title | description | services | documentationcenter | author | manager | editor | ms.assetid | ms.custom | ms.service | ms.workload | ms.tgt_pltfrm | ms.devlang | ms.topic | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Azure Quickstart - CNTK training with Batch AI - Azure CLI | Microsoft Docs |
Quickly learn to run a CNTK training job with Batch AI using the Azure CLI |
batch-ai |
na |
AlexanderYukhanov |
Vaman.Bedekar |
tysonn |
batch-ai |
na |
CLI |
quickstart |
10/06/2017 |
Alexander.Yukhanov |
This quickstart details using the Azure command-line interface (CLI) to run a Microsoft Cognitive Toolkit (CNTK) training job using the Batch AI service. The Azure CLI is used to create and manage Azure resources from the command line or in scripts.
In this example, you use the MNIST database of handwritten images to train a convolutional neural network (CNN) on a single-node GPU cluster managed by Batch AI.
If you don't have an Azure subscription, create a free account before you begin.
This quickstart requires that you are running the latest Azure CLI version. If you need to install or upgrade, see Install Azure CLI 2.0.
The Batch AI resource providers also need to be registered once for your subscription using the Azure Cloud Shell or Azure CLI. A provider registration can take up to 15 minutes.
az provider register -n Microsoft.BatchAI
az provider register -n Microsoft.Batch
Batch AI clusters and jobs are Azure resources and must be placed in an Azure resource group.
Create a resource group with the az group create command.
The following example creates a resource group named myResourceGroup in the eastus location. It then uses the az configure command to set this resource group and location as the default.
az group create --name myResourceGroup --location eastus
az configure --defaults group=myResourceGroup
az configure --defaults location=eastus
This quickstart uses an Azure storage account to host data and scripts for the training job. Create a storage account with the az storage account create command.
az storage account create --name mystorageaccount --sku Standard_LRS
For later commands, set default storage account environment variables:
-
Linux
export AZURE_STORAGE_ACCOUNT=mystorageaccount export AZURE_STORAGE_KEY=$(az storage account keys list --account-name mystorageaccount -o tsv --query [0].value) export AZURE_BATCHAI_STORAGE_ACCOUNT=mystorageaccount export AZURE_BATCHAI_STORAGE_KEY=$(az storage account keys list --account-name mystorageaccount -o tsv --query [0].value)
-
Windows
set AZURE_STORAGE_ACCOUNT=mystorageaccount az storage account keys list --account-name mystorageaccount -o tsv --query [0].value > temp.txt set /p AZURE_STORAGE_KEY=< temp.txt set AZURE_BATCHAI_STORAGE_ACCOUNT=mystorageaccount set /p AZURE_BATCHAI_STORAGE_KEY=< temp.txt del temp.txt
For illustration purposes, this quickstart uses an Azure file share to host the training data and scripts for the learning job.
- Create a file share named batchaiquickstart using the az storage share create command.
az storage share create --name batchaiquickstart
- Create a directory in the share named mnistcntksample using the az storage directory create command.
az storage directory create --share-name batchaiquickstart --name mnistcntksample
- Download the sample package and unzip. Upload the contents to the directory using the az storage file upload command:
az storage file upload --share-name batchaiquickstart --source Train-28x28_cntk_text.txt --path mnistcntksample
az storage file upload --share-name batchaiquickstart --source Test-28x28_cntk_text.txt --path mnistcntksample
az storage file upload --share-name batchaiquickstart --source ConvNet_MNIST.py --path mnistcntksample
Use the az batchai cluster create command to create a Batch AI cluster consisting of a single GPU VM node. In this example, the VM runs the default Ubuntu LTS image. Specify image UbuntuDSVM
instead to run the Microsoft Deep Learning Virtual Machine, which supports additional training frameworks. The NC6 size has one NVIDIA K80 GPU. Mount the file share at a folder named azurefileshare. The full path of this folder on the GPU compute node is $AZ_BATCHAI_MOUNT_ROOT/azurefileshare.
az batchai cluster create --name mycluster --vm-size STANDARD_NC6 --image UbuntuLTS --min 1 --max 1 --afs-name batchaiquickstart --afs-mount-path azurefileshare --user-name <admin_username> --password <admin_password>
After the cluster is created, output is similar to the following:
{
"allocationState": "resizing",
"allocationStateTransitionTime": "2017-10-05T02:09:03.194000+00:00",
"creationTime": "2017-10-05T02:09:01.998000+00:00",
"currentNodeCount": 0,
"errors": null,
"id": "/subscriptions/10d0b7c6-9243-4713-xxxx-xxxxxxxxxxxx/resourceGroups/myresourcegroup/providers/Microsoft.BatchAI/clusters/mycluster",
"location": "eastus",
"name": "mycluster",
"nodeSetup": {
"mountVolumes": {
"azureBlobFileSystems": null,
"azureFileShares": [
{
"accountName": "batchaisamples",
"azureFileUrl": "https://batchaisamples.file.core.windows.net/batchaiquickstart",
"credentialsInfo": {
"accountKey": null,
"accountKeySecretUrl": null
},
"directoryMode": "0777",
"fileMode": "0777",
"relativeMountPath": "azurefileshare"
}
],
"fileServers": null,
"unmanagedFileSystems": null
},
"setupTask": null
},
"nodeStateCounts": {
"idleNodeCount": 0,
"leavingNodeCount": 0,
"preparingNodeCount": 0,
"runningNodeCount": 0,
"unusableNodeCount": 0
},
"provisioningState": "succeeded",
"provisioningStateTransitionTime": "2017-10-05T02:09:02.857000+00:00",
"resourceGroup": "myresourcegroup",
"scaleSettings": {
"autoScale": null,
"manual": {
"nodeDeallocationOption": "requeue",
"targetNodeCount": 1
}
},
"subnet": {
"id": null
},
"tags": null,
"type": "Microsoft.BatchAI/Clusters",
"userAccountSettings": {
"adminUserName": "demoUser",
"adminUserPassword": null,
"adminUserSshPublicKey": null
},
"virtualMachineConfiguration": {
"imageReference": {
"offer": "UbuntuServer",
"publisher": "Canonical",
"sku": "16.04-LTS",
"version": "latest"
}
},
"vmPriority": "dedicated",
"vmSize": "STANDARD_NC6"
To get an overview of the cluster status, run the az batchai cluster list command:
az batchai cluster list -o table
Output is similar to the following:
Name Resource Group VM Size State Idle Running Preparing Unusable Leaving
------- ---------------- -------- ------- ------ --------- ----------- ---------- ----------
mycluster myresourcegroup STANDARD_NC6 steady 1 0 0 0 0
For more detail, run the az batchai cluster show command. It returns all the cluster properties shown after cluster creation.
The cluster is ready when the nodes are allocated and finished preparation (see the nodeStateCounts
attribute). If something went wrong, the errors
attribute contains the error description.
After the cluster is ready, configure and submit the learning job.
- Create a JSON template file for job creation named job.json:
{
"properties": {
"stdOutErrPathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/azurefileshare",
"inputDirectories": [{
"id": "SAMPLE",
"path": "$AZ_BATCHAI_MOUNT_ROOT/azurefileshare/mnistcntksample"
}],
"outputDirectories": [{
"id": "MODEL",
"pathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/azurefileshare",
"pathSuffix": "model",
"type": "custom"
}],
"containerSettings": {
"imageSourceRegistry": {
"image": "microsoft/cntk:2.1-gpu-python3.5-cuda8.0-cudnn6.0"
}
},
"nodeCount": 1,
"cntkSettings": {
"pythonScriptFilePath": "$AZ_BATCHAI_INPUT_SAMPLE/ConvNet_MNIST.py",
"commandLineArgs": "$AZ_BATCHAI_INPUT_SAMPLE $AZ_BATCHAI_OUTPUT_MODEL"
}
}
}
- Create a job named myjob to run on the cluster with the az batchai job create command:
az batchai job create --name myjob --cluster-name mycluster --config job.json
Output is similar to the following:
{
"caffeSettings": null,
"chainerSettings": null,
"cluster": {
"id": "/subscriptions/10d0b7c6-9243-4713-xxxx-xxxxxxxxxxxx/resourceGroups/myresourcegroup/providers/Microsoft.BatchAI/clusters/mycluster",
"resourceGroup": "myresourcegroup"
},
"cntkSettings": {
"commandLineArgs": "$AZ_BATCHAI_INPUT_SAMPLE $AZ_BATCHAI_OUTPUT_MODEL",
"configFilePath": null,
"languageType": "Python",
"processCount": 1,
"pythonInterpreterPath": null,
"pythonScriptFilePath": "$AZ_BATCHAI_INPUT_SAMPLE/ConvNet_MNIST.py"
},
"constraints": {
"maxTaskRetryCount": null,
"maxWallClockTime": "7 days, 0:00:00"
},
"containerSettings": {
"imageSourceRegistry": {
"credentials": null,
"image": "microsoft/cntk:2.1-gpu-python3.5-cuda8.0-cudnn6.0",
"serverUrl": null
}
},
"creationTime": "2017-10-05T06:41:42.163000+00:00",
"customToolkitSettings": null,
"environmentVariables": null,
"executionInfo": {
"endTime": null,
"errors": null,
"exitCode": null,
"lastRetryTime": null,
"retryCount": null,
"startTime": "2017-10-05T06:41:44.392000+00:00"
},
"executionState": "running",
"executionStateTransitionTime": "2017-10-05T06:41:44.953000+00:00",
"experimentName": null,
"id": "/subscriptions/10d0b7c6-9243-4713-xxxx-xxxxxxxxxxxx/resourceGroups/demo/providers/Microsoft.BatchAI/jobs/myjob",
"inputDirectories": [
{
"id": "SAMPLE",
"path": "$AZ_BATCHAI_MOUNT_ROOT/azurefileshare/mnistcntksample"
}
],
"jobPreparation": null,
"location": null,
"name": "cntk_job",
"nodeCount": 1,
"outputDirectories": [
{
"createNew": true,
"id": "MODEL",
"pathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/azurefileshare",
"pathSuffix": "model",
"type": "Custom"
}
],
"priority": 0,
"provisioningState": "succeeded",
"provisioningStateTransitionTime": "2017-10-05T06:41:44.238000+00:00",
"resourceGroup": "demo",
"stdOutErrPathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/azurefileshare",
"tags": null,
"tensorFlowSettings": null,
"toolType": "CNTK",
"type": "Microsoft.BatchAI/Jobs"
}
Use the az batchai job list command to get an overview of the job status:
az batchai job list -o table
Output is similar to the following:
Name Resource Group Cluster Cluster RG Nodes State Exit code
---------- ---------------- --------- --------------- ----- ------- -----------
myjob myresourcegroup mycluster myresourcegroup 1 running
For more detail, run the az batchai job show command.
The executionState
contains the current execution state of the job:
queued
: the job is waiting for the cluster nodes to become availablerunning
: the job is runningsucceeded
(orfailed
) : the job is completed andexecutionInfo
contains details about the result
Use the az batchai job list-files command to list links to the stdout and stderr log files:
az batchai job list-files --name myjob --output-directory-id stdouterr
Output is similar to the following:
[
{
"contentLength": 733,
"downloadUrl": "https://batchaisamples.file.core.windows.net/batchaiquickstart/10d0b7c6-9243-4713-91a9-2730375d3a1b/demo/jobs/cntk_job/stderr.txt?sv=2016-05-31&sr=f&sig=Rh%2BuTg9C1yQxm7NfA9YWiKb%2B5FRKqWmEXiGNRDeFMd8%3D&se=2017-10-05T07%3A44%3A38Z&sp=rl",
"lastModified": "2017-10-05T06:44:38+00:00",
"name": "stderr.txt"
},
{
"contentLength": 300,
"downloadUrl": "https://batchaisamples.file.core.windows.net/batchaiquickstart/10d0b7c6-9243-4713-91a9-2730375d3a1b/demo/jobs/cntk_job/stdout.txt?sv=2016-05-31&sr=f&sig=jMhJfQOGry9jr4Hh3YyUFpW5Uaxnp38bhVWNrTTWMtk%3D&se=2017-10-05T07%3A44%3A38Z&sp=rl",
"lastModified": "2017-10-05T06:44:29+00:00",
"name": "stdout.txt"
}
]
You can stream or tail a job's output files while the job is executing. The following example uses the az batchai job stream-file command to stream the stderr.txt log:
az batchai job stream-file --job-name myjob --output-directory-id stdouterr --name stderr.txt
Output is similar to the following. Interrupt the output by pressing [Ctrl]-[C].
…
Finished Epoch[2 of 40]: [Training] loss = 0.104846 * 60000, metric = 3.00% * 60000 3.849s (15588.5 samples/s);
Finished Epoch[3 of 40]: [Training] loss = 0.077043 * 60000, metric = 2.23% * 60000 3.902s (15376.7 samples/s);
Finished Epoch[4 of 40]: [Training] loss = 0.063050 * 60000, metric = 1.82% * 60000 3.811s (15743.9 samples/s);
…
Use the az batchai job delete command to delete the job:
az batchai job delete --name myjob
Use the az batchai cluster delete command to delete the cluster:
az batchai cluster delete --name mycluster
In this quickstart, you learned how to run a CNTK training job on a Batch AI cluster, using the Azure CLI. To learn more about using Batch AI with different toolkits, see the training recipes.