title	description	author	ms.devlang	ms.topic	ms.date	ms.author	ms.custom
Run Python scripts through Data Factory	Tutorial - Learn how to run Python scripts as part of a pipeline through Azure Data Factory using Azure Batch.	mammask	python	tutorial	12/11/2019	komammas	mvc, tracking-python

Tutorial: Run Python scripts through Azure Data Factory using Azure Batch

In this tutorial, you'll learn how to:

[!div class="checklist"]

Authenticate with Batch and Storage accounts

Develop and run a script in Python

Create a pool of compute nodes to run an application

Schedule your Python workloads

Monitor your analytics pipeline

Access your logfiles

The example below runs a Python script that receives CSV input from a blob storage container, performs a data manipulation process, and writes the output to a separate blob storage container.

If you don’t have an Azure subscription, create a free account before you begin.

Prerequisites

An installed Python distribution, for local testing.
The Azure pip package.
The iris.csv dataset
An Azure Batch account and a linked Azure Storage account. See Create a Batch account for more information on how to create and link Batch accounts to storage accounts.
An Azure Data Factory account. See Create a data factory for more information on how to create a data factory through the Azure portal.
Batch Explorer.
Azure Storage Explorer.

Sign in to Azure

Sign in to the Azure portal at https://portal.azure.com.

[!INCLUDE batch-common-credentials]

Create a Batch pool using Batch Explorer

In this section, you'll use Batch Explorer to create the Batch pool that your Azure Data factory pipeline will use.

Sign in to Batch Explorer using your Azure credentials.
Select your Batch account
Create a pool by selecting Pools on the left side bar, then the Add button above the search form.
1. Choose an ID and display name. We'll use custom-activity-pool for this example.
2. Set the scale type to Fixed size, and set the dedicated node count to 2.
3. Under Data science, select Dsvm Windows as the operating system.
4. Choose Standard_f2s_v2 as the virtual machine size.
5. Enable the start task and add the command cmd /c "pip install pandas". The user identity can remain as the default Pool user.
6. Select OK.

Create blob containers

Here you'll create blob containers that will store your input and output files for the OCR Batch job.

Sign in to Storage Explorer using your Azure credentials.
Using the storage account linked to your Batch account, create two blob containers (one for input files, one for output files) by following the steps at Create a blob container.
- In this example, we'll call our input container input, and our output container output.
Upload main.py and iris.csv to your input container input using Storage Explorer by following the steps at Managing blobs in a blob container

Develop a script in Python

The following Python script loads the iris.csv dataset from your input container, performs a data manipulation process, and saves the results back to the output container.

# Load libraries
from azure.storage.blob import BlockBlobService
import pandas as pd

# Define parameters
storageAccountName = "<storage-account-name>"
storageKey         = "<storage-account-key>"
containerName      = "output"

# Establish connection with the blob storage account
blobService = BlockBlobService(account_name=storageAccountName,
                               account_key=storageKey
                               )

# Load iris dataset from the task node
df = pd.read_csv("iris.csv")

# Subset records
df = df[df['Species'] == "setosa"]

# Save the subset of the iris dataframe locally in task node
df.to_csv("iris_setosa.csv", index = False)

# Upload iris dataset
blobService.create_blob_from_text(containerName, "iris_setosa.csv", "iris_setosa.csv")

Save the script as main.py and upload it to the Azure Storage container. Be sure to test and validate its functionality locally before uploading it to your blob container:

python main.py

Set up an Azure Data Factory pipeline

In this section, you'll create and validate a pipeline using your Python script.

Follow the steps to create a data factory under the "Create a data factory" section of this article.
In the Factory Resources box, select the + (plus) button and then select Pipeline
In the General tab, set the name of the pipeline as "Run Python"
In the Activities box, expand Batch Service. Drag the custom activity from the Activities toolbox to the pipeline designer surface.
In the General tab, specify testPipeline for Name
In the Azure Batch tab, add the Batch Account that was created in the previous steps and Test connection to ensure that it is successful
In the Settings tab, enter the command python main.py.
For the Resource Linked Service, add the storage account that was created in the previous steps. Test the connection to ensure it is successful.
In the Folder Path, select the name of the Azure Blob Storage container that contains the Python script and the associated inputs. This will download the selected files from the container to the pool node instances before the execution of the Python script.
Click Validate on the pipeline toolbar above the canvas to validate the pipeline settings. Confirm that the pipeline has been successfully validated. To close the validation output, select the >> (right arrow) button.
Click Debug to test the pipeline and ensure it works accurately.
Click Publish to publish the pipeline.
Click Trigger to run the Python script as part of a batch process.

Monitor the log files

In case warnings or errors are produced by the execution of your script, you can check out stdout.txt or stderr.txt for more information on output that was logged.

Select Jobs from the left-hand side of Batch Explorer.
Choose the job created by your data factory. Assuming you named your pool custom-activity-pool, select adfv2-custom-activity-pool.
Click on the task that had a failure exit code.
View stdout.txt and stderr.txt to investigate and diagnose your problem.

Next steps

In this tutorial, you explored an example that taught you how to run Python scripts as part of a pipeline through Azure Data Factory using Azure Batch.

To learn more about Azure Data Factory, see:

[!div class="nextstepaction"] Azure Data Factory Pipelines and activities Custom activities

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tutorial-run-python-batch-azure-data-factory.md

tutorial-run-python-batch-azure-data-factory.md

Tutorial: Run Python scripts through Azure Data Factory using Azure Batch

Prerequisites

Sign in to Azure

Create a Batch pool using Batch Explorer

Create blob containers

Develop a script in Python

Set up an Azure Data Factory pipeline

Monitor the log files

Next steps

Files

tutorial-run-python-batch-azure-data-factory.md

Latest commit

History

tutorial-run-python-batch-azure-data-factory.md

File metadata and controls

Tutorial: Run Python scripts through Azure Data Factory using Azure Batch

Prerequisites

Sign in to Azure

Create a Batch pool using Batch Explorer

Create blob containers

Develop a script in Python

Set up an Azure Data Factory pipeline

Monitor the log files

Next steps