title | description | author | ms.devlang | ms.topic | ms.date | ms.author | ms.custom |
---|---|---|---|---|---|---|---|
Run Python scripts through Data Factory |
Tutorial - Learn how to run Python scripts as part of a pipeline through Azure Data Factory using Azure Batch. |
mammask |
python |
tutorial |
12/11/2019 |
komammas |
mvc, tracking-python |
In this tutorial, you'll learn how to:
[!div class="checklist"]
- Authenticate with Batch and Storage accounts
- Develop and run a script in Python
- Create a pool of compute nodes to run an application
- Schedule your Python workloads
- Monitor your analytics pipeline
- Access your logfiles
The example below runs a Python script that receives CSV input from a blob storage container, performs a data manipulation process, and writes the output to a separate blob storage container.
If you don’t have an Azure subscription, create a free account before you begin.
- An installed Python distribution, for local testing.
- The Azure
pip
package. - The iris.csv dataset
- An Azure Batch account and a linked Azure Storage account. See Create a Batch account for more information on how to create and link Batch accounts to storage accounts.
- An Azure Data Factory account. See Create a data factory for more information on how to create a data factory through the Azure portal.
- Batch Explorer.
- Azure Storage Explorer.
Sign in to the Azure portal at https://portal.azure.com.
[!INCLUDE batch-common-credentials]
In this section, you'll use Batch Explorer to create the Batch pool that your Azure Data factory pipeline will use.
- Sign in to Batch Explorer using your Azure credentials.
- Select your Batch account
- Create a pool by selecting Pools on the left side bar, then the Add button above the search form.
- Choose an ID and display name. We'll use
custom-activity-pool
for this example. - Set the scale type to Fixed size, and set the dedicated node count to 2.
- Under Data science, select Dsvm Windows as the operating system.
- Choose
Standard_f2s_v2
as the virtual machine size. - Enable the start task and add the command
cmd /c "pip install pandas"
. The user identity can remain as the default Pool user. - Select OK.
- Choose an ID and display name. We'll use
Here you'll create blob containers that will store your input and output files for the OCR Batch job.
- Sign in to Storage Explorer using your Azure credentials.
- Using the storage account linked to your Batch account, create two blob containers (one for input files, one for output files) by following the steps at Create a blob container.
- In this example, we'll call our input container
input
, and our output containeroutput
.
- In this example, we'll call our input container
- Upload
main.py
andiris.csv
to your input containerinput
using Storage Explorer by following the steps at Managing blobs in a blob container
The following Python script loads the iris.csv
dataset from your input
container, performs a data manipulation process, and saves the results back to the output
container.
# Load libraries
from azure.storage.blob import BlockBlobService
import pandas as pd
# Define parameters
storageAccountName = "<storage-account-name>"
storageKey = "<storage-account-key>"
containerName = "output"
# Establish connection with the blob storage account
blobService = BlockBlobService(account_name=storageAccountName,
account_key=storageKey
)
# Load iris dataset from the task node
df = pd.read_csv("iris.csv")
# Subset records
df = df[df['Species'] == "setosa"]
# Save the subset of the iris dataframe locally in task node
df.to_csv("iris_setosa.csv", index = False)
# Upload iris dataset
blobService.create_blob_from_text(containerName, "iris_setosa.csv", "iris_setosa.csv")
Save the script as main.py
and upload it to the Azure Storage container. Be sure to test and validate its functionality locally before uploading it to your blob container:
python main.py
In this section, you'll create and validate a pipeline using your Python script.
-
Follow the steps to create a data factory under the "Create a data factory" section of this article.
-
In the Factory Resources box, select the + (plus) button and then select Pipeline
-
In the General tab, set the name of the pipeline as "Run Python"
-
In the Activities box, expand Batch Service. Drag the custom activity from the Activities toolbox to the pipeline designer surface.
-
In the General tab, specify testPipeline for Name
-
In the Azure Batch tab, add the Batch Account that was created in the previous steps and Test connection to ensure that it is successful
-
In the Settings tab, enter the command
python main.py
. -
For the Resource Linked Service, add the storage account that was created in the previous steps. Test the connection to ensure it is successful.
-
In the Folder Path, select the name of the Azure Blob Storage container that contains the Python script and the associated inputs. This will download the selected files from the container to the pool node instances before the execution of the Python script.
-
Click Validate on the pipeline toolbar above the canvas to validate the pipeline settings. Confirm that the pipeline has been successfully validated. To close the validation output, select the >> (right arrow) button.
-
Click Debug to test the pipeline and ensure it works accurately.
-
Click Publish to publish the pipeline.
-
Click Trigger to run the Python script as part of a batch process.
In case warnings or errors are produced by the execution of your script, you can check out stdout.txt
or stderr.txt
for more information on output that was logged.
- Select Jobs from the left-hand side of Batch Explorer.
- Choose the job created by your data factory. Assuming you named your pool
custom-activity-pool
, selectadfv2-custom-activity-pool
. - Click on the task that had a failure exit code.
- View
stdout.txt
andstderr.txt
to investigate and diagnose your problem.
In this tutorial, you explored an example that taught you how to run Python scripts as part of a pipeline through Azure Data Factory using Azure Batch.
To learn more about Azure Data Factory, see:
[!div class="nextstepaction"] Azure Data Factory Pipelines and activities Custom activities