title | description | services | documentationcenter | author | manager | editor | ms.assetid | ms.service | ms.workload | ms.tgt_pltfrm | ms.devlang | ms.topic | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Build your first data factory (Azure portal) | Microsoft Docs |
In this tutorial, you create a sample Azure Data Factory pipeline using Data Factory Editor in the Azure portal. |
data-factory |
spelluru |
jhubbard |
monicar |
d5b14e9e-e358-45be-943c-5297435d402d |
data-factory |
data-services |
na |
na |
hero-article |
12/06/2016 |
spelluru |
[!div class="op_single_selector"]
In this article, you learn how to use the Azure portal to create your first Azure data factory.
- Read through Tutorial Overview article and complete the prerequisite steps.
- This article does not provide a conceptual overview of the Azure Data Factory service. We recommend that you go through Introduction to Azure Data Factory article for a detailed overview of the service.
A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source to a destination data store and a HDInsight Hive activity to run Hive script to transform input data to product output data. Let's start with creating the data factory in this step.
-
Log in to the Azure portal.
-
Click NEW on the left menu, click Data + Analytics, and click Data Factory.
-
In the New data factory blade, enter GetStartedDF for the Name.
[!IMPORTANT] The name of the Azure data factory must be globally unique. If you receive the error: Data factory name “GetStartedDF” is not available. Change the name of the data factory (for example, yournameGetStartedDF) and try creating again. See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts.
The name of the data factory may be registered as a DNS name in the future and hence become publically visible.
-
Select the Azure subscription where you want the data factory to be created.
-
Select existing resource group or create a resource group. For the tutorial, create a resource group named: ADFGetStartedRG.
-
Click Create on the New data factory blade.
[!IMPORTANT] To create Data Factory instances, you must be a member of the Data Factory Contributor role at the subscription/resource group level.
-
You see the data factory being created in the Startboard of the Azure portal as follows:
-
Congratulations! You have successfully created your first data factory. After the data factory has been created successfully, you see the data factory page, which shows you the contents of the data factory.
Before creating a pipeline in the data factory, you need to create a few Data Factory entities first. You first create linked services to link data stores/computes to your data store, define input and output datasets to represent input/output data in linked data stores, and then create the pipeline with an activity that uses these datasets.
In this step, you link your Azure Storage account and an on-demand Azure HDInsight cluster to your data factory. The Azure Storage account holds the input and output data for the pipeline in this sample. The HDInsight linked service is used to run Hive script specified in the activity of the pipeline in this sample. Identify what data store/compute services are used in your scenario and link those services to the data factory by creating linked services.
In this step, you link your Azure Storage account to your data factory. In this tutorial, you use the same Azure Storage account to store input/output data and the HQL script file.
-
Click Author and deploy on the DATA FACTORY blade for GetStartedDF. You should see the Data Factory Editor.
-
Click New data store and choose Azure storage.
-
You should see the JSON script for creating an Azure Storage linked service in the editor.
-
Replace account name with the name of your Azure storage account and account key with the access key of the Azure storage account. To learn how to get your storage access key, see the information about how to view, copy, and regenerate storage access keys in Manage your storage account.
-
Click Deploy on the command bar to deploy the linked service.
After the linked service is deployed successfully, the Draft-1 window should disappear and you see AzureStorageLinkedService in the tree view on the left.
In this step, you link an on-demand HDInsight cluster to your data factory. The HDInsight cluster is automatically created at runtime and deleted after it is done processing and idle for the specified amount of time.
-
In the Data Factory Editor, click ... More, click New compute, and select On-demand HDInsight cluster.
-
Copy and paste the following snippet to the Draft-1 window. The JSON snippet describes the properties that are used to create the HDInsight cluster on-demand.
{ "name": "HDInsightOnDemandLinkedService", "properties": { "type": "HDInsightOnDemand", "typeProperties": { "version": "3.2", "clusterSize": 1, "timeToLive": "00:30:00", "linkedServiceName": "AzureStorageLinkedService" } } }
The following table provides descriptions for the JSON properties used in the snippet:
Property Description Version Specifies that the version of the HDInsight created to be 3.2. ClusterSize Specifies the size of the HDInsight cluster. TimeToLive Specifies that the idle time for the HDInsight cluster, before it is deleted. linkedServiceName Specifies the storage account that is used to store the logs that are generated by HDInsight. Note the following points:
-
The Data Factory creates a Windows-based HDInsight cluster for you with the JSON. You could also have it create a Linux-based HDInsight cluster. See On-demand HDInsight Linked Service for details.
-
You could use your own HDInsight cluster instead of using an on-demand HDInsight cluster. See HDInsight Linked Service for details.
-
The HDInsight cluster creates a default container in the blob storage you specified in the JSON (linkedServiceName). HDInsight does not delete this container when the cluster is deleted. This behavior is by design. With on-demand HDInsight linked service, a HDInsight cluster is created every time a slice is processed unless there is an existing live cluster (timeToLive). The cluster is automatically deleted when the processing is done.
As more slices are processed, you see many containers in your Azure blob storage. If you do not need them for troubleshooting of the jobs, you may want to delete them to reduce the storage cost. The names of these containers follow a pattern: "adfyourdatafactoryname-linkedservicename-datetimestamp". Use tools such as Microsoft Storage Explorer to delete containers in your Azure blob storage.
See On-demand HDInsight Linked Service for details.
-
-
Click Deploy on the command bar to deploy the linked service.
-
Confirm that you see both AzureStorageLinkedService and HDInsightOnDemandLinkedService in the tree view on the left.
In this step, you create datasets to represent the input and output data for Hive processing. These datasets refer to the AzureStorageLinkedService you have created earlier in this tutorial. The linked service points to an Azure Storage account and datasets specify container, folder, file name in the storage that holds input and output data.
-
In the Data Factory Editor, click ... More on the command bar, click New dataset, and select Azure Blob storage.
-
Copy and paste the following snippet to the Draft-1 window. In the JSON snippet, you are creating a dataset called AzureBlobInput that represents input data for an activity in the pipeline. In addition, you specify that the input data is located in the blob container called adfgetstarted and the folder called inputdata.
{ "name": "AzureBlobInput", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "fileName": "input.log", "folderPath": "adfgetstarted/inputdata", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Month", "interval": 1 }, "external": true, "policy": {} } }
The following table provides descriptions for the JSON properties used in the snippet:
Property Description type The type property is set to AzureBlob because data resides in Azure blob storage. linkedServiceName refers to the AzureStorageLinkedService you created earlier. fileName This property is optional. If you omit this property, all the files from the folderPath are picked. In this case, only the input.log is processed. type The log files are in text format, so we use TextFormat. columnDelimiter columns in the log files are delimited by comma character (,) frequency/interval frequency set to Month and interval is 1, which means that the input slices are available monthly. external this property is set to true if the input data is not generated by the Data Factory service. -
Click Deploy on the command bar to deploy the newly created dataset. You should see the dataset in the tree view on the left.
Now, you create the output dataset to represent the output data stored in the Azure Blob storage.
-
In the Data Factory Editor, click ... More on the command bar, click New dataset, and select Azure Blob storage.
-
Copy and paste the following snippet to the Draft-1 window. In the JSON snippet, you are creating a dataset called AzureBlobOutput, and specifying the structure of the data that is produced by the Hive script. In addition, you specify that the results are stored in the blob container called adfgetstarted and the folder called partitioneddata. The availability section specifies that the output dataset is produced on a monthly basis.
{ "name": "AzureBlobOutput", "properties": { "type": "AzureBlob", "linkedServiceName": "AzureStorageLinkedService", "typeProperties": { "folderPath": "adfgetstarted/partitioneddata", "format": { "type": "TextFormat", "columnDelimiter": "," } }, "availability": { "frequency": "Month", "interval": 1 } } }
See Create the input dataset section for descriptions of these properties. You do not set the external property on an output dataset as the dataset is produced by the Data Factory service.
-
Click Deploy on the command bar to deploy the newly created dataset.
-
Verify that the dataset is created successfully.
In this step, you create your first pipeline with a HDInsightHive activity. Input slice is available monthly (frequency: Month, interval: 1), output slice is produced monthly, and the scheduler property for the activity is also set to monthly. The settings for the output dataset and the activity scheduler must match. Currently, output dataset is what drives the schedule, so you must create an output dataset even if the activity does not produce any output. If the activity doesn't take any input, you can skip creating the input dataset. The properties used in the following JSON are explained at the end of this section.
-
In the Data Factory Editor, click Ellipsis (…) More commands and then click New pipeline.
-
Copy and paste the following snippet to the Draft-1 window.
[!IMPORTANT] Replace storageaccountname with the name of your storage account in the JSON.
{ "name": "MyFirstPipeline", "properties": { "description": "My first Azure Data Factory pipeline", "activities": [ { "type": "HDInsightHive", "typeProperties": { "scriptPath": "adfgetstarted/script/partitionweblogs.hql", "scriptLinkedService": "AzureStorageLinkedService", "defines": { "inputtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/inputdata", "partitionedtable": "wasb://adfgetstarted@<storageaccountname>.blob.core.windows.net/partitioneddata" } }, "inputs": [ { "name": "AzureBlobInput" } ], "outputs": [ { "name": "AzureBlobOutput" } ], "policy": { "concurrency": 1, "retry": 3 }, "scheduler": { "frequency": "Month", "interval": 1 }, "name": "RunSampleHiveActivity", "linkedServiceName": "HDInsightOnDemandLinkedService" } ], "start": "2016-04-01T00:00:00Z", "end": "2016-04-02T00:00:00Z", "isPaused": false } }
In the JSON snippet, you are creating a pipeline that consists of a single activity that uses Hive to process Data on an HDInsight cluster.
The Hive script file, partitionweblogs.hql, is stored in the Azure storage account (specified by the scriptLinkedService, called AzureStorageLinkedService), and in script folder in the container adfgetstarted.
The defines section is used to specify the runtime settings that are passed to the hive script as Hive configuration values (e.g ${hiveconf:inputtable}, ${hiveconf:partitionedtable}).
The start and end properties of the pipeline specifies the active period of the pipeline.
In the activity JSON, you specify that the Hive script runs on the compute specified by the linkedServiceName – HDInsightOnDemandLinkedService.
[!NOTE] See "Pipeline JSON" in Pipelines and activities in Azure Data Factory for details about JSON properties used in the example.
-
Confirm the following:
- input.log file exists in the inputdata folder of the adfgetstarted container in the Azure blob storage
- partitionweblogs.hql file exists in the script folder of the adfgetstarted container in the Azure blob storage. Complete the prerequisite steps in the Tutorial Overview if you don't see these files.
- Confirm that you replaced storageaccountname with the name of your storage account in the pipeline JSON.
-
Click Deploy on the command bar to deploy the pipeline. Since the start and end times are set in the past and isPaused is set to false, the pipeline (activity in the pipeline) runs immediately after you deploy.
-
Confirm that you see the pipeline in the tree view.
-
Congratulations, you have successfully created your first pipeline!
-
Click X to close Data Factory Editor blades and to navigate back to the Data Factory blade, and click Diagram.
-
In the Diagram View, you see an overview of the pipelines, and datasets used in this tutorial.
-
To view all activities in the pipeline, right-click pipeline in the diagram and click Open Pipeline.
-
Confirm that you see the HDInsightHive activity in the pipeline.
To navigate back to the previous view, click Data factory in the breadcrumb menu at the top.
-
In the Diagram View, double-click the dataset AzureBlobInput. Confirm that the slice is in Ready state. It may take a couple of minutes for the slice to show up in Ready state. If it does not happen after you wait for sometime, see if you have the input file (input.log) placed in the right container (adfgetstarted) and folder (inputdata).
-
Click X to close AzureBlobInput blade.
-
In the Diagram View, double-click the dataset AzureBlobOutput. You see that the slice that is currently being processed.
-
When processing is done, you see the slice in Ready state.
[!IMPORTANT] Creation of an on-demand HDInsight cluster usually takes sometime (approximately 20 minutes). Therefore, expect the pipeline to take approximately 30 minutes to process the slice.
-
When the slice is in Ready state, check the partitioneddata folder in the adfgetstarted container in your blob storage for the output data.
-
Click the slice to see details about it in a Data slice blade.
11. Click an activity run in the Activity runs list to see details about an activity run (Hive activity in our scenario) in an Activity run details window.
From the log files, you can see the Hive query that was executed and status information. These logs are useful for troubleshooting any issues. See Monitor and manage pipelines using Azure portal blades article for more details.
Important
The input file gets deleted when the slice is processed successfully. Therefore, if you want to rerun the slice or do the tutorial again, upload the input file (input.log) to the inputdata folder of the adfgetstarted container.
You can also use Monitor & Manage application to monitor your pipelines. For detailed information about using this application, see Monitor and manage Azure Data Factory pipelines using Monitoring and Management App.
-
Click Monitor & Manage tile on the home page for your data factory.
-
You should see Monitor & Manage application. Change the Start time and End time to match start (04-01-2016 12:00 AM) and end times (04-02-2016 12:00 AM) of your pipeline, and click Apply.
-
Select an activity window in the Activity Windows list to see details about it.
In this tutorial, you created an Azure data factory to process data by running Hive script on a HDInsight hadoop cluster. You used the Data Factory Editor in the Azure portal to do the following steps:
- Created an Azure data factory.
- Created two linked services:
- Azure Storage linked service to link your Azure blob storage that holds input/output files to the data factory.
- Azure HDInsight on-demand linked service to link an on-demand HDInsight Hadoop cluster to the data factory. Azure Data Factory creates a HDInsight Hadoop cluster just-in-time to process input data and produce output data.
- Created two datasets, which describe input and output data for HDInsight Hive activity in the pipeline.
- Created a pipeline with a HDInsight Hive activity.
In this article, you have created a pipeline with a transformation activity (HDInsight Activity) that runs a Hive script on an on-demand HDInsight cluster. To see how to use a Copy Activity to copy data from an Azure Blob to Azure SQL, see Tutorial: Copy data from an Azure blob to Azure SQL.
Topic | Description |
---|---|
Data Transformation Activities | This article provides a list of data transformation activities (such as HDInsight Hive transformation you used in this tutorial) supported by Azure Data Factory. |
Scheduling and execution | This article explains the scheduling and execution aspects of Azure Data Factory application model. |
Pipelines | This article helps you understand pipelines and activities in Azure Data Factory and how to use them to construct end-to-end data-driven workflows for your scenario or business. |
Datasets | This article helps you understand datasets in Azure Data Factory. |
Monitor and manage pipelines using Monitoring App | This article describes how to monitor, manage, and debug pipelines using the Monitoring & Management App. |