Skip to content

Latest commit

 

History

History
533 lines (421 loc) · 28 KB

data-factory-copy-activity-tutorial-using-rest-api.md

File metadata and controls

533 lines (421 loc) · 28 KB
title description services documentationcenter author manager editor ms.assetid ms.service ms.workload ms.tgt_pltfrm ms.devlang ms.topic ms.date ms.author robots
Tutorial: Use REST API to create an Azure Data Factory pipeline | Microsoft Docs
In this tutorial, you use REST API to create an Azure Data Factory pipeline with a Copy Activity to copy data from an Azure blob storage an Azure SQL database.
data-factory
linda33wj
1704cdf8-30ad-49bc-a71c-4057e26e7350
data-factory
data-services
na
na
tutorial
01/22/2018
jingwang
noindex

Tutorial: Use REST API to create an Azure Data Factory pipeline to copy data

[!div class="op_single_selector"]

Note

This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see copy activity tutorial.

In this article, you learn how to use REST API to create a data factory with a pipeline that copies data from an Azure blob storage to an Azure SQL database. If you are new to Azure Data Factory, read through the Introduction to Azure Data Factory article before doing this tutorial.

In this tutorial, you create a pipeline with one activity in it: Copy Activity. The copy activity copies data from a supported data store to a supported sink data store. For a list of data stores supported as sources and sinks, see supported data stores. The activity is powered by a globally available service that can copy data between various data stores in a secure, reliable, and scalable way. For more information about the Copy Activity, see Data Movement Activities.

A pipeline can have more than one activity. And, you can chain two activities (run one activity after another) by setting the output dataset of one activity as the input dataset of the other activity. For more information, see multiple activities in a pipeline.

Note

This article does not cover all the Data Factory REST API. See Data Factory REST API Reference for comprehensive documentation on Data Factory cmdlets.

The data pipeline in this tutorial copies data from a source data store to a destination data store. For a tutorial on how to transform data using Azure Data Factory, see Tutorial: Build a pipeline to transform data using Hadoop cluster.

Prerequisites

  • Go through Tutorial Overview and complete the prerequisite steps.

  • Install Curl on your machine. You use the Curl tool with REST commands to create a data factory.

  • Follow instructions from this article to:

    1. Create a Web application named ADFCopyTutorialApp in Azure Active Directory.
    2. Get client ID and secret key.
    3. Get tenant ID.
    4. Assign the ADFCopyTutorialApp application to the Data Factory Contributor role.
  • Install Azure PowerShell.

  • Launch PowerShell and do the following steps. Keep Azure PowerShell open until the end of this tutorial. If you close and reopen, you need to run the commands again.

    1. Run the following command and enter the user name and password that you use to sign in to the Azure portal:
     Connect-AzureRmAccount
    1. Run the following command to view all the subscriptions for this account:
     Get-AzureRmSubscription
    1. Run the following command to select the subscription that you want to work with. Replace <NameOfAzureSubscription> with the name of your Azure subscription.
     Get-AzureRmSubscription -SubscriptionName <NameOfAzureSubscription> | Set-AzureRmContext
    1. Create an Azure resource group named ADFTutorialResourceGroup by running the following command in the PowerShell:
       New-AzureRmResourceGroup -Name ADFTutorialResourceGroup  -Location "West US"
    If the resource group already exists, you specify whether to update it (Y) or keep it as (N). 
    
    Some of the steps in this tutorial assume that you use the resource group named ADFTutorialResourceGroup. If you use a different resource group, you need to use the name of your resource group in place of ADFTutorialResourceGroup in this tutorial.
    

Create JSON definitions

Create following JSON files in the folder where curl.exe is located.

datafactory.json

Important

Name must be globally unique, so you may want to prefix/suffix ADFCopyTutorialDF to make it a unique name.

{  
    "name": "ADFCopyTutorialDF",  
    "location": "WestUS"
}  

azurestoragelinkedservice.json

Important

Replace accountname and accountkey with name and key of your Azure storage account. To learn how to get your storage access key, see View, copy and regenerate storage access keys.

{
    "name": "AzureStorageLinkedService",
    "properties": {
        "type": "AzureStorage",
        "typeProperties": {
            "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
        }
    }
}

For details about JSON properties, see Azure Storage linked service.

azuersqllinkedservice.json

Important

Replace servername, databasename, username, and password with name of your Azure SQL server, name of SQL database, user account, and password for the account.

{
    "name": "AzureSqlLinkedService",
    "properties": {
        "type": "AzureSqlDatabase",
        "description": "",
        "typeProperties": {
            "connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=<databasename>;User ID=<username>;Password=<password>;Integrated Security=False;Encrypt=True;Connect Timeout=30"
        }
    }
}

For details about JSON properties, see Azure SQL linked service.

inputdataset.json

{
  "name": "AzureBlobInput",
  "properties": {
    "structure": [
      {
        "name": "FirstName",
        "type": "String"
      },
      {
        "name": "LastName",
        "type": "String"
      }
    ],
    "type": "AzureBlob",
    "linkedServiceName": "AzureStorageLinkedService",
    "typeProperties": {
      "folderPath": "adftutorial/",
      "fileName": "emp.txt",
      "format": {
        "type": "TextFormat",
        "columnDelimiter": ","
      }
    },
    "external": true,
    "availability": {
      "frequency": "Hour",
      "interval": 1
    }
  }
}

The following table provides descriptions for the JSON properties used in the snippet:

Property Description
type The type property is set to AzureBlob because data resides in an Azure blob storage.
linkedServiceName Refers to the AzureStorageLinkedService that you created earlier.
folderPath Specifies the blob container and the folder that contains input blobs. In this tutorial, adftutorial is the blob container and folder is the root folder.
fileName This property is optional. If you omit this property, all files from the folderPath are picked. In this tutorial, emp.txt is specified for the fileName, so only that file is picked up for processing.
format -> type The input file is in the text format, so we use TextFormat.
columnDelimiter The columns in the input file are delimited by comma character (,).
frequency/interval The frequency is set to Hour and interval is set to 1, which means that the input slices are available hourly. In other words, the Data Factory service looks for input data every hour in the root folder of blob container (adftutorial) you specified. It looks for the data within the pipeline start and end times, not before or after these times.
external This property is set to true if the data is not generated by this pipeline. The input data in this tutorial is in the emp.txt file, which is not generated by this pipeline, so we set this property to true.

For more information about these JSON properties, see Azure Blob connector article.

outputdataset.json

{
  "name": "AzureSqlOutput",
  "properties": {
    "structure": [
      {
        "name": "FirstName",
        "type": "String"
      },
      {
        "name": "LastName",
        "type": "String"
      }
    ],
    "type": "AzureSqlTable",
    "linkedServiceName": "AzureSqlLinkedService",
    "typeProperties": {
      "tableName": "emp"
    },
    "availability": {
      "frequency": "Hour",
      "interval": 1
    }
  }
}

The following table provides descriptions for the JSON properties used in the snippet:

Property Description
type The type property is set to AzureSqlTable because data is copied to a table in an Azure SQL database.
linkedServiceName Refers to the AzureSqlLinkedService that you created earlier.
tableName Specified the table to which the data is copied.
frequency/interval The frequency is set to Hour and interval is 1, which means that the output slices are produced hourly between the pipeline start and end times, not before or after these times.

There are three columns – ID, FirstName, and LastName – in the emp table in the database. ID is an identity column, so you need to specify only FirstName and LastName here.

For more information about these JSON properties, see Azure SQL connector article.

pipeline.json

{
  "name": "ADFTutorialPipeline",
  "properties": {
    "description": "Copy data from a blob to Azure SQL table",
    "activities": [
      {
        "name": "CopyFromBlobToSQL",
        "description": "Push Regional Effectiveness Campaign data to Azure SQL database",
        "type": "Copy",
        "inputs": [
          {
            "name": "AzureBlobInput"
          }
        ],
        "outputs": [
          {
            "name": "AzureSqlOutput"
          }
        ],
        "typeProperties": {
          "source": {
            "type": "BlobSource"
          },
          "sink": {
            "type": "SqlSink",
            "writeBatchSize": 10000,
            "writeBatchTimeout": "60:00:00"
          }
        },
        "Policy": {
          "concurrency": 1,
          "executionPriorityOrder": "NewestFirst",
          "retry": 0,
          "timeout": "01:00:00"
        }
      }
    ],
    "start": "2017-05-11T00:00:00Z",
    "end": "2017-05-12T00:00:00Z"
  }
}

Note the following points:

  • In the activities section, there is only one activity whose type is set to Copy. For more information about the copy activity, see data movement activities. In Data Factory solutions, you can also use data transformation activities.
  • Input for the activity is set to AzureBlobInput and output for the activity is set to AzureSqlOutput.
  • In the typeProperties section, BlobSource is specified as the source type and SqlSink is specified as the sink type. For a complete list of data stores supported by the copy activity as sources and sinks, see supported data stores. To learn how to use a specific supported data store as a source/sink, click the link in the table.

Replace the value of the start property with the current day and end value with the next day. You can specify only the date part and skip the time part of the date time. For example, "2017-02-03", which is equivalent to "2017-02-03T00:00:00Z"

Both start and end datetimes must be in ISO format. For example: 2016-10-14T16:32:41Z. The end time is optional, but we use it in this tutorial.

If you do not specify value for the end property, it is calculated as "start + 48 hours". To run the pipeline indefinitely, specify 9999-09-09 as the value for the end property.

In the preceding example, there are 24 data slices as each data slice is produced hourly.

For descriptions of JSON properties in a pipeline definition, see create pipelines article. For descriptions of JSON properties in a copy activity definition, see data movement activities. For descriptions of JSON properties supported by BlobSource, see Azure Blob connector article. For descriptions of JSON properties supported by SqlSink, see Azure SQL Database connector article.

Set global variables

In Azure PowerShell, execute the following commands after replacing the values with your own:

Important

See Prerequisites section for instructions on getting client ID, client secret, tenant ID, and subscription ID.

$client_id = "<client ID of application in AAD>"
$client_secret = "<client key of application in AAD>"
$tenant = "<Azure tenant ID>";
$subscription_id="<Azure subscription ID>";

$rg = "ADFTutorialResourceGroup"

Run the following command after updating the name of the data factory you are using:

$adf = "ADFCopyTutorialDF"

Authenticate with AAD

Run the following command to authenticate with Azure Active Directory (AAD):

$cmd = { .\curl.exe -X POST https://login.microsoftonline.com/$tenant/oauth2/token  -F grant_type=client_credentials  -F resource=https://management.core.windows.net/ -F client_id=$client_id -F client_secret=$client_secret };
$responseToken = Invoke-Command -scriptblock $cmd;
$accessToken = (ConvertFrom-Json $responseToken).access_token;

(ConvertFrom-Json $responseToken) 

Create data factory

In this step, you create an Azure Data Factory named ADFCopyTutorialDF. A data factory can have one or more pipelines. A pipeline can have one or more activities in it. For example, a Copy Activity to copy data from a source to a destination data store. A HDInsight Hive activity to run a Hive script to transform input data to product output data. Run the following commands to create the data factory:

  1. Assign the command to variable named cmd.

    [!IMPORTANT] Confirm that the name of the data factory you specify here (ADFCopyTutorialDF) matches the name specified in the datafactory.json.

    $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data @datafactory.json https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/ADFCopyTutorialDF0411?api-version=2015-10-01};
  2. Run the command by using Invoke-Command.

    $results = Invoke-Command -scriptblock $cmd;
  3. View the results. If the data factory has been successfully created, you see the JSON for the data factory in the results; otherwise, you see an error message.

    Write-Host $results
    

Note the following points:

  • The name of the Azure Data Factory must be globally unique. If you see the error in results: Data factory name “ADFCopyTutorialDF” is not available, do the following steps:

    1. Change the name (for example, yournameADFCopyTutorialDF) in the datafactory.json file.

    2. In the first command where the $cmd variable is assigned a value, replace ADFCopyTutorialDF with the new name and run the command.

    3. Run the next two commands to invoke the REST API to create the data factory and print the results of the operation.

      See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts.

  • To create Data Factory instances, you need to be a contributor/administrator of the Azure subscription

  • The name of the data factory may be registered as a DNS name in the future and hence become publicly visible.

  • If you receive the error: "This subscription is not registered to use namespace Microsoft.DataFactory", do one of the following and try publishing again:

    • In Azure PowerShell, run the following command to register the Data Factory provider:

      Register-AzureRmResourceProvider -ProviderNamespace Microsoft.DataFactory

      You can run the following command to confirm that the Data Factory provider is registered.

      Get-AzureRmResourceProvider
    • Login using the Azure subscription into the Azure portal and navigate to a Data Factory blade (or) create a data factory in the Azure portal. This action automatically registers the provider for you.

Before creating a pipeline, you need to create a few Data Factory entities first. You first create linked services to link source and destination data stores to your data store. Then, define input and output datasets to represent data in linked data stores. Finally, create the pipeline with an activity that uses these datasets.

Create linked services

You create linked services in a data factory to link your data stores and compute services to the data factory. In this tutorial, you don't use any compute service such as Azure HDInsight or Azure Data Lake Analytics. You use two data stores of type Azure Storage (source) and Azure SQL Database (destination). Therefore, you create two linked services named AzureStorageLinkedService and AzureSqlLinkedService of types: AzureStorage and AzureSqlDatabase.

The AzureStorageLinkedService links your Azure storage account to the data factory. This storage account is the one in which you created a container and uploaded the data as part of prerequisites.

AzureSqlLinkedService links your Azure SQL database to the data factory. The data that is copied from the blob storage is stored in this database. You created the emp table in this database as part of prerequisites.

Create Azure Storage linked service

In this step, you link your Azure storage account to your data factory. You specify the name and key of your Azure storage account in this section. See Azure Storage linked service for details about JSON properties used to define an Azure Storage linked service.

  1. Assign the command to variable named cmd.

    $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@azurestoragelinkedservice.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/$adf/linkedservices/AzureStorageLinkedService?api-version=2015-10-01};
  2. Run the command by using Invoke-Command.

    $results = Invoke-Command -scriptblock $cmd;
  3. View the results. If the linked service has been successfully created, you see the JSON for the linked service in the results; otherwise, you see an error message.

    Write-Host $results

Create Azure SQL linked service

In this step, you link your Azure SQL database to your data factory. You specify the Azure SQL server name, database name, user name, and user password in this section. See Azure SQL linked service for details about JSON properties used to define an Azure SQL linked service.

  1. Assign the command to variable named cmd.

    $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data @azuresqllinkedservice.json https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/$adf/linkedservices/AzureSqlLinkedService?api-version=2015-10-01};
  2. Run the command by using Invoke-Command.

    $results = Invoke-Command -scriptblock $cmd;
  3. View the results. If the linked service has been successfully created, you see the JSON for the linked service in the results; otherwise, you see an error message.

    Write-Host $results

Create datasets

In the previous step, you created linked services to link your Azure Storage account and Azure SQL database to your data factory. In this step, you define two datasets named AzureBlobInput and AzureSqlOutput that represent input and output data that is stored in the data stores referred by AzureStorageLinkedService and AzureSqlLinkedService respectively.

The Azure storage linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure storage account. And, the input blob dataset (AzureBlobInput) specifies the container and the folder that contains the input data.

Similarly, the Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure SQL database. And, the output SQL table dataset (OututDataset) specifies the table in the database to which the data from the blob storage is copied.

Create input dataset

In this step, you create a dataset named AzureBlobInput that points to a blob file (emp.txt) in the root folder of a blob container (adftutorial) in the Azure Storage represented by the AzureStorageLinkedService linked service. If you don't specify a value for the fileName (or skip it), data from all blobs in the input folder are copied to the destination. In this tutorial, you specify a value for the fileName.

  1. Assign the command to variable named cmd.

    $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@inputdataset.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/$adf/datasets/AzureBlobInput?api-version=2015-10-01};
  2. Run the command by using Invoke-Command.

    $results = Invoke-Command -scriptblock $cmd;
  3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the results; otherwise, you see an error message.

    Write-Host $results

Create output dataset

The Azure SQL Database linked service specifies the connection string that Data Factory service uses at run time to connect to your Azure SQL database. The output SQL table dataset (OututDataset) you create in this step specifies the table in the database to which the data from the blob storage is copied.

  1. Assign the command to variable named cmd.

    $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@outputdataset.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/$adf/datasets/AzureSqlOutput?api-version=2015-10-01};
  2. Run the command by using Invoke-Command.

    $results = Invoke-Command -scriptblock $cmd;
  3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the results; otherwise, you see an error message.

    Write-Host $results

Create pipeline

In this step, you create a pipeline with a copy activity that uses AzureBlobInput as an input and AzureSqlOutput as an output.

Currently, output dataset is what drives the schedule. In this tutorial, output dataset is configured to produce a slice once an hour. The pipeline has a start time and end time that are one day apart, which is 24 hours. Therefore, 24 slices of output dataset are produced by the pipeline.

  1. Assign the command to variable named cmd.

    $cmd = {.\curl.exe -X PUT -H "Authorization: Bearer $accessToken" -H "Content-Type: application/json" --data "@pipeline.json" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/$adf/datapipelines/MyFirstPipeline?api-version=2015-10-01};
  2. Run the command by using Invoke-Command.

    $results = Invoke-Command -scriptblock $cmd;
  3. View the results. If the dataset has been successfully created, you see the JSON for the dataset in the results; otherwise, you see an error message.

    Write-Host $results

Congratulations! You have successfully created an Azure data factory, with a pipeline that copies data from Azure Blob Storage to Azure SQL database.

Monitor pipeline

In this step, you use Data Factory REST API to monitor slices being produced by the pipeline.

$ds ="AzureSqlOutput"

Important

Make sure that the start and end times specified in the following command match the start and end times of the pipeline.

$cmd = {.\curl.exe -X GET -H "Authorization: Bearer $accessToken" https://management.azure.com/subscriptions/$subscription_id/resourcegroups/$rg/providers/Microsoft.DataFactory/datafactories/$adf/datasets/$ds/slices?start=2017-05-11T00%3a00%3a00.0000000Z"&"end=2017-05-12T00%3a00%3a00.0000000Z"&"api-version=2015-10-01};
$results2 = Invoke-Command -scriptblock $cmd;
IF ((ConvertFrom-Json $results2).value -ne $NULL) {
    ConvertFrom-Json $results2 | Select-Object -Expand value | Format-Table
} else {
        (convertFrom-Json $results2).RemoteException
}

Run the Invoke-Command and the next one until you see a slice in Ready state or Failed state. When the slice is in Ready state, check the emp table in your Azure SQL database for the output data.

For each slice, two rows of data from the source file are copied to the emp table in the Azure SQL database. Therefore, you see 24 new records in the emp table when all the slices are successfully processed (in Ready state).

Summary

In this tutorial, you used REST API to create an Azure data factory to copy data from an Azure blob to an Azure SQL database. Here are the high-level steps you performed in this tutorial:

  1. Created an Azure data factory.
  2. Created linked services:
    1. An Azure Storage linked service to link your Azure Storage account that holds input data.
    2. An Azure SQL linked service to link your Azure SQL database that holds the output data.
  3. Created datasets, which describe input data and output data for pipelines.
  4. Created a pipeline with a Copy Activity with BlobSource as source and SqlSink as sink.

Next steps

In this tutorial, you used Azure blob storage as a source data store and an Azure SQL database as a destination data store in a copy operation. The following table provides a list of data stores supported as sources and destinations by the copy activity:

[!INCLUDE data-factory-supported-data-stores]

To learn about how to copy data to/from a data store, click the link for the data store in the table.