Skip to content

Latest commit

 

History

History
327 lines (257 loc) · 18.5 KB

machine-learning-data-science-move-sql-azure-adf.md

File metadata and controls

327 lines (257 loc) · 18.5 KB
title description services documentationcenter author manager editor ms.assetid ms.service ms.workload ms.tgt_pltfrm ms.devlang ms.topic ms.date ms.author
Move data from an on-premise SQL Server to SQL Azure with Azure Data Factory | Microsoft Docs
Set up an ADF pipeline that composes two data migration activities that together move data on a daily basis between databases on-premise and in the cloud.
machine-learning
bradsev
jhubbard
cgronlun
36837c83-2015-48be-b850-c4346aa5ae8a
machine-learning
data-services
na
na
article
12/16/2016
bradsev

Move data from an on-premise SQL server to SQL Azure with Azure Data Factory

This topic shows how to move data from an on-premise SQL Server Database to a SQL Azure Database via Azure Blob Storage using the Azure Data Factory (ADF).

The following menu links to topics that describe how to ingest data into target environments where the data can be stored and processed during the Team Data Science Process.

[!INCLUDE cap-ingest-data-selector]

Introduction: What is ADF and when should it be used to migrate data?

Azure Data Factory is a fully managed cloud-based data integration service that orchestrates and automates the movement and transformation of data. The key concept in the ADF model is pipeline. A pipeline is a logical grouping of Activities, each of which defines the actions to perform on the data contained in Datasets. Linked services are used to define the information needed for Data Factory to connect to the data resources.

With ADF, existing data processing services can be composed into data pipelines that are highly available and managed in the cloud. These data pipelines can be scheduled to ingest, prepare, transform, analyze, and publish data, and ADF manages and orchestrates the complex data and processing dependencies. Solutions can be quickly built and deployed in the cloud, connecting a growing number of on-premises and cloud data sources.

Consider using ADF:

  • when data needs to be continually migrated in a hybrid scenario that accesses both on-premise and cloud resources
  • when the data is transacted or needs to be modified or have business logic added to it when being migrated.

ADF allows for the scheduling and monitoring of jobs using simple JSON scripts that manage the movement of data on a periodic basis. ADF also has other capabilities such as support for complex operations. For more information on ADF, see the documentation at Azure Data Factory (ADF).

The Scenario

We set up an ADF pipeline that composes two data migration activities. Together they move data on a daily basis between an on-premise SQL database and an Azure SQL Database in the cloud. The two activities are:

  • copy data from an on-premise SQL Server database to an Azure Blob Storage account
  • copy data from the Azure Blob Storage account to an Azure SQL Database.

Note

The steps shown here have been adapted from the more detailed tutorial provided by the ADF team: Move data between on-premises sources and cloud with Data Management Gateway References to the relevant sections of that topic are provided when appropriate.

Prerequisites

This tutorial assumes you have:

Note

This procedure uses the Azure portal.

Upload the data to your on-premise SQL Server

We use the NYC Taxi dataset to demonstrate the migration process. The NYC Taxi dataset is available, as noted in that post, on Azure blob storage NYC Taxi Data. The data has two files, the trip_data.csv file, which contains trip details, and the trip_far.csv file, which contains details of the fare paid for each trip. A sample and description of these files are provided in NYC Taxi Trips Dataset Description.

You can either adapt the procedure provided here to a set of your own data or follow the steps as described by using the NYC Taxi dataset. To upload the NYC Taxi dataset into your on-premise SQL Server database, follow the procedure outlined in Bulk Import Data into SQL Server Database. These instructions are for a SQL Server on an Azure Virtual Machine, but the procedure for uploading to the on-premise SQL Server is the same.

Create an Azure Data Factory

The instructions for creating a new Azure Data Factory and a resource group in the Azure portal are provided Create an Azure Data Factory. Name the new ADF instance adfdsp and name the resource group created adfdsprg.

Install and configure up the Data Management Gateway

To enable your pipelines in an Azure data factory to work with an on-premise SQL Server, you need to add it as a Linked Service to the data factory. To create a Linked Service for an on-premise SQL Server, you must:

  • download and install Microsoft Data Management Gateway onto the on-premise computer.
  • configure the linked service for the on-premises data source to use the gateway.

The Data Management Gateway serializes and deserializes the source and sink data on the computer where it is hosted.

For set-up instructions and details on Data Management Gateway, see Move data between on-premises sources and cloud with Data Management Gateway

Create linked services to connect to the data resources

A linked service defines the information needed for Azure Data Factory to connect to a data resource. The step-by-step procedure for creating linked services is provided in Create linked services.

We have three resources in this scenario for which linked services are needed.

  1. Linked service for on-premise SQL Server
  2. Linked service for Azure Blob Storage
  3. Linked service for Azure SQL database

Linked service for on-premise SQL Server database

To create the linked service for the on-premise SQL Server:

  • click the Data Store in the ADF landing page on Azure Classic Portal
  • select SQL and enter the username and password credentials for the on-premise SQL Server. You need to enter the servername as a fully qualified servername backslash instance name (servername\instancename). Name the linked service adfonpremsql.

Linked service for Blob

To create the linked service for the Azure Blob Storage account:

  • click the Data Store in the ADF landing page on Azure Classic Portal
  • select Azure Storage Account
  • enter the Azure Blob Storage account key and container name. Name the Linked Service adfds.

Linked service for Azure SQL database

To create the linked service for the Azure SQL Database:

  • click the Data Store in the ADF landing page on Azure Classic Portal
  • select Azure SQL and enter the username and password credentials for the Azure SQL Database. The username must be specified as user@servername.

Define and create tables to specify how to access the datasets

Create tables that specify the structure, location, and availability of the datasets with the following script-based procedures. JSON files are used to define the tables. For more information on the structure of these files, see Datasets.

Note

You should execute the Add-AzureAccount cmdlet before executing the New-AzureDataFactoryTable cmdlet to confirm that the right Azure subscription is selected for the command execution. For documentation of this cmdlet, see Add-AzureAccount.

The JSON-based definitions in the tables use the following names:

  • the table name in the on-premise SQL server is nyctaxi_data
  • the container name in the Azure Blob Storage account is containername

Three table definitions are needed for this ADF pipeline:

  1. SQL on-premise Table
  2. Blob Table
  3. SQL Azure Table

Note

These procedures use Azure PowerShell to define and create the ADF activities. But these tasks can also be accomplished using the Azure portal. For details, see Create datasets.

SQL on-premise Table

The table definition for the on-premise SQL Server is specified in the following JSON file:

    {
        "name": "OnPremSQLTable",
        "properties":
        {
            "location":
            {
            "type": "OnPremisesSqlServerTableLocation",
            "tableName": "nyctaxi_data",
            "linkedServiceName": "adfonpremsql"
            },
            "availability":
            {
            "frequency": "Day",
            "interval": 1,   
            "waitOnExternal":
            {
            "retryInterval": "00:01:00",
            "retryTimeout": "00:10:00",
            "maximumRetry": 3
            }

            }
        }
    }

The column names were not included here. You can sub-select on the column names by including them here (for details check the ADF documentation topic.

Copy the JSON definition of the table into a file called onpremtabledef.json file and save it to a known location (here assumed to be C:\temp\onpremtabledef.json). Create the table in ADF with the following Azure PowerShell cmdlet:

New-AzureDataFactoryTable -ResourceGroupName ADFdsprg -DataFactoryName ADFdsp –File C:\temp\onpremtabledef.json

Blob Table

Definition for the table for the output blob location is in the following (this maps the ingested data from on-premise to Azure blob):

    {
        "name": "OutputBlobTable",
        "properties":
        {
            "location":
            {
            "type": "AzureBlobLocation",
            "folderPath": "containername",
            "format":
            {
            "type": "TextFormat",
            "columnDelimiter": "\t"
            },
            "linkedServiceName": "adfds"
            },
            "availability":
            {
            "frequency": "Day",
            "interval": 1
            }
        }
    }

Copy the JSON definition of the table into a file called bloboutputtabledef.json file and save it to a known location (here assumed to be C:\temp\bloboutputtabledef.json). Create the table in ADF with the following Azure PowerShell cmdlet:

New-AzureDataFactoryTable -ResourceGroupName adfdsprg -DataFactoryName adfdsp -File C:\temp\bloboutputtabledef.json  

SQL Azure Table

Definition for the table for the SQL Azure output is in the following (this schema maps the data coming from the blob):

{
    "name": "OutputSQLAzureTable",
    "properties":
    {
        "structure":
        [
            { "name": "column1", type": "String"},
            { "name": "column2", type": "String"}                
        ],
        "location":
        {
            "type": "AzureSqlTableLocation",
            "tableName": "your_db_name",
            "linkedServiceName": "adfdssqlazure_linked_servicename"
        },
        "availability":
        {
            "frequency": "Day",
            "interval": 1            
        }
    }
}

Copy the JSON definition of the table into a file called AzureSqlTable.json file and save it to a known location (here assumed to be C:\temp\AzureSqlTable.json). Create the table in ADF with the following Azure PowerShell cmdlet:

New-AzureDataFactoryTable -ResourceGroupName adfdsprg -DataFactoryName adfdsp -File C:\temp\AzureSqlTable.json  

Define and create the pipeline

Specify the activities that belong to the pipeline and create the pipeline with the following script-based procedures. A JSON file is used to define the pipeline properties.

  • The script assumes that the pipeline name is AMLDSProcessPipeline.
  • Also note that we set the periodicity of the pipeline to be executed on daily basis and use the default execution time for the job (12 am UTC).

Note

The following procedures use Azure PowerShell to define and create the ADF pipeline. But this task can also be accomplished using the Azure portal. For details, see Create pipeline.

Using the table definitions provided previously, the pipeline definition for the ADF is specified as follows:

    {
        "name": "AMLDSProcessPipeline",
        "properties":
        {
            "description" : "This pipeline has one Copy activity that copies data from an on-premise SQL to Azure blob",
             "activities":
            [
                {
                    "name": "CopyFromSQLtoBlob",
                    "description": "Copy data from on-premise SQL server to blob",     
                    "type": "CopyActivity",
                    "inputs": [ {"name": "OnPremSQLTable"} ],
                    "outputs": [ {"name": "OutputBlobTable"} ],
                    "transformation":
                    {
                        "source":
                        {                               
                            "type": "SqlSource",
                            "sqlReaderQuery": "select * from nyctaxi_data"
                        },
                        "sink":
                        {
                            "type": "BlobSink"
                        }   
                    },
                    "Policy":
                    {
                        "concurrency": 3,
                        "executionPriorityOrder": "NewestFirst",
                        "style": "StartOfInterval",
                        "retry": 0,
                        "timeout": "01:00:00"
                    }       

                 },

                {
                    "name": "CopyFromBlobtoSQLAzure",
                    "description": "Push data to Sql Azure",        
                    "type": "CopyActivity",
                    "inputs": [ {"name": "OutputBlobTable"} ],
                    "outputs": [ {"name": "OutputSQLAzureTable"} ],
                    "transformation":
                    {
                        "source":
                        {                               
                            "type": "BlobSource"
                        },
                        "sink":
                        {
                            "type": "SqlSink",
                            "WriteBatchTimeout": "00:5:00",                
                        }            
                    },
                    "Policy":
                    {
                        "concurrency": 3,
                        "executionPriorityOrder": "NewestFirst",
                        "style": "StartOfInterval",
                        "retry": 2,
                        "timeout": "02:00:00"
                    }
                 }
            ]
        }
    }

Copy this JSON definition of the pipeline into a file called pipelinedef.json file and save it to a known location (here assumed to be C:\temp\pipelinedef.json). Create the pipeline in ADF with the following Azure PowerShell cmdlet:

New-AzureDataFactoryPipeline  -ResourceGroupName adfdsprg -DataFactoryName adfdsp -File C:\temp\pipelinedef.json

Confirm that you can see the pipeline on the ADF in the Azure Classic Portal show up as following (when you click the diagram)

ADF pipeline

Start the Pipeline

The pipeline can now be run using the following command:

Set-AzureDataFactoryPipelineActivePeriod -ResourceGroupName ADFdsprg -DataFactoryName ADFdsp -StartDateTime startdateZ –EndDateTime enddateZ –Name AMLDSProcessPipeline

The startdate and enddate parameter values need to be replaced with the actual dates between which you want the pipeline to run.

Once the pipeline executes, you should be able to see the data show up in the container selected for the blob, one file per day.

Note that we have not leveraged the functionality provided by ADF to pipe data incrementally. For more information on how to do this and other capabilities provided by ADF, see the ADF documentation.