title | description | services | documentationcenter | author | manager | ms.assetid | ms.service | ms.workload | ms.tgt_pltfrm | ms.devlang | ms.topic | ms.date | ms.author | robots |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Copy data to/from Azure SQL Data Warehouse | Microsoft Docs |
Learn how to copy data to/from Azure SQL Data Warehouse using Azure Data Factory |
data-factory |
linda33wj |
craigg |
d90fa9bd-4b79-458a-8d40-e896835cfd4a |
data-factory |
data-services |
na |
na |
conceptual |
01/10/2018 |
jingwang |
noindex |
[!div class="op_single_selector" title1="Select the version of Data Factory service you are using:"]
Note
This article applies to version 1 of Data Factory. If you are using the current version of the Data Factory service, see Azure SQL Data Warehouse connector in V2.
This article explains how to use the Copy Activity in Azure Data Factory to move data to/from Azure SQL Data Warehouse. It builds on the Data Movement Activities article, which presents a general overview of data movement with the copy activity.
Tip
To achieve best performance, use PolyBase to load data into Azure SQL Data Warehouse. The Use PolyBase to load data into Azure SQL Data Warehouse section has details. For a walkthrough with a use case, see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory.
You can copy data from Azure SQL Data Warehouse to the following data stores:
[!INCLUDE data-factory-supported-sinks]
You can copy data from the following data stores to Azure SQL Data Warehouse:
[!INCLUDE data-factory-supported-sources]
Tip
When copying data from SQL Server or Azure SQL Database to Azure SQL Data Warehouse, if the table does not exist in the destination store, Data Factory can automatically create the table in SQL Data Warehouse by using the schema of the table in the source data store. See Auto table creation for details.
Azure SQL Data Warehouse connector support basic authentication.
You can create a pipeline with a copy activity that moves data to/from an Azure SQL Data Warehouse by using different tools/APIs.
The easiest way to create a pipeline that copies data to/from Azure SQL Data Warehouse is to use the Copy data wizard. See Tutorial: Load data into SQL Data Warehouse with Data Factory for a quick walkthrough on creating a pipeline using the Copy data wizard.
You can also use the following tools to create a pipeline: Azure portal, Visual Studio, Azure PowerShell, Azure Resource Manager template, .NET API, and REST API. See Copy activity tutorial for step-by-step instructions to create a pipeline with a copy activity.
Whether you use the tools or APIs, you perform the following steps to create a pipeline that moves data from a source data store to a sink data store:
- Create a data factory. A data factory may contain one or more pipelines.
- Create linked services to link input and output data stores to your data factory. For example, if you are copying data from an Azure blob storage to an Azure SQL data warehouse, you create two linked services to link your Azure storage account and Azure SQL data warehouse to your data factory. For linked service properties that are specific to Azure SQL Data Warehouse, see linked service properties section.
- Create datasets to represent input and output data for the copy operation. In the example mentioned in the last step, you create a dataset to specify the blob container and folder that contains the input data. And, you create another dataset to specify the table in the Azure SQL data warehouse that holds the data copied from the blob storage. For dataset properties that are specific to Azure SQL Data Warehouse, see dataset properties section.
- Create a pipeline with a copy activity that takes a dataset as an input and a dataset as an output. In the example mentioned earlier, you use BlobSource as a source and SqlDWSink as a sink for the copy activity. Similarly, if you are copying from Azure SQL Data Warehouse to Azure Blob Storage, you use SqlDWSource and BlobSink in the copy activity. For copy activity properties that are specific to Azure SQL Data Warehouse, see copy activity properties section. For details on how to use a data store as a source or a sink, click the link in the previous section for your data store.
When you use the wizard, JSON definitions for these Data Factory entities (linked services, datasets, and the pipeline) are automatically created for you. When you use tools/APIs (except .NET API), you define these Data Factory entities by using the JSON format. For samples with JSON definitions for Data Factory entities that are used to copy data to/from an Azure SQL Data Warehouse, see JSON examples section of this article.
The following sections provide details about JSON properties that are used to define Data Factory entities specific to Azure SQL Data Warehouse:
The following table provides description for JSON elements specific to Azure SQL Data Warehouse linked service.
Property | Description | Required |
---|---|---|
type | The type property must be set to: AzureSqlDW | Yes |
connectionString | Specify information needed to connect to the Azure SQL Data Warehouse instance for the connectionString property. Only basic authentication is supported. | Yes |
Important
Configure Azure SQL Database Firewall and the database server to allow Azure Services to access the server. Additionally, if you are copying data to Azure SQL Data Warehouse from outside Azure including from on-premises data sources with data factory gateway, configure appropriate IP address range for the machine that is sending data to Azure SQL Data Warehouse.
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for the dataset of type AzureSqlDWTable has the following properties:
Property | Description | Required |
---|---|---|
tableName | Name of the table or view in the Azure SQL Data Warehouse database that the linked service refers to. | Yes |
For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policy are available for all types of activities.
Note
The Copy Activity takes only one input and produces only one output.
Whereas, properties available in the typeProperties section of the activity vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks.
When source is of type SqlDWSource, the following properties are available in typeProperties section:
Property | Description | Allowed values | Required |
---|---|---|---|
sqlReaderQuery | Use the custom query to read data. | SQL query string. For example: select * from MyTable. | No |
sqlReaderStoredProcedureName | Name of the stored procedure that reads data from the source table. | Name of the stored procedure. The last SQL statement must be a SELECT statement in the stored procedure. | No |
storedProcedureParameters | Parameters for the stored procedure. | Name/value pairs. Names and casing of parameters must match the names and casing of the stored procedure parameters. | No |
If the sqlReaderQuery is specified for the SqlDWSource, the Copy Activity runs this query against the Azure SQL Data Warehouse source to get the data.
Alternatively, you can specify a stored procedure by specifying the sqlReaderStoredProcedureName and storedProcedureParameters (if the stored procedure takes parameters).
If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section of the dataset JSON are used to build a query to run against the Azure SQL Data Warehouse. Example: select column1, column2 from mytable
. If the dataset definition does not have the structure, all columns are selected from the table.
"source": {
"type": "SqlDWSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', SliceStart)", "type": "Int"}
}
}
The stored procedure definition:
CREATE PROCEDURE CopyTestSrcStoredProcedureWithParameters
(
@stringData varchar(20),
@identifier int
)
AS
SET NOCOUNT ON;
BEGIN
select *
from dbo.UnitTestSrcTable
where dbo.UnitTestSrcTable.stringData != stringData
and dbo.UnitTestSrcTable.identifier != identifier
END
GO
SqlDWSink supports the following properties:
Property | Description | Allowed values | Required |
---|---|---|---|
sqlWriterCleanupScript | Specify a query for Copy Activity to execute such that data of a specific slice is cleaned up. For details, see repeatability section. | A query statement. | No |
allowPolyBase | Indicates whether to use PolyBase (when applicable) instead of BULKINSERT mechanism. Using PolyBase is the recommended way to load data into SQL Data Warehouse. See Use PolyBase to load data into Azure SQL Data Warehouse section for constraints and details. |
True False (default) |
No |
polyBaseSettings | A group of properties that can be specified when the allowPolybase property is set to true. | No | |
rejectValue | Specifies the number or percentage of rows that can be rejected before the query fails. Learn more about the PolyBase’s reject options in the Arguments section of CREATE EXTERNAL TABLE (Transact-SQL) topic. |
0 (default), 1, 2, … | No |
rejectType | Specifies whether the rejectValue option is specified as a literal value or a percentage. | Value (default), Percentage | No |
rejectSampleValue | Determines the number of rows to retrieve before the PolyBase recalculates the percentage of rejected rows. | 1, 2, … | Yes, if rejectType is percentage |
useTypeDefault | Specifies how to handle missing values in delimited text files when PolyBase retrieves data from the text file. Learn more about this property from the Arguments section in CREATE EXTERNAL FILE FORMAT (Transact-SQL). |
True, False (default) | No |
writeBatchSize | Inserts data into the SQL table when the buffer size reaches writeBatchSize | Integer (number of rows) | No (default: 10000) |
writeBatchTimeout | Wait time for the batch insert operation to complete before it times out. | timespan Example: “00:30:00” (30 minutes). |
No |
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
}
Using PolyBase is an efficient way of loading large amount of data into Azure SQL Data Warehouse with high throughput. You can see a large gain in the throughput by using PolyBase instead of the default BULKINSERT mechanism. See copy performance reference number with detailed comparison. For a walkthrough with a use case, see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory.
- If your source data is in Azure Blob or Azure Data Lake Store, and the format is compatible with PolyBase, you can directly copy to Azure SQL Data Warehouse using PolyBase. See Direct copy using PolyBase with details.
- If your source data store and format is not originally supported by PolyBase, you can use the Staged Copy using PolyBase feature instead. It also provides you better throughput by automatically converting the data into PolyBase-compatible format and storing the data in Azure Blob storage. It then loads data into SQL Data Warehouse.
Set the allowPolyBase
property to true as shown in the following example for Azure Data Factory to use PolyBase to copy data into Azure SQL Data Warehouse. When you set allowPolyBase to true, you can specify PolyBase specific properties using the polyBaseSettings
property group. see the SqlDWSink section for details about properties that you can use with polyBaseSettings.
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true,
"polyBaseSettings":
{
"rejectType": "percentage",
"rejectValue": 10.0,
"rejectSampleValue": 100,
"useTypeDefault": true
}
}
SQL Data Warehouse PolyBase directly support Azure Blob and Azure Data Lake Store (using service principal) as source and with specific file format requirements. If your source data meets the criteria described in this section, you can directly copy from source data store to Azure SQL Data Warehouse using PolyBase. Otherwise, you can use Staged Copy using PolyBase.
Tip
To copy data from Data Lake Store to SQL Data Warehouse efficiently, learn more from Azure Data Factory makes it even easier and convenient to uncover insights from data when using Data Lake Store with SQL Data Warehouse.
If the requirements are not met, Azure Data Factory checks the settings and automatically falls back to the BULKINSERT mechanism for the data movement.
-
Source linked service is of type: AzureStorage or AzureDataLakeStore with service principal authentication.
-
The input dataset is of type: AzureBlob or AzureDataLakeStore, and the format type under
type
properties is OrcFormat, ParquetFormat, or TextFormat with the following configurations:rowDelimiter
must be \n.nullValue
is set to empty string (""), ortreatEmptyAsNull
is set to true.encodingName
is set to utf-8, which is default value.escapeChar
,quoteChar
,firstRowAsHeader
, andskipLineCount
are not specified.compression
can be no compression, GZip, or Deflate.
"typeProperties": { "folderPath": "<blobpath>", "format": { "type": "TextFormat", "columnDelimiter": "<any delimiter>", "rowDelimiter": "\n", "nullValue": "", "encodingName": "utf-8" }, "compression": { "type": "GZip", "level": "Optimal" } },
-
There is no
skipHeaderLineCount
setting under BlobSource or AzureDataLakeStore for the Copy activity in the pipeline. -
There is no
sliceIdentifierColumnName
setting under SqlDWSink for the Copy activity in the pipeline. (PolyBase guarantees that all data is updated or nothing is updated in a single run. To achieve repeatability, you could usesqlWriterCleanupScript
). -
There is no
columnMapping
being used in the associated in Copy activity.
When your source data doesn’t meet the criteria introduced in the previous section, you can enable copying data via an interim staging Azure Blob Storage (cannot be Premium Storage). In this case, Azure Data Factory automatically performs transformations on the data to meet data format requirements of PolyBase, then use PolyBase to load data into SQL Data Warehouse, and at last clean-up your temp data from the Blob storage. See Staged Copy for details on how copying data via a staging Azure Blob works in general.
Note
When copying data from an on-prem data store into Azure SQL Data Warehouse using PolyBase and staging, if your Data Management Gateway version is below 2.4, JRE (Java Runtime Environment) is required on your gateway machine that is used to transform your source data into proper format. Suggest you upgrade your gateway to the latest to avoid such dependency.
To use this feature, create an Azure Storage linked service that refers to the Azure Storage Account that has the interim blob storage, then specify the enableStaging
and stagingSettings
properties for the Copy Activity as shown in the following code:
"activities":[
{
"name": "Sample copy activity from SQL Server to SQL Data Warehouse via PolyBase",
"type": "Copy",
"inputs": [{ "name": "OnpremisesSQLServerInput" }],
"outputs": [{ "name": "AzureSQLDWOutput" }],
"typeProperties": {
"source": {
"type": "SqlSource",
},
"sink": {
"type": "SqlDwSink",
"allowPolyBase": true
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": "MyStagingBlob"
}
}
}
]
The following sections provide additional best practices to the ones that are mentioned in Best practices for Azure SQL Data Warehouse.
To use PolyBase, it requires the user being used to load data into SQL Data Warehouse has the "CONTROL" permission on the target database. One way to achieve that is to add that user as a member of "db_owner" role. Learn how to do that by following this section.
Polybase loads are limited to loading rows both smaller than 1 MB and cannot load to VARCHR(MAX), NVARCHAR(MAX) or VARBINARY(MAX). Refer to here.
If you have source data with rows of size greater than 1 MB, you may want to split the source tables vertically into several small ones where the largest row size of each of them does not exceed the limit. The smaller tables can then be loaded using PolyBase and merged together in Azure SQL Data Warehouse.
To achieve best possible throughput, consider to assign larger resource class to the user being used to load data into SQL Data Warehouse via PolyBase. Learn how to do that by following Change a user resource class example.
The following table provides examples on how to specify the tableName property in dataset JSON for various combinations of schema and table name.
DB Schema | Table name | tableName JSON property |
---|---|---|
dbo | MyTable | MyTable or dbo.MyTable or [dbo].[MyTable] |
dbo1 | MyTable | dbo1.MyTable or [dbo1].[MyTable] |
dbo | My.Table | [My.Table] or [dbo].[My.Table] |
dbo1 | My.Table | [dbo1].[My.Table] |
If you see the following error, it could be an issue with the value you specified for the tableName property. See the table for the correct way to specify values for the tableName JSON property.
Type=System.Data.SqlClient.SqlException,Message=Invalid object name 'stg.Account_test'.,Source=.Net SqlClient Data Provider
Currently, PolyBase feature in Data Factory only accepts the same number of columns as in the target table. Say, you have a table with four columns and one of them is defined with a default value. The input data should still contain four columns. Providing a 3-column input dataset would yield an error similar to the following message:
All columns of the table must be specified in the INSERT BULK statement.
NULL value is a special form of default value. If the column is nullable, the input data (in blob) for that column could be empty (cannot be missing from the input dataset). PolyBase inserts NULL for them in the Azure SQL Data Warehouse.
If you are using Copy Wizard to copy data from SQL Server or Azure SQL Database to Azure SQL Data Warehouse and the table that corresponds to the source table does not exist in the destination store, Data Factory can automatically create the table in the data warehouse by using the source table schema.
Data Factory creates the table in the destination store with the same table name in the source data store. The data types for columns are chosen based on the following type mapping. If needed, it performs type conversions to fix any incompatibilities between source and destination stores. It also uses Round Robin table distribution.
Source SQL Database column type | Destination SQL DW column type (size limitation) |
---|---|
Int | Int |
BigInt | BigInt |
SmallInt | SmallInt |
TinyInt | TinyInt |
Bit | Bit |
Decimal | Decimal |
Numeric | Decimal |
Float | Float |
Money | Money |
Real | Real |
SmallMoney | SmallMoney |
Binary | Binary |
Varbinary | Varbinary (up to 8000) |
Date | Date |
DateTime | DateTime |
DateTime2 | DateTime2 |
Time | Time |
DateTimeOffset | DateTimeOffset |
SmallDateTime | SmallDateTime |
Text | Varchar (up to 8000) |
NText | NVarChar (up to 4000) |
Image | VarBinary (up to 8000) |
UniqueIdentifier | UniqueIdentifier |
Char | Char |
NChar | NChar |
VarChar | VarChar (up to 8000) |
NVarChar | NVarChar (up to 4000) |
Xml | Varchar (up to 8000) |
[!INCLUDE data-factory-type-repeatability-for-sql-sources]
As mentioned in the data movement activities article, Copy activity performs automatic type conversions from source types to sink types with the following 2-step approach:
- Convert from native source types to .NET type
- Convert from .NET type to native sink type
When moving data to & from Azure SQL Data Warehouse, the following mappings are used from SQL type to .NET type and vice versa.
The mapping is same as the SQL Server Data Type Mapping for ADO.NET.
SQL Server Database Engine type | .NET Framework type |
---|---|
bigint | Int64 |
binary | Byte[] |
bit | Boolean |
char | String, Char[] |
date | DateTime |
Datetime | DateTime |
datetime2 | DateTime |
Datetimeoffset | DateTimeOffset |
Decimal | Decimal |
FILESTREAM attribute (varbinary(max)) | Byte[] |
Float | Double |
image | Byte[] |
int | Int32 |
money | Decimal |
nchar | String, Char[] |
ntext | String, Char[] |
numeric | Decimal |
nvarchar | String, Char[] |
real | Single |
rowversion | Byte[] |
smalldatetime | DateTime |
smallint | Int16 |
smallmoney | Decimal |
sql_variant | Object * |
text | String, Char[] |
time | TimeSpan |
timestamp | Byte[] |
tinyint | Byte |
uniqueidentifier | Guid |
varbinary | Byte[] |
varchar | String, Char[] |
xml | Xml |
You can also map columns from source dataset to columns from sink dataset in the copy activity definition. For details, see Mapping dataset columns in Azure Data Factory.
The following examples provide sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. They show how to copy data to and from Azure SQL Data Warehouse and Azure Blob Storage. However, data can be copied directly from any of sources to any of the sinks stated here using the Copy Activity in Azure Data Factory.
The sample defines the following Data Factory entities:
- A linked service of type AzureSqlDW.
- A linked service of type AzureStorage.
- An input dataset of type AzureSqlDWTable.
- An output dataset of type AzureBlob.
- A pipeline with Copy Activity that uses SqlDWSource and BlobSink.
The sample copies time-series (hourly, daily, etc.) data from a table in Azure SQL Data Warehouse database to a blob every hour. The JSON properties used in these samples are described in sections following the samples.
Azure SQL Data Warehouse linked service:
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
Azure Blob storage linked service:
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
}
}
Azure SQL Data Warehouse input dataset:
The sample assumes you have created a table “MyTable” in Azure SQL Data Warehouse and it contains a column called “timestampcolumn” for time series data.
Setting “external”: ”true” informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory.
{
"name": "AzureSqlDWInput",
"properties": {
"type": "AzureSqlDWTable",
"linkedServiceName": "AzureSqlDWLinkedService",
"typeProperties": {
"tableName": "MyTable"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
Azure Blob output dataset:
Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time.
{
"name": "AzureBlobOutput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}",
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
],
"format": {
"type": "TextFormat",
"columnDelimiter": "\t",
"rowDelimiter": "\n"
}
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Copy activity in a pipeline with SqlDWSource and BlobSink:
The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to SqlDWSource and sink type is set to BlobSink. The SQL query specified for the SqlReaderQuery property selects the data in the past hour to copy.
{
"name":"SamplePipeline",
"properties":{
"start":"2014-06-01T18:00:00",
"end":"2014-06-01T19:00:00",
"description":"pipeline for copy activity",
"activities":[
{
"name": "AzureSQLDWtoBlob",
"description": "copy activity",
"type": "Copy",
"inputs": [
{
"name": "AzureSqlDWInput"
}
],
"outputs": [
{
"name": "AzureBlobOutput"
}
],
"typeProperties": {
"source": {
"type": "SqlDWSource",
"sqlReaderQuery": "$$Text.Format('select * from MyTable where timestampcolumn >= \\'{0:yyyy-MM-dd HH:mm}\\' AND timestampcolumn < \\'{1:yyyy-MM-dd HH:mm}\\'', WindowStart, WindowEnd)"
},
"sink": {
"type": "BlobSink"
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}
Note
In the example, sqlReaderQuery is specified for the SqlDWSource. The Copy Activity runs this query against the Azure SQL Data Warehouse source to get the data.
Alternatively, you can specify a stored procedure by specifying the sqlReaderStoredProcedureName and storedProcedureParameters (if the stored procedure takes parameters).
If you do not specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section of the dataset JSON are used to build a query (select column1, column2 from mytable) to run against the Azure SQL Data Warehouse. If the dataset definition does not have the structure, all columns are selected from the table.
The sample defines the following Data Factory entities:
- A linked service of type AzureSqlDW.
- A linked service of type AzureStorage.
- An input dataset of type AzureBlob.
- An output dataset of type AzureSqlDWTable.
- A pipeline with Copy activity that uses BlobSource and SqlDWSink.
The sample copies time-series data (hourly, daily, etc.) from Azure blob to a table in Azure SQL Data Warehouse database every hour. The JSON properties used in these samples are described in sections following the samples.
Azure SQL Data Warehouse linked service:
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
Azure Blob storage linked service:
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
}
}
Azure Blob input dataset:
Data is picked up from a new blob every hour (frequency: hour, interval: 1). The folder path and file name for the blob are dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, and day part of the start time and file name uses the hour part of the start time. “external”: “true” setting informs the Data Factory service that this table is external to the data factory and is not produced by an activity in the data factory.
{
"name": "AzureBlobInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "StorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/myfolder/yearno={Year}/monthno={Month}/dayno={Day}",
"fileName": "{Hour}.csv",
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
],
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
}
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
}
Azure SQL Data Warehouse output dataset:
The sample copies data to a table named “MyTable” in Azure SQL Data Warehouse. Create the table in Azure SQL Data Warehouse with the same number of columns as you expect the Blob CSV file to contain. New rows are added to the table every hour.
{
"name": "AzureSqlDWOutput",
"properties": {
"type": "AzureSqlDWTable",
"linkedServiceName": "AzureSqlDWLinkedService",
"typeProperties": {
"tableName": "MyOutputTable"
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Copy activity in a pipeline with BlobSource and SqlDWSink:
The pipeline contains a Copy Activity that is configured to use the input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to BlobSource and sink type is set to SqlDWSink.
{
"name":"SamplePipeline",
"properties":{
"start":"2014-06-01T18:00:00",
"end":"2014-06-01T19:00:00",
"description":"pipeline with copy activity",
"activities":[
{
"name": "AzureBlobtoSQLDW",
"description": "Copy Activity",
"type": "Copy",
"inputs": [
{
"name": "AzureBlobInput"
}
],
"outputs": [
{
"name": "AzureSqlDWOutput"
}
],
"typeProperties": {
"source": {
"type": "BlobSource",
"blobColumnSeparators": ","
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
}
},
"scheduler": {
"frequency": "Hour",
"interval": 1
},
"policy": {
"concurrency": 1,
"executionPriorityOrder": "OldestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
]
}
}
For a walkthrough, see the see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory and Load data with Azure Data Factory article in the Azure SQL Data Warehouse documentation.
See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it.