title | description | services | documentationcenter | author | manager | ms.reviewer | ms.service | ms.workload | ms.tgt_pltfrm | ms.devlang | ms.topic | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Copy data to and from Azure SQL Data Warehouse by using Azure Data Factory | Microsoft Docs |
Learn how to copy data from supported source stores to Azure SQL Data Warehouse or from SQL Data Warehouse to supported sink stores by using Data Factory. |
data-factory |
linda33wj |
craigg |
douglasl |
data-factory |
data-services |
na |
na |
conceptual |
11/08/2018 |
jingwang |
[!div class="op_single_selector" title1="Select the version of Data Factory service you're using:"]
This article explains how to use Copy Activity in Azure Data Factory to copy data to or from Azure SQL Data Warehouse. It builds on the Copy Activity overview article that presents a general overview of Copy Activity.
You can copy data from Azure SQL Data Warehouse to any supported sink data store. And you can copy data from any supported source data store to Azure SQL Data Warehouse. For a list of data stores that are supported as sources or sinks by Copy Activity, see the Supported data stores and formats table.
Specifically, this Azure SQL Data Warehouse connector supports these functions:
- Copy data by using SQL authentication and Azure Active Directory (Azure AD) Application token authentication with a service principal or managed identities for Azure resources.
- As a source, retrieve data by using a SQL query or stored procedure.
- As a sink, load data by using PolyBase or a bulk insert. We recommend PolyBase for better copy performance.
Important
Note that PolyBase supports only SQL authentication but not Azure AD authentication.
Important
If you copy data by using Azure Data Factory Integration Runtime, configure an Azure SQL server firewall so that Azure services can access the server. If you copy data by using a self-hosted integration runtime, configure the Azure SQL server firewall to allow the appropriate IP range. This range includes the machine's IP that is used to connect to Azure SQL Database.
Tip
To achieve best performance, use PolyBase to load data into Azure SQL Data Warehouse. The Use PolyBase to load data into Azure SQL Data Warehouse section has details. For a walkthrough with a use case, see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory.
[!INCLUDE data-factory-v2-connector-get-started]
The following sections provide details about properties that define Data Factory entities specific to an Azure SQL Data Warehouse connector.
The following properties are supported for an Azure SQL Data Warehouse linked service:
Property | Description | Required |
---|---|---|
type | The type property must be set to AzureSqlDW. | Yes |
connectionString | Specify the information needed to connect to the Azure SQL Data Warehouse instance for the connectionString property. Mark this field as a SecureString to store it securely in Data Factory, or reference a secret stored in Azure Key Vault. | Yes |
servicePrincipalId | Specify the application's client ID. | Yes, when you use Azure AD authentication with a service principal. |
servicePrincipalKey | Specify the application's key. Mark this field as a SecureString to store it securely in Data Factory, or reference a secret stored in Azure Key Vault. | Yes, when you use Azure AD authentication with a service principal. |
tenant | Specify the tenant information (domain name or tenant ID) under which your application resides. You can retrieve it by hovering the mouse in the top-right corner of the Azure portal. | Yes, when you use Azure AD authentication with a service principal. |
connectVia | The integration runtime to be used to connect to the data store. You can use Azure Integration Runtime or a self-hosted integration runtime (if your data store is located in a private network). If not specified, it uses the default Azure Integration Runtime. | No |
For different authentication types, refer to the following sections on prerequisites and JSON samples, respectively:
- SQL authentication
- Azure AD application token authentication: Service principal
- Azure AD application token authentication: Managed identities for Azure resources
Tip
If you hit error with error code as "UserErrorFailedToConnectToSqlServer" and message like "The session limit for the database is XXX and has been reached.", add Pooling=false
to your connection string and try again.
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;User ID=<username>@<servername>;Password=<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
To use service principal-based Azure AD application token authentication, follow these steps:
-
Create an Azure Active Directory application from the Azure portal. Make note of the application name and the following values that define the linked service:
- Application ID
- Application key
- Tenant ID
-
Provision an Azure Active Directory administrator for your Azure SQL server on the Azure portal if you haven't already done so. The Azure AD administrator can be an Azure AD user or Azure AD group. If you grant the group with MSI an admin role, skip steps 3 and 4. The administrator will have full access to the database.
-
Create contained database users for the service principal. Connect to the data warehouse from or to which you want to copy data by using tools like SSMS, with an Azure AD identity that has at least ALTER ANY USER permission. Run the following T-SQL:
CREATE USER [your application name] FROM EXTERNAL PROVIDER;
-
Grant the service principal needed permissions as you normally do for SQL users or others. Run the following code:
EXEC sp_addrolemember [role name], [your application name];
-
Configure an Azure SQL Data Warehouse linked service in Azure Data Factory.
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;Connection Timeout=30"
},
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
A data factory can be associated with a managed identity for Azure resources that represents the specific factory. You can use this service identity for Azure SQL Data Warehouse authentication. The designated factory can access and copy data from or to your data warehouse by using this identity.
Important
Note that PolyBase isn't currently supported for MSI authentication.
To use MSI-based Azure AD application token authentication, follow these steps:
-
Create a group in Azure AD. Make the factory MSI a member of the group.
-
Find the data factory service identity from the Azure portal. Go to your data factory's Properties. Copy the SERVICE IDENTITY ID.
-
Install the Azure AD PowerShell module. Sign in by using the
Connect-AzureAD
command. Run the following commands to create a group and add the data factory MSI as a member.
$Group = New-AzureADGroup -DisplayName "<your group name>" -MailEnabled $false -SecurityEnabled $true -MailNickName "NotSet" Add-AzureAdGroupMember -ObjectId $Group.ObjectId -RefObjectId "<your data factory service identity ID>"
-
-
Provision an Azure Active Directory administrator for your Azure SQL server on the Azure portal if you haven't already done so.
-
Create contained database users for the Azure AD group. Connect to the data warehouse from or to which you want to copy data by using tools like SSMS, with an Azure AD identity that has at least ALTER ANY USER permission. Run the following T-SQL.
CREATE USER [your Azure AD group name] FROM EXTERNAL PROVIDER;
-
Grant the Azure AD group needed permissions as you normally do for SQL users and others. For example, run the following code.
EXEC sp_addrolemember [role name], [your Azure AD group name];
-
Configure an Azure SQL Data Warehouse linked service in Azure Data Factory.
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": {
"type": "SecureString",
"value": "Server=tcp:<servername>.database.windows.net,1433;Database=<databasename>;Connection Timeout=30"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
For a full list of sections and properties available for defining datasets, see the Datasets article. This section provides a list of properties supported by the Azure SQL Data Warehouse dataset.
To copy data from or to Azure SQL Data Warehouse, set the type property of the dataset to AzureSqlDWTable. The following properties are supported:
Property | Description | Required |
---|---|---|
type | The type property of the dataset must be set to AzureSqlDWTable. | Yes |
tableName | The name of the table or view in the Azure SQL Data Warehouse instance that the linked service refers to. | No for source, Yes for sink |
{
"name": "AzureSQLDWDataset",
"properties":
{
"type": "AzureSqlDWTable",
"linkedServiceName": {
"referenceName": "<Azure SQL Data Warehouse linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "MyTable"
}
}
}
For a full list of sections and properties available for defining activities, see the Pipelines article. This section provides a list of properties supported by the Azure SQL Data Warehouse source and sink.
To copy data from Azure SQL Data Warehouse, set the type property in the Copy Activity source to SqlDWSource. The following properties are supported in the Copy Activity source section:
Property | Description | Required |
---|---|---|
type | The type property of the Copy Activity source must be set to SqlDWSource. | Yes |
sqlReaderQuery | Use the custom SQL query to read data. Example: select * from MyTable . |
No |
sqlReaderStoredProcedureName | The name of the stored procedure that reads data from the source table. The last SQL statement must be a SELECT statement in the stored procedure. | No |
storedProcedureParameters | Parameters for the stored procedure. Allowed values are name or value pairs. Names and casing of parameters must match the names and casing of the stored procedure parameters. |
No |
- If the sqlReaderQuery is specified for the SqlSource, the Copy Activity runs this query against the Azure SQL Data Warehouse source to get the data. Or you can specify a stored procedure. Specify the sqlReaderStoredProcedureName and storedProcedureParameters if the stored procedure takes parameters.
- If you don't specify either sqlReaderQuery or sqlReaderStoredProcedureName, the columns defined in the structure section of the dataset JSON are used to construct a query.
select column1, column2 from mytable
runs against Azure SQL Data Warehouse. If the dataset definition doesn't have the structure, all columns are selected from the table.
"activities":[
{
"name": "CopyFromAzureSQLDW",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure SQL DW input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlDWSource",
"sqlReaderQuery": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
"activities":[
{
"name": "CopyFromAzureSQLDW",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure SQL DW input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlDWSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', <datetime parameter>)", "type": "Int"}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
CREATE PROCEDURE CopyTestSrcStoredProcedureWithParameters
(
@stringData varchar(20),
@identifier int
)
AS
SET NOCOUNT ON;
BEGIN
select *
from dbo.UnitTestSrcTable
where dbo.UnitTestSrcTable.stringData != stringData
and dbo.UnitTestSrcTable.identifier != identifier
END
GO
To copy data to Azure SQL Data Warehouse, set the sink type in Copy Activity to SqlDWSink. The following properties are supported in the Copy Activity sink section:
Property | Description | Required |
---|---|---|
type | The type property of the Copy Activity sink must be set to SqlDWSink. | Yes |
allowPolyBase | Indicates whether to use PolyBase, when applicable, instead of the BULKINSERT mechanism. We recommend that you load data into SQL Data Warehouse by using PolyBase. See the Use PolyBase to load data into Azure SQL Data Warehouse section for constraints and details. Allowed values are True and False (default). |
No |
polyBaseSettings | A group of properties that can be specified when the allowPolybase property is set to true. | No |
rejectValue | Specifies the number or percentage of rows that can be rejected before the query fails. Learn more about PolyBase’s reject options in the Arguments section of CREATE EXTERNAL TABLE (Transact-SQL). Allowed values are 0 (default), 1, 2, etc. |
No |
rejectType | Specifies whether the rejectValue option is a literal value or a percentage. Allowed values are Value (default) and Percentage. |
No |
rejectSampleValue | Determines the number of rows to retrieve before PolyBase recalculates the percentage of rejected rows. Allowed values are 1, 2, etc. |
Yes, if the rejectType is percentage. |
useTypeDefault | Specifies how to handle missing values in delimited text files when PolyBase retrieves data from the text file. Learn more about this property from the Arguments section in CREATE EXTERNAL FILE FORMAT (Transact-SQL). Allowed values are True and False (default). |
No |
writeBatchSize | Inserts data into the SQL table when the buffer size reaches writeBatchSize. Applies only when PolyBase isn't used. The allowed value is integer (number of rows). |
No. The default is 10000. |
writeBatchTimeout | Wait time for the batch insert operation to finish before it times out. Applies only when PolyBase isn't used. The allowed value is timespan. Example: “00:30:00” (30 minutes). |
No |
preCopyScript | Specify a SQL query for Copy Activity to run before writing data into Azure SQL Data Warehouse in each run. Use this property to clean up the preloaded data. | No |
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true,
"polyBaseSettings":
{
"rejectType": "percentage",
"rejectValue": 10.0,
"rejectSampleValue": 100,
"useTypeDefault": true
}
}
Learn more about how to use PolyBase to efficiently load SQL Data Warehouse in the next section.
Using PolyBase is an efficient way to load a large amount of data into Azure SQL Data Warehouse with high throughput. You'll see a large gain in the throughput by using PolyBase instead of the default BULKINSERT mechanism. See Performance reference for a detailed comparison. For a walkthrough with a use case, see Load 1 TB into Azure SQL Data Warehouse.
- If your source data is in Azure Blob storage or Azure Data Lake Store, and the format is compatible with PolyBase, copy direct to Azure SQL Data Warehouse by using PolyBase. For details, see Direct copy by using PolyBase.
- If your source data store and format isn't originally supported by PolyBase, use the Staged copy by using PolyBase feature instead. The staged copy feature also provides you better throughput. It automatically converts the data into PolyBase-compatible format. And it stores the data in Azure Blob storage. It then loads the data into SQL Data Warehouse.
Important
Note that PolyBase isn't currently supported for MSI-based Azure AD Application token authentication.
SQL Data Warehouse PolyBase directly supports Azure Blob and Azure Data Lake Store. It uses service principal as a source and has specific file format requirements. If your source data meets the criteria described in this section, use PolyBase to copy direct from the source data store to Azure SQL Data Warehouse. Otherwise, use Staged copy by using PolyBase.
Tip
To copy data efficiently from Data Lake Store to SQL Data Warehouse, learn more from Azure Data Factory makes it even easier and convenient to uncover insights from data when using Data Lake Store with SQL Data Warehouse.
If the requirements aren't met, Azure Data Factory checks the settings and automatically falls back to the BULKINSERT mechanism for the data movement.
-
The Source linked service type is Azure Blob storage (AzureBLobStorage/AzureStorage) with account key authentication or Azure Data Lake Storage Gen1 (AzureDataLakeStore) with service principal authentication.
-
The input dataset type is AzureBlob or AzureDataLakeStoreFile. The format type under
type
properties is OrcFormat, ParquetFormat, or TextFormat, with the following configurations:fileName
doesn't contain wildcard filter.rowDelimiter
must be \n.nullValue
is either set to empty string ("") or left as default, andtreatEmptyAsNull
is not set to false.encodingName
is set to utf-8, which is the default value.escapeChar
,quoteChar
andskipLineCount
aren't specified. PolyBase support skip header row which can be configured asfirstRowAsHeader
in ADF.compression
can be no compression, GZip, or Deflate.
"typeProperties": { "folderPath": "<blobpath>", "format": { "type": "TextFormat", "columnDelimiter": "<any delimiter>", "rowDelimiter": "\n", "nullValue": "", "encodingName": "utf-8", "firstRowAsHeader": <any> }, "compression": { "type": "GZip", "level": "Optimal" } },
"activities":[
{
"name": "CopyFromAzureBlobToSQLDataWarehouseViaPolyBase",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSQLDWDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
}
}
}
]
When your source data doesn’t meet the criteria in the previous section, enable data copying via an interim staging Azure Blob storage instance. It can't be Azure Premium Storage. In this case, Azure Data Factory automatically runs transformations on the data to meet the data format requirements of PolyBase. Then it uses PolyBase to load data into SQL Data Warehouse. Finally, it cleans up your temporary data from the blob storage. See Staged copy for details about copying data via a staging Azure Blob storage instance.
To use this feature, create an Azure Storage linked service that refers to the Azure storage account with the interim blob storage. Then specify the enableStaging
and stagingSettings
properties for the Copy Activity as shown in the following code:
"activities":[
{
"name": "CopyFromSQLServerToSQLDataWarehouseViaPolyBase",
"type": "Copy",
"inputs": [
{
"referenceName": "SQLServerDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSQLDWDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "MyStagingBlob",
"type": "LinkedServiceReference"
}
}
}
}
]
The following sections provide best practices in addition to those mentioned in Best practices for Azure SQL Data Warehouse.
To use PolyBase, the user that loads data into SQL Data Warehouse must have "CONTROL" permission on the target database. One way to achieve that is to add the user as a member of the db_owner role. Learn how to do that in the SQL Data Warehouse overview.
PolyBase loads are limited to rows smaller than 1 MB. They can't load to VARCHR(MAX), NVARCHAR(MAX), or VARBINARY(MAX). For more information, see SQL Data Warehouse service capacity limits.
When your source data has rows greater than 1 MB, you might want to vertically split the source tables into several small ones. Make sure that the largest size of each row doesn't exceed the limit. The smaller tables can then be loaded by using PolyBase and merged together in Azure SQL Data Warehouse.
To achieve the best possible throughput, assign a larger resource class to the user that loads data into SQL Data Warehouse via PolyBase.
The following table gives examples of how to specify the tableName property in the JSON dataset. It shows several combinations of schema and table names.
DB Schema | Table name | tableName JSON property |
---|---|---|
dbo | MyTable | MyTable or dbo.MyTable or [dbo].[MyTable] |
dbo1 | MyTable | dbo1.MyTable or [dbo1].[MyTable] |
dbo | My.Table | [My.Table] or [dbo].[My.Table] |
dbo1 | My.Table | [dbo1].[My.Table] |
If you see the following error, the problem might be the value you specified for the tableName property. See the preceding table for the correct way to specify values for the tableName JSON property.
Type=System.Data.SqlClient.SqlException,Message=Invalid object name 'stg.Account_test'.,Source=.Net SqlClient Data Provider
Currently, the PolyBase feature in Data Factory accepts only the same number of columns as in the target table. An example is a table with four columns where one of them is defined with a default value. The input data still needs to have four columns. A three-column input dataset yields an error similar to the following message:
All columns of the table must be specified in the INSERT BULK statement.
The NULL value is a special form of the default value. If the column is nullable, the input data in the blob for that column might be empty. But it can't be missing from the input dataset. PolyBase inserts NULL for missing values in Azure SQL Data Warehouse.
When you copy data from or to Azure SQL Data Warehouse, the following mappings are used from Azure SQL Data Warehouse data types to Azure Data Factory interim data types. See schema and data type mappings to learn how Copy Activity maps the source schema and data type to the sink.
Azure SQL Data Warehouse data type | Data Factory interim data type |
---|---|
bigint | Int64 |
binary | Byte[] |
bit | Boolean |
char | String, Char[] |
date | DateTime |
Datetime | DateTime |
datetime2 | DateTime |
Datetimeoffset | DateTimeOffset |
Decimal | Decimal |
FILESTREAM attribute (varbinary(max)) | Byte[] |
Float | Double |
image | Byte[] |
int | Int32 |
money | Decimal |
nchar | String, Char[] |
ntext | String, Char[] |
numeric | Decimal |
nvarchar | String, Char[] |
real | Single |
rowversion | Byte[] |
smalldatetime | DateTime |
smallint | Int16 |
smallmoney | Decimal |
sql_variant | Object * |
text | String, Char[] |
time | TimeSpan |
timestamp | Byte[] |
tinyint | Byte |
uniqueidentifier | Guid |
varbinary | Byte[] |
varchar | String, Char[] |
xml | Xml |
For a list of data stores supported as sources and sinks by Copy Activity in Azure Data Factory, see supported data stores and formats.