Skip to content

Latest commit

 

History

History
89 lines (56 loc) · 6.41 KB

machine-learning-data-science-move-data-to-azure-blob-using-ssis.md

File metadata and controls

89 lines (56 loc) · 6.41 KB

Move Data to or from Azure Blob Storage using SSIS connectors

The SQL Server Integration Services Feature Pack for Azure provides components to connect to Azure, transfer data between Azure and on-premises data sources, and process data stored in Azure.

Guidance on technologies used to move data to and/or from Azure Blob storage are linked here:

[AZURE.INCLUDE blob-storage-tool-selector]

Introduction

Once customers have moved on-premises data into the cloud, they can access it from any Azure service to leverage the full power of the suite of Azure technologies. It may be used, for example, in Azure Machine Learning or on an HDInsight cluster.

This will typically be the first step for the SQL and HDInsight walkthroughs.

For a discussion of canonical scenarios that use SSIS to accomplish business needs common in hybrid data integration scenarios, see Doing more with SQL Server Integration Services Feature Pack for Azure blog.

[AZURE.NOTE] For a complete introduction to Azure blob storage, please refer to Azure Blob Basics and Azure Blob Service.

Prerequisites

To perform the tasks described in this article, you must have an Azure subscription and an Azure storage account set up. You must know your Azure storage account name and account key in order to upload or download data.

To use the SSIS connectors you must download:

[AZURE.NOTE] SSIS is installed with SQL Server, but is not included in the Express version. For information on what applications are included in various editions of SQL Server, see SQL Server Editions

For training materials on SSIS, see Hands On Training for SSIS

For information on how to get up-and-running using SISS to build simple extraction, transformation, and load (ETL) packages, see SSIS Tutorial: Creating a Simple ETL Package.

Download NYC Taxi dataset

The example described here use a publicly available dataset -- the NYC Taxi Trips dataset. The dataset consists of about 173 million taxi rides in NYC in the year 2013. There are two types of data : trip details data and fare data. As there is a file for each month, we have 24 files in all, each of which is approximately 2GB uncompressed.

Upload data to Azure blob storage

To move data using the SSIS feature pack from on-premises to Azure blob storage, we use an instance of the Azure Blob Upload Task, shown below:

configure-data-science-vm

The parameters that the task uses are described here:

Field Description
AzureStorageConnection Specifies an existing Azure Storage Connection Manager or creates a new one that refers to an Azure storage account that points to where the blob files are hosted.
BlobContainer Specifies the name of the blob container that will hold the uploaded files as blobs.
BlobDirectory Specifies the blob directory where the uploaded file will be stored as a block blob. The blob directory is a virtual hierarchical structure. If the blob already exists, it will be replaced.
LocalDirectory Specifies the local directory that contains the files to be uploaded.
FileName Specifies a name filter to select files with the specified name pattern. For example, MySheet*.xls* includes files such as MySheet001.xls and MySheetABC.xlsx
TimeRangeFrom/TimeRangeTo Specifies a time range filter. Files modified after TimeRangeFrom and before TimeRangeTo will be included.

[AZURE.NOTE] The AzureStorageConnection credentials need to be correct and the BlobContainer must exist before the transfer is attempted.

Download data from Azure blob storage

To download data from Azure blob storage to on-premise storage with SSIS, use an instance of the Azure Blob Upload Task.

##More advanced SSIS-Azure scenarios We note here that the SSIS feature pack allows for more complex flows to be handled by packaging tasks together. For example, the blob data could feed directly into an HDInsight cluster whose output could be downloaded back to a blob and then to an on-premises storage. SSIS can run Hive and Pig jobs on an HDInsight cluster using additional SSIS connectors: