title | description | services | documentationcenter | author | manager | editor | ms.assetid | ms.service | ms.workload | ms.tgt_pltfrm | ms.devlang | ms.topic | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Move data from on-premises HDFS | Microsoft Docs |
Learn about how to move data from on-premises HDFS using Azure Data Factory. |
data-factory |
linda33wj |
jhubbard |
monicar |
3215b82d-291a-46db-8478-eac1a3219614 |
data-factory |
data-services |
na |
na |
article |
12/07/2016 |
jingwang |
This article outlines how you can use the Copy Activity in an Azure data factory to move data from an on-premises HDFS to another data store. This article builds on the data movement activities article that presents a general overview of data movement with copy activity and supported data store combinations.
Data factory currently supports only moving data from an on-premises HDFS to other data stores, but not for moving data from other data stores to an on-premises HDFS.
Data Factory service supports connecting to on-premises HDFS using the Data Management Gateway. See moving data between on-premises locations and cloud article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway. Use the gateway to connect to HDFS even if it is hosted in an Azure IaaS VM.
While you can install gateway on the same on-premises machine or the Azure VM as the HDFS, we recommend that you install the gateway on a separate machine/Azure IaaS VM. Having gateway on a separate machine reduces resource contention and improves performance. When you install the gateway on a separate machine, the machine should be able to access the machine with the HDFS.
The easiest way to create a pipeline that copies data from on-premises HDFS is to use the Copy data wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.
The following examples provide sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. They show how to copy data from an on-premises HDFS to an Azure Blob Storage. However, data can be copied to any of the sinks stated here using the Copy Activity in Azure Data Factory.
This sample shows how to copy data from an on-premises HDFS to Azure Blob Storage. However, data can be copied directly to any of the sinks stated here using the Copy Activity in Azure Data Factory.
The sample has the following data factory entities:
- A linked service of type OnPremisesHdfs.
- A linked service of type AzureStorage.
- An input dataset of type FileShare.
- An output dataset of type AzureBlob.
- A pipeline with Copy Activity that uses FileSystemSource and BlobSink.
The sample copies data from an on-premises HDFS to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples.
As a first step, set up the data management gateway. The instructions in the moving data between on-premises locations and cloud article.
HDFS linked service This example uses the Windows authentication. See HDFS linked service section for different types of authentication you can use.
{
"name": "HDFSLinkedService",
"properties":
{
"type": "Hdfs",
"typeProperties":
{
"authenticationType": "Windows",
"userName": "Administrator",
"password": "password",
"url" : "http://<machine>:50070/webhdfs/v1/",
"gatewayName": "mygateway"
}
}
}
Azure Storage linked service
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
}
}
}
HDFS input dataset This dataset refers to the HDFS folder DataTransfer/UnitTest/. The pipeline copies all the files in this folder to the destination.
Setting “external”: ”true” informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory.
{
"name": "InputDataset",
"properties": {
"type": "FileShare",
"linkedServiceName": "HDFSLinkedService",
"typeProperties": {
"folderPath": "DataTransfer/UnitTest/"
},
"external": true,
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Azure Blob output dataset
Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time.
{
"name": "OutputDataset",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "AzureStorageLinkedService",
"typeProperties": {
"folderPath": "mycontainer/hdfs/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "\t"
},
"partitionedBy": [
{
"name": "Year",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "yyyy"
}
},
{
"name": "Month",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "MM"
}
},
{
"name": "Day",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "dd"
}
},
{
"name": "Hour",
"value": {
"type": "DateTime",
"date": "SliceStart",
"format": "HH"
}
}
]
},
"availability": {
"frequency": "Hour",
"interval": 1
}
}
}
Pipeline with Copy activity
The pipeline contains a Copy Activity that is configured to use these input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to FileSystemSource and sink type is set to BlobSink. The SQL query specified for the query property selects the data in the past hour to copy.
{
"name": "pipeline",
"properties":
{
"activities":
[
{
"name": "HdfsToBlobCopy",
"inputs": [ {"name": "InputDataset"} ],
"outputs": [ {"name": "OutputDataset"} ],
"type": "Copy",
"typeProperties":
{
"source":
{
"type": "FileSystemSource"
},
"sink":
{
"type": "BlobSink"
}
},
"policy":
{
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1,
"timeout": "00:05:00"
}
}
],
"start": "2014-06-01T18:00:00Z",
"end": "2014-06-01T19:00:00Z"
}
}
The following table provides description for JSON elements specific to HDFS linked service.
Property | Description | Required |
---|---|---|
type | The type property must be set to: Hdfs | Yes |
Url | URL to the HDFS | Yes |
encryptedCredential | New-AzureRMDataFactoryEncryptValue output of the access credential. | No |
userName | Username for Windows authentication. | Yes (for Windows Authentication) |
password | Password for Windows authentication. | Yes (for Windows Authentication) |
authenticationType | Windows, or Anonymous. | Yes |
gatewayName | Name of the gateway that the Data Factory service should use to connect to the HDFS. | Yes |
See Move data between on-premises sources and the cloud with Data Management Gateway for details about setting credentials for on-premises HDFS.
{
"name": "hdfs",
"properties":
{
"type": "Hdfs",
"typeProperties":
{
"authenticationType": "Anonymous",
"userName": "hadoop",
"url" : "http://<machine>:50070/webhdfs/v1/",
"gatewayName": "mygateway"
}
}
}
{
"name": "hdfs",
"properties":
{
"type": "Hdfs",
"typeProperties":
{
"authenticationType": "Windows",
"userName": "Administrator",
"password": "password",
"url" : "http://<machine>:50070/webhdfs/v1/",
"gatewayName": "mygateway"
}
}
}
For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.).
The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type FileShare (which includes HDFS dataset) has the following properties
Property | Description | Required |
---|---|---|
folderPath | Path to the folder. Example: myfolder Use escape character ‘ \ ’ for special characters in the string. For example: for folder\subfolder, specify folder\\subfolder and for d:\samplefolder, specify d:\\samplefolder. You can combine this property with partitionBy to have folder paths based on slice start/end date-times. |
Yes |
fileName | Specify the name of the file in the folderPath if you want the table to refer to a specific file in the folder. If you do not specify any value for this property, the table points to all files in the folder. When fileName is not specified for an output dataset, the name of the generated file would be in the following this format: Data..txt (for example: : Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt |
No |
partitionedBy | partitionedBy can be used to specify a dynamic folderPath, filename for time series data. Example: folderPath parameterized for every hour of data. | No |
fileFilter | Specify a filter to be used to select a subset of files in the folderPath rather than all files. Allowed values are: * (multiple characters) and ? (single character).Examples 1: "fileFilter": "*.log" Example 2: "fileFilter": 2014-1-?.txt" Note: fileFilter is applicable for an input FileShare dataset |
No |
format | The following format types are supported: TextFormat, AvroFormat, JsonFormat, OrcFormat, and ParquetFormat. Set the type property under format to one of these values. See Specifying TextFormat, Specifying AvroFormat, Specifying JsonFormat, Specifying OrcFormat, and Specifying ParquetFormat sections for details. If you want to copy files as-is between file-based stores (binary copy), you can skip the format section in both input and output dataset definitions. | No |
compression | Specify the type and level of compression for the data. Supported types are: GZip, Deflate, and BZip2 and supported levels are: Optimal and Fastest. Currently, compression settings are not supported for data in AvroFormat or OrcFormat. For more information, see Compression support section. | No |
Note
filename and fileFilter cannot be used simultaneously.
As mentioned in the previous section, you can specify a dynamic folderPath, filename for time series data with partitionedBy. You can do so with the Data Factory macros and the system variable SliceStart, SliceEnd that indicate the logical time period for a given data slice.
To learn more about time series datasets, scheduling, and slices, see Creating Datasets, Scheduling & Execution, and Creating Pipelines articles.
"folderPath": "wikidatagateway/wikisampledataout/{Slice}",
"partitionedBy":
[
{ "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } },
],
In this example {Slice} is replaced with the value of Data Factory system variable SliceStart in the format (YYYYMMDDHH) specified. The SliceStart refers to start time of the slice. The folderPath is different for each slice. For example: wikidatagateway/wikisampledataout/2014100103 or wikidatagateway/wikisampledataout/2014100104.
"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}",
"fileName": "{Hour}.csv",
"partitionedBy":
[
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } }
],
In this example, year, month, day, and time of SliceStart are extracted into separate variables that are used by folderPath and fileName properties.
[!INCLUDE data-factory-file-format]
[!INCLUDE data-factory-compression]
For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policies are available for all types of activities.
Properties available in the typeProperties section of the activity on the other hand vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks.
For Copy Activity, when source is of type FileSystemSource the following properties are available in typeProperties section:
FileSystemSource supports the following properties:
Property | Description | Allowed values | Required |
---|---|---|---|
recursive | Indicates whether the data is read recursively from the sub folders or only from the specified folder. | True, False (default) | No |
[!INCLUDE data-factory-column-mapping]
[!INCLUDE data-factory-structure-for-rectangualr-datasets]
See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it.