title	description	services	documentationcenter	author	manager	editor	ms.assetid	ms.service	ms.workload	ms.tgt_pltfrm	ms.devlang	ms.topic	ms.date	ms.author
Move data from on-premises HDFS \| Microsoft Docs	Learn about how to move data from on-premises HDFS using Azure Data Factory.	data-factory		linda33wj	jhubbard	monicar	3215b82d-291a-46db-8478-eac1a3219614	data-factory	data-services	na	na	article	12/07/2016	jingwang

Move data From on-premises HDFS using Azure Data Factory

This article outlines how you can use the Copy Activity in an Azure data factory to move data from an on-premises HDFS to another data store. This article builds on the data movement activities article that presents a general overview of data movement with copy activity and supported data store combinations.

Data factory currently supports only moving data from an on-premises HDFS to other data stores, but not for moving data from other data stores to an on-premises HDFS.

Enabling connectivity

Data Factory service supports connecting to on-premises HDFS using the Data Management Gateway. See moving data between on-premises locations and cloud article to learn about Data Management Gateway and step-by-step instructions on setting up the gateway. Use the gateway to connect to HDFS even if it is hosted in an Azure IaaS VM.

While you can install gateway on the same on-premises machine or the Azure VM as the HDFS, we recommend that you install the gateway on a separate machine/Azure IaaS VM. Having gateway on a separate machine reduces resource contention and improves performance. When you install the gateway on a separate machine, the machine should be able to access the machine with the HDFS.

Copy data wizard

The easiest way to create a pipeline that copies data from on-premises HDFS is to use the Copy data wizard. See Tutorial: Create a pipeline using Copy Wizard for a quick walkthrough on creating a pipeline using the Copy data wizard.

The following examples provide sample JSON definitions that you can use to create a pipeline by using Azure portal or Visual Studio or Azure PowerShell. They show how to copy data from an on-premises HDFS to an Azure Blob Storage. However, data can be copied to any of the sinks stated here using the Copy Activity in Azure Data Factory.

Sample: Copy data from on-premises HDFS to Azure Blob

This sample shows how to copy data from an on-premises HDFS to Azure Blob Storage. However, data can be copied directly to any of the sinks stated here using the Copy Activity in Azure Data Factory.

The sample has the following data factory entities:

A linked service of type OnPremisesHdfs.
A linked service of type AzureStorage.
An input dataset of type FileShare.
An output dataset of type AzureBlob.
A pipeline with Copy Activity that uses FileSystemSource and BlobSink.

The sample copies data from an on-premises HDFS to an Azure blob every hour. The JSON properties used in these samples are described in sections following the samples.

As a first step, set up the data management gateway. The instructions in the moving data between on-premises locations and cloud article.

HDFS linked service This example uses the Windows authentication. See HDFS linked service section for different types of authentication you can use.

{
    "name": "HDFSLinkedService",
    "properties":
    {
        "type": "Hdfs",
        "typeProperties":
        {
            "authenticationType": "Windows",
            "userName": "Administrator",
            "password": "password",
            "url" : "http://<machine>:50070/webhdfs/v1/",
            "gatewayName": "mygateway"
        }
    }
}

Azure Storage linked service

{
  "name": "AzureStorageLinkedService",
  "properties": {
    "type": "AzureStorage",
    "typeProperties": {
      "connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=<accountkey>"
    }
  }
}

HDFS input dataset This dataset refers to the HDFS folder DataTransfer/UnitTest/. The pipeline copies all the files in this folder to the destination.

Setting “external”: ”true” informs the Data Factory service that the dataset is external to the data factory and is not produced by an activity in the data factory.

{
    "name": "InputDataset",
    "properties": {
        "type": "FileShare",
        "linkedServiceName": "HDFSLinkedService",
        "typeProperties": {
            "folderPath": "DataTransfer/UnitTest/"
        },
        "external": true,
        "availability": {
            "frequency": "Hour",
            "interval":  1
        }
    }
}

Azure Blob output dataset

Data is written to a new blob every hour (frequency: hour, interval: 1). The folder path for the blob is dynamically evaluated based on the start time of the slice that is being processed. The folder path uses year, month, day, and hours parts of the start time.

{
    "name": "OutputDataset",
    "properties": {
        "type": "AzureBlob",
        "linkedServiceName": "AzureStorageLinkedService",
        "typeProperties": {
            "folderPath": "mycontainer/hdfs/yearno={Year}/monthno={Month}/dayno={Day}/hourno={Hour}",
            "format": {
                "type": "TextFormat",
                "rowDelimiter": "\n",
                "columnDelimiter": "\t"
            },
            "partitionedBy": [
                {
                    "name": "Year",
                    "value": {
                        "type": "DateTime",
                        "date": "SliceStart",
                        "format": "yyyy"
                    }
                },
                {
                    "name": "Month",
                    "value": {
                        "type": "DateTime",
                        "date": "SliceStart",
                        "format": "MM"
                    }
                },
                {
                    "name": "Day",
                    "value": {
                        "type": "DateTime",
                        "date": "SliceStart",
                        "format": "dd"
                    }
                },
                {
                    "name": "Hour",
                    "value": {
                        "type": "DateTime",
                        "date": "SliceStart",
                        "format": "HH"
                    }
                }
            ]
        },
        "availability": {
            "frequency": "Hour",
            "interval": 1
        }
    }
}

Pipeline with Copy activity

The pipeline contains a Copy Activity that is configured to use these input and output datasets and is scheduled to run every hour. In the pipeline JSON definition, the source type is set to FileSystemSource and sink type is set to BlobSink. The SQL query specified for the query property selects the data in the past hour to copy.

{
    "name": "pipeline",
    "properties":
    {
        "activities":
        [
            {
                "name": "HdfsToBlobCopy",
                "inputs": [ {"name": "InputDataset"} ],
                "outputs": [ {"name": "OutputDataset"} ],
                "type": "Copy",
                "typeProperties":
                {
                    "source":
                    {
                        "type": "FileSystemSource"
                    },
                    "sink":
                    {
                        "type": "BlobSink"
                    }
                },
                "policy":
                {
                    "concurrency": 1,
                    "executionPriorityOrder": "NewestFirst",
                    "retry": 1,
                    "timeout": "00:05:00"
                }
            }
        ],
        "start": "2014-06-01T18:00:00Z",
        "end": "2014-06-01T19:00:00Z"
    }
}

HDFS Linked Service properties

The following table provides description for JSON elements specific to HDFS linked service.

Property	Description	Required
type	The type property must be set to: Hdfs	Yes
Url	URL to the HDFS	Yes
encryptedCredential	New-AzureRMDataFactoryEncryptValue output of the access credential.	No
userName	Username for Windows authentication.	Yes (for Windows Authentication)
password	Password for Windows authentication.	Yes (for Windows Authentication)
authenticationType	Windows, or Anonymous.	Yes
gatewayName	Name of the gateway that the Data Factory service should use to connect to the HDFS.	Yes

See Move data between on-premises sources and the cloud with Data Management Gateway for details about setting credentials for on-premises HDFS.

Using Anonymous authentication

{
    "name": "hdfs",
    "properties":
    {
        "type": "Hdfs",
        "typeProperties":
        {
            "authenticationType": "Anonymous",
            "userName": "hadoop",
            "url" : "http://<machine>:50070/webhdfs/v1/",
            "gatewayName": "mygateway"
        }
    }
}

Using Windows authentication

{
    "name": "hdfs",
    "properties":
    {
        "type": "Hdfs",
        "typeProperties":
        {
            "authenticationType": "Windows",
            "userName": "Administrator",
            "password": "password",
            "url" : "http://<machine>:50070/webhdfs/v1/",
            "gatewayName": "mygateway"
        }
    }
}

HDFS Dataset type properties

For a full list of sections & properties available for defining datasets, see the Creating datasets article. Sections such as structure, availability, and policy of a dataset JSON are similar for all dataset types (Azure SQL, Azure blob, Azure table, etc.).

The typeProperties section is different for each type of dataset and provides information about the location of the data in the data store. The typeProperties section for dataset of type FileShare (which includes HDFS dataset) has the following properties

Property	Description	Required
folderPath	Path to the folder. Example: `myfolder` Use escape character ‘ \ ’ for special characters in the string. For example: for folder\subfolder, specify folder\\subfolder and for d:\samplefolder, specify d:\\samplefolder. You can combine this property with partitionBy to have folder paths based on slice start/end date-times.	Yes
fileName	Specify the name of the file in the folderPath if you want the table to refer to a specific file in the folder. If you do not specify any value for this property, the table points to all files in the folder. When fileName is not specified for an output dataset, the name of the generated file would be in the following this format: Data..txt (for example: : Data.0a405f8a-93ff-4c6f-b3be-f69616f1df7a.txt	No
partitionedBy	partitionedBy can be used to specify a dynamic folderPath, filename for time series data. Example: folderPath parameterized for every hour of data.	No
fileFilter	Specify a filter to be used to select a subset of files in the folderPath rather than all files. Allowed values are: `` (multiple characters) and `?` (single character). Examples 1: `"fileFilter": ".log"` Example 2: `"fileFilter": 2014-1-?.txt"` Note: fileFilter is applicable for an input FileShare dataset	No
format	The following format types are supported: TextFormat, AvroFormat, JsonFormat, OrcFormat, and ParquetFormat. Set the type property under format to one of these values. See Specifying TextFormat, Specifying AvroFormat, Specifying JsonFormat, Specifying OrcFormat, and Specifying ParquetFormat sections for details. If you want to copy files as-is between file-based stores (binary copy), you can skip the format section in both input and output dataset definitions.	No
compression	Specify the type and level of compression for the data. Supported types are: GZip, Deflate, and BZip2 and supported levels are: Optimal and Fastest. Currently, compression settings are not supported for data in AvroFormat or OrcFormat. For more information, see Compression support section.	No

Note

filename and fileFilter cannot be used simultaneously.

Using partionedBy property

As mentioned in the previous section, you can specify a dynamic folderPath, filename for time series data with partitionedBy. You can do so with the Data Factory macros and the system variable SliceStart, SliceEnd that indicate the logical time period for a given data slice.

To learn more about time series datasets, scheduling, and slices, see Creating Datasets, Scheduling & Execution, and Creating Pipelines articles.

Sample 1:

"folderPath": "wikidatagateway/wikisampledataout/{Slice}",
"partitionedBy":
[
    { "name": "Slice", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyyMMddHH" } },
],

In this example {Slice} is replaced with the value of Data Factory system variable SliceStart in the format (YYYYMMDDHH) specified. The SliceStart refers to start time of the slice. The folderPath is different for each slice. For example: wikidatagateway/wikisampledataout/2014100103 or wikidatagateway/wikisampledataout/2014100104.

Sample 2:

"folderPath": "wikidatagateway/wikisampledataout/{Year}/{Month}/{Day}",
"fileName": "{Hour}.csv",
"partitionedBy":
 [
    { "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
    { "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "MM" } },
    { "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "dd" } },
    { "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } }
],

In this example, year, month, day, and time of SliceStart are extracted into separate variables that are used by folderPath and fileName properties.

[!INCLUDE data-factory-file-format]

[!INCLUDE data-factory-compression]

HDFS Copy Activity type properties

For a full list of sections & properties available for defining activities, see the Creating Pipelines article. Properties such as name, description, input and output tables, and policies are available for all types of activities.

Properties available in the typeProperties section of the activity on the other hand vary with each activity type. For Copy activity, they vary depending on the types of sources and sinks.

For Copy Activity, when source is of type FileSystemSource the following properties are available in typeProperties section:

FileSystemSource supports the following properties:

Property	Description	Allowed values	Required
recursive	Indicates whether the data is read recursively from the sub folders or only from the specified folder.	True, False (default)	No

[!INCLUDE data-factory-column-mapping]

[!INCLUDE data-factory-structure-for-rectangualr-datasets]

Performance and Tuning

See Copy Activity Performance & Tuning Guide to learn about key factors that impact performance of data movement (Copy Activity) in Azure Data Factory and various ways to optimize it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data-factory-hdfs-connector.md

data-factory-hdfs-connector.md

Move data From on-premises HDFS using Azure Data Factory

Enabling connectivity

Copy data wizard

Sample: Copy data from on-premises HDFS to Azure Blob

HDFS Linked Service properties

Using Anonymous authentication

Using Windows authentication

HDFS Dataset type properties

Using partionedBy property

Sample 1:

Sample 2:

HDFS Copy Activity type properties

Performance and Tuning

Files

data-factory-hdfs-connector.md

Latest commit

History

data-factory-hdfs-connector.md

File metadata and controls

Move data From on-premises HDFS using Azure Data Factory

Enabling connectivity

Copy data wizard

Sample: Copy data from on-premises HDFS to Azure Blob

HDFS Linked Service properties

Using Anonymous authentication

Using Windows authentication

HDFS Dataset type properties

Using partionedBy property

Sample 1:

Sample 2:

HDFS Copy Activity type properties

Performance and Tuning