You can use the HDInsightStreamingActivity Activity invoke a Hadoop Streaming job from an Azure Data Factory pipeline. The following JSON snippet shows the syntax for using the HDInsightStreamingActivity in a pipeline JSON file.
The HDInsight Streaming Activity in a Data Factory pipeline executes Hadoop Streaming programs on your own or on-demand HDInsight cluster. This article builds on the data transformation activities article which presents a general overview of data transformation and the supported transformation activities.
{
"name": "HadoopStreamingPipeline",
"properties":
{
"description" : "Hadoop Streaming Demo",
"activities":
[
{
"name": "RunHadoopStreamingJob",
"description": "Run a Hadoop streaming job",
"type": "HDInsightStreaming",
"getDebugInfo": "Failure",
"inputs": [ ],
"outputs": [ {"name": "OutputTable"} ],
"linkedServiceName": "HDInsightLinkedService",
"typeProperties":
{
"mapper": "cat.exe",
"reducer": "wc.exe",
"input": "wasb://adfsample@<account name>.blob.core.windows.net/example/data/gutenberg/davinci.txt",
"output": " wasb://adfsample@<account name>.blob.core.windows.net/example/data/StreamingOutput/wc.txt",
"filePaths": [
"adfsample/example/apps/wc.exe" ,
"adfsample/example/apps/cat.exe"
],
"fileLinkedService" : "StorageLinkedService",
"arguments":[
]
},
"policy":
{
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 1,
"timeout": "01:00:00"
}
}
]
}
}
Note the following:
- Set the linkedServiceName to the name of the linked service that points to your HDInsight cluster on which the streaming mapreduce job will be run.
- Set the type of the activity to HDInsightStreaming.
- For the mapper property, specify the name of mapper executable. In the above example, cat.exe is the mapper executable.
- For the reducer property , specify the name of reducer executable. In the above example, wc.exe is the reducer executable.
- For the input property, specify the input file (including the location) for the mapper. In the example: "wasb://[email protected]/example/data/gutenberg/davinci.txt": adfsample is the blob container, example/data/Gutenberg is the folder and davinci.txt is the blob.
- For the output property, specify the output file (including the location) for the reducer. The output of the Hadoop Streaming job will be written to the location specified for this property.
- In the filePaths section, specify the paths for the mapper and reducer executables. In the example: "adfsample/example/apps/wc.exe", adfsample is the blob container, example/apps is the folder, and wc.exe is the executable.
- For the fileLinkedService property, specify the Azure Storage linked service that represents the Azure storage that contains the files specified in the filePaths section.
- For the arguments property, specify the arguments for the streaming job.
- The getDebugInfo property is an optional element. When it is set to Failure, the logs are downloaded only on failure. When it is set to All, logs are always downloaded irrespective of the execution status.