title	description	services	ms.reviewer	author	ms.service	ms.custom	ms.topic	ms.date	ms.author
Create Java MapReduce for Apache Hadoop - Azure HDInsight	Learn how to use Apache Maven to create a Java-based MapReduce application, then run it with Hadoop on Azure HDInsight.	hdinsight	jasonh	hrasheed-msft	hdinsight	hdinsightactive,hdiseo17may2017	conceptual	04/23/2018	hrasheed

Develop Java MapReduce programs for Apache Hadoop on HDInsight

Learn how to use Apache Maven to create a Java-based MapReduce application, then run it with Apache Hadoop on Azure HDInsight.

Note

This example was most recently tested on HDInsight 3.6.

Prerequisites

Java JDK 8 or later (or an equivalent, such as OpenJDK).

[!NOTE] HDInsight versions 3.4 and earlier use Java 7. HDInsight 3.5 and greater uses Java 8.
Apache Maven

Configure development environment

The following environment variables may be set when you install Java and the JDK. However, you should check that they exist and that they contain the correct values for your system.

JAVA_HOME - should point to the directory where the Java runtime environment (JRE) is installed. For example, on an OS X, Unix or Linux system, it should have a value similar to /usr/lib/jvm/java-7-oracle. In Windows, it would have a value similar to c:\Program Files (x86)\Java\jre1.7
PATH - should contain the following paths:
- JAVA_HOME (or the equivalent path)
- JAVA_HOME\bin (or the equivalent path)
- The directory where Maven is installed

Create a Maven project

From a terminal session, or command line in your development environment, change directories to the location you want to store this project.
Use the mvn command, which is installed with Maven, to generate the scaffolding for the project.
```
mvn archetype:generate -DgroupId=org.apache.hadoop.examples -DartifactId=wordcountjava -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
```
[!NOTE] If you are using PowerShell, you must enclose the -D parameters in double quotes.

mvn archetype:generate "-DgroupId=org.apache.hadoop.examples" "-DartifactId=wordcountjava" "-DarchetypeArtifactId=maven-archetype-quickstart" "-DinteractiveMode=false"

This command creates a directory with the name specified by the artifactID parameter (wordcountjava in this example.) This directory contains the following items:
- pom.xml - The Project Object Model (POM) that contains information and configuration details used to build the project.
- src - The directory that contains the application.
Delete the src/test/java/org/apache/hadoop/examples/apptest.java file. It is not used in this example.

Add dependencies

Edit the pom.xml file and add the following text inside the <dependencies> section:
```
 <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-mapreduce-examples</artifactId>
     <version>2.7.3</version>
     <scope>provided</scope>
 </dependency>
 <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-mapreduce-client-common</artifactId>
     <version>2.7.3</version>
     <scope>provided</scope>
 </dependency>
 <dependency>
     <groupId>org.apache.hadoop</groupId>
     <artifactId>hadoop-common</artifactId>
     <version>2.7.3</version>
     <scope>provided</scope>
 </dependency>
```
This defines required libraries (listed within <artifactId>) with a specific version (listed within <version>). At compile time, these dependencies are downloaded from the default Maven repository. You can use the Maven repository search to view more.

The <scope>provided</scope> tells Maven that these dependencies should not be packaged with the application, as they are provided by the HDInsight cluster at run-time.

[!IMPORTANT] The version used should match the version of Hadoop present on your cluster. For more information on versions, see the HDInsight component versioning document.

Add the following to the pom.xml file. This text must be inside the <project>...</project> tags in the file; for example, between </dependencies> and </project>.

 <build>
     <plugins>
     <plugin>
         <groupId>org.apache.maven.plugins</groupId>
         <artifactId>maven-shade-plugin</artifactId>
         <version>2.3</version>
         <configuration>
         <transformers>
             <transformer implementation="org.apache.maven.plugins.shade.resource.ApacheLicenseResourceTransformer">
             </transformer>
         </transformers>
         </configuration>
         <executions>
         <execution>
             <phase>package</phase>
                 <goals>
                 <goal>shade</goal>
                 </goals>
         </execution>
         </executions>
         </plugin>
     <plugin>
         <groupId>org.apache.maven.plugins</groupId>
         <artifactId>maven-compiler-plugin</artifactId>
         <version>3.6.1</version>
         <configuration>
         <source>1.8</source>
         <target>1.8</target>
         </configuration>
     </plugin>
     </plugins>
 </build>

The first plugin configures the Maven Shade Plugin, which is used to build an uberjar (sometimes called a fatjar), which contains dependencies required by the application. It also prevents duplication of licenses within the jar package, which can cause problems on some systems.

The second plugin configures the target Java version.

[!NOTE] HDInsight 3.4 and earlier use Java 7. HDInsight 3.5 and greater uses Java 8.

Save the pom.xml file.

Create the MapReduce application

Go to the wordcountjava/src/main/java/org/apache/hadoop/examples directory and rename the App.java file to WordCount.java.

Open the WordCount.java file in a text editor and replace the contents with the following text:

package org.apache.hadoop.examples;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

    public static class TokenizerMapper
        extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
        }
    }
}

public static class IntSumReducer
        extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                        Context context
                        ) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
        sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
        System.err.println("Usage: wordcount <in> <out>");
        System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Notice the package name is org.apache.hadoop.examples and the class name is WordCount. You use these names when you submit the MapReduce job.

Save the file.

Build the application

Change to the wordcountjava directory, if you are not already there.
Use the following command to build a JAR file containing the application:
```
mvn clean package
```
This command cleans any previous build artifacts, downloads any dependencies that have not already been installed, and then builds and package the application.
Once the command finishes, the wordcountjava/target directory contains a file named wordcountjava-1.0-SNAPSHOT.jar.

[!NOTE] The wordcountjava-1.0-SNAPSHOT.jar file is an uberjar, which contains not only the WordCount job, but also dependencies that the job requires at runtime.

Upload the jar

Use the following command to upload the jar file to the HDInsight headnode:

scp target/wordcountjava-1.0-SNAPSHOT.jar USERNAME@CLUSTERNAME-ssh.azurehdinsight.net:

Replace USERNAME with your SSH user name for the cluster. Replace CLUSTERNAME with the HDInsight cluster name.

This command copies the files from the local system to the head node. For more information, see Use SSH with HDInsight.

Run the MapReduce job on Hadoop

Connect to HDInsight using SSH. For more information, see Use SSH with HDInsight.
From the SSH session, use the following command to run the MapReduce application:
```
yarn jar wordcountjava-1.0-SNAPSHOT.jar org.apache.hadoop.examples.WordCount /example/data/gutenberg/davinci.txt /example/data/wordcountout
```
This command starts the WordCount MapReduce application. The input file is /example/data/gutenberg/davinci.txt, and the output directory is /example/data/wordcountout. Both the input file and output are stored to the default storage for the cluster.
Once the job completes, use the following command to view the results:
```
hdfs dfs -cat /example/data/wordcountout/*
```
You should receive a list of words and counts, with values similar to the following text:
```
 zeal    1
 zelus   1
 zenith  2
```

Next steps

In this document, you have learned how to develop a Java MapReduce job. See the following documents for other ways to work with HDInsight.

Use Hive with HDInsight
Use Pig with HDInsight
Use MapReduce with HDInsight

For more information, see also the Java Developer Center.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apache-hadoop-develop-deploy-java-mapreduce-linux.md

apache-hadoop-develop-deploy-java-mapreduce-linux.md

Develop Java MapReduce programs for Apache Hadoop on HDInsight

Prerequisites

Configure development environment

Create a Maven project

Add dependencies

Create the MapReduce application

Build the application

Upload the jar

Run the MapReduce job on Hadoop

Next steps

Files

apache-hadoop-develop-deploy-java-mapreduce-linux.md

Latest commit

History

apache-hadoop-develop-deploy-java-mapreduce-linux.md

File metadata and controls

Develop Java MapReduce programs for Apache Hadoop on HDInsight

Prerequisites

Configure development environment

Create a Maven project

Add dependencies

Create the MapReduce application

Build the application

Upload the jar

Run the MapReduce job on Hadoop

Next steps