title	description	services	author	ms.reviewer	ms.service	ms.custom	ms.topic	ms.date	ms.author
Use Apache Pig with SSH on an HDInsight cluster - Azure	Learn how connect to a Linux-based Apache Hadoop cluster with SSH, and then use the Pig command to run Pig Latin statements interactively, or as a batch job.	hdinsight	hrasheed-msft	jasonh	hdinsight	hdinsightactive	conceptual	02/27/2018	hrasheed

Run Apache Pig jobs on a Linux-based cluster with the Pig command (SSH)

[!INCLUDE pig-selector]

Learn how to interactively run Apache Pig jobs from an SSH connection to your HDInsight cluster. The Pig Latin programming language allows you to describe transformations that are applied to the input data to produce the desired output.

Important

The steps in this document require a Linux-based HDInsight cluster. Linux is the only operating system used on HDInsight version 3.4 or greater. For more information, see HDInsight retirement on Windows.

Connect with SSH

Use SSH to connect to your HDInsight cluster. The following example connects to a cluster named myhdinsight as the account named sshuser:

ssh sshuser@myhdinsight-ssh.azurehdinsight.net

For more information, see Use SSH with HDInsight.

Use the Pig command

Once connected, start the Pig command-line interface (CLI) by using the following command:
```
pig
```
After a moment, the prompt changes togrunt>.
Enter the following statement:
```
LOGS = LOAD '/example/data/sample.log';
```
This command loads the contents of the sample.log file into LOGS. You can view the contents of the file by using the following statement:
```
DUMP LOGS;
```
Next, transform the data by applying a regular expression to extract only the logging level from each record by using the following statement:
```
LEVELS = foreach LOGS generate REGEX_EXTRACT($0, '(TRACE|DEBUG|INFO|WARN|ERROR|FATAL)', 1)  as LOGLEVEL;
```
You can use DUMP to view the data after the transformation. In this case, use DUMP LEVELS;.

Continue applying transformations by using the statements in the following table:

Pig Latin statement	What the statement does
`FILTEREDLEVELS = FILTER LEVELS by LOGLEVEL is not null;`	Removes rows that contain a null value for the log level and stores the results into `FILTEREDLEVELS`.
`GROUPEDLEVELS = GROUP FILTEREDLEVELS by LOGLEVEL;`	Groups the rows by log level and stores the results into `GROUPEDLEVELS`.
`FREQUENCIES = foreach GROUPEDLEVELS generate group as LOGLEVEL, COUNT(FILTEREDLEVELS.LOGLEVEL) as COUNT;`	Creates a set of data that contains each unique log level value and how many times it occurs. The data set is stored into `FREQUENCIES`.
`RESULT = order FREQUENCIES by COUNT desc;`	Orders the log levels by count (descending) and stores into `RESULT`.

[!TIP] Use DUMP to view the result of the transformation after each step.

You can also save the results of a transformation by using the STORE statement. For example, the following statement saves the RESULT to the /example/data/pigout directory on the default storage for your cluster:
```
STORE RESULT into '/example/data/pigout';
```
[!NOTE] The data is stored in the specified directory in files named part-nnnnn. If the directory already exists, you receive an error.
To exit the grunt prompt, enter the following statement:
```
QUIT;
```

Pig Latin batch files

You can also use the Pig command to run Pig Latin contained in a file.

After exiting the grunt prompt, use the following command to create file named pigbatch.pig:
```
nano ~/pigbatch.pig
```

Type or paste the following lines:

LOGS = LOAD '/example/data/sample.log';
LEVELS = foreach LOGS generate REGEX_EXTRACT($0, '(TRACE|DEBUG|INFO|WARN|ERROR|FATAL)', 1)  as LOGLEVEL;
FILTEREDLEVELS = FILTER LEVELS by LOGLEVEL is not null;
GROUPEDLEVELS = GROUP FILTEREDLEVELS by LOGLEVEL;
FREQUENCIES = foreach GROUPEDLEVELS generate group as LOGLEVEL, COUNT(FILTEREDLEVELS.LOGLEVEL) as COUNT;
RESULT = order FREQUENCIES by COUNT desc;
DUMP RESULT;

When finished, use Ctrl + X, Y, and then Enter to save the file.

Use the following command to run the pigbatch.pig file by using the Pig command.
```
pig ~/pigbatch.pig
```
Once the batch job finishes, you see the following output:
```
 (TRACE,816)
 (DEBUG,434)
 (INFO,96)
 (WARN,11)
 (ERROR,6)
 (FATAL,2)
```

Next steps

For general information on Pig in HDInsight, see the following document:

Use Pig with Hadoop on HDInsight

For more information on other ways to work with Hadoop on HDInsight, see the following documents:

Use Hive with Hadoop on HDInsight
Use MapReduce with Hadoop on HDInsight

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apache-hadoop-use-pig-ssh.md

apache-hadoop-use-pig-ssh.md

Run Apache Pig jobs on a Linux-based cluster with the Pig command (SSH)

Connect with SSH

Use the Pig command

Pig Latin batch files

Next steps

Files

apache-hadoop-use-pig-ssh.md

Latest commit

History

apache-hadoop-use-pig-ssh.md

File metadata and controls

Run Apache Pig jobs on a Linux-based cluster with the Pig command (SSH)

Connect with SSH

Use the Pig command

Pig Latin batch files

Next steps