Skip to content

Commit

Permalink
Merge pull request MicrosoftDocs#3025 from ShawnJackson/patch-58
Browse files Browse the repository at this point in the history
Edit hdinsight-hadoop-use-hive-curl.md
  • Loading branch information
cjgronlund committed Mar 9, 2015
2 parents c6b2ac4 + 14e9093 commit ad824f5
Showing 1 changed file with 44 additions and 44 deletions.
88 changes: 44 additions & 44 deletions articles/hdinsight-hadoop-use-hive-curl.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,137 +20,137 @@

[AZURE.INCLUDE [hive-selector](../includes/hdinsight-selector-use-hive.md)]

In this document, you will learn how to use Curl to run Hive queries on a Hadoop on HDInsight cluster.
In this document, you will learn how to use Curl to run Hive queries on a Hadoop on Azure HDInsight cluster.

Curl is used to demonstrate how you can interact with HDInsight using raw HTTP requests to run, monitor, and retrieve the results of Hive queries. This works by using the WebHCat REST API (formerly known as Templeton,) provided by your HDInsight cluster.
Curl is used to demonstrate how you can interact with HDInsight by using raw HTTP requests to run, monitor, and retrieve the results of Hive queries. This works by using the WebHCat REST API (formerly known as Templeton) provided by your HDInsight cluster.

> [AZURE.NOTE] If you are already familiar with using Linux-based Hadoop servers, but are new to HDInsight, see <a href="../hdinsight-hadoop-linux-information/" target="_blank">What you need to know about Hadoop on Linux-based HDInsight</a>.
##<a id="prereq"></a>Prerequisites

To complete the steps in this article, you will need the following.
To complete the steps in this article, you will need the following:

* A Hadoop on HDInsight cluster (Linux or Windows-based)

* <a href="http://curl.haxx.se/" target="_blank">Curl</a>

* <a href="http://stedolan.github.io/jq/" target="_blank">jq</a>

##<a id="curl"></a>Run Hive queries using Curl
##<a id="curl"></a>Run Hive queries by using Curl

> [AZURE.NOTE] When using Curl or any other REST communication with WebHCat, you must authenticate the requests by providing the HDInsight cluster administrator username and password. You must also use the cluster name as part of the URI used to send the requests to the server.
> [AZURE.NOTE] When using Curl or any other REST communication with WebHCat, you must authenticate the requests by providing the user name and password for the HDInsight cluster administrator. You must also use the cluster name as part of the Uniform Resource Identifier (URI) used to send the requests to the server.
>
> For the commands in this section, replace **USERNAME** with the user to authenticate to the cluster, and **PASSWORD** with the password for the user account. Replace **CLUSTERNAME** with the name of your cluster.
> For the commands in this section, replace **USERNAME** with the user to authenticate to the cluster, and replace **PASSWORD** with the password for the user account. Replace **CLUSTERNAME** with the name of your cluster.
>
> The REST API is secured using <a href="http://en.wikipedia.org/wiki/Basic_access_authentication" target="_blank">basic authentication</a>. You should always make requests using HTTPS to ensure that your credentials are securely sent to the server.
> The REST API is secured via <a href="http://en.wikipedia.org/wiki/Basic_access_authentication" target="_blank">basic authentication</a>. You should always make requests by using Secure HTTP (HTTPS) to help ensure that your credentials are securely sent to the server.
1. From a command-line, use the following command to verify that you can connect to your HDInsight cluster.
1. From a command line, use the following command to verify that you can connect to your HDInsight cluster:

curl -u USERNAME:PASSWORD -G https://CLUSTERNAME.azurehdinsight.net/templeton/v1/status

You should receive a response similar to the following.
You should receive a response similar to the following:

{"status":"ok","version":"v1"}

The parameters used in this command are as follows.
The parameters used in this command are as follows:

* **-u** - the user name and password used to authenticate the request
* **-G** - indicates that this is a GET request
* **-u** - The user name and password used to authenticate the request.
* **-G** - Indicates that this is a GET request.

The beginning of the url, **https://CLUSTERNAME.azurehdinsight.net/templeton/v1**, will be the same for all requests. The path, **/status**, indicates that the request is to return status of WebHCat (also known as Templeton,) for the server. You can also request the version of Hive using the following command.
The beginning of the URL, **https://CLUSTERNAME.azurehdinsight.net/templeton/v1**, will be the same for all requests. The path, **/status**, indicates that the request is to return a status of WebHCat (also known as Templeton) for the server. You can also request the version of Hive by using the following command:

curl -u USERNAME:PASSWORD -G https://CLUSTERNAME.azurehdinsight.net/templeton/v1/version/hive

This should return a response similar to the following.
This should return a response similar to the following:

{"module":"hive","version":"0.13.0.2.1.6.0-2103"}

2. Use the following to create a new table named **log4jLogs**.
2. Use the following to create a new table named **log4jLogs**:

curl -u USERNAME:PASSWORD -d user.name=USERNAME -d execute="DROP+TABLE+log4jLogs;CREATE+EXTERNAL+TABLE+log4jLogs(t1+string,t2+string,t3+string,t4+string,t5+string,t6+string,t7+string)+ROW+FORMAT+DELIMITED+FIELDS+TERMINATED+BY+' '+STORED+AS+TEXTFILE+LOCATION+'wasb:///example/data/';SELECT+t4+AS+sev,COUNT(*)+AS+count+FROM+log4jLogs+WHERE+t4+=+'[ERROR]'+GROUP+BY+t4;" -d statusdir="wasb:///example/curl" https://CLUSTERNAME.azurehdinsight.net/templeton/v1/hive

The parameters used in this command are as follows.
The parameters used in this command are as follows:

* **-d** - since `-G` is not used, the request defaults to the POST method. `-d` specifies the data values that are sent with the request
* **-d** - Since `-G` is not used, the request defaults to the POST method. `-d` specifies the data values that are sent with the request.

* **user.name** - the user that is running the command
* **user.name** - The user that is running the command.

* **execute** - the HiveQL statements to execute
* **execute** - The HiveQL statements to execute.

* **statusdir** - the directory that status for this job will be written to
* **statusdir** - The directory that the status for this job will be written to.

These statements perform the following actions.
These statements perform the following actions:

* **DROP TABLE** - deletes the table and the data file, in case the table already exists
* **DROP TABLE** - Deletes the table and the data file, if the table already exists.

* **CREATE EXTERNAL TABLE** - creates a new 'external' table in Hive. External tables only store the table definition in Hive - the data is left in the original location
* **CREATE EXTERNAL TABLE** - Creates a new 'external' table in Hive. External tables store only the table definition in Hive. The data is left in the original location.

> [AZURE.NOTE] External tables should be used when you expect the underlying data to be updated by an external source, such as an automated data upload process, or by another MapReduce operation, but always want Hive queries to use the latest data.
>
> Dropping an external table does **not** delete the data, only the table definition.

* **ROW FORMAT** - tells Hive how the data is formatted. In this case, the fields in each log are separated by a space
* **ROW FORMAT** - Tells Hive how the data is formatted. In this case, the fields in each log are separated by a space.

* **STORED AS TEXTFILE LOCATION** - tells Hive where the data is stored (the example/data directory,) and that it is stored as text
* **STORED AS TEXTFILE LOCATION** - Tells Hive where the data is stored (the example/data directory), and that it is stored as text.

* **SELECT** - select a count of all rows where column **t4** contain the value **[ERROR]**. This should return a value of **3** as there are three rows that contain this value
* **SELECT** - Selects a count of all rows where column **t4** contains the value **[ERROR]**. This should return a value of **3** as there are three rows that contain this value.

> [AZURE.NOTE] Notice that the spaces between HiveQL statements are replaced by the `+` character when used with Curl. Quoted values that contain a space, such as the delimiter, should not be replaced by `+`.

This command should return a job ID that can be used to check the status of the job.

{"id":"job_1415651640909_0026"}

3. To check the status of the job, use the following command. Replace the **JOBID** with the value returned in the previous step. For example, if the return value was `{"id":"job_1415651640909_0026"}` then the JOBID would be `job_1415651640909_0026`.
3. To check the status of the job, use the following command. Replace **JOBID** with the value returned in the previous step. For example, if the return value was `{"id":"job_1415651640909_0026"}`, then **JOBID** would be `job_1415651640909_0026`.

curl -G -u USERNAME:PASSWORD -d user.name=USERNAME https://CLUSTERNAME.azurehdinsight.net/templeton/v1/jobs/JOBID | jq .status.state

If the job has completed, the state will be "SUCCEEDED".
If the job has finished, the state will be **SUCCEEDED**.

> [AZURE.NOTE] This curl request returns a JSON document with information about the job; jq is used to retrieve only the state value.
> [AZURE.NOTE] This Curl request returns a JavaScript Object Notation (JSON) document with information about the job; jq is used to retrieve only the state value.

4. Once the state of the job has changed to **SUCCEEDED**, you can retrieve the results of the job from Azure Blob Storage. The `statusdir` parameter passed with the query contains the location of the output file; in this case, **wasb:///example/curl**. This address stores the output of the job in the **example/curl** directory on the default storage container used by your HDInsight cluster.
4. Once the state of the job has changed to **SUCCEEDED**, you can retrieve the results of the job from Azure Blob storage. The `statusdir` parameter passed with the query contains the location of the output file; in this case, **wasb:///example/curl**. This address stores the output of the job in the **example/curl** directory on the default storage container used by your HDInsight cluster.

You can list and download these files using the <a href="../xplat-cli/" target="_blank">Azure Cross-Platform Command-Line Interface (xplat-cli)</a>. For example, to list files in the **example/curl**, use the following command.
You can list and download these files by using the <a href="../xplat-cli/" target="_blank">Azure Cross-Platform Command-Line Interface (xplat-cli)</a>. For example, to list files in **example/curl**, use the following command:

azure storage blob list <container-name> example/curl

To download a file, use the following.
To download a file, use the following:

azure storage blob download <container-name> <blob-name> <destination-file>

> [AZURE.NOTE] You must either specify the storage account name that contains the blob using the `-a` and `-k` parameters, or set the **AZURE\_STORAGE\_ACCOUNT** and **AZURE\_STORAGE\_ACCESS\_KEY** environment variables. See <a href="../hdinsight-upload-data/" target="_blank" for more information.
> [AZURE.NOTE] You must either specify the storage account name that contains the blob by using the `-a` and `-k` parameters, or set the **AZURE\_STORAGE\_ACCOUNT** and **AZURE\_STORAGE\_ACCESS\_KEY** environment variables. See <a href="../hdinsight-upload-data/" target="_blank" for more information.

6. Use the following statements to create a new 'internal' table named **errorLogs**.
6. Use the following statements to create a new 'internal' table named **errorLogs**:

curl -u USERNAME:PASSWORD -d user.name=USERNAME -d execute="CREATE+TABLE+IF+NOT+EXISTS+errorLogs(t1+string,t2+string,t3+string,t4+string,t5+string,t6+string,t7+string)+STORED+AS+ORC;INSERT+OVERWRITE+TABLE+errorLogs+SELECT+t1,t2,t3,t4,t5,t6,t7+FROM+log4jLogs+WHERE+t4+=+'[ERROR]';SELECT+*+from+errorLogs;" -d statusdir="wasb:///example/curl" https://CLUSTERNAME.azurehdinsight.net/templeton/v1/hive

These statements perform the following actions.
These statements perform the following actions:

* **CREATE TABLE IF NOT EXISTS** - creates a table, if it does not already exist. Since the **EXTERNAL** keyword is not used, this is an 'internal' table, which is stored in the Hive data warehouse and is managed completely by Hive
* **CREATE TABLE IF NOT EXISTS** - Creates a table, if it does not already exist. Since the **EXTERNAL** keyword is not used, this is an internal table, which is stored in the Hive data warehouse and is managed completely by Hive.

> [AZURE.NOTE] Unlike **EXTERNAL** tables, dropping an internal table will delete the underlying data as well.
> [AZURE.NOTE] Unlike external tables, dropping an internal table will delete the underlying data as well.

* **STORED AS ORC** - stores the data in Optimized Row Columnar (ORC) format. This is a highly optimized and efficient format for storing Hive data
* **INSERT OVERWRITE ... SELECT** - selects rows from the **log4jLogs** table that contain **[ERROR]**, then insert the data into the **errorLogs** table
* **SELECT * ** - selects all rows from the new **errorLogs** table.
* **STORED AS ORC** - Stores the data in Optimized Row Columnar (ORC) format. This is a highly optimized and efficient format for storing Hive data.
* **INSERT OVERWRITE ... SELECT** - Selects rows from the **log4jLogs** table that contain **[ERROR]**, then inserts the data into the **errorLogs** table.
* **SELECT** - Selects all rows from the new **errorLogs** table.

7. Use the job ID returned to check the status of the job. Once it has succeeded, use the xplat-cli as described previously to download and view the results. The output should contain three lines, all of which contain **[ERROR]**.
7. Use the job ID returned to check the status of the job. Once it has succeeded, use xplat-cli as described previously to download and view the results. The output should contain three lines, all of which contain **[ERROR]**.


##<a id="summary"></a>Summary

As demonstrated in this document, you can use raw HTTP request to run, monitor, and view the results of Hive jobs on your HDInsight cluster.
As demonstrated in this document, you can use a raw HTTP request to run, monitor, and view the results of Hive jobs on your HDInsight cluster.

For more information on the REST interface used in this article, see the <a href="https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference" target="_blank">WebHCat Reference</a>.

##<a id="nextsteps"></a>Next steps

For general information on Hive with HDInsight.
For general information on Hive with HDInsight:

* [Use Hive with Hadoop on HDInsight](../hdinsight-use-hive/)

For information on other ways you can work with Hadoop on HDInsight.
For information on other ways you can work with Hadoop on HDInsight:

* [Use Pig with Hadoop on HDInsight](../hdinsight-use-pig/)

Expand Down

0 comments on commit ad824f5

Please sign in to comment.