title	description	keywords	services	documentationcenter	author	manager	editor	tags	ms.assetid	ms.service	ms.custom	ms.workload	ms.tgt_pltfrm	ms.devlang	ms.topic	ms.date	ms.author
Create an Apache Spark cluster in Azure HDInsight \| Microsoft Docs	HDInsight Spark quickstart on how to create an Apache Spark cluster in HDInsight.	spark quickstart,interactive spark,interactive query,hdinsight spark,azure spark	hdinsight		nitinme	jhubbard	cgronlun	azure-portal	91f41e6a-d463-4eb4-83ef-7bbb1f4556cc	hdinsight	hdinsightactive,hdiseo17may2017	big-data	na	na	get-started-article	09/07/2017	nitinme

Create an Apache Spark cluster in Azure HDInsight

In this article, you learn how to create an Apache Spark cluster in Azure HDInsight and then run a Spark SQL query on a Hive table. For information on Spark on HDInsight, see Overview: Apache Spark on Azure HDInsight.

Prerequisites

An Azure subscription. Before you begin this tutorial, you must have an Azure subscription. See Create your free Azure account today.

Create HDInsight Spark cluster

In this section, you create an HDInsight Spark cluster using an Azure Resource Manager template. For other cluster creation methods, see Create HDInsight clusters.

Click the following image to open the template in the Azure portal.
Enter the following values:
- Subscription: Select your Azure subscription for this cluster.
- Resource group: Create a resource group or select an existing one. Resource group is used to manage Azure resources for your projects.
- Location: Select a location for the resource group. The template uses this location for creating the cluster as well as for the default cluster storage.
- ClusterName: Enter a name for the HDInsight cluster that you want to create.
- Spark version: Select 2.0 as the version that you want to install on the cluster.
- Cluster login name and password: The default login name is admin.
- SSH user name and password.
Write down these values. You need them later in the tutorial.
Select I agree to the terms and conditions stated above, select Pin to dashboard, and then click Purchase. You can see a new tile titled Submitting deployment for Template deployment. It takes about 20 minutes to create the cluster.

If you run into an issue with creating HDInsight clusters, it could be that you do not have the right permissions to do so. For more information, see Access control requirements.

Note

This article creates a Spark cluster that uses Azure Storage Blobs as the cluster storage. You can also create a Spark cluster that uses Azure Data Lake Store as the default storage. For instructions, see Create an HDInsight cluster with Data Lake Store.

Run Spark SQL statements on a Hive table

SQL (Structured Query Language) is the most common and widely used language for querying and defining data. The founders of Spark sought to harness this knowledge, opening up the well-known data querying language to a wider audience of analysts who wish to work with data that lives on Hadoop Distributed File System (HDFS). Spark SQL is that offering. It functions as an extension to Apache Spark for processing structured data, using the familiar SQL syntax.

Spark SQL supports both SQL and HiveQL as query languages. Its capabilities include binding in Python, Scala, and Java. With it, you can query data stored in many locations, such as external databases, structured data files (example: JSON), and Hive tables.

Running Spark SQL on an HDInsight cluster

When you use a Jupyter notebook configured for your HDInsight Spark cluster, you get a preset sqlContext that you can use to run Hive queries using Spark SQL. In this section, you learn how to start a Jupyter notebook and then run a basic Spark SQL query on an existing Hive table (hivesampletable) that is available on all HDInsight clusters.

Open the Azure portal.
If you opted to pin the cluster to the dashboard, click the cluster tile from the dashboard to launch the cluster blade.

If you did not pin the cluster to the dashboard, from the left pane, click HDInsight clusters, and then click the cluster you created.
From Quick links, click Cluster dashboards, and then click Jupyter Notebook. If prompted, enter the admin credentials for the cluster.

[!NOTE] You may also access the Jupyter notebook for your cluster by opening the following URL in your browser. Replace CLUSTERNAME with the name of your cluster:

https://CLUSTERNAME.azurehdinsight.net/jupyter
Create a notebook. Click New, and then click PySpark.

A new notebook is created and opened with the name Untitled(Untitled.pynb).
Click the notebook name at the top, and enter a friendly name if you want.
Paste the following code in an empty cell, and then press SHIFT + ENTER to run the code. In the code below, %%sql (called the sql magic) tells Jupyter notebook to use the preset sqlContext to run the Hive query. The query retrieves the top 10 rows from a Hive table (hivesampletable) that is available by default on all HDInsight clusters.
```
%%sql
SELECT * FROM hivesampletable LIMIT 10
```
For more information on the %%sql magic and the preset contexts, see Jupyter kernels available for an HDInsight cluster.

[!NOTE] Every time you run a query in Jupyter, your web browser window title shows a (Busy) status along with the notebook title. You also see a solid circle next to the PySpark text in the top-right corner. After the job is completed, it changes to a hollow circle.
The screen should refresh to show the query output.
Shut down the notebook to release the cluster resources after you have finished running the application. To do so, from the File menu on the notebook, click Close and Halt.
If you plan to complete the next steps at a later time, make sure you delete the HDInsight cluster you created in this article.

[!INCLUDE delete-cluster-warning]

Next step

In this article, you learned how to create an HDInsight Spark cluster and run a basic Spark SQL query. Advance to the next article to learn how to use an HDInsight Spark cluster to run interactive queries on sample data.

[!div class="nextstepaction"] Run interactive queries on an HDInsight Spark cluster

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

apache-spark-jupyter-spark-sql.md

apache-spark-jupyter-spark-sql.md

Create an Apache Spark cluster in Azure HDInsight

Prerequisites

Create HDInsight Spark cluster

Run Spark SQL statements on a Hive table

Running Spark SQL on an HDInsight cluster

Next step

Files

apache-spark-jupyter-spark-sql.md

Latest commit

History

apache-spark-jupyter-spark-sql.md

File metadata and controls

Create an Apache Spark cluster in Azure HDInsight

Prerequisites

Create HDInsight Spark cluster

Run Spark SQL statements on a Hive table

Running Spark SQL on an HDInsight cluster

Next step