title	description	services	author	ms.reviewer	ms.service	ms.custom	ms.topic	ms.date	ms.author
Manage resources for Apache Spark cluster on Azure HDInsight	Learn how to use manage resources for Spark clusters on Azure HDInsight for better performance.	hdinsight	hrasheed-msft	jasonh	hdinsight	hdinsightactive	conceptual	01/23/2018	hrasheed

Manage resources for Apache Spark cluster on Azure HDInsight

Learn how to access the interfaces like Apache Ambari UI, Apache Hadoop YARN UI, and the Spark History Server associated with your Apache Spark cluster, and how to tune the cluster configuration for optimal performance.

Prerequisites:

An Apache Spark cluster on HDInsight. For instructions, see Create Apache Spark clusters in Azure HDInsight.

Open the Ambari Web UI

Apache Ambari is used to monitor the cluster and make configuration changes. For more information, see Manage Apache Hadoop clusters in HDInsight by using the Azure portal

Open the Spark History Server

Spark History Server is the web UI for completed and running Spark applications. It is an extension of Spark's Web UI.

To open the Spark History Server Web UI

From the Azure portal, open the Spark cluster. For more information, see List and show clusters.
From Quick Links, click Cluster Dashboard, and then click Spark History Server

When prompted, enter the admin credentials for the Spark cluster. You can also open the Spark History Server by browsing to the following URL:
```
https://<ClusterName>.azurehdinsight.net/sparkhistory
```
Replace with your Spark cluster name.

The Spark History Server web UI looks like:

Open the Yarn UI

You can use the YARN UI to monitor applications that are currently running on the Spark cluster.

From the Azure portal, open the Spark cluster. For more information, see List and show clusters.
From Quick Links, click Cluster Dashboard, and then click YARN.

[!TIP] Alternatively, you can also launch the YARN UI from the Ambari UI. To launch the Ambari UI, click Cluster Dashboard, and then click HDInsight Cluster Dashboard. From the Ambari UI, click YARN, click Quick Links, click the active Resource Manager, and then click Resource Manager UI.

Optimize clusters for Spark applications

The three key parameters that can be used for Spark configuration depending on application requirements are spark.executor.instances, spark.executor.cores, and spark.executor.memory. An Executor is a process launched for a Spark application. It runs on the worker node and is responsible to carry out the tasks for the application. The default number of executors and the executor sizes for each cluster is calculated based on the number of worker nodes and the worker node size. This information is stored in spark-defaults.conf on the cluster head nodes.

The three configuration parameters can be configured at the cluster level (for all applications that run on the cluster) or can be specified for each individual application as well.

Change the parameters using Ambari UI

From the Ambari UI click Spark, click Configs, and then expand Custom spark-defaults.
The default values are good to have four Spark applications run concurrently on the cluster. You can change these values from the user interface, as shown in the following screenshot:
Click Save to save the configuration changes. At the top of the page, you are prompted to restart all the affected services. Click Restart.

Change the parameters for an application running in Jupyter notebook

For applications running in the Jupyter notebook, you can use the %%configure magic to make the configuration changes. Ideally, you must make such changes at the beginning of the application, before you run your first code cell. Doing this ensures that the configuration is applied to the Livy session, when it gets created. If you want to change the configuration at a later stage in the application, you must use the -f parameter. However, by doing so all progress in the application is lost.

The following snippet shows how to change the configuration for an application running in Jupyter.

%%configure
{"executorMemory": "3072M", "executorCores": 4, "numExecutors":10}

Configuration parameters must be passed in as a JSON string and must be on the next line after the magic, as shown in the example column.

Change the parameters for an application submitted using spark-submit

Following command is an example of how to change the configuration parameters for a batch application that is submitted using spark-submit.

spark-submit --class <the application class to execute> --executor-memory 3072M --executor-cores 4 –-num-executors 10 <location of application jar file> <application parameters>

Change the parameters for an application submitted using cURL

The following command is an example of how to change the configuration parameters for a batch application that is submitted using cURL.

curl -k -v -H 'Content-Type: application/json' -X POST -d '{"file":"<location of application jar file>", "className":"<the application class to execute>", "args":[<application parameters>], "numExecutors":10, "executorMemory":"2G", "executorCores":5' localhost:8998/batches

Change these parameters on a Spark Thrift Server

Spark Thrift Server provides JDBC/ODBC access to a Spark cluster and is used to service Spark SQL queries. Tools like Power BI, Tableau etc. use ODBC protocol to communicate with Spark Thrift Server to execute Spark SQL queries as a Spark Application. When a Spark cluster is created, two instances of the Spark Thrift Server are started, one on each head node. Each Spark Thrift Server is visible as a Spark application in the YARN UI.

Spark Thrift Server uses Spark dynamic executor allocation and hence the spark.executor.instances is not used. Instead, Spark Thrift Server uses spark.dynamicAllocation.minExecutors and spark.dynamicAllocation.maxExecutors to specify the executor count. The configuration parameters spark.executor.cores and spark.executor.memory is used to modify the executor size. You can change these parameters as shown in the following steps:

Expand the Advanced spark-thrift-sparkconf category to update the parameters spark.dynamicAllocation.minExecutors, spark.dynamicAllocation.maxExecutors, and spark.executor.memory.
Expand the Custom spark-thrift-sparkconf category to update the parameter spark.executor.cores.

Change the driver memory of the Spark Thrift Server

Spark Thrift Server driver memory is configured to 25% of the head node RAM size, provided the total RAM size of the head node is greater than 14 GB. You can use the Ambari UI to change the driver memory configuration, as shown in the following screenshot:

From the Ambari UI click Spark, click Configs, expand Advanced spark-env, and then provide the value for spark_thrift_cmd_opts.

Reclaim Spark cluster resources

Because of Spark dynamic allocation, the only resources that are consumed by thrift server are the resources for the two application masters. To reclaim these resources, you must stop the Thrift Server services running on the cluster.

From the Ambari UI, from the left pane, click Spark.
In the next page, click Spark Thrift Servers.
You should see the two headnodes on which the Spark Thrift Server is running. Click one of the headnodes.
The next page lists all the services running on that headnode. From the list click the drop-down button next to Spark Thrift Server, and then click Stop.
Repeat these steps on the other headnode as well.

Restart the Jupyter service

Launch the Ambari Web UI as shown in the beginning of the article. From the left navigation pane, click Jupyter, click Service Actions, and then click Restart All. This starts the Jupyter service on all the headnodes.

Monitor resources

Launch the Yarn UI as shown in the beginning of the article. In Cluster Metrics table on top of the screen, check values of Memory Used and Memory Total columns. If the two values are close, there might not be enough resources to start the next application. The same applies to the VCores Used and VCores Total columns. Also, in the main view, if there is an application stayed in ACCEPTED state and not transitioning into RUNNING nor FAILED state, this could also be an indication that it is not getting enough resources to start.

Kill running applications

In the Yarn UI, from the left panel, click Running. From the list of running applications, determine the application to be killed and click on the ID.
Click Kill Application on the top right corner, then click OK.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apache-spark-resource-manager.md

apache-spark-resource-manager.md

Manage resources for Apache Spark cluster on Azure HDInsight

Open the Ambari Web UI

Open the Spark History Server

Open the Yarn UI

Optimize clusters for Spark applications

Change the parameters using Ambari UI

Change the parameters for an application running in Jupyter notebook

Change the parameters for an application submitted using spark-submit

Change the parameters for an application submitted using cURL

Change these parameters on a Spark Thrift Server

Change the driver memory of the Spark Thrift Server

Reclaim Spark cluster resources

Restart the Jupyter service

Monitor resources

Kill running applications

See also

For data analysts

For Apache Spark developers

Files

apache-spark-resource-manager.md

Latest commit

History

apache-spark-resource-manager.md

File metadata and controls

Manage resources for Apache Spark cluster on Azure HDInsight

Open the Ambari Web UI

Open the Spark History Server

Open the Yarn UI

Optimize clusters for Spark applications

Change the parameters using Ambari UI

Change the parameters for an application running in Jupyter notebook

Change the parameters for an application submitted using spark-submit

Change the parameters for an application submitted using cURL

Change these parameters on a Spark Thrift Server

Change the driver memory of the Spark Thrift Server

Reclaim Spark cluster resources

Restart the Jupyter service

Monitor resources

Kill running applications

See also

For data analysts

For Apache Spark developers