title | description | services | ms.service | author | ms.author | ms.reviewer | ms.custom | ms.topic | ms.date |
---|---|---|---|---|---|---|---|---|---|
Use extended Spark History Server to debug and diagnose Spark applications - Azure HDInsight |
Use extended Spark History Server to debug and diagnose Spark applications - Azure HDInsight. |
hdinsight |
hdinsight |
jejiang |
jejiang |
jasonh |
hdinsightactive,hdiseo17may2017 |
conceptual |
09/14/2018 |
This article provides guidance on how to use extended Apache Spark History Server to debug and diagnose completed and running Spark applications. The extension includes data tab and graph tab and diagnosis tab. On the Data tab, users can check the input and output data of the Spark job. On the Graph tab, users can check the data flow and replay the job graph. On the Diagnosis tab, user can refer to Data Skew, Time Skew and Executor Usage Analysis.
Apache Spark History Server is the web UI for completed and running Spark applications.
-
From the Azure portal, open the Spark cluster. For more information, see List and show clusters.
-
From Quick Links, click Cluster Dashboard, and then click Spark History Server. When prompted, enter the admin credentials for the Spark cluster.
Open the Spark History Server by browsing to the following URL, replace with Spark cluster name of customer.
https://<ClusterName>.azurehdinsight.net/sparkhistory
The Spark History Server web UI looks like:
Select job ID then click Data on the tool menu to get the data view.
-
Check the Inputs, Outputs, and Table Operations by selecting the tabs separately.
-
Copy all rows by clicking button Copy.
-
Save all data as CSV file by clicking button csv.
-
Search by entering keywords in field Search, the search result will display immediately.
-
Click the column header to sort table, click the plus sign to expand a row to show more details, or click the minus sign to collapse a row.
-
Download single file by clicking button Partial Download that place at the right, then the selected file will be downloaded to local, if the file does not exist any more, it will open a new tab to show the error messages.
-
Copy full path or relative path by selecting the Copy Full Path, Copy Relative Path that expands from download menu. For azure data lake storage files, Open in Azure Storage Explorer will launch Azure Storage Explorer, and locate to the folder when sign-in.
-
Click the number below the table to navigate pages when too many rows to display in one page.
-
Hover on the question mark beside Data to show the tooltip, or click the question mark to get more information.
-
Send feedback with issues by clicking Provide us feedback.
Select job ID then click Graph on the tool menu to get the job graph view.
-
Check overview of your job by the generated job graph.
-
By default, it will show all jobs, and it could be filtered by Job ID.
-
By default, Progress is selected, user could check the data flow by selecting Read/Written in the dropdown list of Display.
The graph node display in color that shows the heatmap.
-
Play back the job by clicking the Playback button and stop anytime by clicking the stop button. The task display in color to show different status when playback:
- Green for succeeded: The job has completed successfully.
- Orange for retried: Instances of tasks that failed but do not affect the final result of the job. These tasks had duplicate or retry instances that may succeed later.
- Blue for running: The task is running.
- White for waiting or skipped: The task is waiting to run, or the stage has skipped.
- Red for failed: The task has failed.
The skipped stage display in white.
[!NOTE] Playback for each job is allowed. For incomplete job, playback is not supported.
-
Mouse scrolls to zoom in/out the job graph, or click Zoom to fit to make it fit to screen.
-
Hover on graph node to see the tooltip when there are failed tasks, and click on stage to open stage page.
-
In job graph tab, stages will have tooltip and small icon displayed if they have tasks meet the below conditions:
- Data skew: data read size > average data read size of all tasks inside this stage * 2 and data read size > 10 MB
- Time skew: execution time > average execution time of all tasks inside this stage * 2 and execution time > 2 mins
-
The job graph node will display the following information of each stage:
- ID.
- Name or description.
- Total task number.
- Data read: the sum of input size and shuffle read size.
- Data write: the sum of output size and shuffle write size.
- Execution time: the time between start time of the first attempt and completion time of the last attempt.
- Row count: the sum of input records, output records, shuffle read records and shuffle write records.
- Progress.
[!NOTE] By default, the job graph node will display information from last attempt of each stage (except for stage execution time), but during playback graph node will show information of each attempt.
[!NOTE] For data size of read and write we use 1MB = 1000 KB = 1000 * 1000 Bytes.
-
Send feedback with issues by clicking Provide us feedback.
Select job ID then click Diagnosis on the tool menu to get the job Diagnosis view. The diagnosis tab includes Data Skew, Time Skew, and Executor Usage Analysis.
Click Data Skew tab, the corresponding skewed tasks are displayed based on the specified parameters.
-
Specify Parameters - The first section displays the parameters which are used to detect Data Skew. The built-in rule is: Task Data Read is greater than 3 times of the average task data read, and the task data read is more than 10MB. If you want to define your own rule for skewed tasks, you can choose your parameters, the Skewed Stage, and Skew Char section will be refreshed accordingly.
-
Skewed Stage - The second section displays stages which have skewed tasks meeting the criteria specified above. If there are more than one skewed task in a stage, the skewed stage table only displays the most skewed task (e.g. the largest data for data skew).
-
Skew Chart – When a row in the skew stage table is selected, the skew chart displays more task distributions details based on data read and execution time. The skewed tasks are marked in red and the normal tasks are marked in blue. For performance consideration, the chart only displays up to 100 sample tasks. The task details are displayed in right bottom panel.
The Time Skew tab displays skewed tasks based on task execution time.
-
Specify Parameters - The first section displays the parameters which are used to detect Time Skew. The default criteria to detect time skew is: task execution time is greater than 3 times of average execution time and task execution time is greater than 30 seconds. You can change the parameters based on your needs. The Skewed Stage and Skew Chart display the corresponding stages and tasks information just like the Data Skew tab above.
-
Click Time Skew, then filtered result is displayed in Skewed Stage section according to the parameters set in section Specify Parameters. Click one item in Skewed Stage section, then the corresponding chart is drafted in section3, and the task details are displayed in right bottom panel.
The Executor Usage Graph visualizes the Spark job actual executor allocation and running status.
-
Click Executor Usage Analysis, then four types curves about executor usage are drafted, including Allocated Executors, Running Executors,idle Executors, and Max Executor Instances. Regarding allocated executors, each "Executor added" or "Executor removed" event will increase or decrease the allocated executors, you can check "Event Timeline" in the “Jobs" tab for more comparison.
-
Click the color icon to select or unselect the corresponding content in all drafts.
To revert to community version, do the following steps:
-
Open cluster in Ambari. Click Spark2 in left panel.
-
Click Configs tab.
-
Expand the group Custom spark2-defaults.
-
Click Add Property, add spark.ui.enhancement.enabled=false, save.
-
The property sets to false now.
-
Click Save to save the configuration.
-
Click Spark2 in left panel, under Summary tab, click Spark2 History Server.
-
Restart history server by clicking Restart of Spark2 History Server.
-
Refresh the Spark history server web UI, it will be reverted to community version.
If you run into history server error, follow the steps to provide the event:
-
Download event by clicking Download in history server web UI.
-
Click Provide us feedback from data/graph tab.
-
Provide the title and description of error, drag the zip file to the edit field, then click Submit new issue.
If you want to upgrade with hotfix, use the script below which will upgrade spark-enhancement.jar*.
upgrade_spark_enhancement.sh:
#!/usr/bin/env bash
# Copyright (C) Microsoft Corporation. All rights reserved.
# Arguments:
# $1 Enhancement jar path
if [ "$#" -ne 1 ]; then
>&2 echo "Please provide the upgrade jar path."
exit 1
fi
install_jar() {
tmp_jar_path="/tmp/spark-enhancement-hotfix-$( date +%s )"
if wget -O "$tmp_jar_path" "$2"; then
for FILE in "$1"/spark-enhancement*.jar
do
back_up_path="$FILE.original.$( date +%s )"
echo "Back up $FILE to $back_up_path"
mv "$FILE" "$back_up_path"
echo "Copy the hotfix jar file from $tmp_jar_path to $FILE"
cp "$tmp_jar_path" "$FILE"
"Hotfix done."
break
done
else
>&2 echo "Download jar file failed."
exit 1
fi
}
jars_folder="/usr/hdp/current/spark2-client/jars"
jar_path=$1
if ls ${jars_folder}/spark-enhancement*.jar 1>/dev/null 2>&1; then
install_jar "$jars_folder" "$jar_path"
else
>&2 echo "There is no target jar on this node. Exit with no action."
exit 0
fi
Usage:
upgrade_spark_enhancement.sh https://${jar_path}
Example:
upgrade_spark_enhancement.sh https://${account_name}.blob.core.windows.net/packages/jars/spark-enhancement-${version}.jar
To use the bash file from Azure portal
-
Launch Azure Portal, and select your cluster.
-
Click Script actions, then Submit new. Complete the Submit script action form, then click Create button.
- Script type: select Custom.
- Name: specify a script name.
- Bash script URI: upload the bash file to private cluster then copy URL here. Alternatively, use the URI provided.
https://hdinsighttoolingstorage.blob.core.windows.net/shsscriptactions/upgrade_spark_enhancement.sh
- Check on Head and Worker.
- Parameters: set the parameters follow the bash usage.
-
Currently, it only works for Spark 2.3 cluster.
-
Input/output data using RDD will not show in data tab.
If you have any feedback, or if you encounter any other problems when using this tool, send an email at ([email protected]).