Skip to content

Commit

Permalink
LIHADOOP-18124: Complete Open sourcing of Dr. Elephant
Browse files Browse the repository at this point in the history
RB=681089

G=superfriends-reviewers
R=annag,fli,shanm,viramach
A=annag,shanm
  • Loading branch information
akshayrai committed Mar 14, 2016
1 parent 8906b39 commit 6c06bc8
Show file tree
Hide file tree
Showing 45 changed files with 41 additions and 315 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# Copyright 2015 LinkedIn Corp.
# Copyright 2016 LinkedIn Corp.
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may not
# use this file except in compliance with the License. You may obtain a copy of
Expand Down
262 changes: 8 additions & 254 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,266 +1,20 @@
# Dr. Elephant

Dr. Elephant is a performance monitoring and tuning tool for Hadoop. The goal of Dr. Elephant is to improve developer
productivity and increase cluster efficiency by making it easier to tune Hadoop jobs. It analyzes Hadoop jobs using a
set of configurable heuristics that provide insights on how a job performed and uses the results to make suggestions on
how to tune the job to make it perform more efficiently.
<a href=""><img src="images/readme/dr-elephant-logo-150x150.png" align="left" hspace="10" vspace="6"></a>

## Why Dr. Elephant?
Efficient use of Hadoop cluster resources, and developer productivity, are big problems for users of Hadoop. There are
no actively maintained tools provided by the open source community to bridge this gap. Dr. Elephant, in addition to
solving this problem, is easy to use and extensible.
**Dr. Elephant** is a performance monitoring and tuning tool for Hadoop. He automatically gathers all the metrics, runs analysis on them presents them in a simple way for easy consumption. His goal is to improve developer productivity and increase cluster efficiency by making it easier to tune Hadoop jobs. He analyzes Hadoop/Spark jobs using a set of configurable, rule based heuristics that provide insights on how a job performed and uses the results to make suggestions on how to tune the job to make it perform more efficiently.

## Key Features
* Pluggable and configurable Heuristics that diagnose a job
* Integration with Azkaban scheduler and designed to integrate with any hadoop scheduler such as Oozie.
* Representation of historic performance of jobs and flows
* Job level comparison of flows
* Diagnostic heuristics for Map/Reduce and Spark
* Easily extendable to newer job types, applications and schedulers
* Rest API to fetch all the information

## How does it work?
Dr. Elephant gets a list of all recent succeeded and failed applications, once every minute, from the Resource manager.
The metadata for each application, viz, the job counters, configurations and the task data, are fetched from the Job
History server. Once it has all the metadata, Dr. Elephant runs a set of different Heuristics on them and generates a
diagnostic report on how the individual heuristics and the job as a whole performed. These are then tagged with one of
five severity levels, to indicate potential performance problems.
## Documentation

## Use Cases
At Linkedin, developers use Dr. Elephant for a number of different use cases including monitoring how their flow is
performing on the cluster, understanding why their flow is running slow, how and what can be tuned to improve their
flow, comparing their flow against previous executions, troubleshooting etc. Dr. Elephant’s performance green-lighting
is a prerequisite to run jobs on production clusters.
For more information on Dr. Elephant, [see the wiki](https://github.com/linkedin/dr-elephant/wiki).

## Sample Job Analysis/Tuning
Dr. Elephant’s home page, or the dashboard, includes all the latest analysed jobs along with some statistics.
User guide: [Click here](https://github.com/linkedin/dr-elephant/wiki/User-Guide)

<img src="images/readme/dashboard.png" alt="unable to load image" height="200" width="450" align="center" />
Developer guide: [Click here](https://github.com/linkedin/dr-elephant/wiki/Developer-Guide)

Once a job completes, it can be found in the Dashboard, or by filtering on the Search page. One can filter jobs by the
job id, the flow execution url(if scheduled from a scheduler), the user who triggered the job, job finish time, the type
of the job, or even based on severity of the individual heuristics.
Administrator guide: [Click here](https://github.com/linkedin/dr-elephant/wiki/Administrator-Guide)

<img src="images/readme/search.png" alt="unable to load image" height="200" width="450" align="center" />

The search results provide a high level analysis report of the jobs using color coding to represent severity levels on
how the job and the heuristics performed. The color Red means the job is in critical state and requires tuning while
Green means the job is running efficiently.

## Severity levels

Severity is a measure of the job's performance. It says how severe a job is in terms of efficiency. There are five
severity levels that judge a heuristic/job based on the configured thresholds. The 5 severities in the decreasing order
of severeness are

CRTICAL > SEVERE > MODERATE > LOW > NONE

| SEVERITY | COLOR | DESCRIPTION |
| -------- | --------------------------------------- | -------------------------------------------------- |
| CRITICAL | ![Alt text](images/readme/critical.png) | The job is in critical state and must be tuned |
| SEVERE | ![Alt text](images/readme/severe.png) | There is scope for improvement |
| MODERATE | ![Alt text](images/readme/moderate.png) | There is scope for further improvement |
| LOW | ![Alt text](images/readme/low.png) | There is scope for few minor improvements |
| NONE | ![Alt text](images/readme/none.png) | The job is safe. No tuning necessary |

Once one filters and identifies one’s job, one can click on the result to get the complete report. The report includes
details on each of the individual heuristics and a link, [Explain], which provides suggestions on how to tune the job to
improve that heuristic.

<img src="images/readme/jobdetails.png" alt="unable to load image" height="200" width="450" align="center" />

<img src="images/readme/suggestions.png" alt="unable to load image" height="200" width="450" align="center" />

## Heuristics
One can go to the help page in Dr. Elephant to see what each of the heuristics mean. For more information on the
Heuristics, one can refer to the complete documentation here <link>.

## Dr. Elephant Setup

### Compiling & testing locally

#### Play Setup - One time
* To be able to build & run the application, download and install [Play framework 2.2.2](http://downloads.typesafe.com/play/2.2.2/play-2.2.2.zip).
* Add the Play installation directory to the system path.

#### Hadoop Setup - One time
* Setup hadoop locally. You can find instructions to setup a single node cluster [here](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html).
* Export variable HADOOP\_HOME if you haven't already.
```
export HADOOP\_HOME=/path/to/hadoop/home
export HADOOP\_CONF\_DIR=$HADOOP_HOME/etc/hadoop
```

* Add hadoop to the system path because dr-elephant uses _'hadoop classpath'_ to load the right classes.
```
export PATH=$HADOOP_HOME/bin:$PATH
```

#### Mysql Setup - One time
* Set up and start mysql locally on your box.
* Create a database called 'drelephant'. Use root with no password.
```
mysql -u root -p
mysql> create database drelephant;
```

#### Dr. Elephant Setup
* Start Hadoop and run the history server.
* To compile dr-elephant, run the compile script specifying a path to an external configuration file. Note that this will require Bash (Unix shell).
```
./compile.sh [/path/to/conf]
```
The configuration file includes the following properties,
```
hadoop_version=2.4.1 // The hadoop version you are running. (Default - 2.3.0)
spark_version=1.4.0 // The spark version (Default - 1.4.0)
play_opts="-Dsbt.repository.config=/path/to/resolver/conf ..." // Include other sbt/play options.
```
If any of the above properties are not set then the default values will be used. Additionally, if you want to configure
a custom repository then set the property sbt.repository.config to the resolver file location as shown in the above
example. See section 'Adding a new Resolver' below for more info.
* Unzip the zip file generated in the previous step(check dist) and change to the dr-elephant release directory created.
Henceforth we will refer this as DR_RELEASE.
```
cd dist; unzip dr-elephant\*.zip; cd dr-elephant\*
```
* If you are running dr-elephant for the first time after creating the database, you need to enable evolutions. To do so append _-Devolutionplugin=enabled_ and _-DapplyEvolutions.default=true_ to jvm\_props in elephant.conf file.
```
vim ./app-conf/elephant.conf
jvm\_props="... -Devolutionplugin=enabled -DapplyEvolutions.default=true"
```
* To start dr-elephant, run the start script specifying a path to the application's configuration directory.
```
$DR\_RELEASE/bin/start.sh $DR\_RELEASE/../../app-conf
```
* To stop dr-elephant run,
```
$DR\_RELEASE/bin/stop.sh
```
* The dr-elephant logs are generated in the 'dist' directory besides the dr-elephant release.
```
less $DR\_RELEASE/../logs/elephant/dr_elephant.log
```

### DB Schema evolutions

When the schema in the model package changes, run play to automatically apply the evolutions.

* There is a problem with Ebean where it does not support something like @Index to generate indices for columns of interest
* So what we did to work around this is to manually add indices into the sql script.
* To do this, we needed to prevent the automatically generated sql to overwrite our modified sql.
* The evolution sql file must be changed (by moving or removing the header "To stop Ebean DDL generation, remove this comment and start using Evolutions") to make sure it does not automatically generate new sql.
* To re-create the sql file from a new schema in code:
* Backup the file at ./conf/evolutions/default/1.sql
* Remove the file
* Run play in debug mode and browse the page. This causes EBean to generate the new sql file, and automatically apply the evolution.
* Copy over the indices from the old 1.sql file
* Remove the header in the sql file so it does not get overwritten
* Browse the page again to refresh the schema to add the indices.

### Deployment on the cluster

* SSH into the cluster machine.
* Switch to the appropriate user.
```
sudo -iu <user>
```
* Unzip the dr-elephant release and change directory to it.
* To start dr-elephant, run the start script. The start script takes an optional argument to the application's conf directory. Alternatively, you can set an env variable ELEPHANT_CONF_DIR.
```
./bin/start.sh [/path/to/app-conf]
```
* To stop dr-elephant run,
```
./bin/stop.sh
```
* To deploy new version, be sure to kill the running process first

### Adding new heuristics

* Create a new heuristic and test it.
* Create a new view for the heuristic for example helpMapperSpill.scala.html
* Add the details of the heuristic in the HeuristicConf.xml file.
* The HeuristicConf.xml file requires the following details for each heuristic:
* **applicationtype**: The type of application analysed by the heuristic. e.g. mapreduce or spark
* **heuristicname**: Name of the heuristic.
* **classname**: Fully qualified name of the class.
* **viewname**: Fully qualified name of the view.
* **hadoopversions**: Versions of Hadoop with which the heuristic is compatible.
* Optionally, if you wish to override the threshold values of the severities used in the Heuristic and use custom
threshold limits, you can specify them in the HeuristicConf.xml between params tag. See examples below.
* A sample entry in HeuristicConf.xml would look like,
```
<heuristic>
<applicationtype>mapreduce</applicationtype>
<heuristicname>Mapper GC</heuristicname>
<classname>com.linkedin.drelephant.mapreduce.heuristics.MapperGCHeuristic</classname>
<viewname>views.html.help.mapreduce.helpGC</viewname>
</heuristic>
```
* A sample entry showing how to override/configure severity thresholds would look like,
```
<heuristic>
<applicationtype>mapreduce</applicationtype>
<heuristicname>Mapper Data Skew</heuristicname>
<classname>com.linkedin.drelephant.mapreduce.heuristics.MapperDataSkewHeuristic</classname>
<viewname>views.html.help.mapreduce.helpMapperDataSkew</viewname>
<params>
<num\_tasks\_severity>10, 50, 100, 200</num\_tasks\_severity>
<deviation\_severity>2, 4, 8, 16</deviation\_severity>
<files\_severity>1/8, 1/4, 1/2, 1</files\_severity>
</params>
</heuristic>
```
* Run Doctor Elephant, it should now include the new heuristics.

### Adding a new Resolver

If you want a add a custom repository, configure the resolver in a separate file as shown below and specify the path to
this file in the compiler configuration file.
cat resolver.conf
```
[repositories]
local
# label ":" url [ ["," ivyPattern] "," artifactPattern [", mavenCompatible"]]
custom_resolver : repo_url, [organization]/[module]/[revision]/[module]-[revision].ivy, [organisation]/[module]/[revision]/[artifact]-[revision](-[classifier]).[ext], mavenCompatible
```
After defining the resolver configuration, include the path to it under play_opts="... -Dsbt.repository.config=/path/to/resolver.conf" in your compiler configuration.

## Project Structure

app → Contains all the source files
└ com.linkedin.drelepahnt → Application Daemons
└ org.apache.spark → Spark Support
└ controllers → Controller logic
└ models → Includes models that Map to DB
└ views → Page templates

app-conf → Application Configurations
└ elephant.conf → Port, DB, Keytab and other JVM Configurations (Overrides application.conf)
└ FetcherConf.xml → Fetcher Configurations
└ HeuristicConf.xml → Heuristic Configurations
└ JobTypeConf.xml → JobType Configurations

conf → Configurations files
└ evolutions → DB Schema
└ application.conf → Main configuration file
└ log4j.properties → log configuration file
└ routes → Routes definition

public → Public assets
└ assets → Library files
└ css → CSS files
└ images → Image files
└ js → Javascript files

scripts
└ start.sh → Starts Dr. Elephant
└ stop.sh → Stops Dr. Elephant

test → Source folder for unit tests

compile.sh → Compiles the application

## License

Expand All @@ -276,4 +30,4 @@ After defining the resolver configuration, include the path to it under play_opt
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.
the License.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright 2015 LinkedIn Corp.
* Copyright 2016 LinkedIn Corp.
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
Expand Down
2 changes: 1 addition & 1 deletion app/controllers/IdUrlPair.java
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright 2015 LinkedIn Corp.
* Copyright 2016 LinkedIn Corp.
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
Expand Down
2 changes: 1 addition & 1 deletion app/models/AppHeuristicResultDetails.java
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright 2015 LinkedIn Corp.
* Copyright 2016 LinkedIn Corp.
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
Expand Down
2 changes: 1 addition & 1 deletion app/models/AppResult.java
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright 2015 LinkedIn Corp.
* Copyright 2016 LinkedIn Corp.
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
Expand Down
13 changes: 2 additions & 11 deletions app/views/help/mapreduce/helpMapperTime.scala.html
Original file line number Diff line number Diff line change
Expand Up @@ -76,12 +76,7 @@ <h4>Suggestions</h4>
one file, and changing split size won't help. If that is your case, you should either try CombineFileInputFormat or
use Pig/Hive.
<br>
@* TODO Before Open Source *@
@*
We will be releasing the go/hadooptuningtips page as part of Dr. Elephant's documentation.
Update this link when we migrate code to github.
*@
See <a href="http://go/hadooptuningtips">Hadoop Tuning Tips</a> for further information.<br>
See <a href="https://github.com/linkedin/dr-elephant/wiki/Tuning-Tips">Hadoop Tuning Tips</a> for further information.<br>
</p>
<h3>Large files/Unsplittable files</h3>
<p>
Expand Down Expand Up @@ -125,9 +120,5 @@ <h4>Suggestions</h4>
The input split size is controlled by formula <b>max(minSplitSize, min(maxSplitSize, blockSize))</b>. See the
previous section for further details. <br>
In the case above, since mapper input size >> block size and you want to increase mappers, you should decrease min split size close to BlockSize(512MB). <br>
@* TODO Before Open Source *@
@*
Update link while migrating code to github
*@
See <a href="http://go/hadooptuningtips">Hadoop Tuning Tips</a> for further information.
See <a href="https://github.com/linkedin/dr-elephant/wiki/Tuning-Tips">Hadoop Tuning Tips</a> for further information.
</p>
6 changes: 1 addition & 5 deletions app/views/help/mapreduce/helpMemory.scala.html
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,4 @@ <h4>Suggestions</h4>
</ul>

<br>
@* TODO Before Open Source *@
@*
Update link while migrating code to github
*@
See <a href="http://go/hadooptuningtips">Hadoop Tuning Tips</a> for further information.<br>
See <a href="https://github.com/linkedin/dr-elephant/wiki/Tuning-Tips">Hadoop Tuning Tips</a> for further information.<br>
6 changes: 1 addition & 5 deletions app/views/help/mapreduce/helpReducerTime.scala.html
Original file line number Diff line number Diff line change
Expand Up @@ -87,9 +87,5 @@ <h3>Suggestions</h3>
For Azkaban flows, add jvm.args=-Dmapreduce.job.reduces=NUMBER_OF_REDUCERS to your job properties<br>
<br>
Generally, Dr. Elephant(and Hadoop team) advises the ideal task time to be 5-10 minutes.<br>
@* TODO Before Open Source *@
@*
Update link while migrating code to github
*@
See <a href="http://go/hadooptuningtips">Hadoop Tuning Tips</a> for further information.
See <a href="https://github.com/linkedin/dr-elephant/wiki/Tuning-Tips">Hadoop Tuning Tips</a> for further information.
</p>
6 changes: 1 addition & 5 deletions app/views/help/mapreduce/helpShuffleSort.scala.html
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,5 @@ <h3>Suggestions</h3>
For Apache-Pig jobs: Use "set mapreduce.job.reduce.slowstart.completedmaps 0.95"<br>
For Apache-Hive jobs: Use "set mapreduce.job.reduce.slowstart.completedmaps=0.95"<br>
For Azkaban flows, add jvm.args=-Dmapreduce.job.reduce.slowstart.completedmaps=0.95 to your job properties(will affect all MapReduce jobs under this azkaban job)<br>
@* TODO Before Open Source *@
@*
Update link while migrating code to github
*@
See <a href="http://go/hadooptuningtips">Hadoop Tuning Tips</a> for further information.
See <a href="https://github.com/linkedin/dr-elephant/wiki/Tuning-Tips">Hadoop Tuning Tips</a> for further information.
</p>
7 changes: 0 additions & 7 deletions conf/application.conf
Original file line number Diff line number Diff line change
Expand Up @@ -78,10 +78,3 @@ logger.play=INFO

# Logger provided to your application:
logger.application=DEBUG

# Emailer
# smtp.host=
# smtp.port=
# smtp.from=
# smtp.user=
# smtp.password=
File renamed without changes
File renamed without changes
Binary file added images/wiki/dr-elephant-logo-150x150.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/wiki/dr-elephant-logo-300x300.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
2 changes: 1 addition & 1 deletion public/js/flowhistoryform.js
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ function getGraphTooltipContent(record, jobDefList) {
heading.appendChild(document.createElement("br"));

var details = document.createElement("p");
details.appendChild(document.createTextNode("Job Score = " + record.score));
details.appendChild(document.createTextNode("Flow Score = " + record.score));

var jobTable = document.createElement("table");
if (record.score != 0) {
Expand Down
Loading

0 comments on commit 6c06bc8

Please sign in to comment.