Adds python script to collect and anonymize Spark and MR job history logs #223

simonpk · 2017-03-08T21:15:58Z

Looking towards creating a shared repository of job history logs, here's a script to collect, anonymize and create tarballs from job history files on HDFS. Not yet sure where in the source tree this script, or a future repo of job logs, should live. Note that half the lines in the script are whitelist words, which should not be anonymized.

Use this program to get a snapshot of MapReduce & Spark history data to a local tarball. It should be called as a user that has access to the history directories. Optionally, anonymization can be applied which will replace sensitive data (hostnames, usernames, ...) with base64'd SHA-256 hashes of the same data.

Usage examples:

Do a basic run, getting the last 1000 MR + Spark jobs.
get_jobhistory_hdfs.py -o /tmp/jobhistory.tgz

Get the last 20 jobs, with anonymization.
get_jobhistory_hdfs.py -o /tmp/jobhistory.tgz -c 20 -a

Basic run, but running as the HDFS user for permission reasons.
sudo -u hdfs ./get_jobhistory_hdfs -o /tmp/jobhistory.tgz

optional arguments:
-h, --help show this help message and exit
-c COUNT, --count COUNT
How many recent jobs to retrieve
-d MR_DIR, --mr-dir MR_DIR
History dir for mapreduce
-s SPARK_DIR, --spark-dir SPARK_DIR
History dir for spark
-o OUTPUT_TARBALL, --output-tarball OUTPUT_TARBALL
Output tarball name
-a, --anonymize Anonymize output

…job history server on hadoop2

…ephant

Changed the script so it prints out what it is doing. I suspect that somehow the directory name was not getting set correctly. Used a one-liner that seems to work on both Mac and Linux to determine directory of a script. Used braces to delimit the shell variable names. May be random variables were getting set? The zip command included a -x that pretty much amounted to nothing, since it excluded '*'. Changed the zip command to be more specific. Finally, added a trap command that fields ^C and exits the script. It seems to survive ctrl-C now. RB=399693 BUGS=HADOOP-7816 R=fli,akrai A=fli

…HOME

…n job has no mappers

…plying it to the query instead of asking the query to perform the case-insensitive comparison.

Changed HadoopJobData to include finishTime since that is needed for metrics. Changed the signature of getJobCounter to include jobConf and jobData so that it can publish metrics Updated README.md Tested locally on my box and on spades RB=406817 BUGS=HADOOP-7814 R=fli,mwagner A=fli

The java file DaliMetricsAPI.java has a flavor of the APIs that we will be exposing from the dali library. We can split these classes into individual files when we move this functionality to the dali library. Changed start script to look for a config file that configures a publisher. If the file is present, then dr-elephant is started with an option that has the file name. If the file is not present, then the behavior is unchanged (i.e. no metrics are published). If the file is parsed correctly then dr-elephant publishes metrics in HDFS (one avro file per job) for jobs that are configured to publish the metrics. The job needs to set something like mapreduce.job.publish-counters='org.apache.hadoop.examples.WordCount$AppCounter:*' to publish all counters in the group given. The format is : 'groupName:counterName' where counterName can be an asterisk to indicate all counters in the group. See the class DaliMetricsAPI.CountersToPublish The HDFSPublisher is configured with a base path under which metrics are published. The date/hour hierarchy is added to the base path. The XML file for configuring dr-elephant is checked in as a template. A config file needs to be added to the 'conf' path of dr-elephant (manually, as per meeting with hadoop-admin) on clusters where we want dr-elephant to publish metrics. RB=409443 BUGS=HADOOP-7814 R=fli,csteinba,mwagner,cbotev,ahsu A=fli,ahsu

hadoop-1 does not have JobStatus.getFinishTime(). This causes dr-elephant to hang. Set the start time to be same as finish time for h1 jobs. For consistency, reverted to the old method of scraping the job tracker url so that we get only start time, and set the finish time to be equal to start time for retired jobs as well. RB=417975 BUGS=HADOOP-8640 R=fli,mwagner A=fli

RB=417448 BUGS=HADOOP-8648 R=fli A=fli

…ff 51 reducers instead of 50

…istic

…r time help page

…inkedin#169) * logger actually returned number of application types, not job types * also log the appType size

…before running heuristic (linkedin#168)

…ows (linkedin#167) also fixed testDepthCalculation - our test workflow only has one parent so Depth should be 1.

* Rewrite Spark fetcher/heuristics. The purpose of this update is to: - rewrite the Spark data fetcher to use Spark event logs minimally, since it can be expensive to download and process these fully as done before - rewrite the Spark data fetcher to use the [Spark monitoring REST API](https://spark.apache.org/docs/1.4.1/monitoring.html#rest-api), which provides almost all of the information Spark heuristics need - update the Spark heuristics to provide hopefully more useful information and avoid being arbitrarily restrictive The new Spark-related code is provided in parallel to the old Spark-related code. To enable it: - Uncomment and swap in the appropriate fragments in `AggregatorConf.xml`, `FetcherConf.xml`, and `HeuristicConf.xml`. - Set `SPARK_CONF_DIR` (or `SPARK_HOME`) to an appropriate location so that Dr. Elephant can find `spark-defaults.conf`. Heuristics added: - "Executor shuffle read bytes distribution": We now provide a distribution with min/25p/median/75p/max, with severity based on a max-to-median ratio. - "Executor shuffle write bytes distribution": We now provide a distribution with min/25p/median/75p/max, with severity based on a max-to-median ratio. Heuristics changed: - "Average input size" -> "Executor input bytes distribution": Instead of providing an average along with min/max, we now provide a distribution with min/25p/median/75p/max, with severity based on a max-to-median ratio. - "Average peak storage memory" -> "Executor storage memory used distribution": Instead of providing an average along with min/max, we now provide a distribution with min/25p/median/75p/max, with severity based on a max-to-median ratio. - "Average runtime" -> "Executor task time distribution": Instead of providing an average along with min/max, we now provide a distribution with min/25p/median/75p/max, with severity based on a max-to-median ratio. - "Memory utilization rate" -> "Executor storage memory utilization rate": This seemed to imply total memory but it is just the utilization rate for storage memory, so has been relabeled to indicate that. Shuffle memory is important too (but we don't seem to have access to shuffle memory utilization metrics). - "Total memory used at peak" -> "Total executor storage memory used": This also refers to storage memory. It has been relabeled to indicate that. - "Spark problematic stages" -> ("Spark stages with high task failure rates", "Spark stages with long average executor runtimes"): This was a combination of stages with high task failure rates and those with long runtimes. Those have been separated. Heuristics removed: - spark.executor.cores: I think this is somewhat discretionary. At the very least, our internal recommendation stopped matching the one in Dr. Elephant. - spark.shuffle.manager: This was changed to "sort" by default as of Spark 1.2, so there is no current use for checking this setting. - "Average output size": Metrics related to output size appear to be deprecated or non-existent, so there is no current use for checking this setting. Finally, overall waste metrics are calculated based on allocation [app runtime * # of executors * executor memory] vs. usage [total executor run time * executor memory]. They were previously calculated based only on storage memory and some 50% buffer, which I didn't understand. Added unit tests and also tested against our internal cluster as much as practically I could. Will need help to fully validate.

…ephant

* Fix default (mb) and verify memory size >= 0. * Parse the string that's already been retrieved from the config.

…nkedin#172)

…in#174) * LIHADOOP-25142: Build the user-summary page for Dr. Elephant LIHADOOP-20704: Expose rest interface to provide aggregation and filtering based on username Other fixes: Tooltips on search panel Fix to search by job_*

…ies. (linkedin#183)

…d How to Contribute page (linkedin#185) Quick Setup: https://github.com/linkedin/dr-elephant/wiki/Quick-Setup-Instructions How to Contribute?: https://github.com/linkedin/dr-elephant/wiki/How-to-Contribute%3F

* Use map-reduce job configuration properties to provide flow and job history run outside of a scheduler. They are injected into the job conf by pig (see PIG-3048) and hive (HIVE-3708) * Prefix jobDefId/workflowDefId with username

Jobs which put large files(> 500MB) in the distributed cache are flagged. Files as part of the following are considered. mapreduce.job.cache.files mapreduce.job.cache.archives

…rs (linkedin#202)

* Removes pattern matching

…p2 (linkedin#203) (1) Use ArrayList instead (2) Add unit test for this

…dd missing workflow links (linkedin#207)

…#210)

…n#217)

paulbramsen · 2017-03-13T21:40:08Z

test/job-history/get_jobhistory_hdfs.py

+    @rtype: str
+    """
+    hasher = hashlib.sha256()
+    hasher.update(plaintext)


A pedantic comment on security (I'm currently TAing a course on computer security so I can't not at least point this out :P): it might be a good idea to add a random salt to the plaintext before hashing. Otherwise low entropy plaintexts are crackable. I see two ways of doing this:

Take the salt as an input to the script. This would allow an organization to use a single salt for all their files maintaining plaintext correlations between runs.

Generate the salt at the beginning of each script run and use it for the duration of the run. This is easier but correlations between runs are lost.

Also, of course there could be the option to do either.

Akshay Rai and others added 30 commits November 10, 2014 20:53

HADOOP-7581: Fix for filtering jobs by Start Date in Hadoop 2

16022d1

HADOOP-7852: Dr. Elephant needs to periodically re-authenticate with …

c54ccc6

…job history server on hadoop2

HADOOP-7809: Cluster deployment for Dr. Elephant Part II

f4a63cf

HADOOP-7885: Release Dr. Elephant v0.6.1

8b513d7

HADOOP-7651: Update log location in Dr. Elephant v0.6.1

0fd5501

HADOOP-7931: Update Dr.Elephant version number to 0.6.2-SNAPSHOT

f2a8875

HADOOP-7652: Rewrite search box in Dr. Elephant

9c3e783

HADOOP-7930: Add hash index for execution url in db schema for Dr. El…

b92ec7e

…ephant

HADOOP-7941: Release Dr. Elephant v0.6.2

05676b2

HADOOP-8123: Update Dr.Elephant version number to 0.6.3-SNAPSHOT

482ad16

HADOOP-7996: Dr. Elephant fails to relogin using keytab on java8

7fbc7e0

HADOOP-8060: Dr. Elephant should use hadoop lib and conf from HADOOP_…

a6d951a

…HOME

HADOOP-8135: Dr. Elephant's MapperSpeedHeuristic throws exception whe…

799c34f

…n job has no mappers

HADOOP-8134: Release Dr. Elephant v0.6.3

48ff1b9

HADOOP-8316: Update Dr.Elephant version number to 0.6.4-SNAPSHOT

1a4390f

HADOOP-8096: Property 'username' is converted to lowercase before sup…

65ae460

…plying it to the query instead of asking the query to perform the case-insensitive comparison.

[HADOOP-8648]Updating dr-elephant release ID to 0.6.4

be47552

RB=417448 BUGS=HADOOP-8648 R=fli A=fli

[HADOOP-8648] Reset version to 0.6.5-SNAPSHOT

b6ddadd

HADOOP-8294: Improved slow performance of join queries

194dee4

HADOOP-5369: Paginate search results of Dr. Elephant

45e3ca2

HADOOP-7948: Make Dr. Elephant's Reducer Time Moderate Severity cut-o…

f7562cc

…ff 51 reducers instead of 50

HADOOP-8320: Add detailed suggestions in Dr. Elephant help page

f6a2a95

HADOOP-8320: Add detailed suggestions in Dr. Elephant help page

8272cf0

HADOOP-8856: Fix a broken Dr.Elephant test case for reducer time heur…

138693a

…istic

HADOOP-8859: Add ideal task time suggestion in Dr. Elephant 's reduce…

81a27f7

…r time help page

HADOOP-8846: Release Dr. Elephant v0.6.5

6beb02d

babak-altiscale and others added 28 commits November 23, 2016 23:43

Add null checks around update methods (linkedin#164)

08377da

logger actually returned number of application types, not job types (l…

5f056c1

…inkedin#169) * logger actually returned number of application types, not job types * also log the appType size

add exclude_jobs_filter to heuristics that is checked in AnalyticJob …

ca08fce

…before running heuristic (linkedin#168)

fix superParentId - now also returns the Coordinator id for subworkfl…

09c5971

…ows (linkedin#167) also fixed testDepthCalculation - our test workflow only has one parent so Depth should be 1.

use an ordered map to preserve order of jobs (linkedin#170)

134bab1

LIHADOOP-24958: Investigate the cause of Skipped Jobs Alert in Dr. El…

e3b1096

…ephant

Post Refactoring of Spark Fetcher, we have removed Java 6 support

cf5576b

Resources used (linkedin#173)

42a9be7

* Fix default (mb) and verify memory size >= 0. * Parse the string that's already been retrieved from the config.

Removes dependencies on Play to load classes (linkedin#179)

e0c5935

Post Refactoring of Spark Fetcher, we have removed Java 6 support (li…

416cf90

…nkedin#172)

Add JAVA_OPTS env var export to start.sh (linkedin#116)

9a13165

Removes play dependencies to read config file and read system propert…

e8e23e5

…ies. (linkedin#183)

LIHADOOP-25809: Update Dr. Elephant README to point to Quick Setup an…

2f687cd

…d How to Contribute page (linkedin#185) Quick Setup: https://github.com/linkedin/dr-elephant/wiki/Quick-Setup-Instructions How to Contribute?: https://github.com/linkedin/dr-elephant/wiki/How-to-Contribute%3F

Show exceptions for failed workflows (linkedin#188)

1d0350b

Added new heuristic DistributedCacheLimit heuristic. (linkedin#187)

4df9ba9

Jobs which put large files(> 500MB) in the distributed cache are flagged. Files as part of the following are considered. mapreduce.job.cache.files mapreduce.job.cache.archives

Cleanes up MapReduceTaskData class by removing unnecessary constructo…

2a84735

…rs (linkedin#202)

Fixes Spark REST fetcher for client mode applications (linkedin#193)

e93d431

* Removes pattern matching

Fix for null pointers in TaskList returned by MapReduceFSFetcherHadoo…

dd7a458

…p2 (linkedin#203) (1) Use ArrayList instead (2) Add unit test for this

Fix linkedin#162 with the right calculation for resourceswasted and a…

d3c90d5

…dd missing workflow links (linkedin#207)

Fix Exception thrown when JAVA_EXTRA_OPTIONS is not present (linkedin…

da7983c

…#210)

Adds an option to fetch recently finished apps from RM (linkedin#212)

0d668ab

Fixes issue caused by http in history server config property (linkedi…

f6274b1

…n#217)

add config for timezone of job history server (linkedin#214)

965cba3

Include reference to the weekly meeting

6b80614

script to collect job history files

295cb92

paulbramsen reviewed Mar 13, 2017

View reviewed changes

akshayrai force-pushed the master branch from 7c2fd7f to 8b46933 Compare December 12, 2017 05:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds python script to collect and anonymize Spark and MR job history logs #223

Adds python script to collect and anonymize Spark and MR job history logs #223

simonpk commented Mar 8, 2017

paulbramsen Mar 13, 2017 •

edited

Loading

Adds python script to collect and anonymize Spark and MR job history logs #223

Are you sure you want to change the base?

Adds python script to collect and anonymize Spark and MR job history logs #223

Conversation

simonpk commented Mar 8, 2017

paulbramsen Mar 13, 2017 • edited Loading

Choose a reason for hiding this comment

paulbramsen Mar 13, 2017 •

edited

Loading