Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds python script to collect and anonymize Spark and MR job history logs #223

Open
wants to merge 272 commits into
base: master
Choose a base branch
from

Conversation

simonpk
Copy link

@simonpk simonpk commented Mar 8, 2017

Looking towards creating a shared repository of job history logs, here's a script to collect, anonymize and create tarballs from job history files on HDFS. Not yet sure where in the source tree this script, or a future repo of job logs, should live. Note that half the lines in the script are whitelist words, which should not be anonymized.

Use this program to get a snapshot of MapReduce & Spark history data to a local tarball. It should be called as a user that has access to the history directories. Optionally, anonymization can be applied which will replace sensitive data (hostnames, usernames, ...) with base64'd SHA-256 hashes of the same data.

Usage examples:

Do a basic run, getting the last 1000 MR + Spark jobs.
get_jobhistory_hdfs.py -o /tmp/jobhistory.tgz

Get the last 20 jobs, with anonymization.
get_jobhistory_hdfs.py -o /tmp/jobhistory.tgz -c 20 -a

Basic run, but running as the HDFS user for permission reasons.
sudo -u hdfs ./get_jobhistory_hdfs -o /tmp/jobhistory.tgz

optional arguments:
-h, --help show this help message and exit
-c COUNT, --count COUNT
How many recent jobs to retrieve
-d MR_DIR, --mr-dir MR_DIR
History dir for mapreduce
-s SPARK_DIR, --spark-dir SPARK_DIR
History dir for spark
-o OUTPUT_TARBALL, --output-tarball OUTPUT_TARBALL
Output tarball name
-a, --anonymize Anonymize output

Akshay Rai and others added 30 commits November 10, 2014 20:53
Changed the script so it prints out what it is doing.

I suspect that somehow the directory name was not getting set correctly. Used a
one-liner that seems to work on both Mac and Linux to determine directory of a script.

Used braces to delimit the shell variable names. May be random variables were getting set?

The zip command included a -x that pretty much amounted to nothing, since it excluded '*'.
Changed the zip command to be more specific.

Finally, added a trap command that fields ^C and exits the script.

It seems to survive ctrl-C now.

RB=399693
BUGS=HADOOP-7816
R=fli,akrai
A=fli
…plying it to the query instead of asking the query to perform the case-insensitive comparison.
Changed HadoopJobData to include finishTime since that is needed for
metrics.
Changed the signature of getJobCounter to include jobConf and jobData
so that it can publish metrics
Updated README.md

Tested locally on my box and on spades

RB=406817
BUGS=HADOOP-7814
R=fli,mwagner
A=fli
The java file DaliMetricsAPI.java has a flavor of the APIs that we will be exposing from the dali library.
We can split these classes into individual files when we move this functionality to the dali library.

Changed start script to look for a config file that configures a publisher. If the file is present,
then dr-elephant is started with an option that has the file name. If the file is not present,
then the behavior is unchanged (i.e. no metrics are published).

If the file is parsed correctly then dr-elephant publishes metrics in HDFS (one avro file per job)
for jobs that are configured to publish the metrics.

The job needs to set something like mapreduce.job.publish-counters='org.apache.hadoop.examples.WordCount$AppCounter:*'
to publish all counters in the group given. The format is : 'groupName:counterName' where counterName can be an
asterisk to indicate all counters in the group. See the class DaliMetricsAPI.CountersToPublish

The HDFSPublisher is configured with a base path under which metrics are published. The date/hour hierarchy is added
to the base path.

The XML file for configuring dr-elephant is checked in as a template. A config file needs to be added to the
'conf' path of dr-elephant (manually, as per meeting with hadoop-admin) on clusters where we want dr-elephant
to publish metrics.

RB=409443
BUGS=HADOOP-7814
R=fli,csteinba,mwagner,cbotev,ahsu
A=fli,ahsu
hadoop-1 does not have JobStatus.getFinishTime(). This causes dr-elephant to hang.

Set the start time to be same as finish time for h1 jobs.

For consistency, reverted to the old method of scraping the job tracker url so that we get only
start time, and set the finish time to be equal to start time for retired jobs as well.

RB=417975
BUGS=HADOOP-8640
R=fli,mwagner
A=fli
RB=417448
BUGS=HADOOP-8648
R=fli
A=fli
babak-altiscale and others added 28 commits November 23, 2016 23:43
…inkedin#169)

* logger actually returned number of application types, not job types

* also log the appType size
…ows (linkedin#167)

also fixed testDepthCalculation - our test workflow only has one parent so Depth should be 1.
* Rewrite Spark fetcher/heuristics.

The purpose of this update is to:

- rewrite the Spark data fetcher to use Spark event logs minimally, since it can be expensive to download and process these fully as done before
- rewrite the Spark data fetcher to use the [Spark monitoring REST API](https://spark.apache.org/docs/1.4.1/monitoring.html#rest-api), which provides almost all of the information Spark heuristics need
- update the Spark heuristics to provide hopefully more useful information and avoid being arbitrarily restrictive

The new Spark-related code is provided in parallel to the old Spark-related code. To enable it:

- Uncomment and swap in the appropriate fragments in `AggregatorConf.xml`, `FetcherConf.xml`, and `HeuristicConf.xml`.
- Set `SPARK_CONF_DIR` (or `SPARK_HOME`) to an appropriate location so that Dr. Elephant can find `spark-defaults.conf`.

Heuristics added:

- "Executor shuffle read bytes distribution": We now provide a distribution with min/25p/median/75p/max, with severity based on a max-to-median ratio.
- "Executor shuffle write bytes distribution": We now provide a distribution with min/25p/median/75p/max, with severity based on a max-to-median ratio.

Heuristics changed:

- "Average input size" -> "Executor input bytes distribution": Instead of providing an average along with min/max, we now provide a distribution with min/25p/median/75p/max, with severity based on a max-to-median ratio.
- "Average peak storage memory" -> "Executor storage memory used distribution": Instead of providing an average along with min/max, we now provide a distribution with min/25p/median/75p/max, with severity based on a max-to-median ratio.
- "Average runtime" -> "Executor task time distribution": Instead of providing an average along with min/max, we now provide a distribution with min/25p/median/75p/max, with severity based on a max-to-median ratio.
- "Memory utilization rate" -> "Executor storage memory utilization rate": This seemed to imply total memory but it is just the utilization rate for storage memory, so has been relabeled to indicate that. Shuffle memory is important too (but we don't seem to have access to shuffle memory utilization metrics).
- "Total memory used at peak" -> "Total executor storage memory used": This also refers to storage memory. It has been relabeled to indicate that.
- "Spark problematic stages" -> ("Spark stages with high task failure rates", "Spark stages with long average executor runtimes"): This was a combination of stages with high task failure rates and those with long runtimes. Those have been separated.

Heuristics removed:

- spark.executor.cores: I think this is somewhat discretionary. At the very least, our internal recommendation stopped matching the one in Dr. Elephant.
- spark.shuffle.manager: This was changed to "sort" by default as of Spark 1.2, so there is no current use for checking this setting.
- "Average output size": Metrics related to output size appear to be deprecated or non-existent, so there is no current use for checking this setting.

Finally, overall waste metrics are calculated based on allocation [app runtime * # of executors * executor memory] vs. usage [total executor run time * executor memory]. They were previously calculated based only on storage memory and some 50% buffer, which I didn't understand.

Added unit tests and also tested against our internal cluster as much as practically I could. Will need help to fully validate.
* Fix default (mb) and verify memory size >= 0.
* Parse the string that's already been retrieved from the config.
…in#174)

*     LIHADOOP-25142: Build the user-summary page for Dr. Elephant
      LIHADOOP-20704: Expose rest interface to provide aggregation and filtering based on username
    Other fixes:
        Tooltips on search panel
        Fix to search by job_*
* Use map-reduce job configuration properties to provide flow and job
history run outside of a scheduler. They are injected into the job conf by pig (see PIG-3048) and hive (HIVE-3708)

* Prefix jobDefId/workflowDefId with username
Jobs which put large files(> 500MB) in the distributed cache are flagged.
Files as part of the following are considered.
  mapreduce.job.cache.files
  mapreduce.job.cache.archives
…p2 (linkedin#203)

(1) Use ArrayList instead
(2) Add unit test for this
@rtype: str
"""
hasher = hashlib.sha256()
hasher.update(plaintext)
Copy link
Contributor

@paulbramsen paulbramsen Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pedantic comment on security (I'm currently TAing a course on computer security so I can't not at least point this out :P): it might be a good idea to add a random salt to the plaintext before hashing. Otherwise low entropy plaintexts are crackable. I see two ways of doing this:

  1. Take the salt as an input to the script. This would allow an organization to use a single salt for all their files maintaining plaintext correlations between runs.
  2. Generate the salt at the beginning of each script run and use it for the duration of the run. This is easier but correlations between runs are lost.

Also, of course there could be the option to do either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.