-
Notifications
You must be signed in to change notification settings - Fork 856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds python script to collect and anonymize Spark and MR job history logs #223
base: master
Are you sure you want to change the base?
Conversation
…job history server on hadoop2
Changed the script so it prints out what it is doing. I suspect that somehow the directory name was not getting set correctly. Used a one-liner that seems to work on both Mac and Linux to determine directory of a script. Used braces to delimit the shell variable names. May be random variables were getting set? The zip command included a -x that pretty much amounted to nothing, since it excluded '*'. Changed the zip command to be more specific. Finally, added a trap command that fields ^C and exits the script. It seems to survive ctrl-C now. RB=399693 BUGS=HADOOP-7816 R=fli,akrai A=fli
…n job has no mappers
…plying it to the query instead of asking the query to perform the case-insensitive comparison.
Changed HadoopJobData to include finishTime since that is needed for metrics. Changed the signature of getJobCounter to include jobConf and jobData so that it can publish metrics Updated README.md Tested locally on my box and on spades RB=406817 BUGS=HADOOP-7814 R=fli,mwagner A=fli
The java file DaliMetricsAPI.java has a flavor of the APIs that we will be exposing from the dali library. We can split these classes into individual files when we move this functionality to the dali library. Changed start script to look for a config file that configures a publisher. If the file is present, then dr-elephant is started with an option that has the file name. If the file is not present, then the behavior is unchanged (i.e. no metrics are published). If the file is parsed correctly then dr-elephant publishes metrics in HDFS (one avro file per job) for jobs that are configured to publish the metrics. The job needs to set something like mapreduce.job.publish-counters='org.apache.hadoop.examples.WordCount$AppCounter:*' to publish all counters in the group given. The format is : 'groupName:counterName' where counterName can be an asterisk to indicate all counters in the group. See the class DaliMetricsAPI.CountersToPublish The HDFSPublisher is configured with a base path under which metrics are published. The date/hour hierarchy is added to the base path. The XML file for configuring dr-elephant is checked in as a template. A config file needs to be added to the 'conf' path of dr-elephant (manually, as per meeting with hadoop-admin) on clusters where we want dr-elephant to publish metrics. RB=409443 BUGS=HADOOP-7814 R=fli,csteinba,mwagner,cbotev,ahsu A=fli,ahsu
hadoop-1 does not have JobStatus.getFinishTime(). This causes dr-elephant to hang. Set the start time to be same as finish time for h1 jobs. For consistency, reverted to the old method of scraping the job tracker url so that we get only start time, and set the finish time to be equal to start time for retired jobs as well. RB=417975 BUGS=HADOOP-8640 R=fli,mwagner A=fli
RB=417448 BUGS=HADOOP-8648 R=fli A=fli
…ff 51 reducers instead of 50
…inkedin#169) * logger actually returned number of application types, not job types * also log the appType size
…before running heuristic (linkedin#168)
…ows (linkedin#167) also fixed testDepthCalculation - our test workflow only has one parent so Depth should be 1.
* Rewrite Spark fetcher/heuristics. The purpose of this update is to: - rewrite the Spark data fetcher to use Spark event logs minimally, since it can be expensive to download and process these fully as done before - rewrite the Spark data fetcher to use the [Spark monitoring REST API](https://spark.apache.org/docs/1.4.1/monitoring.html#rest-api), which provides almost all of the information Spark heuristics need - update the Spark heuristics to provide hopefully more useful information and avoid being arbitrarily restrictive The new Spark-related code is provided in parallel to the old Spark-related code. To enable it: - Uncomment and swap in the appropriate fragments in `AggregatorConf.xml`, `FetcherConf.xml`, and `HeuristicConf.xml`. - Set `SPARK_CONF_DIR` (or `SPARK_HOME`) to an appropriate location so that Dr. Elephant can find `spark-defaults.conf`. Heuristics added: - "Executor shuffle read bytes distribution": We now provide a distribution with min/25p/median/75p/max, with severity based on a max-to-median ratio. - "Executor shuffle write bytes distribution": We now provide a distribution with min/25p/median/75p/max, with severity based on a max-to-median ratio. Heuristics changed: - "Average input size" -> "Executor input bytes distribution": Instead of providing an average along with min/max, we now provide a distribution with min/25p/median/75p/max, with severity based on a max-to-median ratio. - "Average peak storage memory" -> "Executor storage memory used distribution": Instead of providing an average along with min/max, we now provide a distribution with min/25p/median/75p/max, with severity based on a max-to-median ratio. - "Average runtime" -> "Executor task time distribution": Instead of providing an average along with min/max, we now provide a distribution with min/25p/median/75p/max, with severity based on a max-to-median ratio. - "Memory utilization rate" -> "Executor storage memory utilization rate": This seemed to imply total memory but it is just the utilization rate for storage memory, so has been relabeled to indicate that. Shuffle memory is important too (but we don't seem to have access to shuffle memory utilization metrics). - "Total memory used at peak" -> "Total executor storage memory used": This also refers to storage memory. It has been relabeled to indicate that. - "Spark problematic stages" -> ("Spark stages with high task failure rates", "Spark stages with long average executor runtimes"): This was a combination of stages with high task failure rates and those with long runtimes. Those have been separated. Heuristics removed: - spark.executor.cores: I think this is somewhat discretionary. At the very least, our internal recommendation stopped matching the one in Dr. Elephant. - spark.shuffle.manager: This was changed to "sort" by default as of Spark 1.2, so there is no current use for checking this setting. - "Average output size": Metrics related to output size appear to be deprecated or non-existent, so there is no current use for checking this setting. Finally, overall waste metrics are calculated based on allocation [app runtime * # of executors * executor memory] vs. usage [total executor run time * executor memory]. They were previously calculated based only on storage memory and some 50% buffer, which I didn't understand. Added unit tests and also tested against our internal cluster as much as practically I could. Will need help to fully validate.
* Fix default (mb) and verify memory size >= 0. * Parse the string that's already been retrieved from the config.
…in#174) * LIHADOOP-25142: Build the user-summary page for Dr. Elephant LIHADOOP-20704: Expose rest interface to provide aggregation and filtering based on username Other fixes: Tooltips on search panel Fix to search by job_*
…d How to Contribute page (linkedin#185) Quick Setup: https://github.com/linkedin/dr-elephant/wiki/Quick-Setup-Instructions How to Contribute?: https://github.com/linkedin/dr-elephant/wiki/How-to-Contribute%3F
* Use map-reduce job configuration properties to provide flow and job history run outside of a scheduler. They are injected into the job conf by pig (see PIG-3048) and hive (HIVE-3708) * Prefix jobDefId/workflowDefId with username
Jobs which put large files(> 500MB) in the distributed cache are flagged. Files as part of the following are considered. mapreduce.job.cache.files mapreduce.job.cache.archives
* Removes pattern matching
…p2 (linkedin#203) (1) Use ArrayList instead (2) Add unit test for this
…dd missing workflow links (linkedin#207)
@rtype: str | ||
""" | ||
hasher = hashlib.sha256() | ||
hasher.update(plaintext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A pedantic comment on security (I'm currently TAing a course on computer security so I can't not at least point this out :P): it might be a good idea to add a random salt to the plaintext before hashing. Otherwise low entropy plaintexts are crackable. I see two ways of doing this:
- Take the salt as an input to the script. This would allow an organization to use a single salt for all their files maintaining plaintext correlations between runs.
- Generate the salt at the beginning of each script run and use it for the duration of the run. This is easier but correlations between runs are lost.
Also, of course there could be the option to do either.
Looking towards creating a shared repository of job history logs, here's a script to collect, anonymize and create tarballs from job history files on HDFS. Not yet sure where in the source tree this script, or a future repo of job logs, should live. Note that half the lines in the script are whitelist words, which should not be anonymized.
Use this program to get a snapshot of MapReduce & Spark history data to a local tarball. It should be called as a user that has access to the history directories. Optionally, anonymization can be applied which will replace sensitive data (hostnames, usernames, ...) with base64'd SHA-256 hashes of the same data.
Usage examples:
Do a basic run, getting the last 1000 MR + Spark jobs.
get_jobhistory_hdfs.py -o /tmp/jobhistory.tgz
Get the last 20 jobs, with anonymization.
get_jobhistory_hdfs.py -o /tmp/jobhistory.tgz -c 20 -a
Basic run, but running as the HDFS user for permission reasons.
sudo -u hdfs ./get_jobhistory_hdfs -o /tmp/jobhistory.tgz
optional arguments:
-h, --help show this help message and exit
-c COUNT, --count COUNT
How many recent jobs to retrieve
-d MR_DIR, --mr-dir MR_DIR
History dir for mapreduce
-s SPARK_DIR, --spark-dir SPARK_DIR
History dir for spark
-o OUTPUT_TARBALL, --output-tarball OUTPUT_TARBALL
Output tarball name
-a, --anonymize Anonymize output