Sahale

A tool to record and visualize metrics captured from Cascading (Scalding) workflows at runtime.

Designed to target the pain points of analysts and end users of Cascading, Sahale provides insight into a workflow's runtime resource usage and makes job debugging and locating relevant Hadoop logs easy. The tool reveals optimization opportunities by exposing inefficient MapReduce jobs in a larger workflow, and enables users to track the execution history of their workflows.

Sahale has been tested and verified to work with several different Cascading DSL's, but the example projects are geared towards Scalding, which is our primary Hadoop analytics tool at Etsy.

Installation

Step One: Assumes you have a MySQL instance Sahale can use. Clone the Sahale repository and follow instructions in src/main/sql/create_db_tables.sql to create the tables the Scala and NodeJS components expect. Modify db-config.json in the project's root directory to point to your database.

Step Two: Install Node Package Manager, cd into Sahale's root directory, execute npm install. Execute node app to run the server; browse to localhost on port 5735 and enjoy.

Step Three: Install Maven3. Update the pom.xml file with the correct Hadoop and Scala/Scalding versions for your Hadoop installation. Update src/main/resources/flow-tracker.properties with the hostname you plan to run the NodeJS server on. Keep the port number here in sync with the NodeJS port (see Step Two.) Execute mvn install.

User Workflow

For a quick test, see bin/runjob to run the example job(s). You will need to supply some text file(s) on HDFS to run it against.

Users can run their own tracked Scalding jobs in two ways. Both start by making the Scalding job(s) in question a subclass of com.etsy.sahale.TrackedJob.

The easiest way is to add user source code to src/main/scala/examples, build the Sahale fatjar with mvn install, and execute using bin/runjob, just as one would for the included example job. You can add any job dependencies to the fatjar via the pom.xml.

The other method is to include the Sahale JAR in your own project build as a dependency, then include it in job runs using hadoop jar's -libjars argument. This approach can integrate easily into your existing workflow.

Only jobs submitted to a Hadoop cluster are tracked. No local mode runs are tracked. All tracked jobs must include the argument --track-job. The --track-job argument is included in the bin/runjob convenience script by default.

The Name

Sahale was handmade at Etsy.com and is named for Sahale Mountain, which is a wonderful vantage point from which to view the Cascades ;)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
bin		bin
public		public
routes		routes
src		src
views		views
.gitignore		.gitignore
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
app.js		app.js
db-config.json		db-config.json
package.json		package.json
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sahale

Installation

User Workflow

The Name

About

Releases

Packages

Languages

License

mbrukman/Sahale

Folders and files

Latest commit

History

Repository files navigation

Sahale

Installation

User Workflow

The Name

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages