This is Twitter's tutorial tutorial for Scalding scalding adapted to run
on Hadoop as a standalone job - i.e. without requiring scald.rb
etc.
This was built as a Scala SBT project by the [Concurrent Inc] concurrent team,
in order to integrate the scalding tutorial into the Cascading SDK. It
is based on the excellent work done by Snowplow Analytics for
porting the Wordcount example
to SBT.
The versioning of the project follows the versions of the scalding release on which it is based.
Please note that this tutorial uses scala 2.10 and not 2.9.
In order to use this tutorial, you need to have SBT
and the hadoop
command
installed. Cascading and therefore scalding is compatible with a number of
hadoop distributions. If you are unsure, if your distribution is compatible,
please check the compatibility page.
You do not need to have a full hadoop cluster, in order to run this tutorial. The local mode of hadoop is sufficient.
Assuming you already have SBT installed:
$ git clone git://github.com/Cascading/scalding-tutorial.git
$ cd scalding-tutorial
$ sbt assembly
The 'fat jar' is now available as:
target/scalding-tutorial-0.8.11.jar
Some modifications have been done to the code, order to properly work in an SBT based build.
- all code is now in
src/main/scala/tutorial
- the data files for the different parts live now in
data
- the classes in the matrix tutorial have been renamed to match the file names, so that the commandline invocation is similar to the original tutorial
- the documentation of the examples has been adapted to match the new structure
Each part of the tutorial explains, how to run it properly. However the general way is always
$ hadoop jar target/scalding-tutorial-0.8.11.jar <TutorialPart> --local <addtional arguments>
Copyright 2012-2013 Concurrent Inc, with significant portions copyright 2012 Twitter, Inc. and Snowplow Analytics Inc.
Licensed under the [Apache License, Version 2.0] license (the "License"); you may not use this software except in compliance with the License.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.