Jupyter Scala is a Scala kernel for Jupyter. It aims at being a versatile and easily extensible alternative to other Scala kernels or notebook UIs, building on both Jupyter and Ammonite.
The current version is available for Scala 2.11. Support for Scala 2.10 could be added back, and 2.12 should be supported soon (via ammonium / Ammonite).
- Quick start
- Extra launcher options
- Comparison to alternatives
- Status / disclaimer
- Big data frameworks
- Spark
- Flink
- Scio / Beam
- Scalding
- Plotting
- Vegas
- plotly-scala
- Special commands / API
- Jupyter installation
- Internals
- Compiling it
First ensure you have Jupyter installed.
Running jupyter --version
should print a value >= 4.0. See Jupyter installation
if it's not the case.
Ensure the coursier launcher is available in the PATH
. On OS X, brew install --HEAD paulp/extras/coursier
should install it. coursier --help
should then print a version >= 1.0.0-M14.
Then simply run the jupyter-scala
script of this
repository to install the kernel. Launch it with --help
to list available (non mandatory) options.
Once installed, the kernel should be listed by jupyter kernelspec list
.
Some options can be passed to the jupyter-scala
script / launcher.
- The kernel ID (
scala
) can be changed with--id custom
(allows to install the kernel alongside already installed Scala kernels). - The kernel name, that appears in the Jupyter Notebook UI, can be changed with
--name "Custom name"
. - If a kernel with the same ID is already installed and should be erased, the
--force
option should be specified.
There are already a few notebook UIs or Jupyter kernels for Scala out there:
- the ones originating from IScala,
- the ones originating from scala-notebook,
- scala-notebook itself, and
- spark-notebook that updated / reworked various parts of it and added Spark support to it, and
- the ones affiliated with Apache,
- Toree (incubated, formerly known as spark-kernel), a Jupyter kernel to do Spark calculations, and
- Zeppelin, a JVM-based alternative to Jupyter, with some support for Spark, Flink, Scalding in particular.)
Compared to them, jupyter-scala aims at being versatile, allowing to add support for big data frameworks on-the-fly. It aims at building on the nice features of both Jupyter (alternative UIs, ...) and Ammonite - it is now based on a only slightly modified version of it (ammonium). Most of what can be done via notebooks can also be done in the console via ammonium (slightly modified Ammonite). jupyter-scala is not tied to specific versions of Spark - one can add support for a given version in a notebook, and support for another version in another notebook.
jupyter-scala tries to build on top of both Jupyter and Ammonite. Both of them are quite used and well tested / reliable. The specific features of jupyter-scala (support for big data frameworks in particular) should be relied on with caution - some are just POC for now (support for Flink, Scio), others are a bit more used... in specific contexts (support for Spark, quite used on YARN at my current company, but whose status is unknown with other cluster managers).
Status: some specific uses (Spark on YARN) well tested in particular contexts (especially the previous version, the current one less so for now), others (Mesos, standalone clusters) unknown with the current code base
Use like
import $exclude.`org.slf4j:slf4j-log4j12`, $ivy.`org.slf4j:slf4j-nop:1.7.21` // for cleaner logs
import $profile.`hadoop-2.6`
import $ivy.`org.apache.spark::spark-sql:2.1.0` // adjust spark version - spark >= 2.0
import $ivy.`org.apache.hadoop:hadoop-aws:2.6.4`
import $ivy.`org.jupyter-scala::spark:0.4.0` // for JupyterSparkSession (SparkSession aware of the jupyter-scala kernel)
import org.apache.spark._
import org.apache.spark.sql._
import jupyter.spark.session._
val sparkSession = JupyterSparkSession.builder() // important - call this rather than SparkSession.builder()
.jupyter() // this method must be called straightaway after builder()
// .yarn("/etc/hadoop/conf") // optional, for Spark on YARN - argument is the Hadoop conf directory
// .emr("2.6.4") // on AWS ElasticMapReduce, this adds aws-related to the spark jar list
// .master("local") // change to "yarn-client" on YARN
// .config("spark.executor.instances", "10")
// .config("spark.executor.memory", "3g")
// .config("spark.hadoop.fs.s3a.access.key", awsCredentials._1)
// .config("spark.hadoop.fs.s3a.secret.key", awsCredentials._2)
.appName("notebook")
.getOrCreate()
Important: SparkSession
s should not be manually created. Only the ones from the org.jupyter-scala::spark
library
are aware of the kernel, and setup the SparkSession
accordingly (passing it the loaded dependencies, the kernel
build products, etc.).
Note that no Spark distribution is required to have the kernel work. In particular, on YARN, the call to .yarn(...)
above
generates itself the so-called spark assembly (or list of JARs with Spark 2), that is (are) shipped to the driver and
executors.
Status: POC
Use like
import $exclude.`org.slf4j:slf4j-log4j12`, $ivy.`org.slf4j:slf4j-nop:1.7.21`, $ivy.`org.slf4j:log4j-over-slf4j:1.7.21` // for cleaner logs
import $ivy.`org.jupyter-scala::flink-yarn:0.4.0`
import jupyter.flink._
addFlinkImports()
sys.props("FLINK_CONF_DIR") = "/path/to/flink-conf-dir" // directory, should contain flink-conf.yaml
interp.load.cp("/etc/hadoop/conf")
val cluster = FlinkYarn(
taskManagerCount = 2,
jobManagerMemory = 2048,
taskManagerMemory = 2048,
name = "flink",
extraDistDependencies = Seq(
s"org.apache.hadoop:hadoop-aws:2.7.3" // required on AWS ElasticMapReduce
)
)
val env = JupyterFlinkRemoteEnvironment(cluster.getJobManagerAddress)
Status: POC
Use like
import $ivy.`org.jupyter-scala::scio:0.4.0`
import jupyter.scio._
import com.spotify.scio._
import com.spotify.scio.accumulators._
import com.spotify.scio.bigquery._
import com.spotify.scio.experimental._
val sc = JupyterScioContext(
"runner" -> "DataflowPipelineRunner",
"project" -> "jupyter-scala",
"stagingLocation" -> "gs://bucket/staging"
).withGcpCredential("/path-to/credentials.json") // alternatively, set the env var GOOGLE_APPLICATION_CREDENTIALS to that path
Status: TODO! (nothing for now)
Being based on a slightly modified version of Ammonite, jupyter-scala allows to
- add dependencies / repositories,
- manage pretty-printing,
- load external scripts, etc.
the same way Ammonite does, with the same API, described in its documentation.
It has some additions compared to it though:
One can exclude dependencies with, e.g.
import $exclude.`org.slf4j:slf4j-log4j12`
to exclude org.slf4j:slf4j-log4j12
from subsequent dependency loading.
publish.html(
"""
<b>Foo</b>
<div id="bar"></div>
"""
)
publish.png(png) // png: Array[Byte]
publish.js(
"""
console.log("hey");
"""
)
Like for big data frameworks, support for plotting libraries can be added on-the-fly during a notebook session.
Vegas is a Scala wrapper for Vega-Lite
Use like
import $ivy.`org.vegas-viz::vegas:0.3.8`
import vegas._
Vegas("Country Pop").
withData(
Seq(
Map("country" -> "USA", "population" -> 314),
Map("country" -> "UK", "population" -> 64),
Map("country" -> "DK", "population" -> 80)
)
).
encodeX("country", Nom).
encodeY("population", Quant).
mark(Bar).
show
Additional Vegas samples with jupyter-scala notebook are here.
plotly-scala is a Scala wrapper for plotly.js.
Use like
import $ivy.`org.plotly-scala::plotly-jupyter-scala:0.3.0`
import plotly._
import plotly.element._
import plotly.layout._
import plotly.JupyterScala._
plotly.JupyterScala.init()
val (x, y) = Seq(
"Banana" -> 10,
"Apple" -> 8,
"Grapefruit" -> 5
).unzip
Bar(x, y).plot()
Check that you have Jupyter installed by running
jupyter --version
. It should print a value >= 4.0. If it's
not the case, a quick way of setting it up consists
in installing the Anaconda Python
distribution (or its lightweight counterpart, Miniconda), and then running
$ pip install jupyter
or
$ pip install --upgrade jupyter
jupyter --version
should then print a value >= 4.0.
jupyter-scala uses the Scala interpreter of ammonium, a slightly modified Ammonite. The interaction with Jupyter (the Jupyter protocol, ZMQ concerns, etc.) are handled in a separate project, jupyter-kernel. In a way, jupyter-scala is just a bridge between these two projects.
The API as seen from a jupyter-scala session is defined
in the scala-api
module, that itself depends on the api
module of jupyter-kernel.
The core of the kernel is in the scala
module, in particular with an implementation
of an Interpreter
for jupyter-kernel,
and implementations of the interfaces / traits defined in scala-api
.
It also has a third module, scala-cli
, which deals with command-line argument parsing,
and launches the kernel itself. The launcher script just runs this third module.
Clone the sources:
$ git clone https://github.com/alexarchambault/jupyter-scala.git
$ cd jupyter-scala
Compile and publish them:
$ sbt publishLocal
Edit the jupyter-scala
script, and set VERSION
to 0.4.1-SNAPSHOT
(the version being built / published locally). Install it:
$ ./jupyter-scala --id scala-develop --name "Scala (develop)" --force
If one wants to make changes to jupyter-kernel or ammonium, and test them via jupyter-scala, just clone their sources,
$ git clone https://github.com/alexarchambault/jupyter-kernel
or
$ git clone https://github.com/alexarchambault/ammonium
build them and publish them locally,
$ cd jupyter-kernel
$ sbt publishLocal
or
$ cd ammonium
$ sbt published/publishLocal
Then adjust the ammoniumVersion
or jupyterKernelVersion
in the build.sbt
of jupyter-scala (set them to 0.4.1-SNAPSHOT
or 0.8.1-SNAPSHOT
), reload the SBT compiling / publishing jupyter-scala (type reload
, or exit and relaunch it), and
build / publish locally jupyter-scala again (sbt publishLocal
). That will make the locally published artifacts of
jupyter-scala depend on the locally published ones of ammonium or jupyter-kernel.
Released under the Apache 2.0 license, see LICENSE for more details.