In this developer journey, we will create a scalable recommender system using Apache Spark and Elasticsearch.
[TBC - Explain briefly how things work].
This repo contains a Jupyter notebook illustrating the basics of how to use Spark for training a collaborative filtering recommendation models from ratings data stored in Elasticsearch, saving the model factors to Elasticsearch, and then using Elasticsearch to serve real-time recommendations using the user and item factors.
When the reader has completed this journey, they will understand how to:
- Ingest and index user event data into Elasticsearch using the Elasticsearch Spark connector
- Load event data into Spark DataFrames and use Spark's machine learning library (MLlib) to train a collaborative filtering recommender model
- Export the trained model into Elasticsearch
- Using a custom Elasticsearch plugin, compute user and similar item recommendations and combine recommendations with search and content filtering
- Load the movie dataset into Spark.
- Use Spark DataFrame operations to clean up the dataset and load it into Elasticsearch.
- Using Spark MLlib, train a collaborative filtering recommendation model.
- Save the resulting model into Elasticsearch.
- Using Elasticsearch queries and a custom vector scoring plugin, generate some example recommendations. The Movie Database API is used to display movie poster images for the recommended movies.
- Apache Spark: An open-source, fast and general-purpose cluster computing system
- Elasticsearch: Open-source search and analytics engine
- Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.
- Data Science: Systems and scientific methods to analyze structured and unstructured data in order to extract knowledge and insights.
- Artificial Intelligence: Artificial intelligence can be applied to disparate solution spaces to deliver disruptive technologies.
- Python: Python is a programming language that lets you work more quickly and integrate your systems more effectively.
Follow these steps to create the required services and run the notebook locally.
- Clone the repo
- Set up Elasticsearch
- Download the Elasticsearch Spark connector
- Download Apache Spark
- Download the data
- Launch the notebook
- Run the notebook
Clone the REPO
locally. In a terminal, run:
$ git clone https://github.com/IBM/REPO
This Journey currently depends on Elasticsearch 5.3.0. Go to the downloads page and download the appropriate package for your system.
For example on Linux / Mac you can download the TAR archive and unzip it using the command:
$ tar xfvz elasticsearch-5.3.0.tar.gz
Change directory to the newly unzipped folder using:
$ cd elasticsearch-5.3.0
Next, you will need to install the Elasticsearch vector scoring plugin. You can do this by running the following command (Elasticsearch will download the plugin file for you):
$ ./bin/elasticsearch-plugin install https://github.com/MLnick/elasticsearch-vector-scoring/releases/download/v5.3.0/elasticsearch-vector-scoring-5.3.0.zip
Next, start Elasticsearch (do this in a separate terminal window in order to keep it up and running):
$ ./bin/elasticsearch
You should see some start up logs displayed. Check that the elasticsearch-vector-scoring-plugin
is successfully loaded:
$ ./bin/elasticsearch
[2017-09-08T15:58:18,781][INFO ][o.e.n.Node ] [] initializing ...
...
[2017-09-08T15:58:19,406][INFO ][o.e.p.PluginsService ] [2Zs8kW3] loaded plugin [elasticsearch-vector-scoring]
[2017-09-08T15:58:20,676][INFO ][o.e.n.Node ] initialized
...
You will need to run your PySpark notebook with the elasticsearch-spark
JAR (version 5.3.0
) on the classpath. Follow these steps to set up the connector:
- Download the ZIP file.
- Unzip the file by running
$ unzip elasticsearch-hadoop-5.3.0.zip
. - The connector JAR will be located in the
dist
subfolder.
[TODO]
You will be using the Movielens dataset of ratings given by a set of users to movies, together with movie metadata. There are a few versions of the dataset. You should download the "latest small" version. Once downloaded, unzip the file by running $ unzip ml-latest-small.zip
.
To run the notebook you will need to launch a PySpark session within a Jupyter notebook. Remember to include the Elasticsearch Spark connector JAR from step 3 on the classpath.
From the base directory of the Journey repository that you cloned in step 1, run the following command to launch your PySpark notebook server locally (here PATH_TO_SPARK
is the path where you unzipped the Spark release, and PATH_TO_ES_SPARK
is the path where you unzipped the Elasticsearch Spark connector (this assumes you are running Spark with the default Scala version of 2.11
):
PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" PATH_TO_SPARK/bin/pyspark --driver-memory 4g --driver-class-path PATH_TO_ES_SPARK/dist/elasticsearch-spark_2.11-5.3.0.jar
This should open a browser window with the Journey folder contents displayed. Click on the notebooks
subfolder and then click on the elasticsearch-spark-recommender.ipynb
file to launch the notebook.
Note: TODO
TODO
- Demo on Youtube
- Watch the meetup presentation and slide deck covering some of the background and technical details behind this Journey.
- An extended version of the above presentation was given at ApacheCon Big Data Europe 2016 (see slides).
-
Error: XYZ
Solution
-
Error: ABC
Solution