Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
doc/source/images		doc/source/images
notebooks		notebooks
.gitignore		.gitignore
ACKNOWLEDGEMENTS.md		ACKNOWLEDGEMENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
README.md		README.md

Repository files navigation

Building a Recommender with Apache Spark & Elasticsearch

In this developer journey, we will create a scalable recommender system using Apache Spark and Elasticsearch.

[TBC - Explain briefly how things work].

This repo contains a Jupyter notebook illustrating the basics of how to use Spark for training a collaborative filtering recommendation models from ratings data stored in Elasticsearch, saving the model factors to Elasticsearch, and then using Elasticsearch to serve real-time recommendations using the user and item factors.

When the reader has completed this journey, they will understand how to:

Ingest and index user event data into Elasticsearch using the Elasticsearch Spark connector
Load event data into Spark DataFrames and use Spark's machine learning library (MLlib) to train a collaborative filtering recommender model
Export the trained model into Elasticsearch
Using a custom Elasticsearch plugin, compute user and similar item recommendations and combine recommendations with search and content filtering

Flow

Load the movie dataset into Spark.
Use Spark DataFrame operations to clean up the dataset and load it into Elasticsearch.
Using Spark MLlib, train a collaborative filtering recommendation model.
Save the resulting model into Elasticsearch.
Using Elasticsearch queries and a custom vector scoring plugin, generate some example recommendations. The Movie Database API is used to display movie poster images for the recommended movies.

Included components

Apache Spark: An open-source, fast and general-purpose cluster computing system
Elasticsearch: Open-source search and analytics engine
Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Featured technologies

Data Science: Systems and scientific methods to analyze structured and unstructured data in order to extract knowledge and insights.
Artificial Intelligence: Artificial intelligence can be applied to disparate solution spaces to deliver disruptive technologies.
Python: Python is a programming language that lets you work more quickly and integrate your systems more effectively.

Watch the Video

Steps

Follow these steps to create the required services and run the notebook locally.

Clone the repo
Set up Elasticsearch
Download the Elasticsearch Spark connector
Download Apache Spark
Download the data
Launch the notebook
Run the notebook

1. Clone the repo

Clone the REPO locally. In a terminal, run:

$ git clone https://github.com/IBM/REPO

2. Set up Elasticsearch

This Journey currently depends on Elasticsearch 5.3.0. Go to the downloads page and download the appropriate package for your system.

For example on Linux / Mac you can download the TAR archive and unzip it using the command:

$ tar xfvz elasticsearch-5.3.0.tar.gz

Change directory to the newly unzipped folder using:

$ cd elasticsearch-5.3.0

Next, you will need to install the Elasticsearch vector scoring plugin. You can do this by running the following command (Elasticsearch will download the plugin file for you):

$ ./bin/elasticsearch-plugin install https://github.com/MLnick/elasticsearch-vector-scoring/releases/download/v5.3.0/elasticsearch-vector-scoring-5.3.0.zip

Next, start Elasticsearch (do this in a separate terminal window in order to keep it up and running):

$ ./bin/elasticsearch

You should see some start up logs displayed. Check that the elasticsearch-vector-scoring-plugin is successfully loaded:

$ ./bin/elasticsearch
[2017-09-08T15:58:18,781][INFO ][o.e.n.Node               ] [] initializing ...
...
[2017-09-08T15:58:19,406][INFO ][o.e.p.PluginsService     ] [2Zs8kW3] loaded plugin [elasticsearch-vector-scoring]
[2017-09-08T15:58:20,676][INFO ][o.e.n.Node               ] initialized
...

3. Download the Elasticsearch Spark connector

You will need to run your PySpark notebook with the elasticsearch-spark JAR (version 5.3.0) on the classpath. Follow these steps to set up the connector:

Download the ZIP file.
Unzip the file by running $ unzip elasticsearch-hadoop-5.3.0.zip.
The connector JAR will be located in the dist subfolder.

4. Download Apache Spark

[TODO]

5. Download the data

You will be using the Movielens dataset of ratings given by a set of users to movies, together with movie metadata. There are a few versions of the dataset. You should download the "latest small" version. Once downloaded, unzip the file by running $ unzip ml-latest-small.zip.

6. Launch the notebook

To run the notebook you will need to launch a PySpark session within a Jupyter notebook. Remember to include the Elasticsearch Spark connector JAR from step 3 on the classpath.

From the base directory of the Journey repository that you cloned in step 1, run the following command to launch your PySpark notebook server locally (here PATH_TO_SPARK is the path where you unzipped the Spark release, and PATH_TO_ES_SPARK is the path where you unzipped the Elasticsearch Spark connector (this assumes you are running Spark with the default Scala version of 2.11):

PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" PATH_TO_SPARK/bin/pyspark --driver-memory 4g --driver-class-path PATH_TO_ES_SPARK/dist/elasticsearch-spark_2.11-5.3.0.jar

This should open a browser window with the Journey folder contents displayed. Click on the notebooks subfolder and then click on the elasticsearch-spark-recommender.ipynb file to launch the notebook.

Note: TODO

7. Run the notebook

TODO

Sample output

Links

Demo on Youtube
Watch the meetup presentation and slide deck covering some of the background and technical details behind this Journey.
An extended version of the above presentation was given at ApacheCon Big Data Europe 2016 (see slides).

Troubleshooting

Error: XYZ

Solution
Error: ABC

Solution

License

Apache 2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building a Recommender with Apache Spark & Elasticsearch

Flow

Included components

Featured technologies

Watch the Video

Steps

1. Clone the repo

2. Set up Elasticsearch

3. Download the Elasticsearch Spark connector

4. Download Apache Spark

5. Download the data

6. Launch the notebook

7. Run the notebook

Sample output

Links

Troubleshooting

License

About

Releases

Packages

Contributors 9

Languages

License

IBM/elasticsearch-spark-recommender

Folders and files

Latest commit

History

Repository files navigation

Building a Recommender with Apache Spark & Elasticsearch

Flow

Included components

Featured technologies

Watch the Video

Steps

1. Clone the repo

2. Set up Elasticsearch

3. Download the Elasticsearch Spark connector

4. Download Apache Spark

5. Download the data

6. Launch the notebook

7. Run the notebook

Sample output

Links

Troubleshooting

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Languages

Packages