This project aims to perform an in-depth analysis of the Game of Thrones network using Apache GraphX, Neo4j, and Spark ML. The dataset used for this analysis is available here.
Here is an overview of the project architecture:
The project comprises several key steps:
-
Read the Dataset:
- Use the provided Game of Thrones dataset.
-
Connect Spark GraphX with Neo4j:
- Follow the instructions in the Neo4j Spark Connector documentation to establish a connection.
-
Import Dataset into Neo4j:
- Use the Cypher query language to perform CRUD operations on the dataset in Neo4j.
-
Apache Zeppelin with GraphX:
- Integrate Apache Zeppelin with GraphX.
- Read data from Neo4j.
- Conduct exploratory data analysis.
- Execute five graph algorithms of your choice using GraphX.
- Visualize the results.
-
Create a Customizable Dashboard:
- Develop a customizable dashboard to visualize dataset information and the results of graph algorithms.
-
Spark ML:
- Use Spark ML to apply machine learning algorithms to the dataset.
Use the provided docker-compose.yml
file to set up the cluster. The included services are:
- Zeppelin (Apache Zeppelin 0.10.0)
- Spark Master (Bitnami Spark 3.1.2)
- Neo4j (Bitnami Neo4j 5)
Make sure to configure volumes and ports accordingly.
# Clone the repository from GitHub
git clone https://github.com/rmakaoui/Project_GraphX_SparkML_neo4j.git
cd Project_GraphX_SparkML_neo4j
# Make sure to be in the project directory
cd Project_GraphX_SparkML_neo4j
# Start the services in the background with Docker Compose
docker-compose up -d
After running these commands, the Docker services (Zeppelin, Spark Master, Neo4j) will start, and you can access Apache Zeppelin at http://localhost:8080 in your browser.
- Follow the steps outlined in the guide for data analysis, graph algorithms, and machine learning.