This repository contains the code of a little project whose intention is that of implementing a data streaming using Apache Kafka consumed then by a Neo4j Sink instance.
An instance of Kafka, Zookeeper and Schema Registry alongside with Neo4j is deployed using the definition inside a docker-compose.yml
file.
Inside a Java application we define two Producers to produce movies from different sources, in this case the sources are two movie datasets obtained from Kaggle: Netflix Movies And TV Shows and The Movies Dataset.
Csv files are read from the Java application at some rate and sent to the kafka broker that's being deployed using docker under specified topic names. Data has been serialized using Avro.
The Neo4j instance that's acting as a sink is waiting to poll records from the kafka broker and when it does it a custom query specified with the env-variable name NEO4J_streams_sink_topic_cypher_<topic-name>
is executed to merge them.
The most representative stack that's been used in this project is:
- Docker v20.10.6
- Zookeeper (cp-zookeeper)
- Kafka (cp-enterprise-kafka)
- Schema Registry (cp-schema-registry)
- Neo4j v4.2.6
- APOC v4.2.0.2
- neo4j-streams-4.0.8
- Java 11
- Avro serializer v5.3.0
There's also other libraries used to develop the Java source code.
- Download
credits.csv
andmovies_metadata.csv
from The Movies Dataset and place them in thedata/
folder. Do the same fornetflix_titles.csv
from Netflix Movies And Tv Shows. - There's a file called
.env.development
in the root folder where the environmental variables are declared, create a.env
file and fill the placeholder declarations in there. Eg:BOOTSTRAP_SERVERS_ADDR=localhost:9092
. - Run
mvn clean
followed bymvn package
. - Inside the
docker/
folder rundocker-compose --env-file .env up
. This will create the containers, be careful to wait enough for everything to be initialized, take a look at the logs. - To execute the producers run the
main()
function ofsrc/main/java/com/movies/graph/MoviesProducer.java
. The Neo4j instance will automatically consume those.
Thanks to Bruno Berisso for helping in the development of the idea and gifting his time to solve questions.