Initial Commands

In this project, we explore the use of Neo4j for the purposes of a potential collaboration with the ATRIAGraph project.

Initial Commands

Start the neo4j container: docker run -it --rm -p 7474:7474 -p 7687:7687 -v $(pwd)/data:/data -e NEO4J_AUTH=neo4j/tapis4ever --name=neo neo4j:5.22.0
Monitor cpu and memory usage: docker stats neo
Generate test csv files to import into the db (this was already done but can be run again as needed) python generate_test_data.py

Copy new files into the container directory sudo cp test_data_100k.csv data/

Load csv file into Neo4j -- NOTE: neo4 container should NOT be running. The following runs a new container just to do the import:

docker run -it --rm -p 7474:7474 -p 7687:7687 -v $(pwd)/data:/data -v $(pwd)/import:/import -e NEO4J_AUTH=neo4j/tapis4ever neo4j:5.22.0 neo4j-admin database import full neo4j --nodes=/import/test_data_100k.csv --overwrite-destination

100K example

Imported: 100000 nodes 0 relationships 300000 properties Peak memory usage: 518.0MiB

500k example

IMPORT DONE in 2s 597ms. Imported: 500000 nodes 0 relationships 1500000 properties Peak memory usage: 519.5MiB

10M example

IMPORT DONE in 11s 580ms. Imported: 10000000 nodes 0 relationships 30000000 properties Peak memory usage: 632.1MiB

Timing results of some queries

To time queries, first exec into the container:

docker exec -it neo

Then, from within the container:

# return total number of nodes 
cypher-shell --database=neo4j -u neo4j -p tapis4ever "MATCH (n) RETURN count(n) as nodes"

# return all distinct properties across all nodes 
cypher-shell --database=neo4j -u neo4j -p tapis4ever "MATCH (n) RETURN distinct keys(n)"

# property_1 is unique for each node 
cypher-shell --database=neo4j -u neo4j -p tapis4ever "MATCH (n) WHERE n.property_1 = 'prp_1' RETURN count(n) as nodes"

# property 2 is random so there should be a number of nodes returned for any given value
cypher-shell --database=neo4j -u neo4j -p tapis4ever "MATCH (n) WHERE n.property_2 = 10 RETURN count(n) as nodes"

Memory Usage

Usage started out a little under 500MB after server start up, even with the 10M data set. Queries did increase the usage temporarily though. For example,

The simple count of all nodes increased usage to a peak of around ~850MB; usage went back down to around 550MB
The return distinct keys query and property_1 query increased usage to a peak of ~1.62GB; usage went down to around 1.41GB
The property_2 query increased usage to a peak of almost 1.7GB; usage went back down to around 1.45GB

ASTRIA Data Experiments

Importing the Data

The high-level strategy that seems to work is this:

Start with a v3 dump file
Start a v4.0 noe4j instance to use to import the data (this is multiple steps, see below).
Once the data has been imported to a 4.0 instance, run a 4.4 instance "on top of" the data directory. This seems to work.

Work directory: ~/tmp/ASTRIA and assumes a directory, ~/tmp/ASTRIA/data and ~/tmp/ASTRIA/import. The ~/tmp/data contains the Neo4j database while ~/tmp/import contains a dump file, graph.db.dump.

First, need to start container to create the initial database shell

docker run -it --rm -p 7474:7474 -p 7687:7687 -v $(pwd)/data:/data -v $(pwd)/import:/import -e NEO4J_AUTH=neo4j/tapis4ever -e NEO4J_dbms_allow__upgrade=true neo4j:4.0

Shut down this container so the server stops, and then create a new container to load data:

docker run -it --entrypoint=bash --rm -p 7474:7474 -p 7687:7687 -v $(pwd)/data:/data -v $(pwd)/import:/import -e NEO4J_AUTH=neo4j/tapis4ever -e NEO4J_dbms_allow__upgrade=true neo4j:4.0

Next, import the data from the dump file:

neo4j-admin load --database=neo4j --from=/import/graph.db.dump --force

Finally, exit the shell to stop that container and start up Neo4j as normal. Here, we specify neo4j:4.0.

docker run \
  -it --rm \
  --name neo4j4.4 \
  -p 7474:7474 -p 7687:7687 \
  -v $(pwd)/data:/data -v $(pwd)/import:/import \
  -e NEO4J_AUTH=neo4j/tapis4ever \
  -e NEO4J_dbms_allow__upgrade=true \ 
  -e NEO4J_apoc_export_file_enabled=true \
  -e NEO4J_apoc_import_file_enabled=true \
  -e NEO4J_apoc_import_file_use__neo4j__config=true \
  -e NEO4JLABS_PLUGINS=\[\"apoc\"\] \
  -e dbms_security_procedures_unrestricted=algo.*,apoc.* \
  neo4j:4.0

Once the server has started and imported the data correctly, one can run the same command above, replacing neo4j:4.0 with neo4j:4.4.

Starting the Server Again and Basic Queries

Once the above procedure has been done on a host, the Neo4j database container can be restarted from the same directory. The start_astria_db.sh script contains a complete docker run command that can be used.

Executing Basic Queries

The simplest way to test the database is to exec into the neo4j container:

docker exec -it neo4j4.4 bash

and then use cypher-shell to run queries against the local databases. Be sure to pass the same auth credentials established in the earlier steps; e.g.,

cypher-shell -u neo4j -p tapis4ever "MATCH (n) RETURN count(n) as nodes"
+-----------+
| nodes     |
+-----------+
| 103049370 |
+-----------+

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
generate_test_data.py		generate_test_data.py
start_astria_db.sh		start_astria_db.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Initial Commands

100K example

500k example

10M example

Timing results of some queries

Memory Usage

ASTRIA Data Experiments

Importing the Data

Starting the Server Again and Basic Queries

Executing Basic Queries

About

Releases

Packages

Languages

joestubbs/astria-neo4j

Folders and files

Latest commit

History

Repository files navigation

Initial Commands

100K example

500k example

10M example

Timing results of some queries

Memory Usage

ASTRIA Data Experiments

Importing the Data

Starting the Server Again and Basic Queries

Executing Basic Queries

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages