DBLP Pipeline

This is a data pipeline built as an assignment for CS 516: Database Systems at Duke University. It ingests XML data from the DBLP Computer Science Bibliography and turns it into a Postgres database for analysis.

Setup

The first step of this process is to create a conda enviornment from a1.txt. This can be done by running the command:

conda env create -f environment.yml

After that, you can run setup.sh to download the appropriate data.

Note that this project was specifically made to run correctly on WSL2.

Running the pipeline

Assuming that you have Postgres installed and set up correctly, the first step is setting up the following tables:

article: pubkey (text), journal (text), year(int)
inproceedings: pubkey (text), booktitle (text), year(int)
authorship: pubkey (string), author (string)

After that, you can run run.sh to convert the downloaded materials from XML.

Analysis

Some basic SQL analysis of the database can be found in assignment1.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
a1.txt		a1.txt
assignment1.ipynb		assignment1.ipynb
config.json.template		config.json.template
nuke.sql		nuke.sql
pipeline.py		pipeline.py
run.sh		run.sh
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DBLP Pipeline

Setup

Running the pipeline

Analysis

About

Releases

Packages

Languages

License

ldtcooper/dblp-analysis

Folders and files

Latest commit

History

Repository files navigation

DBLP Pipeline

Setup

Running the pipeline

Analysis

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages