Skip to content
This repository has been archived by the owner on Nov 29, 2023. It is now read-only.

Latest commit

 

History

History
40 lines (23 loc) · 1.85 KB

README.md

File metadata and controls

40 lines (23 loc) · 1.85 KB

pubmed_xml2rdbms

Dependency Status

This is an ETL project that transfers MEDLINE/PubMed citation record data from a set of XML files to a MySQL table.

The National Library of Medicine (NLM) provides a baseline set of MEDLINE/PubMed citation records in XML format for bulk download on an annual basis. For each citation record, we extract the following subset of data:

  • PubMed ID
  • Article Title
  • Abstract Text
  • Keywords
  • Medical Subject Headings (MeSH)

After all citation records are processed, the resulting fully-populated MySQL table is utilized by the NCBO Resource Index project. The Resource Index consumes data from biomedical resources and generates annotations from ontology classes in the BioPortal application.

Configuration

The configuration file in src/main/resources allows for specification of a path to the baseline set of XML files, as well as database information, e.g., table name, credentials, etc.

Use the logback.xml file in src/main/resources to customize log output.

Build

This is a Maven project. Use the typical Maven command to compile and package a runnable JAR file:

mvn package

Make sure to use the JAR file with dependencies included, e.g.:

pubmed-xml2rdbms-1.0-SNAPSHOT-jar-with-dependencies.jar

Run

Successful execution of the JAR file assumes that:

  • You have access to the MySQL database specified in the configuration file
  • You downloaded the baseline set of XML files from NLM, and specified the path in the configuration file

java -jar pubmed-xml2rdbms-1.0-SNAPSHOT-jar-with-dependencies.jar