Skip to content

Uploading Data

Zebulun Arendsee edited this page Feb 4, 2022 · 6 revisions

Genbank data

octofludb pull is the base command for updating the database. It may take most of a day to run the first time, since it must pull all flu data from genbank and classify all the swine data with octoFLU. However, subsequent runs are much faster.

I usually update the database once a month just before handling WGS selections. The command I use is, octofludb pull --nmonths=2. The command pulls the last two months of data from Genbank (more than necessary, but more doesn't hurt). It will also run octoFLU to classify all unclassified swine strains, find antigenic motifs, fill in missing subtype data, etc. It may be a good idea to clear the last two months of cached genbank data before running octofludb pull. This may be done by deleting the two most recent $OCTOFLUDB_HOME/build/.gb_YEAR-MONTH.ttl files.

Custom data

octofludb prep offers several subcommands for processing different data types into turtle files that can be uploaded to the GraphDB database with octofludb upload.

The most important of these are fasta, table, and tag. fasta and table are internally almost equivalent calls. The only difference between a FASTA file and tabular file is that for fast the final column is a sequence. Although, this is ONLY true when the fasta file is formatted with an equal length delimited fields in the header.

For example, if you wanted to upload some new annotations for a set of strains:

id	coolness
OK384170	awesome
OK384184	awesome
OK384186	awesome
OL468248	awesome
OL331064	awesome
OK625341	boring
OK384168	boring
OK625361	boring
OL331238	boring
OK384281	boring

To upload these (assuming they are in the file named "coolness.txt"), you can do the following:

$ octofludb prep table coolness.txt > coolness.ttl
$ less coolness.ttl  # always inspect

The resulting coolness.ttl file contains the following turtle data:

@prefix f: <https://flu-crew.org/term/> .
@prefix fid: <https://flu-crew.org/id/> .

fid:ok384168 f:coolness "boring" ;
    f:genbank_id "OK384168" ;
    f:id "OK384168" .

fid:ok384170 f:coolness "awesome" ;
    f:genbank_id "OK384170" ;
    f:id "OK384170" .

fid:ok384184 f:coolness "awesome" ;
    f:genbank_id "OK384184" ;
    f:id "OK384184" .

fid:ok384186 f:coolness "awesome" ;
    f:genbank_id "OK384186" ;
    f:id "OK384186" .

fid:ok384281 f:coolness "boring" ;
    f:genbank_id "OK384281" ;
    f:id "OK384281" .

fid:ok625341 f:coolness "boring" ;
    f:genbank_id "OK625341" ;
    f:id "OK625341" .

fid:ok625361 f:coolness "boring" ;
    f:genbank_id "OK625361" ;
    f:id "OK625361" .

fid:ol331064 f:coolness "awesome" ;
    f:genbank_id "OL331064" ;
    f:id "OL331064" .

fid:ol331238 f:coolness "boring" ;
    f:genbank_id "OL331238" ;
    f:id "OL331238" .

fid:ol468248 f:coolness "awesome" ;
    f:genbank_id "OL468248" ;
    f:id "OL468248" .

Note that octofludb inferred the id column was a genbank_id. However, it also preserved the "id" column. Whether this is the behavior you want is a good question. If you worry that uploading the "id" column could cause problems internally, you may want to rename your table headers to "genbank_id", which will circumvent the problem.

Next you can upload the data to the GraphDB database:

$ octofludb upload coolness.ttl

Now you can query it:

PREFIX onto: <http://www.ontotext.com/>
PREFIX f: <https://flu-crew.org/term/>

SELECT DISTINCT
  ?strain
  ?clade
FROM onto:disable-sameAs
WHERE {
  ?sid f:strain_name ?strain .
  ?sid f:has_segment ?gid .
  ?gid f:segment_name "HA" .
  ?gid f:coolness "awesome" .
  ?gid f:gl_clade ?clade .
}

Running this query with octofludb query returns just the awesome sequences linked to their global clades:

A/swine/Jiangsu/65/2015 1C.2
A/swine/Texas/A02246976/2021    1A.3.3.2
A/swine/Kansas/A02248188/2021   1B.2.2.2
A/swine/Minnesota/A02248193/2021        1B.2.1
A/swine/North_Carolina/A02248187/2021   1A.3.3.3

Deleting data

What if you upload data and then later want to remove it? Here you must be careful. Since the uploaded data include connections that already existed, you cannot reverse the upload by simply deleting all edges represented in the turtle file. Doing so would have a net result of deleting old edges. The link from segment id (e.g., fig:ok384168) to genbank id (OK384168) would be lost from the database.

Deleting data requires a carefully constructed DELETE statement.

Before deleting anything, it is usually good to run a SELECT statement to see what exactly is being deleted:

PREFIX f: <https://flu-crew.org/term/>

SELECT
  ?gid
  ?coolness
WHERE {
  ?gid f:coolness ?coolness .
}

This returns:

https://flu-crew.org/id/ol331238        boring
https://flu-crew.org/id/ol468248        awesome
https://flu-crew.org/id/ol331064        awesome
https://flu-crew.org/id/ok625341        boring
https://flu-crew.org/id/ok384170        awesome
https://flu-crew.org/id/ok384168        boring
https://flu-crew.org/id/ok384186        awesome
https://flu-crew.org/id/ok384281        boring
https://flu-crew.org/id/ok625361        boring
https://flu-crew.org/id/ok384184        awesome

Looks right, so next we rephrase that query to delete instead of select:

PREFIX f: <https://flu-crew.org/term/>

DELETE {
  ?gid f:coolness ?coolness .
}
WHERE {
  ?gid f:coolness ?coolness .
}

Also, this is not a query, it is an update, so we use the octofludb update command.

After running this command, you can rerun the query, and nothing should be returned.

This is a simple case. Uploading custom data can get pretty messy. Just think carefully about how you will remove the data and check for conflicts between columns in the table and pre-existing relations in the database. And remember to inspect the *.ttl files before uploading them.

Clone this wiki locally