-
Notifications
You must be signed in to change notification settings - Fork 0
Uploading Data
octofludb pull
is the base command for updating the database. It may take most of a day to run the first time, since it must pull all flu data from genbank and classify all the swine data with octoFLU. However, subsequent runs are much faster.
I usually update the database once a month just before handling WGS selections. The command I use is, octofludb pull --nmonths=2
. The command pulls the last two months of data from Genbank (more than necessary, but more doesn't hurt). It will also run octoFLU to classify all unclassified swine strains, find antigenic motifs, fill in missing subtype data, etc. It may be a good idea to clear the last two months of cached genbank data before running octofludb pull
. This may be done by deleting the two most recent $OCTOFLUDB_HOME/build/.gb_YEAR-MONTH.ttl
files.
octofludb prep
offers several subcommands for processing different data types into turtle files that can be uploaded to the GraphDB database with octofludb upload
.
The most important of these are fasta
, table
, and tag
. fasta
and table
are internally almost equivalent calls. The only difference between a FASTA file and tabular file is that for fast the final column is a sequence. Although, this is ONLY true when the fasta file is formatted with an equal length delimited fields in the header.
For example, if you wanted to upload some new annotations for a set of strains:
id coolness
OK384170 awesome
OK384184 awesome
OK384186 awesome
OL468248 awesome
OL331064 awesome
OK625341 boring
OK384168 boring
OK625361 boring
OL331238 boring
OK384281 boring
To upload these (assuming they are in the file named "coolness.txt"), you can do the following:
$ octofludb prep table coolness.txt > coolness.ttl
$ less coolness.ttl # always inspect
The resulting coolness.ttl
file contains the following turtle data:
@prefix f: <https://flu-crew.org/term/> .
@prefix fid: <https://flu-crew.org/id/> .
fid:ok384168 f:coolness "boring" ;
f:genbank_id "OK384168" ;
f:id "OK384168" .
fid:ok384170 f:coolness "awesome" ;
f:genbank_id "OK384170" ;
f:id "OK384170" .
fid:ok384184 f:coolness "awesome" ;
f:genbank_id "OK384184" ;
f:id "OK384184" .
fid:ok384186 f:coolness "awesome" ;
f:genbank_id "OK384186" ;
f:id "OK384186" .
fid:ok384281 f:coolness "boring" ;
f:genbank_id "OK384281" ;
f:id "OK384281" .
fid:ok625341 f:coolness "boring" ;
f:genbank_id "OK625341" ;
f:id "OK625341" .
fid:ok625361 f:coolness "boring" ;
f:genbank_id "OK625361" ;
f:id "OK625361" .
fid:ol331064 f:coolness "awesome" ;
f:genbank_id "OL331064" ;
f:id "OL331064" .
fid:ol331238 f:coolness "boring" ;
f:genbank_id "OL331238" ;
f:id "OL331238" .
fid:ol468248 f:coolness "awesome" ;
f:genbank_id "OL468248" ;
f:id "OL468248" .
Note that octofludb
inferred the id column was a genbank_id
. However, it also preserved the "id" column. Whether this is the behavior you want is a good question. If you worry that uploading the "id" column could cause problems internally, you may want to rename your table headers to "genbank_id", which will circumvent the problem.
Next you can upload the data to the GraphDB database:
$ octofludb upload coolness.ttl
Now you can query it:
PREFIX onto: <http://www.ontotext.com/>
PREFIX f: <https://flu-crew.org/term/>
SELECT DISTINCT
?strain
?clade
FROM onto:disable-sameAs
WHERE {
?sid f:strain_name ?strain .
?sid f:has_segment ?gid .
?gid f:segment_name "HA" .
?gid f:coolness "awesome" .
?gid f:gl_clade ?clade .
}
Running this query with octofludb query
returns just the awesome sequences linked to their global clades:
A/swine/Jiangsu/65/2015 1C.2
A/swine/Texas/A02246976/2021 1A.3.3.2
A/swine/Kansas/A02248188/2021 1B.2.2.2
A/swine/Minnesota/A02248193/2021 1B.2.1
A/swine/North_Carolina/A02248187/2021 1A.3.3.3
What if you upload data and then later want to remove it? Here you must be careful. Since the uploaded data include connections that already existed, you cannot reverse the upload by simply deleting all edges represented in the turtle file. Doing so would have a net result of deleting old edges. The link from segment id (e.g., fig:ok384168
) to genbank id (OK384168
) would be lost from the database.
Deleting data requires a carefully constructed DELETE statement.
Before deleting anything, it is usually good to run a SELECT statement to see what exactly is being deleted:
PREFIX f: <https://flu-crew.org/term/>
SELECT
?gid
?coolness
WHERE {
?gid f:coolness ?coolness .
}
This returns:
https://flu-crew.org/id/ol331238 boring
https://flu-crew.org/id/ol468248 awesome
https://flu-crew.org/id/ol331064 awesome
https://flu-crew.org/id/ok625341 boring
https://flu-crew.org/id/ok384170 awesome
https://flu-crew.org/id/ok384168 boring
https://flu-crew.org/id/ok384186 awesome
https://flu-crew.org/id/ok384281 boring
https://flu-crew.org/id/ok625361 boring
https://flu-crew.org/id/ok384184 awesome
Looks right, so next we rephrase that query to delete instead of select:
PREFIX f: <https://flu-crew.org/term/>
DELETE {
?gid f:coolness ?coolness .
}
WHERE {
?gid f:coolness ?coolness .
}
Also, this is not a query, it is an update, so we use the octofludb update
command.
After running this command, you can rerun the query, and nothing should be returned.
This is a simple case. Uploading custom data can get pretty messy. Just think carefully about how you will remove the data and check for conflicts between columns in the table and pre-existing relations in the database. And remember to inspect the *.ttl
files before uploading them.