Handling large annotation sets #2

ekg · 2015-06-23T12:58:45Z

Could you provide efficient query across the annotations using a FM-index over the concatenated annotation strings from the VCF file? A second compressed bitvector could encode variant annotation starts in this record (basically storing a variant to annotation mapping).

Then you could subset to a given set of records with a particular annotation by finding the ranks of the occurrences of a given pattern in the auxiliary bitvector.

I guess this wouldn't help much when you have to compare floats in the annotations and the annotation is included in all records. Then you end up needing to compare lots of values to execute the query. There might also be a way around this though.

ekg · 2015-06-23T13:00:04Z

Not the most precise definition of what I mean so let me know if it needs clarification.

lh3 · 2015-06-23T13:38:48Z

BGT has a different design from VCF. I see annotating each VCF is a waste of resource, so I encourage to use a single variant annotation file for all BGT databases. You locate a particular row in BGT by an allele string like "11:10000:1:C". Currently, BGT reads through the variant annotation file to collect allele strings and then find rows in BGT. It is reasonably fast. The preferred way is really to have a proper disk-based database backend for annotations. SQLite could be an option. Cassandra would be better if performance becomes an issue.

ml4wc mentioned this issue Apr 27, 2016

freebayes vcfs (1.0-r282 and 1.0-r265) #6

Closed

zmaroti mentioned this issue Sep 16, 2021

annotation of output #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling large annotation sets #2

Handling large annotation sets #2

ekg commented Jun 23, 2015

ekg commented Jun 23, 2015

lh3 commented Jun 23, 2015

Handling large annotation sets #2

Handling large annotation sets #2

Comments

ekg commented Jun 23, 2015

ekg commented Jun 23, 2015

lh3 commented Jun 23, 2015