Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling large annotation sets #2

Open
ekg opened this issue Jun 23, 2015 · 2 comments
Open

Handling large annotation sets #2

ekg opened this issue Jun 23, 2015 · 2 comments

Comments

@ekg
Copy link

ekg commented Jun 23, 2015

Could you provide efficient query across the annotations using a FM-index over the concatenated annotation strings from the VCF file? A second compressed bitvector could encode variant annotation starts in this record (basically storing a variant to annotation mapping).

Then you could subset to a given set of records with a particular annotation by finding the ranks of the occurrences of a given pattern in the auxiliary bitvector.

I guess this wouldn't help much when you have to compare floats in the annotations and the annotation is included in all records. Then you end up needing to compare lots of values to execute the query. There might also be a way around this though.

@ekg
Copy link
Author

ekg commented Jun 23, 2015

Not the most precise definition of what I mean so let me know if it needs clarification.

@lh3
Copy link
Owner

lh3 commented Jun 23, 2015

BGT has a different design from VCF. I see annotating each VCF is a waste of resource, so I encourage to use a single variant annotation file for all BGT databases. You locate a particular row in BGT by an allele string like "11:10000:1:C". Currently, BGT reads through the variant annotation file to collect allele strings and then find rows in BGT. It is reasonably fast. The preferred way is really to have a proper disk-based database backend for annotations. SQLite could be an option. Cassandra would be better if performance becomes an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants