Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[E::hts_idx_push] Invalid record on sequence #2: end 1 < begin 2040518 #2329

Closed
ivan-adzhubey opened this issue Dec 4, 2024 · 2 comments
Closed

Comments

@ivan-adzhubey
Copy link

Hi,

I am getting this cryptic error message when trying to create an index for NCBI ALFA VCF:

$ tabix -p vcf freq.vcf.gz
[E::hts_idx_push] Invalid record on sequence #2: end 1 < begin 2040518
tbx_index_build3 failed: freq.vcf.gz

The VCF was downloaded from NCBI FTP site: https://ftp.ncbi.nih.gov/snp/population_frequency/latest_release/

$ tabix |& head -3
Version: 1.21
Usage: tabix [OPTIONS] [FILE] [REGION [...]]

Re-heading the VCF by adding contigs information from NCBI GRCh38 FASTA sequence file does not help and same error persists. Looks like the file itself is not a strict VCF v4.1, even though it declares itself as such. But it was indexed at some point somehow, since the FTP site also contains accompanying freq.vcf.gz.tbi file. However, it is not possible to recreate the index file using the current version 1.21 of tabix.

@pd3
Copy link
Member

pd3 commented Dec 16, 2024

It looks like the data is malformed. The second sequence in the VCF is NC_000008.11 and one of the records there has a missing column - there should be 21, like in the rest of the VCF, but one row has just 20

$ zcat freq.vcf.gz | awk '{if($1=="NC_000008.11" && $2==2040518)print $1,$2,NF}' 
NC_000008.11 2040518 21
NC_000008.11 2040518 20
NC_000008.11 2040518 21

If you run it through bcftools view, it seems to have fixed it

$ bcftools view freq.vcf.gz -H | awk '{if($1=="NC_000008.11" && $2==2040518)print $1,$2,NF}' 
NC_000008.11 2040518 21
NC_000008.11 2040518 21
NC_000008.11 2040518 21

@pd3 pd3 closed this as completed Dec 16, 2024
@ivan-adzhubey
Copy link
Author

Oh, so "sequence" in the error message means contig, that helps. The VCF file in question is really a mess, sorry for bugging you about something that was not your code's problem in the first place. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants