Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error with makeTxDbFromGenBank #15

Open
jayoung opened this issue Oct 14, 2021 · 0 comments
Open

error with makeTxDbFromGenBank #15

jayoung opened this issue Oct 14, 2021 · 0 comments

Comments

@jayoung
Copy link

jayoung commented Oct 14, 2021

hi there,

I just found your package - looks really useful: thank you!

I had no trouble reading in my genbank sequence using readGenBank, but when I tried to use makeTxDbFromGenBank I encountered some errors. I got a little way down the road in trying to hack a solution before giving up - I've included notes below.

It suspect it is something about the specific Genbank entry I'm trying to use: it's a viral genome, so most genes are intronless, and most genes lack a transcript annotation and lack exon annotation (most have only the gene and CDS). I don't know how common this situation is, so don't know how worthwhile it will be for you to fix it.

But as I started looking at the code, I thought it might not be too difficult to fix it, and I would guess it is might be common enough for genomes like viruses and simple eukaryotes like S.cerevisiae and friends that it could be good motivation.

library(genbankr)

# this works
ref <- readGenBank(GBAccession("JQ673480.1"))

##### makeTxDbFromGenBank error #1:
ref_txdb <- makeTxDbFromGenBank(ref, reassign.ids = FALSE)
# Error in data.frame(tx_id = as.integer(factor(txgr$transcript_id)), gene_id = as.character(txgr$gene_id),  : 
#  arguments imply differing number of rows: 3, 0

I started playing around with the makeTxDbFromGenBank function and was able to get past this error by changing line 87 of txdb.R as follows:

gene_id = as.character(txgr$gene_id)
# changed to to 
gene_id = as.character(txgr$gene)

After fixing that I tried again:

###### error # 2
ref_txdb <- makeTxDbFromGenBank_JY(ref, reassign.ids = FALSE)
##  Error in .check_foreign_key(splicings$tx_id, "integer", "splicings$tx_id",  : 'splicings$tx_id' cannot contain NAs 

I was able to get past that one by changing line 71 of txdb.R

splicedf = data.frame(tx_id = chartxidToInt_JY(cdsgr$transcript_id, txgr$transcript_id), ...
## to 
tx_id = chartxidToInt_JY( cdsgr$transcript_id ), ...

and on trying again:

###### error # 3
makeTxDbFromGenBank_JY(ref, reassign.ids = FALSE)
#Error in .check_foreign_key(splicings$tx_id, "integer", "splicings$tx_id",  : 
#      all the values in 'splicings$tx_id' must be present in #'transcripts$tx_id' 

I can see that the problem is stemming from the fact I have many CDSs without a corresponding transcript. The help page ?readGenBank sheds some light on why that is, for this genome:
In files where transcripts are not present, 'approximate transcripts' defined by the ranges spanned by groups of exons are used. Currently, we do not support generating approximate transcripts from CDSs in files that contain actual transcript annotations, even if those annotations do not cover all genes with CDS/exon annotations.

Would you be able to implement that support to help in cases like mine? thanks for considering it!

all the best,

Janet Young

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] genbankr_1.20.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7                  lattice_0.20-45             prettyunits_1.1.1          
 [4] png_0.1-7                   Rsamtools_2.8.0             Biostrings_2.60.2          
 [7] assertthat_0.2.1            digest_0.6.28               utf8_1.2.2                 
[10] BiocFileCache_2.0.0         R6_2.5.1                    GenomeInfoDb_1.28.4        
[13] stats4_4.1.1                RSQLite_2.2.8               httr_1.4.2                 
[16] pillar_1.6.3                zlibbioc_1.38.0             rlang_0.4.11               
[19] GenomicFeatures_1.44.2      progress_1.2.2              curl_4.3.2                 
[22] rstudioapi_0.13             rentrez_1.2.3               blob_1.2.2                 
[25] S4Vectors_0.30.2            Matrix_1.3-4                BiocParallel_1.26.2        
[28] stringr_1.4.0               RCurl_1.98-1.5              bit_4.0.4                  
[31] biomaRt_2.48.3              DelayedArray_0.18.0         compiler_4.1.1             
[34] rtracklayer_1.52.1          pkgconfig_2.0.3             BiocGenerics_0.38.0        
[37] SummarizedExperiment_1.22.0 tidyselect_1.1.1            KEGGREST_1.32.0            
[40] tibble_3.1.5                GenomeInfoDbData_1.2.6      matrixStats_0.61.0         
[43] IRanges_2.26.0              XML_3.99-0.8                fansi_0.5.0                
[46] crayon_1.4.1                dplyr_1.0.7                 dbplyr_2.1.1               
[49] GenomicAlignments_1.28.0    bitops_1.0-7                rappdirs_0.3.3             
[52] grid_4.1.1                  jsonlite_1.7.2              lifecycle_1.0.1            
[55] DBI_1.1.1                   magrittr_2.0.1              stringi_1.7.5              
[58] cachem_1.0.6                XVector_0.32.0              xml2_1.3.2                 
[61] ellipsis_0.3.2              filelock_1.0.2              generics_0.1.0             
[64] vctrs_0.3.8                 rjson_0.2.20                restfulr_0.0.13            
[67] tools_4.1.1                 bit64_4.0.5                 BSgenome_1.60.0            
[70] Biobase_2.52.0              glue_1.4.2                  purrr_0.3.4                
[73] MatrixGenerics_1.4.3        hms_1.1.1                   parallel_4.1.1             
[76] fastmap_1.1.0               yaml_2.2.1                  AnnotationDbi_1.54.1       
[79] GenomicRanges_1.44.0        memoise_2.0.0               VariantAnnotation_1.38.0   
[82] BiocIO_1.2.0               

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant