Models for newer versions of Guppy with sup basecalls #198

hasindu2008 · 2023-10-06T01:26:24Z

HI NanoSim developers,

I am wondering if you have a pre-trained model for simulating genomic reads from the human genomes close to ones basecalled from newer versions of Guppy (e.g. Guppy 6) under the super accuracy mode (R9.4.1 chemistry). Or will the default -c human option already suffice to emulate this kind of data?

kmnip · 2023-10-09T16:24:16Z

The existing models in NanoSim are very old.
If you have an ONT dataset with the right basecaller+chemistry combination, then you can subsample to 1 million reads and run the characterization stage. It should be relatively straightforward, e.g.

# subsample 1 million reads
seqtk sample reads.fq.gz 1000000 > reads_sub1M.fq.gz

# NanoSim characterization stage
read_analysis.py genome -i reads_sub1M.fq.gz ...

hasindu2008 · 2023-10-16T03:47:03Z

Can we use NA12878 data to train the human genome and then use the trained model as a generic human model?
To explain further, say I create a custom human genome reference file by manually incorporating some variants, can the model trained with NA12878 be used on such a reference? Or do I need to train on data samples exactly with those variants?

SaberHQ · 2024-02-22T23:33:58Z

Can we use NA12878 data to train the human genome and then use the trained model as a generic human model? To explain further, say I create a custom human genome reference file by manually incorporating some variants, can the model trained with NA12878 be used on such a reference? Or do I need to train on data samples exactly with those variants?

I am not sure if I got your question correctly. But if you are interested in simulating nanopore reads having the characteristics of the latest basecaller and chemistry, I would suggest you train NanoSim using such reads and then use the simulator to generate reads given a reference genome/transcriptome.

Btw, if you are interested in newer basecaller, dorado, we provide a trained model for H. sapiens NA24385 - AshkenazimTrio - Son (hg002) which is sequenced using Kit v14 (R10 chemistry) and basecalled by dorado. You may find the trained model on 1 Million subsampled reads on the GitHub page (available along with the other models at pre-trained_models folder). If you are interested in the trained model based on the whole dataset, you can get it through Zenodo - DOI: 10.5281/zenodo.10064740. The model is trained using NanoSim v3.0.2 with scikit-learn v0.23.2 and python v3.7.10.

If you have any issues using the pre-trained models, check the dependencies section for some information and tips.

dlaehnemann · 2024-10-31T10:31:57Z

I see that you recently added a new model for genomic DNA, trained with a recent version of nanosim in #224. Thanks already for this. Would it be possible to also get newly trained models for transcriptomic and metagenomic DNA?

The background of my request is for wrapping nanosim to be easily usable in snakemake workflows. I have recently created such a wrapper, and for testing it relies on the pre-trained models. In combination with the issue regarding scikit-learn version discussed in #162, this currently means that we need to use an older (bioconda) version of nanosim (3.1). See the respective discussion here:
snakemake/snakemake-wrappers#3165

lcoombe · 2024-10-31T16:11:44Z

Hi @dlaehnemann,

Sure, we can look at adding those updated pre-trained models! We'll likely use the same data as was used for the older models, but with the most recent NanoSim version.

Will keep you updated - but hopefully can have that for you in the next couple of weeks!
Lauren

dlaehnemann · 2024-11-01T10:15:58Z

Thanks, that would be great. One TODO to remove from the wrapper... 😅

lcoombe · 2024-11-05T23:36:07Z

Hi @dlaehnemann,
I have generated updated pre-trained models for transcriptome and metagenome modes (#238), and updated the README page accordingly!

SaberHQ added the feature request label Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Models for newer versions of Guppy with sup basecalls #198

Models for newer versions of Guppy with sup basecalls #198

hasindu2008 commented Oct 6, 2023

kmnip commented Oct 9, 2023

hasindu2008 commented Oct 16, 2023

SaberHQ commented Feb 22, 2024

dlaehnemann commented Oct 31, 2024

lcoombe commented Oct 31, 2024

dlaehnemann commented Nov 1, 2024

lcoombe commented Nov 5, 2024

Models for newer versions of Guppy with sup basecalls #198

Models for newer versions of Guppy with sup basecalls #198

Comments

hasindu2008 commented Oct 6, 2023

kmnip commented Oct 9, 2023

hasindu2008 commented Oct 16, 2023

SaberHQ commented Feb 22, 2024

dlaehnemann commented Oct 31, 2024

lcoombe commented Oct 31, 2024

dlaehnemann commented Nov 1, 2024

lcoombe commented Nov 5, 2024