Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate length of simulated reads #22

Closed
npavlovikj opened this issue Jan 19, 2018 · 2 comments
Closed

Calculate length of simulated reads #22

npavlovikj opened this issue Jan 19, 2018 · 2 comments

Comments

@npavlovikj
Copy link

Hi,

I have a question regarding the header notation of the simulated reads.
In the given example, for the "aligned" reads you say "92_12710_2 means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region.".
Does this mean that the total length of the simulated read is 12,804bp (92 + 12710 + 2)?

This is what I thought, but when I compared the header info from NanoSim (first column in the example below) with the length of the sequence itself (second column in the example below), these numbers don't match (the length is always longer):
ENA|U00096|U00096.3-Escherichia-coli-str.-K-12-substr.-MG1655,-complete-genome.1213634_aligned_7_R9_8481_25 | 8563
ENA|U00096|U00096.3-Escherichia-coli-str.-K-12-substr.-MG1655,-complete-genome.989475_aligned_8_F31_6280_22 | 6406
ENA|U00096|U00096.3-Escherichia-coli-str.-K-12-substr.-MG1655,-complete-genome.2385960_aligned_9_R1_3551_22 | 3642

Can you please help me understand how can I calculate the length of the simulated reads based on the information provided in the header and/or the other model files?

Thank you,
Natasha

@cheny19
Copy link
Collaborator

cheny19 commented Jan 22, 2018

Hi Natasha,

In your example (92_12710_2), 12710 should be the length of the reference sequence. Since indels are introduced to the sequence, the simulated read is not necessarily 12710 + 92 + 2. That's why the numbers don't match.

If you want the length of simulated reads from the fasta file, simply use len(sequence) - 1 in python or other programming language. If you have to use only the header and model files, then use 12710 - deletions + insertions. You can find the deletions and insertions in the simulated error profile, where each introduced error is listed.

Thanks,
Chen

@npavlovikj
Copy link
Author

Hi Chen,

Thank you so much for your detailed reply.
It was stupid of me to obey the "I-D" part in my calculations...
I did some testing with the equation you provided, and the numbers match.

Many thanks for the help!
Natasha

@cheny19 cheny19 closed this as completed Jan 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants