Calculate length of simulated reads #22

npavlovikj · 2018-01-19T00:50:31Z

Hi,

I have a question regarding the header notation of the simulated reads.
In the given example, for the "aligned" reads you say "92_12710_2 means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region.".
Does this mean that the total length of the simulated read is 12,804bp (92 + 12710 + 2)?

This is what I thought, but when I compared the header info from NanoSim (first column in the example below) with the length of the sequence itself (second column in the example below), these numbers don't match (the length is always longer):
ENA|U00096|U00096.3-Escherichia-coli-str.-K-12-substr.-MG1655,-complete-genome.1213634_aligned_7_R9_8481_25 | 8563
ENA|U00096|U00096.3-Escherichia-coli-str.-K-12-substr.-MG1655,-complete-genome.989475_aligned_8_F31_6280_22 | 6406
ENA|U00096|U00096.3-Escherichia-coli-str.-K-12-substr.-MG1655,-complete-genome.2385960_aligned_9_R1_3551_22 | 3642

Can you please help me understand how can I calculate the length of the simulated reads based on the information provided in the header and/or the other model files?

Thank you,
Natasha

cheny19 · 2018-01-22T18:28:32Z

Hi Natasha,

In your example (92_12710_2), 12710 should be the length of the reference sequence. Since indels are introduced to the sequence, the simulated read is not necessarily 12710 + 92 + 2. That's why the numbers don't match.

If you want the length of simulated reads from the fasta file, simply use len(sequence) - 1 in python or other programming language. If you have to use only the header and model files, then use 12710 - deletions + insertions. You can find the deletions and insertions in the simulated error profile, where each introduced error is listed.

Thanks,
Chen

npavlovikj · 2018-01-23T19:24:19Z

Hi Chen,

Thank you so much for your detailed reply.
It was stupid of me to obey the "I-D" part in my calculations...
I did some testing with the equation you provided, and the numbers match.

Many thanks for the help!
Natasha

cheny19 closed this as completed Jan 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate length of simulated reads #22

Calculate length of simulated reads #22

npavlovikj commented Jan 19, 2018

cheny19 commented Jan 22, 2018

npavlovikj commented Jan 23, 2018

Calculate length of simulated reads #22

Calculate length of simulated reads #22

Comments

npavlovikj commented Jan 19, 2018

cheny19 commented Jan 22, 2018

npavlovikj commented Jan 23, 2018