You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question regarding the header notation of the simulated reads.
In the given example, for the "aligned" reads you say "92_12710_2 means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region.". Does this mean that the total length of the simulated read is 12,804bp (92 + 12710 + 2)?
This is what I thought, but when I compared the header info from NanoSim (first column in the example below) with the length of the sequence itself (second column in the example below), these numbers don't match (the length is always longer): ENA|U00096|U00096.3-Escherichia-coli-str.-K-12-substr.-MG1655,-complete-genome.1213634_aligned_7_R9_8481_25 | 8563
ENA|U00096|U00096.3-Escherichia-coli-str.-K-12-substr.-MG1655,-complete-genome.989475_aligned_8_F31_6280_22 | 6406
ENA|U00096|U00096.3-Escherichia-coli-str.-K-12-substr.-MG1655,-complete-genome.2385960_aligned_9_R1_3551_22 | 3642
Can you please help me understand how can I calculate the length of the simulated reads based on the information provided in the header and/or the other model files?
Thank you,
Natasha
The text was updated successfully, but these errors were encountered:
In your example (92_12710_2), 12710 should be the length of the reference sequence. Since indels are introduced to the sequence, the simulated read is not necessarily 12710 + 92 + 2. That's why the numbers don't match.
If you want the length of simulated reads from the fasta file, simply use len(sequence) - 1 in python or other programming language. If you have to use only the header and model files, then use 12710 - deletions + insertions. You can find the deletions and insertions in the simulated error profile, where each introduced error is listed.
Thank you so much for your detailed reply.
It was stupid of me to obey the "I-D" part in my calculations...
I did some testing with the equation you provided, and the numbers match.
Hi,
I have a question regarding the header notation of the simulated reads.
In the given example, for the "aligned" reads you say "92_12710_2 means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region.".
Does this mean that the total length of the simulated read is 12,804bp (92 + 12710 + 2)?
This is what I thought, but when I compared the header info from NanoSim (first column in the example below) with the length of the sequence itself (second column in the example below), these numbers don't match (the length is always longer):
ENA|U00096|U00096.3-Escherichia-coli-str.-K-12-substr.-MG1655,-complete-genome.1213634_aligned_7_R9_8481_25 | 8563
ENA|U00096|U00096.3-Escherichia-coli-str.-K-12-substr.-MG1655,-complete-genome.989475_aligned_8_F31_6280_22 | 6406
ENA|U00096|U00096.3-Escherichia-coli-str.-K-12-substr.-MG1655,-complete-genome.2385960_aligned_9_R1_3551_22 | 3642
Can you please help me understand how can I calculate the length of the simulated reads based on the information provided in the header and/or the other model files?
Thank you,
Natasha
The text was updated successfully, but these errors were encountered: