Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{sample}.fastq.gz have incorrect sequence identifier string #6

Closed
Nicolas-Fernandez opened this issue May 3, 2021 · 2 comments
Closed

Comments

@Nicolas-Fernandez
Copy link

Dear Fedonin,

I run VirGenA with option "assembling using reference, without msa" on some reads cleaned with alignment on a reference genome (Bowtie2+Samtools). Some tools are ok with my fastq obtnained, like fastqc, fastqscreen, DNAstar, but with VirGenA I have this issue :

java.io.IOException: File {sample}.fastq.gz have incorrect sequence identifier string

Somes parameters :

_> Mode:
Reference Selector: false
Use Major: true

Data:
Reads Insertion Length: 1000 nt

Computing:
Thread Number: -1 threads
Batch Size: 1000 reads

Assembling:
Reference: {my_ref}.fasta
MSA: {my_msa}.fasta
Minimum Read Length: 50 nt
Uclust Identity (%): 0.95
Minimum Contig Length: 1000 nt
Delta (%): 0.05_

My fastq format (head) before and after cleaning :

BEFORE

@FS10001377:5:BPA73114-2327:1:1101:1140:1000 1:N:0:4
AACATTGGCCGTGACAGCTTGACAAATGTTAAAAACACTATTAGCATA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@FS10001377:5:BPA73114-2327:1:1101:1360:1000 1:N:0:4
GCACATCACTACGCAACTTTAGAGCACATCACTACGCAACTTTAGAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@FS10001377:5:BPA73114-2327:1:1101:2240:1000 1:N:0:4
GCTTATTGTTGGCGTTGCACTTCTTGCTGTTTTTCAG

AFTER

@FS10001377:5:BPA73114-2327:1:1101:1000:1260
GAGTTTAGTTCCCTTCCATCATATGCAGCTTTTGCTACTGTTCAAGAAGCTTATGAGCAGGCTGTTGCTAATGGTGATTCTGAAGTTGTTCTTAAAAAGTTGAAGAAGTCTTTGAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@FS10001377:5:BPA73114-2327:1:1101:1000:1530
CTGCTTGCACTGATGACAATGCTTTAGCTTACTACAACACAACAAAGGGAGGTAGGTTTGTACTTTCACTGTTATCCGATTTACAGGATTTGAAATGGGCTAGATTCCCTAAGAGTGATGGAACTGGTACTATC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@FS10001377:5:BPA73114-2327:1:1101:1000:2010
GCCATTGTGTATTTAGTAAGACGTTGACGTGATATATGTGGTACCATGTCACCGTCTATTCTAAACTTAAAGAAGTCATGTTTAGCAACAGCTGGACAATCCTTAAGTAAATTATAAATTGTTTCTTCATGTTGGTAG

It's the last missing part of the header missing (1:N:0:4) ?
Or maybe something else ?

Thank you very much,
Nicolas

@Nicolas-Fernandez
Copy link
Author

Update: No more issue using option : samtools fastq -N (Always add either '/1' or '/2' to the end of read names, even when put into different files), with now information about who is R1 and R2 inside file.fastq.gz, I guess. :)

@gFedonin
Copy link
Owner

gFedonin commented May 3, 2021

You are right! VirGenaA searches for the first ' ' or '/' and then trims the ending to match read names in R1 and R2 files. Bowtie + samtools seem to do the same, trimming the ending. Adding '/1' or '/2' or '1:N:0:4' back fixes this as you already found yourself.

@gFedonin gFedonin closed this as completed May 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants