@@ -92,7 +92,7 @@ Options:
92
92
## Output description
93
93
94
94
- ` O_PREFIX_good.fq.gz ` : contains reads that passed all filters (maybe trimmed).
95
- - ` O_PREFIX_adap.fq.gz ` : contains discarded due to the presence of adapters.
95
+ - ` O_PREFIX_adap.fq.gz ` : contains reads discarded due to the presence of adapters.
96
96
- ` O_PREFIX_cont.fq.gz ` : contains contamination reads.
97
97
- ` O_PREFIX_lowQ.fq.gz ` : contains reads discarded due to low quality issues.
98
98
- ` O_PREFIX_NNNN.fq.gz ` : contains reads discarded due to * N* 's issues.
@@ -118,7 +118,64 @@ Options:
118
118
119
119
#### Adapters
120
120
121
- TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO
121
+ Technical sequences within the reads are detected if the option
122
+ ` --adapters <ADAPTERS.fa>:<mismatches>:<score> ` is given. The
123
+ adapter(s) sequence(s) are read from the fasta file, and the
124
+ search is done using an 'seed and extend' approach. It starts by looking for
125
+ 16-nucleotides long seeds, for which a user defined number of mismatches is
126
+ allowed (` mismatches ` ). If found, a score is computed. If the score is larger
127
+ than the user defined threshold (` score ` ) and the number of matched
128
+ nucleotides exceeds 12, then the read is trimmed if the remaining part is
129
+ longer than ` MINL ` (user defined) and discarded otherwise. If no
130
+ 16-nucleotides long seeds are found, we proceed with 8-nucleotides long seeds
131
+ and apply the same criteria to trim/discard a read. A list of possible
132
+ situations follows, to illustrate how it works (` MINL=25 ` , ` mismatches=2 ` ):
133
+
134
+ ```
135
+ ADAPTER: CAAGCAGAAGACGGCATACGAG
136
+ REV_COM: AGATCGGAAGAGCTCGTATGCC
137
+
138
+ CASE1A: CACAGTCGATCAGCGAGCAGGCATTCATGCTGAGATCGGAAGAGATCGTATG
139
+ ||||||||||||X|||----
140
+ AGATCGGAAGAGCTCGTATG
141
+ - Seed: 16 Nucleotides
142
+ - Return: trimmed, TRIMA:0:31
143
+ CASE1B: CACATCATCGCTAGCTATCGATCGATCGATGCTATGCAAGATCGGAAGAGCT
144
+ ||||||||------
145
+ AGATCGGAAGAGCT
146
+ - Seed: 8 Nucleotides
147
+ - Return: trimmed, TRIMA:0:37
148
+ CASE1C: CACATCATCGCTAGCTATCGATCGATCGATGCTATGCACGAAGATCGGAAGA
149
+ ||||||||---
150
+ AGATCGGAAGA
151
+ - Seed: 8 Nucleotides
152
+ - Return: nothing done, reason: Match length < 12
153
+ CASE2A: CATACATCACGAGCTAGCTAGAGATCGGAAGAGCTCGTATGCCCAGCATCGA
154
+ ||||||||||||||||------
155
+ AGATCGGAAGAGCTCGTATGCC
156
+ - Seed: 16 Nucleotides
157
+ - Return: discarded, reason: remaining read too short.
158
+ CASE2B: CCACAGTACAATACATCACGAGCTAGCTAGAGATCGGAAGAGCTCGTATGCA
159
+ ||||||||||||||||||||||
160
+ AGATCGGAAGAGCTCGTATGCC
161
+ - Seed: 16 Nucleotides
162
+ - Return: trimmed, TRIMA:0:28
163
+ CASE3A: TATGCCGTCTTCTGCTTGCAGTGCATGCTGATGCATGCTGCATGCTAGCTGC
164
+ ||||||||||||||||--
165
+ TATGCCGTCTTCTGCTTG
166
+ - Seed: 16 Nucleotides
167
+ - Return: discarded, reason: remaining read too short
168
+ CASE3B: CGTCTTCTGCTTGCCGATCGATGCTAGCTACGATCGTCGAGCTAGCTACGTG
169
+ ||||||||-----
170
+ CGTCTTCTGCTTG
171
+ - Seed: 8 Nucleotides
172
+ - Return: discarded, reason: remaining read too short
173
+ CASE3C: TCTTCTGCTTGCCGATCGATGCTAGCTACGATCGTCGAGCTAGCTACGTGCG
174
+ ||||||||---
175
+ TCTTCTGCTTG
176
+ - Seed: 8 Nucleotides
177
+ - Return: nothing done, reason: Match length < 12
178
+ ```
122
179
123
180
#### Impurities
124
181
@@ -307,17 +364,32 @@ IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
307
364
308
365
## Test/examples
309
366
310
- The examples in folder ` examples/trimFilter_SReport/ ` works in the following
367
+ The examples in folder ` examples/trimFilter_SReport/ ` work in the following
311
368
way:
312
369
313
- 1 . See folder ` fa_fq_files ` . The file ` EColi_rRNA.fq ` was created with
370
+ 1 . See folder ` examples/ fa_fq_files` . The file ` EColi_rRNA.fq ` was created with
314
371
` create_fq.sh ` and contains:
315
372
* 2e5 reads of length 50 from ` EColi_genome.fa ` with NO errors.
316
373
* 5e4 reads of length 50 from ` rRNA_modified.fa ` with NO errrors
317
374
(rRNA contaminations).
318
375
* Artificially generated reads with low quality score (see ` create_fq.sh ` )
319
- * Artificially generated reads with Ns (see ` create_fq.sh ` ).
320
- 2 . ` run_example_TREE.sh ` : the code was tested with flags:
376
+ * Artificially generated reads with Ns (see ` create_fq.sh ` ).
377
+ * Adapter files: ` adapter_even_long.fa ` , ` adapter_odd_long.fa ` ,
378
+ ` adapter_even_short.fa ` , ` adapter_odd_short.fa ` . Fasta files containing
379
+ one adapter sequence each, longer/shorter than 16 nucleotides and with
380
+ an even/odd length.
381
+ * Example files to test the adapter contamination searchs:
382
+ ` human_[even/odd]_wad_[even/odd]_[long/short].fq ` . Short fastq files where
383
+ adapters contaminations have been inserted in all possible ways:
384
+ even/odd positions, at the beginning/middle/end of the reads. Read
385
+ lengths are even or odd as the first suffix indicates. The adapter
386
+ contaminations included are suggested by the second even/odd suffix,
387
+ and the long/short suffix.
388
+ 2 . ` adapters/run_example.sh ` : runs examples of reads containing adapters
389
+ contaminations. A set of different possibilities is covered.
390
+ See README file inside the folder ` adapters `
391
+
392
+ 3 . ` run_example_TREE.sh ` : the code was tested with flags:
321
393
```
322
394
$ ../../bin/trimFilter -l 50 --ifq PATH/TO/EColi_rRNA.fq.gz
323
395
--method TREE --ifa PATH/TO/rRNA_modified.fa:0.9:50
@@ -326,13 +398,13 @@ IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
326
398
i.e., we check for contaminations from rRNA, trim reads with lowQ at
327
399
the ends and less than 5% in the remaining part, and strip reads
328
400
containing N's. The output should coincide with the files ` example_TREE* `
329
- 3 . ` run_example_BLOOM.sh ` :
401
+ 4 . ` run_example_BLOOM.sh ` :
330
402
* bloom filter is generated for ` rRNA_modified.fa ` with FPR = 0.0075
331
403
and ` kmersize=25 ` . The output should coincide with ` rRNA_example.bf* ` .
332
404
* trimFilter is run like in 2. but passing a bloom filter to look for
333
405
contaminations with ` score=0.4 ` .
334
- 4 . ` run_example_SA.sh ` : TODO
335
- 5 . With this set up, it is possible to run further customized tests.
406
+ 5 . ` run_example_SA.sh ` : TODO
407
+ 6 . With this set up, it is possible to run further customized tests.
336
408
337
409
** NOTE:** ` rRNA_modified.fa ` is the ` rRNA_CRUnit.fa ` sequence, where we have
338
410
removed the lines containing N's for testing purposes.
0 commit comments