Skip to content

Commit a60014b

Browse files
committed
adapters single end added
1 parent d5f3a39 commit a60014b

File tree

426 files changed

+7541
-1380
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

426 files changed

+7541
-1380
lines changed

CMakeLists.txt

+20-1
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ CONFIGURE_FILE(${CMAKE_SOURCE_DIR}/config.h.in
9494
${CMAKE_SOURCE_DIR}/config.h [ESCAPE_QUOTES])
9595

9696
# Set verbose
97-
set(CMAKE_VERBOSE_MAKE ON)
97+
#set(CMAKE_VERBOSE_MAKE ON)
9898

9999
# Set compiler flags
100100
set(CMAKE_C_FLAGS "-Wall -O3 -march=native -std=c11")
@@ -132,6 +132,7 @@ add_executable(trimFilter ${PROJECT_SOURCE_DIR}/trimFilter.c
132132
${PROJECT_SOURCE_DIR}/io_trimFilter.c
133133
${PROJECT_SOURCE_DIR}/fa_read.c
134134
${PROJECT_SOURCE_DIR}/fq_read.c
135+
${PROJECT_SOURCE_DIR}/adapters.c
135136
${PROJECT_SOURCE_DIR}/tree.c
136137
${PROJECT_SOURCE_DIR}/bloom.c
137138
${PROJECT_SOURCE_DIR}/city.c
@@ -140,6 +141,24 @@ add_executable(trimFilter ${PROJECT_SOURCE_DIR}/trimFilter.c
140141
${PROJECT_SOURCE_DIR}/Lmer.c
141142
${PROJECT_SOURCE_DIR}/str_manip.c )
142143

144+
145+
add_executable(trimFilterDS ${PROJECT_SOURCE_DIR}/trimFilterDS.c
146+
${PROJECT_SOURCE_DIR}/init_trimFilter.c
147+
${PROJECT_SOURCE_DIR}/io_trimFilter.c
148+
${PROJECT_SOURCE_DIR}/fa_read.c
149+
${PROJECT_SOURCE_DIR}/ds_read.c
150+
${PROJECT_SOURCE_DIR}/fq_read.c
151+
${PROJECT_SOURCE_DIR}/tree.c
152+
${PROJECT_SOURCE_DIR}/bloom.c
153+
${PROJECT_SOURCE_DIR}/city.c
154+
${PROJECT_SOURCE_DIR}/adapters.c
155+
${PROJECT_SOURCE_DIR}/trim.c
156+
${PROJECT_SOURCE_DIR}/fopen_gen.c
157+
${PROJECT_SOURCE_DIR}/Lmer.c
158+
${PROJECT_SOURCE_DIR}/str_manip.c )
159+
160+
161+
143162
# Set linker flags
144163
set(CMAKE_C_LINK_FLAGS "-lm " )
145164
add_executable(makeBloom ${PROJECT_SOURCE_DIR}/makeBloom.c

README_trimFilter.md

+81-9
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ Options:
9292
## Output description
9393

9494
- `O_PREFIX_good.fq.gz`: contains reads that passed all filters (maybe trimmed).
95-
- `O_PREFIX_adap.fq.gz`: contains discarded due to the presence of adapters.
95+
- `O_PREFIX_adap.fq.gz`: contains reads discarded due to the presence of adapters.
9696
- `O_PREFIX_cont.fq.gz`: contains contamination reads.
9797
- `O_PREFIX_lowQ.fq.gz`: contains reads discarded due to low quality issues.
9898
- `O_PREFIX_NNNN.fq.gz`: contains reads discarded due to *N*'s issues.
@@ -118,7 +118,64 @@ Options:
118118

119119
#### Adapters
120120

121-
TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO
121+
Technical sequences within the reads are detected if the option
122+
`--adapters <ADAPTERS.fa>:<mismatches>:<score>` is given. The
123+
adapter(s) sequence(s) are read from the fasta file, and the
124+
search is done using an 'seed and extend' approach. It starts by looking for
125+
16-nucleotides long seeds, for which a user defined number of mismatches is
126+
allowed (`mismatches`). If found, a score is computed. If the score is larger
127+
than the user defined threshold (`score`) and the number of matched
128+
nucleotides exceeds 12, then the read is trimmed if the remaining part is
129+
longer than `MINL` (user defined) and discarded otherwise. If no
130+
16-nucleotides long seeds are found, we proceed with 8-nucleotides long seeds
131+
and apply the same criteria to trim/discard a read. A list of possible
132+
situations follows, to illustrate how it works (`MINL=25`, `mismatches=2`):
133+
134+
```
135+
ADAPTER: CAAGCAGAAGACGGCATACGAG
136+
REV_COM: AGATCGGAAGAGCTCGTATGCC
137+
138+
CASE1A: CACAGTCGATCAGCGAGCAGGCATTCATGCTGAGATCGGAAGAGATCGTATG
139+
||||||||||||X|||----
140+
AGATCGGAAGAGCTCGTATG
141+
- Seed: 16 Nucleotides
142+
- Return: trimmed, TRIMA:0:31
143+
CASE1B: CACATCATCGCTAGCTATCGATCGATCGATGCTATGCAAGATCGGAAGAGCT
144+
||||||||------
145+
AGATCGGAAGAGCT
146+
- Seed: 8 Nucleotides
147+
- Return: trimmed, TRIMA:0:37
148+
CASE1C: CACATCATCGCTAGCTATCGATCGATCGATGCTATGCACGAAGATCGGAAGA
149+
||||||||---
150+
AGATCGGAAGA
151+
- Seed: 8 Nucleotides
152+
- Return: nothing done, reason: Match length < 12
153+
CASE2A: CATACATCACGAGCTAGCTAGAGATCGGAAGAGCTCGTATGCCCAGCATCGA
154+
||||||||||||||||------
155+
AGATCGGAAGAGCTCGTATGCC
156+
- Seed: 16 Nucleotides
157+
- Return: discarded, reason: remaining read too short.
158+
CASE2B: CCACAGTACAATACATCACGAGCTAGCTAGAGATCGGAAGAGCTCGTATGCA
159+
||||||||||||||||||||||
160+
AGATCGGAAGAGCTCGTATGCC
161+
- Seed: 16 Nucleotides
162+
- Return: trimmed, TRIMA:0:28
163+
CASE3A: TATGCCGTCTTCTGCTTGCAGTGCATGCTGATGCATGCTGCATGCTAGCTGC
164+
||||||||||||||||--
165+
TATGCCGTCTTCTGCTTG
166+
- Seed: 16 Nucleotides
167+
- Return: discarded, reason: remaining read too short
168+
CASE3B: CGTCTTCTGCTTGCCGATCGATGCTAGCTACGATCGTCGAGCTAGCTACGTG
169+
||||||||-----
170+
CGTCTTCTGCTTG
171+
- Seed: 8 Nucleotides
172+
- Return: discarded, reason: remaining read too short
173+
CASE3C: TCTTCTGCTTGCCGATCGATGCTAGCTACGATCGTCGAGCTAGCTACGTGCG
174+
||||||||---
175+
TCTTCTGCTTG
176+
- Seed: 8 Nucleotides
177+
- Return: nothing done, reason: Match length < 12
178+
```
122179

123180
#### Impurities
124181

@@ -307,17 +364,32 @@ IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
307364

308365
## Test/examples
309366

310-
The examples in folder `examples/trimFilter_SReport/` works in the following
367+
The examples in folder `examples/trimFilter_SReport/` work in the following
311368
way:
312369

313-
1. See folder `fa_fq_files`. The file `EColi_rRNA.fq` was created with
370+
1. See folder `examples/fa_fq_files`. The file `EColi_rRNA.fq` was created with
314371
`create_fq.sh` and contains:
315372
* 2e5 reads of length 50 from `EColi_genome.fa` with NO errors.
316373
* 5e4 reads of length 50 from `rRNA_modified.fa` with NO errrors
317374
(rRNA contaminations).
318375
* Artificially generated reads with low quality score (see `create_fq.sh`)
319-
* Artificially generated reads with Ns (see `create_fq.sh`).
320-
2. `run_example_TREE.sh`: the code was tested with flags:
376+
* Artificially generated reads with Ns (see `create_fq.sh`).
377+
* Adapter files: `adapter_even_long.fa`, `adapter_odd_long.fa`,
378+
`adapter_even_short.fa`, `adapter_odd_short.fa`. Fasta files containing
379+
one adapter sequence each, longer/shorter than 16 nucleotides and with
380+
an even/odd length.
381+
* Example files to test the adapter contamination searchs:
382+
`human_[even/odd]_wad_[even/odd]_[long/short].fq`. Short fastq files where
383+
adapters contaminations have been inserted in all possible ways:
384+
even/odd positions, at the beginning/middle/end of the reads. Read
385+
lengths are even or odd as the first suffix indicates. The adapter
386+
contaminations included are suggested by the second even/odd suffix,
387+
and the long/short suffix.
388+
2. `adapters/run_example.sh`: runs examples of reads containing adapters
389+
contaminations. A set of different possibilities is covered.
390+
See README file inside the folder `adapters`
391+
392+
3. `run_example_TREE.sh`: the code was tested with flags:
321393
```
322394
$ ../../bin/trimFilter -l 50 --ifq PATH/TO/EColi_rRNA.fq.gz
323395
--method TREE --ifa PATH/TO/rRNA_modified.fa:0.9:50
@@ -326,13 +398,13 @@ IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
326398
i.e., we check for contaminations from rRNA, trim reads with lowQ at
327399
the ends and less than 5% in the remaining part, and strip reads
328400
containing N's. The output should coincide with the files `example_TREE*`
329-
3. `run_example_BLOOM.sh`:
401+
4. `run_example_BLOOM.sh`:
330402
* bloom filter is generated for `rRNA_modified.fa` with FPR = 0.0075
331403
and `kmersize=25`. The output should coincide with `rRNA_example.bf*`.
332404
* trimFilter is run like in 2. but passing a bloom filter to look for
333405
contaminations with `score=0.4`.
334-
4. `run_example_SA.sh`: TODO
335-
5. With this set up, it is possible to run further customized tests.
406+
5. `run_example_SA.sh`: TODO
407+
6. With this set up, it is possible to run further customized tests.
336408

337409
**NOTE:** `rRNA_modified.fa` is the `rRNA_CRUnit.fa` sequence, where we have
338410
removed the lines containing N's for testing purposes.

doxygen_sqlite3.db

1.19 MB
Binary file not shown.

examples/bloomROC/create_ROC_png.R

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Generates ROC curves.
2+
tags <- c( "0p0075")
3+
for (FPR_text in tags) {
4+
bloom <- read.csv(paste0("example_ROC_",FPR_text,"_bloom.csv"))
5+
FPR = bloom[,4]
6+
TPR = bloom[,2]
7+
pdf(paste0("ROC_",FPR_text,"_bloom.png"))
8+
plot(FPR,TPR, main="ROC curves with option -p 0.0075",
9+
xlab="False positive rate",ylab="sensitivity",
10+
xlim = c(min(FPR),max(FPR)),
11+
ylim = c(min(TPR),max(TPR)),
12+
type="o", col="blue")
13+
dev.off()
14+
}
+15-15
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
11
FN,TP,TN,FP
2-
0.050000,0.050470,0.949530,0.999420,0.000580
3-
0.060000,0.050670,0.949330,0.999600,0.000400
4-
0.070000,0.051010,0.948990,0.999690,0.000310
5-
0.080000,0.051850,0.948150,0.999820,0.000180
6-
0.090000,0.052440,0.947560,0.999850,0.000150
7-
0.100000,0.052960,0.947040,0.999870,0.000130
8-
0.110000,0.053850,0.946150,0.999880,0.000120
9-
0.120000,0.055700,0.944300,0.999950,0.000050
10-
0.130000,0.056650,0.943350,0.999950,0.000050
11-
0.140000,0.057770,0.942230,0.999950,0.000050
2+
0.050000,0.050420,0.949580,0.999530,0.000470
3+
0.060000,0.050640,0.949360,0.999650,0.000350
4+
0.070000,0.050950,0.949050,0.999710,0.000290
5+
0.080000,0.051880,0.948120,0.999830,0.000170
6+
0.090000,0.052460,0.947540,0.999860,0.000140
7+
0.100000,0.053030,0.946970,0.999870,0.000130
8+
0.110000,0.053820,0.946180,0.999880,0.000120
9+
0.120000,0.055580,0.944420,0.999930,0.000070
10+
0.130000,0.056590,0.943410,0.999950,0.000050
11+
0.140000,0.057740,0.942260,0.999950,0.000050
1212
0.150000,0.059110,0.940890,0.999960,0.000040
13-
0.160000,0.062000,0.938000,0.999990,0.000010
14-
0.170000,0.063510,0.936490,0.999990,0.000010
15-
0.180000,0.065400,0.934600,0.999990,0.000010
16-
0.190000,0.067290,0.932710,0.999990,0.000010
17-
0.200000,0.071730,0.928270,0.999990,0.000010
13+
0.160000,0.062080,0.937920,0.999990,0.000010
14+
0.170000,0.063800,0.936200,0.999990,0.000010
15+
0.180000,0.065520,0.934480,0.999990,0.000010
16+
0.190000,0.067360,0.932640,0.999990,0.000010
17+
0.200000,0.071640,0.928360,0.999990,0.000010
-25 Bytes
Binary file not shown.
+15-15
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
11
FN,TP,TN,FP
2-
0.050000,0.050400,0.949600,0.999500,0.000500
3-
0.060000,0.050640,0.949360,0.999650,0.000350
4-
0.070000,0.050920,0.949080,0.999740,0.000260
5-
0.080000,0.051780,0.948220,0.999820,0.000180
6-
0.090000,0.052350,0.947650,0.999840,0.000160
7-
0.100000,0.052960,0.947040,0.999880,0.000120
8-
0.110000,0.053680,0.946320,0.999890,0.000110
9-
0.120000,0.055480,0.944520,0.999920,0.000080
10-
0.130000,0.056330,0.943670,0.999950,0.000050
2+
0.050000,0.050370,0.949630,0.999530,0.000470
3+
0.060000,0.050630,0.949370,0.999680,0.000320
4+
0.070000,0.050880,0.949120,0.999720,0.000280
5+
0.080000,0.051690,0.948310,0.999850,0.000150
6+
0.090000,0.052250,0.947750,0.999880,0.000120
7+
0.100000,0.052850,0.947150,0.999880,0.000120
8+
0.110000,0.053530,0.946470,0.999890,0.000110
9+
0.120000,0.055390,0.944610,0.999930,0.000070
10+
0.130000,0.056350,0.943650,0.999950,0.000050
1111
0.140000,0.057500,0.942500,0.999960,0.000040
12-
0.150000,0.058730,0.941270,0.999960,0.000040
13-
0.160000,0.061640,0.938360,0.999990,0.000010
14-
0.170000,0.063350,0.936650,0.999990,0.000010
15-
0.180000,0.064990,0.935010,0.999990,0.000010
16-
0.190000,0.066880,0.933120,0.999990,0.000010
17-
0.200000,0.071190,0.928810,0.999990,0.000010
12+
0.150000,0.058820,0.941180,0.999970,0.000030
13+
0.160000,0.061730,0.938270,0.999980,0.000020
14+
0.170000,0.063180,0.936820,0.999990,0.000010
15+
0.180000,0.064860,0.935140,0.999990,0.000010
16+
0.190000,0.066970,0.933030,0.999990,0.000010
17+
0.200000,0.071110,0.928890,0.999990,0.000010
-17 Bytes
Binary file not shown.
+16-16
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
11
FN,TP,TN,FP
2-
0.050000,0.050290,0.949710,0.999290,0.000710
3-
0.060000,0.050500,0.949500,0.999580,0.000420
4-
0.070000,0.050810,0.949190,0.999670,0.000330
5-
0.080000,0.051600,0.948400,0.999800,0.000200
6-
0.090000,0.052120,0.947880,0.999830,0.000170
7-
0.100000,0.052730,0.947270,0.999880,0.000120
8-
0.110000,0.053480,0.946520,0.999880,0.000120
9-
0.120000,0.055190,0.944810,0.999910,0.000090
10-
0.130000,0.056210,0.943790,0.999940,0.000060
11-
0.140000,0.057160,0.942840,0.999950,0.000050
12-
0.150000,0.058430,0.941570,0.999970,0.000030
13-
0.160000,0.061380,0.938620,0.999980,0.000020
14-
0.170000,0.063010,0.936990,0.999980,0.000020
15-
0.180000,0.064640,0.935360,0.999980,0.000020
16-
0.190000,0.066360,0.933640,0.999980,0.000020
17-
0.200000,0.070740,0.929260,0.999990,0.000010
2+
0.050000,0.050330,0.949670,0.999340,0.000660
3+
0.060000,0.050520,0.949480,0.999580,0.000420
4+
0.070000,0.050810,0.949190,0.999680,0.000320
5+
0.080000,0.051580,0.948420,0.999800,0.000200
6+
0.090000,0.052140,0.947860,0.999830,0.000170
7+
0.100000,0.052700,0.947300,0.999870,0.000130
8+
0.110000,0.053340,0.946660,0.999890,0.000110
9+
0.120000,0.055150,0.944850,0.999930,0.000070
10+
0.130000,0.056110,0.943890,0.999940,0.000060
11+
0.140000,0.057160,0.942840,0.999960,0.000040
12+
0.150000,0.058550,0.941450,0.999970,0.000030
13+
0.160000,0.061200,0.938800,0.999990,0.000010
14+
0.170000,0.062930,0.937070,0.999990,0.000010
15+
0.180000,0.064730,0.935270,0.999990,0.000010
16+
0.190000,0.066600,0.933400,0.999990,0.000010
17+
0.200000,0.070600,0.929400,0.999990,0.000010
-14 Bytes
Binary file not shown.
+16-16
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
11
FN,TP,TN,FP
2-
0.050000,0.049710,0.950290,0.990740,0.009260
3-
0.060000,0.050130,0.949870,0.997260,0.002740
4-
0.070000,0.050520,0.949480,0.999020,0.000980
5-
0.080000,0.051170,0.948830,0.999690,0.000310
6-
0.090000,0.051530,0.948470,0.999790,0.000210
7-
0.100000,0.052120,0.947880,0.999830,0.000170
8-
0.110000,0.052650,0.947350,0.999850,0.000150
9-
0.120000,0.054310,0.945690,0.999910,0.000090
10-
0.130000,0.055230,0.944770,0.999920,0.000080
11-
0.140000,0.056230,0.943770,0.999930,0.000070
12-
0.150000,0.057350,0.942650,0.999950,0.000050
13-
0.160000,0.059930,0.940070,0.999950,0.000050
14-
0.170000,0.061310,0.938690,0.999970,0.000030
15-
0.180000,0.062910,0.937090,0.999980,0.000020
16-
0.190000,0.064800,0.935200,0.999990,0.000010
17-
0.200000,0.068810,0.931190,0.999990,0.000010
2+
0.050000,0.049780,0.950220,0.990760,0.009240
3+
0.060000,0.050210,0.949790,0.996800,0.003200
4+
0.070000,0.050570,0.949430,0.999030,0.000970
5+
0.080000,0.051180,0.948820,0.999750,0.000250
6+
0.090000,0.051650,0.948350,0.999830,0.000170
7+
0.100000,0.052080,0.947920,0.999850,0.000150
8+
0.110000,0.052640,0.947360,0.999880,0.000120
9+
0.120000,0.054190,0.945810,0.999900,0.000100
10+
0.130000,0.055090,0.944910,0.999910,0.000090
11+
0.140000,0.056150,0.943850,0.999930,0.000070
12+
0.150000,0.057200,0.942800,0.999950,0.000050
13+
0.160000,0.059800,0.940200,0.999980,0.000020
14+
0.170000,0.061320,0.938680,0.999980,0.000020
15+
0.180000,0.062900,0.937100,0.999990,0.000010
16+
0.190000,0.064660,0.935340,0.999990,0.000010
17+
0.200000,0.068520,0.931480,0.999990,0.000010
-5 Bytes
Binary file not shown.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
>Illumina Single End Adapter 2
2+
CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
>Artificial adapter (trimmed from Illumina Single End Adapter 2)
2+
CAAGCAGAAGACGG
+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
>Illumina Single End Adapter 1
2+
GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
>Artificial adapter (trimmed from Illumina Single End Adapter 1)
2+
GATCGGAAGAGCTCG
3+

0 commit comments

Comments
 (0)