-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME
55 lines (46 loc) · 2.07 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
LeapFrog
========
A set of tools that allows the genomic localization of (flanking
regions of) repetitive elements based on read-pair information.
analysis steps & data sets are:
* (1) input fastq
The paired-end fastq files generated from the organism that you are
interested in. This MUST be paired end!
* (2) reference fasta
Contains the genome reference sequence
* (3) element_database
The database is a multi-fasta file elements that are to be
located. The software expects the sequence headers to have the
following format: `>NAME#FAMILY`
* (4) bowtie2db for the reference fasta
bowtie2 database based on (2)
* (5) bowtie2db for the element database
bowtie2 database based on (3)
* (6) get danglers
the leapfrog script `lf_danglers` will run bowtie2 in the background
and output a properly renamed fastq file containing the
"danglers". A "dangler" is a read that does not map to the element
database (3), but it's paired end mate does!
____
/ \
===== ===== <- dangler
+========================+
| A sequence from the |
| element database (3/5) |
+========================+
Check how to run the script using the `-h` parameter. This script
takes as input the element bowtie2 database (5) and the input fastq
(1).
* (7) map the danglers to the reference genome
run a regular bowtie2 job mapping the dangler sequences (6) against
the reference genome (2/4)
* (8) extract PFR's from the BAM alignment from (7)
using the script lf_regionify. This script needs to be executed for
each genome/sample separately. The output is a GFF file identifygin
each PFR separately. The script splits PFR's based on family and
orientation and tries to unmerge peaks that are close together. A
score is assigned to each PFR.
* (9) compare PFR's between genomes
This script (lf_findiff) is still very experimental. It takes a
number of input GFF PFR files and determines which one overlap,
followed by absence presence information.