Skip to content

Commit

Permalink
Updating README/Docker scripts
Browse files Browse the repository at this point in the history
  • Loading branch information
Nam-phuong Nguyen committed Mar 25, 2019
1 parent 70dba24 commit 1afb8e0
Show file tree
Hide file tree
Showing 3 changed files with 56 additions and 46 deletions.
70 changes: 31 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,8 +151,38 @@ Finally, ViFi outputs several working files that can be deleted after a run. Th
7. \<prefix\>.fixed.trans.bam - A BAM file created by merging 6. and any human/viral paired end reads discovered by running the viral HMMs on 3.
8. \<prefix\>.fixed.trans.cs.bam - A coordinate sorted BAM file of 7.

## References
1. Nguyen ND, Deshpande V, Luebeck J, Mischel PS, Bafna V (2018) ViFi: accurate detection of viral integration and mRNA fusion reveals indiscriminate and unregulated transcription in proximal genomic regions in cervical cancer. Nucleic Acids Res (April):1–17.

# [Advanced Notes](#advanced_notes)

## Building evolutionary models

ViFi can be run with and without evolutionary models (i.e., the HMMs). We outline the steps in building the HMMs below. However, we also include a Docker pipeline that will automatically build the HMMs for the users to use. The pipeline only requires docker to be installed for use.

## Using Docker pipeline to build HMMs for use in ViFi
The following command will create HMMs from a set of unaligned sequences. The sequences are assumed to share a common viral ancestor (i.e., don't mix viral families together when running the pipeline).

```
bash $VIFI_DIR/scripts/build_references.sh <INPUT_SEQ> <OUTPUT_DIR> <PREFIX>
```

The output in the OUTPUT_DIR folder will be a set of HMMs (suffix with *.hmmbuild) and a file containing the list of HMMs.

## Using customized reference

If you want to use a customized reference or a reference for a different organism, you can inform ViFi
of the reference sequences by supplying a chromosome file to ViFi using the **--chromosome_list**. The file
format is a single line that has the sequence names delimited by spaces. For example:

```
mouse_chr1 mouse_chr2
```

## Installation (Depreciated):
would inform ViFi that any other sequences found in the BAM file that does not match mouse_chr1 and mouse_chr2 are
considered viral sequences.

## Installation from source code (Depreciated):
We provide instructions for installing ViFi on Linux below.

1. ViFi download (if you have not already cloned this source code):
Expand Down Expand Up @@ -221,41 +251,3 @@ Note that this version defaults to searching for **HPV**. To search for HBV, ru
```
python run_vifi.py -f <input_R1.fq.gz> -r <input_R2.fq.gz> -o <output_dir> -v hbv
```

## References
1. Nguyen ND, Deshpande V, Luebeck J, Mischel PS, Bafna V (2018) ViFi: accurate detection of viral integration and mRNA fusion reveals indiscriminate and unregulated transcription in proximal genomic regions in cervical cancer. Nucleic Acids Res (April):1–17.

# [Advanced Notes](#advanced_notes)

## Building evolutionary models

ViFi can be run with and without evolutionary models (i.e., the HMMs). The HMMs

## Building Alignment and Tree on viral family of interest

ViFi can build HMMs from any viral family if there is an existing FASTA alignment and NEWICK tree on the viral
sequences. Note that the sequences should be phylogenetically related to each other (i.e., do not mix HPV and HBV
sequences). Any standard alignment method and tree reconstruction method can be used. In our paper, we used [PASTA](https://github.com/smirarab/pasta) to construct our alignment and tree and provide the steps in doing this below.
Instructions on installing and running PASTA can be found [here](https://github.com/smirarab/pasta).

## Building HMMs

We created script to allow easy creation of the HMMs used within ViFi for a viral family of interesting. To

Requires:
## 1) Python 2.7
## 2) Dendropy verion 4.0.0 or higher (https://github.com/jeetsukumaran/DendroPy):
sudo pip install dendropy

## Using customized reference

If you want to use a customized reference or a reference for a different organism, you can inform ViFi
of the reference sequences by supplying a chromosome file to ViFi using the **--chromosome_list**. The file
format is a single line that has the sequence names delimited by spaces. For example:

```
mouse_chr1 mouse_chr2
```

would inform ViFi that any other sequences found in the BAM file that does not match mouse_chr1 and mouse_chr2 are
considered viral sequences.
7 changes: 4 additions & 3 deletions scripts/build_hmms.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,8 @@ def build_hmms(tree_map, alignment_file, output_dir, prefix, keep_alignment):
parser.add_argument('--prefix', dest='prefix',
help="prefix used to name HMM files (default=viral)", metavar='PREFIX', default = 'viral',
action='store', type=str)
parser.add_argument('--max_size', dest='max_size',
help="Maximum subtree size to decompose (default=10)", metavar='MAX_SIZE', default = 10,
parser.add_argument('--max_size_fraction', dest='max_size',
help="Maximum fraction of total size to decompose (default=.10)", metavar='MAX_SIZE', default = 0.10,
action='store', type=int)
parser.add_argument('--keep_alignment', dest='keep_alignment',
help="Keep temporary ", default = False,
Expand All @@ -76,6 +76,7 @@ def build_hmms(tree_map, alignment_file, output_dir, prefix, keep_alignment):
tree = Tree.get(file=open(arg.tree_file, 'r'), schema="newick", preserve_underscores=True)
tree_map = {}
print "Decomposing Tree"
decompose_tree(tree, arg.max_size, tree_map = tree_map, decomposition=arg.decomposition)
max_size = max(10, int(arg.max_size*len(tree.leaf_nodes())))
decompose_tree(tree, max_size, tree_map = tree_map, decomposition=arg.decomposition)
print "Building HMMs"
build_hmms(tree_map, arg.alignment_file, arg.output_dir, arg.prefix, arg.keep_alignment)
25 changes: 21 additions & 4 deletions scripts/build_references.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,31 @@ INPUT_FASTA=$1
OUTPUT_DIR=$2
PREFIX=$3

#Build alignment/tree
run_pasta.py --max-mem-mb=4000 -d dna -j $PREFIX -i $INPUT_FASTA -o $OUTPUT_DIR
mkdir -p $OUTPUT_DIR
OUTPUT_DIR=`realpath $OUTPUT_DIR`
INPUT_DIR=`dirname $INPUT_FASTA`
if [ "$INPUT_DIR" == "." ]
then
INPUT_DIR=`pwd`
fi
INPUT_NAME=`basename $INPUT_FASTA`

#Pull latest version of PASTA
docker pull smirarab/pasta

#Build alignment/tree, use 4GB for alignment
docker run -v $INPUT_DIR/:/data/ -v $OUTPUT_DIR/:/output/ smirarab/pasta run_pasta.py --max-mem-mb=4000 -d dna -j $PREFIX -i $INPUT_NAME -o /output/

cd $OUTPUT_DIR

OUT_ALN=`grep "Writing resulting alignment" $PREFIX.out.txt | awk '{print $7}'`
OUT_ALN=`basename $OUT_ALN`
OUT_TREE=`grep "Writing resulting tree" $PREFIX.out.txt | awk '{print $7}'`

OUT_TREE=`basename $OUT_TREE`
#Decompose alignment/tree into HMMs
python $VIFI_DIR/scripts/build_hmms.py --tree_file $OUT_TREE --alignment_file $OUT_ALN --prefix $PREFIX --output_dir $OUTPUT_DIR
docker run -v $OUTPUT_DIR/:/output/ docker.io/namphuon/vifi python "scripts/build_hmms.py" --tree_file /output/$OUT_TREE --alignment_file /output/$OUT_ALN --prefix $PREFIX --output_dir /output/

#python $VIFI_DIR/scripts/build_hmms.py --tree_file $OUT_TREE --alignment_file $OUT_ALN --prefix $PREFIX --output_dir $OUTPUT_DIR

#Build HMM list
ls $OUTPUT_DIR/*.hmmbuild > $OUTPUT_DIR/hmm_list.txt

0 comments on commit 1afb8e0

Please sign in to comment.