Skip to content
/ BPAC Public

Predict TFBS with sequence information, accessible chromatin information and ChIP-seq

Notifications You must be signed in to change notification settings

sliu2/BPAC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BPAC (TF Binding Prediction from ACcessibility data)

BPAC: A universal model for prediction of transcription factor binding sites based on chromatin accessibility.

Installation

require numpy, pyBigWig package to be installed before BPAC installation.

python setup.py install

Usage

Preprocess (required for offline prediction)

Generate read profile (in bigwig format). Require bedtools (https://github.com/arq5x/bedtools2) and ucsc tools (http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/).

  1. bedtools genomecov -ibam -bg -split -i sample.bam -g genome_file > sample_read.bg
  2. bedGraphToBigWig sample.bg genome_file sample_read.bw

Generate cut profile (in bigwig format). Require ucsc tools.

  1. cut_profile.py genome_bed_file sample.bam sample_cut.wig
  2. wigToBigWig sample_cut.wig genome_file sample_cut.bw
    replace sample with your acutal filenames
    genome_file is hg19.chrom.sizes for hg19, mm9.chrom.sizes for mm9.
    genome_bed_file is hg19_sorted.bed for hg19, mm9_sorted.bed for mm9.

Generate feature file

Usage: generateAttributes generateAttribute.py config_file output_file
config_file contains:
[setting]
bed_file=example.bed
read_bw_file=example.read.bw
cut_bw_file=example.cut.bw
bam_file=example.bam
TSSbedFile=example.TSS.bed
chromsizeFile=example.chrom.sizes
phastConsFile=example.phastCons.bw
ChIPpeakFile=example.ChIP.bed

bed_file is bed6 format with score representing PWM score, e.g., http://bioinfo.wilmer.jhu.edu/BPAC/data/example.bed6.

read_bw_file or cut_bw_file are generated through preprocessing section. If read_bw_file or cut_bw_file is not provided, bam_file is used instead, read counts and cut counts are then generated from
bam file instead of bw files. Bam file needs to be indexed by running samtools index bam_file. Total number of reads in the bam
file is stored in bam_file".reads" file. If it is not provided, it is counted from bam_file assuming the bam_file is from the
whole genome. bam_file processing may be very slow.
chromsizFile or TSSbedFile are provided within package for different genomes phastConsFile can be downloaded from UCSC browser (in bigwig format):
http://hgdownload.cse.ucsc.edu/goldenpath/hg19/phastCons100way/hg19.100way.phastCons.bw for hg19
ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/phastCons30way/vertebrate/ for mm9
If ChIPpeakFile entry is empty, it will generate label "-1" as undecided and can be used for test. ChIPpeakFile is also in bed6 format. Output format from peaks calling software like MACS2 also works.
The first 3 columns of resulting feature file are chromosome start end
The command e.g.:
generateAttributes example.cfg output_feature.tsv
generateAttributes example_bam.cfg output_feature.tsv (bam file)

Construct model

Usage: constructModel training_feature_file output_file classifier parameters_for_classifier
Generate model from feature file using Random Forest (RF) or Support Vector Machine (SVM)
classifier: RF (default) or SVM
parameters_for_classifier is in the format: param1=value1,param2=value2,param3=value3 . There are no spaces.
The command e.g.:
constructModel output_feature.tsv modelRF.pkl RF n_estimators=200

Prediction

Usage: predictWithModel test_file model_file output_file
test_file is in feature file format generated by generateAttributes.
prediction output_file is bed4 format: chromosome start end probability_being_possitive
The command e.g.:
predictWithModel test_feature.tsv modelRF.pkl pred.txt

Evaluation

Usage: evalPrediction True Prediction positive_label
True is a column of labels, each row is for each sample
Prediction is a column of prediction probability, with
positive class probability 1, negative class prob. 0
The command e.g.:
evalPrediction <(cut -f15 test_feature.tsv) <(cut -f4 pred.txt) 1

About

Predict TFBS with sequence information, accessible chromatin information and ChIP-seq

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages