BPAC: A universal model for prediction of transcription factor binding sites based on chromatin accessibility.
require numpy, pyBigWig package to be installed before BPAC installation.
python setup.py install
Generate read profile (in bigwig format). Require bedtools (https://github.com/arq5x/bedtools2) and ucsc tools (http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/).
- bedtools genomecov -ibam -bg -split -i sample.bam -g genome_file > sample_read.bg
- bedGraphToBigWig sample.bg genome_file sample_read.bw
Generate cut profile (in bigwig format). Require ucsc tools.
- cut_profile.py genome_bed_file sample.bam sample_cut.wig
- wigToBigWig sample_cut.wig genome_file sample_cut.bw
replace sample with your acutal filenames
genome_file is hg19.chrom.sizes for hg19, mm9.chrom.sizes for mm9.
genome_bed_file is hg19_sorted.bed for hg19, mm9_sorted.bed for mm9.
Usage: generateAttributes generateAttribute.py config_file output_file
config_file contains:
[setting]
bed_file=example.bed
read_bw_file=example.read.bw
cut_bw_file=example.cut.bw
bam_file=example.bam
TSSbedFile=example.TSS.bed
chromsizeFile=example.chrom.sizes
phastConsFile=example.phastCons.bw
ChIPpeakFile=example.ChIP.bed
bed_file is bed6 format with score representing PWM score, e.g., http://bioinfo.wilmer.jhu.edu/BPAC/data/example.bed6.
read_bw_file or cut_bw_file are generated through preprocessing section. If read_bw_file or cut_bw_file is not provided, bam_file is used instead, read counts and cut counts are then generated from
bam file instead of bw files. Bam file needs to be indexed by running samtools index bam_file. Total number of reads in the bam
file is stored in bam_file".reads" file. If it is not provided, it is counted from bam_file assuming the bam_file is from the
whole genome. bam_file processing may be very slow.
chromsizFile or TSSbedFile are provided within package for different genomes
phastConsFile can be downloaded from UCSC browser (in bigwig format):
http://hgdownload.cse.ucsc.edu/goldenpath/hg19/phastCons100way/hg19.100way.phastCons.bw for hg19
ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/phastCons30way/vertebrate/ for mm9
If ChIPpeakFile entry is empty, it will generate label "-1" as undecided and can be used for test. ChIPpeakFile is also in bed6 format. Output format from peaks calling software like MACS2 also works.
The first 3 columns of resulting feature file are chromosome start end
The command e.g.:
generateAttributes example.cfg output_feature.tsv
generateAttributes example_bam.cfg output_feature.tsv (bam file)
Usage: constructModel training_feature_file output_file classifier parameters_for_classifier
Generate model from feature file using Random Forest (RF) or Support Vector Machine (SVM)
classifier: RF (default) or SVM
parameters_for_classifier is in the format: param1=value1,param2=value2,param3=value3 . There are no spaces.
The command e.g.:
constructModel output_feature.tsv modelRF.pkl RF n_estimators=200
Usage: predictWithModel test_file model_file output_file
test_file is in feature file format generated by generateAttributes.
prediction output_file is bed4 format: chromosome start end probability_being_possitive
The command e.g.:
predictWithModel test_feature.tsv modelRF.pkl pred.txt
Usage: evalPrediction True Prediction positive_label
True is a column of labels, each row is for each sample
Prediction is a column of prediction probability, with
positive class probability 1, negative class prob. 0
The command e.g.:
evalPrediction <(cut -f15 test_feature.tsv) <(cut -f4 pred.txt) 1