Classifying RNAs by Ensemble Machine learning Algorithms

Classifying transcripts as lncRNAs is difficult because there is no consensus on what a lncRNA truly is. Instead of using thresholds, or rules, for identifying lncRNAs, this tool uses an ensemble stacking method of 8 different gradient boostling models to predict lncRNAs. Trained only on true, validated lncRNAs, this method has been tested on plant transcriptomes with a high success rate.

See our publication at: Prediction of plant lncRNA by ensemble machine learning classifiers

Getting started

To use this tool, simply clone this repository on your machine by:

git clone https://github.com/gbgolding/crema.git

Prerequisites

To use this tool you will need:

Example

Before you can run tool, you'll need to remove all rRNAs and tRNAs from your input data.

Then, you will need to run cpat.py. An example:

cpat.py -g your_transcript_fasta_file.fa -o cpat_output.txt -x ./cpat_models/ath_hexamer -d ./cpat_models/ath_logit.RData

Firstly you must create the DIAMOND database from the SwissProt protein database:

diamond makedb --in uniprot_sprot.fasta -d swissprot.dmnd

Run DIAMOND:

diamond blastx -d swissprot.dmnd -q your_transcript_fasta_file.fa -o diamond_output.txt \\
-e 0.001 -k 5 --matrix BLOSUM62 --gapopen 11 --gapextend 1 --more-sensitive \\
-f 6 qseqid pident length qframe qstart qend sstart send evalue bitscore

Once you have identified your transcript features using CPAT and DIAMOND, you can run the tool!

python3 bin/predict.py -f your_transcript_fasta_file.fa -c cpat_output.txt -d diamond_output.txt

Note: if script cannot find logit_models.RData, please run predict.py using its full file path. This is a known issue that we are working on solving.

Output files

All output files are written to your working directory. Custom output directories to come...

The most helpful output file is final_ensemble_predictions.csv. The CSV has outputs of both the features used in prediction as well as the lncRNA prediction score and final decision.

The columns describe:

gene name
length of transcript
ORF length
GC%
Fickett score (for more info see the CPAT paper)
Hexamer score (for more info see the CPAT paper)
% identity to a hit in the SwissProt database
Alignment length of hit in SwissProt database
Ratio of alignment length to transcript lenth
Ratio of alignment length to ORF length
Score of lncRNA prediction (you can use this to rank your predictions)
Final decision of prediction: 1 == lncRNA

The other files may be less useful to you, depending on what you're looking at.

all_model_predictions.csv: how each base model predicted the transcript (1 == lncRNA).
all_model_scores.csv: the lncRNA prediction scores of each transcript for each base model.
ensemble_logreg_pred.csv: the raw output of the final logistic regression stacking classifier.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
bin		bin
cpat_models		cpat_models
data		data
gb_models		gb_models
log_reg_models		log_reg_models
updated_gb_models		updated_gb_models
.gitignore		.gitignore
README.md		README.md
draftSteps.txt		draftSteps.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classifying RNAs by Ensemble Machine learning Algorithms

Getting started

Prerequisites

Example

Output files

About

Releases

Packages

Languages

tideking/crema

Folders and files

Latest commit

History

Repository files navigation

Classifying RNAs by Ensemble Machine learning Algorithms

Getting started

Prerequisites

Example

Output files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages