Classifying transcripts as lncRNAs is difficult because there is no consensus on what a lncRNA truly is. Instead of using thresholds, or rules, for identifying lncRNAs, this tool uses an ensemble stacking method of 8 different gradient boostling models to predict lncRNAs. Trained only on true, validated lncRNAs, this method has been tested on plant transcriptomes with a high success rate.
See our publication at: Prediction of plant lncRNA by ensemble machine learning classifiers
To use this tool, simply clone this repository on your machine by:
git clone https://github.com/gbgolding/crema.git
To use this tool you will need:
- python3
- CPAT v1.2.1
- python2
- DIAMOND
Before you can run tool, you'll need to remove all rRNAs and tRNAs from your input data.
Then, you will need to run cpat.py. An example:
cpat.py -g your_transcript_fasta_file.fa -o cpat_output.txt -x ./cpat_models/ath_hexamer -d ./cpat_models/ath_logit.RData
Firstly you must create the DIAMOND database from the SwissProt protein database:
diamond makedb --in uniprot_sprot.fasta -d swissprot.dmnd
Run DIAMOND:
diamond blastx -d swissprot.dmnd -q your_transcript_fasta_file.fa -o diamond_output.txt \\
-e 0.001 -k 5 --matrix BLOSUM62 --gapopen 11 --gapextend 1 --more-sensitive \\
-f 6 qseqid pident length qframe qstart qend sstart send evalue bitscore
Once you have identified your transcript features using CPAT and DIAMOND, you can run the tool!
python3 bin/predict.py -f your_transcript_fasta_file.fa -c cpat_output.txt -d diamond_output.txt
Note: if script cannot find logit_models.RData
, please run predict.py
using its full file path. This is a known issue that we are working on solving.
All output files are written to your working directory. Custom output directories to come...
The most helpful output file is final_ensemble_predictions.csv
.
The CSV has outputs of both the features used in prediction as well as the lncRNA prediction score and final decision.
The columns describe:
- gene name
- length of transcript
- ORF length
- GC%
- Fickett score (for more info see the CPAT paper)
- Hexamer score (for more info see the CPAT paper)
- % identity to a hit in the SwissProt database
- Alignment length of hit in SwissProt database
- Ratio of alignment length to transcript lenth
- Ratio of alignment length to ORF length
- Score of lncRNA prediction (you can use this to rank your predictions)
- Final decision of prediction: 1 == lncRNA
The other files may be less useful to you, depending on what you're looking at.
all_model_predictions.csv
: how each base model predicted the transcript (1 == lncRNA).
all_model_scores.csv
: the lncRNA prediction scores of each transcript for each base model.
ensemble_logreg_pred.csv
: the raw output of the final logistic regression stacking classifier.