- Setup1: Leave-responses-out (LRO) cross validation
- Setup2: Leave-cell lines-out (LCO) cross validation (Unseen cell lines with 1/5th cell lines for each cancer type hold out during testing)
- Below, we provide detailed instructions and commands on how to reproduce the provided processed and cleaned data files from source files.
- Each setup requires different type of data split as discussed below.
- CCLE: Download CCLE expression data from here and save it as
CCLE/CCLE_expression.csv
. - Combined: Download the
combined_rnaseq_data
file with gene expression data for cell lines renamed for CTRPv2 from here and save it inCombined/
.
- To create the processed data files, first download the source datasets as mentioned below.
- For CTRPv2 with adjusted AUCs, we download the following files from here:
- For drug information:
drug_info
- For adjusted AUCs:
combined_single_response_agg
- For drug information:
- For PRISM, we download the following files from here
- For drug information and AUCs:
secondary-screen-dose-response-curve-parameters.csv
- For drug information and AUCs:
- Run the commands provided in
commands_ctrpv2.log
andcommands_prism.log
to clean the files within respective directories for the following steps.
mkdir -p ctrpv2/LRO/
python scripts/create_cv.py --data_dir /content/DrugRank/data/Combined/ --save_dir ctrpv2/LRO/
The above script creates the list of AUCs in aucs.txt
and splits in splits.txt
. Run this command similarly for PRISM.
python scripts/create_setup_LRO.py --data_dir ctrpv2/LRO/ --save_dir ctrpv2/LRO/pletorg/ --genexp_file Combined/combined_rnaseq_data_combat
- The above script does the following:
- Creates the full cell line - drug sensitivity matrix and saves as
auc_matrix.txt
. M x N matrix - Saves the list of cell IDs as
cells.txt
. # cells = M - Saves the list of drug IDs as
drugs.txt
. # drugs = N - Note that the order of cells across rows and the order of drugs across columns must be consistent
with the respective ordering in the files
cells.txt
anddrugs.txt
.
- Creates the full cell line - drug sensitivity matrix and saves as
- Rerun the script accordingly for PRISM.
python scripts/lasso.py --data_dir ctrpv2/LRO/ --l1_ratio 0.5
python scripts/lasso.py --data_dir prism/LRO/ --l1_ratio 0.1
- This runs Elastic Net on the drug response matrix to identify informative genes for each cell line, following the standard setup in pLETORg.
- To compute cosine similarities:
python scripts/cell_similarity.py --data_dir ctrpv2/LRO/ --genes ctrpv2/LRO/genes.txt --sim cos
python scripts/cell_similarity.py --data_dir prism/LRO/ --genes prism/LRO/genes.txt --sim cos
- To compute simliarities using RBF kernel:
python scripts/cell_similarity.py --data_dir ctrpv2/LRO/ --genes ctrpv2/LRO/genes.txt --sim rbf
python scripts/cell_similarity.py --data_dir prism/LRO/ --genes prism/LRO/genes.txt --sim rbf
python scripts/create_setup_LCO.py --data_dir ctrpv2/LRO/ --save_dir ctrpv2/LCO/pletorg/
python scripts/create_setup_LCO.py --data_dir prism/LRO/ --save_dir prism/LCO/pletorg/
- Creates the auc matrices separately for training, val and test in each fold
- Saves the list of cell IDs as
*_cells.txt
and the union of cells in../cells.txt
, and corresponding pairwise similarity matrix - Note that the order of cells across rows is consistent with the respective ordering in the similarity matrix file.