Large-scale gene expression profiling has been widely used to characterize cellular states in response to various disease conditions, genetic perturbations and so on. Although the cost of whole-genome expression profiling has been dropping steadily, generating a compendium of expression profiling over thousands of samples is still very expensive. Recognizing that gene expressions are often highly correlated, researchers from the NIH LINCS program have developed a cost-effective strategy of profiling only ˜1,000 carefully selected landmark genes and relying on computational methods to infer the expression of remaining target genes. However, the computational approach adopted by the LINCS program is currently based on linear regression, limiting its accuracy since it does not capture complex nonlinear relationship between expression of genes.
We present a deep learning method (abbreviated as DGEX) to infer the expression of target genes from the expression of landmark genes. We used the microarray-based GEO dataset, consisting of 111K expression profiles, to train our model and compare its performance to those from other methods. In terms of mean absolute error averaged across all genes, deep learning significantly outperforms linear regression with 15.33% relative improvement. A gene-wise comparative analysis shows that deep learning achieves lower error than linear regression in 99.97% of the target genes. We also tested the performance of our learned model on an independent RNA-Seq-based GTEx dataset, which consists of 2,921 expression profiles. Deep learning still outperforms linear regression with 6.57% relative improvement, and achieves lower error in 81.31% of the target genes.
This code base provides all the necessary pieces to reproduce the main results of D-GEX. If you have any questions, please email [email protected]
-
Python (2.7). Python 2.7.6 is recommended.
-
Numpy(>=1.6.1).
-
Scipy(>=0.10).
-
Theano(0.7).
-
Scikit-learn(>=0.15.2).
-
l1ktools (v1.1). Please do not use later version as they changed the API.
The original data files are not provided within this codebase, as some of them require applying for access. Once you download all of them, please put them in this codebase.
The GEO microarray expression data can be accessed from lincscloud. You need to apply for the access. The original data downloaded is bgedv2_QNORM.gctx (data level 3 Q2NORM).
UPDATE1: It seems lincscloud has been migrated to NCH: http://www.lincscloud.org/download.html . And the data is publicly available now at GSE70138:GSE70138_Broad_LINCS_Level3_INF_mlr12k_n115209x22268_2015-12-31.gct.gz. However, the number of samples is slightly different (129158 vs 115209). Not sure if the current code base works fine for this new data. Will take a look in the furture.
UPDATE2: Since the data provided in UPDATE1 is slightly different than the original GEO version we used in our publiction, please email us ([email protected]) if you could like to download the original version.
The GTEx data we used in our paper is a preliminary version from the Broad institute before the official publication of GTEx, and is not publicly available. For those who are interested in the data, please email us ([email protected]) with your basic information through an academic institute email address, and we will provide you the private download link. The data you will download is GTEx_RNASeq_RPKM_n2921x55993.gctx.
The 1000 Genomes RNA-Seq expression data can be accessed from EMBL-EBI. The original data downloaded is GD462.GeneQuantRPKM.50FN.samplename.resk10.txt.
The predicted expression of L1000 data based on D-GEX can be downloaded at l1000_n1328098x22268.gctx. It consists of 1328098 expression profiles of 22268 genes. The first 978 genes are landmark genes that were directly measured by the L1000 platform. The other 21290 genes are target genes infered by D-GEX based on the GEO data. The expression profiles of each gene were standardized to mean 0 and standard deviation 1.
The whole preprocessing step should be done by run
$ ./preprocess.sh
Specifically, there are four steps.
- Removing duplicates by k-means:
kmeans.py
,nodup_idx.py
. - Coverting data into numpy format:
bgedv2.py
,GTEx.py
,1000G.py
. - Quantile normalization:
bgedv2_reqnorm.py
,GTEx_reqnorm.py
,1000G_reqnorm.py
. - Standardization:
bgedv2_norm.py
,GTEx_norm.py
,1000G_norm.py
.
Training D-GEX is done by run H1_0-4760.py
, H1_4760-9520.py
, H2_0-4760.py
, H2_4760-9520.py
, H3_0-4760.py
, H3_4760-9520.py
. Each stript trains half of the target genes (0-4760 or 4760-9520) with a certain architecture (1, 2 or 3 hidden layers).
A training example using 200 epoch, 0.75 include rate (0.25 dropout rate) and 1 hidden layer with 9000 hidden units in each hidden layer for 0-4760 target genes is by:
$ ./ H1_0-4760.py 9000_H1_0-4760_75 200 9000 0.75
In which, 9000_H1_0-4760_75 is the base name for all the output files.
Each training instance will output 7 files. For example, by running
$ ./ H1_0-4760.py 9000_H1_0-4760_75 200 9000 0.75
It outputs:
9000_H1_0-4760_75.log, the log file of the training instance.
9000_H1_0-4760_75_bestva_model.pkl, the model saved by best performance on Y_va (GEO microarray data).
9000_H1_0-4760_75_bestva_Y_va_hat.npy, the Y_va_hat predicted by best performance on Y_va (GEO microarray data).
9000_H1_0-4760_75_bestva_Y_te_hat.npy, the Y_te_hat predicted by best performance on Y_va (GEO microarray data).
9000_H1_0-4760_75_best1000G_model.pkl, the model saved by best performance on Y_1000G (1000G RNA-Seq data).
9000_H1_0-4760_75_best1000G_Y_1000G_hat.npy, the Y_1000G_hat predicted by best performance on Y_1000G (1000G RNA-Seq data).
9000_H1_0-4760_75_best1000G_Y_GTEx_hat.npy, the Y_GTEx_hat predicted by best performance on Y_1000G (1000G RNA-Seq data).
Gene expression inference with deep learning, 2016. Bioinformatics, bioRxiv.