Skip to content

Latest commit

 

History

History

2.CNN_RNN_sequence_analysis

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

2.CNN_RNN_sequence_analysis

To use this repository, first download the data here and decompress the files in this folder.

In this example, we show how to use CNN and RNN to predict the functionality of non-coding DNA sequences. We use the data from DeepSEA. As described in DeepSEA and DanQ, the human GRCh37 reference genome was segmented into non-overlapping 200-bp bins. The inputs of the deep learning model are the 1000-bp DNA sequences which are centered on the 200-bp bins. In terms of the labels of those sequences, they were generated by collecting profiles from ENCODE and Roadmap Epigenomics data releases, which resulted in a 919 binary vector for each sequence (690 transcription factor binding profiles, 125 DNase I–hypersensitive profiles and 104 histone-mark profiles). To encode the DNA sequence string into a mathematical form which can be fed to the model, we use the one-hot encoding. In terms of the model, because for DNA sequences, not only do the specific motifs matter, but also the interaction between the upstream and downstream motifs also plays important roles in determining the sequence functionality, we combine CNN and RNN, stacking a bi-directional LSTM layer on top of 1D convolutional layers. The original implementation of DanQ requires Theano, which has been discontinued. We reimplemented the idea solely using Keras.

Reference: