A novel deep learning-based method for predicting RNA-protein interactions

April 13, 2017 Research No comments

A novel deep learning-based method for predicting RNA-protein interactions

RNA-binding proteins (RBPs) take over 5–10% of the eukaryotic proteome and regulate the gene localization and translation. On the other hand, the mutations in RBPs have been discovered to be associated with disease risk, such as FUS and TDP-43 in amyotrophic lateral sclerosis. Thus, decoding the links between RNAs and proteins can facilitate the insights into the mechanism behind them. Identification of ncRNA interactions through experimental methods is still challenging and high-cost, which can be complemented by the use of computational models. How to accurately and automatically identify whether a RNA binds to a protein is urgently needed.

Fig. 1. Encoding RNA and protein sequences into a vector of k-mer frequency. The 20 amino acids are grouped as follows: (Ala, Gly, Val), (Ile, Leu, Phe, Pro), (Tyr, Met, Thr, Ser), (His, Asn, Gln, Tpr), (Arg, Lys), (Asp, Glu) and (Cys).

We develop a deep learning-based method, IPMiner, to automatically predict the RNA-protein interactions directly from sequences, which can be applied for any RNA and protein pairs. The new IPMiner proceeds with the following 4 steps:

In the first step of IPMiner (Fig. 1), it encodes simple k-mer sequence features both for RNA and protein sequences. For RNA sequences, we extract the frequency of 4-mers, which is the number of times a 4-mer appears in the sequence. For protein sequences, we first divide the 20 amino acids into 7 groups, then we get the frequency of 3-mers using the reduced amino acid alphabet.

In step 2, we use stacked autoencoder to further refine the presentations of raw k-mer features for proteins and RNAs, respectively (Fig. 2). Stacked autoencoder consists of multiple layer of neural networks, and each layer reconstructs original input after nonlinear transformations.

In step 3, the learned high-level features for proteins and RNAs from stacked autoencoder are concatenated, which are fed into a random forest classifier to predict whether this RNA-protein pair interacts or not. To remove the potential bias caused by a single classifier and enhance the accuracy, we also trained 2 other random forest classifiers: one is using the raw k-mer frequency features without any post-processing as the input, and the other is using the abstracted features from unsupervised stacked autoencoder without fine tuning using labeled RNA-protein pairs as the input. In total, we will have 3 random forest classifiers for different input features as a complement to each other.

In step 4, finally we integrate the outputs from these 3 different classifiers using stacked ensembling, where the outputs from the 3 different classifiers are inputted into a logistic regression to learn the weights for the 3 different classifiers. Compared to the traditional majority voting, it can automatically learn the different contributions of diverse classifiers to the final decision.

Fig. 2. Stacked autoencoder is used to further refine the presentations of raw k-mer features for proteins and RNAs, respectively. The refined features are further fed into random forest to classify RNA-protein interactions.

Due to the new IPMiner is only requiring the sequences as the input, it can be used to predict the probability of interaction for any pair of RNAs and proteins. Its efficacy has been demonstrated on multiple RNA-protein datasets. To make our IPMiner serve the academic community better, an easy-to-use standalone software has been released at http://www.csbio.sjtu.edu.cn/bioinf/IPMiner/ and https://github.com/xypan1232/IPMiner. When using this IPMiner, the users only need prepare two Fasta files for RNAs and proteins respectively, then IPMiner will automatically calculate the interaction potential between any pair of RNAs and proteins in both files.

Xiaoyong Pan¹, Hong-Bin Shen²
¹Department of medical informatics, Erasmus Medical Center, Rotterdam, The Netherlands
²Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and
Key Laboratory of System Control and Information Processing, Ministry of Education of China

Publication

IPMiner: hidden ncRNA-protein interaction sequential pattern mining with stacked autoencoder for accurate computational prediction.
Pan X, Fan YX, Yan J, Shen HB
BMC Genomics. 2016 Aug 9

Read offline:

	Is multiple sclerosis triggered by immunological… Multiple sclerosis (MS) is an autoimmune disease where immune cells (T cells) and antibodies progressively damage the myelin sheath surrounding nerve cells leading to their loss of function. We have…
	Making Christmas trees under duress, or how cells… Some of the most enduring images for a molecular biologist are electron microscopy micrographs of the so-called “Christmas trees”, famously first observed by Oscar Miller from newt oocytes in 1969.…
	Can we accurately diagnose different clinical… Progressive Supranuclear Palsy (PSP) is the second most common degenerative parkinsonian syndrome after idiopathic Parkinson’s disease. PSP is a clinically heterogeneous disorder with several clinical variants. The two most common…
	UCLA researchers develop high-sensitivity… A significant advancement for point-of-care medical diagnostics, a team of researchers from UCLA has introduced a deep learning-enhanced, paper-based vertical flow assay (VFA) capable of detecting cardiac troponin I (cTnI)…
	Improving assessment of arthritis models to better… Rheumatoid Arthritis (RA) is a common inflammatory disease that is characterized by swelling and tenderness of multiple joints. The resulting pain and joint stiffness cause disability for patients and treatment…
	Rabbits with mammary carcinomas as a model for… Within a breeding colony of rabbits, the American pathologist Harry Greene (1904-1969) observed that mammary carcinomas were restricted to certain families. This is suggestive of a familiar predisposition as it…

bioinformatics, deep learning, RNA, RNA-protein

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31