Simultaneous integrated analysis of biological datasets: an evaluation of O2PLS
The rapid progress in high throughput technology made it possible to measure biological processes at several levels: DNA markers (genetic code), gene expression (which represents the process of reading the code of a gene), proteomics (proteins are the products of a gene and are needed for biological processes), metabolomics (molecules which play a role in many different chemic reactions in the body). By combining data from the different levels researchers aim to gain deeper understanding of biological mechanisms. State of the art methods, however, do not fully explore the joint nature of these data.
For illustration we have data from 466 participants from the Finnish DILGOM study. Here two biological aspects were measured: gene expression (6272) and metabolites (137). A straightforward approach to analyse the data is pairwise: all combinations of metabolites and gene expressions are considered at a time. However there are many pairs (more than 850k) and joint relationships (several genes related to multiple metabolites) might not be recovered. Integrative analysis of all measurements from all datasets (i.e. simultaneous data analysis, Fig. 1) are more likely to give an insight across the datasets and hence about the underlying biological processes. We aim to find parts of two datasets which are highly connected. To find these parts we use the O2PLS method. O2PLS constructs the joint part of the two datasets, and the remaining part consists of data-specific informative part and noise. Thus we end up with joint, metabolite-specific and gene-specific information in the data. Inferring how genes and metabolites are related, while separating the related from the unrelated part, is the aim of the paper.
The parts found by O2PLS are combinations of gene expressions and metabolites. Specifically, O2PLS simultaneously assigns a value to each gene and metabolite indicating its importance to the joint or specific part. Large positive or negative values indicate large contribution to the corresponding part. The most important genes and metabolites in the joint part can be further investigated to understand the relationship between gene expression and metabolite concentration. The specific parts may also be interpreted by looking at the top genes and metabolites. The amount of information of each part can be quantified by its variation relative to the total variation in the corresponding dataset.
We used the measurements on gene expression and the abundance of several metabolites. Regarding the metabolites we found that 46% of the total information was in the joint part, while 12% was in the metabolite-specific part. Regarding gene-expression we found that 1,3% of the total information was in the joint part, while 50% was in the gene-specific part. The O2PLS results are visualized in Fig. 2. These results confirm the former findings based on pair-wise analysis. In addition we found interesting other genes for future research.
To conclude, O2PLS is a promising tool for summarizing information from two datasets. However the current status in biology is DNA markers, methylation, proteomics, in addition to gene expression and metabolites. These data are heterogeneous: each dataset represents a different layer of the biological mechanisms and these data are generated by different measurement techniques. Due to this heterogeneity it is highly important to model the data-specific information correctly. Ignoring this might lead to failure of recovering the joint relationships. To gain better understanding, integrated analysis should be performed of available datasets. O2PLS can be the starting point for developing such methods.
Said el Bouhaddani 1, Jeanine Houwing-Duistermaat 1,2, Geurt Jongbloed 3, Hae-Won Uh 1
1Dept of Medical statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands
2Dept of Statistics, University of Leeds, Leeds, United Kingdom
3Dept of Applied mathematics, Delft university of technology, Delft, The Netherlands
Evaluation of O2PLS in Omics data integration.
Bouhaddani SE, Houwing-Duistermaat J, Salo P, Perola M, Jongbloed G, Uh HW
BMC Bioinformatics. 2016 Jan 20
|Measuring small molecules for nutrition research? Metabolomics is the study of small molecules that are present in biological samples. These small molecules include amino acids, fatty acids, organic acids and sugars, and are referred to as…|
|Ultra Quick sample preparation prior to GC-MS based… Metabolomics is a comprehensive study of low molecular weight compounds in biological samples and has been applied to wide range of research areas from medical to food analysis. For the…|
|It’s Time for Multimodal Biometric Verification. Here’s why If you have ever watched one of the Mission: Impossible movies, then you’ve definitely seen biometric verification in action. A ton of characters in the franchise have used their eyes,…|
|AI Can’t Save Us From This Pandemic - But It Can… Much has been said about artificial intelligence (AI) and its vast potential to revolutionize life as we know it. But if we were to pinpoint a specific field that has…|
|Abiotic stress signaling in plants: functional… Basic plant science has never been more important than in the current scenario of climate change and increasing human population. To address the global challenges like increase in food production…|
|An integrated genomics approach for identifying… Breast cancer is the most common cancer in women with over 2 million new cases worldwide in 2018. Three kinds of receptors are usually found on the surface of breast…|