Simultaneous integrated analysis of biological datasets: an evaluation of O2PLS

August 31, 2016 Research No comments

Simultaneous integrated analysis of biological datasets: an evaluation of O2PLS

The rapid progress in high throughput technology made it possible to measure biological processes at several levels: DNA markers (genetic code), gene expression (which represents the process of reading the code of a gene), proteomics (proteins are the products of a gene and are needed for biological processes), metabolomics (molecules which play a role in many different chemic reactions in the body). By combining data from the different levels researchers aim to gain deeper understanding of biological mechanisms. State of the art methods, however, do not fully explore the joint nature of these data.

For illustration we have data from 466 participants from the Finnish DILGOM study. Here two biological aspects were measured: gene expression (6272) and metabolites (137). A straightforward approach to analyse the data is pairwise: all combinations of metabolites and gene expressions are considered at a time. However there are many pairs (more than 850k) and joint relationships (several genes related to multiple metabolites) might not be recovered. Integrative analysis of all measurements from all datasets (i.e. simultaneous data analysis, Fig. 1) are more likely to give an insight across the datasets and hence about the underlying biological processes. We aim to find parts of two datasets which are highly connected. To find these parts we use the O2PLS method. O2PLS constructs the joint part of the two datasets, and the remaining part consists of data-specific informative part and noise. Thus we end up with joint, metabolite-specific and gene-specific information in the data. Inferring how genes and metabolites are related, while separating the related from the unrelated part, is the aim of the paper.

Fig. 1. Graphical description of the complex relationships between several levels of biological processes.

The parts found by O2PLS are combinations of gene expressions and metabolites. Specifically, O2PLS simultaneously assigns a value to each gene and metabolite indicating its importance to the joint or specific part. Large positive or negative values indicate large contribution to the corresponding part. The most important genes and metabolites in the joint part can be further investigated to understand the relationship between gene expression and metabolite concentration. The specific parts may also be interpreted by looking at the top genes and metabolites. The amount of information of each part can be quantified by its variation relative to the total variation in the corresponding dataset.

Fig. 2. Visualisation of the results of the data analysis done with O2PLS. The datasets are decomposed in an overlapping part, a data-specific part and a remaining part. The percentages indicate the variation of each part relative to the total amount of variation.

We used the measurements on gene expression and the abundance of several metabolites. Regarding the metabolites we found that 46% of the total information was in the joint part, while 12% was in the metabolite-specific part. Regarding gene-expression we found that 1,3% of the total information was in the joint part, while 50% was in the gene-specific part. The O2PLS results are visualized in Fig. 2. These results confirm the former findings based on pair-wise analysis. In addition we found interesting other genes for future research.

To conclude, O2PLS is a promising tool for summarizing information from two datasets. However the current status in biology is DNA markers, methylation, proteomics, in addition to gene expression and metabolites. These data are heterogeneous: each dataset represents a different layer of the biological mechanisms and these data are generated by different measurement techniques. Due to this heterogeneity it is highly important to model the data-specific information correctly. Ignoring this might lead to failure of recovering the joint relationships. To gain better understanding, integrated analysis should be performed of available datasets. O2PLS can be the starting point for developing such methods.

Said el Bouhaddani¹, Jeanine Houwing-Duistermaat^1,2, Geurt Jongbloed³, Hae-Won Uh¹
¹Dept of Medical statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands
²Dept of Statistics, University of Leeds, Leeds, United Kingdom
³Dept of Applied mathematics, Delft university of technology, Delft, The Netherlands

Publication

Evaluation of O2PLS in Omics data integration.
Bouhaddani SE, Houwing-Duistermaat J, Salo P, Perola M, Jongbloed G, Uh HW
BMC Bioinformatics. 2016 Jan 20

Read offline:

	Looking inside the heart: how multiple chronic… The aim of this study was to understand how having several ongoing health problems—what we refer to as multimorbidity—impacts the heart in people with cardiovascular disease, especially those undergoing heart…
	Does UV-B radiation modify gene expression? Frequently the harsh environmental conditions, such as, high temperatures, low freezing conditions, high levels of PAR and UV-B sun radiation induce remarkable adaptive reactions in plants. These responses suggest that…
	Ferrate technology: an innovative solution for… Sewers might be out of sight, but they play a huge role in shaping the well-being of a society. They quietly carry away all the wastewater from our homes, businesses,…
	UCLA researchers pioneer AI-based tissue staining to… Los Angeles, CA – September 10, 2024 – Researchers at the University of California, Los Angeles (UCLA) have pioneered a groundbreaking approach in the imaging and detection of amyloid deposits…
	Making Christmas trees under duress, or how cells… Some of the most enduring images for a molecular biologist are electron microscopy micrographs of the so-called “Christmas trees”, famously first observed by Oscar Miller from newt oocytes in 1969.…
	UCLA researchers develop high-sensitivity… A significant advancement for point-of-care medical diagnostics, a team of researchers from UCLA has introduced a deep learning-enhanced, paper-based vertical flow assay (VFA) capable of detecting cardiac troponin I (cTnI)…

big data, correlation, gene expression, holistic approach, integration, metabolites, O2PLS, statistics

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31