< Back to previous page

Project

Machine Learning for Genomic Data Fusion

It has been shown that while a single genomic data source might not be sufficiently informative, fusing several complementary genomic data sources delivers more accurate predictions. In this regard, genomic data fusion has garnered much interest across biological research communities. Consequently, finding efficient and effective techniques for fusing heterogeneous biological data sources has gained growing attention over the past few years.

Kernel methods, in particular, are an interesting class of techniques for data fusion. We look into the possibility of using the geometric mean of matrices instead of the arithmetic mean for kernel data fusion. While computing geometric means of matrices is challenging, it hints at an intriguing research direction in data fusion. Geometric kernel fusion is used for protein fold recognition, protein subnuclear localizations, and gene prioritization.

Our kernel data fusion frameworks offer a significant improvement over multiple kernel learning approaches proposed for protein fold recognition. Furthermore, our kernel-based protein fold recognizers, which were developed by fusing twenty-six different protein features through the geometric mean of their corresponding kernel matrices, improve the state of the art. Moreover, it is observed that by incorporating the available functional domain information through our proposed hybridization model, we are almost able to crack the protein fold recognition problem for 27 folds.

In addition, the experimental results demonstrate that geometric kernel fusion can effectively improve the accuracy of the state-of-the-art kernel fusion models for predicting protein subnuclear locations, detecting protein remote homology, and prioritizing disease-associated genes.

In particular, for gene prioritization, we design a geometric kernel data fusion model using the log-Euclidean mean of kernel matrices, which offers scalability to large datasets. Moreover, to deliver more accurate gene prioritization predictions, we introduce a heuristic weighted approach for integrating kernel matrices using a log-Euclidean mean of kernel matrices.

Next, we focus on fusing biological data sources at the decision level. We discuss the possible advantage of combining multiple heterogeneous biological kernels in the gene prioritization task using late aggregation operators, such as ordered weighted averaging. Accordingly, we design several kernel-based gene prioritization frameworks that integrate multiple genomic data sources through late integration. Our proposed models have been submitted to the second Critical Assessment of Functional Annotation (CAFA)2 challenge to predict human phenotype terms. The proposed model delivered promising results among those of participating groups in that challenge.

To tackle gene prioritization task more effectively, we develop a model by fusing both genomic and phenotypic information. The proposed method is grounded in the concept of matrix completion. In this fashion, we can consider the advantage of multi-task approach for gene prioritization. Accordingly, we designed a gene prioritization model through a multi-task approach in which it is possible to detect patterns in the data common to several diseases or phenotypes. This particularly appealing aspect of our method, alongside with combining the phenotypic similarity of diseases, enables us to handle gene prioritization for diseases with very few known genes and genes that have not yet been extensively characterized.

To deliver more accurate gene-phenotype matrix completion, we extend the classical Bayesian matrix factorization to work with multiple side information sources. The availability of side information allows us to make nontrivial predictions for genes for which no previous disease association is known. Our gene prioritization method can for the first time not only combine data sources describing genes, but also incorporate data sources describing phenotypes, and in this way improve the state of the art. Evaluation results on our benchmarks show that our proposed model can successfully improve accuracy over a state-of- the-art gene prioritization method, Endeavour.

 

Date:1 Oct 2012 →  29 Jun 2018
Keywords:Data fusion, Genomic data fusion, Kernel methods, Machine Learning, Gene prioritization., Critical Assessment of Functional Annota, Geometric mean of matrices, matrix completion, matrix factorization, Human phenotype terms, Protein fold recognition, Ordered weighted averaging, Geometric kernel data fusion
Disciplines:Evolutionary biology, General biology, Social medical sciences, Scientific computing, Bioinformatics and computational biology, Public health care, Public health services, Applied mathematics in specific fields, Computer architecture and networks, Distributed computing, Information sciences, Information systems, Programming languages, Theoretical computer science, Visual computing, Other information and computing sciences, Artificial intelligence, Cognitive science and intelligent systems, Modelling, Biological system engineering, Signal processing, Control systems, robotics and automation, Design theories and methods, Mechatronics and robotics, Computer theory
Project type:PhD project