Genomics and proteomics are currently producing massive datasets containing a wealth of information about underlying biological mechanisms. However, many interesting properties of biomolecules, such as drugs, DNA and proteins, cannot be easily determined experimentally and thus form worthwhile targets for computational prediction. In our bioinformatics group, we are interested in developing new machine learning methods to find regularities or interpretable knowledge in biological data and, in turn, exploit the knowledge to predict important properties of biomolecules.
Biological data, drawn from the observation of the complex mechanisms of the cell, are represented in various types of data structures. For example, gene expression profiles can be represented as vectors, genome sequences as symbol sequences, phylogenetic profiles as trees, chemical compounds as graphs and relationships among proteins as networks. One of our main goals is to develop effective and theoretically well-founded machine learning methods for such structured data. Current projects employ diverse approaches such as graph mining, network module enumeration, structured output learning and Bayesian inference for dealing with structured data.
Graph mining is a recently emerging approach to finding frequently appearing subgraphs in large databases of graphs. It has been used, for example, for finding abundant phrases in a text corpus, but has a large potential for computational biology as well. In particular, graph mining has been proven effective in chemoinformatics, where the task is to predict biochemical properties such as toxicity, mutagenicity, or solubility of drug candidates. Such properties can be screened by experimental means, however, given a large amount of drug candidates, computational prediction is useful for the drug discovery processes by providing a form of virtual screening.
We have been working on developing efficient and accurate algorithms for learning from graph data. Our goal is to derive prediction rules that depend on only a few, interpretable subgraph features. Among several approaches we have proposed so far, the flagship method is graph boosting, where graph mining is repeatedly called to collect necessary subgraph descriptors progressively [p 56]. In comparison with conventional methods that create all subgraph descriptors at once, graph boosting is faster and requires less memory by avoiding the enumeration of useless descriptors. So far, graph mining methods have been mostly heuristic and lacking solid theoretical basis. In contrast, graph boosting has guaranteed convergence properties, and the maximum-margin property to warrant good generalization. Along this line of research, we are working on graph least angle regression (gLARS), graph partial least squares regression (gPLS) and graph principal component analysis (gPCA). Recently, we developed a graph boosting method that can predict the chemical activity with error bars. This Bayesian approach allows us to evaluate the uncertainty of predictions, which is often important in applications. This can eventually lead to novel methods for active learning and automated drug discovery algorithms.
As biological networks such as protein interaction networks and gene regulatory networks have recently become available in a large number of species, bioinformatics groups worldwide are working on novel methods for the discovery of useful knowledge from the networks. In particular, discovery of densely connected protein groups (i.e., modules) is crucial, as biological functions tend to be achieved not by individual proteins, but by groups or complexes of proteins. Most methods partition a network into disjoint modules; however, this is not appropriate for discovering protein complexes, because one protein often participates in multiple complexes. We have proposed a dense module enumeration method that can find all dense modules. It is based on a technique called reverse search, which was developed in the 90s and can solve difficult problems where conventional algorithms such as branch-and-bound become inefficient.
In addition to novel data mining techniques, we continue to apply kernel methods to problems of computational biology. In collaboration with the Friedrich Miescher Lab (FML) group led by Gunnar Rätsch, we applied support vector machines (SVMs) to the prediction of subcellular localization of proteins from their amino acid sequences. The SVM was extended to choose automatically an optimal combination from a large set of possibly useful kernels; this way, competitive accuracy was achieved without tedious handcrafting of models. A second project done with the FML group concerns domain adaptation methods in the field of genome annotation. For model organisms such as C. elegans, a substantial amount of labeled data is available, but there is only little data available for newly sequenced organisms. In a recent project, we evaluated a number of domain adaptation methods (including some newly developed ones) to transfer knowledge from one organism to another [p. 60], showing that significant improvements can be achieved.
In another campus collaboration we have continued a joint project with the Department of Protein Evolution at the Max Planck Institute for Developmental Biology, working on the development of Bayesian three-dimensional protein structure elucidation methods based on Inferential Structure Determination (ISD). In contrast to optimization-based methods, ISD can take the uncertainty into account that is inherent in NMR data. It allows sampling the structures from the full posterior distribution with Markov chain Monte Carlo methods, and assessing the precision and quality of the obtained structures. Recently, our main focus has been the integration of heterogeneous data sources such as NMR, X-ray crystallography, and cryo-electron microscopy (cryo-EM). The ISD method and software were extended to integrate low-resolution density maps from cryo-EM and structure prediction by homology. These developments contributed to the determination of the first structure of a mitochondrial membrane protein (VDAC).