We are studying the principled design of learning algorithms that are able to identify regularities in data. Subjects of research in this area are thus not only the development of improved algorithms for generic learning problems, but also the design of new algorithms for specific applications.
Technically, many of the approaches in the department fall into the category of kernel algorithms. They are based on the notion of positive-definite kernels. These kernels can be shown to play three roles in learning. First, the kernel can be thought of as a (nonlinear) similarity measure that is used to compare the data (e.g., visual images). Second, the kernel can be shown to correspond to an inner product in an associated linear space with the mathematical structure of a reproducing kernel Hilbert space (RKHS). In this way, the kernel induces a linear representation of the data. Third, it can be shown that a large class of kernel algorithms leads to solutions that can be expanded in terms of kernel functions centered on the training data. In this sense, the kernel also determines the function class used for learning, i.e., the hypotheses that are used in examining the dataset for regularities. All three issues lie at the heart of empirical inference, rendering kernel methods an elegant mathematical framework for studying learning and designing learning algorithms.
During their relatively short history in machine learning, kernel methods have already undergone several conceptual changes. Initially, kernels were viewed as a way of “kernelizing” algorithms, i.e., constructing nonlinear variants of existing linear algorithms. The next step was the use of kernels to induce a linear representation of data that did not come from a vector space to begin with, thus allowing the use of a number of linear methods for data types such as strings or graphs. The third change happened only recently. It was observed that kernels sometimes let us rewrite optimization problems over large classes of nonlinear functions as linear problems in RKHSs. In a statistical context, this usually amounts to transforming certain higher order statistics into first-order (linear) ones, and handling them using convenient tools from linear algebra and functional analysis. An example that we have co-developed is a class of methods for distribution embeddings in RKHSs.
As well as providing a measure of distance on probability distributions (namely, the RKHS distance between embeddings), these mappings directly imply a measure of dependence between random variables, consisting of the RKHS distance between the embedding of the joint distribution and that of the product of marginals.
When the embeddings are computed on the basis of finite samples, a question of particular interest is whether the distance between embeddings is large enough to be statistically significant (and thus, whether the distributions are deemed to be different on the basis of the observations). We have provided means for verifying this significance, and associated nonparametric hypothesis tests for homogeneity, independence, and conditional independence.
The behavior and performance of any kernel algorithm using these distribution embeddings hinges upon properties of the kernel used. This led us to a detailed study of the class of kernels that induce injective RKHS embeddings, i.e., embeddings that do not lose information and uniquely characterize all probability distributions from a given set.
Kernel dependence measures based on distribution embeddings may be used not only to detect whether significant dependence exists, but can also be optimized to reveal underlying structure in the data. Thus, data can be clustered more effectively when the resulting clusters are given structure using side information, by maximizing a kernel dependence measure with respect to this side information. Such information may take the form of additional descriptions of the data, such as captions for images, or might involve imposing a structure on the clusters using prior knowledge about their mutual relations. In the first case (additional descriptions of the data), we have developed a novel clustering algorithm, Correlational Spectral Clustering, which uses the kernel canonical correlations between the data and the side information to improve spectral clustering. In the example considered, images were clustered more consistently with human labeling when side information in the associated descriptions was used to guide the clustering. In the second case (prior cluster structure known), the clusters were assumed to follow a tree structure, leading to the Numerical Taxonomy Clustering algorithm.
A second set of projects is concerned with use of non-standard inference principles in machine learning. We have already in the past devoted substantial efforts to such inference principles, including in particular semi-supervised learning. We investigated the use of local inference in several learning problems. We have also contributed to algorithms implementing a novel approach for regularization termed the “Universum,” and linked it up with known methods of learning and data analysis.
In another focus started during the previous reporting period, we have continued and expanded our work on structured-output learning, dealing with learning algorithms that generalize classification and regression to a situation where the goal is to learn an input-output mapping between arbitrary sets of objects. Canonical examples of an output in this framework are sequences, trees and strings. For such problems, the large size of the output space renders standard learning methods ineffective (e.g., the size of the output space for sequences scales exponentially with the length of the sequences). A series of papers has developed new supervised and semi-supervised learning methods that combine advantages of kernel methods with the efficiency of dynamic programming algorithms. In recent work, we have provided a unifying analysis of existing supervised inference methods for structured prediction using convex duality techniques.
Our analysis has shown that these methods can be cast as duals of various divergence minimization problems with respect to structured data constraints. By extending these constraints to employ unlabeled data, we developed a class of semi-supervised learning methods for structured output learning. Another direction we pursued in this framework is extending supervised learning methods to complex tasks, which consist of multiple structured output problems. This is particularly challenging since exact inference of such problems is intractable. We used multitask learning techniques and devised an efficient approximation algorithm to learn multiple structured output predictors jointly.
The two last projects in the area of kernel algorithms draw their motivation from the needs of practical problems that we encountered in application domains. In kernel machines, the solution is usually written as an expansion in terms of an a-priori chosen kernel function. The choice of the kernel function is nontrivial yet important in practice. Sometimes a linear combination from a large library of kernels works best, or a multi-scale approach, with aggressive sparsification to keep the runtime complexity under control. This last work improves upon our earlier work on sparsification of kernel machines as it turns out that with the additional degree of freedom introduced by the multi-scale approach, a higher degree of sparsification can be achieved.