Scholkopf20123AGrettonKBorgwardtMRaschBSchölkopfASmola2012-03-0013723−773Journal of Machine Learning ResearchWe propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distribution-free tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published-723A Kernel Two-Sample Test150171542015017207551501715421SongSGBB20123LSongASmolaAGrettonJBedoKBorgwardt2012-03-001313931434Journal of Machine Learning ResearchWe introduce a framework for feature selection based on dependence maximization between the selected features and the labels of an estimation problem, using the Hilbert-Schmidt Independence Criterion. The key idea is that good features should be highly dependent on the labels. Our approach leads to a greedy procedure for feature selection. We show that a number of existing feature selectors are special cases of this framework. Experiments on both artificial and real-world data show that our feature selector works well in practice.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/2012/JMLR-2007-Song.pdfaccepted41Feature Selection via Dependence Maximization1501720755ThomaCGHKSSYYB20103MThomaHChengAGrettonJHanH-PKriegelAJSmolaLSongPSYuXYanKMBorgwardt2010-10-0053302–318Statistical Analysis and Data MiningThe goal of frequent subgraph mining is to detect subgraphs that frequently occur in a dataset of graphs. In classification settings, one is often interested in discovering discriminative frequent subgraphs, whose presence or absence is indicative of the class membership of a graph. In this article, we propose an approach to feature selection on frequent subgraphs, called CORK, that combines two central advantages. First, it optimizes a submodular quality criterion, which means that we can yield a near-optimal solution using greedy feature selection. Second, our submodular quality function criterion can be integrated into gSpan, the state-of-the-art tool for frequent subgraph mining, and help to prune the search space for discriminative frequent subgraphs even during frequent subgraph mining.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published-302Discriminative frequent subgraph mining with optimality guarantees1501715420150172075547643LSongJBedoKMBorgwardtAGrettonASmola2007-07-0013: ISMB/ECCB 2007 Conference Proceedings23i490i498BioinformaticsMotivation: Identifying significant genes among thousands of sequences on a microarray is a central challenge for cancer research in bioinformatics. The ultimate goal is to detect the genes that are involved in disease outbreak and progression. A multitude of methods have been proposed for this task of feature selection, yet the selected gene lists differ greatly between different methods. To accomplish biologically meaningful gene selection from microarray data, we have to understand the theoretical connections and the differences between these methods. In this article, we define a kernel-based framework for feature selection based on the Hilbert–Schmidt independence criterion and backward elimination, called BAHSIC. We show that several well-known feature selectors are instances of BAHSIC, thereby clarifying their relationship. Furthermore, by choosing a different kernel, BAHSIC allows us to easily define novel feature selection algorithms. As a further advantage, feature selection via BAHSIC works directly on multiclass problems.
Results: In a broad experimental evaluation, the members of the BAHSIC family reach high levels of accuracy and robustness when compared to other feature selection techniques. Experiments show that features selected with a linear kernel provide the best classification performance in general, but if strong non-linearities are present in the data then non-linear kernels can be more suitable.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published0Gene selection via the BAHSIC family of algorithms150171542039813KMBorgwardtAGrettonMRaschH-PKriegelBSchölkopfASmola2006-08-004: ISMB 2006 Conference Proceedings22e49e57BioinformaticsMotivation: Many problems in data integration in bioinformatics can be posed as one common question: Are two sets of observations generated by the same distribution? We propose a kernel-based statistical test for this problem, based on the fact that two distributions are different if and only if there exists at least one function having different expectation on the two distributions. Consequently we use the maximum discrepancy between function means as the basis of a test statistic.
The Maximum Mean Discrepancy (MMD) can take advantage of the kernel trick, which allows us to apply it not only to vectors, but strings, sequences, graphs, and other common structured data types arising in molecular biology.
Results: We study the practical feasibility of an MMD-based test on three central data integration tasks: Testing cross-platform comparability of microarray data, cancer diagnosis, and data-content based schema matching for two different protein function classification schemas. In all of these experiments, including high-dimensional ones, MMD is very accurate in finding samples that were generated from the same distribution, and outperforms its best competitors.
Conclusions: We have defined a novel statistical test of whether two samples are from the same distribution, compatible with both multivariate and structured data, that is fast, easy to implement, and works well, as confirmed by our experiments.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published0Integrating Structured Biological data by Kernel Maximum Mean Discrepancy1501715420VishwanathanBGS20063SVNVishwanathanKMBorgwardtOGuttmanAJSmola2006-03-007-969721729NeurocomputingWe present a framework for efficient extrapolation of reduced rank approximations, graph kernels, and locally linear embeddings (LLE) to unseen data. We also present a principled method to combine many of these kernels and then extrapolate them. Central to our method is a theorem for matrix approximation, and an extension of the representer theorem to handle multiple joint regularization constraints. Experiments in protein classification demonstrate the feasibility of our approach.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published8Kernel extrapolation34153KMBorgwardtCSOngSSchönauerVishwanathanAJSmolaH-PKriegel2005-06-00Suppl. 1: ISMB 2005 Proceedings21i47i56BioinformaticsMotivation: Computational approaches to protein function prediction infer protein function by finding proteins with similar sequence, structure, surface clefts, chemical properties, amino acid motifs, interaction partners or phylogenetic profiles. We present a new approach that combines sequential, structural and chemical information into one graph model of proteins. We predict functional class membership of enzymes and non-enzymes using graph kernels and support vector machine classification on these protein graphs.
Results: Our graph model, derivable from protein sequence and structure only, is competitive with vector models that require additional protein information, such as the size of surface pockets. If we include this extra information into our graph model, our classifier yields significantly higher accuracy levels than the vector models. Hyperkernels allow us to select and to optimally combine the most relevant node attributes in our protein graphs. We have laid the foundation for a protein function prediction system that integrates protein information from various sources efficiently and effectively.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/pdf3415.pdfpublished0Protein function prediction via graph kernels41937AGrettonKMBorgwardtMRaschBSchölkopfASmolaVancouver, BC, Canada2007-09-00513520Twentieth Annual Conference on Neural Information Processing Systems (NIPS 2006)We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The first test is based on a large deviation bound for the test statistic, while the second is
based on the asymptotic distribution of this statistic.
The test statistic can be computed in $O(m^2)$ time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly.
We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/NIPS2006_0583_4193[0].pdfpublished7A Kernel Method for the Two-Sample-Problem150171542041947JHuangASmolaAGrettonKMBorgwardtBSchölkopfVancouver, BC, Canada2007-09-00601608Twentieth Annual Conference on Neural Information Processing Systems (NIPS 2006)We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and
testing sets in feature space. Experimental results demonstrate that our method works well in practice.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/NIPS2006_0915_4194[0].pdfpublished7Correcting Sample Selection Bias by Unlabeled Data150171542044267AGrettonKMBorgwardtMRaschBSchölkopfAJSmolaVancouver, BC, Canada2007-07-0016371641Twenty-Second AAAI Conference on Artificial Intelligence (IAAI-07)We describe a technique for comparing distributions without
the need for density estimation as an intermediate step.
Our approach relies on mapping the distributions into a Reproducing Kernel Hilbert Space. We apply this technique to
construct a two-sample test, which is used for determining
whether two sets of observations arise from the same distribution. We use this test in attribute matching for databases using the Hungarian marriage method, where it performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/Gretton_4426[0].pdfpublished4A Kernel Approach to Comparing Distributions150171542044717LSongAJSmolaAGrettonKMBorgwardtCorvallis, OR, USA2007-06-00815822Twenty-Fourth Annual International Conference on Machine Learning (ICML 2007)We propose a family of clustering algorithms based on the maximization of dependence between the input variables and their cluster labels, as expressed by the Hilbert-Schmidt Independence Criterion (HSIC). Under this framework, we unify the geometric, spectral, and statistical dependence views of clustering, and subsume many existing algorithms as special cases (e.g. k-means and spectral clustering). Distinctive to our framework is that kernels can also be applied on the labels, which can endow them with particular structures. We also obtain a perturbation bound on the change in k-means clustering.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/cluhsic_[0].pdfpublished7A Dependence Maximization View of Clustering150171542044627LSongAJSmolaAGrettonKMBorgwardtJBedoCorvallis, OR, USA2007-06-00823830Twenty-Fourth Annual International Conference on Machine Learning (ICML 2007)We introduce a framework for filtering features that employs the Hilbert-Schmidt Independence Criterion (HSIC) as a measure of dependence between the features and the labels. The key idea is that good features should maximise such dependence. Feature selection for various supervised learning problems (including classification and regression) is unified under this framework, and the solutions can be approximated using a backward-elimination algorithm. We demonstrate the usefulness of our method on both artificial and real world datasets.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/ICML07_[0].pdfpublished7Supervised Feature Selection via Dependence Estimation1501715420BorgwardtGVS20057KMBorgwardtOGuttmanSVNVishwanathanAJSmolaBrugge, Belgium2005-04-0045546013th European Symposium on Artificial Neural Networks (ESANN 2005)We present a principled method to combine kernels under
joint regularization constraints. Central to our method is an extension of the representer theorem for handling multiple joint regularization constraints. Experimental evidence shows the feasibility of our approach.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/ESANN-005_Borgwardt.pdfpublished5Joint Regularization31747AGrettonAJSmolaOBousquetRHerbrichABelitskiMAugathYMurayamaJPaulsBSchölkopfNKLogothetisBarbados2005-01-00112119Tenth International Workshop on Artificial Intelligence and Statistics (AI & Statistics 2005)We discuss reproducing kernel Hilbert space (RKHS)-based measures of statistical dependence, with emphasis on constrained covariance (COCO), a novel criterion to test dependence of random variables. We show that COCO is a test for independence if and only if the associated RKHSs are universal. That said, no independence test exists that can distinguish dependent and independent random variables in all circumstances. Dependent random variables can result in a COCO which is arbitrarily close to zero when the source densities are highly non-smooth. All current kernel-based independence tests share this behaviour. We demonstrate exponential convergence between the population and empirical COCO. Finally, we use COCO as a measure of joint neural activity between voxels in MRI recordings of the macaque monkey, and compare the results to the mutual information and the correlation. We also show the effect of removing breathing artefacts from the MRI recording.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/pdf3174.pdfpublished7Kernel Constrained Covariance for Dependence Measurement1501715420150171542153762AGrettonAJSmolaJHuangMSchmittfullKMBorgwardtBSchölkopfMIT PressCambridge, MA, USA2009-02-00131160Dataset Shift in Machine LearningGiven sets of observations of training and test data, we consider the problem of re-weighting the training data such that its distribution more closely matches that of the test data. We achieve this goal by matching covariate distributions between training and test sets in a high dimensional feature space (specifically, a reproducing
kernel Hilbert space). This approach does not require distribution estimation. Instead, the sample weights are obtained by a simple quadratic programming procedure. We provide a uniform convergence bound on the distance between
the reweighted training feature mean and the test feature mean, a transductive bound on the expected loss of an algorithm trained on the reweighted data, and a connection to single class SVMs. While our method is designed to deal with the case of simple covariate shift (in the sense of Chapter ??), we have also found benefits for sample selection bias on the labels. Our correction procedure yields its greatest and most consistent advantages when the learning algorithm returns a classifier/regressor that is simpler" than the data might suggest.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/shift-book-for-LeEtAl-webversion_5376[0].pdfpublished29Covariate Shift by Kernel Mean Matching15017154201501720755511146AGrettonKBorgwardtMRaschBSchölkopfASmola2008-04-002008-04-00A Kernel Method for the Two-sample ProblemnonotspecifiedA Kernel Method for the Two-sample Problem1501715420