Scholkopf2012 3 A Gretton K Borgwardt M Rasch B Schölkopf A Smola 2012-03-00 13 723−773 Journal of Machine Learning Research We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distribution-free tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests. no notspecified http://www.kyb.tuebingen.mpg.de/ published -723 A Kernel Two-Sample Test 15017 15420 15017 20755 15017 15421 SongSGBB2012 3 L Song A Smola A Gretton J Bedo K Borgwardt 2012-03-00 13 1393 1434 Journal of Machine Learning Research We introduce a framework for feature selection based on dependence maximization between the selected features and the labels of an estimation problem, using the Hilbert-Schmidt Independence Criterion. The key idea is that good features should be highly dependent on the labels. Our approach leads to a greedy procedure for feature selection. We show that a number of existing feature selectors are special cases of this framework. Experiments on both artificial and real-world data show that our feature selector works well in practice. no notspecified http://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/2012/JMLR-2007-Song.pdf accepted 41 Feature Selection via Dependence Maximization 15017 20755 ThomaCGHKSSYYB2010 3 M Thoma H Cheng A Gretton J Han H-P Kriegel AJ Smola L Song PS Yu X Yan KM Borgwardt 2010-10-00 5 3 302–318 Statistical Analysis and Data Mining The goal of frequent subgraph mining is to detect subgraphs that frequently occur in a dataset of graphs. In classification settings, one is often interested in discovering discriminative frequent subgraphs, whose presence or absence is indicative of the class membership of a graph. In this article, we propose an approach to feature selection on frequent subgraphs, called CORK, that combines two central advantages. First, it optimizes a submodular quality criterion, which means that we can yield a near-optimal solution using greedy feature selection. Second, our submodular quality function criterion can be integrated into gSpan, the state-of-the-art tool for frequent subgraph mining, and help to prune the search space for discriminative frequent subgraphs even during frequent subgraph mining. no notspecified http://www.kyb.tuebingen.mpg.de/ published -302 Discriminative frequent subgraph mining with optimality guarantees 15017 15420 15017 20755 4764 3 L Song J Bedo KM Borgwardt A Gretton A Smola 2007-07-00 13: ISMB/ECCB 2007 Conference Proceedings 23 i490 i498 Bioinformatics Motivation: Identifying significant genes among thousands of sequences on a microarray is a central challenge for cancer research in bioinformatics. The ultimate goal is to detect the genes that are involved in disease outbreak and progression. A multitude of methods have been proposed for this task of feature selection, yet the selected gene lists differ greatly between different methods. To accomplish biologically meaningful gene selection from microarray data, we have to understand the theoretical connections and the differences between these methods. In this article, we define a kernel-based framework for feature selection based on the Hilbert–Schmidt independence criterion and backward elimination, called BAHSIC. We show that several well-known feature selectors are instances of BAHSIC, thereby clarifying their relationship. Furthermore, by choosing a different kernel, BAHSIC allows us to easily define novel feature selection algorithms. As a further advantage, feature selection via BAHSIC works directly on multiclass problems. Results: In a broad experimental evaluation, the members of the BAHSIC family reach high levels of accuracy and robustness when compared to other feature selection techniques. Experiments show that features selected with a linear kernel provide the best classification performance in general, but if strong non-linearities are present in the data then non-linear kernels can be more suitable. no notspecified http://www.kyb.tuebingen.mpg.de/ published 0 Gene selection via the BAHSIC family of algorithms 15017 15420 3981 3 KM Borgwardt A Gretton M Rasch H-P Kriegel B Schölkopf A Smola 2006-08-00 4: ISMB 2006 Conference Proceedings 22 e49 e57 Bioinformatics Motivation: Many problems in data integration in bioinformatics can be posed as one common question: Are two sets of observations generated by the same distribution? We propose a kernel-based statistical test for this problem, based on the fact that two distributions are different if and only if there exists at least one function having different expectation on the two distributions. Consequently we use the maximum discrepancy between function means as the basis of a test statistic. The Maximum Mean Discrepancy (MMD) can take advantage of the kernel trick, which allows us to apply it not only to vectors, but strings, sequences, graphs, and other common structured data types arising in molecular biology. Results: We study the practical feasibility of an MMD-based test on three central data integration tasks: Testing cross-platform comparability of microarray data, cancer diagnosis, and data-content based schema matching for two different protein function classification schemas. In all of these experiments, including high-dimensional ones, MMD is very accurate in finding samples that were generated from the same distribution, and outperforms its best competitors. Conclusions: We have defined a novel statistical test of whether two samples are from the same distribution, compatible with both multivariate and structured data, that is fast, easy to implement, and works well, as confirmed by our experiments. no notspecified http://www.kyb.tuebingen.mpg.de/ published 0 Integrating Structured Biological data by Kernel Maximum Mean Discrepancy 15017 15420 VishwanathanBGS2006 3 SVN Vishwanathan KM Borgwardt O Guttman AJ Smola 2006-03-00 7-9 69 721 729 Neurocomputing We present a framework for efficient extrapolation of reduced rank approximations, graph kernels, and locally linear embeddings (LLE) to unseen data. We also present a principled method to combine many of these kernels and then extrapolate them. Central to our method is a theorem for matrix approximation, and an extension of the representer theorem to handle multiple joint regularization constraints. Experiments in protein classification demonstrate the feasibility of our approach. no notspecified http://www.kyb.tuebingen.mpg.de/ published 8 Kernel extrapolation 3415 3 KM Borgwardt CS Ong S Schönauer Vishwanathan AJ Smola H-P Kriegel 2005-06-00 Suppl. 1: ISMB 2005 Proceedings 21 i47 i56 Bioinformatics Motivation: Computational approaches to protein function prediction infer protein function by finding proteins with similar sequence, structure, surface clefts, chemical properties, amino acid motifs, interaction partners or phylogenetic profiles. We present a new approach that combines sequential, structural and chemical information into one graph model of proteins. We predict functional class membership of enzymes and non-enzymes using graph kernels and support vector machine classification on these protein graphs. Results: Our graph model, derivable from protein sequence and structure only, is competitive with vector models that require additional protein information, such as the size of surface pockets. If we include this extra information into our graph model, our classifier yields significantly higher accuracy levels than the vector models. Hyperkernels allow us to select and to optimally combine the most relevant node attributes in our protein graphs. We have laid the foundation for a protein function prediction system that integrates protein information from various sources efficiently and effectively. no notspecified http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/pdf3415.pdf published 0 Protein function prediction via graph kernels 4193 7 A Gretton KM Borgwardt M Rasch B Schölkopf A Smola Vancouver, BC, Canada2007-09-00 513 520 Twentieth Annual Conference on Neural Information Processing Systems (NIPS 2006) We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The first test is based on a large deviation bound for the test statistic, while the second is based on the asymptotic distribution of this statistic. The test statistic can be computed in $O(m^2)$ time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist. no notspecified http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/NIPS2006_0583_4193[0].pdf published 7 A Kernel Method for the Two-Sample-Problem 15017 15420 4194 7 J Huang A Smola A Gretton KM Borgwardt B Schölkopf Vancouver, BC, Canada2007-09-00 601 608 Twentieth Annual Conference on Neural Information Processing Systems (NIPS 2006) We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice. no notspecified http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/NIPS2006_0915_4194[0].pdf published 7 Correcting Sample Selection Bias by Unlabeled Data 15017 15420 4426 7 A Gretton KM Borgwardt M Rasch B Schölkopf AJ Smola Vancouver, BC, Canada2007-07-00 1637 1641 Twenty-Second AAAI Conference on Artificial Intelligence (IAAI-07) We describe a technique for comparing distributions without the need for density estimation as an intermediate step. Our approach relies on mapping the distributions into a Reproducing Kernel Hilbert Space. We apply this technique to construct a two-sample test, which is used for determining whether two sets of observations arise from the same distribution. We use this test in attribute matching for databases using the Hungarian marriage method, where it performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist. no notspecified http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/Gretton_4426[0].pdf published 4 A Kernel Approach to Comparing Distributions 15017 15420 4471 7 L Song AJ Smola A Gretton KM Borgwardt Corvallis, OR, USA2007-06-00 815 822 Twenty-Fourth Annual International Conference on Machine Learning (ICML 2007) We propose a family of clustering algorithms based on the maximization of dependence between the input variables and their cluster labels, as expressed by the Hilbert-Schmidt Independence Criterion (HSIC). Under this framework, we unify the geometric, spectral, and statistical dependence views of clustering, and subsume many existing algorithms as special cases (e.g. k-means and spectral clustering). Distinctive to our framework is that kernels can also be applied on the labels, which can endow them with particular structures. We also obtain a perturbation bound on the change in k-means clustering. no notspecified http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/cluhsic_[0].pdf published 7 A Dependence Maximization View of Clustering 15017 15420 4462 7 L Song AJ Smola A Gretton KM Borgwardt J Bedo Corvallis, OR, USA2007-06-00 823 830 Twenty-Fourth Annual International Conference on Machine Learning (ICML 2007) We introduce a framework for filtering features that employs the Hilbert-Schmidt Independence Criterion (HSIC) as a measure of dependence between the features and the labels. The key idea is that good features should maximise such dependence. Feature selection for various supervised learning problems (including classification and regression) is unified under this framework, and the solutions can be approximated using a backward-elimination algorithm. We demonstrate the usefulness of our method on both artificial and real world datasets. no notspecified http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/ICML07_[0].pdf published 7 Supervised Feature Selection via Dependence Estimation 15017 15420 BorgwardtGVS2005 7 KM Borgwardt O Guttman SVN Vishwanathan AJ Smola Brugge, Belgium2005-04-00 455 460 13th European Symposium on Artificial Neural Networks (ESANN 2005) We present a principled method to combine kernels under joint regularization constraints. Central to our method is an extension of the representer theorem for handling multiple joint regularization constraints. Experimental evidence shows the feasibility of our approach. no notspecified http://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/ESANN-005_Borgwardt.pdf published 5 Joint Regularization 3174 7 A Gretton AJ Smola O Bousquet R Herbrich A Belitski M Augath Y Murayama J Pauls B Schölkopf NK Logothetis Barbados2005-01-00 112 119 Tenth International Workshop on Artificial Intelligence and Statistics (AI & Statistics 2005) We discuss reproducing kernel Hilbert space (RKHS)-based measures of statistical dependence, with emphasis on constrained covariance (COCO), a novel criterion to test dependence of random variables. We show that COCO is a test for independence if and only if the associated RKHSs are universal. That said, no independence test exists that can distinguish dependent and independent random variables in all circumstances. Dependent random variables can result in a COCO which is arbitrarily close to zero when the source densities are highly non-smooth. All current kernel-based independence tests share this behaviour. We demonstrate exponential convergence between the population and empirical COCO. Finally, we use COCO as a measure of joint neural activity between voxels in MRI recordings of the macaque monkey, and compare the results to the mutual information and the correlation. We also show the effect of removing breathing artefacts from the MRI recording. no notspecified http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/pdf3174.pdf published 7 Kernel Constrained Covariance for Dependence Measurement 15017 1542015017 15421 5376 2 A Gretton AJ Smola J Huang M Schmittfull KM Borgwardt B Schölkopf MIT Press Cambridge, MA, USA 2009-02-00 131 160 Dataset Shift in Machine Learning Given sets of observations of training and test data, we consider the problem of re-weighting the training data such that its distribution more closely matches that of the test data. We achieve this goal by matching covariate distributions between training and test sets in a high dimensional feature space (specifically, a reproducing kernel Hilbert space). This approach does not require distribution estimation. Instead, the sample weights are obtained by a simple quadratic programming procedure. We provide a uniform convergence bound on the distance between the reweighted training feature mean and the test feature mean, a transductive bound on the expected loss of an algorithm trained on the reweighted data, and a connection to single class SVMs. While our method is designed to deal with the case of simple covariate shift (in the sense of Chapter ??), we have also found benefits for sample selection bias on the labels. Our correction procedure yields its greatest and most consistent advantages when the learning algorithm returns a classifier/regressor that is simpler" than the data might suggest. no notspecified http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/shift-book-for-LeEtAl-webversion_5376[0].pdf published 29 Covariate Shift by Kernel Mean Matching 15017 1542015017 20755 5111 46 A Gretton K Borgwardt M Rasch B Schölkopf A Smola 2008-04-00 2008-04-00 A Kernel Method for the Two-sample Problem no notspecified A Kernel Method for the Two-sample Problem 15017 15420