7271BSchölkopfCJCBurgesAJSmolaMIT PressCambridge, MA, USA1999-00-00nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published352Advances in Kernel Methods: Support Vector LearningSongSGBB20123LSongASmolaAGrettonJBedoKBorgwardt2012-05-001313931434We introduce a framework for feature selection based on dependence maximization between the selected features and the labels of an estimation problem, using the Hilbert-Schmidt Independence Criterion. The key idea is that good features should be highly dependent on the labels. Our approach leads to a greedy procedure for feature selection. We show that a number of existing feature selectors are special cases of this framework. Experiments on both artificial and real-world data show that our feature selector works well in practice.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/2012/JMLR-2007-Song.pdfpublished41Feature Selection via Dependence Maximization1501720755Scholkopf20123AGrettonKBorgwardtMRaschBSchölkopfASmola2012-03-0013723−773We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distribution-free tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published-723A Kernel Two-Sample Test150171542015017207551501715421ThomaCGHKSSYYB20103MThomaHChengAGrettonJHanH-PKriegelAJSmolaLSongPSYuXYanKMBorgwardt2010-10-0053302–318The goal of frequent subgraph mining is to detect subgraphs that frequently occur in a dataset of graphs. In classification settings, one is often interested in discovering discriminative frequent subgraphs, whose presence or absence is indicative of the class membership of a graph. In this article, we propose an approach to feature selection on frequent subgraphs, called CORK, that combines two central advantages. First, it optimizes a submodular quality criterion, which means that we can yield a near-optimal solution using greedy feature selection. Second, our submodular quality function criterion can be integrated into gSpan, the state-of-the-art tool for frequent subgraph mining, and help to prune the search space for discriminative frequent subgraphs even during frequent subgraph mining.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published-302Discriminative frequent subgraph mining with optimality guarantees1501715420150172075547643LSongJBedoKMBorgwardtAGrettonASmola2007-07-0013: ISMB/ECCB 2007 Conference Proceedings23i490i498Motivation: Identifying significant genes among thousands of sequences on a microarray is a central challenge for cancer research in bioinformatics. The ultimate goal is to detect the genes that are involved in disease outbreak and progression. A multitude of methods have been proposed for this task of feature selection, yet the selected gene lists differ greatly between different methods. To accomplish biologically meaningful gene selection from microarray data, we have to understand the theoretical connections and the differences between these methods. In this article, we define a kernel-based framework for feature selection based on the Hilbert–Schmidt independence criterion and backward elimination, called BAHSIC. We show that several well-known feature selectors are instances of BAHSIC, thereby clarifying their relationship. Furthermore, by choosing a different kernel, BAHSIC allows us to easily define novel feature selection algorithms. As a further advantage, feature selection via BAHSIC works directly on multiclass problems.
Results: In a broad experimental evaluation, the members of the BAHSIC family reach high levels of accuracy and robustness when compared to other feature selection techniques. Experiments show that features selected with a linear kernel provide the best classification performance in general, but if strong non-linearities are present in the data then non-linear kernels can be more suitable.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published0Gene selection via the BAHSIC family of algorithms150171542039813KMBorgwardtAGrettonMRaschH-PKriegelBSchölkopfASmola2006-08-004: ISMB 2006 Conference Proceedings22e49e57Motivation: Many problems in data integration in bioinformatics can be posed as one common question: Are two sets of observations generated by the same distribution? We propose a kernel-based statistical test for this problem, based on the fact that two distributions are different if and only if there exists at least one function having different expectation on the two distributions. Consequently we use the maximum discrepancy between function means as the basis of a test statistic.
The Maximum Mean Discrepancy (MMD) can take advantage of the kernel trick, which allows us to apply it not only to vectors, but strings, sequences, graphs, and other common structured data types arising in molecular biology.
Results: We study the practical feasibility of an MMD-based test on three central data integration tasks: Testing cross-platform comparability of microarray data, cancer diagnosis, and data-content based schema matching for two different protein function classification schemas. In all of these experiments, including high-dimensional ones, MMD is very accurate in finding samples that were generated from the same distribution, and outperforms its best competitors.
Conclusions: We have defined a novel statistical test of whether two samples are from the same distribution, compatible with both multivariate and structured data, that is fast, easy to implement, and works well, as confirmed by our experiments.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published0Integrating Structured Biological data by Kernel Maximum Mean Discrepancy1501715420VishwanathanBGS20063SVNVishwanathanKMBorgwardtOGuttmanAJSmola2006-03-007-969721729We present a framework for efficient extrapolation of reduced rank approximations, graph kernels, and locally linear embeddings (LLE) to unseen data. We also present a principled method to combine many of these kernels and then extrapolate them. Central to our method is a theorem for matrix approximation, and an extension of the representer theorem to handle multiple joint regularization constraints. Experiments in protein classification demonstrate the feasibility of our approach.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published8Kernel extrapolation34153KMBorgwardtCSOngSSchönauerVishwanathanAJSmolaH-PKriegel2005-06-00Suppl. 1: ISMB 2005 Proceedings21i47i56Motivation: Computational approaches to protein function prediction infer protein function by finding proteins with similar sequence, structure, surface clefts, chemical properties, amino acid motifs, interaction partners or phylogenetic profiles. We present a new approach that combines sequential, structural and chemical information into one graph model of proteins. We predict functional class membership of enzymes and non-enzymes using graph kernels and support vector machine classification on these protein graphs.
Results: Our graph model, derivable from protein sequence and structure only, is competitive with vector models that require additional protein information, such as the size of surface pockets. If we include this extra information into our graph model, our classifier yields significantly higher accuracy levels than the vector models. Hyperkernels allow us to select and to optimally combine the most relevant node attributes in our protein graphs. We have laid the foundation for a protein function prediction system that integrates protein information from various sources efficiently and effectively.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/pdf3415.pdfpublished0Protein function prediction via graph kernels7323AJSmolaBSchölkopfK-RMüller1998-06-00411637649n this paper a correspondence is derived between regularization operators used in regularization networks and support vector kernels. We prove that the Green‘s Functions associated with regularization operators are suitable support vector kernels with equivalent regularization properties. Moreover, the paper provides an analysis of currently used support vector kernels in the view of regularization theory and corresponding operators associated with the classes of both polynomial kernels and translation invariant kernels. The latter are also analyzed on periodical domains. As a by-product we show that a large number of radial basis functions, namely conditionally positive definite functions, may be used as support vector kernels.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published12The connection between regularization operators and support vector kernels41937AGrettonKMBorgwardtMRaschBSchölkopfASmolaVancouver, BC, Canada2007-09-00513520We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The first test is based on a large deviation bound for the test statistic, while the second is
based on the asymptotic distribution of this statistic.
The test statistic can be computed in $O(m^2)$ time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly.
We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/NIPS2006_0583_4193[0].pdfpublished7A Kernel Method for the Two-Sample-Problem150171542041947JHuangASmolaAGrettonKMBorgwardtBSchölkopfVancouver, BC, Canada2007-09-00601608We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and
testing sets in feature space. Experimental results demonstrate that our method works well in practice.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/NIPS2006_0915_4194[0].pdfpublished7Correcting Sample Selection Bias by Unlabeled Data150171542044267AGrettonKMBorgwardtMRaschBSchölkopfAJSmolaVancouver, BC, Canada2007-07-0016371641We describe a technique for comparing distributions without
the need for density estimation as an intermediate step.
Our approach relies on mapping the distributions into a Reproducing Kernel Hilbert Space. We apply this technique to
construct a two-sample test, which is used for determining
whether two sets of observations arise from the same distribution. We use this test in attribute matching for databases using the Hungarian marriage method, where it performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/Gretton_4426[0].pdfpublished4A Kernel Approach to Comparing Distributions150171542044717LSongAJSmolaAGrettonKMBorgwardtCorvallis, OR, USA2007-06-00815822We propose a family of clustering algorithms based on the maximization of dependence between the input variables and their cluster labels, as expressed by the Hilbert-Schmidt Independence Criterion (HSIC). Under this framework, we unify the geometric, spectral, and statistical dependence views of clustering, and subsume many existing algorithms as special cases (e.g. k-means and spectral clustering). Distinctive to our framework is that kernels can also be applied on the labels, which can endow them with particular structures. We also obtain a perturbation bound on the change in k-means clustering.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/cluhsic_[0].pdfpublished7A Dependence Maximization View of Clustering150171542044627LSongAJSmolaAGrettonKMBorgwardtJBedoCorvallis, OR, USA2007-06-00823830We introduce a framework for filtering features that employs the Hilbert-Schmidt Independence Criterion (HSIC) as a measure of dependence between the features and the labels. The key idea is that good features should maximise such dependence. Feature selection for various supervised learning problems (including classification and regression) is unified under this framework, and the solutions can be approximated using a backward-elimination algorithm. We demonstrate the usefulness of our method on both artificial and real world datasets.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/ICML07_[0].pdfpublished7Supervised Feature Selection via Dependence Estimation1501715420BorgwardtGVS20057KMBorgwardtOGuttmanSVNVishwanathanAJSmolaBrugge, Belgium2005-04-00455460We present a principled method to combine kernels under
joint regularization constraints. Central to our method is an extension of the representer theorem for handling multiple joint regularization constraints. Experimental evidence shows the feasibility of our approach.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/ESANN-005_Borgwardt.pdfpublished5Joint Regularization31747AGrettonAJSmolaOBousquetRHerbrichABelitskiMAugathYMurayamaJPaulsBSchölkopfNKLogothetisBridgetown, Barbados2005-01-00112119We discuss reproducing kernel Hilbert space (RKHS)-based measures of statistical dependence, with emphasis on constrained covariance (COCO), a novel criterion to test dependence of random variables. We show that COCO is a test for independence if and only if the associated RKHSs are universal. That said, no independence test exists that can distinguish dependent and independent random variables in all circumstances. Dependent random variables can result in a COCO which is arbitrarily close to zero when the source densities are highly non-smooth. All current kernel-based independence tests share this behaviour. We demonstrate exponential convergence between the population and empirical COCO. Finally, we use COCO as a measure of joint neural activity between voxels in MRI recordings of the macaque monkey, and compare the results to the mutual information and the correlation. We also show the effect of removing breathing artefacts from the MRI recording.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/pdf3174.pdfpublished7Kernel Constrained Covariance for Dependence Measurement1501715420150171542127407YAltunTHofmannAJSmolaBanf, Alberta, Canada2004-07-004Many real-world classification tasks involve the prediction of multiple, inter-dependent class labels. A prototypical case of this sort deals with prediction of a sequence of labels for a sequence of observations. Such problems arise naturally in the context of annotating and segmenting observation sequences. This paper generalizes Gaussian Process classification to predict multiple labels by taking dependencies between neighboring labels into account. Our approach is motivated by the desire to retain rigorous probabilistic semantics, while overcoming limitations of parametric methods like Conditional Random Fields, which exhibit conceptual and computational difficulties in high-dimensional input spaces. Experiments on named entity recognition and pitch accent prediction tasks demonstrate the competitiveness of our approach.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/ICML2004-Altun_2740[0].pdfpublished-4Gaussian Process Classification for Segmenting and Annotating Sequences34167CSOngXMarySCanuAJSmolaBanff, Alberta, Canada2004-07-0081n this paper we show that many kernel methods can be adapted to deal with indefinite kernels, that is, kernels which are not positive semidefinite. They do not satisfy Mercer‘s condition and they induce associated functional spaces called Reproducing Kernel Kre&icaron;n Spaces (RKKS), a generalization of Reproducing Kernel Hilbert Spaces (RKHS).Machine learning in RKKS shares many "nice" properties of learning in RKHS, such as orthogonality and projection. However, since the kernels are indefinite, we can no longer minimize the loss, instead we stabilize it. We show a general representer theorem for constrained stabilization and prove generalization bounds by computing the Rademacher averages of the kernel class. We list several examples of indefinite kernels and investigate regularization methods to solve spline interpolation. Some preliminary experiments with indefinite kernels for spline smoothing are reported for truncated spectral factorization, Landweber-Fridman iterations, and MR-II.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/pdf3416.pdfpublished-81Learning with Non-Positive Kernels18317AJSmolaOMangasarianBSchölkopfPassau, Germany2002-00-00167178Kernel Principal Component Analysis (KPCA) has proven to be a versatile tool for unsupervised learning, however at a high computational cost due to the dense expansions in terms of kernel functions. We overcome this problem by proposing a new class of feature extractors employing ℓ1 norms in coefficient space instead of the Reproducing Kernel Hilbert Space in which KPCA was originally formulated in. Moreover, the modified setting allows us to efficiently extract features which maximize criteria other than the variance in a way similar to projection pursuit.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published11Sparse kernel feature analysis18387BSchölkopfRHerbrichAJSmolaAmsterdam, The Netherlands2001-07-00416426Wahba’s classical representer theorem states that the solutions of certain risk minimization problems involving an empirical risk term and a quadratic regularizer can be written as expansions in terms of the training examples. We generalize the theorem to a larger class of regularizers and empirical risk terms, and give a self-contained proof utilizing the feature space associated with a kernel. The result shows that a wide range of problems have optimal solutions that live in the finite dimensional span of the training examples mapped into feature space, thus enabling us to carry out kernel algorithms independent of the (potentially infinite) dimensionality of the feature space.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published10A Generalized Representer Theorem8207RCWilliamsonAJSmolaBSchölkopfPalo Alto, CA, USA2000-07-00309319nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published10Entropy Numbers of Linear Function Classes.7917RCWilliamsonAJSmolaBSchölkopfDenver, CO, USA1999-00-00127144nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published17Entropy numbers, operators and support vector kernels.ScholkopfBS19997BSchölkopfCJCBurgesAJSmolaDenver, CO, USA1999-00-00115nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published14Introduction to support vector learning7897BSchölkopfAJSmolaK-RMüllerDenver, CO, USA1999-00-00327352A new method for performing a nonlinear form of Principal Component Analysis is proposed. By the use of integral operator kernel functions, one can efficiently compute principal components in high-dimensional feature spaces, related to input space by some nonlinear map; for instance the space of all possible d-pixel products in images. We give the derivation of the method and present experimental results on polynomial feature extraction for pattern recognition.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published25Kernel principal component analysisScholkopfBS1999_27BSchölkopfCJCBurgesAJSmolaDenver, CO, USA1999-00-001722nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published5RoadmapMullerSRSKV19997K-RMüllerAJSmolaGRätschBSchölkopfJKohlmorgenVVapnikDenver, CO, USA1999-00-00243253nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published10Using support vector machines for time series predictionSmolaMSM19987AJSmolaNMurataBSchölkopfK-RMüllerSkövde, Sweden1998-09-00105110Under the assumption of asymptotically unbiased estimators we show that there exists a nontrivial choice of the insensitivity parameter in Vapnik’s ε-insensitive loss function which scales linearly with the input noise of the training data. This finding is backed by experimental results.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published5Asymptotically Optimal Choice of ε-Loss for Support Vector Machines18717AJSmolaBSchölkopfK-RMüllerSkövde, Sweden1998-09-0099104The concept of Support Vector Regression is extended to a more general class of convex cost functions. It is shown how the resulting convex constrained optimization problems can be efficiently solved by a Primal-Dual Interior Point path following method. Both computational feasibility and improvement of estimation is demonstrated in the experiments.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published5Convex Cost Functions for Support Vector Regression8027BSchölkopfSMikaAJSmolaGRätschK-RMüllerSkövde, Sweden1998-09-00147152Algorithms based on Mercer kernels construct their solutions in terms of expansions in a high-dimensional feature space F. Previous work has shown that all algorithms which can be formulated in terms of dot products in F can be performed using a kernel without explicitly working in F. The list of such algorithms includes support vector machines and nonlinear kernel principal component extraction. So far, however, it did not include the reconstruction of patterns from their largest nonlinear principal components, a technique which is common practice in linear principal component analysis.
The present work proposes an idea for approximately performing this task. As an illustrative example, an application to the de-noising of data clusters is presented.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published5Kernel PCA pattern reconstruction via approximate pre-images7977AJSmolaBSchölkopfDenver, CO, USA1998-06-00343349We derive the correspondence between regularization operators used in Regularization Networks and Hilbert Schmidt Kernels appearing in Sup- port Vector Machines. More specifically, we prove that the Green's Func- tions associated with regularization operators are suitable Support Vect or Kernels with equivalent regularization properties. As a by-product we show that a large number of Radial Basis Functions namely condition- ally positive definite functions may be used as Support Vector kernels.
From Regularization Operators to Support Vector Kernels./221619405_From_Regularization_Operators_to_Support_Vector_Kernels [accessed Feb 4, 2016].nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/NIPS-1997-Smola.pdfpublished6From regularization operators to support vector kernels15017154227987BSchölkopfPSimardAJSmolaVVapnikDenver, CO, USA1998-06-00640646We explore methods for incorporating prior knowledge about a problem at hand in Support Vector learning machines. We show that both invariances under group transformations and prior knowledge about locality in images can be incorporated by constructing appropriate kernel functions.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/NIPS-1997-Schoelkopf.pdfpublished6Prior knowledge in support vector kernels15017154227997AJSmolaBSchölkopfK-RMüllerBrisbane, Australia1998-02-007983The concept of Support Vector Regression is extended to a more general class of convex cost functions. Moreover it is shown how the resulting convex constrained optimization problems can be efficiently solved by a Primal-Dual Interior Point path following method. Both computational feasibility and improvement of estimation is demonstrated in the experiments.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published4General cost functions for Support Vector Regression18237BSchölkopfAJSmolaK-RMüllerCBurgesVVapnikBrisbane, Australia1998-02-007278The last years have witnessed an increasing interest in Support Vector (SV) machines, which use Mercer kernels for efficiently performing computations in high-dimensional spaces. In pattern recognition, the SV algorithm constructs nonlinear decision functions by training a classifier to perform a linear separation in some high-dimensional space which is nonlinearly related to input space. Recently, we have developed a technique for Nonlinear Principal Component Analysis (Kernel PCA) based on the same types of kernels. This way, we can for instance efficiently extract polynomial features of arbitrary order by computing projections onto principal components in the space of all products of n pixels of images. We explain the idea of Mercer kernels and associated feature spaces, and describe connections to the theory of reproducing kernels and to regularization theory, followed by an overview of the above algorithms employing these kernels. 1. Introduction For the case of two-class pattern.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published6Support Vector methods in learning and feature extraction53762AGrettonAJSmolaJHuangMSchmittfullKMBorgwardtBSchölkopfMIT PressCambridge, MA, USA2009-02-00131160Given sets of observations of training and test data, we consider the problem of re-weighting the training data such that its distribution more closely matches that of the test data. We achieve this goal by matching covariate distributions between training and test sets in a high dimensional feature space (specifically, a reproducing
kernel Hilbert space). This approach does not require distribution estimation. Instead, the sample weights are obtained by a simple quadratic programming procedure. We provide a uniform convergence bound on the distance between
the reweighted training feature mean and the test feature mean, a transductive bound on the expected loss of an algorithm trained on the reweighted data, and a connection to single class SVMs. While our method is designed to deal with the case of simple covariate shift (in the sense of Chapter ??), we have also found benefits for sample selection bias on the labels. Our correction procedure yields its greatest and most consistent advantages when the learning algorithm returns a classifier/regressor that is simpler" than the data might suggest.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/shift-book-for-LeEtAl-webversion_5376[0].pdfpublished29Covariate Shift by Kernel Mean Matching15017154201501720755511146AGrettonKBorgwardtMRaschBSchölkopfASmola2008-04-00187446BSchölkopfJCPlattJShawe-TaylorAJSmolaRCWilliamson1999-11-001999-11-00Estimating the support of a high-dimensional distributionnonotspecifiedEstimating the support of a high-dimensional distribution182146BSchölkopfJShawe-TaylorAJSmolaRCWilliamson1999-03-001999-03-00Generalization Bounds via Eigenvalues of the Gram matrixnonotspecifiedGeneralization Bounds via Eigenvalues of the Gram matrixSmolaMS199846AJSmolaOLMangasarianBSchölkopf1999-00-001999-00-00Sparse Kernel Feature AnalysisnonotspecifiedSparse Kernel Feature Analysis183446AJSmolaRCWilliamsonBSchölkopf1998-09-001998-09-00Generalization bounds and learning rates for Regularized principal manifoldsnonotspecifiedGeneralization bounds and learning rates for Regularized principal manifolds186946AJSmolaSMikaBSchölkopf1998-09-001998-09-00Quantization Functionals and Regularized Principal
ManifoldsnonotspecifiedQuantization Functionals and Regularized Principal
Manifolds187246AJSmolaRCWilliamsonBSchölkopf1998-08-001998-08-00Generalization Bounds for Convex Combinations of Kernel FunctionsnonotspecifiedGeneralization Bounds for Convex Combinations of Kernel Functions181946CSaundersMOStitsonJWestonLBottouBSchölkopfAJSmola1998-00-001998-00-00Support Vector Machine Reference ManualnonotspecifiedSupport Vector Machine Reference Manual18287BSchölkopfRWilliamsonAJSmolaJShawe-TaylorDagstul, Germany1999-03-001920Dagstuhl-Seminar on Unsupervised LearningSuppose you are given some dataset drawn from an underlying probability distribution P and you want to estimate a subset
S of input space such that the probability that a test point drawn from P lies outside of S is bounded by some
a priori specified 0<ν≤1. We propose an algorithm to deal with this problem by trying to estimate a function f which is positive on S and negative on the complement of S. The
functional form of f is given by a kernel expansion in terms of a potentially small subset of the training data; it is regularized by controlling the length of the weight
vector in an associated feature space. We can prove that
ν upper bounds the fraction of outliers (training points outside of S and lower bounds the fraction of support vectors. Asymptotically, under some mild condition on
P, both become equalities. The algorithm is a natural extension of the support vector algorithm to the case of unlabelled data.nonotspecifiedhttp://www.kyb.tuebingen.mpg.de/published1Single-class Support Vector Machines