4269
1
GH
Bakir
T
Hofmann
B
Schölkopf
AJ
Smola
B
Taskar
SVN
Vishwanathan
MIT Press
Cambridge, MA, USA
2007-09-00
Machine learning develops intelligent computer systems that are able to generalize from previously seen examples. A new domain of machine learning, in which the prediction must satisfy the additional constraints found in structured data, poses one of machine learnings greatest challenges: learning functional dependencies between arbitrary input and output domains. This volume presents and analyzes the state of the art in machine learning algorithms and theory in this novel field. The contributors discuss applications as diverse as machine translation, document markup, computational biology, and information extraction, among others, providing a timely overview of an exciting field.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
360
Predicting Structured Data
15017
15420
973
1
B
Schölkopf
AJ
Smola
MIT Press
Cambridge, MA, USA
2002-12-00
In the 1990s, a new type of learning algorithm was developed, based on results from statistical learning theory: the Support Vector Machine (SVM). This gave rise to a new class of theoretically elegant learning machines that use a central concept of SVMs-kernelsfor a number of learning tasks. Kernel machines provide a modular framework that can be adapted to different tasks and domains by the choice of the kernel function and the base algorithm. They are replacing neural networks in a variety of fields, including engineering, information retrieval, and bioinformatics.
Learning with Kernels provides an introduction to SVMs and related kernel methods. Although the book begins with the basics, it also includes the latest research. It provides all of the concepts necessary to enable a reader equipped with some basic mathematical knowledge to enter the world of machine learning using theoretically well-founded yet easy-to-use kernel algorithms and to understand and apply the powerful algorithms that have been developed over the last few years.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
644
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
15017
15420
727
1
B
Schölkopf
CJC
Burges
AJ
Smola
MIT Press
Cambridge, MA, USA
1999-00-00
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
352
Advances in Kernel Methods: Support Vector Learning
SongSGBB2012
3
L
Song
A
Smola
A
Gretton
J
Bedo
K
Borgwardt
2012-05-00
13
1393
1434
We introduce a framework for feature selection based on dependence maximization between the selected features and the labels of an estimation problem, using the Hilbert-Schmidt Independence Criterion. The key idea is that good features should be highly dependent on the labels. Our approach leads to a greedy procedure for feature selection. We show that a number of existing feature selectors are special cases of this framework. Experiments on both artificial and real-world data show that our feature selector works well in practice.
no
notspecified
http://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/2012/JMLR-2007-Song.pdf
published
41
Feature Selection via Dependence Maximization
15017
20755
Scholkopf2012
3
A
Gretton
K
Borgwardt
M
Rasch
B
Schölkopf
A
Smola
2012-03-00
13
723−773
We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distribution-free tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
-723
A Kernel Two-Sample Test
15017
15420
15017
20755
15017
15421
ThomaCGHKSSYYB2010
3
M
Thoma
H
Cheng
A
Gretton
J
Han
H-P
Kriegel
AJ
Smola
L
Song
PS
Yu
X
Yan
KM
Borgwardt
2010-10-00
5
3
302–318
The goal of frequent subgraph mining is to detect subgraphs that frequently occur in a dataset of graphs. In classification settings, one is often interested in discovering discriminative frequent subgraphs, whose presence or absence is indicative of the class membership of a graph. In this article, we propose an approach to feature selection on frequent subgraphs, called CORK, that combines two central advantages. First, it optimizes a submodular quality criterion, which means that we can yield a near-optimal solution using greedy feature selection. Second, our submodular quality function criterion can be integrated into gSpan, the state-of-the-art tool for frequent subgraph mining, and help to prune the search space for discriminative frequent subgraphs even during frequent subgraph mining.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
-302
Discriminative frequent subgraph mining with optimality guarantees
15017
15420
15017
20755
4268
3
T
Hofmann
B
Schölkopf
AJ
Smola
2008-06-00
3
36
1171
1220
We review machine learning methods employing positive definite kernels. These methods formulate learning and estimation problems in a reproducing kernel Hilbert space (RKHS) of functions defined on the data domain, expanded in terms of a kernel. Working in linear spaces of function has the benefit of facilitating the construction and analysis of learning algorithms while at the same time allowing large classes of functions. The latter include nonlinear functions as well as functions defined on nonvectorial data.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/pdf4268mod_4268[0].pdf
published
49
Kernel Methods in Machine Learning
15017
15420
4768
3
S
Sonnenburg
ML
Braun
CS
Ong
S
Bengio
L
Bottou
G
Holmes
Y
LeCun
K-R
Müller
F
Pereira
CE
Rasmussen
G
Rätsch
B
Schölkopf
A
Smola
P
Vincent
J
Weston
RC
Williamson
2007-10-00
8
2443
2466
Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not realized, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/JMLR-8-Sonnenburg_4768[0].pdf
published
23
The Need for Open Source Software in Machine Learning
15017
15420
VishwanathanBGS2006
3
SVN
Vishwanathan
KM
Borgwardt
O
Guttman
AJ
Smola
2006-03-00
7-9
69
721
729
We present a framework for efficient extrapolation of reduced rank approximations, graph kernels, and locally linear embeddings (LLE) to unseen data. We also present a principled method to combine many of these kernels and then extrapolate them. Central to our method is a theorem for matrix approximation, and an extension of the representer theorem to handle multiple joint regularization constraints. Experiments in protein classification demonstrate the feasibility of our approach.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
8
Kernel extrapolation
3779
3
A
Gretton
R
Herbrich
A
Smola
O
Bousquet
B
Schölkopf
2005-12-00
6
2075
2129
We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent.
We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information.
Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal
kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/gretton05a_3779[0].pdf
published
54
Kernel Methods for Measuring Independence
15017
15420
3512
3
CS
Ong
A
Smola
R
Williamson
2005-07-00
6
1043
1071
This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional.
We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/ong05hyperkernel_[0].pdf
published
28
Learning the Kernel with Hyperkernels
15017
15420
4679
3
A
Chalimourda
B
Schölkopf
AJ
Smola
2005-03-00
2
18
205
205
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
0
Experimentally optimal ν in support vector regression for different noise models and parameter settings
15017
15420
1832
3
AJ
Smola
B
Schölkopf
2004-08-00
3
14
199
222
In this tutorial we give an overview of the basic ideas underlying Support Vector (SV) machines for function estimation. Furthermore, we include a summary of currently used algorithms for training SV machines, covering both the quadratic (or convex) programming part and advanced methods for dealing with large datasets. Finally, we mention some modifications and extensions that have been applied to the standard SV algorithm, and discuss the aspect of regularization from a SV perspective.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
23
A Tutorial on Support Vector Regression
15017
15420
2102
3
ABA
Graf
AJ
Smola
S
Borer
2003-05-00
3
14
597
605
This paper discusses classification using support vector machines in a normalized feature space. We consider both normalization in input space and in feature space. Exploiting the fact that in this setting all points lie on the surface of a unit hypersphere we replace the optimal separating hyperplane by one that is symmetric in its angles, leading to an improved estimator. Evaluation of these considerations is done in numerical experiments on two real-world datasets. The stability to noise of this offset correction is subsequently investigated as well as its optimality.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/pdf2102.pdf
published
8
Classification in a Normalized Feature Space using Support Vector Machines
15017
15422
1844
3
S
Mika
G
Rätsch
J
Weston
B
Schölkopf
AJ
Smola
K-R
Müller
2003-05-00
5
25
623
628
We incorporate prior knowledge to construct nonlinear algorithms for invariant feature extraction and discrimination. Employing a unified framework in terms of a nonlinearized variant of the Rayleigh coefficient, we propose nonlinear generalizations of Fisher‘s discriminant and oriented PCA using support vector kernel functions. Extensive simulations show the utility of our approach.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
5
Constructing descriptive and discriminative nonlinear features: Rayleigh coefficients in kernel feature spaces
15017
15420
787
3
RC
Williamson
AJ
Smola
B
Schölkopf
2001-09-00
6
47
2516
2532
We derive new bounds for the generalization error of kernel machines, such as support vector machines and related regularization networks by obtaining new bounds on their covering numbers. The proofs make use of a viewpoint that is apparently novel in the field of statistical learning theory. The hypothesis class is described in terms of a linear operator mapping from a possibly infinite-dimensional unit ball in feature space into a finite-dimensional space. The covering numbers of the class are then determined via the entropy numbers of the operator. These numbers, which characterize the degree of compactness of the operator can be bounded in terms of the eigenvalues of an integral operator induced by the kernel function used by the machine. As a consequence, we are able to theoretically explain the effect of the choice of kernel function on the generalization performance of support vector machines.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
16
Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators
785
3
AJ
Smola
S
Mika
B
Schölkopf
RC
Williamson
2001-06-00
1
179
209
Many settings of unsupervised learning can be viewed as quantization problems - the minimization of the expected quantization error subject to some restrictions. This allows the use of tools such as regularization from the theory of (supervised) risk minimization for unsupervised learning. This setting turns out to be closely related to principal curves, the generative topographic map, and robust coding. We explore this connection in two ways: (1) we propose an algorithm for finding principal manifolds that can be regularized in a variety of ways; and (2) we derive uniform convergence bounds and hence bounds on the learning rates of the algorithm. In particular, we give bounds on the covering numbers which allows us to obtain nearly optimal learning rates for certain types of regularization operators. Experimental results demonstrate the feasibility of the approach.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
30
Regularized principal manifolds
15017
15420
970
3
B
Schölkopf
JC
Platt
J
Shawe-Taylor
AJ
Smola
RC
Williamson
2001-03-00
7
13
1443
1471
Suppose you are given some data set drawn from an underlying probability distribution P and you want to estimate a simple subset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori specified value between 0 and 1.
We propose a method to approach this problem by trying to estimate a function f that is positive on S and negative on the complement. The functional form of f is given by a kernel expansion in terms of a potentially small subset of the training data; it is regularized by controlling the length of the weight vector in an associated feature space. The expansion coefficients are found by solving a quadratic programming problem, which we do by carrying out sequential optimization over pairs of input patterns. We also provide a theoretical analysis of the statistical performance of our algorithm.
The algorithm is a natural extension of the support vector algorithm to the case of unlabeled data.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
28
Estimating the support of a high-dimensional distribution.
734
3
B
Schölkopf
AJ
Smola
RC
Williamson
PL
Bartlett
2000-05-00
5
12
1207
1245
We propose a new class of support vector algorithms for regression and classification. In these algorithms, a parameter {nu} lets one effectively control the number of support vectors. While this can be useful in its own right, the parameterization has the additional benefit of enabling us to eliminate one of the other free parameters of the algorithm: the accuracy parameter {epsilon} in the regression case, and the regularization constant C in the classification case. We describe the algorithms, give some theoretical results concerning the meaning and the choice of {nu}, and report experimental results.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
38
New Support Vector Algorithms
216
3
B
Schölkopf
S
Mika
CJC
Burges
P
Knirsch
K-R
Müller
G
Rätsch
AJ
Smola
1999-09-00
5
10
1000
1017
This paper collects some ideas targeted at advancing our understanding of the feature spaces associated with support vector (SV) kernel functions. We first discuss the geometry of feature space. In particular, we review what is known about the shape of the image of input space under the feature space map, and how this influences the capacity of SV methods. Following this, we describe how the metric governing the intrinsic geometry of the mapped surface can be computed in terms of the kernel, using the example of the class of inhomogeneous polynomial kernels, which are often used in SV pattern recognition. We then discuss the connection between feature space and input space by dealing with the question of how one can, given some vector in feature space, find a preimage (exact or approximate) in input space. We describe algorithms to tackle this issue, and show their utility in two applications of kernel methods. First, we use it to reduce the computational complexity of SV decision functions; second, we combine it with the kernel PCA algorithm, thereby constructing a nonlinear statistical denoising technique which is shown to perform well on real-world data.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
17
Input space versus feature space in kernel-based methods
15017
15422
733
3
B
Schölkopf
K-R
Müller
AJ
Smola
1999-09-00
3
14
154
163
We describe recent developments and results of statistical learning theory. In the framework of learning from examples, two factors control generalization ability: explaining the training data by a learning machine of a suitable complexity. We describe kernel algorithms in feature spaces as elegant and efficient methods of realizing such machines. Examples thereof are Support Vector Machines (SVM) and Kernel PCA (Principal Component Analysis). More important than any individual example of a kernel algorithm, however, is the insight that any algorithm that can be cast in terms of dot products can be generalized to a nonlinear setting using kernels.
Finally, we illustrate the significance of kernel algorithms by briefly describing industrial and academic applications, including ones where we obtained benchmark record results.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/pdf733.pdf
published
9
Lernen mit Kernen: Support-Vektor-Methoden zur Analyse hochdimensionaler Daten
15017
15422
948
3
AJ
Smola
B
Schölkopf
1998-09-00
1-2
22
211
231
We present a kernel-based framework for pattern recognition, regression estimation, function approximation, and multiple operator inversion. Adopting a regularization-theoretic framework, the above are formulated as constrained optimization problems. Previous approaches such as ridge regression, support vector methods, and regularization networks are included as special cases. We show connections between the cost function and some properties up to now believed to apply to support vector machines only. For appropriately chosen cost functions, the optimal solution of all the problems described above can be found by solving a simple quadratic programming problem.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
20
On a Kernel-Based Method for Pattern Recognition, Regression, Approximation, and Operator Inversion
15017
15422
730
3
B
Schölkopf
AJ
Smola
K-R
Müller
1998-07-00
5
10
1299
1319
A new method for performing a nonlinear form of principal component analysis is proposed. By the use of integral operator kernel functions, one can efficiently compute principal components in high-dimensional feature spaces, related to input space by some nonlinear map—for instance, the space of all possible five-pixel products in 16 × 16 images. We give the derivation of the method and present experimental results on polynomial feature extraction for pattern recognition.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
20
Nonlinear Component Analysis as a Kernel Eigenvalue Problem
15017
15422
732
3
AJ
Smola
B
Schölkopf
K-R
Müller
1998-06-00
4
11
637
649
n this paper a correspondence is derived between regularization operators used in regularization networks and support vector kernels. We prove that the Green‘s Functions associated with regularization operators are suitable support vector kernels with equivalent regularization properties. Moreover, the paper provides an analysis of currently used support vector kernels in the view of regularization theory and corresponding operators associated with the classes of both polynomial kernels and translation invariant kernels. The latter are also analyzed on periodical domains. As a by-product we show that a large number of radial basis functions, namely conditionally positive definite functions, may be used as support vector kernels.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
12
The connection between regularization operators and support vector kernels
5465
7
X
Zhang
L
Song
A
Gretton
A
Smola
Vancouver, BC, Canada2009-06-00
1937
1944
Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/NIPS2008-Zhang_5465[0].pdf
published
7
Kernel Measures of Independence for Non-IID Data
15017
15420
5666
7
M
Thoma
H
Cheng
A
Gretton
J
Han
H-P
Kriegel
AJ
Smola
L
Song
PS
Yu
X
Yan
KM
Borgwardt
Sparks, NV, USA2009-05-00
1076
1087
Graph classification is an increasingly important step in
numerous application domains, such as function prediction
of molecules and proteins, computerised scene analysis, and
anomaly detection in program flows. Among the various approaches proposed in the literature, graph classification based on frequent subgraphs is a popular branch: Graphs are represented as (usually binary) vectors, with components indicating whether a graph contains a particular subgraph that is frequent across the dataset. On large graphs, however, one faces the enormous problem that the number of these frequent subgraphs may grow exponentially with the size of the graphs, but only few of them possess enough discriminative power to make them useful for graph classification. Efficient and discriminative feature selection among frequent subgraphs is hence a key
challenge for graph mining. In this article, we propose an approach to feature selection on frequent subgraphs, called CORK, that combines two central advantages. First, it optimizes a submodular quality criterion, which means that we can yield a near-optimal solution using greedy feature selection. Second, our submodular quality function criterion can be integrated into gSpan, the state-of-the-art tool for frequent subgraph mining, and
help to prune the search space for discriminative frequent
subgraphs even during frequent subgraph mining.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/SDM2009-Thoma_5666[0].pdf
published
11
Near-optimal supervised feature selection among frequent subgraphs
15017
1542015017
20755
4928
7
A
Gretton
K
Fukumizu
CH
Teo
L
Song
B
Schölkopf
AJ
Smola
Vancouver, BC, Canada2008-09-00
585
592
Whereas kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m^2), where m is the sample size. We demonstrate that this test outperforms established contingency table-based tests. Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/NIPS2007-Gretton_[0].pdf
published
7
A Kernel Statistical Test of Independence
15017
15420
4929
7
L
Song
AJ
Smola
K
Borgwardt
A
Gretton
Vancouver, BC, Canada2008-09-00
1385
1392
Maximum variance unfolding (MVU) is an effective heuristic for dimensionality reduction. It produces a low-dimensional representation of the data by maximizing the variance of their embeddings while preserving the local distances of the
original data. We show that MVU also optimizes a statistical dependence measure which aims to retain the identity of individual observations under the distancepreserving constraints. This general view allows us to design "colored" variants of MVU, which produce low-dimensional representations for a given task, e.g. subject to class labels or other side information.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/NIPS2007-Song_[0].pdf
published
7
Colored Maximum Variance Unfolding
15017
15420
5155
7
L
Song
X
Zhang
A
Smola
A
Gretton
B
Schölkopf
Helsinki, Finland2008-07-00
992
999
Moment matching is a popular means of parametric density estimation. We extend this technique to nonparametric estimation of mixture models. Our approach works by embedding distributions into a reproducing kernel Hilbert space, and performing moment matching in that space. This allows us to tailor density estimators to a function class of interest (i.e., for which we would like to compute expectations). We show our density estimation approach is useful in applications such as message compression in graphical models, and image classification and retrieval.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/ICML2008-Gretton_[0].pdf
published
7
Tailoring density estimation via reproducing kernel moment matching
15017
15420
4644
7
AJ
Smola
A
Gretton
L
Song
B
Schölkopf
Sendai, Japan2007-10-00
40
41
While kernel methods are the basis of many popular techniques in supervised learning, they are less commonly used in testing, estimation, and analysis of probability distributions, where information theoretic approaches rule the roost. However it becomes difficult to estimate mutual information or entropy if the data are high dimensional.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/DS-2007-Gretton_[0].pdf
published
1
A Hilbert Space Embedding for Distributions
15017
15420
4645
7
A
Smola
A
Gretton
L
Song
B
Schölkopf
Sendai, Japan2007-10-00
13
31
We describe a technique for comparing distributions without
the need for density estimation as an intermediate step. Our approach relies on mapping the distributions into a reproducing kernel Hilbert space. Applications of this technique can be found in two-sample tests, which are used for determining whether two sets of observations arise from the same distribution, covariate shift correction, local learning, measures of independence, and density estimation.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/ALT-2007-Gretton_[0].pdf
published
18
A Hilbert Space Embedding for Distributions
15017
15420
4193
7
A
Gretton
KM
Borgwardt
M
Rasch
B
Schölkopf
A
Smola
Vancouver, BC, Canada2007-09-00
513
520
We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The first test is based on a large deviation bound for the test statistic, while the second is
based on the asymptotic distribution of this statistic.
The test statistic can be computed in $O(m^2)$ time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly.
We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/NIPS2006_0583_4193[0].pdf
published
7
A Kernel Method for the Two-Sample-Problem
15017
15420
4194
7
J
Huang
A
Smola
A
Gretton
KM
Borgwardt
B
Schölkopf
Vancouver, BC, Canada2007-09-00
601
608
We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by matching distributions between training and
testing sets in feature space. Experimental results demonstrate that our method works well in practice.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/NIPS2006_0915_4194[0].pdf
published
7
Correcting Sample Selection Bias by Unlabeled Data
15017
15420
TeoSVL2007
7
CH
Teo
A
Smola
SVN
Vishwanathan
QV
Le
San Jose, CA, USA2007-08-00
727
736
A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a highly scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as l1 and l2 penalties. At present, our solver implements 20 different estimation problems, can be easily extended, scales to millions of observations, and is up to 10 times faster than specialized solvers for many applications. The open source code is freely available as part of the ELEFANT toolbox.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
9
A scalable modular convex solver for regularized risk minimization
15017
15420
4426
7
A
Gretton
KM
Borgwardt
M
Rasch
B
Schölkopf
AJ
Smola
Vancouver, BC, Canada2007-07-00
1637
1641
We describe a technique for comparing distributions without
the need for density estimation as an intermediate step.
Our approach relies on mapping the distributions into a Reproducing Kernel Hilbert Space. We apply this technique to
construct a two-sample test, which is used for determining
whether two sets of observations arise from the same distribution. We use this test in attribute matching for databases using the Hungarian marriage method, where it performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/Gretton_4426[0].pdf
published
4
A Kernel Approach to Comparing Distributions
15017
15420
4764
7
L
Song
J
Bedo
KM
Borgwardt
A
Gretton
A
Smola
Wien, Austria2007-07-00
i490
i498
Motivation: Identifying significant genes among thousands of sequences on a microarray is a central challenge for cancer research in bioinformatics. The ultimate goal is to detect the genes that are involved in disease outbreak and progression. A multitude of methods have been proposed for this task of feature selection, yet the selected gene lists differ greatly between different methods. To accomplish biologically meaningful gene selection from microarray data, we have to understand the theoretical connections and the differences between these methods. In this article, we define a kernel-based framework for feature selection based on the Hilbert–Schmidt independence criterion and backward elimination, called BAHSIC. We show that several well-known feature selectors are instances of BAHSIC, thereby clarifying their relationship. Furthermore, by choosing a different kernel, BAHSIC allows us to easily define novel feature selection algorithms. As a further advantage, feature selection via BAHSIC works directly on multiclass problems.
Results: In a broad experimental evaluation, the members of the BAHSIC family reach high levels of accuracy and robustness when compared to other feature selection techniques. Experiments show that features selected with a linear kernel provide the best classification performance in general, but if strong non-linearities are present in the data then non-linear kernels can be more suitable.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
0
Gene selection via the BAHSIC family of algorithms
15017
15420
4471
7
L
Song
AJ
Smola
A
Gretton
KM
Borgwardt
Corvallis, OR, USA2007-06-00
815
822
We propose a family of clustering algorithms based on the maximization of dependence between the input variables and their cluster labels, as expressed by the Hilbert-Schmidt Independence Criterion (HSIC). Under this framework, we unify the geometric, spectral, and statistical dependence views of clustering, and subsume many existing algorithms as special cases (e.g. k-means and spectral clustering). Distinctive to our framework is that kernels can also be applied on the labels, which can endow them with particular structures. We also obtain a perturbation bound on the change in k-means clustering.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/cluhsic_[0].pdf
published
7
A Dependence Maximization View of Clustering
15017
15420
4462
7
L
Song
AJ
Smola
A
Gretton
KM
Borgwardt
J
Bedo
Corvallis, OR, USA2007-06-00
823
830
We introduce a framework for filtering features that employs the Hilbert-Schmidt Independence Criterion (HSIC) as a measure of dependence between the features and the labels. The key idea is that good features should maximise such dependence. Feature selection for various supervised learning problems (including classification and regression) is unified under this framework, and the solutions can be approximated using a backward-elimination algorithm. We demonstrate the usefulness of our method on both artificial and real world datasets.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/ICML07_[0].pdf
published
7
Supervised Feature Selection via Dependence Estimation
15017
15420
5705
7
QV
Le
AJ
Smola
T
Gärtner
Y
Altun
Berlin, Germany2006-09-00
306
317
n contrast to the standard inductive inference setting of predictive machine learning, in real world learning problems often the test instances are already available at training time. Transductive inference tries to improve the predictive accuracy of learning algorithms by making use of the information contained in these test instances. Although this description of transductive inference applies to predictive learning problems in general, most transductive approaches consider the case of classification only. In this paper we introduce a transductive variant of Gaussian process regression with automatic model selection, based on approximate moment matching between training and test data. Empirical results show the feasibility and competitiveness of this approach.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
11
Transductive Gaussian Process Regression with Automatic Model Selection
3981
7
KM
Borgwardt
A
Gretton
M
Rasch
H-P
Kriegel
B
Schölkopf
A
Smola
Fortaleza, Brazil2006-08-00
e49
e57
Motivation: Many problems in data integration in bioinformatics can be posed as one common question: Are two sets of observations generated by the same distribution? We propose a kernel-based statistical test for this problem, based on the fact that two distributions are different if and only if there exists at least one function having different expectation on the two distributions. Consequently we use the maximum discrepancy between function means as the basis of a test statistic.
The Maximum Mean Discrepancy (MMD) can take advantage of the kernel trick, which allows us to apply it not only to vectors, but strings, sequences, graphs, and other common structured data types arising in molecular biology.
Results: We study the practical feasibility of an MMD-based test on three central data integration tasks: Testing cross-platform comparability of microarray data, cancer diagnosis, and data-content based schema matching for two different protein function classification schemas. In all of these experiments, including high-dimensional ones, MMD is very accurate in finding samples that were generated from the same distribution, and outperforms its best competitors.
Conclusions: We have defined a novel statistical test of whether two samples are from the same distribution, compatible with both multivariate and structured data, that is fast, easy to implement, and works well, as confirmed by our experiments.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
0
Integrating Structured Biological data by Kernel Maximum Mean Discrepancy
15017
15420
3921
7
J
McAuley
T
Caetano
A
Smola
MO
Franz
Pittsburgh, PA, USA2006-06-00
617
624
In this paper, we use large neighborhood Markov random fields to learn rich prior models of color images. Our approach extends the monochromatic Fields of Experts model (Roth & Black, 2005a) to color images. In the Fields of Experts model, the curse of dimensionality due to very large clique sizes is circumvented by parameterizing the potential functions according to a product of experts. We introduce simplifications to the original approach by Roth and Black which allow us to cope with the increased clique size (typically 3x3x3 or 5x5x3 pixels) of color images. Experimental results are presented for image denoising which evidence improvements over state-of-the-art monochromatic image priors.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/icml06_3921[0].pdf
published
7
Learning High-Order MRF Priors of Color Images
15017
15420
5704
7
Y
Altun
AJ
Smola
Pittsburgh, PA, USA2006-06-00
139
153
In this paper we unify divergence minimization and statistical inference by means of convex duality. In the process of doing so, we prove that the dual of approximate maximum entropy estimation is maximum a posteriori estimation as a special case. Moreover, our treatment leads to stability and convergence bounds for many statistical learning problems. Finally, we show how an algorithm by Zhang can be used to solve this class of optimization problems efficiently.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
14
Unifying Divergence Minimization and Statistical Inference Via Convex Duality
3774
7
A
Gretton
O
Bousquet
A
Smola
B
Schölkopf
Singapore2005-10-00
63
78
We propose an independence criterion based on the eigenspectrum of covariance operators in reproducing kernel Hilbert spaces (RKHSs), consisting of an empirical estimate of the Hilbert-Schmidt norm of the cross-covariance operator (we term this a Hilbert-Schmidt Independence Criterion, or HSIC). This approach has several advantages, compared with previous kernel-based independence criteria. First, the empirical estimate is simpler than any other kernel dependence test, and requires no user-defined regularisation. Second, there is a clearly defined population quantity which the empirical estimate approaches in the large sample limit, with exponential convergence guaranteed between the two: this ensures that independence tests based on {methodname} do not suffer from slow learning rates.
Finally, we show in the context of independent component analysis (ICA) that the performance of HSIC is competitive with that of previously published kernel-based criteria, and of other recently published ICA methods.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
15
Measuring Statistical Dependence with Hilbert-Schmidt Norms
15017
15420
3415
7
KM
Borgwardt
CS
Ong
S
Schönauer
SVN
Vishwanathan
AJ
Smola
H-P
Kriegel
Detroit, MI, USA2005-06-00
i47
i56
Motivation: Computational approaches to protein function prediction infer protein function by finding proteins with similar sequence, structure, surface clefts, chemical properties, amino acid motifs, interaction partners or phylogenetic profiles. We present a new approach that combines sequential, structural and chemical information into one graph model of proteins. We predict functional class membership of enzymes and non-enzymes using graph kernels and support vector machine classification on these protein graphs.
Results: Our graph model, derivable from protein sequence and structure only, is competitive with vector models that require additional protein information, such as the size of surface pockets. If we include this extra information into our graph model, our classifier yields significantly higher accuracy levels than the vector models. Hyperkernels allow us to select and to optimally combine the most relevant node attributes in our protein graphs. We have laid the foundation for a protein function prediction system that integrates protein information from various sources efficiently and effectively.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/pdf3415.pdf
published
0
Protein function prediction via graph kernels
BorgwardtGVS2005
7
KM
Borgwardt
O
Guttman
SVN
Vishwanathan
AJ
Smola
Brugge, Belgium2005-04-00
455
460
We present a principled method to combine kernels under
joint regularization constraints. Central to our method is an extension of the representer theorem for handling multiple joint regularization constraints. Experimental evidence shows the feasibility of our approach.
no
notspecified
http://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/ESANN-005_Borgwardt.pdf
published
5
Joint Regularization
3174
7
A
Gretton
AJ
Smola
O
Bousquet
R
Herbrich
A
Belitski
M
Augath
Y
Murayama
J
Pauls
B
Schölkopf
NK
Logothetis
Bridgetown, Barbados2005-01-00
112
119
We discuss reproducing kernel Hilbert space (RKHS)-based measures of statistical dependence, with emphasis on constrained covariance (COCO), a novel criterion to test dependence of random variables. We show that COCO is a test for independence if and only if the associated RKHSs are universal. That said, no independence test exists that can distinguish dependent and independent random variables in all circumstances. Dependent random variables can result in a COCO which is arbitrarily close to zero when the source densities are highly non-smooth. All current kernel-based independence tests share this behaviour. We demonstrate exponential convergence between the population and empirical COCO. Finally, we use COCO as a measure of joint neural activity between voxels in MRI recordings of the macaque monkey, and compare the results to the mutual information and the correlation. We also show the effect of removing breathing artefacts from the MRI recording.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/pdf3174.pdf
published
7
Kernel Constrained Covariance for Dependence Measurement
15017
1542015017
15421
2741
7
Y
Altun
AJ
Smola
T
Hofmann
Banff, Alberta, Canada2004-07-00
2
9
In this paper we define conditional random fields in reproducing kernel Hilbert spaces and show connections to Gaussian Process classification. More specifically, we prove decomposition results for undirected graphical models and we give constructions for kernels. Finally we present efficient means of solving the optimization problem using reduced rank decompositions and we show how stationarity can be exploited efficiently in the optimization process.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/pdf2741.pdf
published
7
Exponential Families for Conditional Random Fields
15017
15420
2740
7
Y
Altun
T
Hofmann
AJ
Smola
Banf, Alberta, Canada2004-07-00
4
Many real-world classification tasks involve the prediction of multiple, inter-dependent class labels. A prototypical case of this sort deals with prediction of a sequence of labels for a sequence of observations. Such problems arise naturally in the context of annotating and segmenting observation sequences. This paper generalizes Gaussian Process classification to predict multiple labels by taking dependencies between neighboring labels into account. Our approach is motivated by the desire to retain rigorous probabilistic semantics, while overcoming limitations of parametric methods like Conditional Random Fields, which exhibit conceptual and computational difficulties in high-dimensional input spaces. Experiments on named entity recognition and pitch accent prediction tasks demonstrate the competitiveness of our approach.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/ICML2004-Altun_2740[0].pdf
published
-4
Gaussian Process Classification for Segmenting and Annotating Sequences
15017
15420
3416
7
CS
Ong
X
Mary
S
Canu
AJ
Smola
Banff, Alberta, Canada2004-07-00
81
n this paper we show that many kernel methods can be adapted to deal with indefinite kernels, that is, kernels which are not positive semidefinite. They do not satisfy Mercer‘s condition and they induce associated functional spaces called Reproducing Kernel Kre&icaron;n Spaces (RKKS), a generalization of Reproducing Kernel Hilbert Spaces (RKHS).Machine learning in RKKS shares many "nice" properties of learning in RKHS, such as orthogonality and projection. However, since the kernels are indefinite, we can no longer minimize the loss, instead we stabilize it. We show a general representer theorem for constrained stabilization and prove generalization bounds by computing the Rademacher averages of the kernel class. We list several examples of indefinite kernels and investigate regularization methods to solve spline interpolation. Some preliminary experiments with indefinite kernels for spline smoothing are reported for truncated spectral factorization, Landweber-Fridman iterations, and MR-II.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/pdf3416.pdf
published
-81
Learning with Non-Positive Kernels
2183
7
G
Rätsch
A
Smola
S
Mika
Vancouver, BC, Canada2003-10-00
513
520
In this paper we consider formulations of multi-class problems based on a generalized notion of a margin and using output coding. This includes, but is not restricted to, standard multi-class SVM formulations. Differently
from many previous approaches we learn the code as well as the embedding function. We illustrate how this can lead to a formulation that allows for solving a wider range of problems with for instance many classes or even “missing classes”. To keep our optimization problems tractable we propose an algorithm capable of solving them using twoclass
classifiers, similar in spirit to Boosting.
no
notspecified
http://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/NIPS-2002-Raetsch.pdf
published
7
Adapting Codes and Embeddings for Polychotomies
3418
7
CS
Ong
AJ
Smola
RC
Williamson
2003-10-00
495
502
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/pdf3418.pdf
published
7
Hyperkernels
3417
7
CS
Ong
AJ
Smola
Washington, DC, USA2003-08-00
568
575
We expand on the problem of learning a kernel via a RKHS on the space of kernels itself. The resulting optimization problem is shown to have a semidefinite programming
solution. We demonstrate that it is possible to learn the kernel for various formulations of machine learning problems. Specifically, we provide mathematical programming formulations and experimental results for the C-SVM,ν-SVM and Lagrangian SVM for classification on UCI data, and novelty detection.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/pdf3417.pdf
published
7
Machine Learning with Hyperkernels
2133
7
A
Gretton
R
Herbrich
A
Smola
Hong Kong2003-04-00
880
883
We introduce a new contrast function, the kernel mutual information (KMI), to measure the degree of independence of continuous random variables. This contrast function provides an approximate upper bound on the mutual information, as measured near independence, and is based on a kernel density estimate of the mutual information between a discretised approximation of the continuous random variables. We show that the kernel generalised variance (KGV) of F. Bach and M. Jordan (see JMLR, vol.3, p.1-48, 2002) is also an upper bound on the same kernel density estimate, but is looser. Finally, we suggest that the addition of a regularising term in the KGV causes it to approach the KMI, which motivates the introduction of this regularisation.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
3
The Kernel Mutual Information
15017
15420
2091
7
B
Schölkopf
AJ
Smola
Canberra, Australia2003-00-00
41
64
We briefly describe the main ideas of statistical learning theory, support vector machines, and kernel feature spaces. This includes a derivation of the support vector optimization problem for classification and regression, the v-trick, various kernels and an overview over applications of kernel methods.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
23
A Short Introduction to Learning with Kernels
15017
15420
4258
7
AJ
Smola
B
Schölkopf
Canberra, Australia2003-00-00
65
117
Bayesian methods allow for a simple and intuitive representation of the function spaces used by kernel methods. This chapter describes the basic principles of Gaussian Processes, their implementation and their connection to other kernel-based Bayesian estimation methods, such as the Relevance Vector Machine.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
52
Bayesian Kernel Methods
15017
15420
1831
7
AJ
Smola
O
Mangasarian
B
Schölkopf
Passau, Germany2002-00-00
167
178
Kernel Principal Component Analysis (KPCA) has proven to be a versatile tool for unsupervised learning, however at a high computational cost due to the dense expansions in terms of kernel functions. We overcome this problem by proposing a new class of feature extractors employing ℓ1 norms in coefficient space instead of the Reproducing Kernel Hilbert Space in which KPCA was originally formulated in. Moreover, the modified setting allows us to efficiently extract features which maximize criteria other than the variance in a way similar to projection pursuit.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
11
Sparse kernel feature analysis
1838
7
B
Schölkopf
R
Herbrich
AJ
Smola
Amsterdam, The Netherlands2001-07-00
416
426
Wahba’s classical representer theorem states that the solutions of certain risk minimization problems involving an empirical risk term and a quadratic regularizer can be written as expansions in terms of the training examples. We generalize the theorem to a larger class of regularizers and empirical risk terms, and give a self-contained proof utilizing the feature space associated with a kernel. The result shows that a wide range of problems have optimal solutions that live in the finite dimensional span of the training examples mapped into feature space, thus enabling us to carry out kernel algorithms independent of the (potentially infinite) dimensionality of the feature space.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
10
A Generalized Representer Theorem
2426
7
S
Mika
B
Schölkopf
AJ
Smola
Key West, FL, USA2001-01-00
98
104
We present a fast training algorithm for the kernel Fisher discriminant classifier. It uses a greedy approximation technique and has an empirical scaling behavior which improves upon the state of the art by more than an order of magnitude, thus rendering the kernel Fisher algorithm a viable option also for large datasets.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
6
An Improved Training Algorithm for Kernel Fisher discriminants
1837
7
A
Chalimourda
B
Schölkopf
AJ
Smola
Como, Italy2000-07-00
199
204
In support vector (SV) regression, a parameter ν controls the number of support vectors and the number of points that come to lie outside of the so-called ε-insensitive tube. For various noise models and SV parameter settings, we experimentally determine the values of ν that lead to the lowest generalization error. We find good agreement with the values that had previously been predicted by a theoretical argument based on the asymptotic efficiency of a simplified model of SV regression.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
5
Choosing in Support Vector Regression with Different Noise Models: Theory and Experiments
820
7
RC
Williamson
AJ
Smola
B
Schölkopf
Palo Alto, CA, USA2000-07-00
309
319
This paper collects together a miscellany of results originally motivated by the analysis of the general- ization performance of the "maximum-margin" al- gorithm due to Vapnik and others. The key feature of the paper is its operator-theoretic viewpoint. New bounds on covering numbers for classes related to Maximum Margin classes are derived directly without making use of a combinatorial dimension such as the VC-dimension. Specific contents of the paper include: a new and self-contained proof of Maurey's theorem and some generalizations with small explicit values of constants; bounds on the covering numbers of maximum margin classes suitable for the analysis of their generalization performance.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
10
Entropy Numbers of Linear Function Classes.
819
7
AJ
Smola
B
Schölkopf
Stanford, CA, USA2000-07-00
911
918
In kernel based methods such as Regularization Networks large datasets pose signi- cant problems since the number of basis functions required for an optimal solution equals the number of samples. We present a sparse greedy approximation technique to construct a compressed representation of the design matrix. Experimental results are given and connections to Kernel-PCA, Sparse Kernel Feature Analysis, and Matching Pursuit are pointed out. 1. Introduction Many recent advances in machine learning such as Support Vector Machines [Vapnik, 1995], Regularization Networks [Girosi et al., 1995], or Gaussian Processes [Williams, 1998] are based on kernel methods. Given an m-sample f(x 1 ; y 1 ); : : : ; (x m ; y m )g of patterns x i 2 X and target values y i 2 Y these algorithms minimize the regularized risk functional min f2H R reg [f ] = 1 m m X i=1 c(x i ; y i ; f(x i )) + 2 kfk 2 H : (1) Here H denotes a reproducing kernel Hilbert space (RKHS) [Aronszajn, 1950].
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
7
Sparse Greedy Matrix Approximation for Machine Learning
817
7
S
Mika
G
Rätsch
J
Weston
B
Schölkopf
AJ
Smola
K-R
Müller
Denver, CO, USA2000-06-00
526
532
no
notspecified
http://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/NIPS-1999-Mika.pdf
published
6
Invariant feature extraction and classification in kernel spaces
815
7
B
Schölkopf
RC
Williamson
AJ
Smola
J
Shawe-Taylor
JC
Platt
Denver, CO, USA2000-06-00
582
588
Suppose you are given some dataset drawn from an underlying probability distribution ¤ and you want to estimate a “simple” subset ¥ of input space such that the probability that a test point drawn from ¤ lies outside of ¥ equals some a priori specified ¦ between § and ¨. We propose a method to approach this problem by trying to estimate a function © which is positive on ¥ and negative on the complement. The functional form of © is given by a kernel expansion in terms of a potentially small subset of the training data; it is regularized by controlling the length of the weight vector in an associated feature space. We provide a theoretical analysis of the statistical performance of our algorithm. The algorithm is a natural extension of the support vector algorithm to the case of unlabelled data.
no
notspecified
http://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/NIPS-1999-Schoelkopf.pdf
published
6
Support vector method for novelty detection
816
7
AJ
Smola
J
Shawe-Taylor
B
Schölkopf
RC
Williamson
Denver, CO, USA2000-06-00
342
348
Effective methods of capacity control via uniform convergence bounds for function expansions have been largely limited to Support Vector machines, where good bounds are obtainable by the entropy number approach. We extend these methods to systems with expansions in terms of arbitrary (parametrized) basis functions and a wide range of regularization methods covering the whole range of general linear additive models. This is achieved by a data dependent analysis of the eigenvalues of the corresponding design matrix.
no
notspecified
http://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/NIPS-1999-Smola.pdf
published
6
The entropy regularization information criterion
818
7
G
Rätsch
B
Schölkopf
AJ
Smola
K-R
Müller
T
Onoda
S
Mika
Denver, CO, USA2000-06-00
561
567
AdaBoost and other ensemble methods have successfully been applied to a number of classification tasks, seemingly defying problems of overfitting. AdaBoost performs gradient descent in an error function with respect to the margin, asymptotically concentrating on the patterns which are hardest to learn. For very noisy problems, however, this can be disadvantageous. Indeed, theoretical analysis has shown that the margin distribution, as opposed to just the minimal margin, plays a crucial role in understanding this phenomenon. Loosely speaking, some outliers should be tolerated if this has the benefit of substantially increasing the margin on the remaining points. We propose a new boosting algorithm which allows for the possibility of a pre-specified fraction of points to lie in the margin area or even on the wrong side of the decision boundary.
no
notspecified
http://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/NIPS-1999-Raetsch.pdf
published
6
v-Arc: Ensemble Learning in the Presence of Outliers
1835
7
G
Rätsch
B
Schölkopf
AJ
Smola
S
Mika
T
Onoda
K-R
Müller
Kyoto, Japan2000-04-00
341
344
We propose a new boosting algorithm which similarly to v-Support-Vector Classification allows for the possibility of a pre-specified fraction v of points to lie in the margin area or even on the wrong side of the decision boundary. It gives a nicely interpretable way of controlling the trade-off between minimizing training error and capacity. Furthermore, it can act as a filter for finding and selecting informative patterns from a database.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
3
Robust Ensemble Learning for Data Mining
974
7
AJ
Smola
PJ
Bartlett
B
Schölkopf
D
Schuurmans
Denver, CO, USA2000-00-00
422
The concept of large margins is a unifying principle for the analysis of many different approaches to the classification of data from examples, including boosting, mathematical programming, neural networks, and support vector machines. The fact that it is the margin, or confidence level, of a classification--that is, a scale parameter--rather than a raw training error that matters has become a key tool for dealing with classifiers. This book shows how this idea applies to both the theoretical analysis and the design of algorithms. The book provides an overview of recent developments in large margin classifiers, examines connections with other methods (e.g., Bayesian inference), and identifies strengths and weaknesses of the method, as well as directions for future research. Among the contributors are Manfred Opper, Vladimir Vapnik, and Grace Wahba.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
-422
Advances in Large Margin Classifiers
793
7
AJ
Smola
A
Elisseeff
B
Schölkopf
RC
Williamson
Denver, CO, USA2000-00-00
369
387
This chapter contains sections titled: Introduction, Tools from Functional Analysis, Convex Combinations of Parametric Families, Convex Combinations of Kernels, Multilayer Networks, Discussion, Appendix: A Remark on Traditional Weight Decay, Appendix: Proofs.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
18
Entropy numbers for convex combinations and MLPs
4789
7
N
Oliver
B
Schölkopf
AJ
Smola
Denver, CO, USA2000-00-00
51
60
This chapter contains sections titled: Introduction, Natural Kernels, The Natural Regularization Operator, The Feature Map of Natural Kernel, Experiments, Discussion.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
9
Natural Regularization from Generative Models
794
7
G
Rätsch
B
Schölkopf
AJ
Smola
S
Mika
T
Onoda
K-R
Müller
Denver, CO, USA2000-00-00
207
220
This chapter contains sections titled: Introduction, Boosting and the Linear Programming Solution, υ-Algorithms, Experiments, Conclusion, Acknowledgments.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
13
Robust ensemble learning
811
7
T
Graepel
R
Herbrich
B
Schölkopf
AJ
Smola
P
Bartlett
K
Müller
K
Obermayer
RC
Williamson
Edinburgh, UK1999-09-00
304
309
We provide a new linear program to deal with classification of data in the case of data given in terms of pairwise proximities. This allows to avoid the problems inherent in using feature spaces with indefinite metric in support vector machines, since the notion of a margin is purely needed in input space where the classification actually occurs. Moreover in our approach we can enforce sparsity in the proximity representation by sacrificing training error. This turns out to be favorable for proximity data. Similar to ν-SV methods, the only parameter needed in the algorithm is the (asymptotical) number of data points being classified with a margin. Finally, the algorithm is successfully compared with ν-SV learning in proximity space and K-nearest-neighbors on real world data from neuroscience and molecular biology.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
5
Classification on proximity data with LP-machines
810
7
B
Schölkopf
J
Shawe-Taylor
AJ
Smola
RC
Williamson
Edinburgh, UK1999-09-00
103
108
Model selection in support vector machines is usually carried out by minimizing the quotient of the radius of the smallest enclosing sphere of the data and the observed margin on the training set. We provide a new criterion taking the distribution within that sphere into account by considering the eigenvalue distribution of the Gram matrix of the data. Experimental results on real world data show that this new criterion provides a good prediction of the shape of the curve relating generalization error to kernel width.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
5
Kernel-dependent support vector error bounds
809
7
AJ
Smola
B
Schölkopf
G
Rätsch
Edinburgh, UK1999-09-00
575
580
We have recently proposed a new approach to control the number of basis functions and the accuracy in support vector machines. The latter is transferred to a linear programming setting, which inherently enforces sparseness of the solution. The algorithm computes a nonlinear estimate in terms of kernel functions and an ε>0 with the property that at most a fraction ν of the training set has an error exceeding ε. The algorithm is robust to local perturbations of these points' target values. We give an explicit formulation of the optimization equations needed to solve the linear program and point out which modifications of the standard optimization setting are necessary to take advantage of the particular structure of the equations in the regression case
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
5
Linear programs for automatic accuracy control in regression
806
7
S
Mika
B
Schölkopf
AJ
Smola
K-R
Müller
M
Scholz
G
Rätsch
Denver, CO, USA1999-06-00
536
542
Kernel PCA as a nonlinear feature extractor has proven powerful as a preprocessing step for classification algorithms. But it can also be considered as a natural generalization of linear principal component analysis.
This gives rise to the question how to use nonlinear features for data compression, reconstruction, and de-noising, applications common in linear PCA. This is a nontrivial task, as the results provided by kernel PCA live in some high dimensional feature space and need not have
pre-images in input space. This work presents ideas for finding approximate pre-images, focusing on Gaussian kernels, and shows experimental results using these pre-images in data reconstruction and de-noising on toy examples as well as on real world data.
no
notspecified
http://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/NIPS-1998-Mika.pdf
published
6
Kernel PCA and De-noising in feature spaces
804
7
AJ
Smola
T
Friess
B
Schölkopf
Denver, CO, USA1999-06-00
585
591
Semiparametric models are useful tools in the case where domain knowledge exists about the function to be estimated or emphasis is put onto understandability of the model. We extend two learning algorithms - Support Vector machines and Linear Programming machines to this case and give experimental results for SV machines.
no
notspecified
http://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/NIPS-1998-Smola.pdf
published
6
Semiparametric support vector and linear programming machines
805
7
B
Schölkopf
PL
Bartlett
AJ
Smola
R
Williamson
Denver, CO, USA1999-06-00
330
336
no
notspecified
http://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/NIPS-1998-Schoelkopf.pdf
published
6
Shrinking the tube: a new support vector regression algorithm
814
7
P
Vannerem
K-R
Müller
AJ
Smola
B
Schölkopf
S
Söldner-Rembold
Iraklio, Greece1999-04-00
1
7
We have studied the application of different classification algorithms in the analysis of simulated high energy physics data. Whereas Neural Network algorithms have become a standard tool for data analysis, the performance of other classifiers such as Support Vector Machines has not yet been tested in this environment. We chose two different problems to compare the performance of a Support Vector Machine and a Neural Net trained with back-propagation: tagging events of the type e+e- -> ccbar and the identification of muons produced in multihadronic e+e- annihilation events.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
6
Classifying LEP data with support vector algorithms
807
7
RC
Williamson
AJ
Smola
B
Schölkopf
Nordkirchen, Germany1999-03-00
285
299
We derive new bounds for the generalization error of feature space machines, such as support vector machines and related regularization networks by obtaining new bounds on their covering numbers. The proofs are based on a viewpoint that is apparently novel in the field of statistical learning theory. The hypothesis class is described in terms of a linear operator mapping from a possibly infinite dimensional unit ball in feature space into a finite dimensional space. The covering numbers of the class are then determined via the entropy numbers of the operator. These numbers, which characterize the degree of compactness of the operator, can be bounded in terms of the eigenvalues of an integral operator induced by the kernel function used by the machine. As a consequence we are able to theoretically explain the effect of the choice of kernel functions on the generalization performance of support vector machines.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
14
Entropy numbers, operators and support vector kernels
808
7
AJ
Smola
RC
Williamson
S
Mika
B
Schölkopf
Nordkirchen, Germany1999-03-00
214
229
Many settings of unsupervised learning can be viewed as quantization problems — the minimization of the expected quantization error subject to some restrictions. This allows the use of tools such as regularization from the theory of (supervised) risk minimization for unsupervised settings. Moreover, this setting is very closely related to both principal curves and the generative topographic map.
We explore this connection in two ways: 1) we propose an algorithm for finding principal manifolds that can be regularized in a variety of ways. Experimental results demonstrate the feasibility of the approach. 2) We derive uniform convergence bounds and hence bounds on the learning rates of the algorithm. In particular, we give good bounds on the covering numbers which allows us to obtain a nearly optimal learning rate of order O(m−12+α) for certain types of regularization operators, where m is the sample size and α an arbitrary positive constant.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
15
Regularized principal manifolds
791
7
RC
Williamson
AJ
Smola
B
Schölkopf
Denver, CO, USA1999-00-00
127
144
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
17
Entropy numbers, operators and support vector kernels.
ScholkopfBS1999
7
B
Schölkopf
CJC
Burges
AJ
Smola
Denver, CO, USA1999-00-00
1
15
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
14
Introduction to support vector learning
789
7
B
Schölkopf
AJ
Smola
K-R
Müller
Denver, CO, USA1999-00-00
327
352
A new method for performing a nonlinear form of Principal Component Analysis is proposed. By the use of integral operator kernel functions, one can efficiently compute principal components in high-dimensional feature spaces, related to input space by some nonlinear map; for instance the space of all possible d-pixel products in images. We give the derivation of the method and present experimental results on polynomial feature extraction for pattern recognition.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
25
Kernel principal component analysis
ScholkopfBS1999_2
7
B
Schölkopf
CJC
Burges
AJ
Smola
Denver, CO, USA1999-00-00
17
22
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
5
Roadmap
MullerSRSKV1999
7
K-R
Müller
AJ
Smola
G
Rätsch
B
Schölkopf
J
Kohlmorgen
V
Vapnik
Denver, CO, USA1999-00-00
243
253
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
10
Using support vector machines for time series prediction
SmolaMSM1998
7
AJ
Smola
N
Murata
B
Schölkopf
K-R
Müller
Skövde, Sweden1998-09-00
105
110
Under the assumption of asymptotically unbiased estimators we show that there exists a nontrivial choice of the insensitivity parameter in Vapnik’s ε-insensitive loss function which scales linearly with the input noise of the training data. This finding is backed by experimental results.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
5
Asymptotically Optimal Choice of ε-Loss for Support Vector Machines
1871
7
AJ
Smola
B
Schölkopf
K-R
Müller
Skövde, Sweden1998-09-00
99
104
The concept of Support Vector Regression is extended to a more general class of convex cost functions. It is shown how the resulting convex constrained optimization problems can be efficiently solved by a Primal-Dual Interior Point path following method. Both computational feasibility and improvement of estimation is demonstrated in the experiments.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
5
Convex Cost Functions for Support Vector Regression
802
7
B
Schölkopf
S
Mika
AJ
Smola
G
Rätsch
K-R
Müller
Skövde, Sweden1998-09-00
147
152
Algorithms based on Mercer kernels construct their solutions in terms of expansions in a high-dimensional feature space F. Previous work has shown that all algorithms which can be formulated in terms of dot products in F can be performed using a kernel without explicitly working in F. The list of such algorithms includes support vector machines and nonlinear kernel principal component extraction. So far, however, it did not include the reconstruction of patterns from their largest nonlinear principal components, a technique which is common practice in linear principal component analysis.
The present work proposes an idea for approximately performing this task. As an illustrative example, an application to the de-noising of data clusters is presented.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
5
Kernel PCA pattern reconstruction via approximate pre-images
801
7
B
Schölkopf
P
Bartlett
AJ
Smola
R
Williamson
Skövde, Sweden1998-09-00
111
116
A new algorithm for Support Vector regression is proposed. For a priori chosen ν, it automatically adjusts a flexible tube of minimal radius to the data such that at most a fraction ν of the data points lie outside. The algorithm is analysed theoretically and experimentally.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
5
Support vector regression with automatic accuracy control.
797
7
AJ
Smola
B
Schölkopf
Denver, CO, USA1998-06-00
343
349
We derive the correspondence between regularization operators used in Regularization Networks and Hilbert Schmidt Kernels appearing in Sup- port Vector Machines. More specifically, we prove that the Green's Func- tions associated with regularization operators are suitable Support Vect or Kernels with equivalent regularization properties. As a by-product we show that a large number of Radial Basis Functions namely condition- ally positive definite functions may be used as Support Vector kernels.
From Regularization Operators to Support Vector Kernels./221619405_From_Regularization_Operators_to_Support_Vector_Kernels [accessed Feb 4, 2016].
no
notspecified
http://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/NIPS-1997-Smola.pdf
published
6
From regularization operators to support vector kernels
15017
15422
798
7
B
Schölkopf
P
Simard
AJ
Smola
V
Vapnik
Denver, CO, USA1998-06-00
640
646
We explore methods for incorporating prior knowledge about a problem at hand in Support Vector learning machines. We show that both invariances under group transformations and prior knowledge about locality in images can be incorporated by constructing appropriate kernel functions.
no
notspecified
http://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/NIPS-1997-Schoelkopf.pdf
published
6
Prior knowledge in support vector kernels
15017
15422
799
7
AJ
Smola
B
Schölkopf
K-R
Müller
Brisbane, Australia1998-02-00
79
83
The concept of Support Vector Regression is extended to a more general class of convex cost functions. Moreover it is shown how the resulting convex constrained optimization problems can be efficiently solved by a Primal-Dual Interior Point path following method. Both computational feasibility and improvement of estimation is demonstrated in the experiments.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
4
General cost functions for Support Vector Regression
1823
7
B
Schölkopf
AJ
Smola
K-R
Müller
C
Burges
V
Vapnik
Brisbane, Australia1998-02-00
72
78
The last years have witnessed an increasing interest in Support Vector (SV) machines, which use Mercer kernels for efficiently performing computations in high-dimensional spaces. In pattern recognition, the SV algorithm constructs nonlinear decision functions by training a classifier to perform a linear separation in some high-dimensional space which is nonlinearly related to input space. Recently, we have developed a technique for Nonlinear Principal Component Analysis (Kernel PCA) based on the same types of kernels. This way, we can for instance efficiently extract polynomial features of arbitrary order by computing projections onto principal components in the space of all products of n pixels of images. We explain the idea of Mercer kernels and associated feature spaces, and describe connections to the theory of reproducing kernels and to regularization theory, followed by an overview of the above algorithms employing these kernels. 1. Introduction For the case of two-class pattern.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
6
Support Vector methods in learning and feature extraction
803
7
B
Schölkopf
P
Knirsch
AJ
Smola
C
Burges
Stuttgart, Germany1998-00-00
125
132
Kernel-based learning methods provide their solutions as
expansions in terms of a kernel. We consider the problem of reducing the computational complexity of evaluating these expansions by approximating them using fewer terms. As a by-product, we point out a connection between clustering and approximation in reproducing kernel Hilbert spaces generated by a particular class of kernels.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
7
Fast Approximation of Support Vector Kernel Expansions, and an Interpretation of Clustering as Approximation in Feature Spaces
421
7
B
Schölkopf
AJ
Smola
K-R
Müller
Lausanne, Switzerland1997-10-00
583
588
A new method for performing a nonlinear form of Principal Component Analysis is proposed. By the use of integral operator kernel functions, one can efficiently compute principal components in highdimensional feature spaces, related to input space by some nonlinear map; for instance the space of all possible d-pixel products in images. We give the derivation of the method and present experimental results on polynomial feature extraction for pattern recognition.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
5
Kernel principal component analysis
15017
15422
416
7
K-R
Müller
AJ
Smola
G
Rätsch
B
Schölkopf
J
Kohlmorgen
V
Vapnik
Lausanne, Switzerland1997-10-00
999
1004
Support Vector Machines are used for time series prediction and compared to radial basis function networks. We make use of two different cost functions for Support Vectors: training with (i) an e insensitive loss and (ii) Huber's robust loss function and discuss how to choose the regularization parameters in these models. Two applications are considered: data from (a) a noisy (normal and uniform noise) Mackey Glass equation and (b) the Santa Fe competition (set D). In both cases Support Vector Machines show an excellent performance. In case (b) the Support Vector approach improves the best known result on the benchmark by a factor of 29%.
no
notspecified
http://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/ICANN-1997-Mueller.pdf
published
5
Predicting time series with support vector machines
15017
15422
5376
2
A
Gretton
AJ
Smola
J
Huang
M
Schmittfull
KM
Borgwardt
B
Schölkopf
MIT Press
Cambridge, MA, USA
2009-02-00
131
160
Given sets of observations of training and test data, we consider the problem of re-weighting the training data such that its distribution more closely matches that of the test data. We achieve this goal by matching covariate distributions between training and test sets in a high dimensional feature space (specifically, a reproducing
kernel Hilbert space). This approach does not require distribution estimation. Instead, the sample weights are obtained by a simple quadratic programming procedure. We provide a uniform convergence bound on the distance between
the reweighted training feature mean and the test feature mean, a transductive bound on the expected loss of an algorithm trained on the reweighted data, and a connection to single class SVMs. While our method is designed to deal with the case of simple covariate shift (in the sense of Chapter ??), we have also found benefits for sample selection bias on the labels. Our correction procedure yields its greatest and most consistent advantages when the learning algorithm returns a classifier/regressor that is simpler" than the data might suggest.
no
notspecified
http://www.kyb.tuebingen.mpg.de//fileadmin/user_upload/files/publications/shift-book-for-LeEtAl-webversion_5376[0].pdf
published
29
Covariate Shift by Kernel Mean Matching
15017
1542015017
20755
5702
2
Y
Altun
AJ
Smola
MIT Press
Cambridge, MA, USA
2007-09-00
283
300
In this paper we study the problem of estimating conditional probability distributions for structured output prediction tasks in Reproducing Kernel Hilbert Spaces. More specically, we prove decomposition results for undirected graphical models, give constructions for kernels, and show connections to Gaussian Process classi- cation. Finally we present ecient means of solving the optimization problem and apply this to label sequence learning. Experiments on named entity recognition and pitch accent prediction tasks demonstrate the competitiveness of our approach.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
17
Density Estimation of Structured Outputs in Reproducing Kernel Hilbert Spaces
2512
2
B
Schölkopf
AJ
Smola
Wiley
Chichester, UK
2005-00-00
5328
5335
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
7
Support Vector Machines and Kernel Algorithms
15017
15420
2205
2
B
Schölkopf
AJ
Smola
MIT Press
Cambridge, MA, USA
2002-11-00
1119
1125
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
6
Support Vector Machines
5111
46
A
Gretton
K
Borgwardt
M
Rasch
B
Schölkopf
A
Smola
2008-04-00
3437
46
A
Gretton
O
Bousquet
AJ
Smola
B
Schölkopf
2005-06-00
2005-06-00
Measuring Statistical Dependence with Hilbert-Schmidt Norms
no
notspecified
Measuring Statistical Dependence with Hilbert-Schmidt Norms
15017
15420
2936
46
A
Gretton
A
Smola
O
Bousquet
R
Herbrich
B
Schölkopf
NK
Logothetis
2004-10-00
2004-10-00
Behaviour and Convergence of the Constrained Covariance
no
notspecified
Behaviour and Convergence of the Constrained Covariance
15017
15420
2212
46
A
Gretton
R
Herbrich
AJ
Smola
2003-04-00
2003-04-00
The Kernel Mutual Information
no
notspecified
The Kernel Mutual Information
15017
15420
1836
46
B
Schölkopf
JC
Platt
AJ
Smola
2000-02-00
2000-02-00
Kernel method for percentile feature extraction
no
notspecified
Kernel method for percentile feature extraction
1874
46
B
Schölkopf
JC
Platt
J
Shawe-Taylor
AJ
Smola
RC
Williamson
1999-11-00
1999-11-00
Estimating the support of a high-dimensional distribution
no
notspecified
Estimating the support of a high-dimensional distribution
1821
46
B
Schölkopf
J
Shawe-Taylor
AJ
Smola
RC
Williamson
1999-03-00
1999-03-00
Generalization Bounds via Eigenvalues of the Gram matrix
no
notspecified
Generalization Bounds via Eigenvalues of the Gram matrix
SmolaMS1998
46
AJ
Smola
OL
Mangasarian
B
Schölkopf
1999-00-00
1999-00-00
Sparse Kernel Feature Analysis
no
notspecified
Sparse Kernel Feature Analysis
1834
46
AJ
Smola
RC
Williamson
B
Schölkopf
1998-09-00
1998-09-00
Generalization bounds and learning rates for Regularized principal manifolds
no
notspecified
Generalization bounds and learning rates for Regularized principal manifolds
1869
46
AJ
Smola
S
Mika
B
Schölkopf
1998-09-00
1998-09-00
Quantization Functionals and Regularized Principal
Manifolds
no
notspecified
Quantization Functionals and Regularized Principal
Manifolds
1872
46
AJ
Smola
RC
Williamson
B
Schölkopf
1998-08-00
1998-08-00
Generalization Bounds for Convex Combinations of Kernel Functions
no
notspecified
Generalization Bounds for Convex Combinations of Kernel Functions
1873
46
RC
Williamson
AJ
Smola
B
Schölkopf
1998-00-00
1998-00-00
Generalization Performance of Regularization Networks and Support Vector Machines via Entropy Numbers of Compact Operators
no
notspecified
Generalization Performance of Regularization Networks and Support Vector Machines via Entropy Numbers of Compact Operators
1819
46
C
Saunders
MO
Stitson
J
Weston
L
Bottou
B
Schölkopf
AJ
Smola
1998-00-00
1998-00-00
Support Vector Machine Reference Manual
no
notspecified
Support Vector Machine Reference Manual
1509
46
B
Schölkopf
AJ
Smola
K-R
Müller
1996-12-00
1996-12-00
Nonlinear Component Analysis as a Kernel Eigenvalue Problem
no
notspecified
Nonlinear Component Analysis as a Kernel Eigenvalue Problem
15017
15422
5046
7
A
Zien
G
Rätsch
S
Mika
B
Schölkopf
C
Lemmen
A
Smola
T
Lengauer
K-R
Müller
Heidelberg, Germany1999-10-00
German Conference on Bioinformatics (GCB '99)
In order to extract protein sequences from nucleotide sequences, it is an important step to recognize points from which regions encoding pro teins start, the socalled translation initiation sites (TIS). This can be modeled as a classification prob lem. We demonstrate the power of support vector machines (SVMs) for this task, and show how to suc cessfully incorporate biological prior knowledge by engineering an appropriate kernel function.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
0
Engineering Support Vector Machine Kernels That Recognize Translation Initiation Sites
1828
7
B
Schölkopf
R
Williamson
AJ
Smola
J
Shawe-Taylor
Dagstuhl, Germany1999-03-00
19
20
Dagstuhl-Seminar: Unsupervised Learning
Suppose you are given some dataset drawn from an underlying probability distribution P and you want to estimate a subset
S of input space such that the probability that a test point drawn from P lies outside of S is bounded by some
a priori specified 0<ν≤1. We propose an algorithm to deal with this problem by trying to estimate a function f which is positive on S and negative on the complement of S. The
functional form of f is given by a kernel expansion in terms of a potentially small subset of the training data; it is regularized by controlling the length of the weight
vector in an associated feature space. We can prove that
ν upper bounds the fraction of outliers (training points outside of S and lower bounds the fraction of support vectors. Asymptotically, under some mild condition on
P, both become equalities. The algorithm is a natural extension of the support vector algorithm to the case of unlabelled data.
no
notspecified
http://www.kyb.tuebingen.mpg.de/
published
1
Single-class Support Vector Machines
5271
10
K
Fukumizu
A
Gretton
A
Smola
VishwanathanGBS2004
10
SVN
Vishwanathan
O
Guttman
K
Borgwardt
A
Smola