- Neighborhood graph - it describes the similarity structure among points in a data set. Each vertex (= blue point in the figure) represents one data point. Two vertices are connected by a red line in the graph if the corresponding data points are "close" or "similar" or "strongly related" to each other. For example, vertices might represent people, and two people are linked in the graph if they are friends of each other. An important question is then to identify clusters or "communities" in the graph, that is sets of vertices that are tightly connected to each other. In the graph shown in the figure we might suspect two communities, one on the left hand side and one on the right hand side. One of our research questions is how such communities can be detected in very large graphs with millions of vertices, and how one can assess the statistical "significance" of the discovered communities.
Learning algorithms try to estimate functional dependencies based on empirical data. In the statistical approach to learning, the fundamental assumption is that the data are generated randomly by the same data-generating process. Given some empirical data, the goal of learning is to identify the best possible model from a set of competing ones. Statistical Learning Theory tries to assess to what extent this has been successful. The criterion of interest is the degree to which the model succeeds in capturing the regularities in the data generating process. Because of the randomness in the data-generating process, all statements about the success of a model for given data have an inherent statistical nature.
In our group we focus on two topics: theoretical analysis of clustering algorithms and the statistical analysis of graphs.
Clustering is one of the most widely used techniques for exploratory data analysis. Across all disciplines, from social sciences through biology to computer science, people try to get a first intuition about their data by identifying meaningful groups among the data points. Despite the popularity of clustering, distressingly little is known about the theoretical properties of clustering. This lack of theory becomes more and more urgent as data becomes increasingly complex, appears in massive amounts, has to be treated online, and clustering results cannot be inspected by hand any more. Given some data set and some task we want to achieve, what is the best algorithm to choose, which parameters should be used, and what guarantees can we give about the results we obtain? While these questions have essentially been answered for classification by now, for clustering all of them remain a challenge.
Statistical Analysis of Graphs
One of the strengths of machine learning is that it is able to deal with complex relationships between different data objects. Such relations are often modeled by a graph. For example, graphs can naturally be used to describe relations between proteins in a metabolic network, the structure of complex chemical compounds, the interactions between neurons in the brain, dependencies of linked web pages in the world wide web, social interactions between people, or interactions between sensors in a sensor network. In recent years, many machine learning algorithms have tried to explicitly make use of graph structures in the data: graph partitioning for clustering, label propagation for semi-supervised learning, manifold methods for dimensionality reduction, graph kernels to describe either the similarity of objects in a graph or the similarity between different graphs, classification with structured output, or detecting communities in graphs. In all those areas, many new questions arise that deal with the relation between graph theory, statistics, and geometry. Exactly which graphs do we want to use to model our data? Which properties of graphs are attractive for machine learning? Which ones are misleading? How can we compare different graphs? How can we extend graphs to new objects? How can we deal with the computational challenges that are often posed by graph-based algorithms? How can we efficiently make use of the graph structure?