Uncertainties are present at many levels in biological and artificial adaptive systems. Exemplars from which the system learns are frequently noisy, mislabeled, or atypical. Even with data of high quality, gauging and combining a multitude of data sources and constraints in usually imperfect models of the world requires us to represent and process uncertain knowledge in order to take viable decisions. In the increasingly complicated settings of modern science, model structure or causal relationships may not be known a-priori.
The probabilistic framework for working with uncertain information is at the heart of the Causal and Probabilistic Inference group. In the Bayesian setting, probabilities are used to represent beliefs about variables of interest, and their relationships. The calculus is based on simple rules from probability theory, which are used to refine or update beliefs in the light of new data. The theory of Bayesian inference is well developed, and its fundamental role for decision theory has long been established. A central difference between the Bayesian and other mainstream traditions in statistics lies in the treatment of unobserved variables: these are summed over all possible instances in the former, while often estimated or otherwise conveniently imputed in many of the latter. While justified, the Bayesian insistence on summations leads to two major problems, which need to be addressed. First, the prior distributions, or equivalently the weighting of instances in the summations need to be specified or assessed. Second, the summations are typically computationally hard due to nonlinear dependencies between many variables, and approximations are typically unavoidable. These problems are usually related, in that some classes of priors may lead to more tractable computations than others, and modeling versus approximation errors need to be traded off against each other. Large-scale Bayesian applications in machine learning or statistics are conceptually guided by, but often diverge considerably from, the pure theory. This leaves much room and high demand for novel concepts, algorithms, and theoretical as well as empirical analyses, in what is sometimes misleadingly presented in textbooks as a closed and solved framework.
The focus of our research directly addresses these questions, both at the level of algorithms for machine learning, and of individual applications of inference. We develop new algorithms and approximation techniques, assess their quality both in theory and in empirical evaluations, and apply them to challenging problems. Below we exemplify some of these developments.
Gaussian process (GP) models are probabilistic kernel-based learning algorithms. Through parameterization of the kernel (or covariance function) a flexible family of models can be treated, and inference over kernels can be approximated. We are addressing several issues of fundamental importance for GPs, such as the use of sparse approximations and the evaluation of approximation techniques for classification models, as well as theoretical aspects including generalization error bounds or online sequence prediction. We apply GP models also to problems in reinforcement learning and control, and to real-time inverse kinematics in autonomous robots.
When observations are measured as time series, for example neuronal recordings, or the monitoring of a machine or factory, it is important to capture the temporal dynamics. This allows us to group time series according to similar dynamics, for instance to deduce common functions in biology, or to segment a time course into regimes of distinct dynamical behavior, for example to detect anomalies. We have developed Bayesian approaches for time-series clustering and segmentation, using deterministic variational approximations, which work without supervision and benefit from classical approaches in terms of accuracy and stability. We applied these models to spike sorting of neuronal recordings, and to clustering and segmentation of humanmovement trajectories for robot imitation learning. Bayesian methods are especially valuable for higher order problems, such as the optimization of hyperparameters in hierarchical models. In this context, it is possible to support even non- Bayesian methods (which may be preferable due to faster running time) by higher order Bayesian learning, if the only alternative lies in time-consuming cross-validation procedures.
In experimental design, the goal is to optimize data sampling or measurement architectures, such that inference or estimation can be realized faster or at lower costs. We develop Bayesian experimental design and efficient approximate inference algorithms for GP and sparse generalized linear models, and address applications in systems biology, natural image reconstruction, and magnetic resonance imaging. The expertise spanned by our group allows us to combine Bayesian techniques, such as variational inference, message passing, and MCMC with ideas coming from adjacent fields, such as sparse estimation and convex optimization. While Bayesian practitioners can rely on robust and consistent inference principles, they need to master a large and rapidly growing number of approximations and computational techniques, and be prepared to find a good trade-off specifically for their problem at hand. Many of these techniques and concepts come from or have been developed in the context of classical statistics.
While an unfortunate divide still exists between many Bayesian and classical statisticians, we try to build on both these traditions to get closer to the common goal of efficient, reliable Bayesian machine learning.
Being probabilistic methods, Bayesian approaches make statements about distributions of parameters, variables, or predictions. Such statements can also concern dependencies between observed quantities (e.g., about the correlation between the size of the population of storks and the human birth rate). A convenient way of visualizing the dependency structure implied by a joint distribution is a graphical model, or Bayesian network. The joint density then factorizes into conditional densities of each variable given its parents for every graph that renders the joint distribution Markovian. One should, however, not mistake the links in such networks for causal links indicating that a parent variable exerts a causal influence on its child nodes. To infer the causal graph that visualizes the causal structure of the data generating process is a highly non-trivial task requiring additional assumptions. Indeed, there used to be a substantial controversy in the statistics community as to whether this is possible at all when using data only from passive observations.
A small but growing community studying causal inference, however, develops methods to construct causal graphs from such data. These are directed acyclic graphs whose links represent causal relationships, and for which the joint distribution is Markovian and faithful, meaning that the observed pattern of conditional independences is generic in the sense that it is common to almost all choices of the conditional densities assigned to each node. State-of-the-art causal inference algorithms work with these assumptions but share the following drawbacks: first, conditional statistical independence tests often rely on strong assumptions like multivariate Gaussianity and linearity of causal links and, second, the Markov and the faithfulness condition are often consistent with a large set of causal graphs (“Markov equivalent graphs”) rather than selecting a unique structure.
To address the first problem we have modified the algorithms by using kernel independence tests which indeed improved the performance. To tackle the second problem, i.e., to distinguish between Markov equivalent causal graphs, additional assumptions are needed. Implicit or explicit priors that favor “simple” conditionals can render one of several competing causal hypotheses more likely. We have explored novel approaches postulating additional statistical asymmetries between cause and effect that we sketch now. Apart from an entropy-based criterion for formally selecting hypotheses that yield simple conditionals, we have defined complexity of conditional probability densities by semi-norms in reproducing kernel Hilbert spaces and developed causal inference algorithms based upon this definition.
Another method, which uses priors on conditionals in a more implicit way, has recently been proposed in the literature. It exploits the fact that linear structural equations between non- Gaussian variables induce joint distributions that do not admit any causal graph other than the original one, which is then assumed to be the true causal structure. We have extended this approach (called “LINGAM”) to deal with time series, where the ground truth is known because the time order excludes causal links influencing the past. Our work confirmed that empirical time series are significantly more likely to admit a linear autoregressive moving average model in forward than in backward direction.
We have also transferred the LINGAM idea to non-linear structural equation models, showing that functional relations with additive noise between two variables induce, in the generic case, joint distributions that do not admit an additive noise model in backward direction. Although non-linearity makes the analysis significantly harder, this setting has the advantage that it can also work when the noise is Gaussian, a case where the linear approach fails. To test whether such an additive noise model exists, one checks whether the predictor variable and the residual errors of a non-linear regression are statistically independent. Kernel independence tests provided a helpful tool for this task.
To justify causal inference rules of the above type, we have developed graphical models that represent algorithmic dependences among single objects instead of statistical dependences among random variables. We have proved that three versions of the Markov condition that are known to coincide for statistical dependences essentially also coincide for algorithmic dependences. Based on these results, we have postulated an algorithmic Markov condition relating the observed algorithmic dependences to causality and shown that it can be justified in strong analogy to the statistical causal Markov condition. This adds another level to causal inference because a graph may explain the statistical but not the algorithmic dependences. The algorithmic Markov condition imposes, for instance, that the shortest description of the joint density P (cause, effect) is given by separate descriptions of P (cause) and P (effect | cause). Causal graphs for which the conditional densities assigned to the different nodes share algorithmic information must be rejected as “non-generic” for the same reason as one does not accept unfaithful distributions (because they would require specific adjustments of the network parameters).