Alumni of the Group Cognitive Engineering
I have finished my thesis "Learning Data-Driven Representations for Robust Monocular Computer Vision Applications", it can be found at
Computer Vision aims at teaching computers how to 'see', i.e., understand the contents of an image or video. A remarkable progress in this field has enabled consumer-level cameras to identify faces in real-time and even determining when they smile. It also allows cars to drive nearly autonomously when conditions are favourable. However, these success stories are mostly limited to quite special applications.
Machine Learning has allowed to significantly broaden the range of applicability of computer vision systems in the last years. In my work I have designed, implemented and analyzed new approaches to classical computer vision problems, that use learning to find new solutions to old problems. One of these is self-motion estimation: to drive autonomously, cars and other mobile agents need to estimate their current speed and direction of motion. My supervisor and I have shown that imagery from only a single camera suffices to estimate its self-motion in everyday driving scenarios, using a new representation of optical flow.
This screenshot shows a visualization of typical results:
The current input frame is shown on top, overlaid with observed flow vectors, which are classified as inliers (green, caused by self-motion) and outliers (red, caused by moving objects, unusual geometry or false measurements). Also overlaid are visualizations of the current motion: speed and yaw angle in the middle of the lower part of the image, and the horizon line in the middle of the image, obtained by integrating pitch and roll. Below the example frame are time courses of estimated motion: first forward motion, then change in yaw, pitch and roll. The red line (estimated motion) coincides well with the true motion (black) despite a high number of changes of pitch and roll.
For cars to drive autonomously on our streets, they also need to watch out for other cars and pedestrians. Predicting their future motion paths is important to avoid critical situations and collisions. For cars, the future motion is greatly dependent on its orientation, which is usually estimated together with its position. We have presented a new way to estimate the orientation of cars from single images and showed that using such estimates improves estimates of object motion and orientation in tracking applications. To do so, we represent uncertain and often ambiguous orientation knowledge as a multimodal distribution of a continuous circular variable.
My earlier PhD work is on the interpretation of single images, supervised by . When we humans look at a picture -- say, of our friend's last holiday -- we only need a split of a second to understand it in a rough way: among other things, we can tell the scene type, the location of a few prominent objects, the approximate geometrical layout, and even have an idea of relative position and orientation of the photographer. Combining existing algorithms on scene type classification, object detection and the estimation of geometry and viewpoint, we aimed to re-create such a representation in the computer, as a basis for further task-specific visual processing.
We also published a paper on psychophysical experiments designed to prove that some notion of viewpoint is indeed part of this human gist. For this, we asked human participants to identify the horizon line, one measure of viewpoint, in images after having seen them for only 153ms. They did so with similar precision as when seeing the images for a longer time, using several different cues.
For a more in-depth description of my work, please see the
In my PhD, I have followed several projects.
While supervised by , I have explored new ideas for robust monocular computer vision in an automotive context. Based on work by Roberts and colleagues ("Learning General Optical Flow Subspaces for Egomotion Estimation and Detection of Motion Anomalies", CVPR 2009), we have extended a probabilistic latent variable model that represents optical flow fields as a linear combination of basis flow fields and a mean. Flow vectors that do not fit the model, like missing values, flow measurement errors, flow from independently moving objects, and unusual scene geometry, are automatically identified and discarded from further processing. The mapping from the model's latent variables to self-motion (forward motion, change in yaw, pitch, and roll) is linear. Our extensions include a more flexible variance distribution of flow vector components, which allows a greater model flexibility, and the usage of a mixture of experts to extend the model's applicability. We have analyzed the original and our extended model and tested it on challenging data sets. Results show that this model can be used for accurate self-motion estimation and estimation of the focus of expansion.
In our second project, we have investigated the benefit of using object orientation measurements for object state filtering. This could help in predicting other vehicle's future states and thus help in risk detection and collision avoidance. We built our object orientation estimator on the deformable part object detector by Felzenszwalb and colleagues, using part positions and sizes relative to the object bounding box, as well as the overall size and position of the object as features. We trained random regression trees to estimate car viewpoint from these features, and represent the output as a multi-modal distribution of a continuous circular variable. We use particle filters to disambiguate estimates from temporal context, which involves circular versions of kernel density estimation and mean shift clustering. Combining monocular self-motion estimates from the above project with object orientation estimates in a dynamic object state filtering approach shows that object state filtering from a moving observer benefits from viewpoint measurements.
Before working with image sequences, I have tried to build a general-purpose representation of image content while supervised by . This representation is based on the 'gist of a scene', which forms within a few hundred milliseconds in human brains and is the basis for further visual processing. We tried to create a similar representation by combining the output of object detectors, scene type classifiers, surface orientation estimates, camera height and pitch, as well as learned prior knowledge, which could be expressed as 'cars do not float' or 'finding people in city images is quite likely'. The project showed promising preliminary results. To clarify the role of viewpoint in this gist, we performed experiments on horizon estimation. We showed that participants are able to estimate the horizon in an image quite accurately, even when having seen the image for only 153ms, suggesting that viewpoint information could indeed be part of this first processing step of visual stimuli. We also investigated cues to horizon formation by introducing variations of the stimuli and by comparing human performance with that of simple computer algorithms.
In a side project with , , and , we built and evaluated a new image retrieval tool that is based on semantic sketches. Specifying an arrangement of semantic classes like 'house' or 'car' could help image retrieval in many cases and is feasible given proper annotations.
In these projects, I would like to acknowledge supervision by ()
|Oct 2001 -||Apr 2007||Studied Mathematics at Karlsruhe University|
|Nov 2007 -||Feb 2013||PhD in Informatiks here at MPI|
|Aug 2013 -||Nov 2013||Postdoc at MPI|