Image interpretation, which is effortless and instantaneous for human beings, is the grand challenge of computer vision. The dream is to build a "description machine" which produces a rich semantic description of the underlying scene, including the names and poses of the objects that are present, even "recognizing" other things, such as actions and context. Mathematical frameworks are advanced from time to time, but none is yet widely accepted, and none clearly points the way to closing the gap with natural vision.
The goal of this project is to develop a new strategy for image interpretation, especially annotating cluttered scenes with instances from many object categories (e.g., a kitchen, See Images for some sample images) and videos of people interacting with objects in everyday life (e.g., cooking). Efficient search and evidence integration appear indispensable for handling such complex scenes. Our approach is inspired by two facets of human search: divide-and-conquer querying in playing games like "Twenty Questions" and selective attention in natural vision. In particular, we want to design algorithms for shifting focus from one location to another with a fixed spatial scope, and for rapid and adaptive zooming, allowing one to switch from monitoring the scene as a whole to local scrutiny for fine discrimination, and back again depending on current input and changes in target probabilities as events unfold.
We propose a model-based framework for determining what evidence to acquire from multiple scales and locations, and for coherently integrating the evidence by updating likelihoods. The model is Bayesian and is designed for efficient search and scene processing in an information-theoretic sense. One component is a prior distribution on a huge interpretation vector; each bit represents a high-level scene attribute with widely varying degrees of specificity and resolution – some are very coarse (general hypotheses) and some are very fine (specific hypotheses). The other component is a simple conditional data model for a corresponding family of learned binary classifiers, one per bit. The classifiers are implemented sequentially and adaptively; the order of execution is determined online, during scene parsing, and is driven by removing as much uncertainty as possible about the overall scene interpretation given the evidence to date. The correspondence between interpretation bits and classifiers allows for a fundamental "oracle" approximation. Instead of choosing the next classifier to minimize conditional entropy (which is intractable) we choose the classifier whose corresponding interpretation bit is currently the most informative. This is possible in practice by sampling from the current posterior distribution and choosing the bits with the highest variation. Hence, "entropy pursuit" alternates between testing and efficient optimization. The design of the classifiers will leverage on recent advances at the intersection of machine learning and dynamical systems. Specifically, we will describe spatiotemporal subvolumes of a video at a specific scale and location with dynamical systems and develop classification methods for time-series data.
This should be broadly applicable to scene interpretation problems arising in many areas of science and engineering where current systems are limited in scope. Finally, such results would naturally draw connections with biological vision since maintaining a probability distribution over the possible spatiotemporal locations of objects and actions can be visualized as a spatiotemporal "heat map" and related to "priority maps" in selective attention.