Image interpretation, which is effortless and instantaneous for human beings, is the grand challenge of computer vision. The dream is to build a "description machine" which produces a rich semantic description of the underlying scene, including the names and poses of the objects that are present, even "recognizing" other things, such as actions and context. Mathematical frameworks are advanced from time to time, but none is yet widely accepted, and none clearly points the way to closing the gap with natural vision.
The goal of this project is to develop a new strategy for image interpretation, especially annotating cluttered scenes with instances from many object categories (e.g., a kitchen, See Images for some sample images) and videos of people interacting with objects in everyday life (e.g., cooking). Efficient search and evidence integration appear indispensable for handling such complex scenes. Our approach is inspired by two facets of human search: divide-and-conquer querying in playing games like "Twenty Questions" and selective attention in natural vision. In particular, we want to design algorithms for shifting focus from one location to another with a fixed spatial scope, and for rapid and adaptive zooming, allowing one to switch from monitoring the scene as a whole to local scrutiny for fine discrimination, and back again depending on current input and changes in target probabilities as events unfold.