We are developing a new strategy for automated scene interpretation, especially for annotating cluttered scenes which display instances from many object categories (e.g., a kitchen scene) and for annotating videos of people interacting with objects in everyday life (e.g., cooking). We contend that efficient search and evidence integration is indispensable for handling the complexity of such scenes. Our approach is inspired by two facets of human search: divide-and-conquer querying in playing games like "Twenty Questions" and selective attention in natural vision. We want to design algorithms for shifting focus from one location to another with a fixed spatial scope, and for rapid and adaptive zooming, allowing one to switch from monitoring the scene as a whole to local scrutiny for fine discrimination.
These considerations inspire a new, model-based framework for determining what evidence to acquire from multiple scales and locations, and for coherently integrating the evidence by updating likelihoods. We introduce a novel information-theoreric approach to scene interpretation called entropy pursuit. Likely states are computed in a highly coarse-to-manner based on stepwise uncertainty reduction. This should be broadly applicable to scene interpretation problems arising in many areas of science and engineering where current systems are limited in scope. Finally, such results would naturally draw connections with biological vision since maintaining a probability distribution over the possible spatiotemporal locations of objects and actions can be visualized as a spatiotemporal "heat map" and related to "priority maps" in selective attention.
Coarse-to-fine classification of objects within an image.