Abstract:
Video-based recognition and prediction of a temporally extended
activity can benefit from a detailed description of
high-level expectations about the activity. Stochastic grammars
allow for an efficient representation of such expectations
and are well-suited for the specification of temporally
well-ordered activities. In this paper, we extend stochastic
grammars by adding event parameters, state checks, and
sensitivity to an internal scene model. We present an implemented
system that uses human-specified grammars to recognize
a person performing the Towers of Hanoi task from a
video sequence by analyzing object interaction events. Experimental
results from several videos show robust recognition
of the full task and its constituent sub-tasks even though
no appearance models of the objects in the video are provided.
These experiments include videos of the task performed
with different shaped objects and with distracting
and extraneous interactions.
|