ObjectSpaces is an architecture that facilitates vision-based action and object recognition. We have developed ObjectSpaces as a hierarchical framework for representing prior knowledge about image contents that uses familiar object-oriented contructs call ed classes. We also present system processes, or layers, for facilitating feature extraction, motion characterization, and scene-wide context management.

Indigenous hand-based actions, such as flip forward, are associated with this book object.
Figure 1. Overhead view of scene with known objects highlighted.

The Architecture

Our architecture uses an adaptive bottom-up and top-down strategy with multiple, integrated layers. This framework supports multi-domain applications, context abstraction, & class reuse. The layers facilitate mutual information to be shared when needed. F or example, a person model in the Object layer supplies the Extraction layer with parameters that help guide color blob tracking. Likewise, blob trajectories that represent hand regions are passed back to the respective person object. The free exchange of information throughout the layers helps to offset limitations caused by strictly bottom-up or top-down approaches. The complimentary usage of context will enable this architecture to recover from failures or inconsistencies that occur at either the lowes t or highest levels of abstraction in the system.

  • Extraction Layer - provides tracking and motion analysis facilities
  • Object Layer - contains all scene articles and people objects
  • Scene Layer - supervises interactions and provides domain-specific context

Figure 2. Information discovery, sharing, & archiving across layers.

The Article Class

A class is simply a data structure or properties and methods needed for implementing tasks and holding information. We begin by establishing the objects in the environment that we will focus on, deriving classes from Article for each type. Associ ated with each object is a finite state machine driven by contact with a person. Object state or a query from the Scene layer triggers other object functions. For example, initial contact with an object forces a transition from the inactive state t o the tentative state. If the contact turns out to be transient, the object returns to the inactive state; otherwise, it progresses to the active state, which initiates a contact event and a request to the Extraction layer to compare hand motions against pre-trained motion models. A contact event is a record that maintains information about the interaction, such as the duration, the person performing the action, and a description of the action, if available. The Article class is

Figure 3. We look for evidence throughout the class hierarchy to differentiate between classes sharing the same parent.

The Person Class

The person class works in tandem with the Extraction layer to locate people based on a model of a person. This view-based model is characterized by the arm/hand components as well as the head/torso component, as seen in Figure 5. The former component is c haracterized by physical properties, such as hand size and skin color, as well as physiological considerations, like arm span. Likewise, size and shape specify the head/torso region. Skin color is described by an array containing all of the flesh tones in the person's hands. This color distribution is used in the Extraction layer to assist in the segmentation the hands. We elect to use the YUV color space instead of RGB because of its separation of the luminance and chrominance bands.

Figure 5. Model of Person includes parameters for describing hand and head/torso regions.

  • region centroid
  • color distribution
  • pixel area
  • perimeter edge count
  • aspect ratio
  • arm spanned area
  • estimate of future hand position

The Scene Class

The Scene layer, derived from the Scene class, lies at the highest level of abstraction in the system and pays "attention" to the Object and Extraction layers. It maintains a list of scene objects and person objects. We also construct a matrix to hold conditional probabilities between every two articles. This layer searches for correlations between object interactions in order to classify particular activities or to identify certain human behaviors. We can embed domain-specific context of our env ironment at this level.

Figure 6. P(R|Z1) > {P(R|Z2), P(R|Z3)} where R represents a hand region.
Spatial context regarding the location of articles in the surroundings can be embedded into Activity zones to facilitate tracking and recognition. Consider non-overlapping regions that represent a region in the image specified by a bounding box and a pro bability.

Class Reuse

Figure 7. Framework accommodates context abstraction for application in multiple domains.

Recognizing Hand-based Actions

To determine which items in the surroundings are handled, the location of a person's hands must be recovered. We segment skin-colored blobs from each frame using a predefined color table. Starting with a snapshot of the scene at time t, I(t), we r ifle through each pixel in the image looking for members of the color distribution. Using connected components, we propagate through B(t) to produce a set of binary regions or "blobs." Candidate blobs that do not match the profile supplied by the p erson model are eliminated. Hand position information is accummulated in a buffer, then passed over to an HMM module that tried to identify known actions previously associated with a particular object.

Figure 8. Process of tracking the hands and recovering meaningful actions associated with an object.

Hidden Markov Models (HMMs)

The Hidden Markov Model (HMM) can be described as a finite-state machine characterized by two stochastic processes: one process determines state transitions and is unobservable. The other produces the output observations for each state. The states are no t directly determined by the observations; hence, the states are hidden. One of the goals of the HMM then becomes to uncover the most likely sequence of states that produced the observations.

HMMs are used to recover hand-based actions. We use HMMs because they handle space and time variable data well. To model actions that take place throughout the scene, hand position (centroid) alone is used to construct the observation feature. Our approac h assumes a fixed, overhead camera (8'-10' from floor) so scale variation is insignificant because perspective projection distortion is small. We anticipate that actions can occur any place within the scene. However, Cartesian position is not translation and rotation invariant. Unfortunately, observational components expressed as derivatives tend to erode the Viterbi's dynamic time warping of data, which in turn, compromises our handling of time variations.

Figure 9. 6 state, semi-ergodic HMM with skip transitions and corresponding states of "flip forward" action.

To normalize motion displacement, objects provide affine transformations to deal with translation and rotation. As the hands transverse through space during some action, they pass through certain normalized areas in the image space that correspond to the HMM's states. Hand transitions from area to area generate a sequence of states, which is used to characterize an action. We assumed all actions are single-handed motions. However, image-based methods mentioned earlier can be invoked to detect hand pair in teraction with an object. During training, 20 examples for each action captured by the same person are manually segmented. Deliberate rest states are used as delimiters to parse individual actions during testing. The continuous HMM shown in Figure 9 was e mpirically selected to optimize recognition. After some testing, we found this model topology to perform better than strictly "left-to-right" structures.

Object Classification

Throughout video acquisition and analysis, we collect evidence that is assessed in order to label and summarize activity as well as to identify unknown objects and people. To develop awareness of human actions as well as unknown objects introduced to a sc ene, we merge extracted representations of motion with prior knowledge to synthesize beliefs. Bayes' theorem weighs the strength of belief in a hypothesis against prior knowledge and observed evidence. In addition, it provides attractive features, includi ng: (1) its ability to pool evidence from various sources while making a hypothesis, and (2) its amenability to recursive or incremental calculations, especially when evidence is accumulated over time. These features motivate our application of Bayesian c lassification to summarize activities and resolve unknown objects.

Figure 10. Belief network corresponding to a naive Bayesian classifier for selecting the most likely GCM.


The system has been implemented using C++ to run under the Win9x/NT environment in near real-time. Our approach insists on a static view of the scene from a ceiling mounted camera position. Video was acquired using color CCD cameras and a framegrabber (fr ames acquired in YUV color space). The Vision-based Action Recognition System is shown below as it captures interactive events.

Figure 11. VARS is shown above as it captures interactive events. VARS runs on any Windows NT/9x platform in real-time (w/ framegrabber) or from stored image sequences. This system has been developed by Darnell Moore.

Results and Experiments

Experiment I: Recognizing new objects

To demonstrate detection and recognition of newly introduced objects, several of the objects (book, notebook, printer, and mouse) were carried into the scene after the background was acquired. The system was already aware of several other objects in the r oom, including a keyboard, chair, and desk. Segmentation began immediately as initial image-based evidence of the unknown objects was acquired and initial beliefs were forged. As a person interacted with both known and unknown objects, the strength of bel ief grew in proportion to the number of actions identified, as shown in Figure 12. New action- or image-based information is extracted for one of the four unknowns during each "event" period. Rough shape and size information was sufficient for establishin g Z1 and Z2 early on (event 1), but not for Z3 and Z4. While relevant actions were quickly able to classify Z1 as a notebook, conflicting actions registered to Z2 during events 12, 20, and 25, respectively, compromised belief testimony. Although Z3 has no actions associated with it, moderate belief can still be established by monitoring it's interaction with the keyboard. In general, however, articles such as Z3 stand a greater chance of being mislabeled if actions associated with other GCMs are performed while interacting with it. Classification probability for Z1 through Z4 after 33 events (acquired over 1500 frames) was 97%, 94%, 79%, and 95%, respectively. Closer inspection reveals that 7, 12, 8, and 8 events for Z1 through Z4, respectively, were need ed to achieve this degree of classification.

Figure 12. The strength of belief (proportional to grayscale) as image and action events are registered to one of four unknowns.

Experiment II: Recognition from action alone

Experiments were also conducted to evaluate the strength of action-based evidence alone. Figure 13-a shows the mean log probability of candidate GCMs after 11 action events that were acquired over 583 frames. Note that belief was shared between the GCM fo r notebook and book until event 7, when "write" was the most probable action observed, consequently rejecting book. Figure 13-b shows the accumulated likelihoods of several actions as they occurred throughout this sequence. It a lso reveals the potential for actions to be confused. Note that some actions that never actually occurred, such as "erase," have high, accumulated probabilities, suggesting that it may be similar several of the gestures performed. As a testament to inferr ing classification from action, adding initial image-based evidence to this action-based information yields results that were only 3% higher.
(a) (b)
Figure 13. (a) Mean log probability of GCM classification over several action events; (b) Shows the accumulated likelihoods of several actions as they occurred throughout the corresponding sequence, with the most probable action per eve nt highlighted.

Experiment III: Recognizing objects in the background

To demonstrate detection of unaware objects from action, we performed several eating actions ( stir, cut, feed) in an undeclared space in the scene. When actions, such as stir, occur for more than one GCM, belief is shared. Without image-based segm entation, motion normalization suffers, resulting in lower action recognition rates and occasional mislabeling. (Notice in events 8 and 9 in Figure 14, "open" and "erase" were mistaken for "feed" and "cut".) The table GCM exhibits the strongest belief, as shown in Figure 14. The suggested bounding box, determined by the new activity zone, appropriately covers all of the area where the activity takes place and can be used to indicate future contact with the table.

Figure 14. GCM Mean log probability of unknown object without image-based segmentation









Georgia Tech


College of Computing

GVU Center

Broadband Institute

Future Computing

Aware Home

Copyright © 1997-2001