Wearable Computing and Contextual Awareness


by

Thad Eugene Starner

S.B., Massachusetts Institute of Technology (1991)
S.B., Massachusetts Institute of Technology (1991)
S.M., Massachusetts Institute of Technology (1995)

Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning,
in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 1999

(c)Massachusetts Institute of Technology 1999. All Rights Reserved.

Abstract

Computer hardware continues to shrink in size and increase in capability. This trend has allowed the prevailing concept of a computer to evolve from the mainframe to the minicomputer to the desktop. Just as the physical hardware changes, so does the use of the technology, tending towards more interactive and personal systems. Currently, another physical change is underway, placing computational power on the user's body. These wearable machines encourage new applications that were formerly infeasible and, correspondingly, will result in new usage patterns. This thesis suggests that the fundamental improvement offered by wearable computing is an increased sense of user context.

I hypothesize that on-body systems can sense the user's context with little or no assistance from environmental infrastructure. These body-centered systems that ``see'' as the user sees and ``hear'' as the user hears, provide a unique ``first-person'' viewpoint of the user's environment. By exploiting models recovered by these systems, interfaces are created which require minimal directed action or attention by the user. In addition, more traditional applications are augmented by the contextual information recovered by these systems.

To investigate these issues, I provide perceptually sensible tools for recovering and modeling user context in a mobile, everyday environment. These tools include a downward-facing, camera-based system for establishing the location of the user; a tag-based object recognition system for augmented reality; and several on-body gesture recognition systems to identify various user tasks in constrained environments.

To address the practicality of contextually-aware wearable computers, issues of power recovery, heat dissipation, and weight distribution are examined. In addition, I have encouraged a community of wearable computer users at the Media Lab through design, management, and support of hardware and software infrastructure. This unique community provides a heightened awareness of the use and social issues of wearable computing. As much as possible, the lessons from this experience will be conveyed in the thesis.


Acknowledgments



I thank my Lord for His gift of a privileged life as a researcher and my parents for being a continuing source of support over the years. I take neither for granted, even though it may appear otherwise sometimes.

I would also like to thank my wife, Tavenner Hall, for her support and endless hours editing my documents. She knows I don't take her for granted!

This time period has, without a doubt, been the most difficult of my life. In addition to the happiness of getting married, finding a faculty position, and buying a house, there have been the stresses of thesis defense, moving, a robbery, finishing this document, and what seems an unending list of severe family medical problems. In fact, sizeable portions of this work were performed in airplanes, cars, and hospitals around the country. I would like to thank my readers and my MIT and Georgia Tech families for their forbearance and support during this trying time.

On a happier note, I would like to thank my advisor for ten years of guidance and fruitful research. I do not think Sandy knew what he was getting into when I hunted him down and insisted on doing undergraduate research with him. It has been a blast (hopefully for him as well)! Both of my readers have had a significant impact beyond this thesis. Pattie Maes's remarks in her 1993 Agents class provided the foundation for much of my thinking on the potential uses and style of interfaces for wearable computing. Steve Feiner's talk at the Media Lab on augmented reality gave me confidence that such research could indeed be valued in academia and provided some example scenarios which I re-implemented with computer vision (with permission).

I would like to thank Bradley Rhodes and Lenny Foner for many insights over the years and helping to create a collegial and thoughtful environment for wearable computing in the Media Lab. Many people have influenced my views and efforts, possibly more so than in most dissertations due to the ``living experiment'' aspect of this work, and it would be impossible to thank them all. Some probably don't even remember their off-hand comments or actions that have shaped my opinions and presentations over the years. I would like to thank, in no particular order, Neil Gershenfeld, Roz Picard, Hiroshi Ishii, Aaron Bobick, Whitman Richards, Rehmi Post, Maggie Orth, Doug Platt, Chris Schmandt, Walter Bender, Joe Paradiso, Joe Jacobson, Marvin Minsky, Nicholas Negroponte, Henry Lieberman, Mike Hawley, Olin Shivers, Irfan Essa, Steve Roberts, Ken Russell, Matt Reynolds, Nitin Sawhney, Deb Roy, Mark Billinghurst, Steve Mann, Hong Tan, Dan Gruhl, Josh Weaver, Richard Stallman, Mitch Resnick, Jen Healey, Ed Keyes, Ali Azarbayejani, Amy Bruckman, Ron MacNeil, Justine Cassell, Trevor Darrell, Tom Minka, Andy Wilson, Yuri Ivanov, Ifung Liu, Kris Popat, Sumit Basu, Flavia Sparacino, Matthew Turk, Devon McCullough, Brygg Ullmer, Mike Johnson, Chris Wren, Stephen Intille, Alex Sherstinsky, Rob Poor, Dana Kirsch, Jeff Levine, Adam Oranchak, Ben Walter, Bayard Wentzel, Solomon Assefa, Len Giambroni, Kevin Pipe, Baback Moghaddam, Martin Bichsel, Justin Seger, Tony Jebara, Russ Hoffman, Lee Campbell, John Makhoul, Rich Schwartz, Francis Kubala, Gerald Maguire, Paul Picot, David Kaplowitz, and Joost Bonsen. In addition, I'd like to thank Peter Cochrane, Marcus Smith, Barry Crabtree, Jerry Bowskill, and all the great people at BT Labs; Chris George at HandyKey; Robert Kinney and the U.S. Army Natick Research Labs; and Al Becker and the folks at Reflection Technology.

I am extremely grateful to the project's UROP's, and I've tried to indicate their efforts explicitly in the main document. Thanks also to the faculty, research affiliates, post-docs, graduate students, and undergrads who have comprised the Perceptual Computing and Vision and Modeling groups over the years. You have taught me much.

My sincere thanks to my office-mate Brian Clarkson, who patiently shared his space with the ``Wearables Closet'' and its hordes of experimenters.

Thanks to the many members of the wearables community, especially the IEEE Task Force and the members of the wearables and technomads mailing lists, who have provided comments, direction, criticism, and even hard labor at times. You know who you are.

A huge thanks to Karen Navarro and the second floor crew for the assistance in managing the Wearables Closet, the first IEEE ISWC, and the associated interactions with the press. My gratitude also goes to Linda Peterson who provides a voice of sanity for the students as they work through the program. Laurie, Judy, Kate, and Bea, though you'll never read this, thanks as well.

Thanks to Nicholas Negroponte for creating and nurturing such a marvelous and unique environment.

Finally, thanks to the sponsors listed below and the United States Air Force Laboratory Graduate Fellowship program which ``jump-started'' my graduate career. The individuals behind these organizations have had a significant, positive influence on the research presented here.


This research was conducted at the MIT Media Laboratory and was sponsored in part by BT and the Things That Think Consortium.


Contents


List of Figures

  1. A Private Eye mounted in the brim of a hat.
  2. A Private Eye mounted in a pair of safety glasses. Note that the resulting display mount can be used over normal eyeglasses.
  3. A common misconception is that users of monocular displays worn in front of one eye see two separate images as above.
  4. In fact, the user's brain fuses the image from the monocular display with the image of the world.
  5. A more discreet display by MicroOptical, embedded in a pair of prescription eyeglasses (photo by Sam Ogden).
  6. Handykey's Twiddler, a one-handed chording keyboard with tilt sensitive mouse.
  7. The Locust: an environmentally-powered, microcontroller-based IR transponder.
  8. A head-mounted camera and a head-up display connected wirelessly to a SGI can simulate a powerful wearable computer for augmented reality.
  9. Functional diagram of wireless video wearable computer simulation system.
  10. The hat-mounted camera, pointed downward to the hands, and the corresponding view.
  11. A hard hat adapted for mounting downward- and forward-facing cameras.
  12. A functional diagram of the mobile video backpack. Arrows indicate video channels.
  13. The vision toolkit: low level modules are linked together to create applications.
  14. A visual tag representing the value ten. Squares at either end are red while the inner squares representing the bits are green.
  15. Variants on a 5-state HMM. The top model is a standard left-to-right HMM with no skip transitions. The bottom model has two skip transitions effectively allowing it to become a 3-state or 4-state model depending on training.
  16. The Remembrance Agent in use while writing a paper. The bottom buffer shows the RA's suggestions for files ``relevant'' to the user's current work.
  17. View from the desk-based tracking camera. Images are analyzed at 320x240 resolution.
  18. The four state HMM used for recognition.
  19. Using the finger as a mouse.
  20. An incoming video stream is re-mapped to a polygonal representation of a two dimensional Gaussian. This accentuated fisheye view enlarges the letters in the center of the screen while still allowing enough of the visual field to provide context.
  21. A similar mapping useful to those who suffer from a scotoma or ``blind-spot.''
  22. Ruth Marshall's desktop low vision aid.
  23. Multiple graphical overlays aligned through visual tag tracking. The techniques shown in the following three figures allow the attaching of hypertext to the physical environment.
  24. When a tag is first located, an arrow is used to indicate a hyperlink.
  25. If the user shows interest, the appropriate text labels are displayed. This image shows the overlaying text rotating in conjunction with the tag, demonstrating that the system can recover rotation with respect to the user's head.
  26. If the user approaches the object, 3D graphics or movie sequences are displayed.
  27. A maintenance task using 3D animated graphics. The left side shows the laser printer to be repaired. The right side shows the same printer with the overlying transparent instructions showing how to reach the toner cartridge.
  28. The areas that comprise the Patrol play field.
  29. Initial DUCK! BattleMap.
  30. A more informative, though possibly misleading, version of the DUCK! BattleMap.
  31. Views from the DUCK! cameras. The images on the left are from the downward looking camera while the images on the right are from the camera pointing forward.
  32. HMM topology for DUCK!.
  33. Typical detection of Patrol area transitions.
  34. Images taken during aiming/shooting, reloading, and ``other'' actions.
  35. A re-design of the DUCK! interface providing unique identifiers for each player and distinctive tones for different player actions.
  36. Empirical data taken for a 58.7 kg man of effective force perpendicular to the ground on the foot while walking. This curve should scale roughly based on weight.
  37. Definition of axes for piezoelectric materials. Note that the electrodes are mounted on the $3$ axis.
  38. Simple diagram showing two shoe generation systems: 1) piezoelectric film insert or 2) metal spring with coupled generator system.
  39. Power from body-centered sources. Total power for each action is included in parentheses.
  40. Air flow caused by natural convection.
  41. Blood flow geometries in the arm.
  42. Image of experimental water baths (actual experiment was performed in a temperature controlled room).
  43. Heat loss from forearm (o) and control (+) water baths.
  44. Heat conduction through forearm versus water bath temperature.
  45. Forearm computer heat dissipation.

List of Tables

  1. ASL test lexicon
  2. Word accuracy of desk-based system
  3. Word accuracy of wearable computer system
  4. Patrol area recognition accuracy
  5. Confusion matrix between aiming, reloading, and other tasks.
  6. Comparisons of common energy sources and computing power requirements.
  7. Human energy expenditures for selected activities.
  8. Piezoelectric characteristics of PVDF and PZT.
  9. Comparison of rechargeable battery technologies. On-time refers to the running time of a 100 cm$^3$ battery with a 5W drain.
  10. Thermal conductivities
  11. Daily averages for different mechanisms of heat loss
  12. Skin temperature sensations
  13. Skin structure.
  14. Heat dissipation for various blood flow rates given a $39{}^oC$ external heat bath.
  15. Calibration of digital thermometers.
  16. Temperature compensated thermal capacity of water
  17. Calibrated temperature readings for control and forearm baths.
  18. Average temperature readings and standard deviations for the control and forearm baths
  19. Heat loss of control bath.
  20. Heat loss of forearm bath versus the normalized control bath.


Introduction

From mainframe to wearable

Computer hardware continues to shrink in size and increase in capability. This trend caused the prevailing concept of a computer to change from the mainframe to the minicomputer to the desktop. Just as the physical hardware changes, so does the use of the technology, tending toward more interactive and personal systems. In the late 1990's, another physical change is underway, placing computational power on the user's body, making it accessible at all times. ``Wearable computers'' enable new applications that were formerly infeasible, resulting in new usage paradigms. However, previous personal technologies provide a perspective on these new opportunities.

As with any modern industry, wearable computing has a long history of technological precursors. Many of these are technological tools to augment man's senses or to fulfill a specific need. For example, eyeglasses, which augment sight, are first mentioned by Roger Bacon in 1268. In the 1665 preface to Micrographia, Robert Hooke goes further, suggesting the addition ``of artificial Organs to the natural ... to improve our other senses of hearing, smelling, tasting, and touching.'' In the age of electricity, electronic augmentations such as hearing aids and vision enhancement for the near-blind became available. The use of these systems demonstrates an interesting social trend. Initially, such devices were used sparingly to compensate for a disability when it was necessary to communicate. However, as communication has become more essential to everyday work and living, more users wear their device continuously, simply as a matter of convenience. Today, such devices have progressed to the stage of being implanted into the user, such as with artificial cochlea or retinas [6,177].

A related trend can be seen in consumer goods as a particular function is needed repeatedly throughout the day. Domestic mechanical clocks first appeared in the late 14th century [5]. These cabinet-sized clocks were derived from the large public tower clocks that chimed the hour, often to indicate the hours for prayer in monasteries. However, it was the need of an accurate timepiece for naval navigation that prompted John Harrison to invent an accurate ``pocket'' watch in 1762. After a period of being simply a women's fashion accessory, the wristwatch began to dominate in the early 20th century due to the need for synchronization of soldiers during World War I and the need for a hands-free time reference for aviation. By the 1970's, electronic wristwatches surpassed the accuracy of the best cabinet-sized mechanical watches, completing a 700 year transition from an unwieldy, inaccurate instrument to a mobile, nearly ubiquitous timepiece that is quickly and easily referenced.

Is this the trend for computing? The desktop computer is currently entrenched as the prevalent consumer item, much as cabinet and mantle clocks were in the 1600's. However, merchants and scientists have carried shrunken abaci, slide rules, and calculators for decades. Will computing make the transition from the desktop to the body for the general populace? If so, what will be its form?

In 1992, pen computers were presented as the next logical step in computing. The reasoning was that handwriting is the most intuitive interface for computing and that everyone who would buy such a device would know how to write. Even Microsoft joined in the fray by producing Windows for Pens, a version of their desktop product, to compete with custom pen operating systems. The claim was that users would want the same familiar interface as on the desktop for these mobile devices. However, natural cursive handwriting is slow and requires a relatively large writing surface, limiting the form factor of these devices. Alternative, faster handwriting schemes that used little screen real estate, such as Xerox PARC's Unistroke [80] system, were perceived as too complex or too limiting for the casual user to learn. However, by 1998 the Palm Pilot pen computer, with its custom operating system and the Grafitti lettering method, refuted these preconceptions by becoming the first pen computer to sell the 2 million units, which is considered to be the benchmark of a consumer-grade success.

In many respects, the current generation of successful personal digital assistants (PDA's) resembles the pocket watches of the Victorian era. While improving mobility relative to desktop and laptop systems, a current PDA requires the user to extract it from its case or pocket, flip open the lid, and use both hands to operate it. Most significantly, these machines offer reduced functionality when compared to their desktop counterparts, concentrating on applications for occasional data entry or reminders. Like early reading glasses, these PDA's are used relatively infrequently and only for a specific set of tasks. Thus, pen computers are considered non-essential for many groups of users and are often left at home.

Wearable computers, with their expanded utility, increased accessibility, and improved ergonomics, should supplant the desktop as the preferred interface for computing. For example, as displays become embedded into eyeglasses, users will be freed from maintaining the static neck and back position required by computer monitors for data entry. In addition, as a class these devices should subsume the current concepts of portable consumer electronics by concentrating functionality into one package. Much as desktop computers are becoming all-purpose information appliances, incorporating the telephone, fax, answering machine, television, and VCR, so too should the wearable incorporate the wristwatch, cellular phone, fax machine, palmtop, compact disk player, camera, camcorder, health monitor, etc.

The wearable computer may eventually look like a black box, at most the size of a deck of cards, enclosing a powerful yet energy-conserving CPU and a large capacity data storage device. This black box may have one output device - an LED to indicate that it is on and that its body-centered wireless network is functioning. This wireless network will connect peripherals to the wearable computer in a radius of about two to three meters centered at the body. The wearable's functionality will depend on the peripherals the consumer chooses.

For example, suppose the user likes to listen to music. Current hard drives allow storage of over 200 CD's on a pocket-sized device. Wireless earphones, which will automatically connect with the wearable's wireless network, allow the user to listen to any song at any time. Add a walnut-sized camera, and the wearable computer transforms into a camcorder. Add an Internet modem, and the wearable becomes a pager, cellular phone, web-browser, and e-mail reader. With medical sensors, the wearable transmogrifies into a version of the Star Trek tri-corder, concentrating many diagnostic and recording devices into one unit. With wearable computing and a wireless body-centered network, companies need only create the appropriate peripheral whenever a new need or niche market is discovered. Suddenly, sophisticated portable electronics become cheap and powerful for the consumer, and the computer industry gets an attractive upgrade path to pursue.

This thesis, however, will concentrate on problems and potentials that are unique to the field of wearable computing. It will provide examples of novel interfaces and suggest new design possibilities such as powering the wearable from user actions or cooling the machine via user contact. To begin, let's examine the attributes of a wearable computer.

What is a wearable computer?

``Wearable computing'' can describe a broad range of devices and concepts. During the time of this work, wearables were equated with head-up, head-mounted displays, one-handed keyboards, and specially made computers worn in satchels or belt packs. However, at the beginning of this work in 1993, the author meticulously avoided defining the term to encourage exploration and collaboration, taking a cue from the rapidly expanding software agents community of the time. However, it became necessary to contrast wearables to laptops and PDA's in an attempt to explain the conceptual differences in the interface. My first attempt was in ``The Cyborgs Are Coming,'' [205] originally written in 1993 as an expedient means of explaining the purpose of the wearable computer to curious bystanders (The cited technical report was derived from the original paper which was distributed widely in 1994; the original version is included in Appendix A for reference). In this paper, the author suggests that ``persistence and consistency'' are the two distinguishing characteristics of a wearable computer interface. The wearable interface is ``persistent'' in that it is constantly available and used concurrently while the user is performing other tasks. For example, while a medical doctor is examining a patient, the wearable may display the patient's history or CAT scan. It may record the doctor's observations and search automatically for precedents or possible interactions between prescribed drugs. ``Consistency'' means that the same structured wearable interface and functionality is used in every situation, though adapted and molded over the course of a lifetime of interaction with the user.

The term ``cyborg'' above deserves some attention. Originally coined by Manfred Clynes and Nathan Kline in 1960 [39], a cyborg is a combination of human and machine in which the interface becomes a ``natural'' extension that does not require much conscious attention, such as when a person rides a bicycle. While Clynes and Kline's subject was adapting man for the rigors of space, the same word might be applied to systems which assist the user on a more intellectual level.

As the field developed, members of the community described the attributes of a wearable more explicitly. In 1997, Bradley Rhodes described a wearable computer in relation to five properties [174]. According to Rhodes, wearables are portable while operational; enable hands-free or hands-limited use; can get the attention of the user even when not in active use; are always ``on,'' acting on behalf of the user; and attempt to sense the user's current context to serve him better. Korteum et al. [] describe similar criteria but use the term ``augmented reality'' to describe ``the user interface technique that allows focusing the user's attention and present information in an unobtrusive, context-dependent manner.'' Also in 1997, Steve Mann defines his ``WearComp'' system as being ``eudaemonic'' in that the user considers the apparatus as part of himself, ``existential'' in that the user has complete, informed control of the apparatus, and ``ephemeral'' [sic] meaning that the system is always operating at least on some minimal level and has an output channel open to the user at all times. Later, Mann would refine these attributes [129] as constant and always ready; unrestrictive; unmonopolizing of the user's attention; observable by the user; controllable by the user; attentive to the environment; useful as a communications tool to others; and personal.

Note that all of these definitions explicitly avoid describing how the apparatus is implemented but instead concentrate on an interface ideal. The author's own guiding principle may be summarized best by the concept of symbiosis as described in J.C.R. Licklider's paper ``Man-Computer Symbiosis:'' [123]

``Man-computer symbiosis'' is a subclass of man-machine systems. There are many man-machine systems. At present however, there are no man-computer symbioses. ... The hope is that, in not too many years, human brains and computing machines will be coupled together very tightly and that the resulting partnership will think as no human brain has ever thought and process data in a way not approached by the information-handling machines we know today.
In order to achieve such a symbiosis, I believe that the computer must be constantly with the user, sharing in the experience of the user's life, drawing input from the user's environment, and providing useful and graceful assistance as appropriate. Specifically, I believe the ideal wearable

  1. Persists and provides constant access: Designed for everyday and continuous use over the course of a lifetime, the wearable can interact with the user at any given time, allowing for user interruptions when necessary. Correspondingly, the wearable can be accessed by the user quickly and with very little effort. Needless to say, such a device must be mobile and physically unobtrusive to meet these goals.

  2. Senses and models context: In order to provide the best cognitive support for the user, the wearable must try to observe and model the user's environment, the user's physical and mental state, and the wearable's own internal state. In some cases, the user may provide explicit contextual cues to help the wearable in its task. To provide parity, the wearable should inform the user of its own status, either through an explicit display or through subtle ``background'' cues [100]. In addition, the wearable should make its models observable to the user.

  3. Augments and mediates: The wearable should provide universal information support for the user in both the physical and virtual realms. For example, the wearable should gather information and resources relevant to a particular physical location automatically and filter this information based on the user's current needs and preferences. The wearable should adapt to provide a common, extensible interface to automation or computation in the environment. In addition, the wearable should manage potential interruptions, such as phone calls or e-mail, to best serve its user.

  4. Interacts seamlessly: The wearable should adapt its input and output modalities automatically to those which are most appropriate and socially graceful at the time. In many instances, the computer interface will be secondary to the user's primary task and should take the minimal necessary amount of the user's attention. In addition, the interface should guarantee privacy when appropriate, adapt to its user over time, and encourage personalization.

Many of the attributes described previously mesh with these principles. However, context sensing is the key advantage wearable computers have over related devices. When not being used for a task that requires the user's full attention, wearable computers will be used as a secondary interface. In other words, while the user is attending a conversation or inspecting equipment for repair, the wearable computer will provide information support to augment the user's native knowledge and abilities. To provide this service efficiently and without interrupting the user with a complex interface, the wearable computer will have to sense the user's actions and predict what is needed. The next section explores this idea more fully.

The importance of context sensing

For most computer systems, the only input devices are used to get instructions or information directly from the user. The user manipulates a keyboard and a 2D or 3D pointing device to drive a software package toward a particular goal, such as drawing a graph or solving a spreadsheet. Wearable computers offer a unique opportunity to re-direct sensing technology toward recovering both environmental and personal user context in a more natural, mobile environment. Wearable computers have the potential to ``see'' as the user sees, ``hear'' as the user hears, and experience the life of the user in a ``first-person'' sense. In addition, wearables provide the opportunity to sense user behavior over time. This increase in available information may lead to more intelligent and fluid interfaces that use the physical world as part of the interface.

Since context has been mentioned repeatedly, an explanation of the term may be in order. Using a working definition by Bradley Rhodes [175], given a user and a set of goals, context is those features of the environment not created explicitly to be input to the system. A context-aware application is a system that uses context to perform useful work, where ``useful'' means relating to a goal, subgoal, related goal, or future goal.

The importance of context in communication and interface can not be overstated. Physical environment, time of day, mental state, and the personal model each conversant has of the other participants can be critical in conveying necessary information and mood. An anecdote from Nicholas Negroponte's book ``Being Digital'' [150] illustrates this point:

Before dinner, we walked around Mr. Shikanai's famous outdoor art collection, which during the daytime doubles as the Hakone Open Air Museum. At dinner with Mr. and Mrs. Shikanai, we were joined by Mr. Shikanai's private male secretary who, quite significantly, spoke perfect English, as the Shikanais spoke none at all. The conversation was started by Wiesner, who expressed great interest in the work by Alexander Calder and told about both MIT's and his own personal experience with that great artist. The secretary listened to the story and then translated it from beginning to end, with Mr. Shikanai listening attentively. At the end, Mr. Shikanai reflected, paused, and then looked up at us and emitted a shogun-size ``Ohhhh.''

The male secretary then translated: ``Mr. Shikanai says that he too is very impressed with the work of Calder and Mr. Shikanai's most recent acquisitions were under the circumstances of ...'' Wait a minute. Where did all that come from?

This continued for most of the meal. Wiesner would say something, it would be translated in full, and the reply would be more or less an ``Ohhhh,'' which was then translated into a lengthy explanation. I said to myself that night, if I really want to build a personal computer, it has to be as good as Mr. Shikanai's secretary. It has to be able to expand and contract signals as a function of knowing me and my environment so intimately that I literally can be redundant on most occasions.

This story contains many subtleties. For example, the ``agent'' (i.e. the secretary) sensed the physical location of the party and the particular object of interest, namely, the work by Calder. In addition, the agent could attend, parse, understand, and translate the English spoken by Wiesner, augmenting Mr. Shikanai's abilities. The agent also predicted what Mr. Shikanai's replies might be based on a model of his tastes and personal history. After Mr. Shikanai consented/specified the response ``Ohhhh,'' the agent took an appropriate action, filling in details based on a model of Wiesner and Negroponte's interests and what they already knew. One can imagine that Mr. Shikanai's secretary uses his model of his employer to perform other functions as well. For example, he can remind Mr. Shikanai of information from past meetings or correspondences. The agent can prevent ``information overload'' by attending to complicated details and prioritizing information based on its relevancy. In addition, he has the knowledge and social grace to know when and how Mr. Shikanai should be interrupted for other real-time concerns such as a phone call or upcoming meeting. These kinds of interactions suggest the types of interface a contextually-aware computer might assume.

Also note that the anecdote naturally limits the possible form factors of the user's computer. Either the computer must have eyes and ears everywhere its master may travel, or it must travel with the user, as with wearable computers. The latter method suggests a more symbiotic relationship with the user. The computer is physically transported by the user to different environments where it may gain more experience. In return, the computer provides the user with progressively more sophisticated and personalized service. Additionally, the user and computer may benefit in other ways from being in close proximity, as will be discussed in later sections.

The computer interface described in ``Being Digital'' is more of a long term goal than what can be addressed in one doctoral thesis. In fact, such symbiotic man-machine relationships have been pursued since the early days of computer science, as shown by the Licklider quote in the previous section. However, this thesis takes concrete steps toward this ideal by developing a body-centered sensing platform through wearable computing, introducing methods to analyze the incoming data, developing models of the user and environment, and suggesting contextually-driven interfaces for the future.

Research areas and contributions

Wearable computing provides opportunities for research in many broad fields. This section provides a short overview of the specific areas addressed by this thesis.

Contextual awareness

Contextually aware computing can be broken into three processes: perception, modeling, and the interface itself.

Perception

Current multimodal interfaces concentrate mainly on the desktop or room environments. With wearable computing, sensors such as cameras, microphones, inertial sensors, and Global Positioning System receiver may be mounted on the user's body. This results in a drastic increase in available data about the user's environment and requires appropriate pattern recognition techniques for analysis. This thesis demonstrates the effectiveness of techniques such as hidden Markov models, multidimensional receptive field histograms, and principal component analysis in this information-rich environment.

Sensor mounting locations can be very important in determining the type and quality of data recovered. I introduce self-observing body mounted cameras as a way for recovering location and hand and foot motion. In addition, I compare the costs, types of data, and privacy implications between sensing with environmental and wearable infrastructure.

Modeling

Context modeling involves observations of the user, the environment, and the computer itself. Models may be used on a low level to aid perception:

How does my user's skin color change in this new lighting?

an associative level:

What objects might be viewable from this room?

or at a higher task level:

What is the user doing?

Such models are introduced throughout the thesis as a means to improve performance and reduce interface complexity.

Interfaces

While current hardware limitations prevent proper implementation and evaluation, this thesis suggests several novel wearable computer interfaces. Some of these are tightly coupled to the perceptual layer, following a more traditional style of direct user input. However, progressively more contextually-driven interfaces are pursued, hopefully leading to a subtler coupling of man and machine in the future.

Design implications

Due to the wearable computer's close proximity to the body, software and hardware become highly intertwined. For example, continuously monitoring a sensor results in an increased computational load. In turn, this either decreases the average battery life of the unit or increases the mass of batteries the user must carry. Both can dramatically effect the usage of the machine. In addition, the increased computation results in excess heat production which must be controlled for the machine to operate. Thus, such contextual interfaces as advocated above will force new designs in the construction of comfortable wearable computers.

To address this issue, analyses of user-derived power sources, heat dissipation, and weight or load bearing are presented. In addition, observations and lessons are offered from managing the hardware, software, and research support of a wearable computer community at the M.I.T. Media Laboratory for over six years.

Organization

The chapters in this dissertation detail a series of projects and experiments designed to progress toward more contextually-based interfaces. In most cases, the major contributions are the perceptual tools and models. Details on how to evaluate these systems are included. In addition, the chapters on power, heat, and load bearing address issues that will become critical as these more computationally intensive systems are adopted.

Chapter 2
details the tools and techniques used throughout this thesis. An everyday-use wearable computer platform, infrared location beacon infrastructure, wireless video wearable computer simulation system, and a mobile, multi-channel recording platform for body-worn cameras are described. In addition, a computer vision toolkit and a hidden Markov (HMM) system are introduced as a basis for perceiving and modeling the user's context.

Chapter 3
provides a perspective on the life of an everyday user of a wearable computer. In doing so, this chapter helps distinguish wearables from every other class of computing device. The concepts of ``augmented memory,'' ``serendipitous interfaces,'' and ``intellectual collectives'' are introduced.

Chapter 4
details the extension and adaptation of a previous camera-based sign language recognition system to a wearable platform. While the system is directly controlled by the user, it provides a proof of concept of how a user-observing body-mounted camera can provide a compelling interface.

Chapter 5
documents an augmented reality (AR) system that generates hypertext links overlaid on physical objects. This example demonstrate how wearables can provide context-sensitive ``just-in-time information'' based simply on the user's head gestures.

Chapter 6
documents work toward a contextually aware system with no explicit input by the user. For a real-space ``paintball'' style game, the system tracks players' locations and arm gestures. While the hardware currently does not exist to complete the system, a personal battle awareness map is prototyped using pre-stored game data.

Chapter 7
begins a trio of chapters examining future design issues for developing sophisticated wearable computers. This chapter presents a discussion on harnessing power through normal user activities, such as walking and typing.

Chapter 8
reviews heat dissipation issues for both humans and wearable computers; models wearable computer cooling methods through potentially beneficial thermal interactions with the user's forearm; and verifies the model experimentally.

Chapter 9
surveys the load bearing literature and explores where and how wearable computers might best be carried on the body given the information from the previous chapters.

Chapter 10
provides conclusions and directions for future work.


Tools and Techniques

Developing a research infrastructure takes a considerable amount of effort and time. However, the process of designing the infrastructure provides crucial insights on how the technology might or might not be used. ``Everyday living'' with the infrastructure provides practical experience that can be gained no other way. The MIT Wearable Computing Project was no exception. As with any project exploring new hardware and metaphors of use, it transpired that some equipment was not practical once it was purchased and used, while other equipment became critical for effective use or experimentation. The members of the project often developed their own hardware drivers because commercial versions were unavailable or proved brittle in a mobile environment. This chapter will describe briefly the hardware and software that became central to the author's research and everyday use. The listed hardware and software was developed by the author or an undergraduate researcher under his immediate supervision except where noted otherwise. On a higher level, certain perceptual tools and modeling techniques became critical to the author's research and were packaged into advanced tool sets for use by the internal Media Laboratory community. Since these tool sets were used in different applications, they are described and developed in this chapter out of the context of a particular project. Later chapters will describe their use for particular applications and contrast the use of these tools to previous work in that particular domain.

Platforms

Researching new methods of computing can lead to very divergent and incompatible hardware. To prevent this, I created and maintained a store of reference platforms, ranging from true everyday-use wearable computers to systems that simulated powerful future hardware or stored information for later analysis. These platforms enabled rapid prototyping of ideas and concentrated the results into a pool of knowledge and apparatus that could be built upon constantly and consistently. In addition, as the everyday-use wearables became more powerful, the applications that were prototyped on the simulation systems could be migrated to more casual use.

The Lizzy wearable computer

In late 1995, the Things That Think consortium at the MIT Media Laboratory decided to direct money into wearable computing research. Having already ordered PC/104 boards to upgrade my own wearable, I began to design inexpensive wearable computer ``kits'' that were highly customizable. By early 1996, the first monies were spent for this ``Wearables Closet,'' a library of hardware maintained for experimentation and rapid prototyping with wearable and embedded computing.

There were many considerations in the base wearable design. The low end system had to be inexpensive for embedded use but expandable to desktop performance. The system was intended to support everyday use [205], high end digital photography [130], augmented reality [214], and medical quality signal capture [160,162]. From earlier experimentation, I knew that form factor plays a crucial role in the use of such machines. The large flat surfaces of laptops, for example, are not very ergonomic for extended wear. Whatever the resulting form factor, the system had to fit in a piece of comfortable clothing for carrying. Another surprisingly crucial factor is battery life. For everyday use, the wearable should run a minimum of six hours on a charge. This way the user can create a daily routine of changing batteries during lunch time. Needless to say, no one commercial design satisfied all of these constraints, neither at the time nor currently.

To avoid supporting separate wearable computers for each user's needs, I chose to support the PC/104 board architecture and have each user manufacture his own wearable computer. The PC/104 standard is built around the concept of a stackable set of boards which connect via headers that are electrically identical to the standard 16-bit PC ISA bus of the time (the standard has since been extended to include PCI). The 3.6'' by 3.8'' boards stack vertically or can tile horizontally with special adaptors. PC/104 boards are developed by many vendors, support a surprising array peripherals, are rugged and heat tolerant, and often have enforced bounds on their power consumption due to their physical size.

I began to create a procedure for manufacturing wearable computers. The procedure needed to be simple enough that anyone who might need a wearable or embedded computer could follow it and produce a working machine in an afternoon. In addition, the procedure needed to reveal the functions of the underlying components and needed to teach the skills necessary in modifying the design so that the user would be confident in extending the system himself. One of the responsibilities of owning a wearable computer was tutoring of the other users on how to make their wearables. These philosophies proved useful in promoting the platform and extending its functionality. By the end of the first year the instructions were fairly robust, and approximately ten machines had been produced. As a personal goal, I wanted the procedure to be repeatable by researchers outside of the Media Laboratory as well, and the design adopted the name ``Lizzy'' from a talk by David Ross calling for a standardized open hardware design at the 1996 Boeing Wearable Computer Symposium. Unfortunately, making the design public proved difficult due to a shortage of parts during 1996. However, by January 1997 I had arranged for all of the components to be available through the suppliers, and I released the instructions to the MIT Wearable Computing Project's web site. A copy of these instructions can be found in Appendix B.

Figure 2-1: A Private Eye mounted in the brim of a hat.
\begin{figure}\centerline{
\epsfysize=2.5in
\epsfbox{images/rhodes-hatmount.ps}
} \end{figure}

Figure 2-2: A Private Eye mounted in a pair of safety glasses. Note that the resulting display mount can be used over normal eyeglasses.
\begin{figure}\centerline{
\epsfysize=2.5in
\epsfbox{images/PE.ps}
} \end{figure}

Figure 2-3: A common misconception is that users of monocular displays worn in front of one eye see two separate images as above.
\begin{figure}\centerline{
\epsfysize=2.5in
\epsfbox{images/PE-zephyr-separated-eye.ps}
} \end{figure}

Figure 2-4: In fact, the user's brain fuses the image from the monocular display with the image of the world.
\begin{figure}\centerline{
\epsfysize=2.5in
\epsfbox{images/PE-zephyr-composite.ps}
} \end{figure}

Figure 2-5: A more discreet display by MicroOptical, embedded in a pair of prescription eyeglasses (photo by Sam Ogden).
\begin{figure}\centerline{
\epsfysize=2.5in
\epsfbox{images/microoptical.ps}
} \end{figure}

One of the most striking characteristics of a typical Lizzy wearable computer is its head-up display, Reflection Technology's Private Eye. This display produces 720 by 280 pixel resolution in monochrome red-on-black. It is fully addressable and its focus can be adjusted from ten inches to infinity. Typically, the Private Eye is mounted on the brim of a cap as in Figure 2-1 or in a pair of safety glasses as in Figure 2-2. The safety glasses mount holds the display directly in the line of sight for one of the user's eyes creating an overlay effect (see Figures 2-3 and 2-4). Such an effect is very useful for creating augmented realities. Over time, other displays were adapted for use as well, including modified cathode ray tubes [128], MicroOptical's display glasses (see Figure 2-5), and even the PalmPilot [89].

Figure 2-6: Handykey's Twiddler, a one-handed chording keyboard with tilt sensitive mouse.
\begin{figure}\centerline{
\epsfysize=2.5in
\epsfbox{images/twiddler.ps}
} \end{figure}

Another striking characteristic of the Lizzy is its keyboard, Handykey's Twiddler (see Figure 2-6). This 18 key chording keyboard is used with either hand and allows typing at up to 60 words per minute. The Twiddler also contains a mouse, activated by pressing a key with the thumb and rolling or pitching the unit for x or y movement respectively. This keyboard often provides the primary source of user input for a Lizzy.

In early 1996, a standard system consisted of a Private Eye, Twiddler, PC/104 based 50MHz 486 computer, 16M of RAM, and 815M of hard disk. Such a system required three PC/104 boards, and the standard 5.5'' by 5.5'' by 2.75'' enclosure could hold a maximum of four boards. By the time the instructions were released publicly, the system required only two boards, the processor speed had increased to 100MHz, and disk densities had increased. Internally to the project, stock was maintained for options such as 16-bit sound boards, video digitizers with on-board digital signal processors, PCMCIA adaptors, or extra communications ports as desired. Cameras, biosensors, alternative displays, extra disk capacity, higher end CPU boards, and custom clothing were also available. Another important option was wireless Internet connectivity through cellular digital packet data (CDPD) modems. Using Bell Atlantic Mobile's CDPD service, the wearable computers were assigned their own Internet address and appeared as a normal static workstation to the rest of the Internet. Service coverage grew to include many urban centers in the United States.

Linux is the operating system of choice for Lizzy design. Most other operating systems were too brittle for serious consideration or did not support the Lizzy's peripherals and were not open enough for the development of appropriate drivers. Consequently, much of the software produced by the Wearable Computing Project concentrated on producing open source for Linux. Among the software developed were drivers for HandyKey's Twiddler by Jeffrey Levine; the X11R6 windowing system on Reflection Technology's Private Eye display by Ben Walter; Sierra Wireless's PocquetPlus 110 CDPD wireless modem by Bayard Wentzel; Adjeco's ANDI-FG video digitizer by Ben Walter; General Reality's CyberTrack pitch, roll, and yaw sensor by Len Giambrone, and a general Global Positioning System decoder by graduate student Daniel Dreilinger. In addition, the Lizzy architecture created a focus for research on wearable-based software and hardware [174,209,89,76].

The Lizzy and related infrastructure have proved very successful, both internally and externally. Approximately twenty five Lizzys have been made internally from the Wearables Closet, and many more machines have been manufactured by researchers and hobbyists world-wide using the Lizzy reference design as a starting point. A benchmark for this success is that the supplier for the Private Eyes and their driver boards had run out of their stock of 100 by one month after the public release of the Lizzy instructions. At MIT, several Lizzy owners are everyday users in that, during a typical day, one can expect to see the user wearing his machine. However, perhaps the best indicator of success of the design, at least for everyday use, is the author's own system. While for many years the author would spend more time using his wearable than desktop machines, as of December 11, 1996 the author switched permanently from desktops to his wearable as his general computing device. In other words, almost all of the author's routine e-mail, web browsing, programming, and text editing, including the preparation of this document, is performed on a Lizzy-based wearable computer. Specifically, the author's system consists of a Private Eye, Twiddler, 133MHz 586 processor board with 20M of RAM, and a CDPD wireless modem. Run-time, without modem usage, is approximately 15 hours on two Sony NPF-950 lithium camcorder batteries, allowing for continuous use during the day.

The Locust: indoor infrared location beacons

Outdoors, the Global Positioning System (GPS) can be used to determine user location. However, for indoor use, a system of low-cost, infrared (IR), light-powered beacons called ``Locust'' was developed to serve this purpose (Figure2-7) [209,210]. Each Locust consists of a 4MHz PIC 16C84 microcontroller, a RS232 line voltage converter, infrared receiver, infrared LED, 6'' by 6'' 9V solar cell, and a voltage regulator. The Locust motherboard is derived from the IRX 2.0 by graduate student Rob Poor, and the resulting board is approximately 1" by 3". The IR LED on each locust is effective to about 20 feet, subtending an angle of approximately 38 degrees about the line of sight. Each Locust is programmed with a unique string of 4 symbols corresponding to its location. The Locust transmits these symbols repeatedly so that a listener, upon receiving the signal, knows his approximate location. A similar system has been described by Long et al. using television remote controls [126]. Since the Locust have to be numerous to cover an entire building, they are designed to be dependent solely on their solar cells so that battery replacement is not be an issue. These systems are typically mounted under fluorescent light fixtures where they can draw power and effectively cover a region.

Figure 2-7: The Locust: an environmentally-powered, microcontroller-based IR transponder.
\begin{figure}\centerline{
\epsfysize=2.5in
\epsfbox{images/ir.ps}
} \end{figure}

In addition to being location beacons, the Locusts allow location-based information uploading. A short message, in this case one byte, is transmitted to a Locust. After the Locust receives the message, it retransmits the message, interleaved with its location information during the Locust's transmit cycle. This uploaded information may be self contained, or it may be a pointer to encrypted information stored elsewhere.

Simulation of a vision-based wearable

While one of the driving principles of the Wearable Computing Project was to design software and hardware for everyday use, much of the advanced research, such as found in this thesis, required more computing power than was available on wearable computers of the time. In order to simulate the more powerful wearable computers of the future, a full duplex wireless video system was designed. This proved valuable for integrating computer vision techniques into wearable computing applications. The first such system in the project, designed and implemented by graduate student Steve Mann [214,127], used amateur television bands. However, with the advent of cheap, unlicensed, and multi-channel 2.4GHz video and audio transmitters, the necessary equipment became much more accessible and could be placed in a shoulder bag. When combined with a small, head-mounted camera, a head-up display such as Virtual I/O's i-glasses or Sony's Glasstron, and a remote Silicon Graphics, Inc. (SGI) workstation, such a system can create the illusion of a powerful computer-vision driven wearable computer (see Figure 2-8). First the camera image of what the user is seeing is sent to the SGI. For most cases, an Elmo MN401E camera was used with a 15mm lens, selected to approximate the correct size of objects when viewed from the head-up display. The SGI analyzes the video and composites an appropriate ``wearable computer'' display for the situation on top of the incoming video. This image is then sent back to the head-up display where it is displayed. The entire process happens in real-time, only limited by the processing speed of the SGI and normal NTSC frame rate. The process is summarized in Figure 2-9, adapted from Starner et al. [214]. Note that an advantage to this system is that the user only sees the computer graphics when they are composited with the video image, insuring proper registration (ignoring latency effects between the video and graphics image). Thus, the issue of improper alignment of the head mounted display with the head mounted camera can be ignored.

Figure 2-8: A head-mounted camera and a head-up display connected wirelessly to a SGI can simulate a powerful wearable computer for augmented reality.
\begin{figure}\centerline{
\epsfysize=2.5in
\epsfbox{images/AR-system-alda2.ps}
} \end{figure}

Figure 2-9: Functional diagram of wireless video wearable computer simulation system.
\begin{figure}\centerline{
\epsfysize=3.0in
\epsfbox{images/visfilter-thad.ps}
} \end{figure}

Recording visual data

Figure 2-10: The hat-mounted camera, pointed downward to the hands, and the corresponding view.
\begin{figure}\centerline{
\epsfysize=2.5in
\epsfbox{images/asl-wearable-combined.ps}
} \end{figure}

When designing a recognition system, a repeatable, stable database of input is needed for training and testing the system [96,211]. For several experiments described in this thesis, ``wearables'' had to be developed that could record information for such later reference. Figure 2-10 demonstrates a baseball cap with a downward-facing video camera embedded in its brim. The goal of this camera is to observe the wearer's hand, feet, and body motions. The resulting apparatus was used for recognizing sign language. The camera shown is an Elmo MN401E with a 4mm lens, which allows the largest field of view possible with this model. The camera head is about the size of a lipstick can and is tethered to a 4'' by 6'' by 2'' camera control box which outputs a high quality NTSC composite or svideo control signal. This camera cap was used in conjunction with a rack-mount Sony Betacam 2800 video recorder to produce high quality video tape of sign language, as will be discussed in the next chapter. Surprisingly, the video showed very little image vibration due to camera motion.

The video backpack

Figure 2-11: A hard hat adapted for mounting downward- and forward-facing cameras.
\begin{figure}\centerline{
\epsfysize=2.5in
\epsfbox{images/patrol-apparatus.ps}
} \end{figure}

A completely mobile unit, capable of recording several synchronized channels of video, was desired for one of the experiments. For this system, a consumer grade Sony Hi-8 camcorder was used to record video. Since video of the scene in front of the user as well as video of his body motions was desired, a custom backpack of video equipment was designed. First, an Elmo QN401E ``matchstick-sized'' camera with a 2.2 mm lens was added to the camera cap in Figure 2-10 to observe a wide field of view ahead of the user. The cap was later replaced with the more durable matte black hard hat shown in Figure 2-11, and the downward looking MN401E camera remounted appropriately. While initially awkward and prone to more vibration than the cap, the hard hat provided a firm mount for both cameras. Both cameras require camera control units, which were placed in the backpack. Since the Hi-8 camcorder could record only one stream of video, a Presearch VQ42C quad display was employed. This unit can take up to four streams of composite NTSC video as input and output a composite video stream with each video stream subsampled and placed into separate quadrants of the image. In this manner, the camcorder could record synchronized video from both cameras. Figure 2-12 shows a functional diagram of the system. The system required over 40 watts of power, resulting in seven kilograms of batteries to run for two hours. Thus, a backpack was necessary to carry the apparatus comfortably. In addition, this video backpack provided protection for the equipment for the relatively harsh DUCK! environment described in later chapters.

Figure 2-12: A functional diagram of the mobile video backpack. Arrows indicate video channels.
\begin{figure}\centerline{
\epsfysize=2.5in
\epsfbox{images/duck-diagram.ps}
} \end{figure}

Analysis and modeling tools

Several tools were developed relating to the analysis and modeling of the user's actions and environment through video. While this section introduces these techniques, later chapters will apply them to particular problems, address previous work, and evaluate the resulting systems. Most of the tools in this section were used in conjunction with the wireless video system described above or used with data sets that were recorded for later reference and experimentation.

A computer vision toolkit

To increase the speed of prototyping vision-based systems, I produced a vision toolkit consisting of small, modular programs that can be reconfigured quickly through Unix pipes. Figure 2-13 shows this vision architecture and the projects involved. Some initial modules evolved from previous work with the ALIVE project [249] which tracks the entire user's body in a room-sized augmented reality.

Most clients of the vision architecture concatenate low level feature detectors, filters, and higher level, domain-specific feature detectors to determine necessary information. For example, ``BlobFinder,'' a low level module, tracks all the blobs of a certain range of colors in view of the camera and returns the shape parameters of those blobs. Parameterized filters remove trivially uninteresting blobs, and application-specific modules extract the parameters of the objects of interest. Finally, these parameters are passed to modeling or graphics applications as desired.

Each module can take input from Unix standard in, produce output on standard out, and describe errors through standard error. All interactions between modules consist of user readable ASCII text. A benefit of this design is that the user can easily observe the output at any given level of the vision system. Another benefit is that the output at any or all levels of the vision system can be logged by using the Unix ``tee'' command, enabling easy troubleshooting and experimentation. The flexibility and ease of use of this architecture allowed its use in several projects at the Media Laboratory and at affiliated sites [237,130,214,219]. The next sections describe each module in detail.

Figure 2-13: The vision toolkit: low level modules are linked together to create applications.
\begin{figure}\centerline{
\epsfysize=2.5in
\epsfbox{images/vision-toolkit.ps}
} \end{figure}

BlobFinder

BlobFinder represents the lowest perceptual layer involved in the vision architecture. Color NTSC composite video is captured and analyzed at 320 by 243 pixel resolution. This lower resolution avoids video interlace effects. To segment each blob initially, the algorithm scans the image until it finds a pixel of the appropriate color, determined by an a priori model or specified through use of interactive sliders. A typical rule for testing whether a given pixel should be included in the segmentation $p_T[x,y]$ is

\begin{displaymath}
p_T[x,y] = \left\{
\begin{array}{ll}
1 & \mbox{if $r[x,y]...
...\times slider3$} \\
0 & \mbox{otherwise}
\end{array} \right.\end{displaymath}

where $r[x,y]$, $g[x,y]$, and $b[x,y]$ are the respective red, green, and blue values for the pixel $p[x,y]$ and $slider1$, $slider2$, and $slider3$ are the current values of the sliders set by the user. The sliders are given initial defaults, but changes in lighting, camera quality, and digitizer quality often cause significant variations, requiring the user to adjust the system to a color sample before proceeding.

Once a pixel of the right color is found, the region is grown from that pixel by checking the eight nearest neighbors. Any neighbor that is found to be the appropriate color is added to a ``grow list,'' and the initial pixel is removed from the grow list. Next, the color of the neighbors of each pixel on the grow list is checked. This process continues until there are no more pixels on the grow list [101]. Each pixel checked is considered part of the blob. This, in effect, performs a simple morphological dilation upon the resultant image that helps to prevent edge and lighting aberrations [101]. The centroid is calculated as a by-product of the growing step and is stored as a potential seed pixel for the next frame.

After extracting the blobs from the scene, second moment analysis is performed on each blob. In effect, parameters are produced which model each blob as an ellipse. The result is a nine element feature vector for each blob consisting of the $x$ and $y$ position of its centroid (as a rule, image positions are normalized from 0 to 1 by dividing by the maximum horizontal or vertical image dimension as appropriate), area in pixels, and major and minor axes as described by normalized $x$ and $y$ offsets from the centroid and the axes' length. These last six parameters can be obtained by finding the eigenvalues and eigenvectors of the matrix


\begin{displaymath}\left( \begin{array}{cc}
a & b/2 \\
b/2 & c
\end{array}\right) \end{displaymath}

where $a$, $b$, and $c$ are defined as

\begin{displaymath}a = \int \int_{I'} (x')^2dx'dy' \end{displaymath}


\begin{displaymath}b = \int \int_{I'} x'y'dx'dy' \end{displaymath}


\begin{displaymath}c = \int \int_{I'} (y')^2dx'dy' \end{displaymath}

($x'$ and $y'$ are the $x$ and $y$ coordinates normalized to the centroid)

Solving for the eigenvalues


\begin{displaymath}\lambda = \frac{a + c \pm \sqrt{a^2 - 2ac + b^2 +c^2}}{2} \end{displaymath}

The eigenvector corresponding to the larger of the two eigenvalues indicates the direction of the the major axis, which is also the axis of least inertia for the blob [93]. The length of the major axis is twice the square root of the first eigenvalue. Similarly, the minor axis is perpendicular to the major axis and has a length of twice the square root of the second eigenvalue. It follows that the eccentricity of the bounding ellipse can be found by determining the ratio of the square roots of the eigenvalues. Note that there is a 180 degree ambiguity when describing the angle of the blob. Angles are constrained to be between $\pm 90$ degrees to address this problem.

BlobFinder can be reconfigured to report blobs of several specified colors through successive iterations through the video image. For example, all red and green blobs within a certain tolerance can be reported. However, BlobFinder is programmed to return a maximum of fifty blobs per color per frame. A required frame rate can be dictated to BlobFinder through command line arguments. If BlobFinder can go faster than the desired frame rate, it will slow down to match the given rate (using the Unix command ``select'' which frees the processor for other tasks). When BlobFinder can not meet its specified frame rate, it reports discrepancies to standard error output. In general, BlobFinder can maintain a 10-15 frames per second rate for most scenes using a 175MHz R5000 SGI.

SizeFilter

SizeFilter's name implies its function. Reading a stream of blobs from standard input, SizeFilter outputs only those blobs consisting of more than a specified minimum of pixels. Blobs are output from largest to smallest in size. This filter is extremely useful in eliminating the many small blobs that can result from noise or small background objects in the video.

HandTrack

HandTrack tracks the user's hands using higher level information to eliminate extraneous blobs from the candidate blob list, identifies situations where the hands occlude each other, and correctly labels such situations. More specifically, this module tracks the hands over time and uses their expected size and position from one frame to the next to avoid confusion with other blobs. Handtrack assumes that the tracking camera is mounted above a desktop looking down at the user or that the tracking camera is worn in a cap and aimed down toward the wearer's hands, as described in the previous section. When a hand occludes the face, as in the case of the desktop version, or the nose, as in the case of the wearable camera, color tracking alone can not resolve the ambiguity. However, since the face or nose remains in the same area of the frame, its position can be determined and those pixels in the frame ignored. However, the hands move rapidly and occlude each other often. When occlusion occurs, the hands appear as a single blob of larger than normal area with significantly different moments than either of the two hands in the previous frame. In such situations, each of the two hands is assigned the features of this single blob. While not as informative as tracking each hand separately, this method retains a surprising amount of discriminating information. The occlusion event itself is implicitly modeled, and the combined position and moment information are retained.

FingerTrack

FingerTrack is a simple version of HandTrack which attempts to track the tip of an extended finger. In visually noisy environments the user wears a specially-colored thimble. In this case, FingerTrack assumes the largest blob is the tip of the finger. When the hand's natural coloration is used, the system assumes the largest blob is the hand and that topmost pixel in the blob is the tip of the finger. Fingertrack outputs the fingertip's $x$ and $y$ position in coordinates normalized from 0 to 1.

TagRec

Fiducials are used in computer vision when accuracy and precision are desired in determining the position and orientation of objects [35,31,10]. Generally, these ``tags'' are designed to be distinct against their surrounding environment. In some cases, fiducials are designed to reflect infrared light or are themselves luminous [102]. In additional, a coding scheme can be used to uniquely identify each fiducial [172,173,146,35]. When an object is uniquely identified by a wearable computer, virtual information and behaviors can be assigned to that object, as will be described in later sections.

TagRec attempts to identify fiducials in the environment from blobs segmented by color, as produced by BlobFinder. Here, fiducials consist of a linear array of characters generated on low-cost miniature eight character LED signs or a linear arrangement of regularly-spaced red and green squares printed on small slips of paper. With the LED signs, the first and last characters always display an ``*'', and the middle characters show either an ``*'' or a blank. The middle characters indicate a unique ID through a simple binary code. For the printed tags, a red square marks the beginning and end of the tag, and the green squares act as the bits to indicate the ID of the tag (see Figure 2-14).

Figure 2-14: A visual tag representing the value ten. Squares at either end are red while the inner squares representing the bits are green.
\begin{figure}\centerline{
\epsfysize=1.0in
\epsfbox{images/tag10.ps}
} \end{figure}

The primary problem TagRec addresses is locating and identifying tags in the presence of noise. Noise, in this case, consists of other objects in the scene and spurious distortions from the camera's electronics that share the same colors as the tags. As an initial step, BlobFinder and SizeFilter find candidate blobs in the scene. For the LED tags, the thresholds for BlobFinder can be set so that there are very few candidate blobs that are not part of a valid tag. In order to determine if a group of blobs are part of the tag, TagRec examines the candidate list for blobs of approximately the same size. If these blobs meet a maximum size variance threshold, they are examined for linearity. Note that this test implies that an LED tag will consist of a minimum of three blobs. Finally, if the linearity test is passed, TagRec checks the spacing of the blobs to determine if they coincide with what is expected from the known geometry of the tag. A by-product of this step is the reconstruction of the identity of the tag.

Since the printed tags are not self-luminous, they can be harder to distinguish from the background. Thus, two distinct colors are used for each tag. From the blob candidate list, tuples of red blobs of similar sizes and the appropriate eccentricity are formed. Next, the line between the two red blobs is scanned for green blobs of the right relative size, linearity, and spacing as previously described. A valid tag must consist of two red squares and one or more green squares. If the resulting set of blobs is judged to be a tag, the blobs are removed from the candidate list and the process is repeated until no more valid pairs of red blobs remain. Note that this process is designed to avoid false positives.

Once an appropriate pattern is found, the identity of the tag is reconstructed by adding the values of the bits indicated by the internal squares. The presence of a square indicates an ``on'' bit. The internal squares are read left to right, with the leftmost square indicating the most significant bit. Since only seven of the characters are used on the LED tags, both the LED and paper tags contain five bits of potential information. Note the assumption is that the tag is read in the correct orientation. In other words, a tag that is upside down to the camera will have a different identity than when it is right side up. A simple way to eliminate this confusion is to equate tags and their reversed equivalents, halving the potential unique identities.

Besides identity, TagRec also returns the rotation and perceived distance of each tag. In fact, in situations where TagRec locates a tag but can not successfully identify it due to lighting or extreme rotation, TagRec will still report the tag's location and attributes. Rotation is calculated from the relative positions of the endpoints of the tag. Assuming the camera view is orthogonal to the surface of the tag and that the actual size of the tag and focal length of the camera are known, the perceived distance to the tag can be calculated from the distance between the tag's endpoints. Generally, only relative size was used in the applications to calculate a ``zoom factor,'' so true distance was not calculated. Theoretically, the perceived shape of the tag's squares could be used to determine the full 3D orientation of the tag relative to the camera. However, in practice, the tags would have to be fairly large or very close to the camera for effective shape recovery. Since part of the goal of the tag tracker is to be unobtrusive, large tags are unacceptable.

HTKPrepare

This module translates a stream of feature vectors to a format that Entropic's Hidden Markov Model Toolkit (HTK) can parse. Elements of the feature vector are assumed to be floating point numbers. Mainly designed for convenience, this module is adapted to whatever domain is needed. The only processing that may occur in this module is that the deltas of some features may be calculated and included in the output HTK feature vectors.

XFakeEvents

XFakeEvents provides a streaming interface for controlling the pointer in X Windows. XFakeEvents takes as input a stream of positions and mouse button combinations and generates appropriate events for the specified X Windows server. Originally written by then undergraduate Ken Russell, this module is extremely useful in interfacing perceptual systems to traditional desktop applications.

ColorSample

ColorSample is another simple, low level utility. Given a video image, ColorSample outputs the average color and luminance values for pre-defined regions in that image. There is no particular limit to the number of regions in the image; however, the number and size of the regions limit the frame rate of the utility. A desired frame rate can be specified as a command line option, and ColorSample will print error messages when it can not meet a given frame rate. If ColorSample can run faster than the given rate, it will slow down as appropriate.

VisualFilter

VisualFilter, named for a technique championed by Mann [127], re-maps a video image on to a polygonal mesh based on specifications from a file. In effect, VisualFilter maps real-time video images on to polygons as if they were textures. The geometry of the polygons is stored in a modified point dictionary form [67] that includes which section of the video image should be mapped to which polygon. This system allows for visual re-mappings that are impossible with traditional lenses. While I originally wrote this utility for the SGI Onyx with Reality Engine 2 and Sirius video capture board, the same technique can now be used on much lower priced machines.

PIPR: probabilistic image patch recognition

This module, adapted by Bernt Schiele from his doctoral thesis work [190,189,217], classifies video image patches based on multidimensional receptive field histograms. For training, a library of images, grouped into recognition classes, is selected. Each image is split into sub-images corresponding to the areas of most interest to create an image patch database. At run time, the system returns the probabilities for a given video image's patches matching patches represented in the library. Note that the number of probabilities returned per frame is the number of sub-images times the number of classes represented in the training database. In the specific system described later, a grid of 4 by 4 sub-images is used for three classes of actions resulting in 48 probabilities per frame. These probabilities can be used as features themselves. The system runs at ten frames per second on a SGI R10000 O2.

TimedData

TimedData is a utility for playing back data. It reads a specified number of lines from its input and outputs these lines at the specified frame rate. In general, TimedData is used for testing or for demonstrations, which is why it doesn't appear in the architecture diagram.

Hidden Markov models

Many of the vision toolkit modules described in the last section concentrate on generating and filtering feature vectors. This section will describe a method for recognizing events based on these feature vectors. Hidden Markov Models (HMM's), through Entropic's HTK toolkit, will be used in this thesis to recognize word signs in sign language, tasks in a ``paintball'' style game, and changes in location. Before the specifics of each of these systems can be discussed in subsequent chapters, a general overview on the training and testing of HMM's is necessary.

Hidden Markov models are used prominently and successfully in speech recognition and, more recently, in handwriting recognition [252,96,211]. Related to dynamic time warping, HMM's are extremely useful in modeling events characterized by features changing through time. Explicit segmentation is not necessary for either training or recognition, eliminating possible errors from pre-segmentation schemes. The output of the recognizer is a stream of time-stamped events that can be compared to a reference training stream for error calculation. In addition, models of language and context can be applied on several different levels. HMM's allow the tailoring of the model to the task selectively, knowledgeably, and scalably. Consequently, HMM's seem ideal for recognizing the complex, time-structured events that mark the everyday life of a user.

While a substantial body of literature exists on HMM technology [14,96,169,252], this section briefly outlines a traditional discussion of the algorithms. After outlining the fundamental theory in training and testing a discrete HMM, this result is then generalized to the continuous density case used in the experiments. For broader discussion of the topic, [96,169,206] are recommended.

Topology

A time domain process demonstrates a Markov property if the conditional probability density of the current event, given all present and past events, depends only on the $j$th most recent events. If the current event depends solely on the most recent past event, then the process is a first order Markov process.

The initial topology for an HMM can be determined by estimating how many different states (i.e. events) are involved for each ``unit class.'' Examples of ``units'' include phonemes in speech [96], signs in sign language [219], or letters in handwriting [211]. A unit class is a particular type of unit. For example, the lowercase letters of the alphabet would be 26 classes in handwriting. Once an initial topology is chosen, fine tuning can be performed empirically for each class, by rerunning the same training and testing experiments with different topologies. To simplify the situation, one topology may be chosen for all classes. For example, for several applications in this thesis, an initial topology of five states was considered sufficient for the most complex class. To handle less complicated classes, skip transitions can be specified. Figure 2-15 shows a 5-state HMM with and without such skip transitions. In this case, the skip transitions allow the HMM to emulate a 3- or 4-state HMM. While a different HMM topology could be specified for each unit class depending on its complexity, similar accuracy gains can be realized by specifying one HMM model with appropriate skip transitions for all classes. Ideally, training for each unit class weights that class's model's transitions to emulate the appropriate HMM topology automatically. In research systems, such skip transition models are appropriate, since a great deal of time may be spent in optimizing a particular class's model at the expense of exploring better features or higher level relationships between models.

Figure 2-15: Variants on a 5-state HMM. The top model is a standard left-to-right HMM with no skip transitions. The bottom model has two skip transitions effectively allowing it to become a 3-state or 4-state model depending on training.
\begin{figure}\centerline{
\epsfysize=2.5in
\epsfbox{images/5state-with-without.ps}
} \end{figure}

Mathematical basis for evaluation, estimation, and decoding

In order to proceed more smoothly, a list of symbols that will be used in this discussion is provided below. The meaning for some of these variables will become clearer in context, but the reader is urged to gain some familiarity with them before continuing.

$T$: the number of observations.

$N$: number of states in the HMM.

$L$: distinct number of possible observations.

$s$: a state. For convenience (and with regard to convention in the HMM literature), state $i$ at time $t$ will be denoted as $s_t = i$.

$S$; the set of states. $S_I$ and $S_F$ will be used to denote the set of initial and final states respectively.

$O_t$: an observation at time $t$.

${\bf O}$: an observation sequence $O_1, O_2,...O_T$.

$v$: a particular type of observation.

$a$: state transition probability. $a_{ij}$ represents the transition probability from state $i$ to state $j$.

$A$: the set of state transition probabilities.

$b$: state output probability. $b_j(k)$ represents the probability of generating some discrete symbol $v_k$ in state $j$.

$B$: the set of state output probabilities.

$\pi$: initial state distribution.

$\lambda$: a convenience variable representing a particular hidden Markov model. $\lambda$ consists of $A$, $B$, and $\pi$.

$\alpha$: the ``forward variable,'' a convenience variable. $\alpha_t(i)$ is the probability of the partial observation sequence to time $t$ and state $i$, which is reached at time $t$, given the model $\lambda$. In notation, $\alpha_t(i) = Pr(O_1, O_2, ...,
O_t, s_t=i\vert\lambda)$.

$\beta$: the ``backward variable,'' a convenience variable. Similar to the forward variable, $\beta_t(i) = Pr(O_{t+1}, O_{t+2},
..., O_T \vert s_t=i,\lambda)$, or the probability of the partial observation sequence from $t+1$ to the final observation $T$, given state $i$ at time $t$ and the model $\lambda$.

$\gamma$: generally used for a posterior probabilities. $\gamma_t(i,j)$ will be defined as the probability of a path being in state $i$ at time $t$ and making a transition to state $j$ at time $t+1$, given the observation sequence and the particular model. In other words, $\gamma_t(i,j) = Pr(s_t=i,
s_{t+1}=j\vert{\bf O},\lambda)$. $\gamma_t(i)$ will be defined as the posterior probability of being in state $i$ at time $t$ given the observation sequence and the model, or $\gamma_t(i) = Pr(s_t=i\vert{\bf O},\lambda)$.

There are three key problems in HMM use. These are the evaluation problem, the estimation problem, and the decoding problem. The evaluation problem is that given an observation sequence and a model, what is the probability that the observed sequence was generated by the model ( $Pr({\bf O}\vert\lambda)$)? If this can be evaluated for all competing models for an observation sequence, then the model with the highest probability can be chosen for recognition.

$Pr({\bf O}\vert\lambda)$ can be calculated several ways. The naive way is to sum the probability over all the possible state sequences in a model for the observation sequence:


\begin{displaymath}Pr({\bf O}\vert\lambda) = \sum_{all S} \prod_{t=1}^T a_{s_{t-1}s_t}b_{s_t}(O_t) \end{displaymath}

The initial distribution $\pi_{s_1}$ is absorbed into the notation for $a_{s_0}{s_1}$ for simplicity in this discussion. The above equation can be better understood by ignoring the outside sum and product and setting $t=1$. Assuming a particular state sequence through the model and the observation sequence, the inner product is the probability of transitioning to the state at time 1 (in this case, from the initial state) times the probability of observation 1 being output from this state. By multiplying over all times 1 through T, the probability that the state sequence $S$ and the observation sequence ${\bf O}$ occur together is obtained. Summing this probability for all possible state sequences $S$ produces $Pr({\bf O}\vert\lambda)$. However, this method is exponential in time, so the more efficient forward-backward algorithm is used in practice.

The forward variable has already been defined above. Here its inductive calculation, called the forward algorithm, is shown (from [96]).

The first step initializes the forward variable with the initial probability for all states, while the second step inductively steps the forward variable through time. The final step gives the desired result $Pr({\bf O}\vert\lambda)$, and it can be shown by constructing a lattice of states and transitions through time that the computation is only order $O(N^2T)$ where $N$ is the number of states and $T$ is the number of observations.

Another way of computing $Pr({\bf O}\vert\lambda)$ is through use of the backward variable $\beta$, as already defined above, in a similar manner.

The estimation problem concerns how to adjust $\lambda$ to maximize $Pr({\bf O}\vert\lambda)$ given an observation sequence ${\bf O}$. Given an initial model, which can have flat probabilities, the forward-backward algorithm allows us to evaluate this probability. All that remains is to find a method to improve the initial model. Unfortunately, an analytical solution is not known, but an iterative technique can be employed.

Using the actual evidence from the training data, a new estimate for the respective output probability can be assigned


\begin{displaymath}\bar{b}_j(k) = \frac{\sum_{t \epsilon O_t = v_k}
\gamma_t(j)}{\sum_{t=1}^{T} \gamma_t(j)} \end{displaymath}

where $\gamma_t(i)$ is defined as the posterior probability of being in state $i$ at time $t$ given the observation sequence and the model. Similarly, the evidence can be used to develop a new estimate of the probability of a state transition ($\bar{a}_{ij}$) and initial state probabilities ($\bar{\pi}_i$). Thus,


\begin{displaymath}\bar{a}_{ij} = \frac{\sum_{t=1}^{T-1}\gamma_t(i,j)}
{\sum_{t=1}^{T-1} \gamma_t(i)} \end{displaymath}

Initial state probabilities can also be re-estimated through the formula


\begin{displaymath}\bar{\pi}_i = \gamma_1(i) \end{displaymath}

Thus all the components of $\lambda$, namely $A$, $B$, and $\pi$ can be re-estimated. Since either the forward or backward algorithm can be used to evaluate $Pr({\bf O} \vert \bar{\lambda})$ versus the previous estimation, the above technique can be used iteratively to converge the model to some limit. While the technique described only handles a single observation sequence, it is easy to extend to a set of observation sequences [96,14,252].

While the estimation and evaluation processes described above are sufficient for the development of an HMM system, the Viterbi algorithm provides a quick means of evaluating a set of HMM's in practice as well as providing a solution for the decoding problem [96]. In decoding, the goal is to recover the state sequence given an observation sequence. The Viterbi algorithm can be viewed as a special form of the forward-backward algorithm where only the maximum path at each time step is taken instead of all paths. This optimization reduces computational load and additionally allows the recovery of the most likely state sequence. The steps to the Viterbi algorithm are

In many HMM system implementations, the Viterbi algorithm is used for evaluation at recognition time. Note that since Viterbi only guarantees the maximum of $Pr({\bf O}, S\vert\lambda)$ over all $S$ (as a result of the first order Markov assumption) instead of the sum over all possible state sequences, the resultant scores are only an approximation. For example, if there are two mostly disjoint state sequences through one model with medium probability and one state sequence through a second model with high probability, the Viterbi algorithm would favor the second HMM over the first. However, Rabiner [169] shows that the probabilities obtained from both methods are typically very close.

In practice, the Viterbi algorithm may be modified with a limit on the lowest numerical value of the probability of the state sequence, which in effect causes a beam search of the space. While this modification no longer guarantees an optimum result, a considerable speed increase may be obtained. Furthermore, to aid in estimation, the Baum-Welch algorithm may be manipulated so that parts of the model are held constant while other parts are trained.

So far the discussion has assumed some method of quantization of feature vectors into classes, but it is easy to see how the actual probability densities might be used. However, the above algorithms must be modified to accept continuous densities. The efforts of Baum, Petrie, Liporace, and Juang [15,14,124,107] showed how to generalize the Baum-Welch, Viterbi, and forward-backward algorithms to handle a variety of characteristic densities. In this context, however, the densities will be assumed to be Gaussian. Specifically,


\begin{displaymath}b_j(O_t) = \frac{1}{\sqrt{(2 \pi)^n \vert\sigma_j\vert}}
e^{\frac{1}{2}(O_t-\mu_j)'\sigma_j^{-1}(O_t-\mu_j)}\end{displaymath}

Initial estimations of $\mu$ and $\sigma$ may be found by dividing the evidence evenly among the states of the model and calculating the mean and variance in the normal way.


\begin{displaymath}\mu_j = \frac{1}{T}\sum^T_{t=1}O_t \end{displaymath}


\begin{displaymath}\sigma_j = \frac{1}{T}\sum^T_{t=1}(O_t - \mu_j)(O_t - \mu_j)' \end{displaymath}

Whereas flat densities were used for the initialization step before, here the evidence is used. Now all that is needed is a way to provide new estimates for the output probability. We wish to weight the influence of a particular observation for each state based on the likelihood of that observation occurring in that state. Adapting the solution from the discrete case yields


\begin{displaymath}\bar{\mu_j} = \frac{\sum_{t=1}^T \gamma_t(j)O_t}{\sum_{t=1}^T
\gamma_t(j)} \end{displaymath}

and


\begin{displaymath}\bar{\sigma_j} = \frac{\sum_{t=1}^{T} \gamma_t(j)(O_t -
\bar{\mu_j})(O_t - \bar{\mu_j})^t}{\sum_{t=1}^{T} \gamma_t(j)} \end{displaymath}

In practice, $\mu_j$ is used to calculate $\bar{\sigma_j}$ instead of the re-estimated $\bar{\mu_j}$ for convenience. While this is not strictly proper, the values are approximately equal in contiguous iterations [96] and seem not to make an empirical difference [252]. Since only one stream of data will be used and only one mixture (Gaussian density) will be assumed, the algorithms above can proceed normally incorporating these changes for the continuous density case.

Training an HMM network

When using HMM's to recognize strings of data such as continuous speech, cursive handwriting, or American Sign Language sentences, several methods can be used to bring context to bear in training and recognition. A simple context modeling method is embedded training. While initial training of the models might rely on manual segmentation or, as in this thesis, evenly dividing the evidence among the models for an automatic initial estimate, embedded training trains the models in situ and allows model boundaries to shift through a probabilistic entry into the initial states of each model [252].

Often, a unit can be affected by both the unit in front of it and the unit behind it. For phonemes in speech, this is called ``co-articulation.'' While this can confuse systems based on recognizing isolated units, the context information can be used to aid overall recognition. For example, if two units are often seen together, recognizing the two units as one group may be beneficial.

A final use of context is best described as the inter-word level in speech (speech processing can be thought of on three levels: phoneme, intra-word, and inter-word). This is one level removed from the inter-unit context described in the preceding paragraph. Statistical grammars relating the probability of the co-occurrence of two or more words can be used to weight the recognition process. Grammars that associate two words are called bigrams, whereas grammars that associate three words are called trigrams. Rule-based grammars can also be used to aid recognition.

This section described the foundations for hidden Markov models without regard to their application. Subsequent chapters which use this framework will address the details on how to evaluate and tune an HMM-based recognizer for their specific domains.


Everyday Use

We are confronted with insurmountab