Perceptual User Interfaces using Vision-based Eye Trackers
1. R. Ruddarraju, A. Haro, I. Essa, Fast Multiple Camera Head Pose Tracking, In Proceedings, Vision Interface 2003, Halifax, Canada. pdf
2. R. Ruddaraju, A. Haro, K. Nagel, Q. Tran, I. Essa, G. Abowd, E. Mynatt, "Perceptual User Interfaces using Vision-Based Eye Tracking" In Proceedings of the Fifth International Conference on Multimodal Interfaces (ICMI-PUI'03), Nov. 5-7th, 2003, (In conjunction with ACM UIST 2003), ACM Press, Vancouver B.C., Canada. (To Appear) pdf
Head pose and eye gaze information are very valuable cues in face-to-face interactions between people. Such information is also important for computer systems that a person intends to interact with. The awareness of eye gaze provides context to the computer system that the user is looking at it and therefore supports effective interaction with the user.
In this paper, we present a real-time vision-based eye tracking system to robustly track a user's eyes and head movements. Our system utilizes robust eye tracking data from multiple cameras to estimate 3D head orientation via triangulation. Multiple cameras also ao,ord a larger tracking volume than is possible with an individual sensor, which is valuable in an attentive environment. Each individual eye tracker exploits the red-eye eo,ect to track eyes robustly using an infrared lighting source. Although our system relies on infrared light, it can still track reliably in environments saturated with infrared light, such as a residential living room with much sunlight.
We also present experiments and quantitative results to demonstrate the robustness of our eye tracking in two application prototypes: the Family Intercom , and the Cooks Collage . In the Family Intercom application, eye gaze estimation is used to make inferences about the desire of an elderly parent to communicate with remote family members. In the Cooks Collage application, eye gaze estimates are used to assist a user while cooking in a kitchen as well as to evaluate the effectiveness of the Collage's user interface. In both cases, our tracking system works well under real-world environments, subject to varying lighting conditions, while allowing the user interaction to be unobtrusive yet engaging.
Figure 1. Program Flow
Our system uses IBM BlueEyes infrared lighting cameras (courtesy of IBM Research ). These cameras are used as sensors for our eye tracking algorithm and the tracked eyes are used in conjunction to estimate the users head pose and eye gaze direction. The tracked head pose is used to estimate a user's eye gaze to measure whether a user is looking at a previously defined region of interest that the prototype applications use to further interact with the user.
Figure 2. Feature Tracking, 2 eyes and 2 mouth corners.
We use several pre-calibrated cameras to estimate a userÕs head pose. For each camera, we use the tracked eye locations to estimate mouth corners. These two mouth corners and eye positions are then used as low level features between all cameras to estimate the userÕs 3D head pose. We use a combination of stereo triangulation, noise reduction via interpolation, and a camera switching metric to use the best subsets of cameras for better tracking as a user is moving their head in the tracking volume. Further details of our head pose estimation can be found in .
Figure 3. Multiple camera setup
Multiple cameras provide both a large tracking volume as well as 3D head pose information. However, as a user moves in the tracking volume, it is possible that their eyes are no longer visible from some cameras. Our system detects when a userÕs head is moving away from a camera and uses a different subset of cameras not including it to estimate the 3D pose more accurately. This is done because cameras with only partial views of the face will have increased eye-
tracking errors since their appearance (and possibly motion) will no longer appear to be eyes. Removing these cameras from the 3D pose calculation is important because the 3D pose is sensitive to noise since we use a small number of cameras. In practice, camera subset switching is not done very often, but must be done for certain head angles to avoid incorrect pose estimates.
FisherÕs linear discriminant has previously been used to compensate for varying light conditions for improved facial recognition . We use it to compute a classification score for candidate eyes versus non-eyes and have found it to yield improved classification over PCA for our data.
Figure 4. Tracking using Fisher Figure 2. Classification using PCA
The Family Intercom project investigates context-aware family communication between homes . The intent is to provide interfaces that facilitate a personÕs ability to decide whether a proposed conversation should be initiated or not. The user-gaze interface shown in Figure 6 provides feedback to the caller to help them determine whether it would be appropriate to initiate a voice conversation.
Figure 5. Family intercom setup
In one home, the vision-based eye tracking system tracks user gaze towards a collection of framed family photographs. Figure 5 shows the setup of the eye trackers on a common household table in an elderly parentÕs home. In the second home, the remote panel is based on the Digital Family Portrait  and displays a portrait and a qualitative estimate of activity for the family member pictured from the first home. Figure 6b shows the interface at a remote family memberÕs house. When a family member notices the digital portrait of their family, they simply touch the portrait to create a connection.
Figure 6. Family Intercom interface with eye tracking data from vision system
The visual gaze tracker conveys statistics of the calleeÕs eye gaze towards the family pictures to the caller and facilitates the appropriate social protocol for initiating conversations between the users.
The CookÕs Collage application explores how to support retrospective memory, such as keeping a record of transpired events, using a household cooking scenario . Our system estimates user attention in three regions: the CollageÕs display, recipe, and cooking area.
Figure 7. CookÕs Collage setup
Currently, we use the head pose data to evaluate the usability of the CollageÕs display. The CookÕs Collage expects the user to refer to its display every time the cook returns to cooking after an interruption for additional cooking prompts. Ignoring the display after an interruption could suggest the need for additional user interface improvements or a better placement of displays.
Figure 8. Collage display
Our subject pool consists of 4 subjects for the Family Intercom and 4 the for CookÕs Collage. For the Family Intercom experiment, a subject performs a regular daily activity like reading or eating while sitting near a table containing the
userÕs family pictures. A separate 15 minute sequence consisting of 225 frames is recorded throughout the experiment to capture ground truth for veri„cation. The video is hand labeled to represent the ground truth of regions viewed by the user. These hand labeled frames are compared with the regions estimated by the head pose tracking system. The percentage of accuracy gives the fraction of frames estimated correctly by the system.
For the CookÕs Collage experiment, subjects were asked to cook a recipe provided to them and a video was also recorded to capture ground truth. Since the length of this experiment is shorter than the Family Intercom experiment, a 10 minute sequence was used for comparison instead. This is also motivated by the fact that the user is more actively involved in the cooking experiment, which provides a larger amount of head poses for comparison purposes.
Subject Correct estimate
Subject 1 87%
Subject 2 88%
Subject 3 90%
Subject 4 84%
Average correct estimate 87.25%
Table 1: Statistics of estimated eye contact for the
Subject Correct estimate
Subject 1 81%
Subject 2 84%
Subject 3 78%
Subject 4 82%
Average correct estimate 81.25%
Table 2: Statistics of estimated eye contact for the
 A. Aaltonen, A. Hyrskykari, and K. Raiha. 101 spots on how do users read menus? In Human Factors in Computing Systems: CHI 98, pages 132Š139, New
York, 1998. ACM Press.
 Alan Allport. Visual Attention. MIT Press, 1993.
 M. Argyle. Social Interaction. Methuen & Co., London, England, 1969.
 M. Argyle and M. Cook. Gaze and Mutual Gaze. Cambridge University Press, Cambridge, UK, 1976.
 R. Ruddarraju, A. Haro, I. Essa, Fast Multiple Camera Head Pose Tracking, In Proceedings, Vision Interface 2003, Halifax, Canada.
 P. Barber and D. Legge. Perception and Information. Methuen & Co., London, England, 1976.
 P. Belhumeur, J. Hespanha, and D. Kriegman. Eigenfaces vs. „sherfaces: Recognition using class specific linear projection. In IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 19, July 1997.
 La Cascia, M. Sclaro„, and S. Athitso. Fast reliable head tracking under varying illumination: An approach based on robust registration of texture mapped 3-d models. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000.
 J.H. Goldberg and J.C. Schryver. Eye-gaze determination of user intent at computer interface. Elsevier Science Publishing, New York, New York, 1995.
 A. Haro, M. Flickner, and I. Essa. Detecting and tracking eyes by using their physiological properties, dynamics, and appearance. In IEEE Computer Vision
and Pattern Recognition, pages 163Š168, 2000.
 M. Harville, A. Rahimi, T. Darrell, G. Gordon, and J. Wood„ll. 3-d pose tracking with linear depth and brightness constraints. In International Conference on Computer Vision, 1999.
 B. Jabrain, J. Wu, R. Vertegaal, and L. Grigorov. Establishing remote conversations through eye contact with physical awareness proxies. In Extended Abstracts of ACM CHI, 2003.
 Y. Matsumoto, T. Ogasawara, and A. Zelinsky. Behavior recognition based on head pose and gaze direction measurement. In IEEE International Conference on Intelligent Robots and Systems, 2000.
 C.H. Morimoto, D. Koons, A. Amir, and M. Flickner. Pupil detection and tracking using multiple light sources. Technical report RJ-10117, IBM Almaden
Research Center, 1998.
 E. Mynatt, J. Rowan, and A Jacobs. Digital family portraits: Providing peace of mind for extended family members. In ACM CHI, 2001.
 K. Nagel, C. Kidd, T. OÕConnell, S. Patil, and G. Abowd. The family intercom: Developing a context-aware audio communication system. In Ubicomp, 2001.
 A. Schoedl, A. Haro, and I. Essa. Head tracking using a textured polygonal model. In Proceedings Workshop on Perceptual User Interfaces, 1998.
 Rainer Stiefelhagen. Tracking focus of attention in meetings. In International Conference on Multi-Modal Interfaces, 2002.
 Q. Tran and E. Mynatt. CookÕs collage: Two exploratory designs. In CHI 2002, Conference Proceedings, 2002.
 A. L. Yarbus. Eye Movements during Perception of Complex Objects. Plenum Press, New York, New York, 1967.