The discuss the fact that in the absence of a mediaspace, people move fluidly from one any of these types of communication to any other.
The built there system around "buttons," which are basically user-configurable scripts for control of the resources. Five major button types emerged/evolved: background, sweep, glance, office share, Vphone. These reflect different intentions of use of the mediaspace. The background button is (roughly) the default (disconnected) state. It puts a common area on your monitor. The sweep gives you ~1 second peeks into some set of locales versus glance which is focused on one person for ~3 seconds. Office share and video phone (Vphone) are two-way A/V connections and differ only intended use (long term collaboration versus focussed discussion).
They initially dealt with the privacy problem by being "intentionally naive" and letting the social norms control privacy issues. They separate privacy into several issues:
The built "Goddard" on top of "iiif" to control the resources and defeat iiif's control/ownership notions. It is what allows you to have control over your resources. The use Gaver's auditory cues to provide notification. They note that non-speech auditory cues can be tailored to be much less disruptive (gradual buildup in volume, auditory icons, etc).
They also built a time-based notification system called "Khronika" which provides selective awareness of planned and electronic events. It can be told to keep the user informed about "events," "daemons," and "notifications." Events are defined in terms of their class, their start time, and their duration (conferences, visitors, local movies, arriving email). Events are also "classified" to provide higher-level abstractions for selection. Daemons are programs that watch for specific things: they are a set of constraints that can fire a notification. They also mention here the Polyscope and Portholes systems. Polyscope provided a very limited "history" mechanism. They also use the awareness system (Polyscope) as a gateway to intentioned connection (vphone, glance, etc).
Almost all the examples are of the (thrice yearly) IETF transmissions. They note that the bandwidth consumed by IETF is 100 to 300 Kbps with spikes to 500Kbps. They mention (but don't explain) the RTP (Real Time Protocol) used by most (?) MBONE apps. They say that the cellular phones get the cost of an audio channel down to 18Kbps, including overhead. (Raw 8 bit, 8K audio plus control info on vat is said to be 75Kbps. )
On page 59 they have a reasonable table of the proposed thresholds for each type of media and transmission distance. They note that coding two things with one number is bad, but that `s the way it is. This also does not allow these two parameters to be controlled individually; if you want stuff from far away of one type you get all the stuff with smaller threshold.
They produce some interesting numbers for multicast packet forwarding: a Sparc 1+ forwards one packet in about 1.0 ms, a Sparc 10 in 0.6 ms. They mention that some people use the CPU speed of the machine forwarding packets as a bandwidth controller. The problem is that in the saturation case the (user level) mrouted is getting any cycles and its neighbors will "time it out" and stop sending data to the site. This gives the CPU a rest, lets the mrouted run again, and then starts the flood again yielding oscillation.
They explain that the LSRR option was bad because Van Jacobsen found that periodically (30 second intervals) you would get huge (65-85%) packet losses if you were transmitting video. During non-trouble periods, losses were around 0.5%. He suggested that the LSRR was competing with normal routing updates. They mention the "bogus ICMP" packet problems that are caused by "screaming" routers when confronted by multicast traffic.
The basic scoop is that they base their routing choices on both the source and destination. (Regular OSPF uses only destination.) Normal OSPF can just use Dijstrka's algorithm to calculate routing; they have to calculate a form of the minimal spanning tree (since they have to worry about both sources and destinations) called the Steiner tree.
They allow IP datagrams to be labelled with a TOS (Type of Service) parameter: minimize delay, maximize throughput, maximize reliability, minimize monetary cost, and normal service. They allow different routes based on this. Further, they can optimize because they can do "expanding ring search" which basically just a set of pings with succesively larger TTLs. When an MOSPF router sees an IP packet with a given TTL it can figure out that the TTL is too small to reach another group member and not bother forwarding it. They do not split the multicast stream across two equal cost channels.
They go into a significant amount of detail about the protocol necessary to make this all work. The upshot is that the routers have complete knowledge of the internetwork and can do smart things. They also provide some mechanisms to help the routers keep their tables smaller (coalescing routes). They compare MOSPF (on parts of the internet) to DVMRP with the following advantages claimed: Increased stability, agregration of sources to make global DVMRP tables smaller, slightly more efficient pruning, and optimization of the IP multicasting "ring search." They also do an analysis of the algorithm; they claim that only the cost of the Dijkstra's needs to be considered, and that number of these can be reduced significantly by using some wildcards and default routes. (They claim a 200 router area router area implies 200 Dijkstra's which can be done in 3 seconds on a 10 MIPS processor.)
The MOSPF protocol doesn't really do much about what to do across AS boundaries. They introduce the "some routers get all the packets" trick to get around this problem. Inside the AS, all the routers know the whole state, so there is no problem, but across the boundaries little (or nothing) is known. I'm concerned by this, although I can't put my finger on why.
She notes that the video-mediated conversation has been studied in the past and that the video-mediated conversations tend to be more "orderly" than their face-to-face versions. The previous results on number of interruptions and pauses are inconsistant (although most involved studies of pairs). The claim is that perhaps listeners are less likely to attempt to seize the floor when using the video because the visual cues are less present. She notes that Rutter has shown that audio-only conversations are less spontaneous, more formal, and more socially distant than face-to-face discussions.
Here hypotheses were:
The observed a use not to different from the telephone. However, about 1/4 of the calls seemed to be users monitoring their environment. The noted that about 1/4 of the Glances were first thing in the morning, right after lunch, and on weekends when it was uncertain if users would be in (and perhaps before the calling user was absorbed in their work).
The observed (duh) two novel uses of Cruiser: officeshare and "ambush." Ambush means that you connect to an empty office and then go about your business, waiting for that person to come in. Cruiser interactions involved more greeting and scheduling, but less problem solving and decision making. They attribute this to a lack of shared objects, tools, etc. The students would use Cruiser to see if the mentor was available for assistance and perhaps have a question or two and/or schedule a meeting. Cruiser was frequently used to inquire about status.
They did some questionnaire based studies on the users perceptions of different medias' effectiveness for various topics. Cruiser was most closely related to the telephone.
After 4 weeks, subjects were asked about privacy issues. Most didn't think there was a problem in their small, collaborative community. 4 of the 23 didn't want strangers looking into their office. Some analysis yielded that people felt that a conversation initiated by someone else via Cruiser was more of a violation than one initiated face-to-face. They were concered about field-of-view privacy problems in the Crusier system, as well as the hands-free audio.
AutoCruises (system initiated) were accepted only 3% of the time, versus 54% with user-initiated calls. Sounds like differences in whether you'll talk to someone when you pass them in the hall. They considered the short length of calls (especially all the status checking) as proof that the system was not as expressive as other media. The low completion of autocruises plus the privacy invasion and instrusion factor led them to conclude that they hadn't produced a system with enough power in the visual channel to handle all the human to human protocol.
Their conclusions revolve around the idea that Cruiser is too heavyweight, and they're right. The system requires attention to make things happen and it is obtrusive. One of their users said, "There is no halfway with Cruiser." They also complained about the lack of shared objects. They discuss that when the telephone was introduced in England, it took a while for the social norms to develop about how the device should be used; they suggest the same might be true of Cruiser. They continue the telephone analogy to the bitter end.
They suggest that they need a new system that balances "accessability, privacy, and solitude." Accessability is the ability to get to someone easily. Privacy is their word for control of what information is available. Solitude is the intrusion factor.
He claims that the two big areas for real-time shared workspaces are shared window systems and mediaspaces. They note that first handles only things in computer and the second things outside the computer. He notes that in conventional apps the seams are starting to be overcome by cut/paste/clipboards and uniform interfaces. He mentions three seems in CSCW apps:
Their basic trick is to overlay individual workspace images. They do this with overlays from CCD cameras as well as from the computer. They have a face camera, a camera pointed at the desk, a private monitor, and a shared monitor. Its a Mac system and so you can just drag apps from the shared to the private monitor as needed. They can merge the desktop and the video image. They can also screen share (with control, via the technique that Timbuktu uses) and share drawing surfaces. The provide an example use of the TWS workstation to help teach calligraphy. Seems a little biased.
He mentions four real time sharing alternatives and examples of each:
They found that people in the same room liked having the TWS' windows mirror the spatial layout. (They added this later.) It does this by understanding the floor-plan of the office. It only does this for the face windows, not the other displays.
They discuss several of the aspects that comprise informal communication. They point out that informal communication (due to excellent feedback) is better at handling situations where participants are elaborating or modifying what they are saying to deal with someone elses objects or misunderstandings. They classify the formality of communication with this diagram:
Deft and Lengel classified rich communication channels as ones that can overcome different frames of reference or clarify ambiguity to change understanding in a timely manner. (Ordering: 1) face-to-face 2) telephone 3) personal documents (letters) 4) impersonal documents 5) numeric documents.) The authors added bandwidth and sponteneity.
They identify some points that are important to informal communication:
Results: about 10% of the possible conversations opportuntities were actually converted into conversations. In the same study about 41% of face-to-face opportunities resulted in conversation. Problems:
They do have a breakdown of CSCW apps based on three axes: distribution in space, distribution in time, and individual versus group support. They are handling (obviously) different place, same time, and group work.
The architecture is such that clients only talk to local servers (performance) and the servers communicate with each other. The servers exchange the inter-domain information this way; they process their own domain. The servers are not particularly concerned with what client programs may do with the source information (images) or properties (everything else).
The interface is fairly space-intensive and they (correctly)note that this is a detriment to casual use (one of their users specifically mentioned this). They make several mentions of serendipity in Portholes use: This is seems crucial to mediaspace success to me, and is really tied to the digital world. Its harder to get serendipity with limited analog resources. They also mention that users have less motivation for use (via an example) if they "aren't guaranteed of seeing anything." This seems to motivate automatic processing based on "interest functions." They make a point of diving into the community building effects of Portholes. They mention that it is a place for both the serious and the whimsical. I wonder how important that "both" really is.
The bring up some issues for future work:
They do this in the context of instructors explaining how to use machine-shop equipment. The noted the instructor's actions to be: find object, express, confirm. The operator's (student's) actions to be: find object, understand, manipulate, respond. The focal point of the instructor changed every few seconds and may change even twice in one second. To support their application you need movable 3D focal points so the instructor can see/show. The noted the 3D expressions of the instructor and student: position, motion/manipulation, and confirmation.
They devised a model task and performed an experiment. The variables were having the instructor present (face-to-face) or remote and whether or not the instructor could use gestures. They do an analysis of the parts of speech used in the discussion so they can understand what the users were talking about. They also bring up a lot of good points about the effects of camera position and orientation. The discussion of remote objects (duh) gets hard when you can't share point of focus.
They use the results to design a communication system for remote 3D collaboration. Requirements:
They did an experiment with face-to-face and two remote cases. The two remote cases were between a fixed camera view and a SharedView camera. The SharedView camera decreased discussion of directions and orientations.
They make the point that the face-to-face view does not allow for many types of signals/communication that naturally occur. Further, the difficulty of adjusting the camera insures that the camera view is relatively fixed. The really good point they are making that they just barely mention is that artifacts/information must be actively presented to colleagues.
They constructed two rooms, each with four cameras. The rooms had three cameras identically placed (face to face, side view of desk for context, desk view for shared documents), the fourth was different in the two rooms (birds-eye view in one, and a dollhouse view for the other, see below). The first task was for each participant to draw the others room. (This seems very contrived to me; this just cries out for a more `exploratory' view and manipulation technology.) The second task was to arrange some furniture in a dollhouse; the participants were secretly given different and conflicting goals for the task.
Their results showed that the face-to-face view was rarely used. Their analysis indicates that if people have a choice between face-to-face and views that give access to shared work objects, they'll choose the view that allows shared work. They mention that people used the face-to-face (for short periods of time) to seemingly assess each other's mood and engagement. They suggest that their data may under-emphasize the importance of glances. They also say their data shows that particular views were not predictably associated with any task, but rather varied based on the current task. Humans are clever at using tools.
They discovered some problems with using a multi-viewed system:
They use an icon/drag based control system. You drag an icon into a virtual office (on screen) and it connects the people inside the office.
The noted some technological obstacles:
They note that eye-gaze was very difficult to establish in the mediaspace, despite its importance. They mention work by some psychology guys about the use of eye-gaze in converstation and claim this work says that eye-gaze has at least five functions in conversation (regulate flow, provide feedback on perception, emotions, nature of relationship, reflect status relationships). The mention the EurPARC (?) idea of video tunnels to help with the eye-gaze problem.
They talk about the fact that the status of the participants is not reflected in the mediaspace display and that their implementation may move people around after a break (arbitrary ordering). They claim this was highly disconcerting to the participants. I don't believe this; if people change seats in a meeting after a break, I can deal with it.
They make the usual observation about the lack of conversational cues causes more of a need for turn-taking and/or a moderator. They claim that video image size had an impact on the conversation. Participants with large images appeared to have more of an impact on the conversation. They bring in some work by pyschologists on "social" distance and discuss the three levels of social distance (4-12 feet for strangers, 1.5-4 feet for friends, and <1.5 feet for intimate friends) in conversation. They relate the image size to this.
On the privacy side, they mention that you need to know when you office has someone in it, and you need a way to control it. They mention that the IIIF system was too complicated for easy use. They also reference Gaver & Smith's feedback via audio cues as a good idea for informing you of what is happening.
They are trying to develop metaphors for privacy/communication built on already existing (human) communication practices. (Shut one's door, wait to see someone who is busy talking to someone else, etc). They are trying to build a visual lange for manipulating the parameters of the system. They are also trying some automatic switching stuff based on who the current speaker is. This won't work if they have a 2-second switching delay.
The mention the video draw system which was a similar setup but with monitors you could draw on and pictures of the hands. A big problem was that you were in the way of the camera and if you weren't, it created a parallax problem.
Good point: this allows people to actually work in the same place on the whiteboard if they are at different sites; this is not possible without videowhiteboard, as your bodies get in the way. People using the system felt like the collaborator was on the other side of the screen; they say this is not a big problem and that VW evokes the correct cues from people when they collaborate. They mention that the collaborator is superimposed on the drawing surface, thus avoiding divisions of attention between speaker and marks that occur with a real whiteboard.
Problems: No eye contact. Shadows dont' distinguish between people. There is no feedback, and so users can't tell if their (subtle) gestures or movements are perceived. They had problems because their resolution was only 330x240 and the optical alignment was worse towards the edges of the screen. Users can only erase their own marks.
They claim that at EuroPARC people tend to establish video contact before engaging in conversation. This can start upgrading as their attempts to go unnoticed until they are gesturing wildly in an attempt to gain the recipients attention. Sometimes they just give up and buzz off, and other times they drop back to an audio connection and announce themselves.
They really discuss how the lack of the ability to perform gestures successfully is a problem to conversation. People use gestures to attract the listeners attention at points, as well give illustrations, etc. They speculate that since gaze is also difficult people have only vocal cues to help the get co-participation. (They give as an example people giving up on utterances because the recipient was giving the feedback to the speaker that he should.)
They make a big deal out of the fact that a video monitor is only a small piece of your field of vision, and thus the peripheral gestures are going to be distorted. Simarly the speakers access the recipient is very limited by the screen size.
They bring out some of the asymetries of collaboration and try to justify why these might be good. (I though these were wildly obvious.) They note that you can be connected with others but confine the disturbance to your mediaspace. They also note that it can be effective as a collaboration tool if you work at it (conversationwise).
He talks about everyday listening and how well-designed auditory icons can help you monitor activities and events and thus provide a basis for collaboration. He reviews the ARKola setup and task. The most interesting result here is that people divided the labor up between them, but used the audio icons to help the monitor what they could not see (or were not focussed on). They also noted that this reduces the risk of adventuring, as you will hear if things go wrong somewhere else. In total, he says that it reduces the difficulty of transition between division of labor and focussed collaboration.
He reviews the EAR work which is basically environmental sounds that give people reminders in their offices of events. The sounds are tailored to be unobtrusive and not come up quickly so as to startle. They use the Khronika event database to tell them when to play the sounds. They use it to support awareness of events that might normally not be perceived (e.g. meetings starting somewhere in the building). He mentions the beginnings of the audio cues that are used in the RAVE system: the creaking open door for someone looking in, the closing door for them leaving, etc.
In both cases the principle is that users can be aware of the sounds even if they visually were not attending to them.
As far a privacy in polyscope the options are no information, short text message, manual video (captured when the user clicks), or automatic video (captured once a minute). For feedback they havew none, names only (who's looking at you), and video (video images of observers). For symmetry they have yes and no: if you set it to yes, then they will not give out video to those not giving out their video. In actual use, 77% of the time the users used no feedback, and the symmetry was almost never on. Actual users commented that although symmetry was good in the abstract, in practice it didn't matter. They (correctly) note that the problem with polyscope was the users were forced to make explicit choices about their accessibility and visibility. This yields what I call the "fluidity" problem with accessability and privacy. It needs to be a continuous variable, and should be easily changeable. They also noted that symmetry is partially qualitative; the view from the cameras vary in terms of their information content. They claim that people feel more strongly about symmetry in the full-motion case; I don't think this is clear. They did say that their system didn't solve the symmetry problem and it may be because it just plain doesn't matter.
Vrooms just lays the same type of stuff onto a spatial metaphor; worse, the spatial metaphor breaks as you can be in multiple rooms at once. All the users in a room see the same view; the view includes images of all the users in the room. Rooms can be created/deleted on the fly by users. They claim that the symmetry constraint is automatically satisfied by the room metaphor (I don't think this is a solution, its changing the problem). Nifty interaction hack: if you move your image near the image of someone else you get a heavy box around the two and if you let go, you'll get a two-way audio-video connection to them.
They make a good observation that the view in a mediaspace like theirs is "gods-eye" and not embedded in yourself as it is normally. They ask the question if this will break the ideas about social space that people normally have. They consider the possibility of "doors" and "hallways" (ala Cruiser) to extend their spatial metaphor. The suggest (rightly) that polyscope is a common vroom.
They basically try to create a virtual workspace, with a hallway metaphor. The hallway is a sequence of nodes in virtual space, which correspond to people's offices, common areas, etc. If you "encounter" someone in one of these areas, you can begin conversing. For convience, the hallway is circular. They have three basic movements in the virtual hallway: jump, planned path, random walk. Jump is a video phone call. The other are the ways to get to destinations in Cruiser; the first is when the user plans the route, the second is when the system generates one on the fly. They claim this is their mechanism for unplanned interaction. (Seems highly questionable to me... how do you if you see someone? How long does a walk take (per node)?) They are forced to create new nodes whenever people are actually engaged in a conversation, so you can stumble in later (called conversation nodes). Note that as you pass an office, you see into the office, and occupant sees you "out in the hallway" which is really an image from your camera. Cruisers are always announced. This seems very high on the interruption scale.
They spend quite a while discussion interruption protocols in the real world. They suggest that there is an "availability" factor that we discern from the situation, the people present, and the snippets of audio we might overhear. They also mention that people also use explicit signals (such as door closings) to make clear their availability. They move this into Cruiser with the notion of blinds. If the blinds are open the cruiser can see in, if they are closed he cannot. They also talk about half-open or partially-open blinds giving partial information. They (again) enforce reciprocality... if you are not allowing access to your video you can't get access to others.
They spend the rest of the paper talking about the integration of the Cruiser system into the workplace, and it sounds very "Intermezzoish." They talk about things like: Cruiser automatically generating random walks to the printer node if you print a document. They also talk about automatically generating cruises based on your current task. This would be something like the system restricting your "hallway" to include only those nodes which include your coauthors on a paper, setting your availability to high for your coauthors and low for everyone else, etc. They also talk about the reverse: If you cruise to someone, the system invokes the shared editor on the document you are working on. In general, all of this stuff is doable, just would talk an immense amount of work.
The use phenomenology (philosophical psychology) to create a paradigm called the workaday world. This is basically the idea people's everyday, mundane, relationships, and resources (technology included) should be the basis for design. They highlight three aspects of the workaday world, technology, sociality, and work practice. These three are tightly linked, and can't be separated.
They make a wonderful argument that not only are the formal and the informal separate, but they flow fluidly back and forth. Although this idea is not new to them, they do highlight the important point that almost all CSCW systems built so far have addressed only one of these two. They then begin to try to to highlight the workaday world paradigm as de-emphasizing technology (making it "mundane"). They also make sure to beat on the fact that it should not require attention. ("Invisibility of ubiquity and invisibility of non-attendance.")
They define four axes of social interaction (see RAVE paper also):