Video Textures

Arno Schödl Georgia Institute of Technology
Richard Szeliski Microsoft Research
David Salesin Microsoft Research, University of Washington
Irfan Essa Georgia Institute of Technology

In the News: Check out the article (2.7 MB) in October 2000's issue of Computer Graphics World about this paper.

The video clips on this web page show the results described in our SIGGRAPH 2000 paper. They appear in the same order as in the paper. Your browser has to have a MPEG file association to play them. Click on the images and the links in the text to play the videos. If your video players supports a loop (e.g., "Repeat forever") mode, try using this, since many of the clips are cyclified (i.e., they start over seamlessly at the end).

This is the submission video which is also contained on the paper submission DVD.

short loop, first order

short loop, second order

long loop

random play texture

Candle flame. A 33-second video of a candle flame was turned into four different video textures: one random play texture; and three different video loops, each containing three different primitive loops. One of the video loops repeats every 426 frames. The other two repeat every 241 frames; these each use the same set of three primitive loops, but are scheduled in a different order. One of those shorter loops is shown on the left, the random play texture on the right. The position of the frame currently being displayed in the original video clip is denoted by the red bar. The red curves show the possible transitions from one frame in the original video clip to another.

Original sequence, clock only.

  • This sequence processed without and with preservation of dynamics.

Original sequence with hand coming in.
(uncompressed AVI version, 16 MB)

  • This sequence processed without future cost computation and with future cost computation.

Clock. This example shows the necessity for both the preservation of dynamics and the future cost computation. The input video sequence shows a clock with a swinging pendulum. Without considering dynamics, a forward-swinging pendulum is likely to match equally well with a backward-swinging frame, causing unnatural jumps in the motion. Adding in the temporal filtering solves this problem. At the end of the input video, a hand moves into the frame. Without the future cost computation, the video texture will reach a dead end, from which no transition to the earlier video will work without a visual jump. The future cost computation solves this problem by increasing the probability of a transition before the hand comes into frame.

Flag. A 38-second video of a flying flag was cyclified using the lowest average cost loop contained in the video. Video textures were created using no fading, cross-fading, and morphing. Cross-fading improves the quality of the transition, at the cost of a small amount of blurring. Morphing works even better at removing the jump without introducing blur, even though the alignment is one stripe off the geometrically correct alignment. The wrong alignment that causes a fold to magically disappear during transition is almost invisible to the unsuspecting observer.

Campfire. A 10-second video of a campfire was cyclified using a single transition. The transition is hardly visible without crossfading, but crossfading over four frames hides it entirely. Although the configuration of the flames never replicates even approximately, the transition is well hidden by the high temporal flicker.

Portrait. A 25-second video of a woman posing for a portrait was turned into a random-play video texture with 20 transitions. Although the frames across the transitions are already quite similar, the morphing performs a subtle alignment of details, such as the eye positions, which hides the transitions almost entirely. Such video textures could be useful in replacing the static portraits that often appear on web pages.

Waterfall. This example of a waterfall works less well. The original 5 minute video sequence never repeats itself, and yet, unlike the campfire, there is a great deal of continuity between the frames, making it difficult to find any unnoticeable transitions. Our best result was obtained by selecting a 6-second source clip, and using cross-fading with frequent transitions so that averaging is always performed across multiple subsequences at once. Although the resulting video texture is blurrier than the original video clip, the resulting imagery is still fairly convincing as a waterfall.

Blowing grass. Here is another example that does not work well as a video texture. Like the waterfall sequence, the original 43-second video of blowing grass never repeats itself. Unlike the waterfall sequence, blurring several frames together does not produce acceptable results. Our automatic morphing also fails to find accurate correspondences in the video frames. The best we could do was to crossfade the transitions (using a 4-second clip as the source), which creates occasional (and objectionable) blurring as the video texture is played.

Sound synthesis. Adding sound to video textures is relatively straightforward. We simply take the sound samples associated with each frame and play them back with the video frames selected to be rendered. To mask any popping effects, we use the same multi-way cross-fading algorithm described in Section 4. The resulting sound tracks, at least in the videos for which we have tried this, Waterfall and Bonfire, sound very natural.

3D Portrait. Video textures can be combined with traditional image-based rendering algorithms such as view interpolation. We created a three-dimensional video texture from three videos of a smiling woman, taken simultaneously from three different viewing angles about 20 degrees apart. We used the center camera to extract and synthesize the video texture, and the first still from each camera to estimate a 3D depth map. We then masked out the background using background subtraction (a clear shot of the background was taken before filming began). To generate each new frame in the 3D video animation, we mapped a portion of the video texture onto the 3D surface, rendered it from a novel viewpoint, and then combined it with the flat image of the background warped to the correct location.

Swings. In this example, the video of two children on swings is manually divided into two halves: one for each swing. These parts are analyzed and synthesized independently, then recombined into the final video texture. The overall video texture is significantly superior to the best video texture that could be generated using the entire video frame.

Balloons sequence

Standard deviation of color values over time

Balloons. For this example, we developed an automatic segmentation algorithm that separates the original video stream into regions that move independently. We first compute the variance of each pixel across time, threshold this image to obtain connected regions of motion, and use connected component labeling followed by a morphological dilation to obtain the five region labels (shown as color regions in this still). The independent regions are then analyzed and synthesized separately, and then recombined using feathering.

Fish. Motion factorization can be further extended to extract independent video sprites from a video sequence. We used background subtraction to create a video sprite of a fish, starting from 5 minutes of video of a fish in a tank. Unfortunately, fish are uncooperative test subjects who frequently visit the walls of the fish tank, where they are hard to extract from the scene because of reflections in the glass. We therefore used as source material only those pieces of video where the fish is swimming freely.

Runner. Instead of using visual smoothness as the only criterion for generating video, we can also add some user-controlled terms to the error function in order to influence the selection of frames. The simplest form of such user control is to interactively select the set of frames S in the sequence that are used for synthesis. We took 3 minutes of video of a runner on a treadmill, starting at a slow jog and then gradually speeding up to a fast run. As the user moves a slider selecting a certain region of the video (the black region of the slider in the figure), the synthesis attempts to select frames that remain within that region, while at the same time using only fairly smooth transitions to jump forward or backward in time. The user can therefore control the speed of the runner by moving the slider back and forth, and the runner makes natural-looking transitions between the different gaits.

Watering can. As another example, we took a 15-second clip of a watering can pouring water into a birdbath. The central portion of this video, which shows the water pouring as a continuous stream, makes a very good video texture. We can therefore shorten or extend the pouring sequence by using the same technique as we did for the runner, only advancing the slider automatically at a faster or slower speed. Thus, the same mechanism can be used to achieve a natural-looking time compression or dilation in a video sequence.

Mouse-controlled fish. Instead of directly specifying a preferred range of frames, we can select frames based on other criteria. For example, in order to interactively guide the path of the fish presented earlier with a mouse, we could give preference to frames in which the fish's video sprite has a certain desired velocity vector.

Fish Tank. The final example we show is a complete fish tank, populated with artificial fish sprites. The tank includes two sets of bubbles, two independently swaying plants, and a small number of independently moving fish. The fish have been scripted to follow a path (here, the SIGGRAPH ``2000'' logo), using the same techniques described for the mouse-controlled fish.