Three different test sets were captured. These include panoramic
sequences that were captured with and without the use of a tripod.
Two outdoor scenes and one indoor scene were captured so that the
dataset would include different lighting conditions and different
structural elements. For example, the indoor scene included
bookshelves, books, papers, desks, etc. All these structures have
straight edges which were mostly horizontally or verticlely biased.
In contrast, the outdoor scene included trees, grass, leaves whose
contours are not straight. Of the two outdoor scenes, one had only
static elements, while the second included a moving car. This second
outdoor test set will be used later this week to test removal and
mosaicing of dynamic elements in a video sequence.
The first step in creating a video mosaic is
to correlate consecutive pairs of images. As described in the
proposal, a simple translation-only correlation is first conducted.
This is done by simly using a portion of the second image in the pair
as a template. The template is then slid over the first image. Two
different correlation techniques were attempted. These include 1)
accumulating the square of the difference between the image and the
template and averaging over the size of the template, and 2)
accumulating the multiplication of the image and template values.
Using the first method, the best match occurs where the difference is
minimal. Using the second method, the best match occurs where the
accumulated value is largest. Neither method seemed to be markedly
better than the other. Two different differencing methods were used -
differencing on the overall luminance value and differencing on red,
green, blue separately. Several parameters can be tweaked. The final
implementation will most likely allow the user to select these
parameters. The parameters are as follows:
The template size may be small or may be the same size as the
second image. Selection of template affects the results as well as
the real-time capability of the system. More about the real-time
possibilities of this program will be discussed in the section on
Results So Far. The different correlation techniques were
described above. The shift heuristics allow the user to specify that
the video sequence was a slow or fast pan by entering the most likely
range of pixel locations by which the second image is shifted relative
to the first image in a pair of consecutive images. So, if the
sequence was a fast pan, the shift is most likely great, while if the
pan was slow, the shift was most likely small.
It is important to note that correlation was performed at one-half
the original height of the image. This is necessary because of
interlacing. It was found that correlation on the full image size
generated very poor results because the interlaced rows gave erroneous
errors. It is imperative that these interlaced rows be removed prior
to correlation.
The next step in correlation is to perform column-by-column
vertical correlation. The correlation that has been implemented thus
far was on multiple columns of the image and included only horizontal
shifting. This initial correlation gives a rough estimate of where
the images lie, relative to each other. Next, each image will be
divided into its separate columns and vertical correlation between
columns will be performed.
Correlation is used to determine how much the second image has
shifted relative to the first in a pair of consecutive images. The
pair of images can then be composited according to the shift.
Currently a simple composite is performed that lays the second image
on top of the first image, but shifted by the calculated value. Next,
a more sophisticated technique will be implemented in which the
columns closest to the center of an image will be displayed.
The following are panoramic composites of
the indoor and outdoor scenes. These scenes include only static
elements.


The images show fairly distinct seams between composited images. Vertical column correlation as well as a better compositing technique should reduce the seams. The video sequence used for the top image was captured using a tripod, while the video sequence of the bottom image was captured free-hand. As a result the top image has much better vertical alignment than the bottom.
The above images were mosaiced using a template that was 50 pixels
in width and 122 pixels in height. The images are originally 360x244.
The program was highly interactive with a 50x122 size template. The
video mosaic was produced within seconds. Even a template of 180x122
was fairly interactive.