The goal of video mosaicing is to produce a panoramic view from a
sequence of images. The motivation behind this is that individual
pictures or frames in a video sequence have a very limited field of
view. We are able to see a much larger field of view. Hence, it
seems natural to paste together a series of limited-view images to
create one image with a field of view more similar our own.
The system that has been implemented consists of two parts. The
first part of the system is general video mosaicing which allows a
user to input a series of image sequences to produce one panoramic
image. Parameters which a user may tweak include the size of the
template for correlation, the type of matching that is performed, and
the type of compositing that is performed. The second part of the
system attempts to remove dynamic objects from the image sequence in
order to recover the static background. The input parameter is a
difference tolerance which specifies how different a pixel must be
from one image to the next in order to be considered dynamic. Details
about correlation, compositing, and dynamic element removal are
described in the corresponding sections of this report.
The system performs fairly quickly, generating the panoramic image
in realtime as the video file is read.
The first step in creating a video mosaic is to correlate pairs of
consecutive images. As described in the proposal and progress report,
translation-only correlation is first performed by using a portion of
the second image in a pair as a template. The template is then slid
over the first image. Correlation is done by accumulating the square
of the difference between the image and the template and averaging
over the size of the template. It was found that accumulating the
multiplication of the image and template values did not give better
correlation. The size of the template and correlation using luminance
only or red, green, blue pixel intensities were left as parameters to
be chosen by the user. This allows a user to test out different
template sizes since a larger template may result in a better mosaic
for some sequences while in others, a smaller template is
sufficient.
Correlation is separated into two tasks - horizontal correlation
using the template and vertical correlation using a single column. A
heuristic on the horizontal shift from one image to the next was
included in correlation. This was done to improve the performance of
the system. This heuristic did not result in errors because the input
is always a video stream of images. Such a video stream has a great
deal of coherence from one image to the next. The heuristic assumes
that each image is unlikely to be shifted by more than a quarter the
width from the previous image.
The mosaicing algorithm is based on the paper, Panoramic Mosaics
by Manifold Projection by Peleg et al. Peleg had performed
vertical correlation on a column-by-column basis. However, it was not
clear in the paper whether correlation was performed for each of the
columns in the image. Separating an image into columns is a more
intensive approach, and it seemed unnecessary because we already have
vertical alignment within an image. Hence, vertical column
correlation is performed only at the seam between consecutive images.
Experimentation revealed that vertical correlation was not necessary
for the most part. There is little vertical deviation in sweeping the
camera in a circular motion. The additional vertical correlation
results in more computation, and so was left as an option which the
user may or may not use. An example of vertical alignment is shown
below.

Correlation is used to determine how much the second image has shifted relative to the first in a pair of consecutive images. The pair of images can then be composited according to the shift. Three types of compositing has been implemented in the system:
Simple overlay pastes the second image on top of the first image,
but shifted by the calculated value. Seam calculation takes into
consideration the fact that alignment is usually better at the center
than at the edges of images and that distortion is minimal at the
center. The seam is calculated as the column that is equidistant to
the centers of the pair of images being composited. Averaging can be
done on the region around the seam. The number of columns to average
around the seam is left as a parameter which can be specified by the
user. This parameter is basically the size of the averaging kernel.
The default is no averaging. A comparison of the three types of
compositing is shown below in the following order: Simple Overlay,
Seam Calculation, Seam Calculation with Averaging. Notice how the
seams between consecutive images disappear as more sophisticated
compositing schemes are used.



All the above examples show a static scene. What happens when
elements in the video are moving, such as people and cars? This
question was the basis of the second part of the system. Simply
mosaicing a video that includes moving elements would result in an
image that has copies of the dynamic element. An example of such an
image is shown:

In order to create a correct panorama of the static background, the
dynamic objects in the video must be removed. This is accomplished by
taking the difference between pairs of consecutive images. This
difference would be positive in all pixel positions where there was
movement. A tolerance level on the pixel difference can be specified
by the user. This tolerance level is the difference threshold above
which pixels are considered to in motion. Once the pixel positions of
the moving elements are known, these pixels can be removed entirely
from the image to be composited. The parts of the background that
were occluded by the moving object in one image may be revealed in
another image in the video sequence. Hence, by having a sequence of
images, the static background can be mostly recovered. Note that the
compositing step is done only after the dynamic elements have been
removed. The following images are examples of recovered background.
The specles of black or colored pixels which resemble a motion blur
belong to the moving object. These are the parts of the static
background which could not be recovered.





