In this assignment you will implement a function to compute the costs involved in stereo matching. This is the first part of a three-part stereo assignment. Code produced during this assignment will be used in future ones (solutions will also be provided). You may find the supplemental readings on stereo listed on the course web site to be helpful, but they are not required. In this problem set it may be helpful to use some of the image processing functions in the OpenCV library.
Grading: This is the graduate level version. I expect most students in 7495 will do this one. It contains some additional questions about the complexity of the computation and includes a simple example of convolution, an important image processing concept. The answers from the students in 7495 will determine the mean for this assignment track. Any students from 4495 who do this version of the assignment will be graded on a more forgiving curve. The majority of points will go towards implementing and testing the function that computes the SSD.
1. A fundamental issue in stereo matching is selecting the matching cost function which measures the similarity between two N by N windows of pixels (where for convenience, N is an odd number). A popular choice is the normalized correlation cost function NC(IL,IR), which can be viewed as an inner product between two vectors of pixels, IL and IR. Prove that if IL and IR are two image patches related by IR = a * IL + b for any two scalar constants a and b, then NC(IL,IR) = 1.
2. Consider two corresponding scanlines in a left and right stereo image pair. Imagine a very simple stereo algorithm in which each window on the left scanline (centered at each pixel location on the scanline) is matched to a window on the right scanline independently, by computing the match scores for D different discrete values of disparity and then picking the best match. (Unlike the dynamic programming approach, this method does not enforce consistency across the set of matches). For simplicity, assume that each pixel from the left scanline is matched (i.e. occlusions are not allowed) and ignore any boundary conditions at the edges of the images. What is the computational complexity of matching the entire image with the above method using NC, as a function of the window size N, the number of disparity levels D, and the width M and height H of the image ? (Your answer will indicate how the amount of computation varies with N, D, M and H. You should ignore boundary issues completely, assume for your analysis that an SSD value is computed for all pixels in the image, even though in practice you will not actually compute values at the borders of the image.)
3. An alternative to NC is a raw (un-normalized) SSD cost function, which computes the sum of squared intensity differences between two windows using the original image pixels. One potential advantage of this cost function is that it can be computed very efficiently. In this case, we can divide the matching cost computation into two phases: computation and aggregation.
In the computation phase, a set of squared difference images is computed by shifting the entire right image by a constant disparity level d and computing, for each pixel individually, the squared intensity difference with respect to the left image. This gives a set of D squared difference images: SDIFF(x, y, d) = [IL(x, y) - IR(x - d, y)] ^2.
In the aggregation phase, each squared difference image is convolved with an N by N box filter, yielding at each location x, y the sum of squared differences of pixels within an N by N neighborhood. Therefore: SSD(x, y, d) = SDIFF(x, y, d) ** BN(x, y), where BN(x, y) is the box filter centered at x, y (which is equal to 1 inside an N by N square and 0 outside), and '**' denotes convolution. Intuitively, what this step is doing is constructing the SSD value at each pixel by summing up all of the squared differences inside a window (or mask) which is centered at that pixel location. The convolution operation captures the process of sweeping a filter through all of the positions that it can occupy within an image and outputting, for each position, the result of multiplying the filter by the image and adding up the resulting products.
Because of the special properties of the box filter, it is possible to implement this operation in a very intuitive and efficient fashion that does not require any detailed understanding of convolution. The box filter convolution is separable, and can be implemented very efficiently as two separate moving averages in x and y. Consider aggregating in the x direction first: Given the sum at (x, y) based on an interval of N pixels, the sum at (x + 1, y) can be obtained by subtracting the leftmost old pixel and adding the rightmost new pixel. In other words, SSDX(x + 1, y, d) = SSDX(x, y, d) - SDIFF(x - n, y, d) + SDIFF(x + n + 1, y, d), where n = (N - 1)/2. This computation is a moving average of the columns of SDIFF. The final SSD image can be obtained in a similar manner from SSDX by aggregating in the y direction (i.e. computing a moving average of its rows).
Consider the following example, where we have an SDIFF image given by
|
1 |
2 |
3 |
4 |
5 |
|
2 |
3 |
4 |
5 |
6 |
|
3 |
4 |
5 |
6 |
7 |
|
4 |
5 |
6 |
7 |
8 |
|
5 |
6 |
7 |
8 |
9 |
and a 3 by 3 pixel box filter
|
1 |
1 |
1 |
|
1 |
1 |
1 |
|
1 |
1 |
1 |
.
The resulting SSDX image would be
|
B |
6 |
9 |
12 |
B |
|
B |
9 |
12 |
15 |
B |
|
B |
12 |
15 |
18 |
B |
|
B |
15 |
18 |
21 |
B |
|
B |
18 |
21 |
24 |
B |
in which B denotes unknown border pixels whose values cannot be computed.
(a) What would be the final SSD image computed from the SSDX image above? Complete the table below with your answer (some of the values have already been provided)
|
B |
B |
B |
||
|
B |
27 |
36 |
||
|
B |
36 |
|||
(b) What is the computational complexity of image matching in this case (analogous to (2) above)? Your answer will indicate how the amount of computation varies with N, D, M and H. You should ignore boundary issues as before. Compare and contrast the NC and un-normalized SSD cost functions.
4. Write a function compute_sdiff that generates the set of SDIFF images described above. It should take a stereo pair of images and a range of disparity values as its input. Write a function compute_ssd that takes as its input the set of SDIFF images and a window size, and outputs a set of SSD images. It's easiest if these are all floating point images so there are no issues in representing the full range of possible SSD values. The OpenCV image class supports the creation of these types of multi-band, floating point images and provides some useful image processing functions.
You will have to address the issue of boundary conditions (i.e. near the image boundary the window would extend outside the image). This can be a problem particularly when searching over large values of disparity. The easiest solution is to limit the range of x, y values you process, effectively truncating the output image. The output image will then contain a border of pixels for which no answer was computed. Set these to some default value (e.g. 0).
5. Write a routine to display an image in a window by first mapping it into a range of grey scale values. The routine should work for pixel values that are floating point, negative, etc. You will need this to display SSD images and disparity maps so that you can visualize them. Test your display code on a known image and then use it to visualize SSD images and test your code from part 4. You may want to use the test pair: [test-left.bmp, test-right.bmp]. Correct disparity images are provided for d=5 and d=15 for testing purposes.
6. You will run your SSD code on the following two stereo pairs: [sawtooth-left.bmp, sawtooth-right.bmp] and [tsukuba-left.bmp, tsukuba-right.bmp]. The sawtooth image is a "synthetic" image, while the Tsukuba image is a staged scene from Prof. Ohta's lab at U. Tsukuba in Japan. (Both of these pairs were taken from the Middlebury Stereo Vision Page and are described in the paper by Scharstein and Szeliski). In testing and debugging your code on these images, you may find it useful to start with a pair of reduced size images, so your code takes less time to run. [sawtooth-left-small.bmp, sawtooth-right-small.bmp] are provided for this purpose. (Use the original images for your final submission however).
(a) Identify the approximate range of disparity values that is present in each of these two stereo pairs. (The easiest way to do this is to open them in an image editing program that displays the position of the mouse in image coordinates, and examine corresponding features manually.)
(b) Using a window size of 11 (N = 11), compute an SSD image for each of the sawtooth and Tsukuba pairs. Use disparity 14 for sawtooth and disparity 4 for Tsukuba. Save both SSD images as gray-scale .bmp files (you can use your display code to convert the images to gray-scale). Submit these two files along with executable for your program (The executable should compute and display the sawtooth SSD output.)
Instructions for submitting part 1: Make a zip file with the suffix '-track1' containing your source code, executable, and the two SSD images from question 6(b), encoded as gray-scale .bmp image files. You will also need to answer the questions above. This part does not need to be submitted electronically. Hand-written solutions are fine as long as they are legible. Follow the instructions on the course web page for submitting the zip file.