Problem Set 2 - Stereo Matching

CS 4495/7495
Computer Vision

Due by midnight Sept. 24, 2002.

In this assignment you will implement and test a stereo matching algorithm based on dynamic programming (DP) and the SSD cost function. You may find the supplemental readings on stereo listed on the course web site to be helpful, but they are not required. In this problem set it may be helpful to use some of the image processing functions in the OpenCV library.

Grading: There are two parts which have equal weight and I expect them each to take about a week to complete. You have the option of turning in your answers for part 1 early (by Sept. 17 or later). If you choose to do this,  we will give you solution code for part 1 which you can build on in doing part 2 (and we will grade your part 1 earlier). I encourage everyone to do this, but it is not required. 

1. Matching Cost Computation

A fundamental issue in stereo matching is selecting the matching cost function which measures the similarity between two N by N windows of pixels. We desire a robust measure which can be implemented efficiently.

(a) The normalized correlation cost function NC(IL,IR) can be viewed as an inner product between two vectors of pixels. Prove that if IL and IR are two image patches related by IR = a * IL + b for any constants a and b, then NC(IL,IR) = 1.

(b) Imagine a very simple stereo algorithm in which each window in the left image (centered at each pixel on the left scanline) is matched to the right scanline independently, by computing the match scores for D different discrete values of disparity and then picking the best match. (Unlike the DP approach, this method does not enforce consistency across the set of matches). For simplicity, assume that each pixel is matched (ie occlusions are not allowed) and ignore any boundary conditions at the edges of the images. What is the complexity of the total computational cost for matching a scanline using NC, as a function of the window size N, the number of disparity levels D, and the width of the image M?

(c) An alternative to NC is a raw (un-normalized) SSD cost function, which computes the sum of squared intensity differences between two windows using the original image pixels. One potential advantage of this cost function is that it can be computed very efficiently. In this case, we can divide the matching cost computation into two phases: computation and aggregation

In the computation phase, a set of squared difference images is computed by shifting the entire right image by a constant disparity level d and computing, for each pixel individually, the squared intensity difference with respect to the left image. This gives a set of D squared difference images: SDIFF(x, y, d) = [IL(x, y) - IR(x - d, y)]^2. 

In the aggregation phase, each squared difference image is convolved with an N by N box filter, yielding at each location x, y the sum of squared differences of pixels within an N by N neighborhood. Therefore: SSD(x, y, d) = SDIFF(x, y, d) ** BN(x, y), where BN(x, y) is the box filter centered at x, y (which is equal to 1 inside an N by N square and 0 outside), and '**' denotes convolution. Intuitively, what this step is doing is constructing the SSD value at each pixel by summing up all of the squared differences inside a window (or mask) which is centered at that pixel location. The convolution operation captures the process of sweeping a filter through all of the positions that it can occupy within an image and outputting, for each position, the result of multiplying the filter by the image and adding up the resulting products. 

Because of the special properties of the box filter, it is possible to implement this operation in a very intuitive fashion that does not require any detailed understanding of convolution. The box filter convolution is separable, and can be implemented very efficiently as two separate moving averages in x and y. Consider aggregating in the x direction first: Given the sum at (x, y) based on an interval of N pixels, the sum at (x + 1, y) can be obtained by subtracting the leftmost old pixel and adding the rightmost new pixel. In other words, SSDX(x + 1, y, d) = SSDX(x, y, d) - SDIFF(x - n, y, d) + SDIFF(x + n + 1, y, d), where n = (N - 1)/2. SSD is formed in a similar manner from SSDX by summing up its columns. 

What is the complexity cost of scanline matching in this case (analogous to (b) above)? Compare and contrast the two cost functions.

(d) Write a function compute_sdiff that generates the set of SDIFF images described above. It should take a stereo pair of images and a range of disparity values as its input. Write a function compute_ssd that takes as its input the set of SDIFF images and a window size, and outputs a set of SSD images. It's easiest if these are all floating point images so there are no issues in representing the full range of possible SSD values. The OpenCV image class supports the creation of these types of multi-band, floating point images and provides some useful image processing functions.

You will have to address the issue of boundary conditions (i.e. near the image boundary the window may fall outside the image). This can be a problem particularly when searching over large values of disparity. The easiest solution is to limit the range of x, y values you process, effectively truncating the output image. The output image will then contain a border of pixels for which no answer was computed. Set these to some default value.

(e) Write a routine to display an image in a window by first mapping it into a range of grey scale values. The routine should work for pixel values that are floating point, negative, etc. You will need this to display SSD images and disparity maps so that you can visualize them. Test your display code on a known image and then use it to visualize SSD images and test your code from part (d). You may want to use the test pair: [test-left.bmp, test-right.bmp]. Computed disparity images are provided for d=5 and d=15.

(f) Compute the SSD images for the following two stereo pairs: [sawtooth-left.bmp, sawtooth-right.bmp] and [tsukuba-left.bmp, tsukuba-right.bmp]. The sawtooth image is a "synthetic" image with 3 disparity levels, while the Tsukuba image is a staged scene from Ohta's lab at U. Tsukuba. (Both of these pairs were taken from the Middlebury Stereo Vision Page and are described in the paper by Scharstein and Szeliski). What range of disparity values is present in each of these stereo pairs?

Instructions for submitting part 1: Make a zip file with the suffix '-part1' containing your source code, an executable, and two SSD images, encoded as gray-scale .bmp image files. The first SSD image should be from the sawtooth pair and should correspond to the disparity of the large sawtooth pattern in the foreground. The second SSD image should show the tsukuba pair at disparity 4. The executable should compute and display the sawtooth SSD output. You will also need to answer the questions above. This part does not need to be submitted electronically. Hand-written solutions are fine as long as they are legible. Follow the instructions on the course web page for submitting the zip file.

2. Stereo Using Dynamic Programming

For this part you will implement a dynamic programming algorithm to match two images, one scan-line at a time. You will exploit the ordering constraint that we discussed in class. The class presentation of DP involved an M by M lattice which encoded all possible matchs between pixels, given the ordering constraint. In practice, we would prefer not to have to consider all possible matchs. For example, most real scenes have a restricted range of depths. By restricting the number of disparity values which have to be considered, we can save both computation and storage.

(a) With respect to the virtual image, we write the stereo equations as xL = f X / Z and xR = f (X - B) / Z and define the disparity d = xL - xR. Show that d > 0. What implication does this have for the search lattice in the dynamic programming approach to matching? In the following DP lattices, which paths correspond to valid stereo pairs? Sketch the depth profile of a possible scene in each valid case. By depth profile we mean the depths in the scene for a single scanline, corresponding to an X-Z slice through some 3-D surface. (Numerical accuracy is not required here. A piecewise planar scene with the right qualitative structure as far as occlusions and disocclusions are concerned is all that is necessary.) Number the key points in the depth profile and their corresponding locations on the DP lattice. Most stereo algorithms consider a range of discrete disparity values, including d = 0. What is the meaning of zero disparity in light of d > 0?

(a)

(b)

(c)

(b) The Disparity Space Image (DSI) is a standard data structure for stereo matching algorithms. It encodes the same information as the DP lattice, but in a more compact and useful format. The DSI image is D by W, where D is the number of discrete disparities and W is the width of the scanline. The columns in the DSI correspond to pixels in the left scanline. The rows correspond to disparity values. In the DP lattice, a match between two pixels was encoded by a vertex with coordinates (xL, xR). In the DSI, this match is encoded by the disparity d at column xL (with xR = xL - d). Below we have reproduced figure (c) above along with the corresponding left to right path through a DSI image.

d = 3 M M M
d = 2 L R
d = 1 L R
d = 0 M M M M L R M M M M M
x = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

(d) Path through DP Lattice

(e) Corresponding Path Through DSI Image for Left Scanline

Each cell in the DSI image can have one of three labels: M stands for 'match' and identifies a match with the right scanline at the given disparity value. L is for 'left' and denotes a pixel which is visible in the left image only (the occluded case). Likewise R identifies pixels which are visible in the right image only (the dis-occluded case). Given a valid path through the DSI image, each column will contain either an M or an L (since each pixel on the left scanline is either matched or occluded in the right). The disparity value associated with an 'L' label identifies the next available pixel to match in the right scanline. The disparity associated with an 'R' label identifies a dis-occluded pixel at that disparity in the right scanline. Below is another example path through a DSI image. Draw the corresponding path through a DP lattice. (This example, as well as the description of the data structure, is due to Scharstein and Szeliski).

d = 3 M M M
d = 2 M M L R M M
d = 1 L R
d = 0 M M M L R M M M M
x = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

(f) Sample Path Through DSI Image

(c) Dynamic programming in the DSI image proceeds in a directly analogous fashion to DP on the lattice. There are seven possible transitions from one cell to another. They are illustrated below, with the predecessor state in lower-case and the current state in upper case. Note that at the boundaries of the DSI, some of the transitions will be invalid. For example, only transitions (v) and (vi) can be applied to a predecessor state in the last column, while (v) and (vi) cannot be applied to states in the first row (d = 0) and (iii) and (iv) cannot be applied to states in the last row (maximum d value). No transitions can be applied when starting from the terminal cell (x = W - 1, d = 0).

m M
m L
  L
l  
  M
l  
m
R
r
R
r M

(i)

(ii)

(iii)

(iv)

(v)

(vi)

(vii)

(g) The seven possible transitions between cells in a DSI image

Show that by starting in cell (x = 0, d = 0) with state M (the starting configuration), all valid combinations of transitions (i) - (vii) will lead to termination of the path at the terminal cell (x = W - 1, d = 0). The path between the starting cell and terminal cell with the lowest cost represents the best match of the left scanline to the right. Thus finding the best cost path through the DSI image is equivalent to finding the best cost path through the DP lattice.

(d) There are many possible transitions between pairs of labels that are not included in the list above. For each of the following transitions (or combinations of transitions), explain why it is not included in the list. (Note that in figure (k) we are showing the composition of three transitions: m to L, then l to R, then r to M).

R
m
  M
m  
l L
    R
M L M
(h) (i) (j) (k)

(e) Since there are three possible states for each cell, we need to maintain three separate costs at each cell. The cost values correspond to the cumulative cost along the lowest cost path to the cell which ends in each of the possible states. In addition to the cumulative cost at each cell, we have to store a back-pointer to the predecessor cell. The transition from that cell to the current cell results in the updated cost.  Two three-band images of D by W dimensions, DSI_cost and DSI_ptr,  hold the costs and pointers respectively. Thus DSI_cost(x, d, s) holds the cost of being in state s at location (x, d) in the DSI image, while DSI_ptr(x, d, s) gives the predecessor cell for the path ending in (x, d, s). Each transition can be described by an offset vector (ox, od, os) which functions as a backpointer. If transition i ends in cell (x, d, s) then the predecessor cell is (x - oxi, d - odi, s - osi). Let bpi(x, d, s) denote the function which maps the current cell into its predecessor under transition i. Write pseudocode for recursively generating DSI_cost and DSI_ptr while traversing the DSI image from left to right. The cost of a match (i.e. s = M) can be obtained directly from the SSD images from part 1. The cost of assigning the left or right state to a cell is the occlusion_cost, a constant which you will have to set by hand. These costs can be preloaded into DSI_cost to speed the computation. The initial conditions at x = 0 should be designed so that the cell (0,0,M) is always the first cell in the path.

(f) Write a function DSI_forward which takes as inputs the SSD images, a y coordinate, and occlusion_cost, and returns DSI_cost and DSI_ptr for scanline y. Due to boundary conditions, not all pixels will have SSD values. Your function should start at the first pixel with an SSD value and end with the last. Write a function DSI_backward which has inputs DSI_cost, DSI_ptr, y coordinate, and a global disparity_map image. The function should chain backwards from the lowest cost terminal state to find the optimal path. As it computes the optimal path, it should substitute the optimal disparity values into scanline y of disparity_map. When the optimal state for a given cell is L, there is no direct measurement of disparity available. A reasonable procedure would be to linearly interpolate the disparities from adjacent match states along the scanline. An approximation to this would be to report the disparity associated with L.

(g) Design a set of match costs and an occlusion_cost for the path in figure (f) above. These costs should have the property that the path in figure (f) is optimal. What are these costs? Test your functions from part (f) on this input and verify the correctness of your implementation.

(h) Write a function compute_disparities that takes the SSD images and occlusion_cost and outputs a disparity_map. The disparity_map is an image, defined with respect to the left image in the stereo pair, that contains a disparity estimate for each pixel. Due to boundary conditions, not all pixels will have disparity estimates. You should use a distinguished value for the disparity in these cases so they can be easily recognized. You may want to debug your code on reduced size images so the runtime is faster: [sawtooth-left-small.bmp, sawtooth-right-small.bmp].

(i) Run compute_disparities on the two stereo pairs from part 1(f). Experiment with different disparity ranges using the intuition you gained in part 1(f). Use your visualization function from part 1(e) to view the disparity images. Ground truth disparity images are available for these two cases: sawtooth-truedisp.bmp and tsukuba-truedisp.bmp. Arrange to view your output and the ground-truth side by side. Comment on your output. Where are you making mistakes? What types of mistakes are they? Where are you successful? What disparity ranges gave the best results? What choice of occlusion_cost gave the best results?

(j) Extra Credit (+10 pts) Modify your program to compute the naive disparity estimate described in part 1(b). Compare this result on the test images to the result with DP. Which one is better?

(k) Extra Credit (up to +20 pts) Implement one of the evaluation metrics from Section 5.1 of Scharstein and Szeliski (available from class web site). The full ground truth data is available here. Describe your implementation and report the results.

Instructions for submitting part 2: Make a zip file with the suffix '-part2' containing your source code, an executable, and two disparity images in .bmp format named sawtooth-disp.bmp and tsukuba-disp.bmp, which correspond to the test data. Your executable should compute the disparity image for the sawtooth case (it should do everything the executable from part 1 did, and in addition it should display the computed disparity image and show the ground truth for comparison. You will also need to answer the questions above. This part does not need to be submitted electronically. Hand-written solutions are fine as long as they are legible. Follow the instructions on the course web page for submitting the zip file.

Grading: This PS is 10% of your grade. Each part is 50 points out of a total of 100. Within each part, the questions are worth 20 points and the program/output is worth 30.