Project 2: Local Feature Matching

Detected keypoints on two images.

This project served as a basic introduction to the concept of image feature matching. We built an elementary version of the SIFT keypoint descriptor, based on keypoints found using the Harris Corner Detector. The process can be broken down into three main steps: keypoint detection, keypoint description, and keypoint matching. Each of these steps will be described in detail.

Keypoint Detection

The first stage in the process is to identify points of interest in the image. Ideally, these points should be fairly unique and distinct: picking regions of low detail and low contrast would not produce very informative points of interest.

Interpreting eigenvalues (courtesy Robert Collins)

For this project, our keypoint detector of choice was the Harris Corner Detector (Harris & Stephens, 1988). A corner is defined as a small region in which a large change in intensity occurs in both the x and y directions. Both of these directions must experience a change, otherwise the region could be simply an edge or other low-detail point. The gradients for the input images are calculated using two Sobel filters, one oriented for the x gradient and the other for the y gradient. Next, these gradient images are smoothed using a Gaussian smoothing filter. Using these smoothed gradients, we can calculate the second moment matrix at each pixel in the image. A unique property of this matrix is that its eigenvalues describe the characteristics of the gradients at that point. When both eigenvalues are large in magnitude, and similar in magnitude, we have found a corner. The graphic on the left illustrates these characteristics.

The last step in keypoint detection is to clean up the results. We eliminate interest points near the border, which often form spontaneously at the edges, and set a threshold value to only allow more "prominent" corner points. I experimentally found that a threshold of two times the mean value gave me the best results. Using a technique called non-maximum supression, we only take the largest value in a small neighborhood (5x5). This ensures that we do not interpret a single corner as many corners.

Detected keypoints for given images.

Keypoint Description

This stage is where the simplified version of the Scale Invariant Feature Transform is utilized. Unlike the SIFT algorithm implemented in 1999 by David Lowe, my implementation is not invariant to scale or rotation. It does, however, respond sufficiently well to changes in brightness across images.

As with the Harris Corner Detector, the first step is to calculate x and y gradients for the given image, and then filter them with a Gaussian smoothing filter. For each interest point detected in the last step, we build the feature from a 16x16 window around the point. This window is then divided into 4x4 cells (if the feature width is 16, each of these cells will be size 4 pixels by 4 pixels). Within each cell, the gradient direction and magnitude is computed from the x and y gradient components calculated previously. We discretize this value into 8 possible bins (each bin has size pi/4) and create a histogram of bins for the given cell, where each pixel contributes to the corresponding direction bin with a vote equal to the magnitude in that direction. When we append all of the histograms from all of the 16 cells in our current window, we end up with a 128-dimension descriptor. The descriptor is normalized to unit length so that it is invariant to intensity (brightness).

93% accuracy for Notre Dame (top) and 97% accuracy for Mount Rushmore (lower).

Feature Matching

The final stage in the SIFT pipeline is to match the calculated features. The similarity between two features is simply calculated using their euclidian distance. In order to only include more distinct matches in the reuslts, a simple ratio test is used to calculate confidences for each match. This ratio, computed as the distance to the nearest neighbor divided by the distance to the second nearest neighbor, is used to sort the calculated matches. The lower the ratio, the more confident that the matching is unique and and therefore correct.

Using the given groundtruth points, I was able to achieve an accuracy of 93% on the Notre Dame image pair. I was able to obtain an even better 97% on the Mount Rushmore image pair.

Additional Examples

I tested my feature detector and descriptor on additional images provided in the supplemental materials. Although there are no ground truth points, it is still fairly easy to assess performance by observing the match lines between the two images. If the lines seem to follow a consistent pattern and do not cross often, we can conclude that the keypoint matching was relatively successful.

(from top to bottom) Two views of a house, El Capricho de Gaudi, and the Statue of Liberty.

Failures

The limitations of this implementation quickly begin to show when more challenging examples are used. The provided Episcopal Palace image pair has a large degree of scale difference between the two images. I did not incorporate any SIFT octaves or keypoint scales in my implementation, so most of the keypoints detected did not correspond between the images. Additionally, there is a large amount of repetition in the details of the building, so matches between features were often incorrect. As a result of the lack of scale invariance, the accuracy for this image pair was extremely low, 8%.

The Episcopal Gaudi image pair had a low 8% accuracy.

I challenged my implentation with a few other provided examples. The following images demonstrate a few more specific cases in which the pipeline failed substantially. In the first image, the two images are taken from entirely different perspectives, which introduces not only a fair amount of scale difference but also some keypoint rotation. In the second image pair of the Pantheon in Paris, it is difficult for even a human to immediately understand that the two images show the same building. The second image is taken from a significantly farther distance, meaning the amount of overlap between the images is reduced greatly. Finally, in the last image pair the two images of the Statue of Liberty are made different not only by a large perspective shift but also by significant lighting differences.

Additional failure cases for my SIFT pipeline.

Conclusion

The above examples showcase the strengths and weaknesses of the local feature matching pipeline I created. I was very impressed with the high accuracy of my feature matching and its robustness against differences in lighting and perspective. However, I also recognized that it would fail in many cases - especially those involving scale differences - and if I had additional time I would attempt to incorporate scale invariance. One of the most significant challenges I faced while working on this project was computation time. I spent lots of time devising new ways to calculate certain quantities in faster ways, and reduced the overall time from several minutes to less than 20 seconds. Another feature I would try to implement is PCA, so that I could further speed up the matching process (which was responsible for a significant portion of elapsed time).