Computer Vision Project

One of the incredible aspects of human vision is our ability to see the same scene from different viewpoints, scales, and orientations and understand it as just that - the same scene. Yet if we pause for a moment, this is clearly not a trivial task. If we wish to endow computers with this ability, how should we start?

One possible solution is to use local features to match images. Conceptually, we could sketch out a rough approach that could work. First, we'd find points in each image that are very distinctive, such as corners. Then we'd see if we could extract information around these interest points so that they could be compared in the other image. The best matches could be taken, and we'd then have a rough sense of where one object is in the other image.

David Lowe formalized this approach with the SIFT pipeline. The pipeline can be broken up into five components.

Interest Point Detection

In this step, we want to find points in each image that are distinctive. Otherwise, the point could be matched to several other if it's generic enough, like an edge or surface area, providing poor results. For this portion, I used the Harris Corner Detector Algorithm. The idea is that if we're looking at a corner through a small window, shifting the window in any direction would give a large change in intensity. The value we're calculating in this process is the local auto-correlation. If we find the local auto-correlation for each pixel across an entire image, the process could be formalized according to the equation below.

Formula to detect change in intensity

The problem is, this algorithm is very computational intensive. Thankfully, auto-correlation can be approximated with the second moment of the matrix. By looking at the eigen values of the second moment, we can tell if we're centered around a corner. If both eigen values are strong, then there's lot of change in both directions, indicating the presence of a corner. Otherwise, if just one is strong or neither, we're looking at either an edge or a smooth patch. By using the trace and determinant of the moment, we can circumvent calculating the eigen values directly.

Second Moment Formula

In my implementation, I used the sobel filter to calculate the derivative in the x and y direction. I also used a gaussian filter after each calculation with the sobel filter to reduce noise. To calculate the harris values of each pixel, with larger values meaning the pixels are more likely to be corners, I used the following formula which incorporates the mathematical tricks discussed above.

Formula to detect harris corners

I cutoff values less than 1% of the max value of the harris vector as not being corners.

The best 100 interest points from three different images are shown below, with the same scene shown from slightly different perspectives on the left and right. We can see that the algorithm does a decent job selecting points that seem significant, and the points are generally the same in both images.

Corner Detection Results

I used non-maximum supression to increase the quality of the corners. Non-maximum supression is a techinque to thin edges and choose the points with the sharpest change in intensity value. To implement it, I used the coltfit function in matlab, which allowed me to choose the maximum value in a small window (I chose 3x3). This lead to better feature points, as the best possible accuracy when running the Notre Dame images through the pipeline with and without non-maximum suppression was 86% and 94%, respectively. When we look at the 100 best corners detected in the two cases, we can see that the difference between corners selected is visually minimal. However, to get the boost in accuracy, the corners selected with the non-maximum suppression must be slightly more distinct. Another benefit is that by reducing the number of corners detected, the time spent when matching is reduced dramatically. Without filtering out any results, it takes approximately 60 seconds (depending on the matching method, discussed later) without using non-maximum suppresion vs just 1 second with. This is because only 1337 features are generated when using non-maximum suppression, vs. 12,581 without.

Corner Detection Results with and without Non-Maximum Supression

The amount of gaussian blurring had a big impact as well on the accuracy. On the Notre Dame pair, the accuracy would go anywhere from 70% to 94% depending on large the filter was and what standard deviation was used. This makes sense, considering that a small amount eliminates noise and increases the accuracy of the derivatives, but too much actually removes useful information from the pixels.

Local Feature Description

Now that we've found interest points, the next thing to do is extract information around these points. The first naive thing to do just use the pixel values as features. An area of 16x16 was taken around each pixel, which meant that each feature had 256 elements when flattened out. Why is this naive? Because even the slightest variation in illumination or viewpoint can greatly change the individual pixels, leading to poor matches. However, it still works fairly well. The accuracy on the 100 best corners when using the entire pipeline and just the patches as features was 79%, 75%, and 2% for the Notre Dame, Mount Rushmore, and Episcopal Gaudi, respectively.

A better approach is to create a histogram of the gradients each specific blocks. This captures the content without being completely wedded to the pixel values themselves. To be more precise, an area of size 16x16 is inspected around each corner detected by the Harris method. This 16 by 16 area is broken down into four chunks of size 4x4. In each chunk, the gradient is taken (I used the imgradient method created by MATLAB), and a histogram of the directions of the gradient is created. The direction is discretized into 8 different directions, each at 45 degree increments. When creating the histogram, the direction of the magnitude of each pixel in the 4x4 bunch is greedily assigned to the closest of the eight orientations, with no interpolation performed. The accuracy on the 100 best corners when using the entire pipeline and histogram of gradients as features was 96%, 93%, and 3% for the Notre Dame, Mount Rushmore, and Episcopal Gaudi image pairs, respectively. The results are shown below. Clearly the histogram approach outperforms the pipeline when just image patches are used, but it is still not enough to fix the poor performance on the Episcopal Gaudi image pairs. For Mount Rushmore and Notre Dame, the image pairs are very similar, unlike in Episcopal Gaudi, where the two images are at different scales and slightly different angles. Since my pipeline doesn't take into account rotation, scale, and things of this nature, it will easily get fooled by these differences, leading to the poor results. In contrast, the Eifel tower photos lead to a mixture of both good and bad results. The pipeline does a good job matching the top of the tower, even though the the two photos were taken at noticeably different angles. This is because the features of the top of the tower are fairly distinct and orentiation independent by themselves. Where it gets tripped up on is the lower part, where there are a bunch of similar features, leading to false matches.

Pipeline Results with 100 best feature matches

Feature Matching

Once we've found features in the two images, we're faced with the task of matching them. The most obvious question is what constitutes a "good" or "bad" match. In this case, we'll use the classic L2 (or Euclidean Distance) metric to determine close matches. But just because two features are very close doesn't necessarily mean they're good features. If the feature has many instances in an image, it becomes difficult to tell which two are actually equivalent. This is where the concept of the nearest neighbor distance comes into play. Here, we not only consider the closest distance between a feature in one image and the features in the other, but also the second closest distance. A good feature should be distinct and occur infrequently in an image, meaning the second closest match should be far away from the first. Thus, to select the best features, we take the ratio between the closest and second feature.

The easiest way to calculate the closest and second closest distances is to iterate through every feature in the first image, comparing it with every feature in the second. While this works fine, it can be quite slow given a large number number of features. One way to address this is to use a KD tree to perform the search. A kd-tree is a binary tree with each node representing a k dimensional point. Each node (non-leaf) can be thought of as a hyperplane, splitting up the space. This allows for faster searches for the same reason as why binary search is better than just iterating through the entire list. MATLAB has a handy implementation of a k-d tree, which was used alongside the brute force approach. For a small number of features (~1000), the brute force approach took 0.28 seconds vs 0.23 seconds when using a k-d tree. Where the latter approach really shines is when the number of features is larger. Given 12,000 features, it took the brute force algorithm 45.85 seconds to complete the matching vs just 30 seconds for the k-d tree. Further, the k-d tree causes no decrease in accuracy, making it the superior method.

Extra Credit

As I mentioned in the sections above, I implemented non-maximum suppression and used a KD tree to speed up matching. Check the respective sections for more detail.

Conclusion

Overall, we've seen how to use a SIFT-like pipeline to compare local features in two images. For roughly similar images, the pipeline works great, taking only a few seconds and giving good accuracy in matching features. The pipeline does falter when faced with images with wide variance in scale and angles. To improve accuracy, we could make the corners and features scale and orientation independent, but that's for another day (and a higher extra credit cap ☺).

Lawrence Moore

Project 2: Local Feature Matching