Project 2: Local Feature Matching

This project report focuses on the implementation of a SIFT-like pipeline for doing instance level feature detection and matching.

Interest point detection

The first part of the pipeline is interest point detect, i.e. finding features in the image that is likely to be identified in other images of the same phyiscal location.

I chose to implement a basic version of the Harris corner detector. First I compute the image gradients, then the second moment matrix and the Harris score for each pixel in the image.


	% output is the Harris corner image
	output = (Sxx .* Syy) - (Sxy .^ 2) - ALPHA * ((Sxx + Syy) .^2);

Visualization of the Harris corner response matrix.

The corner responses are then filtered so that any responses below the threshold HAR_THRESHOLD is suppressed. That is because they are presumed to be egdes or flat surface. All interest points near the egde of image are also suppressed since a patch around them cannot be extracted in the feature descriptor step. Finally all non-maxima points are suppressed using a sliding window. This is done to speed up the pipeline by giving fewer and distinct interest points. Essentially the non-maxima suppress selects the best point in a point cluster around a good feature in the image rather than all of them. This also reduces any potiential confusion when matching since there are fewer similar looking points.

Parameters and efficiency

Visual inspection of the Harris corner response matrix seems to indicate good performance. The points found are in the corners of the Notre Dame. Using non-maxima suppression speeds it up signifacantly and increases accuracy by about 3 procent points.

Feature descriptors

From the interest points the pipeline creates SIFT-like feature vectors for each interest point. These 128 dimensional vectors acts like fingerprints that are unique to features in the image. Ideally they are also robust to scaling and lighting changes.

The features are created by looking at 16x16 pixel patches around the interest point. Then calculating the magnitude and direction if the image gradients in the patch. After that a gaussian filter is applied to the magnitudes to make the descritor less sensitive shifts in the images.

The patch is then divided into 4x4 cells that are processed one by one. For each cell an eight bin gradient magnitude histogram is computed. These 4x4x8 bins will serve as out 128 dimensions feature vector. To make the feature vector further resistant to small shifts trilinear interpolation amongst the bins are used to propagate the bins values to nearby bins in the patch. This last step increased accuracy by about five percentage points.

To make the feature vectors more uniform they are normalized, clamped to 0.2 and the renormalized. Clamping and normalizing improved the accuracy by 5 percentage points.

Parameters and efficiency

The SIFT-like feature descriptors are superior to the raw pixel intensity template methods since they are more resistant to changes in lighting, scaling and rotation.

Feature matching

For each feature in the second image I found the two closest points in the feature space (using the euclidean distance). I then calculated the Nearest Neighbour Distance Ratio Test, i.e. matches with a ratio better than NNDR_THRESHOLD were considered good. The result was sorted from best ratio to worst. A small ratio means that the nearest neighbour is much more similar than the next nearest, thus implying that the match is good relative to all other possible matches.


[D, I] = pdist2(features1, features2,'euclidean','Smallest', 2);
for i = 1:size(features1, 1)
   d1 = D(1, i);
   d2 = D(2, i);
   nndr = d1/d2;
   if nndr <= NNDR_THRESHOLD
        matches = [matches; [I(1,i) i]];
        confidences = [confidences, nndr];
   end
end

Parameters and efficiency

Discussion

This project is very open to improvements both in the pipeline itself and through tweaking of the parameters. I settled on implementing a SIFT-like pipeline and tweaking the parameters as best I could.

Final accuracy

The table below shows my algorithms accuracy on the different images.
Accuracy (at NNDR=0.8) Accuracy (100 best) Matching image (100 best)
84% (out of 76) 77%
38% (out of 16) 22%
11% (out of 11) 6%
I decided to measure both the accuracy of the 100 best matches and also the accuracy of just the points the algorithm considered confident in. Preferable one would use only the confident matches rather than a fixed number of matches to avoid too many incorrect matches. The result is very good for Notre Dame, much due to the fact that the two images are similar in respect to scale and that the church has clear and distinct features. Accuracy could be increased on the other images if the parameters were tweaked for those images though that might lead to overtraining. Better would be to add scale detection for Mount Rushmore and more advanced feature detection for Gaudi. One thing to try would be to implement SURF instead of SIFT.