The corner detection is a straightforward Harris corner detector, but each point is analyzed at multiple scales in order to "guess" how big the feature is.
The scale is found by first applying a Laplacian-of-Gaussian (LoG) filter on the image, where the size of the filter corresponds to the size of features we're looking for. The LoG has the useful property of creating a high response to circles or blobs of the same size. Therefore to pick the scale of our detected points, we pick the scale with the maximum LoG response.
As a last step, we perform non-maximum suppression so that keypoints are only selected if they are the maximum in a 3x3 neighborhood.
The sift descriptor works by first selecting a window around each feature that corresponds to the scale determined in the corner detection phase. The magnitudes are weighted by a Gaussian falloff with variance half the window width. This reduces the importance of gradients far away from the feature center.
The next step works by calculating the gradient around the point, then forming a 4x4x8 description of the area around the point by interpolation:
To match features, we calculate the pairwise distance between each feature in two images and pick the closest pairs. In order to avoid ambiguous matches, we want to keep only the most meaningful pairs so we perform the "ratio test."
For each feature, we calculate the ratio of the closest match / the second closest match. Then we discard any matches with a ratio less than 0.7.
These matches were 92% accurate. The top 100 were 91% accurate.
The effect of searching through scales on match accuracy.
One of the interesting questions I had during this project was whether or not interpolation was necessary. The original SIFT paper describes interpolation as smoothing abrupt changes in sample shifts from one histogram to another.
Using a single scale, I managed to go from 88% accuracy to 92% accuracy on all matches simply by performing the interpolation step.
These pictures achieved extremely high accuracy without scale-invariance. Trying to make the features scale-invariant actually lowered the accuracy.
These matches were 97% accurate. The top 100 were 100% accurate. One scale is used. 511 total good matches, 15 total bad matches.
These matches were 95% accurate. The top 100 were 96% accurate. Three scales are used. 177 total good matches, 10 total bad matches.
For these images I had to increase the ratio threshold to 0.8. Due to the pictures being at significantly different angles, my feature descriptors had a hard time making affine-invariant descriptions. Scale-invariance helped increase the accuracy.
These matches were 14% accurate. One scale is used. 12 total good matches, 71 total bad matches.
These matches were 17% accurate. Ten scales are used. 8 total good matches, 40 total bad matches.
These matches were 92% accurate. The top 100 were 91% accurate. One scale was used.
These matches were 85% accurate. The top 100 were 87% accurate. Three scales were used.
These matches were 89% accurate. The top 100 were 91% accurate. Six scales were used.
These matches were 89% accurate. The top 100 were 94% accurate. Eleven scales were used.
These matches were 89% accurate. The top 100 were 95% accurate. Thirteen scales were used.