Computer Vision Project

Project 2: Local Feature Matching

Notre Dame, with matches shown

This project implements a basic SIFT-like pipeline, with several parameter adjustments to improve accuracy. Each step in the pipeline is explained below, accompanyed by the accuracy of the pipeline after adding that step. In this way, it is possible to see how each step is critical to the overall functionality of the system.

Introduction

Images pairs of Notre Dame, Mount Rushmore, and the Episcopal Gaudi were used to test this SIFT pipeline. The raw pairs are shown below for reference:

Feature Matching

The most critical element in the pipeline is feature matching. Without feature matching the accuracy rate is typically 0%. After implementing a simple nearest-neighbor matching algorithm, using the euclidean distance to compute the distance between features, and ranking the features by the inverse of the ratio distance described in Szeliski, the accuracy is still nearly 0%, because the features and their interest points are still random. According to the SIFT paper, there is a very low probability of a correct match when the ratio distance is greater than 0.75. Therefore, in a real pipeline, it would make sense to disregard any features that exceed this ratio. However, for this implementation, the top 100 matches are retained, regardless of ratio distance.

Feature Description

The simplest feature descriptor is a simple image patch. Using such a patch as the descriptor, and using the human-supplied interest points provided by the template code, we can achieve accuracy on the Notre Dame image of around 55%. If we use machine-detected interest points instead, the accuracy jumps to 72%. An example of the output from this type of pipeline is shown below, for reference.

Why the discrepancy between the accuracy when the method for choosing the points is changed? The reason is that there are more than 1000 points returned by the interest point detector, compared to under a hundred for the human-annotated points. The order of magnitude more points generated by the interest point detector means that there is a higher chance that the same physical location is chosen as an interest point in both images. This is the only way that a valid match is possible. More interest points increases the chance of this condition existing.

When a SIFT-like detector, using histograms of gradients within a feature, is used, accuracy jumps to 82%.

With Human-Annotated Interest Points (55% accuracy)	With Machine-Detected Interest Points (72% accuracy)

Interest Point Detection

Before interest point detection is added the maximum accuracy of the pipeline is about 82% on the Notre Dame images. After interest point detection is implemented, the accuracy jumps to 96%. As discussed above, this is almost certainly due to the fact that there are more matches for the pipeline to choose from, so there are enough good matches to fill the top 100 spots with correct matches.

Human-Annotated Interest Points (82%)	Machine-Detected Interest Points (96%)

Parameter Tuning

The importance of tuning parameters was important in this project. There are a variety of free parameters that need to be chosen by the implementer, and require trial and error to set right. The best example of this is the output from the SIFT feature descriptor. The documentation recommended that this descriptor be raised to a fractional power. A few minutes of trial and error were required to settle on a reasonable value for this parameter (0.3).

Other Images

Up to now, the analysis has focued on the Notre Dame image pair. The results of the other provided image pairs are presented below, for reference:

	Mount Rushmore (99%)
	Episcopal Gaudi (6%)

Mount Rushmore performs especially well, because there are a preponderance of good corners and distinctive features to track.

Episcopal Gaudi is difficult because the difference in scale between the two images isn't accounted for by the simple SIFT-like descriptors implemented in this project.