In this project, I implemented the following:
The projection from a 3D point to a 2D point is represented by:
$$ \begin{pmatrix}u \\ v \\ 1\end{pmatrix} \cong \begin{pmatrix}u*s \\ v*s \\ s \end{pmatrix} = M \begin{pmatrix}X \\ Y \\ Z \\ 1 \end{pmatrix} =\begin{pmatrix}m_{11} & m_{12} & m_{13} & m_{14} \\ m_{21} & m_{22} & m_{23} & m_{24} \\ m_{31} & m_{32} & m_{33} & m_{34} \end{pmatrix} \begin{pmatrix}X \\ Y \\ Z \\ 1 \end{pmatrix} $$,where M is the projection matrix from 3D homogeneous coordinate to 2D homogeneous coordinate.
If we have N such pairs, we can have a group of equations. Notice that we still have one degree of freedom. Let's fixed m34 to be 1 and use the original 3D and 2D points (without any scaling). Write them in matrix form:
$$ \begin{pmatrix} X1 & Y1 & Z1 & 1 & 0 & 0 & 0 & 0 & -u1*X1 & -u1*Y1 & -u1*Z1 \\ 0 & 0 & 0 & 0 & X1 & Y1 & Z1 & 1 & -v1*X1 & -v1*Y1 & -v1*Z1 \\ . & . & . & . & . & . & . & . & . & . & . \\ Xn & Yn & Zn & 1 & 0 & 0 & 0 & 0 & -un*Xn & -un*Yn & -un*Zn \\ 0 & 0 & 0 & 0 & Xn & Yn & Zn & 1 & -vn*Zn & -vn*Yn & -vn*Zn \end{pmatrix} \begin{pmatrix} m_{11} \\ m_{12} \\ m_{13} \\ m_{14} \\ m_{21} \\ m_{22} \\ m_{23} \\ m_{24} \\ m_{31} \\ m_{32} \\ m_{33} \end{pmatrix} = \begin{pmatrix} u_1 \\ v_1 \\ . \\ . \\ . \\ . \\ . \\ . \\ . \\ u_n \\ v_n \end{pmatrix} $$Solve this equation gives us the values of matrix M. The code is shown below:
A = [];
b = [];
for i=1:size(Points_2D, 1)
A = [A; ...
Points_3D(i,:), 1, zeros(1,4), -Points_2D(i,1)*Points_3D(i,:); ...
zeros(1,4), Points_3D(i,:), 1, -Points_2D(i,2)*[Points_3D(i,:)]...
];
b = [b;Points_2D(i,1);Points_2D(i,2)];
end
m = [A \ b;1];
M = reshape(m, 4,3)';
The following code computes the camera position
Center = - (inv(M(1:3,1:3)) * M(1:3, 4))';
Ideally, if a pair of two points (p,q) match, we shall have
$$ q^T F p = 0$$,where F is the fundemental matrix (3 x 3) and p, q are in homogeneous coordinates (p in image A and q in image B).
Given n pairs, we can rewrite:
$$ Ax = \begin{pmatrix} p_{11} q_{11} & p_{12} q_{11} & q_{11} & p_{11} q_{12} & p_{11} q_{12} & q_{12} & p_{11} & p_{12} & 1\\ . & . & . & . & . & . & . & . & .\\ p_{n1} q_{n1} & p_{n2} q_{n1} & q_{n1} & p_{n1} q_{n2} & p_{n1} q_{n2} & q_{n2} & p_{n1} & p_{n2} & 1 \end{pmatrix} \begin{pmatrix} f_{11} \\ f_{12} \\ f_{13} \\ f_{21} \\ f_{22} \\ f_{23} \\ f_{31} \\ f_{32} \\ f_{33} \end{pmatrix} = 0 $$,where p_nk means the k-th coordinate of n-th point in image A.
In general, we don't have perfect matches. Also, when n > 8, we usually have null(A) = {0}. Instead of calculating the null space, we compute SVD of A:
$$ A = USV^T$$Then we just take the vector in V associated with the smallest sigular value to be our solution for x (i.e. F matrix). To ensure rank-2 condition, we perform SVD for F matrix, set the smallest sigular value of F to 0 and recompute the F matrix using the updated sigular values. (Thus we won't have much the error in this process).
The code is shown below.
n = size(Points_a,1);
assert(n>=8);
A = zeros(n,9);
for i=1:n
u = Points_a(i,:);
v = Points_b(i,:);
A(i,:) = [u(1)*v(1), u(2)*v(1), v(1), u(1)*v(2), u(2)*v(2), v(2), u(1), u(2), 1];
end
%%%
[U,S,V] = svd(A);
f_values = V(:,end);
% f_values = f_values / norm(f_values);
%%%
F_matrix = reshape(f_values,3,3)';
[U,S,V] = svd(F_matrix);
F_matrix = U(:,1:2)*S(1:2,1:3)*V';
I wrote the RANSAC part to perform the following steps:
By trail and error, I set N = 1000 and epsilon = 1e-2.
I normalized the coordinates by the following precodure:
The projection matrix is:
$$ \begin{pmatrix} 0.7679 & -0.4938 & -0.0234 & 0.0067 \\ -0.0852 & -0.0915 & -0.9065 & -0.0878 \\ 0.1827 & 0.2988 & -0.0742 & 1.0000 \end{pmatrix} $$The total residual is: 0.0445
The estimated location of camera is: -1.5126, -2.3517, 0.2827
The figures for the estimated projection matrix of the simple example.
The figures for the estimated fundemental matrix of the simple example, using all 20 pairs of points.
We obtained perfect matching in Mount Rushmore example by doing RANSAC, probablistically eliminating effects of noise and outliers.
We obtained perfect matching in Notre Dame example.
This one is harder, with lots of outliers. This is partially due to the difference in scale (image B is much smaller than image A, and thus the data points in image B are smaller in scale). The difference in scale led to large numerical error.
After normalizing all data points, the accuracy increased because of less numerical error.