

Pattern
Recognition
: Face Recognition Using Linear Discriminant Analysis(LDA) and Generalized Singular Vector Decomposition(GSVD)
Kihwan Kim 

Face Recognition using LDA/GSVD
Kihwan Kim Sangmin Lee Prof. James M. Regh Georgia Institute of Technology 


1. Introduction 


We made face recognition system using Linear Discriminant Analysis (LDA) and Generalized Singular Vector Decomposition. (GSVD) We trained some sets of face data from Yale face data set. To make the training set in same condition, we also implemented semiauto crop application which is substitution for perfect face detection application. We will use LDA for clustering each class (each person) and GSVD for reducing dimension of the data.[1] However, GSVD itself enables getting more general solution from LDA than conventional LDA or PCA/LDA combination.[2] For classification and decision step, we will apply Weighted KNearst Neighborhood. (WKNN).



2. Application overview 


 Developed on Microsoft Visual Studio 6.0 and Intel OpenCV.  Developed one more version on MATLAB for checking the data in the C++ version. 


Fig1 Application overview 
(1) Cropping the face data. We assume that face detection is already done. Because the result of the detection step is both very important and very sensitive to recognition step, we made a semiauto crop application. What we need to do is only clicking two eye points in the image. Then, the application automatically rotates and crops corresponding to the distance between eyes. In our test, the size of crop area is 60 by 50 pixels. So the dimension of each data is 3000 dimension. And the initial number of class for training is 3. The reason why we choose the class number as 3 is to analyze the result in 2 dimensional spaces. Later we will talk about why this happen. (2) Preprocessing After cropping the face images, we applied Contrast Limited Additive Histogram Equalization ( CLAHE ) to enhance and normalize the input data. 

(3) Making G matrix and Offline training With every preprocessed 3000 dimensional image data which has various lighting conditions but has same view point, mostly frontal view of face images, we calculated G matrix mapping the training and testing data to certain reduced dimension. (4) Online Test We tested other face images from each class which has various expressions and slightly different pose. After these data converted to reduced dimensional space by G matrix, the mapped point is classified to certain class by applying Weighted KNearst Neighborhood 
Fig2. Training Set examples
Fig3. Testing Image examples 

3. Algorithm 


Here we describe the algorithm for LDA/GSVD. Conventional LDA is performed by using withincluster, betweencluster and mixture scatter matrices. With this matrices, we could find the matrices which maximize the between cluster and minimize withincluster. However, LDA/GSVD approach is slightly different from conventional method even though the internal calculation is similar.
Criteria function of LDA/GSVD is
J(G) = Max trace(G^{T}S_{B}G) and Min trace(G^{T}S_{W}G) = Max trace ((G^{T}S_{W}G )^{1}(G^{T}S_{B}G))
In order to maximize this function without finding inverse of S_{W}, we use LDA/GSVD. We will prove why following algorithm achieve this goal in discussion section.



The algorithm of LDA/GSVD is as follows.
1. Compute H_{B} and H_{W} from data set A = [d_{1}, d_{2} … d_{n}] and centroid of each cluster C = [c_{1}, c_{2}, ….. c_{k}] by (there are k clusters and n data) H_{B} = [d_{1} – c_{1} ….. d_{n } c_{k}] H_{W } = [sqrt(n_{1})(c_{1} – c),sqrt(n_{2})(c_{2} – c) ……, sqrt(n_{k})(c_{k} – c)] where c is global centroid and n_{i }is number of data in the cluster i
2. Concatenate H_{B} and H_{W} to create K K = (H_{B}^{T}; H_{W}^{T}) 3. Perform singular value decomposition on K K = P(R 0; 0 0)Q^{T} 4. When t = rank(K), get first k rows and first t columns of P, M = P(1:k, 1:t) 5. Perform SVD on M M = UE_{A}W^{T} 6. Compute X from the elements that we have gotten from both SVD computation X = Q(R^{1}W 0; 0 I), 7. Obtain first k1 columns of X and assign it to G G = X(:,1:t) 8. With the G obtained from above multiply each training data to get the dimension reduced data and run it in K nearest neighbor classification, then train the classifier. Then again multiply G with each test data set and with the test data set obtained from multiplication test the data.



4. Result 


(1) Application performance (Speed) In a certain application, the processing speed is also very important issue. In our LDA/GSVD algorithm, we used two times of SVD is applied rather than using QR Decomposition. However, the total elapsed time to perform making G matrix is approximately 15 sec in P4 3.4GHZ,1G for 3 class, 7 images case while Matlab version has 5 min for same condition.
(2) Offline training results
The Fig4 shows us the result of 3Class/ 2 Dimensional map. The reason why we chose the number of class as three was that we could get the reduced dimension as two when we apply threeclass case to LDA/GSVD. So we can easily observe the data distribution of clustering in the two dimensional graph. 


Fig4. 3Class/ 2Dimension Map 


However, we found out that if number of class is fixed, the clustering result is better when number of data in each class is assigned between 5 to 7. Because we are not sure whether it could be generalized or not, we tested with some images that possesses very simple features. It is shown in Fig5. These results show us again that in certain size of class has more desirable clustering result between each class. But, we cannot say it is general phenomena. We just presume that it is due to low dimensional subspace. Because LDA/GSVD reduced the dimension from 3000 to k 1 dimension where k is number of class, this k – 1 subspace is too small in larger data set in each class. Thus we assume that 5 to 7 is appropriate number of data to fit in the space. 


Fig5. Images with simple features 


(3) Online test results As shown in the Fig6. each Blue, Green, Red point represent each class or person. Blue is class 0, Green is class 1 and Red is class2. When one test image is coming, this 3000 dimensional data hit the G matrix and is mapped into this 2 dimensional reduced space. This point is represented as white point. Then, Calculate the nearst neighborhood. 


Fig6. Applying Weighted KNearst Neighborhood



Fig7. Test of class 0 ( First person ) 


Data in the Fig7 shows the first result, Test image is 16 image of class 0 or first person. In the offline training step, 0 to 6 images are already used. These 16 images also include different pose and expression. As shown in the training result, first person (blue point)’s data is well separated from two other people. The result is pretty nice. He got highest score from every testing. So Success rate is 100% and FAR is 0 % at this moment. 


Fig8. Test of class 1 (Second person) 


In second person, his data distribution in the 2 dimension is somewhat close to 3rd person. We found out that, in certain expression, second and third person is slightly look alike even in our eye’s perception due to crop the only within face area, not in shape of face. However, during the KNearst Neighborhood, he got higher score. Thus, final result is also pretty nice. 


Fig. Test of class 2 (Third person) 


As for 3rd person case, he got some misclassification result in 11th and 15th image. So his success rate is 83% and FAR is 0.13 So overall FAR is 4.167% at 3 class7image cases. 





4. Discussion



We implemented and tested the face recognition algorithm using LDA/GSVD. One of the most impressive things in this algorithm is that we can cluster well even when reduced dimension is only two. Furthermore, the result of online test was also pretty good with low FAR. However, we need to apply more tests with more data. And because we tested only 3 class case to see the result easily in 2 dimensions, we should test more than 4 class case even though we could not see the data distribution easily. One more thing we should consider is to find out more general way to select the number of image data in certain class. In the final stage of the online test, we used weighted KNearst Neighborhood to classify the input test data. Due to the nice clustering result from training step and low mapping dimension, KNN itself performed very well. But if the mapping dimension is higher and we cannot see the real data distribution ( clustering result ), we are not sure to get a desirable result from it by KNN. Thus, in future work, we should not only apply more class case but also try to use other classification method in final step. Finally, we will discuss about more specific explanation about the algorithm. Why we use LDA/GSVD rather than conventional LDA, what LDA/GSVD is different from conventional LDA, what is K matrix and why we should use Generalized Singular Vector Decomposition. 





Ultimate goal of LDA/GSVD is maximization of distance between each cluster from others, and minimizing the distance between elements in same cluster (i.e. maximizing the value of between cluster scatter matrix and minimizing the value of within cluster scatter matrix). So in LDA/GSVD we want to find a matrix by multiplying it with both scatter matrices, as well as reducing the dimension of each data, to reach the ultimate goal the described above. In arithmetical expression can be written as follows Max trace(G^{T}S_{B}G) and Min trace(G^{T}S_{W}G) = Max trace ((G^{T}S_{W}G )^{1}(G^{T}S_{B}G))
Where S_{W => } R^{nxm}, S_{b=>}R^{pxn }are the within cluster and between cluster matrix. What we are trying to find through this algorithm is the matrix G from the mathematical expression above.
What is the G that maximize the between scatter matrix and minimize the within scatter matrix? P. Howland and H. Park[1]have shown that G matrix above is same as the first t eigenvectors associated with t largest eigenvalues (i.e. in SVD of K, K = USV^{T}, first t columns of matrix U) where t is number of rank of S_{W}. The proof is as follows According to Golub and Van Loan[3], there exists a nonsingular matrix X that satisfies
X^{T}S_{B}X = L = diag(l_{1}……..l_{m}) and X^{T}S_{W}X = I_{m}
Let x_{i }be i^{th} column of X, then we have
S_{B}x_{i} = l_{i}S_{W}x_{i} S_{W}^{1}S_{B} = l_{i}x
From the mathematical expression above we can see that l_{I} is the eigenvalue and x_{i} is the eigenvector of S_{W}^{1}S_{B}. When t = rank(S_{B}), then only up to t^{th} l can be positive. The criteria function
trace ((G^{T}S_{W}G )^{1}(G^{T}S_{B}G)) = trace ((G^{T}X^{T}X^{1}G )^{1}(G^{T}X^{T}L X^{1}G)) = trace ((G_{M}^{T}X^{1}G_{M} )^{1}(G^{T}_{M}L G_{M})),
where G_{M} = X^{1}G. If G has full rank, G_{M} also has full rank so that it has reduced QR factorization which is G_{M} = QR, where Q Î R^{mxl }has orthonormal columns and R is nonsingular. Then we have
trace ((G_{M}^{T}X^{1}G_{M} )^{1}(G^{T}_{M}L G_{M})) = trace ((R^{T}Q^{T}QR)^{1} R^{T}Q^{T}LQR) = trace ((R^{T}R)^{1} R^{T}Q^{T}LQR) = trace (R^{1}Q^{T}LQR) = trace (Q^{T}LQR R^{1}) = trace (Q^{T}LQ)
max trace ((G^{T}S_{W}G )^{1}(G^{T}S_{B}G)) = max trace (Q^{T}LQ) £ l_{1}+ ……+l_{t }= trace (S_{W}^{1}S_{B})
This can be achieved by Q = (I; 0) and G = X(I; 0) R
Even though we transform G with various matrices we all get the same traces for XG and G (Howland and Park, 2004). Therefore max trace((G^{T}S_{W}G )^{1}(G^{T}S_{B}G)) is the l largest eigenvectors of S_{W}^{1}S_{B}, which implies that l columns of X from the (X^{T}S_{B}X = L = diag(l_{1}……..l_{m}) and X^{T}S_{W}X = I_{m}) is the G that satisfies the criteria function. However, we cannot always find G by finding first l eigenvectors of S_{W}^{1}S_{B}, because when there are not enough data compare to number of feature space scatter matrix becomes singular, thus finding inverse of S_{W }impossible. So many have explored to find the matrix G without obtaining inverse of S_{W}. There were two step algorithm which is, at first, use PCA or LSI to obtain the nonsingular S_{W} then use the conventional LDA. However, drawbacks of those algorithm was “experimentation has been needed to determine which intermediate reduced dimension produces optimal results after the second stage.” Moreover, PCA and LSI does not consider the clustered structure of the data[1] In our project we have used GSVD to exploit this problem of sparse data matrix. GSVD enables it to find eigenvector of S_{W}^{1}S_{B }without finding S_{W}^{1}.
Let H_{B} and H_{W} be S_{B } = H_{B}^{T}H_{B} and S_{W} = H_{W}^{T}H_{W}. Let matrix K be K = (H_{B}^{T} ;H_{W}^{T}) and t = rank(K) U^{T}H_{B}Q = å_{A} (W^{T}R, 0) and V^{T}K_{B}Q = å_{B}(W^{T}R,0)Where å_{A } and å_{B} are the diagonal matrices Multiply each side with inverse of (W^{T}R, 0) to obtain U^{T}H_{B}X = å_{A} and V^{T}K_{B}X = å_{B} Where X = Q(R^{1}W 0; 0 I)
With this equation we formulate as follows. H_{B} = Uå_{A}X^{1} and H_{w} = Vå_{B}X^{1} H_{B}^{T}H_{B} = X^{T}(å_{A}^{T}å_{A }0; 0 0)X^{1} and H_{W}^{T}H_{W} = X^{T}(å_{B}^{T}å_{B }0; 0 0)X^{1}
Since å_{A}, and å_{B} are diagonal, it can be written as
å_{A} = diag (a_{1} ….. a_{t} … ) å_{B} = diag (b_{1} … b_{t} …..)
å_{A}^{T}å_{A} and å_{B}^{T}å_{B} are also diagonal. å_{A}^{T}å_{A} = diag (a_{1}^{2} ….. a_{t}^{2} … ) å_{B}^{T}å_{B} = diag (b_{1}^{2} … b_{t}^{2} …..) With the equation we have above we can get H_{B}^{T}H_{B} (å_{A}^{T}å_{A }0; 0 0)^{1}X= X^{T} and H_{W}^{T}H_{W}(å_{B}^{T}å_{B }0; 0 0)^{ 1}X = X^{T} H_{B}^{T}H_{B} (å_{A}^{T}å_{A }0; 0 0)^{1}X = H_{W}^{T}H_{W}(å_{B}^{T}å_{B }0; 0 0)^{ 1}X
Each column of the fomula will be (x_{i} being the ith column of X)
1/a_{i }H_{B}^{T}H_{B} x_{i }= 1/b_{i} H_{W}^{T}H_{W} x_{i} _{i }H_{B}^{T}H_{B} x_{i }= a_{i}/b_{i} H_{W}^{T}H_{W} x_{i}
Since we do not have inverse of H_{W}^{T}H_{W, }we cannot send H_{W}^{T}H_{W }to other side however, the formula shows that a_{i}/b_{I }is eigenvalues of matrix (H_{W}^{T}H_{W})^{1}H_{B}^{T}H_{B} and x_{i} is eigenvector of the matrix. This shows that X that we obtained from the algorithm is the eigenvectors of (H_{W}^{T}H_{W})^{1}H_{B}^{T}H_{B} and first t columns of X, which is G that achieves the goal of the LDA algorithm.



5. Refrences



[1] Generalizing Discriminant Analysis Using the Generalized Singular Value Decomposition (P. Howland and H.Park) IEEE Transactions on Pattern Analysis and Machine Intelligence 


[2] Equivalence of Several Twostage Methods for Linear Discriminant Analysis (P. Howland and Haesun Park) SIAM International Conference on Data Mining 2004 


[3] Matrix Computations (G. Golub and F.Van) Johns Hopkins Studies in Mathematical Sciences 1996 








