Pattern Recognition : Face Recognition Using Linear Discriminant Analysis(LDA) and Generalized   Singular Vector Decomposition(GSVD)   Kihwan Kim Face Recognition using LDA/GSVD   Kihwan Kim        Sangmin Lee Prof. James M. Regh Georgia Institute of Technology 1. Introduction We made face recognition system using Linear Discriminant Analysis (LDA) and Generalized Singular Vector Decomposition. (GSVD) We  trained some sets of face data from Yale face data set. To make the training set in same condition, we also implemented semi-auto  crop application which is substitution for perfect face detection application. We will use LDA for clustering each class (each person) and  GSVD for reducing dimension of the data.[1] However, GSVD itself enables getting more general solution from LDA than conventional LDA or PCA/LDA combination.[2] For classification and decision step, we will apply Weighted K-Nearst Neighborhood. (WKNN). 2. Application overview - Developed on Microsoft Visual Studio 6.0 and Intel OpenCV.   - Developed one more version on MATLAB for checking the data in the C++ version. Fig1 Application overview (1) Cropping the face data. We assume that face detection is already done. Because the result of the detection step is both very important and very sensitive to recognition step, we made a semi-auto crop application. What we need to do is only clicking two eye points in the image. Then, the application automatically rotates and crops corresponding to the distance between eyes. In our test, the size of crop area is 60 by 50 pixels. So the dimension of each data is 3000 dimension. And the initial number of class for training is 3. The reason why we choose the class number as 3 is to analyze the result in 2 dimensional spaces. Later we will talk about why this happen. (2) Preprocessing After cropping the face images, we applied Contrast Limited Additive Histogram Equalization ( CLAHE ) to enhance and normalize the input data. (3) Making G matrix and Off-line training With every pre-processed 3000 dimensional image data which has various lighting conditions but has same view point, mostly frontal view of face images, we calculated G matrix mapping the training and testing data to certain reduced dimension. (4) Online Test We tested other face images from each class which has various expressions and slightly different pose. After these data converted to reduced dimensional space by G matrix, the mapped point is classified to certain class by applying Weighted K-Nearst Neighborhood Fig2. Training Set examples   Fig3. Testing Image examples 3. Algorithm Here we describe the algorithm for LDA/GSVD. Conventional LDA is performed by using within-cluster, between-cluster and mixture scatter matrices. With this matrices, we could find the matrices which maximize the between cluster and minimize within-cluster. However, LDA/GSVD approach is slightly different from conventional method even though the internal calculation is similar.   Criteria function of LDA/GSVD is                             J(G) = Max trace(GTSBG) and Min trace(GTSWG)                                           = Max trace ((GTSWG )-1(GTSBG))                       In order to maximize this function without finding inverse of SW, we use LDA/GSVD. We will prove why following algorithm achieve this goal in discussion section. The algorithm of LDA/GSVD is as follows.   1. Compute HB and HW from data set A = [d1, d2 … dn] and centroid of each cluster C = [c1, c2, ….. ck]     by (there are k clusters and     n data) HB = [d1 – c1 ….. dn - ck] HW  = [sqrt(n1)(c1 – c),sqrt(n2)(c2 – c) ……, sqrt(nk)(ck – c)]               where c is global centroid and ni is number of data in the cluster i   2. Concatenate HB and HW to create K K = (HBT; HWT) 3. Perform singular value decomposition on K K = P(R 0; 0 0)QT 4. When t = rank(K), get first k rows and first t columns of P, M = P(1:k, 1:t) 5. Perform SVD on M M = UEAWT 6. Compute X from the elements that we have gotten from both SVD computation X = Q(R-1W 0; 0 I), 7. Obtain first k-1 columns of X and assign it to G G = X(:,1:t) 8. With the G obtained from above multiply each training data to get the dimension reduced data and run it in K nearest neighbor classification, then train the classifier. Then again multiply G with each test data set and with the test data set obtained from multiplication test the data. 4. Result (1) Application performance (Speed) In a certain application, the processing speed is also very important issue. In our LDA/GSVD algorithm, we used two times of SVD is applied rather than using Q-R Decomposition. However, the total elapsed time to perform making G matrix is approximately 15 sec in P4 3.4GHZ,1G for 3 class, 7 images  case while Matlab version has 5 min for same condition.   (2) Off-line training results   The Fig4 shows us the result of 3-Class/ 2 Dimensional map. The reason why we chose the number of class as three was that we could get the reduced dimension as two when we apply three-class case to LDA/GSVD. So we can easily observe the data distribution of clustering in the two dimensional graph. Fig4. 3-Class/ 2-Dimension Map However, we found out that if number of class is fixed, the clustering result is better when number of data in each class is assigned between 5 to 7.  Because we are not sure whether it could be generalized or not, we tested with some images that possesses very simple features. It is shown in Fig5. These results show us again that in certain size of class has more desirable clustering result between each class. But, we cannot say it is general phenomena. We just presume that it is due to low dimensional subspace. Because LDA/GSVD reduced the dimension from 3000 to k -1 dimension where k is number of class, this k – 1 subspace is too small in larger data set in each class. Thus we assume that 5 to 7 is appropriate number of data to fit in the space. Fig5. Images with simple features (3) On-line test results As shown in the Fig6. each Blue, Green, Red point represent each class or person. Blue is class 0, Green is class 1 and Red is class2. When one test image is coming, this 3000 dimensional data hit the G matrix and is mapped into this 2 dimensional reduced space. This point is represented as white point. Then, Calculate the nearst neighborhood. Fig6. Applying Weighted K-Nearst Neighborhood Fig7. Test of class 0 ( First person ) Data in the Fig7 shows the first result, Test image is 16 image of class 0 or first person. In the off-line training step, 0 to 6 images are already used. These 16 images also include different pose and expression. As shown in the training result, first person (blue point)’s data is well separated from two other people. The result is pretty nice. He got highest score from every testing. So Success rate is 100% and FAR is 0 % at this moment. Fig8. Test of class 1 (Second person) In second person, his data distribution in the 2 dimension is somewhat close to 3rd person. We found out that, in certain expression, second and third person is slightly look alike even in our eye’s perception due to crop the only within face area, not in shape of face. However, during the K-Nearst Neighborhood, he got higher score. Thus, final result is also pretty nice. Fig. Test of class 2 (Third person) As for 3rd person case, he got some mis-classification result in 11th and 15th image. So his success rate is 83% and FAR is 0.13 So overall FAR is 4.167% at 3 class-7image cases. 4. Discussion We implemented and tested the face recognition algorithm using LDA/GSVD. One of the most impressive things in this algorithm is that we can cluster well even when reduced dimension is only two. Furthermore, the result of online test was also pretty good with low FAR. However, we need to apply more tests with more data. And because we tested only 3 class case to see the result easily in 2 dimensions, we should test more than 4 class case even though we could not see the data distribution easily. One more thing we should consider is to find out more general way to select the number of image data in certain class. In the final stage of the online test, we used weighted K-Nearst  Neighborhood to classify the input test data. Due to the nice clustering result from training step and low mapping dimension,  K-NN itself performed very well. But if the mapping dimension is higher and we cannot see the real data distribution ( clustering result ), we are not sure to get a desirable result from it by K-NN. Thus, in future work, we should not only apply more class case but also try to use other classification method in final step. Finally, we will discuss about more specific explanation about the algorithm. Why we use LDA/GSVD rather than conventional LDA, what LDA/GSVD is different from conventional LDA, what is K matrix and why we should use Generalized Singular Vector Decomposition. Ultimate goal of LDA/GSVD is maximization of distance between each cluster from others, and minimizing the distance between elements in same cluster (i.e. maximizing the value of between cluster scatter matrix and minimizing the value of within cluster scatter matrix). So in LDA/GSVD we want to find a matrix by multiplying it with both scatter matrices, as well as reducing the dimension of each data, to reach the ultimate goal the described above. In arithmetical expression can be written as follows Max trace(GTSBG) and Min trace(GTSWG) = Max trace ((GTSWG )-1(GTSBG))   Where SW =>  Rnxm, Sb=>Rpxn are the within cluster and between cluster matrix. What we are trying to find through this algorithm is the matrix G from the mathematical expression above.   What is the G that maximize the between scatter matrix and minimize the within scatter matrix? P. Howland and H. Park[1]have shown that G matrix above is same as the first t eigenvectors associated with t largest eigenvalues (i.e. in SVD of K, K = USVT, first t columns of matrix U) where t is number of rank of SW. The proof is as follows According to Golub and Van Loan[3], there exists a nonsingular matrix X that satisfies   XTSBX = L = diag(l1……..lm) and XTSWX  = Im   Let xi be ith column of X, then we have   SBxi = liSWxi SW-1SB = lix   From the mathematical expression above we can see that lI is the eigenvalue and xi is the eigenvector of SW-1SB. When t = rank(SB), then only up to tth l can be positive. The criteria function   trace ((GTSWG )-1(GTSBG)) = trace ((GTX-TX-1G )-1(GTX-TL X-1G)) = trace ((GMTX-1GM )-1(GTML GM)),   where GM = X-1G. If G has full rank, GM also has full rank so that it has reduced QR factorization which is GM = QR, where Q Î Rmxl has orthonormal columns and R is nonsingular. Then we have   trace ((GMTX-1GM )-1(GTML GM)) = trace ((RTQTQR)-1 RTQTLQR) = trace ((RTR)-1 RTQTLQR) = trace (R-1QTLQR) = trace (QTLQR R-1) = trace (QTLQ)   max trace ((GTSWG )-1(GTSBG)) = max trace (QTLQ) £ l1+ ……+lt = trace (SW-1SB)   This can be achieved by Q = (I; 0) and G = X(I; 0) R   Even though we transform G with various matrices we all get the same traces for XG and G (Howland and Park, 2004). Therefore max trace((GTSWG )-1(GTSBG)) is the l largest eigenvectors of SW-1SB, which implies that l columns of X from the (XTSBX = L = diag(l1……..lm) and XTSWX  = Im) is the G that satisfies the criteria function. However, we cannot always find G by finding first l eigenvectors of SW-1SB, because when there are not enough data compare to number of feature space scatter matrix becomes singular, thus finding inverse of SW impossible.  So many have explored to find the matrix G without obtaining inverse of SW. There were two step algorithm which is, at first, use PCA or LSI to obtain the nonsingular SW then use the conventional LDA. However, drawbacks of those algorithm was “experimentation has been needed to determine which intermediate reduced dimension produces optimal results after the second stage.” Moreover, PCA and LSI does not consider the clustered structure of the data[1] In our project we have used GSVD to exploit this problem of sparse data matrix. GSVD enables it to find eigenvector of SW-1SB without finding SW-1.   Let HB and HW be SB  = HBTHB  and SW = HWTHW. Let matrix K be K = (HBT ;HWT)  and t = rank(K) UTHBQ = åA (WTR, 0) and VTKBQ = åB(WTR,0)Where åA  and åB are the diagonal matrices Multiply each side with inverse of (WTR, 0) to obtain UTHBX = åA and VTKBX = åB Where X = Q(R-1W 0; 0 I)   With this equation we formulate as follows. HB = UåAX-1 and Hw = VåBX-1 HBTHB = X-T(åATåA 0; 0 0)X-1 and HWTHW = X-T(åBTåB 0; 0 0)X-1   Since åA, and åB are diagonal, it can be written as   åA = diag (a1 ….. at … ) åB = diag (b1 … bt  …..)   åATåA and åBTåB are also diagonal. åATåA = diag (a12 ….. at2 … ) åBTåB = diag (b12 … bt2 …..) With the equation we have above we can get HBTHB (åATåA 0; 0 0)-1X= X-T and HWTHW(åBTåB 0; 0 0) -1X = X-T HBTHB (åATåA 0; 0 0)-1X = HWTHW(åBTåB 0; 0 0) -1X   Each column of the fomula will be (xi being the ith column of X)   1/ai HBTHB xi = 1/bi HWTHW xi i HBTHB xi  = ai/bi HWTHW xi   Since we do not have inverse of HWTHW, we cannot send HWTHW to other side however, the formula shows that ai/bI is eigenvalues of matrix (HWTHW)-1HBTHB and xi is eigenvector of the matrix. This shows that X that we obtained from the algorithm is the eigenvectors of (HWTHW)-1HBTHB and first t columns of X, which is G that achieves the goal of the LDA algorithm. 5. Refrences [1] Generalizing Discriminant Analysis Using the Generalized Singular Value Decomposition (P. Howland and H.Park) IEEE Transactions on Pattern Analysis and Machine Intelligence [2] Equivalence of Several Two-stage Methods for Linear Discriminant Analysis (P. Howland and Haesun Park) SIAM International Conference on Data Mining 2004 [3] Matrix Computations (G. Golub and F.Van) Johns Hopkins Studies in Mathematical Sciences 1996