Dimensionality Reduction Using Random Projection

From MeliWiki
Jump to: navigation, search

Dimensionality Reduction

Typically, problems arise when performing pattern recognition tasks in high-dimensional spaces (i.e., “curse of dimensionality”). Significant improvements can be achieved by first mapping the data into a lower-dimensional sub-space. Applying dimensionality reduction to some vector <math>x=[x_{1},x_{2},...,x_{N}]^{T}</math> in an N-dimensional space yields another vector <math>y=[y_{1},y_{2},...,y_{K}]^{T}</math> in an K-dimensional space where <math> K << N </math> Dimensionality reduction techniques using linear transformations have been very popular in determining the intrinsic dimensionality of the manifold as well as extracting its principal directions (i.e., basis vectors).

The most prominent method in this category is Principal Component Analysis (see module on Face Recognition Using PCA). PCA determines the basis vectors by finding the directions of maximum variance in the data and it is optimal in the sense that it minimizes the error between the original image and the one reconstructed from its low-dimensional representation. This is equivalent to retaining as much as possible of the variation present in the original data. Its success has triggered significant research in pattern recognition and many powerful dimensionality reduction techniques (e.g., Probabilistic PCA, Linear Discriminant Analysis (LDA) Independent Component Analysis (ICA), Local Feature Analysis (LFA), Kernel PCA) have been proposed for finding appropriate low-dimensional pattern representations[1].

Random Projection (RP) is a powerful method for dimensionality reduction [2] [3] [4] [5]. It represents a computationally simple and efficient method that preserves the structure of the data without introducing significant distortion. Its most important property is that it is a general data reduction method. In RP, the original high-dimensional data is projected onto a low dimensional subspace using a random matrix whose columns have unit length. In contrast to other methods, such as PCA, that compute a low-dimensional subspace by optimizing certain criteria (e.g., PCA finds a subspace that maximizes the variance in the data), RP does not use such criteria, and therefore, it is data independent. Moreover, it represents a computationally simple and efficient method that preserves the structure of the data without introducing significant distortion. There exist a number of theoretical results supporting that RP preserves approximately pairwise distances of points in Euclidean space, volumes and affine distances, and the structure of data (e.g., clustering).

Random Projection (RP)

We summarize below the main ideas behind RP. Let us suppose that we are given a set of vectors <math>\Gamma _{i}, i=1,2,...,M</math>, in a p-dimensional space. RP can be used to transform <math>\Gamma _{i}</math> to a lower dimension d, with d<<p, via the following transformation:

<math> \tilde{\Gamma}_{i}=R\Gamma _{i} </math>

where R is orthonormal and its columns are realizations of independent and identically distributed (i.i.d.) zero-mean normal variables, scaled to have unit length. RP is motivated by the Johnson-Lindenstrauss lemma [6] that states that a set of M points in a high dimensional Euclidean space can be mapped down onto a <math>d \geq O(logM/ \epsilon ^{2})</math> dimensional subspace such that the distances between the points are approximately preserved (i.e., not distorted more than a factor of <math>1 \pm \epsilon</math>, for any <math>0<\epsilon <1</math>). There is experimental evidence suggesting that the lower bound for d is not very tight and that, in practice, lower d values give good results.

The entries of the matrix R can be calculated using the following algorithm:

Step 1. Set each entry of the matrix to an i.i.d. N(0; 1) value.

Step 2. Orthogonalize the d rows of the matrix (e.g., using Gram-Schmidt orthogonalization algorithm).

Step 3. Normalize the rows of the matrix to unit length (i.e., important for preserving similarities in the low dimensional space).

Orthogonalizing the random vectors is required to preserve the similarities between the original vectors in the low-dimensional space. In high enough dimensions, however, it is possible to save computation time by avoiding the orthogonalization step without affecting much the quality of the projection matrix. This is because, in high-dimensional spaces, there exist a much larger number of almost orthogonal vectors than orthogonal vectors. Therefore, high-dimensional vectors having random directions are very likely to be close to orthogonal.

A simpler algorithm has been proposed by Achlioptas [7] or approximating the random matrix, yielding significant computational savings during the computation of R and the projections <math>R\Gamma_{i}</math>. The algorithm produces a sparse random matrix with elements belonging in -1, 0, 1, therefore, simplifying the computations considerably.

RP has been applied on various types of problems including information retrieval, machine learning and optimization (see [5] ). Although it is based on a simple idea, RP has demonstrated good performance in a number of applications, yielding results comparable to conventional dimensionality reduction techniques, such as PCA, while having much lower computational requirements.

Using RP for Face Recognition

Face recognition is a key biometric technology with a wide range of potential applications related to national security and safety including surveillance, information security, access control, identity fraud, gang tracking, banking and finding missing children. Faces are highly dynamic and can vary considerably in their orientation, lighting, scale and facial expression, therefore face recognition is considered a difficult problem to solve. Considerable progress has been made in face recognition research over the last two decades, especially with the development of powerful models of face appearance [8]. These models represent faces as points in high-dimensional spaces and employ dimensionality reduction to find a more meaningful representation. The key observation is that although face images can be regarded as points in a high-dimensional space, they often lie on a manifold (i.e., subspace) of much lower dimensionality, embedded in the high-dimensional image space [9].

Several reasons make RP attractive for face recognition. First of all, RP is data independent. This is in contrast to other techniques such as PCA, where the data is projected on a low-dimensional subspace computed from a training set. This implies that RP is faster to compute compared to PCA. In particular, PCA becomes very expensive to compute when the dimensionality of the data is very high. Second, RP does not require that the face representations are updated when the face database changes (i.e., when new faces are added to the database or old faces are deleted). In contrast, traditional dimensionality reduction techniques, such as PCA, require re-computing the basis vectors and re-projecting the faces for maintaining their optimality. Although there exist algorithms for updating the basis vectors adaptively, there are stability issues with algorithms while the faces still have to be re-projected on the updated basis.

Finally, RP has much lower computational requirements compared to PCA. Computing a <math>d \times p</math> random matrix is <math>O(pd^2)</math> using the algorithm presented in the previous section and <math>O(pd)</math> using the simplified algorithm by Achlioptas [7]. On the other hand, traditional methods such as PCA and SVD are more expensive with PCA being <math>O(p^2M)+O(p^3)</math> and SVD being <math>O(prM)</math> assuming sparse matrices with r non-zero entries per column. It should be mentioned that in general, performing the projection of M points is <math>O(pdM)</math> both for PCA and RP. However, when approximating the projection matrix using integer values as in [7], the projections can be computed much faster.

RP has shown to have several other properties that can benefit face recognition [10]. For example, it has been shown that when the original data form highly eccentric Gaussian clusters, RP preserves the clustering effect down in lower dimensions (i.e., up to <math>O(logk)</math> dimensions where k is the number of clusters) and produces more spherical clusters which could improve recognition performance. In addition, it has been shown that for moderate dimensions, the inner product between the mapped vectors follows closely the inner product of the original vectors, that is, random mappings do not introduce distortion in the data and preserve similarities approximately.

We outline below the steps of face recognition using RP. Suppose that we are given M training face images <math> I_{1},I_{2},...,I_{M}</math>, each of size <math>N \times N</math>:

Step 1:Represent each <math>N \times N</math> face image <math>I_{i}</math> as a one-dimensional,<math> N^2 \times 1</math> vector <math>\Gamma _{i}</math>. This can be done by simply stacking the rows of the image, one after the other (see Figure 1). Note: the images must have been scaled and aligned with each other first.

Figure 1. Vector representation of face images.

Step 2: Compute the average face:

<math> \Psi = \frac{1}{M} \sum _{i=1}^{M} \Gamma _{i} </math>

Step 3: Subtract the mean face from each other face:

<math> \Phi _{i} = \Gamma _{i}- \Psi </math>

Step 4: Generate the random matrix using the algorithm given in the previous section.

Step 5: Project the normalized face images in the random subspace:

<math> w_{i}=R\Phi _{i} </math>

To perform face recognition, first we represent each training face in a lower dimensional space using RP:

<math> w_{i}=\begin{bmatrix} y_{i1}\\ y_{i2}\\ ...\\ y_{iK} \end{bmatrix},i=1,2,...,M </math>

Given an unknown <math>N \times N</math> face image I (aligned in the same way as the training faces), we apply the following steps for recognition purposes:

Step 1. Represent I as a one-dimensional, <math>N^2 \times 1</math> vector <math>\Gamma</math>.

Step 2. Normalize <math>\Gamma</math> by subtracting the average face <math>\Phi = \Gamma- \Psi</math>

Step 3. Project <math>\Phi</math> onto the RP space

<math> w=R\Phi \; where \; w=\begin{bmatrix} y_{1}\\ y_{2}\\ ...\\ y_{K} \end{bmatrix} </math>

Step 4. Find the closest training face <math>\Phi_{i}</math> to the unknown face <math>\Phi</math>

<math> e_{r}=min_{l}\left \| w-w_{l} \right \| </math>

Where <math>e_{r}</math> is the minimum error.

Step 5. If <math>e_{r} < T_{r}</math>, where <math>T_{r}</math> is a threshold, the face <math>\Gamma</math> is recognized as face <math>\Gamma _{l}</math>

Typically, the Euclidean distance is used to compute the error.

Experimental Results

We have considered two different databases in our experiments: the ORL and the CVL databases [11]. The ORL database contains 400 face images from 40 subjects with 10 frontal exposures of each assuming different facial expressions, lighting, and slight orientation changes. The CVL database contains 798 images from 114 subjects with 7 exposures of each assuming different facial expressions and orientations. We only used the frontal exposures of each subject in our experiments (i.e., 3 exposures assuming different facial expressions, that is, 342 images). In each experiment, we divided the database under experimentation into three subsets: training, gallery, and test.

We have performed two types of experiments: (a) the training set contains images of the same subjects as in the gallery set, and (b) the training and gallery sets contain different subjects. It should be mentioned that the second scenario is more realistic when we do not have a representative set of images for training. In this case, we would have to use images of other subjects to create a representative training set and compute the RP space. Specifically, given a test face, the closest match approach performs recognition by finding the nearest face from the gallery set. The recognition accuracy is computed as the ratio of the faces recognized correctly from the test set over the total number of faces in the test set. For comparison purposes, we have compared the performance of RP with PCA. Since it has been reported that RP performs better when averaging the results over several RPs, we also report results using majority voting on a relatively small ensemble of five different random projections.

Figures 2(a)-(c) show the results using the ORL database assuming subjects with the same identity both in the training and gallery sets while Figures 2(d)-(f) show the case of subject having different identity. In the first case, we built the training set using 3 images from each subject (i.e., 120 images) while in the second case, we built the training set using the images of 13 subjects (i.e., 130 images). The images of the rest 27 subjects (i.e., 270 images) were used to create the gallery and test sets. To test the sensitivity of each method, we also varied the number of images per subject in the gallery set, (i.e., Figure 2(a) - 3 images per subject, Figure 2(b) - 2 images per subject, Figure 2(c) - 1 image per subject, Figure 2(d) - 5 images per subject, Figure 2(e) - 4 images per subject, and Figure 2(f) - 3 images per subject). In each graph, the red line corresponds to RP and the blue line corresponds to PCA. The green line corresponds to RP using majority voting.

Our first observation is that PCA in general performs better when the identity of the subjects in the gallery set is the same to the identity of the subjects in the training set. Comparing RP with PCA, our results show that PCA performs better than RP mostly for low dimensions (i.e., 20-30). This result is consistent with previous studies where it has been reported that RP compares favorably with PCA for moderate or higher number of dimensions. The difference in performance becomes smaller and smaller as the number of dimensions increases.

The results using different subjects in the gallery set than in the training set are more interesting. In this case, it is obvious that RP and PCA have very close performances. In fact, RP seems to be doing slightly better (i.e., 1%-2%) than PCA for higher dimensions. This is mainly because PCA is data dependent while RP is data independent. Overall, both methods seem to be affected by the number of images per subject in the gallery set. Their performance degrades as the number of images per subject in the gallery set decreases. The degradation in the performance of the two methods has been emphasized by the fact we increase the size of the test set as we decrease the size of the gallery set (i.e., the images removed from the gallery set are added to the test set).

Figure 2. Experiments using the ORL database, closest match, and majority voting. (a)-(c) Same subjects in the training and gallery sets, (d)-(f) Different subjects in the training and gallery sets. The proportion of subjects in the gallery and test sets varies as shown. The blue line corresponds to PCA using closest match, the red line corresponds to RP using closest match, and the green line corresponds to RP using majority voting.
Figure 2. Experiments using the ORL database, closest match, and majority voting. (a)-(c) Same subjects in the training and gallery sets, (d)-(f) Different subjects in the training and gallery sets. The proportion of subjects in the gallery and test sets varies as shown. The blue line corresponds to PCA using closest match, the red line corresponds to RP using closest match, and the green line corresponds to RP using majority voting.
Figure 2. Experiments using the ORL database, closest match, and majority voting. (a)-(c) Same subjects in the training and gallery sets, (d)-(f) Different subjects in the training and gallery sets. The proportion of subjects in the gallery and test sets varies as shown. The blue line corresponds to PCA using closest match, the red line corresponds to RP using closest match, and the green line corresponds to RP using majority voting.

The results for the CVL database are shown in Figure 3. Figure 3(a) corresponds to using subjects with the same identity both in the training and gallery sets while Figures 3(b) and (c) correspond to different subjects. In Figure 3(a), we built the training, gallery, and test sets by splitting the CVL database into three equal parts, keeping 1 image per subject for each part. In Figures 3(b) and (c) we built the training set using the images of 34 subjects (i.e., 102 images). Figure 3(b) corresponds to storing 2 images per subject in the gallery set while Figures 3(c) corresponds to storing 1 image per subject in the gallery set. The results for the CVL database are quite consistent with those for the ORL database. Figure 3(b) presents an interesting case where RP seems to be doing better than PCA (i.e., 3%-5%) using 50 dimensions and higher.

Figure 3. Experiments using the CVL database, closest match, and majority voting. (a) Same subjects in the training and gallery sets (b)-(c) Different subjects in training and gallery sets. The proportion of subjects in the gallery and test sets varies as shown. The blue line corresponds to PCA using closest match, the red line corresponds to RP using closest match, and the green line corresponds to RP using majority voting.
Figure 3. Experiments using the CVL database, closest match, and majority voting. (a) Same subjects in the training and gallery sets (b)-(c) Different subjects in training and gallery sets. The proportion of subjects in the gallery and test sets varies as shown. The blue line corresponds to PCA using closest match, the red line corresponds to RP using closest match, and the green line corresponds to RP using majority voting.

Project

In this project, you will implement face recognition using RP and study the effect of several factors on identification performance. For your experiments, you will use a subset of the AR face database (50 subjects, 10 images per subject) [1] . Each face image is stored as an individual Portable Gray Map (PGM) image file. The file name is defined as follows: the first two digits represent the user index and the next two digits denote the image index for the same user. For example, image 0204.pgm is the <math>4^{th}</math> image of user 2. Each image is a <math> 42 \times 42 </math> (= 1764) gray scale image. For different users, the images with the same index image number are captured under similar conditions, including lighting and expression.

Experiments

To evaluate RP, you will perform several experiments using the following three methods for evaluating recognition accuracy: (i) closest match, (ii) majority voting, and (iii) scoring. For each experiment, use the first five images from each subject to build the gallery set and the rest five for testing (i.e., no images are needed for training, as in the case of PCA, since the projection matrix is built randomly). In each experiment, evaluate RP with PCA by varying the number of dimensions as well as the number of images per subject in the gallery set.

(a) Given a test face, the closest match approach performs recognition by finding the nearest face from the gallery set. In this case, recognition accuracy is computed as the ratio of the faces recognized correctly from the test set over the total number of faces in the test set. To account for instabilities in the performance of RP due to the random nature of the projection matrix, you would need to report the recognition performance as the average over five different RPs.

(b) In the majority voting approach, you need to run five different RPs (i.e., the number of RPs is chosen arbitrarily; running more RPs would produce more stable results) for each test face and find the closest match in each case. Then, the identity of the face is determined by applying majority voting on the five closest matches found. Recognition accuracy is computed by dividing the number of correctly recognized faces from the test set with the total number of faces in the test set.

(c) Given a test face, the scoring approach retrieves the closest x matches from the gallery set. Then, the recall rate for each test face must be computed, that is, the ratio of the face images retrieved from the gallery set that belong to the same person as in the test face, over the total number of images stored in the gallery for that person. Recognition accuracy is computed by averaging the recall rates for the test set. To account for instabilities in the performance of RP due to the random nature of the projection matrix, report the average over five different RPs.

Wiki Assessment

Pre-test questions


1. What is the goal of dimensionality reduction?

Answer: To project high-dimensional data into a low-dimensional space. Different criteria could be used in determining an appropriate low-dimensional space.

2. How do you think high dimensionality affects performance?

Answer: Time, memory, and redundancy issues.

3. Consider that you are developing a face recognition system. What would you do in order to reduce the dimensionality of the face images?

Answer: Depending on students’ background, different answers might be presented, for example, reducing image size, extracting facial features, or using PCA.

4. What are the main factors affecting face recognition performance?

Answer: Lighting, pose, facial expressions, aging.

Post-test questions


1. How does RP work?

Answer: see Section 2.

2. Mention three reasons that motivate the use of RP for face recognition?

Answer: (1) RP is data independent; (2) do not have to update the projection matrix as the face database size increases; and (3) computationally efficient.

3. What are the main similarities/differences between RP and PCA?

Answer: Both RP and PCA are dimensionality reduction techniques. RP is data independent while PCA is data dependent. PCA is more computationally intensive.

4. Describe how to apply RP for face recognition.

Answer: see Section 3.

References and Resources

  1. 1.0 1.1 W. Zhao, R. Chellappa, P.J. Phillips, and A. Rosenfeld, "Face recognition: A literature survey," ACM Computing Surveys, vol. 35, no. 4, pp. 399–458, 2003.
  2. S. Dasgupta, "Experiments with random projection," Uncertainty in Artificial Intelligence, 2000.
  3. E. Brigham and H. Maninila, "Random projection in dimensionality reduction: applications to image and text data," International Conference on Knowledge Discovery and Data Mining, pp. 245–250, 2001.
  4. D. Fradkin and D. Madigan, "Experiments with random projection for machine learning," ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 571–522, 2003.
  5. 5.0 5.1 N. Goel, G. Bebis, A. Nefian, "Face Recognition Experiments with Random Projection", SPIE Defense and Security Symposium (Biometric Technology for Human Identification), Orlando, FL, March 28 - April 1, 2005.
  6. W. Johnson and J. Lindenstrauss, “Extensions of lipschitz mapping into a hilbert space”, Conference on Modern Analysis and Probability, pp. 189–206, 1984.
  7. 7.0 7.1 7.2 D. Achlioptas, “Database-friendly random projections,” ACM Symposium on the Principles of Database Systems, pp. 274–281, 2001.
  8. W. Zhao, R. Chellappa, P.J. Phillips, and A. Rosenfeld, “Face recognition: A literature survey,” ACM Computing Surveys, vol. 35, no. 4, pp. 399–458, 2003.
  9. B. Moghaddam, “Principal manifolds and probabilistic subspaces for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 6, pp. 780–788, 2002.
  10. S. Dasgupta, “Experiments with random projection,” Uncertainty in Artificial Intelligence, 2000.
  11. Face Recognition Home Page: http://www.face-rec.org/ A good resource to face recognition methods and data sets.