Feature Subset Selection Using Genetic Algorithms (GAs)

From MeliWiki
Jump to: navigation, search

<Contributed to MeLi Wiki by Professor George Bebis, Department of Computer Science, University of Nevada, Reno>


Introduction

The majority of real-world object detection problems involve “concepts” (e.g., face, vehicle) rather than specific objects. Usually, these “conceptual objects” have large within class variability. As a result, there is no easy way to come up with an analytical decision boundary to separate a certain “conceptual object” against others. One feasible approach is to learn the decision boundary from a set of training examples using supervised learning where each training instance is associated with a class label. Building an object detection system under this framework involves two main steps: (1) extracting a number of features, and (2) training a classifier using the extracted features to distinguish among different class instances. Choosing an appropriate set of features is critical when designing pattern classification systems under the framework of supervised learning. Often, a large number of features are extracted to better represent the target concept. Without employing some kind of feature selection strategy, however, many of them could be either redundant or even irrelevant to the classification task. As a result, the classi1er might not be able to generalize nicely. Ideally, we would like to use only features having high discriminative power while ignoring or paying less attention to the rest. For instance, in order to allow a vehicle detector to generalize nicely, it would be necessary to exclude features encoding fine details which might appear in specific vehicles only. A limited yet salient feature set can simplify both the pattern representation and the classi1ers that are built on the selected representation. Consequently, the resulting classifier will be more efficient.


Feature Extraction and Selection

A major issue with pattern classification/recognition tasks is the issue of pattern representation. By pattern representation we imply not only finding a set of features that represent the pattern (e.g., face, fingerprint) in a more compact and robust way but also providing more discriminative information. This is also known as the "feature extraction" step in pattern recognition [1]. Specifically, the goal of feature extraction is to determine an appropriate subspace of dimensionality <math>m</math> in the original feature space of dimensionality <math>d</math> where <math>m</math> is less than or equal to <math>d</math>. Depending on the nature of the task at hand, the features can be extracted either manually or automatically by applying transformations of hand-picked features or the original raw pattern values (e.g., image pixels) which can also be considered as a primitive set of features. In principle, the transformations used for feature extraction perform dimensionality reduction and could be either linear or non-linear. Transformation-based methods have the potential of generating better features than the original ones; however, the new features may not have a physical meaning since they are combinations of the original features. A popular linear feature extraction method is Principal Component Analysis (PCA) which has been review in another module.

In most practical cases, relevant features are not known a priori. Finding out which features to use for classification (or recognition) is referred to as "feature selection". More formally, given a set of <math>d</math> features, the goal is to select a subset of size <math>m</math> that leads to the smallest classification error [1]. The selection and quality of the features representing a pattern have considerable impact on the success of subsequent pattern classification. Feature selection is essentially an optimization problem that involves searching the space of possible feature subsets [2]. Feature selection approaches have two main components: a search strategy and an evaluation method. Search strategies belong to three categories: optimal, heuristic, and randomized.

Exhaustive search guarantees finding an optimum solution, however, the number of possible subsets grows combinatorially, which makes it impractical. Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS) are well-known heuristic feature selection schemes [2]. SFS starts with an empty feature set, selects the best single feature based on some criteria, and then adds it to the feature set. SBS starts with the entire feature set and at each step drops the feature whose absence least decreases performance. SFFS and SBFS are extensions of SFS and SBS and consider groups of features at each step [2]. These strategies make local decisions and cannot be expected to find globally optimal solutions. Randomized search uses probabilistic steps or a sampling process. The Relief algorithm and several extension of it are typical examples. Genetic Algorithms (GAs) have also been proposed for feature selection.

Evaluation strategies belong to the following two categories: filter and wrapper. Filter approaches are computationally more efficient than wrapper approaches since they evaluate the goodness of selected features using criteria that can be tested quickly (e.g., reducing the correlation or the mutual information among features). This, however, could lead to non-optimal features. Wrapper approaches perform evaluation by training the classifier using the selected features and estimating the classification error using a validation set. Although this is a slower procedure, the features selected are more optimal for the classifier employed.

Genetic Algorithms (GAs)

GAs are a class of optimization procedures inspired by the biological mechanisms of reproduction. In the past, they have been used to solve various problems including target recognition, object recognition, face recognition, and face detection/verification. This section contains a brief summary of the fundamentals of GAs. [3] provides a great introduction to GAs and the reader is referred to this source as well as [4]. GAs operate iteratively on a population of structures, each one of which represents a candidate solution to the problem at hand, properly encoded as a string of symbols (e.g., binary). A randomly generated set of such strings forms the initial population from which the GA starts its search. Three basic genetic operators guide this search: selection, crossover, and mutation. The genetic search process is iterative: evaluating, selecting, and recombining strings in the population at every iteration (generation) until reaching some termination condition (see Fig 1). The basic algorithm, where P(t) is the population of strings at generation t, is given below:


Code GA.jpg


Fig. 1. Illustration of GA’s iterative process.


Evaluation of each string is based on a fitness function that is problem-dependent. It determines which of the candidate solutions are better. This corresponds to the environmental determination of survivability in natural selection. Selection of a string, which represents a point in the search space, depends on the string’s fitness relative to those of other strings in the population. It probabilistically removes, from the population, those points that have relatively low fitness (see Fig. 2).


Fig. 2. Illustration of the selection operator.


Mutation, as in natural systems, is a very low probability operator and just flips a specific bit (see Fig. 3). Mutation plays the role of restoring lost genetic material. Crossover in contrast is applied with high probability. It is a randomized yet structured operator that allows information exchange between points (see Fig. 4). Its goal is to preserve the fittest individuals without introducing any new value. In summary, selection probabilistically filters out solutions that perform poorly, choosing high performance solutions to concentrate on or exploit. Crossover and mutation, through string operations, generate new solutions for exploration.


Fig. 3. Illustration of the mutation operator.


Fig. 4. Illustration of the cross-over operator.


Given an initial population of elements, GAs use the feedback from the evaluation process to select better solutions, eventually converging to a population of high-performance solutions. GAs do not guarantee a global optimum solution. However, they have the ability to search through very large search spaces and come to nearly optimal solutions fast. Their ability for fast convergence is explained by the schema theorem (i.e., short-length bit patterns in the chromosomes with above average fitness, get exponentially growing number of trials in subsequent generations [3]).

Genetic Feature Subset Selection

Traditionally, there are three main steps in building a pattern classification system using supervised learning. First, some preprocessing is applied to the input patterns (e.g., normalize the pattern with respect to size and orientation, compensate for light variations, reduce noise, etc.). Second, feature extraction is applied to represent patterns by a compact set of features. The last step involves training a classifier to learn to assign input patterns to their correct category. In most cases, no explicit feature selection step takes place besides feature weighting performed implicitly by the classifier. Fig. 5 illustrates the main steps of the approach based on Genetic Algorithms. The main difference from the traditional approach is the inclusion of a step that performs feature selection using GAs.


Fig. 5. Main steps involved in building an object detection system using feature subset selection.


Feature selection encoding: A simple encoding scheme can be used where the chromosome is a bit string whose length is determined by the number of features. Each feature is associated with one bit in the string. If the <math>i-th</math> bit is 1, then the <math>i-th</math> feature is selected, otherwise, that component is ignored. Each chromosome thus represents a different subset of features.

Feature subset fitness evaluation: The goal of feature subset selection is to use less features to achieve the same or better performance. Therefore, the fitness evaluation contains two terms: (1) accuracy and (2) the number of features selected. The performance of the classifier is estimated using a validation data set (which is separate from the training and test sets) which guides the GA search. Each feature subset contains a certain number of features. If two subsets achieve the same performance, while containing different number of features, the subset with fewer features is preferred. Between accuracy and feature subset size, accuracy is our major concern. We used the fitness function shown below to combine the two terms:

<math>fitness=10^4Accuracy+0.5Zeros</math>

where <math>Accuracy</math> corresponds to the classification accuracy on a validation set for a particular subset of eigenvectors, and <math>Zeros</math> corresponds to the number eigenvectors not selected (i.e., zeros in the chromosome). The <math>Accuracy</math> term ranges roughly from <math>0.50</math> to <math>0.99</math>, thus, the first term assumes values from <math>5000</math> to <math>9900</math>. The <math>Zeros</math> term ranges from <math>0</math> to <math>L − 1</math> where <math>L</math> is the length of the chromosome, thus, the second term assumes values from <math>0</math> to <math>99</math> (<math>L=200</math>). Based on the weights that we have assigned to each term, the <math>Accuracy</math> term dominates the fitness value. This implies that individuals with higher accuracy will outweigh individuals with lower accuracy, no matter how many features they contain. Overall, the higher the accuracy is, the higher the fitness is. Also, the fewer the number of features is, the higher the fitness is.

Initial population: In general, the initial population is generated randomly, (e.g., each bit in an individual is set by flipping a coin). This, however, would produce a population where each individual contains approximately the same number of 1’s and 0’s on the average. To explore subsets of di6erent numbers of features, the number of 1’s for each individual is generated randomly. Then, the 1’s are randomly scattered in the chromosome. In the experiments below, we used a population size of 2000 and 200 generations. In most cases, the GA converged in less than 200 generations.

Selection: Our selection strategy was cross generational. Assuming a population of size <math>N</math>, the offspring double the size of the population and we select the best N individuals from the combined parent–offspring population [3].

Crossover: There are three basic types of crossovers: one-point crossover, two-point crossover, and uniform crossover. For one-point crossover, the parent chromosomes are split at a common point chosen randomly and the resulting sub-chromosomes are swapped. For two-point crossover, the chromosomes are thought of as rings with the first and last gene connected (i.e., wrap-around structure). In this case, the rings are split at two common points chosen randomly and the resulting sub-strings are swapped. Uniform crossover is different from the above two schemes. In this case, each gene of the offspring is selected randomly from the corresponding genes of the parents. Since we do not know in general how eigenvectors depend on each other, if dependent eigenvectors are far apart in the chromosome, it is very likely that traditional one-point or two-point crossover will destroy the schemata. To avoid this problem, uniform crossover is used here. The crossover probability used in all of our experiments was 0.66.

Mutation: We use the traditional mutation operator which just flips a specific bit with a very low probability. The mutation probability used in all of our experiments was 0.04.

Experimental Results

The feasibility of using GAs for subset feature selection has been tested in several challenging problems such as face/vehicle detection [5] and gender classification [6]. Here, we will present experimental results in the case of gender classification (see Fig. 6) using PCA for feature extraction. The key objective is using GAs to select the most relevant eigen-features for gender classification.


Fig. 6. The problem of gender classification.


Male and female faces make up two challenging classes of naturally structured but slightly deformable patterns with sufficient within class variations. Previous studies on gender classification fall into one of two main categories: feature-based and appearance- based. Feature-based methods use local features which are extracted from a region of the image. Methods using geometrical features such as distances and areas between facial features (e.g., eyes, nose, and mouth) fall into this category and are motivated by psychophysical experiments and statistical analysis. However, such features discard textural information like the "smoothness" of faces or hair style that contain strong gender information. Moreover, locating and extracting them accurately is a difficult task. These observations have led to template-based methods which actually use features extracted from the whole image (i.e., global features). Appearance-based methods using raw or lower-dimensional face representations are common examples. The main drawback of appearance-based approaches is that they are sensitive to occlusion; however, they can tolerate noise better than feature-based approaches. What is common about traditional gender classification approaches is that they do not apply feature selection to choose the “best” features for gender classification. Often, the same features used for face recognition are also used for gender classification. However, this is not an optimum choice. In fact, there is cognitive evidence suggesting that gender classification is a simpler problem than general face recognition. Also, gender classification in humans does not seem to depend on the ability to identify individuals and there is some evidence suggesting that it is carried out by a separate population of cells in the inferior temporal cortex.

A major issue with face processing tasks is the issue of face representation. Commonly, faces are represented in a low-dimensional space, for example, using PCA. In this case, the low-dimensional face space is spanned by the principal components (i.e., eigenvectors corresponding to the largest eigenvalues) of the distribution of face images. We refer to the projection coefficients of an image on this space as eigen-features. Using PCA or any other dimensionality reduction technique, a key issue in the context of gender classification is finding which dimensions encode mostly gender information. In the case of PCA, it is well known that different eigenvectors encode different kind of information. For example, the first few eigenvectors seem to encode lighting while other eigenvectors seem to encode features such as glasses or moustaches. Fig.7 shows several eigenvectors computed using our dataset; it is evident that eigenvectors 1-4 encode light variations while eigenvectors 10 and 20 encode information about glasses. Although many of the eigen-features are important for face recognition, they might actually confuse the classifier in other applications such as in gender classification. Therefore, selecting a good subset of eigen-features will be important in order to improve gender classification performance.


Fig. 7. Eigenvectors (left to right, top to bottom): no. 1-6, 8, 10, 12, 14, 19, 20, 150, 200 and 250.


In our experiments, we used 400 frontal images from 400 distinct people, representing different ethnicities, with different facial expressions, and under different lighting conditions. Half of them were male and the other half female. We evaluated feature selection using three classifiers (Bayesian, Neural Networks (NNs), and Support Vector Machines (SVMs)). For each classifier, the average error rate was recorded using a three-fold cross-validation procedure. To do this, we randomly split the dataset three times by keeping 300 images (150 female and 150 male) for training, 50 images for validation (25 female and 25 male) and 50 images for testing (25 female 25 male). The validation set was strictly used to evaluate the goodness of a given subset of selected eigenfeatures. Only the first 10 to 250 eigenvectors were used in our experiments (the last 50 eigenvectors seemed to encode mostly noise).


Fig. 8. Error rates (Left) ERM: manually selected features; ERG: GA selected features; ERS: using SFBS for feature selection. (Right) RNG/RNS: ratio between number of features selected by GA/SFBS and the complete feature set; RIG/RIS: percentage of the information contained in the GA/SFBS feature subsets.


Fig. 9. The distribution of eigenvectors selected by GAs (left) and SFBS (right).


Our experiments showed that feature selection improved the performance of all the classifiers tried in our experiments. Also, features selected by the GA approach yielded higher accuracy than those selected by SFBS. Figure 8 shows the results for each classifier. The best performance was obtained using the SVM classifier. Using only 8.4% of the features in the complete set, the SVM classifier achieved an error rate of 4.7%. To get an idea regarding the optimal set of eigenvectors selected by GAs or SFBS for each classifier, we computed histograms, showing the average distribution of the selected eigenvectors (i.e, over the three training sets). The x-axis corresponds to the eigenvectors, ordered by their eigenvalues, while the y-axis corresponds to the average number of times an eigenvector within some interval has been selected by the GA or SFBS in the final solution. Figure 9 shows the histograms for the case of the SVM classifier. It is interesting to note that the SFBS approach seems to favor mostly eigenvectors corresponding to large eigenvalues, so it did not perform as good as the GA approach in every experiment we performed.


Fig. 10. Reconstructed images using the selected feature subsets.(1st row): original images, (2nd row): using top 30 eigenvectors, (3rd row): using the eigenvectors selected by SFBS, (4th row): using the eigenvectors selected by GAs.


For visualization purposes, we have reconstructed the face images using the selected eigenvectors only. Figure 10 shows these results for the case of the SVM classifier. Several very interesting observations can be made. First, it is obvious by observing Figure 10 (last row) that identity information has been lost in the reconstructed faces using the eigenvectors selected by the GA approach. In contrast, the reconstructed faces using the top 30 eigenvectors (2nd row of Figure 10) or the eigenvectors selected by SFBS (3rd row of Figure 10) reveal strong identity information. Although identity information has been lost in the images reconstructed images using eigenvectors selected by the GA approach, these images do disclose strong gender information, the reconstructed female faces look more "female" than the reconstructed male faces. This implies that the GA approach was more successful in selecting out eigenvectors encoding gender information. Second, those eigenvectors encoding features unimportant for gender classification seem to have been discarded by the GA approach. This is obvious from the reconstructed face images corresponding to the first two males shown in Figure 10. Although both of them wear eye-glasses, the reconstructed faces do not contain eye-glasses which implies that those eigenvectors encoding eye-glasses have not been selected by the GA approach. Note that the reconstructed images using the first 30 largest eigenvectors as well those reconstructed using the eigenvectors selected by the SFBS approach appear to preserve features less relevant to gender classification (e.g., eye-glasses).

Project

In this project, you will familiarize yourself with using genetic algorithms for feature subset selection in gender classification. In particular, you will perform a number of experiments to replicate some of the experimental results reported in Section 5.


Experiments

(a) If you do not have enough experience with GAs, it might be good a good idea to experiment with GAs on a simple problem first.

(a.1) Implement or find some good GA code (e.g., [GA]).
(a.2) Test your code on a simple optimization problem (e.g., see pages 15-18 from [Goldbert89])

(b) Generate base-line results for gender classification (e.g., without using feature subset selection).

(b.1) Divide the face data into three sets: training, validation, and test (see Section 5)
(b.2) Apply PCA to represent the data in a low-dimensional space; keep 99% of the information.
(b.3) Pick the classifier of your choice (e.g., NN, SVM) and apply training to classify the face images into two classes: male and female.
(b.4) Using 3-fold cross-validation (see Section 5) repeat the experiments three times and report average classification accuracy.

(c) Use feature subset selection for gender classification.

(c.1) Repeat the experiment in (b) using GAs for selecting the most relevant eigen-vectors.
(c.2) Compare your results.

(d) Compare GAs with alternative feature selection algorithms.

(d.1) Use alternative feature selection algorithms for gender classification (e.g., see [TOOLDIAG])
(d.2) Compare your results with those obtained in (c).

(e) If time permits, repeat experiments (b) and (c) using alternative dimensionality reduction methods for feature extraction (e.g., LDA).


Wiki Assessment

Pre-test questions

  1. What is the goal of feature extraction? What about feature selection?
  2. What is the goal of gender classification? What are some practical applications?
  3. Could you devise a simple algorithm for gender classification?
  4. Could you devise a simple algorithm for feature subset selection?
  5. Do you think that using the same features both for face recognition and gender classification will be a good idea?


Post-test questions

  1. Explain the problem of feature selection. What are the main components in feature selection? What are the main search strategies?
  2. How is feature extraction different from feature selection?
  3. What is the difference between filter and wrapper approaches for feature evaluation?
  4. Describe how GAs work.
  5. Describe how to use GAs for feature selection.
  6. Why is feature selection necessary for pattern classification?


References

  1. 1.0 1.1 A. Jain, R. Duin, J. Mao, Statistical pattern recognition: a review, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22 (1) (2000) 4–37.
  2. 2.0 2.1 2.2 M. Dash, H. Liu, Feature selection for classification, Intelligent Data Analysis, 1 (3) (1997) 131–156.
  3. 3.0 3.1 3.2 D. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Addison Wesley, Reading, MA, 1989.
  4. M. Srinivas and L. Patnaik, Genetic algorithms: a survey, IEEE Computer, 27 (6) (1994) 17–26.
  5. Sun, G. Bebis, and R. Miller, "Object Detection Using Feature Subset Selection", Pattern Recognition, vol. 37, pp. 2165-2176, 2004.
  6. Z. Sun, G. Bebis, X. Yuan, and S. Louis, "Genetic Feature Subset Selection for Gender Classification: A Comparison Study", IEEE Workshop on Applications of Computer Vision, pp. 165-170, Orlando, December 2002.