Feature Subset Selection Using Genetic Algorithms (GAs)
Introduction
The majority of real-world object detection problems involve âconceptsâ (e.g., face, vehicle) rather than specific objects. Usually, these âconceptual objectsâ have large within class variability. As a result, there is no easy way to come up with an analytical decision boundary to separate a certain âconceptual objectâ against others. One feasible approach is to learn the decision boundary from a set of training examples using supervised learning where each training instance is associated with a class label. Building an object detection system under this framework involves two main steps: (1) extracting a number of features, and (2) training a classifier using the extracted features to distinguish among different class instances. Choosing an appropriate set of features is critical when designing pattern classification systems under the framework of supervised learning. Often, a large number of features are extracted to better represent the target concept. Without employing some kind of feature selection strategy, however, many of them could be either redundant or even irrelevant to the classification task. As a result, the classi1er might not be able to generalize nicely. Ideally, we would like to use only features having high discriminative power while ignoring or paying less attention to the rest. For instance, in order to allow a vehicle detector to generalize nicely, it would be necessary to exclude features encoding fine details which might appear in specific vehicles only. A limited yet salient feature set can simplify both the pattern representation and the classi1ers that are built on the selected representation. Consequently, the resulting classifier will be more efficient.
Feature Extraction and Selection
A major issue with pattern classification/recognition tasks is the issue of pattern representation. By pattern representation we imply not only finding a set of features that represent the pattern (e.g., face, fingerprint) in a more compact and robust way but also providing more discriminative information. This is also known as the "feature extraction" step in pattern recognition [Jain00]. Specifically, the goal of feature extraction is to determine an appropriate subspace of dimensionality <math>m</math> in the original feature space of dimensionality <math>d</math> where <math>m</math> is less than or equal to <math>d</math>. Depending on the nature of the task at hand, the features can be extracted either manually or automatically by applying transformations of hand-picked features or the original raw pattern values (e.g., image pixels) which can also be considered as a primitive set of features. In principle, the transformations used for feature extraction perform dimensionality reduction and could be either linear or non-linear. Transformation-based methods have the potential of generating better features than the original ones; however, the new features may not have a physical meaning since they are combinations of the original features. A popular linear feature extraction method is Principal Component Analysis (PCA) which has been review in another module.
In most practical cases, relevant features are not known a priori. Finding out which features to use for classification (or recognition) is referred to as "feature selection". More formally, given a set of <math>d</math> features, the goal is to select a subset of size <math>m</math> that leads to the smallest classification error [Jain00]. The selection and quality of the features representing a pattern have considerable impact on the success of subsequent pattern classification. Feature selection is essentially an optimization problem that involves searching the space of possible feature subsets [Dash97]. Feature selection approaches have two main components: a search strategy and an evaluation method. Search strategies belong to three categories: optimal, heuristic, and randomized.
Exhaustive search guarantees finding an optimum solution, however, the number of possible subsets grows combinatorially, which makes it impractical. Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS) are well-known heuristic feature selection schemes [Dash97]. SFS starts with an empty feature set, selects the best single feature based on some criteria, and then adds it to the feature set. SBS starts with the entire feature set and at each step drops the feature whose absence least decreases performance. SFFS and SBFS are extensions of SFS and SBS and consider groups of features at each step [Dash97]. These strategies make local decisions and cannot be expected to find globally optimal solutions. Randomized search uses probabilistic steps or a sampling process. The Relief algorithm and several extension of it are typical examples. Genetic Algorithms (GAs) have also been proposed for feature selection.
Evaluation strategies belong to the following two categories: filter and wrapper. Filter approaches are computationally more efficient than wrapper approaches since they evaluate the goodness of selected features using criteria that can be tested quickly (e.g., reducing the correlation or the mutual information among features). This, however, could lead to non-optimal features. Wrapper approaches perform evaluation by training the classifier using the selected features and estimating the classification error using a validation set. Although this is a slower procedure, the features selected are more optimal for the classifier employed.