# The importance of feature selection fs in the representation of original data

Introduction[ edit ] A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is to test each possible subset of features finding the one which minimizes the error rate.

This is an exhaustive search of the space, and is computationally intractable for all but the smallest of feature sets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: Each new subset is used to train a model, which is tested on a hold-out set.

Counting the number of mistakes made on that hold-out set the error rate of the model gives the score for that subset.

- Filter methods have also been used as a preprocessing step for wrapper methods, allowing a wrapper to be used on larger problems;
- Exhaustive search is generally impractical, so at some implementor or operator defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset.

As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model. Filter methods use a proxy measure instead of the error rate to score a feature subset. This measure is chosen to be fast to compute, while still capturing the usefulness of the feature set.

However the feature set doesn't contain the assumptions of a prediction model, and so is more useful for exposing the relationships between the features. Many filters provide a feature ranking rather than an explicit best feature subset, and the cut off point in the ranking is chosen via cross-validation.

Filter methods have also been used as a preprocessing step for wrapper methods, allowing a wrapper to be used on larger problems. Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. The exemplar of this approach is the LASSO method for constructing a linear model, which penalizes the regression coefficients with an L1 penalty, shrinking many of them to zero.

These approaches tend to be between filters and wrappers in terms of computational complexity. In traditional statistics, the most popular form of feature selection is stepwise regressionwhich is a wrapper technique. It is a greedy algorithm that adds the best feature or deletes the worst feature at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by cross-validation.

In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as branch and bound and piecewise linear network. Subset selection[ edit ] Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken up into Wrappers, Filters and Embedded. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset.

## Feature selection

Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to Wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in and specific to a model. Many popular search approaches use greedy hill climbingwhich iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old.

Evaluation of the subsets requires a scoring metric that grades a subset of features. Exhaustive search is generally impractical, so at some implementor or operator defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset. The stopping criterion varies by algorithm; possible criteria include: Alternative search-based techniques are based on targeted projection pursuit which finds low-dimensional projections of the data that score highly: