Open In App

Parameters for Feature Selection

Last Updated : 08 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Feature selection is a process of selecting a subset of relevant features that contribute the most to the prediction of model while discarding redundant, irrelevant or noisy features. This ensures that the model focuses on the important variable required for prediction.

In this article we will discuss the various parameters used in feature selection helping you understand how to choose the right features for your machine learning model.

Common Parameters Used in Feature Selection

Here are different parameters that can be used in feature selection.

1. Information Gain

Information Gain is the reduction in uncertainty gained by knowing the value of a feature. In classification tasks, it is used to select features based on how much information they provide about the target variable. It is used in algorithms like Decision Trees.

IG(Y, X) = H(Y) - H(Y | X)

Where:

  • H(Y) is the entropy of the target variable.
  • H(Y|X) is the conditional entropy of the target variable given the feature X.

A higher information gain indicates that the feature provides more information and is a better choice for feature selection.

2. Correlation Coefficient

Correlation coefficient quantifies the degree to which two variables are related. In feature selection it helps in identifying redundant features. Highly correlated features tend to provide similar information so one of them can often be removed without significantly affecting the model’s performance.

r = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum (X_i - \bar{X})^2 \sum (Y_i - \bar{Y})^2}}

Where,

  • X_i and Y_i are individual data points for features X and Y
  • \bar{X} and \bar{Y} are the mean values of features X and Y.

A high absolute value of the correlation coefficient (close to 1 or -1) indicates a strong relationship between features. Features with high correlation can often be merged or one can be removed to reduce redundancy.

3. p-Value

p-value helps in determining whether the relationship between a feature and the target variable is statistically significant. In feature selection the p-value is used to identify features that significantly affect the model’s output. A low p-value indicates that a feature is statistically significant.

  • p-value < 0.05: The feature is statistically significant.
  • p-value > 0.05: The feature may not be statistically significant and can be considered for removal.

4. Recursive Feature Elimination

Recursive Feature Elimination (RFE) is a wrapper method for feature selection that recursively removes features from the dataset to improve model performance. The process involves fitting the model multiple times and eliminating the least important features based on model accuracy or feature importance scores.

Steps in RFE are:

  1. Train the model on the complete set of features.
  2. Rank the features based on their importance.
  3. Remove the least important features.
  4. Repeat the process until the desired number of features is selected.

RFE is effective for selecting the most relevant features when the model’s performance is the primary focus.

5. Feature Importance

Embedded methods such as those used in tree-based models like Random Forest and Gradient Boosting provide a feature importance score based on how well the feature contributes to the prediction.

  • Random Forest Feature Importance: Random Forest computes the importance of each feature based on the decrease in impurity i.e., Gini index or entropy caused by the feature. Features that reduce impurity significantly are considered important.
  • Gradient Boosting Feature Importance: In Gradient Boosting features are ranked by their contribution to the model's performance in terms of reducing the loss function.

6. Variance Threshold

Variance threshold is a simple filter-based method used to select features based on their variance. Features with low variance i.e features that have the same or nearly the same value across samples are less likely to contribute useful information to the model and can be removed.

\text{Variance}(X) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \mu_X)^2

Where

  • x_i is an individual sample,
  • \mu_X is the mean of feature X
  • n is the number of samples.

Features with low variance i.e close to zero can be discarded as they do not provide enough differentiation between data points.

7. Chi-Square Test

Chi-Square test is used to determine whether there is a significant association between two categorical variables. It is commonly used in feature selection for classification problems involving categorical data. Features with a higher score are considered more important.

\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}

Where

  • O_i is the observed frequency,
  • E_i is the expected frequency.

A higher Chi-Square score indicates a stronger relationship between the feature and the target variable.

By using methods like Information Gain, Mutual Information, Correlation Coefficient, p-value and others you can effectively choose the best features for your model and avoid overfitting, redundancy and computational inefficiency.


Next Article

Similar Reads