Module2.1 Feature Selection
Module2.1 Feature Selection
• model.fit(X_train_selected, y_train)
• Step 8: Make Predictions and Evaluate the Model Make predictions
on the test set and calculate the mean squared error to evaluate the
model's performance.
• y_pred = model.predict(X_test_selected)
• mse = mean_squared_error(y_test, y_pred)
• print("Mean Squared Error:", mse)
• Forward Selection
• Forward Selection starts with an empty set of features and gradually
adds the most promising feature at each step.
• The model's performance is evaluated after each feature addition, and
the process continues until a specified number of features are selected.
Step-by-step process
• Step 1: Empty Feature Set
• Begin with an empty set of features. This set will gradually grow as the algorithm progresses.
• Step 2: Model Training and Evaluation
• Train a machine learning model (e.g., linear regression, decision tree, etc.) using the dataset
with the currently selected features (initially, this is an empty set).
• Evaluate the model's performance using a suitable metric, such as mean squared error (MSE)
for regression tasks or accuracy for classification tasks.
• Step 3: Feature Selection
• In each iteration of forward selection, consider adding one of the remaining candidate
features to the set of selected features.
• Train a new model with the current set of selected features plus the candidate feature.
• Evaluate the performance of the new model using the same metric as in Step 2.
• Step 4: Select the Best Feature
• Among all the candidate features considered in the current iteration, choose the one that leads to the best
improvement in model performance. This is typically determined by comparing the model's performance
metrics.
• Add the selected feature to the set of selected features.
• Step 5: Stopping Criterion
• Decide on a stopping criterion. This could be a predefined number of features to select, a specific
performance threshold, or any other relevant criterion.
• Check if the stopping criterion is met. If it is, stop the forward selection process. Otherwise, continue to the
next iteration.
• Step 6: Final Model
• Once the stopping criterion is met, the selected features form the final set of features to be used in your
model.
• Train a final model using all the selected features.
• Evaluate the final model on a separate test dataset to assess its performance in a more realistic scenario.
• Suppose you're working on a predictive modeling task to predict house prices.
• You have a dataset with features like the number of bedrooms, square footage,
presence of a garage, distance to the nearest school, and age of the house.
• You start with an empty set of features.
• In each iteration, you consider adding one feature to the set and measure how
much it improves the model's ability to predict house prices.
• You continue this process until you've added a predefined number of features or
until you're satisfied with the model's performance.
• The selected features form the final set used in your model to predict house
prices.
Embedded Methods
• Embedded methods incorporate feature selection as part of the model
training process.
• These techniques automatically select relevant features during model
training.
• Lasso Regression and Random Forest Importance have been widely
used embedded methods.
• Lasso Regression
• Lasso Regression introduces a regularization term that penalizes the
absolute values of the feature coefficients. As a result, some
coefficients become zero, effectively removing the corresponding
features from the model. This technique encourages scattered and
performs feature selection simultaneously.
• Lasso is a modification of linear regression, where the model is
penalized for the sum of absolute values of the weights. Thus, the
absolute values of weight will be (in general) reduced, and many will
tend to be zeros.
• Initialize and Train the Lasso Regression Model Create a Lasso
Regression model, and specify the strength of the regularization
penalty, typically denoted as alpha. Larger alpha values lead to
stronger regularization, which results in more feature selection. You
can use techniques like cross-validation to choose an appropriate
alpha value.
• alpha = 0.01 # Adjust the value of alpha based on your data and
requirements
• lasso_model = Lasso(alpha=alpha)
• lasso_model.fit(X_train, y_train)
• Feature Selection Lasso Regression will automatically perform feature
selection by shrinking the coefficients of less important features
towards zero. After training the model, you can examine the
coefficients to identify which features were selected.
• selected_features = X.columns[lasso_model.coef_ != 0]
• Random Forest Importance
• Random Forest Importance (RFI) is a technique used to perform
feature selection by leveraging the capabilities of a Random Forest
classifier or regressor. Random Forest is an ensemble learning method
that combines multiple decision trees to make predictions. RFI
measures the importance of each feature in the Random Forest
model and ranks them based on their contribution to the model's
predictive performance. Features that contribute the most to
reducing impurity or error are considered more important.
• Train a Random Forest Model Create and train a Random Forest
classifier using your training data.
• rf_classifier = RandomForestClassifier(n_estimators=100,
random_state=42)
• rf_classifier.fit(X_train, y_train)
• Feature Importance Calculation Retrieve the feature importances
from the trained Random Forest model.
• feature_importances = rf_classifier.feature_importances_
• Rank Features Rank the features based on their importance scores, in
descending order. You can use this ranking to select the top features
for your model.
• feature_ranking = pd.DataFrame({'Feature': X.columns, 'Importance':
feature_importances})
• feature_ranking = feature_ranking.sort_values(by='Importance',
ascending=False)
• Select Top Features Choose the top N features based on your
requirements. You can select a fixed number of features or use a
threshold on the importance score.
• top_n_features = feature_ranking.head(N) # Replace N with the
desired number of features
• selected_features = top_n_features['Feature'].tolist()
• Train and Evaluate the Model with Selected Features Train a Random Forest
model using only the selected features and evaluate its performance.
• X_train_selected = X_train[selected_features]
• X_test_selected = X_test[selected_features]
• rf_classifier.fit(X_train_selected, y_train)
• y_pred = rf_classifier.predict(X_test_selected)