Machine Learning Techniques Lesson 1
Machine Learning Techniques Lesson 1
Suppose you're working on a regression problem and have identified the Random Forest
Regressor as a promising strategy. However, there's a challenge to overcome—missing values
in the dataset, which can hinder the use of the model. This calls for a smart and effective
strategy to handle the missing values. In this article, we'll explore several suggestions and
techniques to address this issue.
By implementing these approaches, we can ensure that our Random Forest Regressor
performs optimally even in the presence of missing values. So, let's delve into the solutions
that will help us successfully navigate this obstacle and unlock the full potential of our
regression analysis.
1/9
In this guide, we'll explore different techniques and approaches to handle missing values
effectively. We'll discuss how to identify missing values, understand their patterns and causes,
and then make informed decisions on how to handle them.
Handling missing values involves making choices based on the specific context and dataset.
We'll explore various strategies, such as imputation, where missing values are filled in with
estimated values, or removal, where rows or columns with missing values are excluded from
the analysis. Each approach has its pros and cons, and we'll dive into the considerations for
choosing the most appropriate method.
Parameters:
X_train (array-like): Training data features.
X_valid (array-like): Validation data features.
y_train (array-like): Training data labels.
y_valid (array-like): Validation data labels.
Returns:
float: Mean absolute error between the actual validation labels and the
predicted labels.
"""
2/9
import pandas as pd
Parameters:
df (pd.DataFrame): The DataFrame to check for missing values.
Returns:
list: A list of column names with missing values.
"""
cols_with_missing = []
for col in df.columns:
if df[col].isnull().any():
cols_with_missing.append(col)
return cols_with_missing
cols_with_missing = cols_missing(X_train)
import pandas as pd
Parameters:
df (pd.DataFrame): The DataFrame to drop columns from.
Returns:
pd.DataFrame: The DataFrame with columns dropped.
"""
cols_with_missing = cols_missing(X_train)
df_dropped = df.drop(cols_with_missing, axis=1)
return df_dropped
3/9
import pandas as pd
from sklearn.impute import SimpleImputer
Parameters:
df (pd.DataFrame): The DataFrame to impute missing values.
Returns:
pd.DataFrame: The DataFrame with imputed values.
"""
imputer = SimpleImputer()
imputed_data = pd.DataFrame(imputer.fit_transform(df))
imputed_data.columns = df.columns
return imputed_data
In the code snippet, I have modularized the process of imputing missing values using the
impute_missing_values function. I have also added a docstring to the function. Here's a
summary of the changes:
• The impute_missing_values function takes a DataFrame df as input and returns the
DataFrame with imputed values.
• Inside the function, a SimpleImputer is created to impute missing values.
• The fit_transform method is used to fit the imputer on the data and transform the
data, filling in the missing values.
• The transformed data is converted back to a DataFrame, and the column names are
assigned to the imputed data using df.columns .
• The resulting DataFrame with imputed values is returned from the function.
• Finally, the impute_missing_values function is called to impute missing values in
X_train and X_valid , assigning the results to imputed_X_train and
imputed_X_valid , respectively.
By encapsulating the imputation logic within a function and providing a docstring, the
code becomes more modular, reusable, and self-explanatory.
Parameters:
df (pd.DataFrame): The DataFrame to add indicator columns to.
cols_with_missing (list): List of column names with missing values.
Returns:
pd.DataFrame: The DataFrame with indicator columns added.
"""
df_plus = df.copy()
for col in cols_with_missing:
df_plus[col + '_was_missing'] = df_plus[col].isnull()
return df_plus
# Make copy of data to avoid changing the original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()
This is where grid search comes in. Grid search is an essential technique that helps us
systematically explore different combinations of hyperparameters to identify the optimal
ones for our RandomForestRegressor model. It saves us from the tedious and time-consuming
process of manually tuning hyperparameters.
During grid search, we specify a range of values for different hyperparameters, such as the
number of trees in the forest, the maximum depth of the trees, and the minimum number of
samples required to split a node. Grid search then exhaustively tries all possible combinations
of these hyperparameters and evaluates the performance of each combination using cross-
validation.
By performing grid search, we can evaluate multiple models with different hyperparameter
settings and compare their performance. This allows us to identify the best combination of
hyperparameters that results in the highest predictive accuracy or the best trade-off between
accuracy and model complexity.
Grid search helps us avoid the risk of using suboptimal hyperparameter values that could lead
to underfitting or overfitting the data. It provides a systematic approach to fine-tuning our
RandomForestRegressor model, enabling us to achieve better predictive performance on
unseen data.
6/9
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
Parameters:
X (pd.DataFrame or np.array): The input features.
y (pd.Series or np.array): The target variable.
param_grid (dict): The parameter grid to search over.
cv (int, optional): The number of cross-validation folds (default: 5).
Returns:
RandomForestRegressor: The best RandomForestRegressor model found during
grid search.
"""
# Create RandomForestRegressor
rf = RandomForestRegressor()
return best_rf
obs:
7/9
1. n_estimators : This hyperparameter determines the number of trees in the random forest.
The param_grid specifies three possible values to be evaluated: 100, 200, and 300. The grid
search will try these values to find the best number of trees for the model.
2. max_depth : This hyperparameter determines the maximum depth of each tree in the
random forest. The param_grid specifies three possible values: None , 5, and 10. A None
value means there is no maximum depth limit. The grid search will evaluate these values to
find the best maximum depth for the trees.
3. min_samples_split : This hyperparameter determines the minimum number of samples
required to split an internal node in a tree. The param_grid specifies three possible values:
2, 5, and 10. This parameter helps control the complexity of the trees and prevents
overfitting. The grid search will test these values to find the best minimum number of
samples for splitting a node.
By specifying different values for each hyperparameter in the param_grid , the grid search will
evaluate all possible combinations of these values. It will train and evaluate multiple
RandomForestRegressor models, each with a different combination of hyperparameters,
using cross-validation. The goal is to identify the combination that yields the best
performance based on the evaluation metric chosen, such as mean absolute error or R-
squared.
The choice of the cv value depends on factors such as the size of the dataset, the complexity
of the model, and the available computing resources. Common values for cv include 3, 5, and
10, but other values can be used too.
Using a higher cv value can give us a more reliable estimate of how well the model will
perform, but it may take longer to compute. On the other hand, using a lower cv value can be
faster, but the performance estimate may have more variability.
It's important to remember that cross-validation is not the final evaluation. Once the best
hyperparameters are found through grid search, it's crucial to assess the model's
performance on a separate, unseen test set to get a more accurate understanding of how it
will perform in real-world situations.
Homework
8/9
Apply this techniques at the Housing Prices Data at Housing Prices Competition for Kaggle
Learn Users and analyze your solution.
References:
1. sklearn.ensemble.RandomForestRegressor
2. 3.2. Tuning the hyper-parameters of an estimator
3. Kaggle: Your Machine Learning and Data Science Community
4. Kalinowski, M; Escovedo, T; Villamizar, H; Lopes, H. Engenharia de So ware para Ciência
de Dados: Um guia de boas práticas com ênfase na construção de sistemas de Machine
Learning em Python. Casa do Código.
9/9