0% found this document useful (0 votes)
74 views9 pages

Machine Learning Techniques Lesson 1

The document discusses techniques for handling missing values in machine learning models. It begins by introducing the problem of missing values hindering random forest regression. It then discusses modularizing code by creating functions for evaluating model performance and identifying columns with missing values. Finally, it explores strategies for handling missing values, including dropping columns, imputing values, adding indicator columns for imputed values, and hyperparameter tuning using grid search. The strategies are demonstrated by code snippets that encapsulate the techniques in reusable functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views9 pages

Machine Learning Techniques Lesson 1

The document discusses techniques for handling missing values in machine learning models. It begins by introducing the problem of missing values hindering random forest regression. It then discusses modularizing code by creating functions for evaluating model performance and identifying columns with missing values. Finally, it explores strategies for handling missing values, including dropping columns, imputing values, adding indicator columns for imputed values, and hyperparameter tuning using grid search. The strategies are demonstrated by code snippets that encapsulate the techniques in reusable functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Machine Learning Techniques

Lesson 1: Missing Values


Igor Caetano Diniz, Machine Learning Engineer and Data Scientist

Suppose you're working on a regression problem and have identified the Random Forest
Regressor as a promising strategy. However, there's a challenge to overcome—missing values
in the dataset, which can hinder the use of the model. This calls for a smart and effective
strategy to handle the missing values. In this article, we'll explore several suggestions and
techniques to address this issue.

By implementing these approaches, we can ensure that our Random Forest Regressor
performs optimally even in the presence of missing values. So, let's delve into the solutions
that will help us successfully navigate this obstacle and unlock the full potential of our
regression analysis.

Deal with Missing Values


Dealing with missing values is a common challenge when working with data. Missing values
can occur for various reasons, such as data entry errors, incomplete data collection, or certain
values being unavailable. It's important to handle missing values properly to ensure the
accuracy and reliability of our analyses and models.

1/9
In this guide, we'll explore different techniques and approaches to handle missing values
effectively. We'll discuss how to identify missing values, understand their patterns and causes,
and then make informed decisions on how to handle them.

Handling missing values involves making choices based on the specific context and dataset.
We'll explore various strategies, such as imputation, where missing values are filled in with
estimated values, or removal, where rows or columns with missing values are excluded from
the analysis. Each approach has its pros and cons, and we'll dive into the considerations for
choosing the most appropriate method.

First: Modularize the code


Functional Programming

Add functions in your pipeline to make estimations and evaluations reproducible

Create a function to evaluate a model performance

from sklearn.ensemble import RandomForestRegressor


from sklearn.metrics import mean_absolute_error

def score_dataset(X_train, X_valid, y_train, y_valid):


"""
Compare different approaches for a machine learning task.

Parameters:
X_train (array-like): Training data features.
X_valid (array-like): Validation data features.
y_train (array-like): Training data labels.
y_valid (array-like): Validation data labels.

Returns:
float: Mean absolute error between the actual validation labels and the
predicted labels.
"""

model = RandomForestRegressor(n_estimators=100, random_state=0)


model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)

Create function to list columns with missing values

2/9
import pandas as pd

def cols_missing(df: pd.DataFrame) -> list:


"""
Get a list of columns with missing values in a DataFrame.

Parameters:
df (pd.DataFrame): The DataFrame to check for missing values.

Returns:
list: A list of column names with missing values.
"""

cols_with_missing = []
for col in df.columns:
if df[col].isnull().any():
cols_with_missing.append(col)

return cols_with_missing

cols_with_missing = cols_missing(X_train)

Second: Adopt a proper strategy


1. Drop Columns with missing values

import pandas as pd

def drop_columns_with_missing(df: pd.DataFrame) -> pd.DataFrame:


"""
Drop columns with missing values from a DataFrame.

Parameters:
df (pd.DataFrame): The DataFrame to drop columns from.

Returns:
pd.DataFrame: The DataFrame with columns dropped.
"""

cols_with_missing = cols_missing(X_train)
df_dropped = df.drop(cols_with_missing, axis=1)
return df_dropped

# Drop columns with missing values in training and validation data


reduced_X_train = drop_columns_with_missing(X_train)
reduced_X_valid = drop_columns_with_missing(X_valid)

2. Imputation: fill in some average measure

3/9
import pandas as pd
from sklearn.impute import SimpleImputer

def impute_missing_values(df: pd.DataFrame) -> pd.DataFrame:


"""
Impute missing values in a DataFrame using SimpleImputer.

Parameters:
df (pd.DataFrame): The DataFrame to impute missing values.

Returns:
pd.DataFrame: The DataFrame with imputed values.
"""

imputer = SimpleImputer()
imputed_data = pd.DataFrame(imputer.fit_transform(df))
imputed_data.columns = df.columns
return imputed_data

# Impute missing values in training and validation data


imputed_X_train = impute_missing_values(X_train)
imputed_X_valid = impute_missing_values(X_valid)

In the code snippet, I have modularized the process of imputing missing values using the
impute_missing_values function. I have also added a docstring to the function. Here's a
summary of the changes:
• The impute_missing_values function takes a DataFrame df as input and returns the
DataFrame with imputed values.
• Inside the function, a SimpleImputer is created to impute missing values.
• The fit_transform method is used to fit the imputer on the data and transform the
data, filling in the missing values.
• The transformed data is converted back to a DataFrame, and the column names are
assigned to the imputed data using df.columns .
• The resulting DataFrame with imputed values is returned from the function.
• Finally, the impute_missing_values function is called to impute missing values in
X_train and X_valid , assigning the results to imputed_X_train and
imputed_X_valid , respectively.
By encapsulating the imputation logic within a function and providing a docstring, the
code becomes more modular, reusable, and self-explanatory.

3. Imputation + add binary column


When we have missing values in a dataset, we usually fill them in using a technique called
imputation. However, sometimes the imputed values may not exactly match the original
missing values. To address this, there's an extension to imputation that adds extra
information to the dataset.
In this approach, we not only fill in the missing values but also create a new column to
indicate which values were imputed. By doing this, we provide the model with additional
4/9
knowledge about the imputed entries. This can help the model make better predictions by
considering the specific characteristics associated with missing values.
However, it's important to note that this approach doesn't always improve the results. Its
effectiveness depends on the dataset and the relationship between missing values and the
target variable. So, it's necessary to experiment with different imputation methods and
evaluate their impact on the model's performance to see if this extension is helpful for a
specific problem.
import pandas as pd
from sklearn.impute import SimpleImputer

def add_missing_indicator(df: pd.DataFrame, cols_with_missing: list) ->


pd.DataFrame:
"""
Add indicator columns to a DataFrame indicating missing values.

Parameters:
df (pd.DataFrame): The DataFrame to add indicator columns to.
cols_with_missing (list): List of column names with missing values.

Returns:
pd.DataFrame: The DataFrame with indicator columns added.
"""

df_plus = df.copy()
for col in cols_with_missing:
df_plus[col + '_was_missing'] = df_plus[col].isnull()
return df_plus

# Make copy of data to avoid changing the original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Add indicator columns for missing values


X_train_plus = add_missing_indicator(X_train_plus, cols_with_missing)
X_valid_plus = add_missing_indicator(X_valid_plus, cols_with_missing)

# Impute missing values in training and validation data


imputed_X_train_plus = impute_missing_values(X_train_plus)
imputed_X_valid_plus = impute_missing_values(X_valid_plus)

# Restore column names after imputation


imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

• The add_missing_indicator function takes a DataFrame df and a list of column names


cols_with_missing as input, and returns the DataFrame with indicator columns added.
• Inside the function, a copy of the DataFrame is made to avoid changing the original data.
Then, a loop is used to add indicator columns for each column with missing values.

Bonus: Hyperparameter tuning Module


5/9
When training a RandomForestRegressor model, we need to choose the right combination of
hyperparameters that determine how the model learns and makes predictions. However,
finding the best set of hyperparameters can be a challenging task.

This is where grid search comes in. Grid search is an essential technique that helps us
systematically explore different combinations of hyperparameters to identify the optimal
ones for our RandomForestRegressor model. It saves us from the tedious and time-consuming
process of manually tuning hyperparameters.

During grid search, we specify a range of values for different hyperparameters, such as the
number of trees in the forest, the maximum depth of the trees, and the minimum number of
samples required to split a node. Grid search then exhaustively tries all possible combinations
of these hyperparameters and evaluates the performance of each combination using cross-
validation.

By performing grid search, we can evaluate multiple models with different hyperparameter
settings and compare their performance. This allows us to identify the best combination of
hyperparameters that results in the highest predictive accuracy or the best trade-off between
accuracy and model complexity.

Grid search helps us avoid the risk of using suboptimal hyperparameter values that could lead
to underfitting or overfitting the data. It provides a systematic approach to fine-tuning our
RandomForestRegressor model, enabling us to achieve better predictive performance on
unseen data.

6/9
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

def grid_search_random_forest(X, y, param_grid, cv=5):


"""
Perform grid search on a RandomForestRegressor using the specified
parameter grid.

Parameters:
X (pd.DataFrame or np.array): The input features.
y (pd.Series or np.array): The target variable.
param_grid (dict): The parameter grid to search over.
cv (int, optional): The number of cross-validation folds (default: 5).

Returns:
RandomForestRegressor: The best RandomForestRegressor model found during
grid search.
"""

# Create RandomForestRegressor
rf = RandomForestRegressor()

# Perform grid search


grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=cv)
grid_search.fit(X, y)

# Get the best model


best_rf = grid_search.best_estimator_

return best_rf

How we can use this function? like this:

# Define the parameter grid for grid search


param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5, 10]
}

# Perform grid search on RandomForestRegressor


best_model = grid_search_random_forest(X_train, y_train, param_grid, cv=5)

obs:

Grid Search Parameters:


The param_grid is a parameter in the grid search that allows us to define the hyperparameter
grid to search over. It specifies the different values that we want to try for each
hyperparameter. In this example:

7/9
1. n_estimators : This hyperparameter determines the number of trees in the random forest.
The param_grid specifies three possible values to be evaluated: 100, 200, and 300. The grid
search will try these values to find the best number of trees for the model.
2. max_depth : This hyperparameter determines the maximum depth of each tree in the
random forest. The param_grid specifies three possible values: None , 5, and 10. A None

value means there is no maximum depth limit. The grid search will evaluate these values to
find the best maximum depth for the trees.
3. min_samples_split : This hyperparameter determines the minimum number of samples
required to split an internal node in a tree. The param_grid specifies three possible values:
2, 5, and 10. This parameter helps control the complexity of the trees and prevents
overfitting. The grid search will test these values to find the best minimum number of
samples for splitting a node.

By specifying different values for each hyperparameter in the param_grid , the grid search will
evaluate all possible combinations of these values. It will train and evaluate multiple
RandomForestRegressor models, each with a different combination of hyperparameters,
using cross-validation. The goal is to identify the combination that yields the best
performance based on the evaluation metric chosen, such as mean absolute error or R-
squared.

Cross Validation Parameter:


The cv hyperparameter, which stands for cross-validation, is an important setting in grid
search. It determines how the data is divided during the evaluation of the model. Think of it as
a way to assess the model's performance and estimate how well it would work on new, unseen
data.
Cross-validation helps us evaluate the model in a reliable and robust manner. It involves
splitting the training data into cv number of groups, or folds. The model is then trained and
evaluated multiple times, each time using a different fold as the validation set and the
remaining folds as the training set.

The choice of the cv value depends on factors such as the size of the dataset, the complexity
of the model, and the available computing resources. Common values for cv include 3, 5, and
10, but other values can be used too.
Using a higher cv value can give us a more reliable estimate of how well the model will
perform, but it may take longer to compute. On the other hand, using a lower cv value can be
faster, but the performance estimate may have more variability.

It's important to remember that cross-validation is not the final evaluation. Once the best
hyperparameters are found through grid search, it's crucial to assess the model's
performance on a separate, unseen test set to get a more accurate understanding of how it
will perform in real-world situations.

Homework

8/9
Apply this techniques at the Housing Prices Data at Housing Prices Competition for Kaggle
Learn Users and analyze your solution.

References:

1. sklearn.ensemble.RandomForestRegressor
2. 3.2. Tuning the hyper-parameters of an estimator
3. Kaggle: Your Machine Learning and Data Science Community
4. Kalinowski, M; Escovedo, T; Villamizar, H; Lopes, H. Engenharia de So ware para Ciência
de Dados: Um guia de boas práticas com ênfase na construção de sistemas de Machine
Learning em Python. Casa do Código.

9/9

You might also like