0% found this document useful (0 votes)

74 views9 pages

Machine Learning Techniques Lesson 1

The document discusses techniques for handling missing values in machine learning models. It begins by introducing the problem of missing values hindering random forest regression. It then discusses modularizing code by creating functions for evaluating model performance and identifying columns with missing values. Finally, it explores strategies for handling missing values, including dropping columns, imputing values, adding indicator columns for imputed values, and hyperparameter tuning using grid search. The strategies are demonstrated by code snippets that encapsulate the techniques in reusable functions.

Uploaded by

Igor Caetano Diniz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views9 pages

Machine Learning Techniques Lesson 1

Uploaded by

Igor Caetano Diniz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Machine Learning Techniques

Lesson 1: Missing Values

Igor Caetano Diniz, Machine Learning Engineer and Data Scientist

Suppose you're working on a regression problem and have identified the Random Forest
Regressor as a promising strategy. However, there's a challenge to overcome—missing values
in the dataset, which can hinder the use of the model. This calls for a smart and eﬀective
strategy to handle the missing values. In this article, we'll explore several suggestions and
techniques to address this issue.

By implementing these approaches, we can ensure that our Random Forest Regressor
performs optimally even in the presence of missing values. So, let's delve into the solutions
that will help us successfully navigate this obstacle and unlock the full potential of our
regression analysis.

Deal with Missing Values

Dealing with missing values is a common challenge when working with data. Missing values
can occur for various reasons, such as data entry errors, incomplete data collection, or certain
values being unavailable. It's important to handle missing values properly to ensure the
accuracy and reliability of our analyses and models.

1/9
In this guide, we'll explore diﬀerent techniques and approaches to handle missing values
eﬀectively. We'll discuss how to identify missing values, understand their patterns and causes,
and then make informed decisions on how to handle them.

Handling missing values involves making choices based on the specific context and dataset.
We'll explore various strategies, such as imputation, where missing values are filled in with
estimated values, or removal, where rows or columns with missing values are excluded from
the analysis. Each approach has its pros and cons, and we'll dive into the considerations for
choosing the most appropriate method.

First: Modularize the code

Functional Programming

Add functions in your pipeline to make estimations and evaluations reproducible

Create a function to evaluate a model performance

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error

def score_dataset(X_train, X_valid, y_train, y_valid):

"""
Compare different approaches for a machine learning task.

Parameters:
X_train (array-like): Training data features.
X_valid (array-like): Validation data features.
y_train (array-like): Training data labels.
y_valid (array-like): Validation data labels.

Returns:
float: Mean absolute error between the actual validation labels and the
predicted labels.
"""

model = RandomForestRegressor(n_estimators=100, random_state=0)

model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)

Create function to list columns with missing values

2/9
import pandas as pd

def cols_missing(df: pd.DataFrame) -> list:

"""
Get a list of columns with missing values in a DataFrame.

Parameters:
df (pd.DataFrame): The DataFrame to check for missing values.

Returns:
list: A list of column names with missing values.
"""

cols_with_missing = []
for col in df.columns:
if df[col].isnull().any():
cols_with_missing.append(col)

return cols_with_missing

cols_with_missing = cols_missing(X_train)

Second: Adopt a proper strategy

1. Drop Columns with missing values

import pandas as pd

def drop_columns_with_missing(df: pd.DataFrame) -> pd.DataFrame:

"""
Drop columns with missing values from a DataFrame.

Parameters:
df (pd.DataFrame): The DataFrame to drop columns from.

Returns:
pd.DataFrame: The DataFrame with columns dropped.
"""

cols_with_missing = cols_missing(X_train)
df_dropped = df.drop(cols_with_missing, axis=1)
return df_dropped

# Drop columns with missing values in training and validation data

reduced_X_train = drop_columns_with_missing(X_train)
reduced_X_valid = drop_columns_with_missing(X_valid)

2. Imputation: fill in some average measure

3/9
import pandas as pd
from sklearn.impute import SimpleImputer

def impute_missing_values(df: pd.DataFrame) -> pd.DataFrame:

"""
Impute missing values in a DataFrame using SimpleImputer.

Parameters:
df (pd.DataFrame): The DataFrame to impute missing values.

Returns:
pd.DataFrame: The DataFrame with imputed values.
"""

imputer = SimpleImputer()
imputed_data = pd.DataFrame(imputer.fit_transform(df))
imputed_data.columns = df.columns
return imputed_data

# Impute missing values in training and validation data

imputed_X_train = impute_missing_values(X_train)
imputed_X_valid = impute_missing_values(X_valid)

In the code snippet, I have modularized the process of imputing missing values using the
impute_missing_values function. I have also added a docstring to the function. Here's a
summary of the changes:
• The impute_missing_values function takes a DataFrame df as input and returns the
DataFrame with imputed values.
• Inside the function, a SimpleImputer is created to impute missing values.
• The fit_transform method is used to fit the imputer on the data and transform the
data, filling in the missing values.
• The transformed data is converted back to a DataFrame, and the column names are
assigned to the imputed data using df.columns .
• The resulting DataFrame with imputed values is returned from the function.
• Finally, the impute_missing_values function is called to impute missing values in
X_train and X_valid , assigning the results to imputed_X_train and
imputed_X_valid , respectively.
By encapsulating the imputation logic within a function and providing a docstring, the
code becomes more modular, reusable, and self-explanatory.

3. Imputation + add binary column

When we have missing values in a dataset, we usually fill them in using a technique called
imputation. However, sometimes the imputed values may not exactly match the original
missing values. To address this, there's an extension to imputation that adds extra
information to the dataset.
In this approach, we not only fill in the missing values but also create a new column to
indicate which values were imputed. By doing this, we provide the model with additional
4/9
knowledge about the imputed entries. This can help the model make better predictions by
considering the specific characteristics associated with missing values.
However, it's important to note that this approach doesn't always improve the results. Its
eﬀectiveness depends on the dataset and the relationship between missing values and the
target variable. So, it's necessary to experiment with diﬀerent imputation methods and
evaluate their impact on the model's performance to see if this extension is helpful for a
specific problem.
import pandas as pd
from sklearn.impute import SimpleImputer

def add_missing_indicator(df: pd.DataFrame, cols_with_missing: list) ->

pd.DataFrame:
"""
Add indicator columns to a DataFrame indicating missing values.

Parameters:
df (pd.DataFrame): The DataFrame to add indicator columns to.
cols_with_missing (list): List of column names with missing values.

Returns:
pd.DataFrame: The DataFrame with indicator columns added.
"""

df_plus = df.copy()
for col in cols_with_missing:
df_plus[col + '_was_missing'] = df_plus[col].isnull()
return df_plus

# Make copy of data to avoid changing the original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Add indicator columns for missing values

X_train_plus = add_missing_indicator(X_train_plus, cols_with_missing)
X_valid_plus = add_missing_indicator(X_valid_plus, cols_with_missing)

# Impute missing values in training and validation data

imputed_X_train_plus = impute_missing_values(X_train_plus)
imputed_X_valid_plus = impute_missing_values(X_valid_plus)

# Restore column names after imputation

imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

• The add_missing_indicator function takes a DataFrame df and a list of column names

cols_with_missing as input, and returns the DataFrame with indicator columns added.
• Inside the function, a copy of the DataFrame is made to avoid changing the original data.
Then, a loop is used to add indicator columns for each column with missing values.

Bonus: Hyperparameter tuning Module

5/9
When training a RandomForestRegressor model, we need to choose the right combination of
hyperparameters that determine how the model learns and makes predictions. However,
finding the best set of hyperparameters can be a challenging task.

This is where grid search comes in. Grid search is an essential technique that helps us
systematically explore diﬀerent combinations of hyperparameters to identify the optimal
ones for our RandomForestRegressor model. It saves us from the tedious and time-consuming
process of manually tuning hyperparameters.

During grid search, we specify a range of values for diﬀerent hyperparameters, such as the
number of trees in the forest, the maximum depth of the trees, and the minimum number of
samples required to split a node. Grid search then exhaustively tries all possible combinations
of these hyperparameters and evaluates the performance of each combination using cross-
validation.

By performing grid search, we can evaluate multiple models with diﬀerent hyperparameter
settings and compare their performance. This allows us to identify the best combination of
hyperparameters that results in the highest predictive accuracy or the best trade-oﬀ between
accuracy and model complexity.

Grid search helps us avoid the risk of using suboptimal hyperparameter values that could lead
to underfitting or overfitting the data. It provides a systematic approach to fine-tuning our
RandomForestRegressor model, enabling us to achieve better predictive performance on
unseen data.

6/9
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

def grid_search_random_forest(X, y, param_grid, cv=5):

"""
Perform grid search on a RandomForestRegressor using the specified
parameter grid.

Parameters:
X (pd.DataFrame or np.array): The input features.
y (pd.Series or np.array): The target variable.
param_grid (dict): The parameter grid to search over.
cv (int, optional): The number of cross-validation folds (default: 5).

Returns:
RandomForestRegressor: The best RandomForestRegressor model found during
grid search.
"""

# Create RandomForestRegressor
rf = RandomForestRegressor()

# Perform grid search

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=cv)
grid_search.fit(X, y)

# Get the best model

best_rf = grid_search.best_estimator_

return best_rf

How we can use this function? like this:

# Define the parameter grid for grid search

param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5, 10]
}

# Perform grid search on RandomForestRegressor

best_model = grid_search_random_forest(X_train, y_train, param_grid, cv=5)

obs:

Grid Search Parameters:

The param_grid is a parameter in the grid search that allows us to define the hyperparameter
grid to search over. It specifies the diﬀerent values that we want to try for each
hyperparameter. In this example:

7/9
1. n_estimators : This hyperparameter determines the number of trees in the random forest.
The param_grid specifies three possible values to be evaluated: 100, 200, and 300. The grid
search will try these values to find the best number of trees for the model.
2. max_depth : This hyperparameter determines the maximum depth of each tree in the
random forest. The param_grid specifies three possible values: None , 5, and 10. A None

value means there is no maximum depth limit. The grid search will evaluate these values to
find the best maximum depth for the trees.
3. min_samples_split : This hyperparameter determines the minimum number of samples
required to split an internal node in a tree. The param_grid specifies three possible values:
2, 5, and 10. This parameter helps control the complexity of the trees and prevents
overfitting. The grid search will test these values to find the best minimum number of
samples for splitting a node.

By specifying diﬀerent values for each hyperparameter in the param_grid , the grid search will
evaluate all possible combinations of these values. It will train and evaluate multiple
RandomForestRegressor models, each with a diﬀerent combination of hyperparameters,
using cross-validation. The goal is to identify the combination that yields the best
performance based on the evaluation metric chosen, such as mean absolute error or R-
squared.

Cross Validation Parameter:

The cv hyperparameter, which stands for cross-validation, is an important setting in grid
search. It determines how the data is divided during the evaluation of the model. Think of it as
a way to assess the model's performance and estimate how well it would work on new, unseen
data.
Cross-validation helps us evaluate the model in a reliable and robust manner. It involves
splitting the training data into cv number of groups, or folds. The model is then trained and
evaluated multiple times, each time using a diﬀerent fold as the validation set and the
remaining folds as the training set.

The choice of the cv value depends on factors such as the size of the dataset, the complexity
of the model, and the available computing resources. Common values for cv include 3, 5, and
10, but other values can be used too.
Using a higher cv value can give us a more reliable estimate of how well the model will
perform, but it may take longer to compute. On the other hand, using a lower cv value can be
faster, but the performance estimate may have more variability.

It's important to remember that cross-validation is not the final evaluation. Once the best
hyperparameters are found through grid search, it's crucial to assess the model's
performance on a separate, unseen test set to get a more accurate understanding of how it
will perform in real-world situations.

Homework

8/9
Apply this techniques at the Housing Prices Data at Housing Prices Competition for Kaggle
Learn Users and analyze your solution.

References:

1. sklearn.ensemble.RandomForestRegressor
2. 3.2. Tuning the hyper-parameters of an estimator
3. Kaggle: Your Machine Learning and Data Science Community
4. Kalinowski, M; Escovedo, T; Villamizar, H; Lopes, H. Engenharia de So ware para Ciência
de Dados: Um guia de boas práticas com ênfase na construção de sistemas de Machine
Learning em Python. Casa do Código.

9/9

OOPS Concepts - Java
86% (7)
OOPS Concepts - Java
32 pages
उत्तराखंड सामान्यज्ञान हस्तलेख
100% (1)
उत्तराखंड सामान्यज्ञान हस्तलेख
106 pages
Software Engineering Solved Mcqs PDF
100% (1)
Software Engineering Solved Mcqs PDF
16 pages
(Essential Textbooks in Mathematics) Sasane, Amol - Friendly Approach To Functional Analysis, A-World Scientific Publishing Company (2017)
No ratings yet
(Essential Textbooks in Mathematics) Sasane, Amol - Friendly Approach To Functional Analysis, A-World Scientific Publishing Company (2017)
396 pages
Kenneth H. Rosen - Discrete Mathematics and Its Applications (1998, McGraw-Hill Science Engineering Math)
No ratings yet
Kenneth H. Rosen - Discrete Mathematics and Its Applications (1998, McGraw-Hill Science Engineering Math)
700 pages
Super Slick Melodic Blues VOL.1: Taking The Hit
No ratings yet
Super Slick Melodic Blues VOL.1: Taking The Hit
7 pages
T12 Se
No ratings yet
T12 Se
11 pages
A Practical Strategy and Workflow For Large Projects
No ratings yet
A Practical Strategy and Workflow For Large Projects
9 pages
Blancco Drive Eraser
100% (1)
Blancco Drive Eraser
2 pages
Csce616 11
No ratings yet
Csce616 11
14 pages
Introduction To Computer Viruses
92% (12)
Introduction To Computer Viruses
31 pages
UJIAN
No ratings yet
UJIAN
10 pages
Super Slick Melodic Blues VOL.1: Long Story Short
No ratings yet
Super Slick Melodic Blues VOL.1: Long Story Short
5 pages
Data Acquisition and Quality
No ratings yet
Data Acquisition and Quality
5 pages
Qamar Resume
No ratings yet
Qamar Resume
2 pages
ch1-2 Updated
No ratings yet
ch1-2 Updated
136 pages
Software Engineering For Automotive Systems: A Roadmap
No ratings yet
Software Engineering For Automotive Systems: A Roadmap
17 pages
Iso 26262 Safety Cases: Compliance and Assurance: Rob Palin, David Ward, Ibrahim Habli, Roger Rivett
No ratings yet
Iso 26262 Safety Cases: Compliance and Assurance: Rob Palin, David Ward, Ibrahim Habli, Roger Rivett
6 pages
Zimbra OS Admin Guide 8.6.0
No ratings yet
Zimbra OS Admin Guide 8.6.0
208 pages
C/C++ Programming Interview Questions and Answers: by Satish Shetty, July 14th, 2004
No ratings yet
C/C++ Programming Interview Questions and Answers: by Satish Shetty, July 14th, 2004
16 pages
Yashwd
No ratings yet
Yashwd
26 pages
MX 5201
No ratings yet
MX 5201
3 pages
VSA System User Guide
No ratings yet
VSA System User Guide
93 pages
Efficiency of The Use of Graphic Programs (Autocad, Compass, Coreldraw) in Higher Technical Education
No ratings yet
Efficiency of The Use of Graphic Programs (Autocad, Compass, Coreldraw) in Higher Technical Education
4 pages
SPC224 - Architecting and Automating SharePoint Governance
No ratings yet
SPC224 - Architecting and Automating SharePoint Governance
64 pages
VMware Performance and Capacity Management - Second Edition - Sample Chapter
No ratings yet
VMware Performance and Capacity Management - Second Edition - Sample Chapter
34 pages
Mass Transit
No ratings yet
Mass Transit
12 pages
Mikrotik ZOOM Conf
No ratings yet
Mikrotik ZOOM Conf
3 pages
Extractive Text Summarization Using Sentence Ranking: J.N.Madhuri Ganesh Kumar.R
No ratings yet
Extractive Text Summarization Using Sentence Ranking: J.N.Madhuri Ganesh Kumar.R
3 pages
Anil Suddapalli
No ratings yet
Anil Suddapalli
6 pages
Super Slick Melodic Blues VOL.1: Don't You Feel Lonely
No ratings yet
Super Slick Melodic Blues VOL.1: Don't You Feel Lonely
6 pages
Device Info Report
No ratings yet
Device Info Report
5 pages
Sap Successfactors What'S New Viewer: Warning
No ratings yet
Sap Successfactors What'S New Viewer: Warning
3 pages
Hornet Vumeter User-Manual
No ratings yet
Hornet Vumeter User-Manual
3 pages
Super Slick Melodic Blues VOL.1: Gear & Credits
No ratings yet
Super Slick Melodic Blues VOL.1: Gear & Credits
1 page
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)

Machine Learning Techniques Lesson 1

Uploaded by

Machine Learning Techniques Lesson 1

Uploaded by

Machine Learning Techniques

Lesson 1: Missing Values

Deal with Missing Values

First: Modularize the code

Add functions in your pipeline to make estimations and evaluations reproducible

Create a function to evaluate a model performance

from sklearn.ensemble import RandomForestRegressor

def score_dataset(X_train, X_valid, y_train, y_valid):

model = RandomForestRegressor(n_estimators=100, random_state=0)

Create function to list columns with missing values

def cols_missing(df: pd.DataFrame) -> list:

Second: Adopt a proper strategy

def drop_columns_with_missing(df: pd.DataFrame) -> pd.DataFrame:

# Drop columns with missing values in training and validation data

2. Imputation: fill in some average measure

def impute_missing_values(df: pd.DataFrame) -> pd.DataFrame:

# Impute missing values in training and validation data

3. Imputation + add binary column

def add_missing_indicator(df: pd.DataFrame, cols_with_missing: list) ->

# Add indicator columns for missing values

# Impute missing values in training and validation data

# Restore column names after imputation

• The add_missing_indicator function takes a DataFrame df and a list of column names

Bonus: Hyperparameter tuning Module

def grid_search_random_forest(X, y, param_grid, cv=5):

# Perform grid search

# Get the best model

How we can use this function? like this:

# Define the parameter grid for grid search

# Perform grid search on RandomForestRegressor

Grid Search Parameters:

Cross Validation Parameter:

You might also like