0% found this document useful (0 votes)
67 views

next_level_data_science_sample chapter

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

next_level_data_science_sample chapter

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

MACHINE LEARNING

MASTERY

NEXT-LEVEL
Data
Science
Mastering Techniques
for the Modern Era

Vinod Chugani
This is Just a Sample

Thank-you for your interest in Next-Level Data Science.


This is just a sample of the full text. You can purchase the complete book online from:
https://machinelearningmastery.com/next-level-data-science

MACHINE LEARNING
MASTERY

NEXT-LEVEL
Data
Science
Mastering Techniques
for the Modern Era

Vinod Chugani
This is Just a Sample ii

Disclaimer
The information contained within this eBook is strictly for educational purposes. If you wish to
apply ideas contained in this eBook, you are taking full responsibility for your actions.
The author has made every effort to ensure the accuracy of the information within this book was
correct at time of publication. The author does not assume and hereby disclaims any liability to any
party for any loss, damage, or disruption caused by errors or omissions, whether such errors or
omissions result from accident, negligence, or any other cause.
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic or
mechanical, recording or by any information storage and retrieval system, without written
permission from the author.

Credits
Founder: Jason Brownlee
Lead Editor: Adrian Tam
Author: Vinod Chugani
Technical Reviewers: Yoyo Chan, Darci Heikkinen, and Ishaan Mahajan

Copyright
Next-Level Data Science
© 2024 MachineLearningMastery.com. All Rights Reserved.

Edition: v1.00
Contents

This is Just a Sample i

Preface iv

Introduction v

2 From Train-Test to Cross-Validation: Advancing Your Model’s Evaluation 1


Model Evaluation: Train-Test vs. Cross-Validation . . . . . . . . . . . . . . 1
The “Why” of Cross-Validation . . . . . . . . . . . . . . . . . . . . . . 3
Delving Deeper with k-Fold Cross-Validation. . . . . . . . . . . . . . . . . 5
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

7 Capturing Curves: Advanced Modeling with Polynomial Regression 9


Establishing a Baseline with Linear Regression . . . . . . . . . . . . . . . . 9
Capturing Curves with Polynomial Regression . . . . . . . . . . . . . . . . 11
Experimenting with a Cubic Regression . . . . . . . . . . . . . . . . . . . 13
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

15 Boosting Over Bagging: Enhancing Predictive Accuracy with Gradient Boosting


Regressors 16
What Is Boosting? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Comparing Baseline Decision Tree to Gradient Boosting Ensembles . . . . . . . 17
Optimizing Gradient Boosting with Learning Rate Adjustments . . . . . . . . 21
Final Optimization: Tuning Learning Rate and Number of Trees . . . . . . . . 24
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Preface

Data science applies mathematics and statistics to tell a story from the data you have and
sometimes predict something about the data you have not yet collected. The traditional
statistical techniques are powerful enough to reveal a lot about the data, but as the discipline
evolved, there are more tools, such as machine learning models invented to bring new insights.
Machine learning models are called models because they are based on certain assumptions
about the data you used to train them. As the great statistician George Box said, “All models
are wrong, but some are useful.” How to correctly use a model and make it more useful is
the key to success in applying a model in data science. It is not whether you get the sharpest
tool but whether you use the tool correctly.
This book is the sequel to our earlier publication, The Beginner’s Guide to Data Science.
In this book, we will focus on linear regression and tree-based models, which are the simplest
and most ubiquitous models in data science. You will be led through the steps to prepare the
data and apply the models correctly.
Introduction

Welcome to Next-Level Data Science. This book is the sequel to The Beginner’s Guide to
Data Science. However, that is not a prerequisite for this book. This book focuses on the
most commonly used models in data science, namely, linear regression, decision trees, and
some other models derived from them. While these are the models every data scientist should
know, this book aims to show you how you should use these models to get the most out of
them.
As we said in the first book, the central theme of data science is to tell a story from
data. Using sophisticated models or the latest and greatest software is not the requirement.
Sometimes, the simplest and oldest models can do a good job. Some people love to try out
very sophisticated models because they think the simpler one doesn’t work or because they
tried and didn’t realize that they used it incorrectly. If you can use the simple models to their
full potential, you will be surprised how many problems they can deal with.

What Is a Model in Data Science


A model is a way of thinking about the data. Once you assume the pattern of the data, you
can “build” a model with the data and use it for analysis or prediction. But there are several
questions you should be able to answer as a data scientist:
B Among the many models, how can I know which one fits the data well?
B How much data should I use to “train” a model?
B What does a model assume about the data?
B How can I prepare the data if it is currently in a form that the model cannot accept?
B How can I tell a model is not doing well, and how should I fix it?
These are important questions and, in fact, you might be able to answer some of them with
your intuition. But as data science is a precise discipline, there are some scientific methods
to answer all the questions above. The goal is to provide convincing evidence to support your
claim and back up your story. Only then can you deliver a persuasive summary of the data
you have.
vi

This book is created to walk you through a data science project with just enough amount
of programming. Some simple yet commonly used models are discussed. Visualizations are
employed only when the charts are required to explain a concept. You will start with a data
set (same one as in the first book) and apply it to the models. Through these examples, you
will be shown how a data scientist thinks when a tool, such as regression models, is used.

Book Organization
This book is in three parts. The first part is about linear regression. This is the simplest model
but powerful enough for many applications. It is simple enough to understand, therefore, you
can use it as an example to learn how to apply a model in general to a data science project.
This part includes the following chapters:
1. Regression Through Two Different Lenses
2. From Train-Test to Cross-Validation: Advancing Your Model’s Evaluation
3. The Strategic Use of Sequential Feature Selector for Housing Price Predictions
4. The Search for the Sweet Spot in a Linear Regression with Numeric Features
5. One-Hot Encoding: Understanding the “Hot” in Data
6. Interpreting Coefficients in Linear Regression Models
7. Capturing Curves: Advanced Modeling with Polynomial Regression
These chapters not only show you how to create a linear regression model in Python but also
illustrate how many stories you can tell by using just a simple model. This is an important
concept in data science; namely, a simple trick reveals a lot from the data, and you should
extract the most from your limited resources.
The second part is about some skills you can apply to improve a model or a workflow,
using the regression models from Part I as examples:
8. The Power of Pipelines
9. Detecting and Overcoming Perfect Multicollinearity in Large Datasets
10. Scaling to Success: Implementing and Optimizing Penalized Models
11. Filling the Gaps: A Comparative Guide to Imputation Techniques in Machine Learning
These chapters are to address the common problems you encounter in a data science project.
The pipelines (Chapter 8) are about how to make the workflow more efficient, and the other
chapters are about how to make the data and model more robust and accurate.
Part III moves away from linear regression to cover the tree-based models:
12. Branching Out: Exploring Tree-Based Models for Regression
13. Decision Trees and Ordinal Encoding: A Practical Guide
14. From Single Trees to Forests: Enhancing Real Estate Predictions with Ensembles
15. Boosting Over Bagging: Enhancing Predictive Accuracy with Gradient Boosting
Regressors
vii

16. Navigating Missing Data Challenges with XGBoost


17. Exploring LightGBM: Leaf-Wise Growth with GBDT and GOSS
18. CatBoost Essentials: Building Robust Home Price Prediction Systems
Sometimes, tree-based models are found to be more powerful than linear regression models
and you will find an explanation in Chapter 13. Nothing stops you from using more than one
tree to solve your problem. The natural extension would be the random forests in the later
chapters in this part.
At this point, you should know the two basic but powerful models for data science projects.
As a reminder, learning the model is not the focus of this book, but learning the mental models
on how to apply models as tools for your data science project is what you should learn. In
order to help you get this data science mindset, three summary chapters are in the appendix:
A. Planning Your Data Science Project
B. From Features to Performance: Crafting Robust Predictive Models
C. Interpreting and Communicating Data Science Results
All these chapters in this book are based on the same dataset so that you can build insights
on a seemingly simple data as you navigate through the chapters. You need to find the right
tool for the right problem but also use the tool efficiently to solve it. This includes applying
the appropriate Python function and interpreting the result correctly.

Requirements for This Book


This book focuses on machine learning models that are most useful for common data science
projects. Even so, you do not need to be an expert in machine learning to enjoy it since
most models and concepts are well explained when they are introduced. However, if you
are interested in learning more, you may find the other book by Machine Learning Mastery,
Machine Learning with Python 1 , useful.
The code in this book is in Python. It is the most popular language in this domain, and
you can easily find help online. It does not mean you need to be a Python expert. All you
need to know is to install and set up Python with libraries such as scikit-learn, Matplotlib,
XGBoost, and so on. The examples assume that you have a Python environment available.
This may be on your workstation or laptop, it may be in a VM or a Docker instance that
you run, or it may be a server instance that you can configure in the cloud. Please refer to
Appendix D if you need instructions on how to set up your Python environment.
The most important requirement of this book is a good mindset. There is a lot to cover in
this book. Each topic can be drilled down and expanded into a much larger context. However,
remember that you should focus on the high-level goal of the data instead of all the details in
the models and code. In many cases, you will see that a function from the machine learning
library is used, and there are many arguments you can specify for that function. However, the
example only covers a minimum variation to complete the job. You should finish the story

1
https://machinelearningmastery.com/machine-learning-with-python/
viii

in each chapter first and go back to change the code to try a different idea afterward. This
is the fastest way to learn.

Your Outcomes from Reading This Book


This book will lead you from being a programmer with little to no data science knowledge to
a developer with sufficient knowledge to carry out basic workflow in data science.
Specifically, you will know:
B How to determine if a model is right for your problem.
B The basic but commonly used machine learning models in data science.
B How to collect data to back up your model choice and interpret the result.
B How to deliver your argument quantitatively.
This book is not a textbook. You are provided with the theoretical background at a
bare minimum. This book, however, is a cookbook or playbook in that there are step-by-step
instructions you can follow to finish a project.
You do not need to read the chapters in their order. While the chapters are ordered to
help you go from the beginning to the end of a data science journey, you can skip to chapters
as your needs or interests motivate you. To get the most from this book, you should also
attempt to improve the results, try a different function or model, apply the method to a
similar but different problem, and so on. You are welcome to share your findings with us at
[email protected].

Next
Let’s dive in. Next up is Part I, where you will take a tour of the scikit-learn library in Python
to create a linear regression model, which is an essential skill you will be using throughout
this book.
2
From Train-Test to
Cross-Validation: Advancing
Your Model’s Evaluation

Many beginners will initially rely on the train-test method to evaluate their models. This
method is the most straightforward way to give a clear indication of how well a model performs
on unseen data. However, this approach may lead to an incomplete understanding of a model’s
capabilities. In this chapter, we’ll discuss why it’s important to go beyond the basic train-test
split and how cross-validation can offer a more thorough evaluation of model performance.
You will go through the essential steps to achieve a deeper and more accurate assessment of
your machine learning models.
Let’s get started.

Overview
This chapter is divided into three parts; they are:
B Model Evaluation: Train-Test vs. Cross-Validation
B The “Why” of Cross-Validation
B Delving Deeper with k-Fold Cross-Validation

2.1 Model Evaluation: Train-Test vs. Cross-Validation


The predictability of a machine learning model is determined by its design (such as a linear
vs. nonlinear model) and its complexity (such as the number of parameters in a linear regression
model). You need to make sure you have a right design before deciding what is the appropriate
complexity.
The performance of a machine learning model is gauged by how well it performs on
previously unseen (or test) data. In a standard train-test split, you divide the dataset into
two parts: a larger portion for training your model and a smaller portion for testing its
performance. The ratio of split is a trade-off. Too much data in test set would make the
model underfit as there is not enough samples to learn from. Too little data in test set cannot
identify an overfitting model as you cannot perform a trustworthy evaluation.
The model is suitable if the tested performance is acceptable. This approach is
straightforward but doesn’t always utilize your data most effectively.
2.1 Model Evaluation: Train-Test vs. Cross-Validation 2

Dataset Train/Test

Training
(80%)

Testing
(20%)
One model is scored
Figure 2.1: How to do train-test split on a dataset

However, if you don’t have sufficient data, you can do cross-validation. Figure 2.2 shows a
5-fold cross-validation, where the dataset is split into five subsets (“folds”). In each round of
validation, a different fold is used as the test set while the remaining form the training set.
This process is repeated five times, producing five different models and ensuring each data
point is used for training and testing.

Dataset 5-Fold Cross-Validation

20%

20%
Training
(80%)
20%

20%

Testing
(20%) 20%

5 models are scored


Figure 2.2: How to do 5-fold cross-validation on a dataset

Here is an example in code to illustrate the above:

# Load the Ames dataset


import pandas as pd
Ames = pd.read_csv("Ames.csv")

# Import Linear Regression, Train-Test, Cross-Validation from scikit-learn


from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score

# Select features and target


X = Ames[["GrLivArea"]] # Feature: GrLivArea, a 2D matrix
y = Ames["SalePrice"] # Target: SalePrice, a 1D vector
2.2 The “Why” of Cross-Validation 3

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear Regression model using Train-Test


model = LinearRegression()
model.fit(X_train, y_train)
train_test_score = round(model.score(X_test, y_test), 4)
print(f"Train-Test R^2 Score: {train_test_score}")

# Perform 5-Fold Cross-Validation


cv_scores = cross_val_score(model, X, y, cv=5)
cv_scores_rounded = [round(score, 4) for score in cv_scores]
print(f"Cross-Validation R^2 Scores: {cv_scores_rounded}")

Listing 2.1: Running train-test split and 5-fold cross-validation on the same dataset

While the train-test method yields a single R2 score, cross-validation provides you with
a spectrum of five different R2 scores, one from each fold of the data, offering a more
comprehensive view of the model’s performance:

Train-Test R^2 Score: 0.4789


Cross-Validation R^2 Scores: [0.4884, 0.5412, 0.5214, 0.5454, 0.4673]

Output 2.1: The R2 scores found

The roughly equal R2 scores among the five mean the model is stable. You can then decide
whether this model (i.e., linear regression) provides an acceptable prediction power.

2.2 The “Why” of Cross-Validation


Understanding the variability of your model’s performance across different subsets of data
is crucial in machine learning. The train-test split method, while useful, only gives you a
snapshot of how your model might perform on one particular set of unseen data.
By systematically using multiple folds of data for both training and testing, cross-
validation offers a more robust and comprehensive evaluation of the model’s performance.
Each fold acts as an independent test, providing insights into how the model is expected
to perform across varied data samples. This multiplicity not only helps identify potential
overfitting but also ensures that the performance metric (in this case, R2 score) is not overly
optimistic or pessimistic by chance1 but rather a reliable indicator of how the model will
generalize to unseen data.
To visually demonstrate this, let’s consider the R2 scores from both a train-test split and
a 5-fold cross-validation process:

1
If you do one test and get a high (or low) R2 , you may wonder if you are just being lucky (unlucky). But if
you get the same result five times in a row, you should know it is real.
2.2 The “Why” of Cross-Validation 4

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
# Import Seaborn and Matplotlib
import seaborn as sns
import matplotlib.pyplot as plt

# Perform 5-fold cross-validation. Let cv_scores_rounded contains your


# cross-validation scores, and train_test_score is your single train-test R^2 score
Ames = pd.read_csv("Ames.csv")
X = Ames[["GrLivArea"]] # Feature: GrLivArea, a 2D matrix
y = Ames["SalePrice"] # Target: SalePrice, a 1D vector
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
train_test_score = round(model.score(X_test, y_test), 4)
cv_scores = cross_val_score(model, X, y, cv=5)
cv_scores_rounded = [round(score, 4) for score in cv_scores]

# Plot the box plot for cross-validation scores


cv_scores_df = pd.DataFrame(cv_scores_rounded, columns=["Cross-Validation Scores"])
sns.boxplot(data=cv_scores_df, y="Cross-Validation Scores",
width=0.3, color="lightblue", fliersize=0)

# Overlay individual scores as points


plt.scatter([0] * len(cv_scores_rounded), cv_scores_rounded,
color="blue", label="Cross-Validation Scores")
plt.scatter(0, train_test_score, color="red", zorder=5, label="Train-Test Score")

# Plot the visual


plt.title("Model Evaluation: Cross-Validation vs. Train-Test")
plt.ylabel("R^2 Score")
plt.xticks([0], ["Evaluation Scores"])
plt.legend(loc="lower left", bbox_to_anchor=(0, +0.1))
plt.show()

Listing 2.2: Visually comparing the R2 scores obtained from different trained models

This visualization underscores the difference in insights gained from a single train-test
evaluation versus the broader perspective offered by cross-validation:
2.3 Delving Deeper with k-Fold Cross-Validation 5

Figure 2.3: The distribution of R2 score from various models

The red dot in Figure 2.3 is near the lower end of the box-whisker plot. Should you obtain
that score only, you are underestimating the accuracy of your model. With more number of
scores from cross-validation, you have a better estimation. Through cross-validation, you gain
a deeper understanding of your model’s performance, moving closer to developing machine
learning solutions that are both effective and reliable.

2.3 Delving Deeper with k-Fold Cross-Validation


Cross-validation is a cornerstone of reliable machine learning model evaluation, with
cross_val_score() providing a quick and automated way to perform this task. Now, let’s
turn your attention to the KFold class, a component of scikit-learn that offers a deeper dive
into the folds of cross-validation. The KFold class provides not just a score but a window
into the model’s performance across different segments of your data. We demonstrate this by
replicating the example above:

import pandas as pd
# Import k-fold and necessary libraries
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

Ames = pd.read_csv("Ames.csv")

# Select features and target


X = Ames[['GrLivArea']].values # Convert to numpy array for KFold
y = Ames['SalePrice'].values # Convert to numpy array for KFold

# Initialize linear regression and k-fold


model = LinearRegression()
2.3 Delving Deeper with k-Fold Cross-Validation 6

kf = KFold(n_splits=5)

# k-fold cross-validation in detailed steps


for fold, (train_index, test_index) in enumerate(kf.split(X), start=1):
# Split the data into training and testing sets
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

# Fit the model and predict


model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Calculate and print the R^2 score for the current fold
print(f"Fold {fold}:")
print(f"TRAIN set size: {len(train_index)}")
print(f"TEST set size: {len(test_index)}")
print(f"R^2 score: {round(r2_score(y_test, y_pred), 4)}\n")

Listing 2.3: Performing k-fold cross-validation manually in scikit-learn

This code block will show you the size of each training and testing set and the corresponding
R2 score for each fold:

Fold 1:
TRAIN set size: 2063
TEST set size: 516
R^2 score: 0.4884

Fold 2:
TRAIN set size: 2063
TEST set size: 516
R^2 score: 0.5412

Fold 3:
TRAIN set size: 2063
TEST set size: 516
R^2 score: 0.5214

Fold 4:
TRAIN set size: 2063
TEST set size: 516
R^2 score: 0.5454

Fold 5:
TRAIN set size: 2064
TEST set size: 515
R^2 score: 0.4673

Output 2.2: Metrics from the models trained using k-fold cross-validation

The KFold class shines in its transparency and control over the cross-validation process. While
cross_val_score() simplifies the process into one line, KFold opens it up, allowing you to view
the exact splits of your data. This is incredibly valuable when you need to:
B Understand how your data is being divided.
2.4 Further Reading 7

B Implement custom preprocessing before each fold.


B Gain insights into the consistency of your model’s performance.
By using the KFold class, you can manually iterate over each split and apply the model training
and testing process. This not only helps ensure that you’re fully informed about the data
being used at each stage but also offers the opportunity to modify the process to suit complex
needs.

2.4 Further Reading


This section provides more resources on the topic if you want to go deeper.

Papers
Dean De Cock. “Ames, Iowa: Alternative to the Boston Housing Data as an End of
Semester Regression Project”. Journal of Statistics Education, 19(3), 2011.
https://jse.amstat.org/v19n3/decock.pdf

Resources
Dean De Cock. Ames Housing Dataset. 2011.
https://jse.amstat.org/v19n3/decock/AmesHousing.txt
Dean De Cock. Ames Housing Data Dictionary. 2011.
https://jse.amstat.org/v19n3/decock/DataDocumentation.txt

Online
sklearn.model_selection.train_test_split API. scikit-learn.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train
_test_split.html
sklearn.model_selection.cross_val_score API. scikit-learn.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross
_val_score.html
sklearn.model_selection.KFold API. scikit-learn.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold
.html

2.5 Summary
In this chapter, we explored the importance of thorough model evaluation through cross-
validation and the KFold method. Both techniques meticulously avoid the pitfall of data
leakage by keeping training and testing data distinct, thereby ensuring the model’s performance
is accurately measured. Moreover, by validating each data point exactly once and using it
for training k − 1 times, these methods provide a detailed view of the model’s ability to
generalize, boosting confidence in its real-world applicability. Through practical examples,
2.5 Summary 8

we’ve demonstrated how integrating these strategies into your evaluation process leads to
more reliable and robust machine learning models ready for the challenges of new and unseen
data.
Specifically, you learned:
B The efficiency of cross_val_score() in automating the cross-validation process.
B How KFold offers detailed control over data splits for tailored model evaluation.
B How both methods ensure full data utilization and prevent data leakage.
In the next chapter, you will look into selecting inputs for your models.
7
Capturing Curves: Advanced
Modeling with Polynomial
Regression

When we analyze relationships between variables in machine learning, we often find that a
straight line doesn’t tell the whole story. That’s where polynomial transformations come
in, adding layers to our regression models without complicating the calculation process.
By transforming our features into their polynomial counterparts—squares, cubes, and other
higher-degree terms—we give linear models the flexibility to curve and twist, fitting snugly
to the underlying trends of our data.
This chapter will explore how you can move beyond simple linear models to capture
more complex relationships in your data. You’ll learn about the power of polynomial and
cubic regression techniques, which allow you to see beyond the apparent and uncover the
underlying patterns that a straight line might miss. You will also delve into the balance
between adding complexity and maintaining predictability in your models, ensuring that they
are both powerful and practical.
Let’s get started.

Overview
This chapter is divided into three parts; they are:
B Establishing a Baseline with Linear Regression
B Capturing Curves with a Polynomial Regression
B Experimenting with a Cubic Regression

7.1 Establishing a Baseline with Linear Regression


When we talk about relationships between two variables, linear regression is often the first
step because it is the simplest. It models the relationship by fitting a straight line to the
data. This line is described by the simple equation y = mx + b, where y is the dependent
variable, x is the independent variable, m is the slope of the line, and b is the y-intercept.
Let’s demonstrate this by predicting the “SalePrice” in the Ames dataset based on its overall
quality, which is an integer value ranging from 1 to 10.
7.1 Establishing a Baseline with Linear Regression 10

# Import the necessary libraries


import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

# Prepare data for linear regression


Ames = pd.read_csv("Ames.csv")
X = Ames[["OverallQual"]] # Predictor
y = Ames["SalePrice"] # Response

# Create and fit the linear regression model


linear_model = LinearRegression()
linear_model.fit(X, y)

# Coefficients
intercept = int(linear_model.intercept_)
slope = int(linear_model.coef_[0])
eqn = f"Fitted Line: y = {slope}x - {abs(intercept)}"

# Perform 5-fold cross-validation to evaluate model performance


cv_score = cross_val_score(linear_model, X, y).mean()

# Visualize Best Fit and display CV results


plt.figure(figsize=(10, 6))
plt.scatter(X, y, color="blue", alpha=0.5, label="Data points")
plt.plot(X, linear_model.predict(X), color="red", label=eqn)
plt.title("Linear Regression of SalePrice vs OverallQual", fontsize=16)
plt.xlabel("Overall Quality", fontsize=12)
plt.ylabel("Sale Price", fontsize=12)
plt.legend(fontsize=14)
plt.grid(True)
plt.text(1, 540000, f"5-Fold CV R^2: {cv_score:.3f}", fontsize=14, color="green")
plt.show()

Listing 7.1: Using OverallQual to predict SalePrice using linear regression

Figure 7.1: Result of linear regression


7.2 Capturing Curves with Polynomial Regression 11

With a basic linear regression, your model came up with the following equation: y =
43383x − 84264. This means that each additional point in quality is associated with an
increase of approximately $43,383 in the sale price. To evaluate the performance of your
model, you used 5-fold cross-validation, resulting in an R2 of 0.618. This value indicates that
about 61.8% of the variability in sale prices can be explained by the overall quality of the
house using this simple model.
Linear regression is straightforward to understand and implement. However, it assumes
that the relationship between the independent and dependent variables is linear, which might
not always be the case, as seen in the scatter plot above. While linear regression provides
a good starting point, real-world data often require more complex models to capture curved
relationships, as you’ll see in the next section on polynomial regression.

7.2 Capturing Curves with Polynomial Regression


Real-world relationships are often not straight lines but curves. Polynomial regression allows
us to model these curved relationships. For a third-degree polynomial, this method takes
your simple linear equation and adds terms for each power of x: y = ax + bx2 + cx3 + d. You
can implement this by using the PolynomialFeatures class from the sklearn.preprocessing
library, which generates a new feature matrix consisting of all polynomial combinations of the
features with a degree less than or equal to the specified degree. Here’s how you can apply it
to your dataset:

# Import the necessary libraries


import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt

# Load the data


Ames = pd.read_csv("Ames.csv")
X = Ames[["OverallQual"]]
y = Ames["SalePrice"]

# Transform the predictor variable to polynomial features up to the 3rd degree


poly = PolynomialFeatures(degree=3, include_bias=False)
X_poly = poly.fit_transform(X)

# Create and fit the polynomial regression model


poly_model = LinearRegression()
poly_model.fit(X_poly, y)

# Extract model coefficients that form the polynomial equation


intercept = int(poly_model.intercept_)
coefs = np.rint(poly_model.coef_).astype(int)
eqn = f"Fitted Line: y = " \
f"{coefs[0]}x^1 {coefs[1]:+d}x^2 {coefs[2]:+d}x^3 {intercept:+d}"
7.2 Capturing Curves with Polynomial Regression 12

# Perform 5-fold cross-validation


cv_score = cross_val_score(poly_model, X_poly, y).mean()

# Generate data to plot curve


X_range = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
X_range_poly = poly.transform(X_range)

# Plot
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color="blue", alpha=0.5, label="Data points")
plt.plot(X_range, poly_model.predict(X_range_poly), color="red", label=eqn)
plt.title("Polynomial Regression (3rd Degree) of SalePrice vs OverallQual", fontsize=16)
plt.xlabel("Overall Quality", fontsize=12)
plt.ylabel("Sale Price", fontsize=12)
plt.legend(fontsize=14)
plt.grid(True)
plt.text(1, 540000, f"5-Fold CV R^2: {cv_score:.3f}", fontsize=14, color="green")
plt.show()

Listing 7.2: Creating polynomial features using scikit-learn

First, you transform the predictor variable into polynomial features up to the third degree.
This enhancement expands your feature set from just x (Overall Quality) to x, x2 , x3 (i.e.,
each feature becomes three different but correlated features), allowing your linear model to
fit a more complex, curved relationship in the data. You then fit this transformed data into a
linear regression model to capture the nonlinear relationship between the overall quality and
sale price.

Figure 7.2: The result of polynomial regression

The new model has the equation y = 65966x − 11619x2 + 1006x3 − 31343. The curve fits
the data points more closely than the straight line, indicating a better model. The 5-fold
cross-validation gave us an R2 of 0.681, which is an improvement over the linear model. This
suggests that including the squared and cubic terms helps your model to capture more of the
complexity in the data. Polynomial regression introduces the ability fit curves, but sometimes,
focusing on a specific power, like the cubic term, can reveal deeper insights, as you will explore
in cubic regression.
7.3 Experimenting with a Cubic Regression 13

7.3 Experimenting with a Cubic Regression


Sometimes, you may suspect that a specific power of x is particularly important. In these
cases, you can focus on that power. Cubic regression is a special case where you model the
relationship with a cube of the independent variable: y = ax3 + b. To effectively focus on this
power, you can utilize the FunctionTransformer class from the sklearn.preprocessing library,
which allows you to create a custom transformer to apply a specific function to the data.
This approach is useful for isolating and highlighting the impact of higher-degree terms like
x3 on the response variable, providing a clear view of how the cubic term alone explains the
variability in the data.

# Import the necessary libraries


import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import FunctionTransformer
import matplotlib.pyplot as plt

# Load data
Ames = pd.read_csv("Ames.csv")
X = Ames[["OverallQual"]]
y = Ames["SalePrice"]

# Function to apply cubic transformation


def cubic_transformation(x):
return x ** 3

# Apply transformation
cubic_transformer = FunctionTransformer(cubic_transformation)
X_cubic = cubic_transformer.fit_transform(X)

# Fit model
cubic_model = LinearRegression()
cubic_model.fit(X_cubic, y)

# Get coefficients and intercept


intercept_cubic = int(cubic_model.intercept_)
coef_cubic = int(cubic_model.coef_[0])
eqn = f"Fitted Line: y = {coef_cubic}x^3 + {intercept_cubic}"

# Cross-validation
cv_score_cubic = cross_val_score(cubic_model, X_cubic, y).mean()

# Generate data to plot curve


X_range = np.linspace(X.min(), X.max(), 300)
X_range_cubic = cubic_transformer.transform(X_range)

# Plot
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color="blue", alpha=0.5, label="Data points")
plt.plot(X_range, cubic_model.predict(X_range_cubic), color="red", label=eqn)
7.4 Further Reading 14

plt.title("Cubic Regression of SalePrice vs OverallQual", fontsize=16)


plt.xlabel("Overall Quality", fontsize=12)
plt.ylabel("Sale Price", fontsize=12)
plt.legend(fontsize=14)
plt.grid(True)
plt.text(1, 540000, f"5-Fold CV R^2: {cv_score_cubic:.3f}", fontsize=14, color="green")
plt.show()

Listing 7.3: Performing cubic regression with the help of FunctionTransformer

Here, you applied a cubic transformation to your independent variable and obtained a cubic
model with the equation y = 361x3 + 85579. This represents a slightly simpler approach than
the full polynomial regression model, focusing solely on the cubic term’s predictive power.

Figure 7.3: Result of cubic regression

With cubic regression, the 5-fold cross-validation yielded an R2 of 0.678. This performance is
slightly below the full polynomial model but still notably better than the linear one. Cubic
regression is simpler than a higher-degree polynomial regression and can be sufficient for
capturing the relationship in some datasets. It’s less prone to overfitting than a higher-degree
polynomial model but more flexible than a linear model. The coefficient in the cubic regression
model, 361, indicates the rate at which sale prices increase as the quality cubed increases. This
emphasizes the substantial influence that very high-quality levels have on the price, suggesting
that properties with exceptional quality see a disproportionately higher increase in their sale
price. This insight is particularly valuable for investors or developers focused on high-end
properties where quality is a premium.
As you may imagine, this technique does not limit you from polynomial regression. You
can introduce more exotic functions such as log and exponential if you think that makes sense
in the scenario.

7.4 Further Reading


This section provides more resources on the topic if you want to go deeper.
7.5 Summary 15

Papers
Dean De Cock. “Ames, Iowa: Alternative to the Boston Housing Data as an End of
Semester Regression Project”. Journal of Statistics Education, 19(3), 2011.
https://jse.amstat.org/v19n3/decock.pdf

Resources
Dean De Cock. Ames Housing Dataset. 2011.
https://jse.amstat.org/v19n3/decock/AmesHousing.txt
Dean De Cock. Ames Housing Data Dictionary. 2011.
https://jse.amstat.org/v19n3/decock/DataDocumentation.txt

Online
sklearn.preprocessing.PolynomialFeatures API. scikit-learn.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Polynomi
alFeatures.html
sklearn.preprocessing.FunctionTransformer API. scikit-learn.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Function
Transformer.html
Tamas Ujhelyi. Polynomial Regression in Python using scikit-learn. Nov. 2021.
https://data36.com/polynomial-regression-python-scikit-learn/

7.5 Summary
This chapter explored different regression techniques suited for modeling relationships in
data across varying complexities. We started with linear regression to establish a baseline
for predicting house prices based on quality ratings. Visuals accompanying this section
demonstrate how a linear model attempts to fit a straight line through the data points,
illustrating the basic concept of regression. Advancing to polynomial regression, we tackled
more intricate, nonlinear trends, which enhanced model flexibility and accuracy. The
accompanying graphs showed how a polynomial curve adjusts to fit the data points more
closely than a simple linear model. Finally, we focused on cubic regression to examine the
impact of a specific power of the predictor variable, isolating the effects of higher-degree terms
on the dependent variable. The cubic model proved to be particularly effective, capturing the
essential characteristics of the relationship with sufficient precision and simplicity.
Specifically, you learned:
B How to identify nonlinear trends using visualization techniques.
B How to model nonlinear trends using polynomial regression techniques.
B How cubic regression can capture similar predictability with fewer model complexities.
Starting with the next chapter, you will see various techniques to make the modeling project
easier.
15
Boosting Over Bagging:
Enhancing Predictive Accuracy
with Gradient Boosting
Regressors

Ensemble learning techniques primarily fall into two categories: bagging and boosting. Bagging
is to reduce model variance and overfitting by averaging many models trained on random subsets
of data, whereas boosting reduces underfitting by combining models that correct each other’s
errors. This chapter begins our deep dive into boosting, starting with the gradient boosting
regressor. Through its application on the Ames housing dataset, we will demonstrate how
boosting uniquely enhances models, setting the stage for exploring various boosting techniques
in upcoming chapters.
Let’s get started.

Overview
This chapter is divided into four parts; they are:
B What Is Boosting?
B Comparing Baseline Decision Tree to Gradient Boosting Ensembles
B Optimizing Gradient Boosting with Learning Rate Adjustments
B Final Optimization: Tuning Learning Rate and Number of Trees

15.1 What Is Boosting?


Boosting is an ensemble technique combining multiple simple models to create a strong learner.
Unlike other ensemble methods that may build models in parallel, boosting adds models
sequentially, with each new model focusing on improving the areas where previous models
struggled. This methodically improves the ensemble’s accuracy with each iteration, making
it particularly effective for complex datasets.

Key Features of Boosting.


15.2 Comparing Baseline Decision Tree to Gradient Boosting Ensembles 17

B Sequential Learning: Boosting builds one model at a time. Each new model learns
from the shortcomings of the previous ones, allowing for progressive improvement in
capturing data complexities.
B Error Correction: New learners focus on previously mispredicted instances,
continuously enhancing the ensemble’s capability to capture difficult patterns in the
data.
B Model Complexity: The ensemble’s complexity grows as more models are added,
enabling it to capture intricate data structures effectively.

Boosting vs. Bagging. Bagging involves building several models (often independently) and
combining their outputs to enhance the ensemble’s overall performance, primarily by reducing
the risk of overfitting the noise in the training data. In contrast, boosting focuses on improving
the accuracy of predictions by learning from errors sequentially, which allows it to adapt more
intricately to the data.

Boosting Regressors in scikit-learn. Scikit-learn provides several implementations of boosting


tailored for different needs and data scenarios:
B AdaBoost Regressor: Employs a sequence of weak learners and adjusts their focus
based on the errors of the previous model, improving where past models were lacking.
B Gradient Boosting Regressor: Builds models one at a time, with each new model
trained to correct the residuals (errors) made by the previous ones, improving accuracy
through careful adjustments.
B HistGradient Boosting Regressor: An optimized form of gradient boosting designed
for larger datasets, which speeds up calculations by using histograms to approximate
gradients.
Each method utilizes the core principles of boosting to improve its components’ performance,
showcasing the versatility and power of this approach in tackling predictive modeling challenges.
In the following sections, we will demonstrate a practical application of the gradient boosting
regressor using the Ames housing dataset.

15.2 Comparing Baseline Decision Tree to Gradient Boosting


Ensembles
In transitioning from the theoretical aspects of boosting to its practical applications, this
section will demonstrate the gradient boosting regressor using the meticulously preprocessed
Ames housing dataset. The preprocessing steps, consistent across various tree-based models,
ensure that the improvements observed can be attributed directly to the model’s capabilities,
setting the stage for an effective comparison.
The code below establishes your comparative analysis framework by first setting up a
baseline using a single decision tree, which is not an ensemble method. This baseline will
allow you to clearly illustrate the incremental benefits brought by actual ensemble methods.
Following this, you’ll configure two versions, each of bagging, random forest, and the gradient
15.2 Comparing Baseline Decision Tree to Gradient Boosting Ensembles 18

boosting regressor, with 100 and 200 trees, respectively, to explore the enhancements these
ensemble techniques offer over the baseline.

# Import necessary libraries for preprocessing and modeling


import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, FunctionTransformer
from sklearn.ensemble import \
GradientBoostingRegressor, BaggingRegressor, RandomForestRegressor

# Load the dataset


Ames = pd.read_csv("Ames.csv")

# Adjust data types for categorical variables


for col in ["MSSubClass", "YrSold", "MoSold"]:
Ames[col] = Ames[col].astype("object")

# Exclude "PID" and "SalePrice" from features and handle the "Electrical" column
numeric_features = Ames.select_dtypes(include=["int64", "float64"]) \
.drop(columns=["PID", "SalePrice"]).columns
categorical_features = Ames.select_dtypes(include=["object"]).columns \
.difference(["Electrical"])
electrical_feature = ["Electrical"]

# Manually specify the categories for ordinal encoding according to the data dictionary
ordinal_order = {
# Electrical system
"Electrical": ["Mix", "FuseP", "FuseF", "FuseA", "SBrkr"],
# General shape of property
"LotShape": ["IR3", "IR2", "IR1", "Reg"],
# Type of utilities available
"Utilities": ["ELO", "NoSeWa", "NoSewr", "AllPub"],
# Slope of property
"LandSlope": ["Sev", "Mod", "Gtl"],
# Evaluates the quality of the material on the exterior
"ExterQual": ["Po", "Fa", "TA", "Gd", "Ex"],
# Evaluates the present condition of the material on the exterior
"ExterCond": ["Po", "Fa", "TA", "Gd", "Ex"],
# Height of the basement
"BsmtQual": ["None", "Po", "Fa", "TA", "Gd", "Ex"],
# General condition of the basement
"BsmtCond": ["None", "Po", "Fa", "TA", "Gd", "Ex"],
# Walkout or garden level basement walls
"BsmtExposure": ["None", "No", "Mn", "Av", "Gd"],
# Quality of basement finished area
"BsmtFinType1": ["None", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"],
# Quality of second basement finished area
"BsmtFinType2": ["None", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"],
# Heating quality and condition
15.2 Comparing Baseline Decision Tree to Gradient Boosting Ensembles 19

"HeatingQC": ["Po", "Fa", "TA", "Gd", "Ex"],


# Kitchen quality
"KitchenQual": ["Po", "Fa", "TA", "Gd", "Ex"],
# Home functionality
"Functional": ["Sal", "Sev", "Maj2", "Maj1", "Mod", "Min2", "Min1", "Typ"],
# Fireplace quality
"FireplaceQu": ["None", "Po", "Fa", "TA", "Gd", "Ex"],
# Interior finish of the garage
"GarageFinish": ["None", "Unf", "RFn", "Fin"],
# Garage quality
"GarageQual": ["None", "Po", "Fa", "TA", "Gd", "Ex"],
# Garage condition
"GarageCond": ["None", "Po", "Fa", "TA", "Gd", "Ex"],
# Paved driveway
"PavedDrive": ["N", "P", "Y"],
# Pool quality
"PoolQC": ["None", "Fa", "TA", "Gd", "Ex"],
# Fence quality
"Fence": ["None", "MnWw", "GdWo", "MnPrv", "GdPrv"]
}

# Extract list of ALL ordinal features from dictionary


ordinal_features = list(ordinal_order.keys())

# List of ordinal features except Electrical


ordinal_except_electrical = [feat for feat in ordinal_features if feat != "Electrical"]

# Define transformations for various feature types


electrical_transformer = Pipeline(steps=[
("impute_electrical", SimpleImputer(strategy="most_frequent")),
("ordinal_electrical", OrdinalEncoder(categories=[ordinal_order["Electrical"]]))
])

numeric_transformer = Pipeline(steps=[
("impute_mean", SimpleImputer(strategy="mean"))
])

# Updated categorical imputer using SimpleImputer


categorical_imputer = SimpleImputer(strategy="constant", fill_value="None")

ordinal_transformer = Pipeline([
("impute_ordinal", categorical_imputer),
("ordinal", OrdinalEncoder(categories=[ordinal_order[feat]
for feat in ordinal_features
if feat in ordinal_except_electrical]))
])

nominal_features = [feat for feat in categorical_features if feat not in ordinal_features]


categorical_transformer = Pipeline([
("impute_nominal", categorical_imputer),
("onehot", OneHotEncoder(handle_unknown="ignore"))
])

# Combined preprocessor for numeric, ordinal, nominal, and specific electrical data
15.2 Comparing Baseline Decision Tree to Gradient Boosting Ensembles 20

preprocessor = ColumnTransformer(
transformers=[
("electrical", electrical_transformer, ["Electrical"]),
("num", numeric_transformer, numeric_features),
("ordinal", ordinal_transformer, ordinal_except_electrical),
("nominal", categorical_transformer, nominal_features)
])

# Define model pipelines including Gradient Boosting Regressor


models = {
"Decision Tree (1 Tree)": DecisionTreeRegressor(random_state=42),
"Bagging Regressor (100 Decision Trees)":
BaggingRegressor(estimator=DecisionTreeRegressor(random_state=42),
n_estimators=100, random_state=42),
"Bagging Regressor (200 Decision Trees)":
BaggingRegressor(estimator=DecisionTreeRegressor(random_state=42),
n_estimators=200, random_state=42),
"Random Forest (Default of 100 Trees)":
RandomForestRegressor(random_state=42),
"Random Forest (200 Trees)":
RandomForestRegressor(n_estimators=200, random_state=42),
"Gradient Boosting Regressor (Default of 100 Trees)":
GradientBoostingRegressor(random_state=42),
"Gradient Boosting Regressor (200 Trees)":
GradientBoostingRegressor(n_estimators=200, random_state=42)
}

# Evaluate models using cross-validation and print results


results = {}
for name, model in models.items():
model_pipeline = Pipeline([
("preprocessor", preprocessor),
("regressor", model)
])
scores = cross_val_score(model_pipeline,
Ames.drop(columns="SalePrice"), Ames["SalePrice"], cv=5)
results[name] = round(scores.mean(), 4)
print(f"{name}: Mean CV R^2 = {results[name]}")

Listing 15.1: Setting up a single decision tree as the baseline

Below are the cross-validation results, showcasing how each model performs in terms of mean
R2 values:

Decision Tree (1 Tree): Mean CV R^2 = 0.7663


Bagging Regressor (100 Decision Trees): Mean CV R^2 = 0.8957
Bagging Regressor (200 Decision Trees): Mean CV R^2 = 0.897
Random Forest (Default of 100 Trees): Mean CV R^2 = 0.8954
Random Forest (200 Trees): Mean CV R^2 = 0.8969
Gradient Boosting Regressor (Default of 100 Trees): Mean CV R^2 = 0.9027
Gradient Boosting Regressor (200 Trees): Mean CV R^2 = 0.9061

Output 15.1: Cross-validation results


15.3 Optimizing Gradient Boosting with Learning Rate Adjustments 21

The results from your ensemble models underline several key insights into the behavior and
performance of advanced regression techniques:
B Baseline and Enhancement: Starting with a basic decision tree regressor, which serves
as the baseline with an R2 of 0.7663, you can observe significant performance uplifts as
you introduce more complex models. Both bagging and random forest regressors, using
different numbers of trees, show improved scores, illustrating the power of ensemble
methods in leveraging multiple learning models to reduce error.
B Gradient Boosting Regressor’s Edge: Particularly notable is the gradient boosting
regressor. With its default setting of 100 trees, it achieves an R2 of 0.9027, and
increasing the number of trees further to 200 nudges the score up to 0.9061. This
indicates the effectiveness of GBR in this context and highlights its efficiency in
sequential improvement from additional learners.
B Marginal Gains from More Trees: While increasing the number of trees generally results
in better performance, the incremental gains diminish as you expand the ensemble size.
This trend is evident across bagging, random forest, and gradient boosting models,
suggesting a point of diminishing returns where additional computational resources
yield minimal performance improvements.
The results highlight the gradient boosting regressor’s robust performance. It effectively
leverages comprehensive preprocessing and the sequential improvement strategy characteristic
of boosting. Next, you will explore how adjusting the learning rate can refine your model’s
performance, enhancing its predictive accuracy.

15.3 Optimizing Gradient Boosting with Learning Rate


Adjustments
The learning_rate is unique to boosting models like the gradient boosting regressor,
distinguishing it from other models, such as decision trees and random forests, which do
not have a direct equivalent of this parameter. Adjusting the learning_rate allows you to
delve deeper into the mechanics of boosting and enhance your model’s predictive power by
fine-tuning how aggressively it learns from each successive tree.

What Is the Learning Rate?


In the context of gradient boosting regressors and other gradient descent-based algorithms,
the “learning rate” is a crucial hyperparameter that controls the speed at which the model
converges. At its core, the learning rate influences the size of the steps the model takes toward
the optimal solution during training. Here’s a breakdown:
B Size of Steps: The learning rate determines the magnitude of the updates to the model’s
weights during training. A higher learning rate makes larger updates, allowing the
model to learn faster but at the risk of overshooting the optimal solution. Conversely,
a lower learning rate makes smaller updates, which means the model learns slower
but with potentially higher precision.
B Impact on Model Training:
15.3 Optimizing Gradient Boosting with Learning Rate Adjustments 22

◦ Convergence: A learning rate that is too high may cause the training process
to converge too quickly to a suboptimal solution, or it might not converge at
all as it overshoots the minimum.
◦ Accuracy and Overfitting: A learning rate that is too low can lead the model
to learn too slowly, which may require more trees to achieve similar accuracy,
potentially leading to overfitting if not monitored.
B Tuning: Choosing the right learning rate balances speed and accuracy. It is often
selected through trial and error or more systematic approaches like GridSearchCV and
RandomizedSearchCV, as adjusting the learning rate can significantly affect the model’s
performance and training time.
By adjusting the learning rate, data scientists can control how quickly a boosting model adapts
to the complexity of its errors. This makes the learning rate a powerful tool in fine-tuning
model performance, especially in boosting algorithms where each new tree is built to correct
the residuals (errors) left by the previous trees.
To optimize the learning_rate, you start with GridSearchCV, a systematic method that
will explore predefined values ([0.001, 0.01, 0.1, 0.2, 0.3]) to ascertain the most effective setting
for enhancing your model’s accuracy.

...

# Experiment with GridSearchCV


from sklearn.model_selection import GridSearchCV

# Parameter grid for GridSearchCV


param_grid = {
"regressor__learning_rate": [0.001, 0.01, 0.1, 0.2, 0.3]
}

# Setup the GridSearchCV


grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, scoring="r2", verbose=1)

# Fit the GridSearchCV to the data


grid_search.fit(Ames.drop(columns="SalePrice"), Ames["SalePrice"])

# Best parameters and best score from Grid Search


print("Best parameters (Grid Search):", grid_search.best_params_)
print("Best score (Grid Search):", round(grid_search.best_score_, 4))

Listing 15.2: Find the optimal learning_rate using grid search

Here are the results from the GridSearchCV, focused solely on optimizing the learning_rate
parameter:

Fitting 5 folds for each of 5 candidates, totalling 25 fits


Best parameters (Grid Search): {'regressor__learning_rate': 0.1}
Best score (Grid Search): 0.9061

Output 15.2: The optimal learning_rate as found

Using GridSearchCV, you found that a learning_rate of 0.1 yielded the best result.
15.3 Optimizing Gradient Boosting with Learning Rate Adjustments 23

Following this, you utilize RandomizedSearchCV to expand your search. Unlike


GridSearchCV, RandomizedSearchCV randomly selects from a continuous range, allowing for
a potentially more precise optimization by exploring between the standard values, thus
providing a comprehensive understanding of how subtle variations in learning_rate can impact
performance.

...

# Experiment with RandomizedSearchCV


from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

# Parameter distribution for RandomizedSearchCV


param_dist = {
# Uniform distribution between 0.001 and 0.3
"regressor__learning_rate": uniform(0.001, 0.299)
}

# Setup the RandomizedSearchCV


random_search = RandomizedSearchCV(model_pipeline, param_distributions=param_dist,
n_iter=50, cv=5, scoring="r2", verbose=1,
random_state=42)

# Fit the RandomizedSearchCV to the data


random_search.fit(Ames.drop(columns="SalePrice"), Ames["SalePrice"])

# Best parameters and best score from Random Search


print("Best parameters (Random Search):", random_search.best_params_)
print("Best score (Random Search):", round(random_search.best_score_, 4))

Listing 15.3: Using RandomizedSearchCV to more precise optimization

Contrasting with GridSearchCV, RandomizedSearchCV identified a slightly different optimal


learning_rate of approximately 0.158, which enhanced your model’s performance. This
improvement underscores the value of a randomized search in fine-tuning models, as it can
quickly explore a more diverse set of possibilities and potentially yield better configurations.
This would be difficult in grid search as the parameter space can easily become intractably
large.

Fitting 5 folds for each of 50 candidates, totalling 250 fits


Best parameters (Random Search): {'regressor__learning_rate': 0.1579021730580391}
Best score (Random Search): 0.9134

Output 15.3: Result of RandomizedSearchCV

The optimization through RandomizedSearchCV has demonstrated its efficacy by pinpointing


a learning rate that pushes your model’s performance to new heights, achieving an R2 score
of 0.9134. These experiments with learning_rate adjustments through GridSearchCV and
RandomizedSearchCV illustrate the delicate balance required in tuning gradient boosting models.
They also highlight the benefits of exploring both systematic and randomized parameter search
strategies to optimize a model fully.
15.4 Final Optimization: Tuning Learning Rate and Number of Trees 24

Encouraged by the gains achieved through these optimization strategies, you will now
extend your focus to fine-tuning both the learning_rate and n_estimators simultaneously.
This next phase aims to uncover even more optimal settings by exploring the combined impact
of these crucial parameters on your gradient boosting regressor’s performance.

15.4 Final Optimization: Tuning Learning Rate and Number


of Trees
Building on your previous findings, let’s now advance to a more comprehensive optimization
approach that involves simultaneously tuning both learning_rate and n_estimators. This
dual-parameter tuning is designed to explore how these parameters work together, potentially
enhancing the performance of the gradient boosting regressor even further.
You’ll begin with GridSearchCV to systematically explore combinations of learning_rate
and n_estimators. This approach provides a structured way to assess the impact of varying
both parameters on your model’s accuracy.

...

# "preprocessor" is already set up as your preprocessing pipeline


model_pipeline = Pipeline([
("preprocessor", preprocessor),
("regressor", GradientBoostingRegressor(random_state=42))
])

# Parameter grid for GridSearchCV


param_grid = {
"regressor__learning_rate": [0.001, 0.01, 0.1, 0.2, 0.3],
"regressor__n_estimators": [100, 200, 300, 400, 500]
}

# Setup the GridSearchCV


grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, scoring="r2", verbose=1)

# Fit the GridSearchCV to the data


grid_search.fit(Ames.drop(columns="SalePrice"), Ames["SalePrice"])

# Best parameters and best score from Grid Search


print("Best parameters (Grid Search):", grid_search.best_params_)
print("Best score (Grid Search):", round((grid_search.best_score_), 4))

Listing 15.4: Fine-tuning the learning rate and number of trees

The GridSearchCV process evaluated 25 different combinations across 5 folds, totaling 125 fits:

Fitting 5 folds for each of 25 candidates, totalling 125 fits

C
Best parameters (Grid Search): {'regressor__learning_rate': 0.1, C
'regressor__n_estimators': 500}
Best score (Grid Search): 0.9089

Output 15.4: The optimal learning rate and number of trees found
15.4 Final Optimization: Tuning Learning Rate and Number of Trees 25

It confirmed that a learning_rate of 0.1—the default setting—remains effective. However, it


suggested an increase to 500 trees could slightly improve you model’s performance, elevating
the R2 score to 0.9089. This is a modest enhancement compared to the R2 of 0.9061 achieved
earlier with 200 trees and a learning_rate of 0.1. Interestingly, the previous randomized search
yielded an even better result of 0.9134 with only 200 trees and learning_rate approximately
0.158, illustrating the potential benefits of exploring a broader parameter space to optimize
performance.
To ensure that you have thoroughly explored the parameter space and to uncover
even better configurations potentially, you’ll now employ RandomizedSearchCV. This method
allows for a more explorative and less deterministic approach by sampling from a continuous
distribution of parameter values.

...

from scipy.stats import uniform, randint

# Parameter distribution for RandomizedSearchCV


param_dist = {
# Uniform distribution between 0.001 and 0.3
"regressor__learning_rate": uniform(0.001, 0.299),
# Uniform distribution of integers from 100 to 500
"regressor__n_estimators": randint(100, 501)
}

# Setup the RandomizedSearchCV


random_search = RandomizedSearchCV(model_pipeline, param_distributions=param_dist,
n_iter=50, cv=5, scoring="r2", verbose=1,
random_state=42)

# Fit the RandomizedSearchCV to the data


random_search.fit(Ames.drop(columns="SalePrice"), Ames["SalePrice"])

# Best parameters and best score from Random Search


print("Best parameters (Random Search):", random_search.best_params_)
print("Best score (Random Search):", round((random_search.best_score_), 4))

Listing 15.5: Using RandomizedSearchCV to find the optimal learning rate and
number of trees

The RandomizedSearchCV extended your search across a broader range of possibilities, testing
50 different configurations across 5 folds, totaling 250 fits:

Fitting 5 folds for each of 50 candidates, totalling 250 fits

C
Best parameters (Random Search): {'regressor__learning_rate': 0.12055843054286139, C
'regressor__n_estimators': 287}
Best score (Random Search): 0.9158

Output 15.5: Optimal hyperparameters as found by RandomizedSearchCV

It identified an even more effective setting with a learning_rate of approximately 0.121 and
n_estimators at 287, achieving the best R2 score yet at 0.9158. This underscores the potential
15.5 Further Reading 26

of randomized parameter tuning to discover optimal settings that more rigid methods might
miss.
To validate the performance improvements achieved through your tuning efforts, you will
now perform a final cross-validation using the gradient boosting regressor configured with the
best parameters identified: n_estimators set to 287 and a learning_rate of approximately
0.121.

...

# Cross check model performance of gradient boosting regressor with tuned parameters

# "preprocessor" is already set up as your preprocessing pipeline


model_pipeline = Pipeline([
("preprocessor", preprocessor),
("regressor", GradientBoostingRegressor(n_estimators=287,
learning_rate=0.12055843054286139,
random_state=42))
])

# Using the full dataset X, y


X = Ames.drop(columns="SalePrice")
y = Ames["SalePrice"]

# Perform 5-fold cross-validation


cv_scores = cross_val_score(model_pipeline, X, y, cv=5, scoring="r2")

# Output the mean cross-validated score of tuned model


print("Performance of gradient boosting regressor with tuned parameters:",
round(cv_scores.mean(), 4))

Listing 15.6: Training the gradient boost regressor using the optimal hyperparame-
ters

The final output confirms the performance of your tuned gradient boosting regressor.

Performance of gradient boosting regressor with tuned parameters: 0.9158

Output 15.6: The model score using optimal hyperparameters

By optimizing both learning_rate and n_estimators, you have achieved an R2 score of 0.9158.
This score not only validates the enhancements made through parameter tuning but also
emphasizes the capability of the gradient boosting regressor to adapt and perform consistently
across the dataset.

15.5 Further Reading


This section provides more resources on the topic if you want to go deeper.
15.6 Summary 27

Papers
Dean De Cock. “Ames, Iowa: Alternative to the Boston Housing Data as an End of
Semester Regression Project”. Journal of Statistics Education, 19(3), 2011.
https://jse.amstat.org/v19n3/decock.pdf

Resources
Dean De Cock. Ames Housing Dataset. 2011.
https://jse.amstat.org/v19n3/decock/AmesHousing.txt
Dean De Cock. Ames Housing Data Dictionary. 2011.
https://jse.amstat.org/v19n3/decock/DataDocumentation.txt

Online
sklearn.model_selection.GridSearchCV API. scikit-learn.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSe
archCV.html
sklearn.model_selection.RandomizedSearchCV API. scikit-learn.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.Random
izedSearchCV.html
sklearn.ensemble.GradientBoostingRegressor API. scikit-learn.
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoost
ingRegressor.html
Dhiraj K. Implementing Gradient Boosting Regression in Python. Sept. 2024.
https://blog.paperspace.com/implementing-gradient-boosting-regression-python/

15.6 Summary
This chapter explored the capabilities of the gradient boosting regressor (GBR), from
understanding the foundational concepts of boosting to advanced optimization techniques
using the Ames housing dataset. It focused on key parameters of the GBR such as the
number of trees and learning rate, essential for refining the model’s accuracy and efficiency.
Through systematic and randomized approaches, it demonstrated how to fine-tune these
parameters using GridSearchCV and RandomizedSearchCV, enhancing the model’s performance
significantly.
Specifically, you learned:
B The fundamentals of boosting and how it differs from other ensemble techniques like
bagging.
B How to achieve incremental improvements by experimenting with a range of models.
B Techniques for tuning learning rate and number of trees for the gradient boosting
regressor.
In the next chapter, you will explore the famous XGBoost library for gradient boosting trees.
This is Just a Sample

Thank-you for your interest in Next-Level Data Science.


This is just a sample of the full text. You can purchase the complete book online from:
https://machinelearningmastery.com/next-level-data-science

MACHINE LEARNING
MASTERY

NEXT-LEVEL
Data
Science
Mastering Techniques
for the Modern Era

Vinod Chugani

You might also like