next_level_data_science_sample chapter
next_level_data_science_sample chapter
MASTERY
NEXT-LEVEL
Data
Science
Mastering Techniques
for the Modern Era
Vinod Chugani
This is Just a Sample
MACHINE LEARNING
MASTERY
NEXT-LEVEL
Data
Science
Mastering Techniques
for the Modern Era
Vinod Chugani
This is Just a Sample ii
Disclaimer
The information contained within this eBook is strictly for educational purposes. If you wish to
apply ideas contained in this eBook, you are taking full responsibility for your actions.
The author has made every effort to ensure the accuracy of the information within this book was
correct at time of publication. The author does not assume and hereby disclaims any liability to any
party for any loss, damage, or disruption caused by errors or omissions, whether such errors or
omissions result from accident, negligence, or any other cause.
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic or
mechanical, recording or by any information storage and retrieval system, without written
permission from the author.
Credits
Founder: Jason Brownlee
Lead Editor: Adrian Tam
Author: Vinod Chugani
Technical Reviewers: Yoyo Chan, Darci Heikkinen, and Ishaan Mahajan
Copyright
Next-Level Data Science
© 2024 MachineLearningMastery.com. All Rights Reserved.
Edition: v1.00
Contents
Preface iv
Introduction v
Data science applies mathematics and statistics to tell a story from the data you have and
sometimes predict something about the data you have not yet collected. The traditional
statistical techniques are powerful enough to reveal a lot about the data, but as the discipline
evolved, there are more tools, such as machine learning models invented to bring new insights.
Machine learning models are called models because they are based on certain assumptions
about the data you used to train them. As the great statistician George Box said, “All models
are wrong, but some are useful.” How to correctly use a model and make it more useful is
the key to success in applying a model in data science. It is not whether you get the sharpest
tool but whether you use the tool correctly.
This book is the sequel to our earlier publication, The Beginner’s Guide to Data Science.
In this book, we will focus on linear regression and tree-based models, which are the simplest
and most ubiquitous models in data science. You will be led through the steps to prepare the
data and apply the models correctly.
Introduction
Welcome to Next-Level Data Science. This book is the sequel to The Beginner’s Guide to
Data Science. However, that is not a prerequisite for this book. This book focuses on the
most commonly used models in data science, namely, linear regression, decision trees, and
some other models derived from them. While these are the models every data scientist should
know, this book aims to show you how you should use these models to get the most out of
them.
As we said in the first book, the central theme of data science is to tell a story from
data. Using sophisticated models or the latest and greatest software is not the requirement.
Sometimes, the simplest and oldest models can do a good job. Some people love to try out
very sophisticated models because they think the simpler one doesn’t work or because they
tried and didn’t realize that they used it incorrectly. If you can use the simple models to their
full potential, you will be surprised how many problems they can deal with.
This book is created to walk you through a data science project with just enough amount
of programming. Some simple yet commonly used models are discussed. Visualizations are
employed only when the charts are required to explain a concept. You will start with a data
set (same one as in the first book) and apply it to the models. Through these examples, you
will be shown how a data scientist thinks when a tool, such as regression models, is used.
Book Organization
This book is in three parts. The first part is about linear regression. This is the simplest model
but powerful enough for many applications. It is simple enough to understand, therefore, you
can use it as an example to learn how to apply a model in general to a data science project.
This part includes the following chapters:
1. Regression Through Two Different Lenses
2. From Train-Test to Cross-Validation: Advancing Your Model’s Evaluation
3. The Strategic Use of Sequential Feature Selector for Housing Price Predictions
4. The Search for the Sweet Spot in a Linear Regression with Numeric Features
5. One-Hot Encoding: Understanding the “Hot” in Data
6. Interpreting Coefficients in Linear Regression Models
7. Capturing Curves: Advanced Modeling with Polynomial Regression
These chapters not only show you how to create a linear regression model in Python but also
illustrate how many stories you can tell by using just a simple model. This is an important
concept in data science; namely, a simple trick reveals a lot from the data, and you should
extract the most from your limited resources.
The second part is about some skills you can apply to improve a model or a workflow,
using the regression models from Part I as examples:
8. The Power of Pipelines
9. Detecting and Overcoming Perfect Multicollinearity in Large Datasets
10. Scaling to Success: Implementing and Optimizing Penalized Models
11. Filling the Gaps: A Comparative Guide to Imputation Techniques in Machine Learning
These chapters are to address the common problems you encounter in a data science project.
The pipelines (Chapter 8) are about how to make the workflow more efficient, and the other
chapters are about how to make the data and model more robust and accurate.
Part III moves away from linear regression to cover the tree-based models:
12. Branching Out: Exploring Tree-Based Models for Regression
13. Decision Trees and Ordinal Encoding: A Practical Guide
14. From Single Trees to Forests: Enhancing Real Estate Predictions with Ensembles
15. Boosting Over Bagging: Enhancing Predictive Accuracy with Gradient Boosting
Regressors
vii
1
https://machinelearningmastery.com/machine-learning-with-python/
viii
in each chapter first and go back to change the code to try a different idea afterward. This
is the fastest way to learn.
Next
Let’s dive in. Next up is Part I, where you will take a tour of the scikit-learn library in Python
to create a linear regression model, which is an essential skill you will be using throughout
this book.
2
From Train-Test to
Cross-Validation: Advancing
Your Model’s Evaluation
Many beginners will initially rely on the train-test method to evaluate their models. This
method is the most straightforward way to give a clear indication of how well a model performs
on unseen data. However, this approach may lead to an incomplete understanding of a model’s
capabilities. In this chapter, we’ll discuss why it’s important to go beyond the basic train-test
split and how cross-validation can offer a more thorough evaluation of model performance.
You will go through the essential steps to achieve a deeper and more accurate assessment of
your machine learning models.
Let’s get started.
Overview
This chapter is divided into three parts; they are:
B Model Evaluation: Train-Test vs. Cross-Validation
B The “Why” of Cross-Validation
B Delving Deeper with k-Fold Cross-Validation
Dataset Train/Test
Training
(80%)
Testing
(20%)
One model is scored
Figure 2.1: How to do train-test split on a dataset
However, if you don’t have sufficient data, you can do cross-validation. Figure 2.2 shows a
5-fold cross-validation, where the dataset is split into five subsets (“folds”). In each round of
validation, a different fold is used as the test set while the remaining form the training set.
This process is repeated five times, producing five different models and ensuring each data
point is used for training and testing.
20%
20%
Training
(80%)
20%
20%
Testing
(20%) 20%
Listing 2.1: Running train-test split and 5-fold cross-validation on the same dataset
While the train-test method yields a single R2 score, cross-validation provides you with
a spectrum of five different R2 scores, one from each fold of the data, offering a more
comprehensive view of the model’s performance:
The roughly equal R2 scores among the five mean the model is stable. You can then decide
whether this model (i.e., linear regression) provides an acceptable prediction power.
1
If you do one test and get a high (or low) R2 , you may wonder if you are just being lucky (unlucky). But if
you get the same result five times in a row, you should know it is real.
2.2 The “Why” of Cross-Validation 4
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
# Import Seaborn and Matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
Listing 2.2: Visually comparing the R2 scores obtained from different trained models
This visualization underscores the difference in insights gained from a single train-test
evaluation versus the broader perspective offered by cross-validation:
2.3 Delving Deeper with k-Fold Cross-Validation 5
The red dot in Figure 2.3 is near the lower end of the box-whisker plot. Should you obtain
that score only, you are underestimating the accuracy of your model. With more number of
scores from cross-validation, you have a better estimation. Through cross-validation, you gain
a deeper understanding of your model’s performance, moving closer to developing machine
learning solutions that are both effective and reliable.
import pandas as pd
# Import k-fold and necessary libraries
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
Ames = pd.read_csv("Ames.csv")
kf = KFold(n_splits=5)
# Calculate and print the R^2 score for the current fold
print(f"Fold {fold}:")
print(f"TRAIN set size: {len(train_index)}")
print(f"TEST set size: {len(test_index)}")
print(f"R^2 score: {round(r2_score(y_test, y_pred), 4)}\n")
This code block will show you the size of each training and testing set and the corresponding
R2 score for each fold:
Fold 1:
TRAIN set size: 2063
TEST set size: 516
R^2 score: 0.4884
Fold 2:
TRAIN set size: 2063
TEST set size: 516
R^2 score: 0.5412
Fold 3:
TRAIN set size: 2063
TEST set size: 516
R^2 score: 0.5214
Fold 4:
TRAIN set size: 2063
TEST set size: 516
R^2 score: 0.5454
Fold 5:
TRAIN set size: 2064
TEST set size: 515
R^2 score: 0.4673
Output 2.2: Metrics from the models trained using k-fold cross-validation
The KFold class shines in its transparency and control over the cross-validation process. While
cross_val_score() simplifies the process into one line, KFold opens it up, allowing you to view
the exact splits of your data. This is incredibly valuable when you need to:
B Understand how your data is being divided.
2.4 Further Reading 7
Papers
Dean De Cock. “Ames, Iowa: Alternative to the Boston Housing Data as an End of
Semester Regression Project”. Journal of Statistics Education, 19(3), 2011.
https://jse.amstat.org/v19n3/decock.pdf
Resources
Dean De Cock. Ames Housing Dataset. 2011.
https://jse.amstat.org/v19n3/decock/AmesHousing.txt
Dean De Cock. Ames Housing Data Dictionary. 2011.
https://jse.amstat.org/v19n3/decock/DataDocumentation.txt
Online
sklearn.model_selection.train_test_split API. scikit-learn.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train
_test_split.html
sklearn.model_selection.cross_val_score API. scikit-learn.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross
_val_score.html
sklearn.model_selection.KFold API. scikit-learn.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold
.html
2.5 Summary
In this chapter, we explored the importance of thorough model evaluation through cross-
validation and the KFold method. Both techniques meticulously avoid the pitfall of data
leakage by keeping training and testing data distinct, thereby ensuring the model’s performance
is accurately measured. Moreover, by validating each data point exactly once and using it
for training k − 1 times, these methods provide a detailed view of the model’s ability to
generalize, boosting confidence in its real-world applicability. Through practical examples,
2.5 Summary 8
we’ve demonstrated how integrating these strategies into your evaluation process leads to
more reliable and robust machine learning models ready for the challenges of new and unseen
data.
Specifically, you learned:
B The efficiency of cross_val_score() in automating the cross-validation process.
B How KFold offers detailed control over data splits for tailored model evaluation.
B How both methods ensure full data utilization and prevent data leakage.
In the next chapter, you will look into selecting inputs for your models.
7
Capturing Curves: Advanced
Modeling with Polynomial
Regression
When we analyze relationships between variables in machine learning, we often find that a
straight line doesn’t tell the whole story. That’s where polynomial transformations come
in, adding layers to our regression models without complicating the calculation process.
By transforming our features into their polynomial counterparts—squares, cubes, and other
higher-degree terms—we give linear models the flexibility to curve and twist, fitting snugly
to the underlying trends of our data.
This chapter will explore how you can move beyond simple linear models to capture
more complex relationships in your data. You’ll learn about the power of polynomial and
cubic regression techniques, which allow you to see beyond the apparent and uncover the
underlying patterns that a straight line might miss. You will also delve into the balance
between adding complexity and maintaining predictability in your models, ensuring that they
are both powerful and practical.
Let’s get started.
Overview
This chapter is divided into three parts; they are:
B Establishing a Baseline with Linear Regression
B Capturing Curves with a Polynomial Regression
B Experimenting with a Cubic Regression
# Coefficients
intercept = int(linear_model.intercept_)
slope = int(linear_model.coef_[0])
eqn = f"Fitted Line: y = {slope}x - {abs(intercept)}"
With a basic linear regression, your model came up with the following equation: y =
43383x − 84264. This means that each additional point in quality is associated with an
increase of approximately $43,383 in the sale price. To evaluate the performance of your
model, you used 5-fold cross-validation, resulting in an R2 of 0.618. This value indicates that
about 61.8% of the variability in sale prices can be explained by the overall quality of the
house using this simple model.
Linear regression is straightforward to understand and implement. However, it assumes
that the relationship between the independent and dependent variables is linear, which might
not always be the case, as seen in the scatter plot above. While linear regression provides
a good starting point, real-world data often require more complex models to capture curved
relationships, as you’ll see in the next section on polynomial regression.
# Plot
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color="blue", alpha=0.5, label="Data points")
plt.plot(X_range, poly_model.predict(X_range_poly), color="red", label=eqn)
plt.title("Polynomial Regression (3rd Degree) of SalePrice vs OverallQual", fontsize=16)
plt.xlabel("Overall Quality", fontsize=12)
plt.ylabel("Sale Price", fontsize=12)
plt.legend(fontsize=14)
plt.grid(True)
plt.text(1, 540000, f"5-Fold CV R^2: {cv_score:.3f}", fontsize=14, color="green")
plt.show()
First, you transform the predictor variable into polynomial features up to the third degree.
This enhancement expands your feature set from just x (Overall Quality) to x, x2 , x3 (i.e.,
each feature becomes three different but correlated features), allowing your linear model to
fit a more complex, curved relationship in the data. You then fit this transformed data into a
linear regression model to capture the nonlinear relationship between the overall quality and
sale price.
The new model has the equation y = 65966x − 11619x2 + 1006x3 − 31343. The curve fits
the data points more closely than the straight line, indicating a better model. The 5-fold
cross-validation gave us an R2 of 0.681, which is an improvement over the linear model. This
suggests that including the squared and cubic terms helps your model to capture more of the
complexity in the data. Polynomial regression introduces the ability fit curves, but sometimes,
focusing on a specific power, like the cubic term, can reveal deeper insights, as you will explore
in cubic regression.
7.3 Experimenting with a Cubic Regression 13
# Load data
Ames = pd.read_csv("Ames.csv")
X = Ames[["OverallQual"]]
y = Ames["SalePrice"]
# Apply transformation
cubic_transformer = FunctionTransformer(cubic_transformation)
X_cubic = cubic_transformer.fit_transform(X)
# Fit model
cubic_model = LinearRegression()
cubic_model.fit(X_cubic, y)
# Cross-validation
cv_score_cubic = cross_val_score(cubic_model, X_cubic, y).mean()
# Plot
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color="blue", alpha=0.5, label="Data points")
plt.plot(X_range, cubic_model.predict(X_range_cubic), color="red", label=eqn)
7.4 Further Reading 14
Here, you applied a cubic transformation to your independent variable and obtained a cubic
model with the equation y = 361x3 + 85579. This represents a slightly simpler approach than
the full polynomial regression model, focusing solely on the cubic term’s predictive power.
With cubic regression, the 5-fold cross-validation yielded an R2 of 0.678. This performance is
slightly below the full polynomial model but still notably better than the linear one. Cubic
regression is simpler than a higher-degree polynomial regression and can be sufficient for
capturing the relationship in some datasets. It’s less prone to overfitting than a higher-degree
polynomial model but more flexible than a linear model. The coefficient in the cubic regression
model, 361, indicates the rate at which sale prices increase as the quality cubed increases. This
emphasizes the substantial influence that very high-quality levels have on the price, suggesting
that properties with exceptional quality see a disproportionately higher increase in their sale
price. This insight is particularly valuable for investors or developers focused on high-end
properties where quality is a premium.
As you may imagine, this technique does not limit you from polynomial regression. You
can introduce more exotic functions such as log and exponential if you think that makes sense
in the scenario.
Papers
Dean De Cock. “Ames, Iowa: Alternative to the Boston Housing Data as an End of
Semester Regression Project”. Journal of Statistics Education, 19(3), 2011.
https://jse.amstat.org/v19n3/decock.pdf
Resources
Dean De Cock. Ames Housing Dataset. 2011.
https://jse.amstat.org/v19n3/decock/AmesHousing.txt
Dean De Cock. Ames Housing Data Dictionary. 2011.
https://jse.amstat.org/v19n3/decock/DataDocumentation.txt
Online
sklearn.preprocessing.PolynomialFeatures API. scikit-learn.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Polynomi
alFeatures.html
sklearn.preprocessing.FunctionTransformer API. scikit-learn.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Function
Transformer.html
Tamas Ujhelyi. Polynomial Regression in Python using scikit-learn. Nov. 2021.
https://data36.com/polynomial-regression-python-scikit-learn/
7.5 Summary
This chapter explored different regression techniques suited for modeling relationships in
data across varying complexities. We started with linear regression to establish a baseline
for predicting house prices based on quality ratings. Visuals accompanying this section
demonstrate how a linear model attempts to fit a straight line through the data points,
illustrating the basic concept of regression. Advancing to polynomial regression, we tackled
more intricate, nonlinear trends, which enhanced model flexibility and accuracy. The
accompanying graphs showed how a polynomial curve adjusts to fit the data points more
closely than a simple linear model. Finally, we focused on cubic regression to examine the
impact of a specific power of the predictor variable, isolating the effects of higher-degree terms
on the dependent variable. The cubic model proved to be particularly effective, capturing the
essential characteristics of the relationship with sufficient precision and simplicity.
Specifically, you learned:
B How to identify nonlinear trends using visualization techniques.
B How to model nonlinear trends using polynomial regression techniques.
B How cubic regression can capture similar predictability with fewer model complexities.
Starting with the next chapter, you will see various techniques to make the modeling project
easier.
15
Boosting Over Bagging:
Enhancing Predictive Accuracy
with Gradient Boosting
Regressors
Ensemble learning techniques primarily fall into two categories: bagging and boosting. Bagging
is to reduce model variance and overfitting by averaging many models trained on random subsets
of data, whereas boosting reduces underfitting by combining models that correct each other’s
errors. This chapter begins our deep dive into boosting, starting with the gradient boosting
regressor. Through its application on the Ames housing dataset, we will demonstrate how
boosting uniquely enhances models, setting the stage for exploring various boosting techniques
in upcoming chapters.
Let’s get started.
Overview
This chapter is divided into four parts; they are:
B What Is Boosting?
B Comparing Baseline Decision Tree to Gradient Boosting Ensembles
B Optimizing Gradient Boosting with Learning Rate Adjustments
B Final Optimization: Tuning Learning Rate and Number of Trees
B Sequential Learning: Boosting builds one model at a time. Each new model learns
from the shortcomings of the previous ones, allowing for progressive improvement in
capturing data complexities.
B Error Correction: New learners focus on previously mispredicted instances,
continuously enhancing the ensemble’s capability to capture difficult patterns in the
data.
B Model Complexity: The ensemble’s complexity grows as more models are added,
enabling it to capture intricate data structures effectively.
Boosting vs. Bagging. Bagging involves building several models (often independently) and
combining their outputs to enhance the ensemble’s overall performance, primarily by reducing
the risk of overfitting the noise in the training data. In contrast, boosting focuses on improving
the accuracy of predictions by learning from errors sequentially, which allows it to adapt more
intricately to the data.
boosting regressor, with 100 and 200 trees, respectively, to explore the enhancements these
ensemble techniques offer over the baseline.
# Exclude "PID" and "SalePrice" from features and handle the "Electrical" column
numeric_features = Ames.select_dtypes(include=["int64", "float64"]) \
.drop(columns=["PID", "SalePrice"]).columns
categorical_features = Ames.select_dtypes(include=["object"]).columns \
.difference(["Electrical"])
electrical_feature = ["Electrical"]
# Manually specify the categories for ordinal encoding according to the data dictionary
ordinal_order = {
# Electrical system
"Electrical": ["Mix", "FuseP", "FuseF", "FuseA", "SBrkr"],
# General shape of property
"LotShape": ["IR3", "IR2", "IR1", "Reg"],
# Type of utilities available
"Utilities": ["ELO", "NoSeWa", "NoSewr", "AllPub"],
# Slope of property
"LandSlope": ["Sev", "Mod", "Gtl"],
# Evaluates the quality of the material on the exterior
"ExterQual": ["Po", "Fa", "TA", "Gd", "Ex"],
# Evaluates the present condition of the material on the exterior
"ExterCond": ["Po", "Fa", "TA", "Gd", "Ex"],
# Height of the basement
"BsmtQual": ["None", "Po", "Fa", "TA", "Gd", "Ex"],
# General condition of the basement
"BsmtCond": ["None", "Po", "Fa", "TA", "Gd", "Ex"],
# Walkout or garden level basement walls
"BsmtExposure": ["None", "No", "Mn", "Av", "Gd"],
# Quality of basement finished area
"BsmtFinType1": ["None", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"],
# Quality of second basement finished area
"BsmtFinType2": ["None", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"],
# Heating quality and condition
15.2 Comparing Baseline Decision Tree to Gradient Boosting Ensembles 19
numeric_transformer = Pipeline(steps=[
("impute_mean", SimpleImputer(strategy="mean"))
])
ordinal_transformer = Pipeline([
("impute_ordinal", categorical_imputer),
("ordinal", OrdinalEncoder(categories=[ordinal_order[feat]
for feat in ordinal_features
if feat in ordinal_except_electrical]))
])
# Combined preprocessor for numeric, ordinal, nominal, and specific electrical data
15.2 Comparing Baseline Decision Tree to Gradient Boosting Ensembles 20
preprocessor = ColumnTransformer(
transformers=[
("electrical", electrical_transformer, ["Electrical"]),
("num", numeric_transformer, numeric_features),
("ordinal", ordinal_transformer, ordinal_except_electrical),
("nominal", categorical_transformer, nominal_features)
])
Below are the cross-validation results, showcasing how each model performs in terms of mean
R2 values:
The results from your ensemble models underline several key insights into the behavior and
performance of advanced regression techniques:
B Baseline and Enhancement: Starting with a basic decision tree regressor, which serves
as the baseline with an R2 of 0.7663, you can observe significant performance uplifts as
you introduce more complex models. Both bagging and random forest regressors, using
different numbers of trees, show improved scores, illustrating the power of ensemble
methods in leveraging multiple learning models to reduce error.
B Gradient Boosting Regressor’s Edge: Particularly notable is the gradient boosting
regressor. With its default setting of 100 trees, it achieves an R2 of 0.9027, and
increasing the number of trees further to 200 nudges the score up to 0.9061. This
indicates the effectiveness of GBR in this context and highlights its efficiency in
sequential improvement from additional learners.
B Marginal Gains from More Trees: While increasing the number of trees generally results
in better performance, the incremental gains diminish as you expand the ensemble size.
This trend is evident across bagging, random forest, and gradient boosting models,
suggesting a point of diminishing returns where additional computational resources
yield minimal performance improvements.
The results highlight the gradient boosting regressor’s robust performance. It effectively
leverages comprehensive preprocessing and the sequential improvement strategy characteristic
of boosting. Next, you will explore how adjusting the learning rate can refine your model’s
performance, enhancing its predictive accuracy.
◦ Convergence: A learning rate that is too high may cause the training process
to converge too quickly to a suboptimal solution, or it might not converge at
all as it overshoots the minimum.
◦ Accuracy and Overfitting: A learning rate that is too low can lead the model
to learn too slowly, which may require more trees to achieve similar accuracy,
potentially leading to overfitting if not monitored.
B Tuning: Choosing the right learning rate balances speed and accuracy. It is often
selected through trial and error or more systematic approaches like GridSearchCV and
RandomizedSearchCV, as adjusting the learning rate can significantly affect the model’s
performance and training time.
By adjusting the learning rate, data scientists can control how quickly a boosting model adapts
to the complexity of its errors. This makes the learning rate a powerful tool in fine-tuning
model performance, especially in boosting algorithms where each new tree is built to correct
the residuals (errors) left by the previous trees.
To optimize the learning_rate, you start with GridSearchCV, a systematic method that
will explore predefined values ([0.001, 0.01, 0.1, 0.2, 0.3]) to ascertain the most effective setting
for enhancing your model’s accuracy.
...
Here are the results from the GridSearchCV, focused solely on optimizing the learning_rate
parameter:
Using GridSearchCV, you found that a learning_rate of 0.1 yielded the best result.
15.3 Optimizing Gradient Boosting with Learning Rate Adjustments 23
...
Encouraged by the gains achieved through these optimization strategies, you will now
extend your focus to fine-tuning both the learning_rate and n_estimators simultaneously.
This next phase aims to uncover even more optimal settings by exploring the combined impact
of these crucial parameters on your gradient boosting regressor’s performance.
...
The GridSearchCV process evaluated 25 different combinations across 5 folds, totaling 125 fits:
C
Best parameters (Grid Search): {'regressor__learning_rate': 0.1, C
'regressor__n_estimators': 500}
Best score (Grid Search): 0.9089
Output 15.4: The optimal learning rate and number of trees found
15.4 Final Optimization: Tuning Learning Rate and Number of Trees 25
...
Listing 15.5: Using RandomizedSearchCV to find the optimal learning rate and
number of trees
The RandomizedSearchCV extended your search across a broader range of possibilities, testing
50 different configurations across 5 folds, totaling 250 fits:
C
Best parameters (Random Search): {'regressor__learning_rate': 0.12055843054286139, C
'regressor__n_estimators': 287}
Best score (Random Search): 0.9158
It identified an even more effective setting with a learning_rate of approximately 0.121 and
n_estimators at 287, achieving the best R2 score yet at 0.9158. This underscores the potential
15.5 Further Reading 26
of randomized parameter tuning to discover optimal settings that more rigid methods might
miss.
To validate the performance improvements achieved through your tuning efforts, you will
now perform a final cross-validation using the gradient boosting regressor configured with the
best parameters identified: n_estimators set to 287 and a learning_rate of approximately
0.121.
...
# Cross check model performance of gradient boosting regressor with tuned parameters
Listing 15.6: Training the gradient boost regressor using the optimal hyperparame-
ters
The final output confirms the performance of your tuned gradient boosting regressor.
By optimizing both learning_rate and n_estimators, you have achieved an R2 score of 0.9158.
This score not only validates the enhancements made through parameter tuning but also
emphasizes the capability of the gradient boosting regressor to adapt and perform consistently
across the dataset.
Papers
Dean De Cock. “Ames, Iowa: Alternative to the Boston Housing Data as an End of
Semester Regression Project”. Journal of Statistics Education, 19(3), 2011.
https://jse.amstat.org/v19n3/decock.pdf
Resources
Dean De Cock. Ames Housing Dataset. 2011.
https://jse.amstat.org/v19n3/decock/AmesHousing.txt
Dean De Cock. Ames Housing Data Dictionary. 2011.
https://jse.amstat.org/v19n3/decock/DataDocumentation.txt
Online
sklearn.model_selection.GridSearchCV API. scikit-learn.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSe
archCV.html
sklearn.model_selection.RandomizedSearchCV API. scikit-learn.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.Random
izedSearchCV.html
sklearn.ensemble.GradientBoostingRegressor API. scikit-learn.
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoost
ingRegressor.html
Dhiraj K. Implementing Gradient Boosting Regression in Python. Sept. 2024.
https://blog.paperspace.com/implementing-gradient-boosting-regression-python/
15.6 Summary
This chapter explored the capabilities of the gradient boosting regressor (GBR), from
understanding the foundational concepts of boosting to advanced optimization techniques
using the Ames housing dataset. It focused on key parameters of the GBR such as the
number of trees and learning rate, essential for refining the model’s accuracy and efficiency.
Through systematic and randomized approaches, it demonstrated how to fine-tune these
parameters using GridSearchCV and RandomizedSearchCV, enhancing the model’s performance
significantly.
Specifically, you learned:
B The fundamentals of boosting and how it differs from other ensemble techniques like
bagging.
B How to achieve incremental improvements by experimenting with a range of models.
B Techniques for tuning learning rate and number of trees for the gradient boosting
regressor.
In the next chapter, you will explore the famous XGBoost library for gradient boosting trees.
This is Just a Sample
MACHINE LEARNING
MASTERY
NEXT-LEVEL
Data
Science
Mastering Techniques
for the Modern Era
Vinod Chugani