Ensemble Machine Learning Cookbook Over 35 Practic...
Ensemble Machine Learning Cookbook Over 35 Practic...
Ensemble Machine Learning Cookbook Over 35 Practic...
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Ensemble Machine Learning
Cookbook
Dipayan Sarkar
Vijayalakshmi Natarajan
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
BIRMINGHAM - MUMBAI
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Ensemble Machine Learning Cookbook
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form
or by any means, without the prior written permission of the publisher, except in the case of brief quotations
embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented.
However, the information contained in this book is sold without warranty, either express or implied. Neither the
authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to
have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products
mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy
of this information.
ISBN 978-1-78913-660-9
www.packtpub.com
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
mapt.io
Mapt is an online digital library that gives you full access to over 5,000 books and videos, as
well as industry leading tools to help you plan your personal development and advance
your career. For more information, please visit our website.
Why subscribe?
Spend less time learning and more time coding with practical eBooks and Videos
from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Packt.com
Did you know that Packt offers eBook versions of every book published, with PDF and
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
ePub files available? You can upgrade to the eBook version at www.packt.com and as a print
book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
[email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a
range of free newsletters, and receive exclusive discounts and offers on Packt books and
eBooks.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Foreword
Artificial Intelligence with Machine Learning alongside is currently occupying a formidable
position in the analytics domain for automating strategic decisions. The unabated meteoric
growth that we have been witnessing in business analytics in the last 3 to 5 years that is
responsible for this new Avatar AIMLA, an acronym that stands for Artificial Intelligence
and Machine Learning Algorithms.
AIMLA is the new frontier in business analytics that promises a great future in terms of
attractive career opportunities for budding young students of management and managers.
Andrew Ng, a luminary in this field, predicts "AI will transform every industry just like
electricity transformed them 100 years back." AI will have to necessarily use machine
learning algorithms for automation of decisions.
Against this backdrop, the role of this book titled Ensemble Machine Learning Cookbook that is
being introduced into the market by Packt Publishing looms large. Personally speaking, it
was indeed a pleasure reading this book. Every chapter has been so nicely organized in
terms of the themes "Getting ready", "How to do it", "How it works", and "There's more".
The book uses Python, the new analytic language for deriving insights from data in the
most effective manner. I congratulate the two authors, Dipayan Sarkar and Vijayalakshmi
Natarajan, for producing a practical, yet conceptually rigorous analytic decision-oriented
book that is the need of the hour.
Conceptual clarity, cohesive content, lucid explanation, appropriate datasets for each
algorithm, and analytics for insights using Python coding are the hallmarks of the book.
The journey of ensemble machine learning algorithms in the book involves eight chapters
starting from the preliminary background to Python and going all the way step-by-step, to
Chapter 7, Boosting Model Performance with Boosting. The fascinating part to me has been
Chapter 4, Statistical and Machine Learning Algorithms that is so nicely packed with multiple
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
regression, logistic regression, Naïve Bayes, decision trees, and support vector machines
that are the bedrock of supervised machine learning. Apart from the rest of the content on
machine learning that was very carefully and effectively covered, what stands out is the all-
important ensemble model-random forest and its implementation in Chapter 6, When in
Doubt, Use Random Forests.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
The book is compact, with about 300+ pages and does not frighten anyone by it's huge size.
This new book Ensemble Machine Learning Cookbook will be extremely handy for both
students and practitioners as a guide for not only understanding machine learning but also
automating it for analytic decisions.
I wish authors Dipayan Sarkar and Vijayalakshmi Natarajan all the best.
Dr. P. K.Viswanathan
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Contributors
This book is dedicated to my mother, Rina Sarkar, my father, Gurudas Sarkar, and my
niece, PomPom.
A special thanks to Kajal Goel, Pravalika Aitipamula, Rajat Sharma, Rehan Ali Ansari,
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
About the reviewers
Dr. P.K. Viswanathan has been rated as one of the top 10 and top 20 most prominent
Analytics and Data Science Academicians in India in the years 2017 and 2018 respectively.
In his industrial tenure, spanning 15+ years, he has held senior management positions in
Ballarpur Industries and J.K. Industries. He holds degrees in MSc(Madras University),
MBA(FMS, Delhi), MS(Manitoba, Canada), and PhD(Madras University). He has authored
various books on Business Statistics and Marketing Research. He has published research
articles in reputed journals and has presented papers in national and international
conferences. He has also conducted many training programs, and is involved as a key
faculty in the Management Development Programs at Great Lakes.
Vadim Smolyakov is currently pursuing his PhD at MIT in the areas of computer science
and artificial intelligence. His primary research interests include Bayesian inference, deep
learning, and optimization. Prior to coming to MIT, Vadim received his undergraduate
degree in engineering science at the University of Toronto. He previously worked as a data
scientist in the e-commerce space. Vadim is passionate about machine learning and data
science, and is interested in making the field accessible to a broad audience, inspiring
readers to innovate and pursue research in artificial intelligence.
Swarna Gupta holds a B.E. in computer science, and has 5 years of experience in the data
science space. She is currently working with Rolls Royce in the capacity of a data scientist.
Her work revolves around leveraging data science and machine learning to create value for
the business. She has extensively worked on IoT-based projects in the vehicle telematics
and solar manufacturing industries. Swarna also manages time from her busy schedule to
be a regular pro-bono contributor to social organizations, helping them to solve specific
business problems with the help of data science and machine learning. She takes a keen
interest in the mathematics behind machine learning, deep learning, and artificial
intelligence.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Packt is searching for authors like you
If you're interested in becoming an author for Packt, please visit authors.packtpub.com
and apply today. We have worked with thousands of developers and tech professionals,
just like you, to help them share their insight with the global tech community. You can
make a general application, apply for a specific hot topic that we are recruiting an author
for, or submit your own idea.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Preface
Ensemble modeling is an approach used to improve the performance of machine learning
models. It combines two or more similar or dissimilar machine-learning algorithms to
deliver superior powers. This book will help you to implement some popular machine-
learning algorithms to cover different paradigms of ensemble machine learning, such as
boosting, bagging, and stacking.
This Ensemble Machine Learning Cookbook will start by getting you acquainted with the
basics of ensemble techniques and exploratory data analysis. You'll then learn to implement
tasks related to statistical and machine learning algorithms to understand the ensemble of
multiple heterogeneous algorithms. It'll also ensure that you don't miss out on key topics
such as resampling methods. As you progress, you'll get a better understanding of bagging,
boosting, stacking, and learn how to work with the Random Forest algorithm using real-
world examples. The book will highlight how these ensemble methods use multiple models
to improve machine learning results, compared to a single model. In the concluding
chapters, you'll delve into advanced ensemble models using neural networks, Natural
Language Processing (NLP), and more. You'll also be able to implement models covering
fraud detection, text categorization, and sentiment analysis.
By the end of this book, you'll be able to harness ensemble techniques and the working
mechanisms of machine-learning algorithms to build intelligent models using individual
recipes.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Chapter 2, Getting Started with Ensemble Machine Learning, explores what ensemble learning
is and how it can help in real-life scenarios. Basic ensemble techniques, including
averaging, weighted averaging, and max-voting, are explained. These techniques form the
basis for ensemble techniques, and an understanding of them will lay the groundwork for
readers to move to more advanced stage after reading this chapter.
Chapter 3, Resampling Methods, introduces a handful of algorithms that will be useful when
we get into an ensemble of multiple heterogeneous algorithms. This chapter uses scikit-
learn to prepare all the algorithms to be used.
Chapter 4, Statistical and Machine Learning Algorithms, helps the readers to get to know
various types of resampling methods that are used by machine-learning algorithms. Each
resampling method has its advantages and disadvantages, which are explained to the
readers. The readers also learn the code to be executed for each type of sampling.
Chapter 5, Bag the Models with Bagging, provides the readers with an understanding of
what bootstrap aggregation is and how the bootstrap results can be aggregated, in a process
also known as bagging.
Chapter 6, When in Doubt, Use Random Forests, introduces the random forest algorithm. It
will introduce to readers how, and what kind of, ensemble techniques are used by Random
Forest and how this helps our models avoid overfitting.
Chapter 7, Boosting Model Performance with Boosting, introduces boosting and discusses how
it helps to improve a model performance by reducing variances and increasing accuracy.
This chapter provides information such as the fact that boosting is not robust against
outliers and noisy data but is flexible and can be used with a loss function.
Chapter 8, Blend It with Stacking, applies stacking to learn the optimal combination of base
learners. This chapter will acquaint readers with stacking, which is also known as stacked
generalization.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Chapter 10, Heterogeneous Ensemble Classifiers Using H2O, is a complete code walk-through
on a classification case study for default prediction with an ensemble of multiple
heterogeneous algorithms using scikit-learn.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Chapter 11, Heterogeneous Ensemble for Text Classification Using NLP, is a complete code
walk-through on a classification case study to classify sentiment polarity using an ensemble
of multiple heterogeneous algorithms. Here, NLP techniques such as semantics are used to
improve the accuracy of classification. Then, the mined text information is used to employ
ensemble classification techniques for sentiment analysis. In this case study, the H2O
library is used for building models.
Chapter 12, Homogeneous Ensemble for Multiclass Classification Using Keras, is a complete
code walk-through on a classification case study for multiple classification with
homogeneous ensemble using data diversity with the tf.keras module from TensorFlow.
Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/
PacktPublishing/Ensemble-Machine-Learning-Cookbook. In case there's an update to the
code, it will be updated on the existing GitHub repository.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
We also have other code bundles from our rich catalog of books and videos available
at https://github.com/PacktPublishing/. Check them out!
Conventions used
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames,
file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an
example: "We will use the os package in the operating system's dependent functionality,
and the pandas package for data manipulation."
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Sections
In this book, you will find several headings that appear frequently (Getting ready, How to do
it..., How it works..., There's more..., and See also).
To give clear instructions on how to complete a recipe, use these sections as follows:
Getting ready
This section tells you what to expect in the recipe and describes how to set up any software
or any preliminary settings required for the recipe.
How to do it…
This section contains the steps required to follow the recipe.
How it works…
This section usually consists of a detailed explanation of what happened in the previous
section.
There's more…
This section consists of additional information about the recipe in order to make you more
knowledgeable about the recipe.
See also
This section provides helpful links to other useful information for the recipe.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book
title in the subject of your message and email us at [email protected].
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes
do happen. If you have found a mistake in this book, we would be grateful if you would
report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking
on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we
would be grateful if you would provide us with the location address or website name.
Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in
and you are interested in either writing or contributing to a book, please visit
authors.packtpub.com.
Reviews
Please leave a review. Once you have read and used this book, why not leave a review on
the site that you purchased it from? Potential readers can then see and use your unbiased
opinion to make purchase decisions, we at Packt can understand what you think about our
products, and our authors can see your feedback on their book. Thank you!
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Table of Contents
Preface 10
Chapter 1: Get Closer to Your Data 1
Introduction 1
Data manipulation with Python 2
Getting ready 2
How to do it... 3
How it works... 9
There's more... 11
See also 12
Analyzing, visualizing, and treating missing values 12
How to do it... 13
How it works... 22
There's more... 24
See also 25
Exploratory data analysis 25
How to do it... 25
How it works... 34
There's more... 35
See also 37
Chapter 2: Getting Started with Ensemble Machine Learning 38
Introduction to ensemble machine learning 38
Max-voting 41
Getting ready 41
How to do it... 42
How it works... 44
There's more... 44
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Averaging 46
Getting ready 46
How to do it... 46
How it works... 48
Weighted averaging 48
Getting ready 48
How to do it... 49
How it works... 50
See also 51
Chapter 3: Resampling Methods 52
Introduction to sampling 52
Getting ready 53
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Table of Contents
How to do it... 54
How it works... 55
There's more... 56
See also 57
k-fold and leave-one-out cross-validation 57
Getting ready 59
How to do it... 60
How it works... 62
There's more... 62
See also 64
Bootstrapping 65
Getting ready 67
How to do it... 67
How it works... 69
See also 70
Chapter 4: Statistical and Machine Learning Algorithms 71
Technical requirements 71
Multiple linear regression 72
Getting ready 73
How to do it... 74
How it works... 82
There's more... 83
See also 85
Logistic regression 85
Getting ready 87
How to do it... 88
How it works... 90
See also 91
Naive Bayes 91
Getting ready 93
How to do it... 94
How it works... 99
There's more...
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
99
See also 100
Decision trees 100
Getting ready 102
How to do it... 103
How it works... 109
There's more... 110
See also 110
Support vector machines 110
Getting ready 113
How to do it... 114
How it works... 117
[ ii ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Table of Contents
[ iii ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Table of Contents
Introduction 206
An ensemble of homogeneous models for energy prediction 207
Getting ready 208
How to do it... 209
How it works... 211
There's more... 213
See also 217
An ensemble of homogeneous models for handwritten digit
classification 218
Getting ready 218
How to do it... 221
How it works... 227
[ iv ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Table of Contents
304
[v]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
1
Get Closer to Your Data
In this chapter, we will cover the following recipes:
Introduction
In this book, we will cover various ensemble techniques and will learn how to ensemble
multiple machine learning algorithms to enhance a model's performance. We will use
pandas, NumPy, scikit-learn, and Matplotlib, all of which were built for working with
Python, as we will do throughout the book. By now, you should be well aware of data
manipulation and exploration.
In this chapter, we will recap how to read and manipulate data in Python, how to analyze
and treat missing values, and how to explore data to gain deeper insights. We will use
various Python packages, such as numpy and pandas, for data manipulation and
exploration, and seaborn packages for data visualization. We will continue to use some or
all of these libraries in the later chapters of this book as well. We will also use the Anaconda
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
distribution for our Python coding. If you have not installed Anaconda, you need to
download it from https://www.anaconda.com/download. At the time of writing this book,
the latest version of Anaconda is 5.2, and comes with both Python 3.6 and Python 2.7. We
suggest you download Anaconda for Python 3.6. We will also use the HousePrices
dataset, which is available on GitHub.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
Getting ready
We will use the os package in the operating system's dependent functionality, and
the pandas package for data manipulation.
Let's now take a look at the data definitions to understand our variables. In the following
code, we list the data definition for a few variables. The dataset and the complete data
definitions are available on GitHub. Here is an abridged version of the data description file:
MS SubClass (Nominal): Identifies the type of dwelling involved in
the sale
Lot Frontage (Continuous): Linear feet of street connected to
property
Alley (Nominal): Type of alley access to property
Overall Qual (Ordinal): Rates the overall material and finish of
the house
Overall Cond (Ordinal): Rates the overall condition of the house
Year Built (Discrete): Original construction date
Mas Vnr Type (Nominal): Masonry veneer type
Mas Vnr Area (Continuous): Masonry veneer area in square feet
Garage Type (Nominal): Garage location
Garage Yr Blt (Discrete): Year garage was built
Garage Finish (Ordinal): Interior finish of the garage
Garage Cars (Discrete): Size of garage in car capacity
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[2]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
We will then import the os and pandas packages and set our working directory according
to our requirements, as seen in the following code block:
import os
import pandas as pd
The next step is to download the dataset from GitHub and copy it to your working
directory.
How to do it...
Now, let's perform some data manipulation steps:
1. First, we will read the data in HousePrices.csv from our current working
directory and create our first DataFrame for manipulation. We name
the DataFrame housepricesdata, as follows:
housepricesdata = pd.read_csv("HousePrices.csv")
2. Let's now take a look at our DataFrame and see how it looks:
# See first five observations from top
housepricesdata.head(5)
You might not be able to see all the rows; Jupyter will truncate some of the
variables. In order to view all of the rows and columns for any output in
Jupyter, execute the following commands:
# Setting options to display all rows and columns
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
pd.options.display.max_rows = None
pd.options.display.max_columns = None
3. We can see the dimensions of the DataFrame with shape. shape is an attribute of
the pandas DataFrame:
housepricesdata.shape
[3]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
With the preceding command, we can see the number of rows and columns, as
follows:
(1460, 81)
Here, we can see that the DataFrame has 1460 observations and 81 columns.
In the following code block, we can see the datatypes of each variable in the
DataFrame:
Id int64
MSSubClass int64
MSZoning object
LotFrontage float64
LotArea int64
LotConfig object
LandSlope object
...
BedroomAbvGr int64
KitchenAbvGr int64
KitchenQual object
TotRmsAbvGrd int64
SaleCondition object
SalePrice int64
Length: 81, dtype: object
We're now all ready to start with our data manipulation, which we can do in
many different ways. In this section, we'll look at a few ways in which we can
manipulate and prepare our data for the purpose of analysis.
5. The describe() function will show the statistics for the numerical variables
only:
housepricesdata.describe()
[4]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
6. We will remove the id column, as this will not be necessary for our analysis:
# inplace=True will overwrite the DataFrame after dropping Id
column
housepricesdata.drop(['Id'], axis=1, inplace=True)
7. Let's now look at the distribution of some of the object type variables, that is, the
categorical variables. In the following example, we are going to look at LotShape
and LandContour. We can study the other categorical variables of the dataset in
the same way as shown in the following code block:
# Name the count column as "count"
lotshape_frequencies =
pd.crosstab(index=housepricesdata["LotShape"], columns="count")
landcountour_frequencies =
pd.crosstab(index=housepricesdata["LandContour"], columns="count")
# Name the count column as "count"
print(lotshape_frequencies)
print("\n") # to keep a blank line for display
print(landcountour_frequencies)
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Prior to typecasting any variable, ensure that there are no missing values.
[5]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
We can see the count of observations for each category of houses, as shown in the
following code block:
category
col_0 count
MSSubClass
20 536
30 69
40 4
45 12
50 144
60 299
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
70 60
75 16
80 58
85 20
90 52
120 87
160 63
180 10
190 30
There are many variables that might not be very useful by themselves, but
transforming them gives us a lot of interesting insights. Let's create some new,
meaningful variables.
[6]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
9. YearBuilt and YearRemodAdd represent the original construction date and the
remodel date respectively. However, if they can be converted into age, these
variables will tell us how old the buildings are and how many years it has been
since they were remodeled. To do this, we create two new
variables, BuildingAge and RemodelAge:
# Importing datetime package for date time operations
import datetime as dt
[7]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
Variables that contain label data need to be converted into a numerical form for
machine learning algorithms to use. To get around this, we will perform encoding
that will transform the labels into numerical forms so that the algorithms can use
them.
11. We need to identify the variables that need encoding, which include Street,
LotShape, and LandContour. We will perform one-hot encoding, which is a
representation of categorical variables as binary vectors. We will use the pandas
package in Python to do this:
# We use get_dummies() function to one-hot encode LotShape
one_hot_encoded_variables =
pd.get_dummies(housepricesdata['LotShape'],prefix='LotShape')
# Print the one-hot encoded variables to see how they look like
print(one_hot_encoded_variables)
We can see the one-hot encoded variables that have been created in the following
screenshot:
pd.concat([housepricesdata,one_hot_encoded_variables],axis=1)
[8]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
We can see the output that we get after adding the one-hot encoded variables to
the DataFrame in the following screenshot:
13. Now, let's remove the original variables since we have already created our one-
hot encoded variables:
# Dropping the original variable after one-hot encoding the
original variable
# inplace = True option will overwrite the DataFrame
housepricesdata.drop(['LotShape'],axis=1, inplace=True)
How it works...
The pandas module is a part of the Python standard library – it is one of the key modules
for data manipulation. We have also used other packages, such as os and datetime. After
we set our working directory and read the CSV file into Python as a pandas DataFrame, we
moved on to looking at a few data manipulation methods.
Step 1 to Step 5 in the preceding section showed us how to read the data from a CSV file in
Python using pandas, and also how to use functions such as dtypes.
The pandas package also provides methods for reading data from various
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[9]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
You can also read HDF5 format files in Python using the h5py package. The h5py package
is a Python interface to the HDF5 binary data format. HDF® supports n-dimensional
datasets, and each element in the dataset may itself be a complex object. There is no limit on
the number or size of data objects in the collection. More info can be found at https://www.
hdfgroup.org/. A sample code block looks like this:
import h5py
We look at the datatypes of the variables, and use describe() to see the summary
statistics for the numerical variables. We need to note that describe() works only for
numerical variables and is intelligent enough to ignore non-numerical variables. In Step 6,
we saw how to look at the count of each level for categorical variables such as LotShape
and LandContour. We can use the same code to take a look at the distribution of other
categorical variables.
In Step 7, we took a look at the distribution of the LotShape and LandContour variables
using pd.crosstab().
We then moved on to learning how to convert datatypes. We had a few variables that were
actually categorical, but appeared to be numerical in the dataset. This is often the case in a
real-life scenario, hence we need to learn how to typecast our variables. Step 8 showed us
how to convert a numerical variable, such as MSSubClass, into a categorical type. In Step 8,
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 10 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
In Step 9, we created new meaningful variables from existing variables. We created the new
variables, BuildingAge and RemodelAge, from YearBuilt and YearRemodAdd
respectively, to represent the age of the building and the number of years that have passed
since the buildings were remodeled. This method of creating new variables can provide
better insights into our analysis and modeling. This process of creating new features is
called feature engineering. In Step 10, we added the new variables to our DataFrame.
From there, we moved on to encoding our categorical variables. We needed to encode our
categorical variables because they have named descriptions. Many machine learning
algorithms cannot operate on labelled data because they require all input and output
variables to be numeric. In Step 12, we encoded them with one-hot encoding. In Step 11, we
learned how to use the get_dummies() function, which is a part of the pandas package, to
create the one-hot encoded variables. In Step 12, we added the one-
hot_encoded_variables to our DataFrame. And finally, in Step 13, we removed the
original variables that are now one-hot encoded.
There's more...
The types of data manipulation required depend on your business requirements. In this
first recipe, we saw a few ways to carry out data manipulation, but there is no limit to what
you can do and how you can manipulate data for analysis.
We have also seen how to convert a numerical variable into a categorical variable. We can
do this kind of typecasting in many ways. For example, we can convert a categorical
variable into a numerical variable, if required, with the following code:
# Converting a categorical variable to numerical
# Using astype() to cast a pandas object to a specified datatype
housepricesdata['GarageYrBlt'].astype('int64')
You can only convert the GarageYrBlt variable if it does not contain any missing values.
The preceding code will throw an error, since GarageYrBlt contains missing values.
[ 11 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
We have looked at how we can use one-hot encoding to convert categorical variables to
numerical variables, and why we do this. In addition to one-hot encoding, we can perform
other kinds of encoding, such as label encoding, frequency encoding, and so on. An
example code for label encoding is given in the following code block:
# We use sklearn.preprocessing and import LabelEncoder class
from sklearn.preprocessing import LabelEncoder
See also
The pandas guide to type conversion functions (https://bit.ly/2MzFwiG)
The pandas guide to one-hot encoding using get_dummies() (https://bit.ly/
2N1xjTZ)
The scikit-learn guide to one-hot encoding (https://bit.ly/2wrNNLz)
The scikit-learn guide to label encoding (https://bit.ly/2pDddVb)
values
Missing values are caused by incomplete data. It is important to handle missing values
effectively, as they can lead to inaccurate inferences and conclusions. In this section, we will
look at how to analyze, visualize, and treat missing values.
[ 12 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
How to do it...
Let's start by analyzing variables with missing values. Set the options in pandas to view all
rows and columns, as shown in the previous section:
1. With the following syntax, we can see which variables have missing values:
# Check which variables have missing values
columns_with_missing_values =
housepricesdata.columns[housepricesdata.isnull().any()]
housepricesdata[columns_with_missing_values].isnull().sum()
2. You might also like to see the missing values in terms of percentages. To see the
count and percentage of missing values, execute the following command:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
[ 13 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
It will show you the missing values in both absolute and percentage terms, as
shown in the following screenshot:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 14 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 15 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
But there is a catch. Let's look at the Alley variable again. It shows us that it
has 93.76% missing values. Now take another look at the data description that we
looked at in the preceding section. The variable description for Alley shows that
it has three levels: gravel, paved, and no access. In the original dataset, 'No
Access' is codified as NA. When NA is read in Python, it is treated as NaN, which
means that a value is missing, so we need to be careful.
3. Now, we will replace the missing values for Alley with a valid value, such as
'No Access':
4. Now, let's visualize the missing values and try to see how can we treat
them. The following code generates a chart that showcases the spread of missing
values. Here we use the seaborn library to plot the charts:
# Lets import seaborn. We will use seaborn to generate our charts
import seaborn as sns
The color of the map is generated with linearly increasing brightness by the
cubehelix_palette() function:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 16 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
From the preceding plot, it is easier to read the spread of the missing values. The
white marks on the chart indicate missing values. Notice that Alley no longer
reports any missing values.
value
housepricesdata['LotFrontage'].fillna(housepricesdata['LotFrontage'
].median(), inplace=True)
[ 17 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
6. Let's view the missing value plot once again to see if the missing values from
LotFrontage have been imputed. Copy and execute the preceding code. The
missing value plot will look as follows:
Here, we can see in the preceding plot that there are no more missing values for
Alley or LotFrontage.
7. We have figured out from the data description that several variables have values
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
that are codified as NA. Because this is read in Python as missing values, we
replace all of these with their actual values, which we get to see in the data
description shown in the following code block:
# Replacing all NA values with their original meaning
housepricesdata['BsmtQual'].fillna('No Basement', inplace=True)
housepricesdata['BsmtCond'].fillna('No Basement', inplace=True)
housepricesdata['BsmtExposure'].fillna('No Basement', inplace=True)
housepricesdata['BsmtFinType1'].fillna('No Basement', inplace=True)
housepricesdata['BsmtFinType2'].fillna('No Basement', inplace=True)
housepricesdata['GarageYrBlt'].fillna(0, inplace=True)
[ 18 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
8. Let's take a look at the missing value plot after having treated the
preceding variables:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 19 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
We notice from the preceding plot that there are no more missing values for the
variables that we have just treated. However, we are left with a few missing
values in MasVnrType, MasVnrArea, and Electrical.
10. We will then impute the missing values in MasVnrType with None and
MasVnrArea with zero. This is done with the commands shown in the following
code block:
# Filling in the missing values for MasVnrType and MasVnrArea with
None and 0 respectively
housepricesdata['MasVnrType'].fillna('None', inplace=True)
housepricesdata['MasVnrArea'].fillna(0, inplace=True)
We are still left with one missing value in the Electrical variable.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
11. Let's take a look at the observation where Electrical has a missing value:
housepricesdata['MSSubClass'][housepricesdata['Electrical'].isnull(
)]
[ 20 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
12. We see that MSSubClass is 80 when Electrical is null. Let's see the
distribution of the Electrical type by MSSubClass:
# Using crosstab to generate the count of Electrical Type by
MSSubClass
print(pd.crosstab(index=housepricesdata["Electrical"],\
columns=housepricesdata['MSSubClass'], dropna=False, margins=True))
From the following output, we can see that when MSSubClass is 80, the
majority of cases of the Electrical type are SBrkr:
13. Go ahead and impute the missing value in the Electrical variable with
SBrKr by executing the following code:
housepricesdata['Electrical'].fillna('SBrkr', inplace=True)
14. After this, let's take a look at our missing value plot for a final time:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(20, 10))
[ 21 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
Notice that the plot has changed and now shows no missing values in our DataFrame.
How it works...
In Step 1 and Step 2, we looked at the variables with missing values in absolute and
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
percentage terms. We noticed that the Alley variable had more than 93% of its values
missing. However, from the data description, we figured out that the Alley variable had
a No Access to Alley value, which is codified as NA in the dataset. When this value was
read in Python, all instances of NA were treated as missing values. In Step 3, we replaced the
NA in Alley with No Access.
[ 22 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
In Step 4, we used the seaborn library to plot the missing value chart. In this chart, we
identified the variables that had missing values. The missing values were denoted in white,
while the presence of data was denoted in color. We noticed from the chart that Alley had
no more missing values.
In Step 5, we noticed that one of the numerical variables, LotFrontage, had more than 17%
of its values missing. We decided to impute the missing values with the median of this
variable. We revisited the missing value chart in Step 6 to see whether the variables were
left with any missing values. We noticed that Alley and LotFrontage showed no white
marks, indicating that neither of the two variables had any further missing values.
In Step 7, we identified a handful of variables that had data codified with NA. This caused
the same problem we encountered previously, as Python treated them as missing values.
We replaced all such codified values with actual information.
We then revisited the missing value chart in Step 8. We saw that almost all the variables
then had no missing values, except for MasVnrType, MasVnrArea, and Electrical.
In Step 9 and 10, we filled in the missing values for the MasVnrType and
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 23 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
In Step 11, we looked at what type of house was missing the Electrical value. We noticed
that MSSubClass denoted the dwelling type and, for the missing Electrical value, the
MSSubClass was 80, which meant it was split or multi-level. In Step 12, we checked the
distribution of Electrical by the dwelling type, which was MSSubClass. We noticed that
when MSSubClass equals 80, the majority of the values of Electrical are SBrkr, which
stands for standard circuit breakers and Romex. For this reason, we decided to impute the
missing value in Electrical with SBrkr.
Finally, in Step 14, we again revisited the missing value chart and saw that there were no
more missing values in the dataset.
There's more...
Using the preceding plots and missing value charts, it was easy to figure out the count,
percentage, and spread of missing values in the datasets. We noticed that many variables
had missing values for the same observations. However, after consulting the data
description, we saw that most of the missing values were actually not missing, but since
they were codified as NA, pandas treated them as missing values.
It is very important for data analysts to understand data descriptions and treat the missing
values appropriately.
Missing completely at random (MCAR): MCAR denotes that the missing values
have nothing to do with the object being studied. In other words, data is
MCAR when the probability of missing data on a variable is not related to other
measured variables or to the values themselves. An example of this could be, for
instance, the age of certain respondents to a survey not being recorded, purely by
chance.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Missing at random (MAR): The name MAR is a little misleading here because
the absence of values is not random in this case. Data is MAR if its absence is
related to other observed variables, but not to the underlying values of the data
itself. For example, when we collect data from customers, rich customers are less
likely to disclose their income than their other counterparts, resulting in MAR
data.
[ 24 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
There are various strategies that can be applied to impute the missing values, as listed here:
See also
The scikit-learn module for imputation (https://bit.ly/2MzFwiG)
Multiple imputation by chained equations using the StatsModels library in
Python (https://bit.ly/2PYLuYy)
Feature imputation algorithms using fancyimpute (https://bit.ly/2MJKfOY)
How to do it...
1. In the first section on data manipulation, we saw the summary statistics for our
datasets. However, we have not looked at this since imputing the missing values.
Let's now look at the data and its basic statistics using the following code:
# To take a look at the top 5 rows in the dataset
housepricesdata.head(5)
[ 25 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
2. With the preceding code, we can see the summary statistics of the variables in the
earlier section.
The following code shows us how many variables there are for each datatype. We
can see that we have 3 float-type variables, 33 integer-type variables, 45 object-
type variables, and 4 unsigned integers that hold the one-hot encoded values for
the LotShape variable:
3. Let's create two variables to hold the names of the numerical and categorical
variables:
# Pulling out names of numerical variables by conditioning dtypes
NOT equal to object type
numerical_features = housepricesdata.dtypes[housepricesdata.dtypes
!= "object"].index
print("Number of Numerical features: ", len(numerical_features))
This shows us the amount of numerical and categorical variables there are:
[ 26 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
We use the melt() method from pandas to reshape our DataFrame. You
may want to view the reshaped data after using the melt() method to
understand how the DataFrame is arranged.
melt_num_features = pd.melt(housepricesdata,
value_vars=numerical_features)
[ 27 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
for ax in grid.axes.flat:
plt.setp(ax.get_xticklabels(), rotation=90)
In our dataset, we see that various attributes are present that can drive house
prices. We can try to see the relationship between the attributes and the
SalesPrice variable, which indicates the prices of the houses.
Let's see the distribution of the house sale prices by each categorical variable in
the following plots:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 28 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
6. We will now take a look at the correlation matrix for all numerical variables
using the following code:
# Generate a correlation matrix for all the numerical variables
corr=housepricesdata[numerical_features].corr()
print(corr)
[ 29 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
It might be tough to view the correlations displayed in the preceding format. You
might want to take a look at the correlations graphically.
7. We can also view the correlation matrix plot for the numerical variables. In order
to do this, we use the numerical_features variable that we created in Step 3 to
hold the names of all the numerical variables:
# Get correlation of numerical variables
df_numerical_features=
housepricesdata.select_dtypes(include=[np.number])
correlation= df_numerical_features.corr()
correlation["SalePrice"].sort_values(ascending=False)*100
# Correlation Heat Map (Seaborn library)
f, ax= plt.subplots(figsize=(14,14))
plt.title("Correlation of Numerical Features with Sale Price", y=1,
size=20)
[ 30 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
[ 31 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
8. You may also want to evaluate the correlation of your numerical variables with
SalePrice to see how these numerical variables are related to the prices of the
houses:
row_count = 11
col_count = 3
The following screenshot shows us the correlation plots. Here, we plot the
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 32 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
9. If you want to evaluate the correlation of your numerical variables with the sale
prices of the houses numerically, you can use the following commands:
# See correlation between numerical variables with house prices
corr=housepricesdata.corr()["SalePrice"]
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 33 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
You can view the correlation output sorted in a descending manner in the
following table:
How it works...
In Step 1, we started by reading and describing our data. This step provided us with
summary statistics for our dataset. We looked at the number of variables for each datatype
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
in Step 2.
[ 34 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
In Step 4 and Step 5, we used the seaborn library to plot our charts. We also introduced the
melt() function from pandas, which can be used to reshape our DataFrame and feed it to
the FacetGrid() function of the seaborn library. Here, we showed how you can paint the
distribution plots for all the numerical variables in one single go. We also showed you how
to use the same FacetGrid() function to plot the distribution of SalesPrice by each
categorical variable.
We generated the correlation matrix in Step 6 using the corr() function of the DataFrame
object. However, we noticed that with too many variables, the display does not make it
easy for you to identify the correlations. In Step 7, we plotted the correlation matrix
heatmap by using the heatmap() function from the seaborn library.
In Step 8, we saw how the numerical variables correlated with the sale prices of houses
using a scatter plot matrix. We generated the scatter plot matrix using the regplot()
function from the seaborn library. Note that we used a parameter, fit_reg=False, to
remove the regression line from the scatter plots.
In Step 9, we repeated Step 8 to see the relationship of the numerical variables with the sale
prices of the houses in a numerical format, instead of scatter plots. We also sorted the
output in descending order by passing a [::-1] argument to the corr() function.
There's more...
We have seen a few ways to explore data, both statistically and visually. There are quite a
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
few libraries in Python that you can use to visualize your data. One of the most widely used
of these is ggplot. Before we look at a few commands, let's learn how ggplot works.
There are seven layers of grammatical elements in ggplot, out of which, first three layers
are mandatory:
Data
Aesthetics
Geometrics
[ 35 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
Facets
Statistics
Coordinates
Theme
You will often start by providing a dataset to ggplot(). Then, you provide an aesthetic
mapping with the aes() function to map the variables to the x and y axes. With aes(), you
can also set the color, size, shape, and position of the charts. You then add the type of
geometric shape you want with functions such as geom_point() or geom_histogram().
You can also add various options, such as plotting statistical summaries, faceting, visual
themes, and coordinate systems.
The following code is an extension to what we have used already in this chapter, so we will
directly delve into the ggplot code here:
f = pd.melt(housepricesdata, id_vars=['SalePrice'],value_vars=
numerical_features[0:9])
ggplot(f,aes('value', 'SalePrice')) + geom_point(color='orange') +
facet_wrap('variable',scales='free')
[ 36 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Get Closer to Your Data Chapter 1
Similarly, in order to view the density plot for the numerical variables, we can execute the
following code:
f_1 = pd.melt(housepricesdata, value_vars=numerical_features[0:9])
ggplot(f_1, aes('value')) + geom_density(color="red") +
facet_wrap('variable',scales='free')
The plot shows us the univariate density plot for each of our numerical variables. The
geom_density() computes and draws a kernel density estimate, which is a smoothed
version of the histogram:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
See also
The guide to the seaborn library (https://bit.ly/2iU2aRU)
[ 37 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
2
Getting Started with Ensemble
Machine Learning
In this chapter, we'll cover the following recipes:
Max-voting
Averaging
Weighted averaging
Ensemble models are known for providing an advantage over single models in terms of
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
performance. They can be applied to both regression and classification problems. You can
either decide to build ensemble models with algorithms from the same family or opt to pick
them from different families. If multiple models are built on the same dataset using neural
networks only, then that ensemble would be called a homogeneous ensemble model. If
multiple models are built using different algorithms, such as support vector machines
(SVMs), neural networks, and random forests, then the ensemble model would be called a
heterogeneous ensemble model.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Getting Started with Ensemble Machine Learning Chapter 2
1. Base learners are learners that are designed and fit on training data
2. The base learners are combined to form a single prediction model by using
specific ensembling techniques such as max-voting, averaging, and weighted
averaging
However, to get an ensemble model that performs well, the base learners themselves
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
To perform well, the ensemble models require a sufficient amount of data. Ensemble
techniques prove to be more useful when you have large and non-linear datasets.
[ 39 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Getting Started with Ensemble Machine Learning Chapter 2
Irrespective of how well you fine-tune your models, there's always the risk of high bias or
high variance. Even the best model can fail if the bias and variance aren't taken into account
while training the model. Both bias and variance represent a kind of error in the
predictions. In fact, the total error is comprised of bias-related error, variance-related error,
and unavoidable noise-related error (or irreducible error). The noise-related error is mainly
due to noise in the training data and can't be removed. However, the errors due to bias and
variance can be reduced.
A measure such as mean square error (MSE) captures all of these errors for a continuous
target variable and can be represented as follows:
In this formula, E stands for the expected mean, Y represents the actual target values and
is the predicted values for the target variable. It can be broken down into its
components such as bias, variance and noise as shown in the following formula:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
While bias refers to how close is the ground truth to the expected value of our estimate, the
variance, on the other hand, measures the deviation from the expected estimator value.
Estimators with small MSE is what is desirable. In order to minimize the MSE error, we
would like to be centered (0-bias) at ground truth and have a low deviation (low variance)
from the ground truth (correct) value. In other words, we'd like to be confident (low
variance, low uncertainty, more peaked distribution) about the value of our estimate. High
bias degrades the performance of the algorithm on the training dataset and leads to
underfitting. High variance, on the other hand, is characterized by low training errors and
high validation errors. Having high variance reduces the performance of the learners on
unseen data, leading to overfitting.
[ 40 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Getting Started with Ensemble Machine Learning Chapter 2
Max-voting
Max-voting, which is generally used for classification problems, is one of the simplest ways
of combining predictions from multiple machine learning algorithms.
In max-voting, each base model makes a prediction and votes for each sample. Only the
sample class with the highest votes is included in the final predictive class.
For example, let's say we have an online survey, in which consumers answer a question in a
five-level Likert scale. We can assume that a few consumers will provide a rating of five,
while others will provide a rating of four, and so on. If a majority, say more than 50% of the
consumers, provide a rating of four, then the final rating is taken as four. In this example,
taking the final rating as four is similar to taking a mode for all of the ratings.
Getting ready
In the following steps we will download the following packages:
To start with, import the os and pandas packages and set your working directory
according to your requirements:
# import required packages
import os
import pandas as pd
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Download the Cryotherapy.csv dataset from GitHub and copy it to your working
directory. Read the dataset:
df_cryotherapydata = pd.read_csv("Cryotherapy.csv")
[ 41 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Getting Started with Ensemble Machine Learning Chapter 2
We can see that the data has been read properly and has the Result_of_Treatment class
variable. We then move on to creating models with Result_of_Treatment as the
response variable.
How to do it...
You can create a voting ensemble model for a classification problem using the
VotingClassifier class from Python's scikit-learn library. The following steps
showcase an example of how to combine the predictions of the decision tree, SVMs, and
logistic regression models for a classification problem:
1. Import the required libraries for building the decision tree, SVM, and logistic
regression models. We also import VotingClassifier for max-voting:
# Import required libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
2. We then move on to building our feature set and creating our train and test
datasets:
# We create train & test sample from our dataset
from sklearn.cross_validation import train_test_split
'Area']
X = df_cryotherapydata[feature_columns]
Y = df_cryotherapydata['Result_of_Treatment']
[ 42 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Getting Started with Ensemble Machine Learning Chapter 2
3. We build our models with the decision tree, SVM, and logistic
regression algorithms:
# create the sub models
estimators = []
dt_model = DecisionTreeClassifier(random_state=1)
estimators.append(('DecisionTree', dt_model))
svm_model = SVC(random_state=1)
estimators.append(('SupportVector', svm_model))
logit_model = LogisticRegression(random_state=1)
estimators.append(('Logistic Regression', logit_model))
We can then see the accuracy score of each of the individual base learners:
ensemble_model.fit(X_train,Y_train)
predicted_labels = ensemble_model.predict(X_test)
[ 43 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Getting Started with Ensemble Machine Learning Chapter 2
We can see the accuracy score of the ensemble model using Hard Voting:
How it works...
VotingClassifier implements two types of voting—hard and soft voting. In hard
voting, the final class label is predicted as the class label that has been predicted most
frequently by the classification models. In other words, the predictions from all classifiers
are aggregated to predict the class that gets the most votes. In simple terms, it takes the
mode of the predicted class labels.
In hard voting for the class labels, is the prediction based on the majority voting of each
classifier , where i=1.....n observations, we have the following:
As shown in the previous section, we have three models, one from the decision tree, one
from the SVMs, and one from logistic regression. Let's say that the models classify a
training observation as class 1, class 0, and class 1 respectively. Then with majority voting,
we have the following:
In the preceding section, in Step 1, we imported the required libraries to build our
models. In Step 2, we created our feature set. We also split our data to create the training
and testing samples. In Step 3, we trained three models with the decision tree, SVMs, and
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
logistic regression respectively. In Step 4, we looked at the accuracy score of each of the base
learners, while in Step 5, we ensembled the models using VotingClassifier() and
looked at the accuracy score of the ensemble model.
There's more...
Many classifiers can estimate class probabilities. In this case, the class labels are predicted
by averaging the class probabilities. This is called soft voting and is recommended for an
ensemble of well-tuned classifiers.
[ 44 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Getting Started with Ensemble Machine Learning Chapter 2
dt_model = DecisionTreeClassifier(random_state=1)
estimators.append(('DecisionTree', dt_model))
logit_model = LogisticRegression(random_state=1)
estimators.append(('Logistic Regression', logit_model))
We get to see the accuracy from individual learners and the ensemble learner using soft
voting:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
The SVC class can't estimate class probabilities by default, so we've set its
probability hyper-parameter to True in the preceding code. With
probability=True, SVC will be able to estimate class probabilities.
[ 45 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Getting Started with Ensemble Machine Learning Chapter 2
Averaging
Averaging is usually used for regression problems or can be used while estimating the
probabilities in classification tasks. Predictions are extracted from multiple models and an
average of the predictions are used to make the final prediction.
Getting ready
Let us get ready to build multiple learners and see how to implement averaging:
Download the whitewines.csv dataset from GitHub and copy it to your working
directory, and let's read the dataset:
df_winedata = pd.read_csv("whitewines.csv")
In the following screenshot, we can see that the data has been read properly:
How to do it...
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
We have a dataset that is based on the properties of wines. Using this dataset, we'll build
multiple regression models with the quality as our response variable. With multiple
learners, we extract multiple predictions. The averaging technique would take the average
of all of the predicted values for each training sample:
[ 46 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Getting Started with Ensemble Machine Learning Chapter 2
4. Build the base regression learners using linear regression, SVR, and a decision
tree:
# Build base learners
linreg_model = LinearRegression()
svr_model = SVR()
regressiontree_model = DecisionTreeRegressor()
5. Use the base learners to make a prediction based on the test data:
linreg_predictions = linreg_model.predict(X_test)
svr_predictions = svr_model.predict(X_test)
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
regtree_predictions = regressiontree_model.predict(X_test)
[ 47 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Getting Started with Ensemble Machine Learning Chapter 2
How it works...
In Step 1, we imported the required packages. In Step 2, we separated the feature set and the
response variable from our dataset. We split our dataset into training and testing samples
in Step 3.
Note that our response variable is continuous in nature. For this reason, we built our
regression base learners in Step 4 using linear regression, SVR, and a decision tree. In Step 5,
we passed our test dataset to the predict() function to predict our response variable. And
finally, in Step 6, we added all of the predictions together and divided them by the number
of base learners, which is three in our example.
Weighted averaging
Like averaging, weighted averaging is also used for regression tasks. Alternatively, it can
be used while estimating probabilities in classification problems. Base learners are assigned
different weights, which represent the importance of each model in the prediction.
Getting ready
Download the wisc_bc_data.csv dataset from GitHub and copy it to your working
directory. Let's read the dataset:
df_cancerdata = pd.read_csv("wisc_bc_data.csv")
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 48 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Getting Started with Ensemble Machine Learning Chapter 2
How to do it...
Here, we have a dataset based on the properties of cancerous tumors. Using this dataset,
we'll build multiple classification models with diagnosis as our response variable. The
diagnosis variable has the values, B and M, which indicate whether the tumor is benign
or malignant. With multiple learners, we extract multiple predictions. The weighted
averaging technique takes the average of all of the predicted values for each training
sample.
In this example, we consider the predicted probabilities as the output and use
the predict_proba() function of the scikit-learn algorithms to predict the class
probabilities:
[ 49 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Getting Started with Ensemble Machine Learning Chapter 2
3. We'll then split our data into training and testing sets:
# Create train & test sets
X_train, X_test, Y_train, Y_test = \
train_test_split(X, Y, test_size=0.20, random_state=1)
dt_model = DecisionTreeClassifier()
estimators.append(('DecisionTree', dt_model))
svm_model = SVC(probability=True)
estimators.append(('SupportVector', svm_model))
logit_model = LogisticRegression()
estimators.append(('Logistic Regression', logit_model))
7. Assign different weights to each of the models to get our final predictions:
weighted_average_predictions=(dt_predictions * 0.3 +
svm_predictions * 0.4 + logit_predictions * 0.3)
[ 50 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Getting Started with Ensemble Machine Learning Chapter 2
How it works...
In Step 1, we imported the libraries that are required to build our models. In Step 2, we
created the response and feature sets. We retrieved our feature set using the iloc()
function of the pandas DataFrame. In Step 3, we split our dataset into training and testing
sets. In Step 4, we built our base classifiers. Kindly note that we passed probability=True
to our SVC function to allow SVC() to return class probabilities. In the SVC class, the default
is probability=False.
In Step 5, we fitted our model to the training data. We used the predict_proba() function
in Step 6 to predict the class probabilities for our test observations.
Finally, in Step 7, we assigned different weights to each of our models to estimate the
weighted average predictions. The question that comes up is how to choose the weights.
One way is to sample the weights uniformly and to make sure they normalize to one and
validate on the test set and repeat keeping track of weights that provide the highest
accuracy. This is an example of a random search.
See also
The following are the scikit reference links:
[ 51 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
3
Resampling Methods
In this chapter, we will be introduced to the fundamental concept of sampling. We'll also
learn about resampling and why it's important.
Sampling is the process of selecting a subset of observations from the population with the
purpose of estimating some parameters about the whole population. Resampling methods,
on the other hand, are used to improve the estimates of the population parameters.
Introduction to sampling
k-fold and leave-one-out cross-validation
Bootstrap sampling
Introduction to sampling
Sampling techniques can be broadly classified into non-probability sampling techniques
and probability sampling techniques. Non-probability sampling techniques are based on
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
the judgement of the user, whereas in probability sampling, the observations are selected
by chance.
Probability sampling most often includes simple random sampling (SRS), stratified
sampling, and systematic sampling:
SRS: In SRS, each observation in the population has an equal probability of being
chosen for the sample.
Stratified sampling: In stratified sampling, the population data is divided into
separate groups, called strata. A probability sample is then drawn from each
group.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Resampling Methods Chapter 3
If the sample is too small or too large, it may lead to incorrect findings. For
this reason, it's important that we've got the right sample size. A well-
designed sample can help identify the biasing factors that can skew the
accuracy and reliability of the expected outcome.
Errors might be introduced to our samples for a variety of reasons. An error might occur
due to random sampling, for example, which is known as a sampling error, or because the
method of drawing observations causes the samples to be skewed, which is known as
sample bias.
Getting ready
In Chapter 1, Get Closer to your Data, we manipulated and prepared the data from the
HousePrices.csv file and dealt with the missing values. In this example, we're going to
use the final dataset to demonstrate these sampling and resampling techniques.
We'll import the required libraries. We'll read the data and take a look at the dimensions of
our dataset:
# import os for operating system dependent functionalities
import os
Let's read our data. We'll prefix the DataFrame name with df_ to make it easier to
understand:
df_housingdata = pd.read_csv("Final_HousePrices.csv")
[ 53 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Resampling Methods Chapter 3
How to do it...
Now that we have read our dataset, let's look at how to do the sampling:
4. We split both our predictor and our response datasets into training and testing
subsets using train_test_split():
# Create train & test sets
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
5. We can find the number of observations and columns in each subset as follows:
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)
[ 54 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Resampling Methods Chapter 3
We can see that 70% of the data has been allocated to the training dataset and 30% has been
allocated to the testing dataset:
How it works...
In Step 1 and Step 2, we looked at the dimensions of our DataFrame and found that our
dataset had no missing values. In Step 3, we separated out the features and the response
variable. In Step 4, we used the train_test_split() function from
sklearn.model_selection to split our data and create the training and testing subsets.
Notice that we passed two parameters, train_size and test_size, and set the values to
0.7 and 0.3, respectively. train_size and test_size can take values between 0.0 and
1.0, which represent the proportion of the dataset allocated to each. If an integer value is
provided, the number represents the absolute number of observations.
In Step 5, we looked at the shape of the subsets that were created by the
train_test_split() function.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 55 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Resampling Methods Chapter 3
There's more...
In this example, we're going to use a dataset in which we measure a dichotomous
categorical target variable. It's important to understand that the distribution of both classes
of our target variable is similar in both the training and testing subsets:
We have 30,000 observations with 25 variables. The last variable, the default
payment next month, is our target variable, which has values that are either 0 or
1.
2. We separate our data into a feature set and the response variable and split it into
training and testing subsets using the following code:
# create feature & response set
X = df_creditcarddata.iloc[:,0:24]
Y = df_creditcarddata['default payment next month']
We can now see the distribution of our dichotomous class in our target variable for both the
training and testing subsets:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
print(pd.value_counts(Y_train.values)*100/Y_train.shape)
print(pd.value_counts(Y_test.values)*100/Y_test.shape)
[ 56 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Resampling Methods Chapter 3
In the following output, we can see that the distributions of both the classes are the same in
both subsets:
See also
The scikit-learn guide to sklearn.model_selection: https://bit.ly/2px08Ii
The simplest kind of cross-validation is the holdout method, which we saw in the previous
recipe, Introduction to sampling. In the holdout method, when we split our data into training
and testing subsets, there's a possibility that the testing set isn't that similar to the training
set because of the high dimensionality of the data. This can lead to instability in the
outcome. For this reason, it's very important that we sample our data efficiently. We can
solve this problem using other cross-validation methods such as leave-one-out cross-
validation (LOOCV) or k-fold cross-validation (k-fold CV).
[ 57 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Resampling Methods Chapter 3
k-fold CV is a widely used approach that's used for estimating test errors. The original
dataset with N observations is divided into K subsets and the holdout method is
repeated K times. In each iteration, K-1 subsets are used as the training set and the rest are
used as the testing set. The error is calculated as follows:
In LOOCV, the number of subsets K is equal to the number of observations in the dataset,
N. LOOCV uses one observation from the original dataset as the validation set and the
remaining N-1 observations as the training set. This is iterated N times, so that each
observation in the sample is used as the validation data in each iteration. This is the same as
k-fold CV, in which K equals N, the number of data points in the set. LOOCV usually takes
a lot of computational power because of the large number of iterations required.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
In LOOCV, the estimates from each fold are highly correlated and their
average can have a high level of variance.
[ 58 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Resampling Methods Chapter 3
Estimating the test error is based on a single observation and is represented as MSE =
. We can compute the average of the MSEs for all the folds as follows:
This calculation is no different from the calculation involved in k-fold CV. We'll use scikit-
learn libraries to see how techniques such as k-fold CV and LOOCV can be implemented.
Getting ready
In the following code block, we can see how we can import the required libraries:
import pandas as pd
We read our data and split the features and the response variable:
# Let's read our data.
df_autodata = pd.read_csv("autompg.csv")
X = df_autodata.iloc[:,1:8]
Y = df_autodata.iloc[:,0]
X=np.array(X)
Y=np.array(Y)
[ 59 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Resampling Methods Chapter 3
How to do it...
The k-folds cross-validator provides us with the train and test indices to split the data into
training and testing subsets:
1. We'll split the dataset into K consecutive folds (without shuffling by default) with
K=10:
kfoldcv = KFold(n_splits=10)
kf_ytests = []
kf_predictedvalues = []
mean_mse = 0.0
2. We can look at our coefficient of determination using r2_score() and the mean
squared error using mse():
print("Average CV Score :" ,mean_mse/10)
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 60 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Resampling Methods Chapter 3
3. We plot the predicted values against the actual values of the response variable:
## Let us plot the model
plt.scatter(kf_ytests, kf_predictedvalues)
plt.xlabel('Reported mpg')
plt.ylabel('Predicted mpg')
[ 61 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Resampling Methods Chapter 3
How it works...
In Step 1, the k-fold cross validator splits the dataset into K consecutive folds with K=10. The
k-fold cross-validator provides us with the train and test indices and then splits the data
into training and testing subsets. In Step 2, we looked at the coefficient of determination
using r2_score() and the mean squared error using mse(). The coefficient of
determination and the mean squared error are 79% and 12.85, respectively. In Step 3, we
plotted the predicted values against the actual values of the response variable, mpg.
There's more...
We'll now do the same exercise with LOOCV by using LeaveOneOut from
sklearn.model_selection:
1. We'll read our data once again and split it into the features and response sets:
# Let's read our data.
df_autodata = pd.read_csv("autompg.csv")
X = df_autodata.iloc[:,1:8]
Y = df_autodata.iloc[:,0]
X=np.array(X)
Y=np.array(Y)
loo_ytests = []
loo_predictedvalues = []
mean_mse = 0.0
[ 62 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Resampling Methods Chapter 3
model = LinearRegression()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
# there is only one y-test and y-pred per iteration over the
loo.split,
# so we append them to the respective lists.
loo_ytests += list(Y_test)
loo_predictedvalues += list(Y_pred)
mse = mean_squared_error(loo_ytests, loo_predictedvalues)
r2score = r2_score(loo_ytests, loo_predictedvalues)
print("R^2: {:.2f}, MSE: {:.2f}".format(r2score, mse))
mean_mse += mse
3. We can look at our coefficient of determination using r2_score() and the mean
squared error using mse():
print("Average CV Score :" ,mean_mse/X.shape[0])
We can take a look at the coefficient of determination, and the mean squared error
for the LOOCV results:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 63 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Resampling Methods Chapter 3
4. We can plot the predicted values against the actual values of the response
variable:
## Let us plot the model
plt.scatter(kf_ytests, kf_predictedvalues)
plt.xlabel('Reported mpg')
plt.ylabel('Predicted mpg')
The plot that is generated by the preceding code gives us the following output:
In LOOCV, there is no randomness in the splitting method, so it'll always provide you with
the same result.
The stratified k-fold CV method is often used in classification problems. This is a variation
of the k-fold CV method that returns stratified folds. Each set contains a similar percentage
of samples of each target class as the original dataset. startifiedShuffleSplit is a
variation of shuffle splits, which creates splits by maintaining the same percentage for
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
See also
The scikit-learn guide to other methods of cross-validation: https://bit.ly/
2px08Ii
[ 64 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Resampling Methods Chapter 3
Bootstrapping
Bootstrapping is based on the jackknife method, which was proposed by Quenouille in
1949, and then refined by Tukey in 1958. The jackknife method is used for testing
hypotheses and estimating confidence intervals. It's obtained by calculating the
estimate after leaving out each observation and then computing the average of these
calculations. With a sample of size N, the jackknife estimate can be found by aggregating
the estimates of every N-1 sized sub-sample. It's similar to bootstrap samples, but while the
bootstrap method is sampling with replacement, the jackknife method samples the data
without replacement.
"The essence of bootstrapping is the idea that in the absence of any other knowledge about a
population, the distribution of values found in a random sample of size n from the population is
the best guide to the distribution in the population. Therefore to approximate what would
happen if the population was resampled, it's sensible to resample the sample. In other words, the
infinite population that consists of the n observed sample values, each with probability 1/n, is
used to model the unknown real population."
–Bryan F. J. Manly
[ 65 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Resampling Methods Chapter 3
As we can see in the preceding diagram, some of the data points in the S1 subset also
appear in S2 and S4.
Let's say that we have n bootstrap samples from our original sample. denotes the
estimates of the n bootstrap samples where i=1,2,3...,n. If denotes the estimate of the
parameter for the original sample, the standard error for is given as follows:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
is given as follows:
[ 66 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Resampling Methods Chapter 3
Getting ready
We need to import the required libraries as usual. This time, we will use the resample
class from sklean.utils, which we've not used previously:
import pandas as pd
import numpy as np
We load our data and fill in the missing values with the median for the horsepower
variable. We also drop the carname variable:
# Let's read our data. We prefix the data frame name with "df_" for
easier understanding.
df_autodata = pd.read_csv("autompg.csv")
df_autodata['horsepower'].fillna(df_autodata['horsepower'].median()
, inplace=True)
df_autodata.drop(['carname'], axis=1, inplace=True)
How to do it...
Now that we have read our data, let's see how we can perform bootstrap sampling:
[ 67 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Resampling Methods Chapter 3
2. We loop through 50 iterations and call the custom function by passing the
.
df_autodata DataFrame. We capture the mean of the mpg variable for each
bootstrap sample, which we'll measure against the mean of the mpg variable in
our original DataFrame, which is df_autodata:
iteration=50
bootstap_statistics=list()
originalsample_statistics=list()
for i in range(iteration):
# Call custom function create_bootstrap_oob(). Pass df_autodata
create_bootstrap_oob(df_autodata)
# Capture mean value of mpg variable for all bootstrap samples
bootstap_statistics.append(df_bootstrap_sample.iloc[:,0].mean())
originalsample_statistics.append(df_autodata['mpg'].mean())
3. We plot the mean of the mpg variable for each iteration, for which a separate
bootstrap sample has been considered. We capture the mean of the mpg variable
for each bootstrap sample in each iteration:
import matplotlib.pyplot as plt
f, ax= plt.subplots(figsize=(6,6))
[ 68 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Resampling Methods Chapter 3
We finally plot the mean of the mpg variable against each iteration, which can be seen in the
following image:
How it works...
In Step 1, we created a custom function, create_bootstrap_oob( ), and used the
resample() function from sklearn.utils to create a bootstrap sample with 100
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 69 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Resampling Methods Chapter 3
See also
The scikit-learn guide to sklearn.cross_validation.Bootstrap: https://
bit.ly/2RC5MYv
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 70 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
4
Statistical and Machine
Learning Algorithms
In this chapter, we will cover the following recipes:
Technical requirements
The technical requirements for this chapter remain the same as those we detailed in
Chapter 1, Get Closer to Your Data.
Visit the GitHub repository to get the dataset and the code. These are arranged by chapter
and by the name of the topic. For the linear regression dataset and code, for example, visit
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
Training a linear regression model involves estimating the values of the coefficients for
each of the predictor variables denoted by the letter . In the preceding equation, denotes
an error term, which is normally distributed, and has zero mean and constant variance.
This is represented as follows:
Various techniques can be used to build a linear regression model. The most frequently
used is the ordinary least square (OLS) estimate. The OLS method is used to produce a
linear regression line that seeks to minimize the sum of the squared error. The error is the
distance from an actual data point to the regression line. The sum of the squared error
measures the aggregate of the squared difference between the training instances, which are
each of our data points, and the values predicted by the regression line. This can be
represented as follows:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
In the preceding equation, is the actual training instance and is the value predicted by
the regression line.
[ 72 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
In the context of machine learning, gradient descent is a common technique that can be
used to optimize the coefficients of predictor variables by minimizing the training error of
the model through multiple iterations. Gradient descent starts by initializing the
coefficients to zero. Then, the coefficients are updated with the intention of minimizing the
error. Updating the coefficients is an iterative process and is performed until a minimum
squared error is achieved.
In the gradient descent technique, a hyperparameter called the learning rate, denoted
by is provided to the algorithm. This parameter determines how fast the algorithm
moves toward the optimal value of the coefficients. If is very large, the algorithm might
skip the optimal solution. If it is too small, however, the algorithm might have too many
iterations to converge to the optimum coefficient values. For this reason, it is important to
use the right value for .
In this recipe, we will use the gradient descent method to train our linear regression model.
Getting ready
In Chapter 1, Get Closer To Your Data, we took the HousePrices.csv file and looked at
how to manipulate and prepare our data. We also analyzed and treated the missing values
in the dataset. We will now use this final dataset for our model-building exercise, using
linear regression:
In the following code block, we will start by importing the required libraries:
# import os for operating system dependent functionalities
import os
[ 73 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
Let's read our data. We prefix the DataFrame name with df_ so that we can understand it
easily:
df_housingdata = pd.read_csv("Final_HousePrices.csv")
How to do it...
Let's move on to building our model. We will start by identifying our numerical and
categorical variables. We study the correlations using the correlation matrix and the
correlation plots.
1. First, we'll take a look at the variables and the variable types:
# See the variables and their data types
df_housingdata.dtypes
2. We'll then look at the correlation matrix. The corr() method computes the
pairwise correlation of columns:
# We pass 'pearson' as the method for calculating our correlation
df_housingdata.corr(method='pearson')
3. Besides this, we'd also like to study the correlation between the predictor
variables and the response variable:
# we store the correlation matrix output in a variable
pearson = df_housingdata.corr(method='pearson')
# assume target attr is the last, then remove corr with itself
corr_with_target = pearson.iloc[-1][:-1]
[ 74 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
4. We can look at the correlation plot using the heatmap() function from the
seaborn package:
f, ax = plt.subplots(figsize=(11, 11))
[ 75 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
The following screenshot is the correlation plot. Note that we have removed the
upper triangle of the heatmap using the np.zeros_like()
and np.triu_indices_from() functions:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 76 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
The following screenshot gives us the distribution plot for the SalePrice
variable:
[ 77 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
6. We can also use JointGrid() from our seaborn package to plot a combination
of plots:
from scipy import stats
g = sns.JointGrid(df_housingdata['YearBuilt'],
df_housingdata['SalePrice'])
g = g.plot(sns.regplot, sns.distplot)
g = g.annotate(stats.pearsonr)
With the preceding code, we are able to plot the scatter plot for GarageArea and
SalePrice, while also plotting the histogram for each of these variables on each
axis:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 78 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
7. Let's now scale our numeric variables using min-max normalization. To do this,
we first need to select only the numeric variables from our dataset:
# create a variable to hold the names of the data types viz int16,
in32 and so on
num_cols = ['int16', 'int32', 'int64', 'float16', 'float32',
'float64']
In the following table, we can see that our numeric variables have been scaled
down:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 79 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
one_hot_encoded_variables =
pd.get_dummies(df_housingdata_catcol[col],prefix=col)
df_housingdata_catcol =
pd.concat([df_housingdata_catcol,one_hot_encoded_variables],axis=1)
df_housingdata_catcol.drop([col],axis=1, inplace=True)
10. We have now created a DataFrame with only numeric variables that have been
scaled. We have also created a DataFrame with only categorical variables that
have been encoded. Let's combine the two DataFrames into a single DataFrame:
df_housedata = pd.concat([df_housingdata_numcols,
df_housingdata_catcol], axis=1)
12. We can create our training and testing datasets using the train_test_split
class from sklearn.model_selection:
# Create feature and response variable set
# We create train & test sample from our dataset
from sklearn.model_selection import train_test_split
13. We can now use SGDRegressor() to build a linear model. We fit this linear
model by minimizing the regularized empirical loss with SGD:
import numpy as np
from sklearn.linear_model import SGDRegressor
lin_model = SGDRegressor()
[ 80 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
lin_model_predictions = lin_model.predict(X_test)
By running the preceding code, we find out that the coefficient of determination is
roughly 0.81.
Note that r2_score() takes two arguments. The first argument should
be the true values, not the predicted values, otherwise, it would return an
incorrect result.
14. We check the root mean square error (RMSE) on the test data:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(Y_test, lin_model_predictions)
rmse = np.sqrt(mse)
print(rmse)
Running the preceding code provides output to the effect that the RMSE equals
36459.44.
15. We now plot the actual and predicted values using matplotlib.pyplot:
plt.figure(figsize=(8, 8))
plt.scatter(Y_test, lin_model_predictions)
plt.xlabel('Actual Median value of house prices ($1000s)')
plt.ylabel('Predicted Median value of house prices ($1000s)')
plt.tight_layout()
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 81 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
The resulting plot with our actual values and the predicted values will look as follows:
Because the chart shows most values in approximately a 45-degree diagonal line, our
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
predicted values are quite close to the actual values, apart from a few.
How it works...
In Step 1, we looked at the variable types. We saw that the dataset had both numeric and
non-numeric variables. In Step 2, we used the Pearson method to calculate the pairwise
correlation among all the numeric variables. After that, in Step 3, we saw how all of the
predictor variables are related to the target variable. We also looked at how to sort
correlation coefficients by their absolute values.
[ 82 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
In Step 4, we painted a heatmap to visualize the correlation between the variables. Then, we
introduced two functions from the NumPy library: zeros_like()
and triu_indices_from(). The zeros_like() function takes the correlation matrix as
an input and returns an array of zeros with the same shape and type as the given
array. triu_indices_from() returns the indices for the upper triangle of the array. We
used these two functions to mask the upper triangular part of the correlation plot. We
called the heatmap() function from the seaborn library to paint a correlation heat map
and passed our correlation matrix to it. We also set the color of the matrix using
cmap="YlGnBu" and the size of the legend bar using cbar_kws={"shrink": 0.5}.
In Step 9, Step 10, and Step 11, we performed one-hot encoding on the categorical variables
and added the encoded variables to the DataFrame. We also dropped the original
categorical variables. In Step 12, we split our dataset into a training set and a testing set. In
Step 13, we built our linear regression model using SGDRegressor() and printed the
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
coefficient of determination. Finally, in Step 14, we plotted the predicted and actual values
to see how well our model performed.
There's more...
Consider a linear regression model, given the following hypothesis function:
[ 83 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
In this case, the cost function for is the mean squared error (MSE).
In this formula, represents the number of training instances. and are the input
th
vector and the target vector for the i training instance respectively, while represents the
parameters or coefficients for each input variable. is the predicted value for the ith
training instance using the parameters. The MSE is always non-negative and the closer it
gets to zero, the better.
The MSE is higher when the model performs poorly on the training data. The objective of
the learning algorithm, therefore, is to find the value of such that the MSE is minimized.
This can be represented as follows:
The stochastic gradient descent method finds the values of that minimize the cost
function. In order to minimize the cost function, it keeps changing the parameters by
calculating the slope of the derivative of the cost function. It starts by initializing the
parameters to zero. The parameters are updated at each step of the gradient descent:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 84 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
Every training instance will modify . The algorithm averages these values to calculate
the final .
is the learning rate, which tells the algorithm how rapidly to move toward the minimum.
A large might miss the minimum error, while a small might take a longer time for the
algorithm to run.
In the preceding section, we used a SGDRegressor() function, but we opted for the
default values of the hyperparameters. We are now going to change to 0.0000001 and the
max_iter value to 2000:
In our case, the preceding code gives the result that the RMSE drops from 36,459 to 31,222
and the coefficient of determination improved from 0.81 to 0.86. These results will vary for
every iteration.
See also
The scikit-learn documentation on regression metrics: https://bit.ly/2D6Wn8s
The scikit-learn guide to density estimation: https://bit.ly/2RlnlMj
Logistic regression
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
In the previous section, we noted that linear regression is a good choice when the target
variable is continuous. We're now going to move on to look at a binomial logistic regression
model, which can predict the probability that an observation falls into one of two categories
of a dichotomous target variable based on one or more predictor variables. A binomial
logistic regression is often referred to as logistic regression.
[ 85 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
Logistic regression is similar to linear regression, except that the dependent variable is
measured on a dichotomous scale. Logistic regression allows us to model a relationship
between multiple predictor variables and a dichotomous target variable. However, unlike
linear regression, in the case of logistic regression, the linear function is used as an input to
another function, such as :
Here, is the sigmoid or logistic function. The sigmoid function is given as follows:
The following graph represents a sigmoid curve in which the values of the y-axis lie
between 0 and 1. It crosses the axis at 0.5. :
The output, which lies between 0 and 1, is the probability of the positive class. We can
interpret the output of our hypothesis function as positive if the value returned is 0.5.
Otherwise, we interpret it as negative.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
In the case of logistic regression, we use a cost function known as cross-entropy. This takes
the following form for binary classification:
[ 86 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
Cross-entropy increases as the predicted probability diverges from the actual label. A
higher divergence results in a higher cross-entropy value. In the case of linear regression,
we saw that we can minimize the cost using gradient descent. In the case of logistic
regression, we can also use gradient descent to update the coefficients and minimize the
cost function.
Getting ready
In this section, we're going to use a dataset that contains information on default payments,
demographics, credit data, payment history, and bill statements of credit card clients in
Taiwan from April 2005 to September 2005. This dataset is taken from the UCI ML
repository and is available at GitHub:
[ 87 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
Let's read our data. We will prefix the name of the DataFrame with df_ to make it easier to
read:
df_creditdata = pd.read_csv("UCI_Credit_Card.csv")
How to do it...
Let's start by looking at the variables and data types:
1. First, we're going to take a look at our dataset using the read_csv() function:
print(df_creditdata.shape)
print(df_creditdata.head())
4. In the previous section, we saw how to explore correlations among the variables.
We will skip this here, but readers are advised to check for correlation as
multicollinearity might have an impact on the model.
5. However, we will check if there are any null values, as follows:
df_creditdata.isnull().sum()
6. We will then separate the predictor and response variables. We will also split our
training and testing data:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 88 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
9. We separate out the probabilities of one class. In this case, we will look at class 1:
# We take the predicted values of class 1
Y_predicted = predictedvalues[:, 1]
# We check to see if the right values have been considered from the
predicted values
print(Y_predicted)
11. We can then see the area under curve (AUC) value of the receiver operating
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
# we pass the fpr & tpr values to auc() to calculate the area under
curve
roc_auc = auc(fpr, tpr)
print(roc_auc)
[ 89 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
The following graph shows the ROC curve with the AUC value annotated on it:
The model can be improved by tuning the hyperparameters. It can also be improved
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
How it works...
In Step 1, we looked at the dimensions of our dataset. In Step 2, we took a glimpse at the
datatypes of the variables and noticed that all our variables were numeric in nature. In Step
3, we dropped the ID column since it is of no use for our exercise. We skipped looking at
the correlations between the variables, but it is recommended that the reader adds this step
in order to fully understand and analyze the data.
[ 90 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
In Step 4, we moved on to check whether we had any missing values in our dataset. We
noticed that our dataset had no missing values in this case. In Step 5, we separated the
predictor and response variable and also split our dataset into a training dataset, which was
70% of the data, and a testing dataset, which was 30% of the data. In Step 6, we
used StandardScaler() from sklearn.metrics to standardize our predictor variables
in both the training and testing datasets.
In Steps 8 and 9, we filtered out the probabilities for class 1 and looked at our model score.
In Steps 10 and 11, we looked at the AUC value and plotted our ROC curve. We will explore
more about hyperparameter tuning for each technique in upcoming sections.
See also
You might have noticed that, in Step 7, we used a hyperparameter penalty of l2.
The penalty is the regularization term and l2 is the default value. The
hyperparameter penalty can also be set to l1; however, that may lead to a sparse
solution, pushing most coefficients to zero. More information about this topic can
be found at the following link: https://bit.ly/2RjbSwM
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Naive Bayes
The Naive Bayes algorithm is a probabilistic learning method. It is known
as Naive because it assumes that all events in this word are independent, which
is actually quite rare. However, in spite of this assumption, the Naive Bayesian algorithm
has proven over time to provide great performance in terms of its prediction accuracy.
[ 91 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
The Bayesian probability theory is based on the principle that the estimated likelihood of an
event or a potential outcome should be based on the evidence at hand across multiple
trials. Bayes’ theorem provides a way to calculate the probability of a given class, given
some knowledge about prior observations.
p(class|observation): This is the probability that the class holds given the
observation.
P(observation): This is the prior probability that the training data is observed.
p(class): This is the prior probability of the class.
p(observation|class): This is the probability of the observations given that the
class holds.
In other words, if H is the space for the possible hypothesis, the most probable hypothesis,
class H, is the one that maximizes p(class|observation).
[ 92 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
Getting ready
A Naive Bayes classifier is one of the most basic algorithms that can be applied in text
classification problems.
In this recipe, we will use the spam.csv file, which can be downloaded from the GitHub.
This spam.csv dataset has two columns. One column holds messages and the other
column holds the message type, which states whether it is a spam message or a ham
message. We will apply the Naive Bayes technique to predict whether a message is likely to
be spam or ham.
Let's read our data. As we did in the previous sections, we will prefix the name of the
DataFrame with df_ so that we can read it easily:
df_messages = pd.read_csv('spam.csv', encoding='latin-1', \
sep=',', names=['labels','message'])
[ 93 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
How to do it...
Let's now move on to look at how to build our model.
1. After reading the data, we use the head() function to take a look it:
df_messages.head(3)
In the following screenshot, we can see that there are two columns: labels and
message. The output is as follows:
2. We then use the describe() function to look at a few metrics in each of the
columns:
df_messages.describe()
For the object datatype, the result of describe() will provide metrics,
count, unique, top, and freq. top refers to the most common value,
while freq is the frequency of this value.
[ 94 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
With the preceding command, we see the count, number of unique values, and
frequency for each class of the target variable:
4. To analyze our dataset even further, let's take a look at the word count and the
character count for each message:
df_messages['word_count'] = df_messages['message'].apply(lambda x:
len(str(x).split(" ")))
df_messages['character_count'] = df_messages['message'].str.len()
df_messages[['message','word_count', 'character_count']].head()
5. In this case, labels is our target variable. We have two classes: spam and
ham. We can see the distribution of spam and ham messages using a bar plot:
labels_count =
pd.DataFrame(df_messages.groupby('labels')['message'].count())
labels_count.reset_index(inplace = True)
plt.figure(figsize=(4,4))
sns.barplot(labels_count['labels'], labels_count['message'])
[ 95 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
plt.ylabel('Frequency', fontsize=12)
plt.xlabel('Labels', fontsize=12)
plt.show()
Notice that, in the following screenshot, under the labels variable, all ham and
spam messages are now labelled as 0 and 1 respectively:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 96 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
7. We will now split our data into training and testing samples:
# Split your data into train & test set
X_train, X_test, Y_train, Y_test =
train_test_split(df_messages[‘message’],\
df_messages[‘labels’],test_s=0.2,random_state=1)
10. We load the required libraries for the evaluation metrics, as follows:
from sklearn.metrics import accuracy_score
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
11. We now check our accuracy by evaluating the model with the training data:
# Calculate Train Accuracy
print(‘Accuracy score: {}’.format(accuracy_score(Y_train,
predict_train)))
[ 97 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
12. Now we check the accuracy of our test data by evaluating the model with the
unseen test data:
# We apply the model into our test data
vect_test = vectorizer.transform(X_test)
prediction = model_nb.predict(vect_test)
[ 98 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
How it works...
In Step 1, we looked at our dataset. In Step 2 and Step 3, we looked at the statistics for
the ham and spam class labels. In Step 4, we extended our analysis by looking at the word
count and the character count for each of the messages in our dataset. In Step 5, we saw the
distribution of our target variables (ham and spam), while in Step 6 we encoded our class
labels for the target variable with the numbers 1 and 0. In Step 7, we split our dataset into
training and testing samples. In Step 8, we used CountVectorizer() from
sklearn.feature_extraction.text to convert the collection of messages to a matrix of
token counts.
In Step 9 and Step 10, we built our model and imported the required classes from
sklearn.metrics to measure the various scores respectively. In Step 11 and 12, we
checked the accuracy of our training and testing datasets.
There's more...
The Naive Bayes algorithm comes in multiple variations. These include the Multivariate
Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Multinomial Naive Bayes
algorithms. These variations can be applied to solve different problems.
Multivariate Bernoulli Naive Bayes: This algorithm is used when the feature
vectors provide a binary representation of whether a word or feature occurs in
the document or not. Every token in the feature vector of a document is
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
associated with either the 1 or 0 values. 1 represents a token in which the word
occurs, and 0 represents a token in which the word does not occur. The
Multivariate Bernoulli Naive Bayes algorithm can be used in situations in which
the absence of a particular word matters, such as in the detection of spam
content.
[ 99 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
Multinomial Naive Bayes: This is used when multiple occurrences of words are
to be considered in classification problems. In this variation, text documents are
characterized by the frequency of the term, instead of binary values. Frequency
is a discrete count that refers to how many times a given word or token appears
in a document. The Multinomial Naive Bayes algorithm can be used for topic
modeling, which is a method for finding a group of words that best represent the
key information in a corpus of documents.
Gaussian Multinomial Naive Bayes: In scenarios where we have continuous
features, one way to deal with continuous data in Naive Bayes classifications is to
discretize the features. Alternatively, we can apply the Gaussian Multinomial
Naive Bayes algorithm. This assumes the features follow a normal distribution
and uses a Gaussian kernel to calculate the class probabilities.
See also
In scikit-learn, CountVectorizer() counts the number of times a word shows
up in the document and uses that value as its weight. You can also use
TfidfVectorizer(), where the weight assigned to each token depends on both
its frequency in a document and how often the term recurs in the entire corpus.
You can find more on TfidfVectorizer at the following link: https://bit.ly/
2sJCoVN.
The scikit-learn documentation on the Naive Bayes classifier for multivariate
Bernoulli models: https://bit.ly/2y3fASv.
The scikit-learn documentation on the Naive Bayes classifier for multinomial
models: https://bit.ly/2P4Ohic.
Decision trees
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Decision trees, a non-parametric supervised learning method, are popular algorithms used
for predictive modeling. The most well-known decision tree algorithms include the
iterative dichotomizer (ID3), C4.5, CART, and C5.0. ID3 is only applicable for categorical
features. C4.5 is an improvement on ID3 and has the ability to handle missing
values and continuous attributes. The tree-growing process involves finding the best split
at each node using the information gain. However, the C4.5 algorithm converts a
continuous attribute into a dichotomous categorical attribute by splitting at a suitable
threshold value that can produce maximum information gain.
[ 100 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
Decision trees are built using recursive partitioning, which splits the data into subsets
based on several dichotomous independent attributes. This recursive process may split the
data multiple times until the splitting process terminates after a particular stopping
criterion is reached. The best split is the one that maximizes a splitting criterion. For
classification learning, the techniques used as the splitting criterion are entropy and
information gain, the Gini index, and the gain ratio. For regression tasks, however,
standard deviation reduction is used.
The C4.5 and C5.0 algorithms use entropy (also known as Shannon entropy) and
information gain to identify the optimal attributes and decide on the splitting criterion.
Entropy is a probabilistic measure of uncertainty or randomness.
In the case of a two-class attribute, entropy can range from 0 to 1. For an n-class attribute,
entropy can take values between 0 to . For a homogeneous variable, where there is
just a single class, the entropy would be zero because the probability of that class being zero
is 1 and .
To use entropy to identify the most identified attributes at which to split, the algorithm
calculates the change in homogeneity that would result from the split at each possible
attribute. This change is known as information gain. Constructing a decision tree is all
about finding the attribute that returns the highest information gain. This information gain
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Information gain is calculated as the difference between the entropy before the split and the
entropy after the split:
[ 101 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
The higher the information gain, the better a feature is. Information gain is calculated for all
features. The algorithm chooses the feature with the highest information gain to create the
root node. The information gain is calculated at each node to select the best feature for that
node.
The Gini index is a measure of the degree of impurity and can also be used to identify the
optimal attributes for the splitting criterion. It is calculated as follows:
Getting ready
To build our model with q decision tree algorithm, we will use the backorders.csv file,
which can be downloaded from the following GitHub.
This dataset has 23 columns. The target variable is went_on_backorder. This identifies
whether a product has gone on back order. The other 22 variables are the predictor
variables. A description of the data is provided in the code that comes with this book:
[ 102 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
Let's read our data. As we have done previously, we are going to prefix the name of the
DataFrame with df_ to make it easier to understand:
df_backorder = pd.read_csv("BackOrders.csv")
How to do it...
Let's now move on to building our model:
1. First, we want to look at the dimensions of the dataset and the data using
the shape and head() functions. We also take a look at the statistics of the
numeric variables using describe():
df_backorder.shape
df_backorder.head()
df_backorder.describe()
If you get your output in scientific notation, you can change to view it in
standard form instead by executing the following command:
pd.options.display.float_format = ‘{:.2f}’.format
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
2. With dtypes, we get to see the data types of each of the variables:
df_backorder.dtypes
3. We can see that sku is an identifier and will be of no use to us for our model-
building exercise. We will, therefore, drop sku from our DataFrame as follows:
df_backorder.drop('sku', axis=1, inplace=True)
[ 103 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
4. We can check whether there are any missing values with the isnull().sum()
command:
df_backorder.isnull().sum()
5. Since the number of missing values in the lead_time variable is about 5%, we
will remove all the observations where lead_time is missing for our initial
analysis:
df_backorder = df_backorder.dropna(axis=0)
6. We now need to encode our categorical variables. We select only the categorical
variables and call pd.get_dummies() to dummy-code the non-numeric
variables:
non_numeric_attributes =
df_backorder.select_dtypes(include=['object']).columns
df_backorder = pd.get_dummies(columns=non_numeric_attributes,
data=df_backorder, prefix=non_numeric_attributes,
prefix_sep="_",drop_first=True)
df_backorder.dtypes
With the preceding code, we get to see the datatypes. We notice that dummy-
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 104 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
We can see that our data has a fairly balanced distribution, with approximately
81% of the observations belonging to class 0 and 19% belonging to class 1:
8. We will now split our data into training and testing datasets:
#Performing train test split on the data
X, Y =
df_backorder.loc[:,df_backorder.columns!=‘went_on_backorder_Yes’].va
lues, df_backorder.loc[:,‘went_on_backorder_Yes’].values
With model_DT_Gini, we can see the default values of the hyperparameters that
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 105 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
10. We can use the model to predict our class labels using both our training and our
testing datasets:
# Predict with our test data
test_predictedvalues = model_DT_Gini.predict(X_test)
# Check accuracy
acc = accuracy_score(Y_test, test_predictedvalues)
print("Accuracy is", acc)
This gives us the accuracy along with the count of True Negative (TN), False
Positive (FP), False Negative (FN), and True Positive (TP) values:
plt.figure()
plot_confusion_matrix(cm, classes=target_names, normalize=False)
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
plt.show()
[ 106 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
We can then see the amount of TNs, FPs, FNs, and TPs in our confusion matrix
plot:
12. We can change the hyperparameters to tune our model. We can also perform a
grid search to find the hyperparameter values that supply optimum results. We
can use the following code to set the hyperparameter values:
# set the parameters for grid search
grid_search_parameters = {“criterion”: [“gini”, “entropy”],
“min_samples_split”: [2],
“max_depth”: [None, 2, 3],
“min_samples_leaf”: [1, 5],
“max_leaf_nodes”: [None],
}
# Use GridSearchCV(), pass the values you have set for grid search
model_DT_Grid = GridSearchCV(classifier, grid_search_parameters,
cv=10)
model_DT_Grid.fit(X_train, Y_train)
[ 107 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
14. After running the preceding command, we can see the best parameter values
among those provided using best_params_:
model_DT_Grid.best_params_
15. You can use the model that is selected using the GridSearchCV() function:
test_predictedvalues = model_DT_Grid.predict(X_test)
cc = accuracy_score(Y_test, test_predictedvalues)
print("Accuracy is", acc)
cm = confusion_matrix(Y_test, test_predictedvalues)
plt.figure()
plot_confusion_matrix(cm, classes=target_names, normalize=False)
plt.show()
These results will vary depending on the samples used and the hyperparameter
tuning.
[ 108 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
How it works...
In Step 1, we took a look at the dimensions of our dataset. We also saw the statistics of our
numerical variables. In Step 2, we looked at the datatypes of each of our variables. In Step 3,
we dropped the sku attribute, because it is an identifier that will be of no use to us for
our model. In Step 4, we checked for missing values and noticed that the lead_time
attribute had 3,403 missing values, which is roughly 5% of the total number of
observations. In Step 5, we dropped the observations for which the lead_time had missing
values. Note that there are various strategies to impute missing values, but we haven't
considered these in this exercise.
In Step 6, we used get_dummies() from the pandas library with drop_first=True as one
of the parameters to perform a k-1 dummy coding on the categorical variables. In Step 7, we
took a look at the distribution of our target variable. We see the class labels, 0 and 1, are in
the ratio of 19%-81% approximately, which is not very well balanced. However, we had
enough observations for both classes to proceed to our next steps. In Step 8, we separated
our predictor and response variables. We also split our dataset to create a training dataset
and a testing dataset. In Step 9, we used a DecisionTreeClassifier() to build our
model. We noted the default hyperparameters values and noticed that, by
default, DecisionTreeClassifier() uses the Gini impurity measure as the splitting
criterion.
In Step 10, we used the model to predict our test sample. We took a note of the overall
accuracy and the amount of TP, TN, FP, and FN values that we achieved. In Step 11, we
used plot_confusion_matrix() to plot these values in the form of a confusion matrix.
Please note that plot_confusion_matrix() is readily available at https://bit.ly/
2MdyDU9 and is also provided with the book in the code folder for this chapter.
search algorithm. In Step 13 and 14, we used GridSearchCV() to look for the optimum
hyperparameters. In Step 15, we used the model returned by the grid search to predict our
test observations. Finally, in Step 16, we used classification_report() from
sklearn.metrics to generate various scores including precision, recall, f1-score,
and support.
[ 109 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
There's more...
Sometimes, a model can classify training data perfectly but faces difficulty when working
with new data. This problem is known as overfitting. The model fails to generalize to the
new test data.
We allow a recursive splitting process to repeat until we terminate the leaf node because we
cannot split the data further. This model would fit the training data perfectly but leads to
poor performance. For this reason, tree-based models are susceptible to overfitting. To
overcome this, we need to control the depth of our decision tree.
There are multiple ways to avoid overfitting. One method is to terminate the growth before
a perfect classification of the training data is made. The following approaches can be
adopted to implement this stopping method:
Another method is to allow the data to overfit, and then to prune the tree after it is
constructed. This involves eliminating nodes that are not clearly relevant, which also
minimizes the size of the decision tree.
See also
The scikit-learn documentation on the decision tree classifier: https://bit.ly/
1Ymrzjw
The scikit-learn documentation on the decision tree regressor: https://bit.ly/
2xMNSua
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 110 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
For an SVM, the two classes are represented as -1 and +1 instead of 1 and 0. The hyperplane
can, therefore, be written as follows:
However, it's quite possible that there are a lot of hyperplanes that correctly classify the
training data. There might be infinite solutions of w and b that hold for the preceding rules.
An algorithm such as a perceptron learning algorithm will just find any linear classifier.
SVM, however, finds the optimal hyperplane, which is at a maximum distance from any
data point. The further the data points lie from the hyperplane, the more confident we are
that they have been correctly classified. We would therefore like the data points to be as far
away from the hyperplane as possible, while still being able to classify them correctly. The
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
best hyperplane is the one that has the maximum margin between the two classes. This is
known as the maximum-margin hyperplane.
It's possible for SVM to choose the most important vectors that define the separation
hyperplane from the training data. These are the data points that lie closest to the
hyperplane and are known as support vectors. Support vectors are the data points that are
hardest to classify. At the same time, these represent high-quality data. If you remove all
the other data points and use only the support vectors, you can get back the exact decision
hyperplane and the margin using the same SVM model. The number of data points does
not really matter, just the support vectors.
[ 111 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
We normalize the weights w and b so that the support vectors satisfy the following
condition:
The initial SVM algorithms could only be used in the case of linearly separable data. These
are known as hard-margin SVMs. However, hard-margin SVMs can work only when the
data is completely linearly separable and if doesn't have any noise. In the case of noise or
outliers, a hard-margin SVM might fail.
Vladimir Vapnik proposed soft-margin SVMs to deal with data that is non-
linearly separable by using slack variables. Slack variables allows for errors to be made
while fitting the model to the training dataset. In hard-margin classification, we will get a
decision boundary with a small margin. In soft-margin classification, we will get a decision
boundary with a larger margin:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 112 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
SVMs can also perform non-linear classification extremely well using something called a
kernel trick. This refers to transformations in which the predictor variables are implicitly
mapped to a higher-dimensional feature space. Popular kernel types include the following:
Linear kernels
Polynomial kernels
Radial basis function (RBF) kernels
Sigmoid kernels
Different kernel functions are available for various decision functions. We can add kernel
functions together to achieve even more complex planes.
Getting ready
In this chapter, we are going to use the bank.csv file, which is based on bank marketing
data and which you can download from GitHub. This data is related to a
Portuguese bank's direct marketing campaigns that took place over phone calls. The goal is
to predict whether the client will subscribe to a term deposit:
Let's read our data. We will again prefix the name of the DataFrame with df_ to make it
easier to understand:
df_bankdata = pd.read_csv("bank.csv")
[ 113 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
How to do it...
In this section, we're going to look at checking null values, standardizing numeric values,
and one-hot-encoding categorical variables:
1. With the following command, we can see we that we have ten categorical
variables and seven numerical variables in the dataset:
df_bankdata.dtypes
2. With the following command, we notice there are no missing values, so we can
proceed with our next steps:
df_bankdata.isnull().sum()
4. We can convert our target class to the binary values 1 and 0 with the following
command:
df_bankdata['y'] = (df_bankdata['y']=='yes').astype(int)
[ 114 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
9. We then build our first model using SVC with the default kernel, radial basis
function (RBF):
# Note, you need not pass kernel='rbf' to the SVC() because its the
default
svc_model = SVC(kernel='rbf')
svc_model.fit(X_train, Y_train)
10. We check our training and testing accuracy via the SVC model built with the RBF
kernel:
train_predictedvalues=svc_model.predict(X_train)
test_predictedvalues=svc_model.predict(X_test)
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 115 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
11. We can rebuild our SVC model with a polynomial kernel as follows:
svc_model =SVC(kernel='poly')
svc_model.fit(X_train, Y_train)
train_predictedvalues=svc_model.predict(X_train)
test_predictedvalues=svc_model.predict(X_test)
12. We can also build an SVC model with the linear kernel. Instead of
kernel='ploy', we can replace this with kernel='linear' in the preceding
code:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
svc_model =SVC(kernel='linear')
[ 116 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
Our results will vary depending on the different types of kernel and other
hyperparameter values used.
How it works...
In Step 1, we looked at the data types of our variables. We noticed that we have ten
categories and seven numerical variables. In Step 2, we checked for missing values and saw
that there were no missing values in our dataset. In Step 3, we checked the class balance of
our target variable and found out that it has the values of yes and no. In Step 4, we
converted our target variable to 1 and 0 to represent yes and no respectively. In Steps 5 and
6, we performed one-hot encoding on the non-numerical variables.
In Step 7, we separate the predictor and response variables and in Step 8, we split our
dataset into training and testing datasets. After that, in Step 9, we used SVC() from
sklearn.svm with the default RBF kernel to build our model. We applied it to our training
and testing data to predict the class. In Step 10, we checked the accuracy of our training and
testing data. In Step 11, we changed our hyperparameter to set the kernel to polynomial. We
noticed that training accuracy remained more or less the same, but the test accuracy
improved.
With the polynomial kernel, the default degree is 3. You can change the
polynomial degree to a higher degree and note of the change in
the model's performance.
In Step 12, we changed the kernel to linear to see if the results improved compared to the
polynomial kernel. We did not, however, see any significant improvement.
There's more...
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
In this exercise, we have seen how to use various kernels in our code. Kernel
functions must be symmetrical. Preferably, they should have a positive (semi)
definite gram matrix. A gram matrix is the matrix of all the possible inner
products of V, where V is the set of m vectors. For convenience, we
consider positive semi-definite and positive-definite functions indifferently. In
practice, a positive definiteness of kernel matrices ensures that kernel
algorithms converge to a unique solution.
[ 117 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
A linear kernel is the simplest of all kernels available. It works well with text
classification problems.
A linear kernel is presented as follows:
Here, is the slope, d is the degree of the kernel, and c is the constant term.
The radial basis function kernel (RBF), also known as the Gaussian kernel, is a
more complicated kernel and can outperform polynomial kernels. The RBF
kernel is given as follows:
The parameter can be tuned to increase the performance of the kernel. This is important:
with an over-estimated , the kernel can lose its non-linear power and behave more
linearly. On the other hand, if is underestimated, the decision function can be highly
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Not all kernels are strictly positive-definite. The sigmoid kernel function, though is quite
widely used, is not positive-definite. The sigmoid function is given as follows:
Here, is the slope and c is the constant term. Note that an SVM with a sigmoid kernel is
the same as a two-layer perceptron neural network.
[ 118 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Statistical and Machine Learning Algorithms Chapter 4
Adding a kernel trick to an SVM model can give us new models. How do we
choose which kernel to use? The first approach is to try out the RBF kernel, since
it works pretty well most of the time. However, it is a good idea to use other
kernels and validate your results. Using the right kernel with the right dataset
can help you build the best SVM models.
See also
More on the positive definite matrix can be found here: https://bit.ly/
2NnGeLK.
Positive definite kernels are a generalization of the positive definite matrix. You
can find out more about this here: https://bit.ly/2NlsIs1.
[ 119 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
5
Bag the Models with Bagging
In this chapter, we discuss the following recipes:
Bootstrap aggregation
Ensemble meta-estimators
Bagging regressors
Introduction
The combination of classifiers can help reduce misclassification errors substantially. Many
studies have proved such ensembling methods can significantly reduce the variance of the
prediction model. Several techniques have been proposed to achieve a variance reduction.
For example, in many cases, bootstrap aggregating (bagging) classification trees have been
shown to have higher accuracy than a single classification tree. Bagging can be applied to
tree-based algorithms to enhance the accuracy of the predictions, although it can be used
with methods other than tree-based methods as well.
Bootstrap aggregation
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Bootstrap aggregation, also known as bagging, is a powerful ensemble method that was
proposed by Leo Breiman in 1994 to prevent overfitting. The concept behind bagging is to
combine the predictions of several base learners to create a more accurate output.
Breiman showed that bagging can successfully achieve the desired result in
unstable learning algorithms where small changes to the training data can lead to large
variations in the predictions. Breiman demonstrated that algorithms such as neural
networks and decision trees are examples of unstable learning algorithms. Bootstrap
aggregation is effective on small datasets.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Bag the Models with Bagging Chapter 5
The general procedure for bagging helps to reduce variance for those algorithms have high
variance. Bagging also supports the classification and regression problem. The following
diagram shows how the bootstrap aggregation flow works:
Using bootstrapping with a training dataset X, we generate N bootstrap samples X1, X2,.....,
XN.
For each bootstrap sample, we train a classifier, . The combined classifier will average
the outputs from all these individual classifiers as follows:
In a bagging classifier, voting is used to make a final prediction. The pseudo-code for the
bagging classifier proposed by Breiman is as follows:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
In the case of the bagging regressor, the final prediction is the average of the predictions of
the models that are built over each bootstrap sample. The following pseudo-code describes
the bagging regressor:
[ 121 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Bag the Models with Bagging Chapter 5
Getting ready
We start by importing the required libraries and reading our file. We suppress any
warnings using the warnings.filterwarnings() function from the warnings library:
import warnings
warnings.filterwarnings('ignore')
import os
import pandas as pd
import numpy as np
We have now set our working folder. Download the autompg.csv file from the GitHub
and copy the file into your working folder as follows:
os.chdir('.../.../Chapter 5')
os.getcwd()
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
We read our data with read_csv() and prefix the name of the data frame with df_ so that
it is easier to understand:
df_autodata = pd.read_csv("autompg.csv")
[ 122 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Bag the Models with Bagging Chapter 5
We notice that the horsepower variable has six missing values. We can fill in the missing
values using the median of the horsepower variable's existing values with the following
code:
df_autodata['horsepower'].fillna(df_autodata['horsepower'].median()
, inplace=True)
We notice that the carname variable is an identifier and is not useful in our model-building
exercise, so we can drop it as follows:
df_autodata.drop(['carname'], axis=1, inplace=True)
How to do it...
In this section, we will see how to build a model using bootstrap samples:
In the following code block, we see how to create bootstrap and OOB samples:
def create_bootstrap_oob(df):
global df_OOB
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
global df_bootstrap_sample
# creating the bootstrap sample
df_bootstrap_sample = resample(df, replace=True, n_samples=100)
# creating the OOB sample
bootstrap_sample_index = tuple(df_bootstrap_sample.index)
bootstrap_df = df.index.isin(bootstrap_sample_index)
df_OOB = df[~bootstrap_df]
[ 123 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Bag the Models with Bagging Chapter 5
2. We build models using the bootstrap samples and average the cost function
across all the models. We use the SGDRegressor() on each bootstrap sample. In
the following code block, we reuse our previously written custom
function, create_bootstrap_oob(), to create the bootstrap and OOB error
samples:
iteration=50
mse_each_iterations = list()
lm=SGDRegressor()
total_mse=0
average_mse= list()
for i in range(iteration):
create_bootstrap_oob(df_autodata)
3. We are now going to plot the MSE for each model built:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
plt.xlabel('Iterations')
plt.ylabel('Mean Squared Error')
plt.legend(loc=1)
plt.show()
[ 124 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Bag the Models with Bagging Chapter 5
How it works...
In Step 1, we executed our custom function code to create
the create_bootstrap_oob() function that creates the bootstrap and OOB samples for
us. In Step 2, we executed the following steps:
samples respectively.
4. We split both the df_bootstrap_sample and the df_OOB samples into feature
sets and response variables.
5. We fit the SGDRegressor() to our bootstrap sample to build our model.
6. We passed the OOB sample to the model to predict our values.
[ 125 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Bag the Models with Bagging Chapter 5
In Step 3, we created a plot to show the MSE for each iteration up to the fiftieth iteration.
This result may vary because of randomness.
See also
Bagging Predictors by Leo Breiman, September 1994
Ensemble meta-estimators
The bagging classifier and the bagging regressor are ensemble meta-estimators that fit the
base classifier and regressor models respectively on random subsets of the original dataset.
The predictions from each model are combined to create the final prediction. These kinds of
meta-estimators induce randomization into the model-building process and aggregate the
outcome. The aggregation averages over the iterations for a numerical target variable and
performs a plurality vote in order to reach a categorical outcome.
Bagging classifiers
Bagging classifiers train each classifier model on a random subset of the original training
set and aggregate the predictions, then perform a plurality voting for a categorical
outcome. In the following recipe, we are going to look at an implementation of a bagging
classifier with bootstrap samples.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
How to do it...
1. We import BaggingClassifier and DecisionTreeClassifier from the
scikit-learn library. We also import the other required libraries as follows:
[ 126 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Bag the Models with Bagging Chapter 5
2. Next, we read out the data and take a look at the dimensions:
df_winedata = pd.read_csv('winedata.csv')
df_winedata.shape
3. We separate our features and the response set. We also split our data into
training and testing subsets.
X = df_winedata.iloc[:,1:14]
Y = df_winedata.iloc[:,0]
6. We can see the score after passing the test data to the model:
bag_dt_model.score(X_test, Y_test)
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
8. We will now use a code to plot the confusion matrix. Note that this code has been
taken from scikit-learn.org. We execute the following code to create
the plot_confusion_matrix() function:
# code from
#
http://scikit-learn.org/stable/auto_examples/model_selection/plot_c
onfusion_matrix.html
[ 127 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Bag the Models with Bagging Chapter 5
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]),
range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('Actuals')
plt.xlabel('Predicted')
import itertools
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, predictedvalues)
[ 128 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Bag the Models with Bagging Chapter 5
How it works...
In Step 1, we imported the required libraries to build our decision tree classifier model
using the bagging classifier. In Step 2, we read our dataset, which was winedata.csv. In
Step 3, we separated our feature set and the target variable. We also split our data into
training and testing subsets. In Step 4, we created a decision tree classifier model and
passed it to the BaggingClassifier(). In the DecisionTreeClassifier(), the default
value for the criterion parameter was gini, but we changed it to entropy. We then
passed our decision tree model to the BaggingClassfier(). In
the BaggingClassfier(), we have parameters including n_estimators and
bootstrap. n_estimators is the number of base estimators in the ensemble and has a
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
default value of 10. The bootstrap parameter indicates whether samples are drawn with
replacement or not and is set to True by default.
In Step 5 and Step 6, we fitted our model to the training data and looked at the score of the
test set. In Step 7, we called the predict() method and passed the test feature set. In Step 8,
we added the code for the plot_confusion_matrix() from http://scikit-learn.org,
which takes the confusion matrix as one of its input parameters and plots the confusion
matrix. In Step 9, we called the plot_confusion_matrix() function by passing the
confusion matrix to generate the confusion matrix plot.
[ 129 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Bag the Models with Bagging Chapter 5
There's more...
We can also use GridSearchCV() from sklearn.model_selection to grid search the
best parameters and use them in the BaggingClassifier:
[ 130 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Bag the Models with Bagging Chapter 5
6. We can then look at the accuracy of our OOB samples in the following code
block:
final_bag_dt_model.fit(X_train, Y_train)
bag_predictedvalues = final_bag_dt_model.predict(X_test)
If we plot our confusion matrix, we can see that we have made an improvement with
regard to the number of misclassifications that are made. In the earlier example, two
instances of class 2 were wrongly predicted as class 3, but we can now see that the number
of misclassifications has reduced to one:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
See also
The scikit-learn guide to bagging classifiers: https://bit.ly/2zaq8lS
[ 131 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Bag the Models with Bagging Chapter 5
Bagging regressors
Bagging regressors are similar to bagging classifiers. They train each regressor model on a
random subset of the original training set and aggregate the predictions. Then, the
aggregation averages over the iterations because the target variable is numeric. In the
following recipe, we are going to showcase the implementation of a bagging regressor with
bootstrap samples.
Getting ready
We will import the required libraries, BaggingRegressor and DecisionTreeRegressor,
from sklearn.ensemble and sklearn.tree respectively:
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
We read our dataset, which is bostonhousing.csv, and look at the dimensions of the
DataFrame:
df_housingdata = pd.read_csv('bostonhousing.csv')
df_housingdata.shape
We now move on to creating our feature set and our target variable set.
How to do it...
1. We first separate our feature and response set. We will also split our data into
training and testing subsets in the following code block:
X = df_housingdata.iloc[:,1:14]
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Y = df_housingdata.iloc[:,-1]
[ 132 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Bag the Models with Bagging Chapter 5
5. We use the predict() function and pass the test dataset to predict our target
variable as follows:
predictedvalues = bag_dt_model.predict(X_test)
6. We plot the scatter plot of our actual values and the predicted values of our
target variable with the following code:
#We can plot the actuals and the predicted values
plt.figure(figsize=(4, 4))
plt.scatter(Y_test, predictedvalues)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.tight_layout()
[ 133 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Bag the Models with Bagging Chapter 5
7. We now change the n_estimators parameter to 30 in the following code and re-
execute the steps from Step 3 to Step 6:
bag_dt_model = BaggingRegressor(dt_model, max_features=1.0,
n_estimators=30, bootstrap=True, random_state=1, )
8. The plot of the actual values against the predicted values looks as follows. This
shows us that the values are predicted more accurately than in our previous case
when we changed the value of the n_estimator parameter from 5 to 30:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
How it works...
In Step 1, we separated the features and the target variable set. We also split our data into
training and testing subsets. In Step 2, we created a decision tree regressor model and
passed it to the BaggingRegressor() function. Note that we also passed the
n_estimator=5 parameter to the BaggingRegressor() function. As mentioned earlier,
n_estimator is the number of trees in the forest we would like the algorithm to build. In
Step 3, we trained our model.
[ 134 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Bag the Models with Bagging Chapter 5
In Step 4, we looked at the model score, which was 0.71. In Step 5, we used the predict()
function to predict our target variable for the test subset. After that, in Step 6, we plotted a
scatterplot to explore the relationship between the actual target values and the predicted
target values.
In Step 7, we changed the n_estimator parameter's value from 5 to 30 and re-built our
model. This time, we noticed that the model score improved to 0.82. In Step 8, we plotted
the actual and predicted values and saw that the correlation between the actual and
predicted values was much better than our previous model, where we used
n_estimators=5.
See also
The scikit-learn guide to bagging regressors: https://bit.ly/2pZFmUh
Single estimator versus bagging: https://bit.ly/2q08db6
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 135 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
6
When in Doubt, Use Random
Forests
In this chapter, we will cover the following recipes:
provided. All the decision trees that make up a random forest are different because we
build each tree on a different random subset of our data. A random forest tends to be more
accurate than a single decision tree because it minimizes overfitting.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
The following diagram demonstrates bootstrap sampling being done from the source
sample. Models are built on each of the samples and then the predictions are combined to
arrive at a final result:
Each tree in a random forest is built using the following steps where A represents the entire
forest, a represents a single tree, for a = 1 to A:
1. Create a bootstrap sample with replacement, D training from x, y label these Xa, Ya
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
In a regression problem, predictions for the test instances are made by taking the mean of
the predictions made by all trees. This can be represented as follows:
[ 137 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
Here, N is the total number of trees in the random forest. a=1 represents the first tree in a
forest, while the last tree in the forest is A. ( ) represents the prediction from a single
tree.
If we have a classification problem, majority voting or the most common answer is used.
[ 138 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
Getting ready
In this example, we use a dataset from the UCI ML repository on credit card defaults. This
dataset contains the following information:
Default payments
Demographic factors
Credit data
History of payments
Bill statements of credit card clients
The data and the data descriptions are provided in the GitHub folder:
We will start by loading the required libraries and reading our dataset:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
Let's now read our data. We will prefix the DataFrame name with df_ so that we can
understand it easily:
df_creditcarddata = pd.read_csv("UCI_Credit_Card.csv")
df_creditcarddata.shape
[ 139 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
We can explore our data in various ways. Let's take a look at a couple of different methods:
selected_columns =
df_creditcarddata[['AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','B
ILL_AMT5','BILL_AMT6', 'LIMIT_BAL']]
Note that we have used a semicolon in the last line in the preceding code
block. The semicolon helps to hide the verbose information produced by
Matplotlib. xlabelsize and ylabelsize are used to adjust the font size
in the x-axis and the y-axis.
[ 140 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
We will now explore the payment defaults by age group. We bucket the age variable and
store the binned values in a new variable, age_group, in df_creditcarddata:
df_creditcarddata['agegroup'] = pd.cut(df_creditcarddata['AGE'], range(0,
100, 10), right=False)
df_creditcarddata.head()
We then use our new age_group variable to plot the number of defaults per age group:
# Default vs Age
pd.crosstab(df_creditcarddata.age_group, \
df_creditcarddata["default.payment.next.month"]).plot(kind='bar',stacked=Fa
lse, grid=True)
We can drop the age_group variable from df_creditcarddata since we do not need it
anymore:
df_creditcarddata = df_creditcarddata.drop(columns = ['age_group'])
df_creditcarddata.head()
[ 141 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
We will now look at the payment defaults according to the credit limits of the account
holders:
fig_facetgrid = sns.FacetGrid(df_creditcarddata,
hue='default.payment.next.month', aspect=4)
fig_facetgrid.map(sns.kdeplot, 'LIMIT_BAL', shade=True)
max_limit_bal = df_creditcarddata['LIMIT_BAL'].max()
fig_facetgrid.set(xlim=(0,max_limit_bal));
fig_facetgrid.set(ylim=(0.0,0.000007));
fig_facetgrid.set(title='Distribution of limit balance by default.payment')
fig_facetgrid.add_legend()
We can also assign labels to some of our variables to make the interpretations better. We
assign labels for the Gender, Marriage, and Education variables.
df_creditcarddata['SEX'] = df_creditcarddata.SEX.map(GenderMap)
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
df_creditcarddata['MARRIAGE'] = df_creditcarddata.MARRIAGE.map(MarriageMap)
df_creditcarddata['EDUCATION'] =
df_creditcarddata.EDUCATION.map(EducationMap)
df_creditcarddata['PAY_0'] = df_creditcarddata['PAY_0'].astype(str)
df_creditcarddata['PAY_2'] = df_creditcarddata['PAY_2'].astype(str)
df_creditcarddata['PAY_3'] = df_creditcarddata['PAY_3'].astype(str)
df_creditcarddata['PAY_4'] = df_creditcarddata['PAY_4'].astype(str)
df_creditcarddata['PAY_5'] = df_creditcarddata['PAY_5'].astype(str)
df_creditcarddata['PAY_6'] = df_creditcarddata['PAY_6'].astype(str)
There are more explorations available in the code bundle provided with this book. We now
move on to training our random forest model.
[ 142 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
How to do it...
We will now look at how to use a random forest to train our model:
[ 143 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
7. We might notice that the column names have been changed to numbers. We
assign the columns names and index values back to the scaled DataFrame:
X_train_scaled.columns = X_train.columns.values
X_test_scaled.columns = X_test.columns.values
X_train_scaled.index = X_train.index.values
X_test_scaled.index = X_test.index.values
X_train = X_train_scaled
X_test = X_test_scaled
10. We get the false positive rate (FPR) and true positive rate (TPR) by
passing y_test and y_pred_proba to roc_curve(). We also get the auc value
using roc_auc_score(). Using the FPR, TPR, and the AUC value, we plot the
ROC curve with the AUC value annotated on the plot:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
y_pred_proba = model_RF.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.legend(loc=4)
plt.show()
[ 144 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
The following graph shows the ROC curve with the AUC value annotated on it:
print(evaluation_scores)
12. We can also evaluate a few statistics based on the class of the target variable,
which in this case is 0 or 1:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_RF))
[ 145 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
13. We can plot the top 10 variables by feature importance to see which variables are
important for the model:
feature_importances = pd.Series(classifier.feature_importances_,
index=X_train.columns)
feature_importances.nlargest(10).plot(kind='barh') #top 10 features
The following screenshot shows the top 10 variables with their relative
importance:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
We can change the hyperparameters to see how the model can perform better. We can also
perform a grid search over combinations of hyperparameter values to fine-tune our model.
[ 146 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
How it works...
In Step 1, we split our target and feature variables. In Step 2, in our feature set, we separated
the numeric and non-numeric variables. In Step 3 and Step 4, we converted the non-numeric
variables to dummy coded variables and added them back to the DataFrame. In Step 5, we
split our dataset into training and testing subsets, and in Step 6, we imported
StandardScaler() from sklearn.preprocessing and applied the same scale to our
features.
After executing the commands in Step 6, we noticed that the column names had changed to
sequential numbers. For this reason, in Step 7, we assigned the column names and the index
values back to the scaled DataFrame. In Step 8, we imported RandomForestClassifier()
from sklearn.ensemble and built our first random forest classifier model. After that,
in Step 9 and Step 10, we used our model to calculate the accuracy of our training model and
plotted the ROC curve respectively. We also annotated the ROC Curve with the AUC
value.
In Step 11, we evaluated other scores, including the kappa value, the precision, the recall,
and the accuracy.
In Step 12, we also evaluated these scores based on each class of the target variable, which
in this case is 0 or 1, using classification_report from sklearn.metrics.
There, classification_report() provides us with metrics such as precision, recall, and
f1-score by each class, as well as the average of each of the metrics.
Finally, in Step 13, we looked at the relative variable importance of the top 10 features. This
can help in feature selection to build the models with the right features.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 147 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
There's more...
Isolation forest is another algorithm that is built on the basis of decision trees, and it's used
for anomaly and outlier detection. This algorithm is based on the assumption that the
outlier data points are rare.
The algorithm works a bit differently to the random forest. It creates a bunch of decision
trees, then it calculates the path length necessary to isolate an observation in the tree. The
idea is that isolated observations, or anomalies, are easier to separate because there are
fewer conditions necessary to distinguish them from normal cases. Thus, the anomalies will
have shorter paths than normal observations and will, therefore, reside closer to the root of
the tree. When several decision trees are created, the scores are averaged, which gives us a
good idea about which observations are truly anomalies. As a result, isolation forests are
used for outliers and anomaly detection.
Also, an isolation forest does not utilize any distance or density measures to detect an
anomaly. This reduces the computational cost significantly compared to the distance-based
and density-based methods.
See also
The scikit-learn implementation of the isolation forest algorithm can be found
here: https://bit.ly/2DCjGGF
[ 148 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
The reason why H2O brought lightning-fast machine learning to enterprises is given by the
following explanation:
"H2O's core code is written in Java. Inside H2O, a distributed key/value store is used to
access and reference data, models, objects, and so on, across all nodes and machines. The
algorithms are implemented on top of H2O's distributed Map/Reduce framework and
utilize the Java fork/join framework for multi-threading. The data is read in parallel and is
distributed across the cluster and stored in memory in a columnar format in a compressed
way. H2O's data parser has built-in intelligence to guess the schema of the incoming
dataset and supports data ingest from multiple sources in various formats"
- from h2o.ai
H2O provides us with distributed random forests, which are a powerful tool used for
classification and regression tasks. This generates multiple trees, rather than single trees. In
a distributed random forest, we use the average predictions of both the classification and
regression models to reach a final result.
Getting ready
Java is an absolute must for H2O to run. Make sure you have Java installed with the
following command in Jupyter:
! apt-get install default-jre
! java -version
You will now need to install H2O. To install this from Jupyter, use the following command:
! pip install h2o
import h2o
import seaborn as sns
import numpy as np
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
%matplotlib inline
[ 149 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
To use H2O, we need to initialize an instance and connect to it. We can do that as follows:
h2o.init()
By default, the preceding command tries to connect to an instance. If it fails to do so, it will
attempt to start an instance and then connect to it. Once connected to an instance, we will
see the details of that instance, as follows:
Check whether the data in the H2O DataFrame is properly loaded as follows:
hf_creditcarddata.head()
[ 150 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
We drop the ID column, as this will not be required for our model building exercise:
hf_creditcarddata = hf_creditcarddata.drop(["ID"], axis = 1)
We will now move on to explore our data and build our model.
How to do it...
We have performed various explorations on our data in the previous section. There is no
limit to the ways in which we can explore our data. In this section, we are going to look at a
few more techniques:
1. We check the correlation of each of our feature variables with the target variable:
df_creditcarddata.drop(['default.payment.next.month'], \
axis =
1).corrwith(df_creditcarddata['default.payment.next.month']).\
plot.bar(figsize=(20,10), \
title = 'Correlation with Response variable', \
fontsize = 15, rot = 45, grid = True)
The following plot shows how each of the features is correlated with the target
variable:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 151 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
2. We check the datatypes in the H2O DataFrame. Note that for the pandas
DataFrame, we used dtypes. For the H2O DataFrame, we use types:
hf_creditcarddata.types
3. We notice that they are all of the integer datatype. We will convert them to factor
type, which is categorical in nature:
hf_creditcarddata['SEX'] = hf_creditcarddata['SEX'].asfactor()
hf_creditcarddata['EDUCATION'] =
hf_creditcarddata['EDUCATION'].asfactor()
hf_creditcarddata['MARRIAGE'] =
hf_creditcarddata['MARRIAGE'].asfactor()
hf_creditcarddata['PAY_0'] = hf_creditcarddata['PAY_0'].asfactor()
hf_creditcarddata['PAY_2'] = hf_creditcarddata['PAY_2'].asfactor()
hf_creditcarddata['PAY_3'] = hf_creditcarddata['PAY_3'].asfactor()
hf_creditcarddata['PAY_4'] = hf_creditcarddata['PAY_4'].asfactor()
hf_creditcarddata['PAY_5'] = hf_creditcarddata['PAY_5'].asfactor()
hf_creditcarddata['PAY_6'] = hf_creditcarddata['PAY_6'].asfactor()
'PAY_AMT4','PAY_AMT5','PAY_AMT6']
target = 'default.payment.next.month'
6. We now split the H2O DataFrame into training and testing subsets. We use 70%
of our data for training the model and the remaining 30% for validation:
splits = hf_creditcarddata.split_frame(ratios=[0.7], seed=123)
train = splits[0]
test = splits[1]
[ 152 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
7. We build our random forest model with the default settings. You can check the
model performance on the test data with the following commands:
from h2o.estimators.random_forest import H2ORandomForestEstimator
print(RF_D.model_performance(test))
How it works...
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
In the Getting ready section, we installed JRE and H2O. We initialized and connected to an
H2O instance with h2o.init(). We then read our data using pandas and converted it to
an H2O DataFrame. We used the head() and describe() methods on the H2O
DataFrame, just like we used them on a pandas DataFrame. We then dropped the ID
column from the H2O DataFrame.
After we did these data explorations in the Getting ready section, we moved on to the next
steps. In Step 1, we checked the correlation of each of the features with the target
variable. In Step 2, we used the h2o DataFrame and checked the datatypes.
[ 153 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
In Step 3, we used asfactor() to convert the numeric variables to the categorical type. We
performed this on variables that were supposed to be of a categorical type but were
appearing as numeric.
In Step 5, we separated our features and the target variable. In Step 6, we split the H2O
DataFrame into training and testing subsets using split_frame() on our H2O
DataFrame. We used the ratios parameter and set it to ratios=[0.7] for
split_frame() to allocate 70% of the data to the training set and 30% of the data to the
testing set.
There's more...
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
In our preceding example, we have an AUC of 0.76 and a log loss of 0.44:
[ 154 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
We notice that the AUC has slightly improved to 0.77 and that the log loss has
dropped to 0.43:
2. We can also apply a grid search to extract the best model from the given options.
We set our options as follows:
search_criteria = {'strategy': "RandomDiscrete"}
RF_Grid = H2OGridSearch(
H2ORandomForestEstimator(
model_id = 'RF_Grid',
ntrees = 200,
nfolds = 10,
stopping_metric = 'AUC',
stopping_rounds = 25),
search_criteria = search_criteria, # full grid
search
hyper_params = hyper_params)
RF_Grid.train(x = predictors, y = target, training_frame = train)
[ 155 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
4. We now sort all models by AUC in a descending manner and then pick the first
model, which has the highest AUC:
RF_Grid_sorted = RF_Grid.get_grid(sort_by='auc',decreasing=True)
print(RF_Grid_sorted)
best_RF_model = RF_Grid_sorted.model_ids[0]
best_RF_from_RF_Grid = h2o.get_model(best_RF_model)
6. We can plot the variable importance from the best model that we have achieved
so far:
best_RF_from_RF_G
rid.varimp_plot()
[ 156 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
When in Doubt, Use Random Forests Chapter 6
See also
You may want to look into extremely randomized trees, which have a slightly different
implementation but can sometimes perform better than random forests.
In ensemble methods, each model learns differently in terms of the subset of the dataset
and the subset of the feature vector used for training. These subsets are taken
randomly. Extremely randomized trees possess a high randomness factor in the way they
compute the splits and the subset of the features selected. Unlike random forests, in
which the splitting threshold is chosen randomly, in extremely randomized trees, a
discriminative threshold is used as the splitting rule. Due to this, the overall variance of the
ensemble decreases and the overall performance may be better.
[ 157 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
7
Boosting Model Performance
with Boosting
In this chapter, we will cover the following recipes:
Introduction to boosting
Implementing AdaBoost for disease risk prediction using scikit-learn
Implementing gradient boosting for disease risk prediction using scikit-learn
Implementing extreme gradient boosting for glass identification using XGBoost
with scikit-learn
Introduction to boosting
A boosting algorithm is an ensemble technique that helps to improve model performance
and accuracy by taking a group of weak learners and combining them to form a strong
learner. The idea behind boosting is that predictors should learn from mistakes that have
been made by previous predictors.
When an input is misclassified by a hypothesis, its weight is altered in the next iteration so
that the next hypothesis can classify it correctly. More weight will be given to those that
provide better performance on the training data. This process, through multiple iterations,
converts weak learners into a collection of strong learners, thereby improving the model's
performance.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
In bagging, no bootstrap sample depends on any other bootstrap, so they all run in parallel.
Boosting works in a sequential manner and does not involve bootstrap sampling. Both
bagging and boosting reduce the variance of a single estimate by combining several
estimates from different models into a single estimate. However, it is important to note that
boosting does not help significantly if the single model is overfitting. Bagging would be a
better option if the model overfits. On the other hand, boosting tries to reduce bias, while
bagging rarely improves bias.
In this chapter, we will introduce different boosting algorithms such as Adaptive Boosting
(AdaBoost), gradient boosting, and extreme gradient boosting (XGBoost).
AdaBoost focuses on combining a set of weak learners into a strong learner. The process of
an AdaBoost classifier is as follows:
1. Initially, a short decision tree classifier is fitted onto the data. The decision tree
can just have a single split, which is known as a decision stump. The overall
errors are evaluated. This is the first iteration.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 159 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
The concept behind this algorithm is to distribute the weights to the training example and
select the classifier with the lowest weighted error. Finally, it constructs a strong classifier
as a linear combination of these weak learners.
Here, F(x) represents a strong classifier, represent the weights, and f(x) represents a weak
classifier.
[ 160 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
The AdaBoost classifier takes various parameters. The important ones are explained as
follows:
Getting ready
To start with, import the os and the pandas packages and set your working directory
according to your requirements:
# import required packages
import os
import pandas as pd
import numpy as np
os.getcwd()
Download the breastcancer.csv dataset from GitHub and copy it to your working
directory. Read the dataset:
df_breastcancer = pd.read_csv("breastcancer.csv")
[ 161 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
Take a look at the first few rows with the head() function:
df_breastcancer.head(5)
Notice that the diagnosis variable has values such as M and B, representing Malign and
Benign, respectively. We will perform label encoding on the diagnosis variable so that we
can convert the M and B values into numeric values.
lb = LabelEncoder()
df_breastcancer['diagnosis']
=lb.fit_transform(df_breastcancer['diagnosis'])
df_breastcancer.head(5)
We now separate our target and feature set. We also split our dataset into training and
testing subsets:
# Create feature & response variables
# Drop the response var and id column as it'll not make any sense to the
analysis
X = df_breastcancer.iloc[:,2:31]
# Target
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Y = df_breastcancer.iloc[:,0]
Now, we will move on to building our model using the AdaBoost algorithm.
It is important to note that the accuracy and AUC scores may differ
because of random splits and other randomness factors.
[ 162 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
How to do it...
We will now look at how to use an AdaBoost to train our model:
1. Before we build our first AdaBoost model, let's train our model using
the DecisionTreeClassifier:
dtree = DecisionTreeClassifier(max_depth=3, random_state=0)
dtree.fit(X_train, Y_train)
2. We can see our accuracy and Area Under the Curve (AUC) with the following
code:
# Mean accuracy
print('The mean accuracy is:
',(dtree.score(X_test,Y_test))*100,'%')
#AUC score
y_pred_dtree = dtree.predict_proba(X_test)
fpr_dtree, tpr_dtree, thresholds = roc_curve(Y_test,
y_pred_dtree[:,1])
auc_dtree = auc(fpr_dtree, tpr_dtree)
print ('AUC Value: ', auc_dtree)
We get an accuracy score and an AUC value of 91.81% and 0.91, respectively.
Note that these values might be different for different users due to randomness.
3. Now, we will build our AdaBoost model using the scikit-learn library. We will
use the AdaBoostClassifier to build our AdaBoost model. AdaBoost uses
dtree as the base classifier by default:
AdaBoost = AdaBoostClassifier(n_estimators=100,
base_estimator=dtree, learning_rate=0.1, random_state=0)
AdaBoost.fit(X_train, Y_train)
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
4. We check the accuracy and AUC value of the model on our test data:
# Mean accuracy
print('The mean accuracy is:
',(AdaBoost.score(X_test,Y_test))*100,'%')
#AUC score
y_pred_adaboost = AdaBoost.predict_proba(X_test)
fpr_ab, tpr_ab, thresholds = roc_curve(Y_test,
y_pred_adaboost[:,1])
auc_adaboost = auc(fpr_ab, tpr_ab)
print ('AUC Value: ', auc_adaboost)
[ 163 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
We notice that we get an accuracy score of 92.82% and an AUC value of 0.97. Both
of these metrics are higher than the decision tree model we built in Step 1.
6. Now, we will check the accuracy and AUC values of our new model on our test
data:
# Mean accuracy
print('The mean accuracy is:
',(AdaBoost_with_tuning.score(X_test,Y_test))*100,'%')
#AUC score
y_pred_adaboost_tune = AdaBoost.predict_proba(X_test)
fpr_ab_tune, tpr_ab_tune, thresholds = roc_curve(Y_test,
y_pred_adaboost_tune[:,1])
auc_adaboost_tune = auc(fpr_ab_tune, tpr_ab_tune)
print ('AUC Value: ', auc_adaboost_tune)
We notice the accuracy drops to 92.39%, but that we get an improved AUC value of 0.98.
How it works...
In Step 1, we used the DecisionTreeClassifier to build our model. In Step 2, we noticed
that our mean accuracy and the AUC score were 91.81% and 0.91, respectively. We aimed
to improve this using the AdaBoost algorithm.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Note that the AdaBoost algorithm uses a decision tree as the base classifier by default. In
Step 3, we trained our model using AdaBoost with the default base learner. We set
n_estimators to 100 and the learning_rate to 0.1. We checked our mean accuracy
and AUC value in Step 4. We noticed that we got a decent improvement in the mean
accuracy and the AUC as they jumped to 93.57% and 0.977, respectively.
In Step 5, we fine-tuned some of the hyperparameters for our AdaBoost algorithm, which
used a decision tree as the base classifier. We set the n_estimators to 100 and the
learning_rate to 0.4. Step 6 gave us the accuracy and AUC values for the model we built
in Step 5. We saw that the accuracy dropped to 93.56% and that the AUC stayed similar at
0.981.
[ 164 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
There's more...
Here, we will showcase training a model using AdaBoost with a support vector
machine (SVM) as the base learner.
By default, AdaBoost uses a decision tree as the base learner. We can use different base
learners as well. In the following example, we have used an SVM as our base learner with
the AdaBoost algorithm. We use SVC with rbf as the kernel:
from sklearn.svm import SVC
Adaboost_with_svc_rbf = AdaBoostClassifier(n_estimators=100,
base_estimator=SVC(probability=True, kernel='rbf'), learning_rate=1,
random_state=0)
Adaboost_with_svc_rbf.fit(X_train, Y_train)
We can check the accuracy and the AUC values of our AdaBoost model with support
vector classifier (SVC) as the base learner:
# Mean accuracy
print('The mean accuracy is:
',(Adaboost_with_svc_rbf.score(X_test,Y_test))*100,'%')
#AUC score
y_pred_svc_rbf = Adaboost_with_svc_rbf.predict_proba(X_test)
fpr_svc_rbf, tpr_svc_rbf, thresholds = roc_curve(Y_test,
y_pred_svc_rbf[:,1])
auc_svc_rbf = auc(fpr_svc_rbf, tpr_svc_rbf)
print ('AUC Value: ', auc_svc_rbf)
We noticed that the accuracy and AUC values fall to 62.57 and 0.92, respectively.
Now, we will rebuild our AdaBoost model with SVC. This time, we will use a linear kernel:
Adaboost_with_svc_linear =AdaBoostClassifier(n_estimators=100,
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
We now get a mean accuracy of 90.64% and a decent AUC value of 0.96.
We will now plot a graph to compare the AUC value of each model using the following
code:
import matplotlib.pyplot as plt
% matplotlib inline
plt.figure(figsize=(8,8))
[ 165 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
plt.legend(loc=5)
plt.show()
We can also plot the accuracy of all the models with the following code:
import matplotlib.pyplot as plt
% matplotlib inline
plt.figure(figsize=(8,8))
[ 166 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
values = [dtree.score(X_test,Y_test),
AdaBoost.score(X_test,Y_test),
AdaBoost_with_tuning.score(X_test,Y_test),
Adaboost_with_svc_rbf.score(X_test,Y_test),
Adaboost_with_svc_linear.score(X_test,Y_test)]
def plot_bar_accuracy():
# this is for plotting purpose
index = np.arange(len(label))
plt.bar(index, values)
plt.xlabel('Algorithms', fontsize=10)
plt.ylabel('Accuracy', fontsize=10)
plt.xticks(index, label, fontsize=10, rotation=90)
plt.title('Model Accuracies')
plt.show()
plot_bar_accuracy()
[ 167 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
See also
We can also use grid search with AdaBoost:
#grid search using svm
Adaboost_with_svc = AdaBoostClassifier(n_estimators=100,
base_estimator=SVC(probability=True, kernel='linear'), learning_rate=1,
algorithm= 'SAMME')
estimator = Adaboost_with_svc
Adaboost_with_grid_search = GridSearchCV(estimator,Ada_Grid).fit(X_train,
Y_train)
print(Adaboost_with_grid_search.best_params_)
print(Adaboost_with_grid_search.best_score_)
In the preceding code, we performed a grid search with the n_estimators set to 10, 30,
40, and 100, and learning_rate set to 0.1, 0.2, and 0.3.
Gradient boosting trains models in a sequential manner, and involves the following steps:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
While the AdaBoost model identifies errors by using weights that have been assigned to the
data points, gradient boosting does the same by calculating the gradients in the loss
function. The loss function is a measure of how a model is able to fit the data on which it is
trained and generally depends on the type of problem being solved. If we are talking about
regression problems, mean squared error may be used, while in classification problems, the
logarithmic loss can be used. The gradient descent procedure is used to minimize loss when
adding trees one at a time. Existing trees in the model remain the same.
[ 168 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
Getting ready
We will take the same dataset that we used for training our AdaBoost model. In this
example, we will see how we can train our model using gradient boosting machines. We
will also look at a handful of hyperparameters that can be tuned to improve the model's
performance.
[ 169 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
Then, we read our data and label encode our target variables to 1 and 0:
# Read the Dataset
df_breastcancer = pd.read_csv("breastcancer.csv")
Then, separate our target and feature variables. We split our data into train and test subsets:
# create feature & response variables
# drop the response var and id column as it'll not make any sense to the
analysis
X = df_breastcancer.iloc[:,2:31]
# Target variable
Y = df_breastcancer.iloc[:,0]
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
This is the same code that we used in the Getting ready section of the
AdaBoost example.
[ 170 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
How to do it...
We will now look at how to use a Gradient Boosting Machines to train our model:
GBM_model = GradientBoostingClassifier()
GBM_model.fit(X_train, Y_train)
2. Here, we must pass our test data to the predict() function to make the
predictions using the model we built in Step 1:
Y_pred_gbm = GBM_model.predict(X_test)
[ 171 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
5. We can check the test accuracy and the AUC value with accuracy_score()
and roc_auc_score().
How it works...
In Step 1, we trained a gradient boosting classifier model. In Step 2, we used the
predict() method to make predictions on our test data.
[ 172 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
In Step 4, we used confusion_matrix() to generate the confusion matrix to see the true
positives, true negatives, false positives, and false negatives.
In Step 5, we looked at the accuracy and the AUC values of our test data using
the accuracy_score() and roc_auc_score() functions.
In the next section, we will tune our hyperparameters using a grid search to find the
optimal model.
There's more...
We will now look at how to fine-tune the hyperparameters for gradient boosting machines:
"max_depth":[3, 5, 8],
"max_features":["log2","sqrt"],
"criterion": ["friedman_mse", "mae"],
"subsample":[0.3, 0.6, 1.0]
}
[ 173 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
5. We pass our test data to the predict method to get the predictions:
grid_predictions = grid.predict(X_test)
This gives us the following output. We notice that the average precision and
f1-score improved from the previous case:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 174 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
7. Now, we will take a look at the confusion matrix and plot it, like we did earlier:
cnf_matrix = confusion_matrix(Y_test, grid_predictions)
plot_confusion_matrix(cnf_matrix,classes=[0,1])
We notice that the accuracy remains the same but that the AUC improves from
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
0.96 to 0.97:
[ 175 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
Some of the important parameters that are used in XGBoost are as follows:
half of the data instances to grow trees. The default value is 1 and the range is 0.0
to 1.0. Higher values may improve training accuracy.
col_sample_rate: This specifies the column sampling rate (the y axis) for each
split in each level. The default value is 1.0 and the range is from 0 to 1.0. Higher
values may improve training accuracy.
[ 176 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
Getting ready...
You will need the XGBoost library installed to continue with this recipe. You can use the
pip command to install the XGBoost library as follows:
import itertools
df_glassdata = pd.read_csv('glassdata.csv')
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
df_glassdata.shape
This data has been taken from the UCI ML repository. The column names have been
changed according to the data description that's provided at the following link: https://
bit.ly/2EZX6IC.
[ 177 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
We split our data into a target and feature set, and verify it. Note that we ignore the ID
column:
# split data into X and Y
X = df_glassdata.iloc[:,1:10]
Y = df_glassdata.iloc[:,10]
print(X.shape)
print(Y.shape)
How to do it...
Now, we will proceed to build our first XGBoost model:
fig = pyplot.gcf()
fig.set_size_inches(30, 30)
[ 178 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
With num_trees=0, we get the first boosted tree. We can view the other boosted
trees by setting the index value to the num_trees parameter.
[ 179 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
You will need the graphviz library installed on your system to plot
the boosted trees.
4. We will now use predict() on our test data to get the predicted values. We can
see our test accuracy with accuracy_score():
test_predictions = xg_model.predict(X_test)
test_accuracy = accuracy_score(Y_test, test_predictions)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]),
range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.tight_layout()
[ 180 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
7. We then look at the unique values of our target variable to set the names of each
level of our target variable:
Y.unique()
plt.figure()
plot_confusion_matrix(cm, classes=target_names)
plt.show()
We can now visualize the confusion matrix, as shown in the following screenshot:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 181 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
How it works...
In Step 1, we fit the XGBoostClassfier to our train data. In Step 2 and Step 3, we
visualized the individual boosted trees. To do this, we used the plot_tree() function. We
passed our XGBoost model to the plot_tree() and set the index of the tree by setting the
num_trees parameter. The rankdir='LR' parameter plotted the tree from left to right.
Setting rankdir to UT would plot a vertical tree.
In Step 4, we passed our test subset to predict() to get the test accuracy. Step 5 gave us the
confusion matrix. In Step 6, we sourced a predefined
function, plot_confusion_matrix(), from scikit-learn.org. We used this function to
plot our confusion matrix. In Step 7, we looked at the unique values of our target variable so
that we could set the names for each class of our confusion matrix plot. We then plotted our
confusion matrix to evaluate our model.
There's more...
In this section, we will look at how we can check feature importance and perform feature
selection based on that. We will also look at how we can evaluate the performance of our
XGBoost model using cross-validation.
plot_importance(xg_model)
[ 182 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
After executing the preceding code, we get to see the following chart, which shows feature
importance in descending order of importance:
In the following example, the SelectFromModel takes the pretrained XGBoost model and
provides a subset from our dataset with the selected features. It decides on the selected
features based on a threshold value.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Features that have an importance that is greater than or equal to the threshold value are
kept, while any others are discarded:
# The threshold value to use for feature selection.
feature_importance = sort(xg_model.feature_importances_)
[ 183 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
selected_feature_X_train = selection.transform(X_train)
# Train the model
selection_model = XGBClassifier()
selection_model.fit(selected_feature_X_train, Y_train)
# Reduce X_test only to the selected feature
selected_feature_X_test = selection.transform(X_test)
# Predict using the test value of the selected feature
predictions = selection_model.predict(selected_feature_X_test)
accuracy = accuracy_score(Y_test, predictions)
print("Threshold=%.5f, Number of Features=%d, Model Accuracy: %.2f%%" %
(each_threshold, selected_feature_X_train.shape[1],accuracy*100))
We notice that the performance of the model fluctuates with the number of selected
features. Based on the preceding output, we decide to opt for five features that give us an
accuracy value of 72%. Also, if we use the Occam's razor principle, we can probably opt for
a simpler model with four features that gives us a slightly lower accuracy of 71%.
We can also evaluate our models using cross-validation. To perform k-fold cross-validation,
we must import the KFold class from sklearn.model_selection.
First, we create the KFold object and mention the number of splits that we would like to
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
have:
kfold = KFold(n_splits=40, random_state=0)
xg_model_with_kfold = XGBClassifier()
[ 184 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Boosting Model Performance with Boosting Chapter 7
With cross_val_score(), we evaluate our model, which gives us the mean and standard
deviation classification accuracy. We notice that we get a mean accuracy of 77.92% and a
standard deviation of 22.33%.
If you have many classes for a multi-class classification task, you may use stratified folds
when performing cross-validation:
Stratfold = StratifiedKFold(n_splits=40, random_state=0)
xg_model_with_stratfold = XGBClassifier()
See also
LightGBM is an open source software for the gradient boosting framework that
was developed by Microsoft. It uses the tree-based algorithm differently to other
Gradient Boosting Machines (GBMs): https://bit.ly/2QW53jH
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 185 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
8
Blend It with Stacking
In this chapter, we will cover the following recipes:
Technical requirements
The technical requirements for this chapter remain the same as those we detailed in earlier
chapters.
Visit the GitHub repository to find the dataset and the code. The datasets and code files are
arranged according to chapter numbers, and by the name of the topic.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
Because the predictions from the base learners are blended together, stacking is also
referred to as blending.
[ 187 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
It's of significance for stack generalization that the predictions from the base learners are
not correlated with each other. In order to get uncorrelated predictions from the base
learners, algorithms that use different approaches internally may be used to train the base
learners. Stacked generalization is used mainly for minimizing the generalization error of
the base learners, and can be seen as a refined version of cross-validation. It uses a strategy
that's more sophisticated than cross-validation's winner-takes-all approach for combining
the predictions from the base learners.
Getting ready...
In this example, we use a dataset from the UCI ML Repository on credit card defaults. This
dataset contains information on default payments, demographic factors, credit data, history
of payments, and bill statements of credit card clients. The data and the data descriptions
are provided in the GitHub.
We will start by loading the required libraries and reading our dataset:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
import os
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
[ 188 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
Let's now read our data. We will prefix the DataFrame name with df_ so that we can
understand it easily:
df_creditcarddata = pd.read_csv("UCI_Credit_Card.csv")
We notice that the dataset now has 30,000 observations and 24 columns. Let's now move on
to training our models.
How to do it...
1. We split our target and feature variables:
from sklearn.model_selection import train_test_split
X = df_creditdata.iloc[:,0:23]
Y = df_creditdata['default.payment.next.month']
# Then we take the train subset and carve out a validation set from
the same
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
3. Check the dimensions of each subset to ensure that our splits are correct:
# Dimensions for train subsets
print(X_train.shape)
print(Y_train.shape)
[ 189 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
4. Import the required libraries for the base learners and the meta-learner:
# for the base learners
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
5. Create instances of the base learners and fit the model on our training data:
# The base learners
model_1 = GaussianNB()
model_2 = KNeighborsClassifier(n_neighbors=1)
model_3 = DecisionTreeClassifier()
7. We have three sets of prediction results from three base learners. We use them to
create a stacked array:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 190 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
print(stacked_train_dataframe.shape)
print(stacked_train_dataframe.head(5))
In the following image, we see that the stacked array now has 5,400 observations
and 4 columns:
9. Train the meta-learner using the stacked array that we created in Step 8:
# Build the Mata-learner
meta_learner = LogisticRegression()
meta_learner_model =
meta_learner.fit(stacked_train_dataframe.iloc[:,0:3],
stacked_train_dataframe['Y_VAL'])
10. Create the stacked test set with the testing subset:
# Take the test data (new data)
# Apply the base learners on this new data to make predictions
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
# We now use the models to make predictions on the test data and
create a new stacked dataset
test_prediction_base_learner_1 = base_learner_1.predict(X_test)
test_prediction_base_learner_2 = base_learner_2.predict(X_test)
test_prediction_base_learner_3 = base_learner_3.predict(X_test)
[ 191 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
11. Convert the final_test_stack stacked array to a DataFrame and add column
names to each of the columns. Verify the dimensions and take a look at the first
few rows:
stacked_test_dataframe = pd.DataFrame(final_test_stack[0,0:3000],
columns='NB_TEST KNN_TEST DT_TEST'.split())
print(stacked_test_dataframe.shape)
print(stacked_test_dataframe.head(5))
We see that the stacked array now has 3,000 observations and 3 columns in
stacked_test_dataframe:
We notice that the accuracy is as follows. Note that based on the sampling
strategy and hyperparameters, the results may vary:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
13. Use the meta-learner on the stacked test data and check the accuracy:
test_predictions_meta_learner =
meta_learner_model.predict(stacked_test_dataframe)
print("Accuracy from Meta Learner:", accuracy_score(Y_test,
test_predictions_meta_learner))
[ 192 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
We see the following output returned by the meta-learner applied on the stacked
test data. This accuracy is higher than the individual base learners:
How it works...
In Step 1, we split our dataset into target and feature sets. In Step 2, we created our training,
validation, and testing subsets. We took a look at the dimensions of each of the subset in
Step 3 to verify that the splits were done correctly.
We then moved on to building our base learners and the meta-learner. In Step 4, we
imported the required libraries for the base learners and the meta-learner. For the base
learners, we used Gaussian Naive Bayes, KNN, and a decision tree, while for the meta-
learner we used logistic regression.
In Step 5, we fitted the base learners to our train dataset. Single models, including Gaussian
Naive Bayes, KNN, and a decision tree, are established in the level 0 space. We then had
three base models.
In Step 6, we used these three base models on our validation subset to predict the target
variable. We then had three sets of predictions given by the respective base learners.
Now the base learners will be integrated by logistic regression in the level 1 space via
stacked generalization. In Step 7, we stacked the three sets of predicted values to create an
array. We also stacked the actual target variable of our training dataset to the array. We
then had four columns in our array: three columns from the three sets of predicted values
of the base learners and a fourth column from the target variable of our training dataset.
We called it final_train_stack known as stacked_train_dataframe, and we named
the columns according to the algorithm used for the base learners. In our case, we used the
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
names NB_VAL, KNN_VAL, and DT_VAL since we used Gaussian Naive Bayes, KNN, and a
decision tree classifier, respectively. Because the base learners are fitted to our validation
subset, we suffixed the column names with _VAL to make them easier to understand.
In Step 9, we built the meta-learner with logistic regression and fitted it to our stacked
dataset, stacked_train_dataframe. Notice that we moved away from our original
dataset to a stacked dataset, which contains the predicted values from our base learners.
[ 193 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
In Step 10, we used the base models on our test subset to get the predicted results. We
called it final_test_stack. In Step 11, we converted the final_test_stack array to a
DataFrame called stacked_test_dataframe. Note that in
our stacked_test_dataframe, we only had three columns, which held the predicted
values returned by the base learners applied on our test subset. The three columns were
named after the algorithm used, suffixed with _TEST, so we have NB_TEST, KNN_TEST, and
DT_TEST as the three columns in stacked_test_dataframe.
In Step 12, we checked the accuracy of the base models on our original test subset. The
Gaussian Naive Bayes, KNN, and decision tree classifier models gave us accuracy ratings of
0.39, 0.69, and 0.73, respectively.
In Step 13, we checked the accuracy that we get by applying the meta-learner model on our
stacked test data. This gave us an accuracy of 0.77, which we can see is higher than the
individual base learners. However, bear in mind that simply adding more base learners to
your stacking algorithm doesn't guarantee that you'll get better accuracy.
There's more...
Creating a stacking model can be tedious. The mlxtend library provides tools that simplify
building the stacking model. It provides StackingClassifier, which is the ensemble-learning
meta-classifier for stacking, and it also provides StackingCVClassifier, which uses cross-
validation to prepare the input for the second level meta-learner to prevent overfitting.
You can download the library from https://pypi.org/project/mlxtend/ or use the pip
install mlxtend command to install it. You can find some great examples of simple
stacked classification and stacked classification with grid search at http://rasbt.github.
io/mlxtend/user_guide/classifier/StackingClassifier/.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
See also
You can also take a look at the ML-Ensemble library. To find out more about ML-
Ensemble, visit http://ml-ensemble.com/. A guide to using ML-Ensemble is available
at https://bit.ly/2GFsxJN.
[ 194 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
H2O's stacked ensemble method is an ensemble machine learning algorithm for supervised
problems that finds the optimal combination of a collection of predictive algorithms using
stacking. H2O's stacked ensemble supports regression, binary classification, and multiclass
classification.
In this example, we'll take a look at how to use H2O's stacked ensemble to build a stacking
model. We'll use the bank marketing dataset which is available in the Github.
Getting ready...
First, import the h2o library and other modules from H2O:
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.stackedensemble import
H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
Once we run the preceding code, the h2o instance gets initialized and we will see the
following output:
[ 195 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
Now that we have instantiated an H2O instance, we move onto reading our dataset and
building stacking models.
How to do it...
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
1. We read our data using the h2o.import_file() function. We pass the filename
to the function as the parameter:
df_bankdata = h2o.import_file("bank-full.csv")
[ 196 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
3. We check the dimensions of the training and testing subsets to verify that the
splits are OK:
train.shape, test.shape
4. We take a look at the first few rows to ensure that data is loaded correctly:
df_bankdata.head()
5. We separate the target and predictor column names, which are the response
and predictors, respectively:
# Set the predictor names
predictors = train.columns
print(predictors)
7. We will train our base learners using cross-validation. We set the nfolds value
to 5.We also set a variable 'encoding' to 'OneHotExplicit'. We will use this
variable to encode our categorical variables.
# Number of CV folds
nfolds = 5
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
8. We start training our base learners. We choose the Gradient Boosting Machine
algorithm to build our first base learner:
# Train and cross-validate a GBM
base_learner_gbm =
H2OGradientBoostingEstimator(distribution="bernoulli",\
ntrees=100,\
max_depth=5,\
[ 197 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
min_rows=2,\
learn_rate=0.01,\
nfolds=nfolds,\
fold_assignment="Modulo",\
categorical_encoding = encoding,\
keep_cross_validation_predictions=True)
base_learner_gbm.train(x=predictors, y=response,
training_frame=train)
10. For our third base learner, we implement a Generalized Linear Model (GLM):
# Train and cross-validate a GLM
base_learner_glm =
H2OGeneralizedLinearEstimator(family="binomial",\
model_id="GLM",\
lambda_search=True,\
nfolds = nfolds,\
fold_assignment =
"Modulo",\
keep_cross_validation_predictions = True)
11. Get the best-performing base learner on the test set in terms of the test AUC.
Compare this with the test AUC of the stacked ensemble model:
# Compare to base learner performance on the test set
gbm_test_performance = base_learner_gbm.model_performance(test)
rf_test_performance = base_learner_rf.model_performance(test)
glm_test_performance = base_learner_glm.model_performance(test)
[ 198 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
baselearner_best_auc_test = max(gbm_test_performance.auc(),
rf_test_performance.auc(), glm_test_performance.auc())
print("Best AUC from the base learners", baselearner_best_auc_test)
stack_auc_test = perf_stack_test.auc()
print("Best Base-learner Test AUC: ", baselearner_best_auc_test)
print("Ensemble Test AUC: ", stack_auc_test)
12. We train a stacked ensemble using the base learners we built in the preceding
steps:
all_models = [base_learner_glm, base_learner_gbm, base_learner_rf]
How it works...
In Step 1, we used the h2o.import_file() function to read our dataset.
In Step 2, we split our H2OFrame into training and testing subsets. In Step 3, we checked the
dimensions of these subsets to verify that our split is adequate for our requirements.
In Step 4, we took a look at the first few rows to check if the data is correctly loaded. In Step
5, we separated out the column names of our response and predictor variables, and in Step
6, we converted the response variables into a categorical type with the asfactor()
function.
[ 199 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
We defined a variable called nfolds in Step 7, which we used for cross-validation. We have
also defined a variable encoding which we used in the next steps to instruct H2O to use
one-hot encoding for categorical variables. In Step 8 to Step 10, we built our base learners.
In Step 11, we trained a Gradient Boosting Machine model. We passed some values to a few
hyperparameters as follows:
Note that for all base learners, cross-validation folds must be the same and
keep_cross_validation_predictions must be set to True.
In Step 9, we trained a random forest base learner using the following hyperparameters:
ntrees, nfolds, fold_assignment.
In Step 10, we trained our algorithm with a GLM. Note that we have not encoded the
categorical variables in GLM.
can take advantage of the categorical column for better performance and
efficient memory utilization.
From H2o.ai: "We strongly recommend avoiding one-hot encoding
categorical columns with any levels into many binary columns, as this is
very inefficient. This is especially true for Python users who are used to
expanding their categorical variables manually for other frameworks".
In Step 11, we generated the test AUC values for each of the base learners and printed the
best AUC.
[ 200 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
In Step 12, we trained a stacked ensemble model by combining the output of the base
learners using H2OStackedEnsembleEstimator. We used the trained ensemble model on
our test subset. Note that by default GLM is used as the meta-learner
for H2OStackedEnsembleEstimator. However, we have used deep learning as the meta-
learner in our example.
Note that we have used default hyperparameters values for our meta-
learner. We can specify the hyperparameter values
with metalearner_params. The metalearner_params option allows
you to pass in a dictionary/list of hyperparameters to use for the
algorithm that is used as meta-learner.
There's more...
You may also assemble a list of models to stack together in different ways. In the preceding
example, we trained individual models and put them in a list to ensemble them. We can
also train a grid of models:
2. We train the grid using the hyperparameters defined in the preceding code:
# Train the grid
grid = H2OGridSearch(model=H2ORandomForestEstimator(nfolds=nfolds,\
fold_assignment="Modulo",\
keep_cross_validation_predictions=True),\
hyper_params=hyper_params,\
search_criteria=search_criteria,\
grid_id="rf_grid_binomial")
[ 201 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
stack_auc_test = perf_stack_test.auc()
The preceding code will give the best base-learner test AUC and test the AUC from the
ensemble model. If the response variable is highly imbalanced, consider fine-tuning the
following hyperparameters to control oversampling and under-sampling:
[ 202 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
See also
Take a look at StackNet, which was developed by Marios Michailidis as part of his PhD.
StackNet is available under the MIT licence. It's a scalable and analytical framework that
resembles a feed-forward neural network, and uses Wolpert's stacked-generalization
concept to improve accuracy in machine learning predictive tasks. It uses the notion of
meta-learners, in that it uses the predictions of some algorithms as features for other
algorithms. StackNet can also generalize stacking on multiple levels. It is, however,
computationally intensive. It was originally developed in Java, but a lighter Python version
of StackNet, named pystacknet, is now available as well.
Let's think about how StackNet works. In the case of a neural network, the output of one
layer is inserted as an input to the next layer and an activation function, such as sigmoid,
tanh, or relu, is applied. Similarly, in the case of StackNet, the activation functions can be
replaced with any supervised machine learning algorithm.
The stacking element can be run on two modes: a normal stacking mode and a re-stacking
mode. In the case of a normal stacking mode, each layer uses the predictions of the
previous one. In the case of re-stacking mode, each layer uses the neurons and activations
of the previous layers.
Sample code that uses StackNet would consist of the following steps:
[ 203 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
2. We read the data, drop the ID column, and check the dimensions of the dataset:
df_creditcarddata = pd.read_csv("UCI_Credit_Card.csv")
3. We separate our target and predictor variables. We also split the data into
training and testing subsets:
#create the predictor & target set
X = df_creditcarddata.iloc[:,0:23]
Y = df_creditcarddata['default.payment.next.month']
4. We define the models for the base learners and the meta-learner:
models=[[DecisionTreeClassifier(criterion="entropy", max_depth=5,
max_features=0.5, random_state=1),
GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
max_depth=5, max_features=0.5, random_state=1),
LogisticRegression(random_state=1)],
[RandomForestClassifier (n_estimators=500, criterion="entropy",
max_depth=5, max_features=0.5, random_state=1)]]
model.fit(X_train,Y_train )
[ 204 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Blend It with Stacking Chapter 8
There are various case studies of StackNet being used in winning competitions in Kaggle.
An example of how StackNet can be used is available at https://bit.ly/2T7339y.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 205 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
9
Homogeneous Ensembles
Using Keras
In this chapter, we will cover the following topics:
Introduction
In the case of ensemble models, each base classifier must have some degree of diversity
within itself. This diversity can be obtained in one of the following manners:
In the case of ensemble models, where different algorithms are used for the base learners,
the ensemble is called a heterogeneous ensemble method. If the same algorithm is used for
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
all the base learners on different distributions of the training set, the ensemble is called a
homogeneous ensemble.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
The focus of Keras is the idea of a model. Keras supports two types of models. The main
type of model is a sequence of layers, called sequential. The other type of model in Keras is
the non-sequential model, called model.
[ 207 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
Getting ready
We'll start by installing Keras. In order to install Keras, you will need to have Theano or
TensorFlow installed in your system. In this example, we'll go with TensorFlow as the
backend for Keras.
There are two variants of TensorFlow: a CPU version and a GPU version.
If you have to install the GPU package, use the following command:
pip install tensorflow-gpu
Once you've installed TensorFlow, you'll need to install Keras using the following
command:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
In order to upgrade your already-installed Keras library, use the following command:
sudo pip install --upgrade keras
Once we're done with installing the libraries, let's import the required libraries:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
[ 208 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
How to do it...
We'll now build our test subset and train our neural network models:
1. Separate the test subset to apply the models in order to make predictions:
df_traindata, df_testdata = train_test_split(df_energydata,
test_size=0.3)
3. Take the test subset and split it into target and feature variables:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
X_test = df_testdata.iloc[:,3:27]
Y_test = df_testdata.iloc[:,28]
4. Validate the preceding split by checking the shape of X_test and Y_test:
print(X_test.shape)
print(Y_test.shape)
5. Let's create multiple neural network models using Keras. We use For...Loop to
build multiple models:
ensemble = 20
frac = 0.7
[ 209 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
for i in range(ensemble):
print("number of iteration:", i)
print("predictions_total", predictions_total)
############################################################
model = Sequential()
# Adding the input layer and the first hidden layer
model.add(Dense(units=16, kernel_initializer = 'normal',
activation = 'relu', input_dim = 24))
############################################################
# We use predict() to predict our values
model_predictions = model.predict(X_test)
model_predictions = model_predictions.flatten()
print("TEST MSE for individual model: ",
mean_squared_error(Y_test, model_predictions))
print("")
print(model_predictions)
print("")
[ 210 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
6. Take the summation of the predicted values and divide them by the number
of iterations to get the average predicted values. We use the average predicted
values to calculate the mean-squared error (MSE) for our ensemble:
predictions_total = predictions_total/ensemble
print("MSE after ensemble: ", mean_squared_error(np.array(Y_test),
predictions_total))
How it works...
Here's a diagrammatic representation of the ensemble homogeneous model workflow:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 211 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
In the preceding diagram, we assume that we have 100 train samples. We train 100 models
on our 100 train samples and apply them to our test sample. We get 100 sets of predictions,
which we ensemble by averaging whether the target variable is a numeric variable or
whether we are calculating probabilities for a classification problem. In the case of class
predictions, we would opt for max voting.
In Step 1, we separated our train and test samples. This is the same test sample that we used
for our predictions with all the models we built in this recipe. In Step 2, we checked the
shape of the train and test subsets. In Step 3, we split our test subset into target and
predictor variables, and then checked the shape again in Step 4 to ensure we got the right
split.
In Step 5, we used the Keras library to build our neural network models. We initialized two
variables, ensemble and frac. We used the ensemble variable to run a for loop for a
certain number of iterations (in our case, we set it to 200). We then used the frac variable
to assign the proportion of data we took for our bootstrap samples from the training subset.
In our example, we set frac to 0.8.
In Step 5, within the for...loop iteration, we built multiple neural network models and
applied the models to our test subset to get the predictions. We created sequential models
by passing a list of layers using the add() method. In the first layer, we specified the input
dimensions using the input_dim argument. Because we have 24 input dimensions, we set
input_dim to 24. We also mentioned the Activation function to use in each layer by
setting the Activation argument.
You can also set the Activation function through an Activation layer, as follows:
# Example code to set activation function through the activation layer
model.add(Dense(64))
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
model.add(Activation('tanh'))
In this step, before we build our model, we configure the learning process using the
compile method. The compile method takes the mandatory loss function, the
mandatory optimizer, and the optional metrics as an argument.
[ 212 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
Once we finished all the iterations in the for loop in Step 5, we divided the summation of
predictions by the number of iterations, which is held in the ensemble variable and set to
200, to get the average predictions. We used the average predictions to calculate the MSE of
the ensemble result.
There's more...
If you have high computational requirements, you can use Google
Colaboratory. Colaboratory is a free Jupyter notebook environment that requires no setup
and runs entirely in the cloud. It's a free cloud service that supports free GPU. You can use
Google Colab to build your deep learning applications using TensorFlow, Keras, PyTorch,
and OpenCV.
Once you create your account with https://colab.research.google.com/, you can log in
using your credentials.
Once you're logged in, you can move straight to the File menu to create your Python
notebook:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 213 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
Once you click on the File tab, you'll see New Python 3 notebook; a new notebook is
created that supports Python 3.
You can click on Untitled0.ipynb in the top-left corner to rename the file:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 214 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
Go to Edit and then Notebook settings. A window pops up to indicate the different
settings that you can have:
Choose the Graphics Processing Unit (GPU) option as the Hardware accelerator, as shown
in the preceding screenshot, in order to use the free GPU.
One neat thing about Google Colab is it can work on your own Google Drive. You can
choose to create your own folder in your Google Drive or use the default Colab Notebooks
folder. In order to use the default Google Colab Notebooks folder, follow the steps shown
in the following screenshot:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 215 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
To start reading your datasets, you can store them in folders in Google Drive.
After you have logged in to Google Colab and created a new notebook, you will have to
mount the drive by executing the following code in your notebook:
from google.colab import drive
drive.mount('/content/drive')
When the preceding code is run, it will ask for the authorization code to be entered, as
shown here:
[ 216 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
Paste the authorization code into the text box. You'll get a different authorization code each
time. Upon authorization, the drive is mounted.
Once the drive is mounted, you can read .csv file using pandas, as we showed earlier in
the chapter. The rest of the code, as shown in the How to do it section, runs as it is. If you use
the GPU, you'll notice that there is a substantial increase in the speed of your computational
performance.
See also
There are various activation functions available for use with the Keras library:
[ 217 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
For more information about the preceding activation functions, visit https://keras.io/
activations/.
This dataset is a real-world dataset and is obtained from house numbers in Google Street
View images.
We use Google Colab to train our models. In the first phase, we build a single model using
Keras. In the second phase, we ensemble multiple homogeneous models and ensemble the
results.
Getting ready
The dataset has 60,000 house number images. Each image is labeled between 1 and 10. Digit
1 is labelled as 1, digit 9 is labelled as 9, and digit 0 is labelled as 10. The images are 32 x 32
images centered around a single character. In some cases, we can see the images are
visually indistinct.
import os
import matplotlib.pyplot as plt
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
import numpy as np
from numpy import array
[ 218 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
Now, we import a library called h5py to read the HDF5 format file and our data file, which
is called SVHN_single_grey.h5:
import h5py
We load the training and test subsets and close the file:
# Load the training and test set
x_train = h5f['X_train'][:]
y_train = h5f['y_train'][:]
x_test = h5f['X_test'][:]
y_test = h5f['y_test'][:]
We reshape our train and test subsets. We also change the datatype to float:
x_train = x_train.reshape(x_train.shape[0], 1024)
x_test = x_test.reshape(x_test.shape[0], 1024)
We now normalize our data by dividing it by 255.0. This also converts the data type of the
values to float:
# normalize inputs from 0-255 to 0-1
x_train = x_train / 255.0
x_test = x_test / 255.0
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 219 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
We see that the shape of the train and test features and our target subsets are as follows:
We visualize some of the images. We also print labels on top of the images:
# Visualizing the 1st 10 images in our dataset
# along with the labels
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 1))
for i in range(10):
plt.subplot(1, 10, i+1)
plt.imshow(x_train[i].reshape(32,32), cmap="gray")
plt.title(y_train[i], color='r')
plt.axis("off")
plt.show()
We now perform one-hot encoding on our target variable. We also store our y_test labels
in another variable, called y_test_actuals, for later use:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 220 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
How to do it...
We'll now build a single model with the Keras library:
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(10))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
metrics=['accuracy'], optimizer='adam')
[ 221 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
3. Fit the model to the train data and validate it with the test data:
# training the model and saving metrics in history
svhn_model = model.fit(x_train, y_train,
batch_size=128, epochs=100,
verbose=2,
validation_data=(x_test, y_test))
#plt.subplot(2,1,1)
plt.plot(svhn_model.history['acc'])
plt.plot(svhn_model.history['val_acc'])
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epochs')
plt.legend(['Train', 'Test'], loc='uppper left')
plt.tight_layout()
[ 222 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
plt.plot(svhn_model.history['loss'])
plt.plot(svhn_model.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epochs')
plt.legend(['Train', 'Test'], loc='upper right')
plt.tight_layout()
6. Reuse the code from the scikit-learn website to plot the confusion matrix:
# code from http://scikit-learn.org
def plot_confusion_matrix(cm, classes,
normalize=False,
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
"""
plt.imshow(cm, cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
[ 223 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]),
range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.ylabel('Actuals')
plt.xlabel('Predicted')
cm = confusion_matrix(y_test_actuals, predicted_classes)
print(cm)
plt.figure(figsize=(10,10))
plot_confusion_matrix(cm, classes=target_names, normalize=False)
plt.show()
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 224 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
[ 225 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
8. We'll now look at how to ensemble the results of multiple homogeneous models.
Define a function to fit the model to the training data:
# fit model on dataset
def train_models(x_train, y_train):
# building a linear stack of layers with the sequential model
model = Sequential()
model.add(Dense(512, input_shape=(1024,)))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(10))
model.add(Activation('softmax'))
# compiling the sequential model
model.compile(loss='categorical_crossentropy', optimizer='adam',
metrics=['accuracy'])
# training the model and saving metrics in history
svhn_model = model.fit(x_train, y_train, batch_size=32,
epochs=25)
return model
[ 226 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
10. Write a function to evaluate the models and get the accuracy score of each model:
# evaluate a specific number of members in an ensemble
def evaluate_models(models, no_of_models, x_test, y_test):
# select a subset of members
subset = models[:no_of_models]
# make prediction
y_predicted_ensemble = ensemble_predictions(subset, x_test)
# calculate accuracy
return accuracy_score(y_test_actuals, y_predicted_ensemble)
How it works...
In Step 1 to Step 7, we built a single neural network model to see how to use a labelled
image dataset to train our model and predict the actual label for an unseen image.
In Step 1, we built a linear stack of layers with the sequential model using Keras. We
defined the three layers: one input layer, one hidden layer, and one output layer. We
provided input_shape=1024 to the input layer since we have 32 x 32 images. We used the
relu activation function in the first and second layers. Because ours is a multi-class
classification problem, we used softmax as the activation function for our output layer.
[ 227 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogeneous Ensembles Using Keras Chapter 9
In Step 4 and Step 5, we plotted the model accuracy and the loss metric for every epoch.
From Step 8 onward, we ensembled multiple models. We wrote three custom functions:
In Step 11, we fitted all the models. We set the no_of_models variable to 50. We trained
our models in a loop by calling the train_models() function. We then passed x_train
and y_train to the train_models() function for every model built at every iteration. We
also called evaluate_models(), which returned the accuracy scores of each model built.
We then appended all the accuracy scores.
In Step 12, we plotted the accuracy scores for all the models.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 228 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
10
Heterogeneous Ensemble
Classifiers Using H2O
In this chapter, we will cover the following recipe:
Introduction
In this chapter, we'll showcase how to build heterogeneous ensemble classifier using H2O,
which is an open source, distributed, in-memory, machine learning platform. There are a
host of supervised and unsupervised algorithms available in H2O.
Among the supervised algorithms, H2O provides us with neural networks, random forest
(RF), generalized linear models, a Gradient-Boosting Machine, a naive Bayes classifier, and
XGBoost.
H2O also provides us with a stacked ensemble method that aims to find the optimal
combination of a collection of predictive algorithms using the stacking process. H2O's
stacked ensemble supports both regression and classification.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
This dataset contains information about credit card clients in Taiwan. This includes
information to do with payment defaulters, customers' demographic factors, their credit
data, and their payment history. The dataset is provided in GitHub. It is also available from
its main source, the UCI ML Repository: https://bit.ly/2EZX6IC.
In our example, we'll use the following supervised algorithms from H2O to build our
models:
We'll see how to use these algorithms in Python and learn how to set some of the
hyperparameters for each of the algorithms.
Getting ready
We'll use Google Colab to build our model. In Chapter 10, Heterogeneous Ensemble
Classifiers Using H2O, we explained how to use Google Colaboratory in the There's
more section.
Executing the preceding command will show you a few instructions, with the final line
showing the following message (the version number of H2O will be different depending on
the latest version available):
Successfully installed colorama-0.4.1 h2o-3.22.1.2
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
[ 230 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
Upon successful initialization, we'll see the information shown in the following screenshot.
This information might be different, depending on the environment:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 231 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
We'll read our dataset from Google Drive. In order to do this, we first need to mount the
drive:
from google.colab import drive
drive.mount('/content/drive')
It will instruct you to go to a URL to get the authorization code. You'll need to click on the
URL, copy the authorization code, and paste it. Upon successful mounting, you can read
your file from the respective folder in Google Drive:
# Reading dataset from Google drive
df_creditcarddata = h2o.import_file("/content/drive/My Drive/Colab
Notebooks/UCI_Credit_Card.csv")
You can run similar methods on an H2O DataFrame as you can on pandas. For example, in
order to see the first 10 observations in the DataFrame, you can use the following
command:
df_creditcarddata.head()
In order to see all the column names, we run the following syntax:
df_creditcarddata.columns
In the pandas DataFrame, we used dtypes to see the datatypes of each column. In the H2o
DataFrame, we would use the following:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
df_creditcarddata.types
[ 232 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
This gives us the following output. Note that the categorical variables appear as 'enum':
We don't need the ID column for predictive modeling, so we remove it from our
DataFrame:
df_creditcarddata = df_creditcarddata.drop(["ID"], axis = 1)
[ 233 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
We can see the distribution of the numeric variables using the hist() method:
import pylab as pl
df_creditcarddata[['AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','B
ILL_AMT5','BILL_AMT6', 'LIMIT_BAL']].as_data_frame().hist(figsize=(20,20))
pl.show()
The following screenshot shows us the plotted variables . This can help us in our analysis of
each of the variables:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
To extend our analysis, we can see the distribution of defaulters and non-defaulters by
gender, education, and marital status:
# Defaulters by Gender
columns = ["default.payment.next.month","SEX"]
default_by_gender = df_creditcarddata.group_by(by=columns).count(na ="all")
print(default_by_gender.get_frame())
# Defaulters by education
columns = ["default.payment.next.month","EDUCATION"]
default_by_education = df_creditcarddata.group_by(by=columns).count(na
[ 234 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
="all")
print(default_by_education.get_frame())
# Defaulters by MARRIAGE
columns = ["default.payment.next.month","MARRIAGE"]
default_by_marriage = df_creditcarddata.group_by(by=columns).count(na
="all")
print(default_by_marriage.get_frame())
df_creditcarddata['SEX'] = df_creditcarddata['SEX'].asfactor()
df_creditcarddata['EDUCATION'] = df_creditcarddata['EDUCATION'].asfactor()
df_creditcarddata['MARRIAGE'] = df_creditcarddata['MARRIAGE'].asfactor()
df_creditcarddata['PAY_0'] = df_creditcarddata['PAY_0'].asfactor()
df_creditcarddata['PAY_2'] = df_creditcarddata['PAY_2'].asfactor()
df_creditcarddata['PAY_3'] = df_creditcarddata['PAY_3'].asfactor()
df_creditcarddata['PAY_4'] = df_creditcarddata['PAY_4'].asfactor()
df_creditcarddata['PAY_5'] = df_creditcarddata['PAY_5'].asfactor()
df_creditcarddata['PAY_6'] = df_creditcarddata['PAY_6'].asfactor()
[ 235 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
target = 'default.payment.next.month'
[ 236 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
How to do it...
Let's move on to training our models using the algorithms we mentioned earlier in this
chapter. We'll start by training our generalized linear model (GLM) models. We'll build
three GLM models:
Now we will start with training our models in the following section.
an H2OBinomialModel subclass.
2. We created predictor and target variables in the Getting ready section. Pass the
predictor and target variables to the model:
GLM_default_settings.train(x = predictors, y = target,
training_frame = train)
[ 237 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
GLM_grid_search =
H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial', \
nfolds = 10, fold_assignment = "Modulo", \
keep_cross_validation_predictions = True),\
hyper_parameters, grid_id="GLM_grid",
search_criteria=search_criteria)
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
5. We get the grid result sorted by the auc value with the get_grid() method:
# Get the grid results, sorted by validation AUC
GLM_grid_sorted = GLM_grid_search.get_grid(sort_by='auc',
decreasing=True)
GLM_grid_sorted
[ 238 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
In the following screenshot, we can see the auc score for each model, which
consists of different combinations of the alpha and lambda parameters:
6. We can see the model metrics on our train data and our cross-validation data:
# Extract the best model from random grid search
Best_GLM_model_from_Grid = GLM_grid_sorted.model_ids[0]
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
# model performance
Best_GLM_model_from_Grid = h2o.get_model(Best_GLM_model_from_Grid)
print(Best_GLM_model_from_Grid)
From the preceding code block, you can evaluate the model metrics, which
include MSE, RMSE, Null and Residual Deviance, AUC, and Gini, along with
the Confusion Matrix. At a later stage, we will use the best model from the grid
search for our stacked ensemble.
[ 239 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
Let us look at the following image and evaluate the model metrics:
7. Train the model using random forest. The code for random forest using default
settings looks as follows:
# Build a RF model with default settings
RF_default_settings = H2ORandomForestEstimator(model_id = 'RF_D',\
nfolds = 10, fold_assignment =
"Modulo", \
keep_cross_validation_predictions =
True)
8. To get the summary output of the model, use the following code:
RF_default_settings.summary()
[ 240 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
9. Train the random forest model using a grid search. Set the hyperparameters as
shown in the following code block:
hyper_params = {'sample_rate':[0.7, 0.9],
'col_sample_rate_per_tree': [0.8, 0.9],
'max_depth': [3, 5, 9],
'ntrees': [200, 300, 400]
}
RF_grid_search = H2OGridSearch(H2ORandomForestEstimator(nfolds =
10, \
fold_assignment = "Modulo", \
keep_cross_validation_predictions =
True, \
stopping_metric =
'AUC',stopping_rounds = 5), \
hyper_params = hyper_params, \
grid_id= 'RF_gridsearch')
11. Sort the results by AUC score to see which model performs best:
# Sort the grid models
RF_grid_sorted = RF_grid_search.get_grid(sort_by='auc',
decreasing=True)
print(RF_grid_sorted)
12. Extract the best model from the grid search result:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Best_RF_model_from_Grid = RF_grid_sorted.model_ids[0]
# Model performance
Best_RF_model_from_Grid = h2o.get_model(Best_RF_model_from_Grid)
print(Best_RF_model_from_Grid)
[ 241 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
In the following screenshot, we see the model metrics for the grid model on the
train data and the cross-validation data:
13. Train the model using GBM. Here's how to train a GBM with the default settings:
GBM_default_settings = H2OGradientBoostingEstimator(model_id =
'GBM_default', \
nfolds = 10, \
fold_assignment = "Modulo", \
keep_cross_validation_predictions = True)
14. Use a grid search on the GBM. To perform a grid search, set the hyperparameters
as follows:
[ 242 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
15. Use the hyperparameters on H2OGridSearch() to train the GBM model using
grid search:
GBM_grid_search = H2OGridSearch(H2OGradientBoostingEstimator(nfolds
= 10, \
fold_assignment = "Modulo", \
keep_cross_validation_predictions = True,\
stopping_metric = 'AUC', stopping_rounds =
5),
hyper_params = hyper_params, grid_id=
'GBM_Grid')
16. As with the earlier models, we can view the results sorted by AUC:
# Sort and show the grid search results
GBM_grid_sorted = GBM_grid_search.get_grid(sort_by='auc',
decreasing=True)
print(GBM_grid_sorted)
Best_GBM_model_from_Grid = h2o.get_model(Best_GBM_model_from_Grid)
print(Best_GBM_model_from_Grid)
18. Create a list of the best models from the earlier models that we built using grid
search:
# list the best models from each grid
all_models = [Best_GLM_model_from_Grid, Best_RF_model_from_Grid,
Best_GBM_model_from_Grid]
[ 243 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
21. Compare the performance of the base learners on the test data. The following
code tests the model performance of all the GLM models we've built:
# Checking the model performance for all GLM models built
model_perf_GLM_default =
GLM_default_settings.model_performance(test)
model_perf_GLM_regularized =
GLM_regularized.model_performance(test)
model_perf_Best_GLM_model_from_Grid =
Best_GLM_model_from_Grid.model_performance(test)
The following code tests the model performance of all the random forest models
we've built:
# Checking the model performance for all RF models built
model_perf_RF_default_settings =
RF_default_settings.model_performance(test)
model_perf_Best_RF_model_from_Grid =
Best_RF_model_from_Grid.model_performance(test)
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
The following code tests the model performance of all the GBM models we've
built:
# Checking the model performance for all GBM models built
model_perf_GBM_default_settings =
GBM_default_settings.model_performance(test)
model_perf_Best_GBM_model_from_Grid =
Best_GBM_model_from_Grid.model_performance(test)
[ 244 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
22. To get the best AUC from the base learners, execute the following commands:
# Best AUC from the base learner models
best_auc = max(model_perf_GLM_default.auc(),
model_perf_GLM_regularized.auc(), \
model_perf_Best_GLM_model_from_Grid.auc(), \
model_perf_RF_default_settings.auc(), \
model_perf_Best_RF_model_from_Grid.auc(), \
model_perf_GBM_default_settings.auc(), \
model_perf_Best_GBM_model_from_Grid.auc())
23. The following commands show the AUC from the stacked ensemble model:
# Eval ensemble performance on the test data
Ensemble_model = ensemble.model_performance(test)
Ensemble_model = Ensemble_model.auc()
How it works...
We used Google Colab to train our models. After we installed H2O in Google Colab, we
initialized the H2O instance. We also imported the required libraries.
We mounted Google Drive and read our dataset using h2o.import_file(). This created
an H2O DataFrame, which is very similar to a pandas DataFrame. Instead of holding it in
the memory, however, the data is located in one of the remote H2O clusters.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
We then performed basic operations on the H2O DataFrame to analyze our data. We took a
look at the dimensions, the top few rows, and the data types of each column. The shape
attribute returned a tuple with the number of rows and columns. The head() method
returned the top 10 observations. The types attribute returned the data types of each
column.
[ 245 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
We didn't need the ID column, so we removed it using the drop() method with axis=1 as
a parameter. With axis=1, it dropped the columns. Otherwise, the default value of axis=0
would have dropped the labels from the index.
We analyzed the distribution of the numeric variables. There's no limit to how far you can
explore your data. We also saw the distribution of both of the classes of our target variable
by various categories, such as gender, education, and marriage.
We then converted the categorical variables to factor type with the asfactor()
method. This was done for the target variable as well.
We created a list of predictor variables and target variables. We split our DataFrame into
the train and test subsets with the split_frame() method.
After we split our datasets into train and test subsets, we moved onto training our models.
We used GLM, random forest, a gradient-boosting machine (GBM), and stacked
ensembles to train the stacking model.
In the How to do it... section, in Step 1 and Step 2, we showcased the code to train a GLM
model with the default settings. We used cross-validation to train our model.
In Step 3, we trained a GLM model with lambda_search, which helps to find the optimal
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
regularization parameter.
In Step 4, we used grid-search parameters to train our GLM model. We set our hyper-
parameters and provided these to the H2OGridSearch() method. This helps us search for
the optimum parameters across models. In the H2OGridSearch() method, we used
the RandomDiscrete search-criteria strategy.
[ 246 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
In Step 5, with the get_grid() method, we looked at the AUC score of each model built
with different combinations of the parameters provided. In Step 6, we extracted the best
model from the random grid search. We can also use the print() method on the best
model to see the model performance metrics on both the train data and the cross-validation
data.
In Step 7, we trained a random forest model with default settings and looked at the
summary of the resulting model in step 8. In Step 9 and Step 10, we showcased the code to
train a random forest model using grid-search. We set multiple values for various
acceptable hyper-parameters, such as sample_rate, col_sample_rate_per_tree,
max_depth, and ntrees. sample_rate refers to row sampling without replacement. It
takes a value between 0 and 1, indicating the sampling percentage of the data.
col_sample_rate_per_tree is the column sampling for each tree without replacement.
max_depth is set to specify the maximum depth to which each tree should be built. Deeper
trees may perform better on the training data but will take more computing time and may
overfit and fail to generalize on unseen data. The ntrees parameter is used for tree-based
algorithms to specify the number of trees to build on the model.
In Step 11 and Step 12, we printed the AUC score of each model generated by the grid-
search and extracted the best model from it.
We also trained GBM models to fit our data. In Step 13, we built the GBM using the default
settings. In Step 14, we set the hyperparameter space for the grid search. We used this in
Step 15, where we trained our GBM. In the GBM, we set values for hyperparameters, such
as learn_rate, sample_rate, col_sample_rate, max_depth, and ntrees. The
learn_rate parameter is used to specify the rate at which the GBM algorithm trains the
model. A lower value for the learn_rate parameter is better and can help in avoiding
overfitting, but can be costly in terms of computing time.
Step 16 showed us the AUC score of each resulting model from the grid search. We
extracted the best grid-searched GBM in Step 17.
[ 247 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
In Step 21, we evaluated all the GLM models we built on our test data. We did the same
with all the models we trained using RF and GBM. Step 22 gave us the model with the
maximum AUC score. In Step 23, we evaluated the AUC score of the stacked ensemble
model on the test data in order to compare the performance of the stacked ensemble model
with the individual base learners.
There's more...
Note that we used cross-validation to train all our models. We used the nfolds option to
set the number of folds to use for cross-validation. In our example, we used nfolds=5, but
we can also set it to higher numbers.
The number of folds needs to be the same across every models you build.
With a value for nfolds specified, we can also provide a value for the fold_assignment
parameters. fold_assignment takes values such as auto, random, modulo, and
stratified. If we set it to Auto, the algorithm automatically chooses an option; currently,
it chooses Random. With fold_assignment set to Random, it will enable a random split of
the data into nfolds sets. When fold_assignment is set to Modulo, it uses a deterministic
method to evenly split the data into nfolds that don't depend on the seed parameter.
[ 248 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble Classifiers Using H2O Chapter 10
See also
The H2O documentation is available at http://docs.h2o.ai/.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 249 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
11
Heterogeneous Ensemble for
Text Classification Using NLP
In this chapter, we will cover the following topics:
Introduction
Text classification is a widely studied area of language processing and text mining. Using
text classification mechanisms, we can classify documents into predefined categories based
on their content.
In this chapter, we'll take a look at how to classify short text messages that get delivered to
our mobile phones. While some messages we receive are important, others might represent
a serious threat to our privacy. We want to be able to classify the text messages correctly in
order to avoid spam and to avoid missing important messages.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
In this example, we opt for algorithms such as Naive Bayes, random forest, and support
vector machines to train our models.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
We also process our data using term frequency-inverse data frequency (TF-IDF), which
tells us how often a word appears in a message or a document. TF is calculated as:
TF = No. of times a word appears in a document / Total No. of words in
the document
TF-IDF numerically scores the importance of a word based on how often the word appears
in a document or a collection of documents. Simply put, the higher the TF-IDF score, the
rarer the term. The lower the score, the more common it is. The mathematical
representation of TD-IDF would be as follows:
where w represents the word, d represents a document and D represents the collection of
documents.
In this example, we'll use the SMS spam collection dataset, which has labelled
messages that have been gathered for cellphone spam research. This dataset is available in
the UCI ML repository and is also provided in the GitHub repository.
Getting ready
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 251 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
Note that for this example we import libraries such as nltk to prepare our data. We also
import the CountVectorizer and TfidVectorizer modules from
sklearn.feature_extraction. These modules are used for feature extraction in ML
algorithms.
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.tight_layout()
[ 252 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
After reading the data, we check whether it has been loaded properly:
df_sms.head()
We also check the number of observations and features in the dataset with
dataframe.shape:
df_sms.shape
"orange"], fontsize=13)
ax.set_alpha(0.8)
ax.set_title("Percentage Share of Spam and Ham Messages")
ax.set_ylabel("Count of Spam & Ham messages");
ax.set_yticks([0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500,
5000, 5500])
totals = []
for i in ax.patches:
totals.append(i.get_height())
total = sum(totals)
[ 253 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
We also define a function to remove punctuation, convert the text to lowercase, and remove
stop words:
lemmatizer = WordNetLemmatizer()
[ 254 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
return clean_words
We apply the defined process_text() function to our text variable in the dataset:
df_sms['text'] = df_sms['text'].apply(text_processing)
We separate our feature and target variables, and split our data into train and test
subsets:
X = df_sms.loc[:,'text']
Y = df_sms.loc[:,'type']
Y = Y.astype('int')
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)
We also use the TfidfVectorizer module to convert the text into TF-IDF vectors:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_train = tfidf.fit_transform(X_train)
tfidf_test = tfidf.transform(X_test)
Let's now move on to training our models. We use the following algorithms both on the
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
count data and the TF-IDF data and see how the individual models perform:
Naive Bayes
Support vector machine
Random forest
We also combine the model predictions to see the result from the ensemble.
[ 255 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
How to do it...
Let's begin with training our models, and see how they perform in this section:
1. Train the model using the Naive Bayes algorithm. Apply this algorithm to both
the count data and the TF-IDF data.
The following is the code to train the Naive Bayes on the count data:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(count_train, Y_train)
nb_pred_train = nb.predict(count_train)
nb_pred_test = nb.predict(count_test)
nb_pred_train_proba = nb.predict_proba(count_train)
nb_pred_test_proba = nb.predict_proba(count_test)
Take a look at the train and test accuracy for the preceding model:
This gives us the following output, which shows the precision, recall, f1-
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 256 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
The following plot shows us the true negative, false positive, false negative, and
true positive values:
nb.fit(tfidf_train, Y_train)
nb_pred_train_tfidf = nb.predict(tfidf_train)
nb_pred_test_tfidf = nb.predict(tfidf_test)
nb_tfidf_pred_train_proba = nb.predict_proba(tfidf_train)
nb_tfidf_pred_test_proba = nb.predict_proba(tfidf_test)
[ 257 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
{}'.format(nb.score(count_train, Y_train)))
print('The accuracy for the testing data is
{}'.format(nb.score(count_test, Y_test)))
target_names = ['Spam','Ham']
plot_confusion_matrix(cm, classes=target_names)
plt.show()
In the following screenshot, we can see the output from the preceding code block:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 258 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
6. Fit the model with the support vector machine classifier with the count data. Use
GridSearchCV to perform a search over the specified parameter values for the
estimator:
from sklearn.svm import SVC
svc = SVC(kernel='rbf',probability=True)
svc_params = {'C':[0.001, 0.01, 0.1, 1, 10]}
The grid-search provides us with the optimum model. We get to see the
parameter values and the score of the optimum model:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
7. Take a look at the test accuracy of the count data with the following code:
print(classification_report(Y_test, svc_rbf_test_predicted_values))
target_names = ['Spam','Ham']
[ 259 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
svc_gcv = GridSearchCV(svc,svc_params,cv=5)
svc_gcv.fit(tfidf_train, Y_train)
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 260 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
The following output shows the best score of the model trained with the SVM and
RBF kernel on the TF-IDF data:
9. Print the classification report and the confusion matrix for the preceding model:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
10. Fit the random forest model on the count data with grid search cross-validation,
as we did for SVM:
# Set the parameters for grid search
rf_params =
{"criterion":["gini","entropy"],"min_samples_split":[2,3],"max_dept
h":[None,2,3],"min_samples_leaf":[1,5],"max_leaf_nodes":[None],"oob
_score":[True]}
[ 261 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
# Use gridsearchCV(), pass the values you have set for grid search
rf_gcv = GridSearchCV(rf, rf_params, cv=5)
A grid search of the random forest with the grid parameters returns the best
parameters and the best score, as seen in the following screenshot:
11. Using a classification report and a confusion matrix, take a look at the
performance metrics of the random forest model with the count data on our test
data:
print(classification_report(Y_test, rf_test_predicted_values))
target_names = ['Spam','Ham']
plot_confusion_matrix(cm,classes=target_names)
plt.show()
[ 262 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
12. Build a model on a random forest with a grid-search on the TF-IDF data:
# Set the parameters for grid search
rf_params =
{"criterion":["gini","entropy"],"min_samples_split":[2,3],"max_dept
h":[None,2,3],"min_samples_leaf":[1,5],"max_leaf_nodes":[None],"oob
_score":[True]}
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
# Use gridsearchCV(), pass the values you have set for grid search
rf_gcv = GridSearchCV(rf, rf_params, cv=5)
rf_gcv.fit(tfidf_train, Y_train)
rf_tfidf_train_predicted_values = rf_gcv.predict(tfidf_train)
rf_tfidf_test_predicted_values = rf_gcv.predict(tfidf_test)
[ 263 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
rf_gcv_tfidf_pred_train_proba = rf_gcv.predict_proba(tfidf_train)
rf_gcv_tfidf_pred_test_proba = rf_gcv.predict_proba(tfidf_test)
print(classification_report(Y_test,
rf_tfidf_test_predicted_values))
target_names = ['Spam','Ham']
# Pass actual & predicted values to the confusion matrix()
cm = confusion_matrix(Y_test, rf_tfidf_test_predicted_values)
plt.figure()
plot_confusion_matrix(cm, classes=target_names)
plt.show()
13. Take the output of the predict_proba() methods to gather the predicted
probabilities from each model to plot the ROC curves. The full code is provided
in the code bundle.
Here's a sample of the code to plot the ROC curve from the Naive Bayes model on
the count data:
fpr, tpr, thresholds = roc_curve(Y_test, nb_pred_test_proba[:,1])
roc_auc = auc(Y_test,nb_pred_test_proba[:,1])
[ 264 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
With the complete code provided in the code bundle, we can view the ROC plot
from all the models and compare them:
14. Average the probabilities from all the models and plot the ROC curves:
plt.subplot(4,3,7)
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.0])
plt.ylim([-0.1,1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.subplot(4,3,8)
[ 265 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
roc_auc = auc(Y_test,d[:,1])
We can see the average result of the ROC and AUC scores in the following
screenshot:
15. Check the accuracy of the ensemble result. Create an array of the predicted
results, as follows:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
predicted_array = np.array([nb_pred_test_tfidf,
svc_tfidf_rbd_test_predicted_values,
rf_tfidf_test_predicted_values])
[ 266 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
16. Calculate the mode of the predicted values for the respective observations to
perform max-voting in order to get the final predicted result:
# Using mode on the array, we get the max vote for each observation
predicted_array = mode(predicted_array)
17. Plot the test accuracy for the models trained on the count data and TF-IDF
data, respectively:
How it works...
In the Getting ready section, we imported all the required libraries and defined the function
to plot the confusion matrix. We read our dataset, using UTF8 encoding. We checked the
proportion of spam and ham messages in our dataset and used the CountVectorizer
and TfidfVectorizer modules to convert the texts into vectors and TF-IDF vectors,
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
respectively.
After that, we built multiple models using various algorithms. We also applied each
algorithm on both the count data and the TF-IDF data.
[ 267 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
The Naive Bayes classifier is widely used for text classification in machine learning. The
Naive Bayes algorithm is based on the conditional probability of features belonging to a
class. In Step 1, we built our first model with the Naive Bayes algorithm on the count
data. In Step 2, we checked the performance metrics using classification_report() to
see the precision, recall, f1-score, and support. In Step 3, we called
plot_confusion_matrix() to plot the confusion matrix.
Then, in Step 4, we built the Naive Bayes model on the TF-IDF data and evaluated the
performance in Step 5. In Step 6 and Step 7, we trained our model using the support vector
machine on the count data, evaluated its performance using the output from
classification_report, and plotted the confusion matrix. We trained our SVM model
using the RBF kernel. We also showcased an example of using GridSearchCV to find the
best parameters. In Step 8 and Step 9, we repeated what we did in Step 6 and Step 7, but this
time, we trained the SVM on TF-IDF data.
In Step 10, we trained a random forest model using grid search on the count data. We set
gini and entropy for the criterion hyperparameter. We also set multiple values for the
parameters, such as min_samples_split, max_depth, and min_samples_leaf. In Step
11, we evaluated the model's performance.
We then trained another random forest model on the TF-IDF data in Step 12. Using
the predic_proba() function, we got the class probabilities on our test data. We used the
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
same in Step 13 to plot the ROC curves with AUC scores annotated on the plots for each of
the models. This helps us to compare the performance of the models.
In Step 14, we averaged the probabilities, which we got from the models for both the count
and TF-IDF data. We then plotted the ROC curves for the ensemble results. From Step 15
through to Step 17, we plotted the test accuracy for each of the models built on the count
data as well as the TF-IDF data.
[ 268 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
We have movie reviews in .txt files that are separated into two folders: negative and
positive. There are 1,000 positive reviews and 1,000 negative reviews. These files can be
retrieved from GitHub.
The first part is to prepare the dataset. We'll read the review files that are
provided in the .txt format, append them, label them as positive or negative
based on which folder they have been put in, and create a .csv file that contains
the label and text.
In the second part, we'll build multiple base learners on both the count data and
on the TF-IDF data. We'll evaluate the performance of the base learners and then
evaluate the ensemble of the predictions.
Getting ready
We start by importing the required libraries:
import os
import glob
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
import pandas as pd
[ 269 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
We set our path variable and iterate through the .txt files in the folders.
The TXT files for the positive reviews are read and the reviews are appended in an array.
We use the array to create a DataFrame, df_pos.
path="/.../Chapter 11/CS - IMDB Classification/txt_sentoken/pos/*.txt"
files = glob.glob(path)
text_pos = []
for p in files:
file_read = open(p, "r")
to_append_pos = file_read.read()
text_pos.append(to_append_pos)
file_read.close()
df_pos = pd.DataFrame({'text':text_pos,'label':'positive'})
df_pos.head()
We also iterate through the TXT files in the negative folder to read the negative reviews and
append them in an array. We use the array to create a DataFrame, df_neg:
path="/Users/Dippies/CODE PACKT - EML/Chapter 11/CS - IMDB
Classification/txt_sentoken/neg/*.txt"
files = glob.glob(path)
text_neg = []
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
for n in files:
file_read = open(n, "r")
to_append_neg = file_read.read()
text_neg.append(to_append_neg)
file_read.close()
df_neg = pd.DataFrame({'text':text_neg,'label':'negative'})
df_neg.head()
[ 270 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
Finally, we merge the positive and negative DataFrames into a single DataFrame using
the concat() method:
df_moviereviews=pd.concat([df_pos, df_neg])
We can take a look at the prepared DataFrame with the head() and tail() methods:
print(df_moviereviews.head())
print(df_moviereviews.tail())
From the preceding image, we notice that the positive and negative reviews have been
added sequentially. The first half of the DataFrame holds the positive reviews, while the
next half holds the negative reviews.
df_moviereviews=shuffle(df_moviereviews)
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
df_moviereviews.head(10)
[ 271 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
We validate the dimensions of the merged DataFrame to see whether it holds 2,000
observations, which would be the result of combining the 1,000 negative and 1,000 positive
reviews:
df_moviereviews.shape
From the preceding code, we notice that we have 2,000 observations and 2 columns.
We may also write the resulting DataFrame into another .csv file in order to avoid
recreating the CSV file from the TXT files as we did in the preceding steps:
df_moviereviews.to_csv("/.../Chapter 11/CS - IMDB
Classification/Data_IMDB.csv")
Next, we'll define the plot_confusion_matrix() method that we have used earlier.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
We can now see the share of the positive and negative reviews in our data. In our case, the
proportion is exactly 50:50:
df_moviereviews["label"].value_counts().plot(kind='pie')
plt.tight_layout(pad=1,rect=(0, 0, 0.7, 1))
plt.text(x=-0.9,y=0.1, \
s=(np.round(((df_moviereviews["label"].\
value_counts()[0])/(df_moviereviews["label"].value_counts()[0] + \
df_moviereviews["label"].value_counts()[1])),2)))
[ 272 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
plt.text(x=0.4,y=-0.3, \
s=(np.round(((df_moviereviews["label"].\
value_counts()[1])/(df_moviereviews["label"].value_counts()[0] + \
df_moviereviews["label"].value_counts()[1])),2)))
The output of the preceding code can be seen in the following screenshot:
We will now replace the "positive" label with "1" and the "negative" label" with "0":
df_moviereviews.loc[df_moviereviews["label"]=='positive',"label",]=1
df_moviereviews.loc[df_moviereviews["label"]=='negative',"label",]=0
We prepare our data using various data-cleaning and preparation mechanisms. We'll
follow the same sequence as we followed in the previous recipe to preprocess our data:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 273 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
nopunc = ''.join(nopunc)
return clean_words
We'll now build our base learners and evaluate the ensemble result.
How to do it...
We start by importing the remaining libraries we need:
[ 274 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
We proceed by training the base learners on the count data and on the TF-
IDF data. We train the base learners with random forest models, Naive Bayes
models, and the support-vector classifier models.
6. Train the random forest model using grid-search on the count data:
# Set the parameters for grid search
rf_params = {"criterion":["gini","entropy"],\
"min_samples_split":[2,3],\
"max_depth":[None,2,3],\
"min_samples_leaf":[1,5],\
"max_leaf_nodes":[None],\
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
"oob_score":[True]}
# Use gridsearchCV(), pass the values you have set for grid search
rf_count = GridSearchCV(rf, rf_params, cv=5)
rf_count.fit(count_train, Y_train)
[ 275 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
rf_count_probabilities = rf_count.predict_proba(count_test)
In the following screenshot, we can see the output of the preceding code:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 276 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
# Use gridsearchCV(), pass the values you have set for grid search
rf_tfidf = GridSearchCV(rf, rf_params, cv=5)
rf_tfidf.fit(tfidf_train, Y_train)
print(classification_report(Y_test, rf_tfidf_predicted_values))
plot_confusion_matrix(cm, classes=target_names,normalize=False)
plt.show()
10. Train the Naive Bayes model on the count data and check the accuracy of the test
data:
nb_count = MultinomialNB()
nb_count.fit(count_train, Y_train)
nb_count_predicted_values = nb_count.predict(count_test)
nb_count_probabilities = nb_count.predict_proba(count_test)
[ 277 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
12. Train the Naive Bayes model on the TF-IDF data and evaluate its performance
the same way we did for earlier models:
nb_tfidf = MultinomialNB()
nb_tfidf.fit(count_train, Y_train)
nb_tfidf_predicted_values = nb_tfidf.predict(tfidf_test)
nb_tfidf_probabilities = nb_tfidf.predict_proba(tfidf_test)
print(classification_report(Y_test, nb_predicted_values))
[ 278 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
13. Train a model with a support vector classifier algorithm with the linear kernel on
the count data. We also grid-search the C parameter for the SVC:
svc_count = SVC(kernel='linear',probability=True)
svc_params = {'C':[0.001, 0.01, 0.1, 1, 10]}
svc_count_predicted_values = svc_gcv_count.predict(count_test)
svc_count_probabilities = svc_gcv_count.predict_proba(count_test)
svc_count_train_accuracy = svc_gcv_count.score(count_train,
Y_train)
svc_count_test_accuracy = svc_gcv_count.score(count_test, Y_test)
print(classification_report(Y_test, svc_count_predicted_values))
# Pass actual & predicted values to the confusion_matrix()
cm = confusion_matrix(Y_test, svc_count_predicted_values)
plt.figure()
plot_confusion_matrix(cm, classes=target_names,normalize=False)
plt.show()
14. Train a model with the support vector classifier algorithm with the linear kernel
on the TF-IDF data. We also grid-search the C parameter for the SVC:
svc_tfidf = SVC(kernel='linear',probability=True)
svc_params = {'C':[0.001, 0.01, 0.1, 1, 10]}
svc_gcv_tfidf.fit(tfidf_train, Y_train)
svc_tfidf_predicted_values = svc_gcv_tfidf.predict(tfidf_test)
svc_tfidf_probabilities = svc_gcv_tfidf.predict_proba(tfidf_test)
svc_tfidf_train_accuracy = svc_gcv_count.score(tfidf_train,
Y_train)
svc_tfidf_test_accuracy = svc_gcv_count.score(tfidf_test, Y_test)
[ 279 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
print(classification_report(Y_test, svc_tfidf_predicted_values))
# Pass actual & predicted values to the confusion_matrix()
cm = confusion_matrix(Y_test, svc_tfidf_predicted_values)
plt.figure()
plot_confusion_matrix(cm, classes=target_names)
plt.show()
15. Plot the ROC curve for each of the models. The code for one of the plots is shown
here (the complete code is provided in this book's code bundle):
fpr, tpr, thresholds = roc_curve(Y_test,
rf_count_probabilities[:,1])
roc_auc = auc(Y_test, rf_count_probabilities[:,1])
In the following screenshot, we can compare the ROC curves of all the models
we've trained:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 280 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
16. Plot the ROC curves for the ensemble results on the count and TF-IDF data:
predicted_values_tfidf = np.array([rf_tfidf_predicted_values, \
nb_tfidf_predicted_values, \
svc_tfidf_predicted_values])
predicted_values_count = mode(predicted_values_count)
predicted_values_tfidf = mode(predicted_values_tfidf)
18. Plot the test accuracy for each of the models trained on the count data and the
TF-IDF data:
count = np.array([rf_count_test_accuracy,\
nb_count_test_accuracy,\
svc_count_test_accuracy,\
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
accuracy_score(Y_test,
predicted_values_count[0][0])])
tfidf = np.array([rf_tfidf_test_accuracy,\
nb_tfidf_test_accuracy,\
svc_tfidf_test_accuracy,\
accuracy_score(Y_test,
predicted_values_tfidf[0][0])])
[ 281 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
plt.xticks([0,1,2,3],label_list)
for i in range(4):
plt.text(x=i,y=(count[i]+0.001), s=np.round(count[i],4))
for i in range(4):
plt.text(x=i,y=tfidf[i]-0.003, s=np.round(tfidf[i],4))
plt.legend(["Count","TFIDF"])
plt.title("Test accuracy")
The following plot shows the accuracy comparison between the count data and
the TF-IDF data across all models and the ensemble result:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 282 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
How it works...
We started by importing the required libraries. In this chapter, we used a module called
glob. The glob module is used to define the techniques to match a specified pattern to a
path, a directory, and a filename. We used the glob module to look for all the files in a
specified path. After that, we used the open() method to open each file in read mode. We
read each file and appended it to form a dataset with all the review comments. We also
created a label column to tag each review with a positive or negative tag.
However, after we appended all the positive and negative reviews, we noticed that they
were added sequentially, which means the first half held all the positive reviews and the
second half contained the negative reviews. We shuffled the data using the shuffle()
method.
We cleaned our data by converting it to lowercase, removing the punctuation and stop
words, performing stemming, and tokenizing the texts to create feature vectors.
In the How to do it... section, we started by importing the libraries in Step 1. In Step 2, we
separated our target and feature variables into X and Y.
We split our data into train and test subsets in Step 3. We used test_size=.3 to split the
data into train and test subsets.
In Step 6, we set our hyperparameters for grid-search to train a random forest model. We
trained our random forest model on the count data and checked our train and test accuracy.
for all the models we built to predict the class as well as the class
probabilities.
In Step 7, we generated the confusion matrix to evaluate the model's performance for the
random forest model we built in the preceding step. In Step 8 and Step 9, we repeated the
training for another random forest model on the TF-IDF data and evaluated the
performance. We trained the Naive Bayes model on the count data and the TF-IDF data
from Step 10 through to Step 12.
[ 283 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
In Step 13 and Step 14, we trained the support vector classifier algorithm with the linear
kernel on the count data and the TF-IDF data, respectively. In Step 15, we plotted the ROC
curves with the AUC score for each of the base learners we built. We also plotted the RUC
curves for the ensemble in Step 16 to compare the performance with the base
learners. Finally, in Step 17, we plotted the test accuracy of each of the models on the count
and TF-IDF data.
There's more...
In today’s world, the availability and flow of textual information are limitless. This means
we need various techniques to deal with these textual matters to extract meaningful
information. For example, parts-of-speech (POS) tagging is one of the fundamental tasks in
the NLP space. POS tagging is used to label words in a text with their respective parts of
speech. These tags may then be used with more complex tasks, such as syntactic and
semantic parsing, machine translation (MT), and question answering.
Nouns
Pronouns
Adjectives
Verbs
Adverbs
Prepositions
Conjunctions
Interjections:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 284 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
The NLTK library has functions to get POS tags that can be applied to texts after
tokenization. Let's import the required libraries:
import os
import pandas as pd
import nltk
from nltk.tag import pos_tag
from nltk.corpus import stopwords
We take a look at the list of the first 10 tokens from the first movie review:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
tokenized_sent[0][0:10]
[ 285 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Heterogeneous Ensemble for Text Classification Using NLP Chapter 11
We print the first 10 POS tags for the first movie review:
postag[0][0:10]
Chunking is another process that can add more structure to POS tagging. Chunking is used
for entity detection; it tags multiple tokens to recognize them as meaningful entities. There
are various chunkers available; NLTK provides ne_chunk, which recognizes people
(names), places, and organizations. Other frequently used chunkers include OpenNLP,
Yamcha, and Lingpipe. It's also possible to use a combination of chunkers and apply max-
voting on the results to improve the classification's performance.
[ 286 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
12
Homogenous Ensemble for
Multiclass Classification Using
Keras
In this chapter, we'll cover the following recipe:
Introduction
Many studies have been done in classification problems to find out how to obtain better
classification accuracy. This problem tends to be more complex when there's a large
number of classes on which to make a prediction. In the case of multiclass classification, it's
assumed that each class in the target variable are independent of each other. A multiclass
classification technique involves training one or more models to classify a target variable
that can take more than two classes.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
T-shirt/top
Trouser
Pullover
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogenous Ensemble for Multiclass Classification Using Keras Chapter 12
Dress
Coat
Sandal
Shirt
Sneakers
Bag
Ankle boot
Each image is a 28 x 28 grayscale image. We will proceed by reading the data to build a few
homogeneous models over a few iterations to see whether the ensemble can deliver a
higher accuracy.
Getting ready
We'll use Google Colab to train our models. Google Colab comes with TensorFlow
installed, so we don't have to install it separately in our system.
import tensorflow as tf
from tensorflow import keras
from sklearn.utils import resample
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from scipy import stats
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
We load our data from the datasets that come with tf.keras:
# Load the fashion-mnist pre-shuffled train data and test data
(x_train, y_train), (x_test, y_test) =
tf.keras.datasets.fashion_mnist.load_data()
[ 288 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogenous Ensemble for Multiclass Classification Using Keras Chapter 12
plt.show()
[ 289 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogenous Ensemble for Multiclass Classification Using Keras Chapter 12
With the preceding code, we plot the first 15 images, along with the associated labels:
How to do it...
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
1. In the following code block, we'll create multiple homogeneous models over a
few iterations using tf.keras:
accuracy = pd.DataFrame( columns=["Accuracy","Precision","Recall"])
predictions = np.zeros(shape=(10000,7))
row_index = 0
for i in range(7):
# bootstrap sampling
boot_train = resample(x_train,y_train,replace=True,
n_samples=40000, random_state=None)
model = tf.keras.Sequential([
[ 290 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogenous Ensemble for Multiclass Classification Using Keras Chapter 12
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(256, activation=tf.nn.relu),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)])
# compile the model
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
# Train the model
model.fit(x_train,y_train,epochs=10,batch_size=64)
# Evaluate accuracy
score = model.evaluate(x_test, y_test, batch_size=64)
accuracy.loc[row_index,"Accuracy"]=score[1]
# Make predictions
model_pred= model.predict(x_test)
pred_classes =model_pred.argmax(axis=-1)
accuracy.loc[row_index, 'Precision'] =
precision_score(y_test, pred_classes, average='weighted')
accuracy.loc[row_index, 'Recall'] = recall_score(y_test,
pred_classes,average='weighted')
# Save predictions to predictions array
predictions[:,i] = pred_classes
print(score)
row_index+=1
[ 291 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogenous Ensemble for Multiclass Classification Using Keras Chapter 12
We mention seven iterations and 10 epochs in each iteration. In the following screenshot,
we can see the progress as the model gets trained:
2. With the code in Step 1, we collate the accuracy, precision, and recall for every
iteration on the test data:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
accuracy
In the following screenshot, we can see how the preceding three metrics
change in each iteration:
[ 292 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogenous Ensemble for Multiclass Classification Using Keras Chapter 12
3. We'll form a DataFrame with the predictions that are returned by all of the
models in each iteration:
# Create dataframe using prediction of each iteration
df_iteration = pd.DataFrame([predictions[:,0],\
predictions[:,1],\
predictions[:,2],\
predictions[:,3],\
predictions[:,4],\
predictions[:,5],\
predictions[:,6]])
5. We perform max-voting to identify the most predicted class for each observation.
We simply use mode to find out which class was predicted the most times for an
observation:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 293 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogenous Ensemble for Multiclass Classification Using Keras Chapter 12
accuracy["Models"]=["Model 1",\
"Model 2",\
"Model 3",\
"Model 4",\
"Model 5",\
"Model 6",\
"Model 7"]
[ 294 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogenous Ensemble for Multiclass Classification Using Keras Chapter 12
10. We then combine the accuracy, precision, and recall in one single table:
accuracy=accuracy.append(pd.DataFrame([[\
accuracy_score(y_test,\
mode[0].T),0,0,\
"Ensemble Model"]], \
columns=["Accuracy",\
"Precision","Recall",\
"Models"]))
accuracy.index=range(accuracy.shape[0])
In the following screenshot, we can see the structure that holds the metrics from
each of the models and the ensemble model:
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 295 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogenous Ensemble for Multiclass Classification Using Keras Chapter 12
11. We plot the accuracy returned by each iteration and the accuracy from max-
voting:
plt.figure(figsize=(20,8))
plt.plot(accuracy.Models,accuracy.Accuracy)
plt.title("Accuracy across all Iterations and Ensemble")
plt.ylabel("Accuracy")
plt.show()
This gives us the following plot. We notice that the accuracy returned by the max-
voting method is the highest compared to individual models:
12. We also plot the precision and recall for each model and the ensemble:
plt.figure(figsize=(20,8))
plt.plot(accuracy.Models,accuracy.Accuracy,accuracy.Models,accuracy
.Precision)
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 296 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogenous Ensemble for Multiclass Classification Using Keras Chapter 12
From the preceding screenshot, we notice that the precision and recall improve for an
ensemble model.
How it works...
In the Getting ready section, we imported our required libraries. Note that we've imported
the TensorFlow library. We can directly access the datasets by importing
the tf.keras.datasets module. This module comes with various built-in datasets,
including the following:
We used the fashion_mnist dataset from this module. We loaded the pre-shuffled train
and test data and checked the shape of the train and test subsets.
We noticed, in the Getting ready section, that the shape of the training subset is (60000, 28,
28), which means that we have 60,000 images that are of 28 X 28 pixel in size.
[ 297 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogenous Ensemble for Multiclass Classification Using Keras Chapter 12
We checked the distinct levels in the target variable with the unique() method. We saw
that there were 10 classes from 0 to 9.
We also took a quick look at some of the images. We defined the number of columns and
rows that we required. Running an iteration, we plotted the images with
matplotlib.pyplot.imshow() in grayscale. We also printed the actual class labels
against each of the images using matplotlib.pyplot.title().
In the How to do it... section, in Step 1, we created multiple homogeneous models using
the tf.keras module. In each iteration, we used the resample() method to create
bootstrap samples. We passed replace=True to the resample() method to ensure that
we have samples with replacement.
In this step, we also defined the model architecture. We added layers to the model using
tf.keras.layers. In each layer, we defined the number of units.
We ran through a few iterations in our example. We set the number of iterations. In each
iteration, we compiled the model and fit it to our training data. We made predictions on
our test data and captured the following metrics in a DataFrame:
Accuracy
Precision
Recall
We've used Rectified Linear Units (RELU) as the activation function for the hidden
layers. ReLU is represented by the f(x) = max{0, x}. In neural networks, ReLU is
recommended as the default activation function.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Note that, in the last layer of the model architecture, we've used softmax
as the activation function. The softmax function can be considered a
generalization of the sigmoid function. While the sigmoid function is used
to represent a probability distribution of a dichotomous variable, the
softmax function is used to represent a probability distribution of a target
variable with more than two classes. When the softmax function is used
for multi-class classification, it returns a probability value between 0 and 1
for each class. The sum of all probabilities will be equal to one.
[ 298 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogenous Ensemble for Multiclass Classification Using Keras Chapter 12
In Step 2, we checked the structure of the accuracy DataFrame that we created in Step 1. We
noticed that we had three columns for accuracy, precision, and recall and the metrics for
each iteration were captured. In Step 3, we converted the datatypes in the DataFrame
into an integer.
In Step 5, we checked the accuracy of the model with the max-voted predictions. In Step 6
and Step 7, we generated the confusion matrix to visualize the correct predictions. The
diagonal elements in the plot were the correct predictions, while the off-diagonal elements
were the misclassifications. We saw that there was a higher number of correct
classifications compared to misclassifications.
In Step 8 and Step 9, we proceeded to create a structure to hold the performance metrics
(accuracy, precision, and recall), along with the labels for each iteration and the ensemble.
We used this structure to plot our charts for the performance metrics.
In Step 10, we plotted the accuracy for each iteration and the max-voted predictions.
Similarly, in Step 11, we plotted the precision and recall for each iteration and the max-
voted predictions.
From the plots we generated in Step 10 and Step 11, we noticed how the accuracy, precision,
and recall improved for the max-voted predictions.
See also
The tf.keras module provides us with TensorFlow-specific functionality, such as eager-
execution, data pipelines, and estimators. You can take a look at the various options
the tf.keras module provides us.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 299 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Homogenous Ensemble for Multiclass Classification Using Keras Chapter 12
In the present day, the Adam optimizer is one of the best optimizers. It's
an extension of Stochastic Gradient Descent (SGD). SGD considers a
single learning rate for all weight updates and the learning rate remains
unchanged during the model training process. The Adam algorithm
considers adaptive learning rates methods to compute individual learning
rates for each parameter.
The tf.keras.losses module provides us with various options so that we can choose our
loss function. We used sparse_categorical_crossentropy. Depending on your task,
you might opt for other options, such
as binary_crossentropy, categorical_crossentropy, mean_squared_error, and so
on.
You can get more detailed information about the other hyperparameters that can be used
with tf.keras at https://www.tensorflow.org/api_docs/python/tf/keras.
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 300 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Other Books You May Enjoy
If you enjoyed this book, you may be interested in these other books by Packt:
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Other Books You May Enjoy
[ 302 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Other Books You May Enjoy
[ 303 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
Index
A D
AdaBoost-abstain 159 data manipulation
AdaBoost with Python 2, 3, 4, 7, 8, 10, 12
implementing, for disease risk prediction with decision stump 159
scikit-learn 159, 161, 163, 164, 166, 168 decision trees
Adam optimizer 299 about 100, 102, 110
Adaptive Boosting (AdaBoost) 159 model, building 103, 105, 106, 108
Anaconda working 109
downloading link 1
area under curve (AUC) 89 E
averaging 46, 47, 48 ensemble machine learning 38, 40
ensemble meta-estimators
B about 126
bagging classifiers bagging classifiers 126
about 126, 127, 129 ensemble model
working 129, 131 used, for sentiment analysis of movie reviews
bagging regressors 269, 271, 273, 276, 279, 282, 283, 285
about 132, 133 ensemble of heterogeneous algorithms
working 135 used, for Spam filtering 250, 254, 256, 259,
boosting 158 262, 265, 268
bootstrap aggregation 120, 121, 123, 125 ensemble of homogeneous models
bootstrapping 65, 66, 67, 68, 70 for energy prediction 207, 208, 209, 211, 213,
215, 217
C for handwritten digit classification 218, 220, 222,
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
F K
false positive rate (FPR) 144 k-fold cross-validation (k-fold CV) 57, 59, 60, 62,
fashion products 64
classifying, with ensemble of homogeneous kernel density estimation (KDE) 77
models 287, 289, 290, 293, 296, 297, 299, Kullback-Leibler divergence 102
300
feature engineering 11 L
feature scaling 83 L1 penalty and sparsity, logistic regression
reference 91
G learning rate 73
Gaussian Multinomial Naive Bayes 100 leave-one-out cross-validation (LOOCV) 57, 59,
Generalized Linear Model (GLM) 198, 237 60, 62, 64
ggplot linear kernel 118
working 35 logistic regression
glob 283 about 85, 87, 91
Gradient Boosting Machine (GBM) 185, 246 model, building 88, 89, 90
gradient boosting machine working 90
implementing, for disease risk prediction with
scikit-learn 168, 170, 171, 173, 175 M
machine translation (MT) 284
H Matplotlib
H2O color map selection, reference 32
used, for implementing stacked generalization for max-voting 41, 42, 44, 45
campaign outcome prediction 195, 197, 199, maximum regularization 238
201, 202, 204 maximum-margin hyperplane 111
used, for predicting credit card defaults by mean square error (MSE) 40, 211
implementing random forest 148, 150, 151, missing at random (MAR) 24
153, 154, 155, 157 missing completely at random (MCAR) 24
handwritten digit classification missing not at random (MNAR) 25
ensemble of homogeneous models 218, 220, missing values
222, 226, 228 analyzing 12, 14, 16, 19, 22, 23, 25
hard-margin SVMs 112 treating 12, 14, 16, 19, 22, 23, 25
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
heterogeneous ensemble classifiers visualizing 12, 14, 16, 19, 22, 23, 25
used, for predicting credit card defaulters 229, model architecture 298
231, 233, 236, 238, 241, 245, 248 movie reviews, sentiment analysis
heterogeneous ensemble method 206 with ensemble model 269, 271, 273, 276, 279,
heterogeneous ensemble model 38 282, 283, 285
homogeneous ensemble model 38, 206 Multinomial Naive Bayes 100
hyperplane 110 multiple linear regression
about 72, 73, 74, 84, 85
I model, building 74, 76, 78, 80, 82
iterative dichotomizer (ID3) 100 working 82, 83
Multivariate Bernoulli Naive Bayes 99
[ 305 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
N R
naive 91 radial basis function kernel (RBF) 115, 118
Naive Bayes algorithm random forest
about 91, 93, 99 about 136, 138
model, building 94, 95, 97, 98 implementing, for predicting credit card defaults
working 99 with H2O 148, 150, 151, 153, 154, 155, 157
natural language processing (NLP) 269 implementing, for predicting credit card defaults
non-ignorable 25 with scikit-learn 138, 140, 142, 144, 145, 147,
non-sequential model 207 148
receiver operating characteristic (ROC) 89
O Rectified Linear Units (RELU) 298
ordinary least square (OLS) 72 root mean square error (RMSE) 81
out-of-bag (OOB) 69, 123
out-of-the-bag samples 138 S
overfitting 110 sample bias 53
sampling error 53
P sampling with replacement 65
parameters, AdaBoost classifier sampling
base_estimator 161 about 52, 53, 55, 56, 57
learning_rate 161 simple random sampling (SRS) 52
n_estimators 161 stratified sampling 52
parameters, RandomForestClassifier systematic sampling 53
max_features 138 scikit-learn
min_sample_leaf 138 reference 180
n_estimators 138 used, for implementing AdaBoost for disease risk
n_jobs 138 prediction 159, 161, 163, 164, 166, 168
oob_score 138 used, for implementing gradient boosting
random_state 138 machine for disease risk prediction 168, 170,
parameters, XGBoost 171, 173, 175
col_sample_rate 176 used, for predicting credit card defaults by
learn_rate 176 implementing random forest 138, 140, 142,
max_depth 176 144, 145, 147, 148
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
[ 306 ]
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.
implementing, by combining predictions 188, types, missing values
189, 191, 193, 194 missing at random (MAR) 24
implementing, for campaign outcome prediction missing completely at random (MCAR) 24
with H2O 195, 197, 199, 201, 202, 204 missing not at random (MNAR) 25
Stochastic Gradient Descent (SGD) 213, 300 types, Naive Bayes algorithm
strata 52 Gaussian Multinomial Naive Bayes 100
stratified sampling 52 Multinomial Naive Bayes 100
Street View House Numbers (SVHN) 218 Multivariate Bernoulli Naive Bayes 99
support vector classifier (SVC) 165
support vector machine (SVM) W
about 38, 110, 112, 113, 114, 117, 119 weighted averaging 48, 49, 51
model, building 115, 116 winner-takes-all approach 188
working 117
systematic sampling 53 X
XGBoost, with scikit-learn
T extreme gradient boosting method, implementing
term frequency-inverse data frequency (TF-IDF) for glass identification 176, 177, 179, 181,
251 182, 185
true positive rate (TPR) 144
Copyright © 2019. Packt Publishing, Limited. All rights reserved.
Sarkar, Dipayan, and Vijayalakshmi Natarajan. Ensemble Machine Learning Cookbook : Over 35 Practical Recipes to Explore Ensemble Machine Learning
Techniques Using Python, Packt Publishing, Limited, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/bcu/detail.action?docID=5667626.
Created from bcu on 2023-01-10 17:18:45.