Practitioner's Guide To Data Science
Practitioner's Guide To Data Science
Practitioner's Guide To Data Science
Practitioner’s Guide to
Data Science
Contents
List of Figures ix
Preface xv
1 Introduction 1
1.1 A Brief History of Data Science . . . . . . . . . . 1
1.2 Data Science Role and Skill Tracks . . . . . . . . 5
1.2.1 Engineering . . . . . . . . . . . . . . . . . 7
1.2.2 Analysis . . . . . . . . . . . . . . . . . . . 8
1.2.3 Modeling/Inference . . . . . . . . . . . . . 10
1.3 What Kind of Questions Can Data Science Solve? 15
1.3.1 Prerequisites . . . . . . . . . . . . . . . . 15
1.3.2 Problem Type . . . . . . . . . . . . . . . 18
1.4 Structure of Data Science Team . . . . . . . . . 20
1.5 Data Science Roles . . . . . . . . . . . . . . . . . 24
iii
iv Contents
5 Data Pre-processing 79
5.1 Data Cleaning . . . . . . . . . . . . . . . . . . . 81
5.2 Missing Values . . . . . . . . . . . . . . . . . . . 84
5.2.1 Impute missing values with median/mode 85
5.2.2 K-nearest neighbors . . . . . . . . . . . . 86
Contents v
Appendix 337
Bibliography 369
Index 377
List of Figures
ix
x List of Figures
10.1 Test mean squared error for the ridge regression . 193
10.2 Test mean squared error for the lasso regression . 197
xiii
Preface
xv
xvi Preface
To repeat the code for big data, like running R notebook, you need
to set up Spark in Databricks. Follow the instructions in section
4.3 on the process of setting up and using the spark environment.
Then, run the “Create Spark Data” notebook to create Spark data
frames. After that, you can run the pyspark notebook to learn how
to use pyspark.
Complementary Reading
If you are new to R, we recommend R for Marketing Research
and Analytics by Chris Chapman and Elea McDonnell Feit. The
book is practical and provides repeatable R code. Part I & II of the
book cover basics of R programming and foundational statistics.
It is an excellent book on marketing analytics.
If you are new to Python, we recommend the Python version
of the book mentioned above, Python for Marketing Research and
Analytics by Jason Schwarz, Chris Chapman, and Elea McDonnell
Feit.
Preface xxi
If you want to dive deeper into some of the book’s topics, there
are many places to learn more.
• For machine learning, Python Machine Learning 3rd Edition by
Raschka and Mirjalili is a good book on implementing machine
learning in Python. Apply Predictive Modeling by Kuhn and
Johnston is an applied, practitioner-friendly textbook using R
package caret .
• For statistics models in R, a recommended book is An Introduc-
tion to Statistical Learning (ISL) by James, Witten, Hastie, and
Tibshirani. A more advanced treatment of the topics in ISL is
The Elements of Statistical Learning by Friedman, Tibshirani,
and Hastie.
About the Authors
xxiii
1
Introduction
“When you’re fundraising, it’s AI. When you’re hiring, it’s ML.
When you’re implementing, it’s logistic regression.”
1
2 1 Introduction
For outsiders, data science is the magic that can extract useful
information from data. Everyone is familiar with the concept of big
data. Data science trainees must now possess the skills to manage
large data sets. These skills may include Hadoop, a system that
uses Map/Reduce to process large data sets distributed across a
cluster of computers or Spark, a system that builds on top of
Hadoop to speed up the process by loading massive data sets into
shared memory (RAM) across clusters with an additional suite of
machine learning functions for big data.
The new skills are essential for dealing with large data sets be-
yond a single computer’s memory or hard disk and the large-scale
cluster computing. However, they are not necessary for deriving
meaningful insights from data.
A lot of data means more sophisticated tinkering with comput-
ers, especially a cluster of computers. The computing and pro-
gramming skills to handle big data were the biggest hurdle for
traditional analysis practitioners to be a successful data scientist.
However, this barrier has been significantly lowered thanks to the
cloud computing revolution, as discussed in Chapter 2. After all,
it isn’t the size of the data that’s important, but what you do
with it. You may be feeling a mix of skepticism and confusion. We
understand; we had the same reaction.
To declutter, let’s start with a brief history of data science. If you
search on Google Trends, which shows search keyword informa-
tion over time, the term “data science” dates back further than
2004. Media coverage may give the impression that machine learn-
ing algorithms are a recent invention and that there was no “big”
data before Google. However, this is not true. While there are new
and exciting developments in data science, many of the techniques
we use are based on decades of work by statisticians, computer
scientists, mathematicians, and scientists from a variety of other
fields.
In the early 19th century, Legendre and Gauss came up with the
least squares method for linear regression . At the time, it was
mainly used by physicists to fit their data. Nowadays, nearly any-
1.1 A Brief History of Data Science 3
one can build linear regression models using spreadsheet with just
a little bit of self-guided online training.
In 1936, Fisher came up with linear discriminant analysis. In the
1940s, logistic regression became a widely used model. Then, in the
1970s, Nelder and Wedderburn formulated the “generalized linear
mode (GLM)” which:
Most of the classical statistical models are the first type of stochas-
tic data model. Black-box models, such as random forest, Gradi-
ent Boosting Machine (GBM), and deep learning , are algorithmic
models. As Breiman pointed out, algorithmic models can be used
on large, complex data as a more accurate and informative alterna-
tive to stochastic modeling on smaller datasets. These algorithms
have developed rapidly with much-expanded applications in fields
outside of traditional statistics which is one of the most important
reasons why statisticians are not in the mainstream of today’s data
science, both in theory and practice.
Python is overtaking R as the most popular language in data
science, mainly due to the backgrounds of many data scientists.
Since 2000, the approaches to getting information out of data have
shifted from traditional statistical models to a more diverse tool-
box that includes machine learning and deep learning models. To
help readers who are traditional data practitioners, we provide
both R and Python codes.
What is the driving force behind the shifting trend? John Tukey
identified four forces driving data analysis (there was no “data
science” when this was written in 1962):
“We don’t see things as they are, we see them as we are. [by
Anais Nin]”
When people talk about all the machine learning and artificial
intelligence algorithms, they often overlook the critical data engi-
neering part that makes everything possible. Data engineering is
the unseen iceberg under the water surface. Does your company
need a data scientist? You are only ready for a data scientist if
1
This is based on “Industry recommendations for academic data science
programs: https://github.com/brohrer/academic_advisory”. It is a collection
of thoughts of different data scientist across industries about what a data
scientist does, and what differentiates an exceptional data scientist.
1.2 Data Science Role and Skill Tracks 7
you have a data engineer. You need to have the ability to get data
before making sense of it. If you only deal with small datasets
with formatted data, you can get by with plain text files such
as CSV (i.e., comma-separated values) or spreadsheets. As the
data increases in volume, variety, and velocity, data engineering
becomes a sophisticated discipline in its own right.
1.2.1 Engineering
Data engineering is the foundation that makes everything else
possible (figure 1.2). It mainly involves building data infrastruc-
tures and pipelines. In the past, when data was stored on local
servers, computers, or other devices, constructing the data infras-
tructure was a major IT project. This included software, hardware
for servers to store the data, and the ETL (extract, transform, and
load) process.
With the advent of cloud computing, the new standard for stor-
ing and computing data is on the cloud. Data engineering today
is essentially software engineering with data flow as the primary
focus. The fundamental element for automation is maintaining the
data pipeline through modular, well-commented code, and version
control.
(3) Production
1.2.2 Analysis
Analysis turns raw data into meaningful insights through a fast
and often exploratory approach. To excel as an analyst, one must
possess a solid understanding of the relevant domain, perform ex-
ploratory analysis efficiently, and be able to communicate findings
through compelling storytelling (figure 1.3).
industry where you apply data science. You can’t make sense of
data without context. Some questions about the context are
• What are the critical metrics for this kind of business?
• What are the business questions?
• What type of data do they have, and what does the data repre-
sent?
• How to translate a business need to a data problem?
• What has been tried before, and with what results?
• What are the accuracy-cost-time trade-offs?
• How can things fail?
• What are other factors not accounted for?
• What are the reasonable assumptions, and what are faulty?
Domain knowledge helps you to deliver the results in an audience-
friendly way with the right solution to the right problem.
(3) Storytelling
1.2.3 Modeling/Inference
Modeling/inference is a process that dives deeper into the data to
discover patterns that are not easily seen. It is often misunderstood.
When people think of data science, they may immediately think of
complex machine learning models. Despite the overrepresentation
of machine learning in the public’s mind, the truth is that you
don’t have to use machine learning to be a data scientist. Even
data scientists who use machine learning in their work spend less
than 20% of their time working on machine learning. They spend
most of their time communicating with different stakeholders and
collecting and cleaning data.
This track mainly focuses on three problems: (1) prediction, (2)
explanation, and (3) causal inference (figure 1.4)).
Prediction focuses on predicting based on what has happened, and
understanding each variable’s role is not a concern. Many black-
box models, such as ensemble methods and deep learning, are often
used to make a prediction. Examples of problems are image recog-
nition, machine translation, and recommendation. Despite the re-
markable success of many deep-learning models, they operate al-
1.2 Data Science Role and Skill Tracks 11
Getting data from different sources and dumping them into a data
lake. A data lake is a storage repository that stores a vast amount
of raw data in its native format, including XML, JSON, CSV,
Parquet, etc. It is a data cesspool rather than a data lake. The
data engineer’s job is to get a clean schema out of the data lake by
transforming and formatting the data. Some common problems to
resolve are
• Enforce new tables’ schema to be the desired one
• Repair broken records in newly inserted data
• Aggregate the data to form the tables with a proper granularity
One cannot make a silk purse out of a sow’s ear. Data scientists
relevant and accurate data. The supply problem mentioned above
is a case in point. There was relevant data, but not sound. All
the later analytics based on that data was a building on sand. Of
course, data nearly almost have noise, but it has to be in a certain
range. Generally speaking, the accuracy requirement for the inde-
pendent variables of interest and response variable is higher than
others. For the above question 2, it is variables related to the “new
promotion” and “sales of P1197.”
The data has to be helpful for the question. If we want to predict
which product consumers are most likely to buy in the next three
months, we need to have historical purchasing data: the last buy-
ing time, the amount of invoice, coupons, etc. Information about
customers’ credit card numbers, ID numbers, and email addresses
will not help much.
Often, the data quality is more important than the quantity, but
you can not completely overlook quantity. Suppose you can guar-
antee data quality, even then the more data, the better. If we have
18 1 Introduction
1. Description
2. Comparison
3. Clustering
4. Classification
need to try all the models but several models that perform well
generally. For example, the random forest algorithm is usually used
as the baseline model to set model performance expectations.
5. Regression
6. Optimization
critical to the business, you can’t afford to outsource it. Also, each
company has its business context, and it needs new kinds of data
as the business grows and uses the results in novel ways. Being a
data-driven organization requires cross-organization commitments
to identify what data each department needs to collect, establish
the infrastructure and process for collecting and maintaining that
data, and standardize how to deliver analytical results. Unfortu-
nately, it is unlikely that an off-the-shelf solution will be flexible
enough to adapt to the specific business context. In general, most
of the companies establish their data science team.
Where should the data science team fit? In general, the data sci-
ence team is organized in three ways.
There is no data science team. Each team hires its data science
1.5 Structure of Data Science Team 23
Role Skills
Data infrastructure engineer Go, Python, AWS/Google
Cloud/Azure, logstash, Kafka,
and Hadoop
Data engineer spark/scala, python, SQL,
AWS/Google Cloud/Azure,
Data modeling
BI engineer Tableau/looker/Mode, etc.,
data visualization, SQL,
Python
Data analyst SQL, basic statistics, data
visualization
Data scientist R/Python, SQL, basic applied
statistics, data visualization,
experimental design
Research scientist R/Python, advanced statistics,
experimental design, ML,
research background,
publications, conference
contributions, algorithms
Applied scientist ML algorithm design, often
with an expectation of
fundamental software
engineering skills
Machine learning engineer More advanced software
engineering skillset, algorithms,
machine learning algorithm
design, system design
The above table shows some data science roles and common tech-
nical keywords in job descriptions. Those roles are different in the
following key aspects:
• How much business knowledge is required?
• Does it need to deploy code in the production environment?
• How frequently is data updated?
26 1 Introduction
FIGURE 1.5: Different roles in data science and the skill require-
ments
are technical but not engineers. They analyze ad hoc data and
deliver the results through presentations. The data is, most of
the time, structured. They need to know coding basics (SQL or
R/Python), but they rarely need to write production-level code.
This role was mixed with “data scientist” by many companies but
is now much better refined in mature companies.
The most significant difference between a data analyst and a data
scientist is the requirement of mathematics and statistics. Most
data scientists have a quantitative background and do A/B exper-
iments and sometimes machine learning models. Data analysts usu-
ally don’t need a quantitative background or an advanced degree.
The analytics they do are primarily descriptive with visualizations.
They mainly handle structured and ad hoc data.
Research scientists are experts who have a research background.
They do rigorous analysis and make causal inferences by framing
experiments and developing hypotheses, and proving whether they
are true or not. They are researchers that can create new models
and publish peer-reviewed papers. Most of the small/mid compa-
nies don’t have this role.
Applied scientist is the role that aims to fill the gap between
data/research scientists and data engineers. They have a decent
scientific background but are also experts in applying their knowl-
edge and implementing solutions at scale. They have a different
focus than research scientists. Instead of scientific discovery, they
focus on real-life applications. They usually need to pass a coding
bar.
In the past, some data scientist roles encapsulated statistics, ma-
chine learning, and algorithmic knowledge, including taking mod-
els from proof of concept to production. However, more recently,
some of these responsibilities are now more common in another
role: machine learning engineer. Often larger companies may distin-
guish between data scientists and machine learning engineer roles.
Machine learning engineer roles will deal more with the algorith-
mic and machine learning side and strongly emphasize software
engineering. In contrast, data scientist roles will emphasize ana-
1.5 Data Science Roles 29
31
32 2 Soft Skills for Data Scientists
above the ground and down to the detail to the very bottom. To
convert a business question into a data science problem, a data sci-
entist needs to communicate using the language other people can
understand and obtain the required information through formal
and informal conversations.
In the entire data science project cycle, including defining, plan-
ning, developing, and implementing, every step needs to get a data
scientist involved to ensure the whole team can correctly determine
the business problem and reasonably evaluate the business value
and success. Corporates are investing heavily in data science and
machine learning, and there is a very high expectation of return
for the investment.
However, it is easy to set an unrealistic goal and inflated estimation
for a data science project’s business impact. The team’s data sci-
entist should lead and navigate the discussions to ensure data and
analytics, not wishful thinking, back the goal. Many data science
projects often over-promise in business value and are too optimistic
on the timeline to delivery. These projects eventually fail by not
delivering the pre-set business impact within the promised time-
line. As data scientists, we need to identify these issues early and
communicate with the entire team to ensure the project has a re-
alistic deliverable and timeline. The data scientist team also needs
to work closely with data owners on different things. For example,
identify a relevant internal and external data source, evaluate the
data’s quality and relevancy to the project, and work closely with
the infrastructure team to understand the computation resources
(i.e., hardware and software) availability. It is easy to create scal-
able computation resources through the cloud infrastructure for a
data science project. However, you need to evaluate the dedicated
computation resources’ cost and make sure it fits the budget.
In summary, data science projects are much more than data and
analytics. A successful project requires a data scientist to lead
many aspects of the project.
2.4 Three Pillars of Knowledge 35
The types of data used and the final model development define the
different kinds of data science projects.
2.4.1.1 Offline and Online Data
There are offline and online data. Offline data are historical data
stored in databases or data warehouses. With the development of
data storage techniques, the cost to store a large amount of data
is low. Offline data are versatile and rich in general (for example,
websites may track and keep each user’s mouse position, click and
typing information while the user is visiting the website). The data
is usually stored in a distributed system, and it can be extracted
in batch to create features used in model training.
Online data are real-time information that flows to models to make
automatic actions. Real-time data can frequently change (for ex-
ample, the keywords a customer is searching for can change at any
given time). Capturing and using real-time online data requires the
integration of a machine learning model to the production infras-
tructure. It used to be a steep learning curve for data scientists not
familiar with computer engineering, but the cloud infrastructure
makes it much more manageable. Based on the offline and online
data and model properties, we can separate data science projects
into three different categories as described below.
ical data, and the output is a report, there is no need for real-time
execution. Usually, there is no run-time constraint on the machine
learning model unless the model runs beyond a reasonable time
frame, such as a few days. We can call this type of data science
project “offline training, offline application” project.
2.4.1.3 Offline Training and Online Application
Another type of data science project uses offline data for train-
ing and applies the trained model to real-time online data in the
production environment. For example, we can use historical data
to train a personalized advertisement recommendation model that
provides a real-time ad recommendation. The model training uses
historical offline data. The trained model then takes customers’
online real-time data as input features and run the model in real-
time to provide an automatic action. The model training is very
similar to the “offline training, offline application” project. But to
put the trained model into production, there are specific require-
ments. For example, as features used in the offline training have
to be available online in real-time, the model’s online run-time has
to be short enough without impacting user experience. In most
cases, data science projects in this category create continuous and
scalable business value as the model could run millions of times
a day. We will use this type of data science project to describe
the typical data science project cycle from section 2.4.2 to section
2.4.5.
2.4.1.4 Online Training and Online Application
(1) the business team, which may include members from the
business operation team, business analytics, insight, and
metrics reporting team;
(2) the technology team, which may include members from
the database and data warehouse team, data engineering
team, infrastructure team, core machine learning team,
and software development team;
(3) the project, program, and product management team de-
pending on the scope of the data science project.
1. Shadow mode
2. A/B testing
42 2 Soft Skills for Data Scientists
take 60% to 80% of the total time for a given data science project,
but people often don’t realize that.
When there are a lot of data already collected across the orga-
nization, people assume we have enough data for everything. It
leads to the mistake: too optimistic about data availability
and quality. We need not “big data,” but data that can help us
solve the problem. The data available may be of low quality, and
we need to put substantial effort into cleaning the data before we
can use it. There are “unexpected” efforts to bring the right and
relevant data for a specific data science project. To ensure smooth
delivery of data science projects, we need to account for the “unex-
pected” work at the planning stage. Data scientists all know data
preprocessing and feature engineering is usually the most time-
consuming part of a data science project. However, people outside
data science are not aware of it, and we need to educate other
team members and the leadership team.
time, each instance’s total run time (i.e., model latency) should
not impact the customer’s user experience. Nobody wants to wait
for even one second to see the results after click the “search” but-
ton. In the production stage, feature availability is crucial to run
a real-time model. Engineering resources are essential for model
production. However, in traditional companies, it is common that
a data science project fails to scale in real-time applications
due to lack of computation capacity, engineering resources, or non-
tech culture and environment.
As the business problem evolves rapidly, the data and model in
the production environment need to change accordingly, or the
model’s performance deteriorates over time. The online production
environment is more complicated than model training and testing.
For example, when we pull online features from different resources,
some may be missing at a specific time; the model may run into a
time-out zone, and various software can cause the version problem.
We need regular checkups during the entire life of the model cycle
from implementation to retirement. Unfortunately, people often
don’t set the monitoring system for data science projects, and it is
another common mistake: missing necessary online checkup.
It is essential to set a monitoring dashboard and automatic alarms,
create model tuning, re-training, and retirement plans.
1. Demography
•age: age of the respondent
•gender: male/female
•house: 0/1 variable indicating if the customer owns a
house or not
2. Sales in the past year
•store_exp: expense in store
•online_exp: expense online
•store_trans: times of store purchase
•online_trans: times of online purchase
3. Survey on product preference
49
50 3 Introduction to the Data
1. Strong disagree
2. Disagree
3. Neither agree nor disagree
4. Agree
5. Strongly agree
• Q1. I like to buy clothes from different brands
• Q2. I buy almost all my clothes from some of my favorite brands
• Q3. I like to buy premium brands
• Q4. Quality is the most important factor in my purchasing deci-
sion
• Q5. Style is the most important factor in my purchasing decision
• Q6. I prefer to buy clothes in store
• Q7. I prefer to buy clothes online
• Q8. Price is important
• Q9. I like to try different styles
• Q10. I like to make decision myself and don’t need too much of
others’ suggestions
There are 4 segments of customers:
1. Price
2. Conspicuous
3. Quality
4. Style
str(sim.dat,vec.len=3)
𝜷𝐠 = (1, 0, −1) × 𝛾, 𝑔 = 1, … , 40
52 3 Introduction to the Data
The second forty survey questions are also important questions but
only one answer has a coefficient that is different from the other
two answers:
𝜷𝐠 = (1, 0, 0) × 𝛾, 𝑔 = 41, … , 80
40
𝜷𝐓 = ⎛
⎜ , 1, 0, −1, ..., 1, 0, 0 , ..., 0, 0, 0 , ..., 0, 0, 0 ⎞⎟∗𝛾
3 ⏟ ⏟ ⏟ ⏟
⎝ 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 1 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 41 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 81 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 120⎠
For each value of 𝛾, 20 data sets are simulated. The bigger 𝛾 is,
the larger the corresponding parameter. We provided the data sets
with 𝛾 = 2. Let’s check the data:
## 6 1 0 0 1 1 0 1
Keras library, and there are a few build-in functions in Keras for
data loading and pre-processing. It contains 50,000 movie reviews
(25,000 in training and 25,000 in testing) from IMDB, as well as
each movie review’s binary sentiment: positive or negative. The
raw data contains the text of each movie review, and it has to be
pre-processed before being fitted with any machine learning mod-
els. By using Keras’s built-in functions, we can easily get the pro-
cessed dataset (i.e., a numerical data frame) for machine learning
algorithms. Keras’ build-in functions perform the following tasks
to convert the raw review text into a data frame:
57
58 4 Big Data Cloud Platform
as one powerful machine with memory, hard disk and CPU equiv-
alent to the sum of individual computers. It is common to have
hundreds or even thousands of nodes for a cluster.
In the past, users need to write code (such as MPI) to distribute
data and do parallel computing. Fortunately, with the recent new
development, the cloud environment for big data analysis is more
user-friendly. As data is often beyond the size of the hard disk,
the dataset itself is stored across different nodes (i.e., the Hadoop
system). When doing analysis, the data is distributed across differ-
ent nodes, and algorithms are parallel to leverage corresponding
nodes’ CPUs to compute (i.e., the Spark system).
4.2.1 Hadoop
The very first problem internet companies face is that a lot of data
has been collected and how to better store these data for future
analysis. Google developed its own file system to provide efficient,
reliable access to data using large clusters of commodity hardware.
The open-source version is known as Hadoop Distributed File Sys-
tem (HDFS). Both systems use Map-Reduce to allocate computa-
tion across computation nodes on top of the file system. Hadoop is
written in Java and writing map-reduce job using Java is a direct
way to interact with Hadoop which is not familiar to many in the
60 4 Big Data Cloud Platform
data and analytics community. To help better use the Hadoop sys-
tem, an SQL-like data warehouse system called Hive, and a script-
ing language for analytics interface called Pig were introduced for
people with analytics background to interact with Hadoop system.
Within Hive, we can create user-defined functions through R or
Python to leverage the distributed and parallel computing infras-
tructure. Map-reduce on top of HDFS is the main concept of the
Hadoop ecosystem. Each map-reduce operation requires retrieving
data from hard disk, then performing the computation, and storing
the result onto the disk again. So, jobs on top of Hadoop require a
lot of disk operation which may slow down the entire computation
process.
4.2.2 Spark
Spark works on top of a distributed file system including HDFS
with better data and analytics efficiency by leveraging in-memory
operations. Spark is more tailored for data processing and analytics
and the need to interact with Hadoop directly is greatly reduced.
The spark system includes an SQL-like framework called Spark
SQL and a parallel machine learning library called MLlib Fortu-
nately for many in the analytics community, Spark also supports
R and Python. We can interact with data stored in a distributed
file system using parallel computing across nodes easily with R and
Python through the Spark API and do not need to worry about
lower-level details of distributed computing. We will introduce how
to use an R notebook to drive Spark computations.
4.3.2 R Notebook
For this book, we will use R notebook for examples and demos
and the corresponding Python notebook will be available online
too. For an R notebook, it contains multiple cells, and, by default,
the content within each cell are R scripts. Usually, each cell is a
well-managed segment of a few lines of codes that accomplish a
specific task. For example, Figure 4.2 shows the default cell for
an R notebook. We can type in R scripts and comments same as
we are using R console. By default, only the result from the last
line will be shown following the cell. However, you can use print()
function to output results for any lines. If we move the mouse to
the middle of the lower edge of the cell below the results, a “+”
symbol will show up and click on the symbol will insert a new
cell below. When we click any area within a cell, it will make it
editable and you will see a few icons on the top right corn of the
cell where we can run the cell, as well as add a cell below or above,
copy the cell, cut the cell etc. One quick way to run the cell is
Shift+Enter when the cell is chosen. User will become familiar with
the notebook environment quickly.
4.4 Introduction of Cloud Environment 63
# Install sparklyr
if (!require("sparklyr")) {
install.packages("sparklyr")
}
# Load sparklyr package
library(sparklyr)
library(dplyr)
head(iris)
In real applications, the data set may be massive and cannot fit in a
single hard disk and most likely such data are already stored in the
Spark system. If the data is already in Hadoop/Spark ecosystem in
the form of SDF, we can create a local R object to link to the SDF
by the tbl() function where my_sdf is the SDF in the Spark system,
and my_sdf_tbl is the R local object that referring to my_sdf:
The above one-line code copies iris dataset from the local node to
Spark cluster environment. “sc” is the Spark Connection we just
created; “x” is the data frame that we want to copy; “overwrite”
is the option whether we want to overwrite the target object if
the same name SDF exists in the Spark environment. Finally,
sdf_copy_to() function will return an R object representing the
copied SDF (i.e. creating a “pointer” to the SDF such that we
can refer iris_tbl in the R notebook to operate iris SDF). Now
irir_tbl in the local R environment can be used to refer to the iris
SDF in the Spark system.
To check whether the iris data was copied to the Spark environ-
ment successfully or not, we can use src_tbls() function to the
Spark Connection (sc):
or using the head() function to return the first few rows in iris_tbl:
head(iris_tbl)
iris_tbl %>%
mutate(Sepal_Add = Sepal_Length + Sepal_Width) %>%
group_by(Species) %>%
summarize(count = n(), Sepal_Add_Avg = mean(Sepal_Add))
library(ggplot2)
ggplot(iris_summary, aes(Sepal_Width_round,
Sepal_Length_avg,
color = Species)) +
geom_line(size = 1.2) +
geom_errorbar(aes(ymin = Sepal_Length_avg - Sepal_Length_stdev,
ymax = Sepal_Length_avg + Sepal_Length_stdev),
width = 0.05) +
geom_text(aes(label = count),
vjust = -0.2,
hjust = 1.2,
color = "black") +
theme(legend.position="top")
After fitting the k-means model, we can apply the model to pre-
dict other datasets through ml_predict() function. Following code
applies the model to iris_tbl again to predict the cluster and col-
lect the results as a local R object (i.e. prediction) using collect()
function:
prediction %>%
ggplot(aes(Petal_Length, Petal_Width)) +
geom_point(aes(Petal_Width, Petal_Length,
col = factor(prediction + 1)),
size = 2, alpha = 0.5) +
geom_point(data = fit2$centers, aes(Petal_Width, Petal_Length),
col = scales::muted(c("red", "green", "blue")),
pch = 'x', size = 12) +
scale_color_discrete(name = "Predicted Cluster",
labels = paste("Cluster", 1:3)) +
labs(x = "Petal Length",
y = "Petal Width",
title = "K-Means Clustering",
subtitle = "Use Spark ML to predict cluster
membership with the iris dataset")
These procedures cover the basics of big data analysis that a data
scientist needs to know as a beginner. We have an R notebook on
the book website that contains the contents of this chapter. We
also have a Python notebook on the book website.
state division
Alabama East South Central
Alaska Pacific
Arizona Mountain
Arkansas West South Central
California Pacific
The results from the above query only return one row as expected.
Sometimes we want to find the aggregated value based on groups
that can be defined by one or more columns. Instead of writing
multiple SQL to calculate the aggregated value for each group, we
can easily use the GROUP BY to calculate the aggregated value
for each group in the SELECT statement. For example, if we want to
find how many states in each division, we can use the following:
The database system is usually designed such that each table con-
tains a piece of specific information and oftentimes we need to
JOIN multiple tables to achieve a specific task. There are few types
typically JOINs: inner join (keep only rows that match the join
condition from both tables), left outer join (rows from inner join +
unmatched rows from the first table), right outer join (rows from
inner join + unmatched rows from the second table) and full outer
join (rows from inner join + unmatched rows from both tables).
The typical JOIN statement is illustrated below:
For example, let us join the division table and metrics table to find
what is the average population and income for each division, and
the results order by division names:
In real life, depending on the stage of data cleanup, data has the
following types:
1. Raw data
2. Technically correct data
3. Data that is proper for the model
4. Summarized data
5. Data with fixed format
79
80 5 Data Pre-processing
The raw data is the first-hand data that analysts pull from the
database, market survey responds from your clients, the experi-
mental results collected by the research and development depart-
ment, and so on. These data may be very rough, and R sometimes
can’t read them directly. The table title could be multi-line, or the
format does not meet the requirements:
• Use 50% to represent the percentage rather than 0.5, so R will
read it as a character;
• The missing value of the sales is represented by “-” instead of
space so that R will treat the variable as character or factor type;
• The data is in a slideshow document, or the spreadsheet is not
“.csv” but “.xlsx”
• …
Most of the time, you need to clean the data so that R can import
them. Some data format requires a specific package. Technically
correct data is the data, after preliminary cleaning or format con-
version, that R (or another tool you use) can successfully import
it.
Assume we have loaded the data into R with reasonable column
names, variable format and so on. That does not mean the data is
entirely correct. There may be some observations that do not make
sense, such as age is negative, the discount percentage is greater
than 1, or data is missing. Depending on the situation, there may
be a variety of problems with the data. It is necessary to clean the
data before modeling. Moreover, different models have different
requirements on the data. For example, some model may require
the variables are of consistent scale; some may be susceptible to
outliers or collinearity, some may not be able to handle categorical
variables and so on. The modeler has to preprocess the data to
make it proper for the specific model.
Sometimes we need to aggregate the data. For example, add up
the daily sales to get annual sales of a product at different loca-
tions. In customer segmentation, it is common practice to build
a profile for each segment. It requires calculating some statistics
such as average age, average income, age standard deviation, etc.
5.1 Data Cleaning 81
Q3 Q4 Q5 Q6 Q7
Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.00
1st Qu.:1.00 1st Qu.:2.00 1st Qu.:1.75 1st Qu.:1.00 1st Qu.:2.50
Median :1.00 Median :3.00 Median :4.00 Median :2.00 Median :4.00
Mean :1.99 Mean :2.76 Mean :2.94 Mean :2.45 Mean :3.43
3rd Qu.:3.00 3rd Qu.:4.00 3rd Qu.:4.00 3rd Qu.:4.00 3rd Qu.:4.00
Max. :5.00 Max. :5.00 Max. :5.00 Max. :5.00 Max. :5.00
Q8 Q9 Q10 segment
Min. :1.0 Min. :1.00 Min. :1.00 Conspicuous:200
1st Qu.:1.0 1st Qu.:2.00 1st Qu.:1.00 Price :250
Median :2.0 Median :4.00 Median :2.00 Quality :200
Mean :2.4 Mean :3.08 Mean :2.32 Style :350
3rd Qu.:3.0 3rd Qu.:4.00 3rd Qu.:3.00
Max. :5.0 Max. :5.00 Max. :5.00
age store_exp
Min. :16.00 Min. : 155.8
1st Qu.:25.00 1st Qu.: 205.1
Median :36.00 Median : 329.8
Mean :38.58 Mean : 1358.7
3rd Qu.:53.00 3rd Qu.: 597.4
Max. :69.00 Max. :50000.0
NA's :1 NA's :1
survey. Therefore, there are not many papers on missing value im-
putation in the prediction model. Those who want to study further
can refer to Saar-Tsechansky and Provost’s comparison of differ-
ent imputation methods (M and F, 2007) and De Waal, Pannekoek
and Scholtus’ book (de Waal et al., 2011).
Now the two variables are in the same scale. You can check the re-
sult using summary(transformed). Note that there are missing values.
90 5 Data Pre-processing
∑(𝑥𝑖 − 𝑥)̄ 3
𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 =
(𝑛 − 1)𝑣3/2
∑(𝑥𝑖 − 𝑥)̄ 2
𝑣=
(𝑛 − 1)
A zero skewness means that the distribution is symmetric, i.e. the
probability of falling on either side of the distribution’s mean is
equal.
left skew, skewnwss = -1.88 right skew, skewness = 1.88
0.30
0.30
0.20
0.20
Density
Density
0.10
0.10
0.00
0.00
0 5 10 15 0 5 10 15
X2 X1
𝑥𝜆 −1
∗ 𝜆 𝑖𝑓 𝜆 ≠ 0
𝑥 ={
𝑙𝑜𝑔(𝑥) 𝑖𝑓 𝜆 = 0
describe(sim.dat)
It is easy to see the skewed variables. If mean and trimmed differ a lot,
there is very likely outliers. By default, trimmed reports mean by
dropping the top and bottom 10%. It can be adjusted by setting
argument trim=. It is clear that store_exp has outliers.
As an example, we will apply Box-Cox transformation on
store_trans and online_trans:
## Pre-processing:
## - Box-Cox transformation (2)
## - ignored (0)
##
## Lambda estimates for Box-Cox transformation:
## 0.1, 0.7
The last line of the output shows the estimates of 𝜆 for each vari-
able. As before, use predict() to get the transformed result:
Frequency
100
150
50
0 50
store_trans store_trans
## Box-Cox Transformation
##
## 1000 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 4.00 5.35 7.00 20.00
##
## Largest/Smallest: 20
## Sample Skewness: 1.11
##
## Estimated Lambda: 0.1
## With fudge factor, Lambda = 0 will be used for transformations
## [1] -0.2155
age
250
50
income
50000
40000
store_exp
0
online_exp
6000
0
store_trans
15
5
online_trans
20
0
It is also easy to observe the pair relationship from the plot. age
is negatively correlated with online_trans but positively correlated
with store_trans. It seems that older people tend to purchase from
the local store. The amount of expense is positively correlated with
income. Scatterplot matrix like this can reveal lots of information
before modeling.
In addition to visualization, there are some statistical methods to
define outliers, such as the commonly used Z-score. The Z-score
for variable Y is defined as:
𝑌𝑖 − 𝑌 ̄
𝑍𝑖 =
𝑠
5.5 Resolve Outliers 95
0.6745(𝑌𝑖 − 𝑌 ̄ )
𝑀𝑖 =
𝑀 𝐴𝐷
## [1] 59
where 𝑥𝑖𝑗 represents the 𝑖𝑡ℎ observation and 𝑗𝑡ℎ variable. As shown
in the equation, every observation for sample 𝑖 is divided by its
square mode. The denominator is the Euclidean distance to the
center of the p-dimensional predictor space. Three things to pay
attention here:
# KNN imputation
sdat <- sim.dat[, c("income", "age")]
imp <- preProcess(sdat, method = c("knnImpute"), k = 5)
sdat <- predict(imp, sdat)
transformed <- spatialSign(sdat)
transformed <- as.data.frame(transformed)
par(mfrow = c(1, 2), oma = c(2, 2, 2, 2))
plot(income ~ age, data = sdat, col = "blue", main = "Before")
plot(income ~ age, data = transformed, col = "blue", main = "After")
5.6 Collinearity 97
Before After
1.0
4
3
0.5
income
income
2
0.0
1
-1.0 -0.5
0
-1
age age
Some readers may have found that the above code does not seem
to standardize the data before transformation. Recall the introduc-
tion of KNN, preProcess() with method="knnImpute" by default will
standardize data.
5.6 Collinearity
It is probably the technical term known by the most un-technical
people. When two predictors are very strongly correlated, includ-
ing both in a model may lead to confusion or problem with a
singular matrix. There is an excellent function in corrplot package
with the same name corrplot() that can visualize correlation struc-
ture of a set of predictors. The function has the option to reorder
the variables in a way that reveals clusters of highly correlated
ones.
store_trans
online_exp
store_exp
income
age
1
online_trans 0.8
0.6
age -0.74
0.4
The closer the correlation is to 0, the lighter the color is and the
closer the shape is to a circle. The elliptical means the correlation
is not equal to 0 (because we set the upper = "ellipse"), the greater
the correlation, the narrower the ellipse. Blue represents a positive
correlation; red represents a negative correlation. The direction
of the ellipse also changes with the correlation. The correlation
coefficient is shown in the lower triangle of the matrix.
The variables relationship from previous scatter matrix are clear
5.6 Collinearity 99
## [1] 2 6
# make a copy
zero_demo <- sim.dat
# add two sparse variable zero1 only has one unique value zero2 is a
# vector with the first element 1 and the rest are 0s
zero_demo$zero1 <- rep(1, nrow(zero_demo))
zero_demo$zero2 <- c(1, rep(0, nrow(zero_demo) - 1))
## [1] 20 21
## Female Male
## [1,] 1 0
## [2,] 1 0
## [3,] 0 1
## [4,] 0 1
## [5,] 0 1
## [6,] 0 1
dummyVars() can also use formula format. The variable on the right-
hand side can be both categorical and numeric. For a numerical
variable, the function will keep the variable unchanged. The advan-
tage is that you can apply the function to a data frame without
5.8 Re-encode Dummy Variables 103
This chapter focuses on some of the most frequently used data ma-
nipulations and shows how to implement them in R and Python.
It is critical to explore the data with descriptive statistics (mean,
standard deviation, etc.) and data visualization before analysis.
Transform data so that the data structure is in line with the re-
quirements of the model. You also need to summarize the results
after analysis.
When the data is too large to fit in a computer’s memory, we can
use some big data analytics engine like Spark on a cloud platform
(see Chapter 4). Even the user interface of many data platforms is
much more friendly now, it is still easier to manipulate the data as
a local data frame. Spark’s R and Python interfaces aim to keep
the data manipulation syntax consistent with popular packages for
local data frames. As shown in Section 4.4, we can run nearly all
of the dplyr functions on a spark data frame once setting up the
Spark environment. And the Python interface pyspark uses a similar
syntax as pandas. This chapter focuses on data manipulations on
standard data frames, which is also the foundation of big data
manipulation.
Even when the data can fit in the memory, there may be a situ-
ation where it is slow to read and manipulate due to a relatively
large size. Some R packages can make the process faster with the
cost of familiarity, especially for data wrangling. But it avoids the
hurdle of setting up Spark cluster and working in an unfamiliar en-
vironment. It is not a topic in this chapter but Appendix 13 briefly
introduces some of the alternative R packages to read, write and
wrangle a data set that is relatively large but not too big to fit in
the memory.
105
106 6 Data Wrangling
1
https://www.tidyverse.org/packages/
6.1 Summarize Data 107
1. Display
2. Subset
3. Summarize
4. Create new variable
5. Merge
# Read data
sim.dat <- read.csv("http://bit.ly/2P5gTw4")
6.1.1.1 Display
tbl_df(sim.dat)
glimpse(sim.dat)
6.1.1.2 Subset
dplyr::distinct(sim.dat)
dplyr::slice(sim.dat, 10:15)
110 6 Data Wrangling
It is equivalent to sim.dat[10:15,].
dplyr::top_n(sim.dat,2,income)
If you want to select columns instead of rows, you can use select().
The following are some sample codes:
6.1.1.3 Summarize
marise tells R the manipulation(s) to do. Then list the exact actions
inside summarise(). For example, Age = round(mean(na.omit(age)),0)
tell R the following things:
(0.73). They are very likely to be digital natives and prefer online
shopping.
You may notice that Style group purchase more frequently online
(online_trans) but the expense (online_exp) is not higher. It makes
us wonder what is the average expense each time, so you have a
better idea about the price range of the group.
The analytical process is aggregated instead of independent steps.
The current step will shed new light on what to do next. Sometimes
you need to go back to fix something in the previous steps. Let’s
check average one-time online and instore purchase amounts:
sim.dat %>%
group_by(segment) %>%
summarise(avg_online = round(sum(online_exp)/sum(online_trans), 2),
avg_store = round(sum(store_exp)/sum(store_trans), 2))
## # A tibble: 4 x 3
## segment avg_online avg_store
## <chr> <dbl> <dbl>
## 1 Conspicuous 442. 479.
## 2 Price 69.3 81.3
## 3 Quality 126. 105.
## 4 Style 92.8 121.
Price group has the lowest averaged one-time purchase. The Con-
spicuous group will pay the highest price. When we build customer
profile in real life, we will also need to look at the survey summa-
rization. You may be surprised how much information simple data
manipulations can provide.
Another comman task is to check which column has missing values.
It requires the program to look at each column in the data. In this
case you can use summarise_all:
There are often situations where you need to create new variables.
For example, adding online and store expense to get total expense.
In this case, you will apply a function to the columns and return
a column with the same length. mutate() can do it for you and
append one or more new columns:
The above code sums up two columns and appends the result
(total_exp) to sim.dat. Another similar function is transmute(). The
difference is that transmute() will delete the original columns and
only keep the new ones.
6.1.1.5 Merge
## ID x1
## 1 A 1
## 2 B 2
## 3 C 3
## ID y1
## 1 B TRUE
## 2 C TRUE
## 3 D FALSE
## ID x1 y1
## 1 A 1 <NA>
## 2 B 2 TRUE
## 3 C 3 TRUE
## ID x1 y1
## 1 B 2 TRUE
## 2 C 3 TRUE
## ID x1 y1
## 1 A 1 <NA>
## 2 B 2 TRUE
## 3 C 3 TRUE
## 4 D <NA> FALSE
116 6 Data Wrangling
## simulate a matrix
x <- cbind(x1 =1:8, x2 = c(4:1, 2:5))
6.1 Summarize Data 117
## x1 x2
## 4.5 3.0
## [[1]]
##
## 1 3 7
## 2 1 1
##
## [[2]]
##
## 2 4 6 8
## 1 1 1 1
## [,1] [,2]
## 0% 1 2.0
## 25% 1 3.5
118 6 Data Wrangling
## 50% 2 5.0
## 75% 4 6.5
## 100% 7 8.0
Results can have different lengths for each call. This is a trickier
example. What will you get?
The data frame sdat only includes numeric columns. Now we can
go head and use apply() to get mean and standard deviation for
each column:
6.1 Summarize Data 119
Even the average online expense is higher than store expense, the
standard deviation for store expense is much higher than online ex-
pense which indicates there is very likely some big/small purchase
in store. We can check it quickly:
120 6 Data Wrangling
summary(sdat$store_exp)
summary(sdat$online_exp)
There are some odd values in store expense. The minimum value
is -500 which indicates that you should preprocess data before an-
alyzing it. Checking those simple statistics will help you better
understand your data. It then gives you some idea how to prepro-
cess and analyze them. How about using lapply() and sapply()?
Run the following code and compare the results:
sdat<-sim.dat[1:5,1:6]
sdat
For the above data sdat, what if we want to reshape the data
to have a column indicating the purchasing channel (i.e. from
store_exp or online_exp) and a second column with the correspond-
ing expense amount? Assume we want to keep the rest of the
columns the same. It is a task to change data from “wide” to
“long”.
## # A tibble: 2 x 4
## # Groups: house [1]
## house gender total_online_exp total_store_exp
## <chr> <chr> <dbl> <dbl>
## 1 Yes Female 413. 1007.
## 2 Yes Male 533. 1218.
The above code also uses the functions in the dplyr package in-
troduced in the previous section. Here we use package::function to
make clear the package name. It is not necessary if the package is
already loaded.
Another pair of functions that do opposite manipulations are sep-
arate() and unite().
You can see that the function separates the original column
“Channel” to two new columns “Source” and “Type”. You can use
sep = to set the string or regular expression to separate the col-
umn. By default, it is “_”.
The unite() function will do the opposite: combining two columns.
It is the generalization of paste() to a data frame.
124 6 Data Wrangling
sepdat %>%
unite("Channel", Source, Type, sep = "_")
125
126 7 Model Tuning Strategy
y = 𝑓(X) + � (7.1)
𝐸(y − y)̂ 2 = ̂
𝐸[𝑓(X) + � − 𝑓(X)] 2
= 𝐸[𝑓(X) ̂
− 𝑓(X)]2 + 𝑉 𝑎𝑟(�) (7.2)
⏟⏟⏟⏟⏟⏟⏟ ⏟
(1) (2)
It is also called Mean Square Error (MSE) where (1) is the system-
128 7 Model Tuning Strategy
̂
(𝑓(X) − 𝐸[𝑓(X)] ̂
+ 𝐸[𝑓(X)] ̂
− 𝑓(X))
2 2
̂
= 𝐸 (𝐸[𝑓(X)] − 𝑓(X)) + 𝐸 (𝑓(X) ̂ ̂
− 𝐸[𝑓(X)]) (7.3)
= ̂
[𝐵𝑖𝑎𝑠(𝑓(X))] 2 ̂
+ 𝑉 𝑎𝑟(𝑓(X))
7.1 Variance-Bias Trade-Off 129
source('http://bit.ly/2KeEIg9')
# randomly simulate some non-linear samples
x = seq(1, 10, 0.01) * pi
e = rnorm(length(x), mean = 0, sd = 0.2)
fx <- sin(x) + e + sqrt(x)
dat = data.frame(x, fx)
4
fx
10 20 30
x
4
fx
10 20 30
x
The resulting plot (Fig. 7.3) indicates the smoothing method fit
the data much better and it has a much smaller bias. However,
this method has a high variance. If we simulate different subsets
of the sample, the result curve will change significantly:
132 7 Model Tuning Strategy
# sample 2
idx2 = sample(1:length(x), 100)
dat2 = data.frame(x2 = x[idx2], fx2 = fx[idx2])
p2 = ggplot(dat2, aes(x2, fx2)) +
geom_smooth(span = 0.03) +
geom_point()
# sample 3
idx3 = sample(1:length(x), 100)
dat3 = data.frame(x3 = x[idx3], fx3 = fx[idx3])
p3 = ggplot(dat3, aes(x3, fx3)) +
geom_smooth(span = 0.03) +
geom_point()
# sample 4
idx4 = sample(1:length(x), 100)
dat4 = data.frame(x4 = x[idx4], fx4 = fx[idx4])
p4 = ggplot(dat4, aes(x4, fx4)) +
geom_smooth(span = 0.03) +
geom_point()
6 6
5
4
fx1
fx3
4
3
2
2
1
10 20 30 10 20 30
x1 x3
6
6
4 4
fx2
fx4
2 2
10 20 30 10 20 30
x2 x4
The fitted lines (blue) change over different samples which means it
has high variance. People also call it overfitting. Fitting the linear
model using the same four subsets, the result barely changes:
6 6
5
4
fx1
fx3
4
3
2
2
10 20 30 10 20 30
x1 x3
6
6
4
4
fx2
fx4
2 2
10 20 30 10 20 30
x2 x4
learning too much from the current sample set. Those models are
susceptible to the specific sample set used to fit them. The model
prediction may be off when future data is unlike past data. Con-
versely, a simple model, such as ordinary linear regression, tends
to underfit, leading to a poor prediction by learning too little from
the data. It systematically over-predicts or under-predicts the data
regardless of how well future data resemble past data.
Model evaluation is essential to assess the efficacy of a model. A
modeler needs to understand how a model fits the existing data
and how it would work on future data. Also, trying multiple models
and comparing them is always a good practice. All these need data
splitting and resampling.
simple way is to split data randomly, which does not control for any
data attributes. However, sometimes we may want to ensure that
training and testing data have a similar outcome distribution. For
example, suppose you want to predict the likelihood of customer
retention. In that case, you want two data sets with a similar
percentage of retained customers.
There are three main ways to split the data that account for the
similarity of resulted data sets. We will describe the three ap-
proaches using the clothing company’s customer data as examples.
# load data
sim.dat <- read.csv("http://bit.ly/2P5gTw4")
library(caret)
# set random seed to make sure reproducibility
set.seed(3456)
trainIndex <- createDataPartition(sim.dat$segment,
p = 0.8,
list = FALSE,
times = 1)
head(trainIndex)
## Resample1
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [5,] 6
## [6,] 7
7.2 Data Splitting and Resampling 139
According to the setting, there are 800 samples in the training set
and 200 in the testing set. Let’s check the distribution of the two
groups:
datTrain %>%
dplyr::group_by(segment) %>%
dplyr::summarise(count = n(),
percentage = round(length(segment)/nrow(datTrain), 2))
## # A tibble: 4 x 3
## segment count percentage
## <chr> <int> <dbl>
## 1 Conspicuous 160 0.2
## 2 Price 200 0.25
## 3 Quality 160 0.2
## 4 Style 280 0.35
datTest %>%
dplyr::group_by(segment) %>%
dplyr::summarise(count = n(),
percentage = round(length(segment)/nrow(datTest), 2))
140 7 Model Tuning Strategy
## # A tibble: 4 x 3
## segment count percentage
## <chr> <int> <dbl>
## 1 Conspicuous 40 0.2
## 2 Price 50 0.25
## 3 Quality 40 0.2
## 4 Style 70 0.35
The percentages are the same for these two sets. In practice, it
is possible that the distributions are not identical but should be
close.
library(lattice)
# select variables
testing <- subset(sim.dat, select = c("age", "income"))
set.seed(5)
# select 5 random samples
startSet <- sample(1:dim(testing)[1], 5)
start <- testing[startSet, ]
# save the rest in data frame 'samplePool'
samplePool <- testing[-startSet, ]
The obj = minDiss in the above code tells R to use minimum dissim-
ilarity to define the distance between groups. Next, random select
5 samples from samplePool in data frame RandomSet:
Initial Set
Maximum Dissimilarity Sampling
Random Sampling
60
50
age
40
30
20
income
AR(1) φ = − 0.9
6
4
2
timedata
0
-2
-4
-6
0 20 40 60 80 100
Time
Fig. 7.6 shows 100 simulated time series observation. The goal is
to make sure both training and test set to cover the whole period.
horizon = 12,
fixedWindow = T)
str(timeSlices,max.level = 1)
## List of 2
## $ train:List of 53
## $ test :List of 53
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [35] 35 36
testSlices[[1]]
## [1] 37 38 39 40 41 42 43 44 45 46 47 48
7.2.2 Resampling
You can consider resampling as repeated splitting. The basic idea
is: use part of the data to fit model and then use the rest of data
to calculate model performance. Repeat the process multiple times
and aggregate the results. The differences in resampling techniques
usually center around the ways to choose subsamples. There are
two main reasons that we may need resampling:
the model and the rest 𝑘 − 1 to train model. Then repeat the
process 𝑘 times with each of the 𝑘 folds as the test set. Aggregate
the results into a performance profile.
̂ (𝑋) the fitted function, computed with the 𝜅𝑡ℎ fold
Denote by 𝑓 −𝜅
removed and 𝑥𝜅𝑖 the predictors for samples in left-out fold. The
process of k-fold cross-validation is as follows:
library(caret)
class <- sim.dat$segment
# creat k-folds
set.seed(1)
cv <- createFolds(class, k = 10, returnTrain = T)
str(cv)
## List of 10
## $ Fold01: int [1:900] 1 2 3 4 5 6 7 8 9 10 ...
## $ Fold02: int [1:900] 1 2 3 4 5 6 7 9 10 11 ...
## $ Fold03: int [1:900] 1 2 3 4 5 6 7 8 10 11 ...
## $ Fold04: int [1:900] 1 2 3 4 5 6 7 8 9 11 ...
## $ Fold05: int [1:900] 1 3 4 6 7 8 9 10 11 12 ...
## $ Fold06: int [1:900] 1 2 3 4 5 6 7 8 9 10 ...
## $ Fold07: int [1:900] 2 3 4 5 6 7 8 9 10 11 ...
## $ Fold08: int [1:900] 1 2 3 4 5 8 9 10 11 12 ...
## $ Fold09: int [1:900] 1 2 4 5 6 7 8 9 10 11 ...
## $ Fold10: int [1:900] 1 2 3 5 6 7 8 9 10 11 ...
148 7 Model Tuning Strategy
The above code creates ten folds (k=10) according to the customer
segments (we set class to be the categorical variable segment). The
function returns a list of 10 with the index of rows in training set.
Once know how to split the data, the repetition comes naturally.
7.2 Data Splitting and Resampling 149
The apparent error rate is the error rate when the data is used
twice, both to fit the model and to check its accuracy and it is ap-
parently over-optimistic. The modified bootstrap estimate reduces
the bias but can be unstable with small samples size. This esti-
mate can also be unduly optimistic when the model severely over-
150 7 Model Tuning Strategy
fits since the apparent error rate will be close to zero. Efron and
Tibshirani (Efron and Tibshirani, 1997) discuss another technique,
called the “632+ method,” for adjusting the bootstrap estimates.
8
Measuring Performance
if (length(p_to_install) > 0) {
install.packages(p_to_install)
}
151
152 8 Measuring Performance
1 𝑛
𝑀 𝑆𝐸 = ∑(𝑦𝑖 − 𝑦𝑖̂ )2
𝑛 𝑖=1
1 𝑛
𝑅𝑀 𝑆𝐸 = √ ∑(𝑦 − 𝑦𝑖̂ )2
𝑛 𝑖=1 𝑖
Both are the common measurements for the regression model per-
formance. Let’s use the previous income prediction as an example.
Fit a simple linear model:
##
## Call:
## lm(formula = income ~ store_exp + online_exp + store_trans +
## online_trans, data = sim.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -128768 -15804 441 13375 150945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85711.680 3651.599 23.47 < 2e-16 ***
## store_exp 3.198 0.475 6.73 3.3e-11 ***
## online_exp 8.995 0.894 10.06 < 2e-16 ***
## store_trans 4631.751 436.478 10.61 < 2e-16 ***
## online_trans -1451.162 178.835 -8.11 1.8e-15 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31500 on 811 degrees of freedom
## (184 observations deleted due to missingness)
8.1 Regression Model Performance 153
y <- sim.dat$income
yhat <- predict(fit, sim.dat)
MSE <- mean((y - yhat)^2, na.rm = T )
RMSE <- sqrt(MSE)
RMSE
## [1] 31433
𝑅𝑆𝑆
𝑅2 = 1 −
𝑇 𝑆𝑆
𝑛 𝑛
where 𝑅𝑆𝑆 = ∑𝑖=1 (𝑦𝑖 − 𝑦𝑖̂ )2 and 𝑇 𝑆𝑆 = ∑𝑖=1 (𝑦𝑖 − 𝑦)̄ 2 .
154 8 Measuring Performance
𝑅𝑆𝑆/(𝑛 − 𝑝 − 1)
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 1 −
𝑇 𝑆𝑆/(𝑛 − 1)
1
𝐶𝑝 = (𝑅𝑆𝑆 + 2𝑝𝜎̂ 2 )
𝑛
R function AIC() and BIC() will calculate the AIC and BIC value
8.2 Classification Model Performance 155
The process includes (1) separate the data to be training and test-
ing sets, (2) fit model using training data (xTrain and yTrain), and
(3) applied the trained model on testing data (xTest and yTest) to
evaluate model performance.
We use 70% of the sample as training and the rest 30% as testing.
set.seed(100)
# separate the data to be training and testing
trainIndex <- createDataPartition(disease_dat$y, p = 0.8,
list = F, times = 1)
xTrain <- disease_dat[trainIndex, ] %>% dplyr::select(-y)
xTest <- disease_dat[-trainIndex, ] %>% dplyr::select(-y)
156 8 Measuring Performance
Apply the trained random forest model to the testing data to get
two types of predictions:
• probability (a value between 0 to 1)
## 0 1
## 47 0.831 0.169
## 101 0.177 0.823
## 196 0.543 0.457
## 258 0.858 0.142
## 274 0.534 0.466
## 369 0.827 0.173
## 389 0.852 0.148
## 416 0.183 0.817
## 440 0.523 0.477
## 642 0.836 0.164
• category prediction (0 or 1)
8.2 Classification Model Performance 157
## 146 232 269 302 500 520 521 575 738 781
## 0 0 1 0 0 0 1 0 0 0
## Levels: 0 1
## yTest
## yhat 1 0
## 1 56 1
## 0 15 88
𝑇𝑃 + 𝑇𝑁
𝑇 𝑜𝑡𝑎𝑙 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
You can calculate the total accuracy when there are more than
two categories. This statistic is straightforward but has some dis-
advantages. First, it doesn’t differentiate different error types. In
a real application, different types of error may have different im-
pacts. For example, it is much worse to tag an important email as
spam and miss it than failing to filter out a spam email. Provost
et al. (Provost F, 1998) discussed in detail about the problem of
using total accuracy on different classifiers. There are some other
metrics based on the confusion matrix that measure different types
of error.
Precision is a metric to measure how accurate positive predic-
tions are (i.e. among those emails predicted as spam, how many
percentages of them are spam emails?):
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
Sensitivity is to measure the coverage of actual positive samples
(i.e. among those spam emails, how many percentages of them are
predicted as spam) :
𝑇𝑃
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑇𝑃 + 𝐹𝑁
Specificity is to measure the coverage of actual negative samples
(i.e. among those non-spam emails, how many percentages of them
pass the filter):
𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁 + 𝐹𝑃
8.2 Classification Model Performance 159
𝑃0 − 𝑃𝑒
𝐾𝑎𝑝𝑝𝑎 =
1 − 𝑃𝑒
Kappa Agreement
<0 Less than chance agreement
0.01–0.20 Slight agreement
0.21– 0.40 Fair agreement
0.41–0.60 Moderate agreement
160 8 Measuring Performance
Kappa Agreement
0.61–0.80 Substantial agreement
0.81–0.99 Almost perfect agreement
# install.packages("fmsb")
kt<-fmsb::Kappa.test(table(yhat,yTest))
kt$Result
##
## Estimate Cohen's kappa statistics and test the
## null hypothesis that the extent of agreement is
## same as random (kappa=0)
##
## data: table(yhat, yTest)
## Z = 9.7, p-value <2e-16
## 95 percent confidence interval:
## 0.6972 0.8894
## sample estimates:
## [1] 0.7933
kt$Judgement
8.2 Classification Model Performance 161
8.2.3 ROC
Receiver Operating Characteristic (ROC) curve uses the predicted
class probabilities and determines an effective threshold such that
values above the threshold are indicative of a specific event. We
have shown the definitions of sensitivity and specificity above. The
sensitivity is the true positive rate and specificity is true negative
rate. “1 - specificity” is the false positive rate. ROC is a graph
of pairs of true positive rate (sensitivity) and false positive rate
(1-specificity) values that result as the test’s cutoff value is varied.
The Area Under the Curve (AUC) is a common measure for two-
class problem. There is usually a trade-off between sensitivity and
specificity. If the threshold is set lower, then there are more sam-
ples predicted as positive and hence the sensitivity is higher. Let’s
look at the predicted probability yhatprob in the swine disease ex-
ample. The predicted probability object yhatprob has two columns,
one is the predicted probability that a farm will have an outbreak,
the other is the probability that farm will NOT have an outbreak.
So the two add up to have value 1. We use the probability of out-
break (the 2nd column) for further illustration. You can use roc()
function to get an ROC object (rocCurve) and then apply different
functions on that object to get needed plot or ROC statistics. For
example, the following code produces the ROC curve:
plot(1-rocCurve$specificities,
rocCurve$sensitivities,
type = 'l',
162 8 Measuring Performance
0.6
0.4
0.2
0.0
1 - Specificities
The first argument of the roc() is, response, the observation vec-
tor. The second argument is predictor is the continuous prediction
(probability or link function value). The x-axis of ROC curve is
“1 - specificity” and the y-axis is “sensitivity.” ROC curve starts
from (0, 0) and ends with (1, 1). A perfect model that correctly
identifies all the samples will have 100% sensitivity and specificity
which corresponds to the curve that also goes through (0, 1). The
area under the perfect curve is 1. A model that is totally useless
corresponds to a curve that is close to the diagonal line and an
area under the curve about 0.5.
You can visually compare different models by putting their ROC
curves on one plot. Or use the AUC to compare them. DeLong et
al. came up a statistic test to compare AUC based on U-statistics
(E.R. DeLong, 1988) which can give a p-value and confidence in-
terval. You can also use bootstrap to get a confidence interval for
AUC (Hall P, 2004).
We can use the following code in R to get an estimate of AUC and
its confidence interval:
8.2 Classification Model Performance 163
table(yTest)
## yTest
## 1 0
## 71 89
100
80
% Samples Found
60
40
20
0 20 40 60 80 100
% Samples Tested
167
168 9 Regression Models
𝑝
𝑓(X) = X� = 𝛽0 + ∑ x.j 𝛽𝑗
𝑗=1
𝑁 𝑁 𝑝
𝑅𝑆𝑆(𝛽) = ∑(𝑦𝑖 − 𝑓(xi. )) = ∑(𝑦𝑖 − 𝛽0 − ∑ 𝑥𝑖𝑗 𝛽𝑗 )2
2
Before fitting the model, we need to clean the data, such as remov-
ing bad data points that are not logical (negative expense).
To fit a linear regression model, let us first check if there are any
missing values or outliers:
170 9 Regression Models
50000
600
Frequency
30000
400
200
10000
0
0 20000 40000
total_exp
y <- modeldat$total_exp
# Find data points with Z-score larger than 3.5
zs <- (y - mean(y))/mad(y)
modeldat <- modeldat[-which(zs > 3.5), ]
Q10
Q1
Q8
Q6
Q7
Q5
Q9
Q2
Q3
Q4
1
Q1
0.8
Q8 0.24
0.6
Q6 0.53 0.68
0.4
Q10 0.72 0.60 0.85
0.2
Q7 -0.61 -0.74 -0.93 -0.88
0
Q5 -0.79 -0.56 -0.86 -0.91 0.90
-0.2
Q9 -0.78 -0.48 -0.80 -0.86 0.83 0.92
-0.4
Q2 0.20 -0.06 0.40 0.27 -0.25 -0.28 -0.30
-0.6
Q3 0.21 -0.64 -0.10 -0.07 0.24 0.01 -0.07 0.49
-0.8
Q4 0.57 -0.34 0.20 0.32 -0.14 -0.42 -0.46 0.44 0.75
-1
(3) if all the variables in the dataset except the response vari-
able are included in the model, we can use . at the right
side of ~
(4) if we want to consider the interaction between two vari-
ables such as Q1 and Q2, we can add an interaction term
Q1*Q2
##
## Call:
## lm(formula = log(total_exp) ~ ., data = modeldat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1749 -0.1372 0.0128 0.1416 0.5623
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.09831 0.05429 149.18 < 2e-16 ***
## Q1 -0.14534 0.00882 -16.47 < 2e-16 ***
## Q2 0.10228 0.01949 5.25 2.0e-07 ***
## Q3 0.25445 0.01835 13.87 < 2e-16 ***
## Q6 -0.22768 0.01152 -19.76 < 2e-16 ***
## Q8 -0.09071 0.01650 -5.50 5.2e-08 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
9.1 Ordinary Least Square 173
confint(lmfit,level=0.9)
## 5 % 95 %
## (Intercept) 8.00892 8.18771
## Q1 -0.15987 -0.13081
## Q2 0.07018 0.13437
## Q3 0.22424 0.28466
## Q6 -0.24665 -0.20871
## Q8 -0.11787 -0.06354
Q-Q Plot is used to check the normality assumption for the resid-
ual. For normally distributed residuals, the data points should fol-
low a straight line along the Q-Q plot. The more departure from
a straight line, the more departure from a normal distribution for
the residual.
plot(lmfit, which = 2)
plot(lmfit, which = 3)
plot(lmfit, which = 4)
Standardized residuals
Residuals vs Fitted Normal Q-Q
0 2
Residuals
0.0
-1.0
-4
960 960
0.00
0.0
## Q1 Q2 Q3 Q6 Q8 total_exp
## 155 4 2 1 4 4 351.9
## 678 2 1 1 1 2 1087.3
## 960 2 1 1 1 3 658.3
It is not easy to see why those records are outliers from the above
output. It will be clear conditional on the independent variables
(Q1, Q2, Q3, Q6, and Q8). Let us examine the value of total_exp for
samples with the same Q1, Q2, Q3, Q6, and Q8 answers as the
3rd row above.
## [1] 87
summary(datcheck$total_exp)
outlier. All the other 86 records with the same survey responses
have a much higher total expense!
• Cook’s distance: the maximum of Cook’s distance is around 0.05.
Even though the graph does not have any point with Cook’s
distance of more than 0.5, we could spot some outliers.
The graphs suggest some outliers, but it is our decision what to do.
We can either remove it or investigate it further. If the values are
not due to any data error, we should consider them in our analysis.
library(lattice)
library(caret)
library(dplyr)
library(elasticnet)
library(lars)
# Load Data
sim.dat <- read.csv("http://bit.ly/2P5gTw4")
ymad <- mad(na.omit(sim.dat$income))
# Calculate Z values
zs <- (sim.dat$income - mean(na.omit(sim.dat$income)))/ymad
# which(na.omit(zs>3.5)) find outlier
# which(is.na(zs)) find missing values
idex <- c(which(na.omit(zs > 3.5)), which(is.na(zs)))
# Remove rows with outlier and missing values
sim.dat <- sim.dat[-idex, ]
set.seed(100)
ctrl <- trainControl(method = "cv", number = 10)
From the result, we can see that the optimal number of variables is
7. However, if we pay attention to the RMSE improvement, we will
find only minimum improvement in RMSE after three variables.
In practice, we could choose to use the model with three variables
if the improvement does not make a practical difference, and we
would rather have a simpler model.
We can also find the relative importance of each variable during
PLS model tuning process, as described using the following code:
Q6
Q2
Q3
Q1
Q4
Q7
Q8
Q9
Q5
Q10
Importance
The above plot shows that Q1, Q2, Q3, and Q6, are more impor-
tant than other variables. Now let’s fit a PCR model with number
of principal components as the hyper-parameter:
9.2 Principal Component Regression and Partial Least Square 185
components, we will find little difference after the model with three
components. Again, in practice, we can keep models with three
components.
Now let’s compare the hyper-parameter tuning process for PLS
and PCR:
45000
RMSE (Cross-Validation)
40000
35000
30000
25000
2 4 6 8 10
# Components
The plot confirms our choice of using a model with three compo-
nents for both PLS and PCR.
10
Regularization Methods
189
190 10 Regularization Methods
library(NetlifyDS)
Σ𝑛𝑖=1 (𝑦𝑖 − 𝛽0 − Σ𝑝𝑗=1 𝛽𝑗 𝑥𝑖𝑗 )2 + 𝜆Σ𝑝𝑗=1 𝛽𝑗2 = 𝑅𝑆𝑆 + 𝜆Σ𝑝𝑗=1 𝛽𝑗2 (10.1)
function from MASS, function enet() from elasticnet. If you know the
value of 𝜆, you can use either of the function to fit ridge regression.
A more convenient way is to use train() function from caret. Let’s
use the 10 survey questions to predict the total purchase amount
(sum of online and store purchase).
## Ridge Regression
##
## 999 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 899, 899, 899, 899, 899, 900, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000 1744 0.7952 754.0
## 0.005263 1744 0.7954 754.9
## 0.010526 1744 0.7955 755.9
## 0.015789 1744 0.7955 757.3
## 0.021053 1745 0.7956 758.8
## 0.026316 1746 0.7956 760.6
## 0.031579 1747 0.7956 762.4
## 0.036842 1748 0.7956 764.3
## 0.042105 1750 0.7956 766.4
## 0.047368 1751 0.7956 768.5
## 0.052632 1753 0.7956 770.6
## 0.057895 1755 0.7956 772.7
## 0.063158 1757 0.7956 774.9
## 0.068421 1759 0.7956 777.2
## 0.073684 1762 0.7956 779.6
## 0.078947 1764 0.7955 782.1
## 0.084211 1767 0.7955 784.8
## 0.089474 1769 0.7955 787.6
## 0.094737 1772 0.7955 790.4
## 0.100000 1775 0.7954 793.3
##
## RMSE was used to select the optimal model using
## the smallest value.
## The final value used for the model was lambda
## = 0.005263.
The results show that the best value of 𝜆 is 0.005 and the RMSE
10.1 Ridge Regression 193
and 𝑅2 are 1744 and 0.7954 correspondingly. You can see from the
figure 10.1, as the 𝜆 increase, the RMSE first slightly decreases and
then increases.
plot(ridgeRegTune)
1770
RMSE (Cross-Validation)
1760
1750
Weight Decay
FIGURE 10.1: Test mean squared error for the ridge regression
Once you have the tuning parameter value, there are different func-
tions to fit a ridge regression. Let’s look at how to use enet() in
elasticnet package.
Note here ridgefit only assigns the value of the tuning parameter
for ridge regression. Since the elastic net model include both ridge
and lasso penalty, we need to use predict() function to get the
model fit. You can get the fitted results by setting s = 1 and mode =
"fraction". Here s = 1 means we only use the ridge parameter. We
will come back to this when we get to lasso regression.
194 10 Regularization Methods
By setting type = "fit", the above returns a list object. The fit
item has the predictions:
names(ridgePred)
head(ridgePred$fit)
## 1 2 3 4 5 6
## 1290.5 224.2 591.4 1220.6 853.4 908.2
ridgeCoef<-predict(ridgefit,newx = as.matrix(trainx),
s=1, mode="fraction", type="coefficients")
It also returns a list and the estimates are in the coefficients item:
10.2 LASSO
Even though the ridge regression shrinks the parameter estimates
towards 0, it won’t shink any estimates to be exactly 0 which
means it includes all predictors in the final model. So it can’t
select variables. It may not be a problem for prediction but it is
a huge disadvantage if you want to interpret the model especially
when the number of variables is large. A popular alternative to the
ridge penalty is the Least Absolute Shrinkage and Selection
Operator (LASSO) (R, 1996).
Similar to ridge regression, lasso adds a penalty. The lasso coeffi-
cients 𝛽𝜆𝐿̂ minimize the following:
Σ𝑛𝑖=1 (𝑦𝑖 −𝛽0 −Σ𝑝𝑗=1 𝛽𝑗 𝑥𝑖𝑗 )2 +𝜆Σ𝑝𝑗=1 |𝛽𝑗 | = 𝑅𝑆𝑆 +𝜆Σ𝑝𝑗=1 |𝛽𝑗 | (10.2)
The only difference between lasso and ridge is the penalty. In sta-
tistical parlance, ridge uses 𝐿2 penalty (𝛽𝑗2 ) and lasso uses 𝐿1
penalty (|𝛽𝑗 |). 𝐿1 penalty can shrink the estimates to 0 when 𝜆 is
big enough. So lasso can be used as a feature selection tool. It is
a huge advantage because it leads to a more explainable model.
Similar to other models with tuning parameters, lasso regression re-
quires cross-validation to tune the parameter. You can use train()
in a similar way as we showed in the ridge regression section. To
tune parameter, we need to set cross-validation and parameter
range. Also, it is advised to standardize the predictors:
The results show that the best value of the tuning parameter
(fraction from the output) is 0.957 and the RMSE and 𝑅2 are
1742 and 0.7954 correspondingly. The performance is nearly the
same with ridge regression. You can see from the figure 10.2, as
the 𝜆 increase, the RMSE first decreases and then increases.
plot(lassoTune)
1760
RMSE (Cross-Validation)
1755
1750
1745
Fraction
FIGURE 10.2: Test mean squared error for the lasso regression
Once you select a value for tuning parameter, there are different
functions to fit lasso regression, such as lars() in lars, enet() in
elasticnet, glmnet() in glmnet. They all have very similar syntax.
Again by setting type = "fit", the above returns a list object. The
fit item has the predictions:
head(lassoFit$fit)
## 1 2 3 4 5 6
## 1357.3 300.5 690.2 1228.2 838.4 1010.1
It also returns a list and the estimates are in the coefficients item:
etc. This algorithm works well for lasso regression especially when
the dimension is high.
Elasticnet
999 samples
10 predictor
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were fraction = 0.9579 and lambda = 0.
The results show that the best values of the tuning parameters are
fraction = 0.9579 and lambda = 0. It also indicates that the final
model is lasso only (the ridge penalty parameter lambda is 0). The
RMSE and 𝑅2 are 1742.2843 and 0.7954 correspondingly.
10.4 Penalized Generalized Linear Model 201
1 𝑁
𝑚𝑖𝑛 Σ 𝑤 𝑙(𝑦 , 𝛽 + �T xi ) + 𝜆[(1 − 𝛼) ∥ � ∥22 /2 + 𝛼 ∥ � ∥1 ]
𝛽0 ,� 𝑁 𝑖=1 𝑖 𝑖 0
where
𝑙(𝑦𝑖 , 𝛽0 + �T xi ) = −𝑙𝑜𝑔[ℒ(𝑦𝑖 , 𝛽0 + �T xi )]
plot(glmfit, label = T)
10.4 Penalized Generalized Linear Model 203
0 3 3 7 9
1000
3
6
2
5
500
Coefficients
4
0
1
10
-500
L1 Norm
Each curve in the plot represents one predictor. The default setting
is 𝛼 = 1 which means there is only lasso penalty. From left to right,
𝐿𝐼 norm is increasing which means 𝜆 is decreasing. The bottom
x-axis is 𝐿1 norm (i.e. ∥ � ∥1 ). The upper x-axis is the effective
degrees of freedom (df) for the lasso. You can check the detail for
every step by:
print(glmfit)
Df %Dev Lambda
1 0 0.000 3040
2 2 0.104 2770
3 2 0.192 2530
4 2 0.265 2300
5 3 0.326 2100
6 3 0.389 1910
7 3 0.442 1740
8 3 0.485 1590
204 10 Regularization Methods
9 3 0.521 1450
...
The first column Df is the degree of freedom (i.e. the number of non-
zero coefficients), %Dev is the percentage of deviance explained and
Lambda is the value of tuning parameter 𝜆. By default, the function
will try 100 different values of 𝜆. However, if as 𝜆 changes, the %Dev
doesn’t change sufficiently, the algorithm will stop before it goes
through all the values of 𝜆. We didn’t show the full output above.
But it only uses 68 different values of 𝜆. You can also set the value
of 𝜆 using s= :
coef(glmfit, s = 1200)
## s1 s2
10.4 Penalized Generalized Linear Model 205
We can plot the object using plot(). The red dotted line is the
cross-validation curve. Each red point is the cross-validation mean
squared error for a value of 𝜆. The grey bars around the red points
indicate the upper and lower standard deviation. The two gray dot-
ted vertical lines represent the two selected values of 𝜆, one gives
the minimum mean cross-validated error (lambda.min), the other
gives the error that is within one standard error of the minimum
(lambda.1se).
plot(cvfit)
9 9 9 9 9 9 9 7 7 6 6 4 3 3 3 3 2
5.0e+06 1.0e+07 1.5e+07
Mean-Squared Error
2 3 4 5 6 7 8
Log(λ)
206 10 Regularization Methods
## [1] 12.57
## [1] 1200
𝑦𝑖 ∼ 𝐵𝑜𝑢𝑛𝑜𝑢𝑙𝑙𝑖(𝜃𝑖 )
𝐺
𝜃𝑖
𝑙𝑜𝑔 ( ) = 𝜂� (𝑥𝑖 ) = 𝛽0 + ∑ xi,g 𝑇 �g
1 − 𝜃𝑖 𝑔=1
𝑛
𝑦
𝑙(�) = 𝑙𝑜𝑔[∏ 𝜃𝑖 𝑖 (1 − 𝜃𝑖 )1−𝑦𝑖 ]
𝑖=1
𝑛
= ∑{𝑦𝑖 𝑙𝑜𝑔(𝜃𝑖 ) + (1 − 𝑦𝑖 )𝑙𝑜𝑔(1 − 𝜃𝑖 )}
𝑖=1
𝑛
= ∑{ 𝑦𝑖 𝜂� (xi ) − 𝑙𝑜𝑔[1 + 𝑒𝑥𝑝(𝜂� (xi ))] }
𝑖=1
library(MASS)
dat <- read.csv("http://bit.ly/2KXb1Qi")
fit <- glm(y~., dat, family = "binomial")
levels(as.factor(trainy))
newdat = as.matrix(trainx[1:3, ])
predict(fit, newdat, type = "link", s = c(2.833e-02, 3.110e-02))
## s1 s2
## 1 0.1943 0.1443
## 2 -0.9913 -1.0077
## 3 -0.5841 -0.5496
210 10 Regularization Methods
The first column of the above output is the predicted link function
value when 𝜆 = 0.02833. The second column of the output is the
predicted link function when 𝜆 = 0.0311.
Similarly, you can change the setting for type to produce different
outputs. You can use the cv.glmnet() function to tune parameters.
The parameter setting is nearly the same as before, the only differ-
ence is the setting of type.measure. Since the response is categorical,
not continuous, we have different performance measurements. The
most common settings of type.measure for classification are:
• class: error rate
• auc: it is the area under the ROC for the dichotomous problem
For example:
0.4
0.3
0.2
-10 -8 -6 -4
Log(λ)
The code above uses error rate as performance criteria and use
10-fold cross-validation. Similarly, you can get the 𝜆 value for the
10.4 Penalized Generalized Linear Model 211
minimum error rate and the error rate that is 1 standard error
from the minimum:
cvfit$lambda.min
## [1] 2.643e-05
cvfit$lambda.1se
## [1] 0.003334
You can use the same way to get the parameter estimates and
make prediction.
10.4.2.3 Group lasso logistic regression
𝐺
𝑆𝜆 (�) = −𝑙(�) + 𝜆 ∑ 𝑠(𝑑𝑓𝑔 ) ∥ �g ∥2
𝑔=1
1
𝜆𝑚𝑎𝑥 = 𝑚𝑎𝑥 { ∥ xg 𝑇 (y − y)̄ ∥2 } , (10.4)
𝑔∈{1,...,𝐺} 𝑠(𝑑𝑓𝑔 )
such that when 𝜆 = 𝜆𝑚𝑎𝑥 , only the intercept is in the model. When
𝜆 goes to 0, the model is equivalent to ordinary logistic regression.
Three criteria may be used to select the optimal value of 𝜆. One
is AUC which you should have seem many times in this book by
now. The log-likelihood score used in Meier et al. (L Meier and
Buhlmann, 2008) is taken as the average of log-likelihood of the
validation data over all cross-validation sets. Another one is the
maximum correlation coefficient in Yeo and Burge (Yeo and Burge,
2004) that is defined as:
devtools::install_github("netlify/NetlifyDS")
library("NetlifyDS")
The package includes the swine disease breakout data and you can
load the data by:
data("sim1_da1")
Dummy variables from the same question are in the same group:
index[1:50]
...
$ auc : num [1:100] 0.573 0.567 0.535 ...
$ log_likelihood : num [1:100] -554 -554 -553 ...
$ maxrho : num [1:100] -0.0519 0.00666 ...
$ lambda.max.auc : Named num [1:2] 0.922 0.94
..- attr(*, "names")= chr [1:2] "lambda" "auc"
$ lambda.1se.auc : Named num [1:2] 16.74 0.81
..- attr(*, "names")= chr [1:2] "" "se.auc"
$ lambda.max.loglike: Named num [1:2] 1.77 -248.86
..- attr(*, "names")= chr [1:2] "lambda" "loglike"
$ lambda.1se.loglike: Named num [1:2] 9.45 -360.13
..- attr(*, "names")= chr [1:2] "lambda" "se.loglike"
$ lambda.max.maxco : Named num [1:2] 0.922 0.708
..- attr(*, "names")= chr [1:2] "lambda" "maxco"
10.4 Penalized Generalized Linear Model 215
plot(cv_fit)
216 10 Regularization Methods
1.0
0.8
auc
0.6
0.4
0 10 20 30 40 50
Lambda
The x-axis is the value of the tuning parameter, the y-axis is AUC.
The two dash lines are the value of 𝜆 for max AUC and the value
for the one standard deviation to the max AUC. Once you choose
the value of the tuning parameter, you can use fitglasso() to fit
the model. For example, we can fit the model using the parameter
value that gives the max AUC, which is 𝜆 = 0.922:
coef(fitgl)
0.922
Intercept -5.318e+01
Q1.A 1.757e+00
Q1.B 1.719e+00
Q2.A 2.170e+00
10.4 Penalized Generalized Linear Model 217
Q2.B 6.939e-01
Q3.A 2.102e+00
Q3.B 1.359e+00
...
219
220 11 Tree-Based Methods
CART can refer to the tree model in general, but most of the time,
it represents the algorithm initially proposed by Breiman (Breiman
et al., 1984). After Breiman, there are many new algorithms, such
as ID3, C4.5, and C5.0. C5.0 is an improved version of C4.5, but
since C5.0 is not open source, the C4.5 algorithm is more popular.
C4.5 was a major competitor of CART. But now, all those seem
outdated. The most popular tree models are Random Forest (RF)
and Gradient Boosting Machine (GBM). Despite being out of favor
in application, it is important to understand the mechanism of the
basic tree algorithm. Because the later models are based on the
same foundation.
The original CART algorithm targets binary classification, and the
later algorithms can handle multi-category classification. A single
tree is easy to explain but has poor accuracy. More complicated
tree models, such as RF and GBM, can provide much better pre-
diction at the cost of explainability. As the model becoming more
complicated, it is more like a black-box which makes it very diffi-
cult to explain the relationship among predictors. There is always
a trade-off between explainability and predictability.
The reason why it is called “tree” is of course because the structure
11.1 Tree Basics 221
data("iris")
head(iris)
## 5 setosa
## 6 setosa
Gini impurity as the splitting criterion; The later ID3, C4.5, and
C5.0 use entropy. We will look at three most common splitting
criteria.
𝑝1 (1 − 𝑝1 ) + 𝑝2 (1 − 𝑝2 )
It is easy to see that when the sample set is pure, one of the
probability is 0 and the Gini score is the smallest. Conversely, when
𝑝1 = 𝑝2 = 0.5, the Gini score is the largest, in which case the purity
of the node is the smallest. Let’s look at an example. Suppose
we want to determine which students are computer science (CS)
majors. Here is the simple hypothetical classification tree result
obtained with the gender variable.
11.2 Splitting Criteria 225
The Gini impurity for the node “Gender” is the following weighted
average of the above two scores:
3 5 2 1
× + ×0=
5 18 5 6
So entropy decreases from 1 to 0.39 after the split and the IG for
“Gender” is 0.61.
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 =
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛
where split information is:
The split information for the birth month is 3.4, and the gain ratio
is 0.22, which is smaller than that of gender (0.63). The gain ratio
refers to use gender as the splitting feature rather than the birth
month. Gain ratio favors attributes with fewer categories and leads
to better generalization (less overfitting).
11.2 Splitting Criteria 229
In equation (11.1), 𝑦1̄ and 𝑦2̄ are the average of the sample in 𝑆1
and 𝑆2 . The way regression tree grows is to automatically decide
on the splitting variables and split points that can maximize SSE
reduction. Since this process is essentially a recursive segmenta-
tion, this approach is also called recursive partitioning.
Take a look at this simple regression tree for the height of 10
students:
SSE for the 10 students in root node is 522.9. After the split, SSE
decreases from 522.9 to 168.
230 11 Tree-Based Methods
In this situation:
ple size at the node helps to prevent the leaf nodes having only
one sample. The sample size can be a tuning parameter. If it
is too large, the model tends to under-fit. If it is too small, the
model tends to over-fit. In the case of severe class imbalance,
the minimum sample size may need to be smaller because the
number of samples in a particular class is small.
• Maximum depth of the tree: If the tree grows too deep, the model
tends to over-fit. It can be a tuning parameter.
• Maximum number of terminal nodes: Limit on the terminal
nodes works the same as the limit on the depth of the tree. They
are proportional.
• The number of variables considered for each split: the algorithm
randomly selects variables used in finding the optimal split point
at each level. In general, the square root of the number of all
variables works best, which is also the default setting for many
functions. However, people often treat it as a tuning parameter.
Remove branches
Another way is to first let the tree grow as much as possible and
then go back to remove insignificant branches. The process reduces
the depth of the tree. The idea is to overfit the training set and then
correct using cross-validation. There are different implementations.
• cost/complexity penalty
The idea is that the pruning minimizes the penalized error 𝑆𝑆𝐸𝜆
with a certain value of tuning parameter 𝜆.
You train a complete tree using the subset (1) and apply the tree
on the subset (2) to calculate the accuracy. Then prune the tree
based on a node and apply that on the subset (2) to calculate
another accuracy. If the accuracy after pruning is higher or equal
to that from the complete tree, then we set the node as a terminal
node. Otherwise, keep the subtree under the node. The advantage
of this method is that it is easy to compute. However, when the
size of the subset (2) is much smaller than that of the subset (1),
there is a risk of over-pruning. Some researchers found that this
method results in more accurate trees than pruning process based
on tree size (F. Espoito and Semeraro, 1997).
• Error-complexity pruning
This method is to search for a trade-off between error and com-
plexity. Assume we have a splitting node 𝑡, and the corresponding
subtree 𝑇 . The error cost of the node is defined as:
11.3 Tree Pruning 233
𝑝(𝑡) is the ratio of the sample of the node to the total sample�
The multiplication 𝑟(𝑡) × 𝑝(𝑡) cancels out the sample size of the
node. If we keep node 𝑡, the error cost of the subtree 𝑇 is:
𝑅(𝑡) − 𝑅(𝑇 )𝑡
𝑎(𝑡) =
𝑛𝑜. 𝑜𝑓 𝑙𝑒𝑎𝑣𝑒𝑠 − 1
𝑛𝑡 − 𝑛𝑡,𝑐 + 𝑘 − 1
𝐸(𝑡) =
𝑛𝑡 + 𝑘
where:
𝑘 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑒𝑠
𝑛𝑡 = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 𝑢𝑛𝑑𝑒𝑟 𝑛𝑜𝑑𝑒 𝑡
𝑛𝑡,𝑐 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑢𝑛𝑑𝑒𝑟 𝑡 𝑡ℎ𝑎𝑡 𝑏𝑒𝑙𝑜𝑛𝑔 𝑡𝑜 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑐
The sample average for region 𝑅1 is 163, for region 𝑅2 is 176. For
a new observation, if it is female, the model predicts the height to
be 163, if it is male, the predicted height is 176. Calculating the
mean is easy. Let’s look at the first step in more detail which is to
divide the space into 𝑅1 , 𝑅2 , … , 𝑅𝐽 .
In theory, the region can be any shape. However, to simplify the
problem, we divide the predictor space into high-dimensional rect-
angles. The goal is to divide the space in a way that minimize
RSS. Practically, it is nearly impossible to consider all possible
partitions of the feature space. So we use an approach named re-
cursive binary splitting, a top-down, greedy algorithm. The process
starts from the top of the tree (root node) and then successively
splits the predictor space. Each split produces two branches (hence
binary). At each step of the process, it chooses the best split at
that particular step, rather than looking ahead and picking a split
that leads to a better tree in general (hence greedy).
Calculate the RSS decrease after the split. For different (𝑗, 𝑠),
236 11 Tree-Based Methods
1800
RMSE (Cross-Validation)
1750
1700
1650
2 4 6 8 10 12
print(rpartTree)
## n= 999
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 999 1.581e+10 3479.0
## 2) Q3< 3.5 799 2.374e+09 1819.0
## 4) Q5< 1.5 250 3.534e+06 705.2 *
## 5) Q5>=1.5 549 1.919e+09 2326.0 *
## 3) Q3>=3.5 200 2.436e+09 10110.0 *
You can see that the final model picks Q3 and Q5 to predict total
expenditure. To visualize the tree, you can convert rpart object to
party object using partykit then use plot() function:
1
Q3
When fitting tree models, people need to choose the way to treat
categorical predictors. If you know some of the categories have
higher predictability, then the first approach may be better. In
the rest of this section, we will build tree models using the above
two approaches and compare them.
Let’s build a classification model to identify the gender of the
customer:
)
trainx2 <- predict(dumMod, trainx1)
# the response variable is gender
trainy <- dat$gender
##
## Female Male
## 0.554 0.446
The outcome is pretty balanced, with 55% female and 45% male.
We use train() function in caret package to call rpart to build the
model. We can compare the model results from the two approaches:
CART
1000 samples
11 predictor
2 classes: 'Female', 'Male'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 901, 899, 900, 900, 901, 900, ...
Resampling results across tuning parameters:
......
242 11 Tree-Based Methods
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.00835.
plot.roc(rpartRoc,
type = "s",
11.4 Regression and Decision Tree Basic 243
print.thres = c(.5),
print.thres.pch = 3,
print.thres.pattern = "",
print.thres.cex = 1.2,
col = "red", legacy.axes = TRUE,
print.thres.col = "red")
plot.roc(rpartFactorRoc,
type = "s",
add = TRUE,
print.thres = c(.5),
print.thres.pch = 16, legacy.axes = TRUE,
print.thres.pattern = "",
print.thres.cex = 1.2)
legend(.75, .2,
c("Grouped Categories", "Independent Categories"),
lwd = c(1, 1),
col = c("black", "red"),
pch = c(16, 3))
1.0
0.8
0.6
Sensitivity
0.4
0.2
Grouped Categories
Independent Categories
0.0
1. Low accuracy
2. Unstable: little change in the training data leads to very
different trees.
metric to use). You can use the average of all the out-of-bag
performance values to gauge the predictive performance of the
entire ensemble. This correlates well with either cross-validation
estimates or test set estimates. On average, each tree uses about
2/3 of the samples, and the rest 1/3 is used as out-of-bag. When
the number of bootstrap samples is large enough, the out-of-
bag performance estimate approximates that from leave one out
cross-validation.
You need to choose the number of bootstrap samples. The au-
thor of “Applied Predictive Modeling” (Kuhn and Johnston, 2013)
points out that often people see an exponential decrease in predic-
tive improvement as the number of iterations increases. Most of
the predictive power is from a small portion of the trees. Based
on their experience, model performance can have small improve-
ments up to 50 bagging iterations. If it is still not satisfying, they
suggest trying other more powerfully predictive ensemble methods
such as random forests and boosting which will be described in the
following sections.
The disadvantages of bagging tree are:
• As the number of bootstrap samples increases, the computation
and memory requirements increase as well. You can mitigate
this disadvantage by parallel computing. Since each bootstrap
sample and modeling is independent of any other sample and
model, you can easily parallelize the bagging process by building
those models separately and bring back the results in the end to
generate the prediction.
• The bagged model is difficult to explain which is common for all
ensemble approaches. However, you can still get variable impor-
tance by combining measures of importance across the ensemble.
For example, we can calculate the RSS decrease for each variable
across all trees and use the average as the measurement of the
importance.
• Since the bagging tree uses all of the original predictors as ev-
erey split of every tree, those trees are related with each other.
The tree correlation prevents bagging from optimally reducing
11.5 Bagging Tree 247
Then fit the model using train function in caret package. Here
we just set the number of trees to be 1000. You can tune that
parameter.
set.seed(100)
bagTune <- caret::train(trainx, trainy,
method = "treebag",
nbagg = 1000,
metric = "ROC",
trControl = trainControl(method = "cv",
summaryFunction = twoClassSummary,
classProbs = TRUE,
savePredictions = TRUE))
bagTune
## Bagged CART
##
## 1000 samples
248 11 Tree-Based Methods
## 11 predictor
## 2 classes: 'Female', 'Male'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 901, 899, 900, 900, 901, 900, ...
## Resampling results:
##
## ROC Sens Spec
## 0.7093 0.6533 0.6774
date variables. Since those trees in the forest don’t always use
the same variables, tree correlation is less than that in bagging.
It tends to work better when there are more predictors. Since we
only have 10 predictors here, the improvement from the random
forest is marginal. The number of randomly selected predictors is
a tuning parameter in the random forest. Since random forest is
computationally intensive, we suggest starting with value around
√
𝑚 = 𝑝. Another tuning parameter is the number of trees in the
forest. You can start with 1000 trees and then increase the number
until performance levels off. The basic random forest is shown in
Algorithm 4.
ntree = 1000,
tuneGrid = data.frame(.mtry = mtryValues),
importance = TRUE,
metric = "ROC",
trControl = trainControl(method = "cv",
summaryFunction = twoClassSummary,
classProbs = TRUE,
savePredictions = TRUE))
rfTune
## Random Forest
##
## 1000 samples
## 11 predictor
## 2 classes: 'Female', 'Male'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 899, 900, 900, 899, 899, 901, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 1 0.7169 0.5341 0.8205
## 2 0.7137 0.6334 0.7175
## 3 0.7150 0.6478 0.6995
## 4 0.7114 0.6550 0.6950
## 5 0.7092 0.6514 0.6882
##
## ROC was used to select the optimal model using
## the largest value.
## The final value used for the model was mtry = 1.
ables at each node is 1. The optimal AUC is not too much higher
than that from bagging tree.
If you have selected the values of tuning parameters, you can also
use the randomForest package to fit a random forest.
Since bagging tree is a special case of random forest, you can fit
the bagging tree by setting 𝑚𝑡𝑟𝑦 = 𝑝. Function importance() can
return the importance of each predictor:
importance(rfit)
## MeanDecreaseGini
## Q1 9.056
## Q2 7.582
## Q3 7.611
## Q4 12.308
## Q5 5.628
## Q6 9.740
## Q7 6.638
## Q8 7.829
## Q9 5.955
## Q10 4.781
## segment 11.185
varImpPlot(rfit)
252 11 Tree-Based Methods
rfit
Q4
segment
Q6
Q1
Q8
Q3
Q2
Q7
Q9
Q5
Q10
0 2 4 6 8 10 12
MeanDecreaseGini
It is easy to see from the plot that segment and Q4 are the top two
variables to classify gender.
1 𝑁
𝑒𝑟𝑟̄ = Σ 𝐼(𝑦 ≠ 𝐺(𝑥𝑖 ))
𝑁 𝑖=1 𝑖
The algorithm produces a series of classifiers 𝐺𝑚 (𝑥), 𝑚 =
1, 2, ..., 𝑀 from different iterations. In each iteration, it finds the
best classifier based on the current weights. The misclassified sam-
ples in the 𝑚𝑡ℎ iteration will have higher weights in the (𝑚+1)𝑡ℎ it-
eration and the correctly classified samples will have lower weights.
As it moves on, the algorithm will put more effort into the “diffi-
cult” samples until it can correctly classify them. So it requires the
algorithm to change focus at each iteration. At each iteration, the
algorithm will calculate a stage weight based on the error rate. The
254 11 Tree-Based Methods
𝐺(𝑥) = 𝑠𝑖𝑔𝑛(Σ𝑀
𝑚=1 𝛼𝑚 𝐺𝑚 (𝑥))
Algorithm 5 AdaBoost.M1
1: Response variables have two values: +1 and -1
2: Initialize the observation to have the same weights: 𝑤𝑖 =
1
𝑁 , 𝑖 = 1, ..., 𝑁
3: for m = 1 to M do
4: Fit a classifier 𝐺𝑚 (𝑥) using weights 𝑤𝑖
𝑁
5: Compute the error rate: 𝑒𝑟𝑟𝑚 = Σ𝑖=1 𝑤𝑖Σ𝐼(𝑦 𝑖 ≠𝐺𝑚 (𝑥𝑖 ))
𝑁 𝑤
𝑖=1 𝑖
When using the tree as the base learner, basic gradient boosting
has two tuning parameters: tree depth and the number of itera-
tions. You can further customize the algorithm by selecting a dif-
ferent loss function and gradient (Hastie T, 2008). The final line
of the loop includes a regularization strategy. Instead of adding
256 11 Tree-Based Methods
(𝑗)
𝑓𝑖 to the previous iteration’s 𝑓𝑖 , only a fraction of the value is
added. This fraction is called learning rate which is 𝜆 in the algo-
rithm. It can take values between 0 and 1 which is another tuning
parameter of the model.
The way to calculate variable importance in boosting is similar
to a bagging model. You get variable importance by combining
measures of importance across the ensemble. For example, we can
calculate the Gini index improvement for each variable across all
trees and use the average as the measurement of the importance.
Boosting is a very popular method for classification. It is one of the
methods that can be directly applied to the data without requir-
ing a great deal of time-consuming data preprocessing. Applying
boosting on tree models significantly improves predictive accuracy.
Some advantages of trees that are sacrificed by boosting are speed
and interpretability.
Let’s look at the R implementation.
set.seed(100)
gbmTune <- caret::train(x = trainx,
y = trainy,
method = "gbm",
tuneGrid = gbmGrid,
metric = "ROC",
verbose = FALSE,
trControl = trainControl(method = "cv",
classProbs = TRUE,
savePredictions = TRUE))
11.7 Gradient Boosted Machine 257
1000 samples
11 predictor
2 classes: 'Female', 'Male'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 899, 900, 900, 899, 899, 901, ...
Resampling results across tuning parameters:
ROC was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 4,
interaction.depth = 3, shrinkage = 0.01 and n.minobsinnode = 6.
The results show that the tuning parameter settings that lead to
the best ROC are n.trees = 4 (number of trees), interaction.depth =
3 (depth of tree), shrinkage = 0.01 (learning rate) and n.minobsinnode
= 6 (minimum number of observations in each node).
Now, let’s compare the results from the three tree models.
258 11 Tree-Based Methods
plot.roc(rpartRoc,
type = "s",
print.thres = c(.5), print.thres.pch = 16,
print.thres.pattern = "", print.thres.cex = 1.2,
col = "black", legacy.axes = TRUE,
print.thres.col = "black")
plot.roc(treebagRoc,
type = "s",
add = TRUE,
print.thres = c(.5), print.thres.pch = 3,
legacy.axes = TRUE, print.thres.pattern = "",
print.thres.cex = 1.2,
col = "red", print.thres.col = "red")
plot.roc(rfRoc,
type = "s",
add = TRUE,
print.thres = c(.5), print.thres.pch = 1,
legacy.axes = TRUE, print.thres.pattern = "",
print.thres.cex = 1.2,
col = "green", print.thres.col = "green")
plot.roc(gbmRoc,
11.7 Gradient Boosted Machine 259
type = "s",
add = TRUE,
print.thres = c(.5), print.thres.pch = 10,
legacy.axes = TRUE, print.thres.pattern = "",
print.thres.cex = 1.2,
col = "blue", print.thres.col = "blue")
Single Tree
0.4
Bagged Tree
Random Forest
Boosted Tree
0.2
0.0
Since the data here doesn’t have many variables, we don’t see
a significant difference among the models. But you can still see
those ensemble methods are better than a single tree. In most
of the real applications, ensemble methods perform much better.
Random forest and boosting trees can be a baseline model. Before
exploring different models, you can quickly run a random forest to
260 11 Tree-Based Methods
261
262 12 Deep Learning
dently pick one framework and start training their deep learning
models right away in popular cloud environments. Much of the
heavy lifting to train a deep learning model has been embedded in
these open-source frameworks and there are also many pre-trained
models available for users to adopt. Users can now enjoy the rela-
tively easy access to software and hardware to develop their own
deep learning applications. In this book, we will demonstrate deep
learning examples using Keras, a high-level abstraction of Tensor-
Flow, using the Databricks Community Edition platform.
In summary, deep learning has not just developed in the past
few years but have in fact been ongoing research for the past few
decades. The accumulation of data, the advancement of new opti-
mization algorithms and the improvement of computation power
has finally enabled every day deep learning applications. In the
foreseeable future, deep learning will continue to revolutionize ma-
chine learning methods across many more areas.
𝑦(𝑖)
̂ = 𝜎(𝑤𝑇 𝑥(𝑖) + 𝑏)
1
where 𝜎(𝑧) = 1+𝑒−𝑧 . The following figure summarizes the process:
There are two types of layers. The last layer connects directly to
the output. All the rest are intermediate layers. Depending on your
definition, we call it “0-layer neural network” where the layer count
only considers intermediate layers. To train the model, you need
a cost function which is defined as equation (12.2).
1 𝑚
𝐽 (𝑤, 𝑏) = Σ 𝐿(𝑦(𝑖)
̂ , 𝑦(𝑖) ) (12.2)
𝑚 𝑖=1
where
𝐿(𝑦(𝑖)
̂ , 𝑦(𝑖) ) = −𝑦(𝑖) 𝑙𝑜𝑔(𝑦(𝑖)
̂ ) − (1 − 𝑦(𝑖) )𝑙𝑜𝑔(1 − 𝑦(𝑖)
̂ )
12.1 Feedforward Neural Network 267
𝑏[1] is the column vector of the four bias parameters shown above.
𝑧 [1] is a column vector of the four non-active neurons. When you
apply an activation function to a matrix or vector, you apply it
element-wise. 𝑊 [1] is the matrix by stacking the four row-vectors:
[1]𝑇
𝑤1
⎡ [1]𝑇 ⎤
𝑤2
𝑊 [1]
=⎢
⎢ [1]𝑇
⎥
⎥
⎢ 𝑤3 ⎥
[1]𝑇
⎣ 𝑤4 ⎦
12.1 Feedforward Neural Network 271
So if you have one sample, you can go through the above forward
propagation process to calculate the output 𝑦 ̂ for that sample. If
you have 𝑚 training samples, you need to repeat this process each
of the 𝑚 samples. We use superscript (i) to denote a quan-
tity associated with 𝑖𝑡ℎ sample. You need to do the same cal-
culation for all 𝑚 samples.
For i = 1 to m, do:
where
| | |
⎡
𝑋=⎢ 𝑥 (1) (1) (𝑚) ⎤
𝑥 ⋯ 𝑥 ⎥,
⎣ | | | ⎦
| | |
⎡
[𝑙]
𝐴 =⎢ 𝑎 [𝑙](1)
𝑎 [𝑙](1)
⋯ 𝑎 [𝑙](𝑚) ⎤
,
⎥
⎣ | | | ⎦𝑙=1 𝑜𝑟 2
| | |
𝑍 [𝑙] ⎡
=⎢ 𝑧 [𝑙](1)
𝑧 [𝑙](1)
⋯ 𝑧 [𝑙](𝑚) ⎤
⎥
⎣ | | | ⎦𝑙=1 𝑜𝑟 2
You can add layers like this to get a deeper neural network as
shown in the bottom right of figure 12.1.
272 12 Deep Learning
1
σ(z) =
0.8
1 + e−z
0.6
σ(z)
0.4
0.2
0.0
-5 0 5
When the output has more than 2 categories, people use softmax
function as the output layer activation function.
𝑒𝑧𝑖
𝑓𝑖 (z) = (12.3)
Σ𝐽𝑗=1 𝑒𝑧𝑗
where z is a vector.
• Hyperbolic Tangent Function (tanh)
12.1 Feedforward Neural Network 273
𝑒𝑧 − 𝑒−𝑧
𝑡𝑎𝑛ℎ(𝑧) = (12.4)
𝑒𝑧 + 𝑒−𝑧
1.0
ez − e−z
0.5
tanh(z) =
ez + e−z
0.0
-0.5
-1.0
-5 0 5
The tanh function crosses point (0, 0) and the value of the function
is between 1 and -1 which makes the mean of the activated neurons
closer to 0. The sigmoid function doesn’t have that property. When
you preprocess the training input data, you sometimes center the
data so that the mean is 0. The tanh function is doing that data
processing in some way which makes learning for the next layer a
little easier. This activation function is used a lot in the recurrent
neural networks where you want to polarize the results.
• Rectified Linear Unit (ReLU) Function
The most popular activation function is the Rectified Linear Unit
(ReLU) function. It is a piecewise function, or a half rectified func-
tion:
R(z) = max(0, z)
6
4
2
0
-5 0 5
𝑧 𝑧≥0
𝑅(𝑧)𝐿𝑒𝑎𝑘𝑦 = {
𝑎𝑧 𝑧<0
8
6
4
2
0
-5 0 5
12.1.5 Optimization
So far, we have introduced the core components of deep learning
architecture, layer, weight, activation function, and loss function.
With the architecture in place, we need to determine how to update
the network based on a loss function (a.k.a. objective function). In
this section, we will look at variants of optimization algorithms
that will improve the training speed.
𝑉𝑡 = 𝛽𝑉𝑡−1 + (1 − 𝛽)𝜃𝑡
And we have:
12.1 Feedforward Neural Network 279
𝑉0 = 0
𝑉1 = 𝛽𝑉1 + (1 − 𝛽)𝜃1
𝑉2 = 𝛽𝑉1 + (1 − 𝛽)𝜃2
⋮
𝑉100 = 𝛽𝑉99 + (1 − 𝛽)𝜃100
𝑉0 = 0
𝑉1 = 0.05𝜃1 ......
𝑉2 = 0.0475𝜃1 + 0.05𝜃2
The black line in the left plot of figure 12.8 is the exponentially
weighted averages of simulated temperature data with 𝛽 = 0.95.
1
𝑉𝑡 is approximately average over the previous 1−𝛽 days. So 𝛽 =
0.95 approximates a 20 days’ average. The red line corresponds to
𝛽 = 0.8, which approximates 5 days’ average. As 𝛽 increases, it
averages over a larger window of the previous values, and hence the
curve gets smoother. A larger 𝛽 also means that it gives the current
value 𝜃𝑡 less weight (1 − 𝛽), and the average adapts more slowly.
It is easy to see from the plot that the averages at the beginning
are more biased. The bias correction can help to achieve a better
estimate:
𝑉𝑡
𝑉𝑡𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 =
1 − 𝛽𝑡
𝑉1
𝑉1𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 = = 𝜃1
1 − 0.95
𝑉2
𝑉2𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 = = 0.4872𝜃1 + 0.5128𝜃2
1 − 0.952
For 𝛽 = 0.95, the origional 𝑉2 = 0.0475𝜃1 + 0.05𝜃2 which is a
small fraction of both 𝑡ℎ𝑒𝑡𝑎1 and 𝑡ℎ𝑒𝑡𝑎2 . That is why it starts so
much lower with big bias. After correction, 𝑉2𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑 = 0.4872𝜃1 +
0.5128𝜃2 is a weighted average with two weights added up to 1
280 12 Deep Learning
day = c(1:100)
theta = a * day^2 + b * day + c + runif(length(day), -5, 5)
theta = round(theta, 0)
par(mfrow=c(1,2))
plot(day, theta, cex = 0.5, pch = 3, ylim = c(0, 100),
main = "Without Correction",
xlab = "Days", ylab = "Tempreture")
beta1 = 0.95
beta2 = 0.8
beta3 = 0.5
for (i in 1:length(theta)) {
if (i == 1) {
v[i] = (1 - beta) * theta[i]
} else {
v[i] = beta * v[i - 1] + (1 - beta) * theta[i]
}
}
return(v)
}
12.1 Feedforward Neural Network 281
for (i in 1:length(theta)) {
if (i == 1) {
v[i] = (1 - beta) * theta[i]
} else {
v[i] = beta * v[i - 1] + (1 - beta) * theta[i]
}
}
v = v/(1 - beta^c(1:length(v)))
return(v)
}
100
80
80
Tempreture
60
60
40
40
20
20
beta1=0.95 beta1=0.95
beta2=0.8 beta2=0.8
beta3=0.5 beta3=0.5
0
0 20 40 60 80 0 20 40 60 80
Days Days
𝑤 = 𝑤 − 𝛼𝑉𝑑𝑤 ; 𝑏 = 𝑏 − 𝛼𝑉𝑑𝑏
𝑑𝑤 𝑑𝑏
𝑤=𝑤−𝛼 ; 𝑏 =𝑏−𝛼
√𝑆𝑑𝑤 √𝑆𝑑𝑏
parameter. The goal is still to adjust the learning speed. Recall the
example that illustrates the intuition behind it. When parameter
b is close to its target value, we want to decrease the oscillations
along the vertical direction.
Adaptive Moment Estimation (Adam)
The Adaptive Moment Estimation (Adam) algorithm is, in some
way, a combination of momentum and RMSprop. On iteration t,
compute dw, db using the current mini-batch. Then calculate both
V and S using the gradient descents.
1 𝑚
𝑚𝑖𝑛𝐽 (𝑤, 𝑏) = Σ 𝐿(𝑦(𝑖)
̂ , 𝑦(𝑖) ) + 𝑝𝑒𝑛𝑎𝑙𝑡𝑦
𝑤,𝑏 𝑚 𝑖=1
𝜆 𝜆 𝑛𝑥 2
𝐿2 𝑝𝑒𝑛𝑎𝑙𝑡𝑦 = ∥ 𝑤 ∥22 = Σ 𝑤
2𝑚 2𝑚 𝑖=1 𝑖
𝜆 𝑛𝑥
𝐿1 𝑝𝑒𝑛𝑎𝑙𝑡𝑦 = Σ |𝑤|
𝑚 𝑖=1
For neural network,
1 𝑚 𝜆
𝐽 (𝑤[1] , 𝑏[1] , … , 𝑤[𝐿] , 𝑏[𝐿] ) = Σ𝑖=1 𝐿(𝑦(𝑖)
̂ , 𝑦(𝑖) ) + Σ𝐿 ∥ 𝑤[𝑙] ∥2𝐹
𝑚 2 𝑙=1
where
[𝑙] [𝑙−1] [𝑙]
∥ 𝑤[𝑙] ∥2𝐹 = Σ𝑛𝑖=1 Σ𝑛𝑗=1 (𝑤𝑖𝑗 )2
Let’s look at how to use the keras R package for a toy exam-
ple in deep learning with the handwritten digits image dataset
(i.e. MNIST). keras has many dependent packages, so it takes a
few minutes to install.
install.packages("keras")
library(keras)
install_keras()
You can run the code in this section in the Databrick community
edition with R as the interface. Refer to section 4.3 for how to
set up an account, create a notebook (R or Python) and start
a cluster. For an audience with a statistical background, using a
well-managed cloud environment has the following benefit:
• Minimum language barrier in coding for most statisticians
• Zero setups to save time using the cloud environment
• Get familiar with the current trend of cloud computing in the
industrial context
You can also run the code on your local machine with R and the
required Python packages (keras uses the Python TensorFlow back-
end engine). Different versions of Python may cause some errors
when running install_keras(). Here are the things you could do
when you encounter the Python backend issue in your local ma-
chine:
• Run reticulate::py_config() to check the current Python config-
uration to see if anything needs to be changed.
• By default, install_keras() uses virtual environment
~/.virtualenvs/r-reticulate. If you don’t know how to set
12.1 Feedforward Neural Network 289
2
https://tensorflow.rstudio.com/reference/keras/install_keras/
290 12 Deep Learning
str(mnist)
List of 2
$ train:List of 2
..$ x: int [1:60000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
..$ y: int [1:60000(1d)] 5 0 4 1 9 2 1 3 1 4 ...
$ test :List of 2
..$ x: int [1:10000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
..$ y: int [1:10000(1d)] 7 2 1 0 4 1 4 9 5 9 ...
Now we prepare the features (x) and the response variable (y)
for both the training and testing dataset, and we can check the
structure of the x_train and y_train using str() function.
str(x_train)
str(y_train)
Now let’s plot a chosen 28x28 matrix as an image using R’s image()
function. In R’s image() function, the way of showing an image
is rotated 90 degree from the matrix representation. So there is
additional steps to rearrange the matrix such that we can use
image() function to show it in the actual orientation.
dplyr::tibble(input_matrix)
## # A tibble: 28 × 1
## input_matrix[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 9
292 12 Deep Learning
# step 1: reshape
x_train <- array_reshape(x_train,
c(nrow(x_train), 784))
x_test <- array_reshape(x_test,
c(nrow(x_test), 784))
# step 2: rescale
x_train <- x_train / 255
x_test <- x_test / 255
str(x_train)
str(x_test)
of input features is 784 (i.e. scaled value of each pixel in the 28x28
image) and the number of class for the output is 10 (i.e. one of the
ten categories). So the input size for the first layer is 784 and the
output size for the last layer is 10. And we can add any number of
compatible layers in between.
In keras, it is easy to define a DNN model: (1) use
keras_model_sequential() to initiate a model placeholder and all
model structures are attached to this model object, (2) layers are
added in sequence by calling the layer_dense() function, (3) add
arbitrary layers to the model based on the sequence of calling
layer_dense(). For a dense layer, all the nodes from the previous
layer are connected with each and every node to the next layer. In
layer_dense() function, we can define how many nodes in that layer
through the units parameter. The activation function can be de-
fined through the activation parameter. For the first layer, we also
need to define the input features’ dimension through input_shape
parameter. For our preprocessed MNIST dataset, there are 784
columns in the input data. A common way to reduce overfitting is
to use the dropout method, which randomly drops a proportion of
the nodes in a layer. We can define the dropout proportion through
layer_dropout() function immediately after the layer_dense() func-
tion.
The above dnn_model has 4 layers with first layer 256 nodes, 2nd
layer 128 nodes, 3rd layer 64 nodes, and last layer 10 nodes. The
activation function for the first 3 layers is relu and the activation
12.1 Feedforward Neural Network 295
function for the last layer is softmax which is typical for classifi-
cation problems. The model detail can be obtained through sum-
mary() function. The number of parameter of each layer can be cal-
culated as: (number of input features +1) times (numbe of nodes
in the layer). For example, the first layer has (784+1)x256=200960
parameters; the 2nd layer has (256+1)x128=32896 parameters.
Please note, dropout only randomly drop certain proportion of
parameters for each batch, it will not reduce the number of pa-
rameters in the model. The total number of parameters for the
dnn_model we just defined has 242762 parameters to be estimated.
summary(dnn_model)
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
dense_1 (Dense) (None, 256) 200960
________________________________________________________________________________
dropout_1 (Dropout) (None, 256) 0
________________________________________________________________________________
dense_2 (Dense) (None, 128) 32896
________________________________________________________________________________
dropout_2 (Dropout) (None, 128) 0
________________________________________________________________________________
dense_3 (Dense) (None, 64) 8256
________________________________________________________________________________
dropout_3 (Dropout) (None, 64) 0
________________________________________________________________________________
dense_4 (Dense) (None, 10) 650
================================================================================
Total params: 242,762
Trainable params: 242,762
Non-trainable params: 0
________________________________________________________________________________
Now we can feed data (x and y) into the neural network that
we just built to estimate all the parameters in the model. Here
we define three hyperparameters: epochs, batch_size, and valida-
tion_split, for this model. It just takes a couple of minutes to
finish.
str(dnn_history)
List of 2
$ params :List of 8
..$ metrics : chr [1:4] "loss" "acc" "val_loss" "val_acc"
..$ epochs : int 15
..$ steps : NULL
12.1 Feedforward Neural Network 297
plot(dnn_history)
12.1.7.3 Prediction
## loss accuracy
## 0.09035096 0.98100007
[1] 190
index_image = 34
You start from the top left corner of the image and put the fil-
ter on the top left 3 x3 sub-matrix of the input image and take
the element-wise product. Then you add up the 9 numbers. Move
forward one step each time until it gets to the bottom right. The
detailed process is shown in figure 12.12.
Let’s use edge detection as an example to see how convolution
302 12 Deep Learning
kernel_vertical
image = magick::image_read("http://bit.ly/2Nh5ANX")
kernel_vertical = matrix(c(1, 1, 1, 0, 0, 0, -1, -1, -1),
nrow = 3, ncol = 3)
plot(image)
plot(image_edge_vertical)
plot(image_edge_horizontal)
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
So the output image has a lighter region in the middle that cor-
responds to the vertical edge of the input image. When the input
image is large, such as the image in figure 12.13 is 1020 x 711,
the edge will not seem as thick as it is in this small example. To
12.2 Convolutional Neural Network 305
detect the horizontal edge, you only need to rotate the filter by 90
degrees. The right image in figure 12.13 shows the horizontal edge
detection result. You can see how convolution operator detects a
specific feature from the image.
The parameters for the convolution operation are the elements in
the filter. For a 3x3 filter shown below, the parameters to estimate
are 𝑤1 to 𝑤9 . So far, we move the filter one step each time when
we convolve. You can do more than 1 step as well. For example,
you can hop 2 steps each time after the sum of the element-wise
product. It is called strided-convolution. Use stride 𝑠 means the
output is downsized by a factor of 𝑠. It is rarely used in practice
but it is good to be familiar with the concept.
library(EBImage)
library(dplyr)
# convert to 2D grayscale
gray_eggshell = apply(eggshell, c(1,2), mean)
if (length(dim(image)) == 3) {
# get image dimensions
col <- dim(image[, , 1])[2]
row <- dim(image[, , 1])[1]
# calculate new dimension size
c <- (col - f)/s + 1
r <- (row - f)/s + 1
# create new image object
newImage <- array(0, c(c, r, 3))
# loops in RGB layers
for (rgb in 1:3) {
m <- image[, , rgb]
m3 <- matrix(0, ncol = c, nrow = r)
i <- 1
if (type == "mean")
for (ii in 1:r) {
j <- 1
for (jj in 1:c) {
m3[ii, jj] <- mean(as.numeric(m[i:(i +
(f - 1)), j:(j + (f - 1))]))
j <- j + s
}
i <- i + s
} else for (ii in 1:r) {
j = 1
for (jj in 1:c) {
m3[ii, jj] <- max(as.numeric(m[i:(i +
(f - 1)), j:(j + (f - 1))]))
j <- j + s
}
i <- i + s
}
newImage[, , rgb] <- m3
}
} else if (length(dim(image)) == 2) {
12.2 Convolutional Neural Network 309
Let’s apply both max and mean pooling with filter size 10 (𝑓 = 10)
and stride 10 (𝑠 = 10).
310 12 Deep Learning
You can see the result by plotting the output image (figure 12.18).
The top left is the original color picture. The top right is the 2D
grayscale picture. The bottom left is the result of max pooling.
The bottom right is the result of mean pooling. The max-pooling
gives you the value of the largest pixel and the mean-pooling gives
the average of the patch. You can consider it as a representation
of features, looking at the maximal or average presence of differ-
ent features. In general, max-pooling works better. You can gain
some intuition from the example (figure 12.18). The max-pooling
“picks” more distinct features and average-pooling blurs out fea-
tures evenly.
numbers with the corresponding numbers from the top left region
of the color input image and add them up. Add a bias parameter
and apply an activation function which gives you the first number
of the output image. Then slide it over to calculate the next one.
The final output is 2D 4 × 4. If you want to detect features in the
red channel only, you can use a filter with the second and third
channels to be all 0s. With different choices of the parameters,
you can get different feature detectors. You can use more than one
filter and each filter has multiple channels. For example, you can
use one 3 × 3 × 3 filter to detect the horizontal edge and another
to detect the vertical edge. Figure 12.18 shows an example of one
layer with two filters. Each filter has a dimension of 3 × 3 × 3. The
output dimension is 4×4×2. The output has two channels because
we have two filters on the layer. The total number of parameters
is 58 (each filter has one bias parameter 𝑏).
We use the following notations for layer 𝑙:
312 12 Deep Learning
str(x_train)
summary(cnn_model)
We then train the model and save each epochs’s history using fit()
function. Please note, as we are not using GPU, it takes a few
minutes to finish. Please be patient while waiting for the results.
The training time can be significantly reduced if running on GPU.
12.2 Convolutional Neural Network 317
## loss accuracy
## 0.02301287 0.99300003
str(cnn_history)
## List of 2
## $ params :List of 3
## ..$ verbose: int 1
## ..$ epochs : int 10
## ..$ steps : int 375
## $ metrics:List of 4
## ..$ loss : num [1:10] 0.3415 0.0911 0.0648 0.0504 0.0428 ...
## ..$ accuracy : num [1:10] 0.891 0.973 0.981 0.985 0.987 ...
## ..$ val_loss : num [1:10] 0.071 0.0515 0.0417 0.0377 0.0412 ...
## ..$ val_accuracy: num [1:10] 0.978 0.985 0.988 0.99 0.988 ...
## - attr(*, "class")= chr "keras_training_history"
318 12 Deep Learning
plot(cnn_history)
12.2.5.3 Prediction
# model prediction
cnn_pred <- cnn_model %>%
predict(x_test) %>%
k_argmax()
head(cnn_pred)
## [1] 70
ous ones. For example, map an input audio clip to a text transcript
where the input is voice over time, and the output is the corre-
sponding sequence of words over time. Recurrent Neural Network
is a deep-learning model that can process this type of sequential
data.
The recurrent neural network allows information to flow from one
step to the next with a repetitive structure. Figure 12.20 shows the
basic chunk of an RNN network. You combine the activated neuro
from the previous step with the current input 𝑥<𝑡> to produce an
output 𝑦<𝑡>
̂ and an updated activated neuro to support the next
input at 𝑡 + 1.
words. You can build the dictionary by finding the top 10,000
occurring words in your training set. Each word in your training
set will have a position in the dictionary sequence. For example,
“use” is the 8320th element of the dictionary sequence. So 𝑥<1>
is a vector with all zeros except for a one on position 8320. Using
this one-hot representation, each input 𝑥<𝑡> is a vector with all
zeros except for one element.
The information flows from one step to the next with a repetitive
structure until the last time step input 𝑥<𝑇𝑥 > and then it out-
puts 𝑦<𝑇
̂ 𝑦 > . In this example, 𝑇𝑥 = 𝑇𝑦 . The architecture changes
when 𝑇𝑥 and 𝑇𝑦 are not the same. The model shares parameters,
𝑊𝑦𝑎 , 𝑊𝑎𝑎 , 𝑊𝑎𝑥 , 𝑏𝑎 , 𝑏𝑦 , for all time steps of the input.
𝐿<𝑡> (𝑦<𝑡>
̂ ) = −𝑦<𝑡> 𝑙𝑜𝑔(𝑦<𝑡>
̂ ) − (1 − 𝑦<𝑡> )𝑙𝑜𝑔(1 − 𝑦<𝑡>
̂ )
𝑇
𝑦
𝐿(𝑦,̂ 𝑦) = Σ𝑡=1 𝐿<𝑡> (𝑦,̂ 𝑦)
The above defines the forward process. Same as before, the back-
ward propagation computes the gradient descent for the parame-
ters by the chain rule for differentiation.
In this RNN structure, the information only flows from the left to
the right. So at any position, it only uses data from earlier in the
sequence to make a prediction. It does not work when predicting
12.3 Recurrent Neural Network 325
the current word needs information from later words. For example,
consider the following two sentences:
Given just the first three words is not enough to know if the word
“April” is part of a person’s name. It is a person’s name in 1 but
not 2. The two sentences have the same first three words. In this
case, we need a model that allows the information to flow in both
directions. A bidirectional RNN takes data from both earlier and
later in the sequence. The disadvantage is that it needs the entire
word sequence to predict at any position. For a speech recognition
application that requires capturing the speech in real-time, we need
a more complex method called the attention model. We will not
get into those models here. Deep Learning with R (Chollet and
Allaire, 2018) provides a high-level introduction of bidirectional
RNN with applicable codes. It teaches both intuition and practical,
computational usage of deep learning models. For python users,
refer to Deep Learning with Python (Chollet, 2017). A standard
text with heavy mathematics is Deep Learning (Goodfellow et al.,
2016).
For sentence 1, you need to use “she” in the adjective clause af-
ter “which” because it is a girl. For sentence 2, you need to use
326 12 Deep Learning
The word “male” has a score of -1 for the “gender” feature, “female”
has a score of 1. Both “Apple” and “pumpkin” have a high score
for the “food” feature and much lower scores for the rest. You can
set the number of features to learn, usually more than what we
list in the above figure. If you use 200 features to represent the
words, then the learned embedding for each word is a vector with
a length of 200.
For language-related applications, text embedding is the most crit-
ical step. It converts raw text into a meaningful vector representa-
tion. Once we have a vector representation, it is easy to calculate
typical numerical metrics such as cosine similarity. There are many
pre-trained text embeddings available for us to use. We will briefly
introduce some of these popular embeddings.
The first widely used embedding is word2vec. It was first intro-
duced in 2013 and was trained by a large collection of text in
an unsupervised way. Training the word2vec embedding vector
uses bag-of-words or skip-gram. In the bag-of-words architecture,
the model predicts the current word based on a window of sur-
rounding context words. In skip-gram architecture, the model uses
the current word to predict the surrounding window of context
words. There are pre-trained word2vec embeddings based on a
large amount of text (such as wiki pages, news reports, etc.) for
general applications.
330 12 Deep Learning
and testing data have 25,000 records each. Each review varies in
length.
12.3.4.1 Data preprocessing
Machine learning algorithms can not deal with raw text, and we
have to convert text into numbers before feeding it into an algo-
rithm. Tokenization is one way to convert text data into a nu-
merical representation. For example, suppose we have 500 unique
words for all reviews in the training dataset. We can label each
word by the rank (i.e., from 1 to 500) of their frequency in the
training data. Then each word is replaced by an integer between 1
to 500. This way, we can map each movie review from its raw text
format to a sequence of integers.
As reviews can have different lengths, sequences of integers will
have different sizes too. So another important step is to make sure
each input has the same length by padding or truncating. For
example, we can set a length of 50 words, and for any reviews less
than 50 words, we can pad 0 to make it 50 in length; and for reviews
with more than 50 words, we can truncate the sequence to 50 by
keeping only the first 50 words. After padding and truncating, we
have a typical data frame, each row is an observation, and each
column is a feature. The number of features is the number of words
designed for each review (i.e., 50 in this example).
After tokenization, the numerical input is just a naive mapping
to the original words, and the integers do not have their usual
numerical meanings. We need to use embedding to convert these
categorical integers to more meaningful representations. The word
embedding captures the inherited relationship of words and dra-
matically reduces the input dimension (see section 12.3.3). The
dimension is a vector space representing the entire vocabulary. It
can be 128 or 256, and the vector space dimension is the same
when the vocabulary changes. It has a lower dimension, and each
vector is filled with real numbers. The embedding vectors can be
learned from the training data, or we can use pre-trained embed-
ding models. There are many pre-trained embeddings for us to use,
such as Word2Vec, BIRD.
332 12 Deep Learning
Now we load the IMDB dataset, and we can check the structure
of the loaded object by using str() command.
12.3 Recurrent Neural Network 333
The x_train and x_test are numerical data frames ready to be used
for recurrent neural network models.
Simple Recurrent Neural Network
Like DNN and CNN models we trained in the past, RNN mod-
334 12 Deep Learning
els are relatively easy to train using keras after the pre-processing
stage. In the following example, we use layer_embedding() to fit
an embedding layer based on the training dataset, which has
two parameters: input_dim (the number of unique words) and out-
put_dim (the length of dense vectors). Then, we add a simple RNN
layer by calling layer_simple_rnn() and followed by a dense layer
layer_dense() to connect to the response binary variable.
batch_size = 128
epochs = 5
validation_split = 0.2
epochs = epochs,
validation_split = validation_split
)
plot(rnn_history)
rnn_model %>%
evaluate(x_test, y_test)
## loss accuracy
## 0.5441073 0.7216800
A simple RNN layer is a good starting point for learning RNN, but
the performance is usually not that good because these long-term
dependencies are impossible to learn due to vanishing gradient.
Long Short Term Memory RNN model (LSTM) can carry useful
information from the earlier words to later words. In keras, it is
easy to replace a simple RNN layer with an LSTM layer by using
layer_lstm().
lstm_model %>%
layer_embedding(input_dim = max_unique_word, output_dim = 128) %>%
layer_lstm(units = 64, dropout = 0.2, recurrent_dropout = 0.2) %>%
layer_dense(units = 1, activation = 'sigmoid')
batch_size = 128
epochs = 5
validation_split = 0.2
plot(lstm_history)
12.3 Recurrent Neural Network 337
lstm_model %>%
evaluate(x_test, y_test)
## loss accuracy
## 0.361364 0.844080
339
13
Handling Large Local Data
13.1 readr
341
342 13 Handling Large Local Data
read_csv("2015,2016,2017
1,2,3
4,5,6")
## Rows: 2 Columns: 3
## -- Column specification -------------------------------
## Delimiter: ","
## dbl (3): 2015, 2016, 2017
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 x 3
## `2015` `2016` `2017`
## <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 4 5 6
The major functions of readr is to turn flat files into data frames:
• read_csv(): reads comma delimited files
• read_csv2(): reads semicolon separated files (common in countries
where , is used as the decimal place)
13.1 readr 343
• read_table():
reads a common variation of fixed width files where
columns are separated by white space
• read_log(): reads Apache style log files
The good thing is that those functions have similar syntax. Once
you learn one, the others become easy. Here we will focus on
read_csv().
# A tibble: 6 x 19
age gender income house store_exp online_exp store_trans online_trans Q1
<int> <chr> <dbl> <chr> <dbl> <dbl> <int> <int> <int>
1 57 Female 1.21e5 Yes 529. 304. 2 2 4
2 63 Female 1.22e5 Yes 478. 110. 4 2 4
3 59 Male 1.14e5 Yes 491. 279. 7 2 5
4 60 Male 1.14e5 Yes 348. 142. 10 2 5
5 51 Male 1.24e5 Yes 380. 112. 4 4 4
6 59 Male 1.08e5 Yes 338. 196. 4 5 4
# ... with 10 more variables: Q2 <int>, Q3 <int>, Q4 <int>, Q5 <int>, Q6 <int>,
# Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>, segment <chr>
The function reads the file to R as a tibble. You can consider tibble
as next iteration of the data frame. They are different with data
frame for the following aspects:
• It never changes an input’s type (i.e., no more stringsAsFactors
= FALSE!)
344 13 Handling Large Local Data
## Rows: 2 Columns: 3
## -- Column specification -------------------------------
## Delimiter: ","
## chr (3): 2015, 2016, 2017
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(dat)
## # A tibble: 2 x 3
## `2015` `2016` `2017`
## <chr> <chr> <chr>
## 1 100 200 300
## 2 canola soybean corn
You can also add comments on the top and tell R to skip those
lines:
## Rows: 7 Columns: 3
## -- Column specification -------------------------------
## Delimiter: ","
## chr (3): Date, Food, Mood
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(dat)
## # A tibble: 7 x 3
## Date Food Mood
## <chr> <chr> <chr>
## 1 Monday carrot happy
## 2 Tuesday carrot happy
## 3 Wednesday carrot happy
## 4 Thursday carrot happy
## 5 Friday carrot happy
## 6 Saturday carrot extremely happy
## 7 Sunday carrot extremely happy
If you don’t have column names, set col_names = FALSE then R will
assign names “X1”,“X2”… to the columns:
346 13 Handling Large Local Data
## Rows: 2 Columns: 3
## -- Column specification -------------------------------
## Delimiter: ","
## chr (3): X1, X2, X3
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(dat)
## # A tibble: 2 x 3
## X1 X2 X3
## <chr> <chr> <chr>
## 1 Saturday carrot extremely happy
## 2 Sunday carrot extremely happy
## # A tibble: 2 x 3
## X1 X2 X3
## <chr> <chr> <chr>
## 1 Saturday carrot extremely happy
## 2 Sunday carrot extremely happy
## Rows: 1 Columns: 10
## -- Column specification -------------------------------
## Delimiter: "\t"
## chr (10): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(dat)
## # A tibble: 1 x 10
## X1 X2 X3 X4 X5 X6 X7 X8 X9
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 every man is a poet when he is in
## # ... with 1 more variable: X10 <chr>
## Rows: 1 Columns: 5
## -- Column specification -------------------------------
## Delimiter: "|"
## chr (5): X1, X2, X3, X4, X5
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
348 13 Handling Large Local Data
print(dat)
## # A tibble: 1 x 5
## X1 X2 X3 X4 X5
## <chr> <chr> <chr> <chr> <chr>
## 1 THE UNBEARABLE RANDOMNESS OF LIFE
Another situation you will often run into is the missing value. In
marketing survey, people like to use “99” to represent missing. You
can tell R to set all observation with value “99” as missing when
you read the data:
## Rows: 1 Columns: 3
## -- Column specification -------------------------------
## Delimiter: ","
## dbl (2): Q1, Q2
## lgl (1): Q3
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(dat)
## # A tibble: 1 x 3
## Q1 Q2 Q3
## <dbl> <dbl> <lgl>
## 1 5 4 NA
For writing data back to disk, you can use write_csv() and
write_tsv(). The following two characters of the two functions in-
crease the chances of the output file being read back in correctly:
• Encode strings in UTF-8
13.2 data.table— enhanced data.frame 349
write_csv(sim.dat, "sim_dat.csv")
For other data types, you can use the following packages:
• Haven: SPSS, Stata and SAS data
• Readxl and xlsx: excel data(.xls and .xlsx)
• DBI: given data base, such as RMySQL, RSQLite and RPost-
greSQL, read data directly from the database using SQL
Some other useful materials:
• For getting data from the internet, you can refer to the book
“XML and Web Technologies for Data Sciences with R”.
We will use the clothes customer data to illustrate. There are two
dimensions in []. The first one indicates the row and second one
indicates column. It uses a comma to separate them.
# read data
sim.dat <- readr::read_csv("http://bit.ly/2P5gTw4")
sim.dat %>%
13.2 data.table— enhanced data.frame 351
group_by(gender) %>%
summarise(Avg_online_trans = mean(online_trans))
dt <- data.table(sim.dat)
class(dt)
dt[, mean(online_trans)]
## [1] 13.55
sim.dat[,mean(online_trans)]
## gender V1
## 1: Female 15.38
## 2: Male 11.26
You can group by more than one variables. For example, group by
“gender” and “house”:
## gender house V1
## 1: Female Yes 11.312
## 2: Male Yes 8.772
## 3: Female No 19.146
## 4: Male No 16.486
13.2 data.table— enhanced data.frame 353
Different from data frame, there are three arguments for data ta-
ble:
SELECT
gender,
avg(online_trans)
FROM
sim.dat
GROUP BY
gender
R code:
is equal to SQL:
SELECT
gender,
house,
avg(online_trans) AS avg
FROM
sim.dat
GROUP BY
gender,
house
R code:
is equal to SQL:
13.2 data.table— enhanced data.frame 355
SELECT
gender,
house,
avg(online_trans) AS avg
FROM
sim.dat
WHERE
age < 40
GROUP BY
gender,
house
You can see the analogy between data.table and SQL. Now let’s
focus on operations in data table.
• select row
• select column
Selecting columns in data.table don’t need $:
## [1] 57 63 59 60 51 59
• tabulation
In data table. .N means to count�
# row count
dt[, .N]
## [1] 1000
# counts by gender
dt[, .N, by= gender]
## gender N
## 1: Female 554
## 2: Male 446
## gender count
## 1: Female 292
## 2: Male 86
Order table:
358 13 Handling Large Local Data
Since data table keep some characters of data frame, they share
some operations:
dt[order(-online_exp)][1:5]
You can also order the table by more than one variable. The fol-
lowing code will order the table by gender, then order within gender
by online_exp:
dt[order(gender, -online_exp)][1:5]
361
362 14 R code for data simulation
2,NA,5000,NA,10,500,NA,NA),
ncol=length(vars), byrow=TRUE)
Now let’s edit the data we just simulated a little by adding tags
to 0/1 binomial variables:
In the real world, the data always includes some noise such as
missing, wrong imputation. So we will add some noise to the data:
So far we have created part of the data. You can check it using
summary(sim.dat). Next, we will move on to simulate survey data.
nf <- 800
for (j in 1:20) {
set.seed(19870 + j)
x <- c("A", "B", "C")
sim.da1 <- NULL
for (i in 1:nf) {
# sample(x, 120, replace=TRUE)->sam
sim.da1 <- rbind(sim.da1, sample(x, 120, replace = TRUE))
}
# r = 0.5
# s1 <- c(rep(c(1/2, 0, -1/2), 40),
368 14 R code for data simulation
# r = 1
# s1 <- c(rep(c(1, 0, -1), 40),
# rep(c(1, 0, 0), 40),
# rep(c(0, 0, 0), 40))
# link1 <- as.matrix(dummy.sim1) %*% s1 - 40/3
# r = 2
# s1 <- c(rep(c(2, 0, -2), 40),
# rep(c(2, 0, 0), 40),
# rep(c(0, 0, 0), 40))
#
# link1 <- as.matrix(dummy.sim1) %*% s1 - 40/3/0.5
for (i in 1:120) {
ind <- c(ind, rep(i, 2))
}
371
372 14 Bibliography
OLS, 176
Ordinary Least Square (OLS),
176
P-value, 173
379