R y Python Diversos Temas
R y Python Diversos Temas
Analytics Using
R and Python
Solve Business Problems Using a
Data-driven Approach
—
Second Edition
—
Umesh R. Hodeghatta, Ph.D
Umesha Nayak
Practical Business
Analytics Using R and
Python
Solve Business Problems Using
a Data-driven Approach
Second Edition
Preface�������������������������������������������������������������������������������������������������������������������xix
Foreword��������������������������������������������������������������������������������������������������������������xxiii
iii
Table of Contents
iv
Table of Contents
v
Table of Contents
6.2.3 Mean Absolute Error (MAE) or Mean Absolute Deviation (MAD)��������������������������������� 168
6.2.4 Sum of Squared Errors (SSE)�������������������������������������������������������������������������������������� 169
6.2.5 R2 (R-Squared)������������������������������������������������������������������������������������������������������������ 169
6.2.6 Adjusted R2����������������������������������������������������������������������������������������������������������������� 169
6.3 Classification Model Evaluation������������������������������������������������������������������������������������������ 170
6.3.1 Classification Error Matrix������������������������������������������������������������������������������������������ 170
6.3.2 Sensitivity Analysis in Classification�������������������������������������������������������������������������� 171
6.4 ROC Chart���������������������������������������������������������������������������������������������������������������������������� 173
6.5 Overfitting and Underfitting������������������������������������������������������������������������������������������������ 174
6.5.1 Bias and Variance������������������������������������������������������������������������������������������������������� 175
6.6 Cross-Validation������������������������������������������������������������������������������������������������������������������ 177
6.7 Measuring the Performance of Clustering�������������������������������������������������������������������������� 179
6.8 Chapter Summary��������������������������������������������������������������������������������������������������������������� 179
vi
Table of Contents
vii
Table of Contents
viii
Table of Contents
ix
Table of Contents
x
Table of Contents
xi
Table of Contents
xii
Table of Contents
xiii
Table of Contents
xiv
Table of Contents
References������������������������������������������������������������������������������������������������������������ 673
Index��������������������������������������������������������������������������������������������������������������������� 683
xv
About the Authors
Dr. Umesh R Hodeghatta is an engineer, scientist, and an
educator. He is currently a faculty member at Northeastern
University, specializing in data analytics, AI, machine
learning, deep learning, natural language processing
(NLP), and cybersecurity. He has more than 25 years of
work experience in technical and senior management
positions at AT&T Bell Laboratories, Cisco Systems,
McAfee, and Wipro. He was also a faculty member at Kent
State University in Kent, Ohio, and Xavier Institute of
Management in Bhubaneswar, India. He earned a master’s
degree in electrical and computer engineering (ECE) from Oklahoma State University
and a doctorate degree from the Indian Institute of Technology (IIT). His research
interest is applying AI/machine learning to strengthen an organization’s information
security based on his expertise in information security and machine learning. As a
chief data scientist, he is helping business leaders to make informed decisions and
recommendations linked to the organization’s strategy and financial goals, reflecting an
awareness of external dynamics based on a data-driven approach.
He has published many journal articles in international journals and conference
proceedings. In addition, he has authored books titled Business Analytics Using R:
A Practical Approach and The InfoSec Handbook: An Introduction to Information
Security, published by Springer Apress. Furthermore, Dr. Hodeghatta has contributed
his services to many professional organizations and regulatory bodies. He was an
executive committee member of the IEEE Computer Society (India); academic advisory
member for the Information and Security Audit Association (ISACA); IT advisor for the
government of India; technical advisory member of the International Neural Network
Society (INNS) India; and advisory member of the Task Force on Business Intelligence &
Knowledge Management. He was listed in “Who’s Who in the World” for the years 2012,
2013, 2014, 2015, and 2016. He is also a senior member of the IEEE (USA).
xvii
About the Authors
xviii
Preface
Business analytics, data science, artificial intelligence (AI), and machine learning (ML)
are hot words right now in the business community. Artificial intelligence and machine
learning systems are enabling organizations to make informed decisions by optimizing
processes, understanding customer behavior, maximizing customer satisfaction, and
thus accelerating overall top-line growth. AI and machine learning help organizations
by performing tasks efficiently and consistently, thus improving overall customer
satisfaction level.
In financial services, AI models are designed to help manage customers’ loans,
retirement plans, investment strategies, and other financial decisions. In the automotive
industry, AI models can help in vehicle design, sales and marketing decisions, customer
safety features based on driving patterns of the customer, recommended vehicle
type for the customer, etc. This has helped automotive companies to predict future
manufacturing resources needed to build, for example, electric and driverless vehicles.
AI models also help them in making better advertisement decisions.
AI can play a big role in customer relationship management (CRM), too. Machine
learning models can predict consumer behavior, start a virtual agent conversation, and
forecast trend analysis that can improve efficiency and response time.
Recommendation systems (AI systems) can learn users’ content preferences and can
select customers’ choice of music, book, game, or any items the customer is planning
to buy online. Recommendation systems can reduce return rates and help create better
targeted content management.
Sentiment analysis using machine learning techniques can predict the opinions and
feelings of users of content. This helps companies to improve their products and services
by analyzing the customers’ reviews and feedback.
These are a few sample business applications, but this can be extended to any
business problem provided you have data; for example, an AI system can be developed
for HR functions, manufacturing, process engineering, IT infrastructure and security,
software development life cycle, and more.
There are several industries that have begun to adopt AI into their business
decision process. Investment in analytics, machine learning, and artificial intelligence
xix
Preface
is predicted to triple in 2023, and by 2025, it is predicted to become a $47 billion market
(per International Data Corp.). According to a recent research survey in the United
States, nearly 34 percent of businesses are currently implementing or plan to implement
AI solutions in their business decisions.
Machine learning refers to the learning algorithm of AI systems to make decisions
based on past data (historical data). Some of the commonly used machine learning
methods include neural networks, decision trees, k-nearest neighbors, logistic
regression, cluster analysis, association rules, deep neural networks, hidden Markov
models, and natural language processing. Availability and abundance of data, lower
storage and processing costs, and efficient algorithms have made machine learning and
AI a reality in many organizations.
AI will be the biggest disruptor to the industry in the next five years. This will no doubt
have a significant impact on the workforce. Though many say AI can replace a significant
number of jobs, it can actually enhance productivity and improve the efficiency of workers.
AI systems can help executives make better business decisions and allow businesses to
work on resources and investments to beat the competition. When decision-makers and
business executives make decisions based on reliable data and recommendations arrived
at through AI systems, they can make better choices for their business, investments, and
employees thus enabling their business to stand out from competition.
There are currently thousands of jobs posted on job portals in machine learning,
data science, and AI, and it is one of the fastest-growing technology areas, according to
the Kiplinger report of 2017. Many of these jobs are going unfilled because of a shortage
of qualified engineers. Apple, IBM, Google, Facebook, Microsoft, Walmart, and Amazon
are some of the top companies hiring data scientists in addition to other companies
such as Uber, Flipkart, Citibank, Fidelity Investments, GE, and many others including
manufacturing, healthcare, agriculture, and transportation companies. Many open job
positions are in San Jose, Boston, New York, London, Hong Kong, and many other cities.
If you have the right skills, then you can be a data scientist in one of these companies
tomorrow!
A data scientist/machine learning engineer may acquire the following skills:
xx
Preface
This book aims to cover the skills required to become a data scientist. This book
enables you to gain sufficient knowledge and skills to process data and to develop machine
learning models. We have made an attempt to cover the most commonly used learning
algorithms and developing models by using open-source tools such as R and Python.
Practical Business Analytics Using R and Python is organized into five parts. The
first part covers the fundamental principles required to perform analytics. It starts by
defining the sometimes confusing terminologies that exist in analytics, job skills, tools,
and technologies required for an analytical engineer, before describing the process
necessary to execute AI and analytics projects. The second and subsequent chapters
cover the basics of math, probability theory, and statistics required for analytics,
before delving into SQL, the business analytics process, exploring data using graphical
methods, and an in-depth discussion of how to evaluate analytics model performance.
In Part II, we introduce supervised machine learning models. We start with
regression analysis and then introduce different classification algorithms, including
naïve Bayes, decision trees, logistic regression, and neural networks.
Part III discusses time-series models. We cover the most commonly used models
including ARIMA.
Part IV covers unsupervised learning and text mining. In unsupervised learning,
we discuss clustering analysis and association mining. We end the section by briefly
introducing big data analytics.
In the final part, we discuss the open-source tools, R and Python, and using them in
programming for analytics. The focus here is on developing sufficient programing skills
to perform analytics.
Source Code
All the source code used in this book can be downloaded from https://github.com/
apress/practical-business-analytics-r-python.
xxi
Foreword
We live in an era where mainstream media, business
literature, and boardrooms are awash in breathless hype
about the data economy, AI revolution, Industry/Business
4.0, and similar terms that are used to describe a meaningful
social and business inflection point. Indeed, today’s
discourse describes what seems to be an almost overnight
act of creation that has generated a new data-centric
paradigm of societal and business order that, according to
the current narrative, is without precedent and requires
the creation of an entirely new set of practitioners and best
practices out of thin air. In this brave new world, goes the narrative, everyone has already
been reinvented as a “data-something” (data scientist, data analyst, data storyteller, data
citizen, etc.) with the associated understanding of what that means for themselves and
for business. These newly minted “data-somethings” are already overhauling current
practice by operationalizing data-driven decision-making practices throughout their
organizations. They just don’t know it yet.
Look under the covers of most operating organizations, however, and a different
picture appears. There is a yawning chasm of skillsets, common language, and run-
rate activity between those operating in data-centric and data-adjacent roles and those
in many operating functions. As such, in many organizations, data-centric activity is
centered in only a few specialized areas or is performed only when made available as a
feature in the context of commercial, off-the-shelf software (COTS). Similarly, while there
is certainly more data volume and more variation in the said data and all of that varied
data is arriving at a much higher velocity than in past, organizations often do not possess
either the infrastructure sophistication or the proliferation of skillsets to validate whether
the data is any good (veracity of data) and actually use it for anything of value. And with
the exception of a few organizations for which data is their only or primary business,
most companies use data and associated analytic activities as a means to accomplish
their primary business objective more efficiently, not as an end in and of itself.
xxiii
Foreword
A similar discontinuity occurred a few decades ago, with the creation of the
Internet. In the years that followed, entire job categories were invented (who needed a
“webmaster” or “e-commerce developer” in 1990?), and yet it took almost a decade and
a half before there was a meaningful common language and understanding between
practitioners and operational business people at many organizations. The resulting
proliferation of business literature tended to focus on “making business people more
technical” so they could understand this strange new breed of practitioners who held
the keys to the new world. And indeed, that helped to a degree. But the accompanying
reality is that the practitioners also needed to be able to understand and communicate in
the language of businesses and contextualize their work as an enabler of business rather
than the point of business.
Academia and industry both struggled to negotiate the balance during the Internet
age and once again struggle today in the nascent data economy. Academia, too often
erred on the side of theoretical courses of study that teach technical skills (mathematics,
statistics, computer science, data science) without contextualizing the applications
to business, while on the other hand, industry rushes to apply techniques they lack
the technical skill to properly understand to business problems. In both groups,
technologists become overly wedded to a given platform, language, or technique at
the expense of leveraging the right tool for the job. In all cases, the chasm of skills
and common language among stakeholders often leads to either or both of incorrect
conclusions or under-utilized analytics. Neither is a good outcome.
There is space both for practitioners to be trained in the practical application of
techniques to business context and for business people to not only understand more
about the “black box” of data-centric activity but be able to perform more of that
activity in a self-service manner. Indeed, the democratization of access to both high-
powered computer and low-code analytical software environments makes it possible
for a broader array of people to become practitioners, which is part of what the hype is
all about.
Enter this book, which provides readers with a stepwise walk-through of the
mathematical underpinnings of business analytics (important to understand the proper
use of various techniques) while placing those techniques in the context of specific,
real-world business problems (important to understand the appropriate application of
those techniques). The authors (both of whom have longstanding industry experience
together, and one of whom is now bringing that experience to the classroom in a
professionally oriented academic program) take an evenhanded approach to technology
choices by ensuring that currently fashionable platforms such as R and Python are
xxiv
Foreword
represented primarily as alternatives that can accomplish equivalent tasks, rather than
endpoints in and of themselves. The stack-agnostic approach also helps readers prepare
as to how they might incorporate the next generation of available technology, whatever
that may be in the future.
As would-be practitioners in business, I urge you to read this book with the
associated business context in mind. Just as with the dawn of the Internet, the true value
of the data economy will only begin to be realized when all the “data-somethings” we
work with act as appropriately contextualized practitioners who use data in the service of
the business of their organizations.
Dan Koloski
Professor of the Practice and Head of Learning Programs
Roux Institute at Northeastern University
October 2022
Dan Koloski is a professor of the practice in the analytics program and director of
professional studies at the Roux Institute at Northeastern University.
Professor Koloski joined Northeastern after spending more than 20 years in the
IT and software industry, working in both technical and business management roles
in companies large and small. This included application development, product
management and partnerships, and helping lead a spin-out and sale from a venture-
backed company to Oracle. Most recently, Professor Koloski was vice president of
product management and business development at Oracle, where he was responsible
for worldwide direct and channel go-to-market activities, partner integrations, product
management, marketing/branding, and mergers and acquisitions for more than
$2 billion in product and cloud-services business. Before Oracle, he was CTO and
director of strategy of the web business unit at Empirix, a role that included product
management, marketing, alliances, mergers and acquisitions, and analyst relations. He
also worked as a freelance consultant and Allaire-certified instructor, developing and
deploying database-driven web applications.
Professor Koloski earned a bachelor’s degree from Yale University and earned his
MBA from Harvard Business School in 2002.
xxv
PART I
Introduction to Analytics
CHAPTER 1
An Overview of Business
Analytics
1.1 Introduction
Today’s world is data-driven and knowledge-based. In the past, knowledge was gained
mostly through observation now, knowledge is secured not only through observation
but also by analyzing data that is available in abundance. In the 21st century, knowledge
is acquired and applied by analyzing data available through various applications,
social media sites, blogs, and much more. The advancement of computer systems
complements knowledge of statistics, mathematics, algorithms, and programming.
Enormous storage and exten computing capabilities have ensured that knowledge can
be quickly derived from huge amounts of data and be used for many other purposes. The
following examples demonstrate how seemingly obscure or unimportant data can be
used to make better business decisions:
• A hotel in Switzerland welcomes you with your favorite drink and
dish; you are so delighted!
• Based on your daily activities and/or food habits, you are warned
about the high probability of becoming a diabetic so you can take the
right steps to avoid it.
3
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5_1
Chapter 1 An Overview of Business Analytics
• You enter a grocery store and find that your regular monthly
purchases are already selected and set aside for you. The only
decision you have to make is whether you require all of them or want
to remove some from the list. How happy you are!
There are many such scenarios that are made possible by analyzing data about you
and your activities that is collected through various means—including mobile phones,
your Google searches, visits to various websites, your comments on social media
sites, your activities using various computer applications, and more. The use of data
analytics in these scenarios has focused on your individual perspective. Now, let’s look at
scenarios from a business perspective.
• As a taxi business owner, you are able to repeatedly attract the same
customers based on their travel history and preferences of taxi type
and driver.
4
Chapter 1 An Overview of Business Analytics
All these scenarios are possible by analyzing data that the businesses and others
collect from various sources. There are many such possible scenarios. The application of
data analytics to the field of business is called business analytics.
You have most likely observed the following scenarios:
• You’ve been searching, for the past few days, on Google for
adventurous places to visit. You’ve also tried to find various travel
packages that might be available. You suddenly find that when
you are on Facebook, Twitter, or other websites, they show a
specific advertisement of what you are looking for, usually at a
discounted rate.
All of these possibilities are now a reality because of data analytics specifically used
by businesses.
5
Chapter 1 An Overview of Business Analytics
• It offers the right mix of theory and hands-on labs. The concepts are
explained using business scenarios or case studies where required.
• Practical insights into the use of data that has been collected,
collated, purchased, or available for free from government sources
or others. These insights are attained via computer programming,
statistical and mathematical knowledge, and expertise in relevant
fields that enable you to understand the data and arrive at predictive
capabilities.
• Practical cases and examples that enable you to apply what you learn
from this book.
6
Chapter 1 An Overview of Business Analytics
1.3 Confusing Terminology
Many terms are used in discussions of this topic—for example, data analytics, business
analytics, big data analytics, and data science. Most of these are, in a sense, the same.
However, the purpose of the analytics, the extent of the data that’s available for analysis, and
the difficulty of the data analysis may vary from one to the other. Finally, regardless of the
differences in terminology, we need to know how to use the data effectively for our businesses.
These differences in terminology should not get in the way of applying techniques to the
data (especially in analyzing it and using it for various purposes, including understanding it,
deriving models from it, and then using these models for predictive purposes).
In layperson’s terms, let’s look at some of this terminology:
7
Chapter 1 An Overview of Business Analytics
8
Chapter 1 An Overview of Business Analytics
Now let’s discuss each of these drivers for business analytics in more detail.
10
Chapter 1 An Overview of Business Analytics
11
Chapter 1 An Overview of Business Analytics
12
Chapter 1 An Overview of Business Analytics
1.5.2 Human Resources
Retention is the biggest problem faced by an HR department in any industry, especially
in the service industry. An HR department can identify which employees have high
potential for retention by processing past employee data. Similarly, an HR department
can also analyze which competence (qualification, knowledge, skill, or training) has the
most influence on the organization’s or team’s capability to deliver quality output within
committed timelines.
1.5.3 Product Design
Product design is not easy and often involves complicated processes. Risks factored in
during product design, subsequent issues faced during manufacturing, and any resultant
issues faced by customers or field staff can be a rich source of data that can help you
understand potential issues with a future design. This analysis may reveal issues with
materials, issues with the processes employed, issues with the design process itself,
issues with the manufacturing, or issues with the handling of the equipment installation
or later servicing. The results of such an analysis can substantially improve the quality of
future designs by any company. Another interesting aspect is that data can help indicate
which design aspects (color, sleekness, finish, weight, size, or material) customers like
and which ones customers do not like.
1.5.4 Service Design
Like products, services are also carefully designed and priced by organizations.
Identifying components of the service (and what are not) also depends on product
design and cost factors compared to pricing. The length of the warranty, coverage during
warranty, and pricing for various services can also be determined based on data from
earlier experiences and from target market characteristics. Some customer regions may
more easily accept “use and throw” products, whereas other regions may prefer “repair
and use” kinds of products. Hence, the types of services need to be designed according
to the preferences of regions. Again, different service levels (responsiveness) may have
different price tags and may be targeted toward a specific segment of customers (for
example, big corporations, small businesses, or individuals).
13
Chapter 1 An Overview of Business Analytics
Having a clear understanding of the problem/data task is one of the most important
requirements. If the person analyzing the data does not understand the underlying
problem or the specific characteristics of the task, then the analysis performed by the
data analytics person can lead to the wrong conclusions or lead the business in the
wrong direction. Also, if an individual does not know the specific domain in which the
problem is being solved, then one should consult the domain expert to perform the
analysis. Not understanding requirements, and just having only programming skills
along with statistical or mathematical knowledge, can sometimes lead to proposing
impractical (or even dangerous) suggestions for the business. These suggestions also
waste the time of core business personnel.
14
Chapter 1 An Overview of Business Analytics
the proper analysis techniques and algorithms to suitable situations or analyses. The
depth of this knowledge may vary from the job titles and experience. For example, linear
regression or multiple linear regression (supervised method) may be suitable if you
know (based on business characteristics) that there exists a strong relationship between
a response variable and various predictors. Clustering (unsupervised method) can
allow you to cluster data into various segments. Using and applying business analytics
effectively can be difficult without understanding these techniques and algorithms.
Having knowledge of tools is important. Though it is not possible to learn all the
tools that are available, knowing as many tools helps fetch job interviews. Computer
knowledge is required for a capable data analytics person as well so that there is no
dependency on other programmers who don’t understand the statistics or mathematics
behind the techniques or algorithms. Platforms such as R, Python, and Hadoop have
reduced the pain of learning programming, even though at times we may have to use
other complementary programming languages.
3. Data structures and data storage or data warehousing techniques,
including how to query the data effectively.
Knowledge of data structures and data storage/data warehousing eliminates
dependence on database administrators and database programmers. This enables you
to consolidate data from varied sources (including databases and flat files), arrange
them into a proper structure, and store them appropriately in a data repository required
for the analysis. The capability to query such a data repository is another additional
competence of value to any data analyst.
4. Statistical and mathematical concepts (probability theory,
linear algebra, matrix algebra, calculus, and cost-optimization
algorithms such as gradient descent or ascent algorithms).
Data analytics and data mining techniques use many statistical and mathematical
concepts on which various algorithms, measures, and computations are based. Good
knowledge of statistical and mathematical concepts is essential to properly use the
concepts to depict, analyze, and present the data and the results of the analysis.
Otherwise, the wrong interpretations, wrong models, and wrong theories can lead others
in the wrong direction by misinterpreting the results because the application of the
technique or interpretation of the result itself was wrong.
Statistics contribute to a significant aspect of effective data analysis. Similarly, the
knowledge discovery enablers such as machine learning have contributed significantly
to the application of business analytics. Another area that has given impetus to business
15
Chapter 1 An Overview of Business Analytics
analytics is the growth of database systems, from SQL-oriented ones to NoSQL ones.
All these combined, along with easy data visualization and reporting capabilities, have
led to a clear understanding of what the data tells us and what we understand from
the data. This has led to the vast application of business and data analytics to solve
problems faced by organizations and to drive a competitive edge in business through the
application of this understanding.
There are umpteen tools available to support each piece of the business analytics
framework. Figure 1-1 presents some of these tools, along with details of the typical
analytics framework.
16
Chapter 1 An Overview of Business Analytics
The steps of the typical AI and business analytics project process are as follows:
2. Study the data, data types, and preprocess data; clean up the data
for missing values; and any other data elements or errors.
3. Check for the outliers in the data and remove them from the data
set to reduce their adverse impact on the analysis.
1.8 Chapter Summary
In this chapter, you saw how knowledge has evolved. You also looked at many scenarios
in which data analytics helps individuals. The chapter included many examples of
business analytics helping businesses to grow and compete effectively. You were also
provided with examples of how business analytics results are used by businesses
effectively.
18
Chapter 1 An Overview of Business Analytics
You briefly went through the skills required for a business analyst. In particular, you
understood the importance of the following: understanding the business and business
problems, data analysis techniques and algorithms, computer programming, data
structures and data storage/warehousing techniques, and statistical and mathematical
concepts required for data analytics.
Finally, we briefly explained the process for executing an analytics project.
19
CHAPTER 2
The Foundations
of Business Analytics
Uncertainty and randomness are bound to exist in most business decisions. Probability
quantifies the uncertainty that we encounter every day. This chapter discusses the
fundamentals of statistics, such as mean, variance, standard deviation, probability
theory basics, types of probability distributions, and the difference between population
and sample, which are essential for any analytics modeling. We will provide
demonstrations using both Python and R.
2.1 Introduction
We all have studied statistics at some point of time in our education. However, we may
never have gained a true appreciation of why applying some of that statistical knowledge
is important. In the context of data and business analytics, knowledge of statistics
can provide insight into characteristics of a data set you have to analyze that will help
you determine the right techniques and methods to be employed for further analysis.
There are many terms in statistics such as mean, variance, median, mode, and standard
deviation, among others. We will try to provide a context for these terms with a simple
example from our daily lives before explaining the terms from a business angle. Further,
we will cover the basics of probability theory and different probability distributions and
why they are necessary for business data analytics.
Imagine you are traveling and have just reached the bank of a muddy river, but there
are no bridges or boats or anyone to help you to cross the river. Unfortunately, you do
not know to swim. When you look around in this confused situation where there is no
help available to you, you notice a sign, as shown in Figure 2-1.
21
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5_2
Chapter 2 The Foundations of Business Analytics
The sign says, “The mean depth of the river is 4 feet.” Say this value of mean is
calculated by averaging the depth of the river at each square-foot area of the river. This
leads us to the following question: “What is average or mean?” Average or mean is the
quantity arrived at by summing up the depth at each square foot and dividing this sum
by the number of measurements (i.e., number of square feet measured).
Your height is 6 feet. Does Figure 2-1 provide enough information for you to attempt
to cross the river by walking? If you say “yes,” definitely I appreciate your guts. I would
not dare to cross the river because I do not know whether there is any point where the
depth is more than my height. If there are points with depths like 7 feet, 8 feet, 10 feet, or
12 feet, then I will not dare to cross as I do not know where these points are, and at these
points I am likely to drown.
Suppose the sign also says “Maximum depth is 12ft and minimum depth is 1ft” (see
Figure 2-2). I am sure this additional information will scare you since you now know that
there are points where you can get drowned. Maximum depth is the measure at one or
more points that are the largest of all the values measured. Again, with this information
you may not be sure that the depth of 12 feet is at one point or at multiple points.
Minimum sounds encouraging (this is the lowest of the values observed) for you to cross
the river, but again you do not know whether it is at one point or multiple points.
22
Chapter 2 The Foundations of Business Analytics
Figure 2-2. The sign indicating mean, maximum, and minimum depths
Suppose, in addition to the previous information, that the sign (shown in Figure 2-3)
also says “Median of the depth is 4.5ft.” Median is the middle point of all the measured
depths if all the measured depths are arranged in ascending order. This means 50
percent of the depths measured are less than this, and also 50 percent of the depths
measured are above this. You may not still dare to cross the river as 50 percent of the
values are above 4.5 feet and the maximum depth is 12 feet.
Figure 2-3. The sign indicating mean, maximum, minimum, and median depths
Suppose, in addition to the previous information, that the sign (shown in Figure 2-4)
also says “Quartile 3 is 4.75ft.” Quartile 3 is the point below which 75 percent of the
measured values fall when the measured values are arranged in ascending order. This
also means there are 25 percent of the measured values that have greater depth than this.
You may not be still comfortable crossing the river as you know the maximum depth is
12 feet and there are 25 percent of the points above 4.75 feet.
23
Chapter 2 The Foundations of Business Analytics
Suppose, in addition to the previous information, that the sign (shown in Figure 2-5)
also says “Percentile 90 is 4.9ft and percentile 95 is 5ft.” Suppose this is the maximum
information available. You now know that only 5 percent of the measured points are of
depth more than 5 feet. You may now want to take a risk if you do not have any other
means other than crossing the river by walking or wading through as now you know that
there are only 5 percent of the points with depth more than 5 feet. Your height is 6 feet.
You may hope that 98 or 99 percentile may be still 5.5 feet. You may now believe that the
maximum points may be rare and you can, by having faith in God, cross the river safely.
24
Chapter 2 The Foundations of Business Analytics
In spite of the previous cautious calculations, you may still drown if you reach rare
points of depth of more than 6 feet (like the maximum point of depth). But, with the
foregoing information, you know that your risk is substantially less compared to your risk
at the initial time when you had only limited information (that the mean depth of the
river is 4 feet).
This is the point we wanted to make through the river analogy: with one single
parameter of measurement, you may not be able to describe the situation clearly and
may require more parameters to elaborate the situation. Each additional parameter
calculated may increase the clarity required to make decisions or to understand the
phenomenon clearly. Again, another note of caution: there are many other parameters
than the ones discussed earlier that are of interest in making decisions or understanding
any situation or context.
Statistical parameters such as mean or average, median, quartile, maximum,
minimum, range, variance, and standard deviation describe the data very clearly. As
shown in the example discussed earlier, one aspect of the data may not provide all the
clarity necessary, but many related parameters provide better clarity with regard to data
or situation, or the context
Later in this chapter, we will discuss how to calculate all these parameters using R as
well as Python. Before that, we need to understand the important aspect—the meaning
of population and sample.
25
Chapter 2 The Foundations of Business Analytics
However, when we have to analyze the data, it is difficult to get the entire population,
especially when the data size is enormous. This is because:
• It is not always possible to gather the data of the entire population. For
example, in the previous example, how can we measure the volume
of the entire river by taking every square inch of the river flow? It is
practically not possible. Similarly, in many business situations, we
may not be able to acquire the data of the entire population.
• It takes substantial time to process the data, and the time taken to
analyze may be prohibitively high in terms of the requirements related
to the application of the data. For example, if the entire transaction
data related to all the purchases of all the users has to be analyzed
before you recommend a particular product to a user, the amount
of processing time taken may be so huge that you may miss the
opportunity to suggest the product to the user who has to be provided
the suggestions quickly when he is in session on the Internet.
2.2.2 Sample
In simple terms, sample means a section or subset of the population selected for
analysis. Examples of samples are the following: randomly selected 100, 000 employees
from the entire IT industry or randomly selected 1, 000 employees of a company or
randomly selected 1,000,000 transactions of an application or randomly selected
10,000,000 Internet users or randomly selected 5,000 users each from each ecommerce
site, and so on. Sample can also be selected using stratification (i.e., based on some rules
of interest). For example, all the employees of the IT industry whose income is greater
than $100,000 or all the employees of a company whose salary is greater than $50,000
or the top 100,000 transactions by amount per transaction (e.g., minimum $1,000 per
transaction or all Internet users who spend more than two hours per day, etc.).
26
Chapter 2 The Foundations of Business Analytics
Several sampling techniques are available to ensure that the data integrity is
maintained and the conclusions and hypothesis can be applied to the population.
Several sampling techniques have been in practice since the old days of statistics and
research design. We will briefly mention them without going into the details as they are
covered extensively in many statistics books. The popular sampling methods used in
data analytics are as follows:
However, of late, we have higher computing power at our hands because of cloud
technologies and the possibility to cluster computers for better computing power.
Though such large computing power allows us, in some cases, to use the entire
population for analysis, sampling definitely helps carry out the analysis relatively easily
and faster in many cases. However, sampling has a weakness: if the samples are not
selected properly, then the analysis results may be wrong. For example, for analyzing the
data for the entire year, only this month’s data is taken. This sample selection may not
give the required information as to how the changes have happened over the months.
2.3.1 Mean
Mean is also known as average in general terms. If we have to summarize any data set
quickly, then the common measure used is mean. Some examples of the usage of the
mean are the following:
• For a business, mean profitability over the last five years may be one
good way to represent the organization’s profitability in order to
judge its success.
• For a country, mean gross domestic product (GDP) over the last five
years may be a good way to represent the health of the economy.
• For a business, mean growth in sales or revenue over a period of the
last five years may be a good way to represent growth.
28
Chapter 2 The Foundations of Business Analytics
Normally, a mean or average figure gives a sense of what the figure is likely to be for
the next year based on the performance for the last number of years. However, there are
limitations of using or relying only on this parameter.
Let’s look at a few examples to understand more about using mean or average:
> #GoodLuck Co. Pvt. Ltd - Profit figures for last 5 years
> Year1Prof<-1000000
> Year2Prof<-750000
> Year3Prof<-600000
> Year4Prof<-500000
> Year5Prof<-500000
> #To calculate the mean or average profit, you require to
> # sum all the 5 years profit and divide bu the number of years
> SumYrsProfs<-Year1Prof+Year2Prof+Year3Prof+Year4Prof+Year5Prof
> MeanProf<- SumYrsProfs/5
> MeanProf
[1] 670000
>
29
Chapter 2 The Foundations of Business Analytics
Year1Profit = 1000000
Year2Profit = 750000
Year3Profit = 600000
Year4Profit = 500000
Year5Profit = 500000
MeanProfit = TotalProfit/5
print(MeanProfit)
30
Chapter 2 The Foundations of Business Analytics
670000.0
Figure 2-9. Alternative and simple way for calculating mean in Python
There could be many different ways in any programming language to perform a task.
The mean for the previous example is calculated using R and Python in a couple ways in
Figures 2-6, 2-7 2-8, and 2-9.
Similarly, for the other examples we can work out the mean or average value if we
know the individual figures for the years.
The problem with the mean or average as a single parameter is as follows:
• Any extreme high or low figure in one or more of the years can skew
the mean, and thus the mean may not appropriately represent the
likely figure next year. For example, consider that there was very
high profit in one of the years because of a volatile international
economy that led to severe devaluation of the local currency. Profits
for five years of a company were, respectively, €6,000,000; €4,000,000;
€4,500,000; €4,750,000; and €4,250,000. The first-year profit of
€6,000,000 was on account of steep devaluation of the euro in the
international market. If the effective value of profit without taking
into consideration devaluation during the first year is €4,000,000,
then the average or mean profit on account of increased profit would
be €400,000, as shown in Figure 2-10.
31
Chapter 2 The Foundations of Business Analytics
Figure 2-10. Actual mean profit and effective mean profit example
• Using mean or average alone will not show the volatility in the
figure over the years effectively. Also, mean or average does not
depict the trend as to whether it is decreasing or increasing. Let’s
take an example. Suppose the revenue of a company over the last
five years is, respectively, $22,000,000; $15,000,000; $32,000,000;
$18,000,000; and $10,000,000. The average revenue of the last five
years is $19,400,000. If you notice the figures, the revenue is quite
volatile; that is, compared to first year, it decreased significantly in
the second year, jumped up by a huge number during the third year,
then decreased significantly during the fourth year, and continued
to decrease further significantly during the fifth year. The average
or mean figure does not depict either this volatility in revenue or
trending downwardness in revenue. Figure 2-11 shows this downside
of mean as a measure.
32
Chapter 2 The Foundations of Business Analytics
2.3.2 Median
Median is the middle value by ordering the values in either ascending order or
descending order. In many circumstances, median may be more representative than
mean. It clearly divides the data set at the middle into two equal partitions; that is, 50
percent of the values will be below the median, and 50 percent of the values will be
above the median. Examples are as follows:
Let us consider the age of 20 workers in an organization as 18, 20, 50, 55, 56, 57, 58,
47, 36, 57, 56, 55, 54, 37, 58, 49, 51, 54, 22, and 57. From a simple examination of these
figures, you can make out that the organization has more aged workers than youngsters
and there may be an issue of knowledge drain in a few years if the organizational
retirement age is 60. Let us also compare mean and median for this data set. The
following figure shows that 50 percent of the workers are above 54 years of age and are
likely to retire early (i.e., if we take 60 years as retirement age, they have only 6 years to
retirement), which may depict the possibility of significant knowledge drain. However,
if we use the average figure of 47.35, it shows a better situation (i.e., about 12.65 years to
retirement). But, it is not so if we look at the raw data: 13 of the 20 employees are already
at the age of 50 or older, which is of concern to the organization. Figure 2-12 shows a
worked-out example of median using R.
We repeat the same in Python, but we use the statistics() library to calculate the
median, as shown in Figure 2-13.
However, if the 10th value had been 54 and the 11th value had been 55, respectively,
then the median would have been (54+55)/2, i.e., 54.5.
Let us take another example of a productivity of a company. Let the productivity per
day in terms of items produced per worker be 20, 50, 55, 60, 21, 22, 65, 55, 23, 21, 20, 35, 56,
59, 22, 23, 25, 30, 35, 41, 22, 24, 25, 24, and 25, respectively. The median productivity is 25
items per day, which means that there are 50 percent of the workers in the organization
who produce less than 25 items per day, and there are 50 percent of the employees who
produce more than 25 items per day. Mean productivity is 34.32 items per day because
some of the workers have significantly higher productivity than the median worker, which
is evident from the productivity of some of the workers; that is, 65 items per day, 60 items
per day, 59 items per day, 56 items per day, 56 items per day, 55 items per day, etc. The
analysis from R in Figure 2-14 clearly shows the difference between mean and median.
> ProdWorkDay<- c(20, 50, 55, 60, 21, 22, 65, 55, 23, 21, 20, 35,
56, 59, 22, 23, 25, 30, 35, 41, 22, 24, 25, 24,25)
> MedProd<-median(ProdWorkDay)
> MedProd
[1] 25
> MeanProd<-mean(ProdWorkDay)
> MeanProd
[1] 34.32
>
34
Chapter 2 The Foundations of Business Analytics
If you have to work out median through manual calculations, you have to arrange
the data points in ascending or descending order and then select the value of the middle
term if there are an odd number of values. If there are an even number of values, then
you have to sum up the middle two terms and then divide the sum by 2 as mentioned in
the previous discussions.
If you notice from the previous discussion, instead of only mean or median alone,
looking at both mean and median gives a better idea of the data.
2.3.3 Mode
Mode is the data point in the data set that occurs the most. For example, in our data set
related to the age of workers, 57 occurs the maximum number of times (i.e., three times).
Hence, 57 is the mode of the workers’ age data set. This shows the pattern of repetition in
the data.
There is no built-in function in R to compute mode. Hence, we have written a
function and have computed the mode as shown in Figure 2-15. We have used the same
data set we used earlier (i.e., WorkAge).
> ##MODE
> WorkAge
[1] 18 20 50 56 57 58 47 36 57 56 55 54 37 58 49 51 54 22
[19] 57
> # We are creating a function by name CalMode to calculate mode
> # This function is used to compute highest number of occurances
> # of the same term
> CalMode<- function(dataset)
+ {
+ UniDataSet <-unique(dataset)
+ UniDataSet[which.max(tabulate(match(dataset,UniDataSet)))]
+ }
> #Using CalMode function on WorkAge data
> CalMode(WorkAge)
[1] 57
>
In the previous function, unique() creates a set of unique numbers from the data set.
In the case of the WorkAge example, the unique numbers are 18, 20, 50, 55, 56, 57, 58, 47,
36, 54, 37, 49, 51, and 22. The match() function matches the numbers between the ones in
35
Chapter 2 The Foundations of Business Analytics
the data set and the unique numbers set we got and provides the position of each unique
number in the original data set. The function tabulate() returns the number of times
each unique number is occurring in the data set. The function w hich.max() returns the
position of the maximum times repeating number in the unique numbers set.
In Python, the statistics() library provides the mode() function to calculate the
mode of the data set given. We can also use the max() function with the key argument, as
shown in Figure 2-16. Some of the functions of R are not applicable directly in Python.
The purpose of the previous description is to demonstrate the concepts.
2.3.4 Range
The range is a simple but essential statistical parameter. It depicts the distance between
the end points of the data set arranged in ascending or descending order (i.e., between
the maximum value in the data set and the minimum value in the data set). This
provides the measure of overall dispersion of the data set.
The R command range(dataset) provides the minimum and maximum values (see
Figure 2-17) on the same data set used earlier (i.e., WorkAge).
> ##Range
> RangeWorkAge<-range(WorkAge)
> RangeWorkAge
[1] 18 58
>
2.3.5 Quantiles
Quantiles are also known as percentiles. Quantiles divide the data set arranged in
ascending or descending order into equal partitions. The median is nothing but the
data point dividing the data arranged in ascending or descending order into two sets of
equal number of elements. Hence, it is also known as the 50th percentile. On the other
hand, quartiles divide the data set arranged in ascending order into four sets of equal
number of data elements. The first quartile (also known as Q1 or as the 25th percentile)
will have 25 percent of the data elements below it and 75 percent of the data elements
above it. The second quartile (also known as Q2 or the 50th percentile or median) will
have 50 percent of the data elements below it and 50 percent of the data elements above
it. The third quartile (also known as Q3 or the 75th percentile) has 75 percent of the data
elements below it and 25 percent of the data elements above it. Quantile is a generic
word, whereas quartile is specific to a particular percentile. For example, Q1 is the 25th
percentile. Quartile 4 is nothing but the 100th percentile.
Quantiles, quartiles, or percentiles provide us with the information that the mean
is not able to provide us. In other words, quantiles, quartiles, or percentiles provide us
additional information about the data set in addition to mean.
Let us take the same two data sets as given in the section “Median” and work
out the quartiles. Figures 2-19A and 2-19B show the working of the quartiles. In the
following code, we have the data in an array called WorkAge. We call the built-in function
quantile() in R to find out the different quantiles. In this example, we want the
“quartile” and hence set the value to 0.25. By setting the value of the probs parameter,
you can decide how you want the data ratios to be split.
37
Chapter 2 The Foundations of Business Analytics
> ##Quartiles
> # Sam&George LLP
> # Data of Employee Age of 20 workers
> WorkAge
[1] 18 20 50 56 57 58 47 36 57 56 55 54 37 58
[15] 49 51 54 22 57
> # Let us calculate the quartiles.
> # However, there is no function in R like quartile()
> # Instead we use qunatiles() only
> QuartWorkAge<-quantile(WorkAge, probs = seq(0,1,0.25))
> QuartWorkAge
0% 25% 50% 75% 100%
18.0 42.0 54.0 56.5 58.0
> #Now let us calculate median using the median()
function as we used earlier
> MedWorkAge<-median(WorkAge)
> MedWorkAge
[1] 54
> #Now let us calculate the median of the qork age
using qunatile() function
> MediWorkAge<-quantile(WorkAge, probs=0.50)
> MediWorkAge
50%
54
>
Similarly, you can divide the data set into 20 sets of equal number of data elements
by using the quantile function with probs = seq(0, 1, 0.05), as shown in Figure 2-20.
38
Chapter 2 The Foundations of Business Analytics
>
Figure 2-20. Partitioning the data into a set of 20 sets of equal number of data
elements
As you can observe from Figure 2-20, the minimum value of the data set is seen at
the 0 percentile, and the maximum value of the data set is seen at the 100 percentile. As
you can observe, typically between each 5 percentiles you can see one data element.
We use the statistics() function in Python to calculate the quartiles. The
statistics.quantiles() function in Python returns the quantiles of the data that
correspond to the numbers n set in the function. The function returns the corresponding
n-1 quantiles. For example, if the n is set as 10 for deciles, the statistics.quantiles()
method will return 10-1=9 cut points of equal intervals, as shown in the Figure 2-21.
Figure 2-21. Partitioning the data into equal number of data elements
As evident from this discussion, quartiles and various quantiles provide additional
information about the data distribution in addition to the information provided by mean
or median (even though median is nothing but the second quartile).
39
Chapter 2 The Foundations of Business Analytics
2.3.6 Standard Deviation
The measures mean and median depict the center of the data set, or distribution. On the
other hand, standard deviation specifies the spread of the data set or data values.
The standard deviation is manually calculated as follows:
2. Then the distance of each value from the mean is calculated (this
is known as the deviation).
The squaring in step 3 is required to understand the real spread of the data as the
negatives and positives in the data set compensate for each other or cancel out the effect
of each other when we calculate or arrive at the mean.
Let us take the age of the workers example shown in Figure 2-22 to calculate the
standard deviation.
40
Chapter 2 The Foundations of Business Analytics
In R, this calculation can be done easily through the simple command sd(dataset).
Figure 2-23 shows the example of finding standard deviation using R.
41
Chapter 2 The Foundations of Business Analytics
>
Similarly the standard deviation can be calculated using the stdev() function of the
statistics() library, as shown in Figure 2-24.
Normally, as per the rules of the normal curve (a data set that consists of a large
number of items is generally said to have a normal distribution or normal curve):
• +/- 3 standard deviation denote that 99.7 percent of the data falls
within it.
In total, around 99.7 percent of the data will be within +/- 3 standard deviations.
As you can see from Figure 2-25, in the case of a normally distributed data (where
the number of data points is typically greater than 30 (i.e., more the better), it is observed
that about 68 percent of the data falls within +/- one standard deviation from the center
of the distribution (i.e., mean). Similarly, about 95 percent (or around 95.2 percent,
42
Chapter 2 The Foundations of Business Analytics
as shown in Figure 4-17B) of the data values fall within +/- two standard deviations
from the center. About 99.7 percent of the data values fall within +/- three standard
deviations from the center. A curve shown here is known as typically a bell curve or
normal distribution curve. For example, profit or loss of all the companies in a country is
normally distributed around the center value (i.e., mean of the profit or loss).
The middle value of a normal distribution is the mean, and the width of the bell curve is defined by the standard
deviation.
•68.2% of the values are within one standard deviation of the mean
•95.4% of the values are within two standard deviations of the mean
•99.7% of the values are within three standard deviations of the mean
Prob Density
Z-score
Figure 2-25. Bell curve showing data coverage within various standard
deviations
The higher the standard deviation, the higher is the spread from the mean—i.e.,
it indicates that the data points vary from each other significantly and shows the
heterogeneity of the data. The lower the standard deviation, the lower the spread from
the mean—i.e., it indicates that the data points vary less from each other and shows the
homogeneity of the data.
However, standard deviation along with other factors such as mean, median,
quartiles, and percentiles give us substantial information about the data or explain the
data more effectively.
43
Chapter 2 The Foundations of Business Analytics
2.3.7 Variance
Variance is another way of depicting the spread. In simple terms, it is the square of the
standard deviation, as shown in Figure 2-26. Variance provides the spread of squared
deviation from the mean value. It is another way of representing the spread compared to
the standard deviation. Mathematically, as mentioned earlier, the variance is the square
of the standard deviation. We are continuing to use the WorkAge data set we used earlier
in this chapter.
> ##Variance
> WorkAge1
[1] 18 20 50 55 56 57 58 47 36 57 56 55 54 37 58
[16] 49 51 54 22 57
> WorkAgeStdDev<-sd(WorkAge1)
> WorkAgeStdDev
[1] 13.32301
> WorkAgeVar<-var(WorkAge1)
> WorkAgeVar
[1] 177.5026
> WorkAgeStdDev*WorkAgeStdDev
[1] 177.5026
>
2.3.8 Summary Command in R
The command summary(dataset)) provides the following information on the data
set, which covers most of the statistical parameters discussed. This command gives us
the output such as minimum value, first quartile, median (i.e., the second quartile),
mean, third quartile, and maximum value. This is an easy way of getting the summary
information through a single command (see Figure 2-28, which has a screenshot
from R).
44
Chapter 2 The Foundations of Business Analytics
If you use the summary(dataset) command, then if required you can use additional
commands, like sd(dataset), var(dataset), etc., to obtain the additional parameters of
interest related to the data.
2.4 Probability
The concepts of probability and related distributions are as important to business
analytics as to the field of pure statistics. Some of the important concepts used in
business analytics such as Bayesian theory and decision trees, etc., are based on the
concepts of probability.
As you are aware, probability in simple terms is the chance of an event happening.
In some cases, we may have some prior information related to the event; in other cases,
the event may be random—that is, we may not have prior knowledge of the outcome. A
popular way to describe the probability is with the example of tossing a coin or tossing
a dice. A coin has two sides, and when it is tossed, the probability of either the head or
the tail coming up is 1/2 because in any throw either it can be the head or the tail that
comes up. You can validate this by tossing up the coin many times and observing that
the probability of either the head or the tail coming up is around 50 percent (i.e., 1/2).
Similarly, the probability of any one of the numbers being rolled using the dice is 1/6,
which can be again validated by tossing the dice many times.
If an event is not likely to happen, the probability of the same is 0. However, if an
event is sure to happen, the probability of this is 1. However, the probability of an event is
always between 0 and 1 and depends upon the chance of it happening or the uncertainty
associated with its happening.
45
Chapter 2 The Foundations of Business Analytics
Mathematically, the probability of any event “e” is the ratio of the number of
outcomes of an event to the total number of possible outcomes. It is denoted as P (e).
P(e) = n/N, where “n” is the outcome of an event and “N” is the total number of
outcomes.
Any given two or more events can happen independent of each other. Similarly, any
two or more events can happen exclusive of each other.
Example 1: Can you travel at the same time to two destinations in opposite
directions? If you travel toward the west direction, you can’t travel toward the east
direction at the same time.
Example 2: If we are making profit in one of the client accounts, we cannot make loss
in the same account.
Examples 1 and 2 are types of events that exclude the happening of a particular event
when the other event happens; they are known as mutually exclusive events.
Example 3: A person “tossing a coin” and “raining” can happen at the same, but
neither impacts the outcome of the other.
Example 4: A company may make profit and at the same time have legal issues. One
event (of making profit) does not have an impact on the other event (of having legal issues).
Examples 3 and 4 are types of events that do not impact the outcome of each
other; they are known as mutually independent events. These are also the examples of
mutually nonexclusive events as both outcomes can happen at the same time.
46
Chapter 2 The Foundations of Business Analytics
47
Chapter 2 The Foundations of Business Analytics
2.4.2 Probability Distributions
Random variables are important in analysis. Probability distributions depict the
distribution of the values of a random variable. The distributions can help in selecting
the right algorithms, and hence plotting the distribution of the data is an important part
of the analytical process; this is performed as a part of exploratory data analysis (EDA).
The following are some important probability distributions:
• Normal distribution
• Binomial distribution
• Poisson distribution
• Uniform distribution
• Chi-squared distribution
• Exponential distribution
We will not be discussing all of these. There are many more types of distributions
possible including F-distribution, hypergeometric distribution, joint and marginal
probability distributions, and conditional distributions. We will discuss only normal
distribution, binomial distribution, and Poisson distribution in this chapter. We will
discuss more about these distributions and other relevant distributions in later chapters.
2.4.2.1 Normal Distribution
A huge amount of data is considered to be normally distributed if the distribution is
normally centered around the mean, as shown in Figure 2-29. Normal distribution is
observed in real life in many situations. On account of the bell shape of the distribution,
the normal distribution is also called bell curve. The properties of normal distribution
typically having 68 percent of the values within +/- 1 standard deviation, 95% of the
values within +/- 2 standard deviation, and 99.7 percent of the values within +/- 3
standard deviation are the ones heavily used in most of the analytical techniques and so
are also the properties of standard normal curve. The standard normal curve has a mean
of 0 and a standard deviation of 1. Z-score, used to normalize the values of the features
in a regression, is based on the concept of standard normal distribution. The normal
distribution is a bell-shaped curve and is generated using the pnorm() function in R.
48
Chapter 2 The Foundations of Business Analytics
Please note that we are interested in the upper tail as we want to know the percentage
of employees who have received a grade of 4 or 5. The answer here is 25.25 percent.
2.4.2.2 Binomial Distribution
Binomial distribution normally follows where success or failure is measured, as shown
in Figure 2-31. In a cricket match, tossing a coin is an important event at the beginning of
the match to decide which side bats (or fields) first. Tossing a coin and calling for “head”
wins you the toss if “head” is the outcome. Otherwise, if the “tail” is the outcome, you
lose the toss.
49
Chapter 2 The Foundations of Business Analytics
2.4.2.3 Poisson Distribution
Poisson distribution represents the independent events happening in a time interval.
The arrival of calls at a call center or the arrival of customers in a banking hall or the
arrival of passengers at an airport/bus terminus follow Poisson distribution.
50
Chapter 2 The Foundations of Business Analytics
Please note that we have used lower = FALSE as we are interested in the upper tail
and we want to know the probability of 26 or more customers arriving at the bank’s
branch per hour. The answer here is 11.22 percent.
A typical Poisson distribution function may look like in Figure 2-34.
2.4.3 Conditional Probability
The conditional probability is the probability of event A occurring given that event B
has already occurred. It is the probability of one event occurring with some relationship
with the other events. This conditional probability is written as P(A|B), meaning the
probability of A given B has occurred.
51
Chapter 2 The Foundations of Business Analytics
If A and B are both independent events (where event A occurring has no impact on
event B occurring), then the conditional probability of (A | B) is the probability of event
A, P(A).
P(A}B) = P(A)
If events A and B are not independent,
P(A|B) = P(A AND B) / P(B)
Conditional probability has many areas of application and is also used quite
frequently in analytics algorithms. For example, the enrollment of students to a specific
university depends on university program ranking as well as tuition fees. Similarly, the
weather reports of rain in your area reported by the news channel or radio channel
depend on the many conditional things such as the following:
52
Chapter 2 The Foundations of Business Analytics
53
Chapter 2 The Foundations of Business Analytics
> ##
> ##Reading three dimensional data
> # Employee name, Employee Age and Employee Salary
> # reading data to a R data frame
> EmpData = read.csv("empdata.csv")
> #printing the contents of the dataframe
> EmpData
ID EmpName EmpAge EmpSal
1 1 John 18 18000
2 2 Craig 28 28000
3 3 Bill 32 32000
4 4 Nick 42 42000
5 5 Umesh 50 50000
6 6 Rama 55 55000
7 7 Ken 57 57000
8 8 Zen 58 58000
9 9 Roberts 59 59000
10 10 Andy 59 59000
> # Displaying summary statistics of the data frame
> summary(EmpData)
ID EmpName EmpAge EmpSal
Min. : 1.00 Andy :1 Min. :18.00 Min. :18000
1st Qu.: 3.25 Bill :1 1st Qu.:34.50 1st Qu.:34500
Median : 5.50 Craig :1 Median :52.50 Median :52500
Mean : 5.50 John :1 Mean :45.80 Mean :45800
3rd Qu.: 7.75 Ken :1 3rd Qu.:57.75 3rd Qu.:57750
Max. :10.00 Nick :1 Max. :59.00 Max. :59000
(Other):4
>
As you can see in the figure, the command summary(dataset) can be used here also
to obtain the summary information pertaining to each feature (i.e., the data in each
column).
You can now compute additional information required if any (as shown in
Figure 2-36).
54
Chapter 2 The Foundations of Business Analytics
!&RPSXWLQJVWDQGDUGGHYLDWLRQ(0SOR\HH$JH
!6WG'HY(PS$JH VG (PS'DWD(PS$JH
!6WG'HY(PS$JH
>@
!&RPSXWLQJVWDQGDUGGHYLDWLRQRI(PSOR\HHVDODU\
!6WG'HY(PS6DOVG (PS'DWD(PS6DO
!6WG'HY(PS6DO
>@
!
import pandas as pd
df = pd.read_csv("empdata.csv", sep=",")
df
55
Chapter 2 The Foundations of Business Analytics
Output:
The statistics of the data frame can be explained using the describe() function, as
shown in Figure 2-39 and Figure 2-40.
ĚĨ͘ĚĞƐĐƌŝďĞ;Ϳ
2.6 Scatter Plot
Scatter plots are an important kind of plot in the analysis of data. These plots depict
the relationship between two variables. Scatter plots are normally used to show cause
and effect relationships, but any relationship seen in the scatter plots need not always
be a cause and effect relationship. Figure 2-41 shows how to create a scatter plot in R,
and Figure 2-42 shows the actual scatter plot generated. The underlying concept of
correlation will be explained in detail in subsequent chapters about regression.
!
Figure 2-42. Scatter plot created in R (using the method specified in Figure 4-31A)
As you can see from this example, there is a direct relationship between employee
age and employee salary. The salary of the employees grows in direct proportion to
their age. This may not be true in a real scenario. Figure 2-42 shows that the salary
57
Chapter 2 The Foundations of Business Analytics
of the employee increases proportionate to their age and increases linearly. Such a
relationship is known as a linear relationship. Please note that type = "b" along with the
plot(dataset) command has created both point and line graph.
Let us now consider another data frame named EmpData1 with one more additional
feature (also known as a column or field) and with different data in it. In Figure 2-43
you can see the data and summary of the data in this data frame. As you can see in
Figure 2-44, one more feature has been added, namely, EmpPerGrade, and also have
changed the values of salary from the earlier data frame, that is EmpData. EmpData1 has
the following data now.
58
Chapter 2 The Foundations of Business Analytics
Figure 2-44. Scatter plot from R showing the changed relationship between two
features of data frame EmpData1
Now, as you can see from Figure 2-45, the relationship between the employee age
and the employee salary has changed; as you can observe, as the age grows, the increase
in employee salary is not proportional but tapers down. This is normally known as a
quadratic relationship.
Figure 2-45. Scatter plot from R showing the changed relationship between two
features of data frame EmpData1
59
Chapter 2 The Foundations of Business Analytics
In Figure 2-45, you can see the relationship plotted between employee age and
employee performance grade. Ignore the first data point as it was for a new employee
joined recently and he was not graded. Hence, the data related to performance grade
is 0. Otherwise, as you can observe, as the age progresses (as per the previous data),
the performance has come down. In this case, there is an inverse relationship between
employee age and employee performance (i.e., as the age progresses, performance is
degrading). This is again not a true data and is given only for illustration.
The same can be plotted using Python Pandas, as shown in Figure 2-46.
ĚĨ͘ƉůŽƚ;ΖŵƉ^ĂůΖ͕ΖŵƉŐĞΖ͕ŬŝŶĚсΗƐĐĂƩĞƌΗͿ
Figure 2-46. Scatter plot usingScatter plotsPython Pandas the Pandas plot()
function
2.7 Chapter Summary
• In this chapter, you learned about various statistical parameters of
interest in descriptive analytics (mean, median, quantiles, quartiles,
percentiles, standard deviation, variance, and mode). You saw how
to compute these using R and Python. You also learned how most
of these parameters can be gotten through a simple command like
summary(dataset) using R and Python.
60
Chapter 2 The Foundations of Business Analytics
• You explored how scatter plots can show the relationship between
various features of the data frame and hence enable us to better
understand these relationships graphically and easily.
61
CHAPTER 3
Structured Query
Language Analytics
3.1 Introduction
Structured Query Language (SQL) is a popular programming language created to define,
populate, manipulate, and query databases. It is not a general-purpose programming
language, as it typically works only with databases and cannot be used for creating
desktop applications, web applications, or mobile applications. There may be some
variations in SQL syntax when it comes to different database engines. By going through
the documentation of the database engine you are going to use, you can easily ascertain
the differences yourself. SQL is a powerful tool for business and data analysts, and hence
we are going to cover it in detail.
SQL is typically pronounced as “sequel” or by its individual letters.
To demonstrate the usage of SQL, in this chapter we have used SQL version of a
popular database engine: PostgreSQL.
Typically, SQL can be categorized into the following three broad categories:
63
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5_3
Chapter 3 Structured Query Language Analytics
In this chapter, we will not delve deep into definitions and theory, as there are
hundreds of books already available on this topic. Typical data types used in a database
are CHAR to store string data of fixed length; VARCHAR to store variable-size string; TEXT,
etc., to store a string; SMALLINT/MEDIUMINT/BIGINT/NUMERIC/REAL/DOUBLE PRECISION,
etc., to store numeric data; MONEY to store monetary values; DATE/TIMESTAMP/INTERVAL
to store date, date and time, and time interval values; BOOLEAN to store logical data like
TRUE/FALSE, YES/NO, ON/OFF; etc. There may be many other data types available for a
particular database engine. These data types also vary from one database engine to the
other. We are not going to discuss the data types and the limitations related to them here
because it is beyond the scope of this chapter. To get a clear understanding of these,
please refer to the corresponding documentation from the database engine you use.
This book is meant to provide you with an understanding of the practical use of SQL,
particularly in the context of business analytics. Please note that the intention of this
book is not to provide a comprehensive chapter covering all the aspects of SQL but to
demonstrate those aspects that are most useful to a data engineer or a data scientist.
3.2 Data Used by Us
To familiarize you with SQL, we will use some manufacturing data as many factories
are becoming smart by using automation and sensors and have started using artificial
intelligence and machine learning to better manage their factories. To start with, we will
create three tables: machine_details, machine_status, and machine_issues.
The first table, machine_details, consists of the machine details such as machine
ID, machine name, and manufacturing line ID. The second table, machine_status,
consists of the individual machine status as the factory works by getting the details from
various sensors. It captures the details at various timestamps of the individual machine
64
Chapter 3 Structured Query Language Analytics
data from the corresponding sensors built into the machines. The third table, machine_
issues, captures issues with the machines on various dates. While the first table is
a master table that is created, the second table gets populated automatically during
the running of the factory, and the third table is updated manually by the machine
maintenance department.
Let’s create the first table using the psql shell interface. The code used is
provided here:
Please note that we have represented the key words from SQL in uppercase letters
and the details provided by us in lowercase letters. However, technically it is not
necessary to type the key words in capital letters. Please note you should end every SQL
statement with a semicolon (;). INTEGER can be specified as INT, and CHARACTER can be
specified as CHAR.
Caution Do not use hyphens (-) in the names of table, columns, and so on, as
they are not recognized. You can use an underscore (_).
To check if the table was created, you can use the \d command on the psql (short
form for PostgreSQL) shell interface.
Let’s now create the second table, machine_status. The code used for this purpose is
provided here:
65
Chapter 3 Structured Query Language Analytics
Let’s now create the third table, machine_issues. The code used for this purpose is
provided here:
Now, let’s check the creation of all the three tables using the \d command on the
psql interface. The code used (i.e., input) is provided here:
postgres=# \d
List of relations
Schema | Name | Type | Owner
-----------+-----------------+--------+--------------
public | machine_details | table | postgres
public | machine_issues | table | postgres
public | machine_status | table | postgres
(3 rows)
We used SQL DDL to create the previous table structures in our database. In the
previous scripts, we used constraints like PRIMARY KEY, NOT NULL, and FOREIGN KEY. A
primary key represents the column that identifies each row uniquely, which means that a
primary key field/column cannot have a NULL value or a duplicate value. A foreign key is
a column or a combination of columns referencing the primary key of some other table.
The NOT NULL constraint ensures that the column must be filled and cannot hold a NULL/
blank value. This means that such a column cannot have a missing value. But it does not
mean that such a field cannot have a wrong or inaccurate value. In addition, we can use
other constraints like CHECK and UNIQUE as relevant.
Currently, all the three tables created by us do not have any data populated in them.
Now, we will insert some data into them. First let’s populate the table machine_details.
We use the INSERT command for this purpose, as follows:
66
Chapter 3 Structured Query Language Analytics
67
Chapter 3 Structured Query Language Analytics
Similarly, let’s insert a few records into the machine_issues table manually on behalf
of the maintenance department. The code is provided here:
Any SQL statement will throw an error, pointing out the location of the error, if
there is anything wrong with the syntax or data type against the defined data type.
Otherwise, in this case, the INSERT statement will show, upon execution, the number of
records added.
You can check on the status of the insertion of the records using the DML/DQL
statement SELECT. The SELECT statement to query the machine_details table is
provided here:
68
Chapter 3 Structured Query Language Analytics
69
Chapter 3 Structured Query Language Analytics
70
Chapter 3 Structured Query Language Analytics
Another way to confirm the proper insertion of the records is to add RETURNING *; at
the end of the INSERT statement.
We can look for DISTINCT or UNIQUE values in a field/column of a table using
DISTINCT before the field name in the SELECT statement. For example, we can find out
the distinct/unique issues with the machines from the machine_issues table that are
provided. The code used in this regard is provided here:
issue_descrip
--------------------------------------
Break down leakage issue
Taken off for preventive maintenance
Break down valve issue
Break down bearing issue
(4 rows)
We have discussed the creation of the table and the insertion of the records into the
table because sometimes as a business analyst or data engineer you may be required to
create and populate your own data tables.
71
Chapter 3 Structured Query Language Analytics
• If the tolerance range for the pressure for machine 2 are between
50 kPa to 60 kPa, check if the pressure readings sent by the
corresponding sensor are within these tolerance limits.
• If the tolerance range for the pressure for machine 2 and machine 8
are between 50 kPa to 60 kPa, check if the pressure readings sent by
the corresponding sensor are within these tolerance limits.
• If the tolerance range for the pressure for machine 3 and machine 9
are between 43kPa to 45kPa and the tolerance range for temperature
are between 210 degrees Celsius to 230 degrees Celsius, check if both
these parameters are within these tolerance limits.
To address these queries, the SELECT statement helps us. Let’s answer the first
question, i.e., which machines have breakdown issues. If we only require the machine ID
and the corresponding breakdown issue, then we can use the simple SELECT statement
as follows:
72
Chapter 3 Structured Query Language Analytics
The output returned by the previous SELECT statement looks like this:
machine_id | date | issue_descrip
------------+------------+-----------------------------------
1 | 2022-02-02 | Break down bearing issue
7 | 2022-02-28 | Break down bearing issue
5 | 2022-03-05 | Break down leakage issue
10 | 2022-03-31 | Break down valve issue
(4 rows)
The LIKE operator matches the pattern specified within the single quotes with
the data in the field and throws out those rows that have the matches to the pattern
specified. Here, the pattern selected says that in the field issue_descrip we are looking
for the word Break wherever it is within this field.
Additional Tip In the previous SELECT query, instead of LIKE '%Break%', you
can use LIKE '%Break_down%'. The underscore between the two words stands
for one character including the space. However, you should note that the words or
phrases you are looking for in the LIKE statement are case sensitive.
Similarly, you can check for machine IDs that underwent the preventive
maintenance using the following SELECT statement on the machine_issues table:
machine_id | date | issue_descrip
------------+------------+--------------------------------------------
1 | 2021-12-31 | Taken off for preventive maintenance
2 | 2021-12-31 | Taken off for preventive maintenance
3 | 2021-12-31 | Taken off for preventive maintenance
4 | 2021-12-31 | Taken off for preventive maintenance
5 | 2021-12-31 | Taken off for preventive maintenance
73
Chapter 3 Structured Query Language Analytics
Suppose in the earlier queries we also want the machine name; then the machine_name
is available in another table, i.e., in the machine_details table and not in the table we
queried earlier i.e., in the machine_issues table. Hence, in order to get the desired result, we
have to use the query joining machine_issues table with machine_details table, as follows:
In the previous query, m and k are the aliases used for the tables machine_issues and
machine_details, respectively. This is used to reduce the need for the repetition of the
entire name of the associated table before the column name concerned where multiple
tables are joined.
74
Chapter 3 Structured Query Language Analytics
If you want to count the number of preventive maintenance per machine, then you
can use the aggregate function COUNT in the SELECT statement on the issue_descrip
field and use the GROUP BY function to get the count, as follows:
Caution You should ensure that the GROUP BY clause includes all the fields in
the SELECT statement other than the aggregate function. Otherwise, you will get
an error. Alternatively, the field not used in the GROUP BY clause needs to have the
AGGREGATE function on it in the SELECT statement.
75
Chapter 3 Structured Query Language Analytics
Additional Tip There are various aggregate functions like SUM, COUNT, MAX, MIN,
and AVG. You may use them to aggregate the data depending upon the context.
Let’s take the next question; i.e., if the tolerance range for the pressure for machine
2 are between 50 kPa to 60 kPa, check if the pressure readings sent by the corresponding
sensor are within these tolerance limits.
We have all the required data related to this query in the machine_status table. Let’s
query this table for the answer, as follows:
machine_id | pressure_sensor_reading
-----------+----------------------------------
2 | 55.50
2 | 55.25
(2 rows)
This query will return only those records that have the pressure_sensor_reading
between 50 kPa and 60 kPa. It does not specify if there are any records pertaining to
machine 2 that have the pressure_sensor_reading outside the range of 50kPa to 60kPa.
Hence, the answer to our question addressed by this query is partial. To get the other
part of the answer, we need to query the machine_status table additionally as follows:
machine_id | pressure_sensor_reading
-----------+-----------------------------------
(0 rows)
76
Chapter 3 Structured Query Language Analytics
The previous output clearly shows that all the values of the
pressure_sensor_reading for machine 2 are within the tolerance limits specified.
If we want to understand the MAX and MIN values of the pressure_sensor_reading
instead of the individual reading, thereby getting the maximum and minimum values
within the tolerance limits, we can use the following query:
machine_id | max | min
-----------+-------+-------
2 | 55.50 | 55.25
(1 row)
Let’s take up the next question: if the tolerance range for the pressure for machine 2
and machine 8 are between 50 kPa to 60 kPa, check if the pressure readings sent by the
corresponding sensor are within these tolerance limits. For this again, we need to query
only one table, i.e., machine_status, as follows:
77
Chapter 3 Structured Query Language Analytics
machine_id | pressure_sensor_reading
-----------+----------------------------------
(0 rows)
You can clearly validate the above result using the following query where you can see
that all the pressure readings are within the tolerance limits specified:
machine_id | pressure_sensor_reading
-----------+-----------------------------------
2 | 55.50
8 | 55.75
2 | 55.25
8 | 55.50
(4 rows)
Additional Tip In the previous query, you need to use OR between the two
machine_ids. Otherwise, zero records will be thrown out as the output as it is
impossible to have machine_id = 2 the same as machine_id = 8. OR looks for
applicability to one of the aspects. Even if one aspect turns out to be true, it will be
included. AND looks for applicability to all of the aspects. Only if all the aspects turn
out to be true will the query return the related output.
Let’s take up our next question, which is more complicated than the ones we dealt
with previously: if the tolerance range for the pressure for machine 3 and machine 9
are between 43kPa to 45kPa and the tolerance range for temperature are between 210
degrees Celsius to 230 degrees Celsius, check if both these parameters are within these
tolerance limits. The query (code) related to the same is provided here:
78
Chapter 3 Structured Query Language Analytics
We can add ORDER BY machine id ASC to get the details in the ascending order of
the machine_id to allow better readability of the data when there are more records in the
table. Here is the query and the result:
79
Chapter 3 Structured Query Language Analytics
To understand very clearly and be doubly sure that there are no records pertaining to
machine 3 or 9, which are not within the tolerance limits specified in the question, let’s
use the following query:
80
Chapter 3 Structured Query Language Analytics
In the previous examples, our data is clean as we created the data based on the
records. However, to help you to understand the process, we will be adding/creating
some incomplete and incorrect records to these tables. Please note that you can use the
INSERT/UPDATE functions to do so. After modifying the data, the details in the previous
three tables are as follows. The queries in this regard and the outputs are provided here.
Code:
Output:
Code:
81
Chapter 3 Structured Query Language Analytics
Output:
Code:
82
Chapter 3 Structured Query Language Analytics
Output:
You can now see some columns with no/blank details in two of these tables. Some of
the table columns have been defined with the NOT NULL constraint and hence will not
allow any NULL/no/blank value. However, other columns without this constraint may
have such NULL/no/blank value. We can find out such rows and columns in each of the
tables as follows.
Code:
83
Chapter 3 Structured Query Language Analytics
Output:
Please note that we know that machine_id and mfg_line are integer fields and
cannot be NULL as defined by the constraints on those columns set while creating the
table. Hence, we have included here only the machine_name column in our query.
In the previous result, you can observe mfg_line = 0 for the machine_id = 17. You
may understand in discussion with the procurement team or manufacturing team that
this machine is a new machine and is yet to be deployed on any manufacturing line.
Hence mfg_line is currently set as 0, which means it is yet to be deployed.
As you know that the machine-naming convention is based on the machine_id, you
can easily update without reference to anybody else the names of the machines in the
table machine_details using the UPDATE function, as follows:
Additional Tip Instead of repeating many times the previous query for different
machine_ids, you can use the CASE statement with the UPDATE statement. The
CASE statement is described later in this chapter.
After doing this, we can now observe from the following query that all the values in
the table machine_details are complete and correct.
Code:
Output:
Code:
Output:
85
Chapter 3 Structured Query Language Analytics
If you think through the previous result, you may conclude that at the specified
time when the automated system tried to get the data from the sensors, possibly the
sensors were unavailable, which means possibly the machines were down or taken off
for preventive maintenance. Alternatively, it is possible that the sensors were faulty and
returning a NULL value. From the data available with us, we can check if the machine was
taken off for preventive maintenance or was on maintenance on account of the machine
breakdown. Let's combine the data from the two tables, i.e., machine_status and
machine_issues, to check on these aspects.
Code:
Output:
86
Chapter 3 Structured Query Language Analytics
Additional Tip Alternatively, you can use the TIMESTAMP data type converted
to the DATE data type so that it is easy to see the date values side by side. For this
purpose, you may use the type cast k.date_and_time::DATE in the SELECT
function. The CAST feature is very useful for converting the data type from one
type to another. There are other ways of type casting. Please refer to the relevant
documentation pertaining to the database engine you use.
As you can see, three machines with machine_ids 13, 14, and 16 were taken off for
preventive maintenance on April 9, 2022. However, as you can see two other machines
with machine_ids 12 and 15 from the machine_status table are not in this list, but they
have the pressure_sensor_reading as NULL and temp_sensor_reading as NULL on April
9, 2022, which means that possibly the sensors were not readable by our application
populating the data into the table machine_status, but the machines were working.
Otherwise, their numbers would have been listed in the machine_issues table on April
9, 2022 (presuming that the machine maintenance department has not missed out to
enter the issues with these machines in the machine_issues table). Now, we cannot
retain this faulty data in the machine_status table as it can skew all our analysis. Hence,
we have two options, i.e., either to remove the rows with null values on April 9, 2022,
for machine_ids 12 and 15 from the machine_status table or to substitute some values
within tolerance limits like average values or median values for the said machine from
the table (if we are sure in discussion with the floor personnel that these two machines
were working on that day and we have a good amount of the data from which we can
derive these average or median values) for both the pressure_sensor_reading and
temp_sensor_reading. The decision we must take is based on the impact of each action.
If we are the business analysts, we would inform the concerned technicians to rectify the
sensors and delete the blank value rows pertaining to machine_ids 12 and 15 for April 9,
2022, from the machine_status table. We will also remove the rows with blank values
pertaining to the machines taken off for preventive maintenance in the machine_status
table. For this purpose, we can use the DELETE function of SQL.
Code:
87
Chapter 3 Structured Query Language Analytics
The status returned is DELETE 5. This indicates that five rows are deleted.
Alternatively, you have to substitute the missing sensor values with the average
sensor value or median sensor value using the UPDATE function for two machines that
were not taken off for preventive maintenance. This means that the nonexistent values of
these are replaced with the average of the existing values of that column or median of the
existing values of that column or the most representative value for that column, for the
particular machine_id, based on the nature of the sensor. For example, for a particular
machine_id we have only three rows, and for one of the columns, we have two values
populated and one value is not populated. We are sure from the other data available
to us that particular value is missing and needs to be filled. In this case, we can use the
average of the two existing values or the most representative value for that field.
88
Chapter 3 Structured Query Language Analytics
Output:
89
Chapter 3 Structured Query Language Analytics
Please note that we have used the ORDER BY clause to order the data returned by the
query in the order of machine_id in the machine_details table. Alternatively, you can
use ORDER BY on date_and_time from the machine_status table. Also, you can use ORDER
BY both together, i.e., date_and_time from the machine_status table and machine_id
from the machine_details table.
Code:
Output:
90
Chapter 3 Structured Query Language Analytics
91
Chapter 3 Structured Query Language Analytics
Based on the purpose for which you use the report, you can choose one of the
suitable methods.
Additional Tip In the SELECT clause you should mention the fields in the order
you want to have them in your report.
Here is another example of INNER JOIN. Here we are joining the tables machine_
details and machine_issues and ordering the fields on the date from the machine_
issues table and within that machine_id from the machine_details table.
Code:
Output:
92
Chapter 3 Structured Query Language Analytics
You can also combine more than two tables using the inner join as long as you have
a relationship between the first table to the second, second to the third, and so on.
Let's explore the inner join on all three tables, i.e., machine_status, machine_issues,
machine_details. The following query demonstrates the corresponding results.
Code:
93
Chapter 3 Structured Query Language Analytics
Output:
94
Chapter 3 Structured Query Language Analytics
However, from the previous result, you can see that using INNER JOIN has thrown
out a lot of duplicated rows and the complexity of such an output makes it less usable.
Hence, we suggest you use the INNER JOIN on more than two tables only when it really
makes sense in terms of the utility of such a report/output.
Let’s now explore LEFT JOIN (also known as LEFT OUTER JOIN). This gets all the details
from the first table (i.e., left table) and populates the corresponding details from the next
table (i.e., second table), both based on the fields specified in the SELECT statement when
the data in the common field in both the tables match. Where the data for the common
95
Chapter 3 Structured Query Language Analytics
field from the first table (i.e., left table) does not match with the next table (i.e., second
table), then the details from the first table are still selected, and the fields from the second
table are marked as NULL in the result returned. Only those fields that are included in
the SELECT statement are returned by the query. This enables us to check on the data
pertaining to the rows in the first table, which is not there in the second table. For example,
we can use this to find out if there was any machine in the organization (all the details of
the machines in the organization are captured in the machine_details table), which either
was not taken off for preventive maintenance or did not have any breakdown or any other
issue (such issues are captured in our case in the machine_issues table). The LEFT JOIN
on the tables machine_details (i.e., left table) and machine_issues (i.e., second table) is
demonstrated with the query and the results shown next.
Code:
Output:
96
Chapter 3 Structured Query Language Analytics
Here, you can see that those machines with machine_ids 12, 15, and 17 did not
have any preventive maintenance or breakdown issues. As you know, machine_id 17,
i.e., Machine017, is yet to be used on a manufacturing line, and hence no issues or
preventive maintenance are carried out on it. However, machines 12 and 15 did not
undergo any preventive maintenance or breakdown issues so far. This possibly may
suggest the organization's maintenance team to take them for preventive maintenance
if the period by which preventive maintenance to be carried out has been already
breached from the last preventive maintenance date or the deployment date if they were
new machines deployed and are being used currently.
Let’s now explore the RIGHT JOIN (also known as the RIGHT OUTER JOIN). This returns
all the rows in the second table (the RIGHT table) and the corresponding values from the
first table where the common field (used for the join) in the second table matches with the
common field in the first table. Where the left table does not have its common field data
(used for the join) from the right table, still all the data from the second table (i.e., RIGHT
97
Chapter 3 Structured Query Language Analytics
table) is returned along with the columns from the first table populated with NULL values.
Only those fields listed in the SELECT statement will be returned. This enables us to check
on the data pertaining to the rows in the second table, which is not there in the first table.
Let’s take the machine_status table as the second table (RIGHT table) and the machine_id
table as the first table. Here is the query and the output pertaining to the RIGHT JOIN.
Code:
postgres=# SELECT m.machine_id, m.machine_name, m.mfg_line, k.pressure_
sensor_reading, k.temp_sensor_reading, k.date_and_time
postgres-# FROM machine_status k
postgres-# RIGHT JOIN machine_details m
postgres-# ON m.machine_id = k.machine_id
postgres-# ORDER BY m.machine_id ASC;
Output:
98
Chapter 3 Structured Query Language Analytics
99
Chapter 3 Structured Query Language Analytics
The previous output clearly shows that sensors on the machines 12, 13, 14, 15, and 16
are possibly not yet deployed or activated or automation is yet to be carried out to obtain
the feed from them. These machines are surely deployed as these machines are on
manufacturing line 3, and as we have seen earlier, some of these underwent preventive
maintenance too. As we know, we are yet to deploy machine with machine_id 17.
Let’s now look at FULL JOIN (also known as FULL OUTER JOIN). Here, if we are
joining two tables, then the results will be a combination of both LEFT JOIN and RIGHT
JOIN, which means that the records from both the tables will be included in the result
with the columns from the other table, which do not have any contents, populated with
NULL values. Let’s now carry out a FULL OUTER JOIN on two tables, i.e., machine_status
and machine_issues. The result includes the matching records from both the tables as
well as the records from the both the tables that do not match. The query and the results
are provided next.
Code:
Output:
100
Chapter 3 Structured Query Language Analytics
101
Chapter 3 Structured Query Language Analytics
As you can see in the previous output, wherever the left table did not have the
corresponding data in the right table, then the columns pertaining to the right table have
been populated with the NULL (blanks) values and have been included in the output.
In the same way, if there are no rows that are matching available in the right table, the
columns pertaining to the left table will be populated with the NULL (blanks) values.
However, in our case there are only the left table rows, which are not available in the
right table, and hence the fields pertaining to the right table are populated with NULL
values where there are no matching records in the left table.
A FULL JOIN is typically used to check on the similarity or dissimilarity between the
contents of the two tables. In the case of our example, this may be more confusing than
really useful. However, as you saw in the earlier examples, INNER JOIN, LEFT JOIN, and
RIGHT JOIN were all useful.
102
Chapter 3 Structured Query Language Analytics
Additional Tip It is not necessary to include all the columns from all the tables in the
SELECT statement for the JOINs. You need to include them based on the purpose of
the query. However, in our examples we have included all just to make you aware of the
outputs in detail. We have not explored the SELF JOIN in this chapter. A SELF JOIN is
either an INNER JOIN or a LEFT JOIN or a RIGHT JOIN on the same table.
Let’s now explore another important SQL condition expression, i.e., CASE. This is very
useful in understanding, in exploring the data, and even for reporting. This can be used
for the cleaning of the data as well. CASE statements typically use a WHEN….THEN…..ELSE
structure within them. ELSE here is optional. Let’s say we want to find out the number
of instances of the preventive maintenance and number of instances of breakdown; the
CASE statement can be used as follows.
Code:
postgres=# SELECT
postgres-# SUM (CASE
postgres(# WHEN issue_descrip LIKE '%Break_down%' THEN 1
postgres(# ELSE 0
postgres(# END
postgres(# ) AS "No. of Break Down Cases",
postgres-# SUM (CASE
postgres(# WHEN issue_descrip LIKE '%preventive_
maintenance%' THEN 1
postgres(# ELSE 0
postgres(# END
postgres(# ) AS "No. of Preventive Maintenance Cases"
postgres-# FROM machine_issues;
Output:
Suppose you want to look at the machine_id, number of breakdown cases and
number of preventive maintenance cases; then you use the query using CASE as follows.
103
Chapter 3 Structured Query Language Analytics
Code:
Output:
104
Chapter 3 Structured Query Language Analytics
Please note that when you have multiple CASE statements with WHEN and THEN, the
execution starts with the first WHEN and THEN. If that WHEN statement is evaluated to be
TRUE, then the THEN statement following it will be executed, and the next WHEN statements
will not be evaluated. However, if the first WHEN statement is evaluated to be FALSE,
then the next WHEN statement following this statement will be evaluated. If all the WHEN
statements evaluate to be false, then the ELSE statement will be executed.
Let’s look at an example of this. The query and the result are shown next.
Code:
Output:
105
Chapter 3 Structured Query Language Analytics
At this point of time, the machine_status table has all the records with no NULL
values for pressure_sensor_reading and temp_sensor_reading. Hence only the
ELSE statement got executed and the result for all the records is, i.e., health_status is
machine_is_working.
Now, let’s add a record/row with NULL values for pressure_sensor_reading and temp_
sensor_reading, date_and_time, and check what will be returned by the previous query.
Code:
Output:
106
Chapter 3 Structured Query Language Analytics
As you can see, the machine_status table now has a row/record with a NULL value for
the pressure_sensor_reading, temp_sensor_reading, and date_and_time columns.
Let’s now run the CASE query used before this INSERT and look at what happens.
Code:
Output:
107
Chapter 3 Structured Query Language Analytics
Output:
machine_id | avg_pres_sens_rdg | avg_temp_sens_rdg
-----------+----------------------+---------------------
1 | 25.3750000000000000 | 125.3300000000000000
2 | 55.3750000000000000 | 250.1250000000000000
3 | 44.3250000000000000 | 220.3750000000000000
4 | 20.2000000000000000 | 190.3750000000000000
5 | 100.4250000000000000 | 500.2750000000000000
7 | 25.5250000000000000 | 125.6500000000000000
8 | 55.6250000000000000 | 250.6500000000000000
9 | 44.3750000000000000 | 220.3250000000000000
10 | 20.0500000000000000 | 190.1250000000000000
11 | 100.6500000000000000 | 500.3500000000000000
(10 rows)
Additional Tip You can use the ROUND function on the AVG function to
round the AVG value to the required decimal places. For example, use
ROUND(AVG(pressure_sensor_reading, 2) to round the returned value to
two decimals.
As you can note, the details of the machine_id 16 are not shown in the previous result
as the fields of pressure_sensor_reading and temp_sensor_reading are NULL.
Additional Tip We have given an alias to the calculated AVG fields to provide a
meaningful heading to such Average columns.
Let’s now see how we can use ANY and ALL with the SELECT statement. These allow
the comparison of any or all values with the result returned by a subquery. These are
always used with comparison operators like = (i.e., equal to), != (i.e., not equal to), <=
(i.e., less than or equal to), < (i.e., less than), >= (i.e., greater than or equal to), > (i.e.,
greater than).
109
Chapter 3 Structured Query Language Analytics
Code:
Output:
machine_id | pressure_sensor_reading
-----------+------------------------
10 | 20.00
10 | 20.10
4 | 20.15
4 | 20.25
1 | 25.25
7 | 25.50
1 | 25.50
7 | 25.55
3 | 44.25
9 | 44.25
3 | 44.40
9 | 44.50
2 | 55.25
2 | 55.50
8 | 55.50
8 | 55.75
5 | 100.35
5 | 100.50
11 | 100.55
(19 rows)
As the highest value of the average of any machine is 100.65 (pertaining to machine_
id 11), the < ANY will throw up all the values of pressure_sensor_reading less than this
value and any value of pressure_sensor_reading above this value will not be selected.
110
Chapter 3 Structured Query Language Analytics
Hence, as you can see the only value not selected here is 100.75 (pertaining to machine_
id 11), which is higher than the previous average of 100.65.
Let’s now see what happens if we use > ANY. The results are as follows.
Code:
Output:
machine_id | pressure_sensor_reading
-----------+------------------------
10 | 20.10
4 | 20.15
4 | 20.25
1 | 25.25
1 | 25.50
7 | 25.50
7 | 25.55
9 | 44.25
3 | 44.25
3 | 44.40
9 | 44.50
2 | 55.25
8 | 55.50
2 | 55.50
8 | 55.75
5 | 100.35
5 | 100.50
11 | 100.55
11 | 100.75
(19 rows)
111
Chapter 3 Structured Query Language Analytics
As you can see, all the values are selected except the value lower than the lowest
average value of 20.05 (pertaining to machine_id 10). This lowest value pertains to
machine 10.
The previous queries may not be of much use in the context of our data but may be
useful in a different context with a different data set.
Additional Tip Please note that the null value is not considered while calculating
the AVG. This is applicable in our case with respect to machine_id 16. There is no
average calculated for machine_id 16 used previously.
Similar to the previous example, you can use ALL before the subquery where the
comparison will be carried out with the value returned by the subquery. As NULL
values in your table can create issues with such queries, you need to be careful.
Where it is possible, rows with the NULL values may be deleted in the copy of the
data set used by you for your ease of analysis.
Similarly, UNION and INTERSECT may also be useful if we have to combine the results
of two or more queries. UNION is used when we want to combine the results of two or
more queries. This ensures the inclusion of the data from both the queries into the
final result set. If there is a common record in both the queries, UNION ensures that only
one of them is retained in the result set. INTERSECT is used when we want to select the
common rows from two or more queries. For example, if you want to find out which
machines have the issues reported, you can use INTERSECT on the machine_ids from
the machine_details (i.e., master file of machines) table with the machine_ids from the
machine_issues (which captures the issues with the machines) table.
Code:
112
Chapter 3 Structured Query Language Analytics
Output:
machine_id
------------
1
2
3
4
5
7
8
9
10
11
13
14
16
(13 rows)
Caution For the UNION and INTERSECT operators to work, each query that
takes part in the SELECT statement should have same number of columns with
the same order and compatible data types.
Please note that typically in a SELECT statement, the clauses are executed in the
following order: FROM > ON > OUTER > WHERE > GROUP BY > HAVING > SELECT > ORDER BY,
according to www.designcise.com.
Let’s now explore EXISTS, NOT EXISTS, IN, and NOT IN.
Code:
113
Chapter 3 Structured Query Language Analytics
Output:
In this query, for every record in the outer table (i.e., in the first FROM statement), the
subquery within the brackets is evaluated, and if the match is made, then the output
from the first table as per the first SELECT statement is returned. Then for the next record
in the outer table (i.e., in the first FROM statement), the subquery within the brackets is
evaluated, and if the match is made, then the output from the first table as per the first
SELECT statement is returned. This way the result of the subquery check is made for
every record in the first table. Hence, this type of query is very inefficient in terms of the
resource utilization.
The same result can be obtained using IN as follows.
Code:
114
Chapter 3 Structured Query Language Analytics
Output:
NOT EXISTS is exactly the opposite of EXISTS. NOT EXISTS evaluates TRUE only when
there is no match of the rows from the first table (part of the first FROM clause) with that
of the row in the table of the subquery and returns the corresponding result of the first
SELECT statement. Let's change the previously executed query with NOT EXISTS and
check what happens.
Code:
115
Chapter 3 Structured Query Language Analytics
Output:
Output:
EXISTS OR NOT EXISTS gets executed when there is at least one record that is
evaluated TRUE by the subquery.
LIMIT in the SELECT statement will help you sample the data or limit the output of the
query in terms of number of records. This will help in case of huge data or in case you
want really TOP few records or check if the query works decently and hence want to limit
the returned records. The following is an example.
Code:
116
Chapter 3 Structured Query Language Analytics
Output:
machine_id | date | issue_descrip
-----------+------------+-------------------------------------
1 | 2021-12-31 | Taken off for preventive maintenance
2 | 2021-12-31 | Taken off for preventive maintenance
3 | 2021-12-31 | Taken off for preventive maintenance
4 | 2021-12-31 | Taken off for preventive maintenance
5 | 2021-12-31 | Taken off for preventive maintenance
(5 rows)
As you can see, only the first five records returned by the query are output as result.
OFFSET N ROWS FETCH NEXT N ROWS ONLY; or FETCH FIRST N ROWS ONLY; can
be used as an alternative if you want to inspect different parts of the output before you
execute the query fully in case of huge data being returned. Examples are provided here.
Code:
Output:
machine_id | date | issue_descrip
-----------+------------+-------------------------------------
7 | 2021-12-31 | Taken off for preventive maintenance
8 | 2021-12-31 | Taken off for preventive maintenance
9 | 2021-12-31 | Taken off for preventive maintenance
10 | 2021-12-31 | Taken off for preventive maintenance
11 | 2021-12-31 | Taken off for preventive maintenance
(5 rows)
Code:
Output:
machine_id | date | issue_descrip
-----------+------------+-------------------------------------
1 | 2021-12-31 | Taken off for preventive maintenance
2 | 2021-12-31 | Taken off for preventive maintenance
3 | 2021-12-31 | Taken off for preventive maintenance
4 | 2021-12-31 | Taken off for preventive maintenance
5 | 2021-12-31 | Taken off for preventive maintenance
(5 rows)
3.4 Chapter Summary
• In this chapter, you saw how SQL can act as an excellent utility for
data analytics.
• You learned how to create the tables in the Postgres database using
PostgreSQL.
• You learned how to insert the data into the tables, if required.
• You also learned how to query on the tables and get the results
presented to you.
• You also learned how these queries can be used to provide you
the insights related to your business problems and throw up the
intelligence hidden in the data.
• You also learned how you can handle the missing values.
118
CHAPTER 4
Business Analytics
Process
This chapter covers the process and life cycle of business analytics projects. We discuss
various steps in the analytic process, from understanding requirements to deploying a
model in production. We also discuss the challenges at each stage of the process and
how to overcome those challenges.
119
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5_4
Chapter 4 Business Analytics Process
The typical process of business analytics and data mining projects is as follows:
or
tart with the data to understand what patterns you see in the data
S
and what knowledge you can decipher from the data.
2. Study the data and data types, preprocess data, clean up the data
for missing values, and fix any other data elements or errors.
3. Check for the outliers in the data and remove them from the data
set to reduce their adverse impact on the analysis.
121
Chapter 4 Business Analytics Process
122
Chapter 4 Business Analytics Process
Most organizations have data spread across various databases. Pulling data from
multiple sources is a required part of solving business analytics tasks. Sometimes, data
may be stored in databases for different purposes than the objective you are trying to
solve. Thus, the data has to be prepared to ensure it addresses the business problem
prior to any analytics process. This process is sometimes referred to as data munging or
data wrangling, which is covered later in the chapter.
4.1.2.1 Sampling
Many times, unless you have a big data infrastructure, only a sample of the population
is used to build analytical modeling. A sample is “a smaller collection of units from a
population used to determine truths about that population” (Field, 2005). The sample
should be representative of the population. Choosing a sampling technique depends on
the type of business problem.
For example, you might want to study the annual gross domestic product (GDP) per
capita for several countries over a period of time and the periodic behavior of such series
in connection with business cycles. Monthly housing sales over a period of 6–10 years
show cyclic behavior, but for 6–12 months, the sales data may show seasonal behavior.
Stock market data over a period of 10–15 years may show a different trend than over
a 100-day period. Similarly, forecasting sales based on previous data over a period of
time, or analyzing Twitter sentiments and trends over a period of time, is cyclic data. If
the fluctuations are not of a fixed period, they are cyclic. If the changes are in a specific
123
Chapter 4 Business Analytics Process
period of the calendar, the pattern is seasonal. Time-series data is data obtained through
repeated measurements over a particular time period.
For time-series data, the sample should contain the time period (date or time or
both) and only a sample of measurement records for that particular day or time instead
of the complete data collected. For example, the Dow Jones volume is traded over 18
months. The data is collected for every millisecond, so the volume of this data for a day is
huge. Over a 10-month period, this data can be in terabytes.
The other type of data is not time dependent. It can be continuous or discrete data,
but time has no significance in such data sets. For example, you might look at the income
or job skills of individuals in a company, the number of credit transactions in a retail
store, or age and gender information. There is no relationship between any two data
records.
Unless you have big data infrastructure, you can just take a sample of records for any
analysis. Use a randomization technique and take steps to ensure that all the members
of a population have an equal chance of being selected. This method is called probability
sampling. There are several variations on this type of sampling.
124
Chapter 4 Business Analytics Process
n = ( z × sigma / E )
^2
n = ( p )(1 – p ) ∗ ( z / E )
^2
4.1.3.1 Data Types
Data can be either qualitative or quantitative. Qualitative data is not numerical—for
example, type of car, favorite color, or favorite food. Quantitative data is numeric.
Additionally, quantitative data can be divided into categories of discrete or continuous
data (described in more detail later in this section).
Quantitative data is often referred to as measurable data. This type of data
allows statisticians to perform various arithmetic operations, such as addition and
multiplication, and to find population parameters, such as mean or variance. The
125
Chapter 4 Business Analytics Process
observations represent counts or measurements, and thus all values are numerical. Each
observation represents a characteristic of the individual data points in a population or
a sample.
126
Chapter 4 Business Analytics Process
Before the analysis, understand the variables you are using and prepare all of them
with the right data type. Many tools support the transformation of variable types.
4.1.3.2 Data Preparation
After the preliminary data type conversions, the next step is to study the data. You need
to check the values and their association with the data. You also need to find missing
values, null values, empty spaces, and unknown characters so they can be removed from
the data before the analysis. Otherwise, this can impact the accuracy of the model. This
section describes some of the criteria and analysis that can be performed on the data.
Fill in the values with average value or mode: This is the simplest
method. Determine the average value or mode value for all the
records of an attribute and then use this value to fill in all the
missing values. This method depends on the type of problem you
are trying to solve. For example, if you have time-series data, this
method is not recommended. In time-series data, let’s say you are
collecting moisture content of your agricultural land soil every
day. Data is collection is over, say, every 24 hours for one month
127
Chapter 4 Business Analytics Process
or may be for one year. Since moisture varies every day depending
on the weather condition, it is not recommended to impute a
mean or mode value for a missing value.
1. Data set income: 100, 210, 300, 400, 900, 1000, 1100, 2000,
2100, 2500.
4. Use the average bin values to fill in the missing value for a
particular bin.
128
Chapter 4 Business Analytics Process
4.1.3.3 Data Transformation
After a preliminary analysis of data, sometimes you may realize that the raw data you
have may not provide good results or doesn’t seem to make any sense. For example,
data may be skewed, data may not be normally distributed, or measurement scales
may be different for different variables. In such cases, data may require transformation.
Common transformation techniques include normalization, data aggregation, and
smoothing. After the transformation, before presenting the analysis results, the inverse
transformation should be applied.
Normalization
Certain techniques such as regression assume that the data is normally distributed and
that all the variables should be treated equally. Sometimes the data we collect for various
predictor variables may differ in their measurement units, which may have an impact on
the overall equation. This may cause one variable to have more influence over another
variable. In such cases, all the predictor variable data is normalized to one single scale.
Some common normalization techniques include the following:
129
Chapter 4 Business Analytics Process
A = [2,3,4,5,6,7]
Mean = 4.5
SD = 1.87
2 -1.33630621
3 -0.801783726
4 -0.267261242
5 0.267261242
6 0.801783726
7 1.33630621
130
Chapter 4 Business Analytics Process
The basic idea is to find a data value for λ such that the
transformed data is as close to normally distributed as possible.
The formula for a box-cox transformation is as follows:
A ’ = ( A ’ – 1) / λ
131
Chapter 4 Business Analytics Process
4.1.5.1 Descriptive Analytics
Descriptive analytics, also referred to as EDA, explains the patterns hidden in the data.
These patterns can be the number of market segments, sales numbers based on regions,
groups of products based on reviews, software bug patterns in a defect database,
behavioral patterns in an online gaming user database, and more. These patterns are
purely based on historical data and use basic statistics and data visualization techniques.
4.1.5.2 Predictive Analytics
Prediction consists of two methods: classification and regression analysis.
Classification is a data analysis in which data is classified into different classes. For
example, a credit card can be approved or denied, flights at a particular airport are on
time or delayed, and a potential employee will be hired or not. The class prediction is
based on previous behaviors or patterns in the data. The task of the classification model
is to determine the class of data from a new set of data that was not seen before.
Regression predicts the value of a numerical variable (continuous variable)—for
example, company revenue or sales numbers. Most books refer to prediction as the
prediction of a value of a continuous variable. However, classification is also prediction,
as the classification model predicts the class of new data of an unknown class label.
4.1.5.3 Machine Learning
Machine learning is about making computers learn and perform tasks based on past
historical data. Learning is always based on observations from the data available. The
132
Chapter 4 Business Analytics Process
133
Chapter 4 Business Analytics Process
We divide any sample data into two sets. The first data set is called the training data set
and this is used to train the model. The second data set, called the test data (sometimes
referred to as validation set), is used to test the model performance. In this example, the
data has a set of documents (data) that are already categorized and labeled into different
classes. The labeling process is done by an expert who understands the different classes
based on the expertise. The data set is called the training data set. The algorithm learns
training data, which has class labels, and creates a model. Once the model is ready, it
accepts the new set of documents whose labels are unknown and classifies them into
proper class. Common classification supervised-learning algorithms include logistic
regression, support vector machines, naïve Bayes, k-nearest neighbor, and decision trees.
134
Chapter 4 Business Analytics Process
Having more variables in the data set may not always provide the desired results.
However, if you have more predictor variables, you need more records. For example, if
you want to find out the relationship between one Y and one single predictor X, then 15
data points may give you results. But if you have 10 predictor variables, 15 data points is
not enough. Then how much is enough? Statisticians and many researchers have worked
on this and given a rough estimate. For example, a procedure by Delmater and Hancock
(2001) indicates that you should have 6 × m × p records for any predictive models, where
p is number of variables and m is the number of outcome classes. The more records you
have, the better the prediction results. Hence, in big data processing, you can eliminate
the need for sampling and try to process all the available data to get better results.
135
Chapter 4 Business Analytics Process
Typically, the following points are addressed during the presentation of the model
and its use in solving the business problems.
4.1.7.1 Problem Description
First, specify the problem defined by the business and solved by the model. In this step,
you are revalidating the precise problem intended to be solved and connecting the
management to the objective of the data analysis.
• Programs used are not effective in using the parallelism and hence
reduce the possibility of effectively using the results.
137
Chapter 4 Business Analytics Process
4.2 Chapter Summary
In this chapter, we focused on the processes involved in business analytics, including
identifying and defining a business problem, preparing data, collecting data, modeling
data, evaluating model performance, and reporting to the management on the findings.
You learned about various methods involved in data cleaning, including
normalizing, transforming variables, handling missing values, and finding outliers.
We also delved into data exploration, which is the most important process in business
analytics.
Further, you explored supervised machine learning, unsupervised machine learning,
and how to choose different methods based on business requirements. We also touched
upon the various metrics to measure the performance of different models including both
regression and classification models.
138
CHAPTER 5
139
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5_5
Chapter 5 Exploratory Data Analysis
Operational
Manufacturing
Database
(structured and
unstructured)
Logs data
• To determine whether the data set can answer the business problem
you are trying to solve
The following sections describe the tables and graphs required for carrying out good
data analysis. The basic concepts are explained in each section and are demonstrated
using R. Then we demonstrate this using Python. We will try to avoid explaining the
concepts twice.
5.1.1 Tables
The easiest and most common tool available for looking at data is a table. Tables contain
rows and columns. Raw data is displayed as rows of observations and columns of
variables. Tables are useful for smaller sets of samples, so it can be difficult to display
the whole data set if you have many records. By presenting the data in tables, you can
gain insight into the data, including the type of data, the variable names, and the way the
140
Chapter 5 Exploratory Data Analysis
The table helps us to quickly check the contents of the data and browse the features
we have in the data set. Looking at the data table provides an understanding of the
feature name, whether feature names are meaningful, and how the name is related
to other features; it also allows us to check the response variable labels (if you are
performing predictive analytics) and identify data types.
141
Chapter 5 Exploratory Data Analysis
142
Chapter 5 Exploratory Data Analysis
The following output is the descriptive statistics of the stock price data set using the
summary() function in R. The output summary() provides the statistical mean, variance,
first quartile, third quartile, and the other measures described earlier. See Figure 5-4.
> stocks3<-read.csv(header=TRUE,"stocks3.csv")
> summary(stocks3)
Day Stock1 Stock2
Min. : 1.0 Min. :17.22 Min. :19.25
1st Qu.:238.2 1st Qu.:27.78 1st Qu.:35.41
Median :475.5 Median :38.92 Median :49.06
Mean :475.5 Mean :37.93 Mean :43.96
3rd Qu.:712.8 3rd Qu.:46.88 3rd Qu.:53.25
Max. :950.0 Max. :61.50 Max. :60.25
Stock3 Stcok4 Stock5
Min. :12.75 Min. :34.38 Min. :27.75
1st Qu.:16.12 1st Qu.:41.38 1st Qu.:49.66
Median :19.38 Median :43.94 Median :61.75
Mean :18.70 Mean :45.35 Mean :60.86
3rd Qu.:20.88 3rd Qu.:48.12 3rd Qu.:71.84
Max. :25.12 Max. :60.12 Max. :94.12
Stcok6 Stock7 Stock8
Min. :14.12 Min. :58.00 Min. :16.38
1st Qu.:18.00 1st Qu.:65.62 1st Qu.:21.25
Median :25.75 Median :68.62 Median :22.50
Mean :24.12 Mean :70.67 Mean :23.29
3rd Qu.:28.88 3rd Qu.:76.38 3rd Qu.:26.38
Max. :35.25 Max. :87.25 Max. :29.25
Stock9 Stock10 Ratings
Min. :31.50 Min. :34.00 High :174
1st Qu.:41.75 1st Qu.:41.38 Low :431
Median :44.75 Median :46.69 Medium:345
Mean :44.21 Mean :46.99
3rd Qu.:47.62 3rd Qu.:52.12
Max. :53.00 Max. :62.00
143
Chapter 5 Exploratory Data Analysis
In Python, the descriptive statistics of the data can be explored using the pandas.
describe() function, as shown next. The first step is reading the data set, and the second
step is calling the function describe().
Here is the input:
stocks=pd.read_csv("stocks3.csv")
stocks.head()
stocks.describe()
5.1.3 Graphs
Graphs represent data visually and provide more details about the data, enabling you
to identify outliers in the data, see the probability distribution for each variable, provide
a statistical description of the data, and present the relationship between two or more
variables. Graphs include bar charts, histograms, box plots, and scatter plots. In addition,
144
Chapter 5 Exploratory Data Analysis
looking at the graphs of multiple variables simultaneously can provide more insights
into the data. There are three types of graphical analysis: univariate, bivariate, and
multivariate.
Univariate analysis analyzes one variable at a time. It is the simplest form of
analyzing data. You analyze a single variable, summarize the data, and find the patterns
in the data. You can use several visualization graphs to perform univariate data analysis,
including bar charts, pie charts, box plots, and histograms.
Bivariate and multivariate data analysis is used to compare relationships
between two or more variables in the data. The major purpose of bivariate analysis
is to explain the correlations of two variables for comparisons and suspect causal
relationships, if any, contingent on the values of the other variables and the relationships
between the two variables.
If more than two variables are involved, then multivariate analysis is applied. Apart
from the two x- and y-axes, the third and fourth dimensions are distinguished using
the color or shape or size of the variable. Beyond four or five dimensions is almost
impossible to visualize.
Also, visualization is limited to the size of the output device. If you have a huge
amount of data and plot a graph, it may become challenging to interpret the results as
they are cluttered within the plot area. For example, if your plot area is only 10 inches by
10 inches, your data must fit within this area. It may be difficult to understand the plot.
Hence, it is recommended to “zoom” the plot to get a better understanding of it.
5.1.3.1 Histogram
A histogram represents the frequency distribution of the data. Histograms are similar to
bar charts but group numbers into ranges. Also, a histogram lets you show the frequency
distribution of continuous data. This helps analyze the distribution (for example,
normal or Gaussian, uniform, binomial, etc.) and any skewness present in the data.
Figure 5-6 describes the probability density graph of the first variable of the stock price
data, Figure 5-7 describes the histogram, and Figure 5-8 shows both the histogram and
distribution in a single graph.
145
Chapter 5 Exploratory Data Analysis
>
146
Chapter 5 Exploratory Data Analysis
In Python, we can plot both the histogram and the probability density function of a
single variable to check the distribution. Unlike in R, we use two different functions in
Python. The pandas.hist() function plots the histogram, and then we use the pandas.
kde() function to just plot the density() function, as shown in Figure 5-9. Both should
provide the same information, and both figures should look the same.
147
Chapter 5 Exploratory Data Analysis
#Histograms
stocks.Stock1.plot.hist(by=None, bins=15)
stocks.Stock1.plot.kde(bw_method='silverman', ind=12
00)
148
Chapter 5 Exploratory Data Analysis
There may be several ways to plot the density functions and histograms. The one
shown here is a basic and simple method supported by the matplot() library. Though
there are many other visualization libraries in Python, we will not be able to discuss all of
them here as that is not the purpose of this book. The objective is to explain the concepts
and provide an understanding of the different plots that are available to explain the data
through visualization and not to explore different libraries.
For example, using the seaborn() library, we can combine both histogram and
density functions, as shown in Figure 5-10.
Figure 5-10. Histogram and density function using the seaborn() library
in Python
In the previous example, the histogram provides how the data is spread, and in this
case, it is a “bimodal” as it has two distinct peaks. The stock prices have reached over 120
two times. The data is not a normal distribution.
Depending on the data set, if the data is normally distributed, the shape can be “bell”
shaped. If data is “uniform,” then the spread has the same height throughout. If the data
has more peaks, it is called a multimodal distribution. Data also can be skewed, with
most of the data having only high values compared to the others; there is a significant
difference between high-value and low-value data. Depending on the data set, the data
can be left skewed or right skewed, as shown in Figure 5-11 and Figure 5-12. Sometimes,
data distribution can be just random with no proper distinct shape.
149
Chapter 5 Exploratory Data Analysis
5.1.3.2 Box Plots
A box plot or whisker plot is also a graphical description of data. Box plots, created by
John W. Tukey, show the distribution of a data set based on a five-number summary:
minimum, maximum, median, first quartile, and third quartile. Figure 5-13 explains
how to interpret a box plot and its components. It also shows the central tendency;
however, it does not show the distribution like a histogram does. In simple words, a box
plot is a graphical representation of the statistical summary, the data spread within the
IQR (interquartile range), and the outliers above the maximum value and below the
maximum value (whiskers). Knowing outliers in the data is useful information for the
150
Chapter 5 Exploratory Data Analysis
analytical model exercise. Outliers can have a significant influence and impact on the
model. Box plots are also beneficial if we have multiple variables to be compared. For
example, if you have a data set that has sales of multiple brands, then box plots can be
used to compare the sales of different brands.
Outliers
Max value
Q3 - 1.5 x IQR
Median
- Middle of the dataset
IQR
Minimum value
Q1-1.5*IQR
Outliers
151
Chapter 5 Exploratory Data Analysis
Tukey (Tukey, 1977) has provided the following definitions for outliers:
Outliers—2/3 IQR above the third quartile or 2/3 IQR below the
first quartile
If the data is normally distributed, then IQR = 1.35 σ, where σ is the population
standard deviation.
The box plot distribution will explain how tightly the data is spread across and
whether it is symmetrical or skewed. Figure 5-14 describes the box plot with respect to a
normal distribution and how symmetrically data is distributed.
Q1 IQR Q3
Q1-1.5*IQR Q3-1.5*IQR
D e n si t y
Prob
Z-score
A box plot is positively skewed if the distance from the median to the maximum
is greater than the distance from the median to the minimum. Similarly, a box plot is
negatively skewed if the distance from the median to the minimum is greater than the
distance from the median to the maximum.
152
Chapter 5 Exploratory Data Analysis
One of the commonly used applications of box plot, apart from finding the spread
and outlier, is that you can plot multiple variables in box plots and compare the data of
each variable side-by-side. Figure 5-15 shows an example. As you can see from the box
plots, all three have a different data spread, their medians are different, and 50 percent
of the data for each variable is different. It is clear from the plot that there is a correlation
between the three variables.
153
Chapter 5 Exploratory Data Analysis
stocks.Stock1.plot.box()
stocks.boxplot(column=['Stock1','Stock2','Stock3',
'Stock5'])
5.1.3.3 Bivariate Analysis
The most common data visualization tool used for bivariate analysis is the scatter plot.
Scatter plots can be used to identify the relationships between two continuous variables.
Each data point on a scatter plot is a single observation. All the observations can be
plotted on a single chart.
154
Chapter 5 Exploratory Data Analysis
5.1.3.4 Scatter Plots
Figure 5-17 shows a scatter plot of the number of employees versus revenue (in millions
of dollars) of various companies. As you can see, there is a strong relationship between
the two variables that is almost linear. However, you cannot draw any causal implications
without further statistical analysis. The example shows the scatter plot of the number of
employees on the x-axis and revenues on the y-axis. For every point on the x-axis, there is
a corresponding point on the y-axis. As you can see, the points are spread in proportion
and have a linear relationship. Though not all the points are aligned proportionally, most
points are.
Figure 5-17. A scatter plot of the number of employees versus revenue (in millions
of dollars)
Unfortunately, scatter plots are not always useful for finding relationships. In the
case of our Stocks data set example, it is difficult to interpret any relationship between
Stock1 and Stock2, as shown in Figure 5-18. In Python, we use the plot.scatter()
function for plotting scatter plots.
155
Chapter 5 Exploratory Data Analysis
> setwd("E:/Umesh-MAY2022/Personal-May2022/BA2ndEditi
n/Book Chapters/Chapter 5 - EDA")
> stocks<-read.csv(header=TRUE,"stocks.csv")
> pairs(stocks)
156
Chapter 5 Exploratory Data Analysis
157
Chapter 5 Exploratory Data Analysis
In this example, the variables are on the diagonal, from the top left to the bottom
right. Each variable is plotted against the other variables. For example, the plot that is
to the right of Stock1 and Stock2 represents a plot of Stock1 on the x-axis and Stock2 on
the y-axis. Similarly, the plot of Stock8 versus Stock9, plotting Stock8 on the x-axis and
Stock9 on the y-axis. If there is any correlation relationship between the two variables, it
can be seen from the plot. For example, if two variables have linear relationships, then all
the data points would be scattered within a straight line.
5.1.4.1 Correlation Plot
Correlation is a statistical measure that expresses the relationship between two variables.
The correlation coefficient, referred to as Pearson correlation coefficient (r), quantifies
the strength of the relationship. When you have two variables, X and Y, if the Y variable
tends to increase corresponding to the increase in the X variable, then we say we have a
positive correlation between the variables. When the Y variable tends to decrease as the
X variable increases, we say there is a negative correlation between the two variables.
One way to check the correlation pattern is to use a scatter plot, and another way is to
use a correlation graph, as shown in Figure 5-21.
158
Chapter 5 Exploratory Data Analysis
In this example, a blue dot represents a positive correlation, and red represents a
negative correlation. The larger the dot, the stronger the correlation. The diagonal dots
(from top left to bottom right) are positively correlated because each dot represents the
correlation of each attribute with itself.
159
Chapter 5 Exploratory Data Analysis
In Python the same can be drawn with the help of the matplotlib() library, as
shown in Figure 5-22. As you can see from the figure, the Pearson coefficient, r, is 0.82
between Stock5 and Stock2 and 0.882 between stock7 and stock4.
Here is the input:
corel=stocks.corr()
corel.style.background_gradient(cmap=’coolwarm’,set_precision(3)
5.1.4.2 Density Plots
A probability density function of each variable can be plotted as a function of the class.
Density plots are used to show the distribution of data. They can also be used to compare
the separation by class. For example, in our stock price example, each stock is rated as
high, low, or medium. You can compare the probability density of each stock price with
respect to each class. Similar to scatter plot matrices, a density featureplot() function
160
Chapter 5 Exploratory Data Analysis
can illustrate the separation by class and show how closely they overlap each other, as
shown in Figure 5-23. In this example (of the stock price data set) shown in Figure 5-23,
some stock prices overlap very closely and are hard to separate.
>
161
Chapter 5 Exploratory Data Analysis
162
Chapter 5 Exploratory Data Analysis
There are many libraries and packages available to support plots. There are many
extensions of the basic plots to get more details, but essentially they all have similar
characteristics of describing data. For example, ggplot() libraries support graphical
description of data in three or four dimensions with the help of different objects
(triangle, circle, square, etc.), colors, and also the size of the objects as an indication of
different dimensions. It is not in the scope of this book to cover all the libraries available
and all the different representations. Our intention is to provide enough insights on
fundamental graphical tools available to perform analytics that facilitate developing
better models. Depending on the data you are analyzing and depending on the
characteristics of the variable, you can select a graphical method. No single technique is
considered as the standard.
163
Chapter 5 Exploratory Data Analysis
The process of EDA can be summarized with the help of Figure 5-26. Depending
on the type of variables, you can choose different techniques, as shown in Figure 5-26.
Data can be of two types, either numerical or categorical. If you are performing analysis
on one variable, then it is a univariate analysis; otherwise, it is a bivariate analysis. For
the numerical variable, you will describe the various statistical parameters and also plot
different graphs such as histograms, box plots, etc. If the variable is categorical, then you
have a bar plot, pie chart, etc., and can use simple measures such as counting different
categories and checking their probability in percent.
164
Chapter 5 Exploratory Data Analysis
5.3 Chapter Summary
In this chapter, we focused on the fundamental techniques of exploring data and its
characteristics. We discussed both graphical methods and tabulated data.
We covered both univariate and bivariate analysis including histograms, box plots,
density plots, scatter plots, correlation plots, and density plots.
We also discussed how to visualize categorical data and its characteristics using
bar charts.
165
CHAPTER 6
Evaluating Analytics
Model Performance
There are several ways to measure the performance of different models. This chapter
discusses the various measures to test a model’s performance, including regression,
classification, and clustering methods.
6.1 Introduction
After we create a model, the next step is to measure its performance. We have different
measures for classification models and different measures for regression models.
Evaluating a model’s performance is a key aspect of understanding how accurately your
model can predict when applying the model to new data. Though several measures have
been used, we will cover only the most commonly used and popular measures. When
we have to predict a numerical value, we use regression. When we predict a class or
category, we use the classification model.
167
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5_6
Chapter 6 Evaluating Analytics Model Performance
6.2.1 Root-Mean-Square Error
The root-mean-square error (RSME) is yielded by the following formula:
n
y yk
2
k
RMSE
k 0
Here, yk is the actual value for the k samples; n is the total number of samples.
yk Is the predicted values
If the model has a good fit, then the error should be less. This is measured by RMSE,
and the lower the value of RMSE, the better the model “fit” is to the given data. The
RMSE value can vary from one data set to another, and there is no standard RMSE value
to say this value is “good” or “bad.”
168
Chapter 6 Evaluating Analytics Model Performance
6.2.5 R2 (R-Squared)
R2 is called the coefficient of determination and is a measure that explains how close your
predicted values are to the actual values. In other words, it explains the variations of
the fitted values to your regression line. R2 describes the proportion of variance. The R2
value varies from 0 to 1. The higher the value of R2, the better the fit is, and the variance
is low for a given model. If the regression model is perfect, SSE is zero, and R2 is 1. R2 is
calculated as the ratio of the comparison of distance of the actual values to the mean and
the estimated value to the mean. It is given by the following formula:
SSR SST SSE
R2
SST SST
Here, SST is the total sum of squares, SSE is the total sum of error, and SSR is the total
sum of residuals. You can refer to any statistics book for the derivation of the equations.
6.2.6 Adjusted R2
The problem with R2 is that its value can increase by just having more data points. The
more data points, the better the regression fit, and hence you are always tempted to add
more data; the same is true if you add more variables. Adjusted R2 is an adjustment to
the R2 to overcome this situation. Adjusted R2 considers the number of variables in a data
set and penalizes the points that do not fit the model.
1 R 2 n 1
R 2
1
n k 1
Adj
169
Chapter 6 Evaluating Analytics Model Performance
All the previous measures can generate outliers. To keep the outlier or not is
something you have to decide as part of exploratory data analysis. These measures are
used to compare models and assess the degree of prediction accuracy. It need not be the
best model to fit the actual training data perfectly.
In Figure 6-1, the total predicted positive class is 120, and the total predicted
negative class is 120. However, the actual positive and negative classes are different. The
170
Chapter 6 Evaluating Analytics Model Performance
actual positive class is only 110, and the negative class is 130. Therefore, the predictive
classification model has incorrectly predicted a class of 10 values and thus has resulted
in a classification error.
Further, if the actual class is yes and the predicted class is also yes, then it is
considered a true positive (TP); if the actual class is yes and the predicted class is no, then
it is a false negative (FN). Similarly, if the actual class is no and predicted class is also
no, then it is referred to as a true negative (TN). Finally, if the actual class is no and the
predicted is yes, it is called a false positive (FP).
Using the contingency table shown in Figure 6-2, we can calculate how well the
model has performed. If you want to calculate the accuracy of this model, then just add
up the true positives and true negatives and divide by the total number of values. If “a” is
true positive, “b” is false negative, “c” is false positive, and “d” is true negative, as shown
in Figure 6-2, then the accuracy is calculated as follows:
ad TP TN
Accuracy
a b c d TP TN FP FN
For the previous example, the accuracy of the model is = (80+90)/240 * 100 = 70.8%.
171
Chapter 6 Evaluating Analytics Model Performance
and if it is predicted as a positive class instead, it is counted as FP. This can be explained
using the confusion matrix shown in Figure 6-2. This matrix forms the basis for many
other common measures used in machine learning classifications.
The true positive rate (also called the hit rate or recall) is estimated as follows:
TP
tp rate recall
TP FP
Similarly, the false positive rate (fp rate), also referred to as false alarm rate, is
estimated as follows:
Sensitivity is the metric that measures a model’s ability to predict true positives of
each available category.
Specificity is the metric that measures a model’s ability to predict true negatives of
each available category.
True negatives TN
Specificity
False positives True negatives FP TN
1 fp rate
TP
Precision
TP FP
TP
Recall
TP FN
172
Chapter 6 Evaluating Analytics Model Performance
6.4 ROC Chart
A receiver operating characteristics (ROC) graph represents the performance of a
classier. It is a technique to visualize the performance and accordingly choose the best
classifier. The ROC graph was first adopted in machine learning by Spackman (1989),
who demonstrated how ROC can be used to evaluate and compare different classification
algorithms. Now, most of the machine learning community is adopting this technique
to measure the performance of a classifier, not just relying on the accuracy of the
model (Provost and Fawcett, 1997; Provost et al., 1998). ROC shows a relation between
the sensitivity and the specificity of the classification algorithm. A receiver operating
characteristic (ROC) graph is a two-dimensional graph of TP rate versus FP rate. It is a
plot of the true-positive rate on the y-axis and the false-positive rate on the x-axis.
A true positive rate should be higher for a good classifier model, as shown in
Figure 6-3. Area under curve (AUC) provides how the classifier is performing. As we
can see from the graph, for a classifier to perform well, it should have a higher true
positives rate than a false positives rate. The false positives rate should stabilize over the
test instances. In Figure 6-3, we have plotted ROC for the three different models, and
the AUC for the first classifier is the highest. Typically, AUC should fall between 0.5 and
1.0. In ideal conditions, when the separation of the two classes is perfect and has no
overlapping of the distributions, the area under the ROC curve reaches to 1. An AUC less
than 0.5 might indicate that the model is not performing well and needs attention.
173
Chapter 6 Evaluating Analytics Model Performance
174
Chapter 6 Evaluating Analytics Model Performance
test data. This gives a false confidence that your model has performed well and you will
tend to take far more risk using the model, which can leave you in a vulnerable situation.
175
Chapter 6 Evaluating Analytics Model Performance
When an overfitting or underfitting situation arises, you must revisit the data
split (training and test data) and reconstruct the model. Also, use a k-fold validation
(explained under section 6.6) mechanism and take the average or the best model. To
reduce the bias error, one should repeat the model-building process by resampling the
data. The best way to avoid overfitting is to test the model on data that is entirely outside
the scope of your training data. This gives you confidence that you have a representative
sample that is part of the production data. In addition to this, it is always good to
revalidate the model periodically to determine whether your model is degrading or
needs improvement.
Bias and variance are tricky situations. If we minimize bias, then we may end up
overfitting the data and end with high variance. If we minimize the variance, then we
end up underfitting the model with high bias. Thus, we need to make a compromise
between the two. Figure 6-4 depicts the bias-variance trade-off. As the model complexity
increases, the error must be reduced to balance the bias and variance.
Optimal Model
Variance
Error
Bias2
Model complexity
High bias means the model has a high prediction error. This results in unexpected
model behavior for the test data (underfitting). Bias can be reduced by the following:
176
Chapter 6 Evaluating Analytics Model Performance
High variance means the model is picking up noise and the model is overfitting. The
following techniques may optimize the model and reduce the variance:
6.6 Cross-Validation
The goal of the supervised machine learning model is to predict the new data as
accurately as possible. To achieve this, we divide the sample data into two sets. The first
data set is called the training data set and is used to train the model. The second data
set, called test data (sometimes referred to as validation set), is used to test the model
performance. The model predicts the class on the test data. The test data already has the
actual class that is compared to what is being predicted by the model. The difference
between the actual and predicted values gives an error and thus measures the model’s
performance. The model should perform well on both training and test data set. Also,
the model clean up should have the same behavior even if it is tested on a different
set of data. Otherwise, as explained in earlier sections, this will result in overfitting or
underfitting. To overcome this problem, we use a method called cross-validation.
Cross-validation is a sampling technique that uses different portions of the data to
train and test the model on different iterations. The goal of the cross-validation is to test
the model’s ability to predict on unseen data that was not used in constructing the model
before and give insight into how the model can behave on independent data. The cross-
validation process involves dividing the sample data into k-sets rather than just two sets.
For example, for a 10-fold data set, the first model is created using 9-folds of smaller
training sets, and the model is tested with 1-fold test data set. In the next iteration,
the test data set is shuffled with the other nine sets of training data, and one of the
training data sets becomes test data. This process repeats until all the k-sets are covered
iteratively. This process gives the entire sample data a chance to be part of the model-
building exercise, thus avoiding the overfitting and underfitting problems. The final
model would be the average of all the k-models or take the best model out of k-iterations.
177
Chapter 6 Evaluating Analytics Model Performance
Figure 6-5 describes the k-fold validation technique. Typically, the k-value can range
anywhere from 5 to 10.
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ……………………. 0 0 0 0 0 0 0
1st Iteration
Test set
Train set
Train set
The first step is to divide the data randomly into k-folds. In the first iteration, select
one of the k-fold as test data and the remaining (k-1) sets for training the model. In the
next iteration, select a different k-fold for the test from the previous (k-1) set, include the
previous test set as part of the new training data, and build the model. Finally, repeat this
process on the entire k-fold data. This process allows the entire data to be part of model
building exercise, thus reducing any overfit or underfit problems.
The process of k-fold validation is as follows:
1. Split the data set into k folds. The suggested value is k = 10.
2. For each k fold in the data set, build your model on k – 1 folds and
test the model to check the effectiveness for the left-out fold.
4. Repeat the steps k times so that each of the k folds are part of the
test set.
178
Chapter 6 Evaluating Analytics Model Performance
6.8 Chapter Summary
In this chapter, we discussed how to measure the performance of a regression model
and a classification model.
In regression, we discussed what R2, adjusted R2, and RMSE are. How we measure
each one is important. We also discussed mean absolute percentage error and mean
absolute error.
In classification, we discussed precision, recall, F-score, calculating the accuracy
of the model, and sensitivity analysis. We also discussed ROC curves and the area
under curve.
We discussed the overfitting and underfitting of the model and the trade-off between
the two. We also talked about techniques to solve such challenges when we build
the models.
Finally, we mentioned cross validation and the various measures available for the
clustering analysis.
179
PART II
183
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5_7
Chapter 7 Simple Linear Regression
employees. The compensation and benefits structure may include salary structure—
that is, how good the salaries are compared to those in other similar organizations or
other organizations in the industry, whether there are bonus or additional incentives
for higher performance or additional perks, etc. To drive home the point, there may
be multiple factors that influence a particular outcome or that are associated with
a particular outcome. Again, each one of these may in turn be associated with or
influenced by other factors. For example, salary structure may influence the work
environment or satisfaction levels of the employees.
Imagine you are a developer of the properties as well as a builder. You are planning to
build a huge shopping mall. The prices of various inputs required such as cement, steel,
sand, pipes, and so on, vary a lot on a day-to-day basis. If you have to decide on the sale
price of the shopping mall or the price of rent that you need to charge for the individual
shops, you need to understand the likely cost of building. For this information, you may
have to consider the periodic change in the costs of these inputs (cement, steel, sand,
etc.) and what factors influence the price of each of these in the market.
You may want to estimate the profitability of the company, arrive at the best possible
cost of manufacturing of a product, estimate the quantum of increase in sales, estimate
the attrition of the company so that you can plan well for recruitment, decide on the
likely cost of the shopping mall you are building, or decide on the rent you need to
charge for a square foot or a square meter. In all these cases you need to understand
the association or relationship of these factors with the ones that influence, decide, or
impact them. The relationship between two factors is normally explained in statistics
through correlation or, to be precise, the coefficient of correlation (i.e., R) or the coefficient
of determination (i.e., R2).
The regression equation depicts the relationship between a response variable (also
known as the dependent variable) and the corresponding independent variables. This
means that the value of the dependent variable can be predicted based on the values
of the independent variables. When there is a single independent variable, then the
regression is called simple regression. When there are multiple independent variables,
then the regression is called multiple regression. Again, the regressions can be of two
types based on the relationship between the response variable and the independent
variables (i.e., linear regression or nonlinear regression). In the case of linear regression,
the relationship between the response variable and the independent variables is
explained through a straight line, and in the case of a nonlinear relationship, the
relationship between the response variable and independent variables is nonlinear
(polynomial like quadratic, cubic, etc.).
184
Chapter 7 Simple Linear Regression
Normally we may find a linear relationship between the price of the house and the
area of the house. We may also see a linear relationship between salary and experience.
However, if we take the relationship between rain and the production of grains, the
production of the grains may increase with moderate to good rain but then decrease
if the rain exceeds the level of good rain and becomes extreme rain. In this case, the
relationship between the quantum of rain and the production of food grains is normally
nonlinear; initially food grain production increases and then reduces.
Regression is a supervised method because we know both the exact values of the
response (i.e., dependent) variable and the corresponding values of the independent
variables. This is the basis for establishing the model. This basis or model is then used
for predicting the values of the response variable where we know the values of the
independent variable and want to understand the likely value of the response variable.
7.2 Correlation
As described in earlier chapters, correlation explains the relationship between two
variables. This may be a cause-and-effect relationship or otherwise, but it need not
always be a cause-and-effect relationship. However, variation in one variable can be
explained with the help of the other parameter when we know the relationship between
two variables over a range of values (i.e., when we know the correlation between two
variables). Typically, the relationship between two variables is depicted through a scatter
plot as explained in earlier chapters.
Attrition is related to the employee satisfaction index. This means that attrition
is correlated with “employee satisfaction index.” Normally, the lower the employee
satisfaction, the higher the attrition. Also, the higher the employee satisfaction, the lower
the attrition. This means that attrition is inversely correlated with employee satisfaction.
In other words, attrition has a negative correlation with employee satisfaction or is
negatively associated with employee satisfaction.
Normally the profitability of an organization is likely to go up with the sales
quantum. This means the higher the sales, the higher the profits. The lower the sales,
the lower the profits. Here, the relationship is that of positive correlation as profitability
increases with the increase in sales quantum and decreases with the decrease in sales
quantum. Here, we can say that the profitability is positively associated with the sales
quantum.
185
Chapter 7 Simple Linear Regression
Normally, the fewer defects in a product or the faster the response related to issues,
the higher the customer satisfaction will be of any company. Here, customer satisfaction
is inversely related to defects in the product or negatively correlated with the defects in
the product. However, the same customer satisfaction is directly related to or positively
correlated with the speed of response.
Correlation explains the extent of change in one of the variables given the unit
change in the value of another variable. Correlation assumes a very significant role in
statistics and hence in the field of business analytics as any business cannot make any
decision without understanding the relationship between various forces acting in favor
of or against it.
Strong association or correlation between two variables enables us to better predict
the value of the response variable from the value of the independent variable. However,
a weak association or low correlation between two variables does not help us to predict
the value of the response variable from the value of the independent variable.
7.2.1 Correlation Coefficient
Correlation coefficient is an important statistical parameter of interest that gives us a
numerical indication of the relationship between two variables. This will be useful only
in the case of linear association between the variables. This will not be useful in the case
of nonlinear associations between the variables.
It is easy to compute the correlation coefficient. To compute it, we require the
following:
Once we have these values, we need to convert each value of each variable into
standard units. This is done as follows:
186
Chapter 7 Simple Linear Regression
Once we have converted each value of each variable into standard units, the
correlation coefficient (normally depicted as r or R) is calculated as follows:
The correlation coefficient can be also found out using the following formula:
187
Chapter 7 Simple Linear Regression
correl_attri_empsat
Figure 7-1B. Scatter plot between employee satisfaction index and attrition
188
Chapter 7 Simple Linear Regression
As shown in the scatter plot in Figure 7-1B, even though the relationship is not
linear, it is near linear. This is shown by the correlation coefficient of -0.983. As you can
see, the negative sign indicates the inverse association or negative association between
attrition percentage and employee satisfaction index. The previous plot shows that the
deterioration in the employee satisfaction leads to an increased rate of attrition.
Further, the correlation test shown in Figure 7-2B confirms that there is an excellent
statistically significant correlation between attrition and the employee satisfaction index.
Figure 7-2A provides the code used to generate this.
#We will now test and confirm the if there exists statistically
Figure 7-2A. Code to test correlation using Spearman’s rank correlation rho
Spearman’s rho is the measure of the strength of the association between two
variables. The Spearman correlation shows whether there is significant correlation
(significantly monotonically related). The null hypothesis for the Spearman’s rank
correlation test is “true rho is equal to zero,” and the alternative hypothesis is “true rho
is not equal to zero.” In the previous test, notice that the p-value is very low and less
than the significance level of 0.05. Hence, there is a significant relationship between
these two specified variables, and we can reject the null hypothesis and uphold that the
189
Chapter 7 Simple Linear Regression
true rho is not equal to zero. The estimated rho value thrown out of the test is -1, which
specifies a clear and perfect negative correlation between the two variables, i.e., attrition
and empsat.
Please note, the previous data is illustrative only and may not be representative of a
real scenario. It is used for the purpose of illustrating the correlation. Further, in the case
of extreme values (outliers) and associations like nonlinear associations, the correlation
coefficient may be very low and may depict no relationship or association. However,
there may be a real and good association among the variables.
7.3 Hypothesis Testing
At this point in time, it is apt for us to briefly touch upon hypothesis testing. This is one
of the important aspects in statistics. In hypothesis testing we start with an assertion or
claim or status quo about a particular population parameter of one or more populations.
This assertion or claim or status quo is known as the null hypothesis, or H0. An example
of the null hypothesis may be a statement like the following: the population mean of
population 1 is equal to the population mean of population 2. There is another statement
known as the alternate hypothesis, or H1, which is opposite to the null hypothesis. In our
example, the alternate hypothesis specifies that there is a significant difference between
the population mean of population 1 and the population mean of population 2. A level
of significance or Type I error of normally 0.05 is specified. This is just the possibility that
the null hypothesis is rejected when actually it is true. This is represented by the symbol
α. The smaller the value of α, the smaller the risk of a Type I error.
Then we decide the sample size required to reduce the errors.
We use test statistics to either reject the null hypothesis or not reject the null
hypothesis. When we reject the null hypothesis, it means that the alternate hypothesis is
true. However, if we do not reject the null hypothesis it does not mean that the alternate
hypothesis is true. It only shows that we do not have sufficient evidence to reject the null
hypothesis. Normally, the t-value is the test statistic used.
Then we use the data and arrive at the sample value of the test statistic. We then
calculate the p-value on the basis of the test statistic. The p-value is the probability that
the test statistic is equal to or more than the sample value of the test statistic when the
null hypothesis is true. We then compare the p-value with the level of significance (i.e.,
α). If the p-value is less than the level of significance, then the null hypothesis is rejected.
190
Chapter 7 Simple Linear Regression
This also means that the alternate hypothesis is accepted. If the p-value is greater than or
equal to the level of significance, then we cannot reject the null hypothesis.
The p-value is used (among many other uses in the field of statistics) to validate
the significance of the parameters to the model in the case of regression analysis. If the
p-value of any parameter in the regression model is less than the level of significance
(typically 0.05), then we reject the null hypothesis that there is no significant
contribution of the parameter to the model, and we accept the alternate hypothesis
that there is significant contribution of the parameter to the model. If the p-value of a
parameter is greater than or equal to the level of significance, then we cannot reject the
null hypothesis that there is no significant contribution of the parameter to the model.
We include in the final model only those parameters that have significance to the model.
191
Chapter 7 Simple Linear Regression
7.4.1 Assumptions of Regression
There are four assumptions of regression. These need to be fulfilled if we need to rely
upon any regression equation.
Y1 = β0 + β1 x1
In the previous equation, β0 is known as the intercept, and β1 is known as the slope of
the regression line. Intercept is the value of the response variable when the value of the
independent variable (i.e., x) is zero. This depicts the point at which the regression line
touches the y-axis when x is zero. The slope can be calculated easily using the following
formula: (R × Standard deviation of the response variable) / (Standard deviation of the
independent variable).
From this, you can see that when the value of the independent variable increases by
one standard deviation, the value of the response variable increases by R × one standard
deviation of the response variable, where R is the coefficient of correlation.
192
Chapter 7 Simple Linear Regression
lm() stands for “linear model” and takes the response variable, the independent
variable, and the dataframe as input parameters.
Let’s look at a simple example to understand how to use the previous command.
Assuming the competence or capability of the sales personnel is equal, keeping the sale
restricted to one single product, we have a set of data that contains the number of hours
of effort expended by each salesperson and the corresponding number of sales made.
Figure 7-3A shows the code to import the data from a file; it lists the contents of the
dataframe and the summary of the contents of the dataframe.
#In the above read statement you can alternatively use read.csv() or
read.csv2()
cust_df
summary(cust_df)
Figure 7-3A. Code to read data, list the contents of the dataframe and display
the summary
Figure 7-3B shows the output of the code provided in Figure 7-3A.
193
Chapter 7 Simple Linear Regression
Figure 7-3B. Creating a data frame from a text file (data for the examples)
194
Chapter 7 Simple Linear Regression
In Figure 7-3, we have imported a table of data containing 21 records with the
Sales_Effort and Product_Sales from a file named cust1.txt into a data frame named
cust_df. The Sales_Effort is the number of hours of effort put in by the salesperson
during the first two weeks of a month, and the Product_Sales is the number of product
sales closed by the salesperson during the same period. The summary of the data is
also shown.
In this data we can treat Product_Sales as the response variable and Sales_Effort
as the independent variable as the product sales depend upon the sales effort put in
place by the salespeople.
We will now split the data into Train_Data used for the model generation and
Test_Data used for validating the model generated. Then, we will run the simple linear
regression to model the relationship between Product_Sales and Sales_Effort using
the lm(response variable ~ independent variable, data = dataframe name)
command of R. We will use the Train_Data as the input dataframe. Figure 7-4A shows
the code.
195
Chapter 7 Simple Linear Regression
#Now, we will split the data into two separate sets Train_Data,
Test_Data
#One for the model generation purposes, another for testing the model
generated
#As our data set has very limited records we will be utilizing 90% of it for
size = sort(sample(nrow(cust_df),nrow(cust_df)*0.9))
Train_Data = cust_df[size, ]
Test_Data = cust_df[-size, ]
#Printing Train_Data
Train_Data
#Printing Test_Data
Test_Data
Figure 7-4A. Code to split the dataset into two separate dataframes and to
generate the linear regression model
196
Chapter 7 Simple Linear Regression
Figure 7-4B. Splitting the data into train and test sets and generating a simple
linear regression model in R
The output shows clearly the split of data into two sets of data, i.e., Train_Data and
Test_Data, with 18 and 3 records, respectively. We then generated the simple linear
regression model named mod_simp_reg. The command lm() has generated the model
but has not thrown any output of the model generated as it has been assigned
to mod_simp_reg. Now, running the summary(mod_simp_reg) command outputs the
summary of the simple linear regression model.
Figure 7-5A shows the code.
summary(mod_simp_reg)
Figure 7-5A. Code to generate the summary of the simple linear regression model
generated
197
Chapter 7 Simple Linear Regression
The initial part of the summary shows which element of the data frame is regressed
against which other element and the name of the data frame that contained the data to
arrive at the model.
Residuals depict the difference between the actual value of the response variable
and the value of the response variable predicted using the regression equation. The
maximum residual is shown as 0.19455. The spread of residuals is provided here by
specifying the values of min, max, median, Q1, and Q3 of the residuals. In this case,
the spread is from -0.14061 to +0.19455. As the principle behind the regression line
and regression equation is to reduce the error or difference, the expectation is that the
median value should be very near to 0. As you can see, here the median value is -0.02474,
which is almost equal to 0. The prediction error can go up to the maximum value of the
residual. As this value (i.e., 0.19455) is very small, we can accept this residual.
The next section specifies the coefficient details. Here β0 is specified by the intercept
estimate (i.e., 0.0688518), and β1 is specified by the Sales_Effort estimate (0.0979316).
Hence, the simple linear regression equation is as follows:
198
Chapter 7 Simple Linear Regression
The value next to the coefficient estimate is the standard error of the estimate. This
specifies the uncertainty of the estimate. Then comes the “t” value of the standard error.
This specifies how large the coefficient estimate is with respect to the uncertainty. The
next value is the probability that the absolute(t) value is greater than the one specified,
which is due to a chance error. Ideally, the “Pr” (or Probability value, or popularly known
as p-value) should be very small (like 0.001, 0.005, 0.01, or 0.05) for the relationship
between the response variable and the independent variable to be significant. The
p-value is also known as the value of significance. As the probability of the error of the
coefficient of Sales_Effort here is less (i.e., almost near 0, <2e-16), we reject the null
hypothesis that there is no significance of the parameter to the model and accept the
alternate hypothesis that there is significance of the parameter to the model. Hence, we
conclude that there is a significant relationship between the response variable Product_
Sales and the independent variable Sales_Effort. The number of asterisks (*s) next to
the p-value of each parameter specifies the level of significance. Please refer to “Signif.
codes” in the model summary as given in Figure 7-5B.
The next section shows the overall model quality-related statistics.
• The residual standard error shows the sum of the squares of the
residuals as divided by the degrees of freedom (in our case 16)
as specified in the summary. This is 0.1044 and is very low, as
required by us.
199
Chapter 7 Simple Linear Regression
• The multiple R-squared value shown here is just the square of the
correlation coefficient (i.e., R). Multiple R-squared is also known as
the coefficient of determination. However, the adjusted R-squared
value is the one that is the adjusted value of R-squared adjusted to
avoid overfitting. Here again, we rely more on adjusted R-squared
value than on multiple R-squared. The value of multiple R-squared
is 0.9989, and the adjusted R-squared is 0.9988, which are very high
and show the excellent relationship between the response variable
Product_Sales and the independent variable Sales_Effort.
7.4.4.1 Test of Linearity
To test the linearity, we plot the residuals against the corresponding fitted values.
Figure 7-6B depicts this.
Figure 7-6A shows the simple code.
#Test of Linearity
plot(mod_simp_reg)
Figure 7-6A. Code to generate the plot of the model to test linearity
200
Chapter 7 Simple Linear Regression
For the model to pass the test of linearity, we should not have any pattern in the
distribution of the residuals, and they should be randomly placed around the 0.0
residual line. That is, the residuals will be randomly varying around the mean of the
value of the response variable. In our case, as we can see there are no patterns in the
distribution of the residuals. Hence, it passes the condition of linearity.
201
Chapter 7 Simple Linear Regression
where there is a possibility of auto correlation. Further, as shown in Figure 7-6B, the
residuals are distributed randomly around the mean value of the response variable. If we
need to test for the autocorrelation, we can use the Durbin-Watson test.
Figure 7-7A gives the code.
library(lmtest)
dwtest(mod_simp_reg)
Figure 7-7A. Code to test the independence of errors around the regression line
In the case of the Durbin-Watson test, the null hypothesis (i.e., H0) is that there is no
autocorrelation, and the alternative hypothesis (i.e., H1) is that there is autocorrelation. If
the p-value is < 0.05, then we reject the null hypothesis—that is, we conclude that there is
autocorrelation. In the previous case, the p-value is greater than 0.05, which means that
there is no evidence to reject the null hypothesis that there is no autocorrelation. Hence,
the test of independence of errors around the regression line passes. Alternatively for
this test, you can use the durbinWatsonTest() function from library(car).
202
Chapter 7 Simple Linear Regression
7.4.4.3 Test of Normality
As per this test, the residuals should be normally distributed. To check on this, we will
look at the normal Q-Q plot (one among the many graphs created using the plot(model
name) command), as shown in Figure 7-8.
Figure 7-8. Test for assumption of “normality” using normal Q-Q plot
Figure 7-8 shows that the residuals are almost on the straight line in the above
normal Q-Q plot. This shows that the residuals are normally distributed. Hence, the
normality test of the residuals passes.
203
Chapter 7 Simple Linear Regression
library(gvlma)
summary(gv_model)
Figure 7-9A. Using the gvlma() function to validate the model assumptions
204
Chapter 7 Simple Linear Regression
Figure 7-9B. Output of the gvlma() with the linear regression assumptions
validated
In Figure 7-9B we have given the output of the gvlma() function from R on our
model. The Global Stat line clearly shows that the assumptions related to this regression
model are acceptable. Here, we need to check whether the p-value is greater than 0.05.
If the p-value is greater than 0.05, then we can safely conclude as shown above that the
assumptions are validated. If we have p-value less than 0.05, then we need to revisit the
regression model.
205
Chapter 7 Simple Linear Regression
In Figure 7-10, as the points are spread in a random fashion around the near
horizontal line, this assures us that the assumption of constant variance (or
homoscedasticity) is fulfilled.
206
Chapter 7 Simple Linear Regression
Figure 7-11. Plot using the crPlots() function to validate the linearity assumption
7.4.5 Conclusion
As shown, the simple linear regression model fitted using the R function lm(response
variable ~ independent variable, data = dataframe name) representing the
simple linear regression equation, namely, Product_Salesi = 0.0688518 + 0.0979316
Sales_Efforti, is a good model as it passes the tests to validate the assumptions of the
regression too. A note of caution here is that there are various ways the regression
equation may be created and validated.
207
Chapter 7 Simple Linear Regression
predicted
This is done using the function predict(model name, newdata) where the model
name is the name of the model arrived at from the input data (i.e., Train_Data in our
case) and newdata contains the independent variable data for which the response
variable has to be predicted. In our case, we used the Test_Data we had earmarked
for the model validation. However, we can create any new dataframe with new Sales_
Effort variables and use it as an input newdata in the previous equation to obtain the
prediction.
As shown, we have obtained a prediction interval, which specifies the range of
the distribution of the prediction also with the additional parameter interval =
"prediction" on the predict() function. This uses by default the confidence interval
as 0.95.
As you can see from the predicted values, if we round off the Sales_Effort
values predicted to the nearest numbers, they are as per the Test_Data, as shown in
Figure 7-4B. This suggests that the model is working well on the prediction.
208
Chapter 7 Simple Linear Regression
7.4.7 Additional Notes
It may be observed from the model fitted earlier that the intercept is not zero, but it is
0.0688518. Actually, when there is no sales effort, ideally, there should not be any sales.
But this may not be so; there may be some walk-in sales possible because of the other
means such as advertisements, websites, etc. Similarly, there cannot be partial product
sales like 3.1. However, the sales efforts put in would have moved the salesperson toward
the next potential sale partially. If we are interested in arriving at the model without
intercept (i.e., no product sales when there is no sales effort), then we can do so as
shown by forcing the intercept to zero value.
Figure 7-13A shows the code used.
summary(mod_simp_reg_wo_intercept)
209
Chapter 7 Simple Linear Regression
However, if we have to believe this model and use this model, we have to validate the
fulfilment of the other assumptions of the regression.
import numpy as np
import pandas as pd
import sklearn
These commands load the packages into the memory for usage.
Now, we will read the file into the working environment for further processing. This
is done using the code given in Figure 7-15A.
210
Chapter 7 Simple Linear Regression
#import the text file with data from the local machine
df = pd.read_csv("C:/Users/kunku/OneDrive/Documents/Book
Revision/cust1.txt", header=0)
print(df)
Figure 7-15A. The code to read the file and create a dataframe
Sales_Effort Product_Sales
0 100 10
1 82 8
2 71 7
3 111 11
4 112 11
5 61 6
6 62 6
7 113 11
8 101 10
9 99 10
10 79 8
11 81 8
12 51 5
13 50 5
14 49 5
15 30 3
16 31 3
17 29 3
18 20 2
19 41 4
20 39 4
In the previous dataframe, we have 21 records with one response variable and one
predictor variable (row index starting from 0).
211
Chapter 7 Simple Linear Regression
y = df["Product_Sales"]
y.head(5)
Figure 7-16A. Code to create a separate dataframe with response variables only
0 10
1 8
2 7
3 11
4 11
Name: Product_Sales, dtype: int64
Figure 7-17A shows the code related to the creation of a separate dataframe with only
the predictor variable.
212
Chapter 7 Simple Linear Regression
X = df.drop(['Product_Sales'], axis=1)
print(X.dtypes)
X.head(5)
Figure 7-17A. Code to create a separate dataframe with only predictor variables
Sales_Effort int64
dtype: object
Out[9]:
Sales_Effort
0 100
1 82
2 71
3 111
4 112
213
Chapter 7 Simple Linear Regression
In the following section, we have split the data set into two separate data sets, i.e.,
the training data set and test data set, and have generated the simple linear regression
model using the scikit-learn package.
Figure 7-18A shows the code.
train_samples = (15)
train_size=train_samples, test_size=6)
Line_Reg = LinearRegression()
Line_Reg.fit(X_train, y_train)
Figure 7-18A. Code to split the data into training data and test data and also to
generate the model using training data
The output of the previous is the creation of four separate dataframes, i.e., X_train,
X_test, y_train, and y_test, and also the generation of the model. We have used the
scikit-learn package to split the data as well as generate the linear regression model
named Line_Reg. However, the output generated by Python in Jupiter is shown in
Figure 7-18B, which suggests a linear regression model has been built.
Figure 7-18B. Output related to train-test split and linear regression model
generation
214
Chapter 7 Simple Linear Regression
In the following section, we have checked for the coefficient value of the predictor
and model accuracy based on the test data set.
Figure 7-19A shows the code.
print(Line_Reg.coef_)
Line_Reg.score(X_test, y_test)
Figure 7-19A. Code to get the coefficient of the model as well as accuracy score of
the model
[0.09826593]
0.9990992557160832
Figure 7-19B. Output related to obtaining the regression coefficient and the
regression model accuracy score
Figure 7-19B shows the coefficient of the Sales_Effort as 0.9826593 and the
Line_Reg regression model accuracy score as 99.91 percent.
In the following section, we have checked on the mean square error of the model.
This is one of the measures of the model error, as you are aware. Figure 7-20A shows the
code related to this.
np.mean((Line_Reg.predict(X_test)-y_test)**2)
Figure 7-20A. Code to generate the mean square error of the model
215
Chapter 7 Simple Linear Regression
0.007806450460612922
As you can observe from Figure 7-20B, the mean square error is very small. It is
nearly zero. Hence, we have obtained a fit model.
In the following section, we have predicted the values of the response variable on the
predictor from the X_test data set. This depicts high agreement between the predicted
(i.e., predicted) and the actual values (i.e., y_test). Of course, there are some residuals
(i.e., the difference between the fitted and actual values). Figure 7-21A shows the code.
#run the model on the test data to arrive at the predicted values
predicted = Line_Reg.predict(X_test)
print(predicted)
print(y_test)
Figure 7-21A. Code to predict on the test data and to print predicted and original
response variable values
216
Chapter 7 Simple Linear Regression
print(Line_Reg_1.summary())
As output of the previous code, an OLS model named Line_Reg_1 is generated, and
the print(Line_Reg_1.summary) prints the output given in Figure 7-22B.
217
Chapter 7 Simple Linear Regression
The R-Squared and Adj. R-Squared value both are 0.999 showing a high correlation
(i.e., almost perfect) between Product_Sales and Sales_Effort. The p-value (P>|t|) of
0.000 in the case of Sales_Effort shows that Sales_Effort is significant to the model.
The probability of the F-statistic of a very low value (i.e., 4.42e-29) shows that the model
generated is significant. The Durbin_Watston test value of 1.507 (which is between 1.50
to 2.50) shows that there is no autocorrelation. Prob(Jarque_Bera) of 0.625 shows that
we cannot reject the null hypothesis that the residuals are normal.
218
Chapter 7 Simple Linear Regression
#We will use the entire data set to carry out the
predicted_full = Line_Reg_1.predict(X)
print(predicted_full)
print(y)
219
Chapter 7 Simple Linear Regression
220
Chapter 7 Simple Linear Regression
Now, we will first plot the scatter plot between the predicted values and the residuals
of the model, through code shown in Figure 7-24A.
plt.scatter(predicted_full, (predicted_full-y))
plt.xlabel("fitted")
plt.ylabel("residuals")
plt.show()
Figure 7-24A. Plotting of a scatter plot between actual response variable values
versus the predicted values
221
Chapter 7 Simple Linear Regression
import scipy
stats.probplot(y-predicted_full, plot=plt)
plt.show()
Figure 7-25A. The code to validate the assumption of normal distribution of the
residuals
222
Chapter 7 Simple Linear Regression
The graph in Figure 7-25B confirms the assumption that the residuals are normally
distributed.
As you can see, all the assumptions of the linear regression are validated. Hence, we
can use the model for the predictions.
223
Chapter 7 Simple Linear Regression
#(even though we have tested the model against the test data
df_new_predictors = pd.DataFrame(data=data)
predicted_resp_variable = Line_Reg_1.predict(df_new_predictors)
print(predicted_resp_variable)
Figure 7-26A. Code to create a dataframe with new predictor variables and carry
out the prediction using the model
0 10.973073
1 7.824101
dtype: float64
The value of the Product_Sales predicted (after rounding off to the nearest number)
are 11 and 8.
7.6 Chapter Summary
• In this chapter, you went through some examples as to how the
relationship between various aspects influence other factors.
Understanding these relationships helps us not only to understand
what can happen to the other associated factors but also to predict
the value of others. You learned how the regression model or
regression equation explains the relationship between a response
variable and the independent variable(s). You also learned about
the linear or nonlinear relationship as well as simple regression and
multiple regression.
224
Chapter 7 Simple Linear Regression
• You explored how the simple linear regression model arrived at can
be used to predict the value of the response variable when it is not
known but when the related independent variable value is known.
• You looked at how, using R, you can arrive at the simple linear
regression model without intercept and the usage of the same.
225
CHAPTER 8
Multiple Linear
Regression
In Chapter 7, you explored simple linear regression, which depicts the relationship
between the response variable and one predictor. You saw that the expectation is
that the response variable is a continuous variable that is normally distributed. If the
response variable is a discrete variable, you use a different regression method. If the
response variable can take values such as yes/no or multiple discrete variables (for
example, views such as strongly agree, agree, partially agree, and do not agree), you use
logistic regression. You will explore logistic regression in Chapter 11. When you have
more than one predictor—say, two predictors or three predictors or n predictors (with
n not equal to 1)—the regression between the response variable and the predictors is
known as multiple regression, and the linear relationship between them is expressed as
multiple linear regression or a multiple linear regression equation. In this chapter, you
will see examples of situations in which many factors affect one response, outcome, or
dependent variable.
Imagine that you want to construct a building. Your main cost components are the
cost of labor and the cost of materials including cement and steel. Your profitability is
positively impacted if the costs of cement, steel, and other materials decrease while
keeping the cost of labor constant. Instead, if the costs of materials increase, your
profitability is negatively impacted while keeping the cost of labor constant. Your
profitability will further decrease if the cost of labor also increases.
It is possible in the market for one price to go up or down or all the prices to move in
the same direction. Suppose the real estate industry is very hot, and there are lots of takers
for the houses, apartments, or business buildings. Then, if there is more demand for the
materials and the supply decreases, the prices of these materials are likely to increase. If the
demand decreases for the houses, apartments, or business buildings, the prices of these
materials are likely to decrease as well (because of the demand being less than the supply).
227
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5_8
Chapter 8 Multiple Linear Regression
Now let’s presume that the selling prices are quite fixed because of the competition,
and hence the profitability is decided and driven primarily by the cost or cost control.
We can now collect data related to the cost of cement, steel, and other materials, as well
as the cost of labor as predictors or independent variables, and profitability (in percent)
as the response variable. Such a relationship can be expressed through a multiple linear
regression model or multiple linear regression equation.
In this example, suppose we find that the relationship of the cost of other materials
(one of the predictors) to the response variable is dependent of the cost of the cement.
Then we say that there is a significant interaction between the cost of other materials
and the cost of the cement. We include the interaction term cost of other materials:cost
of cement in the formula for generating the multiple linear regression model while also
including all the predictors. Thus, our multiple linear regression model is built using the
predictors cost of cement, cost of steel, cost of other materials, and the interaction term,
cost of other materials:cost of cement versus the profitability as the response variable.
Now imagine that you are a human resources (HR) manager. You know that the
compensation to be paid to an employee depends on their qualifications, experience,
and skill level, as well as the availability of other people with that particular skill set
versus the demand. In this case, compensation is the response variable, and the other
parameters are the independent variables, or the predictor variables. Typically, the
higher the experience and higher the skill levels, the lower the availability of people
compared to the demand, and the higher the compensation should be. The skill levels
and the availability of those particular skills in the market may significantly impact the
compensation, whereas the qualifications may not impact compensation as much as the
skill levels and the availability of those particular skills in the market.
In this case, there may be a possible relationship between experience and skill level;
ideally, more experience means a higher skill level. However, a candidate could have a
low skill level in a particular skill while having an overall high level of experience—in
which case, experience might not have a strong relationship with skill level. This feature
of having a high correlation between two or more predictors themselves is known
as multicollinearity and needs to be considered when arriving at the multiple linear
regression model and the multiple linear regression equation.
Understanding the interactions between the predictors as well as multicollinearity is
very important in ensuring that we get a correct and useful multiple regression model.
When we have the model generated, it is necessary to validate it on all four assumptions
of regression:
228
Chapter 8 Multiple Linear Regression
• Independence of residuals
The starting point for building any multiple linear regression model is to get our
data in a dataframe format, as this is the requirement of the lm() function in R. The
expectation when using the lm() function is that the response variable data is distributed
normally. However, independent variables are not required to be normally distributed.
Predictors can contain factors too.
Multiple regression modeling may be used to model the relationship between
a response variable and two or more predictor variables to n number of predictor
variables (say, 100 or more variables). The more features that have a relationship with
the response variable, the more complicated the modeling will be. For example, a
person’s health, if quantified through a health index, might be affected by the quality
of that person’s environment (pollution, stress, relationships, and water quality), the
quality of that person’s lifestyle (smoking, drinking, eating, and sleeping habits), and
genetics (history of the health of the parents). These factors may have to be taken into
consideration to understand the health index of the person.
8.1.1 The Data
To demonstrate multiple linear regression, we have created data with three variables:
Annual Salary, Experience in Years, and Skill Level. These are Indian salaries, but for
the sake of convenience, we have converted them into U.S. dollars and rounded them
to thousands. Further, we have not restricted users to assess and assign skill levels in
decimal points. Hence, in the context of this data, even Skill Level is represented as
continuous data. This makes sense, as in an organization with hundreds of employees, it
229
Chapter 8 Multiple Linear Regression
is not fair to categorize all of them, say, into five buckets; it’s better to differentiate them
with skill levels such as 4.5, 3.5, 2.5, 0.5, and 1.5. In this data set, all the variables are
continuous variables.
Here, we import the data from the CSV file sal1.txt to the dataframe sal_data_1.
Figure 8-1A shows the code.
Figure 8-1A. Code to read the file and to display the details of the
dataframe created
If you use the head() and tail() functions on the data, you will get an idea of what
the data looks like. Figure 8-2A shows the code.
head(sal_data_1)
tail(sal_data_1)
Figure 8-2A. Code to read the first few and last few records of the dataframe
sal_data_1
230
Chapter 8 Multiple Linear Regression
Figure 8-2B. The display of first few and last few rows of the dataframe
Please note that we have not shown all the data, as the data set has 48 records. In
addition, this data is illustrative only and may not be representative of a real scenario.
The data was collected at a certain point in time.
8.1.2 Correlation
We explained correlation in Chapter 7. Correlation specifies the way that one variable
relates to another variable. This is easily done in R by using the cor() function.
Figure 8-3A shows the coefficient of correlation R between each pair of these three
variables (Annual Salary, Experience in Years, Skill Level).
231
Chapter 8 Multiple Linear Regression
cor(sal_data_1)
Figure 8-3A. Code to provide the correlation between each field to the other fields
of the dataframe
Figure 8-3B. Correlation between each field to the other fields of the dataframe
As you can see from Figure 8-3B, there is a very high correlation of about 0.9888
between Annual Salary and Experience in Years. Similarly, there is a very high
correlation of about 0.9414 between Annual Salary and Skill Level. Also, there is a
very high correlation of about 0.8923 between Experience in Years and Skill Level.
Each variable is highly correlated to the other variable. All the correlation is positive
correlation.
The relationship between two pairs of variables is generated visually by using the R
command, as shown in Figure 8-4A.
library(caret)
featurePlot(x=sal_data_1[,c("Expe_Yrs","Skill_lev")],
y=sal_data_1$Annu_Salary, plot="pairs")
Figure 8-4A. Code to visually depict the relationship between the fields
232
Chapter 8 Multiple Linear Regression
The output, i.e., the visual relationship between the fields, is provided in Figure 8-4B
through a scatter plot matrix. A scatter plot matrix explains the relationship between
each variable in the data set to the other variables. In our case, we have three variables,
i.e., Annu_Salary, i.e., represented as y, Expe_Yrs, and Skill_lev. Hence, we will have
a 3 × 3 scatter plot matrix. The label in each row specifies the y-axis for the row and the
x-axis for the column. In our scatter plot matrix shown in Figure 8-4B, we have three
labels (Annu_Salary, Skill_lev, Expe_Yrs) each representing one of the variables in the
data set. The first plot in Figure 8-4B has y as the y-axis and Expe_Yrs as the x-axis, and
the second plot in the first row has y as the y-axis and Skill_lev as the x-axis. Similarly,
the second row in the first plot has Skill_lev as the y-axis and Expe_Yrs as the x-axis,
and the third plot in the second row has Skill_lev as the y-axis and y as the x-axis.
Similarly, the second plot in the third row has Expe_Yrs as the y-axis and Skill_lev
as the x-axis, and the third plot in the third row has Expe_Yrs as the y-axis and y as
the x-axis.
233
Chapter 8 Multiple Linear Regression
Figure 8-4B. Visually depicting the relationship between fields using the
featurePlot() function
Here, we use the caret package and the featurePlot() function. The response
variable is plotted as y, and the predictor variables are plotted with their respective
variable names.
234
Chapter 8 Multiple Linear Regression
The model gets created, but there will not be any output in R. To see the details
of the model created using the lm() function, we need to use summary(model name).
Figure 8-5B shows the command along with the output.
You can see in this summary of the multiple regression model that both independent
variables are significant to the model, as the p-value for both is less than 0.05. Further,
the overall model p-value is also less than 0.05. Further, as you can see, the adjusted
R-squared value of 99.48 percent indicates that the model explains 99.48 percent of the
variance in the response variable. Further, the residuals are spread around the median
value of –23.8, very close to 0.
You can explore the individual aspects of this model by using specific R commands.
You can use fitted(model name) to understand the values fitted using the model. The
fitted values are shown in Figure 8-6. Similarly, you can use residuals(model name) to
understand the residuals for each value of Annual Salary fitted versus the actual Annual
Salary per the data used. You can use coefficients(model name) to get the details of
the coefficients (which is part of the summary data of the model shown previously).
236
Chapter 8 Multiple Linear Regression
Here the residuals versus fitted plot seem to show that the residuals are spread
randomly around the dashed line at 0. If the response variable is linearly related to the
predictor variables, there should be no relationship between the fitted values and the
residuals. The residuals should be randomly distributed. Even though there seems to
be a pattern, in this case we know clearly from the data that there is a linear relationship
between the response variable and the predictors. This is also shown through a high
correlation between the response variable and each predictor variable. Hence, we
cannot conclude that the linearity assumption is violated. The linear relationship
between the response variable and predictors can be tested through the crPlots(model
name) function, as shown in Figure 8-10. As both the graphs in Figure 8-10 show near
linearity, we can accept that the model sal_model_1 fulfils the test of linearity.
237
Chapter 8 Multiple Linear Regression
The normal Q-Q plot seems to show that the residuals are not normally distributed.
However, the visual test may not always be appropriate. We need to ensure normality
only if it matters to our analysis and is really important, because, in reality, data may not
always be normal. Hence, we need to apply our judgment in such cases. Typically, as per
the central limit theorem and rule of thumb, we do not require validating the normality
for huge amounts of data, because it has to be normal. Furthermore, if the data is very
small, most of the statistical tests may not yield proper results. However, we can validate
the normality of the model (that is, of the residuals) through the Shapiro-Wilk normality
test by using the shapiro.test(residuals(model name)) command. The resultant
output is as follows:
Here, the null hypothesis is that the normality assumption holds good. We reject the
null hypothesis if the p-value is < 0.05. As the p-value is > 0.05, we cannot reject the null
hypothesis that normality assumption holds good.
238
Chapter 8 Multiple Linear Regression
Further, the scale-location plot (Figure 8-9) shows the points distributed randomly
around a near-horizontal line. Thus, the assumption of constant variance of the errors is
fulfilled. We validate this understanding with ncvTest(model name) from the car library.
Here, the null hypothesis is that there is constant error variance and the alternative
hypothesis is that the error variance changes with the level of fitted values of the
response variable or linear combination of predictors. As the p-value is > 0.05, we cannot
reject the null hypothesis that there is constant error variance. See the following Figure.
239
Chapter 8 Multiple Linear Regression
From the earlier discussion and the previous graphs that show near linearity, we can
accept that the model sal_model_1 fulfils the test of linearity.
Another way to visually confirm the assumption of normality (in addition to the
discussion earlier in this regard) is by using qqPlot(model name, simulate = TRUE,
envelope = 0.95). Figure 8-11 shows the resultant plot. As you can see, most of the
points are within the confidence interval of 0.95.
240
Chapter 8 Multiple Linear Regression
As the p-value is < 0.05, the Durbin-Watson test rejects the null hypothesis that there
is no autocorrelation. Hence, the Durbin-Watson test holds up the alternate hypothesis
that there exists autocorrelation, or the lack of independence of errors. However, as
mentioned previously, we know that the data used does not lack independence. Hence,
we can ignore the Durbin-Watson test.
As you can see, all the tests of the regression model assumptions are successful.
8.1.5 Multicollinearity
Multicollinearity is another problem that can happen with multiple linear regression
methods. Say you have both date of birth and age as predictors. You know that both are
the same in a way, or, in other words, that one is highly correlated with the other. If two
predictor variables are highly correlated with each other in this way, there is no point in
considering both of these predictors in a multiple linear regression equation. We usually
eliminate one of these predictors from the multiple linear regression model or equation,
because multicollinearity can adversely impact the model.
Multicollinearity can be determined in R easily by using the vif(model name)
function. VIF stands for variance inflation factor. VIF for any two predictors is calculated
using the simple formula, VIF = 1/(1-R2), where R is the correlation coefficient between
these two predictors. The correlation coefficient R between the predictor Expe_Yrs and
Skill_lev in our case is 0.8923255 (refer to Figure 8.3B); i.e., R2 is 0.7962447979. So, VIF
= 1/(1-0.7962447979) = ~ 4.91. Typically, for this test of multicollinearity to pass, the VIF
value should be greater than 5. Figure 8-12 is the test showing the calculation of VIF.
241
Chapter 8 Multiple Linear Regression
As the VIF is less than 5, we can assume that there is no significant multicollinearity.
However, we cannot rule out moderate multicollinearity.
Multicollinearity typically impacts the significance of one of the coefficients and
makes it nonsignificant. However, in this case, we do not see such an impact. Instead,
we see that the coefficients of both predictors are significant. Further, the existence
of multicollinearity does not make a model not usable for prediction. Hence, we can
use the preceding model for prediction even though we presume multicollinearity
(according to some schools of statistical thought that suggest multicollinearity when the
VIF value is greater than 4).
Figure 8-13A shows how to use the model to predict the Annual Salary for the new
data of Experience in Years and Skill Level.
#We will check the predicted value of the Annu_Salary using the Model
predict(sal_model_1, newdata=predictor_newdata)
Figure 8-13A. The code for the prediction using the model on the new values of
the predictors
242
Chapter 8 Multiple Linear Regression
Now, let’s take another set of values for the predictor variables and check what the
model returns as Annual Salary. Figure 8-14 shows the code used and the prediction
made by sal_model_1.
243
Chapter 8 Multiple Linear Regression
#Let us now get the model with only Expe_Yrs as the predictor
summary(sal_model_2)
Figure 8-15A. Code to generate the model after dropping Skill_lev and displaying
the summary of the model
This is also a significant model with the response variable Annual Salary and the
predictor variable Experience in Years, as the p-value of the model as well as the p-value
of the predictor are less than 0.05. Further, the model explains about 97.74 percent of the
variance in the response variable.
Alternatively, if we remove the Experience in Years predictor variable, we use the
code in Figure 8-16A to get the model shown in Figure 8-16B.
244
Chapter 8 Multiple Linear Regression
summary(sal_model_3)
Figure 8-16A. Code to generate the model after dropping Expe_Yrs and displaying
the summary of the model
This is also a significant model with the response variable Annual Salary and the
predictor variable Skill Level, as the p-value of the model as well as the p-value of the
predictor are less than 0.05. Further, the model explains about 88 percent of the variance
in the response variable as shown by the R-squared value.
However, when we have various models available for the same response variable
with different predictor variables, one of the best ways to select the most useful model is
245
Chapter 8 Multiple Linear Regression
to choose the model with the lowest AIC value. Here we have made a comparison of the
AIC values of three models: sal_model_1 with both Experience in Years and Skill Level
as predictors, sal_model_2 with only Experience in Years as the predictor variable, and
sal_model_3 with only Skill Level as the predictor variable. Figure 8-17 shows both the
code and the output related to this.
> AIC(sal_model_1)
[1] 710.7369
> AIC(sal_model_2)
[1] 779.9501
> AIC(sal_model_3)
[1] 858.63
Figure 8-17. Code for generation of the AIC values of the models along with the
corresponding output
If you compare the AIC values, you find that the model with both Experience in Years
and Skill Level as predictors is the best model.
246
Chapter 8 Multiple Linear Regression
We need to have the package MASS and then use this library to run the
stepAIC(model name) function. Figure 8-18A provides the code related to stepwise
multiple linear regression model generation.
library(MASS)
This confirms our understanding as per the discussion in the prior section of this
chapter. The downside for the effective use of this stepwise approach is that as it drops
predictor variables one by one, it does not check for the different combinations of the
predictor variables.
247
Chapter 8 Multiple Linear Regression
library(leaps)
#Best model is the one with the highest Adjusted R-Squared value or R-
Squared Value
Figure 8-19A. Code for generating all subsets of multiple linear regression
248
Chapter 8 Multiple Linear Regression
Figure 8-19B. Plot of the multiple linear regression model generated in R using the
all subsets approach scaled with R-squared
If we use the adjusted R-squared value instead of the R-squared value, we get the
plot shown in Figure 8-20. To get this, we use “adjr2” instead of “r2” in the code given in
Figure 8-19A.
Figure 8-20. Plot of the multiple linear regression model generated in R using the
all subsets approach scaled with adjusted R-squared value
As you can see, we select both the predictors Experience in Years and Skill Level,
as this model has the highest R-squared value of 0.99 and also the one with the highest
adjusted R-squared value. If the adjusted R-squared value is higher than the one with
R-squared, we use the model recommended by it as it adjusts for the degrees of freedom.
In this example, the best model selected is the one with both the predictors.
249
Chapter 8 Multiple Linear Regression
Yi = β0 + β1 x1 + β2 x2 + β3 x3 + …
In this equation, β0 is known as the intercept, and βi is known as the slope of the
regression line. The intercept is the value of the response variable when the values of
each independent variable are 0. This depicts the point at which the regression line
touches the y-axis when x1 and x2, etc., are 0. The slope can be calculated easily by using
this formula:
(R × Standard deviation of the response variable) / Standard deviation of the
independent variable
From this formula, you can see that when the value of the independent variable
increases by one standard deviation, the value of the response variable increases by R ×
one standard deviation of the response variable, where R is the coefficient of correlation.
In our example, the multiple linear regression equation is as follows:
8.1.9 Conclusion
From our data, we got a good multiple linear regression model that we validated for the
success of the assumptions. We also used this model for predictions.
250
Chapter 8 Multiple Linear Regression
summary(glm_sal_model)
Figure 8-21A. Code to generate the multiple linear regression using the alternative
method glm()
251
Chapter 8 Multiple Linear Regression
As expected, you will find that the result is the same as that obtained through the
lm() function.
252
Chapter 8 Multiple Linear Regression
> #Predicting the response variable on the new predictor variables using
> #glm_sal_model
> predicted
16014.64
Figure 8-22. Code to generate prediction using glm_sal_model and the predicted
value output
As you can see from the data used for building the model, the predicted value is
almost in tune with the related actual values in our data set.
Please note: we have used glm_sal_model instead of sal_model_1 generated using
the lm() function, as there is no difference between the two.
This is done using the function predict(model name, newdata), where model name
is the name of the model arrived at from the input data, and newdata contains the data of
independent variables for which the response variable has to be predicted.
This model may work in predicting the Annual Salary for Experience in Years and
Skill Level beyond those in the data set. However, we have to be very cautious in using
the model on such an extended data set, as the model generated does not know or has
no means to know whether the model is suitable for the extended data set. We also do
not know whether the extrapolated data follows the linear relationship. It is possible
that after a certain number of years, while experience increases, the Skill Level and the
corresponding Annual Salary may not go up linearly, but the rate of increase in salary
may taper off, leading to a slowly tapering down in the slope of the relationship.
253
Chapter 8 Multiple Linear Regression
taken to train the model, and another 25 percent to 20 percent of the data set is used to
test the model.
Let’s use the same data set that we used previously to do this. For this, we have to use
install.packages("caret") and library(caret). Figure 8-23A shows the splitting of
the data set into two subsets—training data set and test data set—using R code.
library(caret)
set.seed(45)
summary(Train_Data)
#Creating Test_Data set out of the remaining records from the original
data
summary(Test_Data)
Figure 8-23A. Code to split the data into two separate datframes, namely, one for
training and another for testing
Figure 8-23B shows the output of the previous code showing summary(Train_Data)
and summary(Test_Data).
254
Chapter 8 Multiple Linear Regression
This split of the entire data set into two subsets, Train_Data and Test_Data, has been
done randomly. Now, we have 40 records in Train_Data and 8 records in Test_Data.
We will now train our Train_Data using the machine-learning concepts to generate a
model. Figure 8-24A provides the code.
summary(Trained_Model)
255
Chapter 8 Multiple Linear Regression
As you can see, the model generated, Trained_Model, is a good fit with the p-values
of the coefficients of the predictors being < 0.05 as well as the overall model p-value
being < 0.05.
We will now use the Trained_Model arrived at previously to predict the values of
Annual_Salary in respect to Experience in Years and Skill Level from the Test_Data.
Figure 8-25A shows the code.
summary(predicted_Annu_Salary)
256
Chapter 8 Multiple Linear Regression
Figure 8-25B shows the output of the previous code showing the summary of the
predicted values. The output shows the minimum, maximum, first quartile, median,
mean, and third quartile of the predictions made.
> summary(predicted_Annu_Salary)
Now, let’s see whether the values of Annual Salary we predicted using the model
generated out of Training_Data and actual values of the Annual Salary in the Test_Data
are in tune with each other. Figure 8-26 shows the code and the output showing actual
values and predicted values.
Figure 8-26. Shows both the Actual Annu_Salary and the Predicted Annu_Salary
for comparison
You can see here that both the Actual Annual Salary from Test_Data and the Annual
Salary from the model generated out of Training_Data, i.e., predicted_Annu_Salary,
match closely with each other. Hence, the model generated out of Training_Data, i.e.,
Trained_Model, can be used effectively.
8.5 Cross Validation
Cross validation like k-fold cross validation is used to validate the model generated.
This will be useful when we have limited data and the data is required to be split into a
small test set (like 20 percent of data) and a relatively bigger training set (like 80 percent
of data), and this makes the validation of the model relatively difficult or not feasible
257
Chapter 8 Multiple Linear Regression
because the data in the training set and test set may be split in such a way that both may
not be very representative of the entire data set.
This problem is eliminated by cross validation methods like k-fold cross validation.
Here, we actually divide the data set into k-folds and use k-1 sets as training data and the
remaining 1 set as the test data and repeatedly do this for k times. Each k-fold will have
almost an equal number of data points (depends upon k-value) randomly drawn from
the entire data set. In our case, as we have 48 records in total for k=10, each fold has 4 or
5 records. None of the elements of one fold is repeated in the other fold. For each fold,
the model is validated, and the value predicted by the model (Predicted) and by the
cross validation is tabulated as the output along with the difference between actual value
and the cross validated prediction (cvpred) as CV residual. Using the difference between
the Predicted values and the cvpred, we can arrive at the root mean square error of the
fit of the linear regression model. This cross validation method is also useful in the case
of a lot of data as every data point is used for the training as well as for testing by rotation
and in none of the folds the same data point is taken again for consideration.
Let’s look at the example of the multiple linear regression we used earlier. Let’s
validate this model using the k-fold validation. In our example, we will be using K = 10.
This is for the sake of convenience so that every time we have 90 percent of the data in
the training set and another 10 percent of the data in the test set. This way, the data once
in the training set will move some other time to the test set. Thus, in fact, by rotating, we
will ensure that all the points are used for model generation as well as the testing. For
this cross validation, we will use library(DAAG) and use either the cv.lm(data set,
formula for the model, m=number of folds) or CVlm(data set, formula for the
model, m=number of folds) function. Here m is the number of folds (i.e., k) we have
decided on. Figure 8-27A shows the code to generate the cv_model.
#Cross Validation
library(DAAG)
plot(cv_Model)
Figure 8-27A. Code to carry out cross validation and to plot the model generated
258
Chapter 8 Multiple Linear Regression
The previous code carves out 10 (as m=10) separate data sets from the original
data. As you can see, in each data set carved out there will be either 4 or 5 rows (as we
have only 48 records in total in our original data set). Each fold shows the number of
observations in the set. As you can see in Figure 8-27B, fold 1 has four observations,
and fold 2 has five observations. The other eight folds are not shown in Figure 8-27B in
order to avoid the clutter. The index of the rows selected for each fold is also shown. The
output also shows, for each fold, the Predicted value, which shows the value predicted
using all the observations; the cvpred value, which shows the value predicted using
cross validation; the actual response variable value, CV Residual, which is the difference
between the cvpred value and the actual value of the response variable. At the end, the
sum of squares error and the mean square error of the cross validation residuals are
also shown.
Figure 8-27C shows the plot generated by plot(cv_Model).
259
Chapter 8 Multiple Linear Regression
The values Predicted (predicted using all the values) and cvpred (predicted using
each fold) are almost similar as is evident from the previous plot. Also, we can see from
Figure 8-27B, the predictions made by cv_Model (i.e., cvpred) are almost similar to the
Predicted values as well as the actual values of Annu_Salary. Now, let’s check the root
mean square error between the model generated using the linear regression model and
the cross validation model. Figure 8-28 shows the code and output.
> rmse
[1] 0.2554861
Figure 8-28. Root mean square error calculated between linear regression model
and cross validation model
If we check the root mean square error between the values predicted by multiple
linear regression models and cross validation models, as shown in Figure 8-28, it is
very negligible, i.e., 0.25. Hence, k-fold cross validation validates the multiple linear
regression model arrived at earlier.
260
Chapter 8 Multiple Linear Regression
Instead of k-fold cross validation, we can use other methods like repeated k-fold
cross validation, bootstrap sampling, and leave-one-out cross validation.
Note We need to use the set.seed() command when carrying out the model
generation or splitting the data so that the result for each run is consistent.
import numpy as np
import pandas as pd
import sklearn
261
Chapter 8 Multiple Linear Regression
#import the text file with data from the local machine
#this is a text file with headers and hence the header row is specified
print(df)
Figure 8-30A. Code to import the data into a dataframe from the text file
Figure 8-30B shows the output of the code to import the data, i.e., the listing of the
data imported.
Figure 8-30B. Partial view of the of the data of the dataframe created
In the previous dataframe we have 48 records with one response variable and two
predictor variables (only the partial view is shown here).
262
Chapter 8 Multiple Linear Regression
print(Mul_Line_Reg_1.summary())
Figure 8-31A. Code to generate the multiple linear regression model using
statsmodels and to print the summary of the model created
263
Chapter 8 Multiple Linear Regression
Here is the interpretation of the model output: The intercept value is 3011.6568, the
coefficient of Expe_Yrs is 1589.6773, and that of Skill_lev is 1263.6482. As you can see, the
P>|t| value is 0.000 in the case of both the predictors, i.e., Expe_Yrs and Skill_lev, which
is less than the significance level of 0.05. This informs us that both the predictor variables
are significant to the model. Further, the Prob (F-statistic) value (i.e., 1.76e-52) is very low
and is less than the significance level of 0.05. Moreover, both the R-squared and adjusted
R-squared values are 0.995, which signifies significant explanation of the model by the
predictors. Hence, the model generated is significant. The AIC value of 708.7 is also
significantly low. The Durbin-Watson value shown in the model is 1.339, which shows
minimal auto correlation (a value between ~1.5 to 2.5 shows no autocorrelation). The
Jarque-Bera test checks for the normality of the data. The null hypothesis here is that
the data is normally distributed. The Prob(JB) value is 0.266, which is not significant.
Hence, we cannot reject the null hypothesis. This means that it confirms the normality of
the data.
264
Chapter 8 Multiple Linear Regression
y = df["Annu_Salary"]
y.head(5)
Figure 8-32A. Code to create a separate dataframe y with only the response
variable data
The check for whether the dataframe has been created properly is done through
y.head(5), given earlier. Figure 8-32B shows the output.
0 4000
1 6000
2 8000
3 10000
4 12000
Name: Annu_Salary, dtype: int64
Now, we will create another dataframe with only the predictor variable data.
Figure 8-33A shows the code.
265
Chapter 8 Multiple Linear Regression
X = df.drop(['Annu_Salary'], axis=1)
print(X.dtypes)
X.head(5)
Figure 8-33A. The code to create a separate dataframe X with only the predictor
variable data
In the previous code, X.head(5) is added to check if the dataframe has been
populated properly. The output of the code given in Figure 8-33A is provided in
Figure 8-33B.
Expe_Yrs float64
Skill_lev float64
dtype: object
Out[6]:
Expe_Yrs Skill_lev
0 0.0 0.5
1 1.0 1.0
2 2.0 1.5
3 3.0 2.0
4 4.0 2.5
266
Chapter 8 Multiple Linear Regression
In the following section, we have split the data set into two separate data sets,
i.e., the training data set and test data set, and have generated the multiple linear
regression model using the scikit-learn package. Even the train_test_split has been
accomplished using the scikit-learn package. Figure 8-34A shows the code.
train_samples = (35)
Mul_Line_Reg = LinearRegression()
Mul_Line_Reg.fit(X_train, y_train)
Figure 8-34A. Code to carry out the split of the dataframes into separate training
and test datasets (both X, y) and to generate the model using the training data
LinearRegression(copy_X=True, fit_intercept=True, n
_jobs=1, normalize=False)
The previous output only shows that a linear regression model has been generated.
In the following section, we have checked for the coefficient values of the predictors,
model accuracy based on the test data set, and also the mean square error on the
predictions from the test data set. Figure 8-35 provides the code and output.
267
Chapter 8 Multiple Linear Regression
#Code:
print(Mul_Line_Reg.coef_)
Output: [ 1628.07306079 1176.49916823]
#Code:
Mul_Line_Reg.score(X_test, y_test)
Output: 0.99391939705948518
#Code
np.mean((Mul_Line_Reg.predict(X_test)-y_test)**2)
Output: 158940.20463956954
Figure 8-35. Code and output showing the coefficients of the model, the accuracy
of the model, and the mean square error on the test predictions
In the following section, we have predicted the values of the response variable on the
predictors from the test data set. This depicts high agreement between the predicted and
actual values. Of course, there are some residuals (i.e., difference between the fitted and
actual values). Figure 8-36A shows the code.
268
Chapter 8 Multiple Linear Regression
#run the model on the test data to arrive at the predicted values
predicted = Mul_Line_Reg.predict(X_test)
print(predicted)
print(y_test)
Figure 8-36A. Code to predict on the X_test data using the model trained on
training data
Figure 8-36B. Output of the prediction on the X_test and actual values from y_test
269
Chapter 8 Multiple Linear Regression
To enable us to validate the assumptions at first, we will create a scatter plot between
the fitted values and the residuals. Figure 8-37A shows the code.
plt.scatter(predicted, (predicted-y_test))
plt.xlabel("fitted")
plt.ylabel("residuals")
plt.show()
Figure 8-37A. Code to create a scatter plot of fitted values to the residuals
270
Chapter 8 Multiple Linear Regression
We will now plot a scatter plot of ordered residuals against the theoretical quantiles
using the SciPy package. This will help us to understand the normality of the distribution
of the residuals. Figure 8-38A shows the code.
import scipy
stats.probplot(y_test-predicted, plot=plt)
plt.show()
Figure 8-38A. Code to plot ordered residuals against the theoretical quantiles to
validity the assumption of normality
271
Chapter 8 Multiple Linear Regression
Figure 8-38B. The plot of the ordered values of the residuals against the
theoretical quantiles for validating the assumption of the normality
We will also test the assumption of the linearity in another way. For this purpose, we
will use the Seaborn package to create a pairplot(), plotting the relationship between
each pair of the variables. Figure 8-39A shows the code.
272
Chapter 8 Multiple Linear Regression
sns.pairplot(df)
plt.draw()
plt.show()
Figure 8-39A. Code to create a pairplot() on the original dataframe showing the
relationship between each pair of the variables
273
Chapter 8 Multiple Linear Regression
From the previous graph, we can confirm that there is almost a linear relationship
between each variable to the other variable. Hence, we can conclude that the
assumption of linearity holds.
As you can see from the previous plot, all the assumptions of the linear regression are
validated. Hence, we can use the model for the predictions.
274
Chapter 8 Multiple Linear Regression
df_new_predictors = pd.DataFrame(data=data)
predicted_resp_variable = Mul_Line_Reg.predict(df_new_predictors)
print(predicted_resp_variable)
Figure 8-40A. Code to create a new dataframe with the new predictor values and
to predict using these values as inputs
You can observe from the previous listed output that the predicted values are very
close to the actual values from the initial data set, i.e., 10000 and 16000, respectively.
8.7 Chapter Summary
In this chapter, you saw examples of multiple linear relationships. When you have a
response variable that is continuous and is normally distributed with multiple predictor
variables, you use multiple linear regression. If a response variable can take discrete
values, you use logistic regression.
275
Chapter 8 Multiple Linear Regression
You also briefly looked into significant interaction, which occurs when the outcome
is impacted by one of the predictors based on the value of the other predictor. You
learned about multicollinearity, whereby two or more predictors may be highly
correlated or may represent the same aspect in different ways. You saw the impacts of
multicollinearity and how to handle them in the context of the multiple linear regression
model or equation.
You then explored the data we took from one of the entities and the correlation
between various variables. You saw a high correlation among all three variables involved
(the response variable as well as the predictor variables). By using this data in R, you
learned how to arrive at the multiple linear regression model and to validate if it is a
good model with significant predictor variables.
You learned various techniques to validate that the assumptions of the regression are
met. Different approaches can lead to different interpretations, so you have to proceed
cautiously in that regard.
In exploring multicollinearity further, you saw that functions such as vif() enable
you to understand the existence of multicollinearity. You briefly looked at handling
multicollinearity in the context of multiple linear regression. Multicollinearity does not
reduce the value of the model for prediction purposes. Through the example of Akaike
information criterion (AIC), you learned to compare various models and that the best
one typically has the lowest AIC value.
You then explored two alternative ways to arrive at the best-fit multiple linear
regression models: stepwise multiple linear regression and the all subsets approach
to multiple linear regression. You depicted the model by using the multiple linear
regression equation.
Further, you explored how the glm() function with the frequency distribution
gaussian with link = "identity" can provide the same model as that generated
through the lm() function, as we require normality in the case of a continuous response
variable.
Further, you saw how to predict the value of the response variable by using the values
of the predictor variables.
We also made you explore how to split the data set into two subsets, training data
and test data. You learned how to use the training data to generate a model for validating
the response variable of test data, by using the predict() function on the model
generated using the training data.
We finally demonstrated to you how to carry on all those analyses we carried out
using R in Python also.
276
CHAPTER 9
Classification
In this chapter, we will focus on classification techniques. Classification is the task of
predicting the value of a categorical variable. A classification model predicts the category
of class, whereas regression models predict a continuous value. In this chapter, we
will focus on classification techniques used in predicting a class based on a learning
algorithm. We will explain some classification methods such as naïve Bayes, decision
trees, etc. We will demonstrate the classification techniques using R libraries, and then
we will use Python library packages to perform the same tasks.
Data Classification/
Model Prediction
Prediction
- Behaviour Score
- Sales number
- Weather pattern
- ….
- …
277
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5_9
Chapter 9 Classification
Assume that you are applying for a mortgage loan. You fill out a long application
form with all your personal details, including income, age, qualification, location of
the house, valuation of the house, and more. You are anxiously waiting for the bank’s
decision on whether your loan application has been approved. The bank has to make
a decision about whether the loan should be approved or not. How does the bank
decide? The bank reviews various parameters provided in your application form, and
then—based on similar applications received previously and the experience the bank
has had with those customers—the bank decides whether the loan should be approved
or denied. The bank may be able to review and classify ten or twenty applications in
a day but may find it extremely difficult if it receives 1000s of applications at a time.
Instead, you can use classification algorithms to perform this task. Similarly, assume
that a company has to make a decision about launching a new product in the market.
Product performance depends on various parameters. This decision may be based on
similar experiences a company has had launching similar products in the past, based on
numerous market parameters.
There are numerous examples of classification tasks in business; some examples
include the following:
Prediction is the same as classification except that the results you are predicting
are representation of the future. Examples of prediction tasks in business include the
following:
• Predicting the value of a stock price for the next three months
• Predicting which basketball team will win this year’s game based on
the past data
• Predicting the percentage decrease in traffic deaths next year if the
speed limit is reduced
278
Chapter 9 Classification
Unlabelled Documents
Input to model
Machine Learning Model
(Classifier Model)
Labelled Documents
Model Outcome
Documents
Labelled
279
Chapter 9 Classification
input characteristics of the data. Each algorithm’s performance may vary. Though there
are certain guidelines for selecting the algorithms given by different scholars and data
scientists, there is no standard method. We recommend you try different algorithms and
select the best model and algorithm that provides the best accuracy.
3 14 655 Yes
Yes
4 24 644
Learn
5 43 45 no Model
Model
8 14 655 7 14 245 NO
9 24 644 No
We begin our discussion with a simple classifier and then discuss the other classifier
models such as decision trees, naïve Bayes, and random forest.
9.1.1 K-Nearest Neighbor
We begin our discussion with the k-nearest neighbor (KNN) algorithm, one of the
simpler algorithms for classification based on the distance measure. We will explore the
concepts behind the nearest neighbor with an example and demonstrate how to build
the KNN model using both the R and Python libraries.
The nearest neighbor algorithm is commonly known as a lazy learning algorithm because
it does not have any complicated rules defined to learn the data; instead, it memorizes the
training data set. You will find many applications of KNN in the real world for categorizing a
voter as Democrat or Republican or grouping news events into different category.
280
Chapter 9 Classification
x yi
2
Euclidian Distance i
i 1
Similarly, the Manhattan distance for the two points, X = (x1, x2, x3, … xn) and
Y = (y1, y2, y3, … yn) is defined as follows:
k
Manhattan Distance xi yi
i 1
And the Minkowski distance for the two points, X = (x1, x2, x3, … xn) and
Y = (y1, y2, y3, … yn), is defined as follows:
1/q
k q
Minkowski |xi yi |
i 1
If q=1, then it is equivalent to the Manhattan distance, and for the case q=2, it is
equivalent to the Euclidian distance. Although q can be any real value, it is typically set
to a value between 1 and 3.
281
Chapter 9 Classification
9.1.2 KNN Algorithm
K-nearest neighbor does not assume any relationship among the predictors (X) and
class (Y). Instead, it draws the conclusion of the class based on the similarity measures
between predictors and records in the data set. Though there are many potential
measures, KNN uses Euclidean distance between the records to find the similarities to
label the class. For the algorithm to work, you need the following:
Let’s demonstrate this with the following example. In this loan approval example,
we have a target class variable, Approval, which depends on two independent variables,
Age and PurchaseAmount, as shown in Figure 9-4. The data already has a class label of
Yes or No.
282
Chapter 9 Classification
Figure 9-4. Loan approval training data and new data for KNN
Given the new data, with a variable Age of 34 and a PurchaseAmount of 200, we
calculate the distance from all the training records. We will use the Euclidian distance
measure for our distance calculation. As shown in Figure 9-5, a distance of (34,200) is
10.77033 units from (44, 204), 17.029386 from (35, 183), 22.135944 from (41,221), and
so on. Once the distance measure is calculated, now choose the k-value and classify the
new unknown data class by taking the majority vote. For example, if k=5, then you have
3 No and 2 Yes. The majority is No, and hence the new data is classified as No. If a new
record has to be classified, it finds the nearest match to the record and tags to that class.
For example, if it looks like a mango and tastes like a mango, then it’s probably a mango.
283
Chapter 9 Classification
After computing the distances between records, we need a rule to put these records
into different classes (k). A higher value of k reduces the risk of overfitting due to noise
in the training set. Ideally, we balance the value of k such that the misclassification
error is minimized. Generally, we choose the value of k to be between 2 and 10; for each
iteration of k, we calculate the misclassification error and select the value of k that gives
the minimum misclassification error.
Please note that the predictor variables should be standardized to a common
scale before computing the Euclidean distances and classifying. We can use any of the
standardization techniques we discussed in Chapter 4, such as min-max or z-score
transformation.
9.1.3 KNN Using R
In this example, we will investigate the loan approval problem. The data consists of three
parameters: the dependent variable, the target class to determine whether the loan is
approved based on the Age value of the applicant, and the PurchaseAmount, as shown in
Figure 9-6. This data consists of 53 training records and 10 test data records. The class 0
means the loan is not approved, and 1 means the loan is approved. We will use the class
package in R to run the KNN algorithm function.
284
Chapter 9 Classification
The following section explains the step-by-step process of creating a KNN model and
testing the performance.
Step 1: Read the training data and test data.
Step 2: Preprocess the data if required. Check the data types of each variable and
transform them to the appropriate data types.
285
Chapter 9 Classification
Step 3: Prepare the data as per the library requirements. In this case, read the KNN
function R documentation and accordingly prepare your training and test data.
> ## Read KNN library documentation to see the input parameters
required using help(knn).
> ## For this library function, input training data should not have the
response variable
> # Hence separate the response variable
> train_data=knndata[,-3]
> # Seprate the response variable from the test data
> test_data = knntest[,-3]
>
Step 4: Create the KNN model using the knn() function from the class package. If
you have not installed the class package, you must install the package first.
Here is the description of the function from the official R documentation (https://
www.rdocumentation.org/packages/class/versions/7.3-20/topics/knn):
286
Chapter 9 Classification
287
Chapter 9 Classification
> library(class)
> knnmod<-knn(train=train_data,test=test_data,cl=cls,k=3,
+ prob=FALSE)
The model has training data and test data as input, and it uses the training data to
build the model and the test data to predict. At this stage, we do not know the exact value
of k and assume that k is 3. In step 6, we will demonstrate how to find the value of k.
Step 5: Our KNN model has predicted the test data target class. We must check the
model performance by comparing the predicted class values with the actual values. We
will write a small function called check_error(), as shown next, which compares the
actual values and the predicted values by calculating the mean error.
Please note that the confusion matrix is a truth table of actual versus predicted in
a matrix form. The predicted values are displayed vertically, and the actual values are
displayed horizontally. For example, in this case, there are four cases where the model-
predicted output value is 0, and the actual value is also 0. Similarly, there are eight cases
where the model-predicted value as 1 but the actual value is 0. This matrix helps in
determining the accuracy of the model, precision, and recall of the model. The actual
formula and calculations are explained in Chapter 6. We urge you to refer to this chapter
and calculate manually for better understanding of the matrix and the output produced
by the code.
288
Chapter 9 Classification
Reference
Prediction 0 1
0 4 1
1 8 1
Accuracy : 0.3571
95% CI : (0.1276, 0.6486)
No Information Rate : 0.8571
P-Value [Acc > NIR] : 1.0000
Kappa : -0.0678
Sensitivity : 0.3333
Specificity : 0.5000
Pos Pred Value : 0.8000
Neg Pred Value : 0.1111
Prevalence : 0.8571
Detection Rate : 0.2857
Detection Prevalence : 0.3571
Balanced Accuracy : 0.4167
'Positive' Class : 0
The accuracy of the model is around 35.71 percent with a confidence of 95 percent.
The results also provide sensitivity and specificity measures.
Step 6: We will determine the exact value of k by calculating errors for each value
of k from 1 to 10. Normally, the k-values lie between 1 to 10. We will not consider value
1 because of the inherent nature of KNN for the majority voting count. We will write a
small function that will loop 10 times for different values of k and calculate the error.
Once we have errors for each value of k, we will plot the graph of k-values versus error.
The final model will be with the k-value, which has a minimum error. In this example,
we have not demonstrated the normalizing parameters. As part of fine-tuning model
performance, you should consider normalizing the variables.
289
Chapter 9 Classification
Figure 9-7 shows the k-values versus the error plot. This is referred to as an elbow
plot. This method is normally used to determine the optimum value of k. In this
example, an error is minimal for any values of k between 2 and 6. Since this is a small
data set, we recommend a k-value of 3.
raw_data = pd.read_csv("knn-data.csv")
raw_test = pd.read_csv("knntestdata.csv")
In this step, we will split the data into X_train, y_train, X_test, and y_test by
separating the Approval variable as shown. Then we will check the data types and convert
291
Chapter 9 Classification
them to the appropriate data types. For example, the Approval variable should be
converted to the categorical variable using the astype() function.
Check the data types of the variables in the data set and take the necessary actions to
convert the data types. In this case, as you can see, Approval is the target variable, and it
should be categorical.
#drop the target variable from the X and keep Y separately for the KNN
API() function
X_train = raw_data.drop(columns='Approval')
y_train = raw_data['Approval']
X_test = raw_test.drop(columns='Approval')
y_test = raw_test['Approval']
Step 4: Create the KNN model using the KNeighborsClassifier() function. We can
use any values of k, but to start with, we will choose k=2, and then we will use k=2 and
find the optimal value of k using the elbow method using misclassification errors as a
measure for different values of k.
292
Chapter 9 Classification
knn_model = KNeighborsClassifier(n_neighbors = 2)
knn_model.fit(X_train, y_train)
knn_model.classes_
Step 5: Predict the class of the new data using the KNN model.
Here is the input:
predicted = knn_model.predict(X_test)
predicted
293
Chapter 9 Classification
print(confusion_matrix(y_test, predicted))
print(classification_report(y_test, predicted))
Step 7: The final step is to calculate the error for different k-values and select the k
based on the minimum error. For this, we will write a small function to loop through
different k-values, create the model, and calculate the error. We will also plot the error
versus k-values graph to choose the k-value.
Here is the input:
err = []
for i in np.arange(1,10):
knn_new = KNeighborsClassifier(n_neighbors=i)
knn_new.fit(X_train, y_train)
new_predicted = knn_new.predict(X_test)
err.append(np.mean(new_predicted != y_test))
plt.plot(err)
294
Chapter 9 Classification
As you can see, the misclassification error is least for the k between 2 and 4. For the
optimal model performance, we can choose k=3 (odd number) since the algorithm is
based on the majority vote method.
The results obtained for R and Python are slightly different. We used the open-
source libraries, and we have not looked at all the parameters of the input API function
as well as the same distance measure. We are confident that if you go through the
documentation and apply the right parameters in both R and Python, you will be able to
achieve the same results. We strongly recommend reading the documentation, using the
right parameters, and rerunning the code.
295
Chapter 9 Classification
P X|Y P Y
P Y|X
P X
Applying the same thing in the context of classification, Bayes’ theorem provides a
formula for calculating the probability of a given record belonging to a class. Suppose you
have m classes, C1, C2, C3, … Cm, and the probability of classes is P(C1), P(C2), … P(Cm). If you
know the probability of occurrence of x1, x2, x3, … attributes within each class, then by using
Bayes’ theorem, you can calculate the probability of the record xi, belonging to class Ci:
P x|c P c
PC|X (1)
P x
where:
P (c|x) is the posterior probability of a class given a predictor.
P(x|c) is the likelihood of occurrence, for instance, of x given class C.
P(c) is the probability of class C.
P(x) is the prior probability of the predictors.
The naïve Bayes classifier assumes what is called conditional independence; that is,
the effect of the value of a predictor (x) on a given class (c) is independent of the values of
the other predictors. The previous equation simplifies to the following:
P(Ci) is the prior probability of belonging to class Ci in the absence of any other
attributes. P(Ci|Xi) is the posterior probability of Xi belonging to class Ci. To classify
a record using Bayes’ theorem, we compute the probability of each training record
296
Chapter 9 Classification
belonging to each class Ci and then classify based on the highest probability score
calculated using the naïve Bayes formula.
1 Medium OK < 35 No
2 Medium Excellent < 35 No
3 High Fair 35–40 Yes
4 High Fair > 40 Yes
5 Low Excellent > 40 Yes
6 Low OK > 40 No
7 Low Excellent 35–40 Yes
8 Medium Fair < 35 No
9 Low Fair < 35 No
10 Medium Excellent > 40 No
11 High Fair < 35 Yes
12 Medium Excellent 35–40 No
13 Medium Fair 35–40 Yes
14 High OK < 35 No
297
Chapter 9 Classification
We will calculate the prior probability and class probability for each class. To
calculate these probabilities, we can construct a frequency table for each attribute
against the target class. Then, from the frequency table, calculate the likelihood; finally,
apply the naïve Bayes equation to calculate the posterior probability for each class. Then
the class with the highest posterior probability decides the prediction class.
Figure 9-8 is the frequency table for the Purchase Frequency attribute. Using this
table we can calculate the probabilities:
P(high) = 4/14 = 0.28 (out of 14 training records, we have 4 high purchases). The
probability would be 4/14, i.e, 0.28. Similarly, we can calculate all the other Purchase
Frequency prior probabilities.
P(Yes) = 6/14
P(No) = 8/14.
P(High|Yes) = 3/6 = 0.5; there are 3 High purchases out of 6 Yes.
Now, given all theses probabilities, we can calculate the posterior probability using
the naïve Bayes theorem (assuming the conditional independence).
P (C|X) = (PYes|high) = P( High|Yes) * P(Yes) / P(High)
= (0.5 * 0.42) / 0.28 = 0.75
Similarly, we can come up with frequency tables to calculate the likelihood for other
attributes, as shown in Figure 9-9.
298
Chapter 9 Classification
Yes No Yes No
Excellent 2, 2/6 3, 3/8 5/14
<35 1, 1/6 5, 5/8 6/14
Credit Rating
Age
Fair 4, 4/6 2, 2/8 6/14 >40 2, 2/6 2, 2/8 4/14
Figure 9-9. Frequency table to calculate probabilities, Credit Rating and Age
attributes
299
Chapter 9 Classification
credData
str(credData)
300
Chapter 9 Classification
Once we read the data, divide the data into test and train.
Here is the input:
library(caret)
set.seed(1234)
data_partition<-createDataPartition(y=data_df$Approval,
p=0.8,
list=FALSE)
train<-data_df[data_partition,]
test<-data_df[-data_partition,]
The next step is to build the classifier (naïve Bayes) model by using the mlbench and
e1071 packages.
301
Chapter 9 Classification
For the new sample data X = (Age > 40, Purchase Frequency = Medium, Credit Rating
= Excellent), the naïve Bayes model has predicted Approval = No.
Here is the input:
302
Chapter 9 Classification
> nb_pred<-predict(nb_model,test)
> nb_pred
> nb_pred
[1] No Yes Yes No No Yes No No Yes No Yes No Yes No
Levels: No Yes
The next step is to measure the performance of the NB classifier and how well
it has predicted. We use the caret() libraries to print the confusion matrix. As
discussed in the earlier sections, the confusion matrix is a truth table and provides the
table of what is being predicted by the model versus the actual values. The function
confusionMatrixReport() calculates the accuracy of the model and also provides
a sensitivity analysis report. As you can see from the output results, our NB model
achieved an accuracy of 67 percent.
Here is the input:
303
Chapter 9 Classification
304
Chapter 9 Classification
calculated could result in wrong predictions. When the preceding calculations are done
using computers, sometimes it may lead to floating-point errors. The class with the highest
log probability is still the most significant class. If a particular category does not appear
in a particular class, its conditional probability equals 0. Then the product becomes 0.
If we use the second equation, log(0) is infinity. To avoid this problem, we use add-one
or Laplace smoothing by adding 1 to each count. Laplace smoothing tackles the zero
probability problem. By adding 1, the likelihood is pushed toward a value of nonzero and
thus optimizes the overall model. This is more relevant in text classification problems.
Also, for numeric attributes, normal distribution is assumed. However, if we know the
attribute is not a normal distribution and is likely to follow some other distribution,
you can use different procedures to calculate estimates—for example, kernel density
estimation does not assume any particular distribution (Kernel Density Estimation
(KDE) is a method to estimate the probability density function of a continuous random
variable). Another possible method is to discretize the data. Although conditional
independence does not hold in real-world situations, naïve Bayes tends to perform well.
9.3 Decision Trees
A decision tree builds a classification model by using a tree structure. A decision tree
builds the tree incrementally by breaking down the data sets into a smaller subset and
incrementally builds the tree structure. The final decision tree structure consists of a root
node, branches, and leaf nodes called decision nodes. The root node is the topmost node
attribute, the branches are the next set of attributes selected based on the decisions of
the root node, and the leaf nodes are the final class of the tree model.
Figure 9-10 demonstrates a simple decision tree model. Once the root node is
decided, the decision happens at each node for an attribute on how to split. Selection of
the root node attribute and the criterion for the split node attribute is based on v arious
methods such as gini index, entropy, misclassification errors, etc. The next section
explains the techniques in detail. In this example, the root node is Purchase Frequency,
which has two branches, High and Low. If Purchase Frequency is High, the next decision
node would be Age, and if Purchase Frequency is Low, the next decision node is Credit
Rating. The next nodes are chosen based on the information and purity of each node.
The leaf node represents the classification decision; in this case it is Yes or No. The
final decision tree consists of a set of rules, more of an ‘If Else’ rule used to classify the
new data.
305
Chapter 9 Classification
The decision tree algorithm is based on the divide-and-conquer rule. Let x1, x2, and
x3 be independent variables and Y denote the dependent variable. The X variables can
be continuous, binary, or ordinal. The first step is selecting one of the variables, xi, as
the root node to split. Depending on the type of the variable and the values, the split
decision is made. After the root attribute is split, the next attribute is selected at each
branch, and the algorithm proceeds recursively by splitting at each child node. The
splitting continues until the decision class is reached. We call the final leaf homogeneous,
meaning the final terminal node contains only one class.
The following are the basic steps involved in creating a decision tree structure:
1. The tree starts by selecting the root attribute from the training set,
based on the gini index and entropy.
2. The root node is branched, and the decision to split is made based
on the attribute characteristics and the split measures.
3. At each branch, the attributes of the next node are selected based
on the information gain and entropy measure.
4. This process continues until all the attributes are considered for
the decision.
306
Chapter 9 Classification
The decision tree can handle both categorical and numerical variables.
9.3.1.1 Entropy
Entropy is a measure of how messy the data is. That is, it is how difficult it is to separate
data in the sample into different classes. For example, let’s say you have an urn that has
100 marbles and all of them are red marbles; then whichever you pick, it turns out to be
red marble, so there is no randomness in the data. This means the data is in good order
and the level of disorder is zero. The entropy is zero.
Similarly, if the urn has red marbles, blue marbles, and green marbles, in random
order, then the data is in disorder. The level of randomness is high as you do not know
which marble you get. The data is completely random, and the entropy is high. The
distribution of the marbles in the urn determines the entropy. If the three marbles
are equally distributed, then the entropy is high; if they are distributed as 50 percent,
30 percent, and 10 percent, then entropy is somewhere between low to medium. The
maximum value of the entropy can be 1, and the lowest value of entropy can be 0.
307
Chapter 9 Classification
For example, let’s say we have an attribute LoanApproval, with 6 Yes and 8 No, then
Entropy(LoanApproval) = Entropy(8,6).
= E(0.429, 0.571)
= - (0.428 log2 (0.428) – (0.571log2(0.571)
= 0.9854
The entropy of two attributes would be as follows:
E A, X P c E c
cx
Yes NO
Purchase Frequency
High 3 1
Medium 1 5
Low 2 2
308
Chapter 9 Classification
9.3.1.2 Information Gain
A decision tree consists of many branches and many levels. Information gain is the
decrease of entropy from one level to the next level. For example, in Figure 9-12, the tree
has three levels.
Root
node
E = 0.92
N1 N2
E = 0.62 E = 0.72
Information Gain = 0.3 Information Gain = 0.2
The root node has an entropy of 0.92, node1 has an entropy of 0.62, and node2 has
an entropy of 0.72. The entropy has decreased by 0.3 and 0.2, respectively. A higher
entropy means the data is uniformly distributed, and a lower entropy means data is more
varied. The decision tree algorithm uses this method to construct the tree structure. It
is all about finding the attribute that returns the highest information gain (i.e., easier
to make the split decision at each branch and node). The purity of the subset partition
depends on the value of entropy. The smaller the entropy value, the greater the purity.
In order to select the decision-tree node and attribute to split the tree, we measure
the information provided by each attribute. Such a measure is referred to as a measure
of the goodness of split. The attribute with the highest information gain is chosen as the
test attribute for the node to split. This attribute minimizes the information needed
to classify the samples in the recursive partition nodes. This approach of splitting
minimizes the number of tests needed to classify an object and guarantees that a simple
tree is formed. Many algorithms use entropy to calculate the homogeneity of a sample.
Let N be a set consisting of n data samples. Let the k is the class attribute, with m
distinct class labels Ci (for i = 1, 2, 3, … m).
309
Chapter 9 Classification
The purity of the subset partition depends the value of entropy. The smaller the
entropy value, the greater the purity. The information gain of each branch is calculated
on X attributes as follows:
310
Chapter 9 Classification
Gain (A) is the difference between the entropy of each attribute of X. It is the
expected reduction on entropy caused by individual attributes. The attribute with the
highest information gain is chosen as the root node for the given set S, and the branches
are created for each attribute as per the sampled partition.
1 Medium OK < 35 No
2 Medium Excellent < 35 No
3 High Fair 35–40 Yes
4 High Fair > 40 Yes
5 Low Excellent > 40 Yes
6 Low OK > 40 No
7 Low Excellent 35–40 Yes
8 Medium Fair < 35 No
9 Low Fair < 35 No
10 Medium Excellent > 40 No
11 High Fair < 35 Yes
(continued)
311
Chapter 9 Classification
Yes No
6 8
We will use the recursive partitioning to build the tree. The first step is to calculate
overall impurity measures of all the attributes. Select the root node based on the purity
of the node. At each successive stage, repeat the process by comparing this measure for
each attribute. Choose the node that has minimum impurity.
In this example, Loan Approval is the target class that we have to predict and build
the tree model, which has two class labels: Approved or Denied. There are two distinct
classes (m = 2). C1 represents the class Yes, and class C2 corresponds to No. There are
eight samples of class Yes and six samples of class No. To compute the information gain
of each attribute, we use equation 1 to determine the expected information needed to
classify a given sample.
Entropy(LoanApproval) = Entropy(8,6)
= E(0.429, 0.571)
= - (0.428 * log2 (0.428) – (0.571*log2(0.571))
= 0.9854
Next, compute the entropy of each attribute—Age, Purchase Frequency, and Credit
Rating. For each attribute, look at the distribution of Yes and No and compute the
information for each distribution. Let’s start with the Purchase Frequency attribute.
Having a frequency table for each helps the computation.
The first step is to calculate the entropy of each PurchaseFrequency category, as
shown in Figure 9-13.
312
Chapter 9 Classification
Loan Approval
Yes NO
High 3 1
Medium 1 5
Low 2 2
=0.811
E(C12,C22) = 0.6490
=1
313
Chapter 9 Classification
Similarly, compute the Gain for other attributes, Gain (Age) and
Gain (Credit Rating). Whichever has the highest information
gain among the attributes, it is selected as the root node for that
partitioning. Similarly, branches are grown for each attribute’s
values for that partitioning. The decision tree continues to grow
until all the attributes in the partition are covered.
CreditRating has the highest information gain, and it is used as a root node, and
branches are grown for each attribute value. The next tree branch node is based on
the remaining two attributes, Age and Purchase Frequency. Both Age and Purchase
Frequency have almost the same information gain. Either of these can be used as a split
node for the branch. The final decision tree looks like Figure 9-14.
314
Chapter 9 Classification
Rule 2: If Age < 35, Credit Rating = OK, Purchase Frequency is Low,
then Loan Approval = No.
Rule 3: If Credit Rating is OK or Fair, Purchase Frequency is High,
then Loan Approval = Yes.
Applying the previous rules on the sample gives us X = (Age > 40, Purchase
Frequency = Medium, Credit Rating = Excellent).
The prediction of the class is: Loan Approval = No.
315
Chapter 9 Classification
There are two ways to limit the overfitting error. One way is to set rules to stop the
tree growth at the beginning. The other way is to allow the full tree to grow and then
prune the tree to a level where it does not overfit.
One method is to stop the growth of the tree and set some rules at the beginning
before the model starts overfitting the data. It is not easy to determine a good point for
stopping the tree growth. One popular method used is Chi-Squared Automatic Interaction
Detection (CHAID), which has been widely used in many open-source tools. CHAID
uses a well-known statistical test called a chi-squared test to assess whether splitting a
node improves the purity of a node and is statistically significant. At each node split, the
variables with the strongest association with the response variable are selected based on
the chi-squared test of independence. The tree split is stopped when this test does not
show a significant association. This method is more suitable for categorical variables, but
it can be adopted by transforming continuous variables into categorical bins.
316
Chapter 9 Classification
The other method is to allow the tree to grow fully and then prune the tree. The
purpose of pruning is to identify the branches that hardly reduce the error rate and
remove them. The process of pruning consists of successively selecting a decision
node and redesignating it as a leaf node, thus reducing the size of the tree. The pruning
process should reduce misclassification errors and noise but capture the tree patterns.
Most tools provide an option to select the size of the split and the method to
prune the tree. If you do not remember the chi-squared test, that’s okay. However, it
is important to know which method to choose and when and why to choose it. The
pruning tree method is implemented in multiple software packages such as SAS,
SPSS, C4.5, and other packages. We recommend you read the documentation of the
appropriate libraries and packages before selecting the methods.
317
Chapter 9 Classification
Training Data
Training
Sets 1 2 3 4
Classifiers 2 3
1 4
Aggregate Model
Classifier
We generally do not have the luxury of having multiple training sets. Instead, we
bootstrap by taking repeated samples from the same training set. In this approach, we
generate Z different bootstrapped training data sets and then train the model on the
bootstrapped training set and average all the predictions. This method is called bagging.
Though bagging improves the overall predictability and performance, it is often difficult
to select the most important variables to the procedure.
A random forest is a small tweak to the bagging tree. A random forest builds trees
on a bootstrapped training sample. Each time the tree builds on a random sample of
m predictors from a full set of p predictors where m ≈√p. The number of predictors
considered for each tree split is approximately equal to the square root of the total
number of predictors. All predictors have equal chance to participate, and the average
model is more reliable with better performance and a low test error overbagging.
Random forests have low bias. By adding more trees, we can reduce variance and
thus overfitting. Random forest models are relatively robust to the set of input variables
and often do not care about preprocessing of data. Research has shown that they are
more efficient in building than other models.
Table 9-4 lists the various types of classifiers and their advantages and disadvantages.
318
Chapter 9 Classification
319
Chapter 9 Classification
>
> #Step1: Set working directory
> setwd("E:/Umesh-MAY2022/Personal-May2022/BA2ndEdition/2ndEdition/Book
Chapters/Chapter 9 - Pred-Classification/Code/DecisionTree")
> #step 2: Read data
> attrition_df<-read.csv("attrdataDecisionTree.csv")
Step 2: Once we read the data, we explore the data set to check the type of the
variable, missing values, and null values that may exist in the data. Since the data set
is complete and clean, we will continue with the next step. The R tool has read and
assigned the correct data types. It has identified categorical and integer variables
properly; hence, we do not need to convert any variables’ data types.
320
Chapter 9 Classification
>
Step 3: The next step is exploring the data to understand the distribution. You can
plot various graphs and tables to check the distribution, skewness, etc. This should
indicate how well your assumptions are and how well the model can perform.
The target class is quite balanced. So, you should aim to achieve model performance
higher than this distribution. Similarly, all the other variables are also balanced, and the
machine can learn quite well; we expect the model to perform beyond 54 percent.
> ##Step 3: Explore data. In this case we will look into how data
> # is distributed into different categories.
> table(attrition_df$Attrition)
No Yes
96 112
> prop.table(table(attrition_df$Attrition))
No Yes
0.4615385 0.5384615
321
Chapter 9 Classification
> prop.table(table(attrition_df$WorkChallenging))
No Yes
0.5384615 0.4615385
> prop.table(table(attrition_df$WorkEnvir))
Excellent Low
0.5384615 0.4615385
> prop.table(table(attrition_df$Compensation))
Excellent Low
0.4038462 0.5961538
>
Step 4: The next step is to divide the sample into two: train and test. We use the
caret() library in R to perform this operation. Typically we use 80 percent sample data
to train the model and 20 percent sample data to test the model. As you can see from the
following code, there are 208 sample pieces of data, 80 percent of which is around 167
used to train the model with the remaining 41 samples applied to test the model.
> library(caret)
> set.seed(1234)
> data_partition<-createDataPartition(y=attrition_df$Attrition,
+ p=0.8, list=FALSE)
> train<-attrition_df[data_partition,]
> test<-attrition_df[-data_partition,]
> nrow(attrition_df)
[1] 208
322
Chapter 9 Classification
> nrow(train)
[1] 167
> nrow(test)
[1] 41
> head(train)
Attrition YrsExp WorkChallenging WorkEnvir Compensation TechExper maritalstatus
2 No 2.0 Yes Excellent Excellent Excellent married
3 No 2.5 Yes Excellent Low Excellent single
4 Yes 2.0 No Excellent Low Excellent married
5 No 2.0 Yes Low Low Low married
6 Yes 2.0 No Low Low Excellent single
7 No 2.0 No Excellent Excellent Low married
education children ownhouse loan
2 graduate no yes yes
3 graduate no yes no
4 graduate no yes yes
5 undergraduate no yes no
6 graduate no no no
7 undergraduate no yes no
>
Step 5: Once we have the data split for training and testing, we create the model using
the decision tree algorithm. We use the rpart() library package to create the decision
tree. We use information gain, as discussed earlier, to split the nodes. Please refer to
the documentation for more details on the various input parameters used by rpart()
function. (See https://cran.r-project.org/web/packages/rpart/rpart.pdf.)
323
Chapter 9 Classification
> library(rpart)
> equation = Attrition~YrsExp+WorkChallenging+WorkEnvir+Compens
ation+Tech
Exper+maritalstatus+education+children+ownhouse+loan
> attr_tree<-rpart(formula = equation,
+ data = train,
+ method = 'class',
+ minsplit=2,
+ parms = list(split = 'information')
+ )
>
Step 6: Summarize the model and plot the decision tree structure. The model
summary provides information about the split, variables it considered for the tree, and
other useful information.
> summary(attr_tree)
Call:
rpart(formula = equation, data = train, method = "class", parms =
list(split = "information"), minsplit = 2)
n= 167
324
Chapter 9 Classification
Variable importance
YrsExp WorkEnvir WorkChallenging
20 15 13
maritalstatus Compensation TechExper
13 12 9
education ownhouse children
7 7 2
loan
2
R provides many libraries to plot the structure of the final tree. Figure 9-17 is the final
tree using the rpart.plot function.
> library(rpart.plot)
> rpart.plot(attr_tree)
>
325
Chapter 9 Classification
Step 7: Let’s apply our fully grown decision tree model to predict the new data (test
data). The model’s performance is measured by how well the model has predicted the
test data. The following code is to predict the test data, and the next step is to measure
the performance of the model:
Step 8: Let’s measure the performance of the model using a contingency table, also
called a confusion matrix.
326
Chapter 9 Classification
tree_pred No Yes
No 19 0
Yes 0 22
Reference
Prediction No Yes
No 19 0
Yes 0 22
Accuracy : 1
95% CI : (0.914, 1)
No Information Rate : 0.5366
P-Value [Acc > NIR] : 8.226e-12
Kappa : 1
Sensitivity : 1.0000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 1.0000
Prevalence : 0.4634
Detection Rate : 0.4634
Detection Prevalence : 0.4634
Balanced Accuracy : 1.0000
'Positive' Class : No
The confusion matrix function in R provides the accuracy of the model, false-positive
rates, specificity, sensitivity, and other measures. In this case, the accuracy of the model
is 100 percent. However, the model has considered all the variables. We do not know
whether the tree is under fitted or overfitted as we did not measure the performance of
training data. This exercise allows you to follow the same steps and functions to calculate
327
Chapter 9 Classification
the training data prediction performance and determine whether the model is overfitting
or underfitting. Also, if you look at the tree, it did not consider all the variables.
Step 9: We assume that the tree is full-grown, and we will demonstrate how to prune.
We will also plot the pruned tree (Figure 9-18). We will be using the built-in prune() AIP
function. It takes a fully grown tree model and applies the pruning method. In this case,
this package has an option to use cp (complexity parameter) method. You should refer to
the rpart() documentation for the other methods and options.
(Here is the rpart documentation: https://www.rdocumentation.org/packages/
rpart/versions/4.1.16/topics/rpart.) See Figure 9-18.
>
Step 10: We will predict the test data on the pruned tree model and check the
accuracy of the model.
328
Chapter 9 Classification
Reference
Prediction No Yes
No 16 2
Yes 3 20
Accuracy : 0.878
95% CI : (0.738, 0.9592)
No Information Rate : 0.5366
P-Value [Acc > NIR] : 3.487e-06
Kappa : 0.7539
Sensitivity : 0.8421
Specificity : 0.9091
Pos Pred Value : 0.8889
Neg Pred Value : 0.8696
Prevalence : 0.4634
Detection Rate : 0.3902
Detection Prevalence : 0.4390
Balanced Accuracy : 0.8756
'Positive' Class : No
The pruned tree has an accuracy of 87 percent, and it has considered only four
variables for constructing the tree. Further, by working on other parameters of the
rpart() library, you can improve the performance of the model.
Finally, we will plot the receiver operating curve (ROC) to measure the performance
of the model using sensitivity analysis. In R, use the ROCR library to plot the ROC graph
and measure the AUC. See Figure 9-19.
329
Chapter 9 Classification
[[1]]
[1] 0.9055024
Note To learn more about the rpart() library and options, please refer to the
documentation: h ttps://cran.r-project.org/web/packages/rpart/
rpart.pdf.
330
Chapter 9 Classification
331
Chapter 9 Classification
# Preprocessing libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
Step 2: Read the data set into a Pandas dataframe. All the data manipulations are
performed on the Pandas dataframe.
#Set the working directory where the dataset is stored before reading
the dataset
data_dir = 'E:/Code/DecisionTree'
os.chdir(data_dir)
attrData_df = pd.read_csv("attrdataDecisionTree.csv")
attrData_df.head()
332
Chapter 9 Classification
Attrition YrsExp WorkChallenging WorkEnvir Compensation TechExper \
0 Yes 2.5 No Low Low Excellent
1 No 2.0 Yes Excellent Excellent Excellent
2 No 2.5 Yes Excellent Low Excellent
3 Yes 2.0 No Excellent Low Excellent
4 No 2.0 Yes Low Low Low
Step 3: Explore the data. Check the data type, missing values, etc.
In this case, the data set is clean and there are no missing values. However, in
reality, this is not true. As a first step, you should always clean the data set. Unlike R,
333
Chapter 9 Classification
Python sklearn() libraries do not accept text values to build the models. Hence, we
have to convert any text values to numerical values. In our datset, the variables are
categorical, and they are text. We use the LabelEncoder() function to convert them to
numerical values. We could also use the Pandas get_dummies() function or sklearn
OrdinalEncoder() function. Since the variables are just nominal and has only two
categories (YES / NO, Married / Single, Graduate / Undergraduate) the get_dummies(),
function also gives the same results. To learn when and how to use dummy variables,
you could refer to the sklearn documentation. Typically we use LabelEncoder() to the
target class and get_dummies() for the independent variables. Since we have only two
classes in our data set variables, we will use the LabelEncoder() function instead of
get_dummies as both provide the same end results.
Step 4: This step includes data exploration and preparation.
334
Chapter 9 Classification
maritalstatus education children ownhouse loan
0 1 1 0 0 0
1 1 0 0 1 1
2 2 0 0 1 0
3 1 0 0 1 1
4 1 1 0 1 0
Step 4a: Convert the object data types to categorical data types. Since our data set
variables are categorical and Python reads them as an object, we need to convert them to
categorical.
cat_vars = ['Attrition','WorkChallenging',
'WorkEnvir','Compensation',
'TechExper', 'maritalstatus','education',
'children','ownhouse','loan']
for var in cat_vars:
X[var] = X[var].astype('category',copy=False)
X.dtypes
Attrition category
YrsExp float64
WorkChallenging category
WorkEnvir category
Compensation category
TechExper category
maritalstatus category
education category
children category
ownhouse category
loan category
dtype: object
335
Chapter 9 Classification
Step 4b: To understand the data distribution and how all the variables influence the
target class, explore the data using data visualization or tables. In the following example,
we just use the tables as all our data has only two classes.
X.Attrition.value_counts()
1 112
0 96
Name: Attrition, dtype: int64
X.WorkChallenging.value_counts()
0 112
1 96
Name: WorkChallenging, dtype: int64
X.TechExper.value_counts()
0 176
1 32
Name: TechExper, dtype: int64
X.Compensation.value_counts()
1 124
0 84
Name: Compensation, dtype: int64
Step 4c: In this step, prepare the data as per the sklearn() algorithm API function.
The DecisionTree() function of sklearn accepts inputs as X, only independent variables,
Y, only target class. Hence, we separate Attrition from the X as shown.
336
Chapter 9 Classification
0 1
1 0
Name: Attrition, dtype: category
Categories (2, int64): [0, 1]
Step 5: Split the data into train and test. We use 80 percent for training and 20
percent data for testing. This is the standard rule used across the industry. If you have
lots of data, you can even split 75 percent and 25 percent.
# Split dataset into training set and test set. We will use sklearn
train_test_split() function
X_train, X_test, y_train, y_test = train_test_split(X_NoAttrVar, Y,
train_size = 0.8, random_state=1)
# 80% training and 20% test
X_train.shape
(166, 10)
337
Chapter 9 Classification
y_train.shape
(166,)
X_test.shape
(42, 10)
y_test.shape
(42,)
dec_model= DecisionTreeClassifier(random_state=10,
criterion="entropy", max_depth=10)
dec_model_fit = dec_model.fit(X_train, y_train) ## Train the model
DecisionTreeClassifier(criterion='entropy', max_depth=10, random_
state=10)
dec_model_fit.get_n_leaves()
11
dec_model_fit.classes_
array([0, 1], dtype=int64)
338
Chapter 9 Classification
Step 6a: Plot the tree model. Once you have the model, plot the decision tree and see
how many levels your model has. This helps you to understand the model and arrive at
the rules.
Step 7: The final step is to predict the test data, without the labels, from the decision
tree model we just created and measure the model accuracy.
339
Chapter 9 Classification
9.7.2 Making Predictions
predicted = dec_model_fit.predict(X_test)
Step 8: Measure the accuracy of the model using a contingency table (also referred
to as a confusion matrix). The classification_report() function also provides a
sensitivity analysis report including precision, recall, and f-score.
print(classification_report(y_test, predicted))
precision recall f1-score support
0 1.00 1.00 1.00 21
1 1.00 1.00 1.00 21
accuracy 1.00 42
macro avg 1.00 1.00 1.00 42
weighted avg 1.00 1.00 1.00 42
print(confusion_matrix(y_test, predicted))
[[21 0]
[ 0 21]]
print("Accuracy:",metrics.accuracy_score(y_test, predicted))
Accuracy: 1.0
340
Chapter 9 Classification
dec_model_2= DecisionTreeClassifier(random_state=0,
criterion="entropy", max_depth=4)
dec_model_fit_2 = dec_model_2.fit(X_train, y_train) ## Train the model
##Predict Test Data
predicted_2 = dec_model_fit_2.predict(X_test)
Step 9: Measure the performance of the “pruned tree” and check the accuracy of
the model. The accuracy of the pruned tree model is only 95.23 percent. Finally, plot the
pruned tree using the plot_tree() function.
precision recall f1-score support
0 1.00 0.90 0.95 21
1 0.91 1.00 0.95 21
accuracy 0.95 42
macro avg 0.96 0.95 0.95 42
weighted avg 0.96 0.95 0.95 42
Accuracy: 0.9523809523809523
341
Chapter 9 Classification
Step 10: Plot the ROC curve and find out the AUC measure as well. See Figure 9-20.
roc_auc_score(y_test,dec_model_fit_3.predict_proba(X_test)[:,1])
0.9761904761904762
# Compute fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, predicted_3)
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate or (1 - Specifity)')
plt.ylabel('True Positive Rate or (Sensitivity)')
plt.title('Receiver Operating Characteristic')
342
Chapter 9 Classification
9.8 Chapter Summary
The chapter explained the fundamental concepts of the classification method in
supervised machine learning and the differences between the classification and
prediction models.
We discussed various classification techniques such as k-nearest neighbor, naïve
Bayes, and decision tree. We also touched upon ensemble models.
This chapter described the decision tree model, how to build the decision tree,
how to select the decision tree root, and how to split the tree. You saw examples of
building the decision tree, pruning the tree, and measuring the performance of the
classification model.
You also learned about the bias-variance concept with respect to overfitting and
underfitting.
Finally, you explored how to create a classification model, how to measure the
performance of the model, and how to improve the model performance, using R
and Python.
343
CHAPTER 10
Neural Networks
Though neural networks have been around for many years, because of technological
advancement and computational power, they have gained popularity recently and now
perform better than other machine learning algorithms today. In this chapter, we will
discuss using neural networks and associated deep neural network algorithms to solve
classification problems.
This is possible because the human brain can process every detail and variation with
the help of millions of neurons based on the past memory of a similar object or pattern
and experience. Millions of neurons and billions of connections connect the brain
and eyes. This process happens seamlessly, as the brain is like a supercomputer and
able to process complex information within microseconds. Hence, you can recognize
345
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5_10
Chapter 10 Neural Networks
patterns presented to you visually almost instantaneously. Neurons are the structural
and functional units of the nervous system. The nervous system is divided into a central
and peripheral nervous system. The central nervous system (CNS) is composed of the
brain and its neuronal connections. Neurons transmit signals between each other via
junctions known as synapses. Synapses are of two types, electrical and chemical. The
CNS neurons are connected by electrical synapses, as shown in Figure 10-2.
The neurons act via action potentials with the help of voltage-gated and ligand-gated
ionic channels. Action potentials activate the neuron and transmit signals in synapses
using neurotransmitters. The neurotransmitters can excite or inhibit other neurons. The
main seat of memory in the CNS is the hippocampus. The hippocampus is important for
the conversion of short-term memory to long-term memory. This conversion is known
346
Chapter 10 Neural Networks
347
Chapter 10 Neural Networks
10.2.1 Perceptrons
Let’s start the discussion by understanding how a perceptron works. Once we
understand how a perceptron works, we can learn to combine multiple perceptrons to
form a feed-forward neural network. At its core, a perceptron is a simple mathematical
model that takes a set of inputs and does certain mathematical operations to produce
computation results. A perceptron with the activation function is similar to a biological
neuron; the neurotransmitters can excite or inhibit other neurons based on the electrical
signals by sensors. In this case, the perceptron with its activation function has to decide
whether to fire an output or not based on the threshold values.
A perceptron can take several input values and produce a single output. In the
example shown in Figure 10-3, the perceptron takes two inputs, x1 and x2, and produces
a single output. In general, it could have more than two inputs. The output is the
summation of weights multiplied by the input x.
x1 w1
Activation Output = x1w1+ x2.w2
w2
x2
if x1w1+x2w2 > threshold, Output = 1
if x1w1+x2w2 <= threshold, Output = 0
Rosenblatt in his book proposed a function to compute the output. The output Y
depends on the inputs x1 and x2, and in his book he proposed giving the weightage w1,
w2 corresponding to x1 and x2. The neuron triggers output 1 or 0 determined by the
n
weighted summation of w and x, W .X = ∑wk ⋅ x k . The neuron triggers output 0 or 1
k =0
depending on the threshold value. To put this in a perspective, here it is expressed as an
algebraic equation:
n
0 = if ∑
k =0
x k wk ≤ threshold
Output = n
(1)
1 = if x w
∑
k =0
k k ≥ threshold
348
Chapter 10 Neural Networks
The output is 0 if the sum of all the weights multiplied by input x is less than some
threshold, and the output is 1 if it is greater than some threshold.
The mathematical function is called the activation function. In this case, the
activation function is the summation function of all the inputs and corresponding
weights, and it fires 0 or 1 based on the threshold set, as shown in the previous equation.
Since the output of the neuron is binary, it can be treated as a simple classification
model. For example, using this model, you could build a model to predict playing golf,
based on rain, x1, and weather outlook, x2. Depending on the importance of the features,
corresponding weights can be assigned. In this case, as shown in Figure 10-4, if you
assign 0 to a rainy day, 1 to a sunny day, and 1 if the outlook is sunny, 0 if the outlook is
cloudy, and assign w1 = 5, w2 =3, then if x1 = 0, x2 =0, you will not play golf. If x1 =1,
x2 =0, you would play golf, and vice versa.
x1 x2
x1 0 0
w1 y
F(x,w) 1 0
w2 +1
Activation 1 1
x2
-1 0 1
By rearranging the terms and replacing the threshold with bias, b, the equation
simplifies as follows:
Output = 0 If w .x + b <= 0
(2)
Output = 1 If w .x + b => 0
Term b is called bias. Bias determines how easy it is to fire the perceptron output. For
a large bias, it is easy for perceptron output to be 1, whereas, for a small bias, it is difficult
to fire output. Henceforth, we will not use the term threshold; instead, we use bias. The
perceptron fires out 0 or 1 based on both weights and bias. For simplicity we can keep
the bias fixed, and we can optimize the weights. How do we tune the weights and biases
in response to external stimuli without intervention? We can devise a new learning
algorithm, which automatically tunes the weights and biases of an artificial neuron. This
is the basic idea of the neural network algorithm.
349
Chapter 10 Neural Networks
Footnote: Those unfamiliar with basic logic operations such as AND, OR, NOT, and
XOR, please refer to advanced books on electronics engineering.
The previous example demonstrates how to use a logical operator since this is one
of the most common problems used in networks and circuits. However, more complex
problems are solved by using multiple neurons and complex functions to stimulate the
output by tuning weights and biases without direct intervention by a programmer. In the
following section, let’s explore the neural network architecture with multiple neurons
and how the activation function triggers output. Also, we will explore the algorithm that
tunes weights and biases automatically.
networks have been multilayer feed-forward networks. The basic neural network
architecture consists of multiple neurons organized in a specific way. It consists of
multiple layers, as shown in Figure 10-6. The leftmost layer in this network is called the
input layer, and the neurons within this layer are called input neurons. The middle layer
is called a hidden layer, as the neurons in this layer have inputs from the previous layer
and outputs to the next layer and do not receive direct input or output. There could be
multiple hidden layers in the network as the network learning deepens. Figure 10-6
is a three-layer neural network, whereas Figure 10-7 is a four-layer network with two
hidden layers.
x1 h1 hi = wij * xi + bi
h2
Oi = σ (wij * Hi + bi )
x2
O2 output of this function = 0 to 1
h3
x3
h4
The last layer is called the output layer, which contains the output neuron. The
output neuron can be only one, or it can be more than one depending on the input
conditions. In general, each neuron is called a node, and each link connection is
associated with weights and bias. Weights are similar to coefficients in linear regression
and subjected to iterative adjustments. Biases are similar to coefficients but are not
subjected to iterative adjustments.
351
Chapter 10 Neural Networks
hi = wij * xi + bi
h1
x1 h1
h2
Oi = σ (wij * Hi + bi )
x2 h
2
O2
output of this function = 0 to 1
h3
x3 h3
Output layer
Hidden layer 1
x1, x2, x3, … xn are the input to the hidden-layer neurons; the output of the hidden
neurons is the summation of the weights and bias.
h i = w ij∗ x i + bi
Similarly, the output is the summation of weights and bias of hidden neurons.
Oi = σ ( w ij∗Hi + bi ).
These are represented as vectors. The output depends on the activation function.
The most common activation function used in the output is a sigmoid. We will discuss
different types of activation functions in the next section. Overall, the learning happens
from one layer to the next in a sequential pattern. Each layer attempts to learn the
minute details of the feature to predict the outcome as accurately as possible.
10.3 Learning Algorithms
The purpose of a neural network algorithm is to capture the complicated relationship
between the response variable and the predictors variable much more accurately. For
instance, in linear regression we assume that the relationship between the response
variable and the predictor variable is normally distributed and is linear. In many cases
this is not true, and the relationship may be unknown. To overcome this, we adopt
several transformations of the data to make it normal. In the case of neural networks,
352
Chapter 10 Neural Networks
no such transformation or correction is required. The neural network tries to learn such
a relationship by passing through different layers and adjusting the weights. How does
learning happen, and how does a neural network capture and predict output? We will
illustrate this with an example of a simple data set. This data set is only for the purpose
of explaining the concept, but in practice, the features are much more complex, and the
data size is also larger. In the example demonstration, we will be using a larger data set to
demonstrate how to create the neural network model.
Let’s design a simple neural network model to understand how a neural network
predicts attrition. Figure 10-8 is our neural network model consisting of three input
neurons, four hidden-layer neurons, and one output neuron. For simplicity, we will
name each node. Nodes 1, 2, and 3 are input-layer nodes; nodes 4, 5, 6, and 7 belong to
the hidden layer, and node 8 is the output layer. Each hidden neuron is connected by a
weights arrow from the input neurons and denoted by the weight vector, wij, from node i
to j, and the additional bias vector is denoted by bi.
353
Chapter 10 Neural Networks
b4
w14 Hi = Σ (wij * xi ) + bi )
1 h4
YrsExpr w15 b5
w w48
w17 16
h5
w24 w58
w25 b8
w26
2 b6
AnnuSalary w27
w68 o8 Oi = Σ (whij * Hi ) + bi )
h6
w34 w35
Activation function
w36 b7
w78 Sigmoid = 1 / 1+ex
w37
3
SkillLev h7
The objective is to predict the output, attrition, which is 0 or 1, based on the input
training data. The output neuron triggers 0 or 1 in response to the external stimuli
without any intervention. The network is provided with the known class labels, and the
network should “guess” the output. After “guessing,” the error is computed as follows:
Adjust the weights according to the error until the error is minimized.
Initially, the weights (w0, w1, w2,…wn) and biases are initialized to some random values.
The error function for this is as follows:
E = ∑ Yi − f (wi ,X i )
2
(3)
i
where Yi is the actual value of the class, and f (w,x) is the prediction function of weights
and input data. This is a linear programming minimization problem where the objective
function is the error function of the previous equation 3. The objective is to find the
weights that minimize the error.
The output of each hidden node is the sum product of the input values x1, x2, x3 … xn
and the corresponding weights w1, w2, … wn and the bias b represented mathematically
as follows:
354
Chapter 10 Neural Networks
h i = ∑ ( w ij∗ x i ) + bi (4)
where weights w1, w2, ... wn are initially set to some random values. These weights are
automatically adjusted as the network “learns.” The output of O8 is the sum product of
input values h1, h2, h3 … hn and corresponding weights wh11, wh22, wh31, … whn1 and the bias
b, as shown in figure, represented mathematically as follows:
Oi = ∑ ( w hij∗Hi ) + bi ) (5)
If the network is learning, then we should see changes in the output for any small
changes in the weights (or bias). If a network is learning that a small change in weight
(or bias) causes only a small change in output, then we should continuously use this
fact to get the network to behave in such a manner that it should predict as accurately
as possible by constantly adjusting weights (or bias). Since the output is in binary, 0 or
1, measuring such a small change to trigger 0 or 1 can be difficult and hence makes it
difficult to gradually modify the weights and biases so that the network gets closer to the
desired output. This problem can be resolved by introducing what is called a sigmoid
function as an activation function. So, the output of the sigmoid for the given inputs x1,
x2, … xn, instead of 0 and 1, can have ranges of values between 0 and 1, and you can set
the threshold for 0 or 1.
The sigmoid function is given as follows:
ϭ
Ő;njͿ с ;ϲͿ
ϭнĞ Ͳnj
The output of a neuron of a neural network is a function of weights and bias, f(wx,b),
given by the following:
ϭ
Ő;njͿс
ϭнĞdžƉ;Ͳ ɇ ;ǁŝũ Ύdžŝ ͿͲ ďŝͿ ;ϴͿ
355
Chapter 10 Neural Networks
Suppose z = w.x + b is a large number, e-z then ~ 0, so g(z) ~ 1. In other words, when
z is large and positive, the output from the sigmoid neuron is 1; on the other hand, if
z is very negative, then the output of a neuron is 0. Figure 10-9 shows the shape of the
sigmoid function.
1
H4 =
( )
1 + exp − ( 0.05 2.5 − 0.07∗ 4∗1 + 0.01∗0.5 − 0.01) (9)
∗
= 0.4600
356
Chapter 10 Neural Networks
B4 = -0.01
2.5 W14 =0.05
1 h4
YrsExpr W15 =-0.01
W16 =0.02 B5 = - 0.01
0.06
W17 =0.01
h5 -0.04
-0.07
-0.04 B8 = -0.12
0.03
4
2 B6 = -0.02
AnnuSalary -0.05
0.05 o8 Oi = 0.4807
0.01 -0.07
h6
0.06 B7 = -0.05
0.015
-0.08
0.5
3
SkillLev h7
Śϰ Śϱ Śϲ Śϳ
- Ϭ͘ϰϲϬϬϴϱϭϭϱ Ϭ͘ϰϲϭϯϮϳ Ϭ͘ϱϰϰϴϳϴϴϵϮ Ϭ͘ϰϯϰϭϯϱ
The output of the output node is calculated using the following formula:
Oi = ∑ ( w hij∗Hi ) + bi ) (10)
Just like before, the input to output node 8 is from the hidden-layer neurons. And it is
given by the following:
1
O8 = (11)
(
1 + exp − ( 0.06 0.4600 − 0.04 0.4613 + 0.05∗0.5448 + 0.015∗0.4341 − 0.12 )
∗ ∗
)
O8 = 0.480737
357
Chapter 10 Neural Networks
Note that if there is more than one hidden layer, the same calculations are applied,
except the input values for the subsequent hidden layer would be the output of the
previous hidden layer.
Finally, to classify (output prediction) for this training data, we can use some cutoff
value, say, 0.5. If the output is less than 0.5, then it is 0, and if the output greater than 0.5,
then it is 1. For this model and for the given data, the output is 0 because 0.48037 < 0.5.
We have discussed how the neural network output is generated using inputs from
one layer to the next layer. Such networks are called feed-forward neural networks;
information is always fed forward to the next layer. In the next section, we will discuss
how to train the neural network to produce the best prediction results.
10.3.4 Backpropagation
The estimation of weights is based on errors. The errors are computed at the last layer
and propagate backward to the hidden layers for estimating (adjusting) weights. This
is an iterative process. The initial weights are randomly assigned to each connected
neuron, as shown in Figure 10-10, and for the first training record, the output is
computed as discussed earlier. Let’s say the predicted class is Y^, and the actual class
is Y. The error is as follows:
Error = ( Yi ^ − Yi ) (12)
For the next iteration, the new weight should be the previous weight plus the small
adjustment Δ.
358
Chapter 10 Neural Networks
The learning rate (LR) is the fine-tuning factor. If the learning rate is higher, then the
new weight will be larger, and vice versa. It is a constant and can take a value from 0 to 1.
The prediction for the next training record depends on the LR and the input. And
this iteration process continues until all the records are processed. In our example, for
the first observation the output is 0.48037, and the error is 1 – 0.48037 = 0.51963. This
error is used to compute the weights for the next set of training records as in the previous
equations 1–6. These weights are updated after the second observation is passed
through the network, and the process continues until all the observations in the training
set are used. This one complete cycle is called the epoch or iteration or cycle. Typically,
there are many iterations (epochs) before achieving optimal weights and biases for the
network to perform its best. In case of batches, the entire batch of training data is run
through the network before the updates of weights take place.
Figure 10-11 demonstrates the backpropagation algorithm that has been explained.
The updating of the weights and the network (and weights) optimization process stops
when one of the following conditions are met:
• The error rate has reached the threshold where any updates to
weights are making no difference.
359
Chapter 10 Neural Networks
Activation
Output
Function()
Calculate errors
Backpropagation of errors
Backpropagation of errors
For the previous example of predicting attrition, we calculate the optimized weight
using the R computer program. Figure 10-12 shows the optimized weights (and bias).
The final weights and bias are labeled for each of the nodes. This process took 1,780
iterations to find the optimized weight.
360
Chapter 10 Neural Networks
For these optimized weights and bias, the model-predicted output and the actual
values are as shown in Figure 10-13.
Since the predicted output is in the probabilities values, we could use a threshold
value for deciding the final output values of 0 or 1. We will set the threshold value of 0.5
and higher for 1 and anything less than 0.5 as 0. As you can observe from the truth table
in Figure 10-14, the predicted values for five records are correct, and it matches with the
actual value. However, the fourth record prediction is wrong. Our neural network has
predicted 0, whereas the actual value is 1. This is called a prediction error. The model can
be further tuned to improve the accuracy. We will learn more about tuning the model in
the next section.
361
Chapter 10 Neural Networks
10.4 Activation Functions
In our earlier discussion, we mentioned the activation function. Activation functions
are necessary to trigger the response for a given input. At every stage of the neural
network layers, the output of the neuron depends on the input from the previous
layer’s neurons. The output of each layer’s neuron is the sum of products of the inputs
and their corresponding weights. This is passed through an activation function. The
corresponding output of the activation is the input to the next layer’s neuron and so on.
The sum of products of weights and inputs is simply a linear function, just a polynomial
of degree one. If the activation function is not used, then the model acts as a simple
linear regression model. However, the data is not always linear. The simple linear
regression model will not learn all the details for complex data such as videos, images,
audio, speech, text, etc., thus resulting in the poor performance of the model. Because
of this reason, we use the activation function. Activation function would take the high
dimensional and nonlinear input data and transform them into a more meaningful
data pattern output signal, which is fed into the next layers. At every stage, activation
functions are applied so that the final output of the model provides accurate prediction
performance based on its learning at each layer. Activation functions provide the
nonlinear property to the neural networks.
There are different types of activation functions available for a neural network. We
will discuss the commonly used activation functions. These activation functions are
sometimes referred to as threshold functions or transfer functions since they force the
output signal to a finite value.
10.4.1 Linear Function
Linear is the simplest activation function represented by the following formula:
Y = f (x) α x
The output is directly proportional to the input x ranging from (-∞ to ∞).
Figure 10-15 shows the linear activation function, and the input and output follow a
linear relationship.
362
Chapter 10 Neural Networks
y
10
8
6
4
2
0
-10 -5 0 5 10
-2
-4
-6
-8
-10
f ( x ) = 1/1 + e − x
363
Chapter 10 Neural Networks
10.4.3 Tanh Function
The hyperbolic tangent function (tanh) function is the extension of the sigmoid (logistic)
function. The function is similar to the sigmoid function except corrected to have the
symmetry around the origin and has gradients that are not restricted to vary in a certain
direction, as shown in Figure 10-17.
The tanh function expresses similar behavior to the sigmoid function. The main
difference is that the tanh function pushes the input values to 1 and -1 instead of 1 and 0.
364
Chapter 10 Neural Networks
f ( x ) = 0 when x < 0
f ( x ) = x when x > 0
As you can see from Figure 10-18, it is similar to a linear function that will output the
input directly if it is positive; otherwise, it will output zero.
e xj
f (x) = k
for j = 1,2…k
∑e
k =0
xk
365
Chapter 10 Neural Networks
Where the xj values are the elements of the input vector, it can take any real value.
The denominator term at the bottom of the formula is the normalization term to ensure
all the function output values will sum to 1 to constitute a valid probability distribution.
When we have a multiclass problem, the neural network model output layer
will have the same number of neurons as the number of target classes. The softmax
activation function returns the probability for every data point of all the individual
classes.
• Sigmoid and softmax functions are generally used at the output layer
for classification problems.
• ReLU is used only inside hidden layers but not at the output layer.
• Tanh has shown better results than sigmoid and can be used in both
output and hidden layers.
There are many other activation functions developed recently, which include
leakyReLU, swish, exponential ReLU, etc. They are not as common as the others. Also
note that not all activation functions will be supported by the library you intend to use.
366
Chapter 10 Neural Networks
Job attrition may be related to many reasons and factors, but in this specific case,
we have data consisting of six variables. Attrition represents whether the employee
has exited the organization or is still in the organization (Yes for Exit and No for
Currently continuing in the organization). Yrs_Exp represents the experience of the
employee at this point of time (in years), Work_Challenging represents whether the
work assigned to the employee is challenging or not, Work_Envir represents whether
the work environment is Excellent or Low, Compensation represents whether the
compensation is Excellent or Low, and Tech_Exper represents whether the employee is
an expert (Excellent or Low). The data is for the last two years and pertains to only those
employees with experience between 2 to 5 years.
As you can observe from the data set, there are attributes that carry the same
information and may be correlated. In a neural network, we do not need to worry about
features, removing features, etc., unlike in other algorithms. We will continue to create
the neural network model without removing variables. By nature, neural networks
should adjust the weights and biases based on what it learns from the data.
367
Chapter 10 Neural Networks
10.5.1.1 Exploring Data
The data is extracted from the comma-separated value text file AttrData_NN.txt. This
data set has only 52 records.
Step 1: Read the data set and check the data summary using the summary()
function of R.
Here is the input:
As you can see, the Attrition field has 28 “Yes” values, which means that these
employees have exited the organization, and 24 “No” values, which means that these
employees are still continuing in the organization. You can also see that 28 employees
have not been assigned “Challenging Work” and 24 employees have been assigned
“Challenging Work.” Twenty-eight of the employees are working in teams where
the Work Environment is considered as excellent, whereas 24 are working in teams
where the work environment is not that great (here marked as Low). Twenty-one of
the employees have excellent compensation at or above the market compensation
368
Chapter 10 Neural Networks
(known here as Excellent), whereas 31 have a compensation that is below the market
compensation or low compensation (known here as Low). Of all the employees, 44 have
excellent technical expertise, whereas 8 others have Low technical expertise.
Ideally, when the organization is providing challenging work to an employee, work
environment within the team is excellent, compensation is excellent, and technical
expertise of the employee is high, then there should be low chance for the Attrition.
10.5.1.2 Preprocessing Data
Before we create a model, we should check the data types of the attributes and convert
them to proper data types (numerical, logical, categorical, etc.) if required. We use str()
to check the data type. R reads all the variable types properly. We do not have to convert
any data types.
Here is the input:
The other task is to meet the input data requirements specified by the library you
are using. In this particular example, we are using the neuralnet() R package. This
package (in general, any neural network) expects the input to be all numerical values.
Thus, as part of preprocessing, the next step of the model-building process is to prepare
the input data. We will use model.matrix() to convert all our categorical values (Yes/No,
Excellent, Low, etc.) to numerical values, as shown in Figure 10-21.
369
Chapter 10 Neural Networks
> data_df_mx<-as.data.frame(model.matrix(~WorkChallenging+WorkEnvir+Compensation+TechExp
er+YrsExp+Attrition, data=data_df))
> head(data_df_mx)
Just to keep all the values consistent, we will scale the YrsExp, TechExpereince
variable between 0 and 1 using a custom scale() function, as shown in Figure 10-22.
Before we do that, the model.matrix() function created an additional variable called
intercept, and this has no significance in our model, and hence we will remove it.
Here is the input:
> data_df_2<-as.data.frame(data_df_mx[,c(-1)])
> #Scale YrsExp variable using custom scale function
> scale01<-function(x) {
+ (x-min(x))/(max(x)-min(x))
+ }
> data_df_2$YrsExp<-scale01(data_df_mx$YrsExp)
> head(data_df_2)
370
Chapter 10 Neural Networks
> set.seed(1234)
> data_partition <- createDataPartition(data_df_2$AttritionYes,
+ p=0.8,list=FALSE)
> train<-data_df_2[data_partition,]
> test<-data_df_2[-data_partition,]
> train<-as.data.frame(train)
> head(test)
371
Chapter 10 Neural Networks
Figure 10-23. Data preparation, dividing data into two parts (test and train)
372
Chapter 10 Neural Networks
There are other function parameters that can be used based on the problem you
want the neural network to solve. The details of the input parameters are listed in
the documentation, and you can refer to the documentation to learn more about the
neuralnet() function (https://www.rdocumentation.org/packages/neuralnet/
versions/1.44.2/topics/neuralnet).
Here is the input:
373
Chapter 10 Neural Networks
The objective of the neural network model is to predict the outcome of attrition
based on the five parameters in the data. In this example, the model is a simple network
with two hidden layers and five neurons corresponding to five input variables. This is
specified by the hidden parameter of the neuralnet() function. Since this is a simple
binary classification problem, we use the backprop algorithm to create the model.
Figure 10-25 shows the architecture of the model.
374
Chapter 10 Neural Networks
The plot() function provides the model’s network architecture and the final weights
and biases displayed on each link connected to each neuron, as shown in Figure 10-25. If
you want to extract the model’s weights and biases, run the following command.
Here is the input:
> model_nn$weights
375
Chapter 10 Neural Networks
The model is trained with 80 percent of 52 records, which is only around 42. As it
passes through each iteration, weights and biases are adjusted based on the final output-
layer error as discussed in our previous sections. Since the training data has 42 records,
one epoch consists of 42 iterations. Since we have not specified how many epochs the
program should run to optimize, it runs through iterations until it reaches optimum
performance conditions and breaks the loop. The resulting error rates, optimal weights,
and biases achieved for the last iteration of training the neural net on this data are shown
in Figure 10-26. The neuralnet() package uses the entire data set to calculate gradients,
update weights, and repeat until convergence or stepmax steps is reached. The rep
parameter is the number of repetition of training.
376
Chapter 10 Neural Networks
10.5.1.6 Summary Report
The summary report prints the actual versus predicted values. Since the output of the
neural network is probability values ranging between 0 and 1, we have to set a threshold
value to conform the output to either 0 or 1, in case of a classification model. In our
example, we have set a threshold of 0.8. Any value greater than 0.8 is conformed to ‘1’
and any value less than 0.8 is ‘0’.
Here is the input:
> ###COpying results to a dataframe - Actual Vs Predicted
> comp_results <- data.frame(actual = test$AttritionYes,
predicted = pred_results$net.result)
> comp_results
> thr <- function(x) {
+ if (x>0.8) {return(x=1)} else {return (x=0)} }
> comp_results['predicted']<-apply(comp_results['predicted'],1,thr)
> attach(comp_results)
> table(actual,predicted)
As you can see from Figure 10-28, the predicted values > 0.8 are considered 1 and
<0.8 as 0, and accordingly, the actual versus predicted matrix is displayed.
379
Chapter 10 Neural Networks
380
Chapter 10 Neural Networks
381
Chapter 10 Neural Networks
1 2 3 4 5 6 7 8 9
382
Chapter 10 Neural Networks
Step 1: Import all the libraries necessary to create the neural network model
and measure the accuracy of the model. pandas() and numpy() are required for data
ingestion and data manipulation. Matplotlib() is for plotting. Os() is the basic
operating system libraries. We will be importing the sklearn library packages that are
essential for building the neural network model and measuring the model performance:
confusion_matrix(), roc(), classification_report(), etc.
Here is the input:
import sklearn
from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPRegressor
##IPYthon display
from IPython.display import display
383
Chapter 10 Neural Networks
Step 2: Read the data file AttrData_NN.csv to a Pandas dataframe and print the
summary.
Here is the input:
data_dir='E:/umesh/Dataset/NN'
filename = "AttrData_NN.csv"
os.chdir(data_dir)
data_df = pd.read_csv(filename)
print(data_df.shape)
384
Chapter 10 Neural Networks
Step 3: Explore the data. Check different attributes’ data distribution and data types.
You should use all the techniques you have learned in earlier chapters including data
normalization, data preprocessing, data type conversion, etc. You can gain insight into
the data and the data distribution, distribution of categorical variables, etc., just by
exploring the data.
Here is the input:
data_df_sub=data_df.select_dtypes(include=['object'])
for c in data_df_sub.columns:
display(data_df_sub[c].value_counts())
385
Chapter 10 Neural Networks
Step 5: Prepare the data as per the API requirements. The neural network API does
not accept text values. Hence, convert data types with any strings into numerical values
using LabelEncoder().
Note If you have more than two levels in your categorical variables, you should
use the dummy_variable() or one_hot_encoding() function and create a
dummy variable for all your predictors that has more than two levels. Since in our
data we just have only two levels (yes/No, Low and Excellent, etc.), we are using
the LabelEncoder() function for everything including the target variable.
386
Chapter 10 Neural Networks
le = LabelEncoder()
'TechExper']
X[['Arion','WorkChallenging',
'WorkEnvir','Compensaon',
'TechExper']] = X[['Arion','WorkChallenging',
'WorkEnvir','Compensaon','TechExper']].apply(LabelEncoder().fit_transform)
X[var] = X[var].astype('category',copy=False)
X.head(3)
387
Chapter 10 Neural Networks
Step 6: Split data into two parts, train and test. Train the model using the Training
data set and test the model with the Test data set. The sklearn neural network() function
accepts X parameters and Y parameters separately. Hence, we have to drop target
variables from the dataframe and create a separate X dataframe and Y dataframe with
only the target class.
388
Chapter 10 Neural Networks
##Data preparation
##Y is the response class (target) and X is features Once this is done then split dataset
y = X['Attrition']
X1 = X.drop(columns='Attrition')
X1.head(3)
print(X_train.shape); print(X_test.shape);print(y_train.shape);print(y_test.shape)
389
Chapter 10 Neural Networks
#Generate Neural network Model with (5, 2) 2 hidden layers with 5 neurons
solver='adam',shuffle=True,max_iter=1000)
nn_model.fit(X_train,y_train)
nn_model.hidden_layer_sizes
nn_model.hidden_layer_sizes
nn_model.n_layers_
nn_model.classes_
Figure 10-39. Results of creating the neural network model using the
MLPClassifier() API of scikit-learn on the training data
390
Chapter 10 Neural Networks
The multilayer perceptron classifier MLPClassifier() API function optimizes the log-
loss function using stochastic gradient descent (SGD). SGD is an extension of the gradient
descent. It calculates the gradient for only one training example at every iteration. The
learning rate is used to calculate the variation or adjustments at every iteration. If the LR
is too large, then the variations may be too far past the optimum value. Similarly, if the LR
is too small, you may require many iterations to reach a local minimum. A recommended
method to vary the LR is to start with learning rate 0.1 and adjust as necessary.
You have to provide several input parameters for the function to work properly. At a
minimum, you must specify the number of layers, activation function, learning rate, and
maximum number of epochs. We are providing the MLPClassifier() API details from
the scikit-learn documentation here for reference:
Parameters
hidden_layer_sizestuple, length = n_layers - 2,
default=(100,)
The ith element represents the number of neurons in the ith
hidden layer.
activation{'identity', 'logistic', 'tanh', 'relu'},
default='relu'
Activation function for the hidden layer.
• identity, no-op activation, useful to implement linear
bottleneck, returns f(x) = x
• logistic, the logistic sigmoid function, returns f(x) = 1 / (1 +
exp(-x))
• tanh, the hyperbolic tan function, returns f(x) = tanh(x)
• relu, the rectified linear unit function, returns f(x) = max(0, x)
solver{'lbfgs', 'sgd', 'adam'}, default='adam'
The solver for weight optimization.
• lbfgs is an optimizer in the family of quasi-Newton methods.
• sgd refers to stochastic gradient descent.
• adam refers to a stochastic gradient-based optimizer proposed by
Kingma, Diederik, and Jimmy Ba
391
Chapter 10 Neural Networks
Note The default solver adam works pretty well on relatively large data sets (with
thousands of training samples or more) in terms of both training time and validation
score. For small data sets, however, lbfgs can converge faster and perform better.
alphafloat, default=0.0001
batch_sizeint, default='auto'
learning_rate_initfloat, default=0.001
power_tfloat, default=0.5
392
Chapter 10 Neural Networks
max_iterint, default=200
shufflebool, default=True
tolfloat, default=1e-4
verbosebool, default=False
warm_startbool, default=False
When set to True, reuse the solution of the previous call to fit as
initialization; otherwise, just erase the previous solution.
momentumfloat, default=0.9
393
Chapter 10 Neural Networks
nesterovs_momentumbool, default=True
Whether to use Nesterov’s momentum. Only used when
solver='sgd' and momentum > 0.
early_stoppingbool, default=False
Whether to use early stopping to terminate training when the
validation score is not improving. If set to true, it will automatically
set aside 10 percent of the training data as validation and
terminate training when the validation score is not improving by
at least tol for n_iter_no_change consecutive epochs. The split is
stratified, except in a multilabel setting. If early stopping is False,
then the training stops when the training loss does not improve by
more than tol for n_iter_no_change consecutive passes over the
training set. Only effective when solver='sgd' or 'adam'.
validation_fractionfloat, default=0.1
The proportion of training data to set aside as validation set for
early stopping. Must be between 0 and 1. Only used if early_
stopping is True.
beta_1float, default=0.9
Exponential decay rate for estimates of first moment vector in
adam; should be in [0, 1). Only used when solver='adam'.
beta_2float, default=0.999
Exponential decay rate for estimates of second moment vector in
adam; should be in [0, 1). Only used when solver='adam'.
epsilonfloat, default=1e-8
Value for numerical stability in adam. Only used when
solver='adam'.
n_iter_no_changeint, default=10
Maximum number of epochs to not meet tol improvement. Only
effective when solver='sgd' or adam.
New in version 0.20.
394
Chapter 10 Neural Networks
max_funint, default=15000
Step 8: Predict test data using the model that you just created.
Here is the input:
#predict_train = NNCL.predict(X_train)
predict_test = nn_model.predict(X_test)
predict_test
Step 9: Measure the accuracy of the model. Use a confusion matrix (a truth table
of actual versus predicted). Sklearn() provides built-in functions to calculate different
performance measures, and we will use the same functions. Also, we will plot the loss
curve, which displays the number of iterations it took to adjust the weights and biases to
come up with an optimal model.
Here is the input:
print(confusion_matrix(y_test,predict_test))
print(classification_report(y_test,predict_test))
395
Chapter 10 Neural Networks
Step 10: Finally, we will print the weights and biases of the optimized model for
reference. See Figure 10-42.
396
Chapter 10 Neural Networks
Figure 10-42. Printing the weights and biases of the optimized model
397
Chapter 10 Neural Networks
the minimum value (error) is called global minima. Other points will be called local
minima. The neural network adopts a gradient descent algorithm to find the minimum
error. The learning could also lead to a risk of obtaining local minima weights rather than
converging to global minima if the correct number of epochs is not selected during the
model creation process.
Descent
small delta
Error
large
Global minima
Coefficients (weights)
Neural network learning heavily relies on having a sufficient quantity of data for
the training process. A neural network may perform poorly with smaller data, as in our
example. Similarly, having an imbalanced class of data also leads to poor learning in the
minority category.
A practical challenge is computational timeliness. Though neural network concepts
have been around for more than 50 years, the computational abilities to process large
data was a challenge. This is no longer a problem because of the availability of storage
and powerful processors. In general, the neural network is computationally intensive
and requires longer runtime than other classifiers. This runtime can grow exponentially
with a higher number of variables and more network layers (many more weights to
compute). If you have a real-time or near-real-time prediction application, then you
should consider runtime measures to make sure it is not causing an unacceptable delay
in the decision-making process.
Finally, the challenge is a careful selection of input variables. Since neural networks
automatically adjust the weights and biases based on the “gradient descent” errors, we
may have no mechanism to remove or add the variable based on output. This can be an
advantage or disadvantage depending on the problem.
398
Chapter 10 Neural Networks
399
Chapter 10 Neural Networks
10.9 Chapter Summary
We started the chapter by providing a background of artificial neurons and how to
imitate brain functions using artificial neurons. We explained the fundamentals of
perceptrons and building neural networks and how a neural network learns and adjusts
the weights and biases to perform better.
We discussed different activation functions such as RELU, sigmoid, and tanh, and
you learned about the gradient descent algorithm to find the optimal weights and biases.
You also learned how to create a neural network model using both R and Python
with a practical business case.
Finally, we ended the chapter by introducing deep neural networks and their
applications.
400
CHAPTER 11
Logistic Regression
In Chapters 7 and 8, we discussed simple linear regression and multiple linear
regression, respectively. In both types of regression, we have a dependent variable or
response variable as a continuous variable that is normally distributed. However, this is
not always the case. Often, the response variable is not normally distributed. It may be a
binary variable following a binomial distribution and taking the values of 0/1 or No/Yes.
It may be a categorical or discrete variable taking multiple values that may follow other
distributions other than the normal one.
If the response variable is a discrete variable (which can be nominal, ordinal, or
binary), you use a different regression method, known as logistic regression. If the
response variable takes binary values (such as Yes/No or Sick/Healthy) or multiple
discrete variables (such as Strongly Agree, Agree, Partially Agree, and Do Not Agree),
we can use logistic regression. Logistic regression is still a linear type of regression.
Logistic regression with the response variable taking only two values (such as Yes/No
or Sick/Healthy) is known as binomial logistic regression. A logistic regression with a
response variable that can take multiple discrete values is known as multinomial logistic
regression. (Note that nonlinear regression is outside the scope of this book.)
Consider the following examples:
Hence, we use the following bounding function that enables us to determine the
probability of success and derive the logistic regression model by combining linear
regression with the bounding function:
f a 1 / 1 e a
402
Chapter 11 Logistic Regression
and
We use the maximum likelihood method to generate the logistic regression equation that
predicts the natural logarithm of the odds ratio. From this logistic regression equation, we
determine the predicted odds ratio and the predicted probability of success, as shown here:
Here, we do not use the dependent variable value as is. Instead, we use the natural
logarithm of the odds ratio.
11.1 Logistic Regression
In this section, we will demonstrate logistic regression using a data set.
11.1.1 The Data
Let’s start by considering the data set we have created, which has six variables.
The data covers the last six months and pertains only to those employees with 2 to 5
years of experience. The data is extracted from the CSV text file named attr_data.txt
by using the read.csv() command.
> attrition_data<-read.csv("attr_data.txt")
> summary(attrition_data)
Attrition Yrs_Exp Work_Challenging
No :24 Min. :2.000 No :28
Yes:28 1st Qu.:2.500 Yes:24
Median :4.000
Mean :3.519
3rd Qu.:4.500
Max. :5.000
Work_Envir Compensation Tech_Exper
Excellent:28 Excellent:21 Excellent:44
Low :24 Low :31 Low : 8
This code also presents a summary of the data. As you can see, the Attrition
field has 28 Yes values; this means these employees have exited the organization. The
24 No values represent employees who are still working in the organization. You can
also observe from the preceding summary that 28 employees have not been assigned
challenging work (Work_Challenging), and 24 employees have been. Furthermore,
28 employees are working in teams where the work environment (Work_Envir) is
considered excellent, whereas 24 are working in teams where the work environment is
404
Chapter 11 Logistic Regression
not that great (here marked Low). Finally, 21 employees have excellent compensation,
at par or above the market compensation (shown here as Excellent); but 31 have
compensation that is below the market compensation or low compensation (shown here
as Low). Out of the total employees, 44 have excellent technical expertise (Tech_Exper),
whereas 8 others have low technical expertise. The data set contains 52 records.
Ideally, when the organization is providing challenging work to an employee, the
work environment within the team is excellent, compensation is excellent, and technical
expertise of the employee is low, then the chance for attrition should be low.
Here is a glimpse of the data:
> head(attrition_data)
Attrition Yrs_Exp Work_Challenging Work_Envir
1 Yes 2.5 No Low
2 No 2.0 Yes Excellent
3 No 2.5 Yes Excellent
4 Yes 2.0 No Excellent
5 No 2.0 Yes Low
6 Yes 2.0 No Low
Compensation Tech_Exper
1 Low Excellent
2 Excellent Excellent
3 Low Excellent
4 Low Excellent
5 Low Low
6 Low Excellent
> tail(attrition_data)
Attrition Yrs_Exp Work_Challenging Work_Envir
47 No 4.0 Yes Excellent
48 No 4.5 No Excellent
49 Yes 5.0 No Excellent
50 No 5.0 No Excellent
51 Yes 2.0 Yes Excellent
52 No 4.0 Yes Excellent
405
Chapter 11 Logistic Regression
Compensation Tech_Exper
47 Excellent Excellent
48 Low Low
49 Excellent Excellent
50 Excellent Excellent
51 Excellent Excellent
52 Excellent Excellent
> attri_logit_model<-glm(
Attrition~Yrs_Exp+Work_Challenging+Work_Envir+Compensation+Tech_Exper,
data=attrition_data,
family =binomial(link="logit"))
> summary(attri_logit_model)
The model created by using the glm() function is shown here, along with the
summary (generated by using summary(model name)):
406
Chapter 11 Logistic Regression
Only one value among the categorical variables is shown here. This is because
each variable has two levels, and one level is taken as a reference level by the model.
An example is the categorical variable Work_Challenging, which has two levels: Work_
ChallengingYes and Work_ChallengingNo. Only Work_ChallengingYes is shown in the
model, as Work_ChallengingNo is taken as the reference level.
You can see in the preceding summary of the logistic regression model that except
for Yrs_Exp, all other variables are significant to the model (as each p-value is less than
0.05). Work_ChallengingYes, Work_EnvirLow, CompensationLow, and Tech_ExperLow
are the significant variables. Yrs_Exp is not a significant variable to the model, as it
has a high p-value. It is quite obvious from even a visual examination of the data that
Yrs_Exp will not be significant to the model, as attrition is observed regardless of the
407
Chapter 11 Logistic Regression
number of years of experience. Furthermore, you can see that the model has converged
in seven Fisher’s scoring iterations, which is good because ideally we expect the model to
converge in less than eight iterations.
We can now eliminate Yrs_Exp from the logistic regression model and recast the
model. The formula used for recasting the logistic regression model and the summary of
the model are provided here:
> summary(attri_logit_model_2)
As you can see, now all the model parameters are significant because the p-values
are less than 0.05.
408
Chapter 11 Logistic Regression
The degrees of freedom for the data are calculated as n minus 1 (the number of
data points – 1, or 52 – 1 = 51). The degrees of freedom for the model is n minus 1 minus
the number of coefficients (52 – 1 – 4 = 47). These are shown in the above summary of
the model.
Deviance is a measure of lack of fit. Null and residual deviance are the most common
values used in statistical software to measure the model fit. Null deviance is nothing
but the deviance of the model with only intercept. The null deviance tells us how well
the variable can be predicted without other coefficients, i.e., keeping only intercept. The
residual deviance is the fitness of the model with all the predictors. The lower this
value, the better the model and can predict the value more accurately.
Chi-square statistic can be calculated as follows:
Based on the p-value, we can determine whether the model is “fit” enough. The
lower the p-value, the better the model compared to the model with only intercept.
In our previous example, we have Null deviance of 71.779 and Residual deviance of
26.086, so the Chi-square statistic would be as follows:
409
Chapter 11 Logistic Regression
Let’s compare both the models—attri_logit_model (with all the predictors) and
attri_logit_model_2 (with only significant predictors)—and check how the second
model fares with respect to the first one. See Figure 11-2.
This test is carried out by using the anova() function and checking the Chi-square
p-value. In this table, the Chi-square p-value is 0.7951. This is not significant. This
suggests that the model attri_logit_model_2 without Yrs_Exp works well compared
to the model attri_logit_model with all the variables. There is not much difference
between the two models. Hence, we can safely use the simpler model without Yrs_Exp
(that is, attri_logit_model_2).
410
Chapter 11 Logistic Regression
From this model (attri_logit_model_2), we can see that the coefficient of Work_
ChallengingYes is –3.4632, Work_EnvirLow is 4.5215, CompensationLow is 2.7090, and
Tech_ExperLow is –3.8547. These coefficient values are the natural logarithm of the odds
ratio. A coefficient with a minus sign indicates that it decreases the potential for Attrition.
Similarly, a coefficient with a plus sign indicates that it increases the potential for Attrition.
Hence, the odds of having Attrition = Yes is exp(4.5215) = 91.97 times higher than not
having any Attrition with all other variables remaining the same (all other things being
equal) when Work_Envir = Low compared to when the Work_Envir = Excellent. This
means that if Work_Envir = Excellent, the chance of Attrition = Yes is 5 percent, or
0.05, and then the odds for Attrition = No under the same conditions are (0.05 / (1 –
0.05)) × 91.97 = 4.8376. This corresponds to a probability of Attrition = No of 4.8376 / (1 +
4.8376) = 0.8286, or about 82.86 percent. The odds of having Attrition = Yes is exp(2.7090)
= 15.02 times higher than not having any Attrition with all other variables remaining the
same when Compensation = Low compared to when the Compensation = High. Both Work_
Envir = Low and Compensation = Low increase the possibility of Attrition = Yes.
However, other two variables, Work_Challenging = Yes and Tech_Exper = Low,
with negative signs to the respective coefficients means that they reduce the possibility of
Attrition = Yes. Work_Challenging = Yes with the coefficient value of –3.4632 lowers the
possibility of Attrition = Yes by exp(–3.4632) = 0.0313 with all other variables remaining
the same compared to when Work_Challenging = No. Similarly, Tech_Exper = Low with
the coefficient value of –3.8547 lowers the possibility of Attrition = Yes by exp(–3.8547) =
0.0211 with all other variables remaining the same compared to when Tech_Exper = High.
From this, you can see that the major impact to Attrition is influenced mainly by
Work_Envir and Compensation.
411
Chapter 11 Logistic Regression
> pseudo_R_Squared
[1] 0.6365818
This calculation shows that the model explains 63.65 percent of the deviance. You can
also compute the value of pseudo R-square by using library(pscl) and pR2(model_name).
Another way to verify the model fit is by calculating the p-value with the Chi-square
method as follows:
p-value <- pchisq[(model_deviance_diff ),(df_data – df_model),lower.tail=F]
Here, df_data can be calculated by nrow(data_set) – 1, or in our case,
nrow(attrition_data) – 1; df_model can be calculated by model$df.residual; and
model_deviance_diff can be calculated as model$null.dev – model$deviance. These
calculations are shown here:
Because the p-value is very small, the reduction in deviance cannot be assumed to
be by chance. As the p-value is significant, the model is a good fit.
412
Chapter 11 Logistic Regression
This may be due to data or a portion of data predicting the response perfectly. This is
known as the issue of separation or quasi-separation.
Here are some general words of caution with respect to the logistic regression model:
• We have a problem when the null deviance is less than the residual
deviance.
In these cases, we may have to revisit the model and relook again at each coefficient.
11.1.5 Multicollinearity
We talked about multicollinearity in Chapter 8. Multicollinearity can be made out in R
easily by using the vif(model name) function. VIF stands for variance inflation factor.
Typically, a rule of thumb for the multicollinearity test to pass is that the VIF value
should be greater than 5. The following test shows the calculation of VIF:
413
Chapter 11 Logistic Regression
As you can see, our model does not suffer from multicollinearity.
11.1.6 Dispersion
Dispersion (variance of the dependent variable) above the value of 1 (as mentioned in
the summary of the logistic regression dispersion parameter for the binomial family
to be taken as 1) is a potential issue with some of the regression models, including the
logistic regression model. This is known as overdispersion. Overdispersion occurs when
the observed variance of the dependent variable is bigger than the one expected out of
the usage of binomial distribution (that is, 1). This leads to issues with the reliability of
the significance tests, as this is likely to adversely impact standard errors.
Whether a model suffers from the issue of overdispersion can be easily found using
R, as shown here:
The model generated by us, attri_logit_model_2, does not suffer from the issue of
overdispersion. If a logistic regression model does suffer from overdispersion, you need
to use quasibinomial distribution in the glm() function instead of binomial distribution.
414
Chapter 11 Logistic Regression
> library(caret)
> set.seed(1234)
> # setting seed ensures the repeatability of the results on
different trials
> # We are going to partition data into train and test using
> # createDataPartition() function from the caret package
> # we use 80% of the data as train and 20% as test
> Data_Partition<-createDataPartition(attrition_data$Attrition,
p=0.8,list=FALSE)
> Training_Data<-attrition_data[Data_Partition, ]
> Test_Data<-attrition_data[-Data_Partition, ]
> nrow(attrition_data)
415
Chapter 11 Logistic Regression
[1] 52
> nrow(Training_Data)
[1] 43
> nrow(Test_Data)
[1] 9
> summary(Training_Data)
Attrition Yrs_Exp Work_Challenging Work_
Envir Compensation
No :20 Min. :2.000 No :24 Excellent:24 Excellent:18
Yes:23 1st Qu.:2.500 Yes:19 Low :19 Low :25
Median :4.000
Mean :3.547
3rd Qu.:4.500
Max. :5.000
Tech_Exper
Excellent:37
Low : 6
This split of the entire data set into two subsets (Training_Data and Test_Data)
has been done randomly. Now, we have 43 records in Training_Data and 9 records in
Test_Data.
We now train our model using Training_Data to generate a model, as shown here:
## Model 3
## Create a logistic regression model using Training_Data set
# We will not use Yrs_Exp variable as it is not significant.
# We already explained what variables to consider in our previous
discussion
train_logit_model<-glm(Attrition~Work_Challenging+Work_
Envir+Compensation+Tech_Exper,
data=Training_Data,
family =binomial(link="logit"))
summary(train_logit_model)
416
Chapter 11 Logistic Regression
As you can see, the model generated (train_logit_model) has taken seven Fisher’s
scoring iterations to converge.
The next step is to predict the test data and measure the performance of the model as
follows:
1. Use the model generated from the training data to predict the
response variable for the test data and store the predicted data in
the test data set.
2. Compare the values generated from the response variable with
the actual values of the response variable in the test data set.
417
Chapter 11 Logistic Regression
• True positives are the ones that are actually positives (1) and are
also predicted as positives (1).
• True negatives are the ones that are actually negatives (0) but are
also predicted as negatives (0).
• False positives are the ones that are predicted as positives (1) but
are actually negatives (0).
• False negatives are the ones that are predicted as negatives (0)
but are actually positives (1).
In addition, Precision = TP / (FP + TP) and F1 Score = 2TP / (2TP + FP + FN) may be
considered.
Higher accuracy, higher sensitivity, and higher specificity are typically expected.
Check whether these values are appropriate to the objective of the prediction in mind.
If the prediction will affect the safety or health of people, we have to ensure the highest
accuracy. In such cases, each predicted value should be determined with caution and
further validated through other means, if required.
11.2.1 Example of Prediction
The originally fitted model (attri_logit_model_2) not only explains the relationship
between the response variable and the independent variables but also provides a
mechanism to predict the value of the response variable from the values of the new
independent variables. This is done as shown here:
418
Chapter 11 Logistic Regression
We take the value of Attrition as Yes if the probability returned by the prediction
is > 0.5, and we take the same as No if the probability returned by the prediction is not
> 0.5. As you can see in the preceding code, the value is far below 0.5, so we can safely
assume that Attrition = No.
The preceding prediction is determined by using the function predict(model
name, newdata=dataframe_name, type="response"), where model name is the name
of the model arrived at from the input data, newdata contains the data of independent
variables for which the response variable has to be predicted, and type="response" is
required to ensure that the outcome is not logit(y).
419
Chapter 11 Logistic Regression
> predicted<-predict(train_logit_model,
newdata=Test_Data,
type="response")
FALSE TRUE
No 3 1
Yes 1 4
We can clearly see from the confusion matrix that the model
generates very high true positives and true negatives. The
accuracy of the model can be calculated using the following
formula:
= (TP+TN)/(TP+TN+FP+FN)
=(4+3)/(4+3+1+1)=7/9=0.77
> library(ROCR)
> prediction_object<-prediction(predicted,
Test_Data$Attrition)
> prediction_object
420
Chapter 11 Logistic Regression
421
Chapter 11 Logistic Regression
This ROC curve clearly shows that the model generates almost no false positives
and generates high true positives. Hence, we can conclude that the model generated is a
good model.
11.4 Regularization
Regularization is a complex subject that we won’t discuss thoroughly here. However, we
provide an introduction to this concept because it is an important aspect of statistics that
you need to understand in the context of statistical models.
Regularization is the method normally used to avoid overfitting. When we keep
adding parameters to our model to increase its accuracy and fit, at some point our
prediction capability using this model decreases. By taking too many parameters, we are
overfitting the model to the data and losing the value of generalization, which could have
made the model more useful in prediction.
Using forward and backward model fitting and subset model fitting, we try to avoid
overfitting and hence make the model more generalized and useful in predicting future
values. This will ensure less bias as well as less variance when relating to the test data.
Regularization is also useful when we have more parameters than the data
observations in our data set and the least squares method cannot help because it
would lead to many models (not a single unique model) that would fit to the same data.
Regularization allows us to find one reasonable solution in such situations.
422
Chapter 11 Logistic Regression
Shrinkage methods are the most used regularization methods. They add a penalty term
to the regression model to carry out the regularization. We penalize the loss function by
adding a multiple (λ, also known as the shrinkage parameter) of the regularization norm,
such as Lasso or Ridge (also known as the shrinkage penalty), of the linear regression
weights vector. We may use cross validation to get the best multiple (λ value). The more
complex the model, the greater the penalty. We use either the L1 regularizer (Lasso) or the L2
regularizer (Ridge). Regularization shrinks the coefficient estimates to reduce the variance.
Ridge regression shrinks the estimates of the parameters but not to 0, whereas the
Lasso regression shrinks the estimates of some parameters to 0. For Ridge, the fit will
increase with the value of λ, and along with that, the value of variance also increases.
This can lead to a huge increase in parameter estimates, even for small changes in the
training data, and get aggravated with the increase in the number of parameters. Lasso
creates less-complicated models, thus making the predictability easier.
Let’s explore the concept of regularization on our data set attrition_data without Yrs_
Exp. We don’t take Yrs_Exp into consideration because we know that it is not significant.
We use the glmnet() function from the glmnet package to determine the regularized
model. We use the cv.glmnet() function from the glmnet package to determine the
best lambda value. We use alpha=1 for the Lasso and use alpha=0 for the Ridge. We use
family="binomial" and type="class" because our response variable is binary and
we are using the regularization in the context of logistic regression, as required. The
glmnet() function requires the input to be in the form of a matrix and the response
variable to be a numeric vector. This fits a generalized linear model via penalized
maximum likelihood. The regularization path is computed for the Lasso or elasticnet
penalty at a grid of values for the regularization parameter lambda.
The generic format of this function as defined in the glmnet R package is as follows:
423
Chapter 11 Logistic Regression
As usual, we will not be using all the parameters. We will be using only the absolutely
required parameters in the interest of simplicity. Please explore the glmnet package
guidelines for details of each parameter.
We will first prepare the inputs required. We need the model in the format of a matrix,
as the input for the glmnet() function. We also require the response variable as a vector:
> ###################
> library(glmnet)
Loading required package: Matrix
Loaded glmnet 4.1-1
Warning message:
package ‘glmnet’ was built under R version 3.6.3
> #converting into a matrix as required for the input.
> x<-model.matrix(Attrition~Work_Challenging+Work_
Envir+Compensation+Tech_Exper,
+ data = attrition_data)
> y<-attrition_data$Attrition
> glmnet_fit<-glmnet(x,y,
+ family="binomial",
+ alpha=1,
+ nlambda=100)
> summary(glmnet_fit)
Length Class Mode
a0 68 -none- numeric
beta 340 dgCMatrix S4
df 68 -none- numeric
dim 2 -none- numeric
lambda 68 -none- numeric
dev.ratio 68 -none- numeric
nulldev 1 -none- numeric
npasses 1 -none- numeric
jerr 1 -none- numeric
offset 1 -none- logical
classnames 2 -none- character
424
Chapter 11 Logistic Regression
call 6 -none- call
nobs 1 -none- numeric
Explaining the contents of the summary is beyond the scope of this book, but we
will show how the regularization is carried out primarily using the graphs. We use the
plot() function for this purpose. As we are using the binary data and logistic regression,
we use xvar="dev" (where dev stands for deviance) and label = TRUE to identify the
parameters in the plot as inputs to the plot() function in Figure 11-4.
> ##plot
> plot(glmnet_fit,
xvar="dev",
label=TRUE)
Figure 11-4. Shows the deviance of each variable: two have + coefficients, and two
have – coefficients
425
Chapter 11 Logistic Regression
The output of the glmnet_fit using the print() function is shown here:
> print(glmnet_fit)
Call: glmnet(x = x, y = y, family = "binomial", alpha = 1, nlambda = 100)
Df %Dev Lambda
1 0 0.00 0.273000
2 2 5.56 0.248700
3 2 10.91 0.226600
4 3 15.61 0.206500
5 3 19.94 0.188200
....
....
....
This primarily shows the degrees of freedom (number of nonzero coefficients), the
percentage of null deviance explained by the model, and the lambda value. As you can
see, the lambda value keeps on decreasing. As the lambda value decreases, the percent
of deviance explained by the model increases, as does the significant number of nonzero
coefficients. Even though we supplied nlambda = 100 for the function (this is the
default), the lambda value is shown only 68 times. This is because the algorithm ensures
that it stops at an optimal time when it sees there is no further significant change in the
percent deviation explained by the model.
Now we will make the prediction of the class labels at lambda = 0.05. Here type =
"class" refers to the response type:
426
Chapter 11 Logistic Regression
As you can see, all four values are predicted accurately, as they match the first four
rows of our data set.
Now we will do the cross validation of the regularized model by using the cv.glmnet()
function from the glmnet package. This function does k-fold cross validation for glmnet,
produces a plot, and returns a minimum value for lambda. This also returns a lambda
value at one standard error. This function by default does a 10-fold cross-validation. We
can change the k-folds if required. Here we use type.measure = "class" as we are using
the binary data and the logistic regression. Here, class gives the misclassification error:
We now plot the output of cv.glmnet()—that is, cv.fit—by using the plot()
function.
427
Chapter 11 Logistic Regression
Figure 11-5 shows the cross-validated curve along with the upper and lower values
of the misclassification error against the log(lambda) values. Red dots depict the cross-
validated curve.
The following shows some of the important output parameters of the cv.fit model,
including lambda.min or lambda.1se:
In the previous code, lambda.min is the value of lambda that gives minimum cvm,
and cvm is the cross-validation error. Similarly, lambda.lse is the largest value of lambda
such that the error is within one standard error of the minimum.
We can view the coefficient value at the lambda.min value using the coef(cv.fit,
s = "lambda.min") command in R. The output is a sparse matrix with the second levels
shown for each independent factor.
428
Chapter 11 Logistic Regression
Let’s now see how this regularized model predicts the values by using the predict()
function and the s = "lambda.min" option. We will check this for the first six values of
our data set. The results are shown here:
All six values are predicted properly by the predict() function. However, please
note that we have not validated the results for our entire data set. The accuracy of
the model may not be 100 percent, as our objective of regularization was to provide a
generalized model for future predictions without worrying about an exact fit (overfit) on
the training data.
429
Chapter 11 Logistic Regression
430
Chapter 11 Logistic Regression
The code to read the data from the text file and to print the data is provided below
(the code to import data is also provided in the text form in the above text box to
facilitate the ease of use by the readers of this book):
In the above dataframe, we have 52 records with one response variable (i.e.,
Attrition) and five predictor variables (only a partial view is shown).
attri_data.info()
431
Chapter 11 Logistic Regression
432
Chapter 11 Logistic Regression
433
Chapter 11 Logistic Regression
train_samp = (40)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=train_samp, test_size=12)
predicted = logist_regre.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, predicted))
As you can see, the performance of the model (i.e., the accuracy of prediction on
the test data set) is not very high. As we know from the previous discussions on model
building in R, Yrs_Exp is not a significant field for this model, so we will now build a
model without the variable Yrs_Exp. We will then validate the model using the accuracy
of the prediction on the test data.
434
Chapter 11 Logistic Regression
Note If you want the output to remain the same every time this step is run, then
use the np.random.seed(n) function. Otherwise, the algorithm may provide
different output every time for each specific run.
Z = X.drop(['Yrs_Exp'], axis=1)
#split the dataset into two separate sets viz. training set
#and test set manually
#training set for generating the model
#test set for validating the model generated
train_samp = (40)
from sklearn.model_selection import train_test_split
Z_train, Z_test, y_train, y_test = train_test_split(Z, y,
train_size=train_samp, test_size=12)
from sklearn.linear_model import LogisticRegression
logist_regre = LogisticRegression(random_state = 0, penalty = 'l2',
solver = 'lbfgs', multi_class = 'multinomial',
max_iter = 500).fit(Z_train, y_train)
Predict on the test data and print the confusion matrix score to measure the accuracy
of the model. The code in this regard is provided below:
predicted = logist_regre.predict(Z_test)
import sklearn.metrics as sm
print(confusion_matrix(y_test, predicted))
435
Chapter 11 Logistic Regression
Here, you see higher accuracy. However, depending upon the train-test split carried
out by the algorithm, you may get higher accuracy for the initial instance itself.
Note If you want the output to remain same every time this step is run, then use
the np.random.seed(n) function. Otherwise, the algorithm may provide different
output every time for each specific run. Also, please note that the numbers of the
Jupyter Notebook cells in the previous example may not be consecutive here as
we would carried out some additional steps during our exercise, which are not
required to be shown here.
436
Chapter 11 Logistic Regression
We can see that the optimization of the previous model was successful as the
optimization ended within eight iterations. As you can see from the logit regression
results, the model p-value is very small. Hence, the model is significant. We will also
437
Chapter 11 Logistic Regression
check to see the significance of each of the predictors to the model, as follows, which
confirms that the selected predictors are significant to the model as the p-value of each
of these predictors is less than the selected significance level of 0.05:
Once we have the model, the next step is to predict using the model and measure the
performance.
Here is the input:
Print the Confusion Matrix: The code and the corresponding output are
shown below:
438
Chapter 11 Logistic Regression
Note You could have also initially generated the model with all the predictors and
checked on the significance of each of them. However, in those cases, you may
find more predictors as nonsignificant. However, as you reduce one by one of those
nonsignificant predictors from the regression equation, you may find in the final
model that one or more of these predictors is significant on their own.
11.6 Chapter Summary
In this chapter, you saw that if the response variable is a categorical or discrete variable
(which can be nominal, ordinal, or binary), you use a different regression method, called
logistic regression. If you have a dependent variable with only two values, such as Yes or
No, you use binomial logistic regression. If the dependent variable takes more than two
categorical values, you use multinomial logistic regression.
You looked at a few examples of logistic regression. The assumptions of
linearity, normality, and homoscedasticity that generally apply to regressions
do not apply to logistic regression. You used the glm() function with "family =
binomial(link="logit") to create a logistic regression model.
You also looked at the underlying statistics and how logit (log odds) of the
dependent variable is used in the logistic regression equation instead of the actual value
of the dependent variable.
439
Chapter 11 Logistic Regression
You also imported the data set to understand the underlying data. You created the
model and verified the significance of the predictor variables to the model by using the
p-value. One of the variables (Yrs_Exp) was not significant. You reran the model without
this predictor variable and arrived at a model in which all the variables were significant.
You explored how to interpret the coefficients and their impact on the dependent
variable. You learned about deviance as a measure of lack of fit and saw how to verify
the model’s goodness of fit by using the p-value of deviance difference using the Chi-
square method.
You need to use caution when interpreting the logistic regression model. You
checked for multicollinearity and overdispersion.
You then split the data set into training and test sets. You tried to come up with
a logistic regression model out of the training data set. Through this process, you
learned that a good model generated from such a training set can be used to predict
the dependent variable. You can use a classification report to check measures such as
accuracy, specificity, and sensitivity.
You also learned by using the prediction() and performance() functions from the
ROCR package that you can generate a ROC curve to validate the model, using the same
data set as the original.
You learned how to predict the value of a new data set by using the logistic regression
model you developed. Then you learned about multinomial logistic regression and the R
packages that can be used in this regard.
Also, you learned about regularization, including why it’s required and how it’s
carried out.
Finally, you learned how to generate the logistic regression model using Python
and allied utilities like scikit-learn, numpy, Pandas, etc., in Jupyter Notebook using
interactive programming. We used the Anaconda framework to do the same thing.
440
PART III
Time-Series Models
CHAPTER 12
443
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5_12
Chapter 12 Time Series: Forecasting
All the previous data we talked about involves lots of time-series data, i.e., data taken
over a continuous period of time like minute by minute or hour by hour or day by day or
week by week or month by month or quarter by quarter or year by year, etc. Again, this
data may be univariate (i.e., data about a single factor or parameter) or multivariate (i.e.,
data with multiple variables). Examples of univariate data are the closing stock price of
a particular stock in a particular stock exchange (e.g., the price of the Walt Disney stock
on the New York Stock Exchange), the total revenue of a particular organization (e.g.,
HP), the sales volume of a particular product of a particular organization (e.g., a Voltas
AC 1.5-ton model in California), the price of a particular commodity in a particular
market (e.g., the price of gold in New York per 10 grams of 24 carats gold), etc. Examples
of multivariate data are open price, close price, volume traded, etc., of a particular stock
on a particular stock exchange, revenue, expenses, profit before tax, profit after tax of
a company, maximum temperature, minimum temperature, or average rainfall at a
particular city across various periods.
If we have a good amount of data, we can forecast the future values of it using the
forecasting models. The methodology is simple. First we verify the data for accuracy
and clean up the data where relevant, then we generate the model using the clean and
accurate data, and finally we use the model generated to predict/forecast.
In this chapter, we will mostly use the functions from the libraries base, stats,
graphics, forecast, and tseries in R. We use pandas, NumPy, scikit-learn,
statsmodels, pmdarima, etc., in Python. Please install these packages and load these
libraries when you do the hands-on work. If the code does not work, first check if you
have failed to load the libraries concerned.
444
Chapter 12 Time Series: Forecasting
perception or outlook is that the company’s results are going to be good) or when the
demand for the products of the company is likely to be increasing significantly because
of the shortage of the products in the market or the sudden bankruptcy of a major
competitor for these products. Data in such cases demonstrates a trend. The trend may
be downward also in the case of some stocks due to the performance of the company in
a particular quarter or continuously over multiple quarters. The trend of the price may
be downward for a particular vegetable (a particularly perishable vegetable) if the crop
output significantly increases beyond the demand. Level is the average value, whereas
trend is an increasing or decreasing value.
We call the third factor the irregular or error component, which typically captures
those influences that are not captured by seasonal and trend components and may be
pure white noise.
The other way we need to look at these characteristics is whether these are additive
or multiplicative in the context of a particular time series. The maximum temperature
or minimum temperature over a year is additive as from one period to the next period
it cannot increase drastically. In this case, the seasonality, trend, and error components
all will be additive. However, the quantum of car sales can demonstrate multiplicative
seasonal trend in the sense that the quantum of car sales may increase to, say, 20 to
30 percent just in the month before the financial budget of the country in view of a
possibility of additional taxation in the upcoming budget, and in the immediate next
month it may come down, say, by 10 to 20 percent.
We consider a time series to be stationary if it does not demonstrate either
seasonality or trend. However, even a stationary time series will still have random
fluctuations. A time series may be demonstrating all three characteristics, i.e.,
seasonality, trend, and irregular components.
We can also call the characteristics systematic or nonsystematic. Systematic
constituents of the time series can be modeled well and easily, whereas nonsystematic
constituents of the time series are difficult to model. For example, the random influence
on the prices of a commodity that is not explainable is nonsystematic, whereas normal
seasonality that can be easily explained or modeled is systematic. Similarly, a trend is
part of the systematic component.
445
Chapter 12 Time Series: Forecasting
> ##Get the data into the dataframe from the .csv dataset
> str(max_temp_imd_2017)
446
Chapter 12 Time Series: Forecasting
$ YEAR : int 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 ...
Figure 12-1. Characteristics of the temperature time series imported, output of the
R code executed
Figure 12-1 shows that there are 117 rows of data (i.e., from 1901 to 2017) and that
there are 18 variables including the year, maximum temperature data for each of the 12
months of these years, and another five columns, namely, ANNUAL, JAN.FEB, MAR.MAY,
JUN.SEP, and OCT.DEC.
447
Chapter 12 Time Series: Forecasting
We want to retain only the year and the 12 individual month column data and
remove all the other columns. We are doing this using the following code:
We can test to check whether the intended columns have been dropped from the
data set using the str(new_temp_df) command from R. We get the output shown in
Figure 12-2.
$ YEAR: int 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 ...
To reduce further complexity, for your ease of understanding, we have created the
following small univariate data set with 24 months of data pertaining to the years 2015
and 2016, converting the data to a time series with the start month as Jan 2015 and
448
Chapter 12 Time Series: Forecasting
with a monthly frequency. Further, we have checked if the time series has been created
properly by checking the starting month of the time series, the ending month of the time
series, and also the frequency. We find that all the data is appropriate with the following
code with the embedded output. At the end, we have provided the command to plot the
time series using plot(ts_max_temp).
Please Note We have included > on each code line to make it clear to you which is
the code and separate it from the embedded output. If you are copying the code from
here and inputting it in R, then you should not include the >. The entire R code of this
chapter is provided in a separate R script on the accompanying Apress website.
The output of the last code line, plot(ts_max_temp), is provided in Figure 12-3; it
shows the fluctuations of the data over these different months over 2 years.
Now we will decompose the time series using the command decompose(ts_max_
temp) and look at the seasonal, trend, and irregular components of the time series using
the following code. The related outputs are also embedded here.
The output of the plot(decomp) command provided shows clearly the seasonal
and trend components graphically in Figure 12-4. The second graph from the bottom
of the output shown in Figure 12-4 shows the seasonal trend; i.e., almost the same
pattern is repeated with the seasons. As you can see in this graph, nearer to the middle
of the year the temperature peaks, and during the end of the year it drops. This happens
450
Chapter 12 Time Series: Forecasting
during both years, i.e., over the entire period of the input data. The trend component
is shown in the middle graph. This typically shows if the values have a component that
continues to increase or continues to decrease or they are horizontal. In our case, there
is a trend component for the partial spectrum of the data, i.e., July of the earlier year to
June of the next year (as shown in the output provided along with the code). A random
component is an irregular or error component. There is no random component here as
the bottommost graph shows a straight line (as you can see from the output shown along
with the code, it is NULL).
The following command creates a plot clearly showing the patterns of the
decomposition:
Figure 12-5 shows the output, which clearly shows both the trend and the seasonal
patterns.
451
Chapter 12 Time Series: Forecasting
452
Chapter 12 Time Series: Forecasting
In the triple exponential forecasting model, we use all the three components, i.e.,
the level, the trend, and the seasonal components. Now, we will add another smoothing
factor, i.e., γ. This smoothing factor γ will also take a value between 0 and 1. Again,
the higher this value, the higher the relative weightage to the recent observations in
the time series, and the lower this value, the higher the relative weightage to the older
observations in the past.
To learn more, refer to books on statistics, which cover these aspects and the related
formulae in detail.
ets() is the most popular function used. You can find it in the forecast library of
R. ets(time-series name, model="ZZZ") will select appropriate type of model without
the need for you to specify it. Here, each letter can mean one of the following: A means
Additive, M means Multiplicative, and N means None. The first letter stands for the
error type, the second letter stands for the trend type, and the third letter stands for the
seasonal component type. Let’s check how this works with our time-series data, i.e.,
ts_max_temp. The code used for this along with the output is provided here:
Call:
ets(y = ts_max_temp, model = "ZZZ")
Smoothing parameters:
alpha = 0.9999
beta = 0.9999
Initial states:
l = 22.2701
b = 2.3093
sigma: 1.7486
AIC AICc BIC
108.7209 112.0543 114.6112
453
Chapter 12 Time Series: Forecasting
The previous output shows the additive model using the errors and the trend. This
model also shows the AIC, AICc, and BIC values as well as the smoothing parameters
alpha and beta used. Alpha and beta are the smoothing parameters for the level and
trend, respectively. Both are almost 1. These high values of the alpha and beta suggest
that only the recent observations are being taken into account for the forecast.
Note that the values of alpha, beta, and gamma will be between 0 and 1. A value
nearer to 1 means the recent observations are given more weight compared to the older
observations. A value nearer to 0 means the distant past observations get more relative
weights compared to the recent observations.
Let’s now forecast the next three max temperatures from the year 2017 using this
model and check if this gives near equal predictions. The simple code used along with
the corresponding output is shown here:
The output shown also shows the limits for an 80 percent confidence interval and a
95 percent confidence interval. From the raw data for the year 2017 from the government
data set, the values for January 2017, February 2017, and March 2017 are, respectively,
26.45, 29.46, and 31.60. As you can see, there is wide variation among the actual values
and the predicted values. All the predicted values are on the lower side.
Let’s now explore other models we have not explored: AAA and ANA. Among these,
we will first explore AAA as we saw a seasonal pattern in our data earlier. The code and
the model generated as output are provided here:
> model_2 <- ets(ts_max_temp, model="AAA")
> model_2
ETS(A,A,A)
Call:
ets(y = ts_max_temp, model = "AAA")
Smoothing parameters:
alpha = 0.974
454
Chapter 12 Time Series: Forecasting
beta = 0.1329
gamma = 0.003
Initial states:
l = 28.7322
b = 0.1554
s = -4.852 -2.2334 0.4228 0.7817 1.111 1.2383
2.5547 4.1739 2.903 0.2007 -2.0253 -4.2754
sigma: 0.8999
AIC AICc BIC
78.84388 180.84388 98.87079
Let’s now predict the next three values for the year 2017, i.e., January 2017, February
2017, and March 2017, the actual values of which are 26.45, 29.46, and 31.60.
From the previous, we observe that the values fitted are on the higher side. Let’s now
explore the model with ANA. The code along with the model and the forecasted three
values (pertaining to January 2017, February 2017, and March 2017) are shown here:
455
Chapter 12 Time Series: Forecasting
Initial states:
l = 30.6467
s = -4.8633 -2.1923 0.4212 0.6708 0.9745 1.2463
2.8205 4.541 3.1301 0.2431 -2.2662 -4.7257
sigma: 0.9319
AIC AICc BIC
81.87586 141.87586 99.54667
> forecast(model_3, 3)
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jan 2017 28.14746 26.95320 29.34172 26.32100 29.97393
Feb 2017 30.60699 28.91821 32.29577 28.02423 33.18975
Mar 2017 33.11627 31.04801 35.18453 29.95314 36.27940
Still, we observe that the predicted values are differing significantly from the
actual values.
The previous may be because of the limited data set we used to create the time
series. However, the forecasting of unknown values is not always correct because the
actual values may be impacted by many other parameters unknown to us and the past
data may not be completely representative of the future data.
456
Chapter 12 Time Series: Forecasting
>str(nhtemp)
Time-Series [1:60] from 1912 to 1971: 49.9 52.3 49.4 51.1 49.4 47.9 49.8 50.9
49.3 51.9 ...
>plot(nhtemp)
We will now use the auto.arima function from library(forecast). The code and
the output are given here:
You can see the model’s order for values (p, d, q) in the previous model is (0, 1, 1).
Here, p stands for (p, d, q) autoregressive model of order p, q stands for (p, d, q)
moving average model of order q, and d stands for the number of times the time series
has been differenced.
In the autoregressive model, each value of the time series is predicted using the
linear combination of past p values. In a moving average model, each value of the time
series is predicted using the linear combination of past q errors.
We can make the prediction as usual using the forecast(model_name, n) command
as earlier, where n is the number of future values to be predicted. The code and the
results for the previous model generated is given here:
457
Chapter 12 Time Series: Forecasting
The accuracy of the model can be tested using accuracy(model_name). The code and
output are given here:
As you can see, the various accuracy measures of the model are provided including
mean error (ME), root mean square error (RMSE), mean absolute error (MAE), mean
percentage error (MPE), mean absolute percentage error (MAPE), mean absolute scaled
error (MASE), and first order autocorrelation coefficient (ACF1). The errors are very less
and almost between 0 to 1. Hence, the accuracy is excellent in our case.
458
Chapter 12 Time Series: Forecasting
As you can see, timeseries does not have any seasonal component or trend
component. Such a time series is known as stationary, and such a time series has statistical
properties such as mean and variance constant and also a consistent covariance.
The following is an example of a nonstationary time series:
As you can see in the plot, there is at least a seasonal component. Hence, it is a
nonstationary time series.
The ndiffs() function in R checks and suggests how many differences are required
to make the time series stationary. Let’s check the time series we have in hand,
i.e., nhtemp.
Now we will difference the original time series once (i.e., differences=d=1) and
then check if the time series is now stationary. Here, the function diff(time series,
differences=d) is used. By default, this function takes d as 1. If it is different than 1, then
we need to pass the parameter differences=d. In order to check if the time series has
become stationary after differencing, we use the augmented Dickey-Fuller test. The code
and the output are provided here:
459
Chapter 12 Time Series: Forecasting
data: diff_nhtemp
Dickey-Fuller = -4.6366, Lag order = 3, p-value = 0.01
alternative hypothesis: stationary
Warning message:
In adf.test(diff_nhtemp) : p-value smaller than printed p-value
As the p-value is very small (less than the significance level), we reject the null
hypothesis that the time series is nonstationary and accept the alternative hypothesis
that the time series is stationary.
We will now plot the differenced time series using the following code:
>plot(diff_nhtemp)
Now, we can check for the autocorrelation function (ACF) plot and partial
autocorrelation function (PACF) using the Acf(time series) and Pacf(time
series) from the library (forecast) to check the p and q parameters required for
460
Chapter 12 Time Series: Forecasting
the model building. We already know d=1 from the previous discussions. At lag m, the
autocorrelation depicts the relationship between an element of the time series and the
value, which is m intervals away from it considering the intervening interval values.
Partial autocorrelation depicts the relation between an element of the time series
and the value that is m intervals away from it without considering the intervening
interval values.
Note that as we have d=1, we will be using the differenced time series in Acf() and
in Pacf() instead of the original time series.
Let’s plot this and check. The following is the code.
> Acf(diff_nhtemp)
> Pacf(diff_nhtemp)
The following table provides the rules for interpreting the p order for AR() and the q
order for MA().
461
Chapter 12 Time Series: Forecasting
Acf() The ACF plot is sinusoidal (i.e., of the The ACF plot shows significant spike
form of sine wave) or is gradually or at lag q. The residuals beyond lag
exponentially decaying. q are normally within the confidence
interval plotted. Few residuals other
than the initial significant ones just
outside the confidence interval plotted
are OK and may be ignored.
Pacf() The PACF plot shows significant spike The PACF plot is sinusoidal (i.e., of the
at lag p but no further spikes beyond form of sine wave) or is gradually or
lag p. The residuals beyond lag p are exponentially decaying.
normally within the confidence interval
plotted. Few residuals other than the initial
significant ones just outside the confidence
interval plotted are OK and may be ignored.
Both the Acf plot and Pacf plots shown in Figure 12-8 have very small residuals.
Further, Pacf() shows a significant spike at the first lag, and all the subsequent residuals
are within the confidence interval plotted. Hence, we have p=1, and we have AR(p) =
AR(1). Acf() shows a significant spike at the first lag, and all the subsequent residuals
are within the confidence interval plotted. Hence, we have q=1, and we have MA(q) =
MA(1). We already know that d=1.
We now run the arima() model on the original time series using these (p, d, q)
values; i.e., the order of the model is (1,1,1). The code and the output are given here:
462
Chapter 12 Time Series: Forecasting
ar1 ma1
0.0073 -0.8019
s.e. 0.1802 0.1285
sigma^2 estimated as 1.291: log likelihood = -91.76, aic = 189.52
We are now required to evaluate whether the model is fit and the residuals are
normal. We use the Box-Ljung test for the model fit verification and use the quantile-
to-quantile plots of the residuals to check if the residuals are normal. The code and the
output are shown here:
As the p-value is not significant, we cannot reject the null hypothesis that the model
is fit or there is no autocorrelation. Hence, we consider the model to be fit and without
autocorrelation.
Further, Figure 12-9 shows the output from the quantile-to-quantile plot, which clearly
shows that the residuals are normally distributed, as all the points are on the straight line.
From this, you can conclude that both the assumptions are met and the model_6
generated by us is a very good fit. We can use it to predict the future values as we have
done earlier.
As you can see from the previous discussions, the model generated through auto.
arima() is little different from what we generated here.
463
Chapter 12 Time Series: Forecasting
Please Note The ARMA model does not have a d component. It will have
only (p, q).
We will now plot the forecasted values using model_6 along with the original values
of the time series, nhtemp, using the following code:
> ##Forecasting using model_6 and then plotting the forecasted values
> forecast(model_6, 3)
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
1972 51.90458 50.44854 53.36062 49.67776 54.13141
1973 51.89660 50.41015 53.38304 49.62327 54.16992
1974 51.89654 50.38194 53.41114 49.58015 54.21292
> plot(forecast(model_6, 3), xlab="Temperature", ylab="Timeline")
Figure 12-10 shows the time series we get with the additional forecasted values.
Figure 12-10. Time series, nhtmp extended with the forecasted values
12.5 Forecasting in Python
We have so far seen how to import data, convert it to a time series, decompose the time
series into various components like seasonal, trend, level, and error; use various methods
of generating models; check for the prerequisites (where applicable); validate the model
fit and the assumptions; and forecast using the models in R. Now, we will be looking at
carrying out the same activities in Python. We are using the Jupyter Notebook from the
Anaconda framework here for the coding as it provides you with the output immediately
when running the code in each of the cells. This provides you with the capability of
interactive programming. Throughout the Jupyter Notebook we have provided “comments”
that will help you to understand the reasons behind the specific code or the output.
464
Chapter 12 Time Series: Forecasting
import pandas as pd
import numpy as np
import sklearn
import statsmodels as sm
By using these commands, these models will be loaded into memory and will be
available for use.
Max_Temp_df = pd.read_csv("C:/Users/kunku/OneDrive/Documents/Book
Revision/Max_IMD_Temp_Train.csv", sep=',', header=0)
Max_Temp_df.head(5)
465
Chapter 12 Time Series: Forecasting
The output shows that there are two fields in the dataframe, namely, Date and Temp.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 36 non-null object
1 Temp 36 non-null float64
dtypes: float64(1), object(1)
memory usage: 704.0+ bytes
As you can see, we have 36 rows of data. The two elements/columns of the data are
Date and Temp. We observe that none of the fields has a NULL value. Further, we find that
Date is an object type of field, which means it is a text type of field and currently is not
recognized as a date when processing it because Date requires us to convert it to a true
date format. However, our dates represent the corresponding months of the respective
466
Chapter 12 Time Series: Forecasting
The dates will be converted to a datetime type and will be converted to a Month_
Year format and added as a separate field in the dataframe Max_Temp_df. Further, the
following revised information is provided with regard to data types:
Now, we will drop the Date field, and we will index the dataframe on the Month_Year
field. The code is shown here:
467
Chapter 12 Time Series: Forecasting
Now, you can clearly see the dataframe Max_Temp_df as having 36 rows of data,
indexed on the Month_Year field.
As the Date field has been dropped and the index has been set on Month_Year, if
you check for Max_Temp_df.info() as provided previously, you will get the following
information without the Date-related information:
<class 'pandas.core.frame.DataFrame'>
PeriodIndex: 36 entries, 2014-01 to 2016-12
Freq: M
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Temp 36 non-null float64
dtypes: float64(1)
memory usage: 576.0 bytes
468
Chapter 12 Time Series: Forecasting
As you can see from the previous plot, there is definitely a seasonal component
repeating each year. Let’s now decompose the time series to understand its
composition better.
In this code, after decomposition of the time series, we are plotting the seasonal
component. Figure 12-11 shows the output.
469
Chapter 12 Time Series: Forecasting
You can see that there is a seasonal impact on the time series. Now, we will plot the
other two components. The code in this regard is provided here:
Figure 12-12 shows the plot of the trend and residual components.
470
Chapter 12 Time Series: Forecasting
As you can see from the previous plots, we do have a small upward trend component
and some residual/irregular components. Hence, all three components are relevant to
the time series Max_Temp_df.
471
Chapter 12 Time Series: Forecasting
Output:
0.2964816976648261
True
From this output, we can see that the p-value is not significant. Hence, we will not
reject the null hypothesis that the time series is not stationary. The “True” in the output
means that the time series is currently nonstationary, and we need to make the time
series stationary before we proceed for model generation, by using differencing.
The ndiffs() command shows the need for differencing twice. We also know from
the decomposition done earlier that we have to differentiate twice, i.e., for trend and
for seasonal component. Now, we will carry out differencing twice and create a new
dataframe, i.e., Max_Temp_diff. When this is done, some of the beginning values become
NaN. We need to drop them. This is done by using the following code:
472
Chapter 12 Time Series: Forecasting
The output of the differencing and some of the values (partial view of the output) of
the diffs are provided in Figure 12-13.
The NaN values will be dropped from the time series after applying the
previous code.
Now, we will check whether the differenced time series is now stationary using the
same method, i.e., the augmented Dickey-Fuller test. If the p-value is significant (less
than or equal to the level of significance, then we reject the null hypothesis that the
time series is not stationary and hold the alternative hypothesis; i.e., the time series is
473
Chapter 12 Time Series: Forecasting
stationary. Otherwise, we cannot reject the null hypothesis that the time series is not
stationary. In our case, it turns out that the differenced time series is now stationary.
Figure 12-14 shows the code and output.
#Test
import pmdarima
ADF_Test_Res = ADFTest(alpha=0.05)
print(p_value)
print(should_diff)
Output:
0.0183348400728848
False
Figure 12-14. Augmented Dickey_Fuller test to check if the time series is stationary
12.5.7 Model Generation
We will now generate the model using the pmdarima package. We will be using the auto_
arima utility within the pmdarima package for this purpose. This auto_arima function
does the heavy lifting for us and suggests the best possible model after iterating on the
various options. We will use d=differencing=2 in the input as we know the d is 2. The
code is shown here:
474
Chapter 12 Time Series: Forecasting
475
Chapter 12 Time Series: Forecasting
476
Chapter 12 Time Series: Forecasting
Now, we will interpret the model summary. The model clearly shows that the best
possible model is the seasonal ARIMA model (i.e., SARIMAX) with (p,d,q)x(P,D,Q,S)
as (1,2,0)x(1,1,0,12). We clearly know that this model has a seasonal component.
We observe that the d value that we arrived at earlier and the d value suggested by the
auto_arima are same, i.e., 2. The S clearly shows 12 as we have 12 months of data for
every year and depicts the seasonal component. The ar.L1 has a significant p-value of
0.002, and the Sigma2, which represents the error, also has a significant p-value. The
Prob(Q), i.e., the p-value of the Ljung-Box test, is insignificant. Hence, we cannot reject
the null hypothesis that the model does not have the autocorrelation and is fit with only
the white noise. Also, the p-value of the Jarque-Bera test, i.e., Prob(JB), is 0.56, which
is insignificant. Hence, we cannot reject the null hypothesis that the data is normally
distributed. Further, Prob(H) of the “heteroskedasticity” is 0.40, which is insignificant.
Hence, we cannot reject the null hypothesis that the error residuals have the same
variance. Hence, the model generated is validated from all significant perspectives.
477
Chapter 12 Time Series: Forecasting
Figure 12-16. The Acf() plot from statsmodels on the Max_Temp_diff dataframe
478
Chapter 12 Time Series: Forecasting
Figure 12-17. The Pacf() plot from statsmodels on the Max_Temp_diff dataframe
As you can see from the ACF plot, it is sinusoidal in form; i.e., it follows a sine-wave
pattern. This means the order of MA(), i.e., q, is 0. If we observe the PACF plot, we find
that none of the residuals is significant (just one point is beyond the confidence interval,
which we can ignore). However, the first lag is showing a positive value before it turns
into negative for the further lags. Hence, we can conclude that the order of AR(), i.e.,
p, is 1. This is also in tune with the Max_Temp_Best_fit model hyperparameters. We
already know that d=2.
Now, we will use the residuals of the model generated, i.e., Max_Temp_Best_fit, and
generate the ACF and PACF plots.
The code to generate the ACF plot on the residuals of the model is given here:
479
Chapter 12 Time Series: Forecasting
The code for generating the PACF plot on the residuals of the model is given here:
We can see that almost all the residuals are very small or near zero. Hence, we can
conclude that the model is fit to be used.
480
Chapter 12 Time Series: Forecasting
12.5.9 Forecasting
Forecasting is normally made on the test data kept out of the training data to ensure
that our testing of the prediction is appropriate. In our case, we have the Max_Temp_
IMD_Test.csv data for the further dates, i.e., from Jan 2017 to Dec 2017, which we kept
separately for the purpose of testing.
We will import the test data to our working environment and then use the Date field to
get the Month_Year field as we did in the case of our original training data. We will then drop
the Date field and index the Max_Temp_Test_df on Month_Year. The code is as follows:
#Importing the test time series which has 12 months data post the Train #data
used above
Max_Temp_Test_df = pd.read_csv("C:/Users/kunku/OneDrive/Documents/Book
Revision/Max_Temp_IMD_Test.csv", sep=",", header=0)
Max_Temp_Test_df['Month_Year'] = pd.to_datetime(Max_Temp_Test_df['Date']).
dt.to_period('M')
Max_Temp_Test_df.drop("Date", inplace=True, axis=1)
#indexing the dataframe on Date field so that
#further operations on the data can be done easily
Max_Temp_Test_df.set_index("Month_Year", inplace=True)
Max_Temp_Test_df.head(5)
Figure 12-20. The first five records of the Max_Temp_Test_df used for testing of
the model
481
Chapter 12 Time Series: Forecasting
Now, we will forecast using the model Max_Temp_Best_fit (without using any data
input) for the next six months beyond the training data using the following code:
2017-01 28.711060
2017-02 31.842207
2017-03 34.905397
2017-04 38.357296
2017-05 40.641819
2017-06 39.668761
Freq: M, dtype: float64
Figure 12-21. The predicted or forecasted results using the model Max_Temp_
Best_fit
The model has forecasted the output. If we compare these values with the 2017 month
values in our Max_Temp_Test_df, we may find that the forecasted values may not be a
near match in many of the cases. This means that the model built is still not able to explain
fully all the components of the time series. You can further appreciate the fact that in
many cases it may be difficult to forecast accurately based on the past data as there may be
specific aspects that have impacted the further values that the model has not seen and the
past values may not be representative of the future values for these reasons.
We will now predict the values using the Max_Temp_Best_fit model on our current
data used for the training, i.e., Max_Temp_df, which is known as in-sample or in-series
prediction. The code used for the same is provided here:
#You can also predict on the sample used for the model generation
#This may help for close comparision with the actual values
#and the predicted values
predictions_in_sample = Max_Temp_Best_fit.predict_in_sample(alpha=0.05)
predictions_in_sample
482
Chapter 12 Time Series: Forecasting
Month_Year
2014-01 0.000000
2014-02 39.716663
2014-03 28.110012
2014-04 31.929993
2014-05 36.530003
2014-06 34.800001
2014-07 34.530000
2014-08 29.550003
2014-09 30.789995
2014-10 30.040004
2014-11 29.899995
2014-12 25.810010
2015-01 27.304002
2015-02 16.740004
2015-03 30.039972
2015-04 32.584781
2015-05 32.013921
2015-06 34.477478
2015-07 29.915671
2015-08 31.045280
2015-09 31.890271
2015-10 31.537564
2015-11 29.147156
2015-12 24.666804
2016-01 25.196811
(partial output)
483
Chapter 12 Time Series: Forecasting
12.6 Chapter Summary
In this chapter, you learned what a time series is and the uses/benefits of time series in
the real world.
You learned about the components of a time series such as seasonal, trend, and
irregular/error components.
You learned practically, through the help of examples, how to carry out exponential
smoothing modeling in R.
You learned, through the help of examples, how to carry out ARIMA and ARMA
modeling in R. In the process, you also learned the prerequisites that need to be met
before such modeling is carried out (e.g., the time series needs to be stationary) and
what assumptions your models need to fulfil to be of use.
You also learned how to use the model for forecasting.
You then experimented with decomposition, model generation, its validation, and
forecasting using Python instead of R.
484
PART IV
Cluster Analysis
Clustering is an unsupervised learning technique to categorize data in the absence
of defined categories in the sample data set. In this chapter, we will explore different
techniques and how to perform clustering analysis.
13.1 Overview of Clustering
Clustering analysis is an unsupervised technique. Unlike supervised learning, in
unsupervised learning, the data has no class labels for the machines to learn and predict
the class. Instead, the machine decides how to group the data into different categories.
The objective of the clusters is to enable the business to make meaningful analysis.
Clustering analysis can uncover previously undetected relationships in a data set. For
example, cluster analysis can be applied in marketing for customer segmentation based
on demographics to identify groups of people who purchase similar products. Similarly,
identify clusters based on consumer spending to estimate the potential demand for
products and services. These kind of analysis help businesses to formulate marketing
strategies.
Nielsen (and earlier, Claritas) were pioneers in cluster analysis. Through its
segmentation solution, Nielsen helped customize demographic data to understand
geography based on region, state, ZIP code, neighborhood, and block. This has helped
the company to come up with effective naming and differentiation of groups such as
movers and shakers, fast-track families, football-watching beer aficionados, and casual,
and sweet palate drinkers.
In a human resources (HR) department, cluster analysis can help to identify
employee skills and performance. Furthermore, you can cluster based on interests,
demographics, gender, and salary to help a business act on HR-related issues such as
relocating, improving performance, or hiring an appropriately skilled labor force for
forthcoming projects.
487
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5_13
Chapter 13 Cluster Analysis
In finance, cluster analysis can help create risk-based portfolios based on various
characteristics such as returns, volatility, and P/E ratio. Similarly, clusters can be
created based on revenues and growth, market capital, products and solutions, and
global presence. These clusters can help a business position itself in the market. Other
applications of clustering include grouping newspaper articles based on topics such
as sports, science, or politics; grouping the effectiveness of the software development
process based on defects and processes; and grouping various species based on classes
and subclasses.
The clustering algorithm takes the raw data as input, runs the clustering algorithm,
and segregates the data into different groups. In this example, shown in Figure 13-1,
based on the size, bags are clustered together by the clustering algorithm. The purpose
of cluster analysis is to segregate data into groups. The idea of clustering is not new and
has been applied in many areas, including archaeology, astronomy, science, education,
medicine, psychology, and sociology.
There are different clustering algorithms to perform this task, and in the next
section, we will discuss various clustering techniques and how to perform the clustering
technique with an example.
488
Chapter 13 Cluster Analysis
13.1.1 Distance Measure
To understand the cluster techniques, we have to first understand the distance measured
between the sample records and how to group records into two or more different
clusters. There are several different metrics to measure the distance between the two
records. The most common measures are Euclidian distance, Manhattan distance, and
Minkowski.
13.1.2 Euclidean Distance
Euclidean distance is the simplest and most common measure used. The Euclidean
distance, Eij, between the two records, i and j, are the variables defined as follows:
X X j1 X i 2 X j2 X i 3 X j3 X ip X jp
2 2 2 2
E ij i1 (1)
As an extension to the equation, you can also assign weights to each variable based
on its importance. A weighted Euclidean distance equation is as follows:
E ij W1 X i1 X j1 W2 X i 2 X j2 W3 X i 3 X j3 Wp X ip X jp
2 2 2 2
(2)
13.1.3 Manhattan Distance
Another well-known measure is Manhattan (or city block) distance, which is defined as
follows:
Here, Xij is the p sample from different variables. Both the Euclidian distance and the
Manhattan distance should also satisfy the following conditions:
489
Chapter 13 Cluster Analysis
Usually, data sets have a combination of categorical and continuous variables. The
Gower similarity coefficient is a measure to find the distance between quantitative (such
as income or salary) variables and categorical data variables (such as spam/not spam).
490
Chapter 13 Cluster Analysis
For a given cluster B, with B1, B2, B3 … Bm samples and cluster C with C1, C2, C3 … Cm,
records, single linkage is defined as the shortest distance between the two pairs of
records in Ci and Bj. This is represented as Min(distance(Ci, Bj)) for i = 1, 2, 3, 4 … m;
j = 1, 2, 3 … n. The two samples that are on the edge make the shortest distance between
the two clusters. This defines the cluster boundary. The boundaries form the clusters, as
shown in Figure 13-2.
Complete linkage is the distance between two clusters and is defined as the longest
distance between two points in the clusters, as shown in Figure 13-3. The farthest distance
between two records in clusters Ci and Bj is represented as Max (distance (Ci, Bj)) for I = 1,
2, … m; j = 1, 2, 3 … n. The farthest samples make up the edge of the cluster.
The average linkage is a measure that indicates the average distance between each
point in the clusters, as shown in Figure 13-4. The average distance between records in
one cluster and records in the other cluster is calculated as Average (distance(Ci, Bj)) for
I =1, 2, 3 … m; j = 1, 2, 3 … n.
491
Chapter 13 Cluster Analysis
The centroid is the center of the cluster. It is calculated by taking the average of all the
records in that cluster. The centroid distance between the two clusters A and B is simply
the distance between centroid(A) and centroid(B).
Selecting the clustering distance measure depends on the data, and it also requires
some amount of domain knowledge. When the data is well spread, a single linkage
may be a good choice. On the other hand, the average or complete linkage may be a
better choice if the data is somewhat in sequence and the data appears to be a spherical
shape. Selecting a specific cluster method is always a challenge, and also, we are unsure
how many clusters will be formed by the algorithm. Based on the domain knowledge,
we need to check each element in a cluster and decide whether to keep two or three
clusters. Our research and experience have shown that the unsupervised learning
methods are not intuitive or easy to comprehend. It takes effort to understand and label
the clusters properly.
13.3 Types of Clustering
Clustering analysis is performed on data to gain insights that help you understand the
characteristics and distribution of data. The process involves grouping the data into
similar groups based on similar characteristics, and different groups are as dissimilar
as possible from one cluster to another. If the two samples from two or more variables
have close similarity measures, they are grouped under the same cluster. Unlike in
classification, clustering algorithms do not rely on predefined class labels in the sample
data. Instead, the algorithms are based on the similarity measures discussed earlier
between the different variables and different clusters.
492
Chapter 13 Cluster Analysis
There are several clustering techniques based on the procedure used in similarity
measures, the thresholds in constructing the clusters, and the flexibility of cluster objects
to move around different clusters. Irrespective of the procedure used, the resulting
cluster must be reviewed by the user. The clustering algorithms fall under one of the
following: hierarchical and nonhierarchical clustering.
13.3.1 Hierarchical Clustering
Hierarchical clustering constructs the clusters by dividing the data set into similar
records by constructing a hierarchy of predetermined order from top to bottom.
For example, all files and folders on the hard disk are organized in a hierarchy. In
hierarchical clustering, as the algorithm steps through the samples, it creates a hierarchy
based on distance. It starts with one large cluster and slowly moves the data to different
clusters. There are two types: the agglomerative method and the divisive method. Both
methods are based on the same concept. In agglomerative, the algorithm starts with n
clusters and then merges similar clusters until a single cluster is formed. In divisive, the
algorithm starts with one cluster and then moves the elements into multiple clusters
based on dissimilarities, as shown in Figure 13-5.
Step 1 Step 2 Step 3 Step 4 Step 5 Agglomerative
1, 2
2
1,2,3
3
4 1,2,3,4,5,6
4,5, 6
5
6
Divisive
493
Chapter 13 Cluster Analysis
1. Start with n clusters. Each record in the data set can be a cluster
by itself.
2. The two similar cluster observations are merged in the next step.
1. Start with one single cluster where all the samples in the data set
are in one cluster.
3. The process repeats until all the cluster elements are separated.
At every step, the clusters with the largest distance measure are
separated. This also creates a hierarchy of clusters.
13.3.2 Dendrograms
A dendrogram is the representation of the hierarchy. A dendrogram is a tree-like
structure that summarizes the clustering process and the hierarchy pictorially, as shown
in Figure 13-6.
494
Chapter 13 Cluster Analysis
Agglomerative
Divisive
Distance
13.3.3 Nonhierarchical Method
In nonhierarchical clustering, no hierarchy of clusters is formed; instead, the number of
clusters is prespecified with k partitions. The partitions are formed by minimizing the
error. The objective is to minimize total intracluster variance, using the squared error
function.
n k
E xi m j
2
i 1 i 1
Here, k is the number of partitions, x is the sample, and m is the mean distance
of the cluster. The algorithm intends to partition n objects into k clusters with the
nearest mean. The goal is to divide the samples into the k-clusters so that clusters are
as homogeneous as possible. The end result is to produce k different clusters with clear
distinctions. There are many methods of partitioning clusters implemented by different
tools. The common ones are k-means, probabilistic clustering, k-medoids method, and
partitioning around mediods (PAM). All these algorithms are optimization problems that
try to minimize the error function. The most common is the k-means algorithm.
495
Chapter 13 Cluster Analysis
13.3.4 K-Means Algorithm
The objective of k-means clustering is to minimize total intracluster variance.
The k-means clustering algorithm starts with K centroids. The initial values of the
cluster centroids are selected randomly or from a prior information. Then it assigns
objects to the closest cluster center based on the distance measure. It recalculates the
centroid after each assignment. After recalculating the centroid, it checks the data point's
distance to the centroid of its own cluster. If it is closest, then it is left as is. If not, it moves
to the next closest cluster. This process is repeated until all the data points are covered
and no data point is moving from one cluster to another cluster. The following example
demonstrates the k-means algorithm.
We want to group the visitors to a website using just their age (a one-dimensional
space) using the k-means clustering algorithm. Here is the one-dimensional age vector:
[ 15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65]
Iteration 1:
Let k = 2. Let’s choose 2 centroids randomly, 16 and 22. Move all the points to 16 or
22 based on distance. Now the two clusters are as follows:
C1 = [15, 15, 16,19]
C2 = [20,21,22,28,35,40,41,42,43,44,60,61,65 ]
Iteration 2:
Updated centroid 1 : 16.25 (average of centroid 1 from previous)
Updated centroid 2: 36.85
Move elements closer to the new updated centroid.
New C1 elements = [ 15,15,16,19, 20, 21, 22, 28 ]
New C2 elements = [35,40,41,42,43,44,60,61,65]
Iteration 3:
Updated centroid 1:19.5
Updated centroid 2: 47.88
Move the elements to the updated centroid point.
New C1 elements = [15,15,16,19,20,21,22,28]
New C2 elements = [35,40,41,42,43,44,60,61,65]
There is no change in the centroid from iteration 3 and iteration 4; thus, the
algorithm would stop here. Using k-means clustering, two groups have been identified:
C1[15–28] and C2[35–65]. However, the initial random selection of cluster centroids can
affect the cluster iterations and elements. To overcome this, run the algorithm multiple
times with different starting conditions to get a fair view of what the clusters should be.
496
Chapter 13 Cluster Analysis
• Birch: This builds a tree called the clustering feature tree (CFT) for the
given data. It typically works well when the number of variables is
less than 20.
13.3.6 Evaluating Clustering
Unlike supervised machine learning algorithms like classifications, or regression, it is
hard to evaluate the performance of clustering algorithms and compare one with the
other. However, researchers have developed common techniques based on cluster
homogeneity. Homogeneity is how the samples are distributed within the clusters that
contain members from a proper class. Completeness is all samples in the data set are
assigned suitable cluster classes. Intraclustering distance is another measure used to
find out how well clusters are formed. The objective of the clustering algorithm is to
develop more distinct clusters that are well separated from each other.
497
Chapter 13 Cluster Analysis
13.4 Limitations of Clustering
In general, clustering is unsupervised learning, and there are no predefined class
labels in the sample data set. The algorithm reads all the data, and based on different
measures, it tries to group the data into different clusters. Hierarchical clustering is
simple to understand and interpret. It does not require you to specify the number of
clusters to form. It has the following limitations:
• For large data sets, computing and storing the n × n matrix may be
expensive and slow and have low stability.
• The results may vary when the metric is changed from one measure
to another.
K-means is a simple and relatively easy and efficient method. The problem with
this method is that a different k can vary the results and cluster formation. A practical
approach is to compare the outcomes of multiple runs with different k values and
choose the best one based on a predefined criterion. Selecting the initial k is driven by
external factors such as previous knowledge, practical constraints, or requirements. If
the selection of k is random and not based on prior knowledge, then you have to try a
few different values and compare the resulting clusters.
Clustering higher-dimensional data is a major challenge. Many text documents
and pharmaceutical applications have higher dimensions. The distance measures
may become meaningless to these applications because of the equidistance problem.
Several techniques have been developed to address this problem, and some of the newer
methods include CLIQUE, ProClus, and frequent pattern-based clustering.
13.5 Clustering Using R
In this section, we will create a k-means clustering model using R. The process of
creating a clustering model is the same as for any other supervised model. The steps
are to read the data set, then explore the data set, prepare the data set for the clustering
function, and finally create the model. Since there are no proper metrics to measure the
performance of the cluster models, after the clusters are formed, the clusters have to be
examined manually before finalizing the cluster names and the cluster elements.
Step 1: Load the essential libraries to the development environment and read
data from the source. This k-means clustering model aims to create clusters based on
students’ performance on assignments and exams. The data contains StudentID, Quiz1,
498
Chapter 13 Cluster Analysis
Quiz2, Quiz3, Quiz4, and Qquiz5 variables. The goal is to group students into different
clusters based on their performance and assign grades to each student.
Step 2: Check the data types and remove the StudentID column from the data.
Step 3: Standardize data to a single scale using the scale() function. Since all the
variables are on different scales, it is always recommended to scale the data for better
performance.
499
Chapter 13 Cluster Analysis
Quiz1 Quiz2 Quiz3 Quiz4 Quiz5
[1,] -0.2291681 -0.6336322 0.57674037 0.4707916 -0.6022429
[2,] -0.8481581 -1.4423179 -0.55332570 -1.2543467 1.0763489
[3,] 1.0088120 -0.0559995 0.89961639 -1.6856313 -0.6022429
[4,] 0.3898220 0.1172903 0.89961639 0.9020761 1.0763489
[5,] 0.6993170 -0.4603424 -1.19907774 0.4707916 1.0763489
[6,] -0.8481581 -1.7888975 -0.06901167 0.9020761 1.0763489
Step 4: Build the clustering model using the k-means function in R. Initially choose
the k-value as 3. The kmeans() function has several input parameters that are listed next;
we use nstart = 25 to generate 25 initial configurations. For all other parameters, we
use default values.
Kmeans( ) Function Arguments (from the R documentation)
X This is a numeric matrix of data, or an object that can be coerced to such a matrix
(such as a numeric vector or a dataframe with all numeric columns).
centers This is either the number of clusters, say k, or a set of initial (distinct) cluster
centers. If a number, a random set of (distinct) rows in x is chosen as the initial
centers.
iter.max This is the maximum number of iterations allowed.
nstart If centers is a number, how many random sets should be chosen?
Algorithm This is a character that may be abbreviated. Note that "Lloyd" and "Forgy" are
alternative names for one algorithm.
object This is an R object of class kmeans, typically the result ob of ob <- kmeans(..).
method Th is a character, which may be abbreviated. centers causes fitted to return
cluster centers (one for each input point), and classes causes fitted to return a
vector of class assignments.
trace This is a logical or integer number, currently only used in the default method
(Hartigan-Wong). If positive (or true), tracing information on the progress of the
algorithm is produced. Higher values may produce more tracing information.
... This is not used.
500
Chapter 13 Cluster Analysis
Cluster This is a vector of integers (from 1:k) indicating the cluster to which each point
is allocated.
centers This is a matrix of cluster centers.
totss This is the total sum of squares.
withinss This is a vector of within-cluster sum of squares, one component per cluster.
tot.withinss This is a total within-cluster sum of squares, i.e., sum(withinss).
betweenss This is a between-cluster sum of squares, i.e., totss-tot.withinss.
size This is the number of points in each cluster.
iter This is the number of (outer) iterations.
ifault This is an integer that is an indicator of a possible algorithm problem; this is for
experts.
501
Chapter 13 Cluster Analysis
Step 5: Summarize the model by printing the number of clusters, cluster distribution,
assignment, and cluster centers using the following functions:
[1] 3 2 3 1 1 1 1 2 3 3 1 3 1 2 3 3 3 2 3 3 3 3 2 2 2 1 2 2 3 3
[31] 3 2 3 3 3 1 3 3 1 1 3 3 1 2 3 1 3 3 3 3 2 3 3 3 3 3 3 3 3 1
[61] 2 1 3 1 1 2 2 1 3 2 2 3 2 3 1 3 2 2 3 2 2 3 2 1 1 1 3 1 3 1
[91] 2 1 3 3 3 3 1 3 1 3 3 1 1 3 3 1 3 2 2 3 1 1 3 3 3 3 3 3 3 3
[121] 3 3 3 1 3 3 1 3 3 2 1
> # Centers of each cluster for each variables
> km_model$centers
Quiz1 Quiz2 Quiz3 Quiz4 Quiz5
1 0.08032695 0.1920428 0.2063826 0.2678341 1.4713117
2 -0.80230698 -0.6550260 -1.0854732 -1.1265587 -0.2913925
3 0.27044532 0.1593750 0.3184396 0.3044389 -0.6022429
>
Step 6: Plot the clusters using the fvizcluster() function. Since there are more than
two dimensions in the data, fvizcluster() uses principal component analysis (PCA) to
plot the first two principal components, which explains the majority of the variance in
the data. Figure 13-7 shows the output.
502
Chapter 13 Cluster Analysis
Step 7: Find the optimal value of k by using the elbow method. Fortunately,
fviz_nbclust() supports this.
Step 9: From both methods, the optimal value is found to be 2. We would use 2 and
re-create the model. See Figure 13-10.
[1] 2 1 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 1 2 2 2 1 1 1 1 2 1 1 2 1
[31] 2 1 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
[61] 1 2 2 2 2 1 1 2 2 1 1 2 1 2 2 2 1 1 2 1 1 2 1 2 2 2 2 1 2 2
[91] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 2 1 2 2 2 2 2
[121] 2 2 2 2 2 2 2 2 2 1 2
505
Chapter 13 Cluster Analysis
Optimize the cluster using the NbCLust() function. To learn more about NbClust(),
please read the documentation. See Figure 13-11.
*******************************************************************
* Among all indices:
* 5 proposed 3 as the best number of clusters
* 3 proposed 4 as the best number of clusters
* 6 proposed 5 as the best number of clusters
* 1 proposed 8 as the best number of clusters
* 1 proposed 9 as the best number of clusters
* 3 proposed 10 as the best number of clusters
* 1 proposed 13 as the best number of clusters
506
Chapter 13 Cluster Analysis
******************************************************************
> require(factoextra)
> fviz_dend(x = cluster_hier,
+ rect = TRUE,
+ cex = 0.5, lwd = 0.6,
+ k = 5,
+ k_colors = c("purple","red",
+ "green3", "blue", "magenta"),
+ rect_border = "gray",
+ rect_fill = FALSE)
Warning message:
`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
"none")` instead.
507
Chapter 13 Cluster Analysis
The dendrogram in Figure 13-12 shows how the clusters have grouped students’
based on their performance in various assignment components. The graph is hard to
read and interpret because of too many observations on the x-axis. The bigger the area
of the plot, the better the visual representation. For example, if we display this on a 55- or
65-inch TV screen, the graph may look bigger, and it would be easier to read some of the
x-axis values. If we display it on a 10ft by 10ft area, the area of plot is even bigger, and we
would probably be able to read all the x-axis values. But, practically, this is not possible.
Hence, we develop a better visualization program to zoom the plot and understand
the details. This includes adding a horizontal and vertical scroll bar, adding a zoom
mechanism, or simply plotting the limited values on the x- and y-axis.
508
Chapter 13 Cluster Analysis
have to find the optimal value of the k-number of clusters using the elbow method and
the silhouette method, which were explained earlier.
Step 1: Import all the necessary libraries to the development environment and
read the data. The k-means clustering model aims to create clusters based on students’
performance in the different components of assignments and exams. The data contains
StudentID, Quiz1, Quiz2, Quiz3, Quiz4, and Qquiz5 variables. The goal is to group students
into different clusters based on their performance and assign grades to each student.
509
Chapter 13 Cluster Analysis
Step 3: For the data exploration and data preparation, we remove StudentID since it
is not required for the clustering analysis. Then we scale all the other variables.
510
Chapter 13 Cluster Analysis
grades_scaled = pd.DataFrame(scaler.fit_transform(grades_df2),
columns=grades_df2.columns)
grades_scaled.head()
Quiz1 Quiz2 Quiz3 Quiz4 Quiz5
0 0.666667 0.20 0.727273 0.9 0.0
1 0.533333 0.06 0.515152 0.5 0.5
2 0.933333 0.30 0.787879 0.4 0.0
3 0.800000 0.33 0.787879 1.0 0.5
4 0.866667 0.23 0.393939 0.9 0.5
Step 4: Create the k-means clustering model. Choose K randomly and then optimize
the k-means. Please read the documentation to understand the input parameters. Here,
we are setting only the n_cluster, random_state, and n_init parameters, and the rest of
the values are kept as the default.
Parameters (from sklearn( ) libraries)
n_clustersint, default=8
n_initint, default=10
511
Chapter 13 Cluster Analysis
max_iterint, default=300
tolfloat, default=1e-4
verboseint, default=0
Verbosity mode.
copy_xbool, default=True
512
Chapter 13 Cluster Analysis
auto and full are deprecated, and they will be removed in Scikit-
Learn 1.3. They are both aliases for lloyd.
Step 5: Check the cluster model output and display the cluster allocation to the
StudentIDs.
Quiz1 Quiz2 Quiz3 Quiz4 Quiz5
0 0.608696 0.264348 0.565217 0.752174 5.217391e-01
1 0.837500 0.393750 0.734848 0.931250 7.187500e-01
2 0.852482 0.306596 0.710509 0.885106 1.110223e-16
3 0.685714 0.283333 0.474747 0.395238 -2.775558e-17
4 0.497222 0.326250 0.540404 0.895833 -2.775558e-17
grades_clusters = pd.DataFrame(kmm_model.labels_ ,
columns=['Cluster ID'],
index = grades_df.StudentID)
grades_clusters.head(10)
513
Chapter 13 Cluster Analysis
Cluster ID
StudentID
20000001 2
20000002 0
20000003 3
20000004 1
20000005 0
20000006 0
20000007 0
20000008 3
20000009 2
20000010 2
Step 6: Find the optimal value of k using the elbow method, as shown in Figure 13-13.
kmm_values = [1,2,3,4,5,6,7,8,9,10]
k_elbowFunc(kmm_values, grades_scaled)
514
Chapter 13 Cluster Analysis
Figure 13-13. Finding the optimal value of k using the elbow method
Similarly, the silhouette measure provides how distinct the two clusters are, that
is, how well the two clusters are separated. It has an arrangement of [-1, 1]. The y-axis
on the plot provides the silhouette width for different k values (x-axis). Silhouette
coefficients close to +1 indicate that the sample is far away from the neighboring clusters
(this is what we want), and a 0 value or near 0 value indicates that the sample is very
close to the decision boundary between the two neighboring clusters. A silhouette of
negative values indicates that the data samples have been in the wrong cluster.
Step 7: From Figure 13-13, the optimal value of k is 2 or even 4, so we can consider
and rebuild the model.
515
Chapter 13 Cluster Analysis
grades_clusters = pd.DataFrame(kmm_model_k3.labels_ ,
columns=['Cluster ID'],
index = grades_df.StudentID)
grades_clusters.head(10)
Cluster ID
StudentID
20000001 1
20000002 3
20000003 0
20000004 3
20000005 3
20000006 3
20000007 3
20000008 0
20000009 1
20000010 1
pca_mds['K_cluster'] = kmm_model_k3.predict(grades_scaled)
sns.scatterplot(data=pca_mds,x = "pca_1",y="pca_2",hue="K_cluster")
<AxesSubplot:xlabel='pca_1', ylabel='pca_2'>
516
Chapter 13 Cluster Analysis
The next section explains the hierarchical clustering models and how to create them
using the sklearn hierarchy() function. It is pretty straightforward. Since we already
explained earlier in theory how hierarchical clustering works, we will just present the
code here. As you learned earlier, two measures are important. The first measure is the
distance among different observations in every variable, and the other is the distance
between the clusters. Dendrograms show the cluster formation for a specific distance
measure. See Figure 13-15.
#Hierarchical Clustering
# Create clustering using Single linkage
cluster_hier = hierarchy.linkage(grades_scaled, 'single')
cluster_dendo = hierarchy.dendrogram(cluster_hier, orientation='right',
labels = grades_df['StudentID'].to_list())
517
Chapter 13 Cluster Analysis
#Hierarchical Clustering
# Create clustering using "Group Average linkage"
cluster_hier = hierarchy.linkage(grades_scaled, 'average')
cluster_dendo = hierarchy.dendrogram(cluster_hier, orientation='left',
labels = grades_df['StudentID'].to_list())
The dendrogram in Figure 13-16 shows how the clusters have grouped students
based on their performance in various assignment components. The graph is hard to
read and interpret as the number of observations becomes large. The bigger the area of
the plot, the better the visual representation. If we display the same graph on a 55- or
65-inch TV screen, the graph may look bigger and more readable for some of the x-axis
values. If we display the same on a 10ft by 10ft area, it is even bigger, and all the x-axis
values would be able to be read. But, practically, this is not possible. Hence, we have to
develop visualization program to zoom the plot to understand the details. This includes
adding a horizontal and vertical scroll bar, adding a zoom mechanism, or simply plotting
the limited values on the x- and y-axes.
518
Chapter 13 Cluster Analysis
Various other clustering methods are in use depending on the use case and
applications. The following table lists the clustering methods supported by the
sklearn() library:
(continued)
519
Chapter 13 Cluster Analysis
520
Chapter 13 Cluster Analysis
13.7 Chapter Summary
In this chapter, you learned what unsupervised learning is and what clustering
analysis is.
You also looked at various clustering analysis techniques, including hierarchical
clustering and nonhierarchical clustering.
You learned the various distance measures used for creating clusters including
Euclidian distance, Manhattan distance, single linkage, average linkage, etc.
You also learned how to create k-means and hclust() models using both R and
Python and also selecting optimal value of K for the right number of clusters.
521
CHAPTER 14
14.1 Introduction
The growth of e-commerce, digital transactions, and the retail industry has led to the
generation of a humongous amount of data and an increase in database size. The
customer transactional information and the relationships between the transactions
and customer buying patterns are hidden in the data. Traditional learning and data
mining algorithms that exist may not be able to determine such relationships. This has
created an opportunity to find new and faster ways to mine the data to find meaningful
hidden relationships in the transactional data. To find such associations, the association
rule algorithm was developed. Though many other algorithms have been developed,
the apriori algorithm introduced in 1993 by Srikanth and Agarwal (AIS93) is the most
prominent one. The apriori algorithm mines data to find association rules in a large real-
world transactional database. This method is also referred to as association-rule analysis,
affinity analysis, market-basket analysis (MBA), or relationship data mining.
The association rule analysis is used to find out “which item goes with what item.”
This association is used in the study of customer transaction databases, as shown in
Figure 14-1. The association rules provide a simple analysis indicating that when an
event occurs, another event occurs with a certain probability. Knowing such probability
and discovering such relationships from a huge transactional database can help
companies manage inventory, product promotions, product discounts, the launch of
new products, and other business decisions. Examples are finding the relationship
between phones and phone cases, determining whether customers who purchase a
mobile phone also purchase a screen guard, or seeing whether a customer buys milk
523
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5_14
Chapter 14 Relationship Data Mining
and pastries together. Based on such association probability, stores can promote the
new product, sell extra services, or sell additional products at a promotional price. Such
analysis might encourage customers to buy a new product at a reduced price and might
encourage companies to increase sales and revenues. Association rules are probabilistic
relationships using simple if-then rules computed from the data.
Association rules fall under unsupervised learning and are used for the discovery
of patterns rather than the prediction of an outcome. Though the association rules
are applied for transactional data, the rules can also be applied in other areas such as
biomedical research to find patterns of DNA, find insurance fraud patterns, find credit
card transaction frauds, etc.
The apriori algorithm (Agrawal and Srikant, 1995) generates frequent-item sets. The
algorithm begins with just one item and then generates a two-item set with two items
frequently purchased together, and then it moves on to three-item sets with three items
frequently purchased together, and so on, until all the frequent-item sets are generated.
The key idea is to generate frequent-item sets with one item and then generate two-
item sets, three-item sets, and so on, until all the items are covered in the transactional
database. Once the list of all frequent-item sets is generated, you can find out how many
of those frequent-item sets are in the database. In general, generating n-item sets uses
the frequent n – 1 item and a complete run through the database once. The apriori
algorithm is faster and more efficient even for a large database with many items.
The apriori rules are derived based on the association of frequent items in the data.
Transaction data provides the knowledge of frequent items in the data. By having such
knowledge, we can generate rules and efficiently reduce the number of frequent items
524
Chapter 14 Relationship Data Mining
of interest. The apriori algorithm is based on the assumption that “the entire subset
of a frequent item set must also be frequent.” The item set is frequent only if all of its
subsets, pairs, triples, and singles occur frequently and are considered “interesting”
and “valuable.” Before discussing how to generate the association rules, to discover the
most frequent patterns, we will discuss the support, confidence, and lift metrics used to
deduce the associations and rules.
14.2.1 Support
Support (S) is the fraction of transactions that contain both A and B (antecedent and
consequent). The support is defined as the number of transactions that include both
the antecedent and the consequent item sets. It is expressed as a percentage of the total
number of records in the database.
Support(A & B) = Freq(A & B) / N (where N is the total number of transactions in
database)
For example, if the two-item set {Milk, Jam} in the data set is 5 out of a total of 10
items, then Support = S = 5/10 = 50%.
525
Chapter 14 Relationship Data Mining
14.2.2 Confidence
Confidence (A --> B) is a ratio of support for A and B to the support for A. It is expressed
as a ratio of the number of transactions that contain A and B together to the number of
transactions that contain A.
Trans A,B
Conf ( A -> B ) =
Trans A
= P B | A
Though support and confidence are good measures to show the strength of the
association rule, sometimes it can be deceptive. For example, if the antecedent or
the consequent have high support, it can have high confidence even though both are
independent.
14.2.3 Lift
Lift is a measure when the occurrence of the consequent item in a transaction is
independent of the occurrence of the antecedents. It gives a better measure to compare
the strength of the association. Mathematically, lift is defined as follows:
p A B
conf A B p A p A B
lift A B
p B p B p A p B
For the first case, calculate support, confidence, and lift for Shirt-> tie; the antecedent
is shirt, and the consequent is tie. Whenever someone buys shirt, they may also buy a tie.
526
Chapter 14 Relationship Data Mining
There are five transactions in total. Out of five transactions, there are three
transactions that have shirt -> tie.
Support(A & B) = Freq(A & B) / N (where N is the total number of transactions in
database)
Support (shirt->tie) = 3/5 = 0.6
Confidence(A -->B) = Support(A & B) / Support(A) = Freq(A & B) / Freq(A)
Confidence (shirt->tie) = 3/3 = 1
Lift(A -->B) = Support(A & B) / [Support(A) × Support (B)]
Lift = (3/5) / (3/5 * 5/5) = 1
Figure 14-2 lists the other combinations’ support, confidence, and lift.
Go ahead and calculate the support, confidence and lift measures for the other
examples, socks -> shirt (pant, tie) -> belt.
527
Chapter 14 Relationship Data Mining
For example, there are four transactions in database D with Transaction ID 200, 201,
202, 203 with items pencil, pen, eraser, scale and notebook. For simplicity, we will give
numbers to these items as 1, 2, 3, and 4 in sequence.
The first step is to set a support value. We will set it to 50%, so support S = 50%.
The next step is to generate a one-item set, two-item set, three-item set, etc., based
on the support.
There are five items; hence, we have five one-item sets and seven two-item sets, as
shown in Figure 14-3.
Out of seven two item sets, there are only four items that meet the support criterion
(50 percent support). They are {1,2}, {2,3}, {2,4}, and {3,4}. The next step is to generate {3}
item sets from this item set. {2,3,5}, {2,3,4}, {1,2,3}, and {1,3,4} are four possible {3} item
sets. Out of these four {3 item sets}, only one meets the support criterion: item set {2,3,4}.
528
Chapter 14 Relationship Data Mining
Once we generate the rules, the goal is to find the rules that indicate a strong
association between the items and indicate dependencies between the antecedent
(previous item) and the consequent (next item) in the set.
The support gives you an indication of overall transactions and how they affect the
item sets. If you have only a small number of transactions with minimum support, the
rule may be ignored. The lift ratio provides the strength of the consequent in a random
selection. But the confidence gives the rate at which a consequent can be found in
the database. Low confidence indicates a low consequent rate, and deciding whether
to promote the consequent is a worthwhile exercise. The more records, the better
the conclusion. Finally, the more distinct the rules that are considered, the better the
interpretation and outcome. We recommend looking at the rules from a top-down
approach rather than automating the decision by searching thousands of rules.
A high value of confidence suggests a strong association rule. But when B is
independent of A—that is, p(B) = p(B | A)—and p(B) is high, then we’ll have a rule with
high confidence. For example, if p(“buy pen”) = 85 percent and is independent of “buy
pencil,” then the rule “buy pen” ⇒ “buy pencil” will have a confidence of 85 percent. If
nearly all customers buy pen and nearly all customers buy pencil, then the confidence
level will be high regardless of whether there is an association between the items.
Similarly, if support is very low, it is not worth examining.
Note that association rules do not represent causality or correlation between the two
items. A --> B does not mean B causes A or, no causality, and A --> B can be different
from B --> A, unlike correlation.
529
Chapter 14 Relationship Data Mining
530
Chapter 14 Relationship Data Mining
> str(marys)
'data.frame': 100 obs. of 15 variables:
> sum(is.na(marys))
[1] 0
> # apriori() function accepts logical values and hence convert data to Logical
> marys_1 <- marys %>% mutate_if(is.numeric,as.logical)
> marys_2<-subset(marys_1, select = -c(Trans..Id))
> #str(marys_2)
> head(marys_2)
Belt Shoe Perfume Dress Shirt Jackets Trouser Tie Wallet TravelBag NailPolish
1 FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE
2 FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE
3 FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
4 FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
5 FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE
6 FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
Socks Hats Fitbit
1 FALSE FALSE TRUE
2 TRUE FALSE FALSE
3 TRUE TRUE FALSE
4 FALSE FALSE TRUE
5 TRUE TRUE FALSE
6 FALSE FALSE TRUE
Step 4: Find the frequent item sets and association rules using the apriori()
algorithm with support, confidence, and lift. In this case, we have set support to 0.5 and
confidence to 0.7.
> #Find frequent itemsets and association rules by applying apripori() algorithm
> #by setting support and confidence limits
> rules<-apriori(marys_2,
+ parameter = list(minlen=3, support=0.5, conf=0.7))
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen
0.7 0.1 1 none FALSE TRUE 5 0.5 3 10
target ext
rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Step 5: List the set of association rules and explore the output rules.
Figure 14-8 shows both Input and Output of the association rules.
532
Chapter 14 Relationship Data Mining
533
Chapter 14 Relationship Data Mining
534
Chapter 14 Relationship Data Mining
import os
import pandas as pd
import numpy as nd
data_dir = 'C:/Personal/dataset'
os.chdir(data_dir)
marys_df = pd.read_csv('marys.csv')
marys_df.head()
0 1 0 1 1 1 1 0 1 1
1 2 0 0 1 0 1 0 1 1
2 3 0 1 0 0 1 1 1 1
3 4 0 0 1 1 1 0 1 0
4 5 0 1 0 0 1 0 1 1
0 1 0 0 0 0 1
1 0 0 1 1 0 0
2 1 1 1 1 1 0
3 0 0 1 0 0 1
4 1 1 0 1 1 0
print(marys_df.shape)
(100, 15)
535
Chapter 14 Relationship Data Mining
marys_df2.head()
Step 3: Find the frequent item sets using the apriori() algorithm with a
support of 0.5.
536
Chapter 14 Relationship Data Mining
item_sets.head(6)
support itemsets
0 0.55 (Shoe)
1 0.60 (Perfume)
2 0.85 (Shirt)
3 0.70 (Trouser)
4 0.50 (Tie)
5 0.70 (Wallet)
Figure 14-11. Step 3, executing the aprori() algorithm to find frequent item sets
Step 4: Find the association rules from the frequent data set. We use the
association_rules() function.
537
Chapter 14 Relationship Data Mining
mba_rules.head(10)
mba_rules[mba_rules['antecedents']==frozenset({'Shirt'})].sort_values(by
= 'lift', ascending = False)
538
Chapter 14 Relationship Data Mining
Figure 14-13. Printing rules with support 0.7, one item set
We use frozenset() to build a Python function to find out the set of rules for
promoting any item set. The frozenset() function is commonly used to remove duplicates
from a sequence, computing mathematical operations such as intersection, union, and
symmetric differences. In this example, we use it to find {shirt}-only rules and then rules
for a {shirt, wallet} combination.
Figure 14-14 shows both Input and Output of the association rules of two item set.
The rules are sorted by “lift” values. From the previous rules, we can say that {Shirt,
wallet} and Shoe have strong associations and are sold together. Similarly, the next
association is {Shirt, Travelbag} and {wallet} and are sold together. This association helps
to make certain business decisions such as inventory management or promoting new
{Shoe} or {Wallet} and thus managing stocks appropriately.
539
Chapter 14 Relationship Data Mining
14.6 Chapter Summary
In this chapter, we explained another unsupervised learning technique called
relationship mining. Relationship mining is also referred to as association rules mining
or market basket analysis.
Association rules find interesting associations among large transactional item sets
in the database. You learned the basic concepts of association rule analysis, how to
perform such analysis using both Python and R, and what metrics are used to measure
the strength of the association with a case study.
540
CHAPTER 15
Introduction to Natural
Language Processing
A language is not just words. It’s a culture, a tradition, a unification of a
community, a whole history that creates what a community is. It’s all
embodied in a language.
Noam Chomsky
Natural language processing (NLP) aims to make computers process human languages.
NLP has a wide variety of applications today. You may already be using them when you
are buying a book on Amazon, talking to Google Assistant or Siri when checking the
weather, or talking to an automated chat agent when seeking customer service. In this
chapter, we start with an overview of NLP and discuss NLP applications, key concepts,
various NLP tasks, how to create models, the Python and R NLP libraries, and case
studies.
15.1 Overview
Today technology has become ubiquitous. Without smartphones, it’s hard to conduct
daily business. Our morning routine often starts by talking to Siri or Google Assistant on
your phone or similar AI bots asking about weather or traffic reports. We talk to these
voice assistants in our natural language, not computer programming languages. But,
computers know how to interpret and process only binary data. How can you make
computers understand human language?
541
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5_15
Chapter 15 Introduction to Natural Language Processing
In computer science, NLP is an area that deals with methods, processes, and
algorithms to process languages. There are various steps to perform NLP tasks. This
chapter will explain the various NLP tasks and commonly used methods, models, and
techniques to process language. This chapter will help you solve NLP problems with
various techniques and methods and suggest the best method to choose based on the
type of problem you are solving.
We start with an overview of numerous applications of NLP in real-world scenarios,
then cover a basic understanding of language and what makes NLP complex, and
next discuss various NLP tasks involved in building NLP applications. We will also be
discussing machine learning and deep learning methods to solve NLP.
15.2 Applications of NLP
There are several applications of NLP, including Amazon Alexa, Google Assistant, and
Siri. Some other applications of NLP include sentiment analysis, question answering,
text summarization, machine translation, voice recognition and analysis, and user
recommendation systems. Grammar checkers, spelling checkers, Google language
translators, and voice authentication are some other applications of NLP. Other
examples of NLP are your smartphone messaging application’s next word prediction
functionality and the recent Microsoft Outlook feature that suggests automatic email
responses.
15.2.1 Chatbots
Chatbots can converse with humans using natural languages. AI chatbots have recently
become popular and are mainly used as the first customer support answer to the point
simple user queries and collect customer information for further assistance. AI chatbots
have several benefits; they enhance efficiency, cut operational time and cost, have zero
customer bias, and offer no emotions. According to Salesforce, 69 percent of consumers
are satisfied with the use of chatbots to resolve their issues and the speed at which they
can communicate.
Figure 15-1 is an example of a chatbot from the CITI Bank account website. When
you go online and contact customer support, an AI chatbot triggers the conversation.
542
Chapter 15 Introduction to Natural Language Processing
According to a Mordor Intelligence report, the chatbot market in 2020 was valued at
USD $17 billion and is projected to reach USD $102 billion by 2026.
Voice-assisted intelligent devices such as Alexa, Google Home, etc., are gaining
tremendous popularity due to convenience. In addition, many recent advancements
in deep learning neural networks and machine learning techniques have made the
technology more efficient. Though the technology is heavily adopted in many sectors,
the banking sector has been a significant adopter due to its speed, efficiency, and quicker
response time. In addition, chatbots in the banking sector are also assisting in capturing
critical behavior data and performing cognitive analytics. These analytics help businesses
to build a better customer experience. For the complete market analysis, read the full
report at https://www.mordorintelligence.com/industry-reports/chatbot-market.
543
Chapter 15 Introduction to Natural Language Processing
15.2.2 Sentiment Analysis
It is common to read feedback and reviews on social media before buying any product
or service. There has been a rapid growth of online forums to post product-related
information and customers’ opinions. This information is useful for both companies and
customers. Customers read these reviews and can assess the product’s quality, customer
experience, price, and satisfaction level in purchasing decisions. At the same time,
companies are interested in such information to assess the feedback of their products
and services and arrive at decisions to improve the product quality and customer
experience.
Sentiment analysis has been studied by academic and enterprise applications and
is gaining importance, particularly to understand customers’ emotions such as happy,
satisfied, dissatisfied, angry, or simply positive or negative. Sentiment classification is
valuable in many business intelligence applications and recommendation systems,
where thousands of feedback and ratings are summarized quickly to create a
snapshot. Sentiment classification is also useful for filtering messages, such as spam
(Tatemura, 2000).
Twitter, Facebook, Tumblr, and other social media encourage users to freely express
their opinions, thoughts, and feelings and share them with their friends. As a result, a
vast amount of unstructured data on social media may contain useful data to marketers
and researchers. Most messages contain little information, but millions of such messages
would yield better analysis.
Here are some examples of user sentiments that can be analyzed:
544
Chapter 15 Introduction to Natural Language Processing
15.2.3 Machine Translation
There are so many languages spoken globally that language translation is always
needed to communicate with people of different languages. Even today, we have human
translators who have learned multiple languages to help others by translating from
one language to another. Therefore, there was always a need for language translation
using the machine. The early language translation methods were developed during
Cold War times to translate some of the Russian languages to English using NLP. In
1964, the U.S. government created a committee called the Automatic Language
Processing Advisory Committee (ALPAC) to explore machine translation techniques.
Although the ALPAC could not come up with promising results during its time, today
the advancements in computational technology can create models that can translate
languages with high accuracy.
545
Chapter 15 Introduction to Natural Language Processing
546
Chapter 15 Introduction to Natural Language Processing
Microsoft also has a machine translator. Figure 15-5 is the translator output for the
exact text used earlier (https://www.bing.com/translator).
547
Chapter 15 Introduction to Natural Language Processing
It should be noted that the Microsoft translator time was better than Google, and the
translation time was almost instantaneous. So, there are significant improvements from
where we were to where we are today, even though there is still scope for improvements
and research in this area.
Chatbots, machine translators, and sentiment analyzers are only a few applications
of NLP. There are many applications of NLP, and they are text summarization,
information extraction, named entity recognition (NER), automatic text indexing,
recommendation systems, and trend analysis.
Text summarization is the process of generating a short summary of a paragraph
or a text document. Information retrieval is the retrieval of specific information related
to a selected topic from a body of text or document. Named entity recognition (NER)
is a subtask of information retrieval that locates and classifies the named entities,
such as organization, person, or animal, in unstructured text. Based on the previous
customer behavior of the customers, a recommendation system recommends similar
material to the customer such as Netflix movie recommendations or Amazon purchase
recommendations.
Let’s begin our NLP discussion with a shared understanding of language and how
computers process language.
15.3 What Is Language?
Before the existence of languages, humans used to communicate using signs or
drawings. Though they were simple and easy to understand, not all emotions were easy
to express. Language is a mode of communication between two parties that involves a
complex combination of its constituent elements such as scripts, characters, and letters.
Language also comprises words, structure, tone, and grammar. It has both syntax and
semantics. It is easy to learn all the aspects of a native language as humans. However,
suppose one has to learn another language; that requires a lot more effort to learn
all aspects such as the structure, the grammar, and the ambiguities that exist in that
language.
Learning any natural language, whether it is English, French, or German, requires
years of learning and practice. To speak a language means understanding many aspects
of the language such as concepts of words, phrases, spellings, grammar, and structure
in a meaningful way. Though it may be easier for humans to learn, making a computer
learn and master a language is challenging.
548
Chapter 15 Introduction to Natural Language Processing
It has already been proven that computers can solve complex mathematical
functions, but they have yet to master all the components of a spoken or written
language.
Pronunciation and sounds are also an essential part of the language. Similarly, the
roots of the words, how they are derived, and the context when used in a phrase are all
critical aspects of any language. In formal language definitions, a language consists of
phonemes (sounds and pronunciations), morphemes and lexemes (the roots of the
words), syntax, semantics, and context. It is vital to understand all aspects of a language
and how a language is structured in order for the computer to process the language.
Figure 15-6 shows the different components of language and some of the associated NLP
tasks and applications.
Syntax
(phrases & Parsing, Context Free Grammar(CFG), Entity Extraction
sentences)
15.3.1 Phonemes
Phonemes are the various sounds for different letters in a language such as the way you
pronounce a or b or the words apple and ability. When uttered with a combination of
other letters, phonemes can produce a different sound, and they are essential to the
meaning of the words. For example, the standard English language has around
549
Chapter 15 Introduction to Natural Language Processing
15.3.2 Lexeme
A lexeme is an abstract unit of a word from various inflectional endings. For example,
take, taking, and taken are all inflected forms of the same lexeme take. Similarly, dogs
and dog and cat and cats are inflected forms of the same lexeme dog or cat. A lexeme is
the smallest unit of a word without the end prefixes.
15.3.3 Morpheme
Morphemes are similar to lexemes, but they are formed by removing prefixes and
suffixes. Prefixes and suffixes add meaning to the words in a language. For example, by
removing d in the English word followed, we get follow, which is a morpheme. Similarly,
the words reliable, reliability, and rely are derived from the same morpheme, rel. Not all
morphemes are words. Morphemes can have a grammatical structure in a text.
15.3.4 Syntax
Syntax is a set of grammar rules to construct sentences from a set of words in a language.
In NLP, syntactic structure is represented in many ways. A common approach is a tree
structure, as shown in Figure 15-7. A constituent is a group of words that appears as a
single unit in a phrase. A sentence is a hierarchy of constituents. The sentence’s syntactic
structure is guided by a set of grammar rules for the language. The earliest known
grammar structure is in the Sanskrit language and was defined by Panini in the 4th
century BC.
550
Chapter 15 Introduction to Natural Language Processing
NP
an ant
my
pants
15.3.5 Context
The words can have different meanings depending on the context. Context sets the
situation. For example, the word fly can have two different meanings. Suppose it is
associated with the word aeroplane; in that case, it can mean travel, whereas fly also
can mean an insect. The context of the word fly depends on its association. Languages
can have multiple meanings for the words; thus, the meaning of a sentence can change
depending on the context and how the words are used.
551
Chapter 15 Introduction to Natural Language Processing
552
Chapter 15 Introduction to Natural Language Processing
553
Chapter 15 Introduction to Natural Language Processing
Though humans can resolve ambiguity, it is difficult for the machines to resolve and
thus makes NLP challenging.
Portability of NLP solution: Since every language is different, an NLP solution
developed in one language may not work with another language. This makes it hard to
develop one NLP solution that models all the languages. This means you should build
a language-agnostic solution or build separate solutions for each language. Both are
challenging and hard.
15.5 Approaches to NLP
We can solve NLP problems using the traditional approach, parsing words, sentences,
grammar; using a heuristic approach; or using probabilistic and machine learning
approaches. We will briefly describe all the methods with examples using Python
libraries. Of course, you can do all the NLP tasks using R as well.
The heuristic approach is the early attempt of building an NLP system. It is based on
creating a set of rules and developing a program to map the rules. Such systems required
a corpus containing word dictionaries and their meaning compiled over time. Besides
dictionaries and meaning of words, a more elaborate knowledge base has been built
over a period of time to support a rule-based heuristic NLP system, including semantic
relationships, synonyms, and hyponyms. One such example is the WordNet corpus
(https://wordnet.princeton.edu/).
15.5.1 WordNet Corpus
WordNet is an open-source public corpus with a large lexical database of nouns,
verbs, adjectives, and adverbs grouped into sets of synonyms. They are referred to as
synsets and are interlinked using conceptual-semantic and lexical relations. Figure 15-8
demonstrates how to access WordNet using the NLTK library and the word association
of sweet.
554
Chapter 15 Introduction to Natural Language Processing
ɪ
ɪ
Ŝ
ʰ Ŝſɑɑƀ
ɪɪ
ƃɥƄŜſƀ
ɪɪ
ƃɥƄŜɏſƀ
ɪɪ ɗɗ
Ŝſɐɐƀś
ſŜɏſƀƀ
555
Chapter 15 Introduction to Natural Language Processing
Figure 15-8. Accessing the WordNet corpus using the NLTK library
15.5.2 Brown Corpus
Brown University created the Brown corpus in 1961. This corpus has more than
one million words in the English vocabulary. This corpus contains text from nearly
500 sources and has been categorized by genre, such as news, editorial, and so on.
Figure 15-9 shows the example of accessing the Brown corpus using the NLTK library.
556
Chapter 15 Introduction to Natural Language Processing
brown.categories()
brown.words(categories='news')
15.5.3 Reuters Corpus
The Reuters corpus contains nearly 1.3 million words from 10,000 news articles. The
documents have been classified into 90 different topics. Figure 15-10 shows an example
of accessing the Reuters corpus.
557
Chapter 15 Introduction to Natural Language Processing
reuters.categories()
558
Chapter 15 Introduction to Natural Language Processing
15.5.4.1 re.search() Method
The re.search() method looks for a specific pattern. In the example shown in
Figure 15-11, re.search() searches all the strings that match RE. The method found RE
at the 27th character in a string. In this example, the regex is looking for the pattern ex
and or. re.search() and re.findall() are two different methods. You can examine the
output to understand the difference between the two.
Here is the input:
import re
text
re.search(r'RE',text)
re.search(r'ex',text)
re.search(r'or',text)
559
Chapter 15 Introduction to Natural Language Processing
15.5.4.2 re.findall()
This method is similar to re.search(), but it finds all non-overlapping matches of a
pattern in a string, as shown in Figure 15-12.
560
Chapter 15 Introduction to Natural Language Processing
re.findall(r'or',text)
re.findall(r'regex',text)
re.findall(r'[RE]',text.lower())
15.5.4.3 re.sub()
The substring function sub substitutes a search string with a new string. It is similar to
Find and Replace in Microsoft Word. In Figure 15-13, re.sub() searches for the pattern
RE in the string, which is the third parameter passed in the function, and substitutes the
new string REGEXPR.
561
Chapter 15 Introduction to Natural Language Processing
re.sub(r'RE','REGEXPR',text)
The purpose of this section was to introduce the regular expression tool library.
The syntax and different pattern types for a regular expression can be found in the
documentation. For the Python documentation, refer to: https://docs.python.org/3/
library/re.html. For the R regular expression documentation, please refer to https://
www.rdocumentation.org/packages/base/versions/3.6.2/topics/regex.
562
Chapter 15 Introduction to Natural Language Processing
563
Chapter 15 Introduction to Natural Language Processing
TidyText: This is another popular library for text mining and text processing. This is
based on the same principles of tidyr and dplyr. Reference: https://cran.r-project.
org/web/packages/tidytext/tidytext.pdf
Stringr: Another simple, easy-to-use set of libraries around the stringi package.
Reference: https://cran.r-project.org/web/packages/stringr/index.html
SpacyR: SpacyR is an R wrapper around the Python Spacy package for R
programmers. It supports all the functionalities supported by Spacy for text processing
and NLP including tokenization, lemmatization, extracting token sequences, entities,
phrases, etc. Reference: https://spacy.io/universe/project/spacyrTM
tm: This is the most popular package in R for creating a corpus object and processing
text within a dataframe. It supports a number of NLP functions. Reference: https://
cran.r-project.org/web/packages/tm/tm.pdf
15.8.1 Text Normalization
In the earlier sections, we talked about the properties of language such as phonemes,
morphemes, lexemes, etc. These are important steps in building the vocabulary, and
sometimes this is called normalizing the text. In this section, we will discuss the steps to
normalize text for further processing and prepare it for various NLP tasks. Some tasks
include normalizing text to all lowercase letters and removing unnecessary punctuation,
decimal points, etc., to have all the terms in the same form. For example, you would
remove the periods in U.S.A. and N.Y.C. Similarly, you would keep the root of the word
and remove its prefixes and suffixes. For example, just keep the word leader when leaders
and leader appear in the text.
564
Chapter 15 Introduction to Natural Language Processing
15.8.2 Tokenization
The first step in NLP is to break the larger chunks of documents into smaller sentences
and the sentences into smaller chunks of words known as tokens. Each token is
significant as it has a meaning associated with it. Tokenization is also the fundamental
task in text analytics. Tokens can be words, numbers, punctuation marks, symbols, and
even emoticons in the case of social media texts.
Let’s go through a few examples of tokenization techniques using the NLTK library.
NLTK supports the word_tokenize() function, which tokenizes words based on
space. The first step is to invoke the libraries using the import function and then call the
word_tokenize() function with the appropriate parameters. In this example, we use the
text and tokenize the complete text string using the word_tokenize() method.
Here is the input:
## Sample Text
my_text = "The Omicron surge of coronavirus cases had not yet peaked in the US
and the country is expected to see an increase in hospitalisations and deaths
in the next few weeks"
Tokenization
from nltk.tokenize import word_tokenize
mytext_tokens = word_tokenize(my_text)
print(mytext_tokens)
565
Chapter 15 Introduction to Natural Language Processing
This method does not deal appropriately with apostrophes. For example, I’m and
we’ll should be ideally split into “I am” and “we will.” But, this method does not do it.
Let’s see how our library handles this situation with another example.
566
Chapter 15 Introduction to Natural Language Processing
sent
sent.split()
sent2
word_tokenize(sent2)
567
Chapter 15 Introduction to Natural Language Processing
As you can see from the previous example, the tokenizer did not do a good job with
apostrophes. Nevertheless, such situations should be handled appropriately using regex.
There are other tokenizer libraries available that do the same job. Some other
tokenizers include Treebank Tokenizer, WordPunct Tokenizer, and TweetsTokenizer. The
TreebankTokenizer does a better job splitting words such as I’m into I and m.
15.8.3 Lemmatization
Lemmatization is the process of removing ending letters to create a base form of the
word. For example, the word car may appear in different forms in the whole document
such as car, cars, car’s, cars’, etc., and the lemmatization process brings these to a single
base form of car. The base form of the dictionary word is called the lemma. Several
lemmatizer libraries are available such as WordNet, Spacy, TextBlob, and Gensim. The
most commonly used lemmatizer is the WordNet lemmatizer. In our example, we will
explore the WordNet and Spacy lemmatizers.
Here is the input:
##Lemmatization
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()
mytext_lemma = ' '.join([lemma.lemmatize(word) for word in mytext_tokens])
print(mytext_lemma)
568
Chapter 15 Introduction to Natural Language Processing
The first example, shown in Figure 15-16, is using WordNetLemmatizer(), and the
second example, as shown in Figure 15-17, is using the Spacy lemmatizer. As we can see
from the example, the Spacy lemmatizer has done a better job than the NLTK WordNet
lemmatizer. The Spacy lemmatizer was able to remove ed from the two words peaked
and expected, which WordNet missed.
Here is the input:
#Spacy Lemmatizer
import spacy
load_spacy = spacy.load('en_core_web_sm')
spacy_lemma = load_spacy(my_text)
print(" ".join([word.lemma_ for word in spacy_lemma]))
569
Chapter 15 Introduction to Natural Language Processing
15.8.4 Stemming
Stemming is a process where words are reduced to their root form. As part of stemming,
the inflationary form of the word is changed to the base form called the stem. The affixes
are chopped off during the stemming process. For example, digit, digitization, and
digital will be stemmed to get digit. The stem may not be always a valid dictionary word.
The two most common stemmers used are the Porter stemmer and the Snowball
stemmer. The Porter stemmer was the earliest stemmer developed. The Snowball
stemmer supports multiple languages, whereas the Porter stemmer supports only the
English language. The following examples demonstrate the stemming process using the
Porter stemmer and the Snowball stemmer.
Here's the input:
my_text
570
Chapter 15 Introduction to Natural Language Processing
#SnowballStemmer()
from nltk.stem.snowball import SnowballStemmer
print(SnowballStemmer.languages)
snow_stem= SnowballStemmer(language='english')
571
Chapter 15 Introduction to Natural Language Processing
As you can see from the previous example, the Porter stemmer stemmed the
words surge, cases, peak, expect, increase, hospitalization, and weeks, and the Snowball
stemmer also stemmed the words surge, coronavirus, cases, peak, expect, increase,
hospitalization, and weeks. Both stemmers have done a similar job. Practically you could
use either stemmer.
1/7.6WRSZRUGV
Ŝ
Ŝſɐɐƀ
ɏʰ Ŝ ŜŜſɐɐƀ
ſɏƀ
572
Chapter 15 Introduction to Natural Language Processing
In the previous example, as shown in Figure 15-20, NLTK removed the common stop
words from the text; however, the was not removed as it is not in the NLTK stop words
corpus. If you want to remove such common and frequently appearing words that may
not add much value to the sentence, you can create your own dictionary of such words
and apply it to the text before further processing.
15.8.6 Part-of-Speech Tagging
Part-of-speech (POS) tagging is a process where the algorithm reads the text and
assigns parts of speech associated to each word. By the end of this process, each word
is annotated and assigned with its POS tag such as verb, noun, adverb, preposition, etc.
The POS tag aims to help parse text and resolve word sense disambiguation. Sometimes
POS tagging is also used for identifying named entities, coreference resolution, and
speech recognition. It is impossible to manually annotate each word and assign POS tags
in a corpus. Fortunately, many libraries and corpus are available with POS tags assigned
for the words. The most popular POS tag corpus are the following:
573
Chapter 15 Introduction to Natural Language Processing
Figure 15-21 shows a sample POS tagset from the Penn Treebank corpus; for
example, tag JJ means it is an adjective, VB is for verb, and RB is for adverb.
574
Chapter 15 Introduction to Natural Language Processing
The manual design of POS tags by human experts requires linguistic knowledge. It is
laborious to manually annotate and assign POS tags to each word and each sentence of
the text in a corpus. Instead, many computer-based algorithms are used, and one such
popular method is a sequence model using hidden Markov models (HMMs). Maximum
entropy models (MaxEnt) and sequential conditional random fields (CRFs) can also be
used to assign POS tags. We will not discuss HMM, MaxEnt, or CRF POS tags models
here. You can refer to the published papers by the inventors given in the reference
section to learn more about the models for POS tags.
As shown in Figure 15-22, we tag single words using the NLTK library and Penn
Treebank POS. The NLTK POS tagger has tagged both dog and beautiful as nouns.
Here’s the input:
3267$**,1*
ŜɏſƃɐɐƄƀ
ŜɏſƃɐɐƄƀ
In the following example, as shown in Figure 15-23, we tag each word in a sentence
using the NLTK library.
Here’s the input:
mynew_text = 'Novak Djokovic lost 2021 US Open to Daniil Medvedev in the final'
[nltk.pos_tag([word]) for word in mynew_text.split(' ')]
575
Chapter 15 Introduction to Natural Language Processing
The tagger parsed the sentences and tagged each word. It has properly identified and
tagged them, for example, Novak as Noun, Open as Verb, etc.
You do not have to memorize the tags and their association. You can call help to
learn more about the meaning of different POS tags used by the library.
Here’s the input:
Ŝſɐɐƀɪɪ
ŜŜɏſƀ
576
Chapter 15 Introduction to Natural Language Processing
577
Chapter 15 Introduction to Natural Language Processing
The next word depends on the previous word. The next word could be a verb or it
could be noun. It depends on the probability of association of the previous word. We
can build such models if we know the probability of each word and its occurrence in a
corpus. This approach in NLP is the language model (LM). In the LM, the probability
of the words depends on the training set. Though linguists prefer grammar models, the
language model has given good results and has become the standard in NLP.
578
Chapter 15 Introduction to Natural Language Processing
Or using bigram:
P(“hard | to find a true friend”) = P(hard| true friend )
In summary, for predicting the next word w* given the history w1, w2, w3,….wn-1, we
have this:
Unigram model: P(wn |wn-0)
Bigram model: P(wn|wn-0,wn-1)
Trigram model: P(wn|wn-0,wn-1,wn-2)
Language models have applications used in speech recognition, part-of-speech
tagging, and optical character recognition. They are vital in spelling correction and
machine translations. Though this model is a good approximation for many NLP
applications, it is insufficient because English and any natural language are not always
simple as a set of sequence of structured words. Language has what linguists call long-
distance dependencies. There are many more complex models such as trigram and
probabilistic context-free grammars (CFGs).
Several language modeling toolkits are publicly available. The SRI Language
Modeling Toolkit and Google N-Gram model are popular.
579
Chapter 15 Introduction to Natural Language Processing
15.9.1 Bag-of-Words Modeling
The machine learning models work with numerical data rather than textual data. Using
the bag-of-words (BoW) techniques, we can convert text into equivalent fixed-length
vectors. The BoW model provides a mechanism for representing textual data as vectors.
The model relies on the count of actual terms present in a document. Each entry in the
vector corresponds to a term in the vocabulary, and the number in that particular entry
indicates the frequency of the term that appeared in the sentence under consideration.
Once the vocabulary is available, each sentence can be represented as a vector using the
BoW model. This technique is also popularly known as one-hot vectorization.
Assume that a document has two sentences, shown here:
Sentence 1: This is course on NLP.
Sentence 2: Welcome to NLP class. Let’s start learning NLP.
Using the BoW technique, we can represent both sentences as vectors after obtaining
the list of vocabulary words. In this document, the vocabulary size is 11. Figure 15-25
represents the vector representation of two sentences in the document. If the word
appeared in the sentence, it would be counted as 1; otherwise, it is 0, as shown in
Figure 15-26. For example, in sentence 2, NLP has appeared twice, and hence the NLP
word has 2 counts, and the rest have only 1.
Sentence 1 0 0 1 0 0 0 0 1 1 1 1
Sentence 2 1 1 2 1 1 1 1 0 0 0 0
580
Chapter 15 Introduction to Natural Language Processing
document
#[[1 0 0 0 1 1 0 1 1 0 0]
# [0 1 0 0 2 0 1 1 1 0 1]
# [0 0 1 1 1 0 0 1 1 1 0]]
581
Chapter 15 Introduction to Natural Language Processing
15.9.2 TF-IDF Vectors
As we noticed, the BoW model considers the frequency of words that appear across
a document. If a word is not present, it is ignored or given low weightage. This type
of vector representation does not consider the importance of the word in the overall
context. Some words may occur in only one document but carry important information,
whereas other words that appear more frequently in a document may not carry that
much information, and hence the pattern that can be found across similar documents
may be lost. This weighting scheme is called TF-IDF. The TF-IDF vector representation
mitigates such issues in the text corpus.
15.9.3 Term Frequency
Term frequency (TF) is how frequently a term occurs in a document. In a given a
corpus, each document size can vary; it is more likely that a term could appear more
frequently in a bigger document than smaller ones. Normalizing the vectors makes the
vector size the same throughout the corpus. One representation of TF is to normalize
vectors by dividing the frequency of the term by the total terms in the document. The
representation to calculate TF is as follows:
tfw ,d = Count of word w in a document / Count of total words in the document (1)
582
Chapter 15 Introduction to Natural Language Processing
Thus, with this normalized representation, two documents of different length, “John
is driving faster” and “Jack is driving faster than John,” seem to be identical.
In Figure 15-28, IDF values for different words are shown for the documents in the
corpus. In this example, there are four words (w1, w2, w3, and w4), and the table shows
the frequency of these terms appearing in different documents (d1, d2, d3, and d4). As
you can see from the example, even though w3 and w4 appear less frequently, they have
higher IDF values.
w1 w2 w3 w4
d1 16 10 2 9
d2 15 4 1 6
d3 17 6 1 5
d4 18 7 2 3
583
Chapter 15 Introduction to Natural Language Processing
15.9.5 TF-IDF
By combining the two, i.e., TF and IDF, you obtain a composite weight called TF-IDF.
Multiplying both TF and IDF, the term frequency and document frequency of the
word w is as follows:
584
Chapter 15 Introduction to Natural Language Processing
585
Chapter 15 Introduction to Natural Language Processing
15.10 Text Classifications
In the case of library management, we want to classify library books into different
categories. In the case of email, it is required to separate spam emails. Similarly, if
you have a lot of customer feedback and reviews, you want to identify the behavioral
pattern and classify them into different behaviors. While understanding the grammar
and semantics of a language is still a distant goal, researchers have taken a divide-and-
conquer approach. According to Manning (2009), sentences can be clearly grammatical
or ungrammatical. Most of the time words are used as a single part of speech, but
not always. Our discussions have shown how words in a language can be explained
using probability theory. Further, we have also shown how we can represent text in
a mathematical structure as vectors. Since we have learned how to convert words to
vectors and apply probability theory, can we use advanced machine learning techniques
to solve this problem?
Given a set of documents D = {d1,d2,d3….dn}, each with a label (category), we can
train a classifier using supervised machine learning. For any unlabeled document, our
machine learning classifier model should be able to predict the class of the document. If
we have document d, from a fixed set of categories C = {c1,c2,…cn}, then after the model
is trained, the classifier should be able to predict the class of a new document for d ∈ C,
as shown in Figure 15-30.
586
Chapter 15 Introduction to Natural Language Processing
Machine Learning
Classifier
Predicts
Document
label
This task in NLP is referred to as text classification. We can use any supervised
machine learning classification technique to achieve text classification. The most
commonly used classifiers are naïve Bayes, SVM, and neural nets. The naïve Bayes
classifier is a simple probabilistic classifier based on the Bayes theorem with an
assumption that each word in a sentence is independent of the others. It is one of the
most basic classification techniques used in many applications such as email spam
detection, email sorting, and sentiment analysis. Even though naïve Bayes is a simple
technique, research has shown it gives good performance in many complex real-world
problems. Although conditional independence does not hold in real-world situations,
naïve Bayes performs well.
The naïve Bayes (NB) classifier is derived based on the simple Bayes rule.
Classification is done based on the highest probability score calculated using the
following formula:
P X 1 , X 2 , X 3 , X p , |Ci P Ci
P X , X
P Ci |,X 1 ,X 2 ,X 3 ,X p
, X p ,|C1 P X 1 , X 2 , X p , C2 P X 1 , X 2 , X p ,|Cm
1 2
587
Chapter 15 Introduction to Natural Language Processing
Here, P(Ci) is the prior probability of belonging to class Ci in the absence of any other
attributes.
(Ci |Xi) is the posterior probability of Xi belonging to class Ci. To classify a record
using Bayes’ theorem, first compute its chance belonging to each class Ci. Naïve Bayes
assumes that the predictor attributes are all mutually independent within each class;
that is, the order of the word does not matter.
Applying the Bayes rule to classify document d and a class C, the probability of a
document d being in class C is computed as follows:
Here, Document d ∈ D, where D denotes the training document set. Document d can
be represented as a bag of words (the order and position of words does not matter). Each
word w ∈ d comes from a set W of all the feature words (vocabulary set).
To classify a new document, the product of the probability of each word of the
document given a particular class (likelihood) is estimated. Then, this product is
multiplied by the probability of each class (prior). Probabilities are calculated using the
previous equation for each class. The one with the highest probability decides the final
class of the new document.
When the previous calculations are done using computers, sometimes it may
lead to floating-point errors. It is, therefore, better to perform computation by adding
algorithms of probabilities. The class with the highest log probability is still the most
significant class.
Table 15-1 illustrates an example of sample documents containing documents with
known and unknown classes. The training set contains four documents with a known
class and one document with an unknown class.
588
Chapter 15 Introduction to Natural Language Processing
We will illustrate how to find the class of unknown document using the naive Bayes
algorithm.
Step 1: Calculate the prior probability of C P( c) :
P(Positive) = Npos / (Npos + Nneg) = 3/(2+3) = 3/5 = 0.6 ……………….P (c1)
P(Negative) = Nneg / (Npos+Nneg) = 2/(3+2) = 2/5 = 0.4 ……………….P( c2)
Step 2: Using add-one or laplace smoothing, calculate the probability of words
as shown:
P(very|c1) = 1+count(very,positive) /|vocab| + count(pos) (using laplace smoothing)
P(very|c1) = 1+1 / (7+9) = 2/16 = 0.125
(There are a total of nine words including seven positive words.)
Step 3: Similarly, calculate the probability of other words.
P(good|positive) = (1+1)/(7+9) = 2/16 = 0.125
P(movie|positive) = (2+1)/(7+9) = 3/16 = 0.1875
P(bad|positive) = (0+1)/(7+9) = 1/16 = 0.0625
P(very|Negative) = count(very,negative)+1 /|vocab| + count(negative) (using laplace
smoothing)
P(very|negative) = (0+1) / (4+9) = 1/13 = 0.0769
p(movie|negative) = (2+1)/(4+9) = 3/13 = 0.2307
p(bad|negative) = (1+1) /(4+9) = 2/13= 0.1584
Step 4: Let’s classify document d6 applying naïve Bayes.
Document d6 = “bad movie”.
We first calculate the probability of each word for the positive class and then for the
negative class.
589
Chapter 15 Introduction to Natural Language Processing
15.11 Word2vec Models
Languages are ambiguous. For example, small and tiny are similar words but can be used
in different contexts. But, these ambiguities make it hard for computers to understand
and interpret, thus making NLP tasks complex.
In our earlier sections, we discussed representing words as vectors. We started with
one-hot vector representation, and then we looked at TF, IDF, and TF-IDF. Though these
representations are a good start for solving many problems, they take away some of the
information that appears within a sentence. For example:
Both sentences have similar meanings except for the words cycling and bicycle.
Humans can interpret them easily, whereas for a computer to find the similarity, we
have to represent the sentences as vectors. We can represent both words using one-hot
representation: Cycle = [ 0 0 1 0 0 0 0 0 ] and Bicycle = [ 0 0 0 0 0 0 0 1]. Both word vectors
are orthogonal to each other. By using similarity measures, such as cosine similarity or
the Jaccard similarity measure, it gives 0; i.e, there is no similarity. This is the problem.
In 2013, Tomas Mikolov et al. developed an algorithm called Word2vec, a new
model that minimizes computational complexity. This model uses a neural network to
learn word vectors from a given corpus. The resulting vectors seem to have semantic
and syntactic meaning. Empirical results have shown that this method has given
590
Chapter 15 Introduction to Natural Language Processing
• Skip-gram
For example, given a sentence of five words, wt-1, wt-2, < wt > . wt+1, wt+2, wt is the center
word(target word), and wt-1, wt-2, wt+1, wt+2 are the surrounding words (context words).
In the CBOW, the method predicts wt , whereas in Skip-gram, the method predicts
the surrounding words, wt-1, wt-2, wt+1, wt+12, etc. Research and studies have shown Skip-
gram tends to produce better word embedding than CBOW. Figure 15-31 shows the
architecture of both CBOW and Skip-gram. If you want to learn the architecture and
algorithms, we suggest reading the paper “Distributed representations of words and
phrases and their compositionality.”
Wt-1 Wt+1
wt wt
Wt+1 Wt-1
Wt+2
Wt-2
CBOW Skip-Gram
Figure 15-31. Word2vec embedding architecture of CBOW and Skip-gram
591
Chapter 15 Introduction to Natural Language Processing
The output of the Word2vec algorithm is a |v| X d matrix, where |v| is the size of
the vocabulary, and d is the number of dimensions used to represent each word
vector. Google has already trained the Word2vec model on the Google News data set.
Google Word2vec has a vocabulary size of 3 million words, with each vector having a
dimension of 300.
This model can be downloaded from Google at https://code.google.com/
archive/p/word2vec/. The download size is around 1.5 GB. Also, there are many
libraries that have implemented Word2vec. We will use the Gensim package in this
example. Python’s Gensim library provides various methods to use the pretrained model
directly. The Gensim library function also allows creating a Word2vec model from
scratch based on your corpus.
Step 1: Install Genism in your environment.
Step 2: Load the Google pretrained Word2vec model file.
Here is the input:
import gensim
from gensim.models import KeyedVectors
#Load the pertained vectors from the pertained Word2Vec model file
model=KeyedVectors.load_word2vec_format('C:/Users/phdst/gensim-data/word2vec-goo
gle-news-300/word2vec-google-news-300.gz', binary=True, limit=300000)
592
Chapter 15 Introduction to Natural Language Processing
model['girl']
model['Boston']
Figure 15-33 and Figure 15-34 show both the input and output.
593
Chapter 15 Introduction to Natural Language Processing
model.most_similar('excellent')
model.most_similar(['fraud','cheat'])
594
Chapter 15 Introduction to Natural Language Processing
Step 5: If you want to get the Word2vec representation for your own corpus.
In this example, we have a corpus with three documents. We will limit the vector
size to 100 and the frequency of words occurring in the document to 1. Once we import
the Word2vec model, you will have a vector representation for your words in the corpus.
In the previous example, we have three sentences in the document. We will import the
vectors for our words from the Word2vec model we created earlier. After that, we can
check the word vector of our corpus.
Here is the input:
ɤɨɥɥ
ɪ ɖɖ
ɏɩŜŜɏ ſɐɐƀ
595
Chapter 15 Introduction to Natural Language Processing
596
Chapter 15 Introduction to Natural Language Processing
On the other hand, NLP is related to making computers understand the text and
spoken words for effective communication between humans and computers. NLP
combines linguistics with statistical machine learning and deep learning models to
enable the computers to process language and understand its meaning, sentiment,
and intent.
The following example demonstrates how to visualize the words that occur more
frequently using the wordcloud() library. We have created our corpus (my_text) and
plot the words that frequently occur in the text. In our text, the word language seems to
appear more frequently than other words; hence, it has been highlighted in a large font,
as shown in Figure 15-37.
Here is the input:
my_text =
'We are learning Natural Language Processing in this class. Natural Langua
ge Processing is all about how to make computers to understand language.
Natural Language Processing techniques are evolving everyday.'
myword_cloud = WordCloud().generate(my_text)
# Display the generated image:
plt.imshow(myword_cloud)
plt.axis("off")
plt.show()
597
Chapter 15 Introduction to Natural Language Processing
598
Chapter 15 Introduction to Natural Language Processing
15.15 Chapter Summary
In this chapter, we discussed what NLP is and what text analytics are and how they are
interrelated. We also discussed the various applications of NLP.
We talked about how a computer can process the natural language and the basics
of language processing, including tokenization, stemming, lemmatization, POS
tagging, etc.
We also discussed language modeling, representing words as vectors, term
frequency, and inverse document frequency (TF-IDF), Word2vec, and other models.
We demonstrated the basic concepts with examples using various libraries and also
mentioned various libraries for NLP in R and Python.
We ended the chapter by discussing how deep learning can be applied to solve some
of the NLP problems and applications.
599
CHAPTER 16
16.1 Introduction
Data is power and is going to be another dimension of value in any enterprise. Data
is and will continue to be the major decision driver going forward. All organizations
and institutions have woken up to the value of data and are trying to collate data from
various sources and mine it for its value. Businesses are trying to understand consumer/
market behavior in order to get the maximum out of each consumer with the minimum
effort possible. Fortunately, these organizations and institutions have been supported
by the evolution of technology in the form of increasing storage power and computing
power, as well as the power of the cloud, to provide infrastructure as well as tools. This
has driven the growth of data analytical fields, including descriptive analytics, predictive
analytics, machine learning, deep learning, artificial intelligence, and the Internet
of Things.
This chapter does not delve into the various definitions of Big Data. Many pundits
have offered many divergent definitions of Big Data, confusing people more than
clarifying the issue. However, in general terms, Big Data means a huge amount of data
that cannot be easily understood or analyzed manually or with limited computing power
or limited computer resources. Analyzing Big Data requires the capability to crunch data
601
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5_16
Chapter 16 Big Data Analytics and Future Trends
of a diverse nature (from structured data to unstructured data) from various sources
(such as social media, structured databases, unstructured databases, and the Internet of
Things).
In general, when people refer to Big Data, they are referring to data with three
characteristics: variety, volume, and velocity, as shown in Figure 16-1.
Variety refers to the different types of data that are available on the Web, the Internet,
in various databases, etc. This data can be structured or unstructured and can be from
various social media and/or from other sources. Volume refers to the size of the data that
is available for you to process. Its size is big—terabytes and petabytes. Velocity refers to
how fast you can process and analyze data, determine its meaning, arrive at the models,
and use the models that can help business.
The following have aided the effective use of Big Data:
602
Chapter 16 Big Data Analytics and Future Trends
603
Chapter 16 Big Data Analytics and Future Trends
The Hadoop Distributed File System (HDFS) allows data to be distributed and stored
among many computers. Further, it allows the use of the increased processing power
and memory of multiple clustered systems. This has overcome the obstacle of not being
able to store huge amounts of data in a single system and not being able to analyze
that data because of a lack of required processing power and memory. The Hadoop
ecosystem consists of modules that enable us to process the big data and perform the
analysis.
A user application can submit a job to Hadoop. Once data is loaded onto the
system, it is divided into multiple blocks, typically 64 MB or 128 MB. Then the Hadoop
Job client submits the job to the JobTracker. The JobTracker distributes and schedules
the individual tasks to different machines in a distributed system; many machines are
clustered together to form one entity. The tasks are divided into two phases: Map tasks
are done on small portions of data where the data is stored, and Reduce tasks combine
data to produce the final output. The TaskTrackers on different nodes execute the tasks
as per MapReduce implementation, and the reduce function is stored in the output
files on the file system. The entire process is controlled by various smaller tasks and
functions. Figure 16-3 shows the full Hadoop ecosystem and framework.
604
Chapter 16 Big Data Analytics and Future Trends
605
Chapter 16 Big Data Analytics and Future Trends
In addition to these tools, NoSQL (originally referring to not only SQL) databases
such as Cassandra, ArangoDB, MarkLogic, OrientDB, Apache Giraph, MongoDB, and
Dynamo have supported or complemented the Big Data ecosystem significantly. These
NoSQL databases can store and analyze multidimensional, structured or unstructured,
huge data effectively. This has provided significant fillip to the consolidation and
integration of data from diverse sources for analysis.
Currently Apache Spark is gaining momentum in usage. Apache Spark is a fast,
general engine for Big Data processing, with built-in modules for streaming, SQL,
machine learning, and graph processing. Apache Spark, an interesting development in
recent years, provides an extremely fast engine for data processing and analysis. It allows
an easy interface to applications written in R, Java, Python, or Scala. Apache Spark has a
stack of libraries such as MLib, Spark Streaming, Spark SQL, and GraphX. It can run in
stand-alone mode as well as on Hadoop. Similarly, it can access various sources from
HBase to Cassandra to HDFS. Many users and organizations have shown interest in this
tool and have started using it, resulting in it becoming very popular in a short period of
time. This tool provides significant hope to organizations and users of Big Data.
Tools such as Microsoft Business Intelligence and Tableau provide dashboards and
the visualization of data. These have enabled organizations to learn from the data and
leverage this learning to formulate strategies or improve the way they conduct their
operations or their processes.
The following are some of the advantages of using Hadoop for Big Data processing:
Microsoft Azure, Amazon, and Cloudera are some of the big providers of cloud
facilities and services for effective Big Data analysis.
606
Chapter 16 Big Data Analytics and Future Trends
607
Chapter 16 Big Data Analytics and Future Trends
16.3.4 Prescriptive Analytics
Data analysis is no longer focused only on understanding the patterns or value
hidden in the data. The future trend is to prescribe the actions to be taken,
based on the past and depending on present circumstances, without the need
for human intervention. This is going to be of immense value in fields such as
healthcare, aeronautics, automotive and other fields.
16.3.5 Internet of Things
The Internet of Things is a driving force for the future. It has the capability to bring data from
diverse sources such as home appliances, industrial machines, weather equipment, and
sensors from self-driving vehicles or even people. This has the potential to create a huge
amount of data that can be analyzed and used to provide proactive solutions to potential
future and current problems. This can also lead to significant innovations and improvements.
16.3.6 Artificial Intelligence
Neural networks can drive artificial intelligence—in particular, making huge data learn
from itself without any human intervention, specific programming, or the need for
specific models. Deep learning is one such area that is acting as a driver in the field of
Big Data. This may throw up many of the “whats” that we are not aware of. We may not
understand the “whys” for some of them, but the “whats” of those may be very useful.
Hence, we may move away from the perspective of always looking for cause and effect.
The speed at which the machine learning field is being developed and used, drives
significant emphasis in this area. Further, natural language processing (NLP), and
property graphs (PGs) are also likely to drive new application design and development,
putting the capabilities of these technologies in the hands of organizations and users.
608
Chapter 16 Big Data Analytics and Future Trends
16.3.9 Real-Time Analytics
Organizations are hungry to understand the opportunities available to them. They want
to understand in real time what is happening—for example, what a particular person is
purchasing or what a person is planning for—and use the opportunity appropriately to
offer the best possible solutions or discounts or cross-sell related products or services.
Organizations are no longer satisfied with a delayed analysis of data. which results in
missed business opportunities because they were not aware of what was happening in
real time.
609
Chapter 16 Big Data Analytics and Future Trends
tool easily implemented or migrated to other tools. Efforts in Predictive Model Markup
Language (PMML) are already moving in this direction. The Data Mining Group (DMG)
is working on this.
16.6 Cloud Analytics
Both data and tools in the cloud have provided a significant boost to Big Data analysis.
More and more organizations and institutions are using cloud facilities for their data
storage and analysis. Organizations are moving from the private cloud to the public or
hybrid cloud for data storage and data analysis. This has provided the organization with
cost-effectiveness, scalability, and availability. Of course, security may be a concern, and
significant efforts to increase security in the cloud are in progress.
16.7 In-Database Analytics
In-database analytics have increased security and reduced privacy concerns, in part
by addressing governance. Organizations, if required, can do away with intermediate
requirements for data analysis such as data warehouses. Organizations that are more
conscious about governance and security concerns will provide significant fillips to in-
database analytics. Lots of vendor organizations have already made their presence felt in
this space.
16.8 In-Memory Analytics
The value of in-memory analytics is driving transactional processing and analytical
processing side-by-side. This may be very helpful in fields where immediate intervention
based on the results of analysis is essential. Systems with hybrid transactional/analytical
processing (HTAP) are already being used by some organizations. However, using HTAP
for the sake of using it may not be of much use, even when the rate of data change is slow
and you still need to bring in data from various diverse systems to carry out effective
analysis. Instead, it may be overkill, leading to higher costs to the organization.
610
Chapter 16 Big Data Analytics and Future Trends
provide real-time market insights, fraud detection and prevention, risk analysis, better
financial advice to customers, improved portfolio analysis, and many more applications.
In manufacturing, the data generated by machinery and processes can be used for
predictive maintenance, quality control, process optimization, machinery downtime
optimization, anomaly detection, production forecasting, product lifecycle management
(PLM), etc.
Big volumes of data are being generated every day by people who use social
media. Many researchers are interested in analyzing such data and finding out useful
patterns. Though the data can be useful from a business perspective, it can provide
many perspectives on different aspects of society. It can also lead society on a negative
path because of malicious intentions by some. This discussion is beyond the scope of
this book.
16.12 Chapter Summary
In this chapter, we briefly introduced Big Data and Big Data analytics. We also discussed
various terms and technologies to manage Big Data and analyze it.
We also touched upon some of the challenges of AI, machine learning, and the
future trends.
We hope that this introduction to Big Data helps you to explore these and related
areas based on your needs and interest.
612
PART V
R for Analytics
This chapter introduces the R tool. We’ll discuss the fundamentals of R required to
perform business analytics, data mining, and machine learning. This chapter provides
enough basics to start R programming for data analysis. This chapter also introduces the
data types, variables, and data manipulations in R and explores some of the packages of
R and how they can be used for data analytics.
615
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5_17
Chapter 17 R for Analytics
616
Chapter 17 R for Analytics
617
Chapter 17 R for Analytics
Several free books are available to learn R. In this section, instead of focusing on R
basics, we demonstrate some of the operations performed on a data set with an example.
The examples include checking for nulls and NAs, handling missing values, cleaning and
618
Chapter 17 R for Analytics
exploring data, plotting various graphs, and using some of the loop functions, including
apply(), cut(), and paste(). These examples are only to use as a reference in case you
need them. We will not claim this is the only or the best solution.
All variables in R are represented as vectors. The following are the basic data types in R:
619
Chapter 17 R for Analytics
We use the Summer Olympics medals data set, which lists gold, silver, and bronze
medals won by different countries from 1896 to 2008. We have added some dummy
countries with some dummy values to demonstrate some of the concepts.
1. Read the data set and create sample variables. The dplyr()
package supports a function called sample_frac(), which takes a
random sample from the data set.
dplyr() is a library that helps you to address some of the common data
manipulation challenges. It supports several functions. Here are some examples:
• mutate() adds new variables that are functions of existing variables.
• select() picks variables based on their names.
• filter() picks cases based on their values.
• summarise() reduces multiple values to a single summary.
• arrange() changes the ordering of the rows.
2. Check the size of the data, the number of columns, and the rows (records).
The code and the corresponding output are provided below:
620
Chapter 17 R for Analytics
> ncol(medal_df)
[1] 6
> str(sample_data)
'data.frame': 126 obs. of 6 variables:
$ Country : Factor w/ 140 levels "Afghanistan",..: 13 90 10 70 22 26 61
103 114 112 ...
$ NOC.CODE: Factor w/ 128 levels "","A","AFG","ALG",..: 14 86 11 68 21 2
54 99 104 36 ...
$ Total : int 1 10 1 17 385 NA 23 1 1 113 ...
$ Golds : int 0 3 0 2 163 1 8 0 0 34 ...
$ Silvers : Factor w/ 51 levels "0","1","11","112",..: 1 25 1 3 5 51 39 2
2 33 ...
$ Bronzes : int 1 4 1 4 105 NA 8 0 0 30 ...
621
Chapter 17 R for Analytics
Select only a set of variables from the dataframe. This is often the
case in data processing tasks. In the following code we explore
a first case where we select the two columns required, and
explore the second case where we get all the columns except
the two columns mentioned. We use the head() function to
display the first six rows and tail() to display the last six rows
of the dataframe. The code and the corresponding outputs are
provided below:
> sample_data$Silvers<-as.integer(sample_data$Silvers)
> str(sample_data)
Note str() gives the summary of the object in active memory, and head()
enables you to view the first few (six) lines of data.
622
Chapter 17 R for Analytics
Total Golds
1 1 0
2 10 3
3 1 0
4 17 2
5 385 163
6 NA 1
> X4 = select(sample_data, -NOC.CODE, - Total)
> head(X4)
623
Chapter 17 R for Analytics
6. Often, we have to get a specific column name from the data. Here
is an example of how to get the specific column name from the
data demonstrated using the code and the corresponding outputs:
624
Chapter 17 R for Analytics
625
Chapter 17 R for Analytics
17.2.2 Apply() Functions in R
We will check the proportion of NAs in each column using the apply() function.
Although the for() and while() loops are useful programming tools, curly brackets
and structuring functions can sometimes be cumbersome, especially when dealing with
large data sets. R has some cool functions that implement loops in a compact form to
make data analysis simpler and more effective. R supports the following functions, which
we’ll look at from a data analysis perspective:
626
Chapter 17 R for Analytics
The apply() function is simpler in form, and its code can be very few lines (actually,
one line) while helping to perform effective operations. The other, more complex forms
are lapply(), sapply(), vapply(), mapply(), rapply(), and tapply(). Using these
functions depends on the structure of the data and the format of the output you need to
help your analysis.
The following example demonstrates the proportion of NAs in each column and the
row number of the presence of NA. In both cases, we are using the apply() function.
The first apply() function checks the NAs in each column and provides the proportion
of NAs. In the second case, it is providing the row numbers that have NAs. The code and
the corresponding output are provided below:
Na.omit() removes all the records that have NAs in the data set. The code and the
corresponding output are provided below:
627
Chapter 17 R for Analytics
3 Barbados BAR 1.0000
4 Latvia LAT 17.0000
5 China CHN 385.0000
6 Countr B A 100.7833
17.2.3 lapply()
The lapply() function outputs the results as a list. lapply() can be applied to a list,
dataframe, or vector. The output is always a list with the same number of elements as the
object passed.
The following example demonstrates the lapply() function. We use lapply() to
find out the mean of the total, gold, and bronze medals. The code and the corresponding
output are provided below:
$Golds
[1] 33.08824
$Bronzes
[1] 35.06618
17.2.4 sapply()
The difference between sapply() and lapply() is the output result. The result of
sapply() is a vector, whereas the result for lapply() is a list. You can use the appropriate
functions depending on the kind of data analysis you are doing and the result format you
need. The following example demonstrates the use of sapply() for the same Summer
Olympics Medals data set example through the code and the corresponding output:
628
Chapter 17 R for Analytics
> ##SAPPLY()
> sapply(medal_df['Golds'],mean, na.rm = TRUE)
Golds
33.08824
> sapply(select(medal_df, -Country,-NOC.CODE,-Silvers),mean, na.rm = TRUE)
Total Golds Bronzes
102.47015 33.08824 35.06618
629
Chapter 17 R for Analytics
17.4 split()
The split() function divides the data set into groups defined by the argument. The
split() function splits data into groups based on factor levels chosen as an argument.
This function uses two arguments: (x, y). x is the dataframe, and y is the level (split). We
will divide the data set into three “bins” we have created. The split() command divides
the data set into three groups as shown through the code and corresponding output
provided here:
630
Chapter 17 R for Analytics
The unsplit() function reverses the split() results. The code and the
corresponding output are provided below:
631
Chapter 17 R for Analytics
In the following example, myfunc_count() prints only those that have total gold
medals of greater than 25 (a number). The function my_function() takes two arguments.
The first argument is the dataframe, and the second argument is the number. Once
the function is initiated, you call the function by passing two arguments as shown. As
you can see, we have passed the medal dataframe and 25. The output is to print all the
data whose gold medals (column in the data set) are greater than 25. The code and the
corresponding output are provided below:
632
Chapter 17 R for Analytics
17.6 Chapter Summary
In this chapter, we covered converting a data type, finding missing values and NAs,
removing the rows or columns with NAs, generating a sample dataset, splitting the
dataset, removing the duplicate records from the dataset, filtering the dataset based on
the condition specified, etc.
We also covered various built-in looping apply() functions. We also covered writing
your own functions.
We have only covered the basics of R required for the basic analytics problems that
you will come across. This should be used only as a quick-reference guide. If you want to
learn more complex functions and R programming, you may refer to an R book or the R
documentation.
633
CHAPTER 18
Python Programming
for Analytics
This chapter introduces Python. We discuss the fundamentals of Python that are
required to perform business analytics, data mining, and machine learning. This chapter
provides enough basics to start using Python programming for data analysis. We will
explore the pandas and NumPy tools using Jupyter Notebook so you can perform basic
data analytics to create models. We will be discussing the pandas DataFrame, NumPy
arrays, data manipulation, data types, missing values, data slicing and dicing, as well as
data visualization.
18.1 Introduction
Python has been around since the 1990s. It gained popularity as an interactive
programming language and is easy to learn. It is known for its concise, modular,
and simplistic approach without the need for any complicated code. Just like other
programming languages such as C, C++, and Java, Python is a structured, procedural,
functional, object-oriented language. It supports statements, control flows, arithmetic
operations, scientific operations, and regular expressions. All these features have made
Python a popular language for data analytics and machine learning. Several open-
source libraries are available to support data analytics and machine learning including
scikit-learn, TensorFlow, Keras, H2O, PyTorch, and many more. Scikit-learn is an open-
source machine learning library that supports supervised and unsupervised learning.
It also provides various tools for data wrangling, data preprocessing, model selection,
model evaluation, and many other utilities. As you may recall, many of our regression
and classification learning models were developed using scikit-learn libraries. Similarly,
pandas and NumPy are the two most commonly used libraries for manipulating data
and performing analysis on data.
635
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5_18
Chapter 18 Python Programming for Analytics
Jupyter Notebook is the original web interface for writing Python code for data
science. The notebook also can be shared with others as a web document. It offers a
simple, web-based, interactive experience. JupyterLab is the latest development that
provides an interactive development environment for coding. It provides a flexible
interface to configure and arrange workflows in data science, scientific computing, and
development of machine learning models.
This chapter assumes you are familiar with the basic programming of Python. We
will not cover Python programming. We focus only on the pandas and NumPy libraries
and how to use them for data analysis. All our work is done using Jupyter Notebook.
We want users to write and practice the code rather than copying from the text. The full
notebook is available on GitHub for download.
• Iteration over the data set (called DataFrame in Pandas) for the
purpose of data analytics
• Feature engineering, merging features, deleting columns, separating
features, etc.
636
Chapter 18 Python Programming for Analytics
The following examples demonstrate the techniques using pandas for data analytics
that we have mentioned. We encourage you to type each line of the code and practice. If
you need the full Jupyter Notebook, then you can download it from GitHub.
The first step is to import the pandas library and create a pandas DataFrame. sys and
os are Python system libraries.
Here is the input code:
There will not be any output shown. These packages will be imported into the
memory and will be available for further use.
Each column in a pandas DataFrame is a Series. The pandas DataFrame is similar to
a spreadsheet, database, or table. Basically, you have rows and columns.
Here is the input code to create a DataFrame and print out first few records of
the data:
ʰ ŜſƃɩɥɥɥřɫɥɥɥřɭɥɥɥřɨɥɥɥƄƀ
Ŝſƀ
637
Chapter 18 Python Programming for Analytics
Ŝ ſƀ
Ŝſƀ
Ŝſƀ
Ŝſƀ
Ŝſƀ
638
Chapter 18 Python Programming for Analytics
Here is the output along with the input code provided above:
In our next example, we will read four CSV files, stores.csv, products.csv,
salesperson.csv, and totalsales.csv, which contain information about products,
salespeople, sales, and units sold. Each file has multidimensional data. We read these
data set files as a pandas DataFrame and demonstrate performing the data analysis.
We store each data file as a dataframe. For example, we store stores.csv in a stores
dataframe, products.csv in a products dataframe, and so on.
639
Chapter 18 Python Programming for Analytics
ʰ ɐśŵŞɩɥɩɩŵŞɩɥɩɩŵɩŵɩŵ
ŵɨɯ
Ş ŵɐ
Ŝ ſƀ
Ŝ ſƀ
ɐśɎɎŞɩɥɩɩɎɎŞɩɥɩɩɎɎɩɎɎɩɎɎɎ
ɎɨɯŞ ɎɎɐ
Here is the output along with the input code shown above:
Reading a dataframe using the head() or tail() function is shown next. Similar
to R, Python has a function to get a glimpse of the data using the head() or tail()
function. The function head() provides the top five rows (default), and tail() provides
the bottom five rows of the data. You can also specify more rows to be displayed by
specifying the number within the parentheses. For example, head(10) displays 10 rows.
Here is the input:
Ŝſɨɥƀ
640
Chapter 18 Python Programming for Analytics
Here is the output along with the input code shown above:
If you have limited data, we can print the whole dataframe, as shown here:
641
Chapter 18 Python Programming for Analytics
dtypes provides the data types of the variables in the data set, and info provides
more information about the data set, including null and non-null counts as shown,
which is useful for further analysis. shape provides the size of the data such as the
number of rows × the number of columns. In this example, the sales dataframe has
396 rows and 5 columns of data.
Here is the input:
Ŝ
Ŝſƀɪ
Ŝ
642
Chapter 18 Python Programming for Analytics
Here is the output along with the input code shown above:
643
Chapter 18 Python Programming for Analytics
ƃɐ ɐƄŜſɨɥƀ
Ŝ Ŝſƀ
Here is the output along with the input code shown above:
If we want to access multiple columns, we can slice the dataframe as shown next.
In this example, we are slicing the dataframe to access three columns out of five (Date,
Product, and Salesperson) from the sales dataframe.
Here is the input:
644
Chapter 18 Python Programming for Analytics
ƃƃɐɐřɐ ɐřɐɐƄƄŜſƀ
Here is the output along with the input code shown above:
In the following example, we slice the dataframe to display only specific rows in the
dataframe.
Here is the input:
Here is the output along with the input code shown above:
645
Chapter 18 Python Programming for Analytics
The next set of functions demonstrates data analysis on the pandas DataFrame. For
example, search for the sales data of Units sold greater than 4 units.
Here is the input:
ƃƃɐɐƄʴɫŜɥƄɪɱ
Here is the output along with the input code shown above:
646
Chapter 18 Python Programming for Analytics
Similarly, in the following example we view sales data of product AR400 sold greater
than or equal to 5. This data slicing and dicing is similar to extracting data from a
database using SQL.
Here is the input:
Here is the output along with the input code shown above:
df.loc and df.iloc also provide index-based data slicing by position. df.loc()
accesses a group of rows and columns by label or a Boolean array, whereas df.iloc()
uses integer location-based indexing for selection by position.
647
Chapter 18 Python Programming for Analytics
Ŝ ƃƃɨɥřɨɩƄƄ
Ŝ ƃƃɨɥƄƄ
Ŝ ƃśɪƄ
Here is the output along with the input code shown above:
In the case of df.loc(), you can use a row index to slice the data. It can be a number
or label. In our case, rows are numbered, so we can use numbers. In the first example,
we try to get the first record, and in the second example, we try to get the first and fifth
records. In the third example we fetch the second and the third records.
648
Chapter 18 Python Programming for Analytics
Ŝ ƃɨƄ
Ŝ ƃƃɨřɬƄƄ
Ŝ ƃɩśɪƄ
Here is the output along with the input code shown above:
649
Chapter 18 Python Programming for Analytics
ƃɐɐƄŜſƀ
ƃɐɐƄŜſƀ
ƃɐɐƄŜſƀ
ƃɐɐƄŜ ſƀ
Ŝ Ŝſƀ
ƃɐ ɐƄŜɏ ſƀ
ƃɐɤɐƄŜɏ ſƀ
Here is the output along with the input code shown above:
650
Chapter 18 Python Programming for Analytics
ŜſɑɑƀƃɐɐƄŜſƀ
651
Chapter 18 Python Programming for Analytics
Here is the output along with the input code shown above:
ŜſƀŜſƀ
ŜſƀŜſƀ
ƃɐɤɐƄŜſƀŜſƀ
652
Chapter 18 Python Programming for Analytics
Here is the output along with the input code shown above:
The previous isna() did not give the location of the NAs. To find out the location of
the NAs, meaning specific rows and columns, you have to slice the data further, as shown
in the next set of functions. The following function provides the location of the NAs in
the dataframe.
Here is the input:
ƃƃɐɤɐƄŜſƀƄ
ƃƃɐɐƄŜſƀƄ
653
Chapter 18 Python Programming for Analytics
Here is the output along with the input code shown above:
Once you identify the NAs and NULL values in the data, it is up to you how you want
to substitute these NAs. We have shown a simple example here, but you can extend it to
any technique of assigning values to such locations with NAs. In the following example,
we have just used the number 3.0 to fill the NAs. However, you can substitute the NAs
with mean or mode or median values. You can use the same fillna() function to
perform inputting mean or mode or any other value.
Here is the input:
654
Chapter 18 Python Programming for Analytics
Here is the output along with the input code shown above:
ƃɐɐƄʰ ƃɐɐƄŜſƀ
ɪɪſƀ
Ŝ
Here is the output along with the input code shown above:
655
Chapter 18 Python Programming for Analytics
Ŝ
ƃɐɐƄʰ ŜɏſƃɐɐƄřʰɐ ɐƀ
Ŝ
ɪɪɪř
ƃɐɐƄʰ ƃɐɐƄŜŜ
ƃɐɐƄʰ ƃɐɐƄŜŜ
ƃɐɐƄʰ ƃɐɐƄŜŜ
Ŝſƀ
Here is the output along with the input code shown above:
656
Chapter 18 Python Programming for Analytics
18.2.7 Feature Engineering
The pandas DataFrame supports splitting columns, merging columns, etc. The following
section demonstrates how two columns can be combined. In this example, the first
name and last name columns are combined. There are couple ways we can do this. We
have shown both types. The agg() function aggregates using one or more operations
over the specified axis.
Here is the input:
ɪƃɖ ɖƄŜŜſɗɗƀ
Here is the output along with the input code shown above:
657
Chapter 18 Python Programming for Analytics
658
Chapter 18 Python Programming for Analytics
Here is the output along with the input code shown above:
659
Chapter 18 Python Programming for Analytics
ƃɐɐƄŜŜſƀ
ƃɑɐƄŜɏ ſƀŜŜſƀ
Here is the output along with the input code shown above (a box plot and a
bar chart):
660
Chapter 18 Python Programming for Analytics
ʰ ŜſɨɬƀŜſɪřɬƀ
ſƀ
Ŝ
Ŝ
ŜŜ
Ŝ
ſƀ
661
Chapter 18 Python Programming for Analytics
Here is the output along with the input code shown above:
ʰ ŜſƃſɨŜɩřɪŜɫřɬŜɩƀřſɮřɯřɨɥƀƄƀ
Ŝ
Here is the output along with the input code shown above:
662
Chapter 18 Python Programming for Analytics
ʰ Ŝſſɪřɫƀƀ
ʰŜſſɪřɫƀƀ
ʰ Ŝſſɩřɫƀƀ
Here is the output along with the input code shown above:
663
Chapter 18 Python Programming for Analytics
Many times, arrays need reshaping for vector, matrix, and algebraic operations. We
can perform such operations using the reshape() function of NumPy. The np.reshape()
function gives a new shape to an array without changing its data. The following example
demonstrates the different sizes of the matrix using the np.reshape() function.
Here is the input:
ʰ Ŝſɨɩƀ
ſŜſɪřɫƀƀ
Ŝſɫřɪƀ
Ŝſɭřɩƀ
Ŝſɨɩřɨƀ
Here is the output along with the input code shown above:
664
Chapter 18 Python Programming for Analytics
ʰ ŜſƃƃɨɥřɨɨřɨɩƄřƃɨɫřɨɬřɨɭƄƄƀ
Ŝ
ʰ ŜſƃſɨŜɩřɪŜɫřɬŜɩƀřſɮřɯřɨɥƀƄƀ
Ŝ
ʫ
Ş
Ş
Ƌ
Here is the output along with the above lines of code (Please Note: Order of code
lines here are a little different from the above):
665
Chapter 18 Python Programming for Analytics
You can also perform mathematical and statistical analysis on the NumPy arrays.
This includes finding the square root, finding the mean, finding exponential, etc. The
following example demonstrates some of the mathematical operations on NumPy
data. In this example we will add two NumPy arrays, find the square root, and find
exponential, as well as calculate the square root of NumPy data.
Here is the input:
np.add(b,c)
np.square(c)
np.sqrt(c)
np.exp(c)
np.round_(c)
Here is the output along with the code input provided above (Please Note: order of
the code shown here differs a little from the above):
666
Chapter 18 Python Programming for Analytics
ʰ ŜŜɏſſɬřɩƀƀ
ʰ ŜŜɏſſɪřɫƀƀ
Ŝſƀ
Ŝſƀ
667
Chapter 18 Python Programming for Analytics
Here is the output along with the above input code lines:
The standard deviation is computed along the axis. The default is to compute the
standard deviation of the flattened array (axis 0 is a column and 1 is a row).
Here is the input:
Ŝſʰɥƀ
Ŝſʰɨƀ
Here is the output along with the input code shown above:
668
Chapter 18 Python Programming for Analytics
ƃɩƄ
ƃɫƄ
ƃɫƄƃɥƄ
ƃɥƄƃɥƄ
ƃɥśɨƄ
ƃɥśɨƄʰ ɨɥ
You can also change a particular value of an array element by assigning values
as shown.
669
Chapter 18 Python Programming for Analytics
670
Chapter 18 Python Programming for Analytics
671
Chapter 18 Python Programming for Analytics
18.4 Chapter Summary
In this chapter we discussed the fundamentals of Python for performing business
analytics, data mining, and machine learning using pandas and NumPy.
We explored data manipulation, data types, missing values, data slicing and dicing,
as well as data visualization.
We also discussed the apply() looping functions, statistical analysis on dataframes,
and NumPy arrays.
We have only covered the parts of pandas and NumPy required for basic analytics
problems that arise. Therefore, this chapter should be used only as a quick-reference
guide. If you want to learn more complex functions and Python programming, you may
refer to a Python book or the Python, pandas, or NumPy documentation.
672
References
1. BAESENS, BART. (2014). Analytics in a Big Data World, The
Essential Guide to Data Science and Its Applications. Wiley India
Pvt. Ltd.
2. MAYER-SCHONBERGER, VIKTOR & CUKIER KENNETH. (2013).
Big Data, A Revolution That Will Transform How We Live, Work
and Think. John Murray (Publishers), Great Britain
10. ZUMEL, NINA & MOUNT, JOHN. (2014). Practical Data Science
with R. Dreamtech Press, New Delhi
11. KABACOFF, ROBERT.I. (2015). R In Action – Data analysis and
graphics with R. Dreamtech Press, New Delhi
673
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5
REFERENCES
675
REFERENCES
44. Building Data Mining Applications for CRM, Alex Berson, Stephen
Smith and Kurt Thearling (McGraw Hill, 2000)
45. Building Data Mining Applications for CRM, Alex Berson, Stephen
Smith, Kurt Thearling (McGraw Hill, 2000)
47. Data Mining: Concepts and Techniques, Jiawei Han and Micheline
Kamber, 2000, Morgan Kaufmann Publishers
60. Provost, Foster J., Tom Fawcett, and Ron Kohavi. “The case
against accuracy estimation for comparing induction algorithms.”
ICML. Vol. 98. 1998
61. Hanley, James A., and Barbara J. McNeil. “The meaning and use
of the area under a receiver operating characteristic (ROC) curve.”
Radiology 143.1 (1982): 29–36
677
REFERENCES
67. Kingma, Diederik, and Jimmy Ba. “Adam: A method for stochastic
optimization.” arXiv preprint arXiv:1412.6980 (2014)
678
REFERENCES
85. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
Efficient estimation of word representationsin vector space. ICLR
Workshop, 2013
679
REFERENCES
92. https://www.mordorintelligence.com/industry-reports/
chatbot-market)
94. https://wordnet.princeton.edu/
95. Google AI Blog: All Our N-gram are Belong to You
(googleblog.com)
97. https://cran.r-project.org/web/packages/rpart/rpart.pdf
98. https://www.rdocumentation.org/packages/rpart/
versions/4.1.16
100. https://cran.r-project.org/web/packages/rpart/rpart.pdf.
101. https://www.rdocumentation.org/packages/rpart/
versions/4.1.16
680
REFERENCES
106. http://scikit-learn.org/stable/modules/generated/
sklearn.tree.DecisionTreeClassifier.html
107. http://paginas.fe.up.pt/~ec/files_1011/week%2008%20-%20
Decision%20Trees.pdf
108. http://www.quora.com/Machine-Learning/Are-gini-index-
entropy-or-classification-error-measures-causing-any-
difference-on-Decision-Tree-classification
109. http://www.quora.com/Machine-Learning/Are-gini-index-
entropy-or-classification-error-measures-causing-any-
difference-on-Decision-Tree-classification
110. https://rapid-i.com/rapidforum/index.php?topic=3060.0
111. http://stats.stackexchange.com/questions/19639/which-is-
a-better-cost-function-for-a-random-forest-tree-gini-
index-or-entropy
681
References
Dataset CITATION
Justifying recommendations using distantly-labeled reviews and fine-grained aspects,
Jianmo Ni, Jiacheng Li, Julian McAuley, Empirical Methods in Natural Language
Processing (EMNLP), 2019
682
Index
A statistical and mathematical
concepts, 15, 16
ACF and PACF plots, 477–480
tools, techniques, and
Action potentials, 346
algorithms, 14, 15
Activation function, 349, 352, 362
Analytics methods, 5, 121, 134
dimensional and nonlinear input
anova() function, 410
data, 362
ANOVA, 164, 410
linear function, 362
Antecedent, 525, 526, 529
neural network layers, 362
Apache Hadoop ecosystem, 603, 605
ReLU function, 365
Apache Hadoop YARN, 605
selection, 366
Apache HBase, 605
sigmoid function, 363
Apache Hive, 605
softmax function, 365, 366
Apache Mahout, 605
tanh function, 364
Apache Oozie, 605
types, 362
Apache Pig, 605
Adjusted R2, 169, 170, 179, 200, 236,
Apache Spark, 606
249, 264
Apache Storm, 605
Affinity propagation method, 497
apply() function, 626, 627, 633, 658, 659
agg() function, 657
Approval variable, 291, 292
AGGREGATE functions, 77, 108
Apriori algorithm, 523, 533, 561
AI algorithms, 10
advantages, 524
Akaike information criterion (AIC) value,
assumption, 525
243, 276
data mining, 523
All subsets regression approach, 248, 249
frequent-item sets, 524
Alternate hypothesis, 190, 199
rules generation, 527–529
Anaconda framework, 210, 261, 430,
Area under curve (AUC), 173, 179, 380
440, 464
Artificial intelligence (AI), 64, 131, 347,
Analytics job skills requirement
399, 601, 608
communications skills, 14
Artificial neural networks (ANNs),
data storage/data warehousing, 15
347, 677
data structure, 15
683
© Umesh R. Hodeghatta, Ph.D and Umesha Nayak 2023
U. R. Hodeghatta and U. Nayak, Practical Business Analytics Using R and Python,
https://doi.org/10.1007/978-1-4842-8754-5
INDEX
684
INDEX
685
INDEX
686
INDEX
687
INDEX
688
INDEX
689
INDEX
690
INDEX
691
INDEX
692
INDEX
693
INDEX
694
INDEX
695
INDEX
696
INDEX
697
INDEX
698
INDEX
699
INDEX
700
INDEX
Q KNN, 284–290
lm() function, 229, 235
qqPlot(model name, simulate = TRUE,
Naïve Bayes classifier, 300, 301, 303
envelope = 0.95), 240
pairs() command, 156
Quadratic relationship, 59
Scatter matrix plot, 157
Qualitative data, 125
summary() function, 143
Quantile regression, 192
summary() statistics, data set, 143
Quantiles, 37–39
View() function, 141
Quantile-to-quantile plot, 463
VIF calculation, 242
Quantitative data, 125, 596
R (R-squared), 135, 169, 200, 218, 236,
2
Quartile 3, 23, 24
245, 248, 411
Random forests, 318
R Random sampling, 27, 124
R, 147, 276 Range, 36
cor() function, 231 range(dataset), 36
correlation plot, 159 raprt() library, 323
decision trees, 320–330 Rare points, 25
density plot, 161 Ratio data, 126
hierarchical clustering Raw data, 140
average/centroid methods, 506 R code, 446–448
dendrograms, 507, 508 Real-time analytics, 609
distance methods, 506 Real-time data analysis, 119
Euclidian method, 506 Receiver operating characteristics (ROC)
hclus() function, 506 graph, 135, 173, 174, 329, 330,
NbCLust() function, 506 342, 343
optimal clusters, 507 Rectified linear unit (ReLU) function, 365
k-means clustering model Recurrent neural networks (RNNs), 399, 598
components, 501 Regression line, 250
create clusters, 498, 499 Regression models, 129, 132, 135, 185,
elbow method, 503, 504 198, 277
fvizcluster() function, 502 adjusted R2, 169, 170
kmeans() function, 500 MAD, 168
optimal value, 505 MAE, 168
scale() function, 499, 500 MAPE, 168
silhouette method, 503, 504 prediction error, 167
data types/StudentID column, 499 R2, 169
summarizing, 502 RSME, 168
unsupervised clustering, 503 SSE, 169
701
INDEX
702
INDEX
703
INDEX
704
INDEX
705
INDEX
706