DMBAR Chapter 1
DMBAR Chapter 1
ANALYTICS IN R
Indian Adaptation by
O.P. Wali, Professor, Indian Institute of Foreign Trade
Copyright © 2022 by John Wiley & Sons, Inc. All rights reserved.
PART I
PRELIMINARIES
CHAPTER 1 Introduction
Introduction
1.1 WHAT IS BUSINESS ANALYTICS?
▪ Business Analytics (BA) is the practice and art of bringing quantitative data to bear on decision-making.
▪ Business Analytics, or more generically, analytics, include a range of data analysis methods. Many powerful
applications involve little more than counting, rule-checking, and basic arithmetic.
▪ The next level of business analytics, now termed Business Intelligence (BI), refers to data visualization and
reporting for understanding.
▪ Business Analytics now typically includes BI as well as sophisticated data analysis methods, such as statistical
models and data mining algorithms used for exploring data, quantifying and explaining relationships between
measurements, and predicting new records.
Copyright © 2022 by John Wiley & Sons, Inc. All rights reserved.
1.1 WHAT IS BUSINESS ANALYTICS? (Continuation)
The Business Analytics toolkit also includes statistical experiments, the most common of which is known to
marketers as A-B testing. These are often used for pricing decisions:
Orbitz, the travel site, found that it could price hotel options higher for Mac users than Windows users.
Staples online store found it could charge more for staplers if a customer lived far from a Staples store.
Successful use of analytics and data mining requires both an understanding of the business context where value is to
be captured, and an understanding of exactly what the data mining methods do.
1.2 WHAT IS DATA MINING?
Data mining refers to business analytics methods that go beyond counts, descriptive techniques, reporting, and
methods based on business rules.
Data mining includes statistical and machine-learning methods that inform decision-making, often in an automated
fashion.
The era of Big Data has accelerated the use of data mining.
Data mining methods, with their power and automaticity, have the ability to cope with huge amounts of data and
extract value
1.3 DATA MINING AND RELATED TERMS
The term data mining itself means different things to different people.
Data mining stands at the confluence of the fields of statistics and machine learning (also known as artificial
intelligence).
The emphasis that classical statistics places on inference is absent from data mining.
In comparison to statistics,Data mining deals with large datasets in an open-ended fashion, making it impossible to
put the strict limits around the question being addressed that inference would require.
Data mining is vulnerable to the danger of overfitting, where a model is fit so closely to the available sample of
data that it describes not merely structural characteristics of the data, but random peculiarities as well. In
engineering terms, the model is fitting the noise, not just the signal.
1.4 BIG DATA
Data mining and Big Data go hand in hand. Big Data is a relative term—data today are big by reference to the past,
and to the methods and devices available to deal with them.
The challenge Big Data presents is often characterized by the four V’s—volume, velocity, variety, and veracity.
Volume refers to the amount of data.
a. Velocity refers to the flow rate—the speed at which it is being generated and changed. V
b. Variety refers to the different types of data being generated (currency, dates, numbers, text, etc.).
c. Veracity refers to the fact that data is being generated by organic distributed processes (e.g., millions of people signing up
for services or free downloads) and not subject to the controls or quality checks that apply to data collected for a study.
Most large organizations face both the challenge and the opportunity of Big Data because most
routine data processes now generate data that can be stored and, possibly, analyzed.
1.5 DATA SCIENCE
Data science is a mix of skills in the areas of statistics, machine learning, math, programming, business, and IT.
Data science itself is thus broader than the other concepts we discussed above, and it is a rare individual who
combines deep skills in all the constituent areas.
Although Big Data is the motivating power behind the growth of data science, most data scientists do not actually
spend most of their time working with terabyte-size or larger data.
Data of the terabyte or larger size would be involved at the deployment stage of a model. There are manifold
challenges at that stage, most of them IT and programming issues related to data-handling and tying together
different components of a system.
1.6 WHY ARE THERE SO MANY DIFFERENT METHODS?
The goal is to find a combination of household income level and household lot size that separates buyers (solid
circles) from nonbuyers (hollow circles) of riding mowers. The first method (left panel) looks only for horizontal
and vertical lines to separate buyers from nonbuyers, whereas the second method (right panel) looks for a single
diagonal line.
• Algorithm • Model
• Attribute see Predictor • Observation
• Case see Observation. • Outcome Variable see Response.
• Confidence • Output Variable see Response.
• Dependent Variable see Response. • P (A|B)
• Estimation see Prediction. • Predictor
• Feature see Predictor. • Profile
• Holdout Data (or holdout set) • Record see Observation.
• Input Variable see Predictor.
1.7 TERMINOLOGY AND NOTATION(CONTUNUATION)
• Response • Score
• Sample • Success Class
• Score • Supervised Learning
• Success Class • Target see Response.
• Prediction • Test Data (or test set)
• Predictor • Training Data (or training set)
• Profile • Unsupervised Learning
• Record see Observation. • Validation Data (or validation set)
• Response • Variable
• Sample
1.8 ROAD MAPS TO THIS BOOK
Data mining from a process perspective. Numbers in parentheses indicate chapter numbers
1.8 ROAD MAPS TO THIS BOOK(CONTINUATION)
Organization of data mining methods in this book, according to the nature of the data*