CST 322 Data Analytics (Elective)
CST 322 Data Analytics (Elective)
COURSE MATERIAL
To become an ultimate destination for acquiring latest and advanced knowledge in the
multidisciplinary domains.
To provide high quality education in engineering and technology through innovative
teaching-learning practices, research and consultancy, embedded with professional
ethics.
To promote intellectual curiosity and thirst for acquiring knowledge through outcome
based education.
To have partnership with industry and reputed institutions to enhance the
employability skills of the students and pedagogical pursuits.
To leverage technologies to solve the real life societal problems through community
services.
DEPARTMENT VISION
To produce competent professionals with research and innovative skills, by providing them
with the most conducive environment for quality academic and research oriented
undergraduate education along with moral values committed to build a vibrant nation.
DEPARTMENT MISSION
CO PO MAPPING
CO’S PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
C312B.1 3 2 2 - - - - - - - 2
C312B.2 3 3 - - 2 - - - - - - 2
C312B.3 3 2 - - 2 - - - - - - 2
C312B.4 3 2 2 - - - - - - - 2
C312B.5 3 3 3 - - - - - - - - -
C312B.6 3 2 - - 2 - - - - - - 2
C312 3 2.3 2.3 2 2
COMPUTER SCIENCE AND ENGINEERING
Preamble:
This course helps the learner to understand the basic concepts of data analytics. This course covers
mathematics for data analytics, predictive and descriptive analytics of data, Big data and its
applications, techniques for managing big data and data analysis & visualization using R
programming tool. It enables the learners to perform data analysis on a real world scenario using
appropriate tools.
Prerequisite: NIL
Course Outcomes: After the completion of the course the student will be able to
CO1 Illustrate the mathematical concepts for data analytics (Cognitive Knowledge
Level: Apply)
CO2 Explain the basic concepts of data analytics (Cognitive Knowledge Level:
Understand)
CO4 Describe the key concepts and applications of Big Data Analytics (Cognitive
Knowledge Level: Understand)
CO5 Demonstrate the usage of Map Reduce paradigm for Big Data Analytics
(Cognitive Knowledge Level: Apply)
CO6 Use R programming tool to perform data analysis and visualization (Cognitive
Knowledge Level: Apply)
225
COMPUTER SCIENCE AND ENGINEERING
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1
CO2
CO3
CO4
CO5
CO6
226
COMPUTER SCIENCE AND ENGINEERING
Assessment Pattern
Test 1 Test 2
(%) (%)
Remember 30 30 30
Understand 40 40 40
Apply 30 30 30
Mark Distribution
150 50 100 3
227
COMPUTER SCIENCE AND ENGINEERING
from the partly completed module), each with 7 marks. Out of the 7 questions, a student should
answer any5.
There will be two parts; Part A and Part B. Part A contains 10 questions with 2 questions from each
module, having 3 marks for each question. Students should answer all questions. Part B contains 2
full questions from each module of which students should answer any one. Each question can have
a maximum 2 sub-divisions and carries 14 marks.
Syllabus
Descriptive statistics - Measures of central tendency and dispersion, Association of two variables -
Discrete variables, Ordinal and Continuous variable, Probability calculus - probability distributions,
Inductive statistics - Point estimation, Interval estimation, Hypothesis Testing - Basic definitions, t-
test
Module - 2 (Introduction to Data Analytics)
Big Data Overview – State of the practice in analytics, Example Applications - Credit Risk
Modeling, Business Process Analytics.Big Data Analytics using Map Reduce and Apache Hadoop,
Developing and Executing a HadoopMapReduce Program.
Module - 5 (R programming for Data Analysis)
228
COMPUTER SCIENCE AND ENGINEERING
Text Book
1. Bart Baesens," Analytics in a Big Data World: The Essential Guide to Data Science and
its Business Intelligence and Analytic Trends”, John Wiley & Sons, 2013.
2. David Dietrich, “EMC Education Services, Data Science and Big Data Analytics:
Discovering, Analyzing, Visualizing and Presenting Data”, John Wiley & Sons, 2015.
3. Jaiwei Han, MichelineKamber, “Data Mining Concepts and Techniques'', Elsevier, 2006.
4. Christian Heumann and Michael Schomaker, “Introduction to Statistics and
DataAnalysis”, Springer, 2016
References
1. Margaret H. Dunham, Data Mining: Introductory and Advanced Topics. Pearson, 2012.
2. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer, 2007.
72 84
50 63
229
COMPUTER SCIENCE AND ENGINEERING
81 77
74 78
94 90
86 75
59 49
83 79
65 77
33 52
88 74
81 90
a) Use the method of least squares to find an equation for the prediction of a
student’s final exam marks based on the student’s midterm grade in the
course.
b) Predict the final exam marks of a student who received an 86 on the
midterm exam.
2. Perform knn classification on the following dataset and predict the class for the data
point X (P1 = 3, P2 =7), assuming the value of k as 3.
P1 P2 Class
7 7 False
7 4 False
3 4 True
1 4 True
230
COMPUTER SCIENCE AND ENGINEERING
Members 23 24 27 25 30 28
231
COMPUTER SCIENCE AND ENGINEERING
3. List and explain any two methods for dealing with missing values in a dataset.
4. Consider the following data (in increasing order) for the attribute age: 13, 15, 16, 16, 19, 20,
20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. Sketch an
example for stratified sampling using samples of size 5 and the strata “youth,” “middle-aged,”
and “senior.”
6. Find the absolute support, relative support and confidence of the rule (bread => jam) in the
following set of transactions
T1 {bread, butter}, T2{bread, jam, milk}
T3{Milk, curd}, T4{bread, jam}
Part B
(Answer any one question from each module. Each question carries 14 Marks)
(b) A hiking enthusiast has a new app for his smartphone which summarizes his hikes by
using a GPS device. Let us look at the distance hiked (in km) and maximum altitude (in (6)
m) for the last 10 hikes:
Distance 12.5 29.9 14.8 18.7 7.6 16.2 16.5 27.4 12.1 17.5
Altitude 342 1245 502 555 398 670 796 912 238 466
Calculate the arithmetic mean and median for both distance and altitude.
OR
232
COMPUTER SCIENCE AND ENGINEERING
(b) A total of 150 customers of a petrol station are asked about their satisfaction with their (6)
car and motorbike insurance. The results are summarized below: Determine and
interpret Pearson’s χ2 statistic and Cramer’s V.
Satisfied Unsatisfied Total
Car 33 25 58
Car (Diesel engine) 29 31 60
Motor bike 12 20 32
Total 74 76 150
(b) Discuss the methods for handling noisy data. Consider the following sorted data for (6)
price (in dollars) 4, 8, 15, 21, 21, 24, 25, 28, 34.
Illustrate smoothing by bin means and bin boundaries
OR
14. (a) a) What is the need for sampling in data analytics? Discuss the different sampling (8)
techniques.
(b) Use these methods to normalize the following group of data: (6)
200, 300, 400, 600, 1000
(i) min-max normalization by setting min = 0 and max = 1
(ii) z-score normalization
(iii) normalization by decimal scaling .
15. (a) A database has five transactions. Let min_sup be 60% and min_conf be 80%.
.
TID items_bought
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y}
T300 {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, O, K, I, E}
233
COMPUTER SCIENCE AND ENGINEERING
(b) Generate strong association rules from any one 3 itemset. (4)
OR
(b) Suppose that the data mining task is to cluster points (with (x, y) representing location) (6)
into three clusters, where the points areA1(2,10), A2 (2,5), A3 (8,4), B1 (5,8), B2 (7,5),
B3 (6,4), C1(1,2), C2 (4,9). The distance function is Euclidean distance. Suppose
initially we assign A1, B1, and C1as the center of each cluster, respectively. Use the k-
means algorithm to show only
(a) The three cluster centers after the first round of execution.
(b) The final three clusters.
17. (a) Illustrate the working of a Map Reduce program with example.
(8)
OR
18. (a) Discuss the architecture of HDFS and its features. (8)
(b) Illustrate the use of big data analytics in credit risk modeling. (6)
19. (a) List and explain the R functions used in descriptive statistics. (8)
OR
20. (a) Discuss the data visualization for multiple variables in R (8)
(b) Describe the R functions used for cleaning dirty data. (6)
(5 x 14 = 70)
234
COMPUTER SCIENCE AND ENGINEERING
Teaching Plan
No Contents No of
Lecture
Hrs
235
COMPUTER SCIENCE AND ENGINEERING
4.4 Big Data Analytics using Map Reduce and Apache Hadoop 1
4.5 Big Data Analytics using Map Reduce and Apache Hadoop 1
236
COMPUTER SCIENCE AND ENGINEERING
237
MODULE 1 – MATHEMATICS FOR DATA ANALYTICS
Descriptive statistics
Descriptive statistics are used to describe or summarize the characteristics of a sample or data set,
such as a variable's mean, standard deviation, or frequency. Inferential statistics can help us
understand the collective properties of the elements of a data sample.
Measures of central tendency and dispersion are common descriptive measures for summarising
numerical data.
Measures of central tendency are measures of the location of the middle or the center of a distribution.
The most frequently used measures of central tendency are the mean, median and mode.
A measure of dispersion is a numerical value describing the amount of variability present in a
data set.
The standard deviation (SD) is the most commonly used measure of dispersion. With the SD
you can measure dispersion relative to the scatter of the values about their mean.
The range can also be used to describe the variability in a set of data and is defined as the
difference between the maximum and minimum values. The range is an appropriate measure of
dispersion when the distribution is skewed.
Association between two variables means the values of one variable relate in some way to
the values of the other
Cross Tabulations
Scatter grams
Correlation
There are several types of correlation measures that can be applied to different measurement scales of
a variable (i.e. nominal, ordinal, or interval). One of these, the Pearson product-moment correlation
coefficient, is based on interval-level data and on the concept of deviation from a mean for each of the
variables. A statistic, covariance, is the product of the deviations of the observed values from each of
their means divided by the number of observations.
Regression
If the correlation between two variables is found to be significant and there is reason to suspect that one
variable influences the other, then one might decide to calculate a regression line for the two variables.
In this example one might state that an increase in population results in an increase in the crime rate.
Thus, the crime rate would be considered a dependent variable and the population size would be
considered an independent variable.
PROBABILITY CALCULUS
From probability calculus we know that for two events A and B, the probability of B given A is
obtained by dividing the joint by the marginal: p(B∣A) = p(A and B)/p(A).
• Theoretical Probability.
• Experimental Probability.
• Axiomatic Probability.
A probability distribution is the mathematical function that gives the probabilities of occurrence of
different possible outcomes for an experiment. It is a mathematical description of a random
phenomenon in terms of its sample space and the probabilities of events (subsets of the sample space).
Inductive statistics is the phase of statistics which is concerned with the conditions under which
conclusions about populations can be drawn from analysis of particular samples. Inductive statistics is
also known as statistical inference, or inferential statistics.
Inductive statistics is the logical process of drawing general conclusions based on specific pieces of
information.
The branch of statistics that deals with generalizations, predictions, estimations, and decisions about a
population from data sampled from that population.
POINT ESTIMATE AND AN INTERVAL ESTIMATE
A point estimate is a single value estimate of a parameter. For instance, a sample mean is a
point estimate of a population mean.
An interval estimate gives you a range of values where the parameter is expected to lie. A
confidence interval is the most common type of interval estimate.
The main difference between point and interval estimation is the values that are used. Point
estimation uses a single value, the statistic mean, while interval estimation uses a range of
numbers to infer information about the population.
Hypothesis Testing
Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population
parameter. The methodology employed by the analyst depends on the nature of the data used and the
reason for the analysis
1. The first step is for the analyst to state the two hypotheses so that only one can be
right.
2. The next step is to formulate an analysis plan, which outlines how the data will be
evaluated.
3. The third step is to carry out the plan and physically analyse the sample data.
4. The fourth and final step is to analyse the results and either reject the null hypothesis,
or state that the null hypothesis is plausible, given the data.
STEPS FOR CONSTRUCTING A HYPOTHESIS
T-TEST
MODULE
Aralyhes, Analyics Prous Modal
Inocuchion to Dota dnalgis.-
Analyhcal Modul RLauiremunts, Dat Analytics
Lik yde ovenview
Basics ey data l e dion, Sampling, Prepmusing, andd dimeninalkty Rducho-
Data Analyhcs
data gat in o1dun to
Data Aonalyhics ha Suen d Analy Sing faw
wncenihamcl thar
Ihse
Hndings anu inkipretk and wed to hulp 01ai Sations
cushmiza Lont, cYalu
cienda. beltu, omalugoe huun primohunml CampoiqmA
Gmlent Shatugien amd davolop pndudo Data Analyies hulp oganisoakons to
thiáy eanning.
moAximi manke ekucenty and inpnre
Doto Nnalyhes
Preseiphve
Diagno Sfc Predice
Descriphve
LAcaly Aralykis
hrAlighess
Analghis
- Desciphe Hralyics
2- Diagmoshc Hralykcs
1
pre-proanin9-
mkenprek
evaluatt omd
duplos hu moll
post
proupins
Slep1
In ha r &p homugh duhnikiun of buino pn»lim Av be addsped
s naadad. eg- yeenian modalling Poyt poid Telco subst riphm ov Foud
duleckun Br Credit cands Dehining th pentimetn dt he anahyhas modllin
ercantrge reoruires a dose tolalbo1ation behsean Ha data suenta amd buin
Cperdo Bolkh paaken hada to agruu on hs tt ot kus (onughs. hux thog inchde
hou we duhnu a turbmun tm sachun, chuin o1
Haud
Skp 2
Nect, all source data that could be of potential interest need
to De icdentified The golden Tule here s the move datq The
the
be
Hself will later de cide which date ave
better he analytical model
which ave not fov the task
at hand
Teleyant ancd
Once the
anclytimal model has been
ualidald and
appnouac, t can be
apphopnialaly
put
Put unto phocudion
as an
aralylics appücaon.
Data nalytics cgo
D a t o ginalyiica Lilk l duirud hr bigdada pnblemo and
olada siea projeto
0 ue Wand
Busineo ta
Somaon Who ndenytamda he domain and benelih rorm he
JuuMs w penon Com cansult oncl ddvi he prod kam on
the onesk of the prot, valu of Aha subuls and he Hh Cuhputs
wil be openatiuniauu zad.
2: Pro4 SponsoY
d the pruyc Proviclun Inpe tus and usunemu
Ropon ibde ky genwis
Cove buin so pwbltm. Cnenuol
provicle
o aha pNajut ound diknun tha
3 roec Monagen.
as mat on time amnd
En Suun hat kos oile shons Ond dojeclVe»
ulaid to ha data
fha
apppriat Seccariy Javeln a m in paca
Tupo itorep.
Pdg
Pato Enginsan
Levenagen dasp Hechnical ntills to ani sth SQL Oaics
dada momasamant omd dota ecacukon, And pmvides
Suppot hs data ingeskon n to.Omallgke sandbox
PatPat Scenkst
Poroi da Sulojd matkn eapenk f i Omalphcal kchniquewr dat.
mo dallin9, amd aplyin valid onalykcal
dechniome to given buHhuo
prublem Ensw ovenall omalytcs doithive» am mut
Data Analgkcs e Cgul
Do hae,
e n o n g h i h h m o t "
Ploa
cnlgh
dsat
DISuNen D
hapon View
CornmuthiGA Modal
uDuts
plannins
Vmod
1s hu modal
building
obuyt enAgh ?
po hac a godd
Hovc we ailed r ldaa odoot tHuu tvype
Su of modal to rs 2
eOn i n h analyie plon?
phagI1- Disuoven
I teawn aans businun domoin, Includäng ulevamt istony
whahn the ovgami2ohom o1 buinam unit hoi attempkd
Such as
inilan Proco
of peoplk, tchnulo99,
piojck m lenms
ANonlable wil Suppart Aha
m ad data
Ao
erat Nanshm amd Lood tETL ) to a data in
Sond box.
A . BunneM cdavanu
2 3tok?ktal penkoimonu
powe
4 * openakmally efhunt
>5 * ewnomic oy
6 *
.Jugulaton and J$slaion-
Basics e Daka Colleckün, Soammpling& Pie proains
Tiansachons
2 leoc oloccumu
mulhmadiaa onlent Cam alfo be Tmlenehns
Teod doccma.mdn oY
exeiR.
baned data
3 ualhta ive, Expent
Hm eapent o a Pen ron wih a Sub yronhal Omount+ of
to imenealed Cnbmuw
O p hmg
muuh aM ponible
eg
|hmusgh- th Dos
Accep
Ruut
Baco
Bado Cnoodo
all c a m m who
h s huho popuaion hun unnft of
Com to h bank Omc pply or a morag th So- calld
aLLepled
Credit Saing anown ponibiliy i wthdhawaL» . [huse
au uyhmu who ade of fened Credit but deidad not to
take 4 it
gpe e Dota elomunt
Coninuons
ha e anu A daa damundo hat anu dainad om an inlnva
thot Com be Linitl ov
nlimitd
2. Colegorica
: Raplau imputu)
valua. o n tould
implie uplau mibing Vauue by knuwn
01 median
JuoYCD wh t avenage
jput t h Credit bearean
baoed
2 Delut
hruwoncl ophun.Omd uon ita s
movk shoight
nin o n9 Valu.
w i h lolBs o mini
obknvahions o yanioalblco
dalaing
3 Ka
be maninghw e J : - Cyomun cdicln'4
vauwo cam
ming
inuwma blc h or he m YYemBb unemployed,
oiscloe his ov hm
Out lien Delechoo eot munl
2sbo
2000
looc
SDo
20 30 40 S0
-S - 2 3r3 15D- 2
Box plo
S x IQR
o
out lien
Min M
ondond diviaion
1D 2 Su
30-40)/10 = 1 u 40
30
S-40 /10 = +1
50
2
Lo-u0/10= - 3
40-40/1o
GO-4o/Io +2
5 60
.'.
So-40to 44
40
Slendandizng Data
n min
ma-CXald)-min CXo1d)
mGTÄMUm
hUw maz aMd n w rmin onm h nawly impoto
h
Ornd minimtnm.
2- Sore Yandlandli2ahicn
S
2-Suro
* Caluuat ha
4 Deamas. Scokng.
o lo X nes Xold uih n h
Dividing by powen
mudlul m oblainud.
hu vaniable in to Tanse pa ok
mon
munDEC ra
COe g0i zins
eKa
Can be taun in b onepn actont
in a Te7unitn
no-inOn
vanicble can be uweh to rnocla
CCUE Go12aicn c onts
eCte to in to tihean modals
m a l inkavad binning
+3e0
Bin 4: , 0o0, t, 200,
1500, 8ee,
200e
Bin 2:
Clami hcotion
2
Go o0-5&70 3o0-630) (asv -2241)
5670 630
2241
000
Cnoocl 5610 224 108
121
Codo 630 241
0000 Lodod
loooo loooo
Like oi ophon 2
= G2
Cotzoi 2od"
CST322 – Data Analytics
Module – III
Syllabus
Module – III (Predictive and Descriptive Analytics)
2
Predictive Analytics
• Statistics research develops tools for prediction and forecasting using
data and statistical models
• Statistical methods can be used to summarize or describe a collection of
data.
3
Predictive Analytics
• Statistical methods can also be used to verify data mining results.
• For example, after a classification or prediction model is mined, the
model should be verified by statistical hypothesis testing.
A statistical hypothesis test (sometimes called confirmatory data
4
Classic problems in machine learning
• Supervised learning
• Unsupervised learning
• Semi-supervised learning
5
Supervised learning
• The supervision in the learning comes from the labeled examples in the
training data set.
• Basically supervised learning is when we teach or train the machine
using data that is well labelled. Which means some data is already
6
Supervised learning
• For instance, suppose you are given a basket filled with different kinds
of fruits. Now the first step is to train the machine with all the different
fruits one by one like this:
8
Supervised learning
• Supervised learning is classified into two categories of algorithms:
🞄 Classification: A classification problem is when the output variable is
a category, such as “Red” or “blue” , “disease” or “no disease”.
🞄 Regression: A regression problem is when the output variable is a
real value, such as “dollars” or “weight”.
10
Unsupervised learning
• Unsupervised learning is the training of a machine using information
that is neither classified nor labeled and allowing the algorithm to act
on that information without guidance.
• Here the task of the machine is to group unsorted information according
11
Unsupervised learning
• For instance, suppose it is given an image having both dogs and cats
which it has never seen.
13
Unsupervised learning
• Unsupervised learning is classified into two categories of algorithms:
🞄 Clustering: A clustering problem is where you want to discover the
inherent groupings in the data, such as grouping customers by
purchasing behavior.
🞄 Association: An association rule learning problem is where you want
14
Unsupervised learning
• Types of Unsupervised Learning:-
🞄 Clustering
🞄 Exclusive (partitioning)
🞄 Agglomerative
15
Supervised vs. Unsupervised Machine Learning
16
Naïve Bayes
Classifier Algorithm
18
Why is it called Naïve Bayes?
• The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
Which can be described as:
🞄 Naïve: It is called Naïve because it assumes that the occurrence of a
certain feature is independent of the occurrence of other features. Such
as if the fruit is identified on the bases of color, shape, and taste, then
19
Bayes' Theorem:
• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is
used to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
𝑷 𝑩𝑨 𝑷 𝑨
• The formula for Bayes' theorem is given as: 𝑷 𝑨𝑩 =
𝑷(𝑩)
20
Working of Naïve Bayes' Classifier: - Example
• Suppose we have a dataset of weather conditions and corresponding
target variable "Play". So using this dataset we need to decide that
whether we should play or not on a particular day according to the
21
Working of Naïve Bayes' Classifier: - Example 1
• Problem: If the Play
weather is sunny, Outlook
then the Player 0 Rainy Yes
should play or not? 1 Sunny Yes
2 Overcast Yes
24
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a class
of datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other
Algorithms.
• It is the most popular choice for text classification problems.
25
Working of Naïve Bayes' Classifier: - Example 2
x’=(Outlook=Sunny,
Temperature=Cool,
26
1. Calculate Prior Probability
P(Play=Yes) = 9/14
P(Play=No) = 5/14
P(Outlook=o|Play=b) P(Temperature=t|Play=b)
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5
27
Calculate Current probability or conditional
probability of individual attributes
28
= 𝑎𝑟𝑔𝑚𝑎𝑥𝑣𝑗∈{𝑦𝑒𝑠,𝑁𝑜} 𝑃 𝑣𝑗 . 𝑃 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑆𝑢𝑛𝑛𝑦 𝑣𝑗 . 𝑃 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 𝐶𝑜𝑜𝑙 𝑣𝑗 .
𝑣𝑁𝐵 𝑌𝑒𝑠 = 𝑃 𝑌𝑒𝑠 . 𝑃 𝑆𝑢𝑛𝑛𝑦 𝑌𝑒𝑠 . 𝑃 𝐶𝑜𝑜𝑙 𝑌𝑒𝑠 . 𝑃 𝐻𝑖𝑔ℎ 𝑌𝑒𝑠 . 𝑃 𝑆𝑡𝑟𝑜𝑛𝑔 𝑌𝑒𝑠 = 0.0053
Normalization
𝑣𝑁𝐵 𝑌𝑒𝑠
𝑣𝑁𝐵 𝑁𝑜 =
𝑣𝑁𝐵 𝑁𝑜 = 0.795
𝑣𝑁𝐵 𝑌𝑒𝑠 = = 0.205 𝑣𝑁𝐵 𝑌𝑒𝑠 +𝑣𝑁𝐵 𝑁𝑜
𝑣𝑁𝐵 𝑌𝑒𝑠 +𝑣𝑁𝐵 𝑁𝑜
Hence on a Sunny day, with Cool temperature, with higher humidity and Strong wind,
player can’t play the game.
29
Try it out - 1
32
KNN algorithm
33
KNN algorithm
• K-Nearest Neighbour is one of the simplest Machine Learning
algorithms based on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most
34
KNN algorithm
• K-NN is a non-parametric algorithm, which means it does not make
any assumption on underlying data.
• It is also called a lazy learner algorithm because it does not learn
from the training set immediately instead it stores the dataset and at
35
Example
Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data
set to the cats and dogs images and based on the most similar features it
36
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and
we have a new data point x1, so this data point will lie in which of these
categories.
38
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
39
Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:
41
How to select the value of K in the K-NN Algorithm?
• There is no particular way to determine the best value for "K", so we
need to try some values to find the best out of them. The most preferred
value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to the
42
Advantages of KNN Algorithm:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.
Zaira 34 1 Cricket
Let’s find in which class of
Sachin 55 0 Neither people “Kiran” will lie
whose k factor is 3 and age
Rahul 40 0 Cricket is 5.
Pooja 20 1 Neither
Smith 15 0 Cricket
Laxmi 55 1 Football
Michael 15 0 Football
44
So we have to find out the distance using
𝑑 = (𝑥2 − 𝑥1)² + (𝑦2 − 𝑦1)²to find the distance between any two points.
d= (5 − 32)² + (1 − 0)²
d= 729 + 1
d=27.02
46
Linear Regression
47
Regression
• Technique used for the modeling and analysis of numerical data
• Exploits the relationship between two or more variables so that we can
gain information about one of them through knowing values of the other
Regression can be used for prediction, estimation, hypothesis testing,
48
Why Linear Regression?
• Linear regression algorithm shows a linear relationship between a
dependent (Y) and one or more independent (x) variables, hence called
as linear regression.
• Since linear regression shows the linear relationship, which means it
49
Linear Regression is a Probabilistic Model
• Much of mathematics is devoted to studying variables that are
deterministically related to one another
50
A Linear Probabilistic Model
• Mathematically, we can represent a linear regression as:
52
Linear Regression Line
A linear line showing the relationship between the dependent and
independent variables is called a regression line. A regression line can
show two types of relationship:
• Positive Linear Relationship:
53
Linear Regression Line
• Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent
variable increases on the X-axis, then such a relationship is called a
negative linear relationship.
55
Example
Y - dependent variable
𝑋1 and 𝑋2 - Independent variable
ഥ − 𝒃𝟏𝒙𝟏 − 𝒃𝟐𝒙𝟐
𝒂= 𝒀
56
Example
57
1 -3.7 3 8
𝜮 19.5 20 24
59
10
32.8
-48 𝒃𝟐 = −𝟏. 𝟔𝟕
𝟏𝟗. 𝟓 𝟐𝟎 𝟐𝟒
𝒂= − 𝟐. 𝟐𝟖 × + 𝟏. 𝟔𝟕 × = 𝟐. 𝟕𝟗𝟔
𝟓 𝟓 𝟓
3 Final Regression Equation/Model is:
𝒀 = 𝟐. 𝟕𝟗𝟔 + 𝟐. 𝟐𝟖𝒙𝟏 − 𝟏. 𝟔𝟕𝒙𝟐
𝒙𝟏 = 𝟑, 𝒙𝟐 = 𝟐 𝒀 = 𝟐. 𝟕𝟗𝟔 + 𝟐. 𝟐𝟖 𝟑 − 𝟏. 𝟔𝟕(𝟐)
60
CST322 - DA | Mod-3 | Sahrdaya CET
Unsupervised
Learning- Clustering
61
Clustering
• It is basically a type of unsupervised learning method
• Generally, it is used as a process to find meaningful structure,
explanatory underlying processes, generative features, and groupings
inherent in a set of examples.
62
Clustering – Ex.
• The data points in the graph below clustered together can be classified
into one single group. We can distinguish the clusters, and we can
identify that there are 3 clusters in the below picture.
64
Clustering Methods :
• Density-Based Methods: These methods consider the clusters as the
dense region having some similarities and differences from the lower
dense region of the space. These methods have good accuracy and the
ability to merge two clusters. Example DBSCAN (Density-Based Spatial
Clustering of Applications with Noise), OPTICS (Ordering Points to
65
Clustering Methods : (Contd…)
• Partitioning Methods: These methods partition the objects into k
clusters and each partition forms one cluster. This method is used to
optimize an objective criterion similarity function such as when the
distance is a major parameter example K-means, CLARANS (Clustering
Large Applications based upon Randomized Search), etc.
66
Applications of Clustering in different fields
• Marketing: It can be used to characterize & discover customer
segments for marketing purposes.
• Biology: It can be used for classification among different species of
plants and animals.
67
Hierarchical
algorithms
68
Hierarchical algorithms
• Hierarchical clustering is another unsupervised machine learning
algorithm, which is used to group the unlabeled datasets into a cluster
and also known as hierarchical cluster analysis or HCA.
• In this algorithm, we develop the hierarchy of clusters in the form of a
69
Approaches
• The hierarchical clustering technique has two approaches:
🞄 Agglomerative: Agglomerative is a bottom-up approach, in which
the algorithm starts with taking all data points as single clusters and
merging them until one cluster is left.
🞄 Divisive: Divisive algorithm is the reverse of the agglomerative
70
Agglomerative Hierarchical clustering
• The agglomerative hierarchical clustering algorithm is a popular
example of HCA.
• To group the datasets into clusters, it follows the bottom-up approach. It
means, this algorithm considers each dataset as a single cluster at the
71
How the Agglomerative Hierarchical clustering Work?
72
How the Agglomerative Hierarchical clustering Work?
73
CST322 - DA | Mod-3 | Sahrdaya CET
Step-5: Once all the clusters are combined into one big cluster,
develop the dendrogram to divide the clusters as per the
problem.
74
Measure for the distance between two clusters
• There are various ways to calculate the distance between two clusters,
and these ways decide the rule for clustering. These measures are called
Linkage methods.
• Some of the popular linkage methods are given below:
75
Measure for the distance between two clusters
• Complete Linkage: It is the farthest distance between the two points
of two different clusters. It is one of the popular linkage methods as it
forms tighter clusters than single-linkage.
In the above diagram, the left part is showing how clusters are created in agglomerative
clustering, and the right part is showing the corresponding dendrogram. 79
CST322 - DA | Mod-3 | Sahrdaya CET
• firstly, the datapoints P2 and P3 combine together and form a cluster,
correspondingly a dendrogram is created, which connects P2 and P3 with a
rectangular shape. The hight is decided according to the Euclidean distance
between the data points.
• In the next step, P5 and P6 form a cluster, and the corresponding
dendrogram is created. It is higher than of previous, as the Euclidean
distance between P5 and P6 is a little bit greater than the P2 and P3.
80
CST322 - DA | Mod-3 | Sahrdaya CET
• Again, two new dendrograms are created that combine P1, P2, and P3 in one
dendrogram, and P4, P5, and P6, in another dendrogram.
• At last, the final dendrogram is created that combines all the data points
together.
We can cut the dendrogram tree structure at
any level as per our requirement. 81
Clusters using a Single Link Technique
Problem Definition:
For the given dataset, find the clusters
using a single link technique. Use
Euclidean distance and draw the
82
Clusters using a Single Link Technique
Step – 1 Compute the distance matrix
Find Euclidean distance between each
and every point
83
Clusters using a Single Link Technique
84
Clusters using a Single Link Technique
Step – 2 Merging the two closest
members
Form clusters based on the minimum
value in the matrix and update the
85
(𝑷𝟑, 𝑷𝟔)
86
92
K-Mean (A centroid based Technique):
• The K means algorithm takes the input parameter K from the user and
partitions the dataset containing N objects into K clusters so that resulting
similarity among the data objects inside the group (intra-cluster) is high but
the similarity of data objects with the data objects from outside the cluster is
low (inter-cluster).
93
Algorithm: K mean
Input:
• K: The number of clusters in which the dataset has to be divided
• D: A dataset containing N number of objects
94
Method
1. Randomly assign K objects from the dataset(D) as cluster centers(C)
2. (Re) Assign each object to which object is most similar based upon
mean values.
Update Cluster means, i.e., Recalculate the mean of each cluster with
95
Advantages
• Simple, easy to understand, and easy to implement
• It is also efficient, in which the time taken to cluster K-means rises
linearly with the number of data points
No other clustering algorithm performs better than K-means, in general
Disadvantages
• The user needs to specify an initial value of K
• The process of finding the clusters may not converge
• Not suitable for discovering all types of clusters
96
4
6
5
5
6
8
4
5
2
2
X
4
3
7
2
6
3
7
6
6
4
Y
Example:
108
Association Rule Mining
• Market Based Analysis is one of the key techniques used by large
relations to show associations between items.
• It allows retailers to identify relationships between the items that
people buy together frequently.
109
Basic Definitions
• Support Count(𝝈) - Frequency of occurrence of a item set.
Here 𝝈({Milk, Bread, Diaper})=2
110
111
Rule Evaluation Metrics
• Support(s) –
🞄 The number of transactions that include items in the {X} and {Y} parts of the rule as
a percentage of the total number of transaction.
🞄 It is a measure of how frequently the collection of items occur together as a
percentage of all transactions.
• Confidence(c) –
🞄 It is the ratio of the no of transactions that includes all items in {B} as well as the no
of transactions that includes all items in {A} to the no of transactions that includes
all items in {A}.
112
Rule Evaluation Metrics
• Lift(l) –
🞄 The lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence, assuming that the item sets X and Y are independent of each other.
🞄 The expected confidence is the confidence divided by the frequency of {Y}.
113
From the given table, {Milk, Diaper}=>{Beer}
Illustration
• S = 𝜎({Milk, Diaper, Beer}) ÷ |T|
= 2/5
= 0.4
126
Limitations of Apriori Algorithm
• Apriori Algorithm can be slow.
• The main limitation is time required to hold a vast number of candidate
sets with much frequent item sets, low minimum support or large item
sets
127
Thank you…
MODULE 4
Data is one of the prime factors of any business purpose. Business Enterprises are data-driven
and without data, no one can have a competitive advantage. It has different definitions wherein
the huge amount of data can be considered as Big Data. It is the most widely used technology
these days in almost every business vertical.
Data can be defined as figures or facts that can be stored in or can be used by a computer.
Big Data is a term that is used for denoting a collection of datasets that is large and complex,
making it very difficult to process using legacy data processing applications.
Structured Data
Unstructured Data
Semi-structured Data
The above three types of Big Data are technically applicable at all levels of analytics. It is
critical to understand the source of raw data and its treatment before analysis while working
with large volumes of big data. Because there is so much data, extraction of information needs
to be done efficiently to get the most out of the data.
Structured Data
Structured data is highly organized and thus, is the easiest to work with. Its dimensions are
defined by set parameters. Every piece of information is grouped into rows and columns like
spreadsheets. Structured data has quantitative data such as age, contact, address, billing,
expenses, debit or credit card numbers, etc.
Due to structured data’s quantitative nature, it is easy for programs to sort through and collect
data. It requires little to no preparation to process structured data. The data only needs to be
cleaned and pared down to the relevant points. The data does not need to be converted or
interpreted too deeply to perform a proper inquiry.
Structured data follow road maps to specific data points or schemas for outlining the location
of each datum and its meaning.
The streamlined process of merging enterprise data with relational data is one of the perks of
structured data. Due to the pertinent data dimensions being defined and being in a uniform
format, very little preparation is required to have all sources be compatible.
The ETL process, for structured data, stores the finished product in a data warehouse. The
initial data is harvested for a specific analytics purpose, and for this, the databases are highly
structured and filtered. However, there is only a limited amount of structured data available,
and it falls under a slim minority of all existing data. Consensus says that structured data makes
up only 20 percent or less of all data.
CST322- DATA ANALYTICS
Unstructured Data
Not all data is structured and well-sorted with instructions on how to use it. All unorganized
data is known as unstructured data.
Almost everything generated by a computer is unstructured data. The time and effort required
to make unstructured data readable can be cumbersome. To yield real value from data, datasets
need to be interpretable. But the process to make that happen can be much more rewarding.
The challenging part about unstructured data analysis is teaching an application to understand
the information it’s extracting. Oftentimes, translation into structured form is required, which
is not easy and varies with different formats and end goals. Some methods to achieve the
translation are by using text parsing, NLP, and developing content hierarchies through
taxonomy. Complex algorithms are involved to blend the processes of scanning, interpreting,
and contextualizing.
Unlike structured data, which is stored in data warehouses, unstructured data is stored in data
lakes. Data lakes preserve the raw format of data as well as all of its information. Data lakes
make data more malleable, unlike data warehouses where data is limited to its defined schema.
Semi-structured Data
Semi-structured data falls somewhere between structured data and unstructured data. It mostly
translates to unstructured data that has metadata attached to it. Semi-structured data can be
inherited such as location, time, email address, or device ID stamp. It can even be a semantic
tag attached to the data later.
Consider the example of an email. The time an email was sent, the email addresses of the sender
and the recipient, the IP address of the device that the email was sent from, and other relevant
information are linked to the content of the email. While the actual content itself is not
structured, these components enable the data to be grouped in a structured manner.
Using the right datasets can make semi-structured data into a significant asset. It can aid
machine learning and AI training by associating patterns with metadata.
Subtypes of Data
Apart from the three above-mentioned types, there are subtypes of data that are not formally
considered Big Data but are somewhat pertinent to analytics. Most times, it is the origin of data
such as social media, machine (operational logging), event-triggered, or geospatial. It can also
involve access levels—open (open source), linked (web data transmitted via APIs and other
connection methods), or dark or lost (siloed within systems for the inaccessibility to outsiders
such as CCTV systems).
CST322- DATA ANALYTICS
Volume: This refers to tremendously large data. As you can see from the image, the volume of data is
rising exponentially. In 2016, the data created was only 8 ZB; it is expected that, by 2020, the data
would rise to 40 ZB, which is extremely large.
Variety: A reason for this rapid growth of data volume is that data is coming from different sources in
various formats. We have already discussed how data is categorized into different types. Let us take
another glimpse at it with more examples.
CST322- DATA ANALYTICS
Velocity: The speed of data accumulation also plays a role in determining whether the data is big data
or normal data.
Value: How will the extraction of data work? Here, our fourth V comes in; it deals with a mechanism
to bring out the correct meaning of data. First of all, you need to mine data, i.e., the process to turn raw
data into useful data. Then, an analysis is done on the data that you have cleaned or retrieved from the
raw data. Then, you need to make sure whatever analysis you have done benefits your business, such
as in finding out insights, results, etc., in a way that was not possible earlier.
Veracity: Since packages get lost during execution, we need to start again from the stage of mining raw
data to convert it into valuable data. And this process goes on. There will also be uncertainties and
inconsistencies in the data that can be overcome by veracity. Veracity means the trustworthiness and
quality of data. The veracity of data must be maintained.
Banking
Since there is a massive amount of data that is gushing in from innumerable sources, banks
need to find uncommon and unconventional ways to manage big data. It’s also essential to
examine customer requirements, render services according to their specifications, and reduce
risks while sustaining regulatory compliance. Financial institutions have to deal with Big Data
Analytics to solve this problem.
CST322- DATA ANALYTICS
• NYSE (New York Stock Exchange): NYSE generates about one terabyte of new trade data every
single day. So imagine, if one terabyte of data is generated every day, in a whole year how
much data there would be to process. This is what Big Data is used for.
Government
Government agencies utilize Big Data and have devised a lot of running agencies, managing
utilities, dealing with traffic jams, or limiting the effects of crime. However, apart from its
benefits in Big Data, the government also addresses the concerns of transparency and privacy.
• Aadhar Card: The Indian government has a record of all 1.21 billion citizens. This huge data is
stored and analyzed to find out several things, such as the number of youth in the country.
According to which several schemes are made to target the maximum population. All this big
data can’t be stored in some traditional database, so it is left for storing and analyzing using
several Big Data Analytics tools.
Education
Education concerning Big Data produces a vital impact on students, school systems, and
curriculums. By interpreting big data, people can ensure students’ growth, identify at-risk
students, and achieve an improvised system for the evaluation and assistance of principals and
teachers.
• Example: The education sector holds a lot of information concerning curriculum, students,
and faculty. The information is analyzed to get insights that can enhance the operational
adequacy of the educational organization. Collecting and analyzing information about a
student such as attendance, test scores, grades, and other issues take up a lot of data. So, big
data approaches a progressive framework wherein this data can be stored and analyzed
making it easier for the institutes to work with.
CST322- DATA ANALYTICS
When it comes to what Big Data is in Healthcare, we can see that it is being used enormously.
It includes collecting data, analyzing it, leveraging it for customers. Also, patients’ clinical data
is too complex to be solved or understood by traditional systems. Since big data is processed
by Machine Learning algorithms and Data Scientists, tackling such huge data becomes
manageable.
• Example: Nowadays, doctors rely mostly on patients’ clinical records, which means that a lot
of data needs to be gathered, that too for different patients. It is not possible for old or
traditional data storage methods to store this data. Since there is a large amount of data
coming from different sources, in various formats, the need to handle this large amount of
data is increased, and that is why the Big Data approach is needed.
E-commerce
• Flipkart: Flipkart is a huge e-commerce website dealing with lots of traffic daily. But, when
there is a pre-announced sale on Flipkart, traffic grows exponentially that crashes the website.
So, to handle this kind of traffic and data, Flipkart uses Big Data. Big Data can help in organizing
and analyzing the data for further use.
CST322- DATA ANALYTICS
Social Media
Social media in the current scenario is considered the largest data generator. The stats have
shown that around 500+ terabytes of new data get generated into the databases of social media
every day, particularly in the case of Facebook. The data generated mainly consist of videos,
photos, message exchanges, etc. A single activity on any social media site generates a lot of
data which is again stored and gets processed whenever required. Since the data stored is in
terabytes, it would take a lot of time for processing if it is done by our legacy systems. Big
Data is a solution to this problem.
Apache Hadoop
Big Data Hadoop is a framework that allows you to store big data in a distributed environment for
parallel processing.
CST322- DATA ANALYTICS
Apache Pig
Apache Pig is a platform that is used for analyzing large datasets by representing them as data flows.
Pig is designed to provide an abstraction over MapReduce which reduces the complexities of writing a
MapReduce program.
Apache HBase
Apache HBase is a multidimensional, distributed, open-source, and NoSQL database written in Java.
It runs on top of HDFS providing Bigtable-like capabilities for Hadoop.
Apache Spark
Apache Spark is an open-source general-purpose cluster-computing framework. It provides an
interface for programming all clusters with implicit data parallelism and fault tolerance.
Talend
Talend is an open-source data integration platform. It provides many services for enterprise
application integration, data integration, data management, cloud storage, data quality, and Big Data.
Splunk
Splunk is an American company that produces software for monitoring, searching, and analyzing
machine-generated data using a Web-style interface.
Apache Hive
Apache Hive is a data warehouse system developed on top of Hadoop and is used for interpreting
structured and semi-structured data.
Kafka
Apache Kafka is a distributed messaging system that was initially developed at LinkedIn and later
became part of the Apache project. Kafka is agile, fast, scalable, and distributed by design.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R
The previous chapter presented the six phases of the Data Analytics Lifecycle.
• Phase 1: Discovery
• Phase 6: Operationalize
The first three phases involve various aspects of data exploration. In general, the success of a da ta
analysis project requires a deep understanding of the data. It also requires a toolbox for mining and pre-
senting the data. These activities include the study of the data in terms of basic statistical measures and
creation of graphs and plots to visualize and identify relationships and patterns. Several free or commercial
tools are available for exploring, conditioning, modeling, and presenting data. Because of its popularity and
versatility, the open-source programming language Ris used to illustrate many of the presented analytical
tasks and models in this book.
This chapter introduces the basic functionality of the Rprogramming language and environment. The
first section gives an overview of how to useR to acquire, parse, and filter the data as well as how to obtain
some basic descriptive statistics on a dataset. The second section examines using Rto perform exploratory
data analysis tasks using visua lization. The final section focuses on statistical inference, such as hypothesis
testing and analysis of variance in R.
summary (sales)
In this example, the data file is imported using the read. csv () function. Once the file has been
imported, it is useful to examine the contents to ensure that the data was loaded properly as well as to become
familiar with the data. In the example, the head ( ) function, by default, displays the first six records of sales.
The summary () function provides some descriptive statistics, such as the mean and median, for
each data column. Additionally, the minimum and maximum values as well as the 1st and 3rd quartiles are
provided. Because the gender column contains two possible characters, an "F" (female) or "M" (male),
the summary () function provides the count of each character's occurrence.
summary(sales)
Plotting a dataset's contents can provide information about the relationships between the vari-
ous columns. In this example, the plot () function generates a scatterplot of the number of orders
(sales$num_of_orders) againsttheannual sales (sales$sales_total). The$ is used to refer-
ence a specific column in the dataset sales. The resulting plot is shown in Figure 3-1.
# plot num_of_orders vs. sales
plot(sales$num_of_orders,sales$sales_total,
main .. "Number of Orders vs. Sales")
REVIEW OF BASIC DATA ANALYTIC METHODS USING R
0
iii 0
0 0 0
:§ 0
I <0
C/)
Q>
0 0
iii
C/)
§ 0 0 0
0
II>
i
0
8 ~
I I I I i•
8
i 0
C/)
Q>
0 0 0 0
0
iii
C/)
N
8
0
• 5 10 15 20
sales$num_of_orders
Each point corresponds to the number of orders and the total sales for each customer. The plot indicates
that the annual sales are proportional to the number of orders placed. Although the observed relationship
between these two variables is not purely linear, the analyst decided to apply linear regression using the
lm () function as a first step in the modeling process.
ca.l:
lm formu.a sa.c Ssales ~ota. sales$num_of_orders
Coefti 1en·
In· er ep • sa:essnum o f orders
The resulting intercept and slope values are -154.1 and 166.2, respectively, for the fitted linear equation.
However, results stores considerably more information that can be examined with the summary ()
function. Details on the contents of results are examined by applying the at t ributes () function.
Because regression analysis is presented in more detail later in the book, the reader should not overly focus
on interpreting the following output.
summary(results)
Call :
lm formu:a sa!esSsales_total - salcs$ num_of_orders
Re!'a ilnls:
Min IQ Med1an 3C 1·1ax
-666 . 5 12S . S - 26 . 7 86 . 6 4103 . 4
The summary () function is an example of a generic function. A generic function is a group of fu nc-
tions sharing the same name but behaving differently depending on the number and the type of arguments
they receive. Utilized previously, plot () is another example of a generic function; the plot is determined
by the passed variables. Generic functions are used throughout this chapter and the book. In the final
portion of the example, the following Rcode uses the generic function hist () to generate a histogram
(Figure 3-2) of the residualsstored in results. The function ca ll illustrates that optional parameter values
can be passed. In this case, the number of breaks is specified to observe the large residuals.
0
I()
u>-
..
c
:J
<:T
0
0
~ 0
u. I()
resuttsSres1duals
FIGURE 3-2 Evidence of large residuals
This simple example illustrates a few of the basic model planning and building tasks that may occur
in Phases 3 and 4 of the Data Analytics Lifecycle. Throughout this chapter, it is useful to envision how the
presented R functionality will be used in a more comprehensive analysis.
._CNtl.
.. - - ....._..
.. tJ
-....y
- ..... O....ft• f
....".,
' 1 • -...1 • n •• "'
.:1-
t 1 ules r t -.d.uv 6llu,-...,.1.,_uh,1.u.,· .-------, ulu 10000 OM. ol " " whbhs
Scripts
...
...
tw..o u ln
!~uln •
rn~o~ hs
j Workspace
...
.......•. . '
.........
plot
ruulu
u1ts~of-orcl9ors,u1es
t"ltSIIIU
to' 1 '
1• uln ,u lu_utul
Jo
ules_tot•l,
flt I
u t n- ·~ ~
ulu ~ ,..._ot_orct«'s
T
-
cwo...s '"· s..ln
.... ,..,.,...
; :- •toMtt·
""'-
0 { a......
•. ' 11 luu
'"
lU •Ill rt f " ~ I
Ul hht ruuhs 1 t"tst~•h. br u11 • 100
r Histogram or results$reslduals
••'"• iu~-~.~...~------:=::::::::::::::::~
"s-uy(resulu)
·- .:1
~ Plots
c•H:
l•(f or-..1& - ulu lulu_uul .. saln~of_orcMf's)
... , .... ls :
tUn
· 6M. 1 - US, \
tQ lllt'dhn
•U .7
1Q ~b
t6.6 .&10), <1
Console ~ ~
j
f" • Uathttc : t . ltl••Ool on I wrd 9Ht Of", P•VA1U41: ot l . h•16
• Plot s: Displays the plots generated by the Rcode and provides a straightforward mechanism to
export the plots
Additionally, the console pane can be used to obtain help information on R. Figure 3-4 illustrates that
by entering ? lm at the console prompt, the help details of the lm ( ) function are provided on the right.
Alternatively, help ( lm ) could have been entered at the console prompt.
Functions such as edit () and fix () allow the user to update the contents of an R variable.
Alternatively, such changes can be implemented with RStudio by selecting the appropriate variable from
the workspace pan e.
R allows one to save the workspace environment, includ ing variables and loaded libraries, into an
. Rdata file using the save . image () function. An existing . Rdata file can be loaded using the
load . image () function. Tools such as RStudio prompt the user for whether the developer wants to
save the workspace connects prior to exiting the GUI.
The reader is encouraged to install Rand a preferred GUI to try out the Rexamples provided in the book
and utilize the help functionality to access more details about the discussed topics.
3.1 Introduction toR
.._. . -;:,
...,
t" u lu
t
r e...cl.csv
tt • • ........... ,
.,.1yJ•1n.u\
41 1ft f Ot l.a.l:?' ~~ .:J-
J U
ulu
.. ...... o.ttl01 • i ~·
',
..
411\1 , •
......" ..
N
ru~o~lu
..,"'..,
hud ulu
s~salu
,. ,. ..
plot ultJ l......_.,....ot'MrS ,Uh.tSnles.toul. .. ,,.. -~of arden'''~· ~·u·
,,..
>01 • f • .I H-It t II
ru.ulu 1• uluSulu,.ICKil ulu s~of ..crd~
>07 ru~l n
J ,.. .... ,~ """
110 "' . . . .z..
Ul "r•·f dl \1 nt,,d~l ll f"'*"t 1--Ut M-" '"
U7 •ph•t hi t< t t~ ' " ' ••I
lU hht ru11hs Srut duah, bru~s • 100 1
u•
~••
,;.·,;·iah.... ; ;.;: .:---------=========----' ---
Fitting Linear Models
un:
.:J o..c:ttptSon
lo(f or-.la .. ultsSul•s..tO'CAI .. uo hsJtuil,..of_or~s )
~-u&t411 kllftl•f"'Idek I CM ... vt. . IIC:WfYWft9'*1 - UIQ!It1UIIIIIII~II-I .nd
luto.uh : 1Nfyt4vlt- • (~ • · ~~ . _.,_...clfbab'UMn)
Min IQ ..fltiM )Q '""'
~tM. , · US. S • lt.7 M , t 4110), 41
(fnurcept)
Uttaau St d. lrror t "'' ' "' "'(•I t I>
-1~ .UI <I.Ut - J1.JJ c2t-l6 ••·
_,bOd • • q" · · .o., .. T~. • .. ~. y .. r;u..,r, ql" • ru:z.
• ~ h :.c t • TWI, cMtr uu • II"J:.I., o thu, . .. )
u 1e~ lnwa..ot _orws 1641.211 1."62 UI.M -.l t - 16 ...
Argument1
stvnu. cCIIOH.: o ···-· 0.001 •••• 0.01 ••• o.o\ ·.• o. t • • 1
• ntdvll nln<t¥0 tf'ror: 210.1 on t9N detJ't.S of frt~ f QI&l}& lf'lot.,Kt~tl&•••t· t• liii.•(OfiDMII'IIfunbece«cldlotNIWIII t tymtdcestiiCI¥IbOnol I
...,.
• • plot
-• I
,.,,uq•
of ttw rnldv• h
.. "'ht(r~whs ~ ~sl..,.h, lilr'Uh • toO)
tr.. flt\f'G . . , . ,
-- ,-
J..... u ... tr_::~, f t ;....,l l l tYJIIUI1CIIt.,....,....,.kwaiiiii'O~•uhcl
II'I~'IIIKtorl~ l ....... fl .......... lO beuMO ~UMtc,.,poctst
II'I__......U.tl....,_,teMI<MCI•t!lotiftaltJI'X-tsl
1--'UL ....,C~IM1Il
SPIOII4 c.r--..:.or•-...aor
...... d•l4~..ql l vt:~t•~•
, ..r..... -,,, ...,.,...,.,.,.,,.,... ...,., .. YMO S..lban.tan
~ng
.=.J
R uses a forward slash {!) as the separator character in the directory and file paths. This convention
makes script Iiies somewhat more portable at the expense of some initial confusion on the part of Windows
users, w ho may be accustomed to using a backslash (\) as a separator. To simplify the import of multiple Iiies
with long path names, the setwd () function can be used to set the working directory for the su bsequent
import and export operations, as show n in the follow ing R code.
Other import functions include read. table ( l and read . de lim () ,which are intended to import
other common file types such as TXT. These functions can also be used to import the yearly_ sales
. csv file, as the following code illustrates.
Th e ma in difference between these import functions is the default values. For example, t he read
. de lim () function expects the column separator to be a tab("\ t"). ln the event that the numerical data
REVIEW OF BASIC DATA ANALYTIC METHODS USING R
in a data file uses a comma for the decimal, Ralso provides two additional functions-read . csv2 () and
read . del im2 ()-to import such data. Table 3-1 includes the expected defaults for headers, column
separators, and decimal point notations.
# exp 1L d1ta 1s ''ll' 1.-.. <r u 1 "' L!1< llL t n• t '1'.' name,,
write . t able(sales ," sa l es_modified .txt ", sep= "\t ", row. names=FALSE
Sometimes it is necessary to read data from a database management system (DBMS). Rpackages such
as DBI [6) and RODBC [7] are available for this purpose. These packages provide database interfaces
for communication between Rand DBMSs such as MySQL, Oracle, SQL Server, PostgreSQL, and Pivotal
Greenplum. The following Rcode demonstrates how to instal l the RODBC package with the i ns t al l
. p acka ges () function. The 1 ibr a ry () function loads the package into the Rworkspace. Finally, a
connector (conn ) is initialized for connecting to a Pivotal Greenpl um database tra i n i ng2 via open
database connectivity (ODBC) with user user. The training2 database must be defined either in the
I etc/ODBC . ini configuration file or using the Administrative Tools under the Windows Control Panel.
install . packages ( "RODBC" )
library(RODBC)
conn <- odbcConnect ("t r aining2", uid="user" , pwd= "passwor d " )
Th e con nector needs to be present to su bmit a SQL query to an ODBC database by using the
sq l Qu ery () function from the RODBC package. The following Rcode retrieves specific columns from
the housi ng table in which household income (h inc ) is greater than $1,000,000.
4552088 5 9
4 45"- 88 5 9
5 8699:!93 6 5 5
Although plots can be saved using the RStudio GUI, plots can also be saved using R code by specifying
the appropriate graphic devices. Using the j peg () function, the following R code creates a new JPEG
file, adds a histogram plot to the file, and then closes the file. Such techniques are useful w hen automating
standard repor ts. Other functions, such as png () , bmp () , pdf () ,and postscript () ,are available
in R to save plots in the des ired format.
jpeg ( fil e= "c : /data/ sale s_h ist . j peg" ) creaLe a ne'" jpeg file
h ist(sales$num_of_ o rders ) # export histogt·;un to jpeg
d ev. o ff () ~ shut off the graphic device
More information on data imports and exports can be fou nd at http : I I cran . r-proj e ct . o rgl
doc I ma nuals I r- rel ease i R- da ta . html, such as how to import data sets from statistical software
packages including Minitab, SAS, and SPSS.
Definition The va lues represent Attributes The difference Both the difference
labels that distin- imply a betw een two and the ratio of
guish one from sequence. values is two values are
another. meaningful. meaningful.
+, - +, - ,
x, .:-
REVIEW OF BASIC DATA ANALYTIC METHODS USING R
Data of one attribute type may be converted to another. For example, the qual it yof diamonds {Fair,
Good, Very Good, Premium, Ideal} is considered ordinal but can be converted to nominal {Good, Excellent}
with a defined mapping. Similarly, a ratio attribute like Age can be converted into an ordinal attribute such
as {Infant, Adolescent, Adult, Senior}. Understanding the attribute types in a given dataset is important
to ensure that the appropriate descriptive statistics and analytic methods are applied and properly inter-
preted. For example, the mean and standard deviation of U.S. postal ZIP codes are not very meaningful or
appropriate. Proper handling of categorical variables will be addressed in subsequent chapters. Also, it is
useful to consider these attribute types during the following discussion on Rdata types.
Rprovides several functions, such as class () and type of (),to examine the characteristics of a
given variable. The class () function represents the abstract class of an object. The typeof () func-
tion determines the way an object is stored in memory. Although i appears to be an integer, i is internally
stored using double precision. To improve the readability of the code segments in this section, the inline
Rcomments are used to explain the code or to provide the returned values.
class(i) # returns "numeric"
typeof(i) # returns "double"
class(flag) ..
ttreturns "logical"
typeof (flag) # returns "logical"
Additional Rfunctions exist that can test the variables and coerce a variable into a specific type. The
following Rcode illustrates how to test if i is an integer using the is . integer ( } function and to coerce
i into a new integer variable, j, using the as. integer () function. Similar functions can be applied
for double, character, and logical types.
is.integer(i) # returns FALSE
j <- as.integer(i) # coerces contents of i into an integer
is.integer(j) # returns TRUE
The application of the length () function reveals that the created variables each have a length of 1.
One might have expected the returned length of sport to have been 8 for each of the characters in the
string 11 football". However, these three variables are actually one element, vectors.
length{i) # returns 1
length(flag) # returns 1
length(sport) # returns 1 (not 8 for "football")
3.1 Introduction to R
Vectors
Vectors are a basic building block for data in R. As seen previously, simple Rvariables are actually vectors.
A vector can only consist of values in the same class. The tests for vectors can be conducted using the
is. vector () function.
is.vector(i) !t returns TRUE
is.vector(flag) # returns TRUE
is.vector(sport) ±t returns TRUE
Rprovides functionality that enables the easy creation and manipulation of vectors. The following R
code illustrates how a vector can be created using the combine function, c () or the colon operator, :,
to build a vector from the sequence of integers from 1 to 5. Furthermore, the code shows how the values
of an existing vector can be easily modified or accessed. The code, related to the z vector, indicates how
logical comparisons can be built to extract certain elements of a given vector.
u <- c("red", "yellow", "blue") " create a vector "red" "yello•d" "blue"
u ±; t·eturns "red" "yellow'' "blue"
u[l] returns "red" 1st element in u)
v <- 1:5 # create a vector 1 2 3 4 5
v # returns 1 2 3 4 5
sum(v) It returns 15
w <- v * 2 It create a vector 2 4 6 8 10
w # returns 2 4 6 8 10
w[3] returns 6 (the 3rd element of w)
z <- v + w # sums two vectors element by element
z # returns 6 9 12 15
z > 8 # returns FALSE FALSE TRUE TRUE TRUE
z [z > 8] # returns 9 12 15
z[z > 8 I z < 5] returns 9 12 15 ("!"denotes "or")
Sometimes it is necessary to initialize a vector of a specific length and then populate the content of
the vector later. The vector ( } function, by default, creates a logical vector. A vector of a different type
can be specified by using the mode parameter. The vector c, an integer vector of length 0, may be useful
when the number of elements is not initially known and the new elements will later be added to the end
ofthe vector as the values become available.
a <- vector(length=3) # create a logical vector of length 3
a # returns FALSE FALSE FALSE
b <- vector(mode::"numeric 11 , 3) #create a numeric vector of length 3
typeof(b) # returns "double"
b[2] <- 3.1 #assign 3.1 to the 2nd element
b # returns 0.0 3.1 0.0
c <- vector(mode= 11 integer", 0) # create an integer vectot· of length o
c # returns integer(O)
length(c) # returns o
Although vectors may appear to be analogous to arrays of one dimension, they are technically dimen-
sionless, as seen in the following Rcode. The concept of arrays and matrices is addressed in the following
discussion.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R
length(b) 1·eturns 3
dim(b) ~ 1·etun1s NULL (an undefined value)
[. 1, [.:! 1 :. , 1 [,·s!
[1' 1 0 0 0
[:!,1 !58000 c 0 0
[ 3. 1 0 0 0 0
[. 11 (.21 [. 31 [. 41
[ 1' 1 0 0 0 0
[2 ' 1 0 0 0 0
[3 ' 1 0 0 0 0
A two-dimensional array is known as a matrix. The following code initializes a matrix to hold the quar-
terly sales for the three regions. The parameters nrov1 and nco l define the number of rows and columns,
respectively, for the sal es_ma tri x.
R provides the standard matrix operations such as addition, subtraction, and multiplication, as well
as the transpose function t () and the inverse matrix function ma t r ix . inve r s e () included in the
matrixcalc package. Th e following Rcode builds a 3 x 3 matrix, M, and multiplies it by its inverse to
obtain the identity matrix.
library(matrixcalc)
M <- matrix(c(1,3,3,5,0,4,3 , 3,3) ,nrow 3,ncol 3) build a 3x3 matrix
3.1 Introduction toR
[. 1] [. 2] [ ' 3]
[ 1' J 0 0
[2' J 0 1 0
[3' J 0 0 1
Data Frames
Similar to the concept of matrices, data frames provide a structure for storing and accessing several variables
of possibly different data types. In fact, as the i s . d ata . fr a me () function indicates, a data frame was
created by the r e ad . csv () function at the beginning of the chapter.
r.import a CSV :ile of the total annual sales :or each customer
s ales < - read . csv ("c : / data/ ye arly_s a l es . c sv" )
i s .da t a . f r ame (sal es ) ~ t·eturns TRUE
As seen earlier, the variables stored in the data frame can be easily accessed using the $ notation. The
following R code illustrates that in this example, each variable is a vector with the exception of gende r ,
which was, by a read . csv () default, imported as a factor. Discussed in detail later in this section, a factor
denotes a categorical variable, typically with a few finite levels such as "F" and "M " in the case of gender.
Because of their flexibility to handle many data types, data frames are the preferred input format for
many ofthe modeling functions available in R. The foll owing use of the s t r () function provides the
structure of the sal es data frame. This function identifi es the integer and numeric (double) data types,
the factor variables and levels, as well as the first few values for each variable.
In the simplest sense, data frames are lists of variables of the same length. A subset of the data frame
can be retrieved through subsetting operators. R's subsetting operators are powerful in t hat they allow
one to express complex operations in a succinct fashion and easily retrieve a subset of the dataset.
sales$gender
# retrieve the first two rows of the data frame
sales[l:2,]
# retrieve the first, third, and fourth columns
sales[,c(l,3,4)]
l! retrieve both the cust_id and the sales_total columns
sales[,c("cust_id", "sales_total")]
# retrieve all the records whose gender is female
sales[sales$gender=="F",]
The following Rcode shows that the class of the sales variable is a data frame. However, the type of
the sales variable is a list. A list is a collection of objects that can be of various types, including other lists.
class(sales)
"data. frame"
typeof(sales)
"list"
Lists
Lists can contain any type of objects, including other lists. Using the vector v and the matrix M created in
earlier examples, the following Rcode creates assortment, a list of different object types.
# build an assorted list of a string, a numeric, a list, a vector,
# and a matrix
housing<- list("own", "rent")
assortment <- list("football", 7.5, housing, v, M)
assortment
[ [1)]
[1) "football"
[ (2])
[1) 7. 5
[ (3])
[ [ 3)) [ [ 1))
[1) "own"
[ [3)) [ [2)]
[1) "rent"
[ [4)]
[1] 1 2 3 4 5
[ [5)]
3.1 Introduction toR
[I 1] [ 1 2] [ 13 J
[11 J 1 5
[21 J 3 0
[3 1 J 3 4
In displaying the contents of assortment, the use of the double brackets, [ [] ] , is of particular
importance. As the following Rcode illustrates, the use of the single set of brackets only accesses an item
in the list, not its content.
# examine the fifth object, loll in the list
class(assortment[S]) .. returns "2.ist"
..
tt
length(assortment[S]) tt returns 1
As presented earlier in the data frame discussion, the s tr ( ) function offers details about the structure
of a list.
str(assortment)
List of 5
$ : chr "football"
$ : num 705
$ :List of 2
0 $ : chr "own "
0
0 0$ : chr "rent"
$ int [ 1: 5] 1 2 3 4 5
$ : num [ 1: 3 1 : 3] 1 3 3 5 0 4 3 3 3
1
Factors
Factors were briefly introduced during the discussion of the gender variable in the data frame sales.
In this case, gender could assume one of two levels: ForM. Factors can be ordered or not ordered. In the
case of gender, the levels are not ordered.
class(sales$gender) # returns "factor"
is.ordered(sales$gender) # returns FALSE
Included with the ggplot2 package, the diamonds data frame contains three ordered factors.
Examining the cut factor, there are five levels in order of improving cut: Fair, Good, Very Good, Premium,
and Ideal. Thus, sales$gender contains nominal data, and diamonds$cut contains ordinal data.
head(sales$gender) # display first six values and the levels
F F l-1 1'-1 F F
Levels: F l\1
library(ggplot2)
data(diamonds) # load the data frame into the R workspace
REVIEW OF BASIC DATA ANALYTIC METHODS USING R
str(diamonds)
'data.frame': 53940 obs. of 10 variables:
$ carat num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 ...
$ cut Ord.factor w/ 5 levels "Fair"c"Good"c .. : 5 4 2 4 2 3 ...
$ color Ord.factor w/ 7 levels "D"c"E"c"F"c"G"c .. : 2 2 2 6 7 7
$ clarity: Ord.factor w/ 8 levels "I1"c"SI2"c"SI1"< .. : 2 3 5 4 2
$ depth num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4
$ table num 55 61 65 58 58 57 57 55 61 61 ...
$ price int 326 326 327 334 335 336 336 337 337 338
$ X num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05
$ z num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39
# create and add the ordered factor to the sales data frame
spender<- factor(sales_group,levels=c("small", "medium", "big"),
ordered = TRUE)
sales <- cbind(sales,spender)
str(sales$spender)
Ord.factor w/ 3 levels "small"c"medium"c .. : 3 2 1 2 3 1 1 1 2 1 ...
head(sales$spender)
big medium small medium big small
Levels: small < medium c big
The cbind () function is used to combine variables column-wise. The rbind () function is used
to combine datasets row-wise. The use of factors is important in several Rstatistical modeling functions,
such as analysis of variance, aov ( ) , presented later in this chapter, and the use of contingency tables,
discussed next.
3.11ntrodudion toR
Contingency Tables
In R, table refers to a class of objects used to store the observed counts across the factors for a given dataset.
Such a table is commonly referred to as a contingency table and is the basis for performing a statistical
test on the independence of the factors used to build the table. The following Rcode builds a contingency
table based on the sales$gender and sales$ spender factors.
# build a contingency table based on the gender and spender factors
sales_table <- table{sales$gender,sales$spender)
sales_table
small medium big
F 1726 2746 563
M 1656 2723 586
Based on the observed counts in the table, the summary {) function performs a chi-squared test
on the independence of the two factors. Because the reported p-value is greater than 0.05, the assumed
independence of the two factors is not rejected. Hypothesis testing and p-values are covered in more detail
later in this chapter. Next, applying descriptive statistics in Ris examined.
The following code provides some common Rfunctions that include descriptive statistics. In parenthe-
ses, the comments describe the functions.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R
:. ca.ls, assig::
x <- sales$sales_total
y <- sales$num_of_orders
The IQR () function provides the difference between the third and the first quarti les. The other fu nc-
tions are fairly self-explanatory by their names. The reader is encouraged to review the available help files
for acceptable inputs and possible options.
The function apply () is useful when the same function is to be applied to several variables in a data
frame. For example, the following Rcode calculates the standard deviation for the first three variables in
sales. In the code, setting MARGIN=2 specifies that the sd () function is applied over the columns.
Other functions, such as lappl y () and sappl y (), apply a function to a list or vector. Readers can refer
to the R help files to learn how to use these functions.
Additional descriptive statistics can be applied wi th user-defined funct ions. The following R code
defines a function, my_ range () , to compute the difference between the maximum and minimum va lues
returned by the range () function. In general, user-defined functions are usefu l for any task or operation
that needs to be frequently repeated. More information on user-defined functions is available by entering
help ( 11 function 11 ) in the console.
# build a functi~n tv plvviJ~ the difterence bet~een
~ -he maxrmum and thE .:m • •.1<
my_ range < - function (v) {range (v ) (2] - range (v) [1)}
my_range (x )
summary (data )
·.. y
M1n. : 1.90481 ~·1n .
A useful way to detect patterns and anomalies in the data is through the exploratory data analysis with
visualization. Visualization gives a succinct, holistic view of the data that may be difficult to grasp from the
numbers and summaries alone. Variables x and y of the data frame data can instead be visual ized in a
scatterplot (Figure 3-5). which easily depicts the relationship between two variab les. An important facet
of the initial data exploration, visualization assesses data cleanliness and suggests potentially important
relationships in the data prior to the model planning and building phases.
Scatterplot of X and Y
'·
o·
-1·
2·
2 0 2
X
s ummary (data )
library (ggplo t 2)
ggpl ot (data, aes (x=x , y=y)) +
geom_point (size=2) +
ggtitle ("Scatterplo t o f X and Y" ) +
theme (axis.text=el emen t_t ex t(s i ze= l 2) ,
axis. title el emen t_text (si ze= l4 ) ,
plot.title = e l ement_ t ex t(si ze=20 , fa ce ="bold" ))
Explo ra tory data analysis [9] is a data ana lysis approach to reveal the important characteristics of a
dataset, mainly through visualization. This section discusses how to use some basic visualization techniques
and the plotting feature in R to perform exploratory data analysis.
#1 #2 # 3 #4
X y X y X y X y
4 4.26 4 3 10 4 5.39 8 5 25
5 5.68 5 4 74 5 5.73 8 5.56
6 7.24 6 6 13 6 6.08 8 5.76
7 4.82 7 7.26 7 6.42 8 6.58
8 6.95 8 8. 14 6.77 8 6.89
9 8.81 9 8.77 9 7. 11 8 7.04
10 8.04 10 9. 14 10 7.46 8 7.7 1
11 8.33 11 9.26 11 7.81 8 7.91
12 10.84 12 9. 13 12 8. 15 8 8.47
13 7.58 13 8.74 13 12.74 8 8.84
14 9.96 14 8. 10 14 8.84 19 12.50
The four data sets in Anscom be'squartet have nearly identical statistical properties, as shown in Table 3-3.
Variance of y 11
Based on the nearly identical statistical properties across each dataset, one might conclude that these
four datasets are quite similar. However, the scatterplots in Figure 3-7 tell a different story. Each dataset is
plotted asa scatterplot, and the fitted lines are the result of applying linear regression models. The estimated
regression line fits Dataset 1 reasonably well. Dataset 2 is definitely nonlinear. Dataset 3 exhibits a linear
trend, with one apparent outlier at x = 13. For Dataset 4, the regression line fits the dataset quite well.
However, with only points at two x values, it is not possible to determine that the linearity assumption is
proper.
~I ~
12
••• ••
• • • •
•
•
3 4
12
• •
• ••
:t 5
••• •
10 15
X
~:
5 10 15
The Rcode for generating Figure 3-7 is shown next. It requires the Rpackage ggplot2 [11]. which can
be installed simply by running the command install . p ackages ( "ggp lot2" ) . The anscombe
REVIEW OF BASIC DATA ANALYTIC METHODS USING R
dataset for the plot is included in the standard Rdistribution. Enter data ( ) for a list of datasets included
in the R base distribution. Enter data ( Da tase tName) to make a dataset available in the current
workspace.
In the code that follows, variable levels is created using the gl (} function, which generates
factors offour levels (1, 2, 3, and 4), each repeating 11 times. Variable myda ta is created using the
with (data, expression) function, which evaluates an expression in an environment con-
structed from da ta.ln this example, the data is the anscombe dataset, which includes eight attributes:
xl, x2, x3, x4, yl, y2, y3, and y4. The expression part in the code creates a data frame from the
anscombe dataset, and it only includes three attributes: x, y, and the group each data point belongs
to (mygroup).
install.packages(''ggplot2") # not required i f package has been installed
data (anscombe) It load the anscombe dataset into the current \'iOrkspace
anscombe
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8. O·l 9.14 7.-16 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
13 13 13 7.58 8.74 12.74 7.71
4 9 9 8.81 8.77 7.11 8.84
5 11 11 11 8.33 9.26 7.81 8.·±7
6 14 14 14 8 9. 9G 8.10 8.34 7.04
7 6 6 6 8 7.24 6.13 6. •J8 5.25
8 4 4 4 19 ·l. 26 3.10 5. 3 9 12.50
9 12 12 12 8 10. 8•1 9.13 8.15 5.56
10 7 7 7 8 4.82 7.26 6.-12 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89
mydata
X y mygroup
10 8.04
2 8 6.95
13 7.58
4 9 8.81
3.2 Exploratory Data Analysis
...
4,
1
B
1 ...
'i.S6
4
-l3 8 7 l 0 4
44 B 6 . 89 4
library (ggplot2 )
therne_set (therne_bw ()) - s L rlot color :~erne
j\11 ' 1 7
ggplot (rnydata, aes (x, y )) +
geom_point (size=4 ) +
geom_srnooth (rnethod="lrn ", fill=NA, f ullrange=TRUE ) +
facet_wrap (-rnygroup )
0
...
0 -
>.
u
~~
~~
cQJ
:J
0"
QJ
u:
8 -J
~
0 -'
Age
FIGURE 3-8 Age distribution of bank account holders
If the age data is in a vector called age, the graph can be created with the following Rscript:
The figure shows that the median age of the account holders is around 40. A few accountswith account
holder age less than 10 are unusual but plausible. These could be custodial accounts or college savings
accounts set up by the parents of young children. These accounts should be retained for future analyses.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R
However, the left side of the graph shows a huge spike of customers who are zero years old or have
negative ages. This is likely to be evidence of missing data. One possible explanation is that the null age
values could have been replaced by 0 or negative values during the data input. Such an occurrence may
be caused by entering age in a text box that only allows numbers and does not accept empty values. Or it
might be caused by transferring data among several systems that have different definitions for null values
(such as NULL, NA, 0, -1, or-2). Therefore, data cleansing needs to be performed over the accounts with
abnormal age values. Analysts should take a closer look at the records to decide if the missing data should
be eliminated or if an appropriate age value can be determined using other available information for each
of the accounts.
In R, the is . na (} function provides tests for missing values. The following example creates a vector
x where the fourth value is not available (NA). The is . na ( } function returns TRUE at each NA value
and FALSE otherwise.
X<- c(l, 2, 3, NA, 4)
is.na(x)
[1) FALSE FALSE FALSE TRUE FALSE
Some arithmetic functions, such as mean ( }, applied to data containing missing values can yield an
na. rm parameter to TRUE to remove the missing value during the
NA result. To prevent this, set the
function's execution.
mean(x)
[1) NA
mean(x, na.rm=TRUE)
[1) 2. 5
The na. exclude (} function returns the object with incomplete cases removed.
DF <- data.frame(x = c(l, 2, 3), y = c(lO, 20, NA))
DF
X y
1 1 10
2 2 20
3 3 NA
Account holders older than 100 may be due to bad data caused by typos. Another possibility is that these
accounts may have been passed down to the heirs of the original account holders without being updated.
In this case, one needs to further examine the data and conduct data cleansing if necessary. The dirty data
could be simply removed or filtered out with an age threshold for future analyses. If removing records is
not an option, the analysts can look for patterns within the data and develop a set of heuristics to attack
the problem of dirty data. For example, wrong age values could be replaced with approximation based
on the nearest neighbor-the record that is the most similar to the record in question based on analyzing
the differences in all the other variables besides age.
3.2 Exploratory Data Analysis
Figure 3-9 presents another example of dirty data. The distribution shown here corresponds to the age
of mortgages in a bank's home loan portfolio. The mortgage age is calculated by subtracting the origina-
tion date of the loan from the current date. The vertical axis corresponds to the number of mortgages at
each mortgage age.
"'
~
0
0
~
0
0
u>- (X)
cQj
0
::J 0
cY <D
~
u. 0
..,.
0
0
0
"'
0
I
0 2 6 8 10
Mortgage Age
FIGURE 3-9 Distribution of mortgage in years since origination from a bank's home loan portfolio
If the data is in a vector called mortgage, Figure 3-9 can be produced by the following R script.
Figure 3-9 shows that the loans are no more than 10 years old, and these 10-year-old loans have a
disproportionate frequency compared to the rest of the population. One possible explanation is that the
10-year-old loans do not only include loans originated 10 years ago, but also those originated earlier than
that. In other words, the 10 in the x-axis actually means"<! 10. This sometimes happens when data is ported
from one system to another or because the data provider decided, for some reason, not to distinguish loans
that are more than 10 years old. Analysts need to study the data further and decide the most appropriate
way to perform data cleansing.
Data analysts shou ld perform sanity checks against domain knowledge and decide if the dirty data
needs to be eliminated. Consider the task to find out the probability of mortga ge loan default. If the
past observations suggest that most defaults occur before about the 4th year and 10-year-old mortgages
rarely default, it may be safe to eliminate the dirty data and assume that the defaulted loans are less than
10 years old. For other ana lyses, it may become necessary to track down the source and find out the true
origination dates.
Dirty data can occur due to acts of omission.ln the sales data used at the beginning of this chapter,
it was seen that the minimum number of orders was 1 and the minimum annual sales amount was $30.02.
Thus, there isa strong possibility that the provided dataset did not include the sales data on all customers,
just the customers who purchased something during the past year.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R
Function Purpose
p l o t (data ) Scatterplot where x is the index andy is the value;
suitable for low-volume data
data (mtcars )
dotchart (mtcars$mpg,labels=row . names (mtcars ) ,cex=.7,
main= "Mi les Per Gallon (MPG ) of Car Models",
xlab ="MPG" )
barplot (tabl e (mtcars$cyl ) , main="Distribu:ion of Car Cyl inder Counts",
x lab= "Number of Cylinders" )
Volvo U2f 0
Uastreb Bota 0
Ferran [)n)
Ford Panttra L
Lotus Europa
Pot3che 91 • -2 0
F'"1X1·9 0
D
Utrc.c.SQSE. 0
Utrc2!0(
Were 280 ....
Wtre230
llerc 2•00
Ousttr 360
Vttant 0
Homtt SportabOU1 0
Homtl' Drrve
Datsun 7 10 6 8
Uazdo R.X' Wag
Uazda RXI 0 Ntmler ol Cylinders
10 15 20 2S 30
UPG
(a) (b)
FIGURE 3-10 (a) Dotchart on the miles per gallon of cars and (b) Barplot on the distribution of car cylinder
counts
....
0
"
N
0
"' 0
0
"'
..
i';
c:
:>
.,
0
"'
?;-
v;
.,
c:
00
0
CT
u: "'
0
0 "'
0
0
N "'
0
N
~ 0
0
0 0
FIGURE 3·11 (a) Histogram and (b) Density plot of household income
REVIEW OF BASIC DATA ANALYTIC METHODS USING R
Figure 3-11 (b) shows a density plot of the logarithm of household income values, which emphasizes
the distribution. The income distribution is concentrated in the center portion of the graph. The code to
generate the two plots in Figure 3-11 is provided next. The rug ( } function creates a one-dimensional
density plot on the bottom of the graph to emphasize the distribution of the observation.
# randomly generate 4000 observations from the log normal distribution
income<- rlnorm(4000, meanlog = 4, sdlog = 0.7)
summary (income)
Min. 1st Qu. t.Jedian t>!ean 3rd Qu. f.!ax.
4.301 33.720 54.970 70.320 88.800 659.800
income <- lOOO*income
summary (income)
Min. 1st Qu. f.!edian f.!ean 3rd Qu. 1\!ax.
4301 33720 54970 70320 88800 659800
# plot the histogram
hist(income, breaks=SOO, xlab="Income", main="Histogram of Income")
# density plot
plot(density(loglO(income), adjust=O.S),
main="Distribution of Income (loglO scale)")
# add rug to the density plot
rug(loglO(income))
In the data preparation phase of the Data Analytics Lifecycle, the data range and distribution can be
obtained. If the data is skewed, viewing the logarithm of the data (if it's all positive) can help detect struc-
tures that might otherwise be overlooked in a graph with a regular, nonlogarithmic scale.
When preparing the data, one should look for signs of dirty data, as explained in the previous section.
Examining if the data is unimodal or multi modal will give an idea of how many distinct populations with
different behavior patterns might be mixed into the overall population. Many modeling techniques assume
that the data follows a normal distribution. Therefore, it is important to know if the available dataset can
match that assumption before applying any of those modeling techniques.
Consider a density plot of diamond prices (in USD). Figure 3-12(a) contains two density plots for pre-
mium and ideal cuts of diamonds. The group of premium cuts is shown in red, and the group of ideal cuts
is shown in blue. The range of diamond prices is wide-in this case ranging from around $300 to almost
$20,000. Extreme values are typical of monetary data such as income, customer value, tax liabilities, and
bank account sizes.
Figure 3-12(b) shows more detail of the diamond prices than Figure 3-12(a) by taking the logarithm. The
two humps in the premium cut represent two distinct groups of diamond prices: One group centers around
log10 price= 2.9 (where the price is about $794), and the other centers around log 10 price= 3.7 (where the
price is about $5,012). The ideal cut contains three humps, centering around 2.9, 3.3, and 3.7 respectively.
The Rscript to generate the plots in Figure 3-12 is shown next. The diamonds dataset comes with
the ggplot2 package.
library("ggplot2")
data(diamonds) # load the diamonds dataset from ggplot2
summary(niceDiamonds$cut )
Pr m1u !:! a ..
0 0 137<ll . lSSl
.. '
3t
'
~ ~ cut
o;
; '. ..
;;
c Premium
Jcsu•
" "
It
'
0 0
1~300
pri ce togtO(prtce)
(a) (b)
FIGURE 3-12 Density plot s of (a) d iamond prices and (b) t he logarit hm of diamond p rices
to the possible relationship between the variables. If the functiona l relationship between the variables is
somewhat pronounced, the data may roughly lie along a straight line, a parabola, or an exponential curve.
If variable y is related exponentially to x, then the plot of x versus log (y) is approximately linea r. If the
plot looks more like a cluster without a pattern, the corresponding variables may have a weak relationship.
The scatterplot in Figure 3-13 portrays the relationship of two variables: x and y . The red line shown
on the graph is the fitted line from the linear regression. Linear regression wi ll be revisited in Chapter 6,
"Advanced Analytical Theory and Methods: Reg ression." Figure 3-13 shows that the regression line does
not fit the data well. This is a case in which linear regression cannot model the relationship between the
variables. Alternative methods such as the l oess () functio n ca n be used to fit a nonlinear line to the
data. The blue curve shown on the graph represents the LOESS curve, which fits the data better than linear
regression.
0
0
0
0
N
.,.,
0
0
0
.,.,
0
0
0 o o 0
0
0
0
0
0
0 2 4 6 8 10
The Rcode to produce Figure 3-13 is as follows. The runi f ( 7 5, 0 , 1 0) generates 75 numbers
between 0 to 10 with random deviates, and the numbers conform to the uniform distribution. The
r norm( 7 5 , o , 2 o) generates 75 numbers that conform to the normal distribu tion, with the mean eq ual
to 0 and the standard deviation equal to 20. The poi n ts () function is a generic function that draws a
sequence of points at the specified coordinates. Parameter type=" 1" tells the function to draw a solid
line. The col parameter sets the color of the line, where 2 represents the red color and 4 represents the
blue co lor.
x c- sort(x)
y c - 200 + xA 3 - 10 * x A2 + x + rnorm(75, 0 , 20)
plot (x, y )
4
0
Toyota Corolla
0
Fl8t 128
Lotus Eu ropa 0
Honda Civic 0
F1at X1-9 0
Porscl1e 914- 2 0
t.terc2400 0
Mere 230 0
Datsun 710 0
Toyota Corona 0
Volvo 142E 0
6
Hornet 4 Drive 0
Mazda RX4 Wag 0
Mazda RX4 0
Ferrari Dine 0
I.! ere 280 0
Valiant 0
I.! ere 280C 0
8
Pontiac Fire bird 0
Hornet Sporta bout 0
Mer e 450SL 0
Mere 450SE 0
Ford Pantera L 0
Dodge Challenger 0
AMC Javeun 0
Mere 450SLC 0
1.1aserati Bora 0
Chrysler Imperial 0
Duster 360 0
Camaro Z28 0
Lincoln Continental 0
Cadillac Fleet wood 0
I I I I
10 15 20 25 30
The barplot in Figure 3-15 visualizes the distri bution of car cyli nder counts and number of gears. The
x-axis represents the number of cylinders, and the color represents the number of gears. The code to
generate Figure 3-15 is shown next.
~
Number of Gears
• 3
~ 4
!;!
D 5
CD
~
:J <D
0
u
""
N
4 6 8
Number of Cylinders
Box-and-Whisker Plot
Box-and-whisker plots show the distribution of a continuous variable for each value o f a discrete variable.
The box-and-whisker plot in Figure 3-16 visualizes mean household incomes as a function of region in
th e United States. The first digit of the U.S. postal ("ZIP") code corresponds to a geographical region
in the United States. In Figure 3-16, each data point corresponds to the mean household income from a
particular zip code. The horizontal axis represents the first digit of a zip code, ranging from 0 to 9, where
0 corresponds to the northeast reg ion ofthe United States (such as Maine, Vermont, and Massachusetts),
and 9 corresponds to the southwest region (such as Ca lifornia and Hawaii). The vertical axis rep resents
the logarithm of mean household incomes. Th e loga rithm is take n to bet ter visualize the distr ibution
of th e mean household incomes.
so-
..,
E
0
u
c
;:;
0
.c
'""'::J
0
J:
iii ~ 5- •
'"
:::!!
0
Cl
2
FIGURE 3-16 A box-and-whisker plot of mean household income and geographical region
In this figure, the scatterplot is displayed beneath the box-and-whisker plot, with some jittering for the
overlap points so that each line of points widens into a strip. The "box" of the box-and-whisker shows t he
range that contains the central 50% of the data, and the line inside the box is the location of the median
value. The upper and lower hinges of the boxes correspond to the first and third quartiles of the data. The
upper whisker extends from the hinge to the highest value that is within 1.5 * IQR of the hinge. The lower
whisker extends from the hinge to the lowest value w ithin 1.5 * IQR of the hinge. IQR is the inter-qua rtile
range, as discussed in Section 3.1.4. The points outside the wh iskers can be considered possible outliers.
REVIEW OF BAS IC DATA ANALYTIC M ETHODS USING R
The graph shows how household income varies by reg ion. The highest median incomes are in region
0 and region 9. Region 0 is slightly higher, but the boxes for the two regions overlap enough that the dif-
ference between the two regions probably is not significant. The lowest household incomes tend to be in
region 7, which includes states such as Louisiana, Arka nsas, and Oklahoma.
Assuming a data frame called DF contains two columns (MeanHousehol din come and Zipl), the
following Rscript uses the ggplot2 1ibrary [11 ] to plot a graph that is similar to Figure 3-16.
library (ggplot2 )
plot the jittered scat-erplot w/ boxplot
H color -code points with z1p codes
h th~ outlier . s.ze pr~vents the boxplot from p:c-•inq •h~ uutlier
Alternatively, one can create a simple box-and-whisker plot with the boxplot () function provided
by the Rbase package.
j
g] Counts
71&8
6328
SS22
8
c
'0
0
C!
</) 0
0
0
0 I" ··- 4 77 1
407S
34
! 8
0
~ 1&15
316
:;)
0 </) 8 0 fa u .J 1640
I
I
c <i 141 8
1051
IB ~ 739
::!i
0
0. 0
0
r .... 432
279
C! 132
.2
" 39
0 0
10
~'•W~.Eduauon
.. 1
0 5 10 15
MeanEduca1ion
(a) (b)
FIGURE 3-17 (a) Scatterplot and (b) Hexbinplot of household income against years of education
3.2 Exploratory Data Analysis
Although color and transparency can be used in a scatterplot to address this issue, a hexbinplot is
sometimes a better alternative. A hexbinplot combines the ideas of scatterplot and histogram. Similar to
a scatterplot, a hexbinplot visualizes data in the x-axis andy-axis. Data is placed into hex bins, and the third
dimension uses shading to represent the concentration of data in each hexbin.
In Figure 3-17(b), the same data is plotted using a hexbinplot. The hexbinplot shows that the data is
more densely clustered in a streak that runs through the center of the cluster, roughly along the regression
line. The biggest concentration is around 12 years of education, extending to about 15 years.
In Figure 3-17, note the outlier data at MeanEducation=O. These data points may correspond to
some missing data that needs further cleansing.
Assuming the two variables MeanHouseholdincome and MeanEduca tion are from a data
frame named zeta, the scatterplot of Figure 3-17(a) is plotted by the following Rcode.
# plot the data points
plot(loglO(MeanHouseholdincome) - MeanEducation, data=zcta)
# add a straight fitted line of the linear regression
abline(lm(loglO(MeanHouseholdincome) - MeanEducation, data=zcta), col='red')
Using the zeta data frame, the hexbinplot of Figure 3-17(b) is plotted by the following R code.
Running the code requires the use of the hexbin package, which can be installed by running ins tall
.packages ( "hexbin").
library(hexbin)
# "g" adds the grid, "r" adds the regression line
# sqrt transform on the count gives more dynamic range to the shading
# inv provides the inverse transformation function of trans
hexbinplot(loglO(MeanHouseholdincome) - MeanEducation,
data=zcta, trans= sqrt, inv = function(x) x ... 2, type=c( 11 g 11 , 11
r 11 ) )
Scatterplot Matrix
A scatterplot matrix shows many scatterplots in a compact, side-by-side fashion. The scatterplot matrix,
therefore, can visually represent multiple attributes of a dataset to explore their relationships, magnify
differences, and disclose hidden patterns.
Fisher's iris dataset [13] includes the measurements in centimeters ofthe sepal length, sepal width,
petal length, and petal width for 50 flowers from three species of iris. The three species are setosa, versicolor,
and virginica. The iris dataset comes with the standard Rdistribution.
In Figure 3-18, all the variables of Fisher's iris dataset (sepal length, sepal width, petal length, and
petal width) are compared in a scatterplot matrix. The three different colors represent three species of iris
flowers. The scatterplot matrix in Figure 3-18 allows its viewers to compare the differences across the iris
species for any pairs of attributes.
REVIEW O F BA SIC DATA ANA LYTIC M ETHODS USIN G R
..... w.4"t~
.....
20 25 30 35 •o 0510152025
...... ..;.;.:.··
• ••• •
..... ,. .• "''"
14 11
10>1
'19
~
.. ...
11t
I ll
f.·.
)9
" . ..
;~·:
I
• •
•.,..··~til!~-= Sepal. Width ...
~-· .
.,.. .
..,
Q
0 _ic
"'
~ .. - ~·
•• •
·.t.*
•. .
• =:t • • Petal. length
Petal. Width
"'
Q
H 55 65 75 12 3<567
Consider the scatterplot from the first row and third col umn of Figure 3-18, where sepal length is com-
pared against petal length. The horizontal axis is the petal length, and the vertical axis is the sepal length.
The scatterplot shows that versicolor and virginica share similar sepal and petal lengths, although the latter
has longer petals. The petal lengthsof all setosa are about the sa me, and the petal lengths are remarkably
shorter than the other two species. The scatterplot shows that for versicolor and virgin ica, sepal length
grows linearly with the petal length.
The Rcode for generating the scatterplot mat rix is provided next.
= ~Qr qrdp~ica: pa~a~ :e~· - cl~!' p!ot - 1~9 :c :te ~1gure ~~a1o~
The vector colors defines th e colo r sc heme for the plot. It could be changed to something like
colors<- c("gray50", "white" , " black " } to makethescatterplotsgrayscale.
0
0
CD
0
0
II'>
"'Q;0> 0
.,c .... 0
"'"'
"'
Q, 0
0
< (")
0
0
N
0
~
Tune
Additionally, the overall trend is that the number of air passengers steadily increased from 1949 to
1960. Chapter 8, "Advanced Analytica l Theory and Methods: Time Series Analysis," discusses the analysis
of such data sets in greater detail.
can be relevant to the downstream analysis. The graph shows that the transformed account values follow
an approximate normal distribution, in the range from $100 to $10,000,000. The median account value is
approximately $30,000 (1 o4s), with the majority of the accounts between $1,000 (1 03) and $1,000,000 (1 06).
.....
c::)
0
c::)
2 3 4 5 6 7
Density plots are fairly technical, and they contain so much information that they would be difficult to
explain to less technical stakeholders. For example, it would be challenging to explain why the account
values are in the log 10 scale, and such information is not relevant to stakeholders. The same message can
be conveyed by partitioning the data into log-like bins and presenting it as a histogram. As can be seen in
Figure 3-21, the bulk of the accounts are in the S1,000-1,000,000 range, with the peak concentration in the
$10-SOK range, extending to $500K. This portrayal gives the stakeholders a better sense of the customer
base than the density plot shown in Figure 3-20.
Note that the bin sizes should be carefully chosen to avoid distortion of the data.ln this example, the bins
in Figure 3-21 are chosen based on observations from the density plot in Figure 3-20. Without the density
plot, the peak concentration might be just due to the somewhat arbitrary appearing choices for the bin sizes.
This simple example addresses the different needs of two groups of audience: analysts and stakehold-
ers. Chapter 12, "The Endgame, or Putting It All Together," further discusses the best practices of delivering
presentations to these two groups.
Following is the Rcode to generate the plots in Figure 3-20 and Figure 3-21.
# Generate random log normal income data
income= rlnorm(SOOO, meanlog=log(40000), sdlog=log(S))
r ug (logl O(income))
breaks = c(O, 1000, 5000, 10000, 50000, 100000, SeS, le6, 2e7 )
"'! 1.:: . . ... ••' ,
bins = cut(income, breaks, include .lowest =T,
labels c ( "< lK", "1 - SK", "5- lOK" , "10 - SOK",
"50-lOOK" , "100 -S OOK" , "SOCK-1M", "> 1M") )
~ n r •L ri. ..
plot(bins, main "Dis tribut i on of Account Val ues ",
xl ab "Account value ($ USD) ",
ylab = "Number of Accounts", col= "blue ")
- <1K
• Model Evaluation
• Does the model perform better than another cand idate model?
• Model Deployment
• Does the model have the desired effect (such as reducing the cost}?
This sec tion discusses some useful statistical tools that may answer these questions.
The basic concept of hypothesis testing is to form an assertion and test it with data. When perform-
in g hypothesis tests, the common assumption is that there is no difference between two samples. This
assumption is used as the default position for building the test or conducting a scientific experiment.
Statisticians refer to this as the null hyp o thesis (H0 ). The altern a tive hyp o thesis (H) is that there is a
3.3 Statistical Methods for Evaluation
difference between two samples. For example, if the task is to identify the effect of drug A compared to
drug Bon patients, the null hypothesis and alternative hypothesis would be th is.
If the task is to identify whether advertising Campaign C is effective on reducing customer churn, the
null hypothesis and alternative hypothesis wou ld be as follows.
• fl0 : Campaign Cdoes not reduce customer churn better than the cu rrent campa ign method.
• flA: Campaign C does reduce customer churn better than the current campa ign.
It is important to state the null hypothesis and alternative hypothesis, because misstating them is likely
to undermine the subsequent steps of the hypothesis testing process. A hypothesis test leads to either
rejecting the null hypothesis in favor of the alternative or not rejecting the null hypothesis.
Table 3·5 includes some examples of null and alternative hypotheses that should be answered during
the analytic lifecycle.
Regression Thisvariable does not affect the This variable affects outcome because its
Modeling outcome because its coefficient coefficient is not zero.
is zero.
Once a model is built over the t raining data, it needs to be eva luated over the testing data to see if the
proposed model predicts better than the existing model curren tly being used. Th e null hypothesis is that
the proposed model does not predict better than the existing model. The alternative hypothesis is that
the proposed model indeed predicts better than the existing model. In accuracy forecast, the null model
could be that the sales of the next month are the same as the prior month. The hypothesis test needs to
evaluate if the proposed model provides a better prediction. Take a recommendation engine as an example.
The null hypothesis could be that the new algorithm does not produce better recommendations than the
current algorithm being deployed. The alternative hypothesis is that the new algorithm produces better
recommendations than the old algorithm.
When eva luating a model, sometimes it needs to be determined if a given input variable improves the
model. In regression analysis (Chapter 6), for example, this is the same as asking if the regression coefficient
for a variable is zero. The null hypothesis is that the coefficient is zero, which means the variable does not
have an impact on the outcome. The alternative hypothesis is that the coefficient is nonzero, which means
the variable does have an impact on the outcome.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R
A common hypothesis test is to compare the means of two populations. Two such hypothesis test sare
discussed in Section 3.3.2.
• Ho: II , = ll2
• HA: II , ""' ll2
The 1', and 112 denote the population means of pop1 and pop2, respectively.
The basic testing approach is to compare the observed sample means, X,and X2, corresponding to each
population. If the values of X1 and X2 are approximately equal to each other, the distributions of X,and
X2 overlap substantially (Figure 3-23), and the null hypothesis is supported. A large observed difference
between the sample means indicates that the null hypothesis should be rejected. Formally, the difference
in means can be tested using Student's t-test or the Welch's t-test.
Student's t-test
Stud ent 's t- test ass umes that distributions of the t wo populations have equal but unknow n
variances. Suppose n1 and n2 samples are random ly and independently selected from two populations,
pop1 and pop2, respectively. If each population is normally distributed with the same mean (Jt 1 = Jt 2) and
wi th the sa me variance, then T (the t-statistic ), given in Equation 3-1, follows a t-distribution w ith
n, + n2 - 2 degrees of freedom (df).
where (3-1)
3.3 Statistical Methods for Evaluation
The shape of the t-distribution is similar to the normal distribution. In fact, as the degrees of freedom
approaches 30 or more, the t-distribution is nearly identical to the normal distribution. Because the numera-
tor ofT is the difference of the sample means, if the observed value ofT is far enough from zero such that
the probability of observing such a value of Tis unlikely, one would reject the null hypothesis that the
population means are equal. Thus, for a small probability, say a= 0.05, T* is determined such that
P(ITI2: T*) = 0.05. After the samples are collected and the observed value ofT is calculated according to
Equation 3-1, the null hypothesis (p,1 = p 2) is rejected ifiTI2: r·.
In hypothesis testing, in general, the small probability, n, is known as the significance level of the test.
The significance level of the test is the probability of rejecting the null hypothesis, when the null hypothesis
is actually TRUE.In other words, for n = 0.05, if the means from the two populations are truly equal, then
in repeated random sampling, the observed magnitude ofT would only exceed r· 5% of the time.
In the following Rcode example, 10 observations are randomly selected from two normally distributed
populations and assigned to the variables x andy. The two populations have a mean of 100 and 105,
respectively, and a standard deviation equal to 5. Student's t-test is then conducted to determine if the
obtained random samples support the rejection of the null hypothesis.
# generate random observations from the two populations
x <- rnorm(lO, mean=lOO, sd=S) # normal distribution centered at 100
y <- rnorm(20, mean=lOS, sd=S) ll no~:mal distribution centered at 105
data: x and y
t = -1.7828, df = 28, p-value = 0.08547
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.1611557 0.4271393
sample estimates:
mean of x mean of y
102.2136 105.0806
From the R output, the observed value of Tis t = -1.7828. The negative sign is due to the fact that the
sample mean of xis less than the sample mean of y. Using the qt () function in R, a Tvalue of 2.0484
corresponds to a 0.05 significance level.
# obtain t value for a two-sidec test at a 0.05 significance level
qt(p=O.OS/2, df=28, lower.tail= FALSE)
2.048407
Because the magnitude of the observed T statistic is less than the T value corresponding to the 0.05
significance level Q-1.78281< 2.0484), the null hypothesis is not rejected. Because the alternative hypothesis
is that the means are not equal (p 1 :;z:: 11 2), the possibilities of both p, > 112 and p 1 < 11 2 need to be considered.
This form of Student's t-test is known as a two-sided hypothesis test, and it is necessary for the sum of the
probabilities under both tails of the t-distribution to equal the significance level. It is customary to evenly
REVIEW OF BASIC DATA ANALYTIC METHODS USING R
divide the significance level between both tails. So, p = 0.05/2 = 0.025 was used in the qt () function to
obtain the appropriate t-value.
To simplify the comparison of the t-test results to the significance level, the Routput includes a quantity
known as the p -value. ln the preceding example, the p-value is 0.08547, which is the sum of P(T ~ - 1.7828)
and P(T ~ 1.7828). Figure 3-24 illustrates the t-statistic for the area under the tail of a t-distribution. The -t
and tare the observed values of the t-statistic. ln the Routput, t = 1.7828. The left shaded area corresponds
to the P(T ~ - 1.7828), and the right shaded area corresponds to the P(T ~ 1.7828).
-t 0
FIGURE 3-24 Area under the tails (shaded) of a student's t-distribution
In the Routput, for a significance level of 0.05, the null hypothesis would not be rejected because the
likelihood of a Tvalue of magnitude 1.7828 or greater would occur at higher probability than 0.05. However,
based on the p -value, if the significance level was chosen to be 0.10, instead of 0.05, the null hypothesis
would be rejected. In general, the p-value offers the probability of observing such a sample result given
the null hypothesis is TRUE.
A key assumption in using Student's t-test is that the population variances are equal. In the previous
example, the t . test ( ) function call includes var . equal=TRUE to specify that equality of the vari-
ances should be assumed. If that assumption is not appropriate, then Welch's t-test should be used.
(3-2)
where X,. 5,2, and n, correspond to the i-th sample mean, sample variance, and sample size. Notice that
Welch's t-test uses the sample va riance (5ll for each population instead of the pooled sample variance.
In Welch's test, under the remaining assumptions of random samples from two normal populations with
the same mean, the distribution of Tisapproximated by the t-distribution. The following Rcode performs
the We lch's t-test on the same set of data analyzed in the earlier Student's t-test example.
3.3 Statistical Methods for Evaluation
data: x andy
t = -1.6596, df = 15.118, p-value = 0.1176
alternative hypothesis: true difference in neans is not equal to o
95 percent confidence interval:
-6.546629 0.812663
sample estimates:
mean of x mean of y
102.2136 105.0806
In this particular example of using Welch's t-test, the p-value is 0.1176, which is greater than the p-value
of 0.08547 observed in the Student's t-test example. In this case, the null hypothesis would not be rejected
at a 0.10 or 0.05 significance level.
It should be noted that the degrees of freedom calculation is not as straightforward as in the Student's
t-test. In fact, the degrees of freedom calculation often results in a non-integer value, as in this example.
The degrees of freedom for Welch's t-test is defined in Equation 3-3.
(3-3)
df=l~r l~:r
--+--
n,-1 n2 -1
In both the Student's and Welch's t-test examples, the Routput provides 95% confidence intervals on
the difference of the means. In both examples, the confidence intervals straddle zero. Regardless of the
result of the hypothesis test, the confidence interval provides an interval estimate of the difference of the
population means, not just a point estimate.
A confidence interval is an interval estimate of a population parameter or characteristic based on
sample data. A confidence interval is used to indicate the uncertainty of a point estimate.lfx is the estimate
of some unknown population mean f..L, the confidence interval provides an idea of how close xis to the
unknown p. For example, a 95% confidence interval for a population mean straddles the TRUE, but
unknown mean 95% of the time. Consider Figure 3-25 as an example. Assume the confidence level is 95%.
If the task is to estimate the mean of an unknown value Jt in a normal distribution with known standard
deviation u and the estimate based on n observations is x, then the interval ± ~ straddles the unknown
x
value of Jl with about a 95% chance. If one takes 100 different samples and computes the 95% confi-
dence interval for the mean, 95 of the 100 confidence intervals will be expected to straddle the population
mean Jt.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R
FIGURE 3-25 A 95% confidence interval straddlin g the unknown population mean 1J
Confidence intervals appear again in Section 3.3.6 on ANOVA. Return ing to the discussion of hypoth-
es is testing, a key assumpti on in b oth t he Stud ent 's and Welch 's t-test is that the relevant population
attri bute is norma lly distributed. For non-normally dist ributed data, it is sometimes p ossible to transform
the co llected data to approx imate a normal distribution. For example, taki ng the logarithm of a d ataset
can often transform skewed d ata to a dataset that is at least symmetric arou nd its mean. Howeve r, if such
transformations are ineffective, there are tests like the Wi lcoxon ra nk-su m test that can be ap plied to see
if t wo population distributions are different.
significance of the observed rank-su ms. The following Rcode performs the test on the same dataset used
for the previous t-test.
! 1 t
The wilcox. test ( l function ranks the observations, determines the respective rank-sums cor-
responding to each population's sample, and then determines the probability of such rank-sums of such
magnitude being observed assuming that the population distributions are identical. In this example, the
probability is given by the p-value of 0.04903. Thus, the null hypothesis would be rejected at a 0.05 sig-
nificance level. The reader is cautioned against interpreting that one hypothesis test is clearly better than
another test based solely on the examples given in this section.
Because the Wilcoxon test does not assume anything about the population distribution, it is generally
considered more robust than the t-test. In other words, there are fewer assumptions to violate. However,
when it is reasonable to assume that the data is normally distributed, Student's or Welch's t-test is an
appropriate hypothesis test to consider.
• A type I error is the rejection of the null hypothesis when the null hypothesis is TRUE. The probabil-
ity of the type I error is denoted by the Greek letter n .
• A type II error is the acceptance of a null hypothesis when the null hypothesis is FALSE. The prob-
ability of the type II error is denoted by the Greek letter .1.
Table 3-61ists the four possible states of a hypothesis test, including the two types of errors.
H0 is true H0 is false
The significance level, as mentioned in the Student's t-test discussion, is equivalent to the type I error.
For a significance level such as o = 0.05, if the null hypothesis (Jt 1= J1 1) is TRUE, there is a So/o chance that
the observed Tvalue based on the sample data will be large enough to reject the null hypothesis. By select-
ing an appropriate sig nificance level, the probability of commi tting a type I error can be defined before
any data is collected or analyzed.
The probability of committing a Type II error is somewhat more difficult to determine. If two population
means are truly not equal, the probability of committing a type II error will depend on how far apart the
means truly are. To reduce the probability of a type II error to a reasonable level, it is often necessary to
increase the sample size. This topic is addressed in the next section.
1------1 1------1
a' a'
F IGURE 3-26 A larger sample size better identifies a fixed effect size
With a large enough sample size, almost any effect size can appear statistically significant. However, a
very small effect size may be useless in a practical sense. It is importan t to consider an appropriate effect
size for the problem at hand.
3.3.6ANOVA
The hypothesis tests presented in the previous sections are good for analyzing means between two popu-
lations. But what if there are more than two populations? Consider an example of testing the impact of
3.3 Statistical Methods for Evaluation
nutrition and exercise on 60 candidates between age 18 and 50. The candidates are randomly split into six
groups, each assigned with a different weight loss strategy, and the goal is to determine which strategy
is the most effective.
o Group 1 only eats junk food.
o Group 2 only eats healthy food.
o Group 3 eats junk food and does cardia exercise every other day.
o Group 4 eats healthy food and does cardia exercise every other day.
o Group 5 eats junk food and does both cardia and strength training every other day.
o Group 6 eats healthy food and does both cardia and strength training every other day.
Multiple t-tests could be applied to each pair of weight loss strategies. In this example, the weight loss
of Group 1 is compared with the weight loss of Group 2, 3, 4, 5, or 6. Similarly, the weight loss of Group 2 is
compared with that of the next 4 groups. Therefore, a total of 15 t-tests would be performed.
However, multiplet-tests may not perform well on several populations for two reasons. First, because the
number oft-tests increases as the number of groups increases, analysis using the multiplet-tests becomes
cognitively more difficult. Second, by doing a greater number of analyses, the probability of committing
at least one type I error somewhere in the analysis greatly increases.
Analysis of Variance (ANOVA) is designed to address these issues. AN OVA is a generalization of the
hypothesis testing of the difference of two population means. AN OVA tests if any of the population means
differ from the other population means. The null hypothesis of ANOVA is that all the population means are
equal. The alternative hypothesis is that at least one pair of the population means is not equal. In other
words,
0 Ho:Jll = J12 = ··· = Jln
o HA: Jl; ::= J1i for at least one pair of i,j
As seen in Section 3.3.2, "Difference of Means," each population is assumed to be normally distributed
with the same variance.
The first thing to calculate for the AN OVA is the test statistic. Essentially, the goal is to test whether the
clusters formed by each population are more tightly grouped than the spread across all the populations.
Let the total number of populations be k. The total number of samples N is randomly split into the k
groups. The number of samples in the i-th group is denoted as n1, and the mean of the group is X1 where
iE[l,k]. The mean of all the samples is denoted as X0•
s;,
The between-groups mean sum of squares, is an estimate of the between-groups variance. It
measures how the population means vary with respect to the grand mean, or the mean spread across all
the populations. Formally, this is presented as shown in Equation 3-4.
k
582 =-1
-~n.·(x.-x0 )2
k-1L...i I I
(3-4)
1=1
The within-group mean sum ofsquares, s~. is an estimate of the within-group variance. It quantifies
the spread of values within groups. Formally, this is presented as shown in Equation 3-5.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R
(3-5)
s;
If is much larger than 5~, then some of the population means are different from each other.
The F-test statistic is defined as the ratio of the between-groups mean sum of squares and the within-
group mean sum of squares. Formally, this is presented as shown in Equation 3-6.
(3-6)
The F-test statistic in ANOVA can be thought of as a measure of how different the means are relative to
the variability within each group. The larger the observed F-test statistic, the greater the likelihood that
the differences between the means are due to something other than chance alone. The F-test statistic
is used to test the hypothesis that the observed effects are not due to chance-that is, if the means are
significantly different from one another.
Consider an example that every customer who visits a retail website gets one of two promotional offers
or gets no promotion at all. The goal is to see if making the promotional offers makes a difference. ANOVA
could be used, and the null hypothesis is that neither promotion makes a difference. The code that follows
randomly generates a total of 500 observations of purchase sizes on three different offer options.
offers<- sample(c("offerl", "offer2", "nopromo"), size=SOO, replace=T)
The summary ofthe offertest data frame shows that 170 offerl, 161 offer2, and 169
nopromo (no promotion) offers have been made. It also shows the range of purchase size (purchase_
amt) for each of the three offer options.
# display a summary of offertest where o:fer="offer1"
summary(offertest[offertest$offer=="offerl",])
offer purchase_amt
nopromo: t·li;;.. 4.521
offe:n :170 1 s:: Qu . : 5 8 . 1 5 8
offer2 : i·iedian : 76. 944
I·1ean .Sl. 936
3 rd Qu. : 1 D4 . 9 59
t•la:·:. :130.507
offer purchase_amt
nopromo: 0 ~lin. 14.04
offer! 0 1st Qu . : 6 9 . 4 6
offer2 :161 t·ledian : 90.20
r•lean 89.09
3 rd Qu. : 10 7. 4 8
!•lax. : 154. 3 3
The summary (} function shows a summary of the model. The degrees of freedom for offers is 2,
which corresponds to the k -1 in the denominator of Equation 3-4. The degrees of freedom for residuals
is 497, which corresponds to then- k in the denominator of Equation 3-5.
summary (model)
Of Sum Sq !-lean Sq F value Pr (>F)
offers 2 225222 112611 130.6 <2e-16
Residuals 4 97 428470 862
Signif. codes: 0 1
*** 1
0.001 1
**' 0.01 1
* 1
0.05 '. 1
0.1 1 1
1
The output also includes the 5~ (112,611), 5~ (862), the F-test statistic (130.6), and the p-value (< 2e-16).
The F-test statistic is much greater than 1 with a p-value much less than 1. Thus, the null hypothesis that
the means are equal should be rejected.
However, the result does not show whether offerl is different from offer2, which requires addi-
tional tests. The TukeyHSD (} function implements Tukey's Honest Significant Difference (HSD) on all
pair-wise tests for difference of means.
TukeyHSD(model)
Tukey multiple comparisons of means
95% family-wise confidence level
$offers
diff lwr upr p adj
offerl-nopromo 40.961437 33.4638483 48.45903 0.0000000
REVIEW OF BASIC DATA ANALYTIC METHODS USING R
The result includes p -values of pair-wise comparisons of the three offer options. The p-values for
of ferl- nopromo and of fer- nop romo are equal to 0, smaller than the significance level 0.05.
Thi s suggests that both of ferl and offer2 are significantly different from n opromo. A p-value of
0.0692895 for off er2 against of fer 1 is greater than the significance level 0.05. This suggests that
of fer2 is not significantly different from offerl.
Because only the influence of one factor (offers) was executed, the presented ANOVA is known asone-
way ANOVA. If the goal is to analyze two factors, such as offers and day of week, that would be a two-way
ANOVA [16]. 1f the goal is to model more than one outcome variable, then multivariate AN OVA (or MAN OVA)
cou ld be used.
Summary
Ris a popular package and programming language for data exploration, analytics, and visualization. As an
introduction toR, thischapter coversthe RGUI, data 1/0, attribute and data types, and descriptive statistics.
This chapter also discusses how to useR to perform exploratory data analysis, including the discovery of
dirty data, visua lization of one or more variables, and customization of visualization for different audiences.
Finally, the chapter introduces some basic statistical methods. The first statistical method presented in the
chapter isthe hypothesis testing. The Student's t-test and Welch's t-test are included astwo example hypoth-
esis testsdesigned for testing the difference of means. Other statistical methods and tools presented in this
chapter include confidence interva ls, Wilcoxon rank-sum test, type I and II errors, effect size, and ANOVA.
Exercises
1. How many levels does fdata contain in the following R code?
2. Two vectors, vl and v2, are created with the following Rcode:
vl <- 1:5
v2 <- 6 : 2
What are the results of cbi nd (vl , v2) and rbind (vl , v2)?
3. What Rcomma nd(s) would you use to remove null values from a dataset?
7. An online retailer wa nts to study the purchase behaviors of its customers. Figure 3-27 shows the den-
sity plot of the purchase sizes (in dollars). What wou ld be your recommendation to enhance the plot
to detect more structures that otherwise might be missed?
Bibliography
Be-04
6e-04
£
"'
~ 4e-04
2e-04
09+(}0
8. How many sections does a box-and-whisker divide the data into? What are these sections?
9. What attributes are correlated according to Figure 3-18? How would you describe their relationships?
10. What function can be used to tit a nonlinear line to the data?
11. If a graph of data is skewed and all the data is positive, what mathematical technique may be used to
help detect structures that might otherwise be overlooked?
12. What is a type I error? What is a type II error? Is one always more serious than the other? Why?
13. Suppose everyone who visits a retail website gets one promotional offer or no promotion at all. We
want to see if making a promotional offer makes a difference. What statistical method wou ld you
recommend for this analysis?
14. You are ana lyzing two norma lly distributed populations, and your null hypothesis is that the mean f1 1
of the first population is equal to the mean 112 of the second. Assume the significance level is set at
0.05. If the observed p·value is 4.33e-05, what will be your decision regarding the null hypothesis?
Bibliography
[1] The RProject for Statistical Computing, "R Licenses." [Online). Available: http : I l www. r-
proj ec t. orgiLicensesl. [Accessed 10 December 2013].
[2] The RProject for Statistical Computing, "The Comprehensive R Arch ive Network." [Online].
Available: http: I lcran . r-project. orgl. [Accessed 10 December 2013].