Data Smart Book Summary
Data Smart Book Summary
Abstract
The purpose of this document is to summarize the book Data Smart, written by John W. Foreman
and provide some additional R code to work with the book.
Data Smart
Summary
Data Science is a very loose word and can mean different things in different situations. However one thing is
certain, the principles used in tacking problems are from diverse fields. Drew Conway has this Venn diagram
on his blog :
In such a diverse field one does not know where to start and how to start. Someone has made a nice Metromap
too. All said and done, this is a field that has considerable entry barriers. One needs to spend at least a few
years to get the basics right to understand some basic algorithms.
Where does this book fit in? This book is apt for people who want to see whats going on behind various
algorithms without the math. The book touches upon a dozen topics in data mining and explains the main
principles of each of those topics via Excel. By restricting to Excel, the author enables a wider audience to
get a glimpse of the various concepts. The ideal way to to read this book is by working out the various case
studies that are mentioned in the book. I could not motivate myself to do the analysis in Excel, so replicated
the analysis in R. In this document I have listed down some of the code to work through the book, that essentially replicates the results of the analysis done via Excel. But first a brief summary of the chapters in the book.
Chapter 1 is on Excel and can be speedread as I cannot imagine someone reading this book without ever
working on Excel. Chapter 2 discusses k-means clustering. It uses an offer-purchases dataset to segment
the customers in to various clusters for better marketing. The k-means needs a distance metric and there are
many to choose from based on the situation. The book shows that for the specific dataset used, correlation
based distance or cosine similarity score is a better metric than euclidean distance.
Chapter 3 is on Naive Bayes, a simple method that surprisingly performs better than many other algorithms. In fact the reason for its ubiquity stems from its simplicity; it does not overfit the data.Naive Bayes
principle is applied on a set of tweets to classify them as business-related or junk. Obviously there is not much
of math in this book as expected. So, the results from this chapter will motivate anyone to understand the
Data Smart
reason why Naive Bayes works and understand why bias-variance tradeoff works very differently in a classification setting than a regression setting.
Chapter 4 is about optimization, the quintessential skillset that any data scientist needs to have. Using a
case study, the author introduces Linear Programming, Integer programming, Mixed Integer programming
and ways to convert a nonlinear optimization problem in to Linear Optimization problem. The good thing
about this book and this chapter in particular is that there is a good sense of humor that the author brings
along while explaining principles. That makes the book an immensely readable book.
Chapter 5 discusses graph analysis and uses the same dataset from one of the previous chapters to do an
unsupervised learning. k-neighborhood and Modularity maximization procedures are used to group the customers in to communities. Even though Gephi is used for Visualization, igraph is powerful enough to give
all the visualization features to an R user. Chapter 6 is about regression. The book uses a sample dataset
to explain the concepts of regression and logistic regression.All the creation of dummy variables, setting up
the objective function etc. are done in Excel and the reader is made to understand the basic steps behind
regression modeling.
Chapter 7 gives the reader an insight in to wisdom of crowds type models. The models discussed are
Random Forest and Boosting. A reader who reaches until this point of the book is abundantly convinced that
Excel is too painful use boosting techniques, where every model built on a bootstrapped sample has to be
recorded as a macro and one has to run it manually to get estimates. In any case, the chapter does a wonderful
job of explaining the nuts and bolts of Boosting.
Chapter 8 gives a crash course on exponential smoothing. It starts off with simple exponential smoothing
and then moves on to Holts trend-corrected exponential smoothing and finally ending with multiplicative
Holt-Winters exponential smoothing. The basic limitation of these models is that there is no probabilistic
framework around them. Hyndman has written a book on Exponential smoothing where he casts all the
models in a State space framework that makes the models far more richer.
Chapter 9 talks about outlier detection and introduces three methods: indegree method, k-distance method
, local outlier factor method. Chapter 10 introduces some basic commands in R and then works out the
k-means model, the regression model, the random forests model, forecasting model and outlier detection methods in R. Chapter 11 is the concluding chapter in the book that talks about some soft skills that a data
scientist should have in order to be effective in an organization.
Data Smart
Contents
1
Everything You Ever Needed to Know about Spreadsheets but Were Too Afraid to Ask
13
4 Optimization Modeling: Because That Fresh Squeezed Orange Juice Aint Gonna Blend
Itself
5 Cluster Analysis Part II : Network Graphs and Community Detection
15
30
35
39
42
9 Outlier Detection: Just Because Theyre Odd Doesnt Mean Theyre Unimportant
48
52
11 Conclusion
52
Data Smart
The first chapter is a basic crash course on excel that teaches common functions to an excel newbie. The
functions explained are sort, match, index, offset, small, vlookup, filtering, sorting, pivot tables, array formulas,
solver. It also suggests that the reader install OpenSolver(a solver on steroids) plug-in to work through some
of the content in the book.
This chapter is about clustering a set of customers based on their transactions done in response to various deals.
Each deal has certain characteristics and there are p offers over a certain time period. In the ML literature,
p is the number of features. Every customer can be viewed in this p dimensional space.The j th component is
assigned 0 or 1 based on a customer response to the j th deal. The example describes the algorithm in plain
english. However the precise algo is also not very mathematical. It goes like this(for K clusters):
1. Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster
assignments for the observations.
2. Iterate until the cluster assignments stop changing
(a) For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of
the p feature means for the observations in the kth cluster
(b) Assign each observation to the cluster whose centroid is closest.
Why does the above work ? For whatever cluster configurations, one needs to formulate an objective function
and minimize/maximize it. In this case, you are trying to minimize within-cluster variation. If we denote the
within-cluster variation for Ck is a measure W (Ck ) then the objective function is
min
C1 ,C2 ,...CK
K
X
W (Ck )
k=1
p
X X
1
(xij xi0 j )2
|Ck | 0
j=1
i,i Ck
The within-cluster variation W (Ck )is the sum of pairwise squared Euclidean distances between the observations of all the kthe cluster.
If you look at the algo and the method that is followed in the book, there is no mention of computing pairwise
distances in each cluster. Instead the distance minimized are the distances to the centroid. The key identity
that makes the pairwise distance computation redundant in each cluster is the following
p
p
X X
XX
1
(xij xi0 j )2 = 2
(xij xkj )2
|Ck | 0
j=1
j=1
i,i Ck
iCk
where xjk denotes the mean of feature j in cluster k. Ok, now crunching the data.
Data Smart
input.file
<- "data/ch02/deals.csv"
deals
input.file
<- "data/ch02/offer.csv"
offers
colnames(offers)
offers$value
<- c("name","offer")
<- 1
offers.data
<- kmeans(offers.data,4,nstart=25)
deals.temp
Campaign
## 30 December
Origin PastPeak
Malbec
54
France
FALSE
87
France
FALSE
## 7
March
Prosecco
40
Australia
TRUE
## 8
March
Espumante
45 South Africa
FALSE
## 18
## 13
July
May
Espumante
Merlot
6
6
50
43
Oregon
Chile
FALSE
FALSE
## 1
January
Malbec
72
56
France
FALSE
## 2
January
Pinot Noir
72
17
France
FALSE
## 3
February
Espumante
144
32
Oregon
TRUE
## 4
February
Champagne
72
48
France
TRUE
Campaign
August
Champagne
72
63 France
FALSE
## 31 December
Champagne
72
89 France
FALSE
Prosecco
144
86
Chile
FALSE
## 6
March
## 1
January
Malbec
72
56 France
FALSE
May
Champagne
72
85 France
FALSE
February
February
Espumante
Champagne
144
72
32 Oregon
48 France
TRUE
TRUE
## 14
June
Merlot
72
64
Chile
FALSE
## 15
144
19
Italy
FALSE
54 France
FALSE
## 11
## 3
## 4
## 30 December
Malbec
Data Smart
deals.temp[order(deals.temp$"3",decreasing = TRUE),1:6][1:10,]
##
Campaign
Origin PastPeak
34
Italy
FALSE
144
83
Australia
FALSE
12
47
Germany
FALSE
## 2
72
17
France
FALSE
## 1
January
Malbec
72
56
France
FALSE
April
Prosecco
72
52
## 23 September Chardonnay
144
## 10
California
FALSE
39 South Africa
FALSE
## 27
October
Champagne
72
88
New Zealand
FALSE
## 3
February
Espumante
144
32
Oregon
TRUE
## 4
February
Champagne
72
48
France
TRUE
Campaign
## 31 December
## 4
## 9
72
89
France
FALSE
February
April
Champagne
Chardonnay
72
144
48
57
France
Chile
TRUE
FALSE
May
Champagne
72
85
France
FALSE
86
## 11
## 6
March
Prosecco
144
## 8
March
Espumante
## 14
June
Merlot
## 16
June
Merlot
## 20
## 27
Origin PastPeak
Champagne
Chile
FALSE
45 South Africa
FALSE
72
64
Chile
FALSE
72
88
California
FALSE
72
82
Italy
FALSE
72
88
New Zealand
FALSE
As one can see, the above analysis does not given any conclusive results. Instead one can look at deal counts
in each cluster
cluster
<-
data.frame(name = (rownames(offers.data)),
cluster = km.out$cluster)
deals.by.cluster
temp
temp
<- cbind(deals,temp)
Campaign
## 30 December
Origin PastPeak
Malbec
54
France
FALSE
87
France
FALSE
Data Smart
## 7
March
Prosecco
40
## 8
March
Espumante
45 South Africa
FALSE
## 18
July
Espumante
50
Oregon
FALSE
## 13
May
Merlot
43
Chile
FALSE
January
Malbec
72
56
France
FALSE
## 2
January
Pinot Noir
72
17
France
FALSE
## 3
February
Espumante
144
32
Oregon
TRUE
## 4
February
Champagne
72
48
France
TRUE
## 1
Australia
TRUE
Campaign
August
Champagne
72
63 France
FALSE
## 31 December
Champagne
72
89 France
FALSE
Prosecco
144
86
Chile
FALSE
## 6
March
## 1
January
Malbec
72
56 France
FALSE
May
Champagne
72
85 France
FALSE
## 3
February
Espumante
144
32 Oregon
TRUE
## 4
February
Champagne
72
48 France
TRUE
## 14
June
Merlot
72
64
Chile
FALSE
## 15
144
19
Italy
FALSE
54 France
FALSE
## 11
## 30 December
Malbec
Campaign
Origin PastPeak
34
Italy
FALSE
## 26
144
83
Australia
FALSE
## 17
## 2
12
72
47
17
Germany
France
FALSE
FALSE
## 1
January
Malbec
72
56
France
FALSE
April
Prosecco
72
52
## 23 September Chardonnay
144
## 10
California
FALSE
39 South Africa
FALSE
## 27
October
Champagne
72
88
New Zealand
FALSE
## 3
February
Espumante
144
32
Oregon
TRUE
## 4
February
Champagne
72
48
France
TRUE
Campaign
Origin PastPeak
Data Smart
## 31 December
Champagne
72
89
France
FALSE
## 4
February
Champagne
72
48
France
TRUE
## 9
April
Chardonnay
144
57
Chile
FALSE
May
Champagne
72
85
France
FALSE
## 6
March
Prosecco
144
86
Chile
FALSE
## 8
March
Espumante
45 South Africa
FALSE
## 14
June
Merlot
72
64
Chile
FALSE
## 16
June
Merlot
72
88
California
FALSE
## 11
## 20
## 27
Champagne
72
82
Italy
FALSE
72
88
New Zealand
FALSE
One can try K means for varying K and pick one of the k values. The chapter suggests another way to
compare the K means across various k values, i.e by computing a score for your clusters called the silhouette.
The following R code gives the metric. You can also use the silhouette function from the cluster package.
silhouette.rk
<- function(cluster,dist.euclidean){
clusters
<- sort(unique(cluster$cluster))
silh
<- numeric()
for(i in cluster$id){
temp
temp.cluster
same.cluster
diff.cluster
i.star
<- pmin(i,same.cluster$id)
j.star
<- pmax(i,same.cluster$id)
within
neighbor
{
i.star
j.star
<- dist(offers.data)
Data Smart
km.out
<- kmeans(offers.data,4,nstart=25)
cluster
<-
data.frame(name = (rownames(offers.data)),
cluster = km.out$cluster, id = 1:nrow(cluster) )
print(silhouette.rk(cluster,dist.euclidean))
## [1] 0.1243
print((summary(silhouette(km.out$cluster,dist.euclidean)))$avg.width)
## [1] 0.1243
For K = 5 clusters, one can calculate silhouette as follows:
set.seed(1)
km.out
<- kmeans(offers.data,5,nstart=25)
cluster
<-
data.frame(name = (rownames(offers.data)),
cluster = km.out$cluster, id = 1:nrow(cluster) )
print(silhouette.rk(cluster,dist.euclidean))
## [1] 0.1231
print((summary(silhouette(km.out$cluster,dist.euclidean)))$avg.width)
## [1] 0.1231
The above metric shows that 5 clusters is no better than 4 clusters.
The chapter subsequently introduces a different way to do K means clustering, i.e. Spherical K means. This
is a method where the dissimilarity measure is based on correlation-based distance. The package in R that does
spherical K means is skmeans.
set.seed(1)
sk.out
cluster
<-
data.frame(name = (rownames(offers.data)),
cluster = sk.out$cluster, id = 1:nrow(cluster) )
deals.by.cluster
temp
temp
<- cbind(deals,temp)
temp
<- cbind(deals,temp)
Campaign
Origin PastPeak
34
Italy
FALSE
## 26
144
83
Australia
FALSE
## 2
72
17
France
FALSE
10
Data Smart
## 17
## 1
12
47
Germany
FALSE
January
Malbec
72
56
France
FALSE
## 10
April
Prosecco
72
52
California
FALSE
## 12
May
Prosecco
72
83
Australia
FALSE
## 16
June
Merlot
72
88
California
FALSE
## 23 September Chardonnay
144
## 27
October
Champagne
72
39 South Africa
FALSE
88
FALSE
New Zealand
Campaign
Origin PastPeak
March
Espumante
45 South Africa
FALSE
## 30 December
Malbec
54
France
FALSE
Espumante
50
Oregon
FALSE
87
France
FALSE
## 18
## 7
July
March
Prosecco
40
Australia
TRUE
May
Merlot
43
Chile
FALSE
## 6
March
Prosecco
144
86
Chile
FALSE
## 10
April
Prosecco
72
52
California
FALSE
## 11
May
Champagne
72
85
France
FALSE
## 21
August
Champagne
12
50
California
FALSE
## 13
Campaign
Origin PastPeak
## 9
April
Chardonnay
144
57
Chile
FALSE
## 14
June
Merlot
72
64
Chile
FALSE
January
Malbec
72
56
France
FALSE
## 5
144
44
New Zealand
TRUE
## 4
February
Champagne
72
48
France
TRUE
May
Champagne
72
85
France
FALSE
144
19
Italy
FALSE
FALSE
## 1
## 11
## 15
## 23 September
Chardonnay
144
39 South Africa
## 26
October
Pinot Noir
144
83
Australia
FALSE
March
Prosecco
144
86
Chile
FALSE
## 6
Campaign
August
72
63
Origin PastPeak
France
FALSE
11
Data Smart
## 31 December
Champagne
72
89
France
## 4
February
## 6
March
Champagne
72
48
France
TRUE
Prosecco
144
86
Chile
FALSE
May
Champagne
72
85
France
FALSE
## 3
February
Espumante
144
32
Oregon
TRUE
## 27
October
## 11
FALSE
Champagne
72
88
New Zealand
FALSE
12
56
France
TRUE
## 1
January
Malbec
72
56
France
FALSE
## 8
March
Espumante
45 South Africa
FALSE
The fifth cluster are those who buy only sparkling wine
temp[order(temp$"5",decreasing = TRUE),1:6][1:10,]
##
## 7
Campaign
Origin PastPeak
March
Prosecco
40
Australia
TRUE
## 29 November
Pinot Grigio
87
France
FALSE
## 30 December
Malbec
54
France
FALSE
50
## 18
July
Espumante
## 10
April
Prosecco
72
## 13
May
Merlot
February
## 3
Oregon
FALSE
52 California
FALSE
43
Chile
FALSE
Espumante
144
32
Oregon
TRUE
## 12
May
Prosecco
72
83
Australia
FALSE
## 19
July
Champagne
12
66
Germany
FALSE
12
56
France
TRUE
12
Data Smart
This chapter talks about using a particular form of Bayes theorem that makes it extremely easy to filter
specific type of messages. The Author uses a set of tweets that has words relating to an App. Needless to say
there is a lot of noise in twitter data. Given a tweet, how does one identify whether the tweet is about a specific
App or about something else ?. This question is taken up in the chapter where Bayes is used to compute the
posterior probabilities of a App given a tweet and posterior probabilities of a Noise given a tweet, compares
both the probabilities and assigns it to the respective class. The logic is extremely simple. It is based on bag
of words model. The following are the basic steps :
Model Building
Collect all the words from the app tweets and compute the likelihood of the word given it comes from
an app tweet
Collect all the words from the non-app tweets and compute the likelihood of the word given it comes
Model Prediction
For any test tweet, compute the posterior probability of app given test tweet and compute the posterior
<-
gsub("[.]\\s"," ",x)
<- strsplit(x,"(\\s)+")
return(x[[1]])
}
getFrequencyCounts
<- function(input.file){
tweets
tweets
words
<- sapply(tweets,cleanup)
words
<- do.call("c",words)
wordcount
<- table(words)
wordfreq
<- as.data.frame(wordcount)
<- wordfreq$freq + 1
total
<- sum(wordfreq$freq)
wordfreq$logprob
<- log(wordfreq$freq/total)
result
return(result)
}
13
Data Smart
input.file
<- "data/ch03/app.csv"
app.res
<- getFrequencyCounts(input.file)
input.file
other.res
<- "data/ch03/other.csv"
<- getFrequencyCounts(input.file)
input.file
<- "data/ch03/test_set.csv"
tweets
tweets
words
<- sapply(tweets,cleanup)
getScores
<- function(input){
temp
temp1
<- sum(temp1$logprob)
temp2
<- sum(temp2$logprob)
<- sapply(words,getScores)
attributes(predicted.result)
<- NULL
model.predict
<- cbind(ifelse(predicted.result==TRUE,"APP","OTHERS"))
actual.result
<- c(rep(TRUE,10),rep(FALSE,10))
model.actual
<- c(rep("APP",10),rep("OTHERS",10))
model.comparison
model.comparison$value
<- 1
## 1
APP
10
## 2
OTHERS
Naive Bayes seems to do be doing really well for this toydata set with just 1 false positive.
14
Data Smart
The author explains the principles of LP via a case study. The case involves deciding the procurement of juice
in the Jan, Feb, Mar across 11 varieties. There a set of constraints under which the procurement can be made.
input.file
<- "data/ch04/specs.csv"
data
variety
region
qty
## 1
Hamlin
Brazil
500
100
## 2
Mosambi
India
400
6.5 0.01
310
150
750
600
60
## 3 Valencia
## 4
Hamlin California
## 5
Gardner
Arizona
84 12.0 0.01
600
75
## 6
Sunstar
Texas
625
50
lhs
rhs
dir
reqd
<- c(600,600,700)
lt
<- "<="
gt
<- ">="
et
<- "="
<- 1
Data Smart
<- reqd[k]
dir[j]
j
<- et
<- j+1
# Valencia constraint
lhs[j,] <- c(rep(0,2),1,rep(0,8), rep(0,22))
rhs[j]
<- 240
dir[j]
<- gt
<- j+1
<- 240
dir[j]
<- gt
<- j+1
<- 280
dir[j]
<- gt
<- j+1
# Availability
j
<- 7
for(i in 1:11) {
x
<-rep(0,11)
x[i]
lhs[j,]
<- 1
<- rep(x,times= 3)
dir[j]
<- lt
rhs[j]
<- data$qty[i]
<- j+1
}
lowerlim
<- c(11.5,0.0075,0,4.5)
upperlim
<- c(12.5,0.01,4,5.5)
for(k in 1:3) {
if(k==1){
16
Data Smart
<- temp*data[,l+3]/reqd[k]
<- lowerlim[l]
dir[j]
<- gt
<- j+1
}
u <- 1
for(u in 1:4){
temp2
<- temp*data[,u+3]/reqd[k]
lhs[j,]
<- temp2
rhs[j]
<- upperlim[u]
dir[j]
<- lt
<- j+1
}
}
lhs
<- lhs[1:(j-1),]
rhs
<- rhs[1:(j-1)]
dir
<- dir[1:(j-1)]
sol
df
<- (matrix(sol$sol, nrow=11, ncol = 3))
colnames(df) <- c("Jan","Feb","Mar")
rownames(df) <- as.character(data$variety)
sol$objval
## [1] 1226505
The solution from solving the constraint optimization is
df
##
## Hamlin
Jan Feb
0.00
0
Mar
0.00
## Mosambi
0.00
0.00
17
Data Smart
## Valencia
## Hamlin
124.59
43.41
## Gardner
0.00
84
0.00
## Sunstar
3.00
0.00
## Jincheng
27.00
0.00
## Berna
49.76
0 118.24
## Verna
89.65 138
## Biondo Commune
## Belladonna
0.00
72.35
24 186.00
66.00 114
0.00
rhs
dir
reqd
<- c(600,600,700)
lt
<- "<="
gt
<- ">="
et
j
<- "="
<- 1
<- reqd[k]
dir[j]
<- et
<- j+1
}
18
Data Smart
# Valencia constraint
lhs[j,] <- c(rep(0,2),1,rep(0,8), rep(0,22))
rhs[j] <- 240
dir[j]
<- gt
<- j+1
<- 240
dir[j]
<- gt
<- j+1
<- 280
dir[j]
<- gt
<- j+1
# Availability
j
<- 7
for(i in 1:11) {
x
<-rep(0,11)
x[i]
<- 1
lhs[j,]
<- rep(x,times= 3)
dir[j]
<- lt
rhs[j]
<- data$qty[i]
<- j+1
}
extra.var
lowerlim
upperlim
<- c(11.5,0.0075,0,4.5)
<- c(12.5,0.01,4,5.5)
delta
for(k in 1:3) {
if(k==1){
temp = c(rep(1,11), rep(0,11),rep(0,11))
}
if(k==2){
temp = c(rep(0,11), rep(1,11),rep(0,11))
}
if(k==3){
temp = c(rep(0,11), rep(0,11),rep(1,11))
19
Data Smart
}
l <- 1
for(l in 1:4){
temp1
<- temp*data[,l+3]/reqd[k]
<- lowerlim[l]
dir[j]
<- gt
<- j+1
}
u <- 1
for(u in 1:4){
temp2
<- temp*data[,u+3]/reqd[k]
lhs[j,]
<- temp2
<- upperlim[u]
dir[j]
<- lt
<- j+1
}
}
lhs[j,]
dir[j,]
<- lt
BUDGET
<-
rhs[j]
<- BUDGET
lhs
<- lhs[1:j,]
rhs
<- rhs[1:j]
dir
<- dir[1:j]
extra.var
lhs
<- extra.var[1:j,]
<- cbind(lhs, extra.var)
obj
<- c(rep(0,33),rep(0.25,4))
sol
df
relax
1170000
Data Smart
Jan
Feb
Mar
## Hamlin
0.00
0.000
0.00
## Mosambi
68.53
## Valencia
2.842 216.63
## Hamlin
0.00
0.000
0.00
## Gardner
0.00
0.000
0.00
## Sunstar
0.00
0.000
0.00
## Jincheng
0.00
0.000
0.00
## Berna
13.89
## Verna
0.000 148.11
111.79 188.211
0.00
## Biondo Commune
94.89 115.105
0.00
## Belladonna
70.89
53.842
55.26
relax
##
[,1]
## [1,] 0.4095
## [2,] 0.0000
## [3,] 0.0000
## [4,] 0.5874
mean(relax)
## [1] 0.2492
One can also draw a curve between various budget levels and quality deterioration.
BUDGET
<- seq(1226505,1110000,-10000)
quality
<- numeric(length(BUDGET))
for(b in seq_along(BUDGET)){
rhs[j]
<- BUDGET[b]
sol
relax
<- (mean(sol$sol[34:37]))
quality[b]
<-
relax
par(mfrow=c(1,1))
plot(BUDGET/1000,quality, type ="l", xlab="K$",
ylab = "Broadening of Quality
bands ",col="blue",cex.lab=0.8)
21
0.6
0.4
0.2
0.0
Data Smart
1140
1160
1180
1200
1220
K$
MiniMax
lhs
rhs
dir
reqd
<- c(600,600,700)
lt
<- "<="
gt
<- ">="
et
<- "="
<- 1
22
Data Smart
<- reqd[k]
dir[j]
<- et
<- j+1
# Valencia constraint
lhs[j,] <- c(rep(0,2),1,rep(0,8), rep(0,22))
rhs[j]
<- 240
dir[j]
<- gt
<- j+1
<- 240
dir[j]
<- gt
<- j+1
<- 280
dir[j]
<- gt
<- j+1
# Availability
j
<- 7
for(i in 1:11) {
x
<-rep(0,11)
x[i]
<- 1
lhs[j,]
<- rep(x,times= 3)
dir[j]
<- lt
rhs[j]
j
<- data$qty[i]
<- j+1
}
extra.var
lowerlim
<- c(11.5,0.0075,0,4.5)
upperlim
<- c(12.5,0.01,4,5.5)
delta
for(k in 1:3) {
if(k==1){
23
Data Smart
<- temp*data[,l+3]/reqd[k]
<- lowerlim[l]
dir[j]
<- gt
<- j+1
}
u <- 1
for(u in 1:4){
temp2
<- temp*data[,u+3]/reqd[k]
lhs[j,]
<- temp2
<- upperlim[u]
dir[j]
<- lt
<- j+1
}
}
lhs[j,]
dir[j]
<- lt
BUDGET
rhs[j]
<- 1170000
<- BUDGET
lhs
<- lhs[1:(j+4),]
extra.var
<- extra.var[1:(j+4),]
lhs
rhs
<- rhs[1:(j+4)]
dir
<- dir[1:(j+4)]
24
Data Smart
obj
<- c(rep(0,37),1)
sol
df
relax
colnames(df)
<- c("Jan","Feb","Mar")
rownames(df)
<- as.character(data$variety)
sol$objval
## [1] 0.5874
For the specified budget, the relaxation in quality constraints is
df
##
Jan
Feb
Mar
## Hamlin
0.00
0.00
0.00
## Mosambi
73.11
## Valencia
0.00 214.89
## Hamlin
0.00
0.00
0.00
## Gardner
## Sunstar
0.00
0.00
0.00
0.00
0.00
0.00
0.00
## Jincheng
0.00
0.00
## Berna
0.00
0.00 162.00
## Verna
109.52 190.48
0.00
## Biondo Commune
92.62 117.38
0.00
## Belladonna
84.76
52.14
43.11
relax
##
[,1]
## [1,] 0.5874
## [2,] 0.0000
## [3,] 0.0000
## [4,] 0.5874
For some reason, I could not match the exact results for this section as given in the book. The book uses
OpenSolver plugin. The author then goes in to integer and mixed integer programming for the same dataset.
rhs
dir
25
Data Smart
reqd
<- c(600,600,700)
lt
<- "<="
gt
<- ">="
et
<- "="
<- 1
<- reqd[k]
dir[j]
<- et
<- j+1
# Valencia constraint
lhs[j,] <- c(rep(0,2),1,rep(0,8), rep(0,22))
rhs[j]
dir[j]
<- 240
<- gt
<- j+1
<- 240
dir[j]
<- gt
<- j+1
<- 280
dir[j]
<- gt
<- j+1
26
Data Smart
# Availability
j
<- 7
for(i in 1:11) {
x
<-rep(0,11)
x[i]
<- 1
lhs[j,]
<- rep(x,times= 3)
dir[j]
<- lt
rhs[j]
<- data$qty[i]
<- j+1
}
lowerlim
<- c(11.5,0.0075,0,4.5)
upperlim
<- c(12.5,0.01,4,5.5)
for(k in 1:3) {
if(k==1){
temp = c(rep(1,11), rep(0,11),rep(0,11))
}
if(k==2){
temp = c(rep(0,11), rep(1,11),rep(0,11))
}
if(k==3){
temp = c(rep(0,11), rep(0,11),rep(1,11))
}
l <- 1
for(l in 1:4){
temp1
<- temp*data[,l+3]/reqd[k]
<- lowerlim[l]
<- gt
<- j+1
}
u <- 1
for(u in 1:4){
temp2
<- temp*data[,u+3]/reqd[k]
lhs[j,]
<- temp2
rhs[j]
<- upperlim[u]
dir[j]
<- lt
<- j+1
}
}
27
Data Smart
lhs
rhs
<- lhs[1:(j-1),]
<- rhs[1:(j-1)]
dir
<- dir[1:(j-1)]
rhs1
dir1
<- 1
for(k in 1:3) {
if(k==1){
temp = c(rep(1,11), rep(0,11),rep(0,11))
}
if(k==2){
temp = c(rep(0,11), rep(1,11),rep(0,11))
}
if(k==3){
temp = c(rep(0,11), rep(0,11),rep(1,11))
}
lhs1[j,] <- temp
rhs1[j]
<- 4
dir1[j]
<- lt
<- j+1
}
lhs1
<- lhs1[1:(j-1),]
rhs1
<- rhs1[1:(j-1)]
dir1
<- dir1[1:(j-1)]
lhs2
lhs3
lhs4
<- rbind(lhs2,lhs3)
rhs2
<- c(rhs,rhs1)
dir2
<- c(dir,dir1)
<- 1
28
Data Smart
for(i in 1:33) {
x
<- rep(0,33)
x[i]
<- 1
<- rep(0,33)
y[i]
lhs4
<- 1
<- rbind(lhs4,c(-x,y*data$qty))
dir2
<- c(dir2,gt)
rhs2
<- c(rhs2,0)
obj
bvec
<- 34:66
sol
df
## Hamlin
## Mosambi
## Valencia
## Hamlin
0 168
## Gardner
## Sunstar
0 105
## Jincheng
## Berna
0 168
## Verna
70
83 147
80 100
Towards the end of the section, the chapter talks about risk and using basic simulation , the author tries
to get a handle on the risk distribution. I think the great thing about this chapter is the way the author
communicates the various ideas of linear programming using excel effortlessly with a healthy dose of humor.
29
Data Smart
This chapter introduces Graph analysis via Excel and Graph visualization via Gephi. However one can stay
with in R and do the analysis and visualization all at one go. igraph package in R has extensive functionality
for graph processing. The example used to illustrate the principles is the same example that is used in the
chapter on clustering, i.e. wine data prospect clustering.
input.file
<- "data/ch02/deals.csv"
deals
input.file
<- "data/ch02/offer.csv"
offers
colnames(offers)
<- c("name","offer")
offers$value
<- 1
offers.data
mat
<- 1- as.matrix(dist.cosine)
diag(mat)
<- 0
graph.wine
pct.80
<- quantile(mat[which(mat>0)],prob=0.8)
mat
rownames(mat)
<- rownames(offers.data)
colnames(mat)
<- rownames(offers.data)
graph.wine
30
Data Smart
par(mfrow=c(1,1))
set.seed(12)
plot.igraph(graph.wine,
vertex.label.cex=0.8,vertex.label= rownames(offers.data),
layout=layout.fruchterman.reingold(graph.wine, niter=10000))
Gutierrez
Edwards
Phillips
Moore
Morris
Flores
Jenkins
Johnson
Anderson Cox
Bell
Cook
RussellCampbell
Smith
Gray
Jones
Ortiz
Sullivan
Torres
Wilson
Morgan
Hill
Perez
Myers
Lewis
James
Adams
Hughes
Diaz
Walker
Taylor
Cruz
Carter King
Brown
Bailey
Bennett StewartPerry
Rodriguez
Thompson
Watson
Robinson
Rivera
Green
Peterson
Murphy
Parker
Cooper
Mitchell Collins
Long
Clark
Ward
Martin
Fisher
Ross
Butler
Jackson
Price
Davis
Howard
Ramirez
Gonzalez
Nguyen
Scott
Morales
Rogers
Young
Foster
Miller
Richardson
Wood
Nelson
Harris
Evans
Thomas
Kelly
Garcia
Lee
Gomez
Reyes
Barnes
Brooks
Martinez
Allen Lopez
Baker
Williams
Roberts
Hall
White
Sanders
Wright
Turner
Sanchez
Reed
Powell
Hernandez
The above visualization is just the starting point. The author uses a method called modularity maximization
for identifying communities. There is a function in the igraph package. The good thing about the book, as
I keep repeating is, that you can see what exactly goes on behind the optimal.community function in excel.
The fact that the algo uses mixed integer programming and that is uses a certain metric as objective function
is all easy to see in an excel. Once you see that the workings in an excel, you will not hesitate to use a third
party function that does the job.
31
Data Smart
set.seed(1)
out
sizes(out)
## Community sizes
##
## 23 20 15 12
9 10 11 12 13 14
8 14
mem
<- (membership(out))
communities
deals.by.community
temp
temp
<- cbind(deals,temp)
9 10 11 12 13 14
## 67 76 37 49 16 53
Campaign
April
144
57
Origin PastPeak
Chile
FALSE
## 14
June
Merlot
72
64
Chile
FALSE
## 31
December
Champagne
72
89
France
FALSE
## 22
August
Champagne
72
63
France
FALSE
## 6
March
Prosecco
144
86
Chile
FALSE
144
19
Italy
FALSE
Malbec
72
56
France
FALSE
144
44
New Zealand
TRUE
39 South Africa
FALSE
## 15
## 1
## 5
January
## 23 September
## 4
February
Chardonnay
144
Champagne
72
48
France
TRUE
Campaign
## 30 December
Origin PastPeak
Malbec
54
France
FALSE
Prosecco
40
Australia
TRUE
87
France
FALSE
## 7
March
32
Data Smart
## 18
July
Espumante
## 8
50
Oregon
FALSE
March
Espumante
45 South Africa
FALSE
## 13
May
Merlot
43
Chile
FALSE
## 12
May
Prosecco
72
83
Australia
FALSE
January
Malbec
72
56
France
FALSE
## 2
January
Pinot Noir
72
17
France
FALSE
## 3
February
Espumante
144
32
Oregon
TRUE
## 1
Campaign
Origin PastPeak
## 22
August
Champagne
72
63
France
FALSE
## 11
May
Champagne
72
85
France
FALSE
## 1
January
Malbec
72
56
France
FALSE
## 2
January
Pinot Noir
72
17
France
FALSE
12
56
France
TRUE
Malbec
54
France
FALSE
Prosecco
72
83 Australia
FALSE
## 30 December
## 12
## 25
May
72
59
Oregon
TRUE
## 31 December
Champagne
72
89
France
FALSE
## 4
Champagne
72
48
France
TRUE
February
Campaign
August Champagne
Origin PastPeak
72
63
France
FALSE
Prosecco
144
86
Chile
FALSE
February Espumante
March
144
32
Oregon
TRUE
Prosecco
40
Australia
TRUE
## 19
July Champagne
12
66
Germany
FALSE
## 27
October Champagne
72
88
New Zealand
FALSE
## 4
February Champagne
72
48
France
TRUE
## 31 December Champagne
72
89
France
FALSE
45 South Africa
FALSE
52
FALSE
## 7
March
## 8
March Espumante
## 10
April
Prosecco
6
72
California
Campaign
## 24 September
34
Origin PastPeak
Italy
FALSE
33
Data Smart
## 26
## 17
October
Pinot Noir
144
83
Australia
FALSE
July
Pinot Noir
12
47
Germany
FALSE
January
Pinot Noir
72
17
France
FALSE
## 12
May
Prosecco
72
83
Australia
FALSE
## 16
June
Merlot
72
88
California
FALSE
## 2
## 1
January
Malbec
72
56
France
FALSE
## 3
February
Espumante
144
32
Oregon
TRUE
## 4
February
Champagne
72
48
France
TRUE
## 5
144
44 New Zealand
TRUE
plot(out,graph.wine)
Powell Reed
Sanchez
Sanders
Thomas
Garcia
Gomez
ScottFoster
Miller
Barnes
Wood
Williams
Collins Butler
Price
Lee
Clark
Fisher
Richardson
Martin
Jackson Hall
Mitchell
Roberts
Wright
Turner
Harris
Evans
Rogers
Allen
Lopez
Baker
Brooks
Long
Davis
Murphy
White
Martinez
Gray
Ramirez
Kelly
Nguyen
Morales
Howard
Peterson
Gonzalez
Young
Reyes
Rodriguez
Smith
Campbell
Russell
Anderson
Cox
Bell
Jenkins
Johnson Moore
Cook
Morris
Flores
Phillips
Rivera
Green Watson
Robinson
King Walker
Carter
Bailey
Taylor
Brown
Cruz
AdamsMyers
Diaz
James
Hughes
Lewis Perry Perez
Stewart
Hill
Morgan
Wilson
Bennett
Sullivan
Torres
Jones
Ortiz
Parker
Edwards
Cooper
Gutierrez
Ward
Nelson
ThompsonHernandez Ross
This nice visualization of the communities is the highlight of using Network analysis. Bootstrapped classification methods do not give this ease of interpretation.
34
Data Smart
The chapter talks about the most widespread tool of any statistician - Regression. The dataset used to
illustrate the concepts is a set of customer records and their purchases. Regression is a supervised learning
algorithm and in this case, the criterion or the outcome variable is a dichotomous variable, i.e. whether the
customer is pregnant or not. Since the modeling is done via Excel, the author manually shows the way to
create dummy variables, use solver to compute the coefficients of the regression model. In the first go, the
regression variable is fit with outcome variable taking a numerical value of 0 or 1. Obviously this is not a good
idea when a linear regression model assumes the outcome to be a normal random variable and not a variable
whose support is between 0 and 1. One needs to set up manually the RSS function to minimize it. Using
solver one can get the coefficients, but one needs to do a manual set up in Excel to compute the following :
Total sum of squares
X
(yi y)2
TSS =
RSS =
X
(yi yi )2
i
R test
2
R2 = 1
RSS
T SS
Fstat =
Covariance matrix of the coefficients
Cov =
2 (X 0 X)1
t stat
tj =
(X 0 X)1
jj
All the hardwork that one needs to do in excel can be done in just a few lines of R code.
input.file
<- "data/ch06/Pregnancy.csv"
data
test.file
<- "data/ch06/Pregnancy_Test.csv"
test
fit1
summary(fit1)
##
## Call:
## lm(formula = PREGNANT ~ ., data = data)
35
Data Smart
##
## Residuals:
##
Min
1Q
Median
3Q
Max
0.3008
0.9758
##
## Coefficients:
##
## (Intercept)
0.4441
0.0260
17.10
## Implied.GenderM
-0.0715
0.0251
-2.85
0.00451 **
## Implied.GenderU
## Home.Apt..PO.BoxH
0.0268
-0.0146
0.0403
0.0250
0.67
-0.59
0.50615
0.55856
## Home.Apt..PO.BoxP
0.0133
0.0433
0.31
0.75884
## Pregnancy.Test
0.2164
0.0465
4.65
3.7e-06 ***
## Birth.Control
-0.2741
0.0348
-7.87
9.1e-15 ***
## Feminine.Hygiene
-0.2381
0.0343
-6.94
7.2e-12 ***
0.3456
0.0392
8.83
## Folic.Acid
## Prenatal.Vitamins
0.2941
0.0360
8.16
1.0e-15 ***
## Prenatal.Yoga
0.3253
0.0893
3.64
0.00028 ***
## Body.Pillow
0.1936
0.0894
2.17
0.03057 *
## Ginger.Ale
0.2299
0.0471
4.88
1.2e-06 ***
## Sea.Bands
0.1458
0.0698
2.09
0.03706 *
## Stopped.buying.ciggies
0.1605
0.0417
3.85
0.00013 ***
-0.1591
0.0404
-3.94
8.7e-05 ***
0.1647
0.0516
3.19
0.00146 **
## Cigarettes
## Smoking.Cessation
## Stopped.buying.wine
## Wine
## Maternity.Clothes
0.1878
0.0359
5.23
2.1e-07 ***
-0.2075
0.0366
-5.66
2.0e-08 ***
0.2399
0.0357
6.72
3.2e-11 ***
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.372 on 980 degrees of freedom
## Multiple R-squared:
0.458,Adjusted R-squared:
0.447
p-value: <2e-16
To draw the ROC curves for above model, one can use ROCR package
test.pred
<- predict(fit1,test)
pred.lm
<- prediction(test.pred,test$PREGNANT)
perf.lm
<- performance(pred.lm,"tpr","fpr")
plot(perf.lm,xlim=c(0,1),ylim=c(0,1))
abline(h=seq(0,1,0.1), col="grey",lty="dashed")
The above ROC curve under a good model should hug the top left corner. So, there is a chance to improve
this model. The fact that the predicted values of the model does not stay in [0, 1] means that the model needs
36
0.8
0.6
0.4
0.2
0.0
1.0
Data Smart
0.0
0.2
0.4
0.6
0.8
1.0
to be improved. The first tweak is to minimize a link function of the outcome variable than the outcome
variable itself. The chapter shows the incorrect way of handling this situation and then shows the right way
of estimating via likelihood procedure.Why does the MLE method work and the previous method fail is not
explained, but any data analyst newbie will realize that sum of squares minimization in general is driven by
a probability model and hence one needs to maximize log likelihood rather than any arbitrary function.
fit
summary(fit)
##
## Call:
## glm(formula = PREGNANT ~ ., family = binomial(), data = data)
##
## Deviance Residuals:
##
Min
1Q
Median
3Q
Max
## -3.201
-0.557
-0.025
0.513
2.866
##
## Coefficients:
##
## (Intercept)
-0.34360
## Implied.GenderM
## Implied.GenderU
0.18075
-1.90
0.05731 .
-0.45388
0.19757
-2.30
0.02160 *
0.14194
0.30759
0.46
0.64447
## Home.Apt..PO.BoxH
-0.17293
0.19459
-0.89
0.37418
## Home.Apt..PO.BoxP
-0.00281
0.33643
-0.01
0.99333
## Pregnancy.Test
2.37055
0.52178
4.54
5.5e-06 ***
## Birth.Control
-2.30027
0.36527
-6.30
3.0e-10 ***
37
Data Smart
## Feminine.Hygiene
-2.02856
0.34240
-5.92
3.1e-09 ***
## Folic.Acid
4.07767
0.76189
5.35
8.7e-08 ***
## Prenatal.Vitamins
2.47947
0.36906
6.72
1.8e-11 ***
## Prenatal.Yoga
2.92297
1.14699
2.55
0.01082 *
## Body.Pillow
1.26104
0.86062
1.47
0.14285
## Ginger.Ale
1.93850
0.42673
4.54
5.6e-06 ***
## Sea.Bands
1.10753
0.67343
1.64
0.10005
## Stopped.buying.ciggies
1.30222
0.34235
3.80
0.00014 ***
-1.44302
0.37012
-3.90
9.7e-05 ***
1.79078
1.38389
0.51261
0.30588
3.49
4.52
0.00048 ***
6.1e-06 ***
-1.56554
0.34891
-4.49
7.2e-06 ***
2.07820
0.32943
6.31
2.8e-10 ***
## Cigarettes
## Smoking.Cessation
## Stopped.buying.wine
## Wine
## Maternity.Clothes
## --## Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##
## Residual deviance:
744.11
on 999
degrees of freedom
on 980
degrees of freedom
## AIC: 784.1
##
## Number of Fisher Scoring iterations: 7
For testing the performance of the logistic regression, the following code can be used and ROC curves can be
drawn to get a comparitive performance
test.pred
pred.logr
<- prediction(ilogit(test.pred),test$PREGNANT)
perf.logr
<- performance(pred.logr,"tpr","fpr")
plot(perf.lm,xlim=c(0,1),ylim=c(0,1),col="blue")
plot(perf.logr,xlim=c(0,1),ylim=c(0,1),add=T, col="red")
abline(h=seq(0,1,0.1), col="grey",lty="dashed")
legend("center",legend=c("Linear Regression","Logistic Regression"),fill=c("blue","red"),
cex=0.7)
In this case, there does not seem to be much difference in using Linear Vs Logistic regression. To be technically
correct, it is anyways better to use Logistic regression.
38
0.8
0.6
0.2
0.4
Linear Regression
Logistic Regression
0.0
1.0
Data Smart
0.0
0.2
0.4
0.6
0.8
1.0
The models introduced in this chapter belong to the category of voting models. Such methods would have
been unthinkable a few decades ago. With fast and cheap computing power, the methods mentioned in this
chapter have become extremely popular. The chapters walks through two such methods
1. Random Forests
2. Boosting
Its really commendable that the author has taken pains to do everything in excel to show how the method
really work. I came across these methods in the statistical learning books by Trevor Hastie and Rob Tibshirani.
If you want the math behind it, the books by Trevor and Bob are priceless. However if you are the kind where
you want to see something running and then want to explore the math, this chapter is perfect.
The dataset used in this chapter is the same as that used in the previous chapter. The author spends less time
on data preparation and shows all the steps that are needed to run random bagging and boosting. In the last
chapter of the book, he provides the code for random forests but not boosting.
Here is the R code that replicates Random Forest results from the chapter
input.file
<- "data/ch06/Pregnancy.csv"
data
test.file
<- "data/ch06/Pregnancy_Test.csv"
test
data$PREGNANT
<- factor(data$PREGNANT)
test$PREGNANT
<- factor(test$PREGNANT)
data.rf
<- randomForest(PREGNANT~.,data=data,importance=TRUE,
replace=FALSE,maxnodes=2,mtry=17)
39
Data Smart
test.rf
<- predict(data.rf,test,type="prob")
pred.rf
<- prediction(test.rf[,2],test$PREGNANT)
perf.rf
<- performance(pred.rf,"tpr","fpr")
par(mfrow=c(1,2))
varImpPlot(data.rf, type=2,cex=0.6,main="")
plot(perf.rf,xlim=c(0,1),ylim=c(0,1),lty=2)
Folic.Acid
1.0
Birth.Control
Prenatal.Vitamins
Feminine.Hygiene
Prenatal.Yoga
Body.Pillow
Ginger.Ale
Sea.Bands
0.4
Pregnancy.Test
0.2
Home.Apt..PO.Box
0.6
Implied.Gender
0.8
Maternity.Clothes
Stopped.buying.ciggies
0.0
Cigarettes
Smoking.Cessation
Stopped.buying.wine
Wine
0.0
0
10
15
20
25
30
0.2
0.4
0.6
0.8
1.0
MeanDecreaseGini
Here is the R code that replicates Boosting results from the chapter
input.file
<- "data/ch06/Pregnancy.csv"
data
test.file
<- "data/ch06/Pregnancy_Test.csv"
test
data.boost
test.boost
pred.boost
<- prediction(test.boost,test$PREGNANT)
perf.boost
<- performance(pred.boost,"tpr","fpr")
var rel.inf
Folic.Acid
70.713
40
Data Smart
## Birth.Control
Birth.Control
19.989
## Prenatal.Vitamins
Prenatal.Vitamins
4.938
## Maternity.Clothes
Maternity.Clothes
3.282
Stopped.buying.wine
1.078
Implied.Gender
0.000
Home.Apt..PO.Box
0.000
Pregnancy.Test
0.000
Feminine.Hygiene
0.000
Prenatal.Yoga
0.000
Body.Pillow
Ginger.Ale
0.000
0.000
Sea.Bands
0.000
## Stopped.buying.wine
## Implied.Gender
## Home.Apt..PO.Box
## Pregnancy.Test
## Feminine.Hygiene
## Prenatal.Yoga
## Body.Pillow
## Ginger.Ale
## Sea.Bands
## Stopped.buying.ciggies Stopped.buying.ciggies
0.000
## Cigarettes
Cigarettes
0.000
Smoking.Cessation
0.000
Wine
0.000
## Smoking.Cessation
## Wine
0.8
0.6
0.2
0.4
Random Forest
Boosting
0.0
1.0
0.0
0.2
0.4
0.6
0.8
1.0
41
Data Smart
This chapter starts off by introducing simple exponential smoothing that is basically a changing level model.Here
is the R code to replicate the results from the book.
input.file
data
<- "data/ch10/SwordDemand.csv"
<- read.csv(input.file,stringsAsFactors = TRUE,header = TRUE)
data.ts
fit
par(mfrow=c(1,1))
plot.ts(data.ts , xaxt="n", lwd =2 , col = "grey",xlab="")
time
time
par(mfrow=c(1,1))
plot.ts(data.ts , xaxt="n", lwd =2 , col = "grey",
xlim=c(2010,2014),xlab="")
timeidx <-seq( 2010, 2014,length.out = 48)
time
time
= 1)
42
150
200
250
SwordDemand
300
Data Smart
201001
201007
201101
201107
201201
201207
fit
beta
## 0.05766
sqrt(fit$SSE/34)
## [1] 19.92
par(mfrow=c(1,1))
plot.ts(data.ts , xaxt="n", lwd =2 , col = "grey",
xlim=c(2010,2014),ylim=c(100,400),xlab="")
timeidx <-seq( 2010, 2014,length.out = 48)
time
time
= 1)
43
200
250
150
SwordDemand
300
Data Smart
alpha
## 0.2316
fit$beta
44
100
200
300
SwordDemand
400
Data Smart
##
beta
## 0.1148
fit$gamma
## gamma
##
sqrt(fit$SSE/33)
## [1] 9.282
Check residuals of the model using the acf plot
acf(data.ts[14:36]- c(fitted(fit)[1:23,1]) , main = "", xlim=c(1,15))
par(mfrow=c(1,1))
plot.ts(data.ts , xaxt="n", lwd =2 , col = "grey",
xlim=c(2010,2014),ylim=c(100,400),xlab="")
timeidx <-seq( 2010, 2014,length.out = 48)
time
time
= 1)
45
0.2
0.2
ACF
0.6
1.0
Data Smart
10
12
14
Lag
Figure 8.4: Checking residuals
The last part of the chapter puts confidence intervals around the prediction.
pred <- predict(fit, n.ahead=12, prediction.interval=T, level =0.95)
par(mfrow=c(1,1))
plot.ts(data.ts , xaxt="n", lwd =2 , col = "grey",
xlim=c(2010,2014),ylim=c(100,400),xlab="")
timeidx <-seq( 2010, 2014,length.out = 48)
time
time
= 2)
= 1)
= 1)
legend("topleft",
legend=c("original time series","smoothed","forecast",
"forecast-upper","forecast-lower"),
col
= c("grey","blue","red","red","red"),
46
0.4
0.4
0.0
ACF
0.8
Data Smart
10
12
14
Lag
Figure 8.5: Checking residuals
sword.forecast$method
## [1] "ETS(M,M,M)"
plot(sword.forecast)
47
200
300
100
SwordDemand
400
Data Smart
The chapter starts by discussing Tukey fences as a visualizing tool to spot outliers. They come in two varieties.
Innerfences are 1.5 IQR and Outer fences are 3 IQR on either side of the median
input.file
<- "data/ch10/PregnancyDuration.csv"
data
summary(data)
##
GestationDays
##
Min.
##
1st Qu.:260
##
Median :267
##
Mean
##
3rd Qu.:272
##
Max.
:240
:267
:349
IQR(data[,1])
## [1] 12
quantile(data[,1],prob=0.75)- quantile(data[,1],prob=0.25)
## 75%
##
12
48
200
300
100
SwordDemand
400
Data Smart
par(mfrow=c(1,2))
boxplot(data[,1], main="Inner fences",cex.main = 0.8)
boxplot(data[,1], range =3, main
stats
<- cbind(labels,stats)
<- "data/ch10/CallCenter.csv"
data
data.scaled
<- scale(data[,2:11])
distance
<- as.matrix(dist(data.scaled))
49
Data Smart
mu
242.64
256.24
235.20
233.61
271.24
253.09
259.96
257.66
282.44
296.10
325.04
374.54
313.38
330.95
303.78
301.73
350.33
326.88
335.76
332.79
364.79
382.44
419.82
483.74
lower
223.11
236.05
215.63
212.15
244.07
223.96
226.11
220.13
235.95
240.39
258.75
289.27
236.83
242.13
216.45
207.30
232.55
208.83
209.70
199.98
210.30
214.52
225.40
249.18
upper
260.94
276.48
255.07
255.82
300.13
282.91
296.84
299.94
336.20
362.72
406.33
477.64
414.19
451.16
425.56
439.97
522.92
508.46
538.28
553.81
630.72
686.55
783.55
940.29
labels
lower inner fence
25th percentile
median
75th percentile
upper inner fence
value
242.00
260.00
267.00
272.00
290.00
ranks
top5s
top10s
top20s
temp
temp
<- temp[order(temp[,12],temp[,13],temp[,14]),]
Method 2 : k-distance
ranks
ranks[ranks!=5] <- 0
ranks[ranks==5] <- 1
temp
temp
temp
<- rowSums(ranks*distance)
<- cbind(data, temp)
<- temp[order(temp[,12],decreasing=T),]
50
Data Smart
200
400
600
800
2010
2011
2012
2013
2014
2015
labels
lower outer fence
25th percentile
median
75th percentile
upper outer fence
value
240
260
267
272
303
51
Data Smart
280
240
240
280
320
Outer fences
320
Inner fences
ID
137155
143406
AvgTix
165.30
145.00
Rating
4.49
2.33
Tardies
1
3
Graveyards
3
1
Weekends
2
0
SickDays
1
6
10
This chapter introduces some basic commands in R and then works out the Kmeans model, the regression
model, the random forests model, forecasting model and outlier detection methods in R. In each of the case,
the author provides abundant commentary for each of the packages and commands being used. The packages
used to replicate the excel results from the previous chapter include
skmeans
randomForest
ROCR
forecast
DMwR
11
Conclusion
The book concludes with some soft skills that a data scientist should have for him / her to be effective in an
organization.
52
Data Smart
ID
137155
143406
PercSickOnFri
0.00
0.83
EmployeeDevHrs
30
30
ShiftSwapsReq
1
4
ShiftSwapsOffered
7
0
ID
143406
137155
AvgTix
145.00
165.30
Rating
2.33
4.49
Tardies
3
1
Graveyards
1
3
Weekends
0
2
SickDays
6
1
ID
143406
137155
PercSickOnFri
0.83
0.00
EmployeeDevHrs
30
30
ShiftSwapsReq
4
1
ShiftSwapsOffered
0
7
ID
137155
143406
AvgTix
165.30
145.00
Rating
4.49
2.33
Tardies
1
3
Graveyards
3
1
Weekends
2
0
SickDays
1
6
ID
137155
143406
PercSickOnFri
0.00
0.83
EmployeeDevHrs
30
30
ShiftSwapsReq
1
4
ShiftSwapsOffered
7
0
53