0% found this document useful (1 vote)

344 views53 pages

Data Smart Book Summary

The document summarizes a book about using data science concepts to analyze information. It discusses clustering customers using k-means clustering and analyzing transactions. It also covers other techniques like naive Bayes classification, regression, and time series forecasting. R code is provided to replicate analyses in the book.

Uploaded by

Raimundo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

344 views53 pages

Data Smart Book Summary

Uploaded by

Raimundo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Data Smart

Using Data Science to Transform Information into Insight

RK
April 29, 2014

Abstract
The purpose of this document is to summarize the book Data Smart, written by John W. Foreman
and provide some additional R code to work with the book.

Data Smart

Summary
Data Science is a very loose word and can mean different things in different situations. However one thing is
certain, the principles used in tacking problems are from diverse fields. Drew Conway has this Venn diagram
on his blog :

In such a diverse field one does not know where to start and how to start. Someone has made a nice Metromap
too. All said and done, this is a field that has considerable entry barriers. One needs to spend at least a few
years to get the basics right to understand some basic algorithms.
Where does this book fit in? This book is apt for people who want to see whats going on behind various
algorithms without the math. The book touches upon a dozen topics in data mining and explains the main
principles of each of those topics via Excel. By restricting to Excel, the author enables a wider audience to
get a glimpse of the various concepts. The ideal way to to read this book is by working out the various case
studies that are mentioned in the book. I could not motivate myself to do the analysis in Excel, so replicated
the analysis in R. In this document I have listed down some of the code to work through the book, that essentially replicates the results of the analysis done via Excel. But first a brief summary of the chapters in the book.
Chapter 1 is on Excel and can be speedread as I cannot imagine someone reading this book without ever
working on Excel. Chapter 2 discusses k-means clustering. It uses an offer-purchases dataset to segment
the customers in to various clusters for better marketing. The k-means needs a distance metric and there are
many to choose from based on the situation. The book shows that for the specific dataset used, correlation
based distance or cosine similarity score is a better metric than euclidean distance.
Chapter 3 is on Naive Bayes, a simple method that surprisingly performs better than many other algorithms. In fact the reason for its ubiquity stems from its simplicity; it does not overfit the data.Naive Bayes
principle is applied on a set of tweets to classify them as business-related or junk. Obviously there is not much
of math in this book as expected. So, the results from this chapter will motivate anyone to understand the

Data Smart

reason why Naive Bayes works and understand why bias-variance tradeoff works very differently in a classification setting than a regression setting.
Chapter 4 is about optimization, the quintessential skillset that any data scientist needs to have. Using a
case study, the author introduces Linear Programming, Integer programming, Mixed Integer programming
and ways to convert a nonlinear optimization problem in to Linear Optimization problem. The good thing
about this book and this chapter in particular is that there is a good sense of humor that the author brings
along while explaining principles. That makes the book an immensely readable book.
Chapter 5 discusses graph analysis and uses the same dataset from one of the previous chapters to do an
unsupervised learning. k-neighborhood and Modularity maximization procedures are used to group the customers in to communities. Even though Gephi is used for Visualization, igraph is powerful enough to give
all the visualization features to an R user. Chapter 6 is about regression. The book uses a sample dataset
to explain the concepts of regression and logistic regression.All the creation of dummy variables, setting up
the objective function etc. are done in Excel and the reader is made to understand the basic steps behind
regression modeling.
Chapter 7 gives the reader an insight in to wisdom of crowds type models. The models discussed are
Random Forest and Boosting. A reader who reaches until this point of the book is abundantly convinced that
Excel is too painful use boosting techniques, where every model built on a bootstrapped sample has to be
recorded as a macro and one has to run it manually to get estimates. In any case, the chapter does a wonderful
job of explaining the nuts and bolts of Boosting.
Chapter 8 gives a crash course on exponential smoothing. It starts off with simple exponential smoothing
and then moves on to Holts trend-corrected exponential smoothing and finally ending with multiplicative
Holt-Winters exponential smoothing. The basic limitation of these models is that there is no probabilistic
framework around them. Hyndman has written a book on Exponential smoothing where he casts all the
models in a State space framework that makes the models far more richer.
Chapter 9 talks about outlier detection and introduces three methods: indegree method, k-distance method
, local outlier factor method. Chapter 10 introduces some basic commands in R and then works out the
k-means model, the regression model, the random forests model, forecasting model and outlier detection methods in R. Chapter 11 is the concluding chapter in the book that talks about some soft skills that a data
scientist should have in order to be effective in an organization.

Whats in this document ?

The author does provide R code to work through some of the chapters covered in the book. In one sense, he
handholds the reader in running functions from various libraries. However there are certain sections of the
book for which R code is not made available or some different software like Gephi is used. In this document,
I have included some additional comments for each of the chapters and R code to replicate almost all the
analysis that is done through out the book.

Data Smart

Contents
1

Everything You Ever Needed to Know about Spreadsheets but Were Too Afraid to Ask

Cluster Analysis Part I: Using K-Means to Segment Your Customer Base

3 Naive Bayes and the Incredible Lightness of Being an Idiot

4 Optimization Modeling: Because That Fresh Squeezed Orange Juice Aint Gonna Blend
Itself
5 Cluster Analysis Part II : Network Graphs and Community Detection

15
30

6 The Granddaddy of Supervised Artificial Intelligence Regression

7 Ensemble Models: A Whole Lot of Bad Pizza

8 Forecasting: Breathe Easy; You Cant Win

9 Outlier Detection: Just Because Theyre Odd Doesnt Mean Theyre Unimportant

10 Moving from Spreadsheets into R

11 Conclusion

Data Smart

Everything You Ever Needed to Know about Spreadsheets but

Were Too Afraid to Ask

The first chapter is a basic crash course on excel that teaches common functions to an excel newbie. The
functions explained are sort, match, index, offset, small, vlookup, filtering, sorting, pivot tables, array formulas,
solver. It also suggests that the reader install OpenSolver(a solver on steroids) plug-in to work through some
of the content in the book.

Cluster Analysis Part I: Using K-Means to Segment Your Customer Base

This chapter is about clustering a set of customers based on their transactions done in response to various deals.
Each deal has certain characteristics and there are p offers over a certain time period. In the ML literature,
p is the number of features. Every customer can be viewed in this p dimensional space.The j th component is
assigned 0 or 1 based on a customer response to the j th deal. The example describes the algorithm in plain
english. However the precise algo is also not very mathematical. It goes like this(for K clusters):
1. Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster
assignments for the observations.
2. Iterate until the cluster assignments stop changing
(a) For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of
the p feature means for the observations in the kth cluster
(b) Assign each observation to the cluster whose centroid is closest.
Why does the above work ? For whatever cluster configurations, one needs to formulate an objective function
and minimize/maximize it. In this case, you are trying to minimize within-cluster variation. If we denote the
within-cluster variation for Ck is a measure W (Ck ) then the objective function is
min

C1 ,C2 ,...CK

A common choice for W (Ck ) is

K
X

W (Ck )

k=1

p
X X
1
(xij xi0 j )2
|Ck | 0
j=1
i,i Ck

The within-cluster variation W (Ck )is the sum of pairwise squared Euclidean distances between the observations of all the kthe cluster.
If you look at the algo and the method that is followed in the book, there is no mention of computing pairwise
distances in each cluster. Instead the distance minimized are the distances to the centroid. The key identity
that makes the pairwise distance computation redundant in each cluster is the following
p
p
X X
XX
1
(xij xi0 j )2 = 2
(xij xkj )2
|Ck | 0
j=1
j=1
i,i Ck

iCk

where xjk denotes the mean of feature j in cluster k. Ok, now crunching the data.

Data Smart

input.file

<- "data/ch02/deals.csv"

deals

<- read.csv(input.file,stringsAsFactors = FALSE,header = TRUE)

input.file

<- "data/ch02/offer.csv"

offers

<- read.csv(input.file,stringsAsFactors = FALSE,header = TRUE)

colnames(offers)
offers$value

<- c("name","offer")
<- 1

offers.data

<- cast(offers, name~offer, sum)[,2:33]

rownames(offers.data) <- cast(offers, name~offer, sum)[,1]

set.seed(1)
km.out

<- kmeans(offers.data,4,nstart=25)

deals.temp

<- cbind(deals, (t(km.out$centers)))

Analyzing based on first cluster centre

deals.temp[order(deals.temp$"1",decreasing = TRUE),1:6][1:10,]
##

Campaign

## 30 December

Variety Minimum Discount

Origin PastPeak

Malbec

France

FALSE

## 29 November Pinot Grigio

France

FALSE

## 7

March

Prosecco

Australia

TRUE

## 8

March

Espumante

45 South Africa

FALSE

## 18
## 13

July
May

Espumante
Merlot

6
6

50
43

Oregon
Chile

FALSE
FALSE

## 1

January

Malbec

France

FALSE

## 2

January

Pinot Noir

France

FALSE

## 3

February

Espumante

144

Oregon

TRUE

## 4

February

Champagne

France

TRUE

Analyzing based on second cluster centre

deals.temp[order(deals.temp$"2",decreasing = TRUE),1:6][1:10,]
##
## 22

Campaign

Variety Minimum Discount Origin PastPeak

August

Champagne

63 France

FALSE

## 31 December

Champagne

89 France

FALSE

Prosecco

144

Chile

FALSE

## 6

March

## 1

January

Malbec

56 France

FALSE

May

Champagne

85 France

FALSE

February
February

Espumante
Champagne

144
72

32 Oregon
48 France

TRUE
TRUE

## 14

June

Merlot

Chile

FALSE

## 15

June Cabernet Sauvignon

144

Italy

FALSE

54 France

FALSE

## 11
## 3
## 4

## 30 December

Malbec

Analyzing based on third cluster centre

Data Smart

deals.temp[order(deals.temp$"3",decreasing = TRUE),1:6][1:10,]
##

Campaign

Variety Minimum Discount

## 24 September Pinot Noir

## 26
## 17

Origin PastPeak

Italy

FALSE

October Pinot Noir

144

Australia

FALSE

July Pinot Noir

Germany

FALSE

## 2

January Pinot Noir

France

FALSE

## 1

January

Malbec

France

FALSE

April

Prosecco

## 23 September Chardonnay

144

## 10

California

FALSE

39 South Africa

FALSE

## 27

October

Champagne

New Zealand

FALSE

## 3

February

Espumante

144

Oregon

TRUE

## 4

February

Champagne

France

TRUE

Analyzing based on fourth cluster centre

deals.temp[order(deals.temp$"4",decreasing = TRUE),1:6][1:10,]
##

Campaign

## 31 December
## 4
## 9

Variety Minimum Discount

Champagne

France

FALSE

February
April

Champagne
Chardonnay

72
144

48
57

France
Chile

TRUE
FALSE

May

Champagne

France

FALSE

## 11
## 6

March

Prosecco

144

## 8

March

Espumante

## 14

June

Merlot

## 16

June

Merlot

## 20
## 27

Origin PastPeak

August Cabernet Sauvignon

October

Champagne

Chile

FALSE

45 South Africa

FALSE

Chile

FALSE

California

FALSE

Italy

FALSE

New Zealand

FALSE

As one can see, the above analysis does not given any conclusive results. Instead one can look at deal counts
in each cluster
cluster

data.frame(name = (rownames(offers.data)),
cluster = km.out$cluster)

deals.by.cluster

<- merge(offers, cluster, all.x= T)

temp

<- cast(deals.by.cluster, offer~cluster, sum)

temp

<- cbind(deals,temp)

The first cluster is small timers

temp[order(temp$"1",decreasing = TRUE),1:6][1:10,]
##

Campaign

## 30 December

Variety Minimum Discount

Origin PastPeak

Malbec

France

FALSE

## 29 November Pinot Grigio

France

FALSE

Data Smart

## 7

March

Prosecco

## 8

March

Espumante

45 South Africa

FALSE

## 18

July

Espumante

Oregon

FALSE

## 13

May

Merlot

Chile

FALSE

January

Malbec

France

FALSE

## 2

January

Pinot Noir

France

FALSE

## 3

February

Espumante

144

Oregon

TRUE

## 4

February

Champagne

France

TRUE

## 1

Australia

TRUE

The second cluster is not clear

temp[order(temp$"2",decreasing = TRUE),1:6][1:10,]
##
## 22

Campaign

Variety Minimum Discount Origin PastPeak

August

Champagne

63 France

FALSE

## 31 December

Champagne

89 France

FALSE

Prosecco

144

Chile

FALSE

## 6

March

## 1

January

Malbec

56 France

FALSE

May

Champagne

85 France

FALSE

## 3

February

Espumante

144

32 Oregon

TRUE

## 4

February

Champagne

48 France

TRUE

## 14

June

Merlot

Chile

FALSE

## 15

June Cabernet Sauvignon

144

Italy

FALSE

54 France

FALSE

## 11

## 30 December

Malbec

The third cluster is Pinot Noir variety

temp[order(temp$"3",decreasing = TRUE),1:6][1:10,]
##

Campaign

Variety Minimum Discount

## 24 September Pinot Noir

Origin PastPeak

Italy

FALSE

## 26

October Pinot Noir

144

Australia

FALSE

## 17
## 2

July Pinot Noir

January Pinot Noir

12
72

47
17

Germany
France

FALSE
FALSE

## 1

January

Malbec

France

FALSE

April

Prosecco

## 23 September Chardonnay

144

## 10

California

FALSE

39 South Africa

FALSE

## 27

October

Champagne

New Zealand

FALSE

## 3

February

Espumante

144

Oregon

TRUE

## 4

February

Champagne

France

TRUE

The fourth cluster seems to like August Champaign

temp[order(temp$"4",decreasing = TRUE),1:6][1:10,]
##

Campaign

Variety Minimum Discount

Origin PastPeak

Data Smart

## 31 December

Champagne

France

FALSE

## 4

February

Champagne

France

TRUE

## 9

April

Chardonnay

144

Chile

FALSE

May

Champagne

France

FALSE

## 6

March

Prosecco

144

Chile

FALSE

## 8

March

Espumante

45 South Africa

FALSE

## 14

June

Merlot

Chile

FALSE

## 16

June

Merlot

California

FALSE

## 11

## 20
## 27

August Cabernet Sauvignon

October

Champagne

Italy

FALSE

New Zealand

FALSE

One can try K means for varying K and pick one of the k values. The chapter suggests another way to
compare the K means across various k values, i.e by computing a score for your clusters called the silhouette.
The following R code gives the metric. You can also use the silhouette function from the cluster package.
silhouette.rk

<- function(cluster,dist.euclidean){

clusters

<- sort(unique(cluster$cluster))

silh

<- numeric()

for(i in cluster$id){
temp

<- subset(cluster, id!=i)

temp.cluster

<- subset(cluster, id==i)$cluster

same.cluster

<- subset(temp, cluster == temp.cluster)

diff.cluster

<- subset(temp, cluster != temp.cluster)

i.star

<- pmin(i,same.cluster$id)

j.star

<- pmax(i,same.cluster$id)

within

<- mean(dist.euclidean[ n(i.star-1) i.star(i.star-1)/2 + j.star-i.star ])

neighbor

<- min( sapply( clusters[-temp.cluster],function(j)

{
i.star

<- pmin(i,subset(diff.cluster, cluster== j)$id)

j.star

<- pmax(i,subset(diff.cluster, cluster== j)$id)

mean(dist.euclidean[ n(i.star-1) - i.star(i.star-1)/2 + j.star-i.star ])

}
) )
silh <- c(silh , (neighbor-within)/max(within, neighbor))
}
mean(silh)
}
For K = 4 clusters, one can calculate silhouette as follows:
set.seed(1)
dist.euclidean

<- dist(offers.data)

<- attr(dist.euclidean, "Size")

Data Smart

km.out

<- kmeans(offers.data,4,nstart=25)

cluster

data.frame(name = (rownames(offers.data)),
cluster = km.out$cluster, id = 1:nrow(cluster) )

print(silhouette.rk(cluster,dist.euclidean))
## [1] 0.1243
print((summary(silhouette(km.out$cluster,dist.euclidean)))$avg.width)
## [1] 0.1243
For K = 5 clusters, one can calculate silhouette as follows:
set.seed(1)
km.out

<- kmeans(offers.data,5,nstart=25)

cluster

data.frame(name = (rownames(offers.data)),
cluster = km.out$cluster, id = 1:nrow(cluster) )

print(silhouette.rk(cluster,dist.euclidean))
## [1] 0.1231
print((summary(silhouette(km.out$cluster,dist.euclidean)))$avg.width)
## [1] 0.1231
The above metric shows that 5 clusters is no better than 4 clusters.
The chapter subsequently introduces a different way to do K means clustering, i.e. Spherical K means. This
is a method where the dissimilarity measure is based on correlation-based distance. The package in R that does
spherical K means is skmeans.
set.seed(1)
sk.out

<- skmeans(as.matrix(offers.data), 5, method="genetic")

cluster

data.frame(name = (rownames(offers.data)),
cluster = sk.out$cluster, id = 1:nrow(cluster) )

deals.by.cluster

<- merge(offers, cluster, all.x= T)

temp

<- cast(deals.by.cluster, offer~cluster, sum)

temp

<- cbind(deals,temp)

temp

<- cbind(deals,temp)

The first cluster is Pinot Noir gang

temp[order(temp$"1",decreasing = TRUE),1:6][1:10,]
##

Campaign

Variety Minimum Discount

## 24 September Pinot Noir

Origin PastPeak

Italy

FALSE

## 26

October Pinot Noir

144

Australia

FALSE

## 2

January Pinot Noir

France

FALSE

Data Smart

## 17
## 1

July Pinot Noir

Germany

FALSE

January

Malbec

France

FALSE

## 10

April

Prosecco

California

FALSE

## 12

May

Prosecco

Australia

FALSE

## 16

June

Merlot

California

FALSE

## 23 September Chardonnay

144

## 27

October

Champagne

39 South Africa

FALSE

New Zealand

The second cluster looks like small timers

temp[order(temp$"2",decreasing = TRUE),1:6][1:10,]
##
## 8

Campaign

Variety Minimum Discount

Origin PastPeak

March

Espumante

45 South Africa

FALSE

## 30 December

Malbec

France

FALSE

Espumante

Oregon

FALSE

## 29 November Pinot Grigio

France

FALSE

## 18
## 7

July
March

Prosecco

Australia

TRUE

May

Merlot

Chile

FALSE

## 6

March

Prosecco

144

Chile

FALSE

## 10

April

Prosecco

California

FALSE

## 11

May

Champagne

France

FALSE

## 21

August

Champagne

California

FALSE

## 13

The third cluster is is high volume deals segment

temp[order(temp$"3",decreasing = TRUE),1:6][1:10,]
##

Campaign

Variety Minimum Discount

Origin PastPeak

## 9

April

Chardonnay

144

Chile

FALSE

## 14

June

Merlot

Chile

FALSE

January

Malbec

France

FALSE

## 5

February Cabernet Sauvignon

144

New Zealand

TRUE

## 4

February

Champagne

France

TRUE

May

Champagne

France

FALSE

June Cabernet Sauvignon

144

Italy

FALSE
FALSE

## 1

## 11
## 15

## 23 September

Chardonnay

144

39 South Africa

## 26

October

Pinot Noir

144

Australia

FALSE

March

Prosecco

144

Chile

FALSE

## 6

The fourth cluster is France buyer segment

temp[order(temp$"4",decreasing = TRUE),1:6][1:10,]
##
## 22

Campaign
August

Variety Minimum Discount

Champagne

Origin PastPeak
France

FALSE

Data Smart

## 31 December

Champagne

France

## 4

February

## 6

March

Champagne

France

TRUE

Prosecco

144

Chile

FALSE

May

Champagne

France

FALSE

## 3

February

Espumante

144

Oregon

TRUE

## 27

October

## 11

FALSE

Champagne

New Zealand

FALSE

## 28 November Cabernet Sauvignon

France

TRUE

## 1

January

Malbec

France

FALSE

## 8

March

Espumante

45 South Africa

FALSE

The fifth cluster are those who buy only sparkling wine
temp[order(temp$"5",decreasing = TRUE),1:6][1:10,]
##
## 7

Campaign

Variety Minimum Discount

Origin PastPeak

March

Prosecco

Australia

TRUE

## 29 November

Pinot Grigio

France

FALSE

## 30 December

Malbec

France

FALSE

## 18

July

Espumante

## 10

April

Prosecco

## 13

May

Merlot

February

## 3

Oregon

FALSE

52 California

FALSE

Chile

FALSE

Espumante

144

Oregon

TRUE

## 12

May

Prosecco

Australia

FALSE

## 19

July

Champagne

Germany

FALSE

## 28 November Cabernet Sauvignon

France

TRUE

Data Smart

Naive Bayes and the Incredible Lightness of Being an Idiot

This chapter talks about using a particular form of Bayes theorem that makes it extremely easy to filter
specific type of messages. The Author uses a set of tweets that has words relating to an App. Needless to say
there is a lot of noise in twitter data. Given a tweet, how does one identify whether the tweet is about a specific
App or about something else ?. This question is taken up in the chapter where Bayes is used to compute the
posterior probabilities of a App given a tweet and posterior probabilities of a Noise given a tweet, compares
both the probabilities and assigns it to the respective class. The logic is extremely simple. It is based on bag
of words model. The following are the basic steps :
Model Building
Collect all the words from the app tweets and compute the likelihood of the word given it comes from

an app tweet
Collect all the words from the non-app tweets and compute the likelihood of the word given it comes

from an non-app tweet

To deal with floating point underflow, calculate log likelihood of all probabilities
Do additive smoothing to take care of rare words

Model Prediction
For any test tweet, compute the posterior probability of app given test tweet and compute the posterior

probability of non-app given test tweet

Compute the Bayes factor and report the prediction class

The whole model can be coded in a few lines of R code

cleanup <- function(x){
x

gsub("[.]\\s"," ",x)

<- gsub("[?!,;]"," ",x)

<- strsplit(x,"(\\s)+")

return(x[[1]])
}
getFrequencyCounts

<- function(input.file){

tweets

<- read.csv(input.file,stringsAsFactors = FALSE,header = FALSE)

tweets

<- apply(tweets, 1, tolower)

words

<- sapply(tweets,cleanup)

words

<- do.call("c",words)

wordcount

<- table(words)

wordfreq

<- as.data.frame(wordcount)

colnames(wordfreq) <- c("words","freq")

wordfreq$freq

<- wordfreq$freq + 1

total

<- sum(wordfreq$freq)

wordfreq$logprob

<- log(wordfreq$freq/total)

result
return(result)

<- list(logprob = wordfreq, total = total)

}
13

Data Smart

input.file

<- "data/ch03/app.csv"

app.res

<- getFrequencyCounts(input.file)

input.file
other.res

<- "data/ch03/other.csv"
<- getFrequencyCounts(input.file)

input.file

<- "data/ch03/test_set.csv"

tweets

<- read.csv(input.file,stringsAsFactors = FALSE,header = FALSE)

tweets

<- apply(tweets, 1, tolower)

words

<- sapply(tweets,cleanup)

getScores

<- function(input){

temp

<- data.frame ( words=input)

temp1

<- merge(temp,app.res$logprob, all.x=T)

temp1$logprob[is.na(temp1$logprob)] <- -log(app.res$total)

app.score

<- sum(temp1$logprob)

temp2

<- merge(temp,other.res$logprob, all.x=T)

temp2$logprob[is.na(temp2$logprob)] <- -log(other.res$total)

other.score

<- sum(temp2$logprob)

return(app.score > other.score)

}
predicted.result

<- sapply(words,getScores)

attributes(predicted.result)

<- NULL

model.predict

<- cbind(ifelse(predicted.result==TRUE,"APP","OTHERS"))

actual.result

<- c(rep(TRUE,10),rep(FALSE,10))

model.actual

<- c(rep("APP",10),rep("OTHERS",10))

model.comparison

<- data.frame(predict = model.predict, actual = model.actual)

model.comparison$value

<- 1

Here is the confusion matrix

print(cast(model.comparison,predict~actual,length))
##

predict APP OTHERS

## 1

APP

## 2

OTHERS

Naive Bayes seems to do be doing really well for this toydata set with just 1 false positive.

Data Smart

Optimization Modeling: Because That Fresh Squeezed Orange

Juice Aint Gonna Blend Itself

The author explains the principles of LP via a case study. The case involves deciding the procurement of juice
in the Jan, Feb, Mar across 11 varieties. There a set of constraints under which the procurement can be made.
input.file

<- "data/ch04/specs.csv"

data

<- read.csv(input.file,stringsAsFactors = FALSE,header = TRUE)

colnames(data) <- c("variety","region","qty",

"ba","acid","astr","colr",
"price","shipping")
head(data)
##

variety

region

qty

## 1

Hamlin

Brazil

672 10.5 0.01

500

100

## 2

Mosambi

India

400

6.5 0.01

310

150

Florida 1200 12.0 0.01

750

168 11.0 0.01

600

## 3 Valencia
## 4

Hamlin California

ba acid astr colr price shipping

## 5

Gardner

Arizona

84 12.0 0.01

600

## 6

Sunstar

Texas

210 10.0 0.01

625

Objective & Constraints

obj

<- rep(data$price + data$shipping, times = 3)

lhs

<- matrix(data = 0, nrow= 100, ncol = 33)

rhs

<- matrix(data = 0, nrow= 100, ncol = 1)

dir

<- matrix(data = "", nrow= 100, ncol = 1)

reqd

<- c(600,600,700)

<- "<="

<- ">="

<- "="

<- 1

# Total procurement constraint

for(k in 1:3) {
if(k==1){
temp = c(rep(1,11), rep(0,11),rep(0,11))
}
if(k==2){
temp = c(rep(0,11), rep(1,11),rep(0,11))
}
if(k==3){
15

Data Smart

temp = c(rep(0,11), rep(0,11),rep(1,11))

}
lhs[j,] <- temp
rhs[j]

<- reqd[k]

dir[j]
j

<- et
<- j+1

# Valencia constraint
lhs[j,] <- c(rep(0,2),1,rep(0,8), rep(0,22))
rhs[j]

<- 240

dir[j]

<- gt

<- j+1

lhs[j,] <- c(rep(0,11),rep(0,2),1,rep(0,8), rep(0,11))

rhs[j]

<- 240

dir[j]

<- gt

<- j+1

lhs[j,] <- c(rep(0,22),rep(0,2),1,rep(0,8) )

rhs[j]

<- 280

dir[j]

<- gt

<- j+1

# Availability
j

<- 7

for(i in 1:11) {
x

<-rep(0,11)

x[i]
lhs[j,]

<- 1
<- rep(x,times= 3)

dir[j]

<- lt

rhs[j]

<- data$qty[i]

<- j+1

}
lowerlim

<- c(11.5,0.0075,0,4.5)

upperlim

<- c(12.5,0.01,4,5.5)

for(k in 1:3) {
if(k==1){

Data Smart

temp = c(rep(1,11), rep(0,11),rep(0,11))

}
if(k==2){
temp = c(rep(0,11), rep(1,11),rep(0,11))
}
if(k==3){
temp = c(rep(0,11), rep(0,11),rep(1,11))
}
l <- 1
for(l in 1:4){
temp1

<- temp*data[,l+3]/reqd[k]

lhs[j,] <- temp1

rhs[j]

<- lowerlim[l]

dir[j]

<- gt

<- j+1

}
u <- 1
for(u in 1:4){
temp2

<- temp*data[,u+3]/reqd[k]

lhs[j,]

<- temp2

rhs[j]

<- upperlim[u]

dir[j]

<- lt

<- j+1

}
}
lhs

<- lhs[1:(j-1),]

rhs

<- rhs[1:(j-1)]

dir

<- dir[1:(j-1)]

sol

<- lp(objective.in = obj, const.mat = lhs,

const.rhs = rhs,const.dir = dir)

df
<- (matrix(sol$sol, nrow=11, ncol = 3))
colnames(df) <- c("Jan","Feb","Mar")
rownames(df) <- as.character(data$variety)
sol$objval
## [1] 1226505
The solution from solving the constraint optimization is
df
##
## Hamlin

Jan Feb
0.00
0

Mar
0.00

## Mosambi

0.00

Data Smart

## Valencia

240.00 240 280.00

## Hamlin

124.59

43.41

## Gardner

0.00

## Sunstar

3.00

0.00

## Jincheng

27.00

0.00

## Berna

49.76

0 118.24

## Verna

89.65 138

## Biondo Commune
## Belladonna

0.00

72.35

24 186.00

66.00 114

0.00

Objective becomes constraint

The previous LP problem is cast in recast where there is a constraint on the budget and there is leeway in
relaxing quality standards.
lhs

<- matrix(data = 0, nrow= 100, ncol = 33)

rhs

<- matrix(data = 0, nrow= 100, ncol = 1)

dir

<- matrix(data = "", nrow= 100, ncol = 1)

reqd

<- c(600,600,700)

<- "<="

<- ">="

et
j

<- "="
<- 1

# Total procurement constraint

for(k in 1:3) {
if(k==1){
temp = c(rep(1,11), rep(0,11),rep(0,11))
}
if(k==2){
temp = c(rep(0,11), rep(1,11),rep(0,11))
}
if(k==3){
temp = c(rep(0,11), rep(0,11),rep(1,11))
}
lhs[j,] <- temp
rhs[j]

<- reqd[k]

dir[j]

<- et

<- j+1

}
18

Data Smart

# Valencia constraint
lhs[j,] <- c(rep(0,2),1,rep(0,8), rep(0,22))
rhs[j] <- 240
dir[j]

<- gt

<- j+1

lhs[j,] <- c(rep(0,11),rep(0,2),1,rep(0,8), rep(0,11))

rhs[j]

<- 240

dir[j]

<- gt

<- j+1

lhs[j,] <- c(rep(0,22),rep(0,2),1,rep(0,8) )

rhs[j]

<- 280

dir[j]

<- gt

<- j+1

# Availability
j

<- 7

for(i in 1:11) {
x

<-rep(0,11)

x[i]

<- 1

lhs[j,]

<- rep(x,times= 3)

dir[j]

<- lt

rhs[j]

<- data$qty[i]

<- j+1

}
extra.var

<- matrix(0,nrow = 100, ncol = 4)

lowerlim
upperlim

<- c(11.5,0.0075,0,4.5)
<- c(12.5,0.01,4,5.5)

delta

<- upperlim- lowerlim

for(k in 1:3) {
if(k==1){
temp = c(rep(1,11), rep(0,11),rep(0,11))
}
if(k==2){
temp = c(rep(0,11), rep(1,11),rep(0,11))
}
if(k==3){
temp = c(rep(0,11), rep(0,11),rep(1,11))

Data Smart

}
l <- 1
for(l in 1:4){
temp1

<- temp*data[,l+3]/reqd[k]

lhs[j,] <- temp1

extra.var[j,l] <- delta[l]
rhs[j]

<- lowerlim[l]

dir[j]

<- gt

<- j+1

}
u <- 1
for(u in 1:4){
temp2

<- temp*data[,u+3]/reqd[k]

lhs[j,]

<- temp2

extra.var[j,u] <- - delta[u]

rhs[j]

<- upperlim[u]

dir[j]

<- lt

<- j+1

}
}
lhs[j,]

<- rep(data$price + data$shipping, times = 3)

dir[j,]

<- lt

BUDGET

rhs[j]

<- BUDGET

lhs

<- lhs[1:j,]

rhs

<- rhs[1:j]

dir

<- dir[1:j]

extra.var
lhs

<- extra.var[1:j,]
<- cbind(lhs, extra.var)

obj

<- c(rep(0,33),rep(0.25,4))

sol

<- lp(objective.in = obj, const.mat = lhs,

<- (matrix(sol$sol[1:33], nrow=11, ncol = 3))

relax

<- (matrix(sol$sol[34:37], nrow=4, ncol = 1))

1170000

const.rhs = rhs,const.dir = dir)

colnames(df) <- c("Jan","Feb","Mar")

rownames(df) <- as.character(data$variety)
sol$objval
## [1] 0.2793
20

Data Smart

For the specified budget, the relaxation in quality constraints is

df
##

Jan

Feb

Mar

## Hamlin

0.00

0.000

0.00

## Mosambi

68.53

## Valencia

2.842 216.63

240.00 240.000 280.00

## Hamlin

0.00

0.000

0.00

## Gardner

0.00

0.000

0.00

## Sunstar

0.00

0.000

0.00

## Jincheng

0.00

0.000

0.00

## Berna

13.89

## Verna

0.000 148.11

111.79 188.211

0.00

## Biondo Commune

94.89 115.105

0.00

## Belladonna

70.89

53.842

55.26

relax
##

[,1]

## [1,] 0.4095
## [2,] 0.0000
## [3,] 0.0000
## [4,] 0.5874
mean(relax)
## [1] 0.2492
One can also draw a curve between various budget levels and quality deterioration.
BUDGET

<- seq(1226505,1110000,-10000)

quality

<- numeric(length(BUDGET))

for(b in seq_along(BUDGET)){
rhs[j]

<- BUDGET[b]

sol

<- lp(objective.in = obj, const.mat = lhs,

const.rhs = rhs,const.dir = dir)

relax

<- (mean(sol$sol[34:37]))

quality[b]

relax

par(mfrow=c(1,1))
plot(BUDGET/1000,quality, type ="l", xlab="K$",
ylab = "Broadening of Quality

bands ",col="blue",cex.lab=0.8)

0.6
0.4
0.2
0.0

Broadening of Quality bands

Data Smart

1140

1160

1180

1200

1220

Figure 4.1: Quality Vs Cost

MiniMax
lhs

<- matrix(data = 0, nrow= 100, ncol = 33)

rhs

<- matrix(data = 0, nrow= 100, ncol = 1)

dir

<- matrix(data = "", nrow= 100, ncol = 1)

reqd

<- c(600,600,700)

<- "<="

<- ">="

<- "="

<- 1

# Total procurement constraint

for(k in 1:3) {
if(k==1){
temp = c(rep(1,11), rep(0,11),rep(0,11))
}
if(k==2){
temp = c(rep(0,11), rep(1,11),rep(0,11))
}
if(k==3){
temp = c(rep(0,11), rep(0,11),rep(1,11))
}

Data Smart

lhs[j,] <- temp

rhs[j]

<- reqd[k]

dir[j]

<- et

<- j+1

# Valencia constraint
lhs[j,] <- c(rep(0,2),1,rep(0,8), rep(0,22))
rhs[j]

<- 240

dir[j]

<- gt

<- j+1

lhs[j,] <- c(rep(0,11),rep(0,2),1,rep(0,8), rep(0,11))

rhs[j]

<- 240

dir[j]

<- gt

<- j+1

lhs[j,] <- c(rep(0,22),rep(0,2),1,rep(0,8) )

rhs[j]

<- 280

dir[j]

<- gt

<- j+1

# Availability
j

<- 7

for(i in 1:11) {
x

<-rep(0,11)

x[i]

<- 1

lhs[j,]

<- rep(x,times= 3)

dir[j]

<- lt

rhs[j]
j

<- data$qty[i]
<- j+1

}
extra.var

<- matrix(0,nrow = 100, ncol = 4)

lowerlim

<- c(11.5,0.0075,0,4.5)

upperlim

<- c(12.5,0.01,4,5.5)

delta

<- upperlim- lowerlim

for(k in 1:3) {
if(k==1){

Data Smart

temp = c(rep(1,11), rep(0,11),rep(0,11))

}
if(k==2){
temp = c(rep(0,11), rep(1,11),rep(0,11))
}
if(k==3){
temp = c(rep(0,11), rep(0,11),rep(1,11))
}
l <- 1
for(l in 1:4){
temp1

<- temp*data[,l+3]/reqd[k]

lhs[j,] <- temp1

extra.var[j,l] <- delta[l]
rhs[j]

<- lowerlim[l]

dir[j]

<- gt

<- j+1

}
u <- 1
for(u in 1:4){
temp2

<- temp*data[,u+3]/reqd[k]

lhs[j,]

<- temp2

extra.var[j,u] <- - delta[u]

rhs[j]

<- upperlim[u]

dir[j]

<- lt

<- j+1

}
}
lhs[j,]

<- rep(data$price + data$shipping, times = 3)

dir[j]

<- lt

BUDGET
rhs[j]

<- 1170000
<- BUDGET

lhs

<- lhs[1:(j+4),]

extra.var

<- extra.var[1:(j+4),]

extra.var[(j+1):(j+4),c(1:4)] <- diag(4)

dir[(j+1):(j+4)] <- lt
max.var

<- c(rep(0,j), rep(-1,4) )

lhs

<- cbind(lhs, extra.var,max.var)

rhs

<- rhs[1:(j+4)]

dir

<- dir[1:(j+4)]

Data Smart

obj

<- c(rep(0,37),1)

sol

<- lp(objective.in = obj, const.mat = lhs,

const.rhs = rhs,const.dir = dir)

df
relax

<- (matrix(sol$sol[1:33], nrow=11, ncol = 3))

<- (matrix(sol$sol[34:37], nrow=4, ncol = 1))

colnames(df)

<- c("Jan","Feb","Mar")

rownames(df)

<- as.character(data$variety)

sol$objval
## [1] 0.5874
For the specified budget, the relaxation in quality constraints is
df
##

Jan

Feb

Mar

## Hamlin

0.00

## Mosambi

73.11

## Valencia

0.00 214.89

240.00 240.00 280.00

## Hamlin

0.00

## Gardner
## Sunstar

0.00
0.00

0.00
0.00
0.00

## Jincheng

0.00

## Berna

0.00

0.00 162.00

## Verna

109.52 190.48

0.00

## Biondo Commune

92.62 117.38

0.00

## Belladonna

84.76

52.14

43.11

relax
##

[,1]

## [1,] 0.5874
## [2,] 0.0000
## [3,] 0.0000
## [4,] 0.5874
For some reason, I could not match the exact results for this section as given in the book. The book uses
OpenSolver plugin. The author then goes in to integer and mixed integer programming for the same dataset.

Mixed Integer Programming

lhs

<- matrix(data = 0, nrow= 100, ncol = 33)

rhs

<- matrix(data = 0, nrow= 100, ncol = 1)

dir

<- matrix(data = "", nrow= 100, ncol = 1)

Data Smart

reqd

<- c(600,600,700)

<- "<="

<- ">="

<- "="

<- 1

# Total procurement constraint

for(k in 1:3) {
if(k==1){
temp = c(rep(1,11), rep(0,11),rep(0,11))
}
if(k==2){
temp = c(rep(0,11), rep(1,11),rep(0,11))
}
if(k==3){
temp = c(rep(0,11), rep(0,11),rep(1,11))
}
lhs[j,] <- temp
rhs[j]

<- reqd[k]

dir[j]

<- et

<- j+1

# Valencia constraint
lhs[j,] <- c(rep(0,2),1,rep(0,8), rep(0,22))
rhs[j]
dir[j]

<- 240
<- gt

<- j+1

lhs[j,] <- c(rep(0,11),rep(0,2),1,rep(0,8), rep(0,11))

rhs[j]

<- 240

dir[j]

<- gt

<- j+1

lhs[j,] <- c(rep(0,22),rep(0,2),1,rep(0,8) )

rhs[j]

<- 280

dir[j]

<- gt

<- j+1

Data Smart

# Availability
j
<- 7
for(i in 1:11) {
x

<-rep(0,11)

x[i]

<- 1

lhs[j,]

<- rep(x,times= 3)

dir[j]

<- lt

rhs[j]

<- data$qty[i]

<- j+1

}
lowerlim

<- c(11.5,0.0075,0,4.5)

upperlim

<- c(12.5,0.01,4,5.5)

for(k in 1:3) {
if(k==1){
temp = c(rep(1,11), rep(0,11),rep(0,11))
}
if(k==2){
temp = c(rep(0,11), rep(1,11),rep(0,11))
}
if(k==3){
temp = c(rep(0,11), rep(0,11),rep(1,11))
}
l <- 1
for(l in 1:4){
temp1

<- temp*data[,l+3]/reqd[k]

lhs[j,] <- temp1

rhs[j]
dir[j]

<- lowerlim[l]
<- gt

<- j+1

}
u <- 1
for(u in 1:4){
temp2

<- temp*data[,u+3]/reqd[k]

lhs[j,]

<- temp2

rhs[j]

<- upperlim[u]

dir[j]

<- lt

<- j+1

}
}

Data Smart

lhs
rhs

<- lhs[1:(j-1),]
<- rhs[1:(j-1)]

dir

<- dir[1:(j-1)]

#---------- ADD THE INTEGER CONSTRAINTS

lhs1

<- matrix(data = 0, nrow= 100, ncol = 33)

rhs1

<- matrix(data = 0, nrow= 100, ncol = 1)

dir1

<- matrix(data = "", nrow= 100, ncol = 1)

<- 1

for(k in 1:3) {
if(k==1){
temp = c(rep(1,11), rep(0,11),rep(0,11))
}
if(k==2){
temp = c(rep(0,11), rep(1,11),rep(0,11))
}
if(k==3){
temp = c(rep(0,11), rep(0,11),rep(1,11))
}
lhs1[j,] <- temp
rhs1[j]

<- 4

dir1[j]

<- lt

<- j+1

}
lhs1

<- lhs1[1:(j-1),]

rhs1

<- rhs1[1:(j-1)]

dir1

<- dir1[1:(j-1)]

lhs2

<- cbind(lhs,matrix(0,nrow = dim(lhs)[1],ncol=33))

lhs3

<- cbind(matrix(0,nrow = dim(lhs1)[1],ncol=33),lhs1)

lhs4

<- rbind(lhs2,lhs3)

rhs2

<- c(rhs,rhs1)

dir2

<- c(dir,dir1)

<- 1

Data Smart

for(i in 1:33) {
x

<- rep(0,33)

x[i]

<- 1

<- rep(0,33)

y[i]
lhs4

<- 1
<- rbind(lhs4,c(-x,y*data$qty))

dir2

<- c(dir2,gt)

rhs2

<- c(rhs2,0)

obj

<- c(rep(data$price + data$shipping, times = 3),rep(0,33))

bvec

<- 34:66

sol

<- lp(objective.in = obj, const.mat = lhs4,

const.rhs = rhs2,const.dir = dir2,binary.vec = bvec)

<- (matrix(sol$sol, nrow=11, ncol = 3))

colnames(df) <- c("Jan","Feb","Mar")

rownames(df) <- as.character(data$variety)
sol$objval
## [1] 1230285
The solution from solving the constraint optimization that there should be only four suppliers every month.This
objective values is better than that given in the book.
round(df)
##

Jan Feb Mar

## Hamlin

## Mosambi

## Valencia

240 249 280

## Hamlin

0 168

## Gardner

## Sunstar

0 105

## Jincheng

## Berna

0 168

## Verna

83 147

## Biondo Commune 210

## Belladonna

80 100

Towards the end of the section, the chapter talks about risk and using basic simulation , the author tries
to get a handle on the risk distribution. I think the great thing about this chapter is the way the author
communicates the various ideas of linear programming using excel effortlessly with a healthy dose of humor.

Data Smart

Cluster Analysis Part II : Network Graphs and Community Detection

This chapter introduces Graph analysis via Excel and Graph visualization via Gephi. However one can stay
with in R and do the analysis and visualization all at one go. igraph package in R has extensive functionality
for graph processing. The example used to illustrate the principles is the same example that is used in the
chapter on clustering, i.e. wine data prospect clustering.
input.file

<- "data/ch02/deals.csv"

deals

<- read.csv(input.file,stringsAsFactors = FALSE,header = TRUE)

input.file

<- "data/ch02/offer.csv"

offers

<- read.csv(input.file,stringsAsFactors = FALSE,header = TRUE)

colnames(offers)

<- c("name","offer")

offers$value

<- 1

offers.data

<- cast(offers, name~offer, sum)[,2:33]

rownames(offers.data) <- cast(offers, name~offer, sum)[,1]

dist.cosine

<- dist(offers.data, method="cosine")

mat

<- 1- as.matrix(dist.cosine)

diag(mat)

<- 0

graph.wine

<- graph.adjacency(mat,mode = c("undirected"))

pct.80

<- quantile(mat[which(mat>0)],prob=0.8)

mat

<- apply(mat,1,function(z) ifelse(z>pct.80,1,0))

rownames(mat)

<- rownames(offers.data)

colnames(mat)

<- rownames(offers.data)

graph.wine

<- graph.adjacency(mat,mode = c("undirected"))

Data Smart

par(mfrow=c(1,1))
set.seed(12)
plot.igraph(graph.wine,
vertex.label.cex=0.8,vertex.label= rownames(offers.data),
layout=layout.fruchterman.reingold(graph.wine, niter=10000))

Gutierrez
Edwards
Phillips
Moore
Morris
Flores
Jenkins
Johnson
Anderson Cox
Bell
Cook
RussellCampbell
Smith
Gray

Jones
Ortiz
Sullivan
Torres
Wilson
Morgan
Hill

Perez
Myers
Lewis
James
Adams
Hughes
Diaz
Walker
Taylor
Cruz
Carter King
Brown
Bailey

Bennett StewartPerry

Rodriguez

Thompson

Watson
Robinson
Rivera
Green

Peterson
Murphy

Parker

Cooper
Mitchell Collins
Long

Clark
Ward
Martin

Fisher

Ross

Butler

Jackson

Price

Davis

Howard

Ramirez
Gonzalez

Nguyen

Scott

Morales

Rogers

Young

Foster

Miller

Richardson

Wood

Nelson

Harris

Evans

Thomas

Kelly
Garcia

Lee
Gomez

Reyes

Barnes

Brooks

Martinez

Allen Lopez

Baker
Williams
Roberts

Hall

White

Sanders

Wright
Turner

Sanchez
Reed
Powell

Hernandez

The above visualization is just the starting point. The author uses a method called modularity maximization
for identifying communities. There is a function in the igraph package. The good thing about the book, as
I keep repeating is, that you can see what exactly goes on behind the optimal.community function in excel.
The fact that the algo uses mixed integer programming and that is uses a certain metric as objective function
is all easy to see in an excel. Once you see that the workings in an excel, you will not hesitate to use a third
party function that does the job.

Data Smart

set.seed(1)
out

<- optimal.community(graph.wine, weights = NULL)

sizes(out)
## Community sizes
##

## 23 20 15 12

9 10 11 12 13 14

8 14

mem

<- (membership(out))

communities

<- data.frame(name = attr(mem,"names"), community = mem)

deals.by.community

<- merge(offers, communities, all.x= T)

temp

<- cast(deals.by.community, offer~community, sum)

temp

<- cbind(deals,temp)

Sum of orders by community

size.c <- colSums(temp[,8:21])
size.c
##

9 10 11 12 13 14

## 67 76 37 49 16 53

Community coded 2 with size 76: the high volume community

temp[order(temp$"2",decreasing = TRUE),1:6][1:10,]
##
## 9

Campaign
April

Variety Minimum Discount

Chardonnay

144

Origin PastPeak
Chile

FALSE

## 14

June

Merlot

Chile

FALSE

## 31

December

Champagne

France

FALSE

## 22

August

Champagne

France

FALSE

## 6

March

Prosecco

144

Chile

FALSE

June Cabernet Sauvignon

144

Italy

FALSE

Malbec

France

FALSE

February Cabernet Sauvignon

144

New Zealand

TRUE

39 South Africa

FALSE

## 15
## 1
## 5

January

## 23 September
## 4

February

Chardonnay

144

Champagne

France

TRUE

Community coded 1 with size 67: the small timer community

temp[order(temp$"1",decreasing = TRUE),1:6][1:10,]
##

Campaign

## 30 December

Variety Minimum Discount

Origin PastPeak

Malbec

France

FALSE

Prosecco

Australia

TRUE

## 29 November Pinot Grigio

France

FALSE

## 7

March

Data Smart

## 18

July

Espumante

## 8

Oregon

FALSE

March

Espumante

45 South Africa

FALSE

## 13

May

Merlot

Chile

FALSE

## 12

May

Prosecco

Australia

FALSE

January

Malbec

France

FALSE

## 2

January

Pinot Noir

France

FALSE

## 3

February

Espumante

144

Oregon

TRUE

## 1

Community coded 6 with size 53 : the France community

temp[order(temp$"6",decreasing = TRUE),1:6][1:10,]
##

Campaign

Variety Minimum Discount

Origin PastPeak

## 22

August

Champagne

France

FALSE

## 11

May

Champagne

France

FALSE

## 1

January

Malbec

France

FALSE

## 2

January

Pinot Noir

France

FALSE

## 28 November Cabernet Sauvignon

France

TRUE

Malbec

France

FALSE

Prosecco

83 Australia

FALSE

October Cabernet Sauvignon

## 30 December
## 12
## 25

May

Oregon

TRUE

## 31 December

Champagne

France

FALSE

## 4

Champagne

France

TRUE

February

Community coded 4 with size 49 : Champagne community

temp[order(temp$"4",decreasing = TRUE),1:6][1:10,]
##
## 22
## 6
## 3

Campaign

Variety Minimum Discount

August Champagne

Origin PastPeak

France

FALSE

Prosecco

144

Chile

FALSE

February Espumante

March

144

Oregon

TRUE

Prosecco

Australia

TRUE

## 19

July Champagne

Germany

FALSE

## 27

October Champagne

New Zealand

FALSE

## 4

February Champagne

France

TRUE

## 31 December Champagne

France

FALSE

45 South Africa

FALSE

## 7

March

## 8

March Espumante

## 10

April

Prosecco

6
72

California

Community coded 3 with size 37 ,the Pinot Noir community

temp[order(temp$"3",decreasing = TRUE),1:6][1:10,]
##

Campaign

## 24 September

Variety Minimum Discount

Pinot Noir

Origin PastPeak
Italy

FALSE

Data Smart

## 26
## 17

October

Pinot Noir

144

Australia

FALSE

July

Pinot Noir

Germany

FALSE

January

Pinot Noir

France

FALSE

## 12

May

Prosecco

Australia

FALSE

## 16

June

Merlot

California

FALSE

## 2

## 1

January

Malbec

France

FALSE

## 3

February

Espumante

144

Oregon

TRUE

## 4

February

Champagne

France

TRUE

## 5

February Cabernet Sauvignon

144

44 New Zealand

TRUE

plot(out,graph.wine)

Powell Reed

Sanchez

Sanders
Thomas

Garcia

Gomez

ScottFoster
Miller

Barnes

Wood

Williams
Collins Butler

Price

Lee
Clark

Fisher
Richardson

Martin
Jackson Hall
Mitchell

Roberts

Wright
Turner
Harris

Evans
Rogers

Allen
Lopez

Baker

Brooks
Long
Davis

Murphy

White
Martinez

Gray

Ramirez

Kelly

Nguyen
Morales

Howard

Peterson

Gonzalez
Young

Reyes

Rodriguez
Smith
Campbell
Russell
Anderson
Cox
Bell
Jenkins
Johnson Moore
Cook
Morris
Flores
Phillips

Rivera
Green Watson
Robinson
King Walker
Carter
Bailey
Taylor
Brown
Cruz
AdamsMyers
Diaz
James
Hughes
Lewis Perry Perez
Stewart
Hill
Morgan
Wilson
Bennett
Sullivan
Torres
Jones
Ortiz

Parker
Edwards
Cooper

Gutierrez

Ward

Nelson
ThompsonHernandez Ross

This nice visualization of the communities is the highlight of using Network analysis. Bootstrapped classification methods do not give this ease of interpretation.

Data Smart

The Granddaddy of Supervised Artificial Intelligence Regression

The chapter talks about the most widespread tool of any statistician - Regression. The dataset used to
illustrate the concepts is a set of customer records and their purchases. Regression is a supervised learning
algorithm and in this case, the criterion or the outcome variable is a dichotomous variable, i.e. whether the
customer is pregnant or not. Since the modeling is done via Excel, the author manually shows the way to
create dummy variables, use solver to compute the coefficients of the regression model. In the first go, the
regression variable is fit with outcome variable taking a numerical value of 0 or 1. Obviously this is not a good
idea when a linear regression model assumes the outcome to be a normal random variable and not a variable
whose support is between 0 and 1. One needs to set up manually the RSS function to minimize it. Using
solver one can get the coefficients, but one needs to do a manual set up in Excel to compute the following :
Total sum of squares

X
(yi y)2

TSS =

Residual Sum of Squares

RSS =

X
(yi yi )2
i

R test
2

R2 = 1

RSS
T SS

F test (Try deriving the following expression from scratch)

T SSRSS
(n1)(np)
RSS
np

Fstat =
Covariance matrix of the coefficients

Cov =
2 (X 0 X)1
t stat

tj =

(X 0 X)1
jj

All the hardwork that one needs to do in excel can be done in just a few lines of R code.
input.file

<- "data/ch06/Pregnancy.csv"

data

<- read.csv(input.file,stringsAsFactors = TRUE,header = TRUE)

test.file

<- "data/ch06/Pregnancy_Test.csv"

test

<- read.csv(test.file,stringsAsFactors = TRUE,header = TRUE)

fit1

<- lm(PREGNANT~., data)

summary(fit1)
##
## Call:
## lm(formula = PREGNANT ~ ., data = data)

Data Smart

##
## Residuals:
##

Min

Median

Max

## -0.9814 -0.3400 -0.0301

0.3008

0.9758

##
## Coefficients:
##

Estimate Std. Error t value Pr(>|t|)

## (Intercept)

0.4441

0.0260

17.10

< 2e-16 ***

## Implied.GenderM

-0.0715

0.0251

-2.85

0.00451 **

## Implied.GenderU
## Home.Apt..PO.BoxH

0.0268
-0.0146

0.0403
0.0250

0.67
-0.59

0.50615
0.55856

## Home.Apt..PO.BoxP

0.0133

0.0433

0.31

0.75884

## Pregnancy.Test

0.2164

0.0465

4.65

3.7e-06 ***

## Birth.Control

-0.2741

0.0348

-7.87

9.1e-15 ***

## Feminine.Hygiene

-0.2381

0.0343

-6.94

7.2e-12 ***

0.3456

0.0392

8.83

< 2e-16 ***

## Folic.Acid
## Prenatal.Vitamins

0.2941

0.0360

8.16

1.0e-15 ***

## Prenatal.Yoga

0.3253

0.0893

3.64

0.00028 ***

## Body.Pillow

0.1936

0.0894

2.17

0.03057 *

## Ginger.Ale

0.2299

0.0471

4.88

1.2e-06 ***

## Sea.Bands

0.1458

0.0698

2.09

0.03706 *

## Stopped.buying.ciggies

0.1605

0.0417

3.85

0.00013 ***

-0.1591

0.0404

-3.94

8.7e-05 ***

0.1647

0.0516

3.19

0.00146 **

## Cigarettes
## Smoking.Cessation
## Stopped.buying.wine
## Wine
## Maternity.Clothes

0.1878

0.0359

5.23

2.1e-07 ***

-0.2075

0.0366

-5.66

2.0e-08 ***

0.2399

0.0357

6.72

3.2e-11 ***

## --## Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##
## Residual standard error: 0.372 on 980 degrees of freedom
## Multiple R-squared:

0.458,Adjusted R-squared:

## F-statistic: 43.6 on 19 and 980 DF,

0.447

p-value: <2e-16

To draw the ROC curves for above model, one can use ROCR package
test.pred

<- predict(fit1,test)

pred.lm

<- prediction(test.pred,test$PREGNANT)

perf.lm

<- performance(pred.lm,"tpr","fpr")

plot(perf.lm,xlim=c(0,1),ylim=c(0,1))
abline(h=seq(0,1,0.1), col="grey",lty="dashed")
The above ROC curve under a good model should hug the top left corner. So, there is a chance to improve
this model. The fact that the predicted values of the model does not stay in [0, 1] means that the model needs
36

0.8
0.6
0.4
0.2
0.0

True positive rate

1.0

Data Smart

0.0

0.2

0.4

0.6

0.8

1.0

False positive rate

Figure 6.1: ROC curve

to be improved. The first tweak is to minimize a link function of the outcome variable than the outcome
variable itself. The chapter shows the incorrect way of handling this situation and then shows the right way
of estimating via likelihood procedure.Why does the MLE method work and the previous method fail is not
explained, but any data analyst newbie will realize that sum of squares minimization in general is driven by
a probability model and hence one needs to maximize log likelihood rather than any arbitrary function.
fit

<- glm(PREGNANT~., data= data, family=binomial())

summary(fit)
##
## Call:
## glm(formula = PREGNANT ~ ., family = binomial(), data = data)
##
## Deviance Residuals:
##

Min

Median

Max

## -3.201

-0.557

-0.025

0.513

2.866

##
## Coefficients:
##

Estimate Std. Error z value Pr(>|z|)

## (Intercept)

-0.34360

## Implied.GenderM
## Implied.GenderU

0.18075

-1.90

0.05731 .

-0.45388

0.19757

-2.30

0.02160 *

0.14194

0.30759

0.46

0.64447

## Home.Apt..PO.BoxH

-0.17293

0.19459

-0.89

0.37418

## Home.Apt..PO.BoxP

-0.00281

0.33643

-0.01

0.99333

## Pregnancy.Test

2.37055

0.52178

4.54

5.5e-06 ***

## Birth.Control

-2.30027

0.36527

-6.30

3.0e-10 ***

Data Smart

## Feminine.Hygiene

-2.02856

0.34240

-5.92

3.1e-09 ***

## Folic.Acid

4.07767

0.76189

5.35

8.7e-08 ***

## Prenatal.Vitamins

2.47947

0.36906

6.72

1.8e-11 ***

## Prenatal.Yoga

2.92297

1.14699

2.55

0.01082 *

## Body.Pillow

1.26104

0.86062

1.47

0.14285

## Ginger.Ale

1.93850

0.42673

4.54

5.6e-06 ***

## Sea.Bands

1.10753

0.67343

1.64

0.10005

## Stopped.buying.ciggies

1.30222

0.34235

3.80

0.00014 ***

-1.44302

0.37012

-3.90

9.7e-05 ***

1.79078
1.38389

0.51261
0.30588

3.49
4.52

0.00048 ***
6.1e-06 ***

-1.56554

0.34891

-4.49

7.2e-06 ***

2.07820

0.32943

6.31

2.8e-10 ***

## Cigarettes
## Smoking.Cessation
## Stopped.buying.wine
## Wine
## Maternity.Clothes
## --## Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##
## (Dispersion parameter for binomial family taken to be 1)
##
##

Null deviance: 1386.29

## Residual deviance:

744.11

on 999

degrees of freedom

on 980

degrees of freedom

## AIC: 784.1
##
## Number of Fisher Scoring iterations: 7
For testing the performance of the logistic regression, the following code can be used and ROC curves can be
drawn to get a comparitive performance
test.pred

<- predict(fit,newdata = test)

pred.logr

<- prediction(ilogit(test.pred),test$PREGNANT)

perf.logr

<- performance(pred.logr,"tpr","fpr")

plot(perf.lm,xlim=c(0,1),ylim=c(0,1),col="blue")
plot(perf.logr,xlim=c(0,1),ylim=c(0,1),add=T, col="red")
abline(h=seq(0,1,0.1), col="grey",lty="dashed")
legend("center",legend=c("Linear Regression","Logistic Regression"),fill=c("blue","red"),
cex=0.7)
In this case, there does not seem to be much difference in using Linear Vs Logistic regression. To be technically
correct, it is anyways better to use Logistic regression.

0.8
0.6
0.2

0.4

Linear Regression
Logistic Regression

0.0

True positive rate

1.0

Data Smart

0.0

0.2

0.4

0.6

0.8

1.0

False positive rate

Figure 6.2: ROC curve comparison

Ensemble Models: A Whole Lot of Bad Pizza

The models introduced in this chapter belong to the category of voting models. Such methods would have
been unthinkable a few decades ago. With fast and cheap computing power, the methods mentioned in this
chapter have become extremely popular. The chapters walks through two such methods
1. Random Forests
2. Boosting
Its really commendable that the author has taken pains to do everything in excel to show how the method
really work. I came across these methods in the statistical learning books by Trevor Hastie and Rob Tibshirani.
If you want the math behind it, the books by Trevor and Bob are priceless. However if you are the kind where
you want to see something running and then want to explore the math, this chapter is perfect.
The dataset used in this chapter is the same as that used in the previous chapter. The author spends less time
on data preparation and shows all the steps that are needed to run random bagging and boosting. In the last
chapter of the book, he provides the code for random forests but not boosting.
Here is the R code that replicates Random Forest results from the chapter
input.file

<- "data/ch06/Pregnancy.csv"

data

<- read.csv(input.file,stringsAsFactors = TRUE,header = TRUE)

test.file

<- "data/ch06/Pregnancy_Test.csv"

test

<- read.csv(test.file,stringsAsFactors = TRUE,header = TRUE)

data$PREGNANT

<- factor(data$PREGNANT)

test$PREGNANT

<- factor(test$PREGNANT)

data.rf

<- randomForest(PREGNANT~.,data=data,importance=TRUE,
replace=FALSE,maxnodes=2,mtry=17)

Data Smart

test.rf

<- predict(data.rf,test,type="prob")

pred.rf

<- prediction(test.rf[,2],test$PREGNANT)

perf.rf

<- performance(pred.rf,"tpr","fpr")

par(mfrow=c(1,2))
varImpPlot(data.rf, type=2,cex=0.6,main="")
plot(perf.rf,xlim=c(0,1),ylim=c(0,1),lty=2)

Folic.Acid

1.0

Birth.Control
Prenatal.Vitamins

Feminine.Hygiene
Prenatal.Yoga
Body.Pillow
Ginger.Ale
Sea.Bands

0.4

Pregnancy.Test

0.2

Home.Apt..PO.Box

0.6

True positive rate

Implied.Gender

0.8

Maternity.Clothes

Stopped.buying.ciggies

0.0

Cigarettes
Smoking.Cessation
Stopped.buying.wine
Wine

0.0
0

0.2

0.4

0.6

0.8

1.0

False positive rate

MeanDecreaseGini

Figure 7.1: Importance Plot

Here is the R code that replicates Boosting results from the chapter
input.file

<- "data/ch06/Pregnancy.csv"

data

<- read.csv(input.file,stringsAsFactors = TRUE,header = TRUE)

test.file

<- "data/ch06/Pregnancy_Test.csv"

test

<- read.csv(test.file,stringsAsFactors = TRUE,header = TRUE)

data.boost

<- gbm(PREGNANT~.,data=data, distribution="bernoulli")

test.boost

<- predict(data.boost,newdata = test,n.trees = 200,type="response")

pred.boost

<- prediction(test.boost,test$PREGNANT)

perf.boost

<- performance(pred.boost,"tpr","fpr")

Here is the list of influence variables

summary(data.boost,plotit=F)
##
## Folic.Acid

var rel.inf
Folic.Acid

70.713

Data Smart

## Birth.Control

Birth.Control

19.989

## Prenatal.Vitamins

Prenatal.Vitamins

4.938

## Maternity.Clothes

Maternity.Clothes

3.282

Stopped.buying.wine

1.078

Implied.Gender

0.000

Home.Apt..PO.Box

0.000

Pregnancy.Test

0.000

Feminine.Hygiene

0.000

Prenatal.Yoga

0.000

Body.Pillow
Ginger.Ale

0.000
0.000

Sea.Bands

0.000

## Stopped.buying.wine
## Implied.Gender
## Home.Apt..PO.Box
## Pregnancy.Test
## Feminine.Hygiene
## Prenatal.Yoga
## Body.Pillow
## Ginger.Ale
## Sea.Bands

## Stopped.buying.ciggies Stopped.buying.ciggies

0.000

## Cigarettes

Cigarettes

0.000

Smoking.Cessation

0.000

Wine

0.000

## Smoking.Cessation
## Wine

Here is the comparison of ROC curves

plot(perf.rf,xlim=c(0,1),ylim=c(0,1),col="blue")
plot(perf.boost,xlim=c(0,1),ylim=c(0,1),add=T, col="red")
abline(h=seq(0,1,0.1), col="grey",lty="dashed")

0.8
0.6
0.2

0.4

Random Forest
Boosting

0.0

True positive rate

1.0

legend("center",legend=c("Random Forest","Boosting"),fill=c("blue","red"), cex=0.7)

0.0

0.2

0.4

0.6

0.8

1.0

False positive rate

Figure 7.2: ROC curve

Data Smart

Forecasting: Breathe Easy; You Cant Win

This chapter starts off by introducing simple exponential smoothing that is basically a changing level model.Here
is the R code to replicate the results from the book.
input.file
data

<- "data/ch10/SwordDemand.csv"
<- read.csv(input.file,stringsAsFactors = TRUE,header = TRUE)

data.ts

<- ts(data, start=c(2010),frequency=12)

fit

<- HoltWinters(data.ts, beta=FALSE,

gamma=FALSE, l.start = 163)

The smoothing parameter and RMSE are

fit$alpha
## [1] 0.7297
sqrt(fit$SSE/35)
## [1] 20.39

par(mfrow=c(1,1))
plot.ts(data.ts , xaxt="n", lwd =2 , col = "grey",xlab="")
time

<- seq(as.Date("2010/1/1"),length.out = 36, by="1 month")

time

<- format(time, "%Y-%m")

axis(1, labels=time, at=time(data.ts))

points(fitted(fit), type="l",lty="dashed", col = "blue")
legend("topleft", legend=c("original time series","smoothed"),
col

= c("grey","blue"), lty = c(1,2), lwd = c(2,1), cex = 0.8)

par(mfrow=c(1,1))
plot.ts(data.ts , xaxt="n", lwd =2 , col = "grey",
xlim=c(2010,2014),xlab="")
timeidx <-seq( 2010, 2014,length.out = 48)
time
time

<- seq(as.Date("2010/1/1"),length.out = 48, by="1 month")

<- format(time, "%Y-%m")

axis(1, labels=time, at=timeidx)

points(fitted(fit), type="l",lty="dashed", col = "blue")
points(predict(fit, n.ahead=12), type="l", lty="dashed", col = "red", lwd

= 1)

legend("topleft", legend=c("original time series","smoothed","forecast"),

col

= c("grey","blue","red"), lty = c(1,2,2), lwd = c(2,1), cex = 0.8)

Subsequently, a trend corrected exponential smoothing is done

original time series

smoothed

150

200

250

SwordDemand

300

Data Smart

201001

201007

201101

201107

201201

201207

Figure 8.1: Sword Demand

fit

<- HoltWinters(data.ts, gamma=FALSE,l.start = 155.88,b.start = 0.8369 )

The smoothing parameter , and RMSE are

fit$alpha
## alpha
## 0.666
fit$beta
##

beta

## 0.05766
sqrt(fit$SSE/34)
## [1] 19.92

par(mfrow=c(1,1))
plot.ts(data.ts , xaxt="n", lwd =2 , col = "grey",
xlim=c(2010,2014),ylim=c(100,400),xlab="")
timeidx <-seq( 2010, 2014,length.out = 48)
time

<- seq(as.Date("2010/1/1"),length.out = 48, by="1 month")

time

<- format(time, "%Y-%m")

axis(1, labels=time, at=timeidx)

points(fitted(fit), type="l",lty="dashed", col = "blue")
points(predict(fit, n.ahead=12), type="l", lty="dashed", col = "red", lwd

= 1)

200

250

original time series

smoothed
forecast

150

SwordDemand

300

Data Smart

201001 201008 201103 201110 201205 201212 201307

Figure 8.2: Sword Demand Prediction

legend("topleft", legend=c("original time series","smoothed","forecast"),

col

= c("grey","blue","red"), lty = c(1,2,2), lwd = c(2,1), cex = 0.8)

One check residuals of the model using the acf plot

acf(data.ts[4:36]- c(fitted(fit)[1:33,1]) , main = "", xlim=c(1,15))
Clearly there is a seasonality in the data that needs to be modeled. The chapter uses Multiplicative HoltWinters Exponential smoothing method
#Starting values as given in the book
s.start <- c(0.988233399,1.039459514,0.932933292,0.912597756,
1.043010605,0.906442452,0.920837589,0.926620944,
0.988490753,1.016201453,1.048052656,1.204004908)
fit

<- HoltWinters(data.ts, seasonal="multiplicative",

l.start = 144.42,b.start = 2.2095 ,s.start= s.start)

The smoothing parameter , and RMSE are

fit$alpha
##

alpha

## 0.2316
fit$beta

original time series

smoothed
forecast

100

200

300

SwordDemand

400

Data Smart

201001 201008 201103 201110 201205 201212 201307

Figure 8.3: Sword Demand Prediction - Trend corrected smoothing

beta

## 0.1148
fit$gamma
## gamma
##

sqrt(fit$SSE/33)
## [1] 9.282
Check residuals of the model using the acf plot
acf(data.ts[14:36]- c(fitted(fit)[1:23,1]) , main = "", xlim=c(1,15))

par(mfrow=c(1,1))
plot.ts(data.ts , xaxt="n", lwd =2 , col = "grey",
xlim=c(2010,2014),ylim=c(100,400),xlab="")
timeidx <-seq( 2010, 2014,length.out = 48)
time

<- seq(as.Date("2010/1/1"),length.out = 48, by="1 month")

time

<- format(time, "%Y-%m")

axis(1, labels=time, at=timeidx)

points(fitted(fit), type="l",lty="dashed", col = "blue")
points(predict(fit, n.ahead=12), type="l", lty="dashed", col = "red", lwd

= 1)

legend("topleft", legend=c("original time series","smoothed","forecast"),

col

= c("grey","blue","red"), lty = c(1,2,2), lwd = c(2,1,1), cex = 0.8)

0.2
0.2

ACF

0.6

1.0

Data Smart

Lag
Figure 8.4: Checking residuals

The last part of the chapter puts confidence intervals around the prediction.
pred <- predict(fit, n.ahead=12, prediction.interval=T, level =0.95)

par(mfrow=c(1,1))
plot.ts(data.ts , xaxt="n", lwd =2 , col = "grey",
xlim=c(2010,2014),ylim=c(100,400),xlab="")
timeidx <-seq( 2010, 2014,length.out = 48)
time

<- seq(as.Date("2010/1/1"),length.out = 48, by="1 month")

time

<- format(time, "%Y-%m")

axis(1, labels=time, at=timeidx)

points(fitted(fit), type="l",lty="dashed", col = "blue")
points(pred[,1], type="l", col = "red", lwd

= 2)

points(pred[,2], type="l", lty="dashed", col = "red", lwd

= 1)

points(pred[,3], type="l", lty="dashed", col = "red", lwd

= 1)

legend("topleft",
legend=c("original time series","smoothed","forecast",
"forecast-upper","forecast-lower"),
col

= c("grey","blue","red","red","red"),

lty = c(1,2,1,2,2), lwd = c(2,1,2,1,1), cex = 0.8)

The above analysis is also done in chapter 10 of this book using forecast package
sword.forecast <- forecast(data.ts)
results <- cbind(sword.forecast$mean,sword.forecast$lower[,2],sword.forecast$upper[,2])
colnames(results) <- c("mu","lower","upper")

0.4
0.4

0.0

ACF

0.8

Data Smart

Lag
Figure 8.5: Checking residuals

sword.forecast$method
## [1] "ETS(M,M,M)"

plot(sword.forecast)

200

300

original time series

smoothed
forecast

100

SwordDemand

400

Data Smart

201001 201008 201103 201110 201205 201212 201307

Figure 8.6: Sword Demand Prediction - Multiplicative HoltWinters smoothing

Outlier Detection: Just Because Theyre Odd Doesnt Mean

Theyre Unimportant

The chapter starts by discussing Tukey fences as a visualizing tool to spot outliers. They come in two varieties.
Innerfences are 1.5 IQR and Outer fences are 3 IQR on either side of the median
input.file

<- "data/ch10/PregnancyDuration.csv"

data

<- read.csv(input.file,stringsAsFactors = TRUE,header = TRUE)

summary(data)
##

GestationDays

Min.

1st Qu.:260

Median :267

Mean

3rd Qu.:272

Max.

:240

:267
:349

IQR(data[,1])
## [1] 12
quantile(data[,1],prob=0.75)- quantile(data[,1],prob=0.25)
## 75%
##

200

300

original time series

smoothed
forecast
forecastupper
forecastlower

100

SwordDemand

400

Data Smart

201001 201008 201103 201110 201205 201212 201307

Figure 8.7: Sword Demand Prediction - Confidence bands

par(mfrow=c(1,2))
boxplot(data[,1], main="Inner fences",cex.main = 0.8)
boxplot(data[,1], range =3, main

= "Outer fences",cex.main =0.8)

One can directly see the stats with out plotting

stats

<- c(boxplot(data[,1], plot=F)$stats)

labels <- c("lower inner fence" ,"25th percentile" ,"median","75th percentile",

"upper inner fence")
res

<- data.frame(labels = labels,value =stats)

stats

<- c(boxplot(data[,1],range = 3, plot=F)$stats)

labels <- c("lower outer fence" ,"25th percentile" ,"median","75th percentile",

"upper outer fence")
res

<- cbind(labels,stats)

colnames(res) <- c("labels","value")

The chapter subsequently discussed three methods to quantify the outliers.
Method 1 : Indegree
input.file

<- "data/ch10/CallCenter.csv"

data

<- read.csv(input.file,stringsAsFactors = TRUE,header = TRUE)

data.scaled

<- scale(data[,2:11])

distance

<- as.matrix(dist(data.scaled))

Data Smart

mu
242.64
256.24
235.20
233.61
271.24
253.09
259.96
257.66
282.44
296.10
325.04
374.54
313.38
330.95
303.78
301.73
350.33
326.88
335.76
332.79
364.79
382.44
419.82
483.74

lower
223.11
236.05
215.63
212.15
244.07
223.96
226.11
220.13
235.95
240.39
258.75
289.27
236.83
242.13
216.45
207.30
232.55
208.83
209.70
199.98
210.30
214.52
225.40
249.18

upper
260.94
276.48
255.07
255.82
300.13
282.91
296.84
299.94
336.20
362.72
406.33
477.64
414.19
451.16
425.56
439.97
522.92
508.46
538.28
553.81
630.72
686.55
783.55
940.29

Table 1: Mean and confidence intervals for the prediction

labels
lower inner fence
25th percentile
median
75th percentile
upper inner fence

value
242.00
260.00
267.00
272.00
290.00

Table 2: Tukey Inner fences

ranks

<- t(apply(distance, 1, rank)-1)

top5s

<- apply(ranks, 2, function(z) sum(z <=5))-1

top10s

<- apply(ranks, 2, function(z) sum(z <=10))-1

top20s

<- apply(ranks, 2, function(z) sum(z <=20))-1

temp

<- cbind(data, top5s,top10s,top20s)

temp

<- temp[order(temp[,12],temp[,13],temp[,14]),]

Method 2 : k-distance
ranks

<- t(apply(distance, 1, rank)-1)

ranks[ranks!=5] <- 0
ranks[ranks==5] <- 1
temp
temp
temp

<- rowSums(ranks*distance)
<- cbind(data, temp)
<- temp[order(temp[,12],decreasing=T),]

Data Smart

200

400

600

800

Forecasts from ETS(M,M,M)

2010

2011

2012

2013

2014

2015

Figure 8.8: Sword Demand Prediction - Confidence bands

labels
lower outer fence
25th percentile
median
75th percentile
upper outer fence

value
240
260
267
272
303

Table 3: Tukey Outer fences

Method 3 : Local Outlier Factors

callcenter.lof <- lofactor(data.scaled, 5)
local.outliers <- data[which(callcenter.lof > 1.5),]

Data Smart

280
240

240

280

320

Outer fences

320

Inner fences

Figure 9.1: Tukey Fences

ID
137155
143406

AvgTix
165.30
145.00

Rating
4.49
2.33

Tardies
1
3

Graveyards
3
1

Weekends
2
0

SickDays
1
6

Table 4: Outliers based on Indegree

Moving from Spreadsheets into R

This chapter introduces some basic commands in R and then works out the Kmeans model, the regression
model, the random forests model, forecasting model and outlier detection methods in R. In each of the case,
the author provides abundant commentary for each of the packages and commands being used. The packages
used to replicate the excel results from the previous chapter include
skmeans
randomForest
ROCR
forecast
DMwR

Conclusion

The book concludes with some soft skills that a data scientist should have for him / her to be effective in an
organization.

Data Smart

ID
137155
143406

PercSickOnFri
0.00
0.83

EmployeeDevHrs
30
30

ShiftSwapsReq
1
4

ShiftSwapsOffered
7
0

Table 5: Outliers based on Indegree

ID
143406
137155

AvgTix
145.00
165.30

Rating
2.33
4.49

Tardies
3
1

Graveyards
1
3

Weekends
0
2

SickDays
6
1

Table 6: Outliers based on k distance

ID
143406
137155

PercSickOnFri
0.83
0.00

EmployeeDevHrs
30
30

ShiftSwapsReq
4
1

ShiftSwapsOffered
0
7

Table 7: Outliers based on k distance

ID
137155
143406

AvgTix
165.30
145.00

Rating
4.49
2.33

Tardies
1
3

Graveyards
3
1

Weekends
2
0

SickDays
1
6

Table 8: Local Outlier Factors

ID
137155
143406

PercSickOnFri
0.00
0.83

EmployeeDevHrs
30
30

ShiftSwapsReq
1
4

ShiftSwapsOffered
7
0

Table 9: Local Outlier Factors

更新的红酒培训WSET Level 2 中级教材Looking Behind the Labels
100% (1)
更新的红酒培训WSET Level 2 中级教材Looking Behind the Labels
83 pages
Turning Numbers Into Knowledge - Mastering The Art of Problem Solving PDF
No ratings yet
Turning Numbers Into Knowledge - Mastering The Art of Problem Solving PDF
269 pages
WSET® Level 1 Award In: Wines
No ratings yet
WSET® Level 1 Award In: Wines
36 pages
Pganalyze Advanced+Database+Programming+With+Rails
No ratings yet
Pganalyze Advanced+Database+Programming+With+Rails
45 pages
(Systems Thinking) Angela Espinosa - Sustainable Self-Governance in Businesses and Society - The Viable System Model in Action (2022, Routledge) - Libgen - Li
No ratings yet
(Systems Thinking) Angela Espinosa - Sustainable Self-Governance in Businesses and Society - The Viable System Model in Action (2022, Routledge) - Libgen - Li
313 pages
The Power of Many Book Summary
No ratings yet
The Power of Many Book Summary
7 pages
1963 Creativity Is Not Enough PDF
No ratings yet
1963 Creativity Is Not Enough PDF
8 pages
Selling Is Hard. Buying Is Harder Hess en 39945
No ratings yet
Selling Is Hard. Buying Is Harder Hess en 39945
6 pages
Knygos
No ratings yet
Knygos
7 pages
Winning by Jack Welch A&d
100% (1)
Winning by Jack Welch A&d
42 pages
Anaconda's Guide To Open-Source: Tools and Libraries For Enterprise Data Science and Machine Learning
No ratings yet
Anaconda's Guide To Open-Source: Tools and Libraries For Enterprise Data Science and Machine Learning
29 pages
Connect, Then Lead
No ratings yet
Connect, Then Lead
13 pages
Herding Cats A Primer For Programmers Who Lead Programmers
No ratings yet
Herding Cats A Primer For Programmers Who Lead Programmers
188 pages
Brain Station Course Curriculum
100% (1)
Brain Station Course Curriculum
21 pages
Location Is (Still) Everything PDF
No ratings yet
Location Is (Still) Everything PDF
24 pages
Dan Ariely - The Upside of Irrationality
No ratings yet
Dan Ariely - The Upside of Irrationality
11 pages
Algorithms To Live by
0% (1)
Algorithms To Live by
1 page
(Skiena, 2017) - Book - The Data Science Design Manual - 2
No ratings yet
(Skiena, 2017) - Book - The Data Science Design Manual - 2
1 page
Applied Analytics Using Enterprise Miner5
No ratings yet
Applied Analytics Using Enterprise Miner5
3 pages
EAN Nalytics: Using Data To Build A Startup Faster
100% (1)
EAN Nalytics: Using Data To Build A Startup Faster
2 pages
Andrew Treadway - Software Engineering For Data Scientists (MEAP V03) - Manning Publications (2023)
100% (1)
Andrew Treadway - Software Engineering For Data Scientists (MEAP V03) - Manning Publications (2023)
319 pages
The Insider Advantage
No ratings yet
The Insider Advantage
21 pages
The Visual Display of Quantitative Information
67% (9)
The Visual Display of Quantitative Information
39 pages
The Mythical Man-Month
86% (7)
The Mythical Man-Month
56 pages
MBA of Success Brochure
No ratings yet
MBA of Success Brochure
12 pages
Think Stats: Probability and Statistics For Programmers
No ratings yet
Think Stats: Probability and Statistics For Programmers
140 pages
Ten Mistakes To Avoid: When Launching Your Data Governance Program
No ratings yet
Ten Mistakes To Avoid: When Launching Your Data Governance Program
16 pages
Power Bi Basic To Advance
No ratings yet
Power Bi Basic To Advance
3 pages
VTI Benefits
100% (1)
VTI Benefits
73 pages
Business Ethics
No ratings yet
Business Ethics
21 pages
BK Rand 002806
100% (1)
BK Rand 002806
11 pages
What Is Data Science?: Michael L. Brodie
No ratings yet
What Is Data Science?: Michael L. Brodie
21 pages
Data Storytelling
100% (1)
Data Storytelling
29 pages
Show Me The Numbers - Stephen Few - 555
100% (5)
Show Me The Numbers - Stephen Few - 555
128 pages
A Grammar of Graphics
0% (1)
A Grammar of Graphics
45 pages
Model Thinking
100% (1)
Model Thinking
103 pages
Howard Wainer-Truth or Truthiness - Distinguishing Fact From Fiction by Learning To Think Like A Data Scientist-Cambridge University Press (2015)
No ratings yet
Howard Wainer-Truth or Truthiness - Distinguishing Fact From Fiction by Learning To Think Like A Data Scientist-Cambridge University Press (2015)
230 pages
CleanCodeHandbook v1.0.1 PDF
100% (1)
CleanCodeHandbook v1.0.1 PDF
100 pages
The Design of Business Why Design Thinking Is The Next
No ratings yet
The Design of Business Why Design Thinking Is The Next
11 pages
Ruthless Consistency
No ratings yet
Ruthless Consistency
5 pages
In Search of Excellence Final
67% (3)
In Search of Excellence Final
29 pages
Trust-In Machine Learning Models
No ratings yet
Trust-In Machine Learning Models
11 pages
Competing in The Digital Age
67% (3)
Competing in The Digital Age
4 pages
1.1 Algorithm Paradigms
100% (1)
1.1 Algorithm Paradigms
32 pages
Keeping Up With Quants
No ratings yet
Keeping Up With Quants
13 pages
Data Science Books
100% (1)
Data Science Books
11 pages
QlikSense TopologiesV0 - 18 PDF
No ratings yet
QlikSense TopologiesV0 - 18 PDF
1 page
The Origin of Wealth
100% (2)
The Origin of Wealth
37 pages
Cracking The PM Career Jackie Bavaro
No ratings yet
Cracking The PM Career Jackie Bavaro
1 page
FIT9136 Algorithm and Programming Foundation in Python
No ratings yet
FIT9136 Algorithm and Programming Foundation in Python
29 pages
Fooled by Randomness
100% (3)
Fooled by Randomness
9 pages
E7fASQoeT6S3wEkKHi k2Q - Machine Learning in The Enterprise - Course Summary
No ratings yet
E7fASQoeT6S3wEkKHi k2Q - Machine Learning in The Enterprise - Course Summary
17 pages
Effective Executive - Peter Drucker
No ratings yet
Effective Executive - Peter Drucker
7 pages
Viralnomics Final PDF
100% (1)
Viralnomics Final PDF
146 pages
2010, Mckinsey Quarterly, Strategic Decisions
No ratings yet
2010, Mckinsey Quarterly, Strategic Decisions
10 pages
Operations Manager or Program Manager or Property Manager or Exe
No ratings yet
Operations Manager or Program Manager or Property Manager or Exe
2 pages
Big Data Solution For Tourism PDF
No ratings yet
Big Data Solution For Tourism PDF
10 pages
DATA 2024 - Dist
No ratings yet
DATA 2024 - Dist
97 pages
Lecture Week 2 KNN and Model Evaluation PDF
100% (1)
Lecture Week 2 KNN and Model Evaluation PDF
53 pages
6220010
No ratings yet
6220010
37 pages
Data Science: Concepts and Practice: Course Slides
No ratings yet
Data Science: Concepts and Practice: Course Slides
9 pages
Unsupervised Machine Learning in Python
100% (1)
Unsupervised Machine Learning in Python
89 pages
Ribera Del Duero: by Tim Atkin MW
No ratings yet
Ribera Del Duero: by Tim Atkin MW
35 pages
Glossary Wine Terms
No ratings yet
Glossary Wine Terms
9 pages
SFWine18 Results by Winery
No ratings yet
SFWine18 Results by Winery
91 pages
The Business of Wine An Encyclopedia PDF
No ratings yet
The Business of Wine An Encyclopedia PDF
328 pages
The Lounge
No ratings yet
The Lounge
4 pages
Price List
No ratings yet
Price List
21 pages
Harrods Store Guide
No ratings yet
Harrods Store Guide
53 pages
IkinariSteak DinnerMenu
No ratings yet
IkinariSteak DinnerMenu
2 pages
Cherry Bar Menu
No ratings yet
Cherry Bar Menu
34 pages
Menu Completo Casa Blanca2023
No ratings yet
Menu Completo Casa Blanca2023
39 pages
Wine Cellar - Budureasca
No ratings yet
Wine Cellar - Budureasca
11 pages
Australian Wine Regions Student Workbook Yr9 10 Geography
No ratings yet
Australian Wine Regions Student Workbook Yr9 10 Geography
55 pages
Menu Maguwo Oktober 23-1
No ratings yet
Menu Maguwo Oktober 23-1
16 pages
Wine Tasting Menu
No ratings yet
Wine Tasting Menu
7 pages
Capital Grille Wine List
0% (1)
Capital Grille Wine List
5 pages
The Baron Wine List Woodmead
No ratings yet
The Baron Wine List Woodmead
3 pages
Lagarto Brasserie Main Menu EN
No ratings yet
Lagarto Brasserie Main Menu EN
2 pages
No Price Kask&Cork A4 Portrait Newest
No ratings yet
No Price Kask&Cork A4 Portrait Newest
24 pages
The Bar Menu
No ratings yet
The Bar Menu
13 pages
English Lesson On Maltese Wine For Intermediate Level and Above
No ratings yet
English Lesson On Maltese Wine For Intermediate Level and Above
6 pages
Wine Menu
No ratings yet
Wine Menu
3 pages
Share The-Peak-Grill-Menu
No ratings yet
Share The-Peak-Grill-Menu
2 pages
Server Training Manual Version 4 - March2019
No ratings yet
Server Training Manual Version 4 - March2019
22 pages
Pocket Sommelier - Wine Tasting Guide
100% (3)
Pocket Sommelier - Wine Tasting Guide
13 pages
TOMA Menu Compilation (Feb 24)
No ratings yet
TOMA Menu Compilation (Feb 24)
21 pages
Bar and Beverage Management (Lecture) Online Activity: Republic of The Philippines
No ratings yet
Bar and Beverage Management (Lecture) Online Activity: Republic of The Philippines
40 pages
Wine Tour Report: Fratelli Vineyards
No ratings yet
Wine Tour Report: Fratelli Vineyards
5 pages
5th and Taylor - Menu PDF
No ratings yet
5th and Taylor - Menu PDF
2 pages

Data Smart Book Summary

Uploaded by

Data Smart Book Summary

Uploaded by

Data Smart

Using Data Science to Transform Information into Insight

Whats in this document ?

Cluster Analysis Part I: Using K-Means to Segment Your Customer Base

3 Naive Bayes and the Incredible Lightness of Being an Idiot

6 The Granddaddy of Supervised Artificial Intelligence Regression

7 Ensemble Models: A Whole Lot of Bad Pizza

8 Forecasting: Breathe Easy; You Cant Win

10 Moving from Spreadsheets into R

Everything You Ever Needed to Know about Spreadsheets but

Cluster Analysis Part I: Using K-Means to Segment Your Customer Base

A common choice for W (Ck ) is

<- read.csv(input.file,stringsAsFactors = FALSE,header = TRUE)

<- read.csv(input.file,stringsAsFactors = FALSE,header = TRUE)

<- cast(offers, name~offer, sum)[,2:33]

rownames(offers.data) <- cast(offers, name~offer, sum)[,1]

<- cbind(deals, (t(km.out$centers)))

Analyzing based on first cluster centre

Variety Minimum Discount

## 29 November Pinot Grigio

Analyzing based on second cluster centre

Variety Minimum Discount Origin PastPeak

June Cabernet Sauvignon

Analyzing based on third cluster centre

Variety Minimum Discount

## 24 September Pinot Noir

October Pinot Noir

July Pinot Noir

January Pinot Noir

Analyzing based on fourth cluster centre

Variety Minimum Discount

August Cabernet Sauvignon

<- merge(offers, cluster, all.x= T)

<- cast(deals.by.cluster, offer~cluster, sum)

The first cluster is small timers

Variety Minimum Discount

## 29 November Pinot Grigio

The second cluster is not clear

Variety Minimum Discount Origin PastPeak

June Cabernet Sauvignon

The third cluster is Pinot Noir variety

Variety Minimum Discount

## 24 September Pinot Noir

October Pinot Noir

July Pinot Noir

The fourth cluster seems to like August Champaign

Variety Minimum Discount

August Cabernet Sauvignon

<- subset(cluster, id!=i)

<- subset(cluster, id==i)$cluster

<- subset(temp, cluster == temp.cluster)

<- subset(temp, cluster != temp.cluster)

<- mean(dist.euclidean[ n*(i.star-1) i.star*(i.star-1)/2 + j.star-i.star ])

<- min( sapply( clusters[-temp.cluster],function(j)

<- pmin(i,subset(diff.cluster, cluster== j)$id)

<- pmax(i,subset(diff.cluster, cluster== j)$id)

mean(dist.euclidean[ n*(i.star-1) - i.star*(i.star-1)/2 + j.star-i.star ])

<- attr(dist.euclidean, "Size")

<- skmeans(as.matrix(offers.data), 5, method="genetic")

<- merge(offers, cluster, all.x= T)

<- cast(deals.by.cluster, offer~cluster, sum)

The first cluster is Pinot Noir gang

Variety Minimum Discount

## 24 September Pinot Noir

October Pinot Noir

January Pinot Noir

July Pinot Noir

The second cluster looks like small timers

Variety Minimum Discount

## 29 November Pinot Grigio

The third cluster is is high volume deals segment

Variety Minimum Discount

February Cabernet Sauvignon

June Cabernet Sauvignon

The fourth cluster is France buyer segment

Variety Minimum Discount

<- mean(dist.euclidean[ n(i.star-1) i.star(i.star-1)/2 + j.star-i.star ])

mean(dist.euclidean[ n(i.star-1) - i.star(i.star-1)/2 + j.star-i.star ])