Module 4 - Theory and Methods
Module 4 - Theory and Methods
and Methods
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 1
Module 4: Advanced Analytics – Theory and Methods
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 2
Where “R” we?
• In Module 3 we reviewed R skills and basic statistics
• You can use R to:
Generate summary statistics to investigate a data set
Visualize Data
Perform statistical tests to analyze data and evaluate models
• Now that you have data, and you can see it, you need to plan
the analytic model and determine the analytic method to be
used
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 3
Phase 3 - Model Planning
Discovery
How Operationalize
do people generally solve this Data Prep
problem with the kind of data and
resources I have?
Communicate Model
• Does that work well enough? Or do I have
Results Planning
to come up with something new?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 5
What Kind of Problem do I Need to Solve?
How do I Solve it?
The Problem to Solve The Category of Covered in this Course
Techniques
I want to group items by similarity. Clustering K-means clustering
I want to find structure (commonalities)
in the data
I want to discover relationships between Association Rules Apriori
actions or items
I want to determine the relationship Regression Linear Regression
between the outcome and the input Logistic Regression
variables
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 6
Why These Example Techniques?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 7
Module 4: Advanced Analytics – Theory and Methods
Lesson 1: K-means Clustering
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 8
Clustering
How do I group these documents by topic?
How do I group my customers by purchase patterns?
• Sort items into groups by similarity:
Items in a cluster are more similar to each other than they are to
items in other clusters.
Need to detail the properties that characterize “similarity”
Or of distance, the "inverse" of similarity
• Not a predictive method; finds similarities, relationships
• Our Example: K-means Clustering
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 9
K-Means Clustering - What is it?
• Used for clustering numerical data, usually a set of
measurements about objects of interest.
• Input: numerical. There must be a distance metric defined over
the variable space.
Euclidian distance
• Output: The centers of each discovered cluster, and the
assignment of each input datum to a cluster.
Centroid
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 10
Use Cases
• Often an exploratory technique:
Discover structure in the data
Summarize the properties of each cluster
• Sometimes a prelude to classification:
"Discovering the classes“
• Examples
The height, weight and average lifespan of animals
Household income, yearly purchase amount in dollars, number of
household members of customer households
Patient record with measures of BMI, HBA1C, HDL
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 11
The Algorithm
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 13
The Algorithm (Continued)
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 14
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 15
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 16
Picking K
Heuristic: find the "elbow" of the within-sum-of-squares (wss) plot
as a function of K.
K: # of clusters
ni: # points in ith cluster
ci: centroid of ith cluster
xij: jth point of ith cluster
"Elbows" at k=2,4,6
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 17
Diagnostics – Evaluating the Model
• Do the clusters look separated in at least some of the plots
when you do pair-wise plots of the clusters?
Pair-wise plots can be used when there are not many variables
• Do you have any clusters with few data points?
Try decreasing the value of K
• Are there splits on variables that you would expect, but don't
see?
Try increasing the value K
• Do any of the centroids seem too close to each other?
Try decreasing the value of K
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 18
K-Means Clustering - Reasons to Choose (+) and
Cautions (-)
Reasons to Choose (+) Cautions (-)
Easy to implement Doesn't handle categorical variables
Easy to assign new data to existing Sensitive to initialization (first guess)
clusters
Which is the nearest cluster center?
Concise output Variables should all be measured on
Coordinates the K cluster centers similar or compatible scales
Not scale-invariant!
K (the number of clusters) must be
known or decided a priori
Wrong guess: possibly poor results
Tends to produce "round" equi-sized
clusters.
Not always desirable
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 19
Check Your Knowledge
1. Why do we consider K-means clustering as a unsupervised Your Thoughts?
machine learning algorithm?
2. How do you use “pair-wise” plots to evaluate the effectiveness
of the clustering?
3. Detail the four steps in the K-means clustering algorithm.
4. How do we use WSS to pick the value of K?
5. What is the most common measure of distance used with K-
means clustering algorithms?
6. The attributes of a data set are “purchase decision (Yes/No),
Gender (M/F), income group (<10K, 10-50K, >50K). Can you use
K-means to cluster this data set?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 20
Module 4: Advanced Analytics – Theory and Methods
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 24
Association Rules
Which of my products tend to be purchased together?
What do other people like this person tend to like/buy/watch?
• Discover "interesting" relationships among variables in a large
database
Rules of the form “If X is observed, then Y is also observed"
The definition of "interesting“ varies with the algorithm used for
discovery
• Not a predictive method; finds similarities, relationships
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 25
Association Rules - Apriori
• Specifically designed for mining over transactions in databases
• Used over itemsets: sets of discrete variables that are linked:
Retail items that are purchased together
A set of tasks done in one day
A set of links clicked on by one user in a single session
• Our Example: Apriori
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 26
Apriori Algorithm - What is it?
Support
• Earliest of the association rule algorithms
• Frequent itemset: a set of items L that appears together "often
enough“:
Formally: meets a minimum support criterion
Support: the % of transactions that contain L
• Apriori Property: Any subset of a frequent itemset is also
frequent
It has at least the support of its superset
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 27
Apriori Algorithm (Continued)
Confidence
• Iteratively grow the frequent itemsets from size 1 to size K (or
until we run out of support).
Apriori property tells us how to prune the search space
• Frequent itemsets are used to find rules X->Y with a minimum
confidence:
Confidence: The % of transactions that contain X, which also
contain Y
• Output: The set of all rules X -> Y with minimum support and
confidence
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 28
Copyright © 2014 EMC Corporation. All Rights Reserved. Module #: Module Name 29
Lift and Leverage
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 30
Association Rules Implementations
• Market Basket Analysis
People who buy milk also buy cookies 60% of the time.
• Recommender Systems
"People who bought what you bought also purchased….“.
• Discovering web usage patterns
People who land on page X click on link Y 76% of the time.
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 31
Use Case Example: Credit Records
Credit Attributes
ID
1 credit_good, female_married, job_skilled, home_owner, …
2 credit_bad, male_single, job_unskilled, renter, …
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 32
Computing Confidence and Lift
Suppose we have 1000 credit records:
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 33
A Sketch of the Algorithm
• If Lk is the set of frequent k-itemsets:
Generate the candidate set Ck+1 by joining Lk to itself
Prune out the (k+1)-itemsets that don't have minimum support
Now we have Lk+1
• We know this catches all the frequent (k+1)-itemsets by the
apriori property
a (k+1)-itemset can't be frequent if any of its subsets aren't
frequent
• Continue until we reach kmax, or run out of support
• From the union of all the Lk, find all the rules with minimum
confidence
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 34
Step 1: 1-itemsets (L1)
• Prune male_mar_or_wid 92
female 310
job_skilled 631
job_unskilled 200
home_owner 710
renter 179
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 35
Step 2: 2-itemsets (L2)
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 36
Step 3: 3-itemsets
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 37
Finally: Find Confidence Rules
Rule Set Cnt Set Cnt Confidence
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 38
Diagnostics
• Do the rules make sense?
What does the domain expert say?
• Make a "test set" from hold-out data:
Enter some market baskets with a few items missing (selected at
random). Can the rules determine the missing items?
Remember, some of the test data may not cause a rule to fire.
• Evaluate the rules by lift or leverage.
Some associations may be coincidental (or obvious).
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 39
Apriori - Reasons to Choose (+) and Cautions (-)
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 40
Check Your Knowledge
1. What is the Apriori property and how is it used in the Apriori Your Thoughts?
algorithm?
2. List three popular use cases of the Association Rules mining
algorithms.
3. What is the difference between Lift and Leverage. How is Lift
used in evaluating the quality of rules discovered?
4. Define Support and Confidence
5. How do you use a “hold-out” dataset to evaluate the
effectiveness of the rules generated?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 41
Module 4: Advanced Analytics – Theory and Methods
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 42
Lab Exercise 5 - Association Rules
• This Lab is designed to investigate and practice
Association Rules.
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 43
Lab Exercise 5 - Association Rules - Workflow
• Set the Working Directory and install the “arules” and "arulesViz" package
1
• Plot Transactions
4
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 44
Module 4: Advanced Analytics – Theory and Methods
Lesson 3: Linear Regression
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 45
Regression
• Regression focuses on the relationship between an outcome and
its input variables.
Provides an estimate of the outcome based on the input values.
Models how changes in the input variables affect the outcome.
• The outcome can be continuous or discrete.
• Possible use cases:
Estimate the lifetime value (LTV) of a customer and understand
what influences LTV.
Estimate the probability that a loan will default and understand
what leads to default.
• Our approaches: linear regression and logistic regression
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 46
Linear Regression
• Used to estimate a continuous value as a linear (additive)
function of other variables
Income as a function of years of education, age, and gender
House sales price as function of square footage, number of
bedrooms/bathrooms, and lot size
• Outcome variable is continuous.
• Input variables can be continuous or discrete.
• Model Output:
A set of estimated coefficients that indicate the relative impact of
each input variable on the outcome
A linear expression for estimating the outcome as a function of
input variables
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 47
Linear Regression Model
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 48
Example: Linear Regression with One Input Variable
• x1 - the number of employees reporting to a manager
• y - the hours per week spent in meetings by the manager
𝑦 =𝛽 0 + 𝛽 1 𝑥 1+ 𝜀
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 49
Representing Categorical Attributes
y 0 1employees 2 finance 3mfg 4 sales
Possible Situation Input
Variables
Finance manager with 8 employees (8,1,0,0)
Manufacturing manager with 8 employees (8,0,1,0)
Sales manager with 8 employees (8,0,0,1)
Engineering manager with 8 employees (8,0,0,0)
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 50
Fitting a Line with Ordinary Least Squares (OLS)
𝑛
• Choose the line that minimizes: ∑ [ 𝑦 𝑖 −(𝛽0+ 𝛽1 𝑥 𝑖1 +…+ 𝛽𝑝 −1 𝑥 𝑖, 𝑝−1 )] 2
𝑖=1
yˆ 3.21 2.19 x1
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 51
Interpreting the Estimated Coefficients, bj
yˆ 4.0 2.2employees 0.5 finance 1.9mfg 0.6sales
• Coefficients for numeric input variables
Change in outcome due to a unit change in input variable*
Example: b1 = 2.2
Extra 2.2 hrs/wk in meetings for each additional employee managed *
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 52
Diagnostics – Examining Residuals
• Residuals
Differences between the observed and estimated outcomes
The observed values of the error term, ε, in the regression model
Expressed as: ei yi yi for i 1,2..., n
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 53
Diagnostics – Plotting Residuals
Ideal Residual Plot Non-centered
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 54
Diagnostics – Residual Normality Assumption
Ideal Histogram
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 55
Diagnostics – Using Hold-out Data
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 56
Diagnostics – Other Considerations
• R2
The fraction of the variability in the outcome variable explained
by the fitted regression model.
Attains values from 0 (poorest fit) to 1 (perfect fit)
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 57
Linear Regression - Reasons to Choose (+) and
Cautions (-)
Reasons to Choose (+) Cautions (-)
Concise representation (the coefficients) Does not handle missing values well
Robust to redundant or correlated variables Assumes that each variable affects the
Lose some explanatory value outcome linearly and additively
Variable transformations and
modeling variable interactions can
alleviate this
A good idea to take the log of
monetary amounts or any variable
with a wide dynamic range
Explanatory value Does not easily handle variables that affect
Relative impact of each variable on the outcome in a discontinuous way
the outcome Step functions
Easy to score data Does not work well with categorical
attributes with a lot of distinct values
For example, ZIP code
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 58
Check Your Knowledge
Your Thoughts?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 59
Module 4: Advanced Analytics – Theory and Methods
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 60
Lab Exercise 6: Linear Regression
This Lab is designed to investigate and practice Linear
Regression.
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 61
Lab Exercise 6: Linear Regression - Workflow
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 62
Module 4: Advanced Analytics – Theory and Methods
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 63
Logistic Regression
• Used to estimate the probability that an event will occur as a
function of other variables
The probability that a borrower will default as a function of his
credit score, income, the size of the loan, and his existing debts
• Can be considered a classifier, as well
Assign the class label with the highest probability
• Input variables can be continuous or discrete
• Output:
A set of coefficients that indicate the relative impact of each driver
A linear expression for predicting the log-odds ratio of outcome as
a function of drivers. (Binary classification case)
Log-odds ratio easily converted to the probability of the outcome
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 64
Logistic Regression Use Cases
• The preferred method for many binary classification problems:
Especially if you are interested in the probability of an event, not
just predicting the "yes or no“
Try this first; if it fails, then try something more complicated
• Binary Classification examples:
The probability that a borrower will default
The probability that a customer will churn
• Multi-class example
The probability that a politician will vote yes/vote no/not show up
to vote on a given bill
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 65
Logistic Regression Model - Example
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 66
Logistic Regression- Visualizing the Model
Blue=defaulters
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 67
Technical Description (Binary Case)
𝑃(𝑦 = 1)
𝑙𝑛 ൬ ൰= 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 … + 𝛽𝑝 𝑥𝑝−1
1 − 𝑃(𝑦 = 1)
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 68
Interpreting the Estimated Coefficients, bi
• Invert the logit expression:
• exp(bj) tells us how the odds-ratio of y=1 changes for every unit change in xj
• Example: bcreditScore = -0.69
• exp(bcreditScore) = 0.5 = 1/2
• for the same income, loan, and existing debt, the odds-ratio of default is
halved for every point increase in credit score
• Standard packages return the significance of the coefficients in the same
way as in linear regression
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 69
An Interesting Fact About Logistic Regression
"The probability mass equals the
counts"
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 70
Diagnostics
• Hold-out data:
Does the model predict well on data it hasn't seen?
• N-fold cross-validation: Formal estimate of generalization error
• "Pseudo-R2" : 1 – (deviance/null deviance)
Deviance, null deviance both reported by most standard packages
The fraction of "variance" that is explained by the model
Used the way R2 is used
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 71
Diagnostics (Cont.)
• Sanity check the coefficients
Do the signs make sense? Are the coefficients excessively large?
Wrong sign is an indication of correlated inputs, but doesn't
necessarily affect predictive power.
Excessively large coefficient magnitudes may indicate strongly
correlated inputs; you may want to consider eliminating some
variables, or using regularized regression techniques.
Infinite magnitude coefficients could indicate a variable that strongly
predicts a subset of the output (and doesn't predict well on the rest).
Try a Decision Tree on that variable, to see if you should segment the
data before regressing.
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 72
Diagnostics: ROC Curve
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 73
Diagnostics: Plot the Histograms of Scores
good separation
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 74
Logistic Regression - Reasons to Choose (+) and
Cautions (-)
Reasons to Choose (+) Cautions (-)
Explanatory value: Does not handle missing values well
Relative impact of each variable on the outcome
in a more complicated way than linear regression
Robust with redundant variables, correlated variables Assumes that each variable affects the log-odds of the
Lose some explanatory value outcome linearly and additively
Variable transformations and modeling variable
interactions can alleviate this
A good idea to take the log of monetary amounts
or any variable with a wide dynamic range
Concise representation with the Cannot handle variables that affect the outcome in a
the coefficients discontinuous way.
Step functions
Easy to score data Doesn't work well with discrete drivers that have a lot
of distinct values
For example, ZIP code
Returns good probability estimates of an event
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 75
Check Your Knowledge
Your Thoughts?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 76
Module 4: Advanced Analytics – Theory and Methods
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 77
Lab Exercise 7: Logistic Regression
This Lab is designed to investigate and practice Logistic
Regression.
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 78
Lab Exercise 7: Logistic Regression - Workflow
1 • Define the problem and review input data
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 79
Module 4: Advanced Analytics – Theory and Methods
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 80
Classifiers
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 81
Naïve Bayesian Classifier
• Determine the most probable class label for each object
Based on the observed object attributes
Naïvely assumed to be conditionally independent of each other
Example:
Based on the objects attributes {shape, color, weight}
A given object that is {spherical, yellow, < 60 grams},
may be classified (labeled) as a tennis ball
Class label probabilities are determined using Bayes’ Law
• Input variables are discrete
• Output:
Probability score – proportional to the true probability
Class label – based on the highest probability score
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 82
Naïve Bayesian Classifier - Use Cases
• Preferred method for many text classification problems.
Try this first; if it doesn't work, try something more complicated
• Use cases
Spam filtering, other text classification tasks
Fraud detection
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 83
Building a Training Dataset to Predict Good or Bad Credit
• Predict the credit behavior of
a credit card applicant from
applicant's attributes:
Personal status
Job type
Housing type
Savings amount
• These are all categorical
variables and are better suited
to Naïve Bayesian Classifier
than to logistic regression.
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 84
Technical Description - Bayes' Law
P( A C ) P( A | C ) P(C )
P(C | A)
P( A) P( A)
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 85
Apply the Naïve Assumption and Remove a Constant
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 87
Building a Naïve Bayesian Classifier
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 88
Naïve Bayesian Classifiers for the Credit Example
• Class labels: {good, bad}
P(good) = 0.7
P(bad) = 0.3
• Conditional Probabilities
P(own|bad) = 0.62
P(own|good) = 0.75
P(rent|bad) = 0.23
P(rent|good) = 0.14
… and so on
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 89
Naïve Bayesian Classifier for a Particular Applicant
• Given applicant attributes of aj Ci P(aj| Ci)
female single good 0.28
A= {female single,
owns home, female single bad 0.36
self-employed, own good 0.75
savings > $1000} own bad 0.62
self emp good 0.14
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 90
Naïve Bayesian Implementation Considerations
• Numerical underflow
Resulting from multiplying several probabilities near zero
Preventable by computing the logarithm of the products
• Zero probabilities due to unobserved attribute/classifier pairs
Resulting from rare events
Handled by smoothing (adjusting each probability by a small amount)
• Assign the classifier label, Ci, that maximizes the value of
m
log P(a j | Ci ) log P(Ci )
j 1
where i = 1,2,…,n and
P’ denotes the adjusted probabilities
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 91
Diagnostics
• Hold-out data
How well does the model classify new instances?
• Cross-validation
• ROC curve/AUC
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 92
Diagnostics: Confusion Matrix
Prediction
Actual
Class
good bad
true positives (TP) false negatives (FN)
good 671 29 700
bad 38 262 300
709 291 1000 true negatives (TN)
false positives (FP)
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 93
Naïve Bayesian Classifier - Reasons to Choose (+)
and Cautions (-)
Reasons to Choose (+) Cautions (-)
Handles missing values quite well Numeric variables have to be discrete
(categorized) Intervals
Robust to irrelevant variables Sensitive to correlated variables
"Double-counting"
Easy to implement Not good for estimating probabilities
Stick to class label or yes/no
Easy to score data
Resistant to over-fitting
Computationally efficient
Handles very high dimensional
problems
Handles categorical variables with a
lot of levels
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 94
Check Your Knowledge
1. Consider the following Training Data Set: Your Thoughts?
• Apply the Naïve Bayesian Classifier to this
Training Data Set
data set and compute the probability
X1 X2 X3 Y
score for P(y = 1|X) for X = (1,0,0) 1 1 1 0
1 1 0 0
0 0 0 0
Show your work 0 1 0 1
1 0 1 1
0 1 1 1
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 95
Check Your Knowledge (Continued)
5. What is a confusion matrix and how it is used to evaluate the
Your Thoughts?
effectiveness of the model?
6. Consider the following data set with two input features
temperature and season
• What is the Naïve Bayesian assumption?
• Is the Naïve Bayesian assumption satisfied for this problem?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 96
Module 4: Advanced Analytics – Theory and Methods
Lesson 5: Naïve Bayesian Classifiers - Summary
During this lesson the following topics were covered:
• Naïve Bayesian Classifier
• Theoretical foundations of the classifier
• Use cases
• Evaluating the effectiveness of the classifier
• The Reasons to Choose (+) and Cautions (-) with the use of
the classifier
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 97
Lab Exercise 8: Naïve Bayesian Classifier
This Lab is designed to investigate and practice the
Naïve Bayesian Classifier analytic technique.
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 98
Lab Exercise 8: Naïve Bayesian Classifier Part1 - Workflow
• Set working directory and review
1
training and test data
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 99
Lab Exercise 8: Naïve Bayesian Classifier Part2 - Workflow
• Define the problem (Translating to an Analytics
1
Question)
• Build the training dataset and the test dataset from the
3
database
• Extract the first 10000 records for the training data set
4
and the remaining 10 for the test
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 100
Module 4: Advanced Analytics – Theory and Methods
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 101
Decision Tree Classifier - What is it?
• Used for classification:
Returns probability scores of class membership
Well-calibrated, like logistic regression
Assigns label based on highest scoring class
Some Decision Tree algorithms return simply the most likely class
Regression Trees: a variation for regression
Returns average value at every node
Predictions can be discontinuous at the decision boundaries
• Input variables can be continuous or discrete
• Output:
A tree that describes the decision flow.
Leaf nodes return either a probability score, or simply a classification.
Trees can be converted to a set of "decision rules“
"IF income < $50,000 AND mortgage_amt > $100K THEN default=T with 75%
probability“
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 102
Decision Tree – Example of Visual Structure
Female Male
Gender
Female Male
Branch – outcome of test
Income Age
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 103
Decision Tree Classifier - Use Cases
• When a series of questions (yes/no) are answered to arrive at a
classification
Biological species classification
Checklist of symptoms during a doctor’s evaluation of a patient
• When “if-then” conditions are preferred to linear models.
Customer segmentation to predict response rates
Financial decisions such as loan approval
Fraud detection
• Short Decision Trees are the most popular "weak learner" in
ensemble learning techniques
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 104
Example: The Credit Prediction Problem
700/1000
p(good)=0.70
245/294
p(good)=0.83
housing=free, rent
housing=own
349/501
p(good)=0.70
personal=female, male div/sep
personal=male mar/wid, male single
36/88 70/117
p(good) = 0.41 p(good)=0.60
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 105
General Algorithm
• To construct tree T from training set S
If all examples in S belong to some class in C, or S is sufficiently
"pure", then make a leaf labeled C.
Otherwise:
select the “most informative” attribute A
partition S according to A’s values
recursively construct sub-trees T1, T2, ..., for the subsets of S
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 106
Step 1: Pick the Most “Informative" Attribute
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 107
Step 1: Pick the most "informative" attribute (Continued)
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 108
Step 1: Pick the Most “Informative" Attribute (Continued)
Conditional Entropy
• The weighted sum of the class entropies for each value of the
attribute
• In English: attribute values (home owner vs. renter) give more
information about class membership
"Home owners are more likely to have good credit than renters"
• Conditional entropy should be lower than unconditioned
entropy
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 109
Conditional Entropy Example
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 110
Step 1: Pick the Most “Informative" Attribute (Continued)
Information Gain
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 111
Back to the Credit Prediction Example
Attribute InfoGain
job 0.001
housing 0.013
personal_status 0.006
savings_status 0.028
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 112
Step 2 & 3: Partition on the Selected Variable
• Step 2: Find the partition
with the highest InfoGain
In our example the selected
700/1000
partition has InfoGain = 0.028 p(good)=0.7
savings=(500:1000),
• Step 3: At each resulting savings= <100, (100:500) >=1000,no known
savings
node, repeat Steps 1 and 2
245/294
until node is "pure enough" p(good)=0.83
• Pure nodes => no
information gain by splitting
on other attributes
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 113
Diagnostics
• Hold-out data
• ROC/AUC
• Confusion Matrix
• FPR/FNR, Precision/Recall
• Do the splits (or the "rules") make sense?
What does the domain expert say?
• How deep is the tree?
Too many layers are prone to over-fit
• Do you get nodes with very few members?
Over-fit
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 114
Decision Tree Classifier - Reasons to Choose (+)
& Cautions (-)
Reasons to Choose (+) Cautions (-)
Takes any input type (numeric, categorical) Decision surfaces can only be axis-aligned
In principle, can handle categorical variables with
many distinct values (ZIP code)
Robust with redundant variables, correlated variables Tree structure is sensitive to small changes in the
training data
Naturally handles variable interaction A "deep" tree is probably over-fit
Because each split reduces the training data for
subsequent splits
Handles variables that have non-linear effect on Not good for outcomes that are dependent on many
outcome variables
Related to over-fit problem, above
Computationally efficient to build Doesn't naturally handle missing values;
However most implementations include a
method for dealing with this
Easy to score data In practice, decision rules can be fairly complex
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 115
Which Classifier Should I Try?
Typical Questions Recommended Method
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 117
Check Your Knowledge
Your Thoughts?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 118
Module 4: Advanced Analytics – Theory and Methods
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 119
Lab Exercise 9: Decision Trees
This lab is designed to investigate and practice Decision
Tree models covered in the course work.
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 120
Lab Exercise 9: Decision Trees - Workflow
1 • Set the Working Directory
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 121
Module 4: Advanced Analytics – Theory and Methods
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 122
Time Series Analysis
• Time Series: Ordered sequence of equally spaced values over time
• Time Series Analysis: Accounts for the internal structure of
observations taken over time
Trend
Seasonality
Cycles
Random
• Goals
To identify the internal structure of the time series
To forecast future events
Example: Based on sales history, what will next December sales be?
• Method: Box-Jenkins (ARMA)
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 123
Box-Jenkins Method: What is it?
• Models historical behavior to forecast the future
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 125
Use Cases
Forecast:
• Next month's sales
• Tomorrow's stock price
• Hourly power demand
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 126
Modeling a Time Series
• Let's model the time series as
Yt =Tt +St +Rt, t=1,...,n.
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 127
Stationary Sequences
• Box-Jenkins methodology assumes the random component is a
stationary sequence
Constant mean
Constant variance
Autocorrelation does not change over time
Constant correlation of a variable with itself at different times
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 128
De-trending
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 129
Seasonal Adjustment
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 130
ARMA(p, q) Model
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 131
ARIMA(p, d, q) Model
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 132
ACF & PACF
• Auto Correlation Function (ACF)
Correlation of the values of the time series with itself
Autocorrelation "carries over"
Helps to determine the order, q, of a MA model
Where does ACF go to zero?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 133
Model Selection
• Based on the data, the Data Scientist selects p, d and q
An "art form" that requires domain knowledge, modeling
experience, and a few iterations
Use a simple model when possible
AR model (q = 0)
MA model (p = 0)
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 134
Time Series Analysis - Reasons to Choose (+) &
Cautions (-)
Reasons to Choose (+) Cautions (-)
Minimal data collection No meaningful drivers: prediction
Only have to collect the series based only on past performance
itself No explanatory value
Do not need to input drivers Can't do "what-if" scenarios
Can't stress test
Designed to handle the inherent It's an "art form" to select appropriate
autocorrelation of lagged time series parameters
Accounts for trends and seasonality Only suitable for short term
predictions
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 135
Time Series Analysis with R
• The function “ts” is used to create time series objects
mydata<- ts(mydata,start=c(1999,1),frequency=12)
• Visualize data
plot(mydata)
• De-trend using differencing
diff(mydata)
•Examine ACF and PACF
acf(mydata): It computes and plots estimates of the
autocorrelations
pacf(mydata): It computes and plots estimates of the partial
autocorrelations
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 136
Other Useful R Functions in Time Series Analysis
• ar(): Fit an autoregressive time series model to the data
• arima(): Fit an ARIMA model
• predict(): Makes predictions
“predict” is a generic function for predictions from the results of various
model fitting functions. The function invokes particular methods which
depend on the class of the first argument
• arima.sim(): Simulate a time series from an ARIMA model
• decompose(): Decompose a time series into seasonal, trend and
irregular components using moving averages
Deals with additive or multiplicative seasonal component
• stl(): Decompose a time series into seasonal, trend and irregular
components using loess
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 137
Check Your Knowledge
Your Thoughts?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 138
Module 4: Advanced Analytics – Theory and Methods
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 139
Lab Exercise 10: Time Series Analysis
This Lab is designed to investigate and practice Time
Series Analysis with ARIMA models (Box-Jenkins-
methodology).
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 140
Lab Exercise 10: Time Series Analysis - Workflow
1 • Set the Working Directory
12 • Generate Predictions
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 141
Module 4: Advanced Analytics – Theory and Methods
Lesson 8: Text Analysis
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 142
Text Analysis
Encompasses the processing and representation of text for
analysis and learning tasks
• High-dimensionality
Every distinct term is a dimension
Green Eggs and Ham: A 50-D problem!
• Data is Un-structured
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 143
Text Analysis – Problem-solving Tasks
• Parsing
Impose a structure on the unstructured/semi-structured text for
downstream analysis
• Search/Retrieval
Which documents have this word or phrase?
Pars
Which documents are about this topic or this entity?
ing
• Text-mining
"Understand" the content Search
&Retri
eval
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 144
Example: Brand Management
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 145
Buzz Tracking: The Process
1. Monitor social networks, review sites Parse the data feeds to get actual content.
for mentions of our products. Find and filter the raw text for product
names
(Use Regular Expression).
2. Collect the reviews. Extract the relevant raw text.
Convert the raw text into a suitable
document representation.
Index into our review corpus.
3. Sort the reviews by product. Classification (or "Topic Tagging")
4. Are they good reviews or bad reviews? Classification (sentiment analysis)
We can keep a simple count here, for trend
analysis.
5. Marketing calls up and reads selected Search/Information Retrieval.
reviews in full, for greater insight.
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 146
Parsing the Feeds Parsing
• Impose structure on
semi-structured
data.
• We need to know
where to look for
what we are looking
for.
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 147
Regular Expressions Parsing
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 148
Extract and Represent Text Parsing
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 149
Document Representation - Other Features Parsing
• Feature:
Anything about the document that is used for search or
analysis.
• Title
• Keywords or tags
• Date information
• Source information
• Named entities
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 150
Representing a Corpus (Collection of Documents) Parsing
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 151
Text Classification (I) - "Topic Tagging" Text
Mining
"The bPhone-5X has coverage everywhere. It's much less flaky than
my old bPhone-4G."
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 152
"Topic Tagging" Text
Mining
3. Sort the Reviews by Product
Judicious choice of features
Product mentioned in title?
Tweet, or review?
Term frequency
Canonicalize abbreviations
"5X" = "bPhone-5X"
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 153
Text Classification (II) Sentiment Analysis Text
Mining
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 154
Search and Information Retrieval Search
&Retrieva
l
5. Marketing calls up and reads selected reviews in full, for greater insight.
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 155
Quality of Search Results Search
&Retrieva
l
5. Marketing calls up and reads selected reviews in full, for greater insight.
• Relevance
Is this document what I wanted?
Used to rank search results
• Precision
What % of documents in the result are relevant?
• Recall
Of all the relevant documents in the corpus, what % were returned
to me?
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 156
Computing Relevance (Term Frequency) Search
&Retrieva
l
5. Marketing calls up and reads selected reviews in full, for greater insight.
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 157
Inverse Document Frequency (idf) Search
&Retrieva
l
5. Marketing calls up and reads selected reviews in full, for greater insight.
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 158
TF-IDF and Modified Retrieval Algorithm Search
&Retrieva
l
5. Marketing calls up and reads selected reviews in full, for greater insight.
• Term frequency – inverse document frequency (tf-idf or tfidf) of
term t in document d:
tfidf(t, d) = tf (t, d) * idf(t)
query: brick, phone
• Document with "brick" a few times more relevant than
document with "phone" many times
• Measure of Relevance with tf-idf
• Call up all the documents that have any of the terms from the
query, and sum up the tf-idf of each term:
Relevance(d) tfidf (t , d)
i[1,n]
i
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 159
Other Relevance Metrics Search
&Retrieva
l
5. Marketing calls up and reads selected reviews in full, for greater insight.
• "Authoritativeness" of source
PageRank is an example of this
• Recency of document
• How often the document has been retrieved by other users
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 160
Effectiveness of Search and Retrieval Search
&Retrieva
l
• Relevance metric
important for precision, user experience
• Effective crawl, extraction, indexing
important for recall (and precision)
more important, often, than retrieval algorithm
• MapReduce
Reverse index, corpus term frequencies, idf
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 161
Natural Language Processing
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 162
Example: UFOs Attack
Q:
What is the witness describing?
A: An encounter with a UFO.
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 163
Example: UFOs Attack
Machine
When I fist noticed it, I wanted to freak out. There error
it was an object
floating in on a direct path, It didn't move side to side or volley up
and down. It moved as if though it had a mission or purpose. I was
Typo
nervous, and scared, So afraid in fact that I could feel my knees
buckling. I guess because I didn't know what Turntoofexpect
phraseand I
wanted to actAmbiguous meaning
non aggressive. I though that I was either going to be
taken, blasted into nothing, or…
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 164
Example: UFOs Attack
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 165
Challenges - Text Analysis
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 166
Check Your Knowledge
1. What are the two major challenges in the problem of text Your Thoughts?
analysis?
2. What is a reverse index?
3. Why is the corpus metrics dynamic. Provide an example and a
scenario that explains the dynamism of the corpus metrics.
4. How does tf-idf enhance the relevance of a search result?
5. List and discuss a few methods that are deployed in text
analysis to reduce the dimensions.
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 167
Module 4: Advanced Analytics – Theory and Methods
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 168
Module 4: Summary
Key Topics Covered in this module Methods Covered in this module
Algorithms and technical foundations Categorization (unsupervised) :
K-means clustering
Association Rules
Reasons to Choose (+) and Cautions (-) of the Time Series Analysis
model
Fitting, scoring and validating model in R and in- Text Analysis
db functions
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 169