0% found this document useful (0 votes)
127 views

Module 4 - Theory and Methods

big data

Uploaded by

Mahmoud Elnahas
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views

Module 4 - Theory and Methods

big data

Uploaded by

Mahmoud Elnahas
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 161

Module 4 – Advanced Analytics - Theory

and Methods
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 1
Module 4: Advanced Analytics – Theory and Methods

Upon completion of this module, you should be able to:


• Examine analytic needs and select an appropriate technique based on
business objectives; initial hypotheses; and the data's structure and volume
• Apply some of the more commonly used methods in Analytics solutions
• Explain the algorithms and the technical foundations for the commonly used
methods
• Explain the environment (use case) in which each technique can provide the
most value
• Use appropriate diagnostic methods to validate the models created
• Use R and in-database analytical functions to fit, score and evaluate models

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 2
Where “R” we?
• In Module 3 we reviewed R skills and basic statistics
• You can use R to:
 Generate summary statistics to investigate a data set
 Visualize Data
 Perform statistical tests to analyze data and evaluate models
• Now that you have data, and you can see it, you need to plan
the analytic model and determine the analytic method to be
used

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 3
Phase 3 - Model Planning

Discovery

How Operationalize
do people generally solve this Data Prep
problem with the kind of data and
resources I have?
Communicate Model
• Does that work well enough? Or do I have
Results Planning
to come up with something new?

• What are related or analogous problems?


Model Do I have a good idea
How are they solved? Can I do that?
Is the model robust
Building about the type of model
to try? Can I refine the
enough? Have we analytic plan?
failed for sure?

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 5
What Kind of Problem do I Need to Solve?
How do I Solve it?
The Problem to Solve The Category of Covered in this Course
Techniques
I want to group items by similarity. Clustering K-means clustering
I want to find structure (commonalities)
in the data
I want to discover relationships between Association Rules Apriori
actions or items
I want to determine the relationship Regression Linear Regression
between the outcome and the input Logistic Regression
variables

I want to assign (known) labels to Classification Naïve Bayes


objects Decision Trees
I want to find the structure in a temporal Time Series Analysis ACF, PACF, ARIMA
process
I want to forecast the behavior of a
temporal process
I want to analyze my text data Text Analysis Regular expressions, Document
representation (Bag of Words), TF-
IDF

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 6
Why These Example Techniques?

• Most popular, frequently used:


 Provide the foundation for Data
Science skills on which to build
• Relatively easy for new Data
Scientists to understand &
comprehend
• Applicable to a broad range of
problems in several verticals

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 7
Module 4: Advanced Analytics – Theory and Methods
Lesson 1: K-means Clustering

During this lesson the following topics are covered:


• Clustering – Unsupervised learning method
• K-means clustering:
• Use cases
• The algorithm
• Determining the optimum value for K
• Diagnostics to evaluate the effectiveness of the method
• Reasons to Choose (+) and Cautions (-) of the method

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 8
Clustering
How do I group these documents by topic?
How do I group my customers by purchase patterns?
• Sort items into groups by similarity:
 Items in a cluster are more similar to each other than they are to
items in other clusters.
 Need to detail the properties that characterize “similarity”
 Or of distance, the "inverse" of similarity
• Not a predictive method; finds similarities, relationships
• Our Example: K-means Clustering

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 9
K-Means Clustering - What is it?
• Used for clustering numerical data, usually a set of
measurements about objects of interest.
• Input: numerical. There must be a distance metric defined over
the variable space.
 Euclidian distance
• Output: The centers of each discovered cluster, and the
assignment of each input datum to a cluster.
 Centroid

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 10
Use Cases
• Often an exploratory technique:
 Discover structure in the data
 Summarize the properties of each cluster
• Sometimes a prelude to classification:
 "Discovering the classes“
• Examples
 The height, weight and average lifespan of animals
 Household income, yearly purchase amount in dollars, number of
household members of customer households
 Patient record with measures of BMI, HBA1C, HDL

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 11
The Algorithm

1. Choose K; then select K


random "centroids"
In our example, K=3
2. Assign records to the
cluster with the closest
centroid

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 13
The Algorithm (Continued)

3. Recalculate the resulting


centroids
Centroid: the mean value of all the
records in the cluster
4. Repeat steps 2 & 3 until record
assignments no longer change
Model Output:
• The final cluster centers
• The final cluster assignments of
the training data

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 14
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 15
Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 16
Picking K
Heuristic: find the "elbow" of the within-sum-of-squares (wss) plot
as a function of K.

K: # of clusters
ni: # points in ith cluster
ci: centroid of ith cluster
xij: jth point of ith cluster

"Elbows" at k=2,4,6

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 17
Diagnostics – Evaluating the Model
• Do the clusters look separated in at least some of the plots
when you do pair-wise plots of the clusters?
 Pair-wise plots can be used when there are not many variables
• Do you have any clusters with few data points?
 Try decreasing the value of K
• Are there splits on variables that you would expect, but don't
see?
 Try increasing the value K
• Do any of the centroids seem too close to each other?
 Try decreasing the value of K

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 18
K-Means Clustering - Reasons to Choose (+) and
Cautions (-)
Reasons to Choose (+) Cautions (-)
Easy to implement Doesn't handle categorical variables
Easy to assign new data to existing Sensitive to initialization (first guess)
clusters
Which is the nearest cluster center?
Concise output Variables should all be measured on
Coordinates the K cluster centers similar or compatible scales
Not scale-invariant!
K (the number of clusters) must be
known or decided a priori
Wrong guess: possibly poor results
Tends to produce "round" equi-sized
clusters.
Not always desirable

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 19
Check Your Knowledge
1. Why do we consider K-means clustering as a unsupervised Your Thoughts?
machine learning algorithm?
2. How do you use “pair-wise” plots to evaluate the effectiveness
of the clustering?
3. Detail the four steps in the K-means clustering algorithm.
4. How do we use WSS to pick the value of K?
5. What is the most common measure of distance used with K-
means clustering algorithms?
6. The attributes of a data set are “purchase decision (Yes/No),
Gender (M/F), income group (<10K, 10-50K, >50K). Can you use
K-means to cluster this data set?

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 20
Module 4: Advanced Analytics – Theory and Methods

Lesson 2: Association Rules


During this lesson the following topics are covered:
 Association Rules mining
 Apriori Algorithm
 Prominent use cases of Association Rules
 Support and Confidence parameters
 Lift and Leverage
 Diagnostics to evaluate the effectiveness of rules generated
 Reasons to Choose (+) and Cautions (-) of the Apriori algorithm

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 24
Association Rules
Which of my products tend to be purchased together?
What do other people like this person tend to like/buy/watch?
• Discover "interesting" relationships among variables in a large
database
 Rules of the form “If X is observed, then Y is also observed"
 The definition of "interesting“ varies with the algorithm used for
discovery
• Not a predictive method; finds similarities, relationships

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 25
Association Rules - Apriori
• Specifically designed for mining over transactions in databases
• Used over itemsets: sets of discrete variables that are linked:
 Retail items that are purchased together
 A set of tasks done in one day
 A set of links clicked on by one user in a single session
• Our Example: Apriori

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 26
Apriori Algorithm - What is it?
Support
• Earliest of the association rule algorithms
• Frequent itemset: a set of items L that appears together "often
enough“:
 Formally: meets a minimum support criterion
 Support: the % of transactions that contain L
• Apriori Property: Any subset of a frequent itemset is also
frequent
 It has at least the support of its superset

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 27
Apriori Algorithm (Continued)
Confidence
• Iteratively grow the frequent itemsets from size 1 to size K (or
until we run out of support).
 Apriori property tells us how to prune the search space
• Frequent itemsets are used to find rules X->Y with a minimum
confidence:
 Confidence: The % of transactions that contain X, which also
contain Y
• Output: The set of all rules X -> Y with minimum support and
confidence

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 28
Copyright © 2014 EMC Corporation. All Rights Reserved. Module #: Module Name 29
Lift and Leverage

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 30
Association Rules Implementations
• Market Basket Analysis
 People who buy milk also buy cookies 60% of the time.
• Recommender Systems
 "People who bought what you bought also purchased….“.
• Discovering web usage patterns
 People who land on page X click on link Y 76% of the time.

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 31
Use Case Example: Credit Records

Credit Attributes
ID
1 credit_good, female_married, job_skilled, home_owner, …
2 credit_bad, male_single, job_unskilled, renter, …

Minimum Support: 50% The itemset {home_owner,


Frequent Itemset Support credit_good} has minimum support.

credit_good 70% The possible rules are


male_single 55%
job_skilled 63% credit_good -> home_owner

home_owner 71% and


home_owner, 53%
credit_good home_owner -> credit_good

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 32
Computing Confidence and Lift
Suppose we have 1000 credit records:

free_housing home_owner renter total


credit_bad 44 186 70 300
credit_good 64 527 109 700
108 713 179

713 home_owners, 527 have good credit.


home_owner -> credit_good has confidence 527/713 = 74%

700 with good credit, 527 of them are home_owners


credit_good -> home_owner has confidence 527/700 = 75%

The lift of these two rules is

0.527 / (0.700*0.713) = 1.055

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 33
A Sketch of the Algorithm
• If Lk is the set of frequent k-itemsets:
 Generate the candidate set Ck+1 by joining Lk to itself
 Prune out the (k+1)-itemsets that don't have minimum support
Now we have Lk+1
• We know this catches all the frequent (k+1)-itemsets by the
apriori property
 a (k+1)-itemset can't be frequent if any of its subsets aren't
frequent
• Continue until we reach kmax, or run out of support
• From the union of all the Lk, find all the rules with minimum
confidence

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 34
Step 1: 1-itemsets (L1)

• let min_support = 0.5 Frequent Itemset Count


credit_good 700
• 1000 credit records credit_bad 300
• Scan the database male_single 550

• Prune male_mar_or_wid 92
female 310
job_skilled 631
job_unskilled 200
home_owner 710
renter 179

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 35
Step 2: 2-itemsets (L2)

Frequent Itemset Count


• Join L1 to itself credit_good, 402
male_single
• Scan the database to get credit_good, 544
the counts job_skilled
credit_good, 527
• Prune home_owner
male_single, 340
job_skilled
male_single, 408
home_owner
job_skilled, 452
home_owner

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 36
Step 3: 3-itemsets

Frequent Itemset Count


credit_good, 428
job_skilled,
home_owner

• We have run out of support.


• Candidate rules come from L2:
 credit_good -> job_skilled
 job_skilled -> credit_good
 credit_good -> home_owner
 home_owner -> credit_good

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 37
Finally: Find Confidence Rules
Rule Set Cnt Set Cnt Confidence

IF credit_good credit_good 700 credit_good AND 544 544/700=77%


THEN job_skilled job_skilled
IF credit_good credit_good 700 credit_good AND 527 527/700=75%
THEN home_owner
home_owner
IF job_skilled job_skilled 631 job_skilled AND 544 544/631=86%
THEN credit_good credit_good
IF home_owner home_owner 710 home_owner 527 527/710=74%
THEN credit_good AND credit_good

If we want confidence > 80%:


IF job_skilled THEN credit_good

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 38
Diagnostics
• Do the rules make sense?
 What does the domain expert say?
• Make a "test set" from hold-out data:
 Enter some market baskets with a few items missing (selected at
random). Can the rules determine the missing items?
 Remember, some of the test data may not cause a rule to fire.
• Evaluate the rules by lift or leverage.
 Some associations may be coincidental (or obvious).

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 39
Apriori - Reasons to Choose (+) and Cautions (-)

Reasons to Choose (+) Cautions (-)


Easy to implement Requires many database scans
Uses a clever observation to Exponential time complexity
prune the search space
• Apriori property
Easy to parallelize Can mistakenly find spurious
(or coincidental) relationships
• Addressed with Lift and
Leverage measures

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 40
Check Your Knowledge
1. What is the Apriori property and how is it used in the Apriori Your Thoughts?
algorithm?
2. List three popular use cases of the Association Rules mining
algorithms.
3. What is the difference between Lift and Leverage. How is Lift
used in evaluating the quality of rules discovered?
4. Define Support and Confidence
5. How do you use a “hold-out” dataset to evaluate the
effectiveness of the rules generated?

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 41
Module 4: Advanced Analytics – Theory and Methods

Lesson 2: Association Rules - Summary


During this lesson the following topics were covered:
 Association Rules mining
 Apriori Algorithm
 Prominent use cases of Association Rules
 Support and Confidence parameters
 Lift and Leverage
 Diagnostics to evaluate the effectiveness of rules generated
 Reasons to Choose (+) and Cautions (-) of the Apriori algorithm

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 42
Lab Exercise 5 - Association Rules
• This Lab is designed to investigate and practice
Association Rules.

After completing the tasks in this lab you should be able


to:
• Use R functions for Association Rule based models

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 43
Lab Exercise 5 - Association Rules - Workflow
• Set the Working Directory and install the “arules” and "arulesViz" package
1

• Read in the Data for Modeling


2

• Review Transaction data


3

• Plot Transactions
4

• Mine the Association Rules


5

• Read in Groceries dataset


6

• Mine the Rules for the Groceries Data


7

• Extract Rules with Confidence > 0.8


8

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 44
Module 4: Advanced Analytics – Theory and Methods
Lesson 3: Linear Regression

During this lesson the following topics are covered:


• General description of regression models
• Technical description of a linear regression model
• Common use cases for the linear regression model
• Interpretation and scoring with the linear regression model
• Diagnostics for validating the linear regression model
• The Reasons to Choose (+) and Cautions (-) of the linear
regression model

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 45
Regression
• Regression focuses on the relationship between an outcome and
its input variables.
 Provides an estimate of the outcome based on the input values.
 Models how changes in the input variables affect the outcome.
• The outcome can be continuous or discrete.
• Possible use cases:
 Estimate the lifetime value (LTV) of a customer and understand
what influences LTV.
 Estimate the probability that a loan will default and understand
what leads to default.
• Our approaches: linear regression and logistic regression

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 46
Linear Regression
• Used to estimate a continuous value as a linear (additive)
function of other variables
 Income as a function of years of education, age, and gender
 House sales price as function of square footage, number of
bedrooms/bathrooms, and lot size
• Outcome variable is continuous.
• Input variables can be continuous or discrete.
• Model Output:
 A set of estimated coefficients that indicate the relative impact of
each input variable on the outcome
 A linear expression for estimating the outcome as a function of
input variables

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 47
Linear Regression Model

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 48
Example: Linear Regression with One Input Variable
• x1 - the number of employees reporting to a manager
• y - the hours per week spent in meetings by the manager

𝑦 =𝛽 0 + 𝛽 1 𝑥 1+ 𝜀

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 49
Representing Categorical Attributes
y   0  1employees   2 finance   3mfg   4 sales  
Possible Situation Input
Variables
Finance manager with 8 employees (8,1,0,0)
Manufacturing manager with 8 employees (8,0,1,0)
Sales manager with 8 employees (8,0,0,1)
Engineering manager with 8 employees (8,0,0,0)

• For a categorical attribute with m possible values


 Add m-1 binary (0/1) variables to the regression model
 The remaining category is represented by setting the m-1 binary
variables equal to zero

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 50
Fitting a Line with Ordinary Least Squares (OLS)
𝑛
• Choose the line that minimizes: ∑ [ 𝑦 𝑖 −(𝛽0+ 𝛽1 𝑥 𝑖1 +…+ 𝛽𝑝 −1 𝑥 𝑖, 𝑝−1 )] 2

𝑖=1

• Provides the coefficient estimates, denoted bj

yˆ  3.21  2.19 x1

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 51
Interpreting the Estimated Coefficients, bj
yˆ  4.0  2.2employees  0.5 finance  1.9mfg  0.6sales
• Coefficients for numeric input variables
 Change in outcome due to a unit change in input variable*
 Example: b1 = 2.2
 Extra 2.2 hrs/wk in meetings for each additional employee managed *

• Coefficients for binary input variables


 Represent the additive difference from the reference level *
 Example: b2 = 0.5
 Finance managers meet 0.5 hr/wk more than engineering managers do *

• Statistical significance of each coefficient


 Are the coefficients significantly different from zero?
 For small p-values (say < 0.05), the coefficient is statistically significant
*
when all other input values remain the same

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 52
Diagnostics – Examining Residuals
• Residuals
 Differences between the observed and estimated outcomes
 The observed values of the error term, ε, in the regression model

 Expressed as: ei  yi  yi for i  1,2..., n

• Errors are assumed to be normally distributed with


 A mean of zero
 Constant variance

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 53
Diagnostics – Plotting Residuals
Ideal Residual Plot Non-centered

Quadratic Trend Non-constant Variance

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 54
Diagnostics – Residual Normality Assumption
Ideal Histogram

Ideal Q-Q Plot

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 55
Diagnostics – Using Hold-out Data

• Hold-out data D3 Train Training


 Training and testing datasets Train Set #1
D2
 Does the model predict well on data it
D1 Test
hasn't seen?
• N-fold cross validation D3 Train
 Partition the data into N groups.
D2 Test Training
 Holding out each group, Set #2
 Fit the model D1 Train
 Calculate the residuals on the group
 Estimated prediction error is the D3 Test

average over all the residuals. D2 Train


Training
Train Set #3
D1

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 56
Diagnostics – Other Considerations
• R2
 The fraction of the variability in the outcome variable explained
by the fitted regression model.
 Attains values from 0 (poorest fit) to 1 (perfect fit)

• Identify correlated input variables


 Pair-wise scatterplots
 Sanity check the coefficients
 Are the magnitudes excessively large?
 Do the signs make sense?

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 57
Linear Regression - Reasons to Choose (+) and
Cautions (-)
Reasons to Choose (+) Cautions (-)
Concise representation (the coefficients) Does not handle missing values well
Robust to redundant or correlated variables Assumes that each variable affects the
Lose some explanatory value outcome linearly and additively
Variable transformations and
modeling variable interactions can
alleviate this
A good idea to take the log of
monetary amounts or any variable
with a wide dynamic range
Explanatory value Does not easily handle variables that affect
Relative impact of each variable on the outcome in a discontinuous way
the outcome Step functions
Easy to score data Does not work well with categorical
attributes with a lot of distinct values
For example, ZIP code

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 58
Check Your Knowledge

Your Thoughts?

1. How is the measure of significance used in determining the


explanatory value of a driver (input variable) with linear
regression models?
2. Detail the challenges with categorical values in linear
regression model.
3. Describe N-Fold cross validation method used for diagnosing a
fitted model.
4. List two use cases of linear regression models.
5. List and discuss two standard checks that you will perform on
the coefficients derived from a linear regression model.

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 59
Module 4: Advanced Analytics – Theory and Methods

Lesson 3: Linear Regression - Summary


During this lesson the following topics were covered:
• General description of regression models
• Technical description of a linear regression model
• Common use cases for the linear regression model
• Interpretation and scoring with the linear regression model
• Diagnostics for validating the linear regression model
• The Reasons to Choose (+) and Cautions (-) of the linear
regression model

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 60
Lab Exercise 6: Linear Regression
This Lab is designed to investigate and practice Linear
Regression.

After completing the tasks in this lab you should be able


to:
• Use R functions for Linear Regression (Ordinary
Least Squares – OLS)
• Predict the dependent variables based on the
model
• Investigate different statistical parameter tests
that measure the effectiveness of the model

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 61
Lab Exercise 6: Linear Regression - Workflow

1 • Set Working directory

2 • Generate random data to model

3 • Generate the OLS model

4 • Print and visualize the results

5 • Generate summary outputs

6 • Introduce a slight non-linearity

• Perform in-database analysis of linear


7
regression

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 62
Module 4: Advanced Analytics – Theory and Methods

Lesson 4: Logistic Regression


During this lesson the following topics are covered:
• Technical description of a logistic regression model
• Common use cases for the logistic regression model
• Interpretation and scoring with the logistic regression model
• Diagnostics for validating the logistic regression model
• Reasons to Choose (+) and Cautions (-) of the logistic
regression model

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 63
Logistic Regression
• Used to estimate the probability that an event will occur as a
function of other variables
 The probability that a borrower will default as a function of his
credit score, income, the size of the loan, and his existing debts
• Can be considered a classifier, as well
 Assign the class label with the highest probability
• Input variables can be continuous or discrete
• Output:
 A set of coefficients that indicate the relative impact of each driver
 A linear expression for predicting the log-odds ratio of outcome as
a function of drivers. (Binary classification case)
 Log-odds ratio easily converted to the probability of the outcome

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 64
Logistic Regression Use Cases
• The preferred method for many binary classification problems:
 Especially if you are interested in the probability of an event, not
just predicting the "yes or no“
 Try this first; if it fails, then try something more complicated
• Binary Classification examples:
 The probability that a borrower will default
 The probability that a customer will churn
• Multi-class example
 The probability that a politician will vote yes/vote no/not show up
to vote on a given bill

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 65
Logistic Regression Model - Example

• Training data: default is 0/1


 default=1 if loan defaulted
• The model will return the probability that a loan with given
characteristics will default
• If you only want a "yes/no" answer, you need a threshold
 The standard threshold is 0.5

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 66
Logistic Regression- Visualizing the Model

Overall fraction of default:


~20%

Logistic regression returns a


score that estimates the
probability that a borrower
will default

The graph compares the


distribution of defaulters and
non-defaulters as a function
of the model's predicted
probability, for borrowers
scoring higher than 0.1

Blue=defaulters

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 67
Technical Description (Binary Case)
𝑃(𝑦 = 1)
𝑙𝑛 ൬ ൰= 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 … + 𝛽𝑝 𝑥𝑝−1
1 − 𝑃(𝑦 = 1)

• y=1 is the case of interest: 'TRUE'


• LHS is called logit(P(y=1))
 hence, "logistic regression"
• logit(P(y=1)) is inverted by the sigmoid function
 standard packages can return probability for you
• Categorical variables are expanded as with linear regression
• Iterative solution to obtain coefficient estimates, denoted bj
 "Iteratively re-weighted least squares"

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 68
Interpreting the Estimated Coefficients, bi
• Invert the logit expression:

• exp(bj) tells us how the odds-ratio of y=1 changes for every unit change in xj
• Example: bcreditScore = -0.69
• exp(bcreditScore) = 0.5 = 1/2
• for the same income, loan, and existing debt, the odds-ratio of default is
halved for every point increase in credit score
• Standard packages return the significance of the coefficients in the same
way as in linear regression

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 69
An Interesting Fact About Logistic Regression
"The probability mass equals the
counts"

• If 20% of our loan risk training set


defaults
 The sum of all the training set
scores will be 20% of the number of
training examples

• If 40% of applicants with income <


$50,000 default
 The sum of all the training set
scores of people in this income
category will be 40% of the number
of examples in this income category

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 70
Diagnostics
• Hold-out data:
 Does the model predict well on data it hasn't seen?
• N-fold cross-validation: Formal estimate of generalization error
• "Pseudo-R2" : 1 – (deviance/null deviance)
 Deviance, null deviance both reported by most standard packages
 The fraction of "variance" that is explained by the model
 Used the way R2 is used

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 71
Diagnostics (Cont.)
• Sanity check the coefficients
 Do the signs make sense? Are the coefficients excessively large?
 Wrong sign is an indication of correlated inputs, but doesn't
necessarily affect predictive power.
 Excessively large coefficient magnitudes may indicate strongly
correlated inputs; you may want to consider eliminating some
variables, or using regularized regression techniques.
 Infinite magnitude coefficients could indicate a variable that strongly
predicts a subset of the output (and doesn't predict well on the rest).
 Try a Decision Tree on that variable, to see if you should segment the
data before regressing.

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 72
Diagnostics: ROC Curve

Area under the curve (AUC)


tells you how well the model
predicts. (Ideal AUC = 1)

For logistic regression, ROC


curve can help set classifier
threshold

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 73
Diagnostics: Plot the Histograms of Scores
good separation

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 74
Logistic Regression - Reasons to Choose (+) and
Cautions (-)
Reasons to Choose (+) Cautions (-)
Explanatory value: Does not handle missing values well
Relative impact of each variable on the outcome
in a more complicated way than linear regression
Robust with redundant variables, correlated variables Assumes that each variable affects the log-odds of the
Lose some explanatory value outcome linearly and additively
Variable transformations and modeling variable
interactions can alleviate this
A good idea to take the log of monetary amounts
or any variable with a wide dynamic range

Concise representation with the Cannot handle variables that affect the outcome in a
the coefficients discontinuous way.
Step functions
Easy to score data Doesn't work well with discrete drivers that have a lot
of distinct values
For example, ZIP code
Returns good probability estimates of an event

Preserves the summary statistics of the training data


"The probabilities equal the counts"

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 75
Check Your Knowledge

Your Thoughts?

1. What is a logit and how do we compute class probabilities


from the logit?
2. How is ROC curve used to diagnose the effectiveness of the
logistic regression model?
3. What is Pseudo R2 and what does it measure in a logistic
regression model?
4. How do you describe a binary class problem?
5. Compare and contrast linear and logistic regression methods.

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 76
Module 4: Advanced Analytics – Theory and Methods

Lesson 4: Logistic Regression - Summary


During this lesson the following topics were covered:
• Technical description of a logistic regression model
• Common use cases for the logistic regression model
• Interpretation and scoring with the logistic regression model
• Diagnostics for validating the logistic regression model
• Reasons to Choose (+) and Cautions (-) of the logistic
regression model

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 77
Lab Exercise 7: Logistic Regression
This Lab is designed to investigate and practice Logistic
Regression.

After completing the tasks in this lab you should be able


to:
• Use R functions for Logistic Regression – (also
known as Logit)
• Predict the dependent variables based on the model
• Investigate different statistical parameter tests that
measure the effectiveness of the model

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 78
Lab Exercise 7: Logistic Regression - Workflow
1 • Define the problem and review input data

2 • Set the Working Directory

3 • Read in and examine the data

4 • Build and review logistic regression model

5 • Review the results and interpret the coefficients

6 • Visualize the model using the Plot function

• Use Relevel function to re-level the Price factor


7
with value 30 as the base reference

8 • Plot the ROC curve

9 • Predict Outcome given Age and Income

• Predict outcome for a sequence of Age values at


10
price 30 and mean income
• Predict outcome for a sequence of income at price
11
30 and mean age

12 • Use logistic regression as a classifier

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 79
Module 4: Advanced Analytics – Theory and Methods

Lesson 5: Naïve Bayesian Classifiers

During this lesson the following topics are covered:


• Naïve Bayesian Classifier
• Theoretical foundations of the classifier
• Use cases
• Evaluating the effectiveness of the classifier
• The Reasons to Choose (+) and Cautions (-) with the use of
the classifier

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 80
Classifiers

Where in the catalog should I place this product listing?


Is this email spam?
Is this politician Democrat/Republican/Green?

• Classification: assign labels to objects.


• Usually supervised: training set of pre-classified examples.
• Our examples:
 Naïve Bayesian
 Decision Trees
 (and Logistic Regression)

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 81
Naïve Bayesian Classifier
• Determine the most probable class label for each object
 Based on the observed object attributes
 Naïvely assumed to be conditionally independent of each other
 Example:
 Based on the objects attributes {shape, color, weight}
 A given object that is {spherical, yellow, < 60 grams},
may be classified (labeled) as a tennis ball
 Class label probabilities are determined using Bayes’ Law
• Input variables are discrete
• Output:
 Probability score – proportional to the true probability
 Class label – based on the highest probability score

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 82
Naïve Bayesian Classifier - Use Cases
• Preferred method for many text classification problems.
 Try this first; if it doesn't work, try something more complicated
• Use cases
 Spam filtering, other text classification tasks
 Fraud detection

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 83
Building a Training Dataset to Predict Good or Bad Credit
• Predict the credit behavior of
a credit card applicant from
applicant's attributes:
 Personal status
 Job type
 Housing type
 Savings amount
• These are all categorical
variables and are better suited
to Naïve Bayesian Classifier
than to logistic regression.

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 84
Technical Description - Bayes' Law

P( A  C ) P( A | C ) P(C )
P(C | A)  
P( A) P( A)

• C is the class label:


 C ϵ {C1, C2, … Cn}
• A is the observed object attributes
 A = (a1, a2, … am)
• P(C | A) is the probability of C given A is observed
 Called the conditional probability

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 85
Apply the Naïve Assumption and Remove a Constant

• For observed attributes A = (a1, a2, … am), we want to compute


P(a1 , a2 ,..., am | Ci ) P (Ci )
P(Ci | A)  i  1, 2,..., n
P(a1 , a2 ,..., am )

and assign the classifier, Ci , with the largest P(Ci|A)

• Two simplifications to the calculations


 Apply naïve assumption - each aj is conditionally independent of
each other, then m
P (a1 , a2 ,..., am | Ci )  P(a1 | Ci ) P(a2 | Ci )  P (am | Ci )   P(a j | Ci )
j 1

 Denominator P(a1,a2,…am) is a constant and can be ignored

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 87
Building a Naïve Bayesian Classifier

• Applying the two simplifications


 m 
P(Ci | a1 , a2 ,..., am )    P(a j | Ci )  P(Ci ) i  1, 2,..., n
 j 1 
• To build a Naïve Bayesian Classifier, collect the following
statistics from the training data:
 P(Ci) for all the class labels.
 P(aj| Ci) for all possible aj and Ci
 Assign the classifier label, Ci, that maximizes the value of
 m 
  P (a j | Ci )  P (Ci ) i  1, 2,..., n
 
 j 1 

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 88
Naïve Bayesian Classifiers for the Credit Example
• Class labels: {good, bad}
 P(good) = 0.7
 P(bad) = 0.3
• Conditional Probabilities
 P(own|bad) = 0.62
 P(own|good) = 0.75
 P(rent|bad) = 0.23
 P(rent|good) = 0.14
 … and so on

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 89
Naïve Bayesian Classifier for a Particular Applicant
• Given applicant attributes of aj Ci P(aj| Ci)
female single good 0.28
A= {female single,
owns home, female single bad 0.36
self-employed, own good 0.75
savings > $1000} own bad 0.62
self emp good 0.14

• Since P(good|A) > (bad|A), self emp bad 0.17

assign the applicant the label savings>1K good 0.06

"good" credit savings>1K bad 0.02

P(good|A) ~ (0.28*0.75*0.14*0.06)*0.7 = 0.0012

P(bad|A) ~ (0.36*0.62*0.17*0.02)*0.3 = 0.0002

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 90
Naïve Bayesian Implementation Considerations

• Numerical underflow
 Resulting from multiplying several probabilities near zero
 Preventable by computing the logarithm of the products
• Zero probabilities due to unobserved attribute/classifier pairs
 Resulting from rare events
 Handled by smoothing (adjusting each probability by a small amount)
• Assign the classifier label, Ci, that maximizes the value of
 m 
  log P(a j | Ci )   log P(Ci )
 
 j 1 
where i = 1,2,…,n and
P’ denotes the adjusted probabilities

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 91
Diagnostics
• Hold-out data
 How well does the model classify new instances?
• Cross-validation
• ROC curve/AUC

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 92
Diagnostics: Confusion Matrix

Prediction
Actual
Class
good bad
true positives (TP) false negatives (FN)
good 671 29 700
bad 38 262 300
709 291 1000 true negatives (TN)
false positives (FP)

Overall success rate (or accuracy):


(TP + TN) / (TP+TN+FP+FN) = (671+262)/1000 ≈ 0.93

TPR: TP / (TP + FN) = 671 / (671+29) = 671/700 ≈ 0.96


FPR: FP / (FP + TN) = 38 / (38 + 262) = 38/300 ≈ 0.13
FNR: FN / (TP + FN) = 29 / (671 + 29) = 29/700 ≈ 0.04

Precision: TP/ (TP + FP) = 671/709 ≈ 0.95


Recall (or TPR): TP / (TP + FN) ≈ 0.96

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 93
Naïve Bayesian Classifier - Reasons to Choose (+)
and Cautions (-)
Reasons to Choose (+) Cautions (-)
Handles missing values quite well Numeric variables have to be discrete
(categorized) Intervals
Robust to irrelevant variables Sensitive to correlated variables
"Double-counting"
Easy to implement Not good for estimating probabilities
Stick to class label or yes/no
Easy to score data
Resistant to over-fitting
Computationally efficient
Handles very high dimensional
problems
Handles categorical variables with a
lot of levels

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 94
Check Your Knowledge
1. Consider the following Training Data Set: Your Thoughts?
• Apply the Naïve Bayesian Classifier to this
Training Data Set
data set and compute the probability
X1 X2 X3 Y
score for P(y = 1|X) for X = (1,0,0) 1 1 1 0
1 1 0 0
0 0 0 0
Show your work 0 1 0 1
1 0 1 1
0 1 1 1

2. List some prominent use cases of the Naïve Bayesian Classifier.


3. What gives the Naïve Bayesian Classifier the advantage of being
computationally inexpensive?
4. Why should we use log-likelihoods rather than pure probability
values in the Naïve Bayesian Classifier?

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 95
Check Your Knowledge (Continued)
5. What is a confusion matrix and how it is used to evaluate the
Your Thoughts?
effectiveness of the model?
6. Consider the following data set with two input features
temperature and season
• What is the Naïve Bayesian assumption?
• Is the Naïve Bayesian assumption satisfied for this problem?

Temperature Season Electricty Usage


-10 to 50 F Winter High
50 to 70 F Winter Low
70 to 85 F Summer Low
85 to 110 F Summer High

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 96
Module 4: Advanced Analytics – Theory and Methods
Lesson 5: Naïve Bayesian Classifiers - Summary
During this lesson the following topics were covered:
• Naïve Bayesian Classifier
• Theoretical foundations of the classifier
• Use cases
• Evaluating the effectiveness of the classifier
• The Reasons to Choose (+) and Cautions (-) with the use of
the classifier

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 97
Lab Exercise 8: Naïve Bayesian Classifier
This Lab is designed to investigate and practice the
Naïve Bayesian Classifier analytic technique.

After completing the tasks in this lab you should be able


to:
• Use R functions for Naïve Bayesian Classification
• Apply the requirements for generating
appropriate training data
• Validate the effectiveness of the Naïve Bayesian
Classifier with the big data

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 98
Lab Exercise 8: Naïve Bayesian Classifier Part1 - Workflow
• Set working directory and review
1
training and test data

2 • Install and load library “e1071”

3 • Read in and review data

• Build the Naïve Bayesian classifier


4
model from first principles

5 • Predict the results

6 • Use the naiveBayes function

• Predict the Outcome of “Enrolls” with


7
the test data

8 • Use the Laplace smoothing

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 99
Lab Exercise 8: Naïve Bayesian Classifier Part2 - Workflow
• Define the problem (Translating to an Analytics
1
Question)

2 • Open the ODBC connection

• Build the training dataset and the test dataset from the
3
database

• Extract the first 10000 records for the training data set
4
and the remaining 10 for the test

5 • Execute the NB classifier

• Validate the effectiveness of the NB classifier with a


6
confusion matrix

• Execute NB classifier with MADlib function calls


7
within the database

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 100
Module 4: Advanced Analytics – Theory and Methods

Lesson 6: Decision Trees

During this lesson the following topics are covered:


• Overview of Decision Tree classifier
• General algorithm for Decision Trees
• Decision Tree use cases
• Entropy, Information gain
• Reasons to Choose (+) and Cautions (-) of Decision Tree
classifier
• Classifier methods and conditions in which they are best
suited

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 101
Decision Tree Classifier - What is it?
• Used for classification:
 Returns probability scores of class membership
 Well-calibrated, like logistic regression
 Assigns label based on highest scoring class
 Some Decision Tree algorithms return simply the most likely class
 Regression Trees: a variation for regression
 Returns average value at every node
 Predictions can be discontinuous at the decision boundaries
• Input variables can be continuous or discrete
• Output:
 A tree that describes the decision flow.
 Leaf nodes return either a probability score, or simply a classification.
 Trees can be converted to a set of "decision rules“
 "IF income < $50,000 AND mortgage_amt > $100K THEN default=T with 75%
probability“

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 102
Decision Tree – Example of Visual Structure

Female Male

Gender
Female Male
Branch – outcome of test

Income Age Internal Node – decision on variable

<=45,000 >45,000 <=40 >40

Yes No Yes No Leaf Node – class label

Income Age

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 103
Decision Tree Classifier - Use Cases
• When a series of questions (yes/no) are answered to arrive at a
classification
 Biological species classification
 Checklist of symptoms during a doctor’s evaluation of a patient
• When “if-then” conditions are preferred to linear models.
 Customer segmentation to predict response rates
 Financial decisions such as loan approval
 Fraud detection
• Short Decision Trees are the most popular "weak learner" in
ensemble learning techniques

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 104
Example: The Credit Prediction Problem
700/1000
p(good)=0.70

savings= <100, (100:500)


savings=(500:1000),
>=1000,no known savings

245/294
p(good)=0.83
housing=free, rent
housing=own

349/501
p(good)=0.70
personal=female, male div/sep
personal=male mar/wid, male single

36/88 70/117
p(good) = 0.41 p(good)=0.60

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 105
General Algorithm
• To construct tree T from training set S
 If all examples in S belong to some class in C, or S is sufficiently
"pure", then make a leaf labeled C.
 Otherwise:
 select the “most informative” attribute A
 partition S according to A’s values
 recursively construct sub-trees T1, T2, ..., for the subsets of S

• The details vary according to the specific algorithm – CART, ID3,


C4.5 – but the general idea is the same

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 106
Step 1: Pick the Most “Informative" Attribute

• Entropy-based methods are one common way

• H = 0 if p(c) = 0 or 1 for any class


 So for binary classification, H=0 is a "pure" node
• H is maximum when all classes are equally probable
 For binary classification, H=1 when classes are 50/50

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 107
Step 1: Pick the most "informative" attribute (Continued)

• First, we need to get the base entropy of the data

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 108
Step 1: Pick the Most “Informative" Attribute (Continued)
Conditional Entropy

• The weighted sum of the class entropies for each value of the
attribute
• In English: attribute values (home owner vs. renter) give more
information about class membership
 "Home owners are more likely to have good credit than renters"
• Conditional entropy should be lower than unconditioned
entropy

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 109
Conditional Entropy Example

for free own rent


P(housing) 0.108 0.713 0.179
P(bad | housing) 0.407 0.261 0.391
p(good | housing) 0.592 0.739 0.601

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 110
Step 1: Pick the Most “Informative" Attribute (Continued)
Information Gain

• The information that you gain, by knowing the value of an


attribute
• So the "most informative" attribute is the attribute with the
highest InfoGain

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 111
Back to the Credit Prediction Example

Attribute InfoGain
job 0.001
housing 0.013
personal_status 0.006
savings_status 0.028

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 112
Step 2 & 3: Partition on the Selected Variable
• Step 2: Find the partition
with the highest InfoGain
 In our example the selected
700/1000
partition has InfoGain = 0.028 p(good)=0.7

savings=(500:1000),
• Step 3: At each resulting savings= <100, (100:500) >=1000,no known
savings
node, repeat Steps 1 and 2
245/294
 until node is "pure enough" p(good)=0.83
• Pure nodes => no
information gain by splitting
on other attributes

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 113
Diagnostics
• Hold-out data
• ROC/AUC
• Confusion Matrix
• FPR/FNR, Precision/Recall
• Do the splits (or the "rules") make sense?
 What does the domain expert say?
• How deep is the tree?
 Too many layers are prone to over-fit
• Do you get nodes with very few members?
 Over-fit

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 114
Decision Tree Classifier - Reasons to Choose (+)
& Cautions (-)
Reasons to Choose (+) Cautions (-)
Takes any input type (numeric, categorical) Decision surfaces can only be axis-aligned
In principle, can handle categorical variables with
many distinct values (ZIP code)
Robust with redundant variables, correlated variables Tree structure is sensitive to small changes in the
training data
Naturally handles variable interaction A "deep" tree is probably over-fit
Because each split reduces the training data for
subsequent splits
Handles variables that have non-linear effect on Not good for outcomes that are dependent on many
outcome variables
Related to over-fit problem, above
Computationally efficient to build Doesn't naturally handle missing values;
However most implementations include a
method for dealing with this
Easy to score data In practice, decision rules can be fairly complex

Many algorithms can return a measure of variable


importance
In principle, decision rules are easy to understand

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 115
Which Classifier Should I Try?
Typical Questions Recommended Method

Do I want class probabilities, rather than just class Logistic regression


labels? Decision Tree
Do I want insight into how the variables affect the Logistic regression
model? Decision Tree
Is the problem high-dimensional? Naïve Bayes
Do I suspect some of the inputs are correlated? Decision Tree
Logistic Regression
Do I suspect some of the inputs are irrelevant? Decision Tree
Naïve Bayes
Are there categorical variables with a large number Naïve Bayes
of levels? Decision Tree
Are there mixed variable types? Decision Tree
Logistic Regression
Is there non-linear data or discontinuities in the Decision Tree
inputs that will affect the outputs?

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 117
Check Your Knowledge

Your Thoughts?

1. How do you define information gain?


2. For what conditions is the value of entropy at a maximum and when is it at
a minimum?
3. List three use cases of Decision Trees.
4. What are weak learners and how are they used in ensemble methods?
5. Why do we end up with an over fitted model with deep trees and in data
sets when we have outcomes that are dependent on many variables?
6. What classification method would you recommend for the following cases:
 High dimensional data
 Data in which outputs are affected by non-linearity and discontinuity in
the inputs

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 118
Module 4: Advanced Analytics – Theory and Methods

Lesson 6: Decision Trees - Summary

During this lesson the following topics were covered:


• Overview of Decision Tree classifier
• General algorithm for Decision Trees
• Decision Tree use cases
• Entropy, Information gain
• Reasons to Choose (+) and Cautions (-) of Decision Tree
classifier
• Classifier methods and conditions in which they are best
suited

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 119
Lab Exercise 9: Decision Trees
This lab is designed to investigate and practice Decision
Tree models covered in the course work.

After completing the tasks in this lab you should be able


to:
• Use R functions for Decision Tree models
• Predict the outcome of an attribute based on the
model

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 120
Lab Exercise 9: Decision Trees - Workflow
1 • Set the Working Directory

2 • Read in the Data

3 • Build the Decision Tree

4 • Plot the Decision Tree

• Prepare Data to Test the Fitted


5
Model

• Predict a Decision from the


6
Fitted Model

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 121
Module 4: Advanced Analytics – Theory and Methods

Lesson 7: Time Series Analysis


During this lesson the following topics are covered:
• Time Series Analysis and its applications in forecasting
• ARMA and ARIMA Models
• Implementing the Box-Jenkins Methodology using R
• Reasons to Choose (+) and Cautions (-) with Time Series Analysis

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 122
Time Series Analysis
• Time Series: Ordered sequence of equally spaced values over time
• Time Series Analysis: Accounts for the internal structure of
observations taken over time
 Trend
 Seasonality
 Cycles
 Random
• Goals
 To identify the internal structure of the time series
 To forecast future events
 Example: Based on sales history, what will next December sales be?
• Method: Box-Jenkins (ARMA)

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 123
Box-Jenkins Method: What is it?
• Models historical behavior to forecast the future

• Applies ARMA (Autoregressive Moving Averages)


 Input: Time Series
 Accounting for Trends and Seasonality components
 Output: Expected future value of the time series

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 125
Use Cases
Forecast:
• Next month's sales
• Tomorrow's stock price
• Hourly power demand

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 126
Modeling a Time Series
• Let's model the time series as
Yt =Tt +St +Rt, t=1,...,n.

• Tt: Trend term


 Air travel steadily increased over the last few years

• St: The seasonal term


 Air travel fluctuates in a regular pattern over the course of a year

• Rt: Random component


 To be modeled with ARMA

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 127
Stationary Sequences
• Box-Jenkins methodology assumes the random component is a
stationary sequence
 Constant mean
 Constant variance
 Autocorrelation does not change over time
 Constant correlation of a variable with itself at different times

• In practice, to obtain a stationary sequence, the data must be:


 De-trended
 Seasonally adjusted

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 128
De-trending

• In this example, we see a linear


trend, so we fit a linear model
 Tt = m·t + b

• The de-trended series is then


 Y1t = Yt – Tt

• In some cases, may have to fit


a non-linear model
 Quadratic
 Exponential

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 129
Seasonal Adjustment

• Plotting the de-trended series


identifies seasons
 For CO2 concentration, we can
model the period as being a year,
with variation at the month level

• Simple ad-hoc adjustment: take


several years of data, calculate
the average value for each
month, and subtract that from
Y1t
Y2t = Y1t – St

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 130
ARMA(p, q) Model

• The simplest Box-Jenkins Model


 Yt is de-trended and seasonally adjusted
• Combination of two process models
 Autoregressive: Yt is a linear combination of its last p values
 Moving average: Yt is a constant value plus the effects of a
dampened white noise process over the last q time values (lags)

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 131
ARIMA(p, d, q) Model

• ARIMA adds a differencing term, d, to the ARMA model


 Autoregressive Integrated Moving Average
 Includes the de-trending as part of the model
 linear trend can be removed by d=1
 quadratic trend by d=2
 and so on for higher order trends

• The general non-seasonal model is known as ARIMA (p, d, q):


 p is the number of autoregressive terms
 d is the number of differences
 q is the number of moving average terms

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 132
ACF & PACF
• Auto Correlation Function (ACF)
 Correlation of the values of the time series with itself
 Autocorrelation "carries over"
 Helps to determine the order, q, of a MA model
 Where does ACF go to zero?

• Partial Auto Correlation Function (PACF)


 An autocorrelation calculated after removing the linear
dependence of the previous terms
 Helps to determine the order, p, of an AR model
 Where does PACF go to zero?

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 133
Model Selection
• Based on the data, the Data Scientist selects p, d and q
 An "art form" that requires domain knowledge, modeling
experience, and a few iterations
 Use a simple model when possible
 AR model (q = 0)
 MA model (p = 0)

• Multiple models need to be built


and compared
 Using ACF and PACF

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 134
Time Series Analysis - Reasons to Choose (+) &
Cautions (-)
Reasons to Choose (+) Cautions (-)
Minimal data collection No meaningful drivers: prediction
Only have to collect the series based only on past performance
itself No explanatory value
Do not need to input drivers Can't do "what-if" scenarios
Can't stress test
Designed to handle the inherent It's an "art form" to select appropriate
autocorrelation of lagged time series parameters

Accounts for trends and seasonality Only suitable for short term
predictions

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 135
Time Series Analysis with R
• The function “ts” is used to create time series objects
mydata<- ts(mydata,start=c(1999,1),frequency=12)
• Visualize data
 plot(mydata)
• De-trend using differencing
diff(mydata)
•Examine ACF and PACF
acf(mydata): It computes and plots estimates of the
autocorrelations
pacf(mydata): It computes and plots estimates of the partial
autocorrelations

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 136
Other Useful R Functions in Time Series Analysis
• ar(): Fit an autoregressive time series model to the data
• arima(): Fit an ARIMA model
• predict(): Makes predictions
“predict” is a generic function for predictions from the results of various
model fitting functions. The function invokes particular methods which
depend on the class of the first argument
• arima.sim(): Simulate a time series from an ARIMA model
• decompose(): Decompose a time series into seasonal, trend and
irregular components using moving averages
Deals with additive or multiplicative seasonal component
• stl(): Decompose a time series into seasonal, trend and irregular
components using loess

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 137
Check Your Knowledge

Your Thoughts?

1. What is a time series and what are the key components of a


time series?
2. How do we “de-trend” a time series data?
3. What makes data stationary?
4. How is seasonality removed from the data?
5. What are the modeling parameters in ARIMA?
6. How do you use ACF and PACF to determine the “stationarity”
of time series data?

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 138
Module 4: Advanced Analytics – Theory and Methods

Lesson 7: Time Series Analysis - Summary

During this lesson the following topics were covered:


• Time Series Analysis and its applications in forecasting
• ARMA and ARIMA Models
• Implementing the Box-Jenkins Methodology using R
• Reasons to Choose (+) and Cautions (-) with Time Series Analysis

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 139
Lab Exercise 10: Time Series Analysis
This Lab is designed to investigate and practice Time
Series Analysis with ARIMA models (Box-Jenkins-
methodology).

After completing the tasks in this lab you should be able


to:
• Use R functions for ARIMA models
• Apply the requirements for generating
appropriate training data
• Validate the effectiveness of the ARIMA models

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 140
Lab Exercise 10: Time Series Analysis - Workflow
1 • Set the Working Directory

2 • Open Connection to Database

3 • Get Data from the Database

4 • Import the Table


• Review, Update, and Prepare DataFrame
5
”msales” File for ARIMA Modeling
6 • Convert “sales” into Time Series Object

7 • Plot the Time Series

8 • Analyze the ACF and PACF

9 • Difference the Data to Make it Stationary

10 • Plot ACF and PACF for the Differenced Data

11 • Fit the ARIMA Model

12 • Generate Predictions

13 • Compare Predicted Values with Actual Values

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 141
Module 4: Advanced Analytics – Theory and Methods
Lesson 8: Text Analysis

During this lesson the following topics are covered:


• Challenges with text analysis
• Key tasks in text analysis
• Definition of terms used in text analysis
• Term frequency, inverse document frequency
• Representation and features of documents and corpus
• Use of regular expressions in parsing text
• Metrics used to measure the quality of search results
• Relevance with tf-idf, precision and recall

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 142
Text Analysis
Encompasses the processing and representation of text for
analysis and learning tasks

• High-dimensionality
 Every distinct term is a dimension
 Green Eggs and Ham: A 50-D problem!
• Data is Un-structured

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 143
Text Analysis – Problem-solving Tasks
• Parsing
 Impose a structure on the unstructured/semi-structured text for
downstream analysis
• Search/Retrieval
 Which documents have this word or phrase?
Pars
 Which documents are about this topic or this entity?
ing

• Text-mining
 "Understand" the content Search
&Retri
eval

 Clustering, classification Text


Mining
• Tasks are not an ordered list
 Does not represent process
 Set of tasks used appropriately depending on the problem
addressed

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 144
Example: Brand Management

• Acme currently makes two products


 bPhone
 bEbook
• They have lots of competition. They want to maintain their
reputation for excellent products and keep their sales high.
• What is the buzz on Acme?
 Search for mentions of Acme products
 Twitter, Facebook, Review Sites, etc.
 What do people say?
 Positive or negative?
 What do people think is good or bad about the products?

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 145
Buzz Tracking: The Process

1. Monitor social networks, review sites Parse the data feeds to get actual content.
for mentions of our products. Find and filter the raw text for product
names
(Use Regular Expression).
2. Collect the reviews. Extract the relevant raw text.
Convert the raw text into a suitable
document representation.
Index into our review corpus.
3. Sort the reviews by product. Classification (or "Topic Tagging")
4. Are they good reviews or bad reviews? Classification (sentiment analysis)
We can keep a simple count here, for trend
analysis.
5. Marketing calls up and reads selected Search/Information Retrieval.
reviews in full, for greater insight.

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 146
Parsing the Feeds Parsing

1. Monitor social networks, review sites for mentions of our products

• Impose structure on
semi-structured
data.
• We need to know
where to look for
what we are looking
for.

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 147
Regular Expressions Parsing

1. Monitor social networks, review sites for mentions of our products

• Regular Expressions (regexp) are a means for finding words,


strings or particular patterns in text.
• A match is a Boolean response. The basic use is to ask “does this
regexp match this string?”

regexp matches Note


b[P|p]hone bPhone, bphone Pipe “|” means “or”
bEbo*k bEbk, bEbok, bEbook, “*” matches 0 or more repetitions of
bEboook … the preceding letter
^I love A line starting with "I love" “^” means start of a string
Acme$ A line ending with “Acme” “$” means the end of a string

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 148
Extract and Represent Text Parsing

2. Collect the reviews

Document Representation: "I love LOVE my bPhone!"


A structure for analysis
Convert this to a vector in the term
• "Bag of words" space:
 common representation
 A vector with one dimension for every unique acme 0
term in space bebook 0
 term-frequency (tf): number times a term occurs
 Good for basic search, classification
bPhone 1
fantastic 0
• Reduce Dimensionality
 Term Space – not ALL terms love 2
 no stop words: "the", "a" slow 0
 often no pronouns
terrible 0
 Stemming
 "phone" = "phones" terrific 0

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 149
Document Representation - Other Features Parsing

2. Collect the reviews

• Feature:
 Anything about the document that is used for search or
analysis.
• Title
• Keywords or tags
• Date information
• Source information
• Named entities

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 150
Representing a Corpus (Collection of Documents) Parsing

2. Collect the reviews


• Reverse index
 For every possible feature, a list of all the documents that contain
that feature
• Corpus metrics
 Volume
 Corpus-wide term frequencies
 Inverse Document Frequency (IDF)
 more on this later
• Challenge: a Corpus is dynamic
 Index, metrics must be updated continuously

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 151
Text Classification (I) - "Topic Tagging" Text
Mining

3. Sort the Reviews by Product


Not as straightforward as it seems

"The bPhone-5X has coverage everywhere. It's much less flaky than
my old bPhone-4G."

"While I love Acme's bPhone series, I've been quite disappointed by


the bEbook. The text is illegible, and it makes even my old
Newton look blazingly fast."

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 152
"Topic Tagging" Text
Mining
3. Sort the Reviews by Product
Judicious choice of features
 Product mentioned in title?
 Tweet, or review?
 Term frequency
 Canonicalize abbreviations
 "5X" = "bPhone-5X"

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 153
Text Classification (II) Sentiment Analysis Text
Mining

4. Are they good reviews or bad reviews?

• Naïve Bayes is a good first attempt


• But you need tagged training data!
 The major bottleneck in text classification
• What to do?
 Hand-tagging
 Clues from review sites
 thumbs-up or down, # of stars
 Cluster documents, then label the clusters

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 154
Search and Information Retrieval Search
&Retrieva
l

5. Marketing calls up and reads selected reviews in full, for greater insight.

• Marketing calls up documents with queries:


 Collection of search terms
 "bPhone battery life"
 Can also be represented as "bag of words"
 Possibly restricted by other attributes
 within the last month
 from this review site

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 155
Quality of Search Results Search
&Retrieva
l

5. Marketing calls up and reads selected reviews in full, for greater insight.
• Relevance
 Is this document what I wanted?
 Used to rank search results
• Precision
 What % of documents in the result are relevant?
• Recall
 Of all the relevant documents in the corpus, what % were returned
to me?

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 156
Computing Relevance (Term Frequency) Search
&Retrieva
l

5. Marketing calls up and reads selected reviews in full, for greater insight.

• Assign each term in a document a weight for that term.


• The weight of a term t in a document d is a function of the
number of times t appears in d.
 The weight can be simply set to the number of occurrences of t
in d :

tf (t, d) = count (t, d)

 The term frequency may optionally be normalized.

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 157
Inverse Document Frequency (idf) Search
&Retrieva
l

5. Marketing calls up and reads selected reviews in full, for greater insight.

idf(t) = log [N/df(t)]


 N: Number of documents in the corpus
 df(t): Number of documents in the corpus that contain a term t
• Measures term uniqueness in corpus
 "phone" vs. "brick"
• Indicates the importance of the term
 Search (relevance)
 Classification (discriminatory power)

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 158
TF-IDF and Modified Retrieval Algorithm Search
&Retrieva
l

5. Marketing calls up and reads selected reviews in full, for greater insight.
• Term frequency – inverse document frequency (tf-idf or tfidf) of
term t in document d:
tfidf(t, d) = tf (t, d) * idf(t)
query: brick, phone
• Document with "brick" a few times more relevant than
document with "phone" many times
• Measure of Relevance with tf-idf
• Call up all the documents that have any of the terms from the
query, and sum up the tf-idf of each term:

Relevance(d)   tfidf (t , d)
i[1,n]
i

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 159
Other Relevance Metrics Search
&Retrieva
l

5. Marketing calls up and reads selected reviews in full, for greater insight.

• "Authoritativeness" of source
 PageRank is an example of this
• Recency of document
• How often the document has been retrieved by other users

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 160
Effectiveness of Search and Retrieval Search
&Retrieva
l

• Relevance metric
 important for precision, user experience
• Effective crawl, extraction, indexing
 important for recall (and precision)
 more important, often, than retrieval algorithm
• MapReduce
 Reverse index, corpus term frequencies, idf

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 161
Natural Language Processing

• Unstructured text mining means extracting “features”


 Features are structured meta-data representing the document
 Goal: “vectorize” the documents

• After vectorization, apply advanced machine learning techniques


 Clustering
 Classification
 Decision Trees
 Naïve Bayesian Classifier
 Scoring
 Once models have been built, use them to automatically categorize
incoming documents

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 162
Example: UFOs Attack

July 15th, 2010. Raytown, Missouri


When I fist noticed it, I wanted to freak out. There it was an object
floating in on a direct path, It didn&apos;t move side to side or volley up
and down. It moved as if though it had a mission or purpose. I was
nervous, and scared, So afraid in fact that I could feel my knees
buckling. I guess because I didn&apos;t know what to expect and I
wanted to act non aggressive. I though that I was either going to be
taken, blasted into nothing, or…

Q:
What is the witness describing?
A: An encounter with a UFO.

Q: What is the emotional state of the witness?


A: Frightened, ready to flee.
Source: http://www.infochimps.com/datasets/60000-documented-ufo-sightings-with-text-descriptions-and-metada

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 163
Example: UFOs Attack

If we really are on the cusp of a major alien


invasion, eyewitness testimony is the key to our
survival as a species.

Strangely, the computer finds this account unreliable!

Machine
When I fist noticed it, I wanted to freak out. There error
it was an object
floating in on a direct path, It didn&apos;t move side to side or volley up
and down. It moved as if though it had a mission or purpose. I was
Typo
nervous, and scared, So afraid in fact that I could feel my knees
buckling. I guess because I didn&apos;t know what Turntoofexpect
phraseand I
wanted to actAmbiguous meaning
non aggressive. I though that I was either going to be
taken, blasted into nothing, or…

“UFO” keyword missing


Source: http://www.infochimps.com/datasets/60000-documented-ufo-sightings-with-text-descriptions-and-metada

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 164
Example: UFOs Attack

Investigators need to…

for keywords and phrases, but your topic may be


Search very complicated or keywords may be
misspelled within the document

document meta-data like time, location and


Manage
author. Later retrieval may be key to
identifying this meta-data early, and the
document may be amenable to structure.

content via sentiment analysis, custom


Understand dictionaries, natural language processing,
clustering, classification and good ol’ domain
expertise.
…with computer-aided text mining

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 165
Challenges - Text Analysis

1. Finding the right structure for your unstructured data


2. Very high dimensionality
3. Thinking about your problem the right way

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 166
Check Your Knowledge
1. What are the two major challenges in the problem of text Your Thoughts?
analysis?
2. What is a reverse index?
3. Why is the corpus metrics dynamic. Provide an example and a
scenario that explains the dynamism of the corpus metrics.
4. How does tf-idf enhance the relevance of a search result?
5. List and discuss a few methods that are deployed in text
analysis to reduce the dimensions.

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 167
Module 4: Advanced Analytics – Theory and Methods

Lesson 8: Text Analysis - Summary

During this lesson the following topics were covered:


• Challenges with text analysis
• Key tasks in text analysis
• Definition of terms used in text analysis
• Term frequency, inverse document frequency
• Representation and features of documents and corpus
• Use of regular expressions in parsing text
• Metrics used to measure the quality of search results
• Relevance with tf-idf, precision and recall

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 168
Module 4: Summary
Key Topics Covered in this module Methods Covered in this module
Algorithms and technical foundations Categorization (unsupervised) :
K-means clustering
Association Rules

Key Use cases Regression


Linear
Logistic

Diagnostics and validation of the model Classification (supervised)


Naïve Bayesian classifier
Decision Trees

Reasons to Choose (+) and Cautions (-) of the Time Series Analysis
model
Fitting, scoring and validating model in R and in- Text Analysis
db functions

Copyright © 2014 EMC Corporation. All Rights Reserved. Module 4: Analytics Theory/Methods 169

You might also like