0% found this document useful (0 votes)
58 views50 pages

4.3 BSMM-8710 - Introduction To Data Analytics (2023S) - Lecture 7 - Classification Models - v1.0

This document summarizes a lecture on classification models. It introduces Naive Bayesian classifiers, including their theoretical foundations, use cases, and how to evaluate their effectiveness. It explains that Naive Bayesian classifiers assign probabilities to class membership based on applying Bayes' theorem and making a conditional independence assumption between predictor variables. The document provides an example of how to build a Naive Bayesian classifier to predict credit risk from applicant attributes.

Uploaded by

jyu20186
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views50 pages

4.3 BSMM-8710 - Introduction To Data Analytics (2023S) - Lecture 7 - Classification Models - v1.0

This document summarizes a lecture on classification models. It introduces Naive Bayesian classifiers, including their theoretical foundations, use cases, and how to evaluate their effectiveness. It explains that Naive Bayesian classifiers assign probabilities to class membership based on applying Bayes' theorem and making a conditional independence assumption between predictor variables. The document provides an example of how to build a Naive Bayesian classifier to predict credit risk from applicant attributes.

Uploaded by

jyu20186
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Lecture 6:

Advanced Analytical Theory and Methods:


Classification Models
Dr. Andreas S. Maniatis
Adjunct Instructor

Master of Management
BSMM-8710 – Introduction to Data Analytics [2023S]
Wednesday, June 14th 2023 | 12:30 – 14:20 | OB-507
Module 4 – Advanced Analytical Theory and Methods
Module 4: Advanced Analytical Theory and Methods

Upon completion of this module, you should be able to:


• Examine analytic needs and select an appropriate technique based on business
objectives; initial hypotheses; and the data's structure and volume
• Apply some of the more commonly used methods in Analytics solutions
• Explain the algorithms and the technical foundations for the commonly used methods
• Explain the environment (use case) in which each technique can provide the most value
• Use appropriate diagnostic methods to validate the models created
• Use R and in-database analytical functions to fit, score and evaluate models
Where “R” we?
• In Module 3 we reviewed R skills and basic statistics
• You can use R to:
 Generate summary statistics to investigate a data set
 Visualize Data
 Perform statistical tests to analyze data and evaluate models
• Now that you have data, and you can see it, you need to plan the analytic model and
determine the analytic method to be used

Module 4: Analytics Theory/Methods 4


Applying the Data Analytics Lifecycle

Discovery

Operationalize Data Prep

• In a typical Data Analytics Problem - you would have gone


through:
Communicate Model
 Phase 1 – Discovery - have the problem framed
Results Planning
 Phase 2 – Data Preparation - have the data prepared
• Now you need to plan the model
Model and determine the method to
be used. Building

Module 4: Analytics Theory/Methods 5


Phase 3 - Model Planning

Discovery

How Operationalize
do people generally solve this Data Prep
problem with the kind of data and
resources I have?
Communicate
• Does that work well enough? Or do I have Model
Results
to come up with something new? Planning

• What are related or analogous problems?


Model
How are they solved? Can I do that? Do I have a good idea
Building about the type of model
Is the model robust to try? Can I refine the
enough? Have we analytic plan?
failed for sure?

Module 4: Analytics Theory/Methods 6


What Kind of Problem do I Need to Solve?
How do I Solve it?
The Problem to Solve The Category of Techniques Covered in this Course
I want to group items by similarity. Clustering K-means clustering
I want to find structure (commonalities) in the data
I want to discover relationships between actions or Association Rules Apriori
items
I want to determine the relationship between the Regression Linear Regression
outcome and the input variables Logistic Regression

I want to assign (known) labels to objects Classification Naïve Bayes


Decision Trees
I want to find the structure in a temporal process Time Series Analysis ACF, PACF, ARIMA
I want to forecast the behavior of a temporal process
I want to analyze my text data Text Analysis Regular expressions, Document
representation (Bag of Words).
I want to analyze my social media network data Network Analytics

Module 4: Analytics Theory/Methods 7


Why These Example Techniques?
• Most popular, frequently used:
 Provide the foundation for Data Science skills on which to build
• Relatively easy for new Data Scientists to understand & comprehend
• Applicable to a broad range of problems in several verticals

Module 4: Analytics Theory/Methods 8


Module 4: Advanced Analytical Theory and Methods

Lesson 5: Naïve Bayesian Classifiers

During this lesson the following topics are covered:


• Naïve Bayesian Classifier
• Theoretical foundations of the classifier
• Use cases
• Evaluating the effectiveness of the classifier
• The Reasons to Choose (+) and Cautions (-) with the use of
the classifier

Module 4: Analytics Theory/Methods 9


Classifiers

Where in the catalog should I place this product listing?


Is this email spam?
Is this politician Democrat/Republican/Green?

• Classification: assign labels to objects.


• Usually supervised: training set of pre-classified examples.
• Our examples:
 Naïve Bayes,
 Decision Trees
 (and Logistic Regression)

10
Naïve Bayesian Classifier : What is it?
• Used for classification
 Actually returns a probability score on class membership:
• In practice, probabilities generally close to either 0 or 1
• Not as well calibrated as Logistic Regression
• Input variables are discrete
 Popular for text classification
• Output:
 Most implementations: log probability for each class
• You could convert it to a probability, but in practice, we stay in the log space

11
Naïve Bayesian Classifier - Use Cases
• Preferred method for many text classification problems.
 Try this first; if it doesn't work, try something more complicated
• Use cases
 Spam filtering, other text classification tasks
 Fraud detection

12
Building a Training Dataset
Example : Predicting Good or Bad
credit
Predict the credit behavior of a
credit card applicant from
applicant's attributes:
• personal status
• job type
• housing type
• savings account
These are all categorical variables;
better suited to Naïve Bayesian
classifier than to logistic
regression.

Module 4: Analytics Theory/Methods 13


Technical Description – Bayes’ Law

• B is the class label:


 B ε {b1, b2, … bn}
• A is the specific assignment of input variables
 A = (a1, a2, … am)

14
The "Naïve" Assumption: Conditional Independence

so:

Independent of class – so it cancels out

Module 4: Analytics Theory/Methods 15


Building a Naïve Bayesian Classifier

• To build a Naïve Credit example:


Bayesian classifier, • class labels: {good, bad}
collect the following  P(good) = 0.7
statistics from the  P(bad) = 0.3
training data:
• aggregates for housing
 P(bj) for all the class labels.
 P(own|bad) = 0.62
 P(ai| bj) for all possible
 P(own|good) = 0.75
assignments of the input
 P(rent|bad) = 0.23
variables and class labels.
 P(rent|good) = 0.14
 … and so on

Module 4: Analytics Theory/Methods 16


Building a Naïve Bayesian Classifier (Continued)

• Assign the label that maximizes the value

17
Back to Credit Example
P(good|X) ~ (0.28*0.75*0.14*0.06)*0.7 = 0.0012

P(bad|X) ~ (0.36*0.62*0.17*0.02)*0.3 = 0.0002


Credit Example: X
• female ai bj P(ai | bj)

• owns home female good 0.28

• Self-employed female bad 0.36


own good 0.75
• savings > $1000 own bad 0.62
self emp good 0.14
self emp bad 0.17
P(good|X) > P(bad|X):
savings>1K good 0.06
Assign X the label "good" savings>1K bad 0.02

Module 4: Analytics Theory/Methods 18


Implementation Guideline

• High-dimensional problems are prone to numerical


underflow and unobserved events; it's better to calculate the
log probability (with smoothing).

(Smoothing technique varies with implementation)

19
Diagnostics
• Hold-out data
 How well does the model classify new instances?
• Cross-validation
• ROC curve/AUC

20
Diagnostics: Confusion Matrix

Prediction

True false positives


bad good
Class
bad 262 38 300

good 29 671 700

291 709 1000


false negatives

accuracy: sum of diagonals / sum of table = (262+671)/1000 = 0.93

FPR: false positives / sum of first row = 38/300 = 0.13


FNR: false negatives / sum of second row = 29/700 = 0.04

Precision: true positives / sum of second column = 671/709 = 0.95


Recall: true positives / sum of second row = 671/700 = 0.96

Module 4: Analytics Theory/Methods 21


Naïve Bayesian Classifier - Reasons to Choose (+)
and Cautions (-)
Reasons to Choose (+) Cautions (-)
Handles missing values quite well Numeric variables have to be discrete
(categorized) Intervals
Robust to irrelevant variables Sensitive to correlated variables
"Double-counting"
Easy to implement Not good for estimating probabilities
Stick to class label or yes/no
Easy to score data
Resistant to over-fitting
Computationally efficient
Handles very high dimensional
problems
Handles categorical variables with a
lot of levels

Module 4: Analytics Theory/Methods 22


Check Your Knowledge
1. Consider the following Training Data Set: Your Thoughts?

• Apply the Naïve Bayesian Classifier to this data set and compute Training Data Set

P(y = 1|X) for X = (1,0,0) X1 X2 X3 Y


1 1 1 0
Show your work 1 1 0 0
0 0 0 0
0 1 0 1
1 0 1 1
0 1 1 1
2. List some prominent Use Cases of the Naïve Bayesian Classifier.
3. What gives the Naïve Bayesian Classifier the advantage of being computationally
inexpensive?
4. Why should we use log-likelihoods rather than pure probability values in the Naïve
Bayesian Classifier?

23
Check Your Knowledge (Continued)
5. What is a confusion matrix and how it is used to evaluate the effectiveness of the Your Thoughts?

model?
6. Consider the following data set with two input features temperature and season
• What is the Naïve Bayesian assumption?

• Is the Naïve Bayesian assumption satisfied for this problem?

Electricity
Temperature Season Usage
(Class)
Below Winter High
Average
Above Winter Low
Average
Below Summer Low
Average
Above Summer High
Average

24
Module 4: Advanced Analytics – Theory and Methods
Lesson 5: Naïve Bayesian Classifiers - Summary

During this lesson the following topics were covered:


• Naïve Bayesian Classifier
• Theoretical foundations of the classifier
• Use cases
• Evaluating the effectiveness of the classifier
• The Reasons to Choose (+) and Cautions (-) with the use of the classifier

Module 4: Analytics Theory/Methods 25


Lab Exercise 8: Naïve Bayesian Classifier
This Lab is designed to investigate and practice the
Naïve Bayesian Classifier analytic technique.

After completing the tasks in this lab you should be able


to:
• Use R functions for Naïve Bayesian Classification
• Apply the requirements for generating
appropriate training data
• Validate the effectiveness of the Naïve Bayesian
Classifier with the big data

Module 4: Analytics Theory/Methods 26


Lab Exercise 8: Naïve Bayesian Classifier Part1 -
Workflow
• Set working directory and review training and test data
1

• Install and load library “e1071”


2

• Read in and review data


3
• Build the Naïve Bayesian classifier Model from First
4 Principles

• Predict the Results


5

• Execute the Naïve Bayesian Classifier with e1071 package


6

• Predict the Outcome of “Enrolls” with the Testdata


7

• Review results
8

Module 4: Analytics Theory/Methods 27


Lab Exercise 8: Naïve Bayesian Classifier Part2 -
Workflow
• Define the Problem (Translating to an Analytics Question)
1

• Establish the ODBC Connection


2

• Open Connections to ODBC Database


3

• Build the Training Dataset and the Test Dataset from the Database
4

• Extract the first 10000 records for the training data set and the remaining 10 for the
5 test

• Execute the NB Classifier


6

• Validate the Effectiveness of the NB Classifier with a Confusion Matrix


7

• Execute NB Classifier with MADlib Function Calls Within the Database


8

Module 4: Analytics Theory/Methods 28


Module 4: Advanced Analytics – Theory and Methods
Lesson 6: Decision Trees

During this lesson the following topics are covered:


• Overview of Decision Tree classifier
• General algorithm for Decision Trees
• Decision Tree use cases
• Entropy, Information gain
• Reasons to Choose (+) and Cautions (-) of Decision Tree classifier
• Classifier methods and conditions in which they are best suited

Module 4: Analytics Theory/Methods 29


Decision Tree Classifier – What is it?
• Used for classification:
 Returns probability scores of class membership
• Well-calibrated, like logistic regression
• Assigns label based on highest scoring class
• Some Decision Tree algorithms return simply the most likely class
 Regression Trees: a variation for regression
• Returns average value at every node
• Predictions can be discontinuous at the decision boundaries
• Input variables can be continuous or discrete
• Output:
 A tree that describes the decision flow.
 Leaf nodes return either a probability score, or simply a classification.
 Trees can be converted to a set of "decision rules“
• "IF income < $50,000 AND mortgage_amt > $100K THEN default=T with 75% probability“

30
Decision Tree – Example of Visual Structure

Female Male

Gender
Female Male
Branch – outcome of test

Income Age Internal Node – decision on variable

<=45,000 >45,000 <=40 >40

Yes No Yes No Leaf Node – class label

Income Age

Module 4: Analytics Theory/Methods 31


Decision Tree Classifier - Use Cases
• When a series of questions (yes/no) are answered to arrive at a classification
 Biological species classification
 Checklist of symptoms during a doctor’s evaluation of a patient
• When “if-then” conditions are preferred to linear models.
 Customer segmentation to predict response rates
 Financial decisions such as loan approval
 Fraud detection
• Short Decision Trees are the most popular "weak learner" in ensemble learning
techniques

Module #: Module Name


32
Example: The Credit Prediction Problem
good
700/1000
p(good)=0.7

savings= <100, (100:500)


savings=(500:1000),
>=1000,no known savings

good
245/294
housing=free, rent p(good)=0.83

housing=own

good
349/501
personal=female, male div/sep p(good)=0.7

personal=male mar/wid, male single

bad good
36/88 70/119
p(good) = 0.42 p(good)=0.6

Module 4: Analytics Theory/Methods 33


General Algorithm
• To construct tree T from training set S
 If all examples in S belong to some class in C, or S is sufficiently "pure", then make a leaf
labeled C.
 Otherwise:
• select the “most informative” attribute A
• partition S according to A’s values
• recursively construct sub-trees T1, T2, ..., for the subsets of S

• The details vary according to the specific algorithm – CART, ID3, C4.5 – but the
general idea is the same

Module #: Module Name


34
Step 1: Pick the Most “Informative" Attribute

• Entropy-based methods are one common way

• H = 0 if p(c) = 0 or 1 for any class


 So for binary classification, H=0 is a "pure" node
• H is maximum when all classes are equally probable
 For binary classification, H=1 when classes are 50/50

Module #: Module Name


35
Step 1: Pick the most "informative" attribute (Continued)

• First, we need to get the base entropy of the data

Module #: Module Name


36
Step 1: Pick the Most “Informative" Attribute (Continued)
Conditional Entropy

• The weighted sum of the class entropies for each value of


the attribute
• In English: attribute values (home owner vs. renter) give
more information about class membership
 "Home owners are more likely to have good credit than renters"
• Conditional entropy should be lower than unconditioned
entropy

Module #: Module Name


37
Conditional Entropy Example

for free own rent

P(housing) 0.108 0.713 0.179

P(bad | housing) 0.407 0.261 0.391

p(good | housing) 0.592 0.739 0.601

Module #: Module Name


38
Step 1: Pick the Most “Informative" Attribute
(Continued) Information Gain

• The information that you gain, by knowing the value of an


attribute
• So the "most informative" attribute is the attribute with the
highest InfoGain

Module #: Module Name


39
Back to the Credit Prediction Example

Attribute InfoGain

job 0.001

housing 0.013

personal_status 0.006

savings_status 0.028

Module 4: Analytics Theory/Methods 40


Step 2 & 3: Partition on the Selected Variable

• Step 2: Find the partition


with the highest InfoGain
 In our example the selected good
700/1000
partition has InfoGain = p(good)=0.7
savings=(500:100),>
0.028 =1000,no known
savings= <100, (100:500) savings

• Step 3: At each resulting good


node, repeat Steps 1 and 2 245/294
p(good)=0.83
 until node is "pure enough"
• Pure nodes => no
information gain by
splitting on other
attributes

Module 4: Analytics Theory/Methods 41


Diagnostics
• Hold-out data
• ROC/AUC
• Confusion Matrix
• FPR/FNR, Precision/Recall
• Do the splits (or the "rules") make sense?
 What does the domain expert say?
• How deep is the tree?
 Too many layers are prone to over-fit
• Do you get nodes with very few members?
 Over-fit

Module #: Module Name


42
Decision Tree Classifier - Reasons to Choose (+)
& Cautions (-)
Reasons to Choose (+) Cautions (-)
Takes any input type (numeric, categorical) Decision surfaces can only be axis-aligned
In principle, can handle categorical variables with
many distinct values (ZIP code)
Robust with redundant variables, correlated variables Tree structure is sensitive to small changes in the
training data
Naturally handles variable interaction A "deep" tree is probably over-fit
Because each split reduces the training data for
subsequent splits
Handles variables that have non-linear effect on Not good for outcomes that are dependent on many
outcome variables
Related to over-fit problem, above
Computationally efficient to build Doesn't naturally handle missing values;
However most implementations include a
method for dealing with this
Easy to score data In practice, decision rules can be fairly complex

Many algorithms can return a measure of variable


importance
In principle, decision rules are easy to understand

Module 4: Analytics Theory/Methods 43


Decision Tree Classifier – Reasons to Choose (+)
& Cautions (-) [Continued]
Reasons to Choose (+) Cautions (-)
Takes any input type (numeric, categorical) Decision surfaces can only be axis-aligned
In principle, can handle categorical variables with
many distinct values (ZIP code)
Robust with redundant variables, correlated variables Tree structure is sensitive to small changes in the
training data
Naturally handles variable interaction A "deep" tree is probably over-fit
Because each split reduces the training data for
subsequent splits
Handles variables that have non-linear effect on Not good for outcomes that are dependent on many
outcome variables
Related to over-fit problem, above
Computationally efficient to build Doesn't naturally handle missing values;
However most implementations include a
method for dealing with this
Easy to score data In practice, decision rules can be fairly complex

Many algorithms can return a measure of variable


importance
In principle, decision rules are easy to understand

Module 4: Analytics Theory/Methods 44


Which Classifier Should I Try?
Typical Questions Recommended Method

Do I want class probabilities, rather than just class labels? Logistic regression
Decision Tree

Do I want insight into how the variables affect the model? Logistic regression
Decision Tree

Is the problem high-dimensional? Naïve Bayes

Do I suspect some of the inputs are correlated? Decision Tree


Logistic Regression

Do I suspect some of the inputs are irrelevant? Decision Tree


Naïve Bayes

Are there categorical variables with a large number of levels? Naïve Bayes
Decision Tree

Are there mixed variable types? Decision Tree


Logistic Regression

Is there non-linear data or discontinuities in the inputs that Decision Tree


will affect the outputs?

Module 4: Analytics Theory/Methods 45


Check Your Knowledge

Your Thoughts?

1. How do you define information gain?


2. For what conditions is the value of entropy at a maximum and when is it at
a minimum?
3. List three use cases of Decision Trees.
4. What are weak learners and how are they used in ensemble methods?
5. Why do we end up with an over fitted model with deep trees and in data
sets when we have outcomes that are dependent on many variables?
6. What classification method would you recommend for the following cases:
 High dimensional data
 Data in which outputs are affected by non-linearity and
discontinuity in the inputs

46
Module 4: Advanced Analytics – Theory and
Methods
Lesson 6: Decision Trees - Summary

During this lesson the following topics were covered:


• Overview of Decision Tree classifier
• General algorithm for Decision Trees
• Decision Tree use cases
• Entropy, Information gain
• Reasons to Choose (+) and Cautions (-) of Decision Tree
classifier
• Classifier methods and conditions in which they are best
suited

Module 4: Analytics Theory/Methods 47


Lab Exercise 9: Decision Trees
This Lab is designed to investigate and practice Decision
Tree (DT) models covered in the course work.

After completing the tasks in this lab you should be able


to:
• Use R functions for Decision Tree models
• Predict the outcome of an attribute based on the
model

Module 4: Analytics Theory/Methods 48


Lab Exercise 9: Decision Trees - Workflow

• Set the Working Directory


1

• Read in the Data


2

• Build the Decision Tree


3

• Plot the Decision Tree


4

• Prepare Data to Test the Fitted Model


5

• Predict a Decision from the Fitted Model


6

Module 4: Analytics Theory/Methods 49


Thank you!
Questions?

Dr. Andreas S. Maniatis


Adjunct Instructor
[email protected]
http://www.linkedin.com/in/andreasmaniatis

You might also like