4.3 BSMM-8710 - Introduction To Data Analytics (2023S) - Lecture 7 - Classification Models - v1.0
4.3 BSMM-8710 - Introduction To Data Analytics (2023S) - Lecture 7 - Classification Models - v1.0
Master of Management
BSMM-8710 – Introduction to Data Analytics [2023S]
Wednesday, June 14th 2023 | 12:30 – 14:20 | OB-507
Module 4 – Advanced Analytical Theory and Methods
Module 4: Advanced Analytical Theory and Methods
Discovery
Discovery
How Operationalize
do people generally solve this Data Prep
problem with the kind of data and
resources I have?
Communicate
• Does that work well enough? Or do I have Model
Results
to come up with something new? Planning
10
Naïve Bayesian Classifier : What is it?
• Used for classification
Actually returns a probability score on class membership:
• In practice, probabilities generally close to either 0 or 1
• Not as well calibrated as Logistic Regression
• Input variables are discrete
Popular for text classification
• Output:
Most implementations: log probability for each class
• You could convert it to a probability, but in practice, we stay in the log space
11
Naïve Bayesian Classifier - Use Cases
• Preferred method for many text classification problems.
Try this first; if it doesn't work, try something more complicated
• Use cases
Spam filtering, other text classification tasks
Fraud detection
12
Building a Training Dataset
Example : Predicting Good or Bad
credit
Predict the credit behavior of a
credit card applicant from
applicant's attributes:
• personal status
• job type
• housing type
• savings account
These are all categorical variables;
better suited to Naïve Bayesian
classifier than to logistic
regression.
14
The "Naïve" Assumption: Conditional Independence
so:
17
Back to Credit Example
P(good|X) ~ (0.28*0.75*0.14*0.06)*0.7 = 0.0012
19
Diagnostics
• Hold-out data
How well does the model classify new instances?
• Cross-validation
• ROC curve/AUC
20
Diagnostics: Confusion Matrix
Prediction
• Apply the Naïve Bayesian Classifier to this data set and compute Training Data Set
23
Check Your Knowledge (Continued)
5. What is a confusion matrix and how it is used to evaluate the effectiveness of the Your Thoughts?
model?
6. Consider the following data set with two input features temperature and season
• What is the Naïve Bayesian assumption?
Electricity
Temperature Season Usage
(Class)
Below Winter High
Average
Above Winter Low
Average
Below Summer Low
Average
Above Summer High
Average
24
Module 4: Advanced Analytics – Theory and Methods
Lesson 5: Naïve Bayesian Classifiers - Summary
• Review results
8
• Build the Training Dataset and the Test Dataset from the Database
4
• Extract the first 10000 records for the training data set and the remaining 10 for the
5 test
30
Decision Tree – Example of Visual Structure
Female Male
Gender
Female Male
Branch – outcome of test
Income Age
good
245/294
housing=free, rent p(good)=0.83
housing=own
good
349/501
personal=female, male div/sep p(good)=0.7
bad good
36/88 70/119
p(good) = 0.42 p(good)=0.6
• The details vary according to the specific algorithm – CART, ID3, C4.5 – but the
general idea is the same
Attribute InfoGain
job 0.001
housing 0.013
personal_status 0.006
savings_status 0.028
Do I want class probabilities, rather than just class labels? Logistic regression
Decision Tree
Do I want insight into how the variables affect the model? Logistic regression
Decision Tree
Are there categorical variables with a large number of levels? Naïve Bayes
Decision Tree
Your Thoughts?
46
Module 4: Advanced Analytics – Theory and
Methods
Lesson 6: Decision Trees - Summary