0% found this document useful (0 votes)

58 views50 pages

4.3 BSMM-8710 - Introduction To Data Analytics (2023S) - Lecture 7 - Classification Models - v1.0

This document summarizes a lecture on classification models. It introduces Naive Bayesian classifiers, including their theoretical foundations, use cases, and how to evaluate their effectiveness. It explains that Naive Bayesian classifiers assign probabilities to class membership based on applying Bayes' theorem and making a conditional independence assumption between predictor variables. The document provides an example of how to build a Naive Bayesian classifier to predict credit risk from applicant attributes.

Uploaded by

jyu20186

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views50 pages

4.3 BSMM-8710 - Introduction To Data Analytics (2023S) - Lecture 7 - Classification Models - v1.0

Uploaded by

jyu20186

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Lecture 6:

Advanced Analytical Theory and Methods:

Classification Models
Dr. Andreas S. Maniatis
Adjunct Instructor

Master of Management
BSMM-8710 – Introduction to Data Analytics [2023S]
Wednesday, June 14th 2023 | 12:30 – 14:20 | OB-507
Module 4 – Advanced Analytical Theory and Methods
Module 4: Advanced Analytical Theory and Methods

Upon completion of this module, you should be able to:

• Examine analytic needs and select an appropriate technique based on business
objectives; initial hypotheses; and the data's structure and volume
• Apply some of the more commonly used methods in Analytics solutions
• Explain the algorithms and the technical foundations for the commonly used methods
• Explain the environment (use case) in which each technique can provide the most value
• Use appropriate diagnostic methods to validate the models created
• Use R and in-database analytical functions to fit, score and evaluate models
Where “R” we?
• In Module 3 we reviewed R skills and basic statistics
• You can use R to:
 Generate summary statistics to investigate a data set
 Visualize Data
 Perform statistical tests to analyze data and evaluate models
• Now that you have data, and you can see it, you need to plan the analytic model and
determine the analytic method to be used

Module 4: Analytics Theory/Methods 4

Applying the Data Analytics Lifecycle

Discovery

Operationalize Data Prep

• In a typical Data Analytics Problem - you would have gone

through:
Communicate Model
 Phase 1 – Discovery - have the problem framed
Results Planning
 Phase 2 – Data Preparation - have the data prepared
• Now you need to plan the model
Model and determine the method to
be used. Building

Module 4: Analytics Theory/Methods 5

Phase 3 - Model Planning

Discovery

How Operationalize
do people generally solve this Data Prep
problem with the kind of data and
resources I have?
Communicate
• Does that work well enough? Or do I have Model
Results
to come up with something new? Planning

• What are related or analogous problems?

Model
How are they solved? Can I do that? Do I have a good idea
Building about the type of model
Is the model robust to try? Can I refine the
enough? Have we analytic plan?
failed for sure?

Module 4: Analytics Theory/Methods 6

What Kind of Problem do I Need to Solve?
How do I Solve it?
The Problem to Solve The Category of Techniques Covered in this Course
I want to group items by similarity. Clustering K-means clustering
I want to find structure (commonalities) in the data
I want to discover relationships between actions or Association Rules Apriori
items
I want to determine the relationship between the Regression Linear Regression
outcome and the input variables Logistic Regression

I want to assign (known) labels to objects Classification Naïve Bayes

Decision Trees
I want to find the structure in a temporal process Time Series Analysis ACF, PACF, ARIMA
I want to forecast the behavior of a temporal process
I want to analyze my text data Text Analysis Regular expressions, Document
representation (Bag of Words).
I want to analyze my social media network data Network Analytics

Module 4: Analytics Theory/Methods 7

Why These Example Techniques?
• Most popular, frequently used:
 Provide the foundation for Data Science skills on which to build
• Relatively easy for new Data Scientists to understand & comprehend
• Applicable to a broad range of problems in several verticals

Module 4: Analytics Theory/Methods 8

Module 4: Advanced Analytical Theory and Methods

Lesson 5: Naïve Bayesian Classifiers

During this lesson the following topics are covered:

• Naïve Bayesian Classifier
• Theoretical foundations of the classifier
• Use cases
• Evaluating the effectiveness of the classifier
• The Reasons to Choose (+) and Cautions (-) with the use of
the classifier

Module 4: Analytics Theory/Methods 9

Classifiers

Where in the catalog should I place this product listing?

Is this email spam?
Is this politician Democrat/Republican/Green?

• Classification: assign labels to objects.

• Usually supervised: training set of pre-classified examples.
• Our examples:
 Naïve Bayes,
 Decision Trees
 (and Logistic Regression)

10
Naïve Bayesian Classifier : What is it?
• Used for classification
 Actually returns a probability score on class membership:
• In practice, probabilities generally close to either 0 or 1
• Not as well calibrated as Logistic Regression
• Input variables are discrete
 Popular for text classification
• Output:
 Most implementations: log probability for each class
• You could convert it to a probability, but in practice, we stay in the log space

11
Naïve Bayesian Classifier - Use Cases
• Preferred method for many text classification problems.
 Try this first; if it doesn't work, try something more complicated
• Use cases
 Spam filtering, other text classification tasks
 Fraud detection

12
Building a Training Dataset
Example : Predicting Good or Bad
credit
Predict the credit behavior of a
credit card applicant from
applicant's attributes:
• personal status
• job type
• housing type
• savings account
These are all categorical variables;
better suited to Naïve Bayesian
classifier than to logistic
regression.

Module 4: Analytics Theory/Methods 13

Technical Description – Bayes’ Law

• B is the class label:

 B ε {b1, b2, … bn}
• A is the specific assignment of input variables
 A = (a1, a2, … am)

14
The "Naïve" Assumption: Conditional Independence

so:

Independent of class – so it cancels out

Module 4: Analytics Theory/Methods 15

Building a Naïve Bayesian Classifier

• To build a Naïve Credit example:

Bayesian classifier, • class labels: {good, bad}
collect the following  P(good) = 0.7
statistics from the  P(bad) = 0.3
training data:
• aggregates for housing
 P(bj) for all the class labels.
 P(own|bad) = 0.62
 P(ai| bj) for all possible
 P(own|good) = 0.75
assignments of the input
 P(rent|bad) = 0.23
variables and class labels.
 P(rent|good) = 0.14
 … and so on

Module 4: Analytics Theory/Methods 16

Building a Naïve Bayesian Classifier (Continued)

• Assign the label that maximizes the value

17
Back to Credit Example
P(good|X) ~ (0.28*0.75*0.14*0.06)*0.7 = 0.0012

P(bad|X) ~ (0.360.620.170.02)0.3 = 0.0002

Credit Example: X
• female ai bj P(ai | bj)

• owns home female good 0.28

• Self-employed female bad 0.36

own good 0.75
• savings > $1000 own bad 0.62
self emp good 0.14
self emp bad 0.17
P(good|X) > P(bad|X):
savings>1K good 0.06
Assign X the label "good" savings>1K bad 0.02

Module 4: Analytics Theory/Methods 18

Implementation Guideline

• High-dimensional problems are prone to numerical

underflow and unobserved events; it's better to calculate the
log probability (with smoothing).

(Smoothing technique varies with implementation)

19
Diagnostics
• Hold-out data
 How well does the model classify new instances?
• Cross-validation
• ROC curve/AUC

20
Diagnostics: Confusion Matrix

Prediction

True false positives

bad good
Class
bad 262 38 300

good 29 671 700

291 709 1000

false negatives

accuracy: sum of diagonals / sum of table = (262+671)/1000 = 0.93

FPR: false positives / sum of first row = 38/300 = 0.13

FNR: false negatives / sum of second row = 29/700 = 0.04

Precision: true positives / sum of second column = 671/709 = 0.95

Recall: true positives / sum of second row = 671/700 = 0.96

Module 4: Analytics Theory/Methods 21

Naïve Bayesian Classifier - Reasons to Choose (+)
and Cautions (-)
Reasons to Choose (+) Cautions (-)
Handles missing values quite well Numeric variables have to be discrete
(categorized) Intervals
Robust to irrelevant variables Sensitive to correlated variables
"Double-counting"
Easy to implement Not good for estimating probabilities
Stick to class label or yes/no
Easy to score data
Resistant to over-fitting
Computationally efficient
Handles very high dimensional
problems
Handles categorical variables with a
lot of levels

Module 4: Analytics Theory/Methods 22

Check Your Knowledge
1. Consider the following Training Data Set: Your Thoughts?

• Apply the Naïve Bayesian Classifier to this data set and compute Training Data Set

P(y = 1|X) for X = (1,0,0) X1 X2 X3 Y

1 1 1 0
Show your work 1 1 0 0
0 0 0 0
0 1 0 1
1 0 1 1
0 1 1 1
2. List some prominent Use Cases of the Naïve Bayesian Classifier.
3. What gives the Naïve Bayesian Classifier the advantage of being computationally
inexpensive?
4. Why should we use log-likelihoods rather than pure probability values in the Naïve
Bayesian Classifier?

23
Check Your Knowledge (Continued)
5. What is a confusion matrix and how it is used to evaluate the effectiveness of the Your Thoughts?

model?
6. Consider the following data set with two input features temperature and season
• What is the Naïve Bayesian assumption?

• Is the Naïve Bayesian assumption satisfied for this problem?

Electricity
Temperature Season Usage
(Class)
Below Winter High
Average
Above Winter Low
Average
Below Summer Low
Average
Above Summer High
Average

24
Module 4: Advanced Analytics – Theory and Methods
Lesson 5: Naïve Bayesian Classifiers - Summary

During this lesson the following topics were covered:

Module 4: Analytics Theory/Methods 25

Lab Exercise 8: Naïve Bayesian Classifier
This Lab is designed to investigate and practice the
Naïve Bayesian Classifier analytic technique.

After completing the tasks in this lab you should be able

to:
• Use R functions for Naïve Bayesian Classification
• Apply the requirements for generating
appropriate training data
• Validate the effectiveness of the Naïve Bayesian
Classifier with the big data

Module 4: Analytics Theory/Methods 26

Lab Exercise 8: Naïve Bayesian Classifier Part1 -
Workflow
• Set working directory and review training and test data
1

• Install and load library “e1071”

• Read in and review data

3
• Build the Naïve Bayesian classifier Model from First
4 Principles

• Predict the Results

• Execute the Naïve Bayesian Classifier with e1071 package

• Predict the Outcome of “Enrolls” with the Testdata

• Review results
8

Module 4: Analytics Theory/Methods 27

Lab Exercise 8: Naïve Bayesian Classifier Part2 -
Workflow
• Define the Problem (Translating to an Analytics Question)
1

• Establish the ODBC Connection

• Open Connections to ODBC Database

• Build the Training Dataset and the Test Dataset from the Database
4

• Extract the first 10000 records for the training data set and the remaining 10 for the
5 test

• Execute the NB Classifier

• Validate the Effectiveness of the NB Classifier with a Confusion Matrix

• Execute NB Classifier with MADlib Function Calls Within the Database

Module 4: Analytics Theory/Methods 28

Module 4: Advanced Analytics – Theory and Methods
Lesson 6: Decision Trees

During this lesson the following topics are covered:

• Overview of Decision Tree classifier
• General algorithm for Decision Trees
• Decision Tree use cases
• Entropy, Information gain
• Reasons to Choose (+) and Cautions (-) of Decision Tree classifier
• Classifier methods and conditions in which they are best suited

Module 4: Analytics Theory/Methods 29

Decision Tree Classifier – What is it?
• Used for classification:
 Returns probability scores of class membership
• Well-calibrated, like logistic regression
• Assigns label based on highest scoring class
• Some Decision Tree algorithms return simply the most likely class
 Regression Trees: a variation for regression
• Returns average value at every node
• Predictions can be discontinuous at the decision boundaries
• Input variables can be continuous or discrete
• Output:
 A tree that describes the decision flow.
 Leaf nodes return either a probability score, or simply a classification.
 Trees can be converted to a set of "decision rules“
• "IF income < $50,000 AND mortgage_amt > $100K THEN default=T with 75% probability“

30
Decision Tree – Example of Visual Structure

Female Male

Gender
Female Male
Branch – outcome of test

Income Age Internal Node – decision on variable

<=45,000 >45,000 <=40 >40

Yes No Yes No Leaf Node – class label

Income Age

Module 4: Analytics Theory/Methods 31

Decision Tree Classifier - Use Cases
• When a series of questions (yes/no) are answered to arrive at a classification
 Biological species classification
 Checklist of symptoms during a doctor’s evaluation of a patient
• When “if-then” conditions are preferred to linear models.
 Customer segmentation to predict response rates
 Financial decisions such as loan approval
 Fraud detection
• Short Decision Trees are the most popular "weak learner" in ensemble learning
techniques

Module #: Module Name

32
Example: The Credit Prediction Problem
good
700/1000
p(good)=0.7

savings= <100, (100:500)

savings=(500:1000),
>=1000,no known savings

good
245/294
housing=free, rent p(good)=0.83

housing=own

good
349/501
personal=female, male div/sep p(good)=0.7

personal=male mar/wid, male single

bad good
36/88 70/119
p(good) = 0.42 p(good)=0.6

Module 4: Analytics Theory/Methods 33

General Algorithm
• To construct tree T from training set S
 If all examples in S belong to some class in C, or S is sufficiently "pure", then make a leaf
labeled C.
 Otherwise:
• select the “most informative” attribute A
• partition S according to A’s values
• recursively construct sub-trees T1, T2, ..., for the subsets of S

• The details vary according to the specific algorithm – CART, ID3, C4.5 – but the
general idea is the same

Module #: Module Name

34
Step 1: Pick the Most “Informative" Attribute

• Entropy-based methods are one common way

• H = 0 if p(c) = 0 or 1 for any class

 So for binary classification, H=0 is a "pure" node
• H is maximum when all classes are equally probable
 For binary classification, H=1 when classes are 50/50

Module #: Module Name

35
Step 1: Pick the most "informative" attribute (Continued)

• First, we need to get the base entropy of the data

Module #: Module Name

36
Step 1: Pick the Most “Informative" Attribute (Continued)
Conditional Entropy

• The weighted sum of the class entropies for each value of

the attribute
• In English: attribute values (home owner vs. renter) give
more information about class membership
 "Home owners are more likely to have good credit than renters"
• Conditional entropy should be lower than unconditioned
entropy

Module #: Module Name

37
Conditional Entropy Example

for free own rent

P(housing) 0.108 0.713 0.179

P(bad | housing) 0.407 0.261 0.391

p(good | housing) 0.592 0.739 0.601

Module #: Module Name

38
Step 1: Pick the Most “Informative" Attribute
(Continued) Information Gain

• The information that you gain, by knowing the value of an

attribute
• So the "most informative" attribute is the attribute with the
highest InfoGain

Module #: Module Name

39
Back to the Credit Prediction Example

Attribute InfoGain

job 0.001

housing 0.013

personal_status 0.006

savings_status 0.028

Module 4: Analytics Theory/Methods 40

Step 2 & 3: Partition on the Selected Variable

• Step 2: Find the partition

with the highest InfoGain
 In our example the selected good
700/1000
partition has InfoGain = p(good)=0.7
savings=(500:100),>
0.028 =1000,no known
savings= <100, (100:500) savings

• Step 3: At each resulting good

node, repeat Steps 1 and 2 245/294
p(good)=0.83
 until node is "pure enough"
• Pure nodes => no
information gain by
splitting on other
attributes

Module 4: Analytics Theory/Methods 41

Diagnostics
• Hold-out data
• ROC/AUC
• Confusion Matrix
• FPR/FNR, Precision/Recall
• Do the splits (or the "rules") make sense?
 What does the domain expert say?
• How deep is the tree?
 Too many layers are prone to over-fit
• Do you get nodes with very few members?
 Over-fit

Module #: Module Name

42
Decision Tree Classifier - Reasons to Choose (+)
& Cautions (-)
Reasons to Choose (+) Cautions (-)
Takes any input type (numeric, categorical) Decision surfaces can only be axis-aligned
In principle, can handle categorical variables with
many distinct values (ZIP code)
Robust with redundant variables, correlated variables Tree structure is sensitive to small changes in the
training data
Naturally handles variable interaction A "deep" tree is probably over-fit
Because each split reduces the training data for
subsequent splits
Handles variables that have non-linear effect on Not good for outcomes that are dependent on many
outcome variables
Related to over-fit problem, above
Computationally efficient to build Doesn't naturally handle missing values;
However most implementations include a
method for dealing with this
Easy to score data In practice, decision rules can be fairly complex

Many algorithms can return a measure of variable

importance
In principle, decision rules are easy to understand

Module 4: Analytics Theory/Methods 43

Decision Tree Classifier – Reasons to Choose (+)
& Cautions (-) [Continued]
Reasons to Choose (+) Cautions (-)
Takes any input type (numeric, categorical) Decision surfaces can only be axis-aligned
In principle, can handle categorical variables with
many distinct values (ZIP code)
Robust with redundant variables, correlated variables Tree structure is sensitive to small changes in the
training data
Naturally handles variable interaction A "deep" tree is probably over-fit
Because each split reduces the training data for
subsequent splits
Handles variables that have non-linear effect on Not good for outcomes that are dependent on many
outcome variables
Related to over-fit problem, above
Computationally efficient to build Doesn't naturally handle missing values;
However most implementations include a
method for dealing with this
Easy to score data In practice, decision rules can be fairly complex

Many algorithms can return a measure of variable

importance
In principle, decision rules are easy to understand

Module 4: Analytics Theory/Methods 44

Which Classifier Should I Try?
Typical Questions Recommended Method

Do I want class probabilities, rather than just class labels? Logistic regression
Decision Tree

Do I want insight into how the variables affect the model? Logistic regression
Decision Tree

Is the problem high-dimensional? Naïve Bayes

Do I suspect some of the inputs are correlated? Decision Tree

Logistic Regression

Do I suspect some of the inputs are irrelevant? Decision Tree

Naïve Bayes

Are there categorical variables with a large number of levels? Naïve Bayes
Decision Tree

Are there mixed variable types? Decision Tree

Logistic Regression

Is there non-linear data or discontinuities in the inputs that Decision Tree

will affect the outputs?

Module 4: Analytics Theory/Methods 45

Check Your Knowledge

Your Thoughts?

1. How do you define information gain?

2. For what conditions is the value of entropy at a maximum and when is it at
a minimum?
3. List three use cases of Decision Trees.
4. What are weak learners and how are they used in ensemble methods?
5. Why do we end up with an over fitted model with deep trees and in data
sets when we have outcomes that are dependent on many variables?
6. What classification method would you recommend for the following cases:
 High dimensional data
 Data in which outputs are affected by non-linearity and
discontinuity in the inputs

46
Module 4: Advanced Analytics – Theory and
Methods
Lesson 6: Decision Trees - Summary

During this lesson the following topics were covered:

• Overview of Decision Tree classifier
• General algorithm for Decision Trees
• Decision Tree use cases
• Entropy, Information gain
• Reasons to Choose (+) and Cautions (-) of Decision Tree
classifier
• Classifier methods and conditions in which they are best
suited

Module 4: Analytics Theory/Methods 47

Lab Exercise 9: Decision Trees
This Lab is designed to investigate and practice Decision
Tree (DT) models covered in the course work.

After completing the tasks in this lab you should be able

to:
• Use R functions for Decision Tree models
• Predict the outcome of an attribute based on the
model

Module 4: Analytics Theory/Methods 48

Lab Exercise 9: Decision Trees - Workflow

• Set the Working Directory

• Read in the Data

• Build the Decision Tree

• Plot the Decision Tree

• Prepare Data to Test the Fitted Model

• Predict a Decision from the Fitted Model

Module 4: Analytics Theory/Methods 49

Thank you!
Questions?

Dr. Andreas S. Maniatis

Adjunct Instructor
[email protected]
http://www.linkedin.com/in/andreasmaniatis

Concept, Image, and Symbol The Cognitive Basis of Grammar (Ronald W. Langacker) (Z-Library)
100% (1)
Concept, Image, and Symbol The Cognitive Basis of Grammar (Ronald W. Langacker) (Z-Library)
409 pages
L 0007634413 PDF
0% (1)
L 0007634413 PDF
30 pages
K-Mean Clustering Final
No ratings yet
K-Mean Clustering Final
21 pages
Agya Ram Verma - Yatendra Kumar - Basic and Advance - Phython Programming-Independently Published (2024)
No ratings yet
Agya Ram Verma - Yatendra Kumar - Basic and Advance - Phython Programming-Independently Published (2024)
240 pages
DLP 4th Quarter Health 9-Injuries
100% (4)
DLP 4th Quarter Health 9-Injuries
13 pages
Stages of First and Second Language Acquisition
100% (2)
Stages of First and Second Language Acquisition
6 pages
IDS - 3 - Data Analytics - Part-2 - Methodologies - Dantu
No ratings yet
IDS - 3 - Data Analytics - Part-2 - Methodologies - Dantu
60 pages
Fine Tuning Techniques For Large Language Models LLMs
No ratings yet
Fine Tuning Techniques For Large Language Models LLMs
15 pages
Introduction To Learning: Frederic Precioso 24/01/2019
No ratings yet
Introduction To Learning: Frederic Precioso 24/01/2019
179 pages
Programming in Modern C++
No ratings yet
Programming in Modern C++
1 page
Analysis of ARIMA and GARCH Model
No ratings yet
Analysis of ARIMA and GARCH Model
14 pages
Week8 - Machine Learning
No ratings yet
Week8 - Machine Learning
35 pages
Session 1
No ratings yet
Session 1
39 pages
L1-D2 Basics of Data Preperation and Quality
100% (1)
L1-D2 Basics of Data Preperation and Quality
17 pages
BUS5PB-Lecture1 Introduction To Business Analytics S1-2024
100% (1)
BUS5PB-Lecture1 Introduction To Business Analytics S1-2024
82 pages
Data Science Introduction
No ratings yet
Data Science Introduction
22 pages
Mid1 2021
No ratings yet
Mid1 2021
4 pages
Session 2 - Excel Fundamentals For Data Exploration
100% (1)
Session 2 - Excel Fundamentals For Data Exploration
56 pages
Data Science Interview Questions - Statistics: Mohit Kumar Dec 12, 2018 11 Min Read
100% (1)
Data Science Interview Questions - Statistics: Mohit Kumar Dec 12, 2018 11 Min Read
14 pages
G5 Subject & Object Pronoun Practice WS
No ratings yet
G5 Subject & Object Pronoun Practice WS
4 pages
SPA Full Course PPTs (9 Files Merged)
No ratings yet
SPA Full Course PPTs (9 Files Merged)
239 pages
Python Data Science and Machine Learning Module - (DataScience-Bokeh) 18
No ratings yet
Python Data Science and Machine Learning Module - (DataScience-Bokeh) 18
11 pages
09 Decision Tree Induction
No ratings yet
09 Decision Tree Induction
120 pages
Chapter 7 Final
100% (1)
Chapter 7 Final
13 pages
Lec - 4 C Assembly
No ratings yet
Lec - 4 C Assembly
50 pages
Assembly Lab 2
100% (1)
Assembly Lab 2
9 pages
mcsl-17 C and Assembly Language Programming Lab
No ratings yet
mcsl-17 C and Assembly Language Programming Lab
42 pages
COAL+MID+I Fall18 +paper Solution
No ratings yet
COAL+MID+I Fall18 +paper Solution
3 pages
Instructor Materials Chapter 6: Architecture For Big Data and Data Engineering
No ratings yet
Instructor Materials Chapter 6: Architecture For Big Data and Data Engineering
32 pages
Machine Learning Insights For Executives 1603951193
No ratings yet
Machine Learning Insights For Executives 1603951193
14 pages
Internet of Things: The Gold Mine of Talent - Draup
No ratings yet
Internet of Things: The Gold Mine of Talent - Draup
13 pages
JM International School: C++ Practical File
No ratings yet
JM International School: C++ Practical File
39 pages
Basics of Statistics1
No ratings yet
Basics of Statistics1
63 pages
L1-D5 Inference and Presentation
No ratings yet
L1-D5 Inference and Presentation
10 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
6 pages
MGF3684 Lecture 5
No ratings yet
MGF3684 Lecture 5
26 pages
DataScience With Python Course Content Syllabus Meritude
No ratings yet
DataScience With Python Course Content Syllabus Meritude
10 pages
Lecture8 PDF
No ratings yet
Lecture8 PDF
434 pages
L1-D3 Concepts of Data Analysis
No ratings yet
L1-D3 Concepts of Data Analysis
17 pages
Bias-Variance Tradeoff Presentation
No ratings yet
Bias-Variance Tradeoff Presentation
11 pages
Data Analyst Chapter 3
No ratings yet
Data Analyst Chapter 3
20 pages
INT426 Gen AI
No ratings yet
INT426 Gen AI
4 pages
Lecture-1to8-HCL-DSE - Sumita Narang - IDS PDF
No ratings yet
Lecture-1to8-HCL-DSE - Sumita Narang - IDS PDF
304 pages
Session 4
No ratings yet
Session 4
29 pages
Andrea Martorana Tusa: Failure Prediction For Manufacturing Industry
No ratings yet
Andrea Martorana Tusa: Failure Prediction For Manufacturing Industry
23 pages
Corporate Finance Institute: Introduction To ESG Case Study
No ratings yet
Corporate Finance Institute: Introduction To ESG Case Study
3 pages
01-Introduction To DS With Python
No ratings yet
01-Introduction To DS With Python
32 pages
8 Contemporary Marketing Practice
No ratings yet
8 Contemporary Marketing Practice
17 pages
Module 4 - Theory and Methods
No ratings yet
Module 4 - Theory and Methods
161 pages
Data Analytics Program Training
No ratings yet
Data Analytics Program Training
13 pages
Lecture - Naive Bayesian
No ratings yet
Lecture - Naive Bayesian
21 pages
NewICT - C2 - Overview of Digital Transformation and Business Innovation
No ratings yet
NewICT - C2 - Overview of Digital Transformation and Business Innovation
19 pages
Week1 - Introduction To Machine Learning and Toolkit
No ratings yet
Week1 - Introduction To Machine Learning and Toolkit
102 pages
Marketing Analytics: PPT-5 (Marketing Activities Perspective Metrics)
No ratings yet
Marketing Analytics: PPT-5 (Marketing Activities Perspective Metrics)
33 pages
Kaggle: Your Machine Learning and Data Science Community
No ratings yet
Kaggle: Your Machine Learning and Data Science Community
7 pages
Global Business Strategy: Evaluation & Control
No ratings yet
Global Business Strategy: Evaluation & Control
19 pages
Simple Libraries in Python
No ratings yet
Simple Libraries in Python
12 pages
Little Book of R For Multivariate Analysis
No ratings yet
Little Book of R For Multivariate Analysis
51 pages
S9-Corporate-Strategy-Stability & Retrenchment
No ratings yet
S9-Corporate-Strategy-Stability & Retrenchment
21 pages
Ethics and Social Responsibility: Principles of Marketing
No ratings yet
Ethics and Social Responsibility: Principles of Marketing
28 pages
Marketing Information and Research
No ratings yet
Marketing Information and Research
29 pages
Instructor Materials Chapter 5: Storytelling With Data
No ratings yet
Instructor Materials Chapter 5: Storytelling With Data
27 pages
PrinciplesofMarketing 04 MarketingStrategy
No ratings yet
PrinciplesofMarketing 04 MarketingStrategy
23 pages
The Hitchhikers Guide To Artificial Intelligence 2018 19.original
No ratings yet
The Hitchhikers Guide To Artificial Intelligence 2018 19.original
18 pages
Data Scientist - KD PDF
No ratings yet
Data Scientist - KD PDF
1 page
Human Services Intervention Strategies
No ratings yet
Human Services Intervention Strategies
6 pages
Stanford-Binet Intelligence Scale: Mental Retardation
No ratings yet
Stanford-Binet Intelligence Scale: Mental Retardation
5 pages
1 - Discourse Analysis Powerpoint
No ratings yet
1 - Discourse Analysis Powerpoint
11 pages
Human Relations Chapter 1 2 3 Test
100% (1)
Human Relations Chapter 1 2 3 Test
9 pages
Teaching Practice Task 1: Weekly Reflections: TP Tasks
No ratings yet
Teaching Practice Task 1: Weekly Reflections: TP Tasks
2 pages
Chapter 2 - Attitudes
No ratings yet
Chapter 2 - Attitudes
11 pages
Innovative Interview Questions You'll Most Likely Be Asked
No ratings yet
Innovative Interview Questions You'll Most Likely Be Asked
19 pages
4conchita 1.279150631 PDF
No ratings yet
4conchita 1.279150631 PDF
18 pages
Assignment 1
No ratings yet
Assignment 1
8 pages
Practical Tips On How To Use The Pendulum Efficiently: by Ingrid Auer
No ratings yet
Practical Tips On How To Use The Pendulum Efficiently: by Ingrid Auer
5 pages
What Is The Future of Our Planet
No ratings yet
What Is The Future of Our Planet
2 pages
Ch7b P6 English TSA2015E
No ratings yet
Ch7b P6 English TSA2015E
38 pages
Monday Tuesday Wednesday Thursday Friday: Daily Lesson LOG I.Objectives
No ratings yet
Monday Tuesday Wednesday Thursday Friday: Daily Lesson LOG I.Objectives
15 pages
Past Simple Vs Past Continuous Classroom Posters CLT Communicative Language Teach - 74606
No ratings yet
Past Simple Vs Past Continuous Classroom Posters CLT Communicative Language Teach - 74606
8 pages
Rachel Ford: 80 M St. #3 Salt Lake City, UT 84103 (385) 212-9242
No ratings yet
Rachel Ford: 80 M St. #3 Salt Lake City, UT 84103 (385) 212-9242
2 pages
Transition Words and Phrase
No ratings yet
Transition Words and Phrase
3 pages
Learning Session No 16 - Iii Term Mysterious Places: Skill
No ratings yet
Learning Session No 16 - Iii Term Mysterious Places: Skill
3 pages
Class: Time: SUBJECT: English Day: Monday Date
No ratings yet
Class: Time: SUBJECT: English Day: Monday Date
4 pages
Anisa Rahmadani: Statement of Participation
No ratings yet
Anisa Rahmadani: Statement of Participation
2 pages
1 - English Grade 4 Unit 5 Hot and Cold. Snow and Ice 2 LESSON PLAN
No ratings yet
1 - English Grade 4 Unit 5 Hot and Cold. Snow and Ice 2 LESSON PLAN
2 pages
449 986 1 PB
No ratings yet
449 986 1 PB
6 pages
Daily Lesson Plan: Teacher'S Name: MR Mohd Redza B Abdul Razak
No ratings yet
Daily Lesson Plan: Teacher'S Name: MR Mohd Redza B Abdul Razak
2 pages
SYST 469 - Assignment 1 - ID Plan
No ratings yet
SYST 469 - Assignment 1 - ID Plan
5 pages
Audition Promenade Round Cochlea Autour Cochlée Oreille Ear Organ of Corti Oreille Ear CRIC Montpellier Remy Pujol U254
No ratings yet
Audition Promenade Round Cochlea Autour Cochlée Oreille Ear Organ of Corti Oreille Ear CRIC Montpellier Remy Pujol U254
1 page
Psy404 Assignment PDF
No ratings yet
Psy404 Assignment PDF
2 pages

4.3 BSMM-8710 - Introduction To Data Analytics (2023S) - Lecture 7 - Classification Models - v1.0

Uploaded by

4.3 BSMM-8710 - Introduction To Data Analytics (2023S) - Lecture 7 - Classification Models - v1.0

Uploaded by

Lecture 6:

Advanced Analytical Theory and Methods:

Upon completion of this module, you should be able to:

Module 4: Analytics Theory/Methods 4

Operationalize Data Prep

• In a typical Data Analytics Problem - you would have gone

Module 4: Analytics Theory/Methods 5

• What are related or analogous problems?

Module 4: Analytics Theory/Methods 6

I want to assign (known) labels to objects Classification Naïve Bayes

Module 4: Analytics Theory/Methods 7

Module 4: Analytics Theory/Methods 8

Lesson 5: Naïve Bayesian Classifiers

During this lesson the following topics are covered:

Module 4: Analytics Theory/Methods 9

Where in the catalog should I place this product listing?

• Classification: assign labels to objects.

Module 4: Analytics Theory/Methods 13

• B is the class label:

Independent of class – so it cancels out

Module 4: Analytics Theory/Methods 15

• To build a Naïve Credit example:

Module 4: Analytics Theory/Methods 16

• Assign the label that maximizes the value

P(bad|X) ~ (0.36*0.62*0.17*0.02)*0.3 = 0.0002

• owns home female good 0.28

• Self-employed female bad 0.36

Module 4: Analytics Theory/Methods 18

• High-dimensional problems are prone to numerical

(Smoothing technique varies with implementation)

True false positives

good 29 671 700

291 709 1000

accuracy: sum of diagonals / sum of table = (262+671)/1000 = 0.93

FPR: false positives / sum of first row = 38/300 = 0.13

Precision: true positives / sum of second column = 671/709 = 0.95

Module 4: Analytics Theory/Methods 21

Module 4: Analytics Theory/Methods 22

P(y = 1|X) for X = (1,0,0) X1 X2 X3 Y

• Is the Naïve Bayesian assumption satisfied for this problem?

During this lesson the following topics were covered:

Module 4: Analytics Theory/Methods 25

After completing the tasks in this lab you should be able

Module 4: Analytics Theory/Methods 26

• Install and load library “e1071”

• Read in and review data

• Predict the Results

• Execute the Naïve Bayesian Classifier with e1071 package

• Predict the Outcome of “Enrolls” with the Testdata

Module 4: Analytics Theory/Methods 27

• Establish the ODBC Connection

• Open Connections to ODBC Database

• Execute the NB Classifier

• Validate the Effectiveness of the NB Classifier with a Confusion Matrix

• Execute NB Classifier with MADlib Function Calls Within the Database

Module 4: Analytics Theory/Methods 28

During this lesson the following topics are covered:

Module 4: Analytics Theory/Methods 29

Income Age Internal Node – decision on variable

<=45,000 >45,000 <=40 >40

Yes No Yes No Leaf Node – class label

Module 4: Analytics Theory/Methods 31

Module #: Module Name

savings= <100, (100:500)

personal=male mar/wid, male single

Module 4: Analytics Theory/Methods 33

Module #: Module Name

• Entropy-based methods are one common way

• H = 0 if p(c) = 0 or 1 for any class

Module #: Module Name

• First, we need to get the base entropy of the data

Module #: Module Name

• The weighted sum of the class entropies for each value of

Module #: Module Name

for free own rent

P(housing) 0.108 0.713 0.179

P(bad|X) ~ (0.360.620.170.02)0.3 = 0.0002