0% found this document useful (0 votes)

133 views51 pages

Machine Learning and Data Mining: Prof. Alexander Ihler

The document discusses Naive Bayes classifiers. It explains that Naive Bayes classifiers estimate the probability of each class p(y) and the probability of features given each class p(x|y=c). It then uses Bayes rule to calculate the probability of each class given features p(y|x). For discrete features, this can be represented as a contingency table. However, when there are multiple discrete features, it requires estimating a joint probability distribution over all combinations of feature values, which can lead to overfitting when there is insufficient data.

Uploaded by

Suraj Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

133 views51 pages

Machine Learning and Data Mining: Prof. Alexander Ihler

Uploaded by

Suraj Patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

+

Machine Learning and Data Mining

Bayes Classifiers

Prof. Alexander Ihler

A basic classiﬁer
•  Training data D={x(i),y(i)}, Classiﬁer f(x ; D)
–  Discrete feature vector x
–  f(x ; D) is a con@ngency table
•  Ex: credit ra@ng predic@on (bad/good)
–  X1 = income (low/med/high)
–  How can we make the most # of correct predic@ons?

Features # bad # good

X=0 42 15
X=1 338 287
X=2 3 5

(c) Alexander Ihler 2

–  Predict more likely outcome

for each possible observa@on Features # bad # good
X=0 42 15
X=1 338 287
X=2 3 5

(c) Alexander Ihler 3

–  Predict more likely outcome

for each possible observa@on Features # bad # good
X=0 .7368 .2632
–  Can normalize into probability:
p( y=good | X=c ) X=1 .5408 .4592
X=2 .3750 .6250
–  How to generalize?

(c) Alexander Ihler 4

Bayes rule
•  Two events: headache, flu
•  p(H) = 1/10
•  p(F) = 1/40
H
•  p(H|F) = 1/2
F

•  You wake up with a headache – what is the chance that

you have the flu?

Example from Andrew

Moore’s slides
Bayes rule
•  Two events: headache, flu
•  p(H) = 1/10
•  p(F) = 1/40
H
•  p(H|F) = 1/2
F
•  P(H & F) = ?

•  P(F|H) = ?

Example from Andrew

Moore’s slides
Bayes rule
•  Two events: headache, flu
•  p(H) = 1/10
•  p(F) = 1/40
H
•  p(H|F) = 1/2
F
•  P(H & F) = p(F) p(H|F)
= (1/2) * (1/40) = 1/80
•  P(F|H) = ?

Example from Andrew

Moore’s slides
Bayes rule
•  Two events: headache, flu
•  p(H) = 1/10
•  p(F) = 1/40
H
•  p(H|F) = 1/2
F
•  P(H & F) = p(F) p(H|F)
= (1/2) * (1/40) = 1/80
•  P(F|H) = p(H & F) / p(H)
= (1/80) / (1/10) = 1/8

Example from Andrew

Moore’s slides
Classification and probability
•  Suppose we want to model the data

•  Prior probability of each class, p(y)

–  E.g., fraction of applicants that have good credit
•  Distribution of features given the class, p(x | y=c)
–  How likely are we to see “x” in users with good credit?

•  Joint distribution

•  Bayes Rule:

(Use the rule of total probability

to calculate the denominator!)
Bayes classifiers
•  Learn “class conditional” models
–  Estimate a probability model for each class
•  Training data
–  Split by class
–  Dc = { x(j) : y(j) = c }
•  Estimate p(x | y=c) using Dc

•  For a discrete x, this recalculates the same table…

Features # bad # good p(x | y=0) p(x | y=1) p(y=0|x) p(y=1|x)

X=0 42 15 42 / 383 15 / 307 .7368 .2632
X=1 338 287 338 / 383 287 / 307 .5408 .4592
X=2 3 5 3 / 383 5 / 307 .3750 .6250

p(y) 383/690 307/690

Bayes classifiers
•  Learn “class conditional” models
–  Estimate a probability model for each class
•  Training data
–  Split by class
–  Dc = { x(j) : y(j) = c }
•  Estimate p(x | y=c) using Dc

•  For continuous x, can use any density estimate we like

–  Histogram
12
–  Gaussian 10
–  … 8
6
4
2
0
-3
-2 -1 0 1 2 3
Gaussian models
•  Estimate parameters of the Gaussians from the data

Feature x1 !
Multivariate Gaussian models
•  Similar to univariate case

µ = length-d column vector

§ = d x d matrix

|§| = matrix determinant

5
4
Maximum likelihood estimate:
3

-1

-2
-2 -1 0 1 2 3 4 5
Example: Gaussian Bayes for Iris Data
•  Fit Gaussian distribu@on to each class {0,1,2}

(c) Alexander Ihler 14

Machine Learning and Data Mining

Bayes Classifiers: Naïve Bayes

Prof. Alexander Ihler

Bayes classifiers
•  Estimate p(y) = [ p(y=0) , p(y=1) …]
•  Estimate p(x | y=c) for each class c
•  Calculate p(y=c | x) using Bayes rule
•  Choose the most likely class c

•  For a discrete x, can represent as a contingency table…

–  What about if we have more discrete features?

Features # bad # good p(x | y=0) p(x | y=1) p(y=0|x) p(y=1|x)

X=0 42 15 42 / 383 15 / 307 .7368 .2632
X=1 338 287 338 / 383 287 / 307 .5408 .4592
X=2 3 5 3 / 383 5 / 307 .3750 .6250

p(y) 383/690 307/690

Joint distributions
A B C
•  Make a truth table of all 0 0 0
combinations of values 0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
Joint distributions
A B C p(A,B,C | y=1)
•  Make a truth table of all 0 0 0 0.50
combinations of values 0 0 1 0.05
0 1 0 0.01
0 1 1 0.10
•  For each combination of values, 1 0 0 0.04
1 0 1 0.15
determine how probable it is
1 1 0 0.05
1 1 1 0.10

•  Total probability must sum to one

•  How many values did we specify?

Overfitting and density estimation
A B C p(A,B,C | y=1)
•  Estimate probabilities from the data 0 0 0 4/10
–  E.g., how many times (what fraction) 0 0 1 1/10
did each outcome occur? 0 1 0 0/10
0 1 1 0/10
1 0 0 1/10
•  M data << 2^N parameters?
1 0 1 2/10
1 1 0 1/10
•  What about the zeros? 1 1 1 1/10

–  We learn that certain combinations are impossible?

–  What if we see these later in test data?

•  Overfitting!
Overfitting and density estimation
A B C p(A,B,C | y=1)
•  Estimate probabilities from the data 0 0 0 4/10
–  E.g., how many times (what fraction) 0 0 1 1/10
did each outcome occur? 0 1 0 0/10
0 1 1 0/10
1 0 0 1/10
•  M data << 2^N parameters?
1 0 1 2/10
1 1 0 1/10
•  What about the zeros? 1 1 1 1/10

–  We learn that certain combinations are impossible?

–  What if we see these later in test data?

•  One option: regularize

•  Normalize to make sure values sum to one…
Overfitting and density estimation
•  Another option: reduce the model complexity
–  E.g., assume that features are independent of one another

•  Independence:
•  p(a,b) = p(a) p(b)

•  p(x1, x2, … xN | y=1) = p(x1 | y=1) p(x2 | y=1) … p(xN | y=1)

•  Only need to estimate each individually A B C p(A,B,C | y=1)
0 0 0 .4 * .7 * .1
0 0 1 .4 * .7 * .9
0 1 0 .4 * .3 * .1
A p(A |y=1) B p(B |y=1) C p(C |y=1)
0 1 1 …
0 .4 0 .7 0 .1
1 0 0
1 .6 1 .3 1 .9
1 0 1
1 1 0
1 1 1
Example: Naïve Bayes
Observed Data:
x1 x2 y
1 1 0
1 0 0
1 0 1
0 0 0
0 1 1
1 1 0
0 0 1
1 0 1

PredicIon given some observaIon x?

<
>

Decide class 0
(c) Alexander Ihler 22
Example: Naïve Bayes
Observed Data:
x1 x2 y
1 1 0
1 0 0
1 0 1
0 0 0
0 1 1
1 1 0
0 0 1
1 0 1

(c) Alexander Ihler 23

Example: Joint Bayes
Observed Data:
x1 x2 y
1 1 0
1 0 0
1 0 1
x1 x2 p(x | y=0) x1 x2 p(x | y=1)
0 0 0
0 0 1/4 0 0 1/4
0 1 1
0 1 0/4 0 1 1/4
1 1 0
1 0 1/4 1 0 2/4
0 0 1
1 1 2/4 1 1 0/4
1 0 1

(c) Alexander Ihler 24

Naïve Bayes models
•  Variable y to predict, e.g. “auto accident in next year?”
•  We have *many* co-observed vars x=[x1…xn]
–  Age, income, education, zip code, …

•  Want to learn p(y | x1…xn ), to predict y

–  Arbitrary distribution: O(dn) values!

•  Naïve Bayes:
–  p(y|x)= p(x|y) p(y) / p(x) ; p(x|y) = ∏ι p(xi|y)
–  Covariates are independent given “cause”

•  Note: may not be a good model of the data

–  Doesn’t capture correlations in x’s
–  Can’t capture some dependencies
•  But in practice it often does quite well!
Naïve Bayes models for spam
•  y 2 {spam, not spam}
•  X = observed words in email
–  Ex: [“the” … “probabilistic” … “lottery”…]
–  “1” if word appears; “0” if not

•  1000’s of possible words: 21000s parameters?

•  # of atoms in the universe: » 2270…

•  Model words given email type as independent

•  Some words more likely for spam (“lottery”)
•  Some more likely for real (“probabilistic”)
•  Only 1000’s of parameters now…
Naïve Bayes Gaussian models

¾211 0
x2 0 ¾222

Again, reduces the number of parameters of the model:

•  Maximum likelihood (empirical) estimators for

–  Discrete variables
–  Gaussian variables
–  Overfitting; simplifying assumptions or regularization
+

Machine Learning and Data Mining

Bayes Classifiers: Measuring Error

Prof. Alexander Ihler

A Bayes classiﬁer
•  Given training data, compute p( y=c| x) and choose largest
•  What’s the (training) error rate of this method?

Features # bad # good

X=0 42 15
X=1 338 287
X=2 3 5

(c) Alexander Ihler 30

A Bayes classiﬁer
•  Given training data, compute p( y=c| x) and choose largest
•  What’s the (training) error rate of this method?

Features # bad # good

Gets these examples wrong:
X=0 42 15 Pr[ error ] = (15 + 287 + 3) / (690)
X=1 338 287
X=2 3 5 (empirically on training data:
better to use test data)

(c) Alexander Ihler 31

Bayes Error Rate
•  Suppose that we knew the true probabili@es:

–  Observe any x:
(at any x)

–  Op@mal decision at that par@cular x is:

–  Error rate is:

= “Bayes error rate”
•  This is the best that any classiﬁer can do!
•  Measures fundamental hardness of separa@ng y-values given only features x

•  Note: conceptual only!

–  Probabili@es p(x,y) must be es@mated from data
–  Form of p(x,y) is not known and may be very complex
(c) Alexander Ihler 32
A Bayes classiﬁer
•  Bayes classification decision rule compares probabilities:
<
>
<
= >
•  Can visualize this nicely if x is a scalar:

p(x , y=0 )
Decision boundary
Shape: p(x | y=0 ) p(x , y=1 ) Shape: p(x | y=1 )
Area: p(y=0)
Area: p(y=1)

Feature x1 !
A Bayes classiﬁer Add mul@plier alpha:
<
•  Not all errors are created equally… >
•  Risk associated with each outcome?

p(x , y=0 ) Decision boundary

p(x , y=1 )

{
{ Type 1 errors: false posi@ves
Type 2 errors: false nega@ves

False positive rate: (# y=0, ŷ=1) / (#y=0)

False negative rate: (# y=1, ŷ=0) / (#y=1)
A Bayes classiﬁer Add mul@plier alpha:
<
•  Increase alpha: prefer class 0 >
•  Spam detection

p(x , y=0 ) Decision boundary

p(x , y=1 )

{
{
Type 1 errors: false posi@ves
Type 2 errors: false nega@ves

False positive rate: (# y=0, ŷ=1) / (#y=0)

False negative rate: (# y=1, ŷ=0) / (#y=1)
A Bayes classiﬁer Add mul@plier alpha:
<
•  Decrease alpha: prefer class 1 >
•  Cancer detection

p(x , y=0 ) Decision boundary

p(x , y=1 )
{
{ Type 1 errors: false posi@ves
Type 2 errors: false nega@ves

False positive rate: (# y=0, ŷ=1) / (#y=0)

False negative rate: (# y=1, ŷ=0) / (#y=1)
Measuring errors
•  Confusion matrix Predict 0 Predict 1
•  Can extend to more classes
Y=0 380 5
Y=1 338 3

•  True positive rate: #(y=1 , ŷ=1) / #(y=1) -- “sensitivity”

•  False negative rate: #(y=1 , ŷ=0) / #(y=1)
•  False positive rate: #(y=0 , ŷ=1) / #(y=0)
•  True negative rate: #(y=0 , ŷ=0) / #(y=0) -- “specificity”
Likelihood ra@o tests
•  Connec@on to classical, sta@s@cal decision theory:
< <
> = >
“log likelihood ratio”

•  Likelihood ra@o: rela@ve support for observa@on “x” under

“alterna@ve hypothesis” y=1, compared to “null hypothesis” y=0

•  Can vary the decision threshold: <

•  Classical tes@ng:
–  Choose gamma so that FPR is ﬁxed (“p-value”)
–  Given that y=0 is true, what’s the probability we decide y=1?

(c) Alexander Ihler 38

ROC Curves
•  Characterize performance as we vary the decision threshold?

Bayes classifier,
multiplier alpha Guess all 1
True positive rate

<
= sensitivity

>
Guess at random, proportion alpha

False positive rate

= 1 - specificity

Guess all 0
(c) Alexander Ihler 39
Probabilis@c vs. Discrimina@ve learning

“Discrimina@ve” learning: “Probabilis@c” learning:

Output predic@on ŷ(x) Output probability p(y|x)
(expresses conﬁdence in outcomes)
•  “Probabilis@c” learning
–  Condi@onal models just explain y: p(y|x)
–  Genera@ve models also explain x: p(x,y)
•  Oqen a component of unsupervised or semi-supervised learning
–  Bayes and Naïve Bayes classiﬁers are genera@ve models

(c) Alexander Ihler 40

Probabilis@c vs. Discrimina@ve learning

“Discrimina@ve” learning: “Probabilis@c” learning:

Output predic@on ŷ(x) Output probability p(y|x)
(expresses conﬁdence in outcomes)

•  Can use ROC curves for discrimina@ve models also:

–  Some no@on of confidence, but doesn’t correspond to a probability
–  In our code: “predictSoq” (vs. hard predic@on, “predict”)
>> learner = gaussianBayesClassify(X,Y); % build a classifier
>> Ysoq = predictSoq(learner, X); % N x C matrix of confidences
>> plotSoqClassify2D(learner,X,Y); % shaded confidence plot
(c) Alexander Ihler 41
ROC Curves
•  Characterize performance as we vary our confidence threshold?

Classifier B
Guess all 1
True positive rate
= sensitivity

Classifier A Guess at random, proportion alpha

False positive rate Reduce performance to one number?

= 1 - specificity
AUC = “area under the ROC curve”
Guess all 0 0.5 < AUC < 1
(c) Alexander Ihler 42
+

Machine Learning and Data Mining

Gaussian Bayes Classifiers

Prof. Alexander Ihler

Gaussian models
•  “Bayes optimal” decision
–  Choose most likely class
•  Decision boundary
–  Places where probabilities equal

•  What shape is the boundary?

Gaussian models
•  Bayes optimal decision boundary
–  p(y=0 | x) = p(y=1 | x)
–  Transition point between p(y=0|x) >/< p(y=1|x)
•  Assume Gaussian models with equal covariances
Gaussian example
•  Spherical covariance: § = ¾2 I
•  Decision rule

-1

-2
-2 -1 0 1 2 3 4 5
Non-spherical Gaussian distribu@ons
•  Equal covariances => still linear decision rule
–  May be “modulated” by variance direction
–  Scales; rotates (if correlated)

Ex:
2

Variance 1

[3 0 ]
[ 0 .25 ] 0

-1

-2
-10 -8 -6 -4 -2 0 2 4 6 8 10
Class posterior probabilities
•  Useful to also know class probabilities
•  Some notation
–  p(y=0) , p(y=1) – class prior probabilities
•  How likely is each class in general?
–  p(x | y=c) – class conditional probabilities
•  How likely are observations “x” in that class?
–  p(y=c | x) – class posterior probability
•  How likely is class c given an observation x?
Class posterior probabilities
•  Useful to also know class probabilities
•  Some notation
–  p(y=0) , p(y=1) – class prior probabilities
•  How likely is each class in general?
–  p(x | y=c) – class conditional probabilities
•  How likely are observations “x” in that class?
–  p(y=c | x) – class posterior probability
•  How likely is class c given an observation x?

•  We can compute posterior using Bayes’ rule

–  p(y=c | x) = p(x|y=c) p(y=c) / p(x)
•  Compute p(x) using sum rule / law of total prob.
–  p(x) = p(x|y=0) p(y=0) + p(x|y=1)p(y=1)
Class posterior probabilities
•  Consider comparing two classes
–  p(x | y=0) * p(y=0) vs p(x | y=1) * p(y=1)
–  Write probability of each class as
–  p(y=0 | x) = p(y=0, x) / p(x)
–  = p(y=0, x) / ( p(y=0,x) + p(y=1,x) )
–  = 1 / (1 + exp( -a ) ) (**)

–  a = log [ p(x|y=0) p(y=0) / p(x|y=1) p(y=1) ]

–  (**) called the logistic function, or logistic sigmoid.
Gaussian models
•  Return to Gaussian models with equal covariances

(**)

Now we also know that the probability of each class is given by:
p(y=0 | x) = Logistic( ** ) = Logistic( aT x + b )

We’ll see this form again soon…

IML Module 3
No ratings yet
IML Module 3
95 pages
ml3 - Text Classification - Naive Bayes
No ratings yet
ml3 - Text Classification - Naive Bayes
50 pages
05 Classification II 2024
No ratings yet
05 Classification II 2024
54 pages
Discriminative Generative: R Follow A
100% (1)
Discriminative Generative: R Follow A
18 pages
Lecture 6 - Generative Models
No ratings yet
Lecture 6 - Generative Models
33 pages
NaiveBayersClassification BA
No ratings yet
NaiveBayersClassification BA
36 pages
Feature Engineering
100% (2)
Feature Engineering
76 pages
Slide07 Bayes
No ratings yet
Slide07 Bayes
51 pages
MLFA Bayesian Classifier
No ratings yet
MLFA Bayesian Classifier
25 pages
2022 Naive Bayes and Probability
No ratings yet
2022 Naive Bayes and Probability
30 pages
Chapter 4 Bayesian Networks
No ratings yet
Chapter 4 Bayesian Networks
62 pages
Chapter 4
No ratings yet
Chapter 4
22 pages
Class Adv Classification IV
No ratings yet
Class Adv Classification IV
49 pages
Bayesian Learning
No ratings yet
Bayesian Learning
58 pages
Lecture 5-Naïve Bayes
No ratings yet
Lecture 5-Naïve Bayes
26 pages
Lecture3 Linear Classifiers
No ratings yet
Lecture3 Linear Classifiers
36 pages
Nayes Bayes Classifier
No ratings yet
Nayes Bayes Classifier
46 pages
ML 09 Naive Bayes Classifier
No ratings yet
ML 09 Naive Bayes Classifier
24 pages
ML Lecture#5
No ratings yet
ML Lecture#5
65 pages
Lecture 2 - Principle of Machine Learning
No ratings yet
Lecture 2 - Principle of Machine Learning
39 pages
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
No ratings yet
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
56 pages
2 Naive Bayes
No ratings yet
2 Naive Bayes
49 pages
EIE4105 Multimodal Human Computer Interaction Technology: Fundamental of Statistical Learning
No ratings yet
EIE4105 Multimodal Human Computer Interaction Technology: Fundamental of Statistical Learning
31 pages
ML BayesionBeliefNetwork Lect12 14
No ratings yet
ML BayesionBeliefNetwork Lect12 14
99 pages
Bayes Classifier
No ratings yet
Bayes Classifier
20 pages
TQM - Statistical Process Control
100% (1)
TQM - Statistical Process Control
93 pages
6 - Naive Bayes
No ratings yet
6 - Naive Bayes
26 pages
Classification (Naive Bayes)
No ratings yet
Classification (Naive Bayes)
40 pages
Bayesian Learning
No ratings yet
Bayesian Learning
41 pages
ML 05 Bayesian Classifier
No ratings yet
ML 05 Bayesian Classifier
19 pages
Naive Bayes
No ratings yet
Naive Bayes
29 pages
Module - 4 - ECE3047 - Machine Learning
No ratings yet
Module - 4 - ECE3047 - Machine Learning
81 pages
ML Unit 2
No ratings yet
ML Unit 2
21 pages
Lecture 7
No ratings yet
Lecture 7
15 pages
Module05 - Bayesian Reasoning
No ratings yet
Module05 - Bayesian Reasoning
37 pages
Qa Pastpapers
No ratings yet
Qa Pastpapers
146 pages
Naive by
No ratings yet
Naive by
23 pages
AI Landscape
No ratings yet
AI Landscape
111 pages
ML Interview Questions
No ratings yet
ML Interview Questions
146 pages
Bartlett TheoreticalSpecificationSampling 1946
No ratings yet
Bartlett TheoreticalSpecificationSampling 1946
16 pages
Business Statistics,: 9e, GE (Groebner/Shannon/Fry) Chapter 3 Describing Data Using Numerical Measures
No ratings yet
Business Statistics,: 9e, GE (Groebner/Shannon/Fry) Chapter 3 Describing Data Using Numerical Measures
43 pages
WK 08
No ratings yet
WK 08
10 pages
Data Mining Classification: Naïve Bayes Classifier Lecture Notes For Chapter 4 &5
No ratings yet
Data Mining Classification: Naïve Bayes Classifier Lecture Notes For Chapter 4 &5
26 pages
ML Lec 15 Naive Bayes
No ratings yet
ML Lec 15 Naive Bayes
16 pages
Bayes Algorithm
No ratings yet
Bayes Algorithm
26 pages
Waves
No ratings yet
Waves
52 pages
BCS-DS-602: Machine Learning: Dr. Sarika Chaudhary Associate Professor Fet-Cse
No ratings yet
BCS-DS-602: Machine Learning: Dr. Sarika Chaudhary Associate Professor Fet-Cse
18 pages
Classification With NaiveBayes
No ratings yet
Classification With NaiveBayes
19 pages
DM NaiveBayes
No ratings yet
DM NaiveBayes
15 pages
P03 BayesianLearning SolutionNotes
No ratings yet
P03 BayesianLearning SolutionNotes
4 pages
K L Transform
No ratings yet
K L Transform
4 pages
Naïve Bayesv1
No ratings yet
Naïve Bayesv1
31 pages
Naive Bayes
No ratings yet
Naive Bayes
9 pages
Data Mining - Module 7
No ratings yet
Data Mining - Module 7
8 pages
Scribe: Naive Bayes Classifier
No ratings yet
Scribe: Naive Bayes Classifier
16 pages
Lecture Slide 03 - Bayesian Classifier - Summer 2023
No ratings yet
Lecture Slide 03 - Bayesian Classifier - Summer 2023
23 pages
Bayesian Learning: Berrin Yanikoglu
No ratings yet
Bayesian Learning: Berrin Yanikoglu
64 pages
20210913115710D3708 - Session 09-12 Bayes Classifier
No ratings yet
20210913115710D3708 - Session 09-12 Bayes Classifier
30 pages
Inferential Statistic II
No ratings yet
Inferential Statistic II
61 pages
STA457 Week 7 Notes
No ratings yet
STA457 Week 7 Notes
61 pages
Probabilistic Models in Machine Learning: Unit - III Chapter - 1
No ratings yet
Probabilistic Models in Machine Learning: Unit - III Chapter - 1
18 pages
Feature Engineering and Selection: CS 294: Practical Machine Learning October 1, 2009 Alexandre Bouchard-Côté
No ratings yet
Feature Engineering and Selection: CS 294: Practical Machine Learning October 1, 2009 Alexandre Bouchard-Côté
94 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
51 pages
Bayesian Classification
No ratings yet
Bayesian Classification
25 pages
2018 Sisk GrowthMindsetMetaAnalysis
No ratings yet
2018 Sisk GrowthMindsetMetaAnalysis
23 pages
Coe PDF
No ratings yet
Coe PDF
16 pages
Naive Bayes Classifiers - Parta
No ratings yet
Naive Bayes Classifiers - Parta
17 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
MA2 Applied Linguistics 2016: Quantitative Methods
No ratings yet
MA2 Applied Linguistics 2016: Quantitative Methods
19 pages
Method Validation Calculation File of Assay
No ratings yet
Method Validation Calculation File of Assay
6 pages
Lecture 5 Bayesian Classification
No ratings yet
Lecture 5 Bayesian Classification
16 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
Naïve Bayes Classifier: April 25, 2006
No ratings yet
Naïve Bayes Classifier: April 25, 2006
19 pages
Pgm5 With Output
No ratings yet
Pgm5 With Output
13 pages
Naive Bayesian Classifier: National Institute of Technology Sikkim
No ratings yet
Naive Bayesian Classifier: National Institute of Technology Sikkim
6 pages
PHD Thesis Structural Equation Modeling
100% (3)
PHD Thesis Structural Equation Modeling
6 pages
Big Nosql Data: Mike Carey
No ratings yet
Big Nosql Data: Mike Carey
35 pages
Homogeneity of Variance Tutorial
No ratings yet
Homogeneity of Variance Tutorial
14 pages
Precision Notes
No ratings yet
Precision Notes
26 pages
Classification-Alternative Techniques: Bayesian Classifiers
No ratings yet
Classification-Alternative Techniques: Bayesian Classifiers
7 pages
Revision Pearson Spearman and Chi Square Test
No ratings yet
Revision Pearson Spearman and Chi Square Test
40 pages
Introduction To Principal Components and Factoranalysis
No ratings yet
Introduction To Principal Components and Factoranalysis
29 pages
Dept of Eco Ets Course Content Mphil Econometrics
No ratings yet
Dept of Eco Ets Course Content Mphil Econometrics
18 pages
Ss-Chapter 12: Sampling: Final and Initial Sample Size Determination
No ratings yet
Ss-Chapter 12: Sampling: Final and Initial Sample Size Determination
14 pages
Data Mining - Bayesian Classification
No ratings yet
Data Mining - Bayesian Classification
6 pages
Container Workloads
No ratings yet
Container Workloads
11 pages
Exercise EC5002 Econometrics All Questions
No ratings yet
Exercise EC5002 Econometrics All Questions
24 pages
BM60116 - Slides 3.0
No ratings yet
BM60116 - Slides 3.0
11 pages
January 2022 QP - PDF s3
No ratings yet
January 2022 QP - PDF s3
24 pages
The American Healthcare System Explained - Suraj Patil - Medium
No ratings yet
The American Healthcare System Explained - Suraj Patil - Medium
12 pages
5.cut Off Adjusted Mean Baru - Lamp C1
No ratings yet
5.cut Off Adjusted Mean Baru - Lamp C1
16 pages
Panel GMM Commands
No ratings yet
Panel GMM Commands
13 pages
Spark 2.12 Issues
No ratings yet
Spark 2.12 Issues
10 pages
BBT 3106 - Probability & Statistics II - August 2023EC
No ratings yet
BBT 3106 - Probability & Statistics II - August 2023EC
3 pages
Fairness Frameworks Packages: An Overview of Some Useful &
No ratings yet
Fairness Frameworks Packages: An Overview of Some Useful &
9 pages
Dblog: A Watermark Based Change-Data-Capture Framework: Andreas Andreakis Ioannis Papapanagiotou
No ratings yet
Dblog: A Watermark Based Change-Data-Capture Framework: Andreas Andreakis Ioannis Papapanagiotou
6 pages
Determining Probabilities: Den Mark L. Asebo
No ratings yet
Determining Probabilities: Den Mark L. Asebo
18 pages
Nov Dec 2022
No ratings yet
Nov Dec 2022
4 pages
How To Draw Q-Q Plot
No ratings yet
How To Draw Q-Q Plot
9 pages
2019 State of Devops Report Puppet Circleci Splunk - SML 1 1 PDF
No ratings yet
2019 State of Devops Report Puppet Circleci Splunk - SML 1 1 PDF
86 pages
Kategorisasi Variabel: Dep - Tal
No ratings yet
Kategorisasi Variabel: Dep - Tal
5 pages
The Purpose of This Feasibility Study Is To Forecast The Sales of Renewable Stationary Generators Over The Next Three Years
No ratings yet
The Purpose of This Feasibility Study Is To Forecast The Sales of Renewable Stationary Generators Over The Next Three Years
2 pages

Machine Learning and Data Mining: Prof. Alexander Ihler

Uploaded by

Machine Learning and Data Mining: Prof. Alexander Ihler

Uploaded by

+

Machine Learning and Data Mining

Prof. Alexander Ihler

Features # bad # good

(c) Alexander Ihler 2

– Predict more likely outcome

(c) Alexander Ihler 3

– Predict more likely outcome

• You wake up with a headache – what is the chance that

Example from Andrew

Example from Andrew

Example from Andrew

Example from Andrew

• Prior probability of each class, p(y)

(Use the rule of total probability

• For a discrete x, this recalculates the same table…

Features # bad # good p(x | y=0) p(x | y=1) p(y=0|x) p(y=1|x)

p(y) 383/690 307/690

• For continuous x, can use any density estimate we like

µ = length-d column vector

|§| = matrix determinant

(c) Alexander Ihler 14

Machine Learning and Data Mining

Bayes Classifiers: Naïve Bayes

Prof. Alexander Ihler

• For a discrete x, can represent as a contingency table…

Features # bad # good p(x | y=0) p(x | y=1) p(y=0|x) p(y=1|x)

p(y) 383/690 307/690

• Total probability must sum to one

• How many values did we specify?

– We learn that certain combinations are impossible?

– We learn that certain combinations are impossible?

• One option: regularize

• p(x1, x2, … xN | y=1) = p(x1 | y=1) p(x2 | y=1) … p(xN | y=1)

PredicIon given some observaIon x?

(c) Alexander Ihler 23

(c) Alexander Ihler 24

• Want to learn p(y | x1…xn ), to predict y

• Note: may not be a good model of the data

• 1000’s of possible words: 21000s parameters?

• Model words given email type as independent

Again, reduces the number of parameters of the model:

• Maximum likelihood (empirical) estimators for

Machine Learning and Data Mining

Bayes Classifiers: Measuring Error

Prof. Alexander Ihler

Features # bad # good

(c) Alexander Ihler 30

Features # bad # good

(c) Alexander Ihler 31

– Op@mal decision at that par@cular x is:

– Error rate is:

• Note: conceptual only!

p(x , y=0 ) Decision boundary

False positive rate: (# y=0, ŷ=1) / (#y=0)

p(x , y=0 ) Decision boundary

False positive rate: (# y=0, ŷ=1) / (#y=0)

p(x , y=0 ) Decision boundary

False positive rate: (# y=0, ŷ=1) / (#y=0)

• True positive rate: #(y=1 , ŷ=1) / #(y=1) -- “sensitivity”

• Likelihood ra@o: rela@ve support for observa@on “x” under

• Can vary the decision threshold: <

(c) Alexander Ihler 38

False positive rate

“Discrimina@ve” learning: “Probabilis@c” learning:

(c) Alexander Ihler 40

“Discrimina@ve” learning: “Probabilis@c” learning:

• Can use ROC curves for discrimina@ve models also:

Classifier A Guess at random, proportion alpha

False positive rate Reduce performance to one number?

Machine Learning and Data Mining

Gaussian Bayes Classifiers

Prof. Alexander Ihler

• What shape is the boundary?

• We can compute posterior using Bayes’ rule

– a = log [ p(x|y=0) p(y=0) / p(x|y=1) p(y=1) ]

–  Predict more likely outcome

–  Predict more likely outcome

•  You wake up with a headache – what is the chance that

•  Prior probability of each class, p(y)

•  For a discrete x, this recalculates the same table…

•  For continuous x, can use any density estimate we like

•  For a discrete x, can represent as a contingency table…

•  Total probability must sum to one

•  How many values did we specify?

–  We learn that certain combinations are impossible?

–  We learn that certain combinations are impossible?

•  One option: regularize

•  p(x1, x2, … xN | y=1) = p(x1 | y=1) p(x2 | y=1) … p(xN | y=1)

•  Want to learn p(y | x1…xn ), to predict y

•  Note: may not be a good model of the data

•  1000’s of possible words: 21000s parameters?

•  Model words given email type as independent

•  Maximum likelihood (empirical) estimators for

–  Op@mal decision at that par@cular x is:

–  Error rate is:

•  Note: conceptual only!

•  True positive rate: #(y=1 , ŷ=1) / #(y=1) -- “sensitivity”

•  Likelihood ra@o: rela@ve support for observa@on “x” under

•  Can vary the decision threshold: <

•  Can use ROC curves for discrimina@ve models also:

•  What shape is the boundary?

•  We can compute posterior using Bayes’ rule

–  a = log [ p(x|y=0) p(y=0) / p(x|y=1) p(y=1) ]