0% found this document useful (0 votes)
11 views

Week 3 - 5-Bayesian Methods

The document discusses using Bayes' theorem to build a spam classifier by calculating the probability of emails containing certain words being spam. It explains constructing probabilities for each word found in training data and multiplying them to analyze new emails. Scikit-learn tools like CountVectorizer and MultinomialNB can help implement this naive Bayes approach for spam filtering.

Uploaded by

Noah Byrne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Week 3 - 5-Bayesian Methods

The document discusses using Bayes' theorem to build a spam classifier by calculating the probability of emails containing certain words being spam. It explains constructing probabilities for each word found in training data and multiplying them to analyze new emails. Scikit-learn tools like CountVectorizer and MultinomialNB can help implement this naive Bayes approach for spam filtering.

Uploaded by

Noah Byrne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Bayes Theorem

 Lets use it for machine learning! I want a spam classifier.


 Example: how would we express the probability of an email
being spam if it contains the word “free”?

 The numerator is the probability of a message being spam and


containing the word “free” (this is subtly different from what we
are looking for)
 The denominator is the overall probability of an email
containing the word “free”. Equivalent to P(Free|Spam)P(Spam)
+ P(Free|Not Spam)P(Not Spam))
 So together – this ratio is the % of emails with the word
“free” that are spam.
What about all the other words?
We can construct P(Spam|Word) for every
(meaningful) word we encounter during
training
Then multiply these together when a
analysing new email to get the probability of
it being spam.
Assumes the presence of different words are
independent of each other – one reason this
is called “Naïve bayes”.
Sounds like a lot of work
Scikit-learn to the rescue!
The CountVectorizer lets us operate on lots of words
at once, and MultinomialNB does all the heavy lifting
on Naïve Bayes.
We’ll train it on known sets of spam and “ham”
(non- spam) emails.
So this is supervised learning.
Example: Naivebayes.ipynb

You might also like