The document discusses using Bayes' theorem to build a spam classifier by calculating the probability of emails containing certain words being spam. It explains constructing probabilities for each word found in training data and multiplying them to analyze new emails. Scikit-learn tools like CountVectorizer and MultinomialNB can help implement this naive Bayes approach for spam filtering.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
11 views
Week 3 - 5-Bayesian Methods
The document discusses using Bayes' theorem to build a spam classifier by calculating the probability of emails containing certain words being spam. It explains constructing probabilities for each word found in training data and multiplying them to analyze new emails. Scikit-learn tools like CountVectorizer and MultinomialNB can help implement this naive Bayes approach for spam filtering.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 4
Bayes Theorem
Lets use it for machine learning! I want a spam classifier.
Example: how would we express the probability of an email being spam if it contains the word “free”?
The numerator is the probability of a message being spam and
containing the word “free” (this is subtly different from what we are looking for) The denominator is the overall probability of an email containing the word “free”. Equivalent to P(Free|Spam)P(Spam) + P(Free|Not Spam)P(Not Spam)) So together – this ratio is the % of emails with the word “free” that are spam. What about all the other words? We can construct P(Spam|Word) for every (meaningful) word we encounter during training Then multiply these together when a analysing new email to get the probability of it being spam. Assumes the presence of different words are independent of each other – one reason this is called “Naïve bayes”. Sounds like a lot of work Scikit-learn to the rescue! The CountVectorizer lets us operate on lots of words at once, and MultinomialNB does all the heavy lifting on Naïve Bayes. We’ll train it on known sets of spam and “ham” (non- spam) emails. So this is supervised learning. Example: Naivebayes.ipynb