Bayesian Methods

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Bayesian Methods

March 18, 2024

0.1 Teorema de Bayes


• P(A|B) = (P(A)*P(B|A)) / P(B), onde P(A|B) é a chance de A acontecer, dado B.
• Exemplo de um detector de spam:
– “Qual seria a chance de um email ser spam se conter a palavra ‘free’ em seu texto?”.
– P(Spam | Free) = (P(Spam)*P(Free | Spam)) / P(Free)
• Scikit-learn permite trabalhar de forma fácil com isto.
• O “CountVectorizer” permite-nos com que operemos com várias palavras de uma vez só e
MultinomialNB faz todo o trabalho “pesado” de Naive Bayes.

0.2 Naive Bayes (jeito fácil)


[4]: import os
import io
import numpy
import pandas as pd
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def readFiles(path):
for root, dirnames, filenames in os.walk(path):
for filename in filenames:
path = os.path.join(root, filename)

inBody = False
lines = []
f = io.open(path, 'r', encoding='latin1')
for line in f:
if inBody:
lines.append(line)
elif line == '\n':
inBody = True
f.close()
message = '\n'.join(lines)
yield path, message

1
def dataFrameFromDirectory(path, classification):
rows = []
index = []
for filename, message in readFiles(path):
rows.append({'message': message, 'class': classification})
index.append(filename)

return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

data = pd.concat([data, dataFrameFromDirectory("emails/spam", "spam")]);


data = pd.concat([data, dataFrameFromDirectory("emails/ham", "ham")])

[14]: data.head()

[14]: message \
emails/spam/00217.43b4ef3d9c56cf42be9c37b546a19e78 <html><xbody>\n\n<hr width =
"100%">\n\n<cente…
emails/spam/00328.73c1a9f83d3b1247522c26eb6d74c215 \n\n Socijalisticka
partija Srbije, pred…
emails/spam/00408.22230b84aee00e439ae1938e025d5005 \n\n<html>\n\n<body
bgcolor="#FFFFFF">\n\n<TAB…
emails/spam/00383.1aa9a8211d1de540d6e3852e230e5a9d
<html>\n\n<head>\n\n<title>FREE* Liz Claiborne…
emails/spam/00390.ce19abc8034db9e6b435d494a91db87a This message is in MIME
format. Since your mai…

class
emails/spam/00217.43b4ef3d9c56cf42be9c37b546a19e78 spam
emails/spam/00328.73c1a9f83d3b1247522c26eb6d74c215 spam
emails/spam/00408.22230b84aee00e439ae1938e025d5005 spam
emails/spam/00383.1aa9a8211d1de540d6e3852e230e5a9d spam
emails/spam/00390.ce19abc8034db9e6b435d494a91db87a spam

[8]: vectorizer = CountVectorizer()


counts = vectorizer.fit_transform(data['message'].values)

classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

[8]: MultinomialNB()

[11]: examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)

2
predictions

[11]: array(['spam', 'ham'], dtype='<U4')

Acima é possível ver que o modelo consegue classificar as frases acima de forma correta.

You might also like