Spam Sms Detection 2
Spam Sms Detection 2
March 2, 2024
[4]: v1 v2 Unnamed: 2 \
0 ham Go until jurong point, crazy.. Available only … NaN
1 ham Ok lar… Joking wif u oni… NaN
2 spam Free entry in 2 a wkly comp to win FA Cup fina… NaN
3 ham U dun say so early hor… U c already then say… NaN
4 ham Nah I don't think he goes to usf, he lives aro… NaN
Unnamed: 3 Unnamed: 4
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
[6]: df_sms.head()
[6]: v1 v2 Unnamed: 2 \
0 ham Go until jurong point, crazy.. Available only … NaN
1 ham Ok lar… Joking wif u oni… NaN
2 spam Free entry in 2 a wkly comp to win FA Cup fina… NaN
3 ham U dun say so early hor… U c already then say… NaN
4 ham Nah I don't think he goes to usf, he lives aro… NaN
Unnamed: 3 Unnamed: 4
1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5572
[9]: df_sms.tail()
[9]: v1 v2 Unnamed: 2 \
5567 spam This is the 2nd time we have tried 2 contact u… NaN
5568 ham Will Ì_ b going to esplanade fr home? NaN
5569 ham Pity, * was in mood for that. So…any other s… NaN
5570 ham The guy did some bitching but I acted like i'd… NaN
5571 ham Rofl. Its true to its name NaN
Unnamed: 3 Unnamed: 4
5567 NaN NaN
5568 NaN NaN
5569 NaN NaN
5570 NaN NaN
5571 NaN NaN
[12]: df_sms.describe()
[12]: v1 v2 \
count 5572 5572
unique 2 5169
top ham Sorry, I'll call later
freq 4825 30
Unnamed: 2 \
count 50
unique 43
top bt not his girlfrnd… G o o d n i g h t . . .@"
freq 3
2
Unnamed: 3 Unnamed: 4
count 12 6
unique 10 5
top MK17 92H. 450Ppw 16" GNT:-)"
freq 2 2
df['length'].plot(bins=50, kind='hist')
3
[17]: df.hist(column='length', by='label', bins=50,figsize=(10,4))
4
[18]: df.loc[:,'label'] = df.label.map({'ham':0, 'spam':1})
print(df.shape)
df.head()
(5572, 6)
C:\Users\ADMIN\AppData\Local\Temp\ipykernel_8320\933899167.py:1:
DeprecationWarning: In a future version, `df.iloc[:, i] = newvals` will attempt
to set the values inplace instead of always setting a new array. To retain the
old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-
unique, `df.isetitem(i, newvals)`
df.loc[:,'label'] = df.label.map({'ham':0, 'spam':1})
lower_case_documents = []
lower_case_documents = [d.lower() for d in documents]
print(lower_case_documents)
['hello, how are you!', 'win money, win from home.', 'call me now.', 'hello,
call hello you tomorrow?']
[20]: sans_punctuation_documents = []
import string
for i in lower_case_documents:
5
sans_punctuation_documents.append(i.translate(str.maketrans("","", string.
↪punctuation)))
sans_punctuation_documents
Step 3: Tokenization
[21]: preprocessed_documents = [[w for w in d.split()] for d in␣
↪sans_punctuation_documents]
preprocessed_documents
count_vector = CountVectorizer()
count_vector.fit(documents)
feature_names = count_vector.get_feature_names_out()
[ ]:
6
[24]: doc_array = count_vector.transform(documents).toarray()
doc_array
count_vector = CountVectorizer()
doc_array = count_vector.fit_transform(documents).toarray()
naive_bayes algorithm
[29]: from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data,y_train)
7
[29]: MultinomialNB()
[ ]: