notebook - text classification
notebook - text classification
August 6, 2024
0.1 OBJECTIVE
Text Classification:
Build a text classification model to classify an email as spam or not without using llms or any
transformer based model but use a machine learning classification algorithm after vectorising the
text data. Submit your code along with a document explaining what approach you took and why?
Attached is spam.csv file containing data for this
import joblib
1
2 spam Free entry in 2 a wkly comp to win FA Cup fina…
3 ham U dun say so early hor… U c already then say…
4 ham Nah I don't think he goes to usf, he lives aro…
… … …
5567 spam This is the 2nd time we have tried 2 contact u…
5568 ham Will Ì_ b going to esplanade fr home?
5569 ham Pity, * was in mood for that. So…any other s…
5570 ham The guy did some bitching but I acted like i'd…
5571 ham Rofl. Its true to its name
[3]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 label 5572 non-null object
1 text 5572 non-null object
dtypes: object(2)
memory usage: 87.2+ KB
[4]: df.describe(include='object')
label
ham 4825
spam 747
Name: count, dtype: int64
2
0.4 Data Cleaning
[6]: def clean_text(mail):
#Remove the punctuation
without_punctuation = ''.join(char for char in mail if char not in string.
↪punctuation)
return result
df['text'] = df['text'].apply(clean_text)
df['text']
3
…
5567 2nd time tried 2 contact u U å£750 Pound prize…
5568 Ì b going esplanade fr home
5569 Pity mood Soany suggestions
5570 guy bitching acted like id interested buying s…
5571 Rofl true name
Name: text, Length: 5572, dtype: object
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
4
X_test = vectorizer.transform(X_test)
#Since there is class imbalance between Spam vs Ham, resampling using SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
5
0 1.00 0.98 0.99 1623
1 0.84 0.98 0.91 216
6
0.7.1 BEST PERFORMING MODEL
BernoulliNB is the best performing since it has a better performance in terms of both
recall and F1-score.