0% found this document useful (0 votes)

3 views

notebook - text classification

Uploaded by

golgothgolgoth039

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

notebook - text classification

Uploaded by

golgothgolgoth039

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

text-classification

August 6, 2024

0.1 OBJECTIVE
Text Classification:
Build a text classification model to classify an email as spam or not without using llms or any
transformer based model but use a machine learning classification algorithm after vectorising the
text data. Submit your code along with a document explaining what approach you took and why?
Attached is spam.csv file containing data for this

[1]: # Importing required libraries

import pandas as pd
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
from nltk.corpus import stopwords
import string
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import SMOTE

from sklearn.linear_model import LogisticRegression

from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

import joblib

from sklearn.metrics import classification_report

0.2 Data Explorartion

[2]: #Loading the dataset
df = pd.read_csv('/kaggle/input/spam-csv/spam.csv')
df

[2]: label text

0 ham Go until jurong point, crazy.. Available only …
1 ham Ok lar… Joking wif u oni…

1
2 spam Free entry in 2 a wkly comp to win FA Cup fina…
3 ham U dun say so early hor… U c already then say…
4 ham Nah I don't think he goes to usf, he lives aro…
… … …
5567 spam This is the 2nd time we have tried 2 contact u…
5568 ham Will Ì_ b going to esplanade fr home?
5569 ham Pity, * was in mood for that. So…any other s…
5570 ham The guy did some bitching but I acted like i'd…
5571 ham Rofl. Its true to its name

[5572 rows x 2 columns]

[3]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 label 5572 non-null object
1 text 5572 non-null object
dtypes: object(2)
memory usage: 87.2+ KB

[4]: df.describe(include='object')

[4]: label text

count 5572 5572
unique 2 5158
top ham Sorry, I'll call later
freq 4825 30

0.3 Data Visualization

[5]: #Visualzing the class labels
sns.countplot(df,x='label')
print(df['label'].value_counts())

label
ham 4825
spam 747
Name: count, dtype: int64

2
0.4 Data Cleaning
[6]: def clean_text(mail):
#Remove the punctuation
without_punctuation = ''.join(char for char in mail if char not in string.
↪punctuation)

#Remove the stopwords

result = ' '.join(word for word in without_punctuation.split() if word.
↪lower() not in set(stopwords.words('english')))

return result

df['text'] = df['text'].apply(clean_text)
df['text']

[6]: 0 Go jurong point crazy Available bugis n great …

1 Ok lar Joking wif u oni
2 Free entry 2 wkly comp win FA Cup final tkts 2…
3 U dun say early hor U c already say
4 Nah dont think goes usf lives around though

3
…
5567 2nd time tried 2 contact u U å£750 Pound prize…
5568 Ì b going esplanade fr home
5569 Pity mood Soany suggestions
5570 guy bitching acted like id interested buying s…
5571 Rofl true name
Name: text, Length: 5572, dtype: object

0.5 Feature Encoding

[7]: # Ecoding the label feature
saved_encoder = {}
label_encoder = LabelEncoder()
label_encoder.fit(df['label'])
saved_encoder[df['label'].name] = label_encoder
df['label'] = label_encoder.transform(df['label'])

print([(index, label) for index, label in enumerate(label_encoder.classes_)])

[(0, 'ham'), (1, 'spam')]

[7]: label text

0 0 Go jurong point crazy Available bugis n great …
1 0 Ok lar Joking wif u oni
2 1 Free entry 2 wkly comp win FA Cup final tkts 2…
3 0 U dun say early hor U c already say
4 0 Nah dont think goes usf lives around though
… … …
5567 1 2nd time tried 2 contact u U å£750 Pound prize…
5568 0 Ì b going esplanade fr home
5569 0 Pity mood Soany suggestions
5570 0 guy bitching acted like id interested buying s…
5571 0 Rofl true name

[5572 rows x 2 columns]

0.6 Spliting the Data, Vectorizing and Resampling

[8]: X = df['text']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,␣

↪random_state=42)

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)

4
X_test = vectorizer.transform(X_test)

#Since there is class imbalance between Spam vs Ham, resampling using SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

0.7 Model Training, Saving and Evaluation

[9]: algorithms = ␣
↪[LogisticRegression(),MultinomialNB(),SVC(),BernoulliNB(),KNeighborsClassifier(),DecisionTre

for algorithm in algorithms:

model = algorithm
model.fit(X_train_resampled,y_train_resampled)
joblib.dump(model,'model_'+ f'{model}' + '.pkl')