0% found this document useful (0 votes)
17 views31 pages

Spam Detection 1

The document outlines a spam detection project using Python, including data cleaning, exploratory data analysis (EDA), text preprocessing, model building, and evaluation. It utilizes libraries such as pandas, sklearn, and nltk for handling and analyzing text data. The dataset consists of 5572 entries labeled as 'ham' or 'spam', with various preprocessing steps applied to prepare the data for modeling.

Uploaded by

d4664360
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views31 pages

Spam Detection 1

The document outlines a spam detection project using Python, including data cleaning, exploratory data analysis (EDA), text preprocessing, model building, and evaluation. It utilizes libraries such as pandas, sklearn, and nltk for handling and analyzing text data. The dataset consists of 5572 entries labeled as 'ham' or 'spam', with various preprocessing steps applied to prepare the data for modeling.

Uploaded by

d4664360
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

spam-detection-1

May 5, 2025

[206]: import numpy as np


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

[208]: df = pd.read_csv('spam.csv')

[209]: df.sample(5)

[209]: v1 v2 Unnamed: 2 \
4809 ham Honey, can you pls find out how much they sell… NaN
392 ham Morning only i can ok. NaN
4231 ham I'm at home. Please call NaN
2520 ham Misplaced your number and was sending texts to… NaN
2270 ham U know we watchin at lido? NaN

Unnamed: 3 Unnamed: 4
4809 NaN NaN
392 NaN NaN
4231 NaN NaN
2520 NaN NaN
2270 NaN NaN

[210]: #1. Data clening


#2. EDA
#3. Text Preprocessing
#4. Model building
#5. Evaluation
#6. Improvement
#7. website
#8. Deploy

[211]: ## Data cleaning

[212]: df.info()

1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 v1 5572 non-null object
1 v2 5572 non-null object
2 Unnamed: 2 50 non-null object
3 Unnamed: 3 12 non-null object
4 Unnamed: 4 6 non-null object
dtypes: object(5)
memory usage: 217.8+ KB

[213]: df.drop(df.columns[-3:], axis=1, inplace=True)

[214]: df.sample(5)

[214]: v1 v2
5080 ham Yeah, give me a call if you've got a minute
2805 ham Can a not?
3223 ham Sorry da thangam.it's my mistake.
4499 ham Nvm take ur time.
589 ham I'm in a meeting, call me later at

[215]: df.rename(columns={'v1':'target','v2':'text'},inplace=True)
df.sample(5)

[215]: target text


3920 ham Do 1 thing! Change that sentence into: \Becaus…
3582 ham I sent your maga that money yesterday oh.
3309 ham Oh ho. Is this the first time u use these type…
4562 ham Come around &lt;DECIMAL&gt; pm vikky..i'm ots…
5288 ham An excellent thought by a misundrstud frnd: I …

[216]: from sklearn.preprocessing import LabelEncoder


encoder = LabelEncoder()

[217]: df.head()

[217]: target text


0 ham Go until jurong point, crazy.. Available only …
1 ham Ok lar… Joking wif u oni…
2 spam Free entry in 2 a wkly comp to win FA Cup fina…
3 ham U dun say so early hor… U c already then say…
4 ham Nah I don't think he goes to usf, he lives aro…

[218]: df.isnull().sum()

2
[218]: target 0
text 0
dtype: int64

[219]: df.duplicated().sum()

[219]: np.int64(403)

[220]: df = df.drop_duplicates(keep='first')

[221]: df.duplicated().sum()

[221]: np.int64(0)

[222]: ## EDA

[223]: df.head()

[223]: target text


0 ham Go until jurong point, crazy.. Available only …
1 ham Ok lar… Joking wif u oni…
2 spam Free entry in 2 a wkly comp to win FA Cup fina…
3 ham U dun say so early hor… U c already then say…
4 ham Nah I don't think he goes to usf, he lives aro…

[ ]:

[ ]:

[224]: df['target'].value_counts()

[224]: target
ham 4516
spam 653
Name: count, dtype: int64

[225]: import matplotlib.pyplot as plt


plt.pie(df['target'].value_counts(),labels=['ham','spam'],autopct = "%0.2f")
plt.show()

3
[226]: # Data is imbalanced

[227]: import nltk

[228]: !pip install nltk

Requirement already satisfied: nltk in


c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (3.9.1)
Requirement already satisfied: click in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
nltk) (8.1.8)
Requirement already satisfied: joblib in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
nltk) (1.4.2)
Requirement already satisfied: regex>=2021.8.3 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
nltk) (2024.11.6)
Requirement already satisfied: tqdm in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
nltk) (4.67.1)
Requirement already satisfied: colorama in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
click->nltk) (0.4.6)

4
[notice] A new release of pip is available: 25.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip

[229]: nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to


[nltk_data] C:\Users\ank94\AppData\Roaming\nltk_data…
[nltk_data] Package punkt_tab is already up-to-date!

[229]: True

[230]: df['num_characters'] = df['text'].apply(len)

[231]: df.head()

[231]: target text num_characters


0 ham Go until jurong point, crazy.. Available only … 111
1 ham Ok lar… Joking wif u oni… 29
2 spam Free entry in 2 a wkly comp to win FA Cup fina… 155
3 ham U dun say so early hor… U c already then say… 49
4 ham Nah I don't think he goes to usf, he lives aro… 61

[232]: df['num_words'] = df['text'].apply(lambda x: len(nltk.word_tokenize(x)))

[233]: df.head()

[233]: target text num_characters \


0 ham Go until jurong point, crazy.. Available only … 111
1 ham Ok lar… Joking wif u oni… 29
2 spam Free entry in 2 a wkly comp to win FA Cup fina… 155
3 ham U dun say so early hor… U c already then say… 49
4 ham Nah I don't think he goes to usf, he lives aro… 61

num_words
0 24
1 8
2 37
3 13
4 15

[234]: df['num_sentences'] = df['text'].apply(lambda x: len(nltk.sent_tokenize(x)))

[235]: df[['num_characters','num_words','num_sentences']].describe()
numeric_df = df.select_dtypes(include=['number'])

5
[236]: # For spam messages
df[df['target'] == 'spam' ][['num_characters','num_words','num_sentences']].
↪describe()

[236]: num_characters num_words num_sentences


count 653.000000 653.000000 653.000000
mean 137.479326 27.675345 2.978560
std 30.014336 7.011513 1.493185
min 13.000000 2.000000 1.000000
25% 131.000000 25.000000 2.000000
50% 148.000000 29.000000 3.000000
75% 157.000000 32.000000 4.000000
max 223.000000 46.000000 9.000000

[237]: # For spam messages


df[df['target'] == 'ham' ][['num_characters','num_words','num_sentences']].
↪describe()

[237]: num_characters num_words num_sentences


count 4516.000000 4516.000000 4516.000000
mean 70.457263 17.123782 1.820195
std 56.357463 13.493970 1.383657
min 2.000000 1.000000 1.000000
25% 34.000000 8.000000 1.000000
50% 52.000000 13.000000 1.000000
75% 90.000000 22.000000 2.000000
max 910.000000 220.000000 38.000000

[238]: import seaborn as sns

[239]: # For ham messages


sns.histplot(df[df['target'] == 'ham']['num_characters'])
# For spam messages
sns.histplot(df[df['target'] == 'spam']['num_characters'],color = 'red')

[239]: <Axes: xlabel='num_characters', ylabel='Count'>

6
[240]: sns.pairplot(df,hue='target')

[240]: <seaborn.axisgrid.PairGrid at 0x24a38fbce10>

7
[241]: # Filter to only include numeric columns
numeric_df = df.select_dtypes(include=['number'])
sns.heatmap(numeric_df.corr(),annot=True)

[241]: <Axes: >

8
[242]: # 3. Data Preprocessing

[243]: # Lower case


# Tokenization
# Removing special characters
# Removing stop words and punctuation
# Stemming

[244]: import nltk


import string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

# Download required resources - run this only once


try:
nltk.data.find('corpora/stopwords')
nltk.data.find('tokenizers/punkt')
except LookupError:
nltk.download('stopwords')
nltk.download('punkt')

# Initialize stemmer

9
ps = PorterStemmer()

[245]: def transform_text(text):


text = text.lower()
text = nltk.word_tokenize(text)

y = []
for i in text:
if i.isalnum():
y.append(i)

text = y[:]
y.clear()

for i in text:
if i not in stopwords.words('english') and i not in string.punctuation:
y.append(i)

text = y[:]
y.clear()

for i in text:
y.append(ps.stem(i))

return " ".join(y)

[134]: df['transformed_text'] = df['text'].apply(transform_text)

[135]: df.head()

[135]: target text num_characters \


0 ham Go until jurong point, crazy.. Available only … 111
1 ham Ok lar… Joking wif u oni… 29
2 spam Free entry in 2 a wkly comp to win FA Cup fina… 155
3 ham U dun say so early hor… U c already then say… 49
4 ham Nah I don't think he goes to usf, he lives aro… 61

num_words num_sentences transformed_text


0 24 2 go jurong point crazi avail bugi n great world…
1 8 2 ok lar joke wif u oni
2 37 2 free entri 2 wkli comp win fa cup final tkt 21…
3 13 1 u dun say earli hor u c alreadi say
4 15 1 nah think goe usf live around though

[136]: pip install wordcloud

Requirement already satisfied: wordcloud in

10
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (1.9.4)
Requirement already satisfied: numpy>=1.6.1 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
wordcloud) (2.2.5)
Requirement already satisfied: pillow in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
wordcloud) (11.2.1)
Requirement already satisfied: matplotlib in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
wordcloud) (3.10.1)
Requirement already satisfied: contourpy>=1.0.1 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
matplotlib->wordcloud) (1.3.2)
Requirement already satisfied: cycler>=0.10 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
matplotlib->wordcloud) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
matplotlib->wordcloud) (4.57.0)
Requirement already satisfied: kiwisolver>=1.3.1 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
matplotlib->wordcloud) (1.4.8)
Requirement already satisfied: packaging>=20.0 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
matplotlib->wordcloud) (24.2)
Requirement already satisfied: pyparsing>=2.3.1 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
matplotlib->wordcloud) (3.2.3)
Requirement already satisfied: python-dateutil>=2.7 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
matplotlib->wordcloud) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
python-dateutil>=2.7->matplotlib->wordcloud) (1.17.0)
Note: you may need to restart the kernel to use updated packages.

[notice] A new release of pip is available: 25.1 -> 25.1.1


[notice] To update, run: python.exe -m pip install --upgrade pip

[137]: from wordcloud import WordCloud


wc = WordCloud(width=500, height=500, min_font_size=10,␣
↪background_color='white')

[143]: # Generate word cloud for spam messages


wc.generate(df[df['target'] == 'spam']['transformed_text'].str.cat(sep=" "))

[143]: <wordcloud.wordcloud.WordCloud at 0x24aa74e6850>

11
[144]: # Display the word cloud
plt.imshow(wc)

[144]: <matplotlib.image.AxesImage at 0x24a398b2ad0>

[145]: # Generate word cloud for spam messages


wc.generate(df[df['target'] == 'ham']['transformed_text'].str.cat(sep=" "))

[145]: <wordcloud.wordcloud.WordCloud at 0x24aa74e6850>

[146]: # Display the word cloud


plt.imshow(wc)

[146]: <matplotlib.image.AxesImage at 0x24a3991e490>

12
[147]: df.head()

[147]: target text num_characters \


0 ham Go until jurong point, crazy.. Available only … 111
1 ham Ok lar… Joking wif u oni… 29
2 spam Free entry in 2 a wkly comp to win FA Cup fina… 155
3 ham U dun say so early hor… U c already then say… 49
4 ham Nah I don't think he goes to usf, he lives aro… 61

num_words num_sentences transformed_text


0 24 2 go jurong point crazi avail bugi n great world…
1 8 2 ok lar joke wif u oni
2 37 2 free entri 2 wkli comp win fa cup final tkt 21…
3 13 1 u dun say earli hor u c alreadi say
4 15 1 nah think goe usf live around though

[148]: spam_corpus = []
for msg in df[df['target'] == 'spam'] ['transformed_text'].tolist():
for word in msg.split():
spam_corpus.append(word)

[149]: len(spam_corpus)

13
[149]: 9939

[150]: from collections import Counter


import seaborn as sns
import matplotlib.pyplot as plt

# Get the most common words and create DataFrame


spam_common = Counter(spam_corpus).most_common(30) # Get top 30 words
spam_df = pd.DataFrame(spam_common, columns=['Word', 'Count'])

# Create the plot


plt.figure(figsize=(6, 6))
sns.barplot(data=spam_df, x='Word', y='Count')

# Rotate x-axis labels


plt.xticks(rotation=45) # not 'rotations'

plt.tight_layout()
plt.show()

14
[151]: ham_corpus = []
for msg in df[df['target'] == 'ham'] ['transformed_text'].tolist():
for word in msg.split():
ham_corpus.append(word)

[152]: len(ham_corpus)

[152]: 35404

[153]: from collections import Counter


import seaborn as sns
import matplotlib.pyplot as plt

15
# Get the most common words and create DataFrame
ham_common = Counter(ham_corpus).most_common(30) # Get top 30 words
ham_df = pd.DataFrame(ham_common, columns=['Word', 'Count'])

# Create the plot


plt.figure(figsize=(6, 6))
sns.barplot(data=ham_df, x='Word', y='Count')

# Rotate x-axis labels


plt.xticks(rotation=45) # not 'rotations'

plt.tight_layout()
plt.show()

16
[154]: # 4. Model Building

[155]: from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


cv = CountVectorizer()
tfidf = TfidfVectorizer(max_features=3000)

[156]: x = tfidf.fit_transform(df['transformed_text']).toarray

[157]: X = cv.fit_transform(df['transformed_text']).toarray()

[158]: X = np.hstack((X, df['num_characters'].values.reshape(-1,1)))

[159]: X.shape

[159]: (5169, 6709)

[160]: # from sklearn.preprocessing import MinMaxScaler


# # scaler = MinMaxScaler()
# X_scaled = scaler.fit_transform(X) # Ensure X is properly defined before␣
↪this step

[161]: y = df['target'].values

[162]: y

[162]: array(['ham', 'ham', 'spam', …, 'ham', 'ham', 'ham'],


shape=(5169,), dtype=object)

[163]: import sklearn


from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
# Convert string labels to numbers
le = LabelEncoder()
y = le.fit_transform(df['target']) # 'ham' becomes 0, 'spam' becomes 1

# Then split into train/test and proceed with modeling

[164]: X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.


↪2,random_state=2)

[165]: from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB


from sklearn.metrics import accuracy_score, confusion_matrix, precision_score

[166]: gnb = GaussianNB()


mnb = MultinomialNB()
bnb = BernoulliNB ()

17
[167]: gnb.fit(X_train,y_train)
y_pred1 = gnb.predict(X_test)
print (accuracy_score(y_test,y_pred1))
print(confusion_matrix(y_test,y_pred1))
print(precision_score(y_test,y_pred1))

0.8800773694390716
[[792 104]
[ 20 118]]
0.5315315315315315

[168]: mnb.fit(X_train,y_train)
y_pred2 = mnb.predict(X_test)
print (accuracy_score(y_test,y_pred2))
print(confusion_matrix(y_test,y_pred2))
print(precision_score(y_test,y_pred2))

0.9690522243713733
[[883 13]
[ 19 119]]
0.9015151515151515

[169]: bnb.fit(X_train,y_train)
y_pred3 = bnb.predict(X_test)
print (accuracy_score(y_test,y_pred3))

print(confusion_matrix(y_test,y_pred3))
print(precision_score(y_test,y_pred3))

0.9700193423597679
[[893 3]
[ 28 110]]
0.9734513274336283

[170]: #tfidf --> BNB

[171]: pip install xgboost

Requirement already satisfied: xgboost in


c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (3.0.0)
Requirement already satisfied: numpy in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
xgboost) (2.2.5)
Requirement already satisfied: scipy in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
xgboost) (1.15.2)
Note: you may need to restart the kernel to use updated packages.

18
[notice] A new release of pip is available: 25.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip

[172]: # Imports section


from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

[173]: # Model initialization


svc = SVC(kernel='sigmoid', gamma=1.0)
knc = KNeighborsClassifier()
bnb = BernoulliNB()
dtc = DecisionTreeClassifier(max_depth=5)
lrc = LogisticRegression(solver='liblinear', penalty='l1')
rfc = RandomForestClassifier(n_estimators=50, random_state=2)
abc = AdaBoostClassifier(n_estimators=50,random_state=2)
bc = BaggingClassifier(n_estimators=50, random_state=2)
etc = ExtraTreesClassifier(n_estimators =50, random_state =2 )
gbdt= GradientBoostingClassifier(n_estimators=50,random_state=2)
xgb = XGBClassifier(n_estimators=50,random_state=2)

[174]: clfs = {
'SVC': svc,
'KNR': knc,
'NB': bnb,
'DTC': dtc,
'lr':lrc,
'RF': rfc,
'AdaBoost': abc,
'BgC': bc,
'GBDT': gbdt,
'XGB': xgb # Changed 'xgb' to 'XGB' for consistency
}

[175]: from sklearn.metrics import accuracy_score, precision_score

def train_classifier(clf, X_train, y_train, X_test, y_test):


clf.fit(X_train, y_train)

19
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)

return accuracy, precision

[176]: from sklearn.svm import SVC


from sklearn.metrics import precision_score

# Initialize the model


svc = SVC()

# Train the model


svc.fit(X_train, y_train)

# Make predictions
y_pred = svc.predict(X_test)

# Now calculate precision


precision = precision_score(y_test, y_pred, zero_division=1)

[177]: train_classifier(svc,X_train,y_train,X_test,y_test)

C:\Users\ank94\AppData\Local\Programs\Python\Python313\Lib\site-
packages\sklearn\metrics\_classification.py:1565: UndefinedMetricWarning:
Precision is ill-defined and being set to 0.0 due to no predicted samples. Use
`zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))

[177]: (0.8665377176015474, 0.0)

[178]: def evaluate_model(model, X_train, y_train, X_test, y_test):


# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate precision
precision = precision_score(y_test, y_pred, zero_division=1)

return precision

# Usage
svc = SVC()
precision = evaluate_model(svc, X_train, y_train, X_test, y_test)
print(f"Precision: {precision:.4f}")

20
Precision: 1.0000

[179]: accuracy_scores = []
precision_scores = []

for name, clf in clfs.items():


current_accuracy, current_precision = train_classifier(clf, X_train,␣
↪y_train, X_test, y_test)

print(f"For {name}:")
print(f"Accuracy - {current_accuracy}")
print(f"Precision - {current_precision}")

accuracy_scores.append(current_accuracy)
precision_scores.append(current_precision)

C:\Users\ank94\AppData\Local\Programs\Python\Python313\Lib\site-
packages\sklearn\metrics\_classification.py:1565: UndefinedMetricWarning:
Precision is ill-defined and being set to 0.0 due to no predicted samples. Use
`zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
For SVC:
Accuracy - 0.8665377176015474
Precision - 0.0
For KNR:
Accuracy - 0.9313346228239845
Precision - 0.7768595041322314
For NB:
Accuracy - 0.9700193423597679
Precision - 0.9734513274336283
For DTC:
Accuracy - 0.9516441005802708
Precision - 0.8928571428571429
For lr:
Accuracy - 0.97678916827853
Precision - 0.9523809523809523
For RF:
Accuracy - 0.971953578336557
Precision - 0.990990990990991
For AdaBoost:
Accuracy - 0.9448742746615088
Precision - 0.8932038834951457

---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[179], line 5
2 precision_scores = []

21
4 for name, clf in clfs.items():
----> 5 current_accuracy, current_precision =␣
↪train_classifier(clf, X_train, y_train, X_test, y_test)

7 print(f"For {name}:")
8 print(f"Accuracy - {current_accuracy}")

Cell In[175], line 4, in train_classifier(clf, X_train, y_train, X_test, y_test)


3 def train_classifier(clf, X_train, y_train, X_test, y_test):
----> 4 clf.fit(X_train, y_train)
5 y_pred = clf.predict(X_test)
6 accuracy = accuracy_score(y_test, y_pred)

File␣
↪~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\utils\validation.
↪py:63, in _deprecate_positional_args.<locals>._inner_deprecate_positional_args.

↪<locals>.inner_f(*args, **kwargs)

61 extra_args = len(args) - len(all_args)


62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
65 # extra_args > 0
66 args_msg = [
67 "{}={}".format(name, arg)
68 for name, arg in zip(kwonly_args[:extra_args], args[-extra_args:])
69 ]

File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\base.py:
↪1389, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args,␣

↪**kwargs)

1382 estimator._validate_params()
1384 with config_context(
1385 skip_parameter_validation=(
1386 prefer_skip_nested_validation or global_skip_validation
1387 )
1388 ):
-> 1389 return fit_method(estimator, *args, **kwargs)

File␣
↪~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\ensemble\_bagging.

↪py:389, in BaseBagging.fit(self, X, y, sample_weight, **fit_params)

386 sample_weight = _check_sample_weight(sample_weight, X, dtype=None)


387 fit_params["sample_weight"] = sample_weight
--> 389 return self._fit(X, y, max_samples=self.max_samples, **fit_params)

File␣
↪~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\ensemble\_bagging.
↪py:532, in BaseBagging._fit(self, X, y, max_samples, max_depth, check_input,␣

↪**fit_params)

529 seeds = random_state.randint(MAX_INT, size=n_more_estimators)


530 self._seeds = seeds

22
--> 532 all_results = Parallel(
533 n_jobs=n_jobs, verbose=self.verbose, **self._parallel_args()
534 )(
535 delayed(_parallel_build_estimators)(
536 n_estimators[i],
537 self,
538 X,
539 y,
540 seeds[starts[i] : starts[i + 1]],
541 total_n_estimators,
542 verbose=self.verbose,
543 check_input=check_input,
544 fit_params=routed_params.estimator.fit,
545 )
546 for i in range(n_jobs)
547 )
549 # Reduce
550 self.estimators_ += list(
551 itertools.chain.from_iterable(t[0] for t in all_results)
552 )

File␣
↪~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\utils\parallel.

↪py:77, in Parallel.__call__(self, iterable)

72 config = get_config()
73 iterable_with_config = (
74 (_with_config(delayed_func, config), args, kwargs)
75 for delayed_func, args, kwargs in iterable
76 )
---> 77 return super().__call__(iterable_with_config)

File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\joblib\parallel.
↪py:1918, in Parallel.__call__(self, iterable)

1916 output = self._get_sequential_output(iterable)


1917 next(output)
-> 1918 return output if self.return_generator else list(output)
1920 # Let's create an ID that uniquely identifies the current call. If the
1921 # call is interrupted early and that the same instance is immediately
1922 # re-used, this id will be used to prevent workers that were
1923 # concurrently finalizing a task from the previous call to run the
1924 # callback.
1925 with self._lock:

File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\joblib\parallel.
↪py:1847, in Parallel._get_sequential_output(self, iterable)

1845 self.n_dispatched_batches += 1
1846 self.n_dispatched_tasks += 1
-> 1847 res = func(*args, **kwargs)

23
1848 self.n_completed_tasks += 1
1849 self.print_progress()

File␣
↪~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\utils\parallel.

↪py:139, in _FuncWrapper.__call__(self, *args, **kwargs)

137 config = {}
138 with config_context(**config):
--> 139 return self.function(*args, **kwargs)

File␣
↪~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\ensemble\_bagging.
↪py:189, in _parallel_build_estimators(n_estimators, ensemble, X, y, seeds,␣

↪total_n_estimators, verbose, check_input, fit_params)

187 fit_params_["sample_weight"] = curr_sample_weight


188 X_ = X[:, features] if requires_feature_indexing else X
--> 189 estimator_fit(X_, y, **fit_params_)
190 else:
191 # cannot use sample_weight, so use indexing
192 y_ = _safe_indexing(y, indices)

File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\base.py:
↪1389, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args,␣

↪**kwargs)

1382 estimator._validate_params()
1384 with config_context(
1385 skip_parameter_validation=(
1386 prefer_skip_nested_validation or global_skip_validation
1387 )
1388 ):
-> 1389 return fit_method(estimator, *args, **kwargs)

File␣
↪~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\tree\_classes.

↪py:1024, in DecisionTreeClassifier.fit(self, X, y, sample_weight, check_input)

993 @_fit_context(prefer_skip_nested_validation=True)
994 def fit(self, X, y, sample_weight=None, check_input=True):
995 """Build a decision tree classifier from the training set (X, y).
996
997 Parameters
(…) 1021 Fitted estimator.
1022 """
-> 1024 super()._fit(
1025 X,
1026 y,
1027 sample_weight=sample_weight,
1028 check_input=check_input,
1029 )
1030 return self

24
File␣
↪~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\tree\_classes.
↪py:472, in BaseDecisionTree._fit(self, X, y, sample_weight, check_input,␣

↪missing_values_in_feature_mask)

461 else:
462 builder = BestFirstTreeBuilder(
463 splitter,
464 min_samples_split,
(…) 469 self.min_impurity_decrease,
470 )
--> 472␣
↪builder.build(self.tree_, X, y, sample_weight, missing_values_in_feature_mask)

474 if self.n_outputs_ == 1 and is_classifier(self):


475 self.n_classes_ = self.n_classes_[0]

KeyboardInterrupt:

[180]: # Convert dictionary keys to a list


algorithms = list(clfs.keys())

# Ensure all lists have the same length (use the minimum length)
min_length = min(len(algorithms), len(accuracy_scores), len(precision_scores))

# Create the DataFrame with consistently sized lists


performance_df = pd.DataFrame({
'Algorithm': algorithms[:min_length],
'Accuracy': accuracy_scores[:min_length],
'Precision': precision_scores[:min_length]
}).sort_values('Precision', ascending=False)

[181]: performance_df

[181]: Algorithm Accuracy Precision


5 RF 0.971954 0.990991
2 NB 0.970019 0.973451
4 lr 0.976789 0.952381
6 AdaBoost 0.944874 0.893204
3 DTC 0.951644 0.892857
1 KNR 0.931335 0.776860
0 SVC 0.866538 0.000000

[182]: performance_df1 = pd.melt(performance_df,id_vars = 'Algorithm')

[183]: performance_df1

25
[183]: Algorithm variable value
0 RF Accuracy 0.971954
1 NB Accuracy 0.970019
2 lr Accuracy 0.976789
3 AdaBoost Accuracy 0.944874
4 DTC Accuracy 0.951644
5 KNR Accuracy 0.931335
6 SVC Accuracy 0.866538
7 RF Precision 0.990991
8 NB Precision 0.973451
9 lr Precision 0.952381
10 AdaBoost Precision 0.893204
11 DTC Precision 0.892857
12 KNR Precision 0.776860
13 SVC Precision 0.000000

[184]: # Create the categorical plot


sns.catplot(x='Algorithm', y='value',
hue='variable', data=performance_df1,
kind='bar', height=5)

# Set y-axis limits (use ylim, not lim)


plt.ylim(0.5, 1.0)

# Rotate x-axis labels


plt.xticks(rotation='vertical')

# Show the plot


plt.show()

26
[185]: # First convert clfs.keys() to a list
algorithms = list(clfs.keys())

# Calculate the minimum length


min_length = min(len(algorithms), len(accuracy_scores), len(precision_scores))

# Now create the DataFrame with consistent lengths


temp_df = pd.DataFrame({
'Algorithm': algorithms[:min_length],
'Accuracy_max_ft_3000': accuracy_scores[:min_length],
'Precision_max_ft_3000': precision_scores[:min_length]
}).sort_values('Precision_max_ft_3000', ascending=False)

[186]: # First, convert clfs.keys() to a list


algorithms = list(clfs.keys())

27
# Calculate the minimum length
min_length = min(len(algorithms), len(accuracy_scores), len(precision_scores))

# Now create the DataFrame using only the first min_length elements of each list
temp_df = pd.DataFrame({
'Algorithm': algorithms[:min_length],
'Accuracy_scaling': accuracy_scores[:min_length],
'Precision_scaling': precision_scores[:min_length]
}).sort_values('Precision_scaling', ascending=False)

[187]: new_df = performance_df.merge(temp_df,on='Algorithm')

[188]: new_df_scaled = new_df.merge(temp_df,on='Algorithm')

[189]: new_df_scaled = new_df.merge(temp_df,on='Algorithm')

[190]: # First, convert clfs.keys() to a list


algorithms = list(clfs.keys())

# Calculate the minimum length


min_length = min(len(algorithms), len(accuracy_scores), len(precision_scores))

# Now create the DataFrame using only the first min_length elements of each list
temp_df = pd.DataFrame({
'Algorithm': algorithms[:min_length],
'Accuracy_num_chars': accuracy_scores[:min_length],
'Precision_num_chars': precision_scores[:min_length]
}).sort_values('Precision_num_chars', ascending=False)

[191]: new_df_scaled.merge(temp_df,on='Algorithm')

[191]: Algorithm Accuracy Precision Accuracy_scaling_x Precision_scaling_x \


0 RF 0.971954 0.990991 0.971954 0.990991
1 NB 0.970019 0.973451 0.970019 0.973451
2 lr 0.976789 0.952381 0.976789 0.952381
3 AdaBoost 0.944874 0.893204 0.944874 0.893204
4 DTC 0.951644 0.892857 0.951644 0.892857
5 KNR 0.931335 0.776860 0.931335 0.776860
6 SVC 0.866538 0.000000 0.866538 0.000000

Accuracy_scaling_y Precision_scaling_y Accuracy_num_chars \


0 0.971954 0.990991 0.971954
1 0.970019 0.973451 0.970019
2 0.976789 0.952381 0.976789
3 0.944874 0.893204 0.944874
4 0.951644 0.892857 0.951644

28
5 0.931335 0.776860 0.931335
6 0.866538 0.000000 0.866538

Precision_num_chars
0 0.990991
1 0.973451
2 0.952381
3 0.893204
4 0.892857
5 0.776860
6 0.000000

[192]: new_df_scaled

[192]: Algorithm Accuracy Precision Accuracy_scaling_x Precision_scaling_x \


0 RF 0.971954 0.990991 0.971954 0.990991
1 NB 0.970019 0.973451 0.970019 0.973451
2 lr 0.976789 0.952381 0.976789 0.952381
3 AdaBoost 0.944874 0.893204 0.944874 0.893204
4 DTC 0.951644 0.892857 0.951644 0.892857
5 KNR 0.931335 0.776860 0.931335 0.776860
6 SVC 0.866538 0.000000 0.866538 0.000000

Accuracy_scaling_y Precision_scaling_y
0 0.971954 0.990991
1 0.970019 0.973451
2 0.976789 0.952381
3 0.944874 0.893204
4 0.951644 0.892857
5 0.931335 0.776860
6 0.866538 0.000000

[193]: # Model initialization


rf = RandomForestClassifier(n_estimators=50, random_state=2)
bnb = BernoulliNB()
lr = LogisticRegression(solver='liblinear', penalty='l1')
from sklearn.ensemble import VotingClassifier

[194]: voting = VotingClassifier(estimators=[('rf', rf), ('bnb', bnb), ('lr', lr)],␣


↪voting='soft')

[195]: voting.fit(X_train,y_train)

[195]: VotingClassifier(estimators=[('rf',
RandomForestClassifier(n_estimators=50,
random_state=2)),
('bnb', BernoulliNB()),

29
('lr',
LogisticRegression(penalty='l1',
solver='liblinear'))],
voting='soft')

[196]: VotingClassifier(estimators=[('rf',
RandomForestClassifier(n_estimators=100,
random_state=2)),
('bnb', BernoulliNB()),
('lr', LogisticRegression())],
voting='soft')

[196]: VotingClassifier(estimators=[('rf', RandomForestClassifier(random_state=2)),


('bnb', BernoulliNB()),
('lr', LogisticRegression())],
voting='soft')

[197]: y_pred = voting.predict(X_test)


print("Accuracy",accuracy_score(y_test,y_pred))
print("Precision",precision_score(y_test,y_pred))

Accuracy 0.9758220502901354
Precision 1.0

[198]: estimators = [('rf', RandomForestClassifier(n_estimators=100, random_state=2)),


('bnb', BernoulliNB()),
('lr', LogisticRegression())]
final_estimator = RandomForestClassifier()

[199]: from sklearn.ensemble import StackingClassifier

[200]: clf = StackingClassifier(estimators=estimators, final_estimator=final_estimator)

[201]: # If you're using LogisticRegression, increase max_iter


from sklearn.linear_model import LogisticRegression

# Increase max_iter (default is usually 100)


clf = LogisticRegression(max_iter=1000, solver='lbfgs')

[202]: clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Accuracy",accuracy_score(y_test,y_pred))
print("Precision",precision_score(y_test,y_pred))

Accuracy 0.97678916827853
Precision 0.9672131147540983

30
[246]: import pickle

# Save the TF-IDF vectorizer


with open('vectorizer.pkl', 'wb') as f:
pickle.dump(tfidf, f)

# Save the Bernoulli Naive Bayes model


with open('model.pkl', 'wb') as f:
pickle.dump(bnb, f)

[ ]:

[ ]:

[ ]:

[ ]:

31

You might also like