0% found this document useful (0 votes)

17 views31 pages

Spam Detection 1

The document outlines a spam detection project using Python, including data cleaning, exploratory data analysis (EDA), text preprocessing, model building, and evaluation. It utilizes libraries such as pandas, sklearn, and nltk for handling and analyzing text data. The dataset consists of 5572 entries labeled as 'ham' or 'spam', with various preprocessing steps applied to prepare the data for modeling.

Uploaded by

d4664360

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views31 pages

Spam Detection 1

Uploaded by

d4664360

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

spam-detection-1

May 5, 2025

[206]: import numpy as np

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

[208]: df = pd.read_csv('spam.csv')

[209]: df.sample(5)

[209]: v1 v2 Unnamed: 2 \
4809 ham Honey, can you pls find out how much they sell… NaN
392 ham Morning only i can ok. NaN
4231 ham I'm at home. Please call NaN
2520 ham Misplaced your number and was sending texts to… NaN
2270 ham U know we watchin at lido? NaN

Unnamed: 3 Unnamed: 4
4809 NaN NaN
392 NaN NaN
4231 NaN NaN
2520 NaN NaN
2270 NaN NaN

[210]: #1. Data clening

#2. EDA
#3. Text Preprocessing
#4. Model building
#5. Evaluation
#6. Improvement
#7. website
#8. Deploy

[211]: ## Data cleaning

[212]: df.info()

1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 v1 5572 non-null object
1 v2 5572 non-null object
2 Unnamed: 2 50 non-null object
3 Unnamed: 3 12 non-null object
4 Unnamed: 4 6 non-null object
dtypes: object(5)
memory usage: 217.8+ KB

[213]: df.drop(df.columns[-3:], axis=1, inplace=True)

[214]: df.sample(5)

[214]: v1 v2
5080 ham Yeah, give me a call if you've got a minute
2805 ham Can a not?
3223 ham Sorry da thangam.it's my mistake.
4499 ham Nvm take ur time.
589 ham I'm in a meeting, call me later at

[215]: df.rename(columns={'v1':'target','v2':'text'},inplace=True)
df.sample(5)

[215]: target text

3920 ham Do 1 thing! Change that sentence into: \Becaus…
3582 ham I sent your maga that money yesterday oh.
3309 ham Oh ho. Is this the first time u use these type…
4562 ham Come around <DECIMAL> pm vikky..i'm ots…
5288 ham An excellent thought by a misundrstud frnd: I …

[216]: from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

[217]: df.head()

[217]: target text

0 ham Go until jurong point, crazy.. Available only …
1 ham Ok lar… Joking wif u oni…
2 spam Free entry in 2 a wkly comp to win FA Cup fina…
3 ham U dun say so early hor… U c already then say…
4 ham Nah I don't think he goes to usf, he lives aro…

[218]: df.isnull().sum()

2
[218]: target 0
text 0
dtype: int64

[219]: df.duplicated().sum()

[219]: np.int64(403)

[220]: df = df.drop_duplicates(keep='first')

[221]: df.duplicated().sum()

[221]: np.int64(0)

[222]: ## EDA

[223]: df.head()

[223]: target text

[ ]:

[224]: df['target'].value_counts()

[224]: target
ham 4516
spam 653
Name: count, dtype: int64

[225]: import matplotlib.pyplot as plt

plt.pie(df['target'].value_counts(),labels=['ham','spam'],autopct = "%0.2f")
plt.show()

3
[226]: # Data is imbalanced

[227]: import nltk

[228]: !pip install nltk

Requirement already satisfied: nltk in

c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (3.9.1)
Requirement already satisfied: click in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
nltk) (8.1.8)
Requirement already satisfied: joblib in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
nltk) (1.4.2)
Requirement already satisfied: regex>=2021.8.3 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
nltk) (2024.11.6)
Requirement already satisfied: tqdm in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
nltk) (4.67.1)
Requirement already satisfied: colorama in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
click->nltk) (0.4.6)

4
[notice] A new release of pip is available: 25.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip

[229]: nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to

[nltk_data] C:\Users\ank94\AppData\Roaming\nltk_data…
[nltk_data] Package punkt_tab is already up-to-date!

[229]: True

[230]: df['num_characters'] = df['text'].apply(len)

[231]: df.head()

[231]: target text num_characters

0 ham Go until jurong point, crazy.. Available only … 111
1 ham Ok lar… Joking wif u oni… 29
2 spam Free entry in 2 a wkly comp to win FA Cup fina… 155
3 ham U dun say so early hor… U c already then say… 49
4 ham Nah I don't think he goes to usf, he lives aro… 61

[232]: df['num_words'] = df['text'].apply(lambda x: len(nltk.word_tokenize(x)))

[233]: df.head()

[233]: target text num_characters \

num_words
0 24
1 8
2 37
3 13
4 15

[234]: df['num_sentences'] = df['text'].apply(lambda x: len(nltk.sent_tokenize(x)))

[235]: df[['num_characters','num_words','num_sentences']].describe()
numeric_df = df.select_dtypes(include=['number'])

5
[236]: # For spam messages
df[df['target'] == 'spam' ][['num_characters','num_words','num_sentences']].
↪describe()

[236]: num_characters num_words num_sentences

count 653.000000 653.000000 653.000000
mean 137.479326 27.675345 2.978560
std 30.014336 7.011513 1.493185
min 13.000000 2.000000 1.000000
25% 131.000000 25.000000 2.000000
50% 148.000000 29.000000 3.000000
75% 157.000000 32.000000 4.000000
max 223.000000 46.000000 9.000000

[237]: # For spam messages

df[df['target'] == 'ham' ][['num_characters','num_words','num_sentences']].
↪describe()

[237]: num_characters num_words num_sentences

count 4516.000000 4516.000000 4516.000000
mean 70.457263 17.123782 1.820195
std 56.357463 13.493970 1.383657
min 2.000000 1.000000 1.000000
25% 34.000000 8.000000 1.000000
50% 52.000000 13.000000 1.000000
75% 90.000000 22.000000 2.000000
max 910.000000 220.000000 38.000000

[238]: import seaborn as sns

[239]: # For ham messages

sns.histplot(df[df['target'] == 'ham']['num_characters'])
# For spam messages
sns.histplot(df[df['target'] == 'spam']['num_characters'],color = 'red')

[239]: <Axes: xlabel='num_characters', ylabel='Count'>

6
[240]: sns.pairplot(df,hue='target')

[240]: <seaborn.axisgrid.PairGrid at 0x24a38fbce10>

7
[241]: # Filter to only include numeric columns
numeric_df = df.select_dtypes(include=['number'])
sns.heatmap(numeric_df.corr(),annot=True)

[241]: <Axes: >

8
[242]: # 3. Data Preprocessing

[243]: # Lower case

# Tokenization
# Removing special characters
# Removing stop words and punctuation
# Stemming

[244]: import nltk

import string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

# Download required resources - run this only once

try:
nltk.data.find('corpora/stopwords')
nltk.data.find('tokenizers/punkt')
except LookupError:
nltk.download('stopwords')
nltk.download('punkt')

# Initialize stemmer

9
ps = PorterStemmer()

[245]: def transform_text(text):

text = text.lower()
text = nltk.word_tokenize(text)

y = []
for i in text:
if i.isalnum():
y.append(i)

text = y[:]
y.clear()

for i in text:
if i not in stopwords.words('english') and i not in string.punctuation:
y.append(i)

text = y[:]
y.clear()

for i in text:
y.append(ps.stem(i))

return " ".join(y)

[134]: df['transformed_text'] = df['text'].apply(transform_text)

[135]: df.head()

[135]: target text num_characters \

num_words num_sentences transformed_text

0 24 2 go jurong point crazi avail bugi n great world…
1 8 2 ok lar joke wif u oni
2 37 2 free entri 2 wkli comp win fa cup final tkt 21…
3 13 1 u dun say earli hor u c alreadi say
4 15 1 nah think goe usf live around though

[136]: pip install wordcloud

Requirement already satisfied: wordcloud in

10
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (1.9.4)
Requirement already satisfied: numpy>=1.6.1 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
wordcloud) (2.2.5)
Requirement already satisfied: pillow in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
wordcloud) (11.2.1)
Requirement already satisfied: matplotlib in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
wordcloud) (3.10.1)
Requirement already satisfied: contourpy>=1.0.1 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
matplotlib->wordcloud) (1.3.2)
Requirement already satisfied: cycler>=0.10 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
matplotlib->wordcloud) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
matplotlib->wordcloud) (4.57.0)
Requirement already satisfied: kiwisolver>=1.3.1 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
matplotlib->wordcloud) (1.4.8)
Requirement already satisfied: packaging>=20.0 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
matplotlib->wordcloud) (24.2)
Requirement already satisfied: pyparsing>=2.3.1 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
matplotlib->wordcloud) (3.2.3)
Requirement already satisfied: python-dateutil>=2.7 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
matplotlib->wordcloud) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
python-dateutil>=2.7->matplotlib->wordcloud) (1.17.0)
Note: you may need to restart the kernel to use updated packages.

[notice] A new release of pip is available: 25.1 -> 25.1.1

[notice] To update, run: python.exe -m pip install --upgrade pip

[137]: from wordcloud import WordCloud

wc = WordCloud(width=500, height=500, min_font_size=10,␣
↪background_color='white')

[143]: # Generate word cloud for spam messages

wc.generate(df[df['target'] == 'spam']['transformed_text'].str.cat(sep=" "))

[143]: <wordcloud.wordcloud.WordCloud at 0x24aa74e6850>

11
[144]: # Display the word cloud
plt.imshow(wc)

[144]: <matplotlib.image.AxesImage at 0x24a398b2ad0>

[145]: # Generate word cloud for spam messages

wc.generate(df[df['target'] == 'ham']['transformed_text'].str.cat(sep=" "))

[145]: <wordcloud.wordcloud.WordCloud at 0x24aa74e6850>

[146]: # Display the word cloud

plt.imshow(wc)

[146]: <matplotlib.image.AxesImage at 0x24a3991e490>

12
[147]: df.head()

[147]: target text num_characters \

num_words num_sentences transformed_text

[148]: spam_corpus = []
for msg in df[df['target'] == 'spam'] ['transformed_text'].tolist():
for word in msg.split():
spam_corpus.append(word)

[149]: len(spam_corpus)

13
[149]: 9939

[150]: from collections import Counter

import seaborn as sns
import matplotlib.pyplot as plt

# Get the most common words and create DataFrame

spam_common = Counter(spam_corpus).most_common(30) # Get top 30 words
spam_df = pd.DataFrame(spam_common, columns=['Word', 'Count'])

# Create the plot

plt.figure(figsize=(6, 6))
sns.barplot(data=spam_df, x='Word', y='Count')

# Rotate x-axis labels

plt.xticks(rotation=45) # not 'rotations'

plt.tight_layout()
plt.show()

14
[151]: ham_corpus = []
for msg in df[df['target'] == 'ham'] ['transformed_text'].tolist():
for word in msg.split():
ham_corpus.append(word)

[152]: len(ham_corpus)

[152]: 35404

[153]: from collections import Counter

import seaborn as sns
import matplotlib.pyplot as plt

15
# Get the most common words and create DataFrame
ham_common = Counter(ham_corpus).most_common(30) # Get top 30 words
ham_df = pd.DataFrame(ham_common, columns=['Word', 'Count'])

# Create the plot

plt.figure(figsize=(6, 6))
sns.barplot(data=ham_df, x='Word', y='Count')

# Rotate x-axis labels

plt.xticks(rotation=45) # not 'rotations'

plt.tight_layout()
plt.show()

16
[154]: # 4. Model Building

[155]: from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

cv = CountVectorizer()
tfidf = TfidfVectorizer(max_features=3000)

[156]: x = tfidf.fit_transform(df['transformed_text']).toarray

[157]: X = cv.fit_transform(df['transformed_text']).toarray()

[158]: X = np.hstack((X, df['num_characters'].values.reshape(-1,1)))

[159]: X.shape

[159]: (5169, 6709)

[160]: # from sklearn.preprocessing import MinMaxScaler

# # scaler = MinMaxScaler()
# X_scaled = scaler.fit_transform(X) # Ensure X is properly defined before␣
↪this step

[161]: y = df['target'].values

[162]: y

[162]: array(['ham', 'ham', 'spam', …, 'ham', 'ham', 'ham'],

shape=(5169,), dtype=object)

[163]: import sklearn

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
# Convert string labels to numbers
le = LabelEncoder()
y = le.fit_transform(df['target']) # 'ham' becomes 0, 'spam' becomes 1

# Then split into train/test and proceed with modeling

[164]: X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.

↪2,random_state=2)

[165]: from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score

[166]: gnb = GaussianNB()

mnb = MultinomialNB()
bnb = BernoulliNB ()

17
[167]: gnb.fit(X_train,y_train)
y_pred1 = gnb.predict(X_test)
print (accuracy_score(y_test,y_pred1))
print(confusion_matrix(y_test,y_pred1))
print(precision_score(y_test,y_pred1))

0.8800773694390716
[[792 104]
[ 20 118]]
0.5315315315315315

[168]: mnb.fit(X_train,y_train)
y_pred2 = mnb.predict(X_test)
print (accuracy_score(y_test,y_pred2))
print(confusion_matrix(y_test,y_pred2))
print(precision_score(y_test,y_pred2))

0.9690522243713733
[[883 13]
[ 19 119]]
0.9015151515151515

[169]: bnb.fit(X_train,y_train)
y_pred3 = bnb.predict(X_test)
print (accuracy_score(y_test,y_pred3))

print(confusion_matrix(y_test,y_pred3))
print(precision_score(y_test,y_pred3))

0.9700193423597679
[[893 3]
[ 28 110]]
0.9734513274336283

[170]: #tfidf --> BNB

[171]: pip install xgboost

Requirement already satisfied: xgboost in

c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (3.0.0)
Requirement already satisfied: numpy in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
xgboost) (2.2.5)
Requirement already satisfied: scipy in
c:\users\ank94\appdata\local\programs\python\python313\lib\site-packages (from
xgboost) (1.15.2)
Note: you may need to restart the kernel to use updated packages.

18
[notice] A new release of pip is available: 25.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip

[172]: # Imports section

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

[173]: # Model initialization

svc = SVC(kernel='sigmoid', gamma=1.0)
knc = KNeighborsClassifier()
bnb = BernoulliNB()
dtc = DecisionTreeClassifier(max_depth=5)
lrc = LogisticRegression(solver='liblinear', penalty='l1')
rfc = RandomForestClassifier(n_estimators=50, random_state=2)
abc = AdaBoostClassifier(n_estimators=50,random_state=2)
bc = BaggingClassifier(n_estimators=50, random_state=2)
etc = ExtraTreesClassifier(n_estimators =50, random_state =2 )
gbdt= GradientBoostingClassifier(n_estimators=50,random_state=2)
xgb = XGBClassifier(n_estimators=50,random_state=2)

[174]: clfs = {
'SVC': svc,
'KNR': knc,
'NB': bnb,
'DTC': dtc,
'lr':lrc,
'RF': rfc,
'AdaBoost': abc,
'BgC': bc,
'GBDT': gbdt,
'XGB': xgb # Changed 'xgb' to 'XGB' for consistency
}

[175]: from sklearn.metrics import accuracy_score, precision_score

def train_classifier(clf, X_train, y_train, X_test, y_test):

clf.fit(X_train, y_train)

19
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)

return accuracy, precision

[176]: from sklearn.svm import SVC

from sklearn.metrics import precision_score

# Initialize the model

svc = SVC()

# Train the model

svc.fit(X_train, y_train)

# Make predictions
y_pred = svc.predict(X_test)

# Now calculate precision

precision = precision_score(y_test, y_pred, zero_division=1)

[177]: train_classifier(svc,X_train,y_train,X_test,y_test)

[177]: (0.8665377176015474, 0.0)

[178]: def evaluate_model(model, X_train, y_train, X_test, y_test):

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate precision
precision = precision_score(y_test, y_pred, zero_division=1)

return precision

# Usage
svc = SVC()
precision = evaluate_model(svc, X_train, y_train, X_test, y_test)
print(f"Precision: {precision:.4f}")

20
Precision: 1.0000

[179]: accuracy_scores = []
precision_scores = []

for name, clf in clfs.items():

current_accuracy, current_precision = train_classifier(clf, X_train,␣
↪y_train, X_test, y_test)

print(f"For {name}:")
print(f"Accuracy - {current_accuracy}")
print(f"Precision - {current_precision}")

accuracy_scores.append(current_accuracy)
precision_scores.append(current_precision)

C:\Users\ank94\AppData\Local\Programs\Python\Python313\Lib\site-
packages\sklearn\metrics\_classification.py:1565: UndefinedMetricWarning:
Precision is ill-defined and being set to 0.0 due to no predicted samples. Use
`zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
For SVC:
Accuracy - 0.8665377176015474
Precision - 0.0
For KNR:
Accuracy - 0.9313346228239845
Precision - 0.7768595041322314
For NB:
Accuracy - 0.9700193423597679
Precision - 0.9734513274336283
For DTC:
Accuracy - 0.9516441005802708
Precision - 0.8928571428571429
For lr:
Accuracy - 0.97678916827853
Precision - 0.9523809523809523
For RF:
Accuracy - 0.971953578336557
Precision - 0.990990990990991
For AdaBoost:
Accuracy - 0.9448742746615088
Precision - 0.8932038834951457

---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[179], line 5
2 precision_scores = []

21
4 for name, clf in clfs.items():
----> 5 current_accuracy, current_precision =␣
↪train_classifier(clf, X_train, y_train, X_test, y_test)

7 print(f"For {name}:")
8 print(f"Accuracy - {current_accuracy}")

Cell In[175], line 4, in train_classifier(clf, X_train, y_train, X_test, y_test)

3 def train_classifier(clf, X_train, y_train, X_test, y_test):
----> 4 clf.fit(X_train, y_train)
5 y_pred = clf.predict(X_test)
6 accuracy = accuracy_score(y_test, y_pred)

File␣
↪~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\utils\validation.
↪py:63, in _deprecate_positional_args.<locals>._inner_deprecate_positional_args.

↪<locals>.inner_f(*args, **kwargs)

61 extra_args = len(args) - len(all_args)

62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
65 # extra_args > 0
66 args_msg = [
67 "{}={}".format(name, arg)
68 for name, arg in zip(kwonly_args[:extra_args], args[-extra_args:])
69 ]

File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\base.py:
↪1389, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args,␣

↪**kwargs)

1382 estimator._validate_params()
1384 with config_context(
1385 skip_parameter_validation=(
1386 prefer_skip_nested_validation or global_skip_validation
1387 )
1388 ):
-> 1389 return fit_method(estimator, *args, **kwargs)

File␣
↪~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\ensemble\_bagging.

↪py:389, in BaseBagging.fit(self, X, y, sample_weight, **fit_params)

386 sample_weight = _check_sample_weight(sample_weight, X, dtype=None)

387 fit_params["sample_weight"] = sample_weight
--> 389 return self._fit(X, y, max_samples=self.max_samples, **fit_params)

File␣
↪~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\ensemble\_bagging.
↪py:532, in BaseBagging._fit(self, X, y, max_samples, max_depth, check_input,␣

↪**fit_params)

529 seeds = random_state.randint(MAX_INT, size=n_more_estimators)

530 self._seeds = seeds

22
--> 532 all_results = Parallel(
533 n_jobs=n_jobs, verbose=self.verbose, **self._parallel_args()
534 )(
535 delayed(_parallel_build_estimators)(
536 n_estimators[i],
537 self,
538 X,
539 y,
540 seeds[starts[i] : starts[i + 1]],
541 total_n_estimators,
542 verbose=self.verbose,
543 check_input=check_input,
544 fit_params=routed_params.estimator.fit,
545 )
546 for i in range(n_jobs)
547 )
549 # Reduce
550 self.estimators_ += list(
551 itertools.chain.from_iterable(t[0] for t in all_results)
552 )

File␣
↪~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\utils\parallel.

↪py:77, in Parallel.call(self, iterable)

72 config = get_config()
73 iterable_with_config = (
74 (_with_config(delayed_func, config), args, kwargs)
75 for delayed_func, args, kwargs in iterable
76 )
---> 77 return super().__call__(iterable_with_config)

File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\joblib\parallel.
↪py:1918, in Parallel.__call__(self, iterable)

1916 output = self._get_sequential_output(iterable)

1917 next(output)
-> 1918 return output if self.return_generator else list(output)
1920 # Let's create an ID that uniquely identifies the current call. If the
1921 # call is interrupted early and that the same instance is immediately
1922 # re-used, this id will be used to prevent workers that were
1923 # concurrently finalizing a task from the previous call to run the
1924 # callback.
1925 with self._lock:

File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\joblib\parallel.
↪py:1847, in Parallel._get_sequential_output(self, iterable)

1845 self.n_dispatched_batches += 1
1846 self.n_dispatched_tasks += 1
-> 1847 res = func(*args, **kwargs)

23
1848 self.n_completed_tasks += 1
1849 self.print_progress()

File␣
↪~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\utils\parallel.

↪py:139, in _FuncWrapper.call(self, *args, **kwargs)

137 config = {}
138 with config_context(**config):
--> 139 return self.function(*args, **kwargs)

File␣
↪~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\ensemble\_bagging.
↪py:189, in _parallel_build_estimators(n_estimators, ensemble, X, y, seeds,␣

↪total_n_estimators, verbose, check_input, fit_params)

187 fit_params_["sample_weight"] = curr_sample_weight

188 X_ = X[:, features] if requires_feature_indexing else X
--> 189 estimator_fit(X_, y, **fit_params_)
190 else:
191 # cannot use sample_weight, so use indexing
192 y_ = _safe_indexing(y, indices)

File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\base.py:
↪1389, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args,␣

↪**kwargs)

File␣
↪~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\tree\_classes.

↪py:1024, in DecisionTreeClassifier.fit(self, X, y, sample_weight, check_input)

993 @_fit_context(prefer_skip_nested_validation=True)
994 def fit(self, X, y, sample_weight=None, check_input=True):
995 """Build a decision tree classifier from the training set (X, y).
996
997 Parameters
(…) 1021 Fitted estimator.
1022 """
-> 1024 super()._fit(
1025 X,
1026 y,
1027 sample_weight=sample_weight,
1028 check_input=check_input,
1029 )
1030 return self

24
File␣
↪~\AppData\Local\Programs\Python\Python313\Lib\site-packages\sklearn\tree\_classes.
↪py:472, in BaseDecisionTree._fit(self, X, y, sample_weight, check_input,␣

↪missing_values_in_feature_mask)

461 else:
462 builder = BestFirstTreeBuilder(
463 splitter,
464 min_samples_split,
(…) 469 self.min_impurity_decrease,
470 )
--> 472␣
↪builder.build(self.tree_, X, y, sample_weight, missing_values_in_feature_mask)

474 if self.n_outputs_ == 1 and is_classifier(self):

475 self.n_classes_ = self.n_classes_[0]

KeyboardInterrupt:

[180]: # Convert dictionary keys to a list

algorithms = list(clfs.keys())

# Ensure all lists have the same length (use the minimum length)
min_length = min(len(algorithms), len(accuracy_scores), len(precision_scores))

# Create the DataFrame with consistently sized lists

performance_df = pd.DataFrame({
'Algorithm': algorithms[:min_length],
'Accuracy': accuracy_scores[:min_length],
'Precision': precision_scores[:min_length]
}).sort_values('Precision', ascending=False)

[181]: performance_df

[181]: Algorithm Accuracy Precision

5 RF 0.971954 0.990991
2 NB 0.970019 0.973451
4 lr 0.976789 0.952381
6 AdaBoost 0.944874 0.893204
3 DTC 0.951644 0.892857
1 KNR 0.931335 0.776860
0 SVC 0.866538 0.000000

[182]: performance_df1 = pd.melt(performance_df,id_vars = 'Algorithm')

[183]: performance_df1

25
[183]: Algorithm variable value
0 RF Accuracy 0.971954
1 NB Accuracy 0.970019
2 lr Accuracy 0.976789
3 AdaBoost Accuracy 0.944874
4 DTC Accuracy 0.951644
5 KNR Accuracy 0.931335
6 SVC Accuracy 0.866538
7 RF Precision 0.990991
8 NB Precision 0.973451
9 lr Precision 0.952381
10 AdaBoost Precision 0.893204
11 DTC Precision 0.892857
12 KNR Precision 0.776860
13 SVC Precision 0.000000

[184]: # Create the categorical plot

sns.catplot(x='Algorithm', y='value',
hue='variable', data=performance_df1,
kind='bar', height=5)

# Set y-axis limits (use ylim, not lim)

plt.ylim(0.5, 1.0)

# Rotate x-axis labels

plt.xticks(rotation='vertical')

# Show the plot

plt.show()

26
[185]: # First convert clfs.keys() to a list
algorithms = list(clfs.keys())

# Calculate the minimum length

min_length = min(len(algorithms), len(accuracy_scores), len(precision_scores))

# Now create the DataFrame with consistent lengths

temp_df = pd.DataFrame({
'Algorithm': algorithms[:min_length],
'Accuracy_max_ft_3000': accuracy_scores[:min_length],
'Precision_max_ft_3000': precision_scores[:min_length]
}).sort_values('Precision_max_ft_3000', ascending=False)

[186]: # First, convert clfs.keys() to a list

algorithms = list(clfs.keys())

27
# Calculate the minimum length
min_length = min(len(algorithms), len(accuracy_scores), len(precision_scores))

# Now create the DataFrame using only the first min_length elements of each list
temp_df = pd.DataFrame({
'Algorithm': algorithms[:min_length],
'Accuracy_scaling': accuracy_scores[:min_length],
'Precision_scaling': precision_scores[:min_length]
}).sort_values('Precision_scaling', ascending=False)

[187]: new_df = performance_df.merge(temp_df,on='Algorithm')

[188]: new_df_scaled = new_df.merge(temp_df,on='Algorithm')

[189]: new_df_scaled = new_df.merge(temp_df,on='Algorithm')

[190]: # First, convert clfs.keys() to a list

algorithms = list(clfs.keys())

# Calculate the minimum length

min_length = min(len(algorithms), len(accuracy_scores), len(precision_scores))

# Now create the DataFrame using only the first min_length elements of each list
temp_df = pd.DataFrame({
'Algorithm': algorithms[:min_length],
'Accuracy_num_chars': accuracy_scores[:min_length],
'Precision_num_chars': precision_scores[:min_length]
}).sort_values('Precision_num_chars', ascending=False)

[191]: new_df_scaled.merge(temp_df,on='Algorithm')

[191]: Algorithm Accuracy Precision Accuracy_scaling_x Precision_scaling_x \

0 RF 0.971954 0.990991 0.971954 0.990991
1 NB 0.970019 0.973451 0.970019 0.973451
2 lr 0.976789 0.952381 0.976789 0.952381
3 AdaBoost 0.944874 0.893204 0.944874 0.893204
4 DTC 0.951644 0.892857 0.951644 0.892857
5 KNR 0.931335 0.776860 0.931335 0.776860
6 SVC 0.866538 0.000000 0.866538 0.000000

Accuracy_scaling_y Precision_scaling_y Accuracy_num_chars \

0 0.971954 0.990991 0.971954
1 0.970019 0.973451 0.970019
2 0.976789 0.952381 0.976789
3 0.944874 0.893204 0.944874
4 0.951644 0.892857 0.951644

28
5 0.931335 0.776860 0.931335
6 0.866538 0.000000 0.866538

Precision_num_chars
0 0.990991
1 0.973451
2 0.952381
3 0.893204
4 0.892857
5 0.776860
6 0.000000

[192]: new_df_scaled

[192]: Algorithm Accuracy Precision Accuracy_scaling_x Precision_scaling_x \

Accuracy_scaling_y Precision_scaling_y
0 0.971954 0.990991
1 0.970019 0.973451
2 0.976789 0.952381
3 0.944874 0.893204
4 0.951644 0.892857
5 0.931335 0.776860
6 0.866538 0.000000

[193]: # Model initialization

rf = RandomForestClassifier(n_estimators=50, random_state=2)
bnb = BernoulliNB()
lr = LogisticRegression(solver='liblinear', penalty='l1')
from sklearn.ensemble import VotingClassifier

[194]: voting = VotingClassifier(estimators=[('rf', rf), ('bnb', bnb), ('lr', lr)],␣

↪voting='soft')

[195]: voting.fit(X_train,y_train)

[195]: VotingClassifier(estimators=[('rf',
RandomForestClassifier(n_estimators=50,
random_state=2)),
('bnb', BernoulliNB()),

29
('lr',
LogisticRegression(penalty='l1',
solver='liblinear'))],
voting='soft')

[196]: VotingClassifier(estimators=[('rf',
RandomForestClassifier(n_estimators=100,
random_state=2)),
('bnb', BernoulliNB()),
('lr', LogisticRegression())],
voting='soft')

[196]: VotingClassifier(estimators=[('rf', RandomForestClassifier(random_state=2)),

('bnb', BernoulliNB()),
('lr', LogisticRegression())],
voting='soft')

[197]: y_pred = voting.predict(X_test)

print("Accuracy",accuracy_score(y_test,y_pred))
print("Precision",precision_score(y_test,y_pred))

Accuracy 0.9758220502901354
Precision 1.0

[198]: estimators = [('rf', RandomForestClassifier(n_estimators=100, random_state=2)),

('bnb', BernoulliNB()),
('lr', LogisticRegression())]
final_estimator = RandomForestClassifier()

[199]: from sklearn.ensemble import StackingClassifier

[200]: clf = StackingClassifier(estimators=estimators, final_estimator=final_estimator)

[201]: # If you're using LogisticRegression, increase max_iter

from sklearn.linear_model import LogisticRegression

# Increase max_iter (default is usually 100)

clf = LogisticRegression(max_iter=1000, solver='lbfgs')

[202]: clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Accuracy",accuracy_score(y_test,y_pred))
print("Precision",precision_score(y_test,y_pred))

Accuracy 0.97678916827853
Precision 0.9672131147540983

30
[246]: import pickle

# Save the TF-IDF vectorizer

with open('vectorizer.pkl', 'wb') as f:
pickle.dump(tfidf, f)

# Save the Bernoulli Naive Bayes model

with open('model.pkl', 'wb') as f:
pickle.dump(bnb, f)

[ ]:

Backup Exec Licensing Guide
No ratings yet
Backup Exec Licensing Guide
12 pages
Sms Spam Detection
No ratings yet
Sms Spam Detection
23 pages
Notebook - Text Classification
No ratings yet
Notebook - Text Classification
7 pages
SMS Spam Prediction
No ratings yet
SMS Spam Prediction
18 pages
Building A Powered Ai and Spam Caller
No ratings yet
Building A Powered Ai and Spam Caller
7 pages
Machine Learning Project Spam SMS Classification 1684945672
No ratings yet
Machine Learning Project Spam SMS Classification 1684945672
18 pages
AI Phash3
No ratings yet
AI Phash3
11 pages
Sms Spam Detection
No ratings yet
Sms Spam Detection
7 pages
Fam PR-10
No ratings yet
Fam PR-10
4 pages
Spam Detector
No ratings yet
Spam Detector
4 pages
ML Assignment 4
No ratings yet
ML Assignment 4
10 pages
Sms-Spam-Filter Code
No ratings yet
Sms-Spam-Filter Code
1 page
Code
No ratings yet
Code
6 pages
Information Security Awareness - Refresher Course
100% (2)
Information Security Awareness - Refresher Course
83 pages
Spam Detection
No ratings yet
Spam Detection
10 pages
NLP - Colaboratory
No ratings yet
NLP - Colaboratory
14 pages
Sentiments Analysis Code Analysis
No ratings yet
Sentiments Analysis Code Analysis
42 pages
Spam Sms Detection 2
No ratings yet
Spam Sms Detection 2
8 pages
Python CA 4
No ratings yet
Python CA 4
9 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
13 pages
Email Spam Detection
No ratings yet
Email Spam Detection
3 pages
Shreya Srivastava-27
No ratings yet
Shreya Srivastava-27
3 pages
Mail Spam
No ratings yet
Mail Spam
4 pages
Ai - Phase 3
No ratings yet
Ai - Phase 3
9 pages
NLP - (1) (1) .Ipynb - Colab
No ratings yet
NLP - (1) (1) .Ipynb - Colab
10 pages
AI Phase4
No ratings yet
AI Phase4
11 pages
Email Spam Detection Final Presentation-21BSCHH010002
No ratings yet
Email Spam Detection Final Presentation-21BSCHH010002
17 pages
ML Practical 2D
No ratings yet
ML Practical 2D
6 pages
Emotion Classification With DistilBERT
No ratings yet
Emotion Classification With DistilBERT
25 pages
Thesis Final - Pham Dung - Quang Anh - Ver2
No ratings yet
Thesis Final - Pham Dung - Quang Anh - Ver2
30 pages
Artificial Neural Network Code
No ratings yet
Artificial Neural Network Code
3 pages
Methodology
No ratings yet
Methodology
9 pages
Lab 78
No ratings yet
Lab 78
6 pages
Adithiyaa BR 23MBA0018 SMA DA Text Mining PDF
No ratings yet
Adithiyaa BR 23MBA0018 SMA DA Text Mining PDF
6 pages
Spam Classifier With LSTM - Ipynb
No ratings yet
Spam Classifier With LSTM - Ipynb
44 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Python 21to30
No ratings yet
Python 21to30
9 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
Machine Learning Learning With Email Spam Detection
No ratings yet
Machine Learning Learning With Email Spam Detection
5 pages
Lab 3 ML
No ratings yet
Lab 3 ML
19 pages
Manual
No ratings yet
Manual
48 pages
Tweet Emotion Recognition: NLP With Tensorflow
No ratings yet
Tweet Emotion Recognition: NLP With Tensorflow
10 pages
Unstructured
No ratings yet
Unstructured
37 pages
Data Science Project
No ratings yet
Data Science Project
34 pages
Email Spam Detection System Using Logistic Regression
No ratings yet
Email Spam Detection System Using Logistic Regression
6 pages
Naive Bayes
No ratings yet
Naive Bayes
11 pages
AI-PRACTICE SHEET (Annual Exam)
No ratings yet
AI-PRACTICE SHEET (Annual Exam)
2 pages
Tweet-Sentiment-Extraction - Exploratory Data Analysis
No ratings yet
Tweet-Sentiment-Extraction - Exploratory Data Analysis
11 pages
2 - Jupyter Notebook
No ratings yet
2 - Jupyter Notebook
6 pages
Email Spam Classifier
No ratings yet
Email Spam Classifier
22 pages
Tutorial 2
No ratings yet
Tutorial 2
82 pages
Quiz 2
No ratings yet
Quiz 2
11 pages
Implemention of Sms Spam Filtering
No ratings yet
Implemention of Sms Spam Filtering
27 pages
Codesrepl
No ratings yet
Codesrepl
16 pages
BAET Record
No ratings yet
BAET Record
19 pages
Machine Learning Code Explanation
No ratings yet
Machine Learning Code Explanation
33 pages
7.email Spam Filtering Using Naive Bayes Classifier
No ratings yet
7.email Spam Filtering Using Naive Bayes Classifier
14 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
NLP Practical Three
No ratings yet
NLP Practical Three
8 pages
Profound Python Libraries
From Everand
Profound Python Libraries
Onder Teker
No ratings yet
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
From Everand
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
Equity Press
No ratings yet
Chapter 13
No ratings yet
Chapter 13
10 pages
Gillette Pepsi Cola Media Kit
No ratings yet
Gillette Pepsi Cola Media Kit
7 pages
Python Syntax
No ratings yet
Python Syntax
22 pages
FIT Syllabus
No ratings yet
FIT Syllabus
3 pages
(Patchapk) Rebuiding Apk - Error - Rebuilding The APK May Have Failed. Read The Following Output To Determine If Apktool Actually Had An Error
No ratings yet
(Patchapk) Rebuiding Apk - Error - Rebuilding The APK May Have Failed. Read The Following Output To Determine If Apktool Actually Had An Error
4 pages
Wolftrack: Installation Manual
No ratings yet
Wolftrack: Installation Manual
19 pages
Polynomials 3
No ratings yet
Polynomials 3
11 pages
CheXpress CX30 User Manual - Rev 042516
No ratings yet
CheXpress CX30 User Manual - Rev 042516
24 pages
Brochure Rosemount Level Measurement Solutions en 76356
No ratings yet
Brochure Rosemount Level Measurement Solutions en 76356
20 pages
Cmos Design Using LT Spice 6
No ratings yet
Cmos Design Using LT Spice 6
7 pages
Untitled
No ratings yet
Untitled
2 pages
How Hybrid Working From Home Works Out
No ratings yet
How Hybrid Working From Home Works Out
52 pages
Autocad 2d 2
No ratings yet
Autocad 2d 2
15 pages
Capstone Project 2-SQL-DataETL
No ratings yet
Capstone Project 2-SQL-DataETL
7 pages
Video Visit Guide
No ratings yet
Video Visit Guide
12 pages
Social Media Regulation Freedom of Expression and Civic Space in Nigeria
No ratings yet
Social Media Regulation Freedom of Expression and Civic Space in Nigeria
11 pages
Cs It - Post Gate 2023 Iit Made Easy
No ratings yet
Cs It - Post Gate 2023 Iit Made Easy
59 pages
LG Xh-t9546 DVD Home Theater
No ratings yet
LG Xh-t9546 DVD Home Theater
57 pages
Channel Assignment: Frequency Management and
No ratings yet
Channel Assignment: Frequency Management and
12 pages
BTCO12107 Pps
No ratings yet
BTCO12107 Pps
9 pages
Pattern Program
No ratings yet
Pattern Program
7 pages
Student Login Details For PG SEM I 2023 - GEOGRAPHY
No ratings yet
Student Login Details For PG SEM I 2023 - GEOGRAPHY
36 pages
Vitara Service Manual
No ratings yet
Vitara Service Manual
835 pages
Oracle - Overview of Oracle Spatial
No ratings yet
Oracle - Overview of Oracle Spatial
20 pages
Mastercam - X4 - Art Training Tutorial
No ratings yet
Mastercam - X4 - Art Training Tutorial
28 pages
Ljs Trace
No ratings yet
Ljs Trace
15 pages
The Impact of Product
No ratings yet
The Impact of Product
26 pages
Bugreport NAM L29 HUAWEINAM L29 2025 02 04 08 20 40 Dumpstate - Log 9524
No ratings yet
Bugreport NAM L29 HUAWEINAM L29 2025 02 04 08 20 40 Dumpstate - Log 9524
40 pages
Prefilled
No ratings yet
Prefilled
18 pages