0% found this document useful (0 votes)

6 views4 pages

Spam Detector

The document describes building a spam detector using Naive Bayes classification on SMS text data. It loads and explores an SMS dataset labeled as spam or ham. It preprocesses the text by tokenizing, removing stop words, vectorizing and calculating IDF. It trains a Naive Bayes model on cleaned data and evaluates it, achieving 93% accuracy on test data.

Uploaded by

Mahmoud Abdel Ghani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views4 pages

Spam Detector

Uploaded by

Mahmoud Abdel Ghani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

1/17/23, 9:58 AM Spam Detector

Ham vs Spam Detector with Naive Bayes

Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes

In [1]: import findspark

findspark.init('C:\spark')

In [2]: from pyspark.sql import SparkSession

In [3]: spark = SparkSession.builder.appName('spamdetector').master('local[*]').getOrCreate()

Load and Explore Data

In [6]: data = spark.read.csv('data/SMSSpamCollection', inferSchema=True, sep='\t')

In [7]: data.show(5)

+----+--------------------+
| _c0| _c1|
+----+--------------------+
| ham|Go until jurong p...|
| ham|Ok lar... Joking ...|
|spam|Free entry in 2 a...|
| ham|U dun say so earl...|
| ham|Nah I don't think...|
+----+--------------------+
only showing top 5 rows

In [8]: data = data.withColumnRenamed('_c0', 'class').withColumnRenamed('_c1', 'text')

In [9]: # we have sentences that are labelled as ham or spam

data.show()

localhost:8888/nbconvert/html/Documents/Spark_tut_usecase/Spam Detector.ipynb?download=false 1/4

1/17/23, 9:58 AM Spam Detector
| spam|England v Macedon...|
+-----+--------------------+
only showing top 20 rows

In [10]: from pyspark.sql.functions import length

In [11]: data = data.withColumn('length', length(data['text']))

In [12]: data.show()

+-----+--------------------+------+
|class| text|length|
+-----+--------------------+------+
| ham|Go until jurong p...| 111|
| ham|Ok lar... Joking ...| 29|
| spam|Free entry in 2 a...| 155|
| ham|U dun say so earl...| 49|
| ham|Nah I don't think...| 61|
| spam|FreeMsg Hey there...| 147|
| ham|Even my brother i...| 77|
| ham|As per your reque...| 160|
| spam|WINNER!! As a val...| 157|
| spam|Had your mobile 1...| 154|
| ham|I'm gonna be home...| 109|
| spam|SIX chances to wi...| 136|
| spam|URGENT! You have ...| 155|
| ham|I've been searchi...| 196|
| ham|I HAVE A DATE ON ...| 35|
| spam|XXXMobileMovieClu...| 149|
| ham|Oh k...i'm watchi...| 26|
| ham|Eh u remember how...| 81|
| ham|Fine if thats th...| 56|
| spam|England v Macedon...| 155|
+-----+--------------------+------+
only showing top 20 rows

In [13]: # on average we see that spam is almost twice as long as ham!

data.groupBy('class').mean().show()

+-----+-----------------+
|class| avg(length)|
+-----+-----------------+
| ham|71.45431945307645|
| spam|138.6706827309237|
+-----+-----------------+

Pre-processing Text
In [14]: from pyspark.ml.feature import (
Tokenizer, StopWordsRemover, CountVectorizer, IDF, StringIndexer)

In [17]: # tokenize sentence

tokenizer = Tokenizer(inputCol='text', outputCol='token_text')
# remove stop words
stop_remove = StopWordsRemover(inputCol='token_text', outputCol='stop_tokens')
# convert each sentence to vector of token counts
# this vector has length that is size of vocab
count_vec = CountVectorizer(inputCol='stop_tokens', outputCol='c_vec')
# calculate the inverse document frequency (IDF) from the count vectors
localhost:8888/nbconvert/html/Documents/Spark_tut_usecase/Spam Detector.ipynb?download=false 2/4
1/17/23, 9:58 AM Spam Detector

idf = IDF(inputCol='c_vec', outputCol='idf')

# convert class label to numeric target
ham_spam_to_numeric = StringIndexer(inputCol='class', outputCol='label')

In [18]: from pyspark.ml.feature import VectorAssembler

In [20]: clean_up = VectorAssembler(inputCols=['c_vec', 'idf', 'length'], outputCol='features')

In [21]: from pyspark.ml.classification import NaiveBayes

In [22]: nb = NaiveBayes()

In [23]: from pyspark.ml import Pipeline

In [24]: data_prep_pipe = Pipeline(stages=[ham_spam_to_numeric, tokenizer, stop_remove, count_ve

In [25]: cleaner = data_prep_pipe.fit(data)

In [26]: clean_data = cleaner.transform(data)

In [27]: clean_data = clean_data.select('label', 'features')

In [28]: clean_data.show()

+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|(26847,[7,11,31,6...|
| 0.0|(26847,[0,24,297,...|
| 1.0|(26847,[2,13,19,3...|
| 0.0|(26847,[0,70,80,1...|
| 0.0|(26847,[36,134,31...|
| 1.0|(26847,[10,60,139...|
| 0.0|(26847,[10,53,103...|
| 0.0|(26847,[125,184,4...|
| 1.0|(26847,[1,47,118,...|
| 1.0|(26847,[0,1,13,27...|
| 0.0|(26847,[18,43,120...|
| 1.0|(26847,[8,17,37,8...|
| 1.0|(26847,[13,30,47,...|
| 0.0|(26847,[39,96,217...|
| 0.0|(26847,[552,1697,...|
| 1.0|(26847,[30,109,11...|
| 0.0|(26847,[82,214,47...|
| 0.0|(26847,[0,2,49,13...|
| 0.0|(26847,[0,74,105,...|
| 1.0|(26847,[4,30,33,5...|
+-----+--------------------+
only showing top 20 rows

Train Naive Bayes Classifier

In [29]: training, test = clean_data.randomSplit([0.7, 0.3])

In [30]: spam_detector = nb.fit(training)

localhost:8888/nbconvert/html/Documents/Spark_tut_usecase/Spam Detector.ipynb?download=false 3/4

1/17/23, 9:58 AM Spam Detector

In [31]: data.printSchema()

root
|-- class: string (nullable = true)
|-- text: string (nullable = true)
|-- length: integer (nullable = true)

In [32]: test_results = spam_detector.transform(test)

Evaluate Results
In [33]: from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [34]: acc_eval = MulticlassClassificationEvaluator()

In [35]: acc = acc_eval.evaluate(test_results)

In [36]: acc

Out[36]: 0.9341932172490841

In [ ]:

localhost:8888/nbconvert/html/Documents/Spark_tut_usecase/Spam Detector.ipynb?download=false 4/4

N260 - Computerised Financial Systems N6 - Instructions - Nov 2024
No ratings yet
N260 - Computerised Financial Systems N6 - Instructions - Nov 2024
19 pages
HW4 Text-1
No ratings yet
HW4 Text-1
8 pages
Building A Powered Ai and Spam Caller
No ratings yet
Building A Powered Ai and Spam Caller
7 pages
AI Phash3
No ratings yet
AI Phash3
11 pages
Sms Spam Detection
No ratings yet
Sms Spam Detection
7 pages
Information Security Awareness - Refresher Course
100% (2)
Information Security Awareness - Refresher Course
83 pages
Vietnamese Spam Filtering Report
No ratings yet
Vietnamese Spam Filtering Report
21 pages
Spam Detection 1
No ratings yet
Spam Detection 1
31 pages
Ie ML Project (Getting Started)
No ratings yet
Ie ML Project (Getting Started)
3 pages
AI Phase4
No ratings yet
AI Phase4
11 pages
AI Phash 5
No ratings yet
AI Phash 5
14 pages
Spam Classification2
No ratings yet
Spam Classification2
21 pages
Arnav MLlab04
No ratings yet
Arnav MLlab04
7 pages
Spam Detection
No ratings yet
Spam Detection
10 pages
Project Report
No ratings yet
Project Report
11 pages
SMS Spam Detection Presentation
No ratings yet
SMS Spam Detection Presentation
8 pages
Saurabh
No ratings yet
Saurabh
26 pages
Notebook - Text Classification
No ratings yet
Notebook - Text Classification
7 pages
Sms
No ratings yet
Sms
16 pages
1 s2.0 S0950705106001390 Main
No ratings yet
1 s2.0 S0950705106001390 Main
6 pages
Spam-T5: Benchmarking Large Language Models For Few-Shot Email Spam Detection
No ratings yet
Spam-T5: Benchmarking Large Language Models For Few-Shot Email Spam Detection
18 pages
Major Project by Ali (Intrainz)
No ratings yet
Major Project by Ali (Intrainz)
25 pages
How To Submit Your Homework: EECS 349 Machine Learning Homework 5
No ratings yet
How To Submit Your Homework: EECS 349 Machine Learning Homework 5
4 pages
Spam Detection With Machine Learning
No ratings yet
Spam Detection With Machine Learning
2 pages
Spamemaildetectionusingmachinelearningppt 230201113400 20a802e7
No ratings yet
Spamemaildetectionusingmachinelearningppt 230201113400 20a802e7
21 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
Introduction To Spam Email Detection
No ratings yet
Introduction To Spam Email Detection
16 pages
Aiml Assignment-2
No ratings yet
Aiml Assignment-2
8 pages
Email Spam Detection Final Presentation-21BSCHH010002
No ratings yet
Email Spam Detection Final Presentation-21BSCHH010002
17 pages
Spam Email. Classifier
No ratings yet
Spam Email. Classifier
16 pages
Implemention of Sms Spam Filtering
No ratings yet
Implemention of Sms Spam Filtering
27 pages
Vishal FOML Micro Project Vishal & Milan
No ratings yet
Vishal FOML Micro Project Vishal & Milan
26 pages
Mail Spam
No ratings yet
Mail Spam
4 pages
Final
No ratings yet
Final
51 pages
Ijresm V6 I9 3 2
No ratings yet
Ijresm V6 I9 3 2
5 pages
ML6 Naive Bayes Spam Filter
No ratings yet
ML6 Naive Bayes Spam Filter
11 pages
B 14 Sms Spam Detection ML Ieee Report
No ratings yet
B 14 Sms Spam Detection ML Ieee Report
5 pages
Spam Sms Detection 2
No ratings yet
Spam Sms Detection 2
8 pages
Final Report Spam Classifier
No ratings yet
Final Report Spam Classifier
24 pages
Unit III
No ratings yet
Unit III
10 pages
Spam Detection Thesis
100% (3)
Spam Detection Thesis
6 pages
HW 5 Q 1
No ratings yet
HW 5 Q 1
22 pages
Sms Spam Using Machine Learning 4
No ratings yet
Sms Spam Using Machine Learning 4
42 pages
Irjet V9i11154
No ratings yet
Irjet V9i11154
4 pages
Maths Answers
No ratings yet
Maths Answers
4 pages
Spam
No ratings yet
Spam
12 pages
A Support Vector Machine Based Naive Bayes Algorithm For Spam Filtering
No ratings yet
A Support Vector Machine Based Naive Bayes Algorithm For Spam Filtering
8 pages
Spam Filter - Machine Learning
No ratings yet
Spam Filter - Machine Learning
25 pages
Advances in Spam Filtering Techniques: January 2012
No ratings yet
Advances in Spam Filtering Techniques: January 2012
17 pages
Spam Classifier Report
No ratings yet
Spam Classifier Report
5 pages
44 Decision Tree Model For Email Classification
No ratings yet
44 Decision Tree Model For Email Classification
4 pages
Spam Detection & Classification Final
No ratings yet
Spam Detection & Classification Final
38 pages
Id - 3747 - Literature Review
No ratings yet
Id - 3747 - Literature Review
3 pages
Email Spam Filtering Using Machine Learning.1
No ratings yet
Email Spam Filtering Using Machine Learning.1
16 pages
Spam Detection Model
No ratings yet
Spam Detection Model
4 pages
Spam Message Classification: RTRP Review-1
No ratings yet
Spam Message Classification: RTRP Review-1
12 pages
Machine Learning Learning With Email Spam Detection
No ratings yet
Machine Learning Learning With Email Spam Detection
5 pages
Email Spam Detection Using Machine Learning
No ratings yet
Email Spam Detection Using Machine Learning
2 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Powershell Commands PDF
No ratings yet
Powershell Commands PDF
3 pages
Connection Pooling
No ratings yet
Connection Pooling
5 pages
Windows 7 Regal Business Edition 2014 SP1
No ratings yet
Windows 7 Regal Business Edition 2014 SP1
1 page
Writing PHD Thesis Latex
100% (3)
Writing PHD Thesis Latex
4 pages
SATA Drivers For XP (Solution On 0x0000007B BSOD) - HP Support Forum
No ratings yet
SATA Drivers For XP (Solution On 0x0000007B BSOD) - HP Support Forum
8 pages
Manual EPLAN - Manual Software Eplan P8 - Iniciante
100% (1)
Manual EPLAN - Manual Software Eplan P8 - Iniciante
141 pages
Fda Udi Unique Device Identifier Guidance
100% (1)
Fda Udi Unique Device Identifier Guidance
11 pages
Dbms Notes
No ratings yet
Dbms Notes
48 pages
Summary of Agile and Scrum
No ratings yet
Summary of Agile and Scrum
3 pages
Casio px-130 Ver.4 SM
No ratings yet
Casio px-130 Ver.4 SM
60 pages
World English 2 Split B
100% (3)
World English 2 Split B
122 pages
11 - MIFARE Classic Is Completely Broken
No ratings yet
11 - MIFARE Classic Is Completely Broken
37 pages
Relay Setting
No ratings yet
Relay Setting
144 pages
CD/DPF-R Series: Instruction Manual
No ratings yet
CD/DPF-R Series: Instruction Manual
28 pages
Ericsson Rbs 6601 Manual
No ratings yet
Ericsson Rbs 6601 Manual
1 page
GC 2024 06 30
No ratings yet
GC 2024 06 30
8 pages
Dr. Shyam N. Chawda, C Language Tutorial, 78 74 39 11 91 1.1 Concepts of Programming Methodology
No ratings yet
Dr. Shyam N. Chawda, C Language Tutorial, 78 74 39 11 91 1.1 Concepts of Programming Methodology
64 pages
S1 - Human Computer Interaction
No ratings yet
S1 - Human Computer Interaction
2 pages
Enrolment Form Singapore
No ratings yet
Enrolment Form Singapore
3 pages
SAP BASIS CUA (New) CENTRAL USER ADMIN
No ratings yet
SAP BASIS CUA (New) CENTRAL USER ADMIN
13 pages
Unlock HDD That Are Locked After Secure Erase
No ratings yet
Unlock HDD That Are Locked After Secure Erase
9 pages
Midterm - Lecture 1 - WEBAPPS
No ratings yet
Midterm - Lecture 1 - WEBAPPS
18 pages
PHP Pdo
No ratings yet
PHP Pdo
39 pages
Jagpat Project Dhapni
No ratings yet
Jagpat Project Dhapni
46 pages
AIP QA Manual - 12.19.23
No ratings yet
AIP QA Manual - 12.19.23
232 pages
Active Directory Delegation
No ratings yet
Active Directory Delegation
233 pages
Travel Request Form: Traveller Information
No ratings yet
Travel Request Form: Traveller Information
1 page
TRHW SettingGuide
No ratings yet
TRHW SettingGuide
356 pages
Library Management Project 96ec
No ratings yet
Library Management Project 96ec
25 pages

Spam Detector

Uploaded by

Spam Detector

Uploaded by

1/17/23, 9:58 AM Spam Detector

Ham vs Spam Detector with Naive Bayes

In [1]: import findspark

In [2]: from pyspark.sql import SparkSession

In [3]: spark = SparkSession.builder.appName('spamdetector').master('local[*]').getOrCreate()

Load and Explore Data

In [8]: data = data.withColumnRenamed('_c0', 'class').withColumnRenamed('_c1', 'text')

In [9]: # we have sentences that are labelled as ham or spam

localhost:8888/nbconvert/html/Documents/Spark_tut_usecase/Spam Detector.ipynb?download=false 1/4

In [10]: from pyspark.sql.functions import length

In [11]: data = data.withColumn('length', length(data['text']))

In [13]: # on average we see that spam is almost twice as long as ham!

In [17]: # tokenize sentence

idf = IDF(inputCol='c_vec', outputCol='idf')

In [18]: from pyspark.ml.feature import VectorAssembler

In [20]: clean_up = VectorAssembler(inputCols=['c_vec', 'idf', 'length'], outputCol='features')

In [21]: from pyspark.ml.classification import NaiveBayes

In [23]: from pyspark.ml import Pipeline

In [24]: data_prep_pipe = Pipeline(stages=[ham_spam_to_numeric, tokenizer, stop_remove, count_ve

In [25]: cleaner = data_prep_pipe.fit(data)

In [26]: clean_data = cleaner.transform(data)

In [27]: clean_data = clean_data.select('label', 'features')

Train Naive Bayes Classifier

In [30]: spam_detector = nb.fit(training)

localhost:8888/nbconvert/html/Documents/Spark_tut_usecase/Spam Detector.ipynb?download=false 3/4

In [32]: test_results = spam_detector.transform(test)

In [34]: acc_eval = MulticlassClassificationEvaluator()

In [35]: acc = acc_eval.evaluate(test_results)

localhost:8888/nbconvert/html/Documents/Spark_tut_usecase/Spam Detector.ipynb?download=false 4/4

You might also like