Spam Detector
Spam Detector
In [7]: data.show(5)
+----+--------------------+
| _c0| _c1|
+----+--------------------+
| ham|Go until jurong p...|
| ham|Ok lar... Joking ...|
|spam|Free entry in 2 a...|
| ham|U dun say so earl...|
| ham|Nah I don't think...|
+----+--------------------+
only showing top 5 rows
+-----+--------------------+
|class| text|
+-----+--------------------+
| ham|Go until jurong p...|
| ham|Ok lar... Joking ...|
| spam|Free entry in 2 a...|
| ham|U dun say so earl...|
| ham|Nah I don't think...|
| spam|FreeMsg Hey there...|
| ham|Even my brother i...|
| ham|As per your reque...|
| spam|WINNER!! As a val...|
| spam|Had your mobile 1...|
| ham|I'm gonna be home...|
| spam|SIX chances to wi...|
| spam|URGENT! You have ...|
| ham|I've been searchi...|
| ham|I HAVE A DATE ON ...|
| spam|XXXMobileMovieClu...|
| ham|Oh k...i'm watchi...|
| ham|Eh u remember how...|
| ham|Fine if thats th...|
In [12]: data.show()
+-----+--------------------+------+
|class| text|length|
+-----+--------------------+------+
| ham|Go until jurong p...| 111|
| ham|Ok lar... Joking ...| 29|
| spam|Free entry in 2 a...| 155|
| ham|U dun say so earl...| 49|
| ham|Nah I don't think...| 61|
| spam|FreeMsg Hey there...| 147|
| ham|Even my brother i...| 77|
| ham|As per your reque...| 160|
| spam|WINNER!! As a val...| 157|
| spam|Had your mobile 1...| 154|
| ham|I'm gonna be home...| 109|
| spam|SIX chances to wi...| 136|
| spam|URGENT! You have ...| 155|
| ham|I've been searchi...| 196|
| ham|I HAVE A DATE ON ...| 35|
| spam|XXXMobileMovieClu...| 149|
| ham|Oh k...i'm watchi...| 26|
| ham|Eh u remember how...| 81|
| ham|Fine if thats th...| 56|
| spam|England v Macedon...| 155|
+-----+--------------------+------+
only showing top 20 rows
+-----+-----------------+
|class| avg(length)|
+-----+-----------------+
| ham|71.45431945307645|
| spam|138.6706827309237|
+-----+-----------------+
Pre-processing Text
In [14]: from pyspark.ml.feature import (
Tokenizer, StopWordsRemover, CountVectorizer, IDF, StringIndexer)
In [22]: nb = NaiveBayes()
In [28]: clean_data.show()
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|(26847,[7,11,31,6...|
| 0.0|(26847,[0,24,297,...|
| 1.0|(26847,[2,13,19,3...|
| 0.0|(26847,[0,70,80,1...|
| 0.0|(26847,[36,134,31...|
| 1.0|(26847,[10,60,139...|
| 0.0|(26847,[10,53,103...|
| 0.0|(26847,[125,184,4...|
| 1.0|(26847,[1,47,118,...|
| 1.0|(26847,[0,1,13,27...|
| 0.0|(26847,[18,43,120...|
| 1.0|(26847,[8,17,37,8...|
| 1.0|(26847,[13,30,47,...|
| 0.0|(26847,[39,96,217...|
| 0.0|(26847,[552,1697,...|
| 1.0|(26847,[30,109,11...|
| 0.0|(26847,[82,214,47...|
| 0.0|(26847,[0,2,49,13...|
| 0.0|(26847,[0,74,105,...|
| 1.0|(26847,[4,30,33,5...|
+-----+--------------------+
only showing top 20 rows
In [31]: data.printSchema()
root
|-- class: string (nullable = true)
|-- text: string (nullable = true)
|-- length: integer (nullable = true)
Evaluate Results
In [33]: from pyspark.ml.evaluation import MulticlassClassificationEvaluator
In [36]: acc
Out[36]: 0.9341932172490841
In [ ]: