AI Phash3
AI Phash3
Introduction
The upsurge in the volume of unwanted emails called spam has created an
intense need for the development of more dependable and robust antispam
filters. Any promotional messages or advertisements that end up in our
inbox can be categorised as spam as they don't provide any value and
often irritates us.
The SMS Spam Collection is a set of SMS tagged messages that have
been collected for SMS Spam research. It contains one set of SMS
messages in English of 5,574 messages, tagged according to being ham
(legitimate) or spam.
data['label'].value_counts()
# OUTPUT
ham 4825
spam 747
Name: label, dtype: int64
Preprocessing and Exploring the Dataset
If you are completely new to NLTK and Natural Language Processing(NLP)
I would recommend checking out this short article before
continuing. Introduction to Word Frequencies in NLP
let's use the above functions to create Spam word cloud and ham word
cloud.
from the spam word cloud, we can see that "free" is most often used in
spam.
Now, we can convert the spam and ham into 0 and 1 respectively so that the
machine can understand.
import nltk
nltk.download('stopwords')
data['text'] = data['text'].apply(text_process)
data.head()
Now, create a data frame from the processed data before moving to the
next step.
text = pd.DataFrame(data['text'])
label = pd.DataFrame(data['label'])
TF-IDF is better than Count Vectorizers because it not only focuses on the
frequency of words present in the corpus but also provides the importance
of the words. We can then remove the words that are less important for
analysis, hence making the model building less complex by reducing the
input dimensions.
total_counts = Counter()
for i in range(len(text)):
for word in text.values[i][0].split(" "):
total_counts[word] += 1
# OUTPUT
Total words in data set: 11305
# Sorting in decreasing order (Word with highest frequency appears
first)
vocab = sorted(total_counts, key=total_counts.get, reverse=True)
print(vocab[:60])
# OUTPUT
['u', '2', 'call', 'U', 'get', 'Im', 'ur', '4', 'ltgt', 'know', 'go',
'like', 'dont', 'come', 'got', 'time', 'day', 'want', 'Ill', 'lor',
'Call', 'home', 'send', 'going', 'one', 'need', 'Ok', 'good', 'love',
'back', 'n', 'still', 'text', 'im', 'later', 'see', 'da', 'ok',
'think', 'Ì', 'free', 'FREE', 'r', 'today', 'Sorry', 'week', 'phone',
'mobile', 'cant', 'tell', 'take', 'much', 'night', 'way', 'Hey',
'reply', 'work', 'make', 'give', 'new']
# Mapping from words to index
vocab_size = len(vocab)
word2idx = {}
#print vocab_size
for i, word in enumerate(vocab):
word2idx[word] = I
# Text to Vector
def text_to_vector(text):
word_vector = np.zeros(vocab_size)
for word in text.split(" "):
if word2idx.get(word) is None:
continue
else:
word_vector[word2idx.get(word)] += 1
return np.array(word_vector)
# Convert all titles to vectors
word_vectors = np.zeros((len(text), len(vocab)), dtype=np.int_)
for i, (_, text_) in enumerate(text.iterrows()):
word_vectors[i] = text_to_vector(text_[0])
word_vectors.shape
# OUTPUT
(5572, 11305)
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(data['text'])
vectors.shape
# OUTPUT
(5572, 9376)
#features = word_vectors
features = vectors
Splitting into training and test set
#split the dataset into train and test set
X_train, X_test, y_train, y_test = train_test_split(features,
data['label'], test_size=0.15, random_state=111)
Classifiers used:
1. spam classifier using logistic regression
2. email spam classification using Support Vector Machine(SVM)
3. spam classifier using naive bayes
4. spam classifier using decision tree
5. spam classifier using K-Nearest Neighbor(KNN)
6. spam classifier using Random Forest Classifier
We will make use of sklearn library. This amazing library has all of the
above algorithms we just have to import them and it is as easy as that. No
need to worry about all the maths and statistics behind it.
# OUTPUT
[('SVC', [0.9784688995215312]),
('KN', [0.9330143540669856]),
('NB', [0.9880382775119617]),
('DT', [0.9605263157894737]),
('LR', [0.9533492822966507]),
('RF', [0.9796650717703349])]
Model predictions
#write functions to detect if the message is spam or not
def find(x):
if x == 1:
print ("Message is SPAM")
else:
print ("Message is NOT Spam")
newtext = ["Free entry"]
integers = vectorizer.transform(newtext)
x = mnb.predict(integers)
find(x)
# OUTPUT
Message is SPAM
Checking Classification Results with
Confusion Matrix
If you are confused about the confusion matrix, read this small article
before proceeding .
from sklearn.metrics import confusion_matrix
import seaborn as sns
# Naive Bayes
y_pred_nb = mnb.predict(X_test)
y_true_nb = y_test
cm = confusion_matrix(y_true_nb, y_pred_nb)
f, ax = plt.subplots(figsize =(5,5))
sns.heatmap(cm,annot = True,linewidths=0.5,linecolor="red",fmt =
".0f",ax=ax)
plt.xlabel("y_pred_nb")
plt.ylabel("y_true_nb")
plt.show()
output