Unsolicited Spam Detection
Unsolicited Spam Detection
ABSTRACT
The frequency of cyber security incidents has increased in recent years. Attackers use spam emails as a gateway to
infiltrate government systems, renowned companies, and the websites of politicians and social organizations across
multiple nations. Identifying spam emails within large email datasets has drawn significant public attention. It is
becoming increasingly difficult for existing detection methods to deal with the growing array of deceptive tactics
and the surge in email volume posed by spam emails. The objective of this study is to develop a novel and efficient
method for classifying large email datasets into four separate categories: Normal, Fraudulent, Harassment, and
Suspicious. In order to achieve this classification, Long Short-Term Memory (LSTM) based Gated Recurrent Units
(GRUs) are used. The proposed LSTM based GRU proves adept at capturing meaningful information from emails,
which proves valuable for forensic analysis and evidentiary purposes. The technique involves two crucial stages:
sample expansion and testing. By combining LSTMs with recurrent gradient units, Spam Spoiler outperforms
existing machine learning algorithms with an accuracy of 98%. Spam Spoiler excels at analyzing e-mail content
across diverse topics, maintaining a robust and reliable classification system.
Keywords: Email Classification, Spam, Phishing, Machine Learning, Random Forest, Cyber Security, Text
Classification ,Naïve Bayes.
I. INTRODUCTION Protocol) is used to send messages, while other protocols
In examining electronic mail (e-mail) related crimes, a like IMAP or POP are employed to retrieve messages
comprehensive analysis of both the email header and from a mail server. Accessing a mail account typically
body becomes imperative, as the semantics of involves entering a valid email address, password, and
communication play a crucial role in identifying potential mail server details for sending and receiving messages.
evidence sources. The objective is to choose the optimal While webmail servers often auto configure accounts,
model for e-mail forensic tools. This project introduces a manual configuration may be necessary when using
novel and efficient approach called the E-Mail Sink API, email clients like Microsoft Outlook or Apple Mail.
utilizing a Long Short-Term Memory (LSTM)-based Additionally, entering incoming and outgoing mail
Gated Recurrent Unit (GRU) for multiclass email servers along with correct port numbers may be required.
classification. The primary focus is on identifying Despite the widespread use of the Internet for
harmful or unfavorable e-mails received at the e-mail professional, social, and personal activities, there exists a
server end through a deep learning-based architecture. subset of individuals attempting to compromise Internet-
The proposed approach concurrently models emails at connected devices, violate privacy, and disrupt online
various levels, including the email header, email body, services. Email, as a universal service utilized by over a
character level, and word level, with the goal of billion people worldwide, has become a significant
distinguishing whether an email exhibits characteristics vulnerability. Startling statistics reveal that email remains
indicative of cybercrime. Email messages traverse email the primary threat vector for data breaches, serving as the
servers using multiple protocols within the TCP/IP suite entry point for ninety-four percent of breaches, with an
[1],[10],[14]. For instance, SMTP (Simple Mail Transfer attack occurring every 39 seconds. Over 30% of phishing
messages are opened, and 12% of users click on based detection system, identifies botnets by exploring
malicious links. In response to the escalating spatial-temporal behavioral similarities commonly
sophistication of cybercrime and its ability to bypass observed in IRC-based and HTTP-based botnets. On the
legacy controls, security measures must evolve other hand, BotMiner [7], one of the first protocol- and
accordingly. structure-independent botnet detection systems, classifies
flows into groups based on communication and
II. RELATED WORK malicious activity patterns. The intersection of these
In this section, we delve into previous work related to the groups identifies compromised machines.
identification of compromised machines. Our primary Compared to existing general botnet detection systems
focus is on studies utilizing spamming activities for bot like BotHunter, BotSniffer, and BotMiner, SPOT
detection, followed by a brief overview of various efforts distinguishes itself as a lightweight compromised
in detecting general botnets. Two recent studies [19], machine detection scheme, focusing on the economic
[20], based on email messages received by a large email incentives driving attackers to recruit a large number of
service provider, investigated the global characteristics of compromised machines. Leveraging the Sequential
spamming botnets, including botnet size and spamming Probability Ratio Test (SPRT) as a simple yet powerful
patterns. These studies employed clustering techniques statistical method, SPOT has found successful
on spam messages to reveal insights into the aggregate application in various areas of networking security,
global characteristics of spamming botnets. However, including portscan activity detection, proxy-based
their applicability is more suited to large email service spamming activity detection, anomaly-based botnet
providers for understanding global botnet characteristics detection, and MAC protocol misbehavior in wireless
rather than being deployed by individual networks to networks.
identify internal compromised machines. Additionally,
their approaches lack support for the online detection III. RESULTS AND DISCUSSION
requirement in the network environment considered in
The proposed approach involves data collection,
this paper, where we aim to develop a tool for system
preprocessing, feature extraction, parameter tuning, and
administrators to automatically detect compromised
classification through the LSTM-GRU model. E-mail
machines.
datasets in the project are categorized into normal,
Xie et al. developed DBSpam, an effective tool for
harassing, suspicious, and fraudulent classes. The E-mail
detecting proxy-based spamming activities in a network,
body is segmented into word levels, and the embedding
relying on the packet symmetry property of such
layer is utilized for training to generate the sequence of
activities [13]. While DBSpam identifies spam proxies
vectors.
translating and forwarding non-SMTP packets upstream,
A. Long Short-Term Memory (LSTM)
our goal is to identify all types of compromised machines
involved in spamming. Moving on to general botnet LSTMs, a specialized type of Recurrent Neural Network
detection schemes, Bot Hunter [8], developed by Gu et (RNN), excel in learning long-term dependencies,
al., correlates the Intrusion Detection System (IDS) overcoming the challenge of retaining information over
dialog trace in a network to detect compromised extended periods. Comprising units known as LSTM
machines. It is designed based on the observation that a units or blocks, these form the building components for
complete malware infection process has well-defined layers in an RNN, collectively referred to as an LSTM
stages, and by correlating inbound intrusion alarms with network [8]. A standard LSTM unit consists of a cell, an
outbound communication patterns, Bot Hunter identifies input gate, an output gate, and a forget gate. The cell's
potential infected machines. function involves retaining values across arbitrary time
In contrast to Bot Hunter, SPOT focuses on the economic intervals, addressing the long-term memory aspect of
incentives behind compromised machines and their LSTMs. The gates, resembling conventional neurons,
involvement in spamming. Bot Sniffer [9], an anomaly- regulate the flow of values through connections in the
LSTM.
The term "long short-term" signifies that LSTMs model effectiveness in overcoming the vanishing gradient
short-term memory capable of lasting for an extended problem is attributed to the utilization of an update gate
duration [18]. LSTMs are adept at classifying, and a reset gate. The update gate manages the
information flowing into memory, while the reset gate
processing, and predicting time series data, particularly
governs the information flowing out of memory. Both
in scenarios with unknown time lags and durations gates, represented as vectors, determine the information
between crucial events. Developed to address the transmitted to the output. Their training can prioritize
challenges of exploding and vanishing gradient problems retaining relevant past information or discarding
in training traditional RNNs, LSTMs have proven irrelevant details, contributing to GRU's ability to
effective in handling complex temporal relationships. mitigate the vanishing gradient problem in recurrent
neural networks. GRU proves to be a valuable tool for
addressing the vanishing gradient problem, a challenge
that arises when the gradient diminishes significantly,
impeding weight adjustments. Notably, GRU
demonstrates superior performance compared to LSTM,
especially when handling smaller datasets.