Application of AI-based Models For Online Fraud Detection and Analysis
Application of AI-based Models For Online Fraud Detection and Analysis
SYSTEMATIC REVIEW*
Abstract
Background: Fraud is a prevalent offence that extends beyond financial loss, impacting victims emotionally,
psychologically, and physically. The advancements in online communication technologies, allow for online fraud
to thrive in this vast network, with fraudsters increasingly using these channels for deception. With the
progression of technologies like Generative Artificial Intelligence (GenAI), there is a growing concern that fraud
will increase and scale, using these advanced methods such as deep-fakes in phishing campaigns. However, the
application of AI in detecting and analyzing online fraud remains understudied. This review addresses this gap
by investigating AI’s role in analyzing online fraud using text data.
Methods: We conducted a Systematic Literature Review (SLR) on AI and Natural Language Processing (NLP)
techniques for online fraud detection. The review adhered to the PRISMA-ScR protocol, with eligibility criteria
including language, publication type, relevance to online fraud, use of text data, and AI methodologies. Out of
2, 457 academic records screened, 350 met our eligibility criteria, and 223 were analyzed and included herein.
Results: We report the state-of-the-art NLP techniques used to analyse various online fraud categories; the
data sources used for training the NLP models; the NLP algorithms and models built; and the performance
metrics employed for model evaluation. We find that the current state of research on online fraud is broken
into the various scam activities that take place, and more specifically, we identify 16 different frauds that
researchers focus on. Finally, we present the latest and best performing AI methods used towards detecting
online scam and fraud activities.
Conclusions: This SLR enhances the academic understanding of AI-based detection methods for online fraud
and offers insights for policymakers, law enforcement, and businesses on safeguarding against such activities.
We conclude that focusing on specific scams lacks generalization, as multiple models are required for different
fraud types. Furthermore, the evolving nature of scams limits the effectiveness of models trained on outdated
data. We also identify issues in data limitation or training bias reporting from researchers, along with issues on
model performance reporting, with some studies selectively presenting metrics, leading to potential biases in
model evaluation.
Keywords: Artificial Intelligence; Natural Language Processing; Online Fraud; Systematic Literature Review
certain socio-demographic groups face higher risks of Online frauds exploit the virtual nature of the In-
fraud–such as women aged 25-34 or 35-44 experiencing ternet and the anonymity it provides to reach out to
fraud at rates of 7.6%, and individuals in the highest victims. This virtual environment, coupled with juris-
income bracket at 8.3%–fraud is impacting individuals dictional challenges (where offenders and victims may
across all demographics [2] and sometimes in different be in different regions of the world), makes fraud dif-
ways. For example, victims earning £20,000 or less, ficult to detect and prevent using traditional policing
those aged 65 and over, and female victims, reported techniques. The complexity of online fraud is further
impacts on their self-confidence, at rates of 46%, 43%, heightened by its evolving nature, as fraudsters contin-
and 41%, respectively, compared to the overall victim uously adapt their techniques to bypass new security
rate of 35%. measures and exploit emerging technologies to target
For the year ending March 2023, the Crime Survey on new victims [10].
for England and Wales estimated 3.5 million fraud of- Given the scale and impact of online fraud, there
fences including online fraud [3]. For example, advance is a need for new methods to detect and prevent
fee fraud increased significantly highly from 60,000 to such activities. The use of NLP, in combination with
391,000 offences compared to the year ending March other AI, has been proposed for identifying, charac-
2020. This increase is largely due to society’s growing terising, and detecting fraudulent patterns in appli-
reliance on the Internet and digital platforms for ev- cations like phishing [11], fake job advertisement [12]
eryday services, transactions and communications. Ac- and analysing scam patterns [13] which could help de-
cording to The Office of Communication (Ofcom) [4], velop preventive measures and mitigate risks of on-
92% of adults in the UK use the Internet for a wide va- line fraud. However, understanding the current state
riety of activities including communication, education, of AI techniques in combating online fraud, the data
and entertainment. Activities such as banking, shop- sources used, the evaluation methods for AI models,
ping, and socialising are increasingly happening via on- and the specific types of fraud that are most prevalent
line platforms, expanding the landscape for fraudsters remain a significant challenge. This is due to the con-
to exploit vulnerabilities or use these platforms to de- stant emergence of new fraud activities that use var-
ceive victims. In 2020, online shopping scams made up ious communication mediums and social engineering
38% of all reported scams worldwide, an increase of attacks, in an attempt by fraudsters to remain unde-
6% compared to pre-Covid-19 outbreak [5]. tected. Therefore, there is a pressing need to shift from
Online fraud encompasses a wide range of decep- detecting and analysing the effects of fraud to early de-
tive activities, including identity theft, phishing, ad- tecting emerging fraudulent activities online and new
vance fee fraud, romance scams, fraudulent invest- methods of social engineering.
ment scams, and more. It is important to highlight This systematic literature review (SLR) aims to ad-
that there is no universally accepted definition of “on- dress these challenges by providing a comprehensive
line fraud,” and the term is often used interchange- overview of the state-of-the-art AI techniques used to
ably with the term “scam.” Legally, “fraud is defined detect fraudulent activities online. More specifically,
as false representation to cause loss to another or to we explore the data sources widely used by researchers
expose another to a risk of loss” [6], and scam is the to study online fraud, how researchers evaluate the
process where criminals gain the trust of victims to methodologies employed to assess AI models, and iden-
deceive or cheat victims [7], through false representa- tify the most popular types of online fraud targeted
tion and other means, so that the victim trust them, by researchers. By synthesising findings from academic
which in turn results in various kinds of losses for the papers, this review aims to provide a thorough under-
victim. The National Fraud Authority published a lit- standing of the current landscape of online fraud de-
erature review [8] where they compared the distinction tection and prevention, highlighting gaps in existing
of the term fraud as defined by the amended Fraud Act research, and proposing directions for future studies.
2006 [6] and the typology produced by Levi [9]. They Research Questions. The research question for the
found fraud embraces a broad scope of different crimes, Systematic Literaure Review (SLR) is as follows:
whereas scams often are focused on fraud against indi- – RQ1: What is the state-of-the-art of AI techniques
viduals and small firms. For example, different scams used to detect online fraud?
like advance fee, romance, tech support, etc., all fall – RQ2: What are the data sources that researchers
under the fraud umbrella, but they are also deception use to analyse online fraud?
methods, which are in part scams. Hence, in this SLR, – RQ3: How do researchers evaluate their AI models?
we use both terms as the various scams are the differ- – RQ4: What are the most popular fraud activities
ent deception methods scammers use to trick victims, that researchers study that fraudsters use to ap-
and fraud is the term that includes all scams. proach their potential victims?
Papasavva et al. Page 3 of 37
Although a wide number of studies have explored the websites, services, or apps, providing them with
application of AI for fraud detection and other types their credit card details for a purchase, that leads
of cybercrime, we are not aware of any systematic lit- the victim to a vulnerable position [17].
erature reviews that have examined the application of – Recruitment Fraud is a type of online scam where
AI models using text data. The focus of this SLR is fraudsters pose as legitimate employers or recruiters
AI-based models that study textual data to detect and to deceive job seekers. The primary goal of these
gain insights related to online fraud. Thus, the focus of scams is to receive “fees” for a job application, steal
this study is to identify Natural Language Processing personal information, extort money, or exploit the
models that are used to detect online fraud. victim in some other way. This type of fraud preys
on individuals seeking employment, often target-
1 Online Fraud and AI ing those who are most vulnerable or desperate for
We now discuss online fraud and various AI method- work [18].
ologies in detail for further reader information. Online – Romance fraud (aka romance scams or dating
fraud refers to any deliberate act of deception con- scams) involves fraudsters creating fake profiles on
ducted over the Internet to obtain an unlawful or un- dating websites, social media, or other online plat-
fair gain. It involves exploiting online platforms, ser- forms to deceive victims into believing they are in
vices, and technologies to deceive individuals or organ- a genuine romantic relationship. The primary ob-
isations for financial, personal, or material gain. Online jective is to exploit the victim’s emotions to ex-
fraud can take many forms, each characterised by the tort money, personal information, or other benefits.
method of deception and the medium used to perpe- This is an elaborate scam that is extremely diffi-
trate the fraud. cult to detect, since it is also underreported due
to victims feeling ashamed and hurt for being a
1.1 Fraud Categories victim from someone they considered to be a ro-
The list of online fraud activities is extensive and con- mantic partner [19]. In this scams, fraudsters com-
stantly evolving with new types or sub-types emerging. municate with victims for a long time before pre-
Developing a comprehensive taxonomy or classification senting them with an “investment opportunity” or
for all online fraud activities requires special attention, requesting their financial aid. Romance scams are
which is beyond scope of this work. Below we outline closely related to the Cryptocurrency Pig Butchering
some of the well-known online fraud types. scams [20], where victims are gradually lured into
– Phishing is the process where fraudsters imperson- making increasing contributions over a long period
ate representatives of legitimate organisations or ac- of time, usually in cryptocurrency, to a fraudulent
quaintances of the targeted victim to trick them into scheme [21].
providing personal information such as usernames, – Fraudulent Investment include scams where
passwords, credit card details, or bank account de- fraudsters promise victims significant winnings or
tails. This activity can be done through various lucrative opportunities [22]. These scams are usually
mediums, like email, phone calls (aka Vishing), SMS associated with the romance scams discussed above.
(aka Smishing), and any other way of online com- Once the victims try to withdraw their “winnings”
munication. There are various phishing scams that the scammers will extort them by asking for “fees”
surfaced over the years, including the Royal Mail and “taxes” to be paid in advance. The promised
scam [14], banking scams [15], HMRC scams [16], benefits and winnings never materialize, and the
and many others. Notably, phishing scams often initial investment sums and fees are lost [23]. Fraud-
include deceptive web addresses created by cyber- ulent investment is the umbrella that covers Cryp-
criminals to trick victims into believing they are vis- tocurrency Pig Butchering scams explained above,
iting legitimate websites. The primary goal of these and various Ponzi schemes [24] where early investors
URLs is to steal personal data including usernames, greatly benefit from the investments of later in-
passwords and credit card details for financial gain. vestors, also known as pyramid schemes.
– Fake Reviews are deceptive or fraudulent reviews – Crypto market manipulation involves artifi-
created to mislead potential customers about the cially increasing or decreasing the price of cryp-
quality, reliability, or legitimacy of a product, ser- tocurrencies to achieve financial gain. It often in-
vice, or app. On fraudulent e-commerce websites and volves coordinated efforts by individuals or groups
app stores, fake reviews play a crucial role in trick- to manipulate the market to create false perceptions
ing victims into trusting and using fraudulent apps of supply, demand, or market sentiment. Some com-
or purchasing substandard or non-existent products. mon techniques used in crypto market manipulation
This leads to potential victims trusting fraudulent include: Pump and Dump which inflates the price of
Papasavva et al. Page 4 of 37
a cryptocurrency through misleading or false state- – Pension Scams are similar to Tax Scams. Scam-
ments (pumping), encouraging others to buy, and mers aim to make money through fees, direct access
then selling off the cryptocurrency at a profit once to pension savings or by receiving investments [30].
the price has been pumped up (dumping); Wash Overall, in this section we briefly described some well
Trading occurs when a trader buys and sells the known scams and online fraud activities. The com-
same cryptocurrency simultaneously to create de- plexity and interconnected nature of scams and frauds
ceptive activity on the market; Spoofing involves make it challenging to categorise them under a sin-
placing significant buy or sell orders to withdrawn gle typology. Phishing scams, for instance, serve as a
them before execution to mislead perceptions re- broad umbrella that currently encompasses phishing
lated to the market demand or supply;Front-running conducting using various methods like vishing (voice)
involves placing orders ahead of a large trade that and smishing (sms), yet they can also be integral parts
is known to occur, to benefit from the subsequent of investment scams when scammers develop phish-
price movement caused by the large trade; and many ing websites to gain the trust of victims. Similarly,
others [25]. scams often involve tricking victims through decep-
– Fraudulent e-commerce involves deceptive prac- tive tactics, ultimately leading to defrauding them of
tices or scams conducted through online e-commerce their money or personal information. As such, different
platforms. These scams aim to exploit the digital scam types frequently overlap, blurring the lines be-
payment systems to deceive consumers or businesses tween distinct categories and demonstrating the intri-
via paying for a fraudulent product or service. cate web of fraudulent activities that exist today. The
– Fraudulent crowdfunding refers to the misuse of multifaceted nature of these scams highlights the diffi-
crowdfunding platforms to deceive donors or back- culty in creating a comprehensive classification system
ers, often by providing false or misleading infor- that can effectively encompass all types of fraudulent
mation about the nature, purpose, or outcome of schemes.
a crowdfunding campaign. Crowdfunding itself is a
method of raising money from a large number of 1.2 AI techniques
people via online platforms to fund projects, prod- This study investigates AI-based techniques for pro-
ucts, or causes [26]. A fraud similar to crowdfunding cessing unstructured text data to analyse fraud. Much
is Charity Fraud and Disaster Scams where scam- of this text data, such as news articles, research papers,
mers seek donations for organisations that do not government reports, books, social media posts (such
exist or do no or little work at all. These scams are as tweets and Facebook comments), communications
particularly common after high profile disasters as (such as emails, SMS messages, and chat logs), and
criminals often use tragedies to exploit people who web content (such as reviews on online marketplaces,
are looking to donate [27]. travel and hospitality platforms, and comments on
– Gambling Fraud is any illegal activity that is video sharing platforms), is inherently unstructured.
intended to cheat players or an online gambling Statista reported the global data created, captured,
platform. Fraudsters manage to trick victims and copied, and consumed is 64.2 zettabytes in 2020, and it
platforms in different ways, including rigged games, is expected to exceed 180 zettabytes by 2025 [31]. With
fake websites (phishing URLs described above), ac- each new digital platform or communication channel,
count takeovers (via stealing legitimate users’ access this data is increasing. Most of this created data is
codes), and via creating fake apps with fake reviews, unstructured text data that provides opportunities for
as discussed above to gain the trust of users. Online understanding human behaviour, habits, opinions and
gambling fraud can happen on multiple platforms experiences. It contains information about users’ expe-
and involve wide variety of games inclduing casino riences, events, themes, opinions and sentiments that
scams, sports betting scams, and lottery scams [28]. can be important for deriving meaningful insight from
– Tax Scams occurs when scammers falsify infor- their experience related to fraudulent activities. Man-
mation regarding pending tax money, or via ma- ual traditional data analysis techniques like keyword
liciously impersonating tax officials to trick indi- searches and coding of themes, are often limited and
viduals or business entities to willfully pay them unachievable to extract meaningful insights, making
‘fees” [29]. Scams similar to tax scams are Coun- advanced computer-driven automated techniques nec-
cil Tax Scams, various Utility bill scams, Insurance essary.
Scams, etc., where scammers impersonate officials However, this data often demonstrates significant
and trick the victim that they own money. These challenges due to the diversity of natural (human) lan-
scams fall under the umbrella of Phishing as they guage. This includes dealing with noise, which includes
often take place via SMS, phone calls, or emails. irrelevant data, a wide array of linguistic variations
Papasavva et al. Page 5 of 37
Spatial Clustering); and topic modelling using algo- False Negatives (FN): The model incorrectly pre-
rithms like Latent Dirichlet Allocation and Latent dicts the negative class but it is not (e.g. scam pre-
Semantic Analysis (LSA). dicted as not scam).
– Model training: Often machine learning algo-
rithms will have parameters that need to be tuned Sensitivity and specificity are two other performance
before learning begins, these are known as hyper- measures. Sensitivity is the same as recall, or true
parameters.The tuning process involves re-training positive rate that captures the model’s ability to
the model multiple times using different values for correctly identify positive class (i.e. scam cases):
these hyperparameters and selecting the best com-
bination of values based on model performance on a
TP
metric of interest. In the case of supervised machine Sensitivity(T P R) = Recall =
learning, the hyperparameters might be tuned using TP + FN
model performance on different “folds” of the data Specificity, also known as false positive rate (FPR)
in an approach known as cross-validation, with some measures the proportion of true negatives, and it
randomly-selected proportion of the data kept sepa- captures the model’s ability to correctly identify
rate from the training data, known as a test dataset, negative class (i.e. not-scam cases):
for final model evaluation, giving the best available
indication of how the model is likely to perform on
new, unseen data. In the case of unsupervised mod- TN
Specif icity =
elling, heuristics will be used to identify the optimal FP + TN
number of clusters or topic modelling.
– Model evaluation: The performance of the model The ROC curve is a graph illustrating the perfor-
needs to be evaluated. In the case of supervised mod- mance of one or more binary classifiers. It plots the
elling, this will involve measuring the performance sensitivity against the 1 - 1-specificity for various
of the model on the test data. The classic supervised thresholds. The AUC is calculated as the area un-
machine learning algorithms can be evaluated using der the ROC curve.
performance metrics such as confusion matrix (Fig- – Deploy model: Once the models are working well
ure 2), accuracy, precision, recall, F1-score, sensi- they can be deployed. When considering deployment
tivity, specificity, Receiver Operating Characteristic of the model, one must address questions regarding
(ROC) curve, and the Area Under the Curve (AUC) why others should trust the model, how the model
curve: arrived at its conclusions and usability, and carefully
assess the ethical implications of AI to ensure its
TP + TN suitability for deployment and that it is not biased.
Accuracy = In cases of unsupervised machine learning models,
TP + TN + FP + FN
due to a lack of ground truth labels, the performance of
the model evaluation may involve subjective interpre-
TP tation to interpret the outputs (e.g. clusters or topics)
P recision =
TP + FP generated by the model.
Search String
based on its surrounding context words, whereas Skip- ("Online Fraud*" OR "scam*") AND (("machine learning")
gram predicts surrounding context words based on a OR "NLP" OR ("natural language processing") OR "classifier"
given target word. In the sentence ’The quick brown OR ("Large Language Models") OR "LLM"
OR ("Generative Artificial Intelligence") OR "GenAI" OR "GAI")
fox jumps over the lazy dog’, if ’fox’ is used as the Table 1 Query for the literature selection in various academic
target word, the CBOW model uses ’The’, ’quick’, libraries.
’brown’, ’jumps’, ’over’, ’the’, ’lazy’, and ’dog’ as con-
text and predicts the word ’fox’. In Skip-gram, ’fox’
is used to predict the surrounding words like ’The’, transformer architecture and process text in an uni-
’quick’, ’brown’, ’jumps’, ’over’, ’the’, ’lazy’, and ’dog’. directional manner by processing text based only on
Global Vectors for Word Representation (GloVe [33] preceding words.
learns the vector representation of words using global LLMs can assist in analysing large amounts of text
word-word co-occurrence statistics obtained from the data and identify patterns automatically which can be
training data to show the semantic relationships be- useful when dealing with fraud and other crime-related
tween words. data, they can be misused by criminals to generate
Word embeddings are often used by Large Language content for fraudulent activities. Generative AI refers
Models to comprehend and respond to language. to AI techniques that can be used to generate new
Large Language Models(LLMs) [34] are advanced text, audio, images and video that closely resembles
NLP tools trained on billions of words from a wide vari- human-generated content. LLMs have been success-
ety of sources such as newspapers, books and websites, fully applied in various areas of human communica-
which are designed to perform complex tasks such as tions including chatbots in customer support systems
translations, summarisation and the performance of by generating human-like text, content generation and
human-like conversational abilities. Most LLMs are de- performing language translation. However, this same
veloped using a transformer-based architecture (trans- technology can be exploited by criminals to create con-
formers) [35] and they have billions of parameters tent for deception including fake websites, targeted
used for training. Transformers are a type of deep- phishing emails and scam advertisements to deceive
learning neural network model, and they are more effi- potential victims.
cient compared with predecessor state-of-the-art mod-
1.4 Research approach
els which were based on Recurrent Neural Network
Although there are some literature reviews related to
(RNN ). Transformers use a complicated architecture
the application of AI for fraud and crime, to our best
that has encoder and decoder layers to understand se-
knowledge there are currently no SLRs that attempt to
quences of words and provide an output [35]. While
understand the state-of-the-art towards detecting on-
the encoder layer processes input text data, extract-
line fraud in general. Overall, the literature reviews
ing hierarchical representations through mechanisms
that we found discuss the state-of-the art towards
like self-attention, the decoder layers generate output
detecting specific online fraud or scams, e.g., credit
sequences based on the input received from the en-
card fraud alone. In this work, we aim to understand
coder. Transformer-based LLMs include GPT models whether there are universal AI methodologies that at-
including GPT3 (Generative Pre-trained Transformer tempt to detect online fraud, focusing on textual data.
3 ), GPT-4 and GPT-4o by OpenAI. Both GPT-4
and GPT-4o are multimodal models that accept text
2 Systematic Review
and image and produce text [36]. Other transform-
Systematic reviews differ from traditional literature re-
ers include Bidirectional Encoder Representations from views as they aim to identify all relevant studies that
Transformers (BERT ), and its smaller and lighter address a set of research questions using a methodol-
version of DistilBERT which is designed for applica- ogy that is structured and can be replicated [37].
tions where computational resources might be limited.
BERT and DistilBERT are also designed to under- 2.1 Methods
stand context in language processing and are suitable The following methodology was employed to conduct
for NLP tasks like text classification, answering ques- the SLR, and address the selection process for identi-
tions and named entity recognition. The difference be- fying relevant publications and to avoid biases.
tween models like BERT and GPT is the way their
architecture is designed and the objectives of learn- 2.1.1 Protocol
ing. BERT uses only the encoder of the transformer For this SLR we follow the Preferred Reporting Items
architecture and processes text bidirectionally by con- for Systematic Reviews and Meta-Analysis extensions
sidering both preceding and following words. On the for Scoping Reviews (PRISMA-ScR), as proposed by
other hand, GPT models use only a decoder of the Moher et al. [38].
Papasavva et al. Page 8 of 37
check (N=2457) (N=2107) paper with the scam type that the study is focusing on,
based on the title, abstract, and methodology of the
studies. The majority of studies that were included for
Full paper records Full-text records excluded
eligibility check (N=350) (N=127) qualitative analysis focus on phishing detection, with
about a third (29%) of the studies focusing on detect-
ing phishing URLs online (N = 64). More specifically,
INCLUDED
Studies Included in
Qualitative Synthesis
these works tackle the problem of automatically de-
(N=223) tecting whether a given URL is likely to be fraudu-
lent. A large number of papers were related to detect-
ing phishing emails (N = 29), followed by studies on
SMS phishing detection (N = 20). Other studies on
Figure 3 PRISMA Chart phishing include phone call transcripts towards under-
standing and detecting voice call phishing (N = 12)
and a small number of studies attempt to understand
phishing methods via victim reports (N = 4).
calculated between each pair of annotators. The agree- Moving on to other types of fraud, we found many
ment between the annotator A and the annotator B studies that attempted to detect fake reviews on vari-
was 0.65 (substantial agreement), between A and an- ous platforms like Google Play Store, Apple App Store,
notator C was 0.66, and between B and C was 0.52 Yelp, and TripAdvisor (N = 23). Another widely stud-
(moderate agreement).[1] The three annotators com- ied scam focus was recruitment fraud (N = 20). We
pared their annotation process and reviewed the eligi- also found a number of studies that employed AI tech-
bility criteria and goals of this review. Following this niques to detect fake accounts on Facebook, Insta-
discussion, the lead annotator moved to perform the gram, and Twitter (N = 18).
rest of the annotation of the papers to be included in A few studies used Generative AI (GenAI) to gain a
this review. better understanding of some of the latest methods of
phishing and other fraud types. GenAI represents cases
[1] AnnotatorsA, B, and C will be replaced with the initials of where advanced LLMs are misused towards social en-
authors after the reviewing process. gineering attacks. GenAI models, have revolutionised
Papasavva et al. Page 10 of 37
64
e n g o) it d s) s g k ts d g s s) s)
29
erc atio din pt Ba rau ort am hin ttac un rau hin iew ail RL
om pul fun (cry amce FRep a Sc Visg A ccont F misRev (emg (U
23
4
3
3
3
i
oc
IS
2
nA
2
Ge
his
1
P
0 5 10 15 20 25 30
e
ud ry ule Inv
Percentage (%)
Fra C ud nt
Fra ule
ud
Fra
Figure 4 Percentage of scam types analyzed in the studies included for qualitative analysis
various industries as they are now able to promptly victims. We also identified 3 studies that try to un-
generate human-like text in response to input com- derstand fraudulent cryptocurrency investment scams
mands. 3 studies employed GenAI models to automat- and 2 studies that attempt to detect the likelihood of
ically interact with scammers to waste the scammers’ cryptocurrency manipulation. Finally, we identified 2
resources or time, and to collect information regarding studies that analysed fraudulent crowdfunding online
the methods with which the scammers try to defraud and 1 that studies fraudulent e-commerce websites.
users to disrupt their operation. This is defined as scam
baiting, the process of using generative AI models to 4 Summary of Findings
deceive and engage with scammers. Interestingly, this We now provide a summary of our findings, categorized
is a countermeasure against online fraud (GenAI Scam per fraud activity analyzed within the papers that are
Bait in the figure). included in our SLR for qualitative analysis.
Our search also returned many studies that discuss
the misuses of GenAI towards Social Engineering At- 4.1 Data Sources
tacks (N = 13), where scammers use these advanced First, we report the most popular data sources used
models to create legitimate-looking emails or SMS to and the datasets analyzed.
earn the trust of potential victims. While GenAI mod-
Phishing URLs. We start with understanding the
els offer numerous benefits, these studies show that
preferable data sources and methodologies employed
they pose significant risks when leveraged for malicious
for the analysis and detection of Phishing URLs, as it
purposes, particularly in the realm of social engineer-
is the most popular scam type category we detected in
ing. GenAI can generate coherent, contextually rele-
our SLR. In total, we analyze the data sources and de-
vant, and grammatically correct emails that mimic the
tection methodologies of the identified 63 papers that
style and tone of professional communication. This in-
attempt to tackle this issue. Table 3 summarizes the
creases the likelihood of victims perceiving fraudulent
data sources and methodologies used to detect Phish-
emails as legitimate and trusting the message [39].
ing URLs, as found in the literature.
Regarding other scams, we found a variety of stud-
First, we find that researchers used various websites
ies that try to detect Social Media Scams (N = 6).
that offer information on URLs for the analysis of ma-
These scams included a variety of fraudulent activities
licious and legitimate domains. This information may
including fake user accounts and online groups, adver-
be webpage rankings (how trusted the webpage is),
tisement of fraudulent apps or fake phishing website
phishing reports, and historical data. By far, the most
aimed at stealing personal information or money.
popular data source used was PhishTank,[2] a website
Similarly, 3 studies focus on Romance Fraud via
analysing profiles on social media and discussions with [2] https://phishtank.org/
Papasavva et al. Page 11 of 37
that allows users to report webpages that might be The overwhelming majority of the works focusing on
malevolent or suspicious, with 25 studies using it as phishing email detection use datasets made available
already labeled malicious websites [40–65]. by previous works [104–111], or used datasets pub-
Another website that offers a list of phishing URLs is lished on Kaggle [106, 111–119], or datasets published
OpenPhish[3] and it was used by 3 studies [41, 58, 59]. at UCI ML repository [120–124].
Two studies used URLhaus[4] , a project for sharing ma- Two works [125, 126] used emails received on the
licious URLs, for the collection and analysis of phish- author’s personal or professional email spam folder.
ing URLs [46, 64]. We also find one work that used Other works that include datasets from alternative
SpamHaus[5] for the collection of IP and domain rep- sources is the one of Mehdi et al. [127], which used
utation [66], and one that used URLscan[6] [67]. Inter- various techniques to develop their dataset, like GPT2
estingly, [67] also collected user-reported domains from generated synthetic phishing emails made available by
ScammerInfo[7] , a forum where users post and discuss previous research [128], along with TextAttack[11] , a
various scams. Lastly, the webpage WhoIs[8] , a web- Python framework for adversarial attacks, data aug-
page that offers historical data on webpages, was used mentation, and model training in NLP, Textfooler[12] ,
for feature collection from [51, 68]. a Model for Natural Language Attack on Text Clas-
The most used data source for collection of legitimate sification, and PWWS [129]. Another alternative data
webpages was Amazon Alexa, a webpage that offers source for phishing email detection was used by Janez
webpage rankings, with 10 studies using it to collect et al. [111] who used data from SPAM Archive[13] ,
legitimate annotated webpages [40, 52, 55, 56, 62–64, a website that publishes spammy email repositories
69–71, 71]. Google’s Search Engine was used for one at the end of every month and is constantly up-
study [72], while another used Majestic Million site[9] dated. [130] used user reported e-mails. Lastly, the
for legitimate webpage collection [44], a site similar to data source used by three works was not clearly stated
Alexa. within the manuscript [131–133].
We also find that many records use previously pub-
licly available datasets for their analysis. More specif- Phishing SMS. Regarding Phishing SMS (smishing),
ically, 11 studies [65, 73, 74, 74–81] used already pub- we included and analyzed 20 papers in this SLR. For
licly available datasets published on Kaggle. Similarly, the detailed data, refer to Table 5.
other 5 studies [45, 69, 82–84] used the UCI Phish- Similarly to previous analyses, we find that the over-
ing Dataset.[10] Other studies used publicly available whelming majority of works opted for using already
datasets from other sources [65, 85–90]. publicly available datasets for their analysis and train-
The most recent studies (published in 2023) that ing a model to detect fraudulent SMS automatically.
attempt to detect Phishing URLs automatically, at- More specifically, a variety of subsets from a pub-
tempt to do so by collecting data from alternative licly available dataset on Kaggle[14] was used by 14
sources like social networks, and user-reported phish- studies [134–147]. Another study [148] used the Kag-
ing URLs [70, 70, 91, 91–93], while others used datasets gle dataset, but incorporated Fake Base Station data
from Telecom and Security institutions [71, 75, 94] and made it available to researchers.[15] Similarly, this
Alas, we failed to detect the data source used for 9 work [134] used a subset of the Kaggle dataset in com-
studies, as they do not clearly report how or from bination with emails and YouTube comments for spam
where they acquired the dataset they use for their content detection, while Lai et al. [149] used data pro-
study [95–103]. vided by users.[16] Tang et al. [150] collected tweets
where users were reporting smishing for their analysis.
Phishing Emails. We now discuss the data sources
used in the 29 works that tackle the Phishing Email Two other works used data from the Korean Internet
detection problem. The data extracted from the lit- and Security Agency [151] and 360 Mobile Safe [152].
erature and presented in this section are depicted in Lastly, Timko et al. [153] proposed a platform where
Table 4. users can freely post Phishing SMS for researchers to
use.[17]
[3] https://openphish.com/
[4] https://urlhaus.abuse.ch/ [11] https://github.com/QData/TextAttack
[5] https://www.spamhaus.org/ [12] https://github.com/jind11/TextFooler
[6] https://urlscan.io/ [13] http://untroubled.org/spam/
[7] https://scammer.info/ [14] https://www.kaggle.com/datasets/uciml/
[8] https://who.is/ sms-spam-collection-dataset
[9] https://majestic.com/reports/majestic-million [15] https://github.com/Cypher-Z/FBS_SMS_Dataset
[10] https://archive.ics.uci.edu/dataset/327/phishing+ [16] https://www.datafountain.cn/competitions/508
Phishing Phone Calls. We identified 12 studies that used previously available Amazon product reviews,[24]
used phone call transcripts to understand Voice call- or collected Amazon reviews [182–189].
enabled phishing (vishing). Others collected Amazon Hotel and Holiday pack-
Derakhshan et al. [154] used the CallHome dataset ages reviews [190, 191], or TripAdvisor review data [174]
which includes 120 unscripted 30-minute telephone Lastly, [192] used YouTube transcripts to interpret
conversations between native speakers of English.[18] false review exaggeration. The source of the data used
Another work [155] used AI-generated deepfake voice by Ganesh et al. [193] was unclear.
recordings (Tacotron 2[19] , Deepvoice 3[20] , and Fast-
Fraudulent Recruitment. We detected 19 studies
Speech 2 [156]). For authentic voice recordings they
that focus on the detection of fraudulent job postings.
used the synplaflex dataset [157] which is a corpus of
The related data is listed in Table 7.
87 hours of audiobooks in French.
The overwhelming majority (N = 16), of papers
Various works used telecommunication operator
used the same publicly available dataset from Kag-
datasets, like [158] using fraudulent caller IDs and
gle,[25] , which holds about 18K job postings out of
phone transcripts [159] from telecommunication opera-
which 800 are fraudulent. Notably, this dataset in-
tors in China, [160] used data from the Public Security
cludes data from 2012 to 2014 [12, 194–208].
Bureau in Zhejiang Province, China, and [161] used
The other three studies developed custom crawlers
data from the Korean Financial Supervisory Service.
to collect data from various job posting sites in the
Kale et al. [162] developed their dataset via question-
UK (SEEK, Glassdoor, Indeed, and Gumtree) [209], in
naires and victim testimonies. Other works collected
Bangladesh (job.com.bd, bdjobstoday, deshijob) [210],
data from various social networks, like YouTube tran-
and in China (China–Boss, Zhipin, Liepin, and 51job) [211].
scripts [163], Facebook, online blogs and forums, public
datasets, as well as some that were developed based Fake Accounts Online. Various studies attempt to
on studies of scammers’ activities and behaviors [164]. tackle the automated detection of fake profiles online.
Others opted for using previously analyzed and pub- In Table 6 we depict the data extracted from the iden-
licly available data [165, 166], while the data used from tified records.
Zhong et al. [167] was unclear. We find that many works collect user profile data
from various online social networks for this analy-
User Reports. Four studies used user reports to un-
sis, like Twitter [212–218], Instagram [219–221], Face-
derstand phishing activities.
book [222–224], YouTube [225], and Sina Weibo [226].
First, [168] constructed a fraud complaint dataset
A different approach was used by a study [227] that
from the Internet finance service in China. Simi-
collected real names from various webpages, schools,
larly, [169] used court documents from Chinese online
and other sources to automatically detect fake names
judgement records, while [170] used incident record
online.
forms from victims and interviews in the Philippines.
Other works used previously published and openly
Lastly, [13] launch and introduce a website operated
accessible datasets that included user data from vari-
by the National Crime Prevention Council (NCPC) in
ous social networks [228, 229]
Singapore, where users can report and get informed on
the latest phishing activities. [21] GenAI-facilitated Social Engineering Attacks.
Under this category, we find many works that inves-
Fake Reviews. We move to the Fake/Fraudulent Re-
tigate how generative AI models can be misused to
view scam type. We include 23 studies in our analysis
defraud people.
and the data extracted from each record in our analysis
Various studies [39, 230, 231] develop and discuss an
is depicted in Table 8
initial taxonomy where they present how AI-generated
The overwhelming majority of papers used previ-
content can be misused by scammers. At the same
ously published YELP dataset[22] [171–178], or the
time, Carlini et al. [232] test various membership in-
OTT publicly available dataset on Kaggle[23] [175,
ference attacks on OpenAI’s GPT2 model and con-
178].
firm that the model is vulnerable to this kind of at-
Other studies used application reviews from Google
tack which poses risks of privacy. Similarly, Kumar
Play Store or Apple’s App Store [179–181]. Other’s
et al. [233] discuss the significant implications for cy-
[18] https://catalog.ldc.upenn.edu/LDC97S42
bersecurity, privacy, and ethical considerations that
[19] https://pytorch.org/hub/nvidia_deeplearningexamples_
should be considered when developing and using these
tacotron2/
[20] https://r9y9.github.io/deepvoice3_pytorch/ models.
[21] https://www.scamalert.sg/ [24] https://snap.stanford.edu/data/web-Amazon.html
[22] http://odds.cs.stonybrook.edu/yelpzip-dataset/ [25] https://www.kaggle.com/datasets/amruthjithrajvr/
[23] https://www.kaggle.com/discussions/general/281540 recruitment-scam
Papasavva et al. Page 13 of 37
Moving on to misuse cases of these models, [234] dis- First, [251] analyze investment scam advertisements
cuss how LLMs and GenAI can be used for fake pro- found in Bitcointalk.[26] Li et al. [252] collect YouTube
fessional profile bios to trick users that the account comments to automatically detect bots that advertise
is legitimate. Similarly, [235] show that scammers can fraudulent investment content. Lastly, [253] develop a
use these models to create AI-generated images to be scam detection model based on emotional fluctuations
posted on various social networks. Their case study of user discussions collected from one of the most pop-
shows that these images tend to receive high volumes ular instant messaging app in Taiwan.
of engagement on Facebook as many users do not
seem to recognize that the images are synthetic. Other Crypto Manipulation. Market, and more specifi-
works show how these models can be jailbroken to pro- cally, cryptocurrency coin manipulation is when users
duce code to immitate legitimate webpages (phishing collectively attempt to alter investor interactions to-
URLs) [236], malware code, phishing emails, phishing wards manipulating the prize of a coin.
SMS, SQL injection attacks, and many more poten- [254] discuss this process in detail via data acquired
tially dangerous material [237–239]. from Twitter, Telegram, and Discord channels. Sim-
At the same time other works show that humans may ilarly, [255] identify and analyze cryptocurrency ma-
be able to accurately detect phishing AI-generated nipulations from user activity collected from Telegram
content [240], while Roy et al. [241] discuss and ex- and Twitter.
periment with countermeasure to prevent malicious Fraudulent Crowdfunding. Both works that anal-
prompts (jailbreaking) in GPT and they provide valu- yse fraudulent crowdfunding [256, 257] collect the de-
able insights into how the model can become more ro- scriptions and metadata from hundreds of Kickstarter
bust against this vulnerability. campaigns.[27]
Social Media Scams. Six studies examine various Fraudulent E-commerce. [120] analyze the Terms
scams and spammers facilitated through various Social and Conditions of various websites that sell various
Networks. [242] used data from WeChat and Konect products towards understanding and detecting ob-
repository to detect users that use WeChat to defraud scured financial obligations in online agreement.
people. [243] and [244] used Telegram data to charac-
terize and detect Fake Telegram channels, while [245]
4.2 Methodologies employed
collected data from Telegram and compared it to Twit-
We now discuss the most popular AI and NLP method-
ter data to understand and detect fake users. Simi-
ologies employed per online frayd type.
larly, [246] collect and analyzed Twitter and Institute
of Informatics and Telematics data to detect scam- Phishing URLs. We find that the records included
mers on Twitter. [247] collect Youtube data to detect in our SLR attempt to automatically detect phish-
scammers that attempt to lure in victims via YouTube ing URLs using a variety of NLP and AI methodolo-
comments on videos. gies. These models included classic supervised machine
learning algorithms such as Naive Bayes (NB), Ran-
GenAI Scam Baiting. We identified three works
dom Forest (RF), Decision Tree (DT), Support Vector
that used LLMs and Generative AI to automatically
engage with scammers online towards wasting their re- Machines (SVM), XGBoost (Extreme Gradient Boost-
sources and collect data on various fraud activities. For ing), KNN (k-Nearest Neighbors), as well as Artificial
this section we do not identify data sources used as re- Neural Networks (ANN) and more advanced in deep
searchers use data received on their baiting accounts. learning such as Long Short-Term Memory (LSTM)
and Convolutional Neural Network (CNN). LSTMs
Romance Fraud. Moving on to Romance Fraud, [248] are a type of Recurrent Neural Network (RNN) de-
attempt to automatically detect malicious accounts on signed to capture long-dependencies in sequential data,
Momo, a dating website. Similarly, [249] collect data making them suitable for handling and predicting se-
from DatingMore and scamdigger.com to develop au- quences of text. On the other hand, CNNs aim to iden-
tomated methods to understand fraudulent profiles tify key features in the text by capturing local pat-
within dating social networks. Lastly, [250] analyzed terns. These models were developed for binary classi-
the sentiment of tweets with the hashtag #tinder- fication tasks, whether a URL is fraudulent or not.
swindler to provide an understanding on users sharing NLP techniques related to text mining have been
their experiences regarding romance fraud. used to extract features from URLs, which are then
Fraudulent Investment. Studies that attempted to used as features to train the AI-based classification
understand fraudulent investment scams did so em- [26] https://bitcointalk.org/
models. More specifically, counts of characters, spe- Fake Reviews. The most popular NLP methodolo-
cial characters, n-grams, and URL deconstruction: de- gies that apply on Fake Review detection lean heav-
tection of secure scheme or not, domain and top- ily on Sentiment Detection methodologies including
level domain, along with sub-directories. Various works VADER and WordNet. Similar to previous fraud anal-
used hybrid, or combination of methodologies for their ysis, the AI methods applied include various Neural
detection including some more advanced techniques, Network Models like CNN.
like [40] using Bi-LSTM (Bidirectional Long Short-
Fraudulent Job Postings. The methodologies em-
Term Memory that could process sequences of text
ployed for the automated detection of Fraudulent Job
in both forward and backward directions) with VGG
Postings include stand-alone machine learning algo-
(Visual Geometry Group - a type of CNN), [41] us-
ritunsm used for classification tasks including LR,
ing W2V (Word2Vec) for feature extraction along with
SVM, KNN, RF, XBoost, and ANN, and deep learning
CNN, GRU (Gated Recurrent Unit, - like LSTM but a
models including Bi-LSTM.
more simplified RNN), and Bi-LSTM, [92] using BERT
with RF, [91] developed a hybrid system that uses RF, Scam Baiting. One of the Scam Baiting studies,
SVM, FNN, and XGBoost, [47] used Linear Regression [258] used OpenAI’s ChatGPT to reply to scam-
(LR) and DT, and [86] used LSTM and GRU. mer emails. Similarly, [259] experimented with Ope-
Other stand-alone AI methodologies, like LightGBM nAI’s ChatGPT and DistillBERT to categorize vari-
(Light Gradient Boosting Machine - an ensemble learn- ous scam emails they received and provided a qualita-
ing technique designed for handling large datasets with tive analysis of how well the two models perform. The
large features), RF, NB, and ANN also seem to work other study, [260] set up a mailserver and used data
well on detecting phishing URLs, but by far the one from Scambaiter mailserver, Enron Email Dataset, and
with the best performance seems to be RF. ScamLetters.Info while employing their own Distill-
BERT model to engage with scammers towards cat-
Phishing Emails. The methodologies used for phish-
egorizing and analyzing the various emails.
ing email recognition weigh more on the NLP analysis.
The majority of studies used various NLP method- Romance Fraud. Studies on Romance Fraud detec-
ologies for feature extraction, including, but not lim- tion used various NLP methodologies, including sen-
ited to topic analysis (LDA, BERT, BERTLARGE), timent detection using BOW, TF-IDF, and textBlob,
n-gram extraction (TF-IDF, BOW, Clustering, W2V), finding that RF works best. For the detection of ma-
and sentiment analysis (VADER, WordNet). licious accounts in dating applications, LSTM works
Then, the studies that also opted for automated de- best, while another study [249] found that EML also
tection of phishing emails employed LLM analysis, RF, works well.
NB, SVM, CatBoost, LSTM, RNN, and many more.
Fraudulent Investment. Three of the studies re-
Phishing SMS. Similar to phishing email detection, lated to Fraudulent Investment used data from dif-
phishing SMS detection also relies on state-of-the-art ferent sources (forum, messaging app, and YouTube).
NLP methodologies, including LLMs, LDA, BERT and The reported studies indicate that DT was the best
W2V. The existing literature also used AI methodolo- performed machine learning model for detecting emo-
gies like LR, SVM, CNN, GNN, LSTM, NB, and KNN tional flactuations on victims, while XGBoost worked
for automated detection. well on detecting malicious advertisement of fraudu-
lent investment websites.
Phishing Phone Calls. Studies on Vishing used var-
ious AI-methods for automated detection. The major- Crypto Manipulation. The two identified studies
ity of studies used transcript text data for their analy- on Cryptocurrency Market Manipulation used pre-
sis, with the exception of [155] who analyzed deepfake existing methods for detecting fake users, along with
voice analysis. For that study, the authors found that CorEx Topic Analysis [254]. The other study found
RNN had the best performance. that SVM with SGD and TF-IDF works best for de-
On text data, various NLP and AI methodologies tecting discussions that aim to manipulate the mar-
were used, including but not limited to SVM, NB, ket [255].
LSTM, CNN, RF, BERT, W2V, LR, and KNN.
Fraudulent Crowdfunding. One of the papers on
Phishing User Reports. We find that for the anal- fraudulent crowdfunding detection used NLP method-
ysis of user reports of various phishing activities, the ologies including Named Entity Recognition and other
use of BERT, SMO, J48, NB, RF, XGBoost, D2V, NLP featurues detected in the descriptions of Kick-
Jaccard, NER, and TF-IDF were applied. starter campaigns and build a LR model that performs
Papasavva et al. Page 15 of 37
well [256]. The other study, [257] developed an LSTM- Phishing SMS By this point we established that
LDA topic detection model that analyses the crowd- many works in various scam type analysis rely on es-
funding campaign, along with the comments of people tablished datasets. Alas, the rapidly evolving nature
towards estimating whether a campaign is a scam or of smishing requires a more dynamic and diversified
not. approach to data collection. By integrating publicly
available datasets with real-time, user-reported data
Fraudulent E-Commerce. Finally, the only study
and specialized security sources, researchers can de-
that we identified that used text data towards un-
velop more effective and resilient smishing detection
derstanding Fraudulent E-Commerce activities, used
models. This approach will ensure that models remain
OpenAI’s GPT-4 model to automatically detect ob-
relevant and capable of addressing new and sophisti-
scure financial obligations in the Terms and Conditions cated smishing threats as they arise.
of the focused websites [120]. Contrary to the phishing email detection, we find
that in the case of phishing SMS detection, which tends
4.3 Key Findings to be much smaller text, SVM and various applications
Given the findings of this SLR, we attempt to address of Neural Networks perform best.
our Research Questions.
Phishing Phone Calls. Regarding the data sources
Phishing URLs. While established datasets have used for automated vishing detection, most works
played a crucial role in the development of phishing used text data via transcribing voice recordings. Other
URL detection models, there is a clear need to incorpo- works also used caller ID information from various
rate more dynamic and current data sources. Leverag- telecommunication operators. Some works collected
ing user-reported phishing URLs from social networks, data from user reports and only one work attempted
along with data from Telecom and Security organiza- to detect deepfake signals in voice recordings. The best
tions, offers a more effective approach to combating performing methodology used for automated vishing
phishing attacks. These sources provide real-time, di- detection is SVM.
verse, and relevant data that enhance the robustness
and accuracy of detection models, keeping pace with Phishing User Reports Reviewing the four works
the evolving nature of phishing threats. By combin- that studied phishing via user reports, we find that
ing the strengths of both traditional and modern data researchers used data from various sources, includ-
sources, researchers can develop more comprehensive ing court judgements, public data from forums, and
and adaptive phishing detection systems, better pro- user reports from Financial Institutions. The method-
tecting users from phishing URLs. ologies applied vary as there is use of Named En-
Regarding the methodologies used, we find that the tity Recognition, various NLP methodologies, and AI
existing literature used state-of-the-art methodologies methodologies.
for the analysis and detection of phishing URLs. No- Fake Reviews. Although many works used previously
tably, RF seems to be the stand-alone model that available datasets to establish and compare their de-
works best, while other hybrid methodologies also re- tection models, we notice a clear trend where later
port promising performance. studies tend to collect data from platforms like Ama-
Phishing Emails. While publicly available datasets zon, Google Play, Apple App Store, and YouTube for
their analysis. This is extremely encouraging as the
have laid the groundwork for phishing email detection
data used for these detection models need to be con-
research, the rapidly evolving nature of phishing at-
stantly updated. Hybrid models including LR, SVM,
tacks requires the use of more dynamic and up-to-date
CNN, and LSTM seem to perform best for the detec-
data sources. Leveraging user-reported emails, real-
tion of fake reviews.
time spam collections, and advanced synthetic data
generation techniques can significantly enhance the Fraudulent Job Postings. While the Kaggle dataset
robustness and accuracy of phishing detection mod- has been pivotal in advancing research on fraudulent
els. By combining traditional datasets with innovative job postings, the rapidly evolving nature of job scams
data sources, researchers can develop more comprehen- necessitates the use of more current and diverse data
sive and adaptive phishing detection systems, better sources. Custom data collection methods, which tap
equipped to detect phishing activities via email. into active job posting sites, represent a crucial step
All of the works that opted for automated phishing forward in enhancing the effectiveness of detection
email detection report very good performance on their models. By leveraging a mix of established and new
detection models, with RF, BERT, LSTM, RNN, and data sources, researchers can develop more compre-
SVM being the most popular. hensive systems to effectively combat fraudulent job
Papasavva et al. Page 16 of 37
postings. The models that perform best for this fraud- the latest trends and variations of various online fraud
ulent activity vary. We find that Bi-LSTM, KNN, RF, techniques and activities.
DNN, and LightGBM perform well. To address these limitations, recent studies have
started leveraging more dynamic and real-time data
GenAI Social Engineering. Only two works here
sources. Regarding automated phishing detection, re-
use data collected from real use cases, namely Ayoobi cent studies used user-reported phishing URLs from
et al. [234] discuss fake profile AI-generated content on social networks, as well as data from Telecom and Se-
professional social networks, while Diresta et al. [235] curity organizations. For instance, studies published in
present deep fakes posted on Facebook. Other works 2023 utilized data from sources like Twitter, Facebook,
discuss the potential sybersecurity, privacy, and ethical and specialized security institutions [70, 70, 71, 75, 91,
issues with these models and how they can be misused 91–94].
to automatically create fraudulent content. At the same time, recent works on automated
phishing email detection have utilised user-reported
5 Discussion emails, providing a real-time perspective on phishing
We now discuss our findings, detailing recognised lim- threats [130]. Genc and Jiang[125, 126] used emails
itations and shortcomings detected in the reporting of from their personal or professional spam folders, cap-
AI models related to the performance and data sources turing a more realistic and up-to-date snapshot of
used. We also provide recommendations for researchers phishing attacks. Mehdi et al.[127] took an inno-
developing detection models for online fraud. vative approach by incorporating various techniques
to develop their dataset; GPT-2 generated synthetic
5.1 Data Sources phishing emails, along with tools like TextAttack,
Overall, we analyzed the data sources and detection TextFooler, and PWWS. This approach not only pro-
methodologies of 222 papers that aim to address a vides a diverse dataset but also ensures that the model
range of online fraud problems. Although our findings is robust against sophisticated phishing techniques.
reveal a preference for well-established datasets, espe- Janez et al. [111] used data from the SPAM Archive[28] ,
cially in the automated detection of various phishing a continuously updated repository of spam emails.
and fake reviews detection, more recent studies (pub- Similarly, several studies on automated phishing
lished after 2023) seem to shift towards more dynamic SMS detection have extended their datasets by com-
and recent sources. bining pre-existing publicly available datasets with ad-
We found that the overwhelming majority of the ditional sources to improve the robustness and gen-
Phishing domain detection studies relied on well- eralizability of their models, like incorporating addi-
known and extensively studied datasets from web- tional data from Fake Base Stations [148], emails and
sites like PhishTank, OpenPhish, and SpamHaus. In YouTube comments [134], user reported data for re-
contrast, others used datasets made available from search [149] or on Twitter [150]. Notably, Timko et
previous studies or publicly available repositories like al. [153] proposed a platform, SmishTank, where users
Kaggle, University of California Irvine (UCI) Machine can post phishing SMS messages, creating an ongo-
Learning Repository, and GitHub. This is also the case ing and up-to-date repository for researchers. We also
for studies that focused on phishing email detection, observed studies using data from specialized security
fake review detection, and fraudulent recruitment de- agencies [151] and mobile security services [152], which
tection. However, while these established datasets pro- offer a more targeted collection of smishing examples.
vide a valuable foundation for research, they come with On Fake Review detection, one study [192] used
limitations. YouTube transcripts to interpret false review exagger-
Online fraud is dynamic, with new scam techniques ation, showcasing an innovative approach to identify-
frequently emerging and older methods, like phishing ing fraudulent content in multimedia contexts. Others
URLs and websites, continually evolving. Studies show have adopted similar techniques and developedtheir
hat LLM empowered bots or scammers could be de- own data collection methodologies to collect reviews
ployed to generate and automate sophisticated and from sources like App stores, e-commerce websites, and
location and travel research platforms.
targeted fraudulent and phishing content online, either
The overwhelming amount of studies focus on Fraud-
that being an email, a professional profile, or deceptive
ulent Recruitment detection using the same dataset.
Terms and Conditions for fake e-commerce websites.
In contrast, three studies employed custom crawlers to
Hence, relying on outdated datasets may limit the ef-
gather data from various job posting websites in differ-
fectiveness of detection models when applied to cur-
ent countries, providing a more diverse and up-to-date
rent or evolving threats. At the same time, the static
nature of previously used datasets does not capture [28] http://untroubled.org/spam/
Papasavva et al. Page 17 of 37
perspective. Mahbub et al. [209] collected data from of AI that incorporate NLP, machine learning and
job posting sites in the UK, Tabassum et al. [210] gath- deep learning techniques. Most of these techniques in-
ered job postings from Bangladeshi sites, and Zhang volve using extracting features using NLP techniques
et al. [211] sourced data from Chinese job sites. The and then applying supervised machine learning (i.e.
use of up-to-date data collection from these sources of- labelled data) and deep learning algorithms to build
fer some advantages. For one, by scraping data from binary classifiers.
active job posting sites, these studies can capture the For phishing URLs, Machine Learning and Deep
most recent and relevant data, reflecting current fraud- learning algorithms such as CNN, ANN, KNN, LSTM,
ulent practices. Secondly, collecting data from multiple NB, RF, DT, SVM, and XGBoost were commonly
sources across different regions provides a richer and used, often with URL feature extraction through NLP
more varied dataset, which can enhance the robust- methods like character counts and n-grams. We also
ness and generalizability of detection models. Finally, found various works that applied hybrid approaches
custom datasets often include a wider variety of job where they combined multiple methodologies, demon-
postings, including niche or less common types of em- strating strong detection capabilities. Similarly, phish-
ployment scams, which can be critical for developing ing email detection heavily relied on NLP for feature
more comprehensive detection systems. extraction (e.g., LDA, BERT) followed by AI models
Overall, combining publicly available datasets with like RF, NB, and SVM for classification. Similar tech-
recent data from other sources, such as social me- niques were applied for phishing SMS detection.
dia, user reports, and specialized agencies, may signif- In vishing (phishing phone calls), transcript anal-
icantly enhance the robustness and relevance of detec- ysis primarily conducted using NLP and AI models,
tion models. This approach could ensure that models with some studies examining deepfake voice detection.
are exposed to a wider variety of fraud tactics and can User reports on phishing activities utilized models like
adapt to new threats more effectively. BERT and RF, while fake reviews often involved sen-
There are various advantages of using such dynamic timent analysis with methods like VADER.
data sources: Fraudulent job postings detection involved both use
– Real-time Updates: Social networks and security or-
of machine learning and deeep learning models such as
ganizations provide data that is continuously up-
LR and Bi-LSTM. Romance fraud detection used sen-
dated. This ensures that the detection models are
timent detection methods and machine learning mod-
trained on the most recent phishing URLs, making
els (e.g. RF) and deep learning models (e.g. LSTM).
them more robust against new and emerging threats.
For fraudulent investment and crypto manipulation,
– Diverse Data: User-reported data from social net-
classic machine learning models like DT, XGBoost,
works and institutions often include a wide variety
and SVM were employed. Studies on fraudulent e-
of phishing techniques and strategies. This diversity
commerce and crowdfunding leveraged advanced NLP
enhances the model’s ability to generalize and detect
and machine learning techniques, including GPT-4 and
a broader range of phishing attacks.
LR, respectively.
– Early Detection: These sources can help in the early
detection of new phishing campaigns. Social net- Although the research and detection methodologies
works, in particular, can act as early warning sys- applied by the detected literature perform well, they
tems where new phishing URLs are often first re- are not without theirlimitations. Overall, popular ma-
ported. chine learning algorithms like RF and SVM often rely
– Enhanced Relevance: Data from Telecom and Secu- heavily on the quality of features extracted, which can
rity organizations are often more relevant to current be labor-intensive and may miss out on subtle indica-
threats and can include targeted phishing attacks tors when dealing with large set of data.
that are not present in older datasets. In addition, complex models based on deep learn-
However, despite the advancements in data col- ing techniques like Bi-LSTM with VGG or hybrid ap-
lection methods, there are still gaps. For instance, proaches can be computationally expensive and diffi-
the data sources for various works on various online cult to implement in real-time systems due to the large
fraud categories were not clearly stated within the amount of data processing resources required.
manuscripts. This lack of transparency can hinder re- Notably, models developed that work well for one
producibility and the ability to compare results across kind of fraud, might not generalize well on different
different studies. fraud activities. For example, SMS messages are typ-
ically short, providing limited data for accurate fea-
5.2 Methodologies ture extraction and classification. In addition, natural
The methodologies employed across various studies on language processing techniques used may struggle to
phishing and fraudulent activities involve a wide range capture the features for extracting the understanding
Papasavva et al. Page 18 of 37
the context (semantics) or syntax of phishing content collected data to create more robust and resilient AI
or , leading to potential false positives/negatives (i.e. models.
incorrect misclassificaitons). Methodologies Used. Overall the majority of works
Models trained on specific languages or datasets may used stand-alone machine learning and deep learning
not perform well on emails in other languages or dif- models for the detection of online fraud. In many cases,
ferent styles. Many of the challenges that may arise NLP use for feature extraction was under-reported or
regarding the trained AI models, are often result of ignored. Work on online fraud activities that involve
the poor quality of data. More specifically, effective textual data, should utilize more sophisticated NLP
feature extraction is critical but can be difficult due to techniques such as transformer-based models (e.g.,
the varied nature of how natural language is used. At BERT, GPT) for deeper semantic understanding and
the same time, the textual content used in online fraud better context handling. Although LLMs on their own
activities - such as fraudulent email, sms, or job post- have their limitations such as generating hallucinated
ing - keeps evolving, making it challenging for static or inconsistent results, they are extremely powerful for
models to remain effective over time. context extraction.
Although collecting data from various websites, fo- Recent works demonstrate hybrid models, where
rums, social networks, and telecommunication oper- they combine different AI and NLP methodologies
ators for online fraud detection is invaluable, at the to leverage their strengths, which is performing very
same time, different platform hold data inconsistently, well. Most existing studies use supervised machine
or have unique features and user behaviors, complicat- learning models that require labeled data to detect
ing model generalization. Also, identifying fraudulent fraudulent activities. Due to challenges in obtaining
activities in real-time is challenging due to the dynamic new labeled data, researchers often rely on existing
nature of these scams and the lack of real-time obser- datasets which may not capture the content of new
vance in the various data sources. techniques and tactics employed by scammers. To ad-
Overall, while these methodologies offer powerful dress these challenges, further exploration of active
tools for detecting phishing and fraudulent activities, learning, semi-supervised learning and anomaly-based
they have challenges related to feature extraction, models that rely on small amount of labelled data or
model complexity, generalizability, and data quality. no labelled data is needed. For example, unsupervised
Finally, one of the major issues higlighted in many of or semi-supervised anomaly detection techniques could
these studies is that these models are using supervised be studied to identify outliers and novel fraud pat-
machine learning models that require labeled data. terns that may not be present in the training data. Fi-
Creating labeled data is often challenging and time- nally, we observed that almost non of these models had
consuming. As a result, researchers have been using real-time applications. There should be a shift in focus
existing labeled data, which may become less effective where works attempt to optimize models for real-time
as fraud is evolving over time (e.g. content of phishing processing to ensure timely detection and mitigation
emails changing). of fraudulent activities.
Model Performance Reporting. While going through
5.3 Recommendations the detected literature for this SLR, we observed many
works that were only reporting a subset of performance
Datasets. The reliance on older, established datasets metrics of their models and frequently relying on ac-
for training AI-based models is a double-edged sword. curacy. However, using accuracy on its own especially
While they offer a solid foundation for model develop- when the dataset is unbalanced can be misleading. In a
ment and are used for omparative analysis, their static dataset where one class has more observations than an-
nature may limit their effectiveness in detecting evolv- other (for example having fewer phishing emails com-
ing or emerging fraud trends, hence limiting their ef- pared to not-phishing emails), a model could achieve
fectiveness for detecting new types of fraud. Therefore, very high accuracy simply by predicting the majority
there is a strong case for incorporating more dynamic class (i.e. not-phishing emails) without doing a good
and diverse data sources. Recent studies that use cus- job on detecting the phishing emails.
tom crawlers to gather data from a variety of online Overall, it is essential that the researchers report a
platforms that focus on on a range of fraud types exem- more comprehensive range of performance metrics be-
plify best practices in this area. These approaches pro- yond accuracy alone. This should include precision, re-
vide real-time, relevant data that can significantly im- call, F1-score, AUC score or ROC curve. These met-
prove the adaptability and accuracy of detection mod- rics provide a more complete and nuanced picture of a
els. Going forward, it is recommended that researchers model’s performance, especially in dealing with imbal-
consider combining established datasets with freshly anced datasets. In addition, there is a need to conduct
Papasavva et al. Page 19 of 37
and report detailed error analysis to identify common the models, potentially leading to issues such as over-
failure cases and the reasons behind them. This can fitting. However, issues like this are not discussed in
help in understanding the limitations of the model the majority of works in our SLR. Researchers should
and areas for improvement. Finally, models need to be examining the distribution of different classes and
be cross-validated to ensure the robustness of the re- any potential sources of bias in the data used for
ported performance metrics. For example, reporting training and testing. Lastly, there is limited discussion
results from multiple folds of data samples can provide on the generalizability of the models across different
a more reliable estimate of the model’s performance. datasets, contexts, and evolving fraud patterns. Hence,
research studies should test their models on a variety
Reproducibility. Many of the studies failed to clearly
of experiments to evaluate generalization performance
explain key and critical aspects of their model devel-
of the models.
opment. These include the features engineered and se-
lected, methods used for extracting features, the data
used, size of the dataset, partition of the data into 6 Conclusion
training-test sets, and the hyper-parameters used for In this systematic literature review, we have examined
tuning and training the models. a wide range of studies focusing on the detection of var-
To this end, we recommend researchers provide ac- ious fraudulent activities using AI based models that
cess to the code, datasets, and pre-trained models used use techniques from Natural Language Processing in-
in their studies through platforms like GitHub, Git- cluding machine learning and deep learning. Our goal
Lab, or institutional repositories,. This would help im- was to examine the current state-of-the-art AI-based
prove the reproducibility of their work. Researchers models used for development and training, investigate
should also ensure that the methodology section is the sources of data, and assess how these models are
sufficiently detailed to allow others to replicate their evaluated for effectiveness in analysing and detecting
study and model. This should include a clear descrip- fraudulent activities. Due to limited resources, we have
tion of pre-processing steps, feature engineering and restricted the data collection for the SLR to the years
selection, model hyperparameters, the training-testing 2019-2024. The studies we identified highlight a focus
data split and training protocols. Furthermore, the use on a wide range of fraudulent activities, with particu-
of standardized frameworks and libraries for model im- lar attention given to phishing attacks. However, there
plementation (e.g., TensorFlow, PyTorch) could im- is growing interest towards the use of more advanced
prove reproducibility. Providing comprehensive docu- AI like ChatGPT for creating deceptive content as well
mentation and setup instructions will further help oth- as a tool that can be used for scam-baiting.
ers understand and reproduce the work easily. While significant attention have been given to build-
ing classification models that could be used to detect
Usability. Most of the studied papers are proof-of- fraudulent activities, particularly with hybrid and ad-
concept, and as a result, the usability of AI-based vanced NLP techniques and deep learning, including
models has not been addressed. The effective appli- LLMs , there remains a considerable room for improve-
cation and use of AI-based approaches depend on suc- ment. The key AI-model development areas that re-
cessful usability studies that enable users to develop quire attention include performance reporting, repro-
these models into toolkits and provide user feedback. ducibility and transparency. Providing detailed perfor-
Usability goals are generally determined by efficiency, mance reporting will help compare and evaluate dif-
effectiveness, engagement, error tolerance and ease of ferent models. Improving reproducibility requires en-
use. It is thus imperative to ensure collaboration be- suring that the studies can be replicated and there
tween the developers of AI based tools and practition- is sufficient content for others to achieve this. Increas-
ers. However, while the field of technology usability
ing transparency means providing clear information on
assessment in front line policing is growing [261, 262],
how the AI-based models work and make decisions.
there is a lack of usability studies considering the use
This will help fraud practitioners to interpret and un-
of AI in preventive policing, including cybercrimes like
derstand the models, and mitigate any biases in AI-
online fraud.
based models.
Limitation/Bias discussion. The majority of stud- Furthermore, most existing models rely heavily on
ies do not discuss the limitations of their models or labelled data and supervised machine-learning tech-
data. Researchers should clearly identify and report niques. Future studies should give some attention to
the limitations of the study, including any assump- the application of unsupervised and semi-supervised
tions made, potential biases in the data, and limita- machine learning for detecting fraud. Similarly, the
tions of the methodologies used. The overuse of exist- data sources used for training these models are not
ing labeled datasets could impact the performance of suitable for capturing dynamic nature of fraud. Future
Papasavva et al. Page 20 of 37
Appendix
Papasavva et al.
Performance
# URL Source Collection Method Models Used Best Model
P R A F1 AUC
[47](2019) phishtank.org* Custom crawler RF RF 0.98
[56](2019) phishtank.org and Alexa* UNK J48, RF, SMO, LR, MLP, RF 0.99 0.99
BN, SVM, AdaBoost
[58](2019) phishtank.org and Previous work [59] BNET, NB, J48, LR, RF, RF 0.98
openphish.com* MLP
[59](2019) Alexa, phishtank.org, UNK RF, SVM, NB, C4.5, RF 0.94
openphish.com, and JRip, PART
commoncrawl.org*
[60](2019) Alexa and phishtank.org* UNK SVM, KNN, DT, RF, Hybrid (GBoost, 0.97
GBDT, XGBoost, LGB, XGBoost, and
Hybrid (GBoost, XG- LightGBM)
Boost, and LightGBM)
[101](2019) UNK UNK C4.5, AdaBoost, KNN, Hybrid 0.98 0.98 0.98 0.99
RF, SMO, NB (XCS/UNK)
[61](2019) phishtank.org, Yandex Search Open dataset [263] DT, AdaBoost, Kstar, DT 0.96 0.97 0.97
API, and GitHub and custom crawler KNN, RF, SMO, NB
[71](2019) NetLab360 and Alexa* UNK LR, SVM, LSTM LSTM 0.98 0.98
[81](2020) Kaggle Open dataset [264] NB, KNN, SVM, RF, Hybrid (NN, RF, 0.95 0.98 0.97 0.96
Bagging, NN and Bagging)
[62](2020) Alexa and phishtank.org* Previous work [56] DNN, LSTM, CNN LSTM 0.99
[63](2020) Alexa, phishtank.org, Mende- Open dataset [265] NB, SVC, KNN SVC, KNN
ley, openphish.com, and and custom crawler
commoncrawl.org*
[64](2020) phishtank.org, URLHaus, Open datasets [266– SVM, RF RF 0.98 0.97 0.99
Majestic, Kaggle 268] and custom
crawler
[88](2020) Refer to open dataset GitHub open NB NB 1 0.95 0.97
dataset [269]
[77](2020) Kaggle* UNK open dataset MLP MLP 0.93
[42](2020) UNK* Previous works [270, RF, RNN, CNN CNN 0.94 0.91
271]
[84](2020) UCI ML Repository Open dataset [272] RF, DT, ANN, KNN RF 0.95
[53](2020) phishtank.org and Google UNK SVM, DT, LR, RF, XG- Hybrid (RF, XG- 0.98
Search* Boost, AdaBoost, ET Boost and ET)
[98](2020) phishtank.org* UNK KNN KNN 0.98
[100](2021) Kaggle and Canadian Institute UNK SVC, LR, KNN, NB, RF KNN 0.98 0.98 0.98 0.98
of Cybersecurity*
[67](2021) scammer.info and urlscan. Custom crawler LGBM LGBM 1 0.96 0.98 0.97 0.98
io*
[103](2021) UNK* Previous work [273] RF, DT, NB, LR RF 0.83
[96](2021) UNK* UNK LR, DT, NB LR, DT 1 1 1 1
[99](2021) Alexa and cryptoscamdb.org* Custom crawler NB, SVM, KNN, RF RF 0.98 0.95 0.97 0.96
[54](2021) Alexa, phishtank.org, and Custom crawler and XBoost, RF, SVM, KNN, ANN 0.96 0.97 0.97 0.97
Mendeley* open dataset [274] ANN, LR, DT, NBB
Page 21 of 37
[57](2021) phishtank.org, relbanks.com, Custom crawler ANFIS, NB, PART, J48, PART 0.98 0.99 0.99 0.99 0.98
and millersmiles.co.uk* JRip
[72](2022) Google Rankings and Custom crawler RF, DNN RF 1 0.99 0.99
whoscall.com
Papasavva et al.
[40](2022) Alexa and phishtank.org* Custom crawler Bi-LSTM, Hybrid (Bi- Hybrid (Bi-LSTM 0.96 0.96 0.96
LSTM and CNN), Hybrid and VGG)
(Bi-LSTM and VGG)
[68](2022) who.is* Custom crawler BPNN, RBFN, SVM, NB 0.96
NB, DT, RF, KNN
[95](2022) UNK* UNK DT, RF RF 0.8
[69](2022) Alexa, UCI, phishtank.org UNK open dataset KNN, RF, DT, CBoost, CBoost 0.98 0.98
and Kaggle* LGBM, ABoost, VC
[45](2022) phishtank.org and UCI ML Custom Crawler and Hybrid (CNN) Hybrid (CNN) 0.97
Repository open dataset [272]
[87](2022) Canadian Institute for Cyber- Open dataset [275] LSTM LSTM 0.99 0.99 0.99
security
[73](2022) Kaggle* UNK RF, KNN, XGBoost XGBoost 0.96 0.96
[48](2022) pishitank.org* Custom crawler RF,DT RF 0.87
[86](2022) GitHub Open dataset [263] LR, NB, LSTM, GRU LSTM or GRU 0.95
[83](2022) UCI ML Repository Open datasets [272] AdaBoost, CART, SEM 0.98
and UNK GBoost, MLP, SVM, RF,
NB, SEM
[50](2022) phishtank.org and Alexa* Custom crawler DT, RF RF 0.87
[66](2022) Farsight SIE [276], spamhaus. UNK J48, RF RF
org, and surbl.org*
[52](2022) UNK* Previous work [273] Hybrid (CNN and (CNN and LSTM) 0.98 0.99 0.99 0.99 0.99
LSTM)
[85](2022) Mendeley and previous works Mendeley open Hybrid DLM, Stack Hybrid DLM 0.93 0.93
dataset [277] and pre- model, URLNet
vious works [278, 279]
[55](2022) phishtank.org and Alexa* UNK SVM, NB Hybrid (UNK) 0.99 0.99 0.99 0.99
[65](2022) Canadian Institute of Cyber- UNK SVM, RF RF 0.99 0.99 0.99
security, phishtank.org, and
Kaggle*
[93](2022) Twitch* Twitch API XGBoost, RF, NB RF 0.93 0.93 0.93
[41](2023) phishtank.org and Custom crawler CNN SharkEyes (CNN, 0.94 0.94 0.95 0.94
openphish.com W2V, GRU, Bi-
LSTM)
[92](2023) Tweets, spamhunter.io, and Twitter APi and cus- Hybrid (BERT and RF) Hybrid (BERT and 0.96 0.95 0.95
tweetfeed.live* tom crawler RF)
[70](2023) Twitter and Meta’s Twitter API and cus- UNK Pre-trained 0.96 0.97 0.97 0.96
crowdtangle.com* tom crawler model [60]
[74](2023) Kaggle* Data no longer avail- RF, LR, KNN RF 0.97 0.99 0.97
able
[91](2023) reddit.com/r/Scams/ and Custom crawler RF, XGBoost, SVM, BeyondPhish (RF 0.98
Paolo Alto Networks* FFNN and XGBoost and
SVM and FFNN)
[75](2023) Kaggle and Canadian Institute Open datasets [275, KNN, LR KNN 0.9
of Cybersecurity 280]
[76](2023) Kaggle* UNK open dataset DT, KNN, RF, SVM SVM 0.99 0.96 0.98 0.97
Page 22 of 37
[89](2023) Mendeley Open dataset [265] RF, J48, NB, KNN, LR RF 0.97 0.9 0.94
[78](2023) Kaggle Open dataset [281] DT, KNN, RF, GBoost UNK hybrid 0.98
[43](2023) Mendeley Open dataset [265] MLP, RF, RT, KNN, RF 0.98 0.98 0.98
SVM
Papasavva et al.
[90](2023) UNK* Previous work [90] DT, KNN, SVM, NB, Hybrid (DT, SVM, 0.99 0.98 0.99 0.99
LR, XBoost, Aboost LR)
[44](2023) phishTank, Kaggle, and Majes- UNK BERT BERT 0.97 0.96 0.97 0.97
tic*
[46](2023) URLHaus and phishtank.org Custom crawler Custom rule based Custom rule based 0.93 0.93 0.93 0.93
[82](2023) UCI ML Reposiroty and Open dataset [265, LGBM, XGBoost, Ad- Hybrid 0.97 0.97 0.97 0.97
Mendeley 272, 274] aBoost, CatBoost, GB, (BMLSELM)
Hybrid (BMLSELM)
[79](2023) Kaggle* UNK LR, NB, DT, SVM, RF, KNN 0.99
KNN
[80](2023) Kaggle* UNK RF, XGBoost, LightG- RF 0.99 0.94 0.96 0.96
BBM
[97](2023) UNK* UNK RF, AdaBoost, XGBoost, RF 0.91
GBoost, KNN
[49](2023) phishtank.org* UNK LR, RF RF 0.93 0.79 0.96 0.85
[102](2023) PubMed Open dataset [282] RF, NB, LSTM, CNN UNK
[51](2023) phishtank.org and who.is* Custom crawler LSTM, CNN, Hybrid Hybrid (LSTM and 0.93
(LSTM and CNN) CNN)
[94](2024) Zhejiang Mobile Innovation UNK MBERT, XGBoost, MBERT 0.94 0.94 094
Research Institute* LBoost, LSTM, NB, LR,
RF, SVM, KNN
Table 3: Data sources and Detection Methods used for Phishing URL detection.
A single asterisk (*) indicates that the data is not publicly available. UNK
indicates Unclear details. Empty cells indicate missing values. P: Precision, R:
Recall, A: Accuracy, F1: F1 Score, AUC: Area under the Curve.
Performance
# URL Source Collection Method Models Used Best Model
P R A F1 AUC
[120](2019) UNK* UNK RF, KNN, ANN, SVM, RF 0.97
LR, NB
[130](2019) Spam emails received by a UNK GNB, DT, SVM, NN, RF RF, SVM 0.92 0.97 0.89
company*
[113](2019) cs.cmu.edu Open dataset [283] RF, KNN, SVM, DT RF 0.92 0.94 0.91
[114](2020) aclweb.org and previous work Open dataset [284] NB, Dt, RF, SVM SVM 0.98 0.97 0.98 0.97
and previous
work [285]
[133](2020) Spam emails received by a UNK Clustering Clustering 0.89
company*
[115](2021) Open datasets* UNK LR, SVM, RF, XGBoost XGBoost
[105](2021) cs.cmu.edu Open dataset [283] LSTM LSTM 0.97
[107](2021) UNK UNK Various topic modelling N/A N/A N/A N/A N/A N/A
[125](2021) Author’s spam folder* Custom LDA, Jaccard N/A N/A N/A N/A N/A N/A
[110](2021) UNK* UNK RNN, LSTM, CNN, UNK
BERT
Page 23 of 37
[111](2021) Questionnaires and Open dataset[29] NB, SVM, RF, LR NB 0.88 0.8
untroubled.org/spam
[132](2022) UNK* Custom crawler BOW (Rule based) BOW (Rule based) 0.99 %
[29] https://untroubled.org/spam/
Papasavva et al.
[117](2022) Kaggle Open dataset [286] CBoost, LR, DT, RF, CBoost 0.97 0.96 0.97
GNB, SVM, KNN, XG-
Boost, LGBM, AdaBoost
[119](2022) Kaggle Open dataset [287] RF, NB, SVM, Ad- RF 0.99
aBoost, LR
[104](2022) Previous work* Previous work [288] RF, LR, SVM, MNB RF, LR, SVM, 0.95 0.95 0.95
[108](2022) GitHub, monkey.org, cs.cmu. Open datasets [283, K-Means, DBSCAN, and Agglomerative N/A N/A N/A N/A
edu 289, 290] Agglomerative Clustering Clustering
[131](2022) UNK UNK SVM, DT, LR, DNN, RF DT 1 1 1 1
[121](2022) UCI ML Reposiroty* UNK NB, SVM, KNN, J48, DT DT 0.98
[106](2023) Previous work, cs.cmu.edu, Previous NB, SVM UNK
spamassassin.apache.org, work [291, 292], open
and csmining.org[30] datasets [283, 293],
and UNK
[127](2023) Synthetic data Data generated ALBERT, RoBERTa, ALBERT 0.94 0.95
using various tech- BERT, DBERT, SQ,
niques [294] YOSO
[116](2023) Kaggle* UNK RNN, LSTM, CNN RNN 0.99 0.92 0.99 0.95
[122](2023) UCI ML Repository Open dataset[295] BERT BERT 0.95 0.93 0.98 0.94
[123](2023) UCI ML Repository* Previous work [123] SVM, RF, NB RF 0.95
[109](2023) Previous work* Previous work [296] KNN, NB, DT, RF, BERT 0.97 0.97 0.97 0.97
SVM, LR, XGBoost,
BERT
[124](2023) UCI ML Repository* UNK CatBoost CatBoost 0.97 0.96 0.96 0.97 0.99
[297](2023) Previous work, Kaggle, and Previous work [298– Various topic modelling N/A N/A N/A N/A N/A N/A
monkey.org 300] and open
datasets [290, 301]
[118](2023) Kaggle Open dataset [302] MLP, DT, LR, RF, KNN, MLP, SVM 0.99 0.99 0.99 0.99
SVM
[112](2024) Kaggle Open dataset [303] GPT-3.5, GPT-4, Cus- Custom (Cy- 0.97
tom (CyberGPT) berGPT)
[126](2024) UNK* UNK GPT-3.5, GPT-4 UNK
Table 4: Data sources and Detection Methods used for Phishing Email detec-
tion. A single asterisk (*) indicates that the data is not publicly available. UNK
indicates Unclear details. Empty cells indicate missing values. P: Precision, R:
Recall, A: Accuracy, F1: F1 Score, AUC: Area under the Curve.
Performance
# URL Source Collection Method Models Used Best Model
P R A F1 AUC
[136](2019) Previous work* Previous work [285] SVM, LR, NN, NB, RF RF 0.98
[148](2020) 360 Mobile Safe* UNK SVM, NB, LR, RF SVM 0.96 0.96 0.96
[152](2021) 360 Mobile Safe* UNK LR, DT, NB, SVM LR 0.93 0.93 0.93
[146](2021) UCI ML repository Open dataset [295] CNN, GRU, MLP, SVM, Hybrid (CNN, 0.99 0.96 0.98
XGBoost, Hybrid (CNN, GRU)
Page 24 of 37
GRU)
[149](2022) https://www.datafountain.cn/* Custom crawler CNN, BERT, RoBERTa, Hybrid (Se- 0.96 0.84 0.89
ChineseBERT morph/UNK)
Performance
# Data Collection Method Models Used Best Model
P R A F1 AUC
[216](2019) Twitter user profiles Open dataset [305] GCNN, MLP, BP GCNN 0.94
[222](2019) Facebook user profiles* UNK ID3, KNN, SVM ID3 0.98 0.98 0.97
[214](2019) Twitter User Profiles Open dataset [306] SVM, RF, MADAFE MADAFE
(NN and LR)
[223](2019) Twitter and Facebook (UNK)* UNK HDBSCAN HDBSCAN N/A N/A N/A N/A N/A
[212](2020) Tweets* Twitter API KNN, RF, NB, DT RF 0.95
[226](2021) Sina Weibo User profiles* Custom crawler CatBoost, RF CatBoost 0.87
[229](2021) Twitter user profiles Open dataset [307] NB, QDA, SVM, KNN, RF 0.87 0.88 0.94
RF, NN
[221](2021) Instagram user profiles* Instagram API RF, AdaBoost, MLP, RF 0.99 0.98 0.98 0.98 0.98
ANN, SGD
[224](2022) Facebook user profiles* Manual collection ANN, SVM, RF ANN 0.96
[219](2022) Instagram user profiles* UNK LR, KNN, SVM, RF, NB RF 0.99 0.97 0.94 0.98
[218](2022) Twitter user profiles Open dataset [308] GA, GP GP 0.76 0.78
[217](2022) Twitter user profiles Open dataset [309] SVM, CNB, BNB, MP, TweezBot (Un- 0.99 0.93 0.98 0.99
Page 25 of 37
DT, RF clear)
[225](2023) YouTube video and user meta- YouTube API Sentence-BERT, YouTuBERT 0.63 0.81 0.90 0.71
data and comments* RoBERTa, YouTuBERT (LLM and DB-
SCAN)
[227](2023) List of names* UNK NB, KNN, SVM, LR, RF NB 0.94 0.94 0.95 0.94
Papasavva et al.
[213](2023) Twitter user profiles* UNK NB, DT, NN, Ensem- Ensemstack 0.98
stack
[220](2023) Instagram user profiles* UNK ANN ANN 0.74
[228](2023) Instagram user profiles* UNK LR, DT, RF RF 0.9
[215](2023) Twitter user profiles and tweets* Twitter API LR LR 0.93 0.93 0.93 0.93
Table 6: Data sources and Detection Methods used for Fake User detection.
A single asterisk (*) indicates that the data is not publicly available. UNK
indicates Unclear details. Empty cells indicate missing values. P: Precision, R:
Recall, A: Accuracy, F1: F1 Score, AUC: Area under the Curve.
Performance
# Job postings source Collection Method Models Used Best Model
P R A F1 AUC
[201](2019) Kaggle Open dataset [310] J48, LR, RF Ensemble 0.95 0.94
[197](2020) Kaggle Open dataset [310] GLoVE GLoVE 0.99
[202](2021) Kaggle Open dataset [310] ANN ANN 0.91 0.96 0.93
[204](2021) Kaggle Open dataset [310] LR LR 0.89 0.92 0.96
[208](2021) Kaggle Open dataset [310] LightGBM, LR, DT, LightGBM 0.93 0.94 0.95 0.93
XGBoost, AdaBoost
[210](2021) job.com.bd, bdjobstoday, and Custom crawler LR, AdaBoost, DT, LightGBM or 0.95
deshijob* RF, VC, LGBM, GB GBoost
[198](2021) Kaggle Open dataset [310] KNN, RF, DT, SVM, DNN 0.97
NB, DNN
[12](2022) Kaggle Open dataset [310] RF, LR, SVM, ETC, ETC 0.99
KNN, MP
[207](2022) Kaggle Open dataset [310] KNN, RF KNN 0.79 0.73 0.98 0.76
[209](2022) SEEK, Glassdoor, Indeed, and Custom crawler RF, JRip, NB, J48 RF 0.82 0.69 0.91
Gumtree job postings*
[194](2022) Kaggle Open dataset [310] GRU GRU 0.93
[195](2022) Kaggle Open dataset [310] LR, NB, MLP, KNN, RF 0.98 0.97 0.97 0.98
RF, DT, Adaboost,
GB, NLP
[196](2022) Kaggle Open dataset [310] RF, SVM, Bi-LSTM Bi-LSTM 0.98 0.98
[311](2023) Kaggle Open dataset [310] SVM, NB, RF, Bi- RF
LSTM, LR
[199](2023) Kaggle Open dataset [310] RF, XBoost, Light- XGBoost 0.95 0.9 0.96 0.92
GBM, CatBoost,
DT
[200](2023) Kaggle Open dataset [310] RF, SVM, NB, Ensem- RF 0.98
ble
[203](2023) Kaggle Open dataset [310] RF, NB, SVM, DT, RF 0.97
KNN
[205](2023) Kaggle Open dataset [310] LR, DT, RF, NB, GLM GLM 0.96 0.78 0.86 0.98
[206](2023) Kaggle Open dataset [310] AdaBoost, XGBoost, AdaBoost 0.99 0.97 0.98 0.98
RF, Voting
Page 26 of 37
[211](2023) Boss Zhipin, Liepin, 51job* Custom crawler NB, XGBoost, SVM, DRLM (DT and 0.98 0.94 0.92
LightGBM, DT, RF RF and Light-
GBM)
Papasavva et al.
Table 7: Data sources and Detection Methods used for Fraudulent Recruitment
detection. The asterisk (*) indicates that the data is not publicly available.
Empty cells indicate missing values. P: Precision, R: Recall, A: Accuracy, F1:
F1-Score, AUC: Area Under the Curve.
Performance
# Data Collection Method Models Used Best Model
P R A F1 AUC
[185](2020) Amazon reviews* Custom crawler SVM, LR, RF, DT, 3LP 0.98 0.98 0.98 0.98
GNBSGD, KNN, 3LP,
4LP, XGBoost
[184](2020) Amazon reviews* Custom crawler SVM, KNN, NB, En- Ensemble 0.81 0.81 0.81 0.81
semble
[177](2021) Yelp and JD.com reviews Open dataset[312, 313] and GraphSAGE, Cluster- C-FATH (Un- 0.68- 0.95-
JD.com custom crawler GCN, HGT, C-FATH clear)* 0.87 0.97
(Custom)
[187](2021) Amazon reviews Open dataset [314] RF RF* 1 0.85 0.98
[171](2021) Yelp reviews* UNK CNN, SVM, LR, MLP CNN 0.93 0.92 0.92
[175](2022) Yelp reviews Open dataset [315, 316] WaveNet, LDA WaveNet, LDA N/A N/A N/A N/A N/A
[189](2022) Amazon reviews* UNK BERT, VADER, LR 0.81
LSTM, WordNet,
SGD, SVM, LR
[190](2022) Amazon hotel reviews* UNK KNN, NB, SVM SVM* 0.93
[180](2022) Smartphone App reviews* Web Scraping LDA, keyATM keyATM N/A N/A N/A N/A N/A
[181](2022) Smartphone App reviews Open dataset [317, 318] SVM, DT, NN, LR, SVM 0.94 0.84 0.89
GBT
[179](2022) Google Play reviews* Custom crawler DT, RF, MLP MLP 0.97
[182](2022) Amazon reviews* UNK CNN, SVM, NB CNN 1 1 1
[174](2022) Hotel reviews Open dataset [316, 319] SVM, KNN, LR SKL (SVM and 0.95
KNN and LR)
[173](2022) Yelp reviews Open dataset [312] Bi-LSTM Bi-LSTM 0.89
[188](2023) Amazon book reviews* UNK SVM, LR LR 0.86
[178](2023) Reviews Open dataset [316] and CNN, LSTM, KNN, CNN, LSTM 0.93
UNK NB, SVM, W2V
[183](2023) Amazon reviews* Custom crawler AdaBoost, RF, Lr, RF 0.99 0.99 0.99 0.99
SVM, KNN
[176](2023) Yelp reviews* UNK SVM, MLP, CNN, LR CNN 0.85 0.85 0.85 0.85
[191](2023) Yelp reviews* UNK NB, LR, SVM, DT SVM 0.96 0.98 0.97
[172](2023) Yelp reviews Open dataset [312] GPT-3, BERT, RF, GPT-3 0.73 0.64 0.68 0.75
XGBoost
[193](2023) Undefined reviews* UNK ANN, CNN, LR, SVM, LR 0.89
NB, KNN, RF, DT,
SGD
[186](2023) Hotel reviews* UNK SVM SVM
Page 27 of 37
[192](2024) Product reviews* YouTube API SVM,LR LR 0.74 0.99 0.85 0.95
Table 8: Data sources and Detection Methods used for Fake Review detection.
A single asterisk (*) indicates that the data is not publicly available. UNK
indicates Unclear details. Empty cells indicate missing values. P: Precision, R:
Recall, A: Accuracy, F1: F1-Score, AUC: Area Under the Curve.
Papasavva et al. Page 28 of 37
26. Cumming, D., Hornuf, L., Karami, M., Schweizer, D.: Disentangling Conference on Computer Applications & Information Security
crowdfunding from fraudfunding. Journal of Business Ethics, 1–26 (ICCAIS), pp. 1–6 (2019). IEEE. PHISHING URLs
(2021) 48. Mandadi, A., Boppana, S., Ravella, V., Kavitha, R.: Phishing website
27. FBI: Charity and Disaster Fraud. detection using machine learning. In: 2022 IEEE 7th International
https://www.fbi.gov/how-we-can-help-you/scams-and- Conference for Convergence in Technology (I2CT), pp. 1–4 (2022).
safety/common-scams-and-crimes/charity-and-disaster-fraud. IEEE. PHISHING URLs
Accessed: 2024-07-09 49. Jha, A.K., Muthalagu, R., Pawar, P.M.: Intelligent phishing website
28. Hong, G., Yang, Z., Yang, S., Liaoy, X., Du, X., Yang, M., Duan, H.: detection using machine learning. Multimedia Tools and Applications
Analyzing ground-truth data of mobile gambling scams. In: 2022 82(19), 29431–29456 (2023). PHISHING URLs
IEEE Symposium on Security and Privacy (SP), pp. 2176–2193 50. Marimuthu, S.K., Kalampatti Gopalasamy, S., Ben-Othman, J.:
(2022). IEEE Intelligent antiphishing framework to detect phishing scam: A hybrid
29. Brody, R.G., Haynes, C.M., Mejia, H.: Income tax return scams and classification approach. Software: Practice and Experience 52(2),
identity theft. Accounting and Finance Research 3(1), 90–95 (2014) 459–481 (2022). PHISHING URLs
30. Mirza-Davies, J.: Pension scams. House of Commons (2023) 51. Adebowale, M.A., Lwin, K.T., Hossain, M.A.: Intelligent phishing
31. Taylor, P.: Amount of data created, consumed, and stored 2010-2020, detection scheme using deep learning algorithms. Journal of
with forecasts to 2025. https://www.statista.com/statistics/ Enterprise Information Management 36(3), 747–766 (2023).
871513/worldwide-data-created/. Accessed: 2024-07-01 (2023) PHISHING URLs
32. Rong, X.: word2vec parameter learning explained. CoRR 52. Shaiba, H., Alzahrani, J.S., Eltahir, M.M., Marzouk, R., Mohsen, H.,
abs/1411.2738 (2014). 1411.2738 Hamza, M.A.: Hunger search optimization with hybrid deep learning
33. Stanford NLP Group: GloVe: Global Vectors for Word Representation. enabled phishing detection and classification model. Computers
https://nlp.stanford.edu/projects/glove/. Accessed: Materials & Continua 73(3), 6425–6441 (2022). PHISHING URLs
2024-07-03 53. Rao, R.S., Pais, A.R.: Two level filtering mechanism to detect
34. Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., phishing sites using lightweight visual similarity approach. Journal of
Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Ambient Intelligence and Humanized Computing 11(9), 3853–3872
Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., Liu, P., Nie, J.-Y., (2020). PHISHING URLs
Wen, J.-R.: A Survey of Large Language Models (2023). 2303.18223. 54. Salloum, S., Gaber, T., Vadera, S., Shaalan, K.: Phishing website
https://arxiv.org/abs/2303.18223 detection from urls using classical machine learning ann model. In:
35. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., International Conference on Security and Privacy in Communication
Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Systems, pp. 509–523 (2021). Springer. PHISHING URLs
CoRR abs/1706.03762 (2017). 1706.03762 55. Orunsolu, A.A., Sodiya, A.S., Akinwale, A.: A predictive model for
36. OpenAI: ChatGPT. Accessed: 2024-07-03. https://chatgpt.com/ phishing detection. Journal of King Saud University-Computer and
37. Nightingale, A.: A guide to systematic literature reviews. Surgery Information Sciences 34(2), 232–247 (2022). PHISHING URLs
(Oxford) 27(9), 381–384 (2009) 56. Rao, R.S., Pais, A.R.: Detection of phishing websites using an
38. Moher, D., Liberati, A., Tetzlaff, J., Altman, D.G., Group, P., et al.: efficient feature-based machine learning framework. Neural
Preferred reporting items for systematic reviews and meta-analyses: Computing and applications 31, 3851–3873 (2019). PHISHING URLs
the prisma statement. International journal of surgery 8(5), 336–341 57. Barraclough, P.A., Fehringer, G., Woodward, J.: Intelligent
(2010) cyber-phishing detection for online. computers & security 104,
39. Schmitt, M., Flechais, I.: Digital deception: Generative artificial 102123 (2021). PHISHING URLs
intelligence in social engineering and phishing. arXiv preprint 58. Almseidin, M., Zuraiq, A.A., Al-Kasassbeh, M., Alnidami, N.:
arXiv:2310.13715 (2023). GENAI SE ATTACK Phishing detection based on machine learning and feature selection
40. Li, J., Wang, D., Zhao, C., Tang, J.: Mui-vb: Malicious url methods (2019). PHISHING URLs
identification model combining vgg and bi-lstm. In: Proceedings of 59. Chiew, K.L., Tan, C.L., Wong, K., Yong, K.S., Tiong, W.K.: A new
the 2022 3rd International Conference on Control, Robotics and hybrid ensemble feature selection framework for machine
Intelligent System, pp. 141–148 (2022). PHISHING URLs learning-based phishing detection system. Information Sciences 484,
41. Vo Quang, M., Bui Tan Hai, D., Tran Kim Ngoc, N., Ngo 153–166 (2019). PHISHING URLs
Duc Hoang, S., Nguyen Huu, Q., Phan The, D., Pham, V.-H.: 60. Li, Y., Yang, Z., Chen, X., Yuan, H., Liu, W.: A stacking model using
Shark-eyes: A multimodal fusion framework for multi-view-based url and html features for phishing webpage detection. Future
phishing website detection. In: Proceedings of the 12th International Generation Computer Systems 94, 27–39 (2019). PHISHING URLs
Symposium on Information and Communication Technology, pp. 61. Sahingoz, O.K., Buber, E., Demir, O., Diri, B.: Machine learning
793–800 (2023). PHISHING URLs based phishing detection from urls. Expert Systems with Applications
42. Al-Milli, N., Hammo, B.H.: A convolutional neural network model to 117, 345–357 (2019). PHISHING URLs
detect illegitimate urls. In: 2020 11th International Conference on 62. Somesha, M., Pais, A.R., Rao, R.S., Rathour, V.S.: Efficient deep
Information and Communication Systems (ICICS), pp. 220–225 learning techniques for the detection of phishing websites. Sādhanā
(2020). IEEE. PHISHING URLs 45, 1–18 (2020). PHISHING URLs
43. Aslam, S., Nassif, A.B.: Phish-identifier: Machine learning based 63. Tharani, J.S., Arachchilage, N.A.: Understanding phishers’ strategies
classification of phishing attacks. In: 2023 Advances in Science and of mimicking uniform resource locators to leverage phishing attacks:
Engineering Technology International Conferences (ASET), pp. 1–6 A machine learning approach. Security and Privacy 3(5), 120 (2020).
(2023). IEEE. PHISHING URLs PHISHING URLs
44. Jishnu, K., Arthi, B.: Enhanced phishing url detection using leveraging 64. Do Xuan, C., Nguyen, H.D., Tisenko, V.N.: Malicious url detection
bert with additional url feature extraction. In: 2023 5th International based on machine learning. International Journal of Advanced
Conference on Inventive Research in Computing Applications Computer Science and Applications 11(1) (2020). PHISHING URLs
(ICIRCA), pp. 1745–1750 (2023). IEEE. PHISHING URLs 65. Pradeepa, G., Devi, R.: Lightweight approach for malicious domain
45. Jaber, A.N., Fritsch, L., Haugerud, H.: Improving phishing detection detection using machine learning. Scientific and Technical Bulletin of
with the grey wolf optimizer. In: 2022 International Conference on Information Technologies, Mechanics and Optics 22(2), 262–268
Electronics, Information, and Communication (ICEIC), pp. 1–6 (2022). PHISHING URLs
(2022). IEEE. PHISHING URLs 66. Fernandez, S., Korczyński, M., Duda, A.: Early detection of spam
46. Rafsanjani, A.S., Kamaruddin, N.B., Rusli, H.M., Dabbagh, M.: domains with passive dns and spf. In: International Conference on
Qsecr: Secure qr code scanner according to a novel malicious url Passive and Active Network Measurement, pp. 30–49 (2022).
detection framework. IEEE Access (2023). PHISHING URLs Springer. PHISHING URLs
47. Alswailem, A., Alabdullah, B., Alrumayh, N., Alsedrani, A.: Detecting 67. Chen, Y.-C., Chen, J.-L., Ma, Y.-W.: Ai@ tss-intelligent technical
phishing websites using machine learning. In: 2019 2nd International support scam detection system. Journal of Information Security and
Papasavva et al. Page 30 of 37
Applications 61, 102921 (2021). PHISHING URLs collection of phishing data to fill a research gap in the phishing
68. Shalke, C.J., Achary, R.: Social engineering attack and scam domain. International Journal of Advanced Computer Science and
detection using advanced natural langugae processing algorithm. In: Applications 13(5) (2022). PHISHING URLs
2022 6th International Conference on Trends in Electronics and 86. Villanueva, A., Atibagos, C., De Guzman, J., Cruz, J.C.D., Rosales,
Informatics (ICOEI), pp. 1749–1754 (2022). IEEE. PHISHING URLs M., Francisco, R.: Application of natural language processing for
69. Puri, N., Saggar, P., Kaur, A., Garg, P.: Application of ensemble phishing detection using machine and deep learning models. In: 2022
machine learning models for phishing detection on web networks. In: International Conference on ICT for Smart Society (ICISS), pp. 01–06
2022 Fifth International Conference on Computational Intelligence (2022). IEEE. PHISHING URLs
and Communication Technologies (CCICT), pp. 296–303 (2022). 87. Vecile, S., Lacroix, K., Grolinger, K., Samarabandu, J.: Malicious and
doi:10.1109/CCiCT56684.2022.00062. PHISHING URLs benign url dataset generation using character-level lstm models. In:
70. Saha Roy, S., Karanjit, U., Nilizadeh, S.: Phishing in the free waters: 2022 IEEE Conference on Dependable and Secure Computing (DSC),
A study of phishing attacks created using free website building pp. 1–8 (2022). IEEE. PHISHING URLs
services. In: Proceedings of the 2023 ACM on Internet Measurement 88. Kumar, J., Santhanavijayan, A., Janet, B., Rajendran, B.,
Conference, pp. 268–281 (2023). PHISHING URLs Bindhumadhava, B.: Phishing website classification and detection
71. Liang, Y., Yan, X.: Using deep learning to detect malicious urls. In: using machine learning. In: 2020 International Conference on
2019 IEEE International Conference on Energy Internet (ICEI), pp. Computer Communication and Informatics (ICCCI), pp. 1–6 (2020).
487–492 (2019). IEEE. PHISHING URLs IEEE. PHISHING URLs
72. Chen, S.-W., Chen, P.-H., Tsai, C.-T., Liu, C.-H.: Development of 89. Zin, N.A.B.M., Ab Razak, M.F., Firdaus, A., Ernawan, F., Zulkifli,
machine learning based fraudulent website detection scheme. In: 2022 N.S.A.: Machine learning technique for phishing website detection. In:
IEEE 5th International Conference on Knowledge Innovation and 2023 IEEE 8th International Conference On Software Engineering and
Invention (ICKII), pp. 108–110 (2022). IEEE. PHISHING URLs Computer Systems (ICSECS), pp. 235–239 (2023). IEEE. PHISHING
73. Gu, J., Xu, H.: An ensemble method for phishing websites detection URLs
based on xgboost. In: 2022 14th International Conference on 90. Pathak, P., Shrivas, A.K.: Classification of phishing website using
Computer Research and Development (ICCRD), pp. 214–219 (2022). machine learning based proposed ensemble model. In: 2022 OPJU
IEEE. PHISHING URLs International Technology Conference on Emerging Technologies for
74. Jha, R., Kunwar, G.: Machine learning based url analysis for phishing Sustainable Development (OTCON), pp. 1–6 (2023). IEEE.
detection. In: 2023 6th International Conference on Information PHISHING URLs
Systems and Computer Networks (ISCON), pp. 1–5 (2023). IEEE. 91. Bitaab, M., Cho, H., Oest, A., Lyu, Z., Wang, W., Abraham, J.,
PHISHING URLs Wang, R., Bao, T., Shoshitaishvili, Y., Doupé, A.: Beyond phish:
75. Mehndiratta, M., Jain, N., Malhotra, A., Gupta, I., Narula, R.: Toward detecting fraudulent e-commerce websites at scale. In: 2023
Malicious url: Analysis and detection using machine learning. In: 2023 IEEE Symposium on Security and Privacy (SP), pp. 2566–2583
10th International Conference on Computing for Sustainable Global (2023). IEEE. PHISHING URLs
Development (INDIACom), pp. 1461–1465 (2023). IEEE. PHISHING 92. Nakano, H., Chiba, D., Koide, T., Fukushi, N., Yagi, T., Hariu, T.,
URLs Yoshioka, K., Matsumoto, T.: Canary in twitter mine: collecting
76. Jain, S., Gupta, C.: A support vector machine learning technique for phishing reports from experts and non-experts. In: Proceedings of the
detection of phishing websites. In: 2023 6th International Conference 18th International Conference on Availability, Reliability and Security,
on Information Systems and Computer Networks (ISCON), pp. 1–6 pp. 1–12 (2023). PHISHING URLs
(2023). IEEE. PHISHING URLs 93. Janet, B., Nikam, A., et al.: Real time malicious url detection on
77. Saha, I., Sarma, D., Chakma, R.J., Alam, M.N., Sultana, A., Hossain, twitch using machine learning. In: 2022 International Conference on
S.: Phishing attacks detection using deep learning approach. In: 2020 Electronics and Renewable Systems (ICEARS), pp. 1185–1189
Third International Conference on Smart Systems and Inventive (2022). IEEE. PHISHING URLs
Technology (ICSSIT), pp. 1180–1185 (2020). IEEE. PHISHING URLs 94. Yu, B., Tang, F., Ergu, D., Zeng, R., Ma, B., Liu, F.: Efficient
78. Kumar, S., Dubey, G.P., Gupta, B.: Hybrid machine learning classification of malicious urls: M-bert-a modified bert variant for
technique for prediction of phishing websites. In: 2023 6th enhanced semantic understanding. IEEE Access (2024). PHISHING
International Conference on Information Systems and Computer URLs
Networks (ISCON), pp. 1–4 (2023). IEEE. PHISHING URLs 95. Alkawaz, M.H., Steven, S.J., Mohammad, O.F., Johar, M.G.M.:
79. P, A.N., V, H.V., H, S.P.: Phishing perception and prediction. In: Identification and analysis of phishing website based on machine
2023 4th International Conference on Innovative Trends in learning methods. In: 2022 IEEE 12th Symposium on Computer
Information Technology (ICITIIT), pp. 1–6 (2023). Applications & Industrial Electronics (ISCAIE), pp. 246–251 (2022).
doi:10.1109/ICITIIT57246.2023.10068585. PHISHING URLs IEEE. PHISHING URLs
80. DR, U.S., Patil, A., et al.: Malicious url detection and classification 96. El-Din, A.E., Hemdan, E.E.-D., El-Sayed, A.: Malweb: An efficient
analysis using machine learning models. In: 2023 International malicious websites detection system using machine learning
Conference on Intelligent Data Communication Technologies and algorithms. In: 2021 International Conference on Electronic
Internet of Things (IDCIoT), pp. 470–476 (2023). IEEE. PHISHING Engineering (ICEEM), pp. 1–6 (2021). IEEE. PHISHING URLs
URLs 97. Kundra, D.: Identification and classification of malicious and benign
81. Zamir, A., Khan, H.U., Iqbal, T., Yousaf, N., Aslam, F., Anjum, A., url using machine learning classifiers. In: 2023 7th International
Hamdani, M.: Phishing web site detection using diverse machine Conference on I-SMAC (IoT in Social, Mobile, Analytics and
learning algorithms. The Electronic Library 38(1), 65–80 (2020). Cloud)(I-SMAC), pp. 160–165 (2023). IEEE. PHISHING URLs
PHISHING URLs 98. Chen, J.-L., Ma, Y.-W., Huang, K.-L.: Intelligent visual
82. Kalabarige, L.R., Rao, R.S., Pais, A.R., Gabralla, L.A.: A boosting similarity-based phishing websites detection. Symmetry 12(10), 1681
based hybrid feature selection and multi-layer stacked ensemble (2020). PHISHING URLs
learning model to detect phishing websites. IEEE Access (2023). 99. Ou, H., Guo, Y., Huang, C., Zhao, Z., Guo, W., Fang, Y., Huang, C.:
PHISHING URLs No pie in the sky: The digital currency fraud website detection. In:
83. Mohammed, B.A., Al-Mekhlafi, Z.G.: Accuracy of phishing websites International Conference on Digital Forensics and Cyber Crime, pp.
detection algorithms by using three ranking techniques. In: IJCSNS, 176–193 (2021). Springer. PHISHING URLs
vol. 22, p. 272 (2022). PHISHING URLs 100. Raja, A.S., Vinodini, R., Kavitha, A.: Lexical features based malicious
84. Priya, S., Selvakumar, S., Velusamy, R.L.: Gravitational search based url detection using machine learning techniques. Materials Today:
feature selection for enhanced phishing websites detection. In: 2020 Proceedings 47, 163–166 (2021). PHISHING URLs
2nd International Conference on Innovative Mechanisms for Industry 101. Yadollahi, M.M., Shoeleh, F., Serkani, E., Madani, A., Gharaee, H.:
Applications (ICIMIA), pp. 453–458 (2020). IEEE. PHISHING URLs An adaptive machine learning based approach for phishing detection
85. Ariyadasa, S., Fernando, S., Fernando, S.: Phishrepo: a seamless using hybrid features. In: 2019 5th International Conference on Web
Papasavva et al. Page 31 of 37
Research (ICWR), pp. 281–286 (2019). IEEE. PHISHING URLs Conference on Computer Science and Information Technologies
102. Nagy, N., Aljabri, M., Shaahid, A., Ahmed, A.A., Alnasser, F., (CSIT), pp. 1–5 (2023). IEEE. PHISHING-Emails
Almakramy, L., Alhadab, M., Alfaddagh, S.: Phishing urls detection 119. Livara, A., Hernandez, R.: An empirical analysis of machine learning
using sequential and parallel ml techniques: comparative analysis. techniques in phishing e-mail detection. In: 2022 International
Sensors 23(7), 3467 (2023). PHISHING URLs Conference for Advancement in Technology (ICONAT), pp. 1–6
103. Geyik, B., Erensoy, K., Kocyigit, E.: Detection of phishing websites (2022). IEEE. PHISHING-Emails
from urls by using classification techniques on weka. In: 2021 6th 120. Salihovic, I., Serdarevic, H., Kevric, J.: The role of feature selection in
International Conference on Inventive Computation Technologies machine learning for detection of spam and phishing attacks. In:
(ICICT), pp. 120–125 (2021). IEEE. PHISHING URLs Advanced Technologies, Systems, and Applications III: Proceedings of
104. Al-Ghamdi, N., Alsubait, T.: Digital forensics and machine learning to the International Symposium on Innovative and Interdisciplinary
fraudulent email prediction. In: 2022 Fifth National Conference of Applications of Advanced Technologies (IAT), Volume 2, pp. 476–483
Saudi Computers Colleges (NCCC), pp. 99–106 (2022). IEEE. (2019). Springer. FRAUDULENT ECOMMERCE
PHISHING-Emails 121. Ismail, S.S., Mansour, R.F., Abd El-Aziz, R.M., Taloba, A.I.: Efficient
105. Bhatti, P., Jalil, Z., Majeed, A.: Email classification using lstm: A e-mail spam detection strategy using genetic decision tree processing
deep learning technique. In: 2021 International Conference on Cyber with nlp features. Computational Intelligence and Neuroscience
Warfare and Security (ICCWS), pp. 100–105 (2021). IEEE. 2022(1), 7710005 (2022). PHISHING-Emails
PHISHING-Emails 122. Kushwaha, A., Dutta, K., Maheshwari, V.: Analysis of bert email
106. Jáñez-Martino, F., Alaiz-Rodríguez, R., González-Castro, V., Fidalgo, spam classifier against adversarial attacks. In: 2023 International
E., Alegre, E.: A review of spam email detection: analysis of spammer Conference on Artificial Intelligence and Smart Communication
strategies and the dataset shift problem. Artificial Intelligence Review (AISC), pp. 485–490 (2023). IEEE. PHISHING-Emails
56(2), 1145–1173 (2023). PHISHING-Emails 123. Saini, A., Guleria, K., Sharma, S.: Machine learning approaches for an
107. Stojnic, T., Vatsalan, D., Arachchilage, N.A.: Phishing email automatic email spam detection. In: 2023 International Conference on
strategies: understanding cybercriminals’ strategies of crafting Artificial Intelligence and Applications (ICAIA) Alliance Technology
phishing emails. Security and privacy 4(5), 165 (2021). Conference (ATCON-1), pp. 1–5 (2023). IEEE. PHISHING-Emails
PHISHING-Emails 124. Mittal, K., Gill, K.S., Chauhan, R., Joshi, K., Banerjee, D.: Blockage
108. Saka, T., Vaniea, K., Kökciyan, N.: Context-based clustering to of phishing attacks through machine learning classification techniques
mitigate phishing attacks. In: Proceedings of the 15th ACM and fine tuning its accuracy. In: 2023 3rd International Conference on
Workshop on Artificial Intelligence and Security, pp. 115–126 (2022). Smart Generation Computing, Communication and Networking
PHISHING-Emails (SMART GENCON), pp. 1–5 (2023). IEEE. PHISHING-Emails
109. Jena, D., Kumari, A., Tejaswini, K., Ankita, A., Kumar, B.: Malicious 125. Genc, Y., Kour, H., Arslan, H.T., Chen, L.-C.: Understanding nigerian
spam detection to avoid vicious attack. In: 2023 14th International e-mail scams: A computational content analysis approach.
Conference on Computing Communication and Networking Information Security Journal: A Global Perspective 30(2), 88–99
Technologies (ICCCNT), pp. 1–7 (2023). IEEE. PHISHING-Emails (2021). PHISHING-Emails
110. Jonker, R.A.A., Poudel, R., Pedrosa, T., Lopes, R.P.: Using natural 126. Jiang, L.: Detecting scams using large language models. arXiv
language processing for phishing detection. In: International preprint arXiv:2402.03147 (2024). PHISHING-Emails
Conference on Optimization, Learning Algorithms and Applications, 127. Mehdi Gholampour, P., Verma, R.M.: Adversarial robustness of
pp. 540–552 (2021). Springer. PHISHING-Emails phishing email detection models. In: Proceedings of the 9th ACM
111. Jáñez-Martino, F., Alaiz-Rodríguez, R., González-Castro, V., Fidalgo, International Workshop on Security and Privacy Analytics, pp. 67–76
E.: Trustworthiness of spam email addresses using machine learning. (2023). PHISHING-Emails
In: Proceedings of the 21st ACM Symposium on Document 128. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.,
Engineering, pp. 1–4 (2021). PHISHING-Emails et al.: Language models are unsupervised multitask learners. OpenAI
112. Chataut, R., Gyawali, P.K., Usman, Y.: Can ai keep you safe? a study blog 1(8), 9 (2019)
of large language models for phishing detection. In: 2024 IEEE 14th 129. Ren, S., Deng, Y., He, K., Che, W.: Generating natural language
Annual Computing and Communication Workshop and Conference adversarial examples through probability weighted word saliency. In:
(CCWC), pp. 0548–0554 (2024). IEEE. PHISHING-Emails Korhonen, A., Traum, D., Màrquez, L. (eds.) Proceedings of the 57th
113. Marková, E., Bajtoš, T., Sokol, P., Mézešová, T.: Classification of Annual Meeting of the Association for Computational Linguistics, pp.
malicious emails. In: 2019 IEEE 15th International Scientific 1085–1097. Association for Computational Linguistics, Florence, Italy
Conference on Informatics, pp. 000279–000284 (2019). IEEE. (2019). doi:10.18653/v1/P19-1103.
PHISHING-Emails https://aclanthology.org/P19-1103
114. Al-Haddad, R., Sahwan, F., Aboalmakarem, A., Latif, G., Alufaisan, 130. Gallo, L., Botta, A., Ventre, G.: Identifying threats in a large
Y.M.: Email text analysis for fraud detection through machine company’s inbox. In: Proceedings of the 3rd ACM CoNEXT
learning techniques. In: 3rd Smart Cities Symposium (SCS 2020), vol. Workshop on Big DAta, Machine Learning and Artificial Intelligence
2020, pp. 613–616 (2020). IET. PHISHING-Emails for Data Communication Networks, pp. 1–7 (2019).
115. Islam, M.K., Al Amin, M., Islam, M.R., Mahbub, M.N.I., Showrov, PHISHING-Emails
M.I.H., Kaushal, C.: Spam-detection with comparative analysis and 131. Mughaid, A., AlZu’bi, S., Hnaif, A., Taamneh, S., Alnajjar, A.,
spamming words extractions. In: 2021 9th International Conference Elsoud, E.A.: An intelligent cyber security phishing detection system
on Reliability, Infocom Technologies and Optimization (Trends and using deep learning techniques. Cluster Computing 25(6), 3819–3828
Future Directions)(ICRITO), pp. 1–9 (2021). IEEE. (2022). PHISHING-Emails
PHISHING-Emails 132. Venugopal, I., Bhaskari, D.L., Seetaramanath, M.: Detection of
116. Ramprasath, J., Priyanka, S., Manudev, R., Gokul, M.: Identification severity-based email spam messages using adaptive threshold driven
and mitigation of phishing email attacks using deep learning. In: 2023 clustering. International Journal of Advanced Computer Science and
3rd International Conference on Advance Computing and Innovative Applications 13(10) (2022). PHISHING-Emails
Technologies in Engineering (ICACITE), pp. 466–470 (2023). IEEE. 133. Rahmad, F., Suryanto, Y., Ramli, K.: Performance comparison of
PHISHING-Emails anti-spam technology using confusion matrix classification. In: IOP
117. Singh, U., Singh, V., Gourisaria, M.K., Das, H.: Spam email Conference Series: Materials Science and Engineering, vol. 879, p.
assessment using machine learning and data mining approach. In: 012076 (2020). IOP Publishing. PHISHING-Emails
2022 Fifth International Conference on Computational Intelligence 134. Vinothkumar, S., Varadhaganapathy, S., Shanthakumari, R.,
and Communication Technologies (CCICT), pp. 350–357 (2022). Ramkishore, D., Rithik, S., Tharanies, K.: Detection of spam
IEEE. PHISHING-Emails messages in e-messaging platform using machine learning. In: 2022
118. Emmanuel, A.A., Yamazaki, T.: Information security in social media Fifth International Conference on Computational Intelligence and
sites: Sentiment analysis of email. In: 2023 IEEE 18th International Communication Technologies (CCICT), pp. 283–287 (2022). IEEE.
Papasavva et al. Page 32 of 37
complaint text of internet fraud. In: Proceedings of the 30th ACM Trends in Information Technology and Engineering (ic-ETITE), pp.
International Conference on Information & Knowledge Management, 1–6 (2020). IEEE. FAKE REVIEWS
pp. 3268–3272 (2021). PHISHING USER REPORTS 185. Gupta, V., Aggarwal, A., Chakraborty, T.: Detecting and
169. Zhou, T., Zhao, H., Zhang, X.: Keyword extraction based on random characterizing extremist reviewer groups in online product reviews.
forest and xgboost-an example of fraud judgment document. In: 2022 IEEE Transactions on Computational Social Systems 7(3), 741–750
European Conference on Natural Language Processing and (2020). FAKE REVIEWS
Information Retrieval (ECNLPIR), pp. 17–22 (2022). IEEE. 186. Thilagavathy, A., Therasa, P., Jasmine, J.J., Sneha, M., Lakshmi,
PHISHING USER REPORTS R.S., Yuvanthika, S.: Fake product review detection and elimination
170. Palad, E.B.B., Tangkeko, M.S., Magpantay, L.A.K., Sipin, G.L.: using opinion mining. In: 2023 World Conference on Communication
Document classification of filipino online scam incident text using & Computing (WCONF), pp. 1–5 (2023). IEEE. FAKE REVIEWS
data mining techniques. In: 2019 19th International Symposium on 187. Chandana, P., Sree, N.P., Ramya, V., Bhavana, G.: Analyzing the
Communications and Information Technologies (ISCIT), pp. 232–237 extremist reviewer groups on online products. In: 2021 Third
(2019). IEEE. PHISHING USER REPORTS International Conference on Inventive Research in Computing
171. Javed, M.S., Majeed, H., Mujtaba, H., Beg, M.O.: Fake reviews Applications (ICIRCA), pp. 1–4 (2021). IEEE. FAKE REVIEWS
classification using deep learning ensemble of shallow convolutions. 188. Akshara, S., Shiva, S., Kubireddy, S., Arun, T., Kanthety, V.L.: A
Journal of Computational Social Science, 1–20 (2021). FAKE small comparative study of machine learning algorithms in the
REVIEWS detection of fake reviews of amazon products. In: 2023 6th
172. Pengqi, W., Yue, L., Junyi, C.: Unmasking deception: A comparative International Conference on Contemporary Computing and
study of tree-based and transformer-based models for fake review Informatics (IC3I), vol. 6, pp. 2258–2263 (2023). IEEE. FAKE
detection on yelp. In: 2023 IEEE International Conference on REVIEWS
Systems, Man, and Cybernetics (SMC), pp. 1848–1853 (2023). IEEE. 189. Deekshan, S., PK, A.D., et al.: Detection and summarization of
FAKE REVIEWS honest reviews using text mining. In: 2022 8th International
173. Harris, C.G.: Combining linguistic and behavioral clues to detect Conference on Smart Structures and Systems (ICSSS), pp. 01–05
spam in online reviews. In: 2022 IEEE International Conference on (2022). IEEE. FAKE REVIEWS
e-Business Engineering (ICEBE), pp. 38–44 (2022). IEEE. FAKE 190. Rangari, K., Khan, A.: An empirical analysis of different techniques
REVIEWS for spam detection. In: 2022 8th International Conference on
174. Tufail, H., Ashraf, M.U., Alsubhi, K., Aljahdali, H.M.: The effect of Advanced Computing and Communication Systems (ICACCS), vol. 1,
fake reviews on e-commerce during and after covid-19 pandemic: pp. 947–953 (2022). IEEE. FAKE REVIEWS
Skl-based fake reviews detection. Ieee Access 10, 25555–25564 191. Silpa, C., Prasanth, P., Sowmya, S., Bhumika, Y., Pavan, C.S.,
(2022). FAKE REVIEWS Naveed, M.: Detection of fake online reviews by using machine
175. Balakrishna, V., Bag, S., Sarkar, S.: Identifying spammer groups in learning. In: 2023 International Conference on Innovative Data
consumer reviews using meta-data via bipartite graph approach. In: Communication Technologies and Application (ICIDCA), pp. 71–77
2022 International Conference on Data Analytics for Business and (2023). IEEE. FAKE REVIEWS
Industry (ICDABI), pp. 650–654 (2022). IEEE. FAKE REVIEWS 192. Bevendorff, J., Wiegmann, M., Potthast, M., Stein, B.: Product spam
176. Ashraf, S., Rehman, F., Sharif, H., Kirn, H., Arshad, H., Manzoor, on youtube: A case study. In: Proceedings of the 2024 Conference on
H.: Fake reviews classification using deep learning. In: 2023 Human Information Interaction and Retrieval, pp. 358–363 (2024).
International Multi-disciplinary Conference in Emerging Research FAKE REVIEWS
Trends (IMCERT), vol. 1, pp. 1–8 (2023). IEEE. FAKE REVIEWS 193. Ganesh, D., Rao, K.J., Kumar, M.S., Vinitha, M., Anitha, M., Likith,
177. Wang, L., Li, P., Xiong, K., Zhao, J., Lin, R.: Modeling S.S., Taralitha, R.: Implementation of novel machine learning
heterogeneous graph network on fraud detection: A community-based methods for analysis and detection of fake reviews in social media. In:
framework with attention mechanism. In: Proceedings of the 30th 2023 International Conference on Sustainable Computing and Data
ACM International Conference on Information & Knowledge Communication Systems (ICSCDS), pp. 243–250 (2023). IEEE.
Management, pp. 1959–1968 (2021). FAKE REVIEWS FAKE REVIEWS
178. Singh, D., Memoria, M., Kumar, R.: Deep learning based model for 194. Nessa, I., Zabin, B., Faruk, K.O., Rahman, A., Nahar, K., Iqbal, S.,
fake review detection. In: 2023 International Conference on Hossain, M.S., Mehedi, M.H.K., Rasel, A.A.: Recruitment scam
Advancement in Computation & Computer Technologies (InCACCT), detection using gated recurrent unit. In: 2022 IEEE 10th Region 10
pp. 92–95 (2023). IEEE. FAKE REVIEWS Humanitarian Technology Conference (R10-HTC), pp. 445–449
179. Yugeshwaran, G., Benitta, D.A., Eliyas, S., et al.: Rank fraud and (2022). IEEE. RECRUITEMENT FRAUD
malware detection in google play using fairplay. In: 2022 2nd 195. Prathaban, B.P., Rajendran, S., Lakshmi, G., Menaka, D.:
International Conference on Advance Computing and Innovative Verification of job authenticity using prediction of online employment
Technologies in Engineering (ICACITE), pp. 1356–1359 (2022). scam model (poesm). In: 2022 1st International Conference on
IEEE. FAKE REVIEWS Computational Science and Technology (ICCST), pp. 1–6 (2022).
180. Tushev, M., Ebrahimi, F., Mahmoud, A.: Domain-specific analysis of IEEE. RECRUITEMENT FRAUD
mobile app reviews using keyword-assisted topic models. In: 196. Pandey, B., Kala, T., Bhoj, N., Gohel, H., Kumar, A., Sivaram, P.:
Proceedings of the 44th International Conference on Software Effective identification of spam jobs postings using employer defined
Engineering, pp. 762–773 (2022). FAKE REVIEWS linguistic feature. In: 2022 1st International Conference on AI in
181. Obie, H.O., Ilekura, I., Du, H., Shahin, M., Grundy, J., Li, L., Cybersecurity (ICAIC), pp. 1–6 (2022). IEEE. RECRUITEMENT
Whittle, J., Turhan, B.: On the violation of honesty in mobile apps: FRAUD
Automated detection and categories. In: Proceedings of the 19th 197. Ranparia, D., Kumari, S., Sahani, A.: Fake job prediction using
International Conference on Mining Software Repositories, pp. sequential network. In: 2020 IEEE 15th International Conference on
321–332 (2022). FAKE REVIEWS Industrial and Information Systems (ICIIS), pp. 339–343 (2020).
182. Rangar, K.P., Khan, A.: A machine learning model for spam reviews IEEE. RECRUITEMENT FRAUD
and spammer community detection. In: 2022 IEEE World Conference 198. Habiba, S.U., Islam, M.K., Tasnim, F.: A comparative study on fake
on Applied Intelligence and Computing (AIC), pp. 632–638 (2022). job post prediction using different data mining techniques. In: 2021
IEEE. FAKE REVIEWS 2nd International Conference on Robotics, Electrical and Signal
183. Iqbal, A., Rauf, M.A., Zubair, M., Younis, T.: An efficient ensemble Processing Techniques (ICREST), pp. 543–546 (2021). IEEE.
approach for fake reviews detection. In: 2023 3rd International RECRUITEMENT FRAUD
Conference on Artificial Intelligence (ICAI), pp. 70–75 (2023). IEEE. 199. Reddy, S.M., Ali, S.M., Battula, K.M., lakshmana Charan, P.,
FAKE REVIEWS Rashmi, M.: Web app for predicting fake job posts using ensemble
184. Furia, R., Gaikwad, K., Mandalya, K., Godbole, A.: Tool for review classifiers. In: 2023 4th International Conference for Emerging
analysis of product. In: 2020 International Conference on Emerging Technology (INCET), pp. 1–5 (2023). IEEE. RECRUITEMENT
Papasavva et al. Page 34 of 37
FRAUD 217. Shukla, R., Sinha, A., Chaudhary, A.: Tweezbot: An ai-driven online
200. Santhiya, P., Kavitha, S., Aravindh, T., Archana, S., Praveen, A.V.: media bot identification algorithm for twitter social networks.
Fake news detection using machine learning. In: 2023 International Electronics 11(5), 743 (2022). FAKE ACCOUNTS
Conference on Computer Communication and Informatics (ICCCI), 218. Rovito, L., Bonin, L., Manzoni, L., De Lorenzo, A.: An evolutionary
pp. 1–8 (2023). IEEE. RECRUITEMENT FRAUD computation approach for twitter bot detection. Applied Sciences
201. Lal, S., Jiaswal, R., Sardana, N., Verma, A., Kaur, A., Mourya, R.: 12(12), 5915 (2022). FAKE ACCOUNTS
Orfdetector: ensemble learning based online recruitment fraud 219. Das, S., Saha, S., Vijayalakshmi, S., Jaiswal, J.: An effecient
detection. In: 2019 Twelfth International Conference on Contemporary approach to detect fraud instagram accounts using supervised ml
Computing (IC3), pp. 1–5 (2019). IEEE. RECRUITEMENT FRAUD algorithms. In: 2022 4th International Conference on Advances in
202. Nasser, I.M., Alzaanin, A.H., Maghari, A.Y.: Online recruitment fraud Computing, Communication Control and Networking (ICAC3N), pp.
detection using ann. In: 2021 Palestinian International Conference on 760–764 (2022). IEEE. FAKE ACCOUNTS
Information and Communication Technology (PICICT), pp. 13–17 220. Fathima, A.S., Reema, S., Ahmed, S.T.: Ann based fake profile
(2021). IEEE. RECRUITEMENT FRAUD detection and categorization using premetric paradigms on instagram.
203. Sofy, M.A., Khafagy, M.H., Badry, R.M.: An intelligent arabic model In: 2023 Innovations in Power and Advanced Computing Technologies
for recruitment fraud detection using machine learning. Journal of (i-PACT), pp. 1–6 (2023). IEEE. FAKE ACCOUNTS
Advances in Information Technology 14(1) (2023). RECRUITEMENT 221. Anklesaria, K., Desai, Z., Kulkarni, V., Balasubramaniam, H.: A
FRAUD survey on machine learning algorithms for detecting fake instagram
204. Vo, M.T., Vo, A.H., Nguyen, T., Sharma, R., Le, T.: Dealing with accounts. In: 2021 3rd International Conference on Advances in
the class imbalance problem in the detection of fake job descriptions. Computing, Communication Control and Networking (ICAC3N), pp.
Computers, Materials & Continua 68(1), 521–535 (2021). 141–144 (2021). IEEE. FAKE ACCOUNTS
RECRUITEMENT FRAUD 222. Albayati, M.B., Altamimi, A.M.: Identifying fake facebook profiles
205. Nanath, K., Olney, L.: An investigation of crowdsourcing methods in using data mining techniques. Journal of ICT Research &
enhancing the machine learning approach for detecting online Applications 13(2) (2019). FAKE ACCOUNTS
recruitment fraud. International Journal of Information Management 223. Venkatesan, M., Prabhavathy, P.: Graph based unsupervised learning
Data Insights 3(1), 100167 (2023). RECRUITEMENT FRAUD methods for edge and node anomaly detection in social network. In:
206. Ullah, Z., Jamjoom, M.: A smart secured framework for detecting 2019 IEEE 1st International Conference on Energy, Systems and
and averting online recruitment fraud using ensemble machine Information Processing (ICESIP), pp. 1–5 (2019). IEEE. FAKE
learning techniques. PeerJ Computer Science 9, 1234 (2023). ACCOUNTS
RECRUITEMENT FRAUD 224. Shreya, K., Kothapelly, A., Deepika, V., Shanmugasundaram, H.:
207. Bhatia, T., Meena, J.: Detection of fake online recruitment using Identification of fake accounts in social media using machine learning.
machine learning techniques. In: 2022 4th International Conference In: 2022 Fourth International Conference on Emerging Research in
on Advances in Computing, Communication Control and Networking Electronics, Computer Science and Technology (ICERECT), pp. 1–4
(ICAC3N), pp. 300–304 (2022). IEEE. RECRUITEMENT FRAUD (2022). IEEE. FAKE ACCOUNTS
208. Li, J., Li, Y., Han, H., Lu, X.: Exploratory methods for imbalanced 225. Na, S.H., Cho, S., Shin, S.: Evolving bots: The new generation of
data classification in online recruitment fraud detection: A comment bots and their underlying scam campaigns in youtube. In:
comparative analysis. In: 2021 4th International Conference on Proceedings of the 2023 ACM on Internet Measurement Conference,
Computing and Big Data, pp. 75–81 (2021). RECRUITEMENT pp. 297–312 (2023). FAKE ACCOUNTS
FRAUD 226. Zhang, X., Jiang, F., Zhang, R., Li, S., Zhou, Y.: Social spammer
209. Mahbub, S., Pardede, E., Kayes, A.: Online recruitment fraud detection based on semi-supervised learning. In: 2021 IEEE 20th
detection: A study on contextual features in australian job industries. International Conference on Trust, Security and Privacy in Computing
IEEE Access 10, 82776–82787 (2022). RECRUITEMENT FRAUD and Communications (TrustCom), pp. 849–855 (2021). IEEE. FAKE
210. Tabassum, H., Ghosh, G., Atika, A., Chakrabarty, A.: Detecting online ACCOUNTS
recruitment fraud using machine learning. In: 2021 9th International 227. Haq, I., Qiu, W., Guo, J., Peng, T.: Spammy names detection in
Conference on Information and Communication Technology (ICoICT), pashto language to prevent fake accounts creation on social media.
pp. 472–477 (2021). IEEE. RECRUITEMENT FRAUD In: 2023 8th International Conference on Signal and Image Processing
211. Zhang, H., Wang, M., Wang, Y., Li, Y., Gu, D., Zhu, Y.: (ICSIP), pp. 614–618 (2023). IEEE. FAKE ACCOUNTS
Orfpprediction: Machine learning based online recruitment fraud 228. Nikhitha, K.V., Bhavya, K., Nandini, D.U.: Fake account detection
probability prediction. In: 2023 International Conference on the on social media using random forest classifier. In: 2023 7th
Cognitive Computing and Complex Data (ICCD), pp. 139–144 International Conference on Intelligent Computing and Control
(2023). IEEE. RECRUITEMENT FRAUD Systems (ICICCS), pp. 806–811 (2023). IEEE. FAKE ACCOUNTS
212. Raj, R.J.R., Srinivasulu, S., Ashutosh, A.: A multi-classifier 229. Bebensee, B., Nazarov, N., Zhang, B.-T.: Leveraging node
framework for detecting spam and fake spam messages in twitter. In: neighborhoods and egograph topology for better bot detection in
2020 IEEE 9th International Conference on Communication Systems social graphs. Social Network Analysis and Mining 11(1), 10 (2021).
and Network Technologies (CSNT), pp. 266–270 (2020). IEEE. FAKE FAKE ACCOUNTS
ACCOUNTS 230. Janjeva, A., Harris, A., Mercer, S., Kasprzyk, A., Gausen, A.: The
213. Gangan, J., Suprith, K., Jamdar, N., Bharne, S.: Detection of fake rapid rise of generative ai: Assessing risks to safety and security
twitter accounts using ensemble learning model. In: 2023 7th (2023). GENAI SE ATTACK
International Conference On Computing, Communication, Control 231. Ferrara, E.: Genai against humanity: Nefarious applications of
And Automation (ICCUBEA), pp. 1–6 (2023). IEEE. FAKE generative artificial intelligence and large language models. Journal of
REVIEWS Computational Social Science, 1–21 (2024). GENAI SE ATTACK
214. Yue, H., Zhou, L., Xue, K., Li, H.: Madafe: Malicious account 232. Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A.,
detection on twitter with automated feature extraction. In: 2019 11th Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., et al.:
International Conference on Wireless Communications and Signal Extracting training data from large language models. In: 30th
Processing (WCSP), pp. 1–6 (2019). IEEE. FAKE ACCOUNTS USENIX Security Symposium (USENIX Security 21), pp. 2633–2650
215. Singh, M., Singh, A.: How safe you are on social networks? (2021). GENAI SE ATTACK
Cybernetics and Systems 54(7), 1154–1171 (2023). FAKE 233. Kumar, K., Bhushan, B., et al.: Augmenting cybersecurity and fraud
ACCOUNTS detection using artificial intelligence advancements. In: 2023
216. Ali Alhosseini, S., Bin Tareaf, R., Najafi, P., Meinel, C.: Detect me if International Conference on Computing, Communication, and
you can: Spam bot detection using inductive representation learning. Intelligent Systems (ICCCIS), pp. 1207–1212 (2023). IEEE. GENAI
In: Companion Proceedings of the 2019 World Wide Web Conference, SE ATTACK
pp. 148–153 (2019). FAKE ACCOUNTS 234. Ayoobi, N., Shahriar, S., Mukherjee, A.: The looming threat of fake
Papasavva et al. Page 35 of 37
and llm-generated linkedin profiles: Challenges and opportunities for pp. 1–12 (2022). IEEE. FRAUDULENT INVESTMENT
detection and prevention. In: Proceedings of the 34th ACM 252. Li, K., Guan, S., Lee, D.: Towards understanding and characterizing
Conference on Hypertext and Social Media, pp. 1–10 (2023). GENAI the arbitrage bot scam in the wild. Proceedings of the ACM on
SE ATTACK Measurement and Analysis of Computing Systems 7(3), 1–29 (2023).
235. DiResta, R., Goldstein, J.A.: How spammers and scammers leverage FRAUDULENT INVESTMENT
ai-generated images on facebook for audience growth. arXiv preprint 253. Kuo, C., Tsang, S.-S.: Constructing an investment scam detection
arXiv:2403.12838 (2024). GENAI SE ATTACK model based on emotional fluctuations throughout the investment
236. Grbic, D.V., Dujlovic, I.: Social engineering with chatgpt. In: 2023 scam life cycle. Deviant Behavior 45(2), 204–225 (2024).
22nd International Symposium INFOTEH-JAHORINA (INFOTEH), FRAUDULENT INVESTMENT
pp. 1–5 (2023). IEEE. GENAI SE ATTACK 254. Nizzoli, L., Tardelli, S., Avvenuti, M., Cresci, S., Tesconi, M., Ferrara,
237. Shibli, A.M., Pritom, M.M.A., Gupta, M.: Abusegpt: Abuse of E.: Charting the landscape of online cryptocurrency manipulation.
generative ai chatbots to create smishing campaigns. In: 2024 12th IEEE access 8, 113230–113245 (2020). CRYPTO MANIPULATION
International Symposium on Digital Forensics and Security (ISDFS), 255. Mirtaheri, M., Abu-El-Haija, S., Morstatter, F., Ver Steeg, G.,
pp. 1–6 (2024). IEEE. GENAI SE ATTACK Galstyan, A.: Identifying and analyzing cryptocurrency manipulations
238. Alotaibi, L., Seher, S., Mohammad, N.: Cyberattacks using chatgpt: in social media. IEEE Transactions on Computational Social Systems
Exploring malicious content generation through prompt engineering. 8(3), 607–617 (2021). CRYPTO MANIPULATION
In: 2024 ASU International Conference in Emerging Technologies for 256. Lee, S., Shafqat, W., Kim, H.-c.: Backers beware: Characteristics and
Sustainability and Intelligent Systems (ICETSIS), pp. 1304–1311 detection of fraudulent crowdfunding campaigns. Sensors 22(19),
(2024). IEEE. GENAI SE ATTACK 7677 (2022). FRAUDULENT CROWDFUNDING
239. Alawida, M., Abu Shawar, B., Abiodun, O.I., Mehmood, A., Omolara, 257. Shafqat, W., Byun, Y.-C.: Topic predictions and optimized
A.E., Al Hwaitat, A.K.: Unveiling the dark side of chatgpt: Exploring recommendation mechanism based on integrated topic modeling and
cyberattacks and enhancing user awareness. Information 15(1), 27 deep neural networks in crowdfunding platforms. Applied Sciences
(2024). GENAI SE ATTACK 9(24), 5496 (2019). FRAUDULENT CROWDFUNDING
240. Sharma, M., Singh, K., Aggarwal, P., Dutt, V.: How well does gpt 258. Cambiaso, E., Caviglione, L.: Scamming the scammers: Using chatgpt
phish people? an investigation involving cognitive biases and to reply mails for wasting time and resources. arXiv preprint
feedback. In: 2023 IEEE European Symposium on Security and arXiv:2303.13521 (2023). SCAMBAITING
Privacy Workshops (EuroS&PW), pp. 451–457 (2023). IEEE. GENAI 259. Bajaj, P., Edwards, M.: Automatic scam-baiting using chatgpt. In:
SE ATTACK 2023 IEEE 22nd International Conference on Trust, Security and
241. Roy, S.S., Thota, P., Naragam, K.V., Nilizadeh, S.: From chatbots to Privacy in Computing and Communications (TrustCom), pp.
phishbots?–preventing phishing scams created using chatgpt, google 1941–1946 (2023). IEEE. SCAMBAITING
bard and claude. arXiv preprint arXiv:2310.19181 (2023). GENAI SE 260. Chen, W., Wang, F., Edwards, M.: Active countermeasures for email
ATTACK fraud. In: 2023 IEEE 8th European Symposium on Security and
242. Xu, Z., Luo, S., Shi, J., Li, H., Lin, C., Sun, Q., Hu, S.: Efficiently Privacy (EuroS&P), pp. 39–55 (2023). IEEE. SCAMBAITING
answering k-hop reachability queries in large dynamic graphs for fraud 261. Farzaneh Shahini, D.W., Zahabi, M.: Usability evaluation of police
feature extraction. In: 2022 23rd IEEE International Conference on mobile computer terminals: A focus group study. International
Mobile Data Management (MDM), pp. 238–245 (2022). IEEE. Journal of Human–Computer Interaction 37(15), 1478–1487 (2021).
SOCIAL MEDIA SCAMS doi:10.1080/10447318.2021.1894801
243. La Morgia, M., Mei, A., Mongardini, A.M., Wu, J.: It’sa trap! 262. Zahabi, M., Kaber, D.: Identification of task demands and usability
detection and analysis of fake channels on telegram. In: 2023 IEEE issues in police use of mobile computing terminals. Applied
International Conference on Web Services (ICWS), pp. 97–104 Ergonomics 66, 161–171 (2018). doi:10.1016/j.apergo.2017.08.013
(2023). IEEE. SOCIAL MEDIA SCAMS 263. Ebubekirbbr: Phishing Detection. GitHub.
244. La Morgia, M., Mei, A., Mongardini, A.M., Wu, J.: Uncovering the https://github.com/ebubekirbbr/pdd/tree/master/input (2018)
dark side of telegram: Fakes, clones, scams, and conspiracy
movements. arXiv preprint arXiv:2111.13530 (2021). SOCIAL MEDIA 264. Akash Kumar: Phishing website dataset. https://www.kaggle.com/
SCAMS datasets/akashkr/phishing-website-dataset#dataset.csv
245. Shah, D., Harrison, T., Freas, C.B., Maimon, D., Harrison, R.W.: (2017)
Illicit activity detection in large-scale dark and opaque web social 265. Choon Lin Tan: Phishing Dataset for Machine Learning: Feature
networks. In: 2020 IEEE International Conference on Big Data (Big Evaluation. https://data.mendeley.com/datasets/h3cgnj8hft/1
Data), pp. 4341–4350 (2020). IEEE. SOCIAL MEDIA SCAMS (2018)
246. Al-Hassan, M., Abu-Salih, B., Al Hwaitat, A.: Dspamonto: An 266. Antony J: Malicious n Non-Malicious URL.
ontology modelling for domain-specific social spammers in https://www.kaggle.com/datasets/antonyj453/urldataset
microblogging. Big Data and Cognitive Computing 7(2), 109 (2023). (2017)
SOCIAL MEDIA SCAMS 267. Majestic: Majestic Dataset.
247. Tripathi, A., Ghosh, M., Bharti, K.: Analyzing the uncharted territory http://downloads.majestic.com/majestic_million.csv
of monetizing scam videos on youtube. Social Network Analysis and 268. URLhaus: URLhaus Database Dump.
Mining 12(1), 119 (2022). SOCIAL MEDIA SCAMS https://urlhaus.abuse.ch/downloads/csv/
248. He, X., Gong, Q., Chen, Y., Zhang, Y., Wang, X., Fu, X.: Datingsec: 269. Lilo, J.: Detecting Malicious URL Using Pyspark.
Detecting malicious accounts in dating apps using a content-based https://github.com/rlilojr/
attention network. IEEE Transactions on Dependable and Secure Detecting-Malicious-URL-Machine-Learning/tree/master
Computing 18(5), 2193–2208 (2021). ROMANCE FRAUD (2018)
249. Suarez-Tangil, G., Edwards, M., Peersman, C., Stringhini, G., Rashid, 270. Mohammad, R.M., Thabtah, F., McCluskey, L.: Predicting phishing
A., Whitty, M.: Automatically dismantling online dating fraud. IEEE websites based on self-structuring neural network. Neural Computing
Transactions on Information Forensics and Security 15, 1128–1137 and Applications 25, 443–458 (2014)
(2019). ROMANCE FRAUD 271. Mohammad, R.M., Thabtah, F., McCluskey, L.: Intelligent rule-based
250. Lokanan, M.E.: The tinder swindler: Analyzing public sentiments of phishing websites classification. IET Information Security 8(3),
romance fraud using machine learning and artificial intelligence. 153–160 (2014)
Journal of Economic Criminology 2, 100023 (2023). ROMANCE 272. Mohammad, R., McCluskey, L.: Phishing Websites. UCI Machine
FRAUD Learning Repository. DOI: https://doi.org/10.24432/C51W2X (2015)
251. Siu, G.A., Hutchings, A., Vasek, M., Moore, T.: “invest in crypto!”:
An analysis of investment scam advertisements found in bitcointalk. 273. Rao, R.S., Vaishnavi, T., Pais, A.R.: Catchphish: detection of
In: 2022 APWG Symposium on Electronic Crime Research (eCrime), phishing websites by inspecting urls. Journal of Ambient Intelligence
Papasavva et al. Page 36 of 37
and Humanized Computing 11, 813–825 (2020) tactics and intentions. Decision Support Systems 171, 113977
274. Vrbančič, G.: Phishing Websites Dataset. Mendeley Data. DOI: (2023). PHISHING-Emails
10.17632/72ptz43s9v.1 (2020) 298. El Aassal, A., Baki, S., Das, A., Verma, R.M.: An in-depth
275. Canaadian Institute of Cybersecurity: URL dataset (ISCX-URL2016). benchmarking and evaluation of phishing detection research for
https://www.unb.ca/cic/datasets/url-2016.html (2016) security needs. Ieee Access 8, 22170–22192 (2020)
276. Farsight Inc: Farsight SIE,. https://www.domaintools.com/ 299. Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V.,
resources/user-guides/?_resources_products=sie Spyropoulos, C.D., Stamatopoulos, P.: A memory-based approach to
277. Ariyadasa, S., Fernando, S., Fernando, S.: Phishrepo dataset. anti-spam filtering for mailing lists. Information retrieval 6, 49–73
Mendeley Data. DOI: 10.17632/ttmmtsgbs8.4 (2022) (2003)
278. Lin, Y., Liu, R., Divakaran, D.M., Ng, J.Y., Chan, Q.Z., Lu, Y., Si, 300. Metsis, V., Androutsopoulos, I., Paliouras, G.: Spam filtering with
Y., Zhang, F., Dong, J.S.: Phishpedia: A hybrid deep learning based naive bayes-which naive bayes? In: CEAS, vol. 17, pp. 28–69 (2006).
approach to visually identify phishing webpages. In: 30th USENIX Mountain View, CA
Security Symposium (USENIX Security 21), pp. 3793–3810. USENIX 301. littleRound: 19 Fall Spear Phishing Detection.
Association, ??? (2021). https://www.kaggle.com/c/19fall-spear-phishing-detection/
https://www.usenix.org/conference/usenixsecurity21/presentation/lin (2019)
302. Abhishek Verma: Fraud Email Dataset. https:
279. Feng, J., Zou, L., Ye, O., Han, J.: Web2vec: Phishing webpage //www.kaggle.com/datasets/llabhishekll/fraud-email-dataset
detection method based on multidimensional features driven by deep (2018)
learning. IEEE Access 8, 221214–221224 (2020) 303. Cyber Cop: Phishing Email Detection.
280. Kumar, S.: Detect Malicious URL using ML. https://www.kaggle. https://www.kaggle.com/dsv/6090437 (2023)
com/code/siddharthkumar25/detect-malicious-url-using-ml 304. UCI Machine Learning and Esther Kim: SMS Spam Collection
(2019) Dataset. https://www.kaggle.com/datasets/uciml/
281. Satish Yadav: Phishing Dataset UCI ML CSV. https://www.kaggle. sms-spam-collection-dataset (2016)
com/datasets/isatish/phishing-dataset-uci-ml-csv (2020) 305. Yang, C., Harkreader, R., Gu, G.: Empirical evaluation and new
282. AK., S.: Malicious and Benign Webpages Dataset. PubMed. DOI: design for fighting evolving twitter spammers. IEEE Transactions on
10.1016/j.dib.2020.106304 (2020) Information Forensics and Security 8(8), 1280–1293 (2013)
283. William W. Cohen: Enron Email Dataset. 306. Wu, T., Wen, S., Xiang, Y., Zhou, W.: Twitter spam detection:
https://www.cs.cmu.edu/~enron/ (2020) Survey of new approaches and comparative study. Computers &
284. Dragomir Radev: CLAIR collection of fraud email (Repository). Security 76, 265–284 (2018). doi:10.1016/j.cose.2017.11.013
https://aclweb.org/aclwiki/CLAIR_collection_of_fraud_ 307. Cresci, S., Lillo, F., Regoli, D., Tardelli, S., Tesconi, M.: f ake :
email_(Repository) (2008) Evidenceof spamandbotactivityinstockmicroblogsontwitter.
285. Almeida, T.A., Hidalgo, J.M.G., Yamakami, A.: Contributions to the Proceedings of the International AAAI Conference on Web and Social
study of sms spam filtering: new collection and results. In: Media 12(1) (2018)
Proceedings of the 11th ACM Symposium on Document Engineering, 308. Feng, S., Wan, H., Wang, N., Li, J., Luo, M.: Twibot-20: A
pp. 259–262 (2011) comprehensive twitter bot detection benchmark. In: Proceedings of
286. M Yasser H : Spam Emails Dataset. the 30th ACM International Conference on Information & Knowledge
https://www.kaggle.com/datasets/yasserh/spamemailsdataset Management, pp. 4485–4494 (2021)
(2021) 309. Jain, C.: Training data 2 csv UTF. https://www.kaggle.com/
287. Akashsurya and Gokhan Kul: Phishing Email Collection. datasets/charvijain27/training-data-2-csv-utfcsv (2018)
https://www.kaggle.com/akashsurya156/phishing-paper1 310. Recruitment Scam. https:
(2019) //www.kaggle.com/datasets/amruthjithrajvr/recruitment-scam
288. Hina, M., Ali, M., Javed, A.R., Ghabban, F., Khan, L.A., Jalil, Z.:
Sefaced: Semantic-based forensic analysis and classification of e-mail 311. Yang, Y., Zhang, Y., Zhu, C.: Improved job scam detection methods
data using deep learning. IEEE Access 9, 98398–98411 (2021) using machine learning and resampling techniques. In: 2023 9th
289. Diegoocampoh Ocampo: MachineLearningPhishing. International Conference on Systems and Informatics (ICSAI), pp.
https://github.com/diegoocampoh/MachineLearningPhishing 1–5 (2023). IEEE. RECRUITEMENT FRAUD
(2017) 312. Rayana, S., Akoglu, L.: Collective opinion spam detection: Bridging
290. J Nazario: Nazario Phishing Corpus. review networks and metadata. In: Proceedings of the 21th Acm
https://monkey.org/~jose/phishing/ (2005) Sigkdd International Conference on Knowledge Discovery and Data
291. Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Mining, pp. 985–994 (2015)
Spyropoulos, C.D., Stamatopoulos, P.: Learning to filter spam e-mail: 313. Rayana, S., Akoglu, L.: Collective opinion spam detection using active
A comparison of a naive bayesian and a memory-based approach. inference. In: Proceedings of the 2016 Siam International Conference
arXiv preprint cs/0009009 (2000) on Data Mining, pp. 630–638 (2016). SIAM
292. Cormack, G.V., Gómez Hidalgo, J.M., Sánz, E.P.: Spam filtering for 314. Liu, W., He, J., Han, S., Cai, F., Yang, Z., Zhu, N.: A method for the
short messages. In: Proceedings of the Sixteenth ACM Conference on detection of fake reviews based on temporal features of reviews and
Conference on Information and Knowledge Management, pp. 313–320 comments. IEEE Engineering Management Review 47(4), 67–79
(2007) (2019)
293. spamassasin: Index of /old/publiccorpus. 315. Asghar, N.: Yelp dataset challenge: Review rating prediction. arXiv
https://spamassassin.apache.org/old/publiccorpus/ preprint arXiv:1605.05362 (2016)
294. Gholampour, M.P., Verma, R.M.: 316. Ott, M., Cardie, C., Hancock, J.T.: Negative deceptive opinion spam.
IWSPA-2023-Adversarial-Synthetic-Dataset. https://github.com/ In: Proceedings of the 2013 Conference of the North American
ReDASers/IWSPA-2023-Adversarial-Synthetic-Dataset (2023) Chapter of the Association for Computational Linguistics: Human
295. Almeida, T., Hidalgo, J.: SMS Spam Collection. https: Language Technologies, pp. 497–501 (2013)
//archive.ics.uci.edu/dataset/228/sms+spam+collection 317. Eler, M.M., Orlandin, L., Oliveira, A.D.A.: Do android app users care
(2012) about accessibility? an analysis of user reviews on the google play
296. Yerima, S.Y., Bashar, A.: Semi-supervised novelty detection with one store. In: Proceedings of the 18th Brazilian Symposium on Human
class svm for sms spam detection. In: 2022 29th International Factors in Computing Systems, pp. 1–11 (2019)
Conference on Systems, Signals and Image Processing (IWSSIP), pp. 318. Obie, H.O., Hussain, W., Xia, X., Grundy, J., Li, L., Turhan, B.,
1–4 (2022). IEEE Whittle, J., Shahin, M.: A first look at human values-violation in app
297. Bera, D., Ogbanufe, O., Kim, D.J.: Towards a thematic dimensional reviews. In: 2021 IEEE/ACM 43rd International Conference on
framework of online fraud: An exploration of fraudulent email attack Software Engineering: Software Engineering in Society (ICSE-SEIS),
Papasavva et al. Page 37 of 37