0% found this document useful (0 votes)
49 views8 pages

Spam Text Detection Over Social Media Usage

The document presents a novel approach for spam text detection on social media, utilizing a downsampling strategy based on negative selection density clustering (NSDC-DS) to enhance classifier performance. It discusses various methods for addressing imbalanced datasets, including the use of Naïve Bayes Support Vector Machine (NBSVM) and principal component analysis (PCA) for optimizing model parameters. Experimental results indicate that the proposed NSDC-DS method significantly improves the accuracy and efficiency of communication spam text recognition compared to traditional models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views8 pages

Spam Text Detection Over Social Media Usage

The document presents a novel approach for spam text detection on social media, utilizing a downsampling strategy based on negative selection density clustering (NSDC-DS) to enhance classifier performance. It discusses various methods for addressing imbalanced datasets, including the use of Naïve Bayes Support Vector Machine (NBSVM) and principal component analysis (PCA) for optimizing model parameters. Experimental results indicate that the proposed NSDC-DS method significantly improves the accuracy and efficiency of communication spam text recognition compared to traditional models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.

2024

Spam Text
Detection
Over Social
Media Usage
A Supervised Sampling
Approach for the Social
Web of Things

by Haewon Byeon , Sameer Jha,


Ismail Keshta, Mohammed Wasim Bhatt ,
Pavitar Parkash Singh , Latika Jindal,
and T. R. Vijaya Lakshmi

A
downsampling strategy based on negative compare the improved approach against NSDC, NSDC-
selection density clustering (NSDC-DS) is DS, PCA-SGD, and standard models. According to the tri-
proposed to improve classifier perfor- als, the improved model has a quicker and more
mance while employing random downs- consistent convergence speed.
ampling for unbalanced communication Today, with the growing use of trash data on the
text. The discovery of self-anomalies via negative selec- livelihood platform, preventing garbage data interfer-
tion enhances traditional clustering. The detector and ence to improve system work efficiency and service
self-set are the sample center point and the sample to be quality has become a hot study subject. The problem of
clustered, respectively; anomalous matching is per- imbalanced sample classification has the problem of a
formed on the two; and the NSDC technique analyzes poor classification effect as the basis of trash text rec-
sample similarity. To improve on the traditional downs- ognition. The classification integration approach, cost-
ampling method, we use the Naïve Bayes Support Vector sensitive method, and feature selection method are
Machine (NBSVM) classifier to identify garbage in sam- the key algorithms aimed at this problem. Chen et al.
pled communication samples, use principal component [1] propose employing k-reverse closest neighbors and
analysis (PCA) to evaluate sample information content, one-class support vector machines (SVMs) in series
propose an improved PCA-signed directed graph (SGD) to solve the data imbalance problem. Data balanc-
algorithm to optimize model parameters, and complete ing via sampling has been shown in several research
semisupervised communication spam text recognition works to enhance the performance of classifiers. Deci-
over the Social Web of Things. Several datasets, includ- sion tree classifiers, like C4.5 and C5.0, were employed
ing unbalanced communication text, were used to in more than half of the investigations. The k-closest
neighbor approach and the naive Bayes classifier were
Digital Object Identifier 10.1109/MSMC.2023.3343950
both employed in several research works. Random for-
Date of current version: 25 April 2024 ests (RFs), convolutional neural networks, and linear

32
©SHUTTERSTOCK.COM/JURGENFR

discriminant analysis were used in other research to substantially enhances the accuracy of credit card fraud
assess the sampling technique. detection models. Yang et al. [3] modified the settings of
Imbalanced datasets have a significant skew in the the distance from the imbalanced data to the classifica-
class distribution, such as 1:100 or 1:1,000 samples in the tion surface to correct the offset problem of the classifi-
minority class compared to the majority class. One meth- cation surface. Agnihotri [16] offers a novel variable
od for dealing with class imbalance is to randomly resam- global feature selection strategy, which outperforms the
ple the training dataset. To randomly resample an international feature selection technique when dealing
unbalanced dataset, the two basic ways are to eliminate with unbalanced data. In the multiclassification issue of
instances from the majority class, known as undersam- mechanical defect diagnostics, this technique has good
pling, and to duplicate examples from the minority class, classification accuracy [5].
known as oversampling. Random resampling is a simple For classification issues, the supervised machine
strategy for rebalancing an unbalanced dataset’s class dis- learning method naive Bayes is utilized. It is based on the
tribution. Overfitting can occur when random oversam- Bayes theorem. The naive assumption of conditional
pling repeats samples from the minority class in the independence among predictors gives it its name. It goes
training dataset. Random undersampling removes samples on the premise that every feature in a class is unconnect-
from the majority class, which can result in the loss of ed to every other element. Regression and classification
vital information to the model. may be performed using the supervised machine learning
For data with mainly unbalanced classes, it might be method called SVM. N-dimensional space, where n is the
challenging to build classification models. Accuracy can number of features, is used in SVM to plot data points. An
be increased by employing strategies like oversampling, appropriate hyperplane that distinguishes between two
undersampling, resampling combos, and custom filtering. classes is then chosen to complete the categorization. An
By incorporating feature scaling, suppression, or neu- approach for categorizing issues with binary (two class-
tra lization of mea n absolute er ror, Ponma la r [2] es) and many classes is called naive Bayes.

33
IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.2024

Liu et al. [6] used the class overlap method and the outperforms. With a steady convergence trend and quick-
importance of sample points to design the sample fuzzy ening convergence pace, the communication spam text
membership function and assign the membership value recognition model created in this study under semisuper-
and proposed a fuzzy multiclass SVM algorithm, which can vised learning shows high recognition performance.
more effectively solve the problem of multiclass unbal- When improving downsampling through clustering,
anced data and noise. Gao et al. [7] used the prior class to avoid the problems of difficulty in determining
probability to weight the posterior the number of clusters and high
class probability to deal with the algorithm complexity in tradition-
problem that the minority class is al clustering algorithms, this arti-
misclassified when the neural net- The spectral cle proposes NSDC-DS. According
work is trained with unbalanced to the experimental results, the
clustering algorithm
data. The average recall rate of this unfavorable selection density
algorithm has been improved. is suitable for datasets clustering approach is more effec-
Aradhye et al. [8] used Bayesian with any shape tive and less complex, and the
minimum risk theory to find the increased downsampling meth-
correct classification threshold of sample space, odology NSDC-DS enhances the
and adjusted the imbalanced data but it still has the classifier’s performance. Addi-
after undersampling processing, tionally, the enhanced PCA-SGD
reducing the impact of undersam-
disadvantage that the s t o ch a s t ic g r a d ie nt de s ce nt
pling on classification accuracy number of clusters is approach outperforms. With a
and probability calibration. not easy to determine. steady convergence trend and
Dhah et al. [9] proposed com- quickening convergence pace, the
bining undersampling and overs- communication spam text recog-
ampling using the genetic algorithm n it ion model created i n t h i s
to determine the optimal imbalance rate, which signifi- study under semisupervised learning shows high recog-
cantly improved the rare pattern detection rate and classi- nition performance.
fication performance. Lu et al. [10] made the data We must choose the ideal number of clusters for our
centralized by constraining the range of synthetic data and dataset when using clustering methods like k-means clus-
proposed the Targeted Synthetic Minority Oversampling tering. By doing this, the data are separated effectively and
Technique and Margin Distribution-Sensitive Synthetic correctly. An adequate value for “k,” or the number of clus-
Minority Oversampling Technique algorithms, which ters, ensures that the clusters are of the right granularity
solved the disadvantage of the marginal distribution of the and helps to maintain a healthy balance between the clus-
classifier, and Synthetic Minority Oversampling Technique ters’ compressibility and accuracy.
for imbalanced datasets. Dhanaraj and Karthikeyan [11]
added and deleted samples with a strong correlation with Improved Random Sampling Algorithms
the minority class and weak correlation with the majority
class, respectively, to achieve the class distribution bal- Density Clustering Algorithm Based
ance of the samples, and they proposed the critical value on Negative Selection
sampling method to improve the accuracy of the associa- The number of clusters in the k-means algorithm is not
tion classification method in dealing with imbalanced data. easy to determine, and it is only suitable for convex prob-
Wan and Uehara [12] proposed a combination strategy using lems with the sample space datasets and unbalanced data-
the k-means sampling method and classification guidance sets it is clustering. The class does not work well. The
words, which improved the classification accuracy of spectral clustering algorithm is suitable for datasets with
imbalanced data. Das et al. [13] adjusted the weights of the any shape of sample space, but it still has the disadvantage
majority and minority classes and proposed a novel that the number of clusters is not easy to determine.
undersampling method based on the AdaBoost algorithm to According to multiple experiments, the appropriate simi-
improve the classification effect of imbalanced data. larity threshold c is obtained. When the detector and the self-
To identify communication junk text, this research inte- set meet the matching conditions, that is, when the similarity
grates supervised and unsupervised learning, improves between the sensor and the self-set is greater than or equal to
the algorithm model to optimize parameters, and boosts the similarity threshold c, the sample center point density
the effectiveness of junk text identification. According to can be found. All samples are generated into a cluster; after
the experimental results, the unfavorable selection density each group is clustered, we continue to see the point with the
clustering approach is more effective and less complex, highest density in pieces to be clustered as the next cluster
and the increased downsampling methodology NSDC-DS center point; we update it as the detector and the other as the
enhances the classifier’s performance. Additionally, the self-set to be clustered. The sample points calculate the simi-
enhanced PCA-SGD stochastic gradient descent approach larity, find the samples that meet the similarity threshold’s

34
IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.2024

matching condition, and gather them into the next cluster is then reconstructed with the minority classes to cre-
until the termination condition is met. ate a balanced training set, and the NBSVM classifier
The data are given a relatively broad feature space that is c ho s e n for s up er v i s e d le a r n i n g. F i n a l ly, t he
describes them in many machine learning and pattern rec- revised PCA-SGD approach optimizes the total model
ognition applications. The following four categories of for semisupervised completion.
characteristics [2], [3] may be
found within this area: entirely Model Description
irrelevant, redundant and weakly When identifying communication
relevant, nonredundant but weakly To further improve junk text, the majority of classes in
relevant, and extremely relevant the classification the training set are first subjected
characteristics. The first two cate- to unsupervised learning using the
gories have the potential to consid- effect of the model improved NSDC algor ithm to
erably reduce the performance of and achieve a better improve the accuracy of identifica-
learning algorithms (classification, tion. Then, several representative
regression, and clustering) as well
junk text recognition samples are sampled from each cat-
as their computational effective- effect, we use the egory. Next, it is reorganized with
ness, probability of overfitting, and the minority classes in the training
PCA-SGD algorithm
generalization capacity (Figure 1). set into a balanced training set, and
to optimize the model the NBSVM classifier is selected for
Imbalanced Data Processing parameters. supervised learning. Finally, the
for NSDC improved PCA-SGD algorithm opti-
This article uses the communica- mizes the overall model to be com-
tion text dataset. To improve the pletely semisupervised.
learning effect, it is necessary to solve the imbalance For the task of garbage text recognition under super-
between the communication spam text and the communica- vised learning, the overall solution is shown in Figure 2.
tion nonspam text in the training set samples. Statistics may
be used to adjust for this in investigations where there is Experiment and Result Analysis
clustering. A type of standard error known as cluster-robust
standard error takes the effects of clustering into account, Experimental Description and Experimental Data
resulting in higher values, broader confidence ranges, and To verify the effectiveness of the three improvements dis-
more conservative p values. Fixed-effects models, which cussed in this article, the following three experiments are
include the cluster itself as a factor in a typical regression designed. First, by using datasets with different attributes,
model, or random effects models, which take into account comparing the clustering purity and time (time complexity
the commonalities among individuals with-
in clusters in a multilevel model, can both
be used to modify regression models to Sample Set
account for clustering.

Optimization of the Yes No


Maximum
Communication Spam Text Density Sample
Recognition Model
Cluster Center Samples to be
Point (Detector) Clustered (Self-Set)
Optimization of the Garbage
Identification Model
To further improve the classification effect Similarity Threshold
(Match Condition)
of the model and achieve a better junk text
recognition effect, we use the PCA-SGD algo-
rithm to optimize the model parameters.
Match Mismatched
To increase the accuracy of identifi-
cation while recognizing communication
Cluster into the Unclustered
trash text, the majority of classes in the i th Cluster Samples
training set are first exposed to unsu-
per v ised lea r n i ng employ i ng t he
improved NSDC technique. Then, from Not Null
each group, numerous representative
sa mples a re selected. The training set Figure 1. The NSDC-DS algorithm.

35
IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.2024

and space complexity) of the traditional algorithm and the estimators are contrasted with SVM, Naïve Bayes, and RF,
NSDC algorithm under other datasets, it is verified that the three well-known classification techniques. Different
latter has higher and better efficiency. For strong robust- input characteristics and percentages of originally
ness, we use the random downsampling method, NSDC algo- labeled data examples are investigated for each semisu-
rithm, and traditional clustering algorithm to sample the pervised learning technique. In addition, each semisuper-
majority class in the unbalanced v ised model’s top-per for ming
communication data, respectively, experimental configuration is
and use the reorganized balanced compared to the same model using
sample as the training set to use the In this article, only positive or negative polarity
NBSVM classifier. data examples.
experiments are
To perform learning classifica- The NBSVM classifier is used
tion, we use the validation set to designed to evaluate to learn the balanced training set
verify the effectiveness of the the improved after NSDC and downsampling to
improved downsampling method, create balanced samples, and the
use the enhanced stochastic gradi- algorithm using three semisupervised learning approach
ent descent algorithm to optimize indicators: clustering is utilized to recognize the commu-
the model, and verify the effective- nication junk text. To increase the
ness of the PCA-SGD algorithm by
purity, accuracy, model’s classification impact and
comparing the convergence speed and time. get better garbage text recognition,
and model training speed with we apply the PCA-SGD approach to
the traditional algorithm perfor- ad just the model parameters.
mance. The communication spam When recognizing communication
text recognition model developed in this research under trash text, the majority of classes in the training set are
semisupervised learning has good recognition perfor- initially exposed to unsupervised learning using the
mance, with a consistent convergence trend and quicker enhanced NSDC technique to increase identification
convergence speed. The density clustering approach accuracy. Then, from each group, multiple representative
based on adverse selection obtains the similarity thresh- samples are taken. Following that, it is reconstructed
old through repeated tests, which demands a significant with the minority classes in the training set into a bal-
amount of manual labor. anced training set, and the NBSVM classifier is chosen
The difficulty of classifying opinion spam under super- for supervised learning. Finally, the revised PCA-SGD
vision is a topic of extensive research. Algorithms may method optimizes the total model to be completely
find trends in spammer reviews using labeled data exam- semisupervised.
ples. Four semisupervised techniques using various base In experiment 1, to verify the robustness and effec-
tiveness of t he i mproved clu ster i ng
algorithm, this ar ticle selects noncon-
vex, high-dimensional, and unbalanced
Input Communication sample space datasets with similar sam-
Training Set ple nu mber s: Double - Ci rcles, Wi ne,
Glass, and the comparative dataset Iris.
Experiments 2 and 3 use unbalanced
Chinese Word
Segmentation, Stop Spam Text Recognition communication text data to evaluate
the performance of NSDC downsam-
pling and the PCA-SGD algorithm. The
Communication Text Ling Spam and Spam Base are common-
Vectorization ly used communication datasets. Uni-
PCA-SGD Model com dat a a re t he u nba la n c e d d a t a
Optimization
consulted by Minsheng platform cus-
NSDC-DS Sampling
tomers. We balance the communication
text data.

Recombined With Minority


NBSVM Classifier to Evaluation Indicators
Class Samples into Balanced
Identify Junk Text
Training Set In this article, experiments are designed
Garbage Identification to evaluate the improved algorithm using
Balanced Training Set three indicators: clustering purity, accu-
racy, and time. The specific instructions
Figure 2. The communication spam text recognition model. are as follows:

36
IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.2024

◆ Cluster purity: Purity = R ik= 1 (m i /m) Pi, where k is It can be seen from the results that for the nonconvex
t he number of clusters, m i is the number of texts sample space dataset Double-Circles, k-means is not suit-
i n cluster i, m is the total number of samples, able for such datasets, so it has a lower purity.
p i = max (p ij) = (m ij /m), and m ij is the number of The algorithm ignores, resulting in lower cluster
members in cluster i that belong to class j. Purity is purity than Iris for balanced datasets. For the high-
used to describe the accuracy of the clustering algo- dimensional dataset Wine, the k-means algorithm must
rithm. The higher the purity, the better the cluster- repeatedly update the cluster center points. Further-
ing effect, and the maximum value is one. more, the spectral clustering algorithm needs a signifi-
cant time overhead because of the high-dimensional
Algorithm Performance Verification matrix operation. The NSDC algorithm proposed in this
article calculates the similarity by integrating distance
Experiment 1: NSDC Algorithm and density, which improves the shor tcomings of
k-means inapplicability to nonconvex spherical sample
Performance Verification spaces. Purity reduction and less time requirement;
The k-means algorithm is only suitable for convex sample because the k-means minimization mean-square-error
space datasets, and the clustering effect of unbalanced datas- process is avoided, the influence of the unbalanced
ets is not good. Moreover, for high-dimensional datasets, dataset Glass on its clustering purity is reduced; in addi-
the k-means and spectral clustering algorithms have the dis- tion, the adverse selection density clustering algorithm
advantages of reduced clustering accuracy and longtime avoids the traditional k-means algorithm which repeat-
consumption. Cross-validate, save your training dataset in edly updates cluster center points, and the high-
groups, reserve a group for prediction at all times, and dimensional matrix calculation of spectral clustering
switch the groups with every run. By doing so, you will be leads to high time complexity, which reduces the impact
able to train a model with better data. Use cross-datasets to of high-dimensional data on the time required for
perform cross-validation while utilizing several datasets.
To verify that the adverse selection density clustering 1.5
algorithm can improve the discussed shortcomings, this 1
0.5
experiment uses the convex sample space dataset Double- 0
Spectral

Spectral

Spectral

Spectral
k-Means

NSDC
k-Means

NSDC
k-Means

NSDC
k-Means

NSDC
Circles as well as the nonconvex sample space high-
dimensional dataset Wine, the unbalanced dataset Glass,
and the comparative dataset Iris, respectively. In the
experiment, traditional k-means and spectral clustering Double- Iris Wine Glass
Circles
are selected as the comparison algorithm, and the perfor-
mance is compared with the adverse selection clustering Series 1 Series 2 Series 3
Series 4 Series 5
algorithm. In addition, the clustering purity and time are
used as evaluation indicators. The specific experimental Figure 3. A comparison of the clustering accuracy of
results are shown in Figure 3 and Table 1. each algorithm under different datasets.

Table 1. A comparison of the clustering accuracy of each algorithm


under different datasets.

Double-Circles Iris [10] Wine [8] Glass [4]

k-Means Spectral NSDC k-Means Spectral NSDC k-Means Spectral NSDC k-Means Spectral NSDC

0 0.4 0.4 0.36 0.44 0.46 0.396 0.484 0.506 0.32 0.5324 0.5566

0.5 0.5 0.5 0.45 0.55 0.575 0.495 0.605 0.6325 0.4 0.6655 0.69575

0 0.6 0.6 0.54 0.66 0.69 0.594 0.726 0.759 0.48 0.7986 0.8349

0 0.7 0.7 0.63 0.77 0.805 0.693 0.847 0.8855 0.56 0.9317 0.97405

0 0.8 0.8 0.72 0.88 0.92 0.792 0.968 1.012 0.64 1.0648 1.1132

0 0.9 0.9 0.81 0.99 1.035 0.891 1.089 1.1385 0.72 1.1979 1.25235

0 1 1 0.9 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0

37
IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.2024

clustering. It can be seen from the experimental results 0.23 to 0.15 and the nonspam text misclassification rate
that the adverse selection density clustering algorithm from 0.21 to 0.14, and the accuracy rate reaches 85.62%.
has higher clustering purity and time effect as well as As can be seen from Figure 4, since batch gradient
stronger robustness. descent uses total samples to train the model, it ensures
that each iteration is carried out in the direction of the
Experiment 2: Performance Comparison overall optimization, and the loss value is guaranteed to
of Improved Downsampling Methods decrease monotonically. The experimental results show
To compare the differences between the random downs- that the PCA-SGD algorithm has high stability and fast
ampling method and the cluster downsampling method in convergence speed. In conclusion, this algorithm has
treating imbalanced data, this experimental design uses high feasibility.
random downsampling by the k-means clustering algo-
rithm and NSDC algorithm. Experiment 3: Performance Comparison
In machine learning, resampling is a typical method for of Spam Text Recognition Models
addressing class imbalance. By choosing instances from the To verify the effectiveness of the semisupervised commu-
original dataset, a new training dataset with a different nication spam text recognition model proposed in this
class distribution is created. Ran- article, three communication text
dom resampling is a well-liked datasets—Ling Spam, Spam Base,
resampling technique in which and Unicom—are selected, and
instances for the changed dataset By intentionally the improved model in this article
are selected at random. Because it raising the proportion is used with template-free adver-
enables the model to equally take sarial network evolution [14]. The
into account instances from multi-
of minority class accuracy of the ID-RF [15] model
ple classes during training, resa- instances in the is compared, and the experimental
mpling is frequently seen as a results are shown in Table 2.
dataset, oversampling
stra ightfor wa rd a nd efficient It can be seen that because the
approach for unbalanced classifica- can aid in the model’s Unicom dataset has more samples
tion issues. By intentionally raising training by giving and a higher imbalance ratio than
the proportion of minority class the Ling Spam and Spam Base
instances in the dataset, oversam- it more precise datasets, the accuracy of the
pling can aid in the model’s training information about three models has decreased. Still,
by giving it more precise informa- the semisupervised model pro-
tion about such cases. Changing the
such cases. posed in this a r ticle ha s the
parameter starting values of the slightest drop in inaccuracy. In
model to more accurately represent addition, when the model present-
the sample size of the training data is known as bias initial- ed in this article solves the imbalanced samples, it not
ization. We adjust the bias of the final layer in more detail. only uses the improved NSDC-DS undersampling meth-
The misclassification rate of spam text is reduced od to undersample the majority class but also uses the
from 0.49 to 0.23. The nonspam text misclassification NBSVM classifier to classify the reorganized balanced
rate is reduced from 0.40 to 0.21. The accuracy rate samples and then uses the improved optimization algo-
increased from 59.62% to 79.22%, significantly improving rithm PCA-SGD, which optimizes the model and obtains
classification accuracy. At the same time, the imbalanced better garbage text recognition effect. The experimen-
dataset is processed by the k-means algorithm, using the tal results show that the semisupervised model pro-
improved NCBA clustering algorithm in this article to posed in this article is better than the other two models
reduce the misclassification rate of garbage text from on the three datasets for solving the imbalance problem

1,000 1,000 Table 2. A comparison of the text


recognition accuracy of three methods.
0 0 Semisupervised
1 2 3 4 5 6 7
Dataset TFGE [14] ID-RF [15] Model
SGD PCA-SCD
Ling Spam [5] 85 87 91
MBGD BGD
Spam Base [6] 86 89 90
Figure 4. A comparison of the model classification
accuracy under different training times. MBGD: mini- Unicom [5] 82 80 89
batch gradient descent.

38
IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.2024

and offers better performance in communication spam Latika Jindal ([email protected]) is


text recognition. with the Department of Computer Science and Engineering,
Medi-Caps University, Indore 453331, India.
Conclusion T. R. Vijaya Lakshmi ([email protected].
This research combines unsupervised and supervised learn- in) is with the Mahatma Gandhi Institute of Technology,
ing to recognize communication junk text, optimizes the Gandipet, Hyderabad 500075, India.
algorithm model to optimize parameters, and increases the
effect of junk text identification. The specifics are as follows. References
◆ As part of unsupervised learning to address the draw- [1] J. Chen, L. Zhang, and Y. Lu, “Application of scale invariant feature transform to
backs of standard clustering algorithms, which are sen- image spam filter,” in Proc. 2nd Int. Conf. Future Gen. Commun. Netw. Symp., 2008,
sitive to clustering center points and difficult to use for pp. 55–58, doi: 10.1109/FGCNS.2008.24.
identifying the number of clusters, an NSDC approach [2] A. Ponmalar, K. Rajkumar, U. Hariharan, V. K. G. Kalaiselvi, and S. Deeba, “Analysis
is presented. of spam detection using integration of logistic regression and PSO algorithm,” in Proc.
◆ There is a section for guided learning. The NSDC technique 4th Int. Conf. Comput. Commun. Technol. (ICT), 2021, pp. 396–402, doi: 10.1109/
improves on the classic random downsampling approach. ICCCT53315.2021.9711903.
As a consequence, the collected samples have more com- [3] X. Yang, T. Zhang, and C. Xu, “Cross-domain feature learning in multime-
prehensive overall features, the classifier performs better, dia,” IEEE Trans. Multimedia, vol. 17, no. 1, pp. 64–78, Jan. 2015, doi: 10.1109/
and the semisupervised learning approach is utilized to TMM.2014.2375793.
finish the trash detection of the communication text. [4] Abhishek, A. Dhankar, and N. Gupta, “A systematic review of techniques,
◆ Model improvement: Finally, the revised PCA-SGD tools and applications of machine learning,” in Proc. 3rd Int. Conf. Intell. Com-
technique is employed to complete the optimization mun. Technol. Virtual Mobile Netw. (ICICV), 2021, pp. 764–768, doi: 10.1109/
work of the text trash recognition model, improving ICICV50876.2021.9388637.
the model’s recognition performance. The experimen- [5] M. I. Prabha and G. Umarani Srikanth, “Survey of sentiment analysis using deep
tal findings reveal that the unfavorable selection den- learning techniques,” in Proc. 1st Int. Conf. Innov. Inf. Commun. Technol. (ICIICT),
sity clustering technique is more efficient and has 2019, pp. 1–9, doi: 10.1109/ICIICT1.2019.8741438.
lower complexity, and the enhanced downsampling [6] W. Liu, L. Wang, and F. Hu, “CESMP: Chinese-English segment-aligned multi-field
method NSDC-DS improves the performance of the patent data,” in Proc. IEEE 7th Int. Conf. Cloud Comput. Intell. Syst. (CCIS), 2021,
classifier. Furthermore, the improved stochastic gra- pp. 37–41, doi: 10.1109/CCIS53392.2021.9754662.
dient descent technique PCA-SGD outperforms. [7] Y. Gao, A. Choudhary, and G. Hua, “A comprehensive approach to image spam
The communication spam text recognition model detection: From server to client solution,” IEEE Trans. Inf. Forensics Security, vol. 5,
developed in this research under semisupervised learning no. 4, pp. 826–836, Dec. 2010, doi: 10.1109/TIFS.2010.2080267.
has good recognition performance, with a consistent con- [8] H. B. Aradhye, G. K. Myers, and J. A. Herson, “Image analysis for efficient categori-
vergence trend and quicker convergence speed. The densi- zation of image-based spam e-mail,” in Proc. 8th Int. Conf. Document Anal. Recognit.
ty clustering approach based on adverse selection obtains (ICDAR), 2005, vol. 2, pp. 914–918, doi: 10.1109/ICDAR.2005.135.
the similarity threshold through repeated tests, which [9] E. H. Dhah, M. A. Naser, and S. A. Ali, “Spam email image classification based on
demands a significant amount of manual labor. If the text and image features,” in Proc. 1st Int. Conf. Comput. Appl. Sci. (CAS), 2019, pp.
threshold is adaptively altered based on different datas- 148–153, doi: 10.1109/CAS47993.2019.9075725.
ets, this will be the primary study focus of future work. [10] Z. Lu, H. Yu, D. Fan, and C. Yuan, “Spam filtering based on improved CHI fea-
ture selection method,” in Proc. Chinese Conf. Pattern Recognit., 2009, pp. 1–3, doi:
About the Authors 10.1109/CCPR.2009.5344010.
Haewon Byeon ([email protected]) is with the [11] S. Dhanaraj and V. Karthikeyan, “A study on e-mail image spam filtering tech-
Department of Digital Anti-Aging Healthcare, Inje Universi- niques,” in Proc. Int. Conf. Pattern Recognit., Inform. Mobile Eng., 2013, pp. 49–55,
ty, Gimhae 50834, South Korea. doi: 10.1109/ICPRIME.2013.6496446.
Sameer Jha ([email protected]) is with the [12] P. Wan and M. Uehara, “Spam detection using Sobel operators and OCR,” in Proc. 26th
Computer Science Department, De Montfort University, Int. Conf. Adv. Inf. Netw. Appl. Workshops, 2012, pp. 1017–1022, doi: 10.1109/WAINA.2012.24.
Almaty 050044, Kazakhstan. [13] M. Das, A. Bhomick, Y. J. Singh, and V. Prasad, “A modular approach towards
Ismail Keshta ([email protected]) is with the image spam filtering using multiple classifiers,” in Proc. IEEE Int. Conf. Comput.
Computer Science and Information Systems Department, Intell. Comput. Res., 2014, pp. 1–8, doi: 10.1109/ICCIC.2014.7238323.
College of Applied Sciences, AlMaarefa University, Riyadh [14] X. M. Li and U. M. Kim, “A hierarchical framework for content-based image spam fil-
11597, Saudi Arabia. tering,” in Proc. 8th Int. Conf. Inf. Sci. Digit. Content Technol. (ICIDT), 2012, pp. 149–155.
Mohammed Wasim Bhatt (wasimmohammad71@ [15] A. Rusu and V. Govindaraju, “Handwritten CAPTCHA: Using the difference in the
gmail.com) is with the Model Institute of Engineering and abilities of humans and machines in reading handwritten words,” in Proc. 9th Int. Work-
Technology Jammu, J&K 181122, India. shop Frontiers Handwriting Recognit., 2004, pp. 226–231, doi: 10.1109/IWFHR.2004.54.
Pavitar Parkash Singh ([email protected]) is [16] D. Agnihotri, K. Verma, and P. Tripathi, Variable global feature selection scheme
with the Department of Management, Lovely Professional for automatic classification of text documents, Expert Syst. Appl., vol. 81, pp. 268–
University, Phagwara 144001, India. 281, 2017, doi.org/10.1016/j.eswa.2017.03.057.

39

View publication stats

You might also like