Spam Text Detection Over Social Media Usage
Spam Text Detection Over Social Media Usage
2024
Spam Text
Detection
Over Social
Media Usage
A Supervised Sampling
Approach for the Social
Web of Things
A
downsampling strategy based on negative compare the improved approach against NSDC, NSDC-
selection density clustering (NSDC-DS) is DS, PCA-SGD, and standard models. According to the tri-
proposed to improve classifier perfor- als, the improved model has a quicker and more
mance while employing random downs- consistent convergence speed.
ampling for unbalanced communication Today, with the growing use of trash data on the
text. The discovery of self-anomalies via negative selec- livelihood platform, preventing garbage data interfer-
tion enhances traditional clustering. The detector and ence to improve system work efficiency and service
self-set are the sample center point and the sample to be quality has become a hot study subject. The problem of
clustered, respectively; anomalous matching is per- imbalanced sample classification has the problem of a
formed on the two; and the NSDC technique analyzes poor classification effect as the basis of trash text rec-
sample similarity. To improve on the traditional downs- ognition. The classification integration approach, cost-
ampling method, we use the Naïve Bayes Support Vector sensitive method, and feature selection method are
Machine (NBSVM) classifier to identify garbage in sam- the key algorithms aimed at this problem. Chen et al.
pled communication samples, use principal component [1] propose employing k-reverse closest neighbors and
analysis (PCA) to evaluate sample information content, one-class support vector machines (SVMs) in series
propose an improved PCA-signed directed graph (SGD) to solve the data imbalance problem. Data balanc-
algorithm to optimize model parameters, and complete ing via sampling has been shown in several research
semisupervised communication spam text recognition works to enhance the performance of classifiers. Deci-
over the Social Web of Things. Several datasets, includ- sion tree classifiers, like C4.5 and C5.0, were employed
ing unbalanced communication text, were used to in more than half of the investigations. The k-closest
neighbor approach and the naive Bayes classifier were
Digital Object Identifier 10.1109/MSMC.2023.3343950
both employed in several research works. Random for-
Date of current version: 25 April 2024 ests (RFs), convolutional neural networks, and linear
32
©SHUTTERSTOCK.COM/JURGENFR
discriminant analysis were used in other research to substantially enhances the accuracy of credit card fraud
assess the sampling technique. detection models. Yang et al. [3] modified the settings of
Imbalanced datasets have a significant skew in the the distance from the imbalanced data to the classifica-
class distribution, such as 1:100 or 1:1,000 samples in the tion surface to correct the offset problem of the classifi-
minority class compared to the majority class. One meth- cation surface. Agnihotri [16] offers a novel variable
od for dealing with class imbalance is to randomly resam- global feature selection strategy, which outperforms the
ple the training dataset. To randomly resample an international feature selection technique when dealing
unbalanced dataset, the two basic ways are to eliminate with unbalanced data. In the multiclassification issue of
instances from the majority class, known as undersam- mechanical defect diagnostics, this technique has good
pling, and to duplicate examples from the minority class, classification accuracy [5].
known as oversampling. Random resampling is a simple For classification issues, the supervised machine
strategy for rebalancing an unbalanced dataset’s class dis- learning method naive Bayes is utilized. It is based on the
tribution. Overfitting can occur when random oversam- Bayes theorem. The naive assumption of conditional
pling repeats samples from the minority class in the independence among predictors gives it its name. It goes
training dataset. Random undersampling removes samples on the premise that every feature in a class is unconnect-
from the majority class, which can result in the loss of ed to every other element. Regression and classification
vital information to the model. may be performed using the supervised machine learning
For data with mainly unbalanced classes, it might be method called SVM. N-dimensional space, where n is the
challenging to build classification models. Accuracy can number of features, is used in SVM to plot data points. An
be increased by employing strategies like oversampling, appropriate hyperplane that distinguishes between two
undersampling, resampling combos, and custom filtering. classes is then chosen to complete the categorization. An
By incorporating feature scaling, suppression, or neu- approach for categorizing issues with binary (two class-
tra lization of mea n absolute er ror, Ponma la r [2] es) and many classes is called naive Bayes.
33
IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.2024
Liu et al. [6] used the class overlap method and the outperforms. With a steady convergence trend and quick-
importance of sample points to design the sample fuzzy ening convergence pace, the communication spam text
membership function and assign the membership value recognition model created in this study under semisuper-
and proposed a fuzzy multiclass SVM algorithm, which can vised learning shows high recognition performance.
more effectively solve the problem of multiclass unbal- When improving downsampling through clustering,
anced data and noise. Gao et al. [7] used the prior class to avoid the problems of difficulty in determining
probability to weight the posterior the number of clusters and high
class probability to deal with the algorithm complexity in tradition-
problem that the minority class is al clustering algorithms, this arti-
misclassified when the neural net- The spectral cle proposes NSDC-DS. According
work is trained with unbalanced to the experimental results, the
clustering algorithm
data. The average recall rate of this unfavorable selection density
algorithm has been improved. is suitable for datasets clustering approach is more effec-
Aradhye et al. [8] used Bayesian with any shape tive and less complex, and the
minimum risk theory to find the increased downsampling meth-
correct classification threshold of sample space, odology NSDC-DS enhances the
and adjusted the imbalanced data but it still has the classifier’s performance. Addi-
after undersampling processing, tionally, the enhanced PCA-SGD
reducing the impact of undersam-
disadvantage that the s t o ch a s t ic g r a d ie nt de s ce nt
pling on classification accuracy number of clusters is approach outperforms. With a
and probability calibration. not easy to determine. steady convergence trend and
Dhah et al. [9] proposed com- quickening convergence pace, the
bining undersampling and overs- communication spam text recog-
ampling using the genetic algorithm n it ion model created i n t h i s
to determine the optimal imbalance rate, which signifi- study under semisupervised learning shows high recog-
cantly improved the rare pattern detection rate and classi- nition performance.
fication performance. Lu et al. [10] made the data We must choose the ideal number of clusters for our
centralized by constraining the range of synthetic data and dataset when using clustering methods like k-means clus-
proposed the Targeted Synthetic Minority Oversampling tering. By doing this, the data are separated effectively and
Technique and Margin Distribution-Sensitive Synthetic correctly. An adequate value for “k,” or the number of clus-
Minority Oversampling Technique algorithms, which ters, ensures that the clusters are of the right granularity
solved the disadvantage of the marginal distribution of the and helps to maintain a healthy balance between the clus-
classifier, and Synthetic Minority Oversampling Technique ters’ compressibility and accuracy.
for imbalanced datasets. Dhanaraj and Karthikeyan [11]
added and deleted samples with a strong correlation with Improved Random Sampling Algorithms
the minority class and weak correlation with the majority
class, respectively, to achieve the class distribution bal- Density Clustering Algorithm Based
ance of the samples, and they proposed the critical value on Negative Selection
sampling method to improve the accuracy of the associa- The number of clusters in the k-means algorithm is not
tion classification method in dealing with imbalanced data. easy to determine, and it is only suitable for convex prob-
Wan and Uehara [12] proposed a combination strategy using lems with the sample space datasets and unbalanced data-
the k-means sampling method and classification guidance sets it is clustering. The class does not work well. The
words, which improved the classification accuracy of spectral clustering algorithm is suitable for datasets with
imbalanced data. Das et al. [13] adjusted the weights of the any shape of sample space, but it still has the disadvantage
majority and minority classes and proposed a novel that the number of clusters is not easy to determine.
undersampling method based on the AdaBoost algorithm to According to multiple experiments, the appropriate simi-
improve the classification effect of imbalanced data. larity threshold c is obtained. When the detector and the self-
To identify communication junk text, this research inte- set meet the matching conditions, that is, when the similarity
grates supervised and unsupervised learning, improves between the sensor and the self-set is greater than or equal to
the algorithm model to optimize parameters, and boosts the similarity threshold c, the sample center point density
the effectiveness of junk text identification. According to can be found. All samples are generated into a cluster; after
the experimental results, the unfavorable selection density each group is clustered, we continue to see the point with the
clustering approach is more effective and less complex, highest density in pieces to be clustered as the next cluster
and the increased downsampling methodology NSDC-DS center point; we update it as the detector and the other as the
enhances the classifier’s performance. Additionally, the self-set to be clustered. The sample points calculate the simi-
enhanced PCA-SGD stochastic gradient descent approach larity, find the samples that meet the similarity threshold’s
34
IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.2024
matching condition, and gather them into the next cluster is then reconstructed with the minority classes to cre-
until the termination condition is met. ate a balanced training set, and the NBSVM classifier
The data are given a relatively broad feature space that is c ho s e n for s up er v i s e d le a r n i n g. F i n a l ly, t he
describes them in many machine learning and pattern rec- revised PCA-SGD approach optimizes the total model
ognition applications. The following four categories of for semisupervised completion.
characteristics [2], [3] may be
found within this area: entirely Model Description
irrelevant, redundant and weakly When identifying communication
relevant, nonredundant but weakly To further improve junk text, the majority of classes in
relevant, and extremely relevant the classification the training set are first subjected
characteristics. The first two cate- to unsupervised learning using the
gories have the potential to consid- effect of the model improved NSDC algor ithm to
erably reduce the performance of and achieve a better improve the accuracy of identifica-
learning algorithms (classification, tion. Then, several representative
regression, and clustering) as well
junk text recognition samples are sampled from each cat-
as their computational effective- effect, we use the egory. Next, it is reorganized with
ness, probability of overfitting, and the minority classes in the training
PCA-SGD algorithm
generalization capacity (Figure 1). set into a balanced training set, and
to optimize the model the NBSVM classifier is selected for
Imbalanced Data Processing parameters. supervised learning. Finally, the
for NSDC improved PCA-SGD algorithm opti-
This article uses the communica- mizes the overall model to be com-
tion text dataset. To improve the pletely semisupervised.
learning effect, it is necessary to solve the imbalance For the task of garbage text recognition under super-
between the communication spam text and the communica- vised learning, the overall solution is shown in Figure 2.
tion nonspam text in the training set samples. Statistics may
be used to adjust for this in investigations where there is Experiment and Result Analysis
clustering. A type of standard error known as cluster-robust
standard error takes the effects of clustering into account, Experimental Description and Experimental Data
resulting in higher values, broader confidence ranges, and To verify the effectiveness of the three improvements dis-
more conservative p values. Fixed-effects models, which cussed in this article, the following three experiments are
include the cluster itself as a factor in a typical regression designed. First, by using datasets with different attributes,
model, or random effects models, which take into account comparing the clustering purity and time (time complexity
the commonalities among individuals with-
in clusters in a multilevel model, can both
be used to modify regression models to Sample Set
account for clustering.
35
IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.2024
and space complexity) of the traditional algorithm and the estimators are contrasted with SVM, Naïve Bayes, and RF,
NSDC algorithm under other datasets, it is verified that the three well-known classification techniques. Different
latter has higher and better efficiency. For strong robust- input characteristics and percentages of originally
ness, we use the random downsampling method, NSDC algo- labeled data examples are investigated for each semisu-
rithm, and traditional clustering algorithm to sample the pervised learning technique. In addition, each semisuper-
majority class in the unbalanced v ised model’s top-per for ming
communication data, respectively, experimental configuration is
and use the reorganized balanced compared to the same model using
sample as the training set to use the In this article, only positive or negative polarity
NBSVM classifier. data examples.
experiments are
To perform learning classifica- The NBSVM classifier is used
tion, we use the validation set to designed to evaluate to learn the balanced training set
verify the effectiveness of the the improved after NSDC and downsampling to
improved downsampling method, create balanced samples, and the
use the enhanced stochastic gradi- algorithm using three semisupervised learning approach
ent descent algorithm to optimize indicators: clustering is utilized to recognize the commu-
the model, and verify the effective- nication junk text. To increase the
ness of the PCA-SGD algorithm by
purity, accuracy, model’s classification impact and
comparing the convergence speed and time. get better garbage text recognition,
and model training speed with we apply the PCA-SGD approach to
the traditional algorithm perfor- ad just the model parameters.
mance. The communication spam When recognizing communication
text recognition model developed in this research under trash text, the majority of classes in the training set are
semisupervised learning has good recognition perfor- initially exposed to unsupervised learning using the
mance, with a consistent convergence trend and quicker enhanced NSDC technique to increase identification
convergence speed. The density clustering approach accuracy. Then, from each group, multiple representative
based on adverse selection obtains the similarity thresh- samples are taken. Following that, it is reconstructed
old through repeated tests, which demands a significant with the minority classes in the training set into a bal-
amount of manual labor. anced training set, and the NBSVM classifier is chosen
The difficulty of classifying opinion spam under super- for supervised learning. Finally, the revised PCA-SGD
vision is a topic of extensive research. Algorithms may method optimizes the total model to be completely
find trends in spammer reviews using labeled data exam- semisupervised.
ples. Four semisupervised techniques using various base In experiment 1, to verify the robustness and effec-
tiveness of t he i mproved clu ster i ng
algorithm, this ar ticle selects noncon-
vex, high-dimensional, and unbalanced
Input Communication sample space datasets with similar sam-
Training Set ple nu mber s: Double - Ci rcles, Wi ne,
Glass, and the comparative dataset Iris.
Experiments 2 and 3 use unbalanced
Chinese Word
Segmentation, Stop Spam Text Recognition communication text data to evaluate
the performance of NSDC downsam-
pling and the PCA-SGD algorithm. The
Communication Text Ling Spam and Spam Base are common-
Vectorization ly used communication datasets. Uni-
PCA-SGD Model com dat a a re t he u nba la n c e d d a t a
Optimization
consulted by Minsheng platform cus-
NSDC-DS Sampling
tomers. We balance the communication
text data.
36
IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.2024
◆ Cluster purity: Purity = R ik= 1 (m i /m) Pi, where k is It can be seen from the results that for the nonconvex
t he number of clusters, m i is the number of texts sample space dataset Double-Circles, k-means is not suit-
i n cluster i, m is the total number of samples, able for such datasets, so it has a lower purity.
p i = max (p ij) = (m ij /m), and m ij is the number of The algorithm ignores, resulting in lower cluster
members in cluster i that belong to class j. Purity is purity than Iris for balanced datasets. For the high-
used to describe the accuracy of the clustering algo- dimensional dataset Wine, the k-means algorithm must
rithm. The higher the purity, the better the cluster- repeatedly update the cluster center points. Further-
ing effect, and the maximum value is one. more, the spectral clustering algorithm needs a signifi-
cant time overhead because of the high-dimensional
Algorithm Performance Verification matrix operation. The NSDC algorithm proposed in this
article calculates the similarity by integrating distance
Experiment 1: NSDC Algorithm and density, which improves the shor tcomings of
k-means inapplicability to nonconvex spherical sample
Performance Verification spaces. Purity reduction and less time requirement;
The k-means algorithm is only suitable for convex sample because the k-means minimization mean-square-error
space datasets, and the clustering effect of unbalanced datas- process is avoided, the influence of the unbalanced
ets is not good. Moreover, for high-dimensional datasets, dataset Glass on its clustering purity is reduced; in addi-
the k-means and spectral clustering algorithms have the dis- tion, the adverse selection density clustering algorithm
advantages of reduced clustering accuracy and longtime avoids the traditional k-means algorithm which repeat-
consumption. Cross-validate, save your training dataset in edly updates cluster center points, and the high-
groups, reserve a group for prediction at all times, and dimensional matrix calculation of spectral clustering
switch the groups with every run. By doing so, you will be leads to high time complexity, which reduces the impact
able to train a model with better data. Use cross-datasets to of high-dimensional data on the time required for
perform cross-validation while utilizing several datasets.
To verify that the adverse selection density clustering 1.5
algorithm can improve the discussed shortcomings, this 1
0.5
experiment uses the convex sample space dataset Double- 0
Spectral
Spectral
Spectral
Spectral
k-Means
NSDC
k-Means
NSDC
k-Means
NSDC
k-Means
NSDC
Circles as well as the nonconvex sample space high-
dimensional dataset Wine, the unbalanced dataset Glass,
and the comparative dataset Iris, respectively. In the
experiment, traditional k-means and spectral clustering Double- Iris Wine Glass
Circles
are selected as the comparison algorithm, and the perfor-
mance is compared with the adverse selection clustering Series 1 Series 2 Series 3
Series 4 Series 5
algorithm. In addition, the clustering purity and time are
used as evaluation indicators. The specific experimental Figure 3. A comparison of the clustering accuracy of
results are shown in Figure 3 and Table 1. each algorithm under different datasets.
k-Means Spectral NSDC k-Means Spectral NSDC k-Means Spectral NSDC k-Means Spectral NSDC
0 0.4 0.4 0.36 0.44 0.46 0.396 0.484 0.506 0.32 0.5324 0.5566
0.5 0.5 0.5 0.45 0.55 0.575 0.495 0.605 0.6325 0.4 0.6655 0.69575
0 0.6 0.6 0.54 0.66 0.69 0.594 0.726 0.759 0.48 0.7986 0.8349
0 0.7 0.7 0.63 0.77 0.805 0.693 0.847 0.8855 0.56 0.9317 0.97405
0 0.8 0.8 0.72 0.88 0.92 0.792 0.968 1.012 0.64 1.0648 1.1132
0 0.9 0.9 0.81 0.99 1.035 0.891 1.089 1.1385 0.72 1.1979 1.25235
0 1 1 0.9 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
37
IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.2024
clustering. It can be seen from the experimental results 0.23 to 0.15 and the nonspam text misclassification rate
that the adverse selection density clustering algorithm from 0.21 to 0.14, and the accuracy rate reaches 85.62%.
has higher clustering purity and time effect as well as As can be seen from Figure 4, since batch gradient
stronger robustness. descent uses total samples to train the model, it ensures
that each iteration is carried out in the direction of the
Experiment 2: Performance Comparison overall optimization, and the loss value is guaranteed to
of Improved Downsampling Methods decrease monotonically. The experimental results show
To compare the differences between the random downs- that the PCA-SGD algorithm has high stability and fast
ampling method and the cluster downsampling method in convergence speed. In conclusion, this algorithm has
treating imbalanced data, this experimental design uses high feasibility.
random downsampling by the k-means clustering algo-
rithm and NSDC algorithm. Experiment 3: Performance Comparison
In machine learning, resampling is a typical method for of Spam Text Recognition Models
addressing class imbalance. By choosing instances from the To verify the effectiveness of the semisupervised commu-
original dataset, a new training dataset with a different nication spam text recognition model proposed in this
class distribution is created. Ran- article, three communication text
dom resampling is a well-liked datasets—Ling Spam, Spam Base,
resampling technique in which and Unicom—are selected, and
instances for the changed dataset By intentionally the improved model in this article
are selected at random. Because it raising the proportion is used with template-free adver-
enables the model to equally take sarial network evolution [14]. The
into account instances from multi-
of minority class accuracy of the ID-RF [15] model
ple classes during training, resa- instances in the is compared, and the experimental
mpling is frequently seen as a results are shown in Table 2.
dataset, oversampling
stra ightfor wa rd a nd efficient It can be seen that because the
approach for unbalanced classifica- can aid in the model’s Unicom dataset has more samples
tion issues. By intentionally raising training by giving and a higher imbalance ratio than
the proportion of minority class the Ling Spam and Spam Base
instances in the dataset, oversam- it more precise datasets, the accuracy of the
pling can aid in the model’s training information about three models has decreased. Still,
by giving it more precise informa- the semisupervised model pro-
tion about such cases. Changing the
such cases. posed in this a r ticle ha s the
parameter starting values of the slightest drop in inaccuracy. In
model to more accurately represent addition, when the model present-
the sample size of the training data is known as bias initial- ed in this article solves the imbalanced samples, it not
ization. We adjust the bias of the final layer in more detail. only uses the improved NSDC-DS undersampling meth-
The misclassification rate of spam text is reduced od to undersample the majority class but also uses the
from 0.49 to 0.23. The nonspam text misclassification NBSVM classifier to classify the reorganized balanced
rate is reduced from 0.40 to 0.21. The accuracy rate samples and then uses the improved optimization algo-
increased from 59.62% to 79.22%, significantly improving rithm PCA-SGD, which optimizes the model and obtains
classification accuracy. At the same time, the imbalanced better garbage text recognition effect. The experimen-
dataset is processed by the k-means algorithm, using the tal results show that the semisupervised model pro-
improved NCBA clustering algorithm in this article to posed in this article is better than the other two models
reduce the misclassification rate of garbage text from on the three datasets for solving the imbalance problem
38
IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.2024
39