0% found this document useful (0 votes)

49 views8 pages

Spam Text Detection Over Social Media Usage

The document presents a novel approach for spam text detection on social media, utilizing a downsampling strategy based on negative selection density clustering (NSDC-DS) to enhance classifier performance. It discusses various methods for addressing imbalanced datasets, including the use of Naïve Bayes Support Vector Machine (NBSVM) and principal component analysis (PCA) for optimizing model parameters. Experimental results indicate that the proposed NSDC-DS method significantly improves the accuracy and efficiency of communication spam text recognition compared to traditional models.

Uploaded by

Kalyan Reddy Anugu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views8 pages

Spam Text Detection Over Social Media Usage

Uploaded by

Kalyan Reddy Anugu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.

2024

Spam Text
Detection
Over Social
Media Usage
A Supervised Sampling
Approach for the Social
Web of Things

by Haewon Byeon , Sameer Jha,

Ismail Keshta, Mohammed Wasim Bhatt ,
Pavitar Parkash Singh , Latika Jindal,
and T. R. Vijaya Lakshmi

A
downsampling strategy based on negative compare the improved approach against NSDC, NSDC-
selection density clustering (NSDC-DS) is DS, PCA-SGD, and standard models. According to the tri-
proposed to improve classifier perfor- als, the improved model has a quicker and more
mance while employing random downs- consistent convergence speed.
ampling for unbalanced communication Today, with the growing use of trash data on the
text. The discovery of self-anomalies via negative selec- livelihood platform, preventing garbage data interfer-
tion enhances traditional clustering. The detector and ence to improve system work efficiency and service
self-set are the sample center point and the sample to be quality has become a hot study subject. The problem of
clustered, respectively; anomalous matching is per- imbalanced sample classification has the problem of a
formed on the two; and the NSDC technique analyzes poor classification effect as the basis of trash text rec-
sample similarity. To improve on the traditional downs- ognition. The classification integration approach, cost-
ampling method, we use the Naïve Bayes Support Vector sensitive method, and feature selection method are
Machine (NBSVM) classifier to identify garbage in sam- the key algorithms aimed at this problem. Chen et al.
pled communication samples, use principal component [1] propose employing k-reverse closest neighbors and
analysis (PCA) to evaluate sample information content, one-class support vector machines (SVMs) in series
propose an improved PCA-signed directed graph (SGD) to solve the data imbalance problem. Data balanc-
algorithm to optimize model parameters, and complete ing via sampling has been shown in several research
semisupervised communication spam text recognition works to enhance the performance of classifiers. Deci-
over the Social Web of Things. Several datasets, includ- sion tree classifiers, like C4.5 and C5.0, were employed
ing unbalanced communication text, were used to in more than half of the investigations. The k-closest
neighbor approach and the naive Bayes classifier were
Digital Object Identifier 10.1109/MSMC.2023.3343950
both employed in several research works. Random for-
Date of current version: 25 April 2024 ests (RFs), convolutional neural networks, and linear

32
©SHUTTERSTOCK.COM/JURGENFR

discriminant analysis were used in other research to substantially enhances the accuracy of credit card fraud
assess the sampling technique. detection models. Yang et al. [3] modified the settings of
Imbalanced datasets have a significant skew in the the distance from the imbalanced data to the classifica-
class distribution, such as 1:100 or 1:1,000 samples in the tion surface to correct the offset problem of the classifi-
minority class compared to the majority class. One meth- cation surface. Agnihotri [16] offers a novel variable
od for dealing with class imbalance is to randomly resam- global feature selection strategy, which outperforms the
ple the training dataset. To randomly resample an international feature selection technique when dealing
unbalanced dataset, the two basic ways are to eliminate with unbalanced data. In the multiclassification issue of
instances from the majority class, known as undersam- mechanical defect diagnostics, this technique has good
pling, and to duplicate examples from the minority class, classification accuracy [5].
known as oversampling. Random resampling is a simple For classification issues, the supervised machine
strategy for rebalancing an unbalanced dataset’s class dis- learning method naive Bayes is utilized. It is based on the
tribution. Overfitting can occur when random oversam- Bayes theorem. The naive assumption of conditional
pling repeats samples from the minority class in the independence among predictors gives it its name. It goes
training dataset. Random undersampling removes samples on the premise that every feature in a class is unconnect-
from the majority class, which can result in the loss of ed to every other element. Regression and classification
vital information to the model. may be performed using the supervised machine learning
For data with mainly unbalanced classes, it might be method called SVM. N-dimensional space, where n is the
challenging to build classification models. Accuracy can number of features, is used in SVM to plot data points. An
be increased by employing strategies like oversampling, appropriate hyperplane that distinguishes between two
undersampling, resampling combos, and custom filtering. classes is then chosen to complete the categorization. An
By incorporating feature scaling, suppression, or neu- approach for categorizing issues with binary (two class-
tra lization of mea n absolute er ror, Ponma la r [2] es) and many classes is called naive Bayes.

33
IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.2024

Liu et al. [6] used the class overlap method and the outperforms. With a steady convergence trend and quick-
importance of sample points to design the sample fuzzy ening convergence pace, the communication spam text
membership function and assign the membership value recognition model created in this study under semisuper-
and proposed a fuzzy multiclass SVM algorithm, which can vised learning shows high recognition performance.
more effectively solve the problem of multiclass unbal- When improving downsampling through clustering,
anced data and noise. Gao et al. [7] used the prior class to avoid the problems of difficulty in determining
probability to weight the posterior the number of clusters and high
class probability to deal with the algorithm complexity in tradition-
problem that the minority class is al clustering algorithms, this arti-
misclassified when the neural net- The spectral cle proposes NSDC-DS. According
work is trained with unbalanced to the experimental results, the
clustering algorithm
data. The average recall rate of this unfavorable selection density
algorithm has been improved. is suitable for datasets clustering approach is more effec-
Aradhye et al. [8] used Bayesian with any shape tive and less complex, and the
minimum risk theory to find the increased downsampling meth-
correct classification threshold of sample space, odology NSDC-DS enhances the
and adjusted the imbalanced data but it still has the classifier’s performance. Addi-
after undersampling processing, tionally, the enhanced PCA-SGD
reducing the impact of undersam-
disadvantage that the s t o ch a s t ic g r a d ie nt de s ce nt
pling on classification accuracy number of clusters is approach outperforms. With a
and probability calibration. not easy to determine. steady convergence trend and
Dhah et al. [9] proposed com- quickening convergence pace, the
bining undersampling and overs- communication spam text recog-
ampling using the genetic algorithm n it ion model created i n t h i s
to determine the optimal imbalance rate, which signifi- study under semisupervised learning shows high recog-
cantly improved the rare pattern detection rate and classi- nition performance.
fication performance. Lu et al. [10] made the data We must choose the ideal number of clusters for our
centralized by constraining the range of synthetic data and dataset when using clustering methods like k-means clus-
proposed the Targeted Synthetic Minority Oversampling tering. By doing this, the data are separated effectively and
Technique and Margin Distribution-Sensitive Synthetic correctly. An adequate value for “k,” or the number of clus-
Minority Oversampling Technique algorithms, which ters, ensures that the clusters are of the right granularity
solved the disadvantage of the marginal distribution of the and helps to maintain a healthy balance between the clus-
classifier, and Synthetic Minority Oversampling Technique ters’ compressibility and accuracy.
for imbalanced datasets. Dhanaraj and Karthikeyan [11]
added and deleted samples with a strong correlation with Improved Random Sampling Algorithms
the minority class and weak correlation with the majority
class, respectively, to achieve the class distribution bal- Density Clustering Algorithm Based
ance of the samples, and they proposed the critical value on Negative Selection
sampling method to improve the accuracy of the associa- The number of clusters in the k-means algorithm is not
tion classification method in dealing with imbalanced data. easy to determine, and it is only suitable for convex prob-
Wan and Uehara [12] proposed a combination strategy using lems with the sample space datasets and unbalanced data-
the k-means sampling method and classification guidance sets it is clustering. The class does not work well. The
words, which improved the classification accuracy of spectral clustering algorithm is suitable for datasets with
imbalanced data. Das et al. [13] adjusted the weights of the any shape of sample space, but it still has the disadvantage
majority and minority classes and proposed a novel that the number of clusters is not easy to determine.
undersampling method based on the AdaBoost algorithm to According to multiple experiments, the appropriate simi-
improve the classification effect of imbalanced data. larity threshold c is obtained. When the detector and the self-
To identify communication junk text, this research inte- set meet the matching conditions, that is, when the similarity
grates supervised and unsupervised learning, improves between the sensor and the self-set is greater than or equal to
the algorithm model to optimize parameters, and boosts the similarity threshold c, the sample center point density
the effectiveness of junk text identification. According to can be found. All samples are generated into a cluster; after
the experimental results, the unfavorable selection density each group is clustered, we continue to see the point with the
clustering approach is more effective and less complex, highest density in pieces to be clustered as the next cluster
and the increased downsampling methodology NSDC-DS center point; we update it as the detector and the other as the
enhances the classifier’s performance. Additionally, the self-set to be clustered. The sample points calculate the simi-
enhanced PCA-SGD stochastic gradient descent approach larity, find the samples that meet the similarity threshold’s

34
IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.2024

matching condition, and gather them into the next cluster is then reconstructed with the minority classes to cre-
until the termination condition is met. ate a balanced training set, and the NBSVM classifier
The data are given a relatively broad feature space that is c ho s e n for s up er v i s e d le a r n i n g. F i n a l ly, t he
describes them in many machine learning and pattern rec- revised PCA-SGD approach optimizes the total model
ognition applications. The following four categories of for semisupervised completion.
characteristics [2], [3] may be
found within this area: entirely Model Description
irrelevant, redundant and weakly When identifying communication
relevant, nonredundant but weakly To further improve junk text, the majority of classes in
relevant, and extremely relevant the classification the training set are first subjected
characteristics. The first two cate- to unsupervised learning using the
gories have the potential to consid- effect of the model improved NSDC algor ithm to
erably reduce the performance of and achieve a better improve the accuracy of identifica-
learning algorithms (classification, tion. Then, several representative
regression, and clustering) as well
junk text recognition samples are sampled from each cat-
as their computational effective- effect, we use the egory. Next, it is reorganized with
ness, probability of overfitting, and the minority classes in the training
PCA-SGD algorithm
generalization capacity (Figure 1). set into a balanced training set, and
to optimize the model the NBSVM classifier is selected for
Imbalanced Data Processing parameters. supervised learning. Finally, the
for NSDC improved PCA-SGD algorithm opti-
This article uses the communica- mizes the overall model to be com-
tion text dataset. To improve the pletely semisupervised.
learning effect, it is necessary to solve the imbalance For the task of garbage text recognition under super-
between the communication spam text and the communica- vised learning, the overall solution is shown in Figure 2.
tion nonspam text in the training set samples. Statistics may
be used to adjust for this in investigations where there is Experiment and Result Analysis
clustering. A type of standard error known as cluster-robust
standard error takes the effects of clustering into account, Experimental Description and Experimental Data
resulting in higher values, broader confidence ranges, and To verify the effectiveness of the three improvements dis-
more conservative p values. Fixed-effects models, which cussed in this article, the following three experiments are
include the cluster itself as a factor in a typical regression designed. First, by using datasets with different attributes,
model, or random effects models, which take into account comparing the clustering purity and time (time complexity
the commonalities among individuals with-
in clusters in a multilevel model, can both
be used to modify regression models to Sample Set
account for clustering.

Optimization of the Yes No

Maximum
Communication Spam Text Density Sample
Recognition Model
Cluster Center Samples to be
Point (Detector) Clustered (Self-Set)
Optimization of the Garbage
Identification Model
To further improve the classification effect Similarity Threshold
(Match Condition)
of the model and achieve a better junk text
recognition effect, we use the PCA-SGD algo-
rithm to optimize the model parameters.
Match Mismatched
To increase the accuracy of identifi-
cation while recognizing communication
Cluster into the Unclustered
trash text, the majority of classes in the i th Cluster Samples
training set are first exposed to unsu-
per v ised lea r n i ng employ i ng t he
improved NSDC technique. Then, from Not Null
each group, numerous representative
sa mples a re selected. The training set Figure 1. The NSDC-DS algorithm.

35
IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.2024

and space complexity) of the traditional algorithm and the estimators are contrasted with SVM, Naïve Bayes, and RF,
NSDC algorithm under other datasets, it is verified that the three well-known classification techniques. Different
latter has higher and better efficiency. For strong robust- input characteristics and percentages of originally
ness, we use the random downsampling method, NSDC algo- labeled data examples are investigated for each semisu-
rithm, and traditional clustering algorithm to sample the pervised learning technique. In addition, each semisuper-
majority class in the unbalanced v ised model’s top-per for ming
communication data, respectively, experimental configuration is
and use the reorganized balanced compared to the same model using
sample as the training set to use the In this article, only positive or negative polarity
NBSVM classifier. data examples.
experiments are
To perform learning classifica- The NBSVM classifier is used
tion, we use the validation set to designed to evaluate to learn the balanced training set
verify the effectiveness of the the improved after NSDC and downsampling to
improved downsampling method, create balanced samples, and the
use the enhanced stochastic gradi- algorithm using three semisupervised learning approach
ent descent algorithm to optimize indicators: clustering is utilized to recognize the commu-
the model, and verify the effective- nication junk text. To increase the
ness of the PCA-SGD algorithm by
purity, accuracy, model’s classification impact and
comparing the convergence speed and time. get better garbage text recognition,
and model training speed with we apply the PCA-SGD approach to
the traditional algorithm perfor- ad just the model parameters.
mance. The communication spam When recognizing communication
text recognition model developed in this research under trash text, the majority of classes in the training set are
semisupervised learning has good recognition perfor- initially exposed to unsupervised learning using the
mance, with a consistent convergence trend and quicker enhanced NSDC technique to increase identification
convergence speed. The density clustering approach accuracy. Then, from each group, multiple representative
based on adverse selection obtains the similarity thresh- samples are taken. Following that, it is reconstructed
old through repeated tests, which demands a significant with the minority classes in the training set into a bal-
amount of manual labor. anced training set, and the NBSVM classifier is chosen
The difficulty of classifying opinion spam under super- for supervised learning. Finally, the revised PCA-SGD
vision is a topic of extensive research. Algorithms may method optimizes the total model to be completely
find trends in spammer reviews using labeled data exam- semisupervised.
ples. Four semisupervised techniques using various base In experiment 1, to verify the robustness and effec-
tiveness of t he i mproved clu ster i ng
algorithm, this ar ticle selects noncon-
vex, high-dimensional, and unbalanced
Input Communication sample space datasets with similar sam-
Training Set ple nu mber s: Double - Ci rcles, Wi ne,
Glass, and the comparative dataset Iris.
Experiments 2 and 3 use unbalanced
Chinese Word
Segmentation, Stop Spam Text Recognition communication text data to evaluate
the performance of NSDC downsam-
pling and the PCA-SGD algorithm. The
Communication Text Ling Spam and Spam Base are common-
Vectorization ly used communication datasets. Uni-
PCA-SGD Model com dat a a re t he u nba la n c e d d a t a
Optimization
consulted by Minsheng platform cus-
NSDC-DS Sampling
tomers. We balance the communication
text data.

Recombined With Minority

NBSVM Classifier to Evaluation Indicators
Class Samples into Balanced
Identify Junk Text
Training Set In this article, experiments are designed
Garbage Identification to evaluate the improved algorithm using
Balanced Training Set three indicators: clustering purity, accu-
racy, and time. The specific instructions
Figure 2. The communication spam text recognition model. are as follows:

36
IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.2024

◆ Cluster purity: Purity = R ik= 1 (m i /m) Pi, where k is It can be seen from the results that for the nonconvex
t he number of clusters, m i is the number of texts sample space dataset Double-Circles, k-means is not suit-
i n cluster i, m is the total number of samples, able for such datasets, so it has a lower purity.
p i = max (p ij) = (m ij /m), and m ij is the number of The algorithm ignores, resulting in lower cluster
members in cluster i that belong to class j. Purity is purity than Iris for balanced datasets. For the high-
used to describe the accuracy of the clustering algo- dimensional dataset Wine, the k-means algorithm must
rithm. The higher the purity, the better the cluster- repeatedly update the cluster center points. Further-
ing effect, and the maximum value is one. more, the spectral clustering algorithm needs a signifi-
cant time overhead because of the high-dimensional
Algorithm Performance Verification matrix operation. The NSDC algorithm proposed in this
article calculates the similarity by integrating distance
Experiment 1: NSDC Algorithm and density, which improves the shor tcomings of
k-means inapplicability to nonconvex spherical sample
Performance Verification spaces. Purity reduction and less time requirement;
The k-means algorithm is only suitable for convex sample because the k-means minimization mean-square-error
space datasets, and the clustering effect of unbalanced datas- process is avoided, the influence of the unbalanced
ets is not good. Moreover, for high-dimensional datasets, dataset Glass on its clustering purity is reduced; in addi-
the k-means and spectral clustering algorithms have the dis- tion, the adverse selection density clustering algorithm
advantages of reduced clustering accuracy and longtime avoids the traditional k-means algorithm which repeat-
consumption. Cross-validate, save your training dataset in edly updates cluster center points, and the high-
groups, reserve a group for prediction at all times, and dimensional matrix calculation of spectral clustering
switch the groups with every run. By doing so, you will be leads to high time complexity, which reduces the impact
able to train a model with better data. Use cross-datasets to of high-dimensional data on the time required for
perform cross-validation while utilizing several datasets.
To verify that the adverse selection density clustering 1.5
algorithm can improve the discussed shortcomings, this 1
0.5
experiment uses the convex sample space dataset Double- 0
Spectral

Spectral

Spectral
k-Means

NSDC
k-Means

NSDC
Circles as well as the nonconvex sample space high-
dimensional dataset Wine, the unbalanced dataset Glass,
and the comparative dataset Iris, respectively. In the
experiment, traditional k-means and spectral clustering Double- Iris Wine Glass
Circles
are selected as the comparison algorithm, and the perfor-
mance is compared with the adverse selection clustering Series 1 Series 2 Series 3
Series 4 Series 5
algorithm. In addition, the clustering purity and time are
used as evaluation indicators. The specific experimental Figure 3. A comparison of the clustering accuracy of
results are shown in Figure 3 and Table 1. each algorithm under different datasets.

Table 1. A comparison of the clustering accuracy of each algorithm

under different datasets.

Double-Circles Iris [10] Wine [8] Glass [4]

k-Means Spectral NSDC k-Means Spectral NSDC k-Means Spectral NSDC k-Means Spectral NSDC

0 0.4 0.4 0.36 0.44 0.46 0.396 0.484 0.506 0.32 0.5324 0.5566

0.5 0.5 0.5 0.45 0.55 0.575 0.495 0.605 0.6325 0.4 0.6655 0.69575

0 0.6 0.6 0.54 0.66 0.69 0.594 0.726 0.759 0.48 0.7986 0.8349

0 0.7 0.7 0.63 0.77 0.805 0.693 0.847 0.8855 0.56 0.9317 0.97405

0 0.8 0.8 0.72 0.88 0.92 0.792 0.968 1.012 0.64 1.0648 1.1132

0 0.9 0.9 0.81 0.99 1.035 0.891 1.089 1.1385 0.72 1.1979 1.25235

0 1 1 0.9 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0

37
IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.2024

clustering. It can be seen from the experimental results 0.23 to 0.15 and the nonspam text misclassification rate
that the adverse selection density clustering algorithm from 0.21 to 0.14, and the accuracy rate reaches 85.62%.
has higher clustering purity and time effect as well as As can be seen from Figure 4, since batch gradient
stronger robustness. descent uses total samples to train the model, it ensures
that each iteration is carried out in the direction of the
Experiment 2: Performance Comparison overall optimization, and the loss value is guaranteed to
of Improved Downsampling Methods decrease monotonically. The experimental results show
To compare the differences between the random downs- that the PCA-SGD algorithm has high stability and fast
ampling method and the cluster downsampling method in convergence speed. In conclusion, this algorithm has
treating imbalanced data, this experimental design uses high feasibility.
random downsampling by the k-means clustering algo-
rithm and NSDC algorithm. Experiment 3: Performance Comparison
In machine learning, resampling is a typical method for of Spam Text Recognition Models
addressing class imbalance. By choosing instances from the To verify the effectiveness of the semisupervised commu-
original dataset, a new training dataset with a different nication spam text recognition model proposed in this
class distribution is created. Ran- article, three communication text
dom resampling is a well-liked datasets—Ling Spam, Spam Base,
resampling technique in which and Unicom—are selected, and
instances for the changed dataset By intentionally the improved model in this article
are selected at random. Because it raising the proportion is used with template-free adver-
enables the model to equally take sarial network evolution [14]. The
into account instances from multi-
of minority class accuracy of the ID-RF [15] model
ple classes during training, resa- instances in the is compared, and the experimental
mpling is frequently seen as a results are shown in Table 2.
dataset, oversampling
stra ightfor wa rd a nd efficient It can be seen that because the
approach for unbalanced classifica- can aid in the model’s Unicom dataset has more samples
tion issues. By intentionally raising training by giving and a higher imbalance ratio than
the proportion of minority class the Ling Spam and Spam Base
instances in the dataset, oversam- it more precise datasets, the accuracy of the
pling can aid in the model’s training information about three models has decreased. Still,
by giving it more precise informa- the semisupervised model pro-
tion about such cases. Changing the
such cases. posed in this a r ticle ha s the
parameter starting values of the slightest drop in inaccuracy. In
model to more accurately represent addition, when the model present-
the sample size of the training data is known as bias initial- ed in this article solves the imbalanced samples, it not
ization. We adjust the bias of the final layer in more detail. only uses the improved NSDC-DS undersampling meth-
The misclassification rate of spam text is reduced od to undersample the majority class but also uses the
from 0.49 to 0.23. The nonspam text misclassification NBSVM classifier to classify the reorganized balanced
rate is reduced from 0.40 to 0.21. The accuracy rate samples and then uses the improved optimization algo-
increased from 59.62% to 79.22%, significantly improving rithm PCA-SGD, which optimizes the model and obtains
classification accuracy. At the same time, the imbalanced better garbage text recognition effect. The experimen-
dataset is processed by the k-means algorithm, using the tal results show that the semisupervised model pro-
improved NCBA clustering algorithm in this article to posed in this article is better than the other two models
reduce the misclassification rate of garbage text from on the three datasets for solving the imbalance problem

1,000 1,000 Table 2. A comparison of the text

recognition accuracy of three methods.
0 0 Semisupervised
1 2 3 4 5 6 7
Dataset TFGE [14] ID-RF [15] Model
SGD PCA-SCD
Ling Spam [5] 85 87 91
MBGD BGD
Spam Base [6] 86 89 90
Figure 4. A comparison of the model classification
accuracy under different training times. MBGD: mini- Unicom [5] 82 80 89
batch gradient descent.

38
IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.2024

and offers better performance in communication spam Latika Jindal ([email protected]) is

text recognition. with the Department of Computer Science and Engineering,
Medi-Caps University, Indore 453331, India.
Conclusion T. R. Vijaya Lakshmi ([email protected].
This research combines unsupervised and supervised learn- in) is with the Mahatma Gandhi Institute of Technology,
ing to recognize communication junk text, optimizes the Gandipet, Hyderabad 500075, India.
algorithm model to optimize parameters, and increases the
effect of junk text identification. The specifics are as follows. References
◆ As part of unsupervised learning to address the draw- [1] J. Chen, L. Zhang, and Y. Lu, “Application of scale invariant feature transform to
backs of standard clustering algorithms, which are sen- image spam filter,” in Proc. 2nd Int. Conf. Future Gen. Commun. Netw. Symp., 2008,
sitive to clustering center points and difficult to use for pp. 55–58, doi: 10.1109/FGCNS.2008.24.
identifying the number of clusters, an NSDC approach [2] A. Ponmalar, K. Rajkumar, U. Hariharan, V. K. G. Kalaiselvi, and S. Deeba, “Analysis
is presented. of spam detection using integration of logistic regression and PSO algorithm,” in Proc.
◆ There is a section for guided learning. The NSDC technique 4th Int. Conf. Comput. Commun. Technol. (ICT), 2021, pp. 396–402, doi: 10.1109/
improves on the classic random downsampling approach. ICCCT53315.2021.9711903.
As a consequence, the collected samples have more com- [3] X. Yang, T. Zhang, and C. Xu, “Cross-domain feature learning in multime-
prehensive overall features, the classifier performs better, dia,” IEEE Trans. Multimedia, vol. 17, no. 1, pp. 64–78, Jan. 2015, doi: 10.1109/
and the semisupervised learning approach is utilized to TMM.2014.2375793.
finish the trash detection of the communication text. [4] Abhishek, A. Dhankar, and N. Gupta, “A systematic review of techniques,
◆ Model improvement: Finally, the revised PCA-SGD tools and applications of machine learning,” in Proc. 3rd Int. Conf. Intell. Com-
technique is employed to complete the optimization mun. Technol. Virtual Mobile Netw. (ICICV), 2021, pp. 764–768, doi: 10.1109/
work of the text trash recognition model, improving ICICV50876.2021.9388637.
the model’s recognition performance. The experimen- [5] M. I. Prabha and G. Umarani Srikanth, “Survey of sentiment analysis using deep
tal findings reveal that the unfavorable selection den- learning techniques,” in Proc. 1st Int. Conf. Innov. Inf. Commun. Technol. (ICIICT),
sity clustering technique is more efficient and has 2019, pp. 1–9, doi: 10.1109/ICIICT1.2019.8741438.
lower complexity, and the enhanced downsampling [6] W. Liu, L. Wang, and F. Hu, “CESMP: Chinese-English segment-aligned multi-field
method NSDC-DS improves the performance of the patent data,” in Proc. IEEE 7th Int. Conf. Cloud Comput. Intell. Syst. (CCIS), 2021,
classifier. Furthermore, the improved stochastic gra- pp. 37–41, doi: 10.1109/CCIS53392.2021.9754662.
dient descent technique PCA-SGD outperforms. [7] Y. Gao, A. Choudhary, and G. Hua, “A comprehensive approach to image spam
The communication spam text recognition model detection: From server to client solution,” IEEE Trans. Inf. Forensics Security, vol. 5,
developed in this research under semisupervised learning no. 4, pp. 826–836, Dec. 2010, doi: 10.1109/TIFS.2010.2080267.
has good recognition performance, with a consistent con- [8] H. B. Aradhye, G. K. Myers, and J. A. Herson, “Image analysis for efficient categori-
vergence trend and quicker convergence speed. The densi- zation of image-based spam e-mail,” in Proc. 8th Int. Conf. Document Anal. Recognit.
ty clustering approach based on adverse selection obtains (ICDAR), 2005, vol. 2, pp. 914–918, doi: 10.1109/ICDAR.2005.135.
the similarity threshold through repeated tests, which [9] E. H. Dhah, M. A. Naser, and S. A. Ali, “Spam email image classification based on
demands a significant amount of manual labor. If the text and image features,” in Proc. 1st Int. Conf. Comput. Appl. Sci. (CAS), 2019, pp.
threshold is adaptively altered based on different datas- 148–153, doi: 10.1109/CAS47993.2019.9075725.
ets, this will be the primary study focus of future work. [10] Z. Lu, H. Yu, D. Fan, and C. Yuan, “Spam filtering based on improved CHI fea-
ture selection method,” in Proc. Chinese Conf. Pattern Recognit., 2009, pp. 1–3, doi:
About the Authors 10.1109/CCPR.2009.5344010.
Haewon Byeon ([email protected]) is with the [11] S. Dhanaraj and V. Karthikeyan, “A study on e-mail image spam filtering tech-
Department of Digital Anti-Aging Healthcare, Inje Universi- niques,” in Proc. Int. Conf. Pattern Recognit., Inform. Mobile Eng., 2013, pp. 49–55,
ty, Gimhae 50834, South Korea. doi: 10.1109/ICPRIME.2013.6496446.
Sameer Jha ([email protected]) is with the [12] P. Wan and M. Uehara, “Spam detection using Sobel operators and OCR,” in Proc. 26th
Computer Science Department, De Montfort University, Int. Conf. Adv. Inf. Netw. Appl. Workshops, 2012, pp. 1017–1022, doi: 10.1109/WAINA.2012.24.
Almaty 050044, Kazakhstan. [13] M. Das, A. Bhomick, Y. J. Singh, and V. Prasad, “A modular approach towards
Ismail Keshta ([email protected]) is with the image spam filtering using multiple classifiers,” in Proc. IEEE Int. Conf. Comput.
Computer Science and Information Systems Department, Intell. Comput. Res., 2014, pp. 1–8, doi: 10.1109/ICCIC.2014.7238323.
College of Applied Sciences, AlMaarefa University, Riyadh [14] X. M. Li and U. M. Kim, “A hierarchical framework for content-based image spam fil-
11597, Saudi Arabia. tering,” in Proc. 8th Int. Conf. Inf. Sci. Digit. Content Technol. (ICIDT), 2012, pp. 149–155.
Mohammed Wasim Bhatt (wasimmohammad71@ [15] A. Rusu and V. Govindaraju, “Handwritten CAPTCHA: Using the difference in the
gmail.com) is with the Model Institute of Engineering and abilities of humans and machines in reading handwritten words,” in Proc. 9th Int. Work-
Technology Jammu, J&K 181122, India. shop Frontiers Handwriting Recognit., 2004, pp. 226–231, doi: 10.1109/IWFHR.2004.54.
Pavitar Parkash Singh ([email protected]) is [16] D. Agnihotri, K. Verma, and P. Tripathi, Variable global feature selection scheme
with the Department of Management, Lovely Professional for automatic classification of text documents, Expert Syst. Appl., vol. 81, pp. 268–
University, Phagwara 144001, India. 281, 2017, doi.org/10.1016/j.eswa.2017.03.057.

View publication stats

HQPDS 2019.5.5
No ratings yet
HQPDS 2019.5.5
622 pages
Mod 7 Smote ML
No ratings yet
Mod 7 Smote ML
40 pages
Service-Manual Xplorer-Xplorer-plus Eng Version 10
No ratings yet
Service-Manual Xplorer-Xplorer-plus Eng Version 10
94 pages
PDF
0% (1)
PDF
1 page
DSC 30 July Social 1st
100% (1)
DSC 30 July Social 1st
121 pages
Cybercrime in Ghana A Study of Offenders, Victims and The Law - 2015
100% (1)
Cybercrime in Ghana A Study of Offenders, Victims and The Law - 2015
193 pages
Autocad - Tutorial Auto Cad 2002 2D 3D
100% (4)
Autocad - Tutorial Auto Cad 2002 2D 3D
56 pages
Pin Out BMW 318i (E36)
No ratings yet
Pin Out BMW 318i (E36)
10 pages
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
No ratings yet
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
20 pages
Secure Data Transfer and Deletion From Counting Bloom Filter in Cloud Computing.
100% (1)
Secure Data Transfer and Deletion From Counting Bloom Filter in Cloud Computing.
48 pages
A Searchable and Verifiable Data Protection Scheme For Scholarly Big Data
No ratings yet
A Searchable and Verifiable Data Protection Scheme For Scholarly Big Data
57 pages
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
100% (1)
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
36 pages
R22 ML Unit 5
No ratings yet
R22 ML Unit 5
29 pages
Assignment1: Internet, Intranet and Extranet
No ratings yet
Assignment1: Internet, Intranet and Extranet
10 pages
I. EVS - Part 2
No ratings yet
I. EVS - Part 2
162 pages
Spam Text Detection Over Social Media Usage A Supervised Sampling Approach
No ratings yet
Spam Text Detection Over Social Media Usage A Supervised Sampling Approach
8 pages
Admin, 1277
No ratings yet
Admin, 1277
21 pages
Organizational Behaviour: Ganesh Gugloth
No ratings yet
Organizational Behaviour: Ganesh Gugloth
115 pages
A Comparative Study of SMOTE Borderline-SMOTE and ADASYN Oversampling Techniques Using Different Classifiers
No ratings yet
A Comparative Study of SMOTE Borderline-SMOTE and ADASYN Oversampling Techniques Using Different Classifiers
9 pages
Sowmya Internship Final Report
No ratings yet
Sowmya Internship Final Report
150 pages
PDF
No ratings yet
PDF
106 pages
17S41F0010 - Deals Made Easy
No ratings yet
17S41F0010 - Deals Made Easy
98 pages
Merged Document
No ratings yet
Merged Document
113 pages
17S41F0026 - e Farming
No ratings yet
17S41F0026 - e Farming
84 pages
Batch2 Final Revieww1
No ratings yet
Batch2 Final Revieww1
76 pages
Project Report On Secondary Research
No ratings yet
Project Report On Secondary Research
7 pages
505 Mini
No ratings yet
505 Mini
59 pages
19N01F0003-Leveraging CNN and Transfer Learning For Vision-Based Human Activity Recognition
No ratings yet
19N01F0003-Leveraging CNN and Transfer Learning For Vision-Based Human Activity Recognition
57 pages
Batch02 - Ai Recruitment Tool For Resume Analysis and Skill Matching
No ratings yet
Batch02 - Ai Recruitment Tool For Resume Analysis and Skill Matching
55 pages
Priya Dairy
No ratings yet
Priya Dairy
68 pages
Major Project Batch 5 BOOKS
No ratings yet
Major Project Batch 5 BOOKS
55 pages
GRP Mini Dynamic
No ratings yet
GRP Mini Dynamic
51 pages
Dcs
No ratings yet
Dcs
50 pages
Chapter-1: Cloud Computing Fundamentals
No ratings yet
Chapter-1: Cloud Computing Fundamentals
42 pages
ANJALI
No ratings yet
ANJALI
55 pages
Navo Minority Over-Sampling Technique (Nmote) : A Consistent Performance Booster On Imbalanced Datasets
No ratings yet
Navo Minority Over-Sampling Technique (Nmote) : A Consistent Performance Booster On Imbalanced Datasets
42 pages
City Hotspot Identification Using Smart Cyber Physical Social System
No ratings yet
City Hotspot Identification Using Smart Cyber Physical Social System
40 pages
11192-Article (PDF) - 20731-1-10-20180420
No ratings yet
11192-Article (PDF) - 20731-1-10-20180420
43 pages
Case Study UIUX Sumit B - Designerrs
No ratings yet
Case Study UIUX Sumit B - Designerrs
37 pages
FULLTEXT01
No ratings yet
FULLTEXT01
42 pages
Hospital
No ratings yet
Hospital
27 pages
Language Translator Python
No ratings yet
Language Translator Python
25 pages
Detection of Social Network Spam Based On Improved Extreme Learning Machine
No ratings yet
Detection of Social Network Spam Based On Improved Extreme Learning Machine
23 pages
GENIO Premium Manual
No ratings yet
GENIO Premium Manual
36 pages
144-Statistical Analysis of Imbalanced Classification With Training Size Variation and Subsampling On Datasets of Research Papers in Biomedical Literature
No ratings yet
144-Statistical Analysis of Imbalanced Classification With Training Size Variation and Subsampling On Datasets of Research Papers in Biomedical Literature
26 pages
Characterizing The Propagation of Situational Information in Social Media During Covid-19 Epidemic A Case Study On Weibo
No ratings yet
Characterizing The Propagation of Situational Information in Social Media During Covid-19 Epidemic A Case Study On Weibo
22 pages
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
No ratings yet
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
33 pages
Data Oversampling and Imbalanced Datasets: An Investigation of Performance For Machine Learning and Feature Engineering
No ratings yet
Data Oversampling and Imbalanced Datasets: An Investigation of Performance For Machine Learning and Feature Engineering
32 pages
Societal Project On Scada by Medha Servo Drives
No ratings yet
Societal Project On Scada by Medha Servo Drives
24 pages
MAD Microproject
No ratings yet
MAD Microproject
21 pages
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition
No ratings yet
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition
16 pages
Catboost ET Comparaison
No ratings yet
Catboost ET Comparaison
20 pages
Hellinger Distance
No ratings yet
Hellinger Distance
19 pages
Ensemble Models For Effective Classification of Big Data With Data Imbalance
No ratings yet
Ensemble Models For Effective Classification of Big Data With Data Imbalance
17 pages
Foundations of Data Imbalance and Solutions For A Data Democracy
No ratings yet
Foundations of Data Imbalance and Solutions For A Data Democracy
20 pages
ASC 4 Switchboard Data Sheet 4921240553 UK
No ratings yet
ASC 4 Switchboard Data Sheet 4921240553 UK
19 pages
Class Notes
No ratings yet
Class Notes
24 pages
Knowledge-Based Systems: Michał Koziarski Michał Woźniak Bartosz Krawczyk
No ratings yet
Knowledge-Based Systems: Michał Koziarski Michał Woźniak Bartosz Krawczyk
16 pages
Applsci 13 04852 v2
No ratings yet
Applsci 13 04852 v2
18 pages
Handling Imbalanced Dataset
No ratings yet
Handling Imbalanced Dataset
23 pages
Eng2 12298 PDF
No ratings yet
Eng2 12298 PDF
24 pages
MK-SMOTE and M-SMOTE: Enhanced Techniques For Handling Class Imbalance Problem
No ratings yet
MK-SMOTE and M-SMOTE: Enhanced Techniques For Handling Class Imbalance Problem
19 pages
SNORTNEW
No ratings yet
SNORTNEW
23 pages
Enhanced Synthetic Oversampling For Multiclass Imbalanced Data
No ratings yet
Enhanced Synthetic Oversampling For Multiclass Imbalanced Data
20 pages
Class Imbalance Paper
No ratings yet
Class Imbalance Paper
18 pages
Giaonx,+1155 3735 1 CE
No ratings yet
Giaonx,+1155 3735 1 CE
13 pages
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
No ratings yet
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
15 pages
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
No ratings yet
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
12 pages
Paper 6 - 240417 - 184500 OCR
No ratings yet
Paper 6 - 240417 - 184500 OCR
11 pages
Operating System
No ratings yet
Operating System
11 pages
10 Techniques To Deal With Class Imbalance in Machine Learning
No ratings yet
10 Techniques To Deal With Class Imbalance in Machine Learning
10 pages
2515-Article Text-14337-4-10-20230331
No ratings yet
2515-Article Text-14337-4-10-20230331
12 pages
Gillette Pepsi Cola Media Kit
No ratings yet
Gillette Pepsi Cola Media Kit
7 pages
Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE
No ratings yet
Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE
12 pages
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
No ratings yet
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
11 pages
Class Imbalance Notes
No ratings yet
Class Imbalance Notes
6 pages
Author Final Version
No ratings yet
Author Final Version
11 pages
JPSP - 2022 - 383
No ratings yet
JPSP - 2022 - 383
12 pages
Fraud Find Financial Fraud Detection by Analyzing Human Behavior
No ratings yet
Fraud Find Financial Fraud Detection by Analyzing Human Behavior
7 pages
11-A-SMOTE A New Preprocessing Approach For Highly Im
No ratings yet
11-A-SMOTE A New Preprocessing Approach For Highly Im
11 pages
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
No ratings yet
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
10 pages
FAST - A ROC-Based Feature Selection Metric For Small Samples and Imbalanced Data Classification Problems (2008)
No ratings yet
FAST - A ROC-Based Feature Selection Metric For Small Samples and Imbalanced Data Classification Problems (2008)
9 pages
Over-Sampling Algorithm For Imbalanced Data Classification: XU Xiaolong, Chen Wen, and SUN Yanfei
No ratings yet
Over-Sampling Algorithm For Imbalanced Data Classification: XU Xiaolong, Chen Wen, and SUN Yanfei
10 pages
Imbalanced Data Classification Method Based On LSSASMOTE
No ratings yet
Imbalanced Data Classification Method Based On LSSASMOTE
9 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
Microgrid Monitoring and Controlling Using PLC
No ratings yet
Microgrid Monitoring and Controlling Using PLC
4 pages
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
No ratings yet
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
9 pages
A Survey On Oversampling Techniques For Imbalanced Learning
No ratings yet
A Survey On Oversampling Techniques For Imbalanced Learning
6 pages
Masked Label Prediction (Contiene GTN)
No ratings yet
Masked Label Prediction (Contiene GTN)
7 pages
Analysis of Imbalanced Classification Algorithms A Perspective View
No ratings yet
Analysis of Imbalanced Classification Algorithms A Perspective View
5 pages
Resume
No ratings yet
Resume
3 pages
An Overview of Classification Algorithms For Imbalanced Datasets
No ratings yet
An Overview of Classification Algorithms For Imbalanced Datasets
7 pages
2018 12state of ArtofImbalancedDataClassificationMethods
No ratings yet
2018 12state of ArtofImbalancedDataClassificationMethods
7 pages
5 Techniques To Handle Imbalanced Data For A Classification Problem
No ratings yet
5 Techniques To Handle Imbalanced Data For A Classification Problem
7 pages
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
No ratings yet
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
5 pages
Addressing Imbalance Problem in The Class - A Survey
No ratings yet
Addressing Imbalance Problem in The Class - A Survey
5 pages
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
No ratings yet
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
7 pages
Machine Learning With Oversampling and Undersampling Techniques Overview Study and Experimental Results
No ratings yet
Machine Learning With Oversampling and Undersampling Techniques Overview Study and Experimental Results
6 pages
Class Imbalance Problem in Data Mining: Review
No ratings yet
Class Imbalance Problem in Data Mining: Review
5 pages
Belt Drive Training System Plus - Industrial Belt Alignment Skills
No ratings yet
Belt Drive Training System Plus - Industrial Belt Alignment Skills
4 pages
Imbalanced Learn Python
No ratings yet
Imbalanced Learn Python
5 pages
IMECS2010 pp513-517
No ratings yet
IMECS2010 pp513-517
5 pages
MEE22154 Task2
No ratings yet
MEE22154 Task2
4 pages
Li 2011
No ratings yet
Li 2011
4 pages
Paper IJRITCC
No ratings yet
Paper IJRITCC
5 pages
Resume SYEDAHMADHASHMI
No ratings yet
Resume SYEDAHMADHASHMI
2 pages
PC 3000 Express
No ratings yet
PC 3000 Express
1 page
Act02 Cae02
No ratings yet
Act02 Cae02
2 pages
j2c Uk (s12)
No ratings yet
j2c Uk (s12)
2 pages
21MCME02
No ratings yet
21MCME02
1 page
MarketSmiths Growth250 (Total 300) 1-25
No ratings yet
MarketSmiths Growth250 (Total 300) 1-25
1 page
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Spam Text Detection Over Social Media Usage

Uploaded by

Spam Text Detection Over Social Media Usage

Uploaded by

IEEE Transaction Systems Man and Cybernetics Magazine.Issue Date:April.

by Haewon Byeon , Sameer Jha,

Optimization of the Yes No

Recombined With Minority

Table 1. A comparison of the clustering accuracy of each algorithm

Double-Circles Iris [10] Wine [8] Glass [4]

1,000 1,000 Table 2. A comparison of the text

and offers better performance in communication spam Latika Jindal ([email protected]) is

View publication stats

You might also like