0% found this document useful (0 votes)
162 views10 pages

A Novel Machine Learning Algorithm For Spammer Identification in Industrial Mobile Cloud Computing

This document proposes a new machine learning algorithm called SIGMM to identify spammers in industrial mobile cloud computing networks. SIGMM uses Gaussian mixture modeling to classify users as spammers or normal without relying on relationships between users. It was tested on a real industrial mobile network dataset and was more accurate and faster than other algorithms for spammer identification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
162 views10 pages

A Novel Machine Learning Algorithm For Spammer Identification in Industrial Mobile Cloud Computing

This document proposes a new machine learning algorithm called SIGMM to identify spammers in industrial mobile cloud computing networks. SIGMM uses Gaussian mixture modeling to classify users as spammers or normal without relying on relationships between users. It was tested on a real industrial mobile network dataset and was more accurate and faster than other algorithms for spammer identification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2799907, IEEE
Transactions on Industrial Informatics

A Novel Machine Learning Algorithm for Spammer


Identification in Industrial Mobile Cloud Computing
Tie Qiu, Senior Member, IEEE, Heyuan Wang, Keqiu Li, Senior Member, IEEE, Huansheng Ning, Senior
Member, IEEE, Arun Kumar Sangaiah, Baochao Chen

Abstract—An industrial mobile network is crucial for indus- An important branch of IoT is control, including human
trial production in the Internet of Things. It guarantees the to object control and human control of machines, which is
normal function of machines and the normalization of industrial an important foundation for achieving intelligence[6]. This is
production. However, this characteristic can be utilized by
spammers to attack others and influence industrial production. particularly important in modern industrial production. The
Users who only share spams, such as links to viruses and mobile network becomes a target of spammers due to its
advertisements, are called spammers. With the growth of mobile importance in industrial production control. Spam is one of the
network membership, spammers have organized into groups for most common forms of attack in mobile networks. Spammers
the purpose of benefit maximization, which has caused confusion pretend to be normal users and only send spam[7, 8], and
and heavy losses to industrial production. It is difficult to distin-
guish spammers from normal users owing to the characteristics these are the users we aim to detect. A serious problem
of multidimensional data. To address this problem, this paper caused by spam is that links leading to viruses are selected
proposes a Spammer Identification scheme based on Gaussian by mistake and then users’ personal information is stolen, or
Mixture Model (SIGMM) that utilizes machine learning for production control is interfered with. These malicious nodes
industrial mobile networks. It provides intelligent identification communicate with each other and spammers hide in them as
of spammers without relying on flexible and unreliable relation-
ships. SIGMM combines the presentation of data, where each shown in Fig.1. The net structure with character graphics in
user node is classified into one class in the construction process the left side represents mobile cloud computing, the right side
of the model. We validate SIGMM by comparing it with the describes the behavior data of each user. The red character
reality mining algorithm and hybrid FCM clustering algorithm graphics represent spammers, and the green ones represent
using a mobile network dataset from a cloud server. Simulation normal users.
results show that SIGMM outperforms these previous schemes
in terms of recall, precision, and time complexity. Proposals from industry and academia discuss solutions
for shielding against spam. Classification based on machine
Index Terms—Industrial mobile network, Internet of Things,
spammers, intelligent identification, machine learning
learning is a learning process for mapping data samples
into two classes. However it has limitations. One is data
imbalance, unlabeled data are present in a much larger amount
I. I NTRODUCTION than labeled data, which hinders direct model construction.
Another limitation is multidimensional data, too many features
HE Internet of Things (IoT) [1] is an important com-
T ponent of the new generation of information technology.
It is widely used in many fields such as industrial control,
can lead to overfittinig. Hence intelligent feature selection is
necessary. In this paper, we first investigate the characteristics
of spammers and normal users in an industrial mobile network.
cyber-physical systems, and military investigation through the Then, the SIGMM model is proposed and developed based
techniques of intelligent perception, identification technology, on Gaussian Mixture Model, which focuses on spammer
and pervasive computing [2]. To understand and measure the identification. The paper contains the following three main
environment through objects’ inter-connections around people contributions.
is the basic idea of IoT [3], its foundation is the internet and
terminals to provide communication between objects [4]. It • Based on the Gaussian Mixture Model, we propose a
connects humans and objects, objects with objects, provides recognition process named the SIGMM model for classi-
remote control, and controls intelligent networks in new ways fication without relying on users’ unreliable relationships.
through enabling technologies [5]. • SIGMM can label data automatically, which increases the
precision of the model by expanding the training set. It
Tie Qiu and Keqiu Li are with School of Computer Science and Tech- labels large amounts of unlabeled data based on a few
nology, Tianjin University, Tianjin 300050, China. (E-mail: [email protected]; labeled data and solves the problem of the imbalance
[email protected]) between labeled data and unlabeled data.
Heyuan Wang and Baochao Chen are with the School of Soft-
ware, Dalian University of Technology, Dalian 116620, China. (E-mail: • We use an industrial mobile network dataset from a
[email protected]; [email protected]) cloud server to perform simulations. The results show that
Huansheng Ning, School of Computer and Communication Engineering, SIGMM performs better than two other models in terms
University of Science and Technology Beijing, Beijing 100083, China. (E-
mail: [email protected]) of identifying spammers and reducing time complexity.
Arun Kumar Sangaiah, School of Computer Science and Engineering,
Vellore Institute of Technology (VIT), Vellore 632014, India. (E-mail: arunk- The rest of this paper is organized as follows. Related work
[email protected]) is introduced in Section II. In Section III, preliminaries are

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2799907, IEEE
Transactions on Industrial Informatics

Fig. 1: Spammers in mobile cloud computing

discussed. In Section IV, we provide a detailed description of on the volume of the dataset being small or its accuracy will
the proposed SIGMM model. Simulations are performed to be sharply reduced.
present SIGMM’s performance in identifying spammers and 3) Reinforcement learning: Reinforcement learning algo-
we compare SIGMM with the reality mining algorithm (RMA) rithms are very suitable for learning to control an agent by
and hybrid FCM clustering algorithm (HFCM) in Section V. allowing it to interact with an environment [17]. The goal is to
We implement spammer identification on the industrial mobile choose an action to maximize an expected long-term reward.
data to verify our proposed algorithm. Finally, the conclusions In [18], a recursive least squares (RLS) algorithm based on
and future work are presented in Section VI. the reinforcement algorithm is proposed. It applies Q-Learning
by choosing a policy which is the best selection for a specific
II. R ELATED W ORK user. But it starts from a random user and does the exploration
within the network by friendship relationships, which restricts
For existing algorithms, there are three types of machine the scope of the exploration, and leads to decreased detection
learning: supervised learning, unsupervised learning, and re- efficiency.
inforcement learning. Existing methods[19, 20] mainly depend on the relation-
1) Supervised Learning: The main goal of supervised ships among users. However, owing to the development of
learning is to learn a model from labeled training data that intelligent recommendation mechanisms, user associations are
allows us to make predictions about unseen or future data [9]. not based on their true preferences or intentions. In the case
Supervised refers to a set of samples where the desired output where the users are not very clearly defined, a user might
labels are already known [10]. In the Spammer Setection algo- follow or be followed by an automatic link. Therefore many
rithm based on Logistic Regression [11], a spammer classifier users represent fuzzy relationships. In order to solve the above
is built for an online network with some features as inputs, problems, we propose SIGMM for industrial mobile networks.
and the algorithm output is 1 if a spammer is suspected. The
model is trained on a large training set, however, collection of
III. PRELIMINARIES
labeled data is rather difficult because of the recent emphasis
on the secrecy of user data. In order to learn the data construction and rules, and owing
2) Unsupervised Learning: Using unsupervised learning to the limited access to original data, we preprocess the data
techniques, we are able to explore the structure of our data after extracting any available original data in an industrial
to extract meaningful information without the guidance of a mobile network.
known outcome variable or reward function [12]. A clustering
algorithm is the main algorithm for unsupervised learning [13]. A. Data description
Clustering is a technique that allows us to find groups of Our data contain the following contents, user’s ID, the
similar members [14]. In the RMA based on K-means in [15], relationship with other users, the time-stamped post record,
the algorithm proposes a silhouette function which accepts the and the activity in the past three months. From the post record,
number of clusters as a parameter to judge the accuracy of we calculate the frequency of posting and proportion of posts
clustering. Then it uses a matrix of means to record the mean with URL or @, and the average similarity among the user’s
silhouette values for each value of k and finally determines the posts. The activity reflects whether the account is normal or
best value of k. But the clustering result depends on the k cen- not. It indicates the frequency of following others, which is
troids. Therefore it must consume extra time to determine the necessary because spammers tend to follow others all the time.
value of k. Furthermore, experimental results are unstable, with
the same k used in several experiments, producing different
results. In [16], a prediction model based on Big Data analysis B. Feature scaling
using a hybrid FCM clustering algorithm (HFCM) is proposed. The data we obtained have the following two constraints.
It works by repeating arithmetic operations to minimize the First, the labeled data are far fewer than the unlabeled data
objective function and updating membership function, which is which severely decreases the precision of training. Second,
very time-consuming. The experimental performance depends there is large data noise that may cause incorrect factors in

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2799907, IEEE
Transactions on Industrial Informatics

F3 friend_portion neighborhood’s posts neighborhood’s fans


the parameters of the model. Data points that do not belong label -0.84 0.20 0.11
to any class are defined as data noise. The values of some
data may greatly differ from the mean of samples. SIGMM
reduces data noise by calculating the similarity among users
to increase the precision of training.
In order to remove data noise from large datasets, the sim-
ilarity is calculated first according to the vectors that describe
user behaviors. The similarity measurement is usually related
to distance. Commonly used methods are based on Euclidean
and cosine distances. The cosine method uses vector angles to
represent the distance between two data objects. The Euclidean
distance method calculates the absolute distance between data
objects in space. If we arbitrarily extend the modulus of
vectors, Euclidean distance is sensitive to differences, while
cosine distance does not detect any change. Therefore, the
method based on Euclidean distance is more appropriate for
our purpose.
Fig. 2: PCA for dimension reduce
In order to learn model parameters accurately, we employ
the Pearson correlation coefficient and Principal Component
Analysis (PCA) to characterize features. The Pearson cor-
relation coefficient is used to obtain the linear correlation
coefficient between two variables x and y. It ranges from +1
to -1, where +1 is a positive linear correlation, 0 is no linear
correlation, and -1 is a negative linear correlation. Pearson’s
correlation coefficient is calculated by Eq. (1)
Cov(X,Y )
ρX,Y = (1)
σX σY
Cov(x, y) is the covariance of variables x and y, and σX and
σY are the standard deviation of x, and y respectively. The
correlation coefficient r between two n-dimensional vectors
can be calculated by Eq. (2):
∑ni=1 (xi − x̄)(yi − ȳ)
r= p p (2) (a) Normal users’ basic features after PCA.
∑i=1 (xi − x̄)2 ∑ni=1 (yi − ȳ)2
n

TABLE I shows the correlation coefficients of all features.


We exclude the low ones including the account_age and
fun/follow features in group 1, @number and forward_number
in group 2, and neighborhood’s posts and fans in group 3,
whose absolute values of correlation coefficients are below
the threshold of 0.2.

C. Feature grouping
By utilizing standardization and the Pearson correlation
coefficient, the features remain simple. The multidimensional
feature is divided into three parts to indicate three user
perspectives, which are basic features, content features and
network features.
(b) Spammers’ basic features after PCA.
TABLE I: Pearson correlation coefficients. Fig. 3: Basic features after PCA.
F1 fans follow post fre_follow accountage fans/follow
label -0.72 0.57 0.37 0.58 0.11 0.14 • The basic features include the number of fans, following
users, posts, and the frequency of following. Previous
studies show that spammers tend to follow a large number
F2 fre_post similarity url_portion @number forwardnumber of users, and their fans are rare. The proportion of fans to
label 0.68 0.91 0.88 0.15 0.08
following is particularly low. These characteristics reflect
whether the user is normal or not. Spammers’ frequency

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2799907, IEEE
Transactions on Industrial Informatics

in following others is further higher than that of normal simultaneously obtaining n samples is the largest. θ̂ is called
users. the maximum likelihood estimator, which is defined in Eq. (4).
• The content features mainly reflect the characteristics of
θ̂ = argmaxl(θ ) (4)
the information sent by a user in the most recent three
months. The user’s activity can be analyzed by the content For parameter estimation, the goal is to determine a param-
features. eter that maximizes the likelihood function, the relationship of
• The network feature mainly describes user characteris- parameter θ , and likelihood function L(θ ) can be expressed
tics in an industrial mobile network. The number and as Eq. (5). (
proportion of following each other represent the degree L(θ ) = ∏ P(xi ; θ )
of intimacy between users. Spammers usually follow a (5)
θ = [µ, σ ]
large number of normal users to attack. Therefore, their
proportion of following each other is lower than that of It can be seen that the likelihood function is a continuous
normal users. multiplication. We take the logarithm of L(θ ) and transform
We perform data visualization to evaluate the characteristics it to a continuous addition format using Eq. (6).
of users. PCA is utilized for data compression[21]. and to lnL(θ1 ) = ln ∏ P(xi ; θ1 )
find the directions of maximum variance in high-dimensional " k 2
!#
data and project these onto a new subspace with fewer 1 ∑ (xi − µ1 )
= ln √ exp −
dimensions. The principal components of the new subspace 2πσ1 2σ12 (6)
can be interpreted as the directions of maximum variance given
" #
√ 2
∑ (xi − µ1 )

the constraint that the new feature axes are orthogonal to each = − kln 2πσ1 +
2σ12
other, as illustrated in Fig.2. Here, x1 and x2 are the original
feature axes, and PC1 and PC2 are the principal components. To obtain the maximum value of ln(L(θ )), we calculate its
For the basic features, PCA is utilized for dimensionality partial derivatives by Eq. (7).
reduction. We construct a d×4 dimensional transformation  
√ 2

∑(x −µ )
matrix W to map a sample of four-dimensional features into 
 ∂ −kln( 2πσ1 )− i 2 1
 ∂ ln(θ1 ) =
 2σ1
=0
one-dimensional features. Fig. 3 shows the training dataset ∂ µ1 ∂ µ1
(7)
√ 2
 
after dimensionality reduction. The data of two classes have ∑(x −µ )

 ∂ −kln( 2πσ1 )− i 2 1
a peak and a slope, which are approximately subject to the  ∂ ln(θ1 ) =
 2σ1
=0
∂σ 1 ∂σ 1
Gaussian distribution with different parameters.
Eq. (7) can be simplified to Eq. (8).

k
IV. SIGMM M ECHANISM  ∑i=1 (x2i −µ1 ) = 0

σ1
The SIGMM mechanism fits the behavior data of normal 2 (8)
∑ki=1 (xi −µ1 )
users and spammers, where the behavior data of normal users  −k

σ1 + =0
2σ 3 1
and spammers are mixed random sampling. The SIGMM
The solution of Eq. (7) is shown in Eq. (9).
mechanism learns the parameters of the two distributions (nor-
µ1 = 1k ∑ki=1 xi
(
mal users and spammers) to obtain the classification model.
q
∑ki=1 (xi −µ1 )
2 (9)
σ1 = 2k
A. Parameters estimation based on Expectation-maximization
The data are approximately subject to the Gaussian distribu-
tion. The mean and variance must be estimated for initializing TABLE II: Variables definition.
the model. According to the probability density p(x|θ ), we Symbols Description
independently extract some samples to constitute the training θ θ = [µ, σ ] µ is the mean of samples, σ is the variance of samples
sample set X. Parameter θ represents the mean and variance L The labeled dataset
of the dataset, and is estimated through the sample set X. U The unlabeled dataset
lhs The likelihood function of spammers’ set
Consider X = {x1 , x2 , . . . , xn } as a set of extracted samples, xi lhn The likelihood function of normal users’ set
represents the i th user data, and n represents the number of Models The model of spammers’ distribution
samples. Because they are independent, the probability that Modeln The model of normal users’ distribution
elln The ellipsoid that represents Models
xi and x j are extracted simultaneously is p(xi |θ ) ∗ p(x j |θ ). ells The ellipsoid that represents Modeln
Similarly, the probability that n samples are extracted simul- dn The Euclidean distance between one user point and elln
taneously is the product of their respective probabilities, as ds The Euclidean distance between one user point and ells
shown in Eq. (3). p1 Probability of normal user
p2 Probability of spammer
n λ One user point
L(θ ) = L(x1 , x2 , . . . , xn ; θ ) = ∏ p(xi ; θ ) (3) lbl User’s label
i=1

L(θ ) is called the likelihood function related to the sample set Algorithm 1 describes the process of maximizing the likeli-
X and parameter θ . θ̂ is a value indicating the maximum result hood function and then the model’s parameters are calculated
of likelihood function L(θ ). When θ = θ̂ , the probability of from samples. All variables are shown in Table II. The samples

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2799907, IEEE
Transactions on Industrial Informatics

C. Implementation on Semi-supervised learning


Algorithm 1 Maximum likelihood function.
Input: Labeled dataset L Semi-supervised learning [22] is suitable for data with few
Output: Parameters θ1 ,θ2 labels. Therefore, how to use a large number of unlabeled
1: procedure L IKELIHOOD _E STIMATION (L) samples to improve learning performance has become one of
2: for xi in L1[normal] do the most important issues in current machine learning research.
3: lhn ← lhn ∗ f (xi ; θ1 ) The semi-supervised learning algorithm takes full advantage
4: end for of labeled data [23, 24]. It is a training process with an initial
5: for xi in L1[spammer] do model constructed from a small group of labeled data. It
6: lhs ← lhs ∗ f (xi ; θ2 ) continues by predicting unlabeled data first, and then transfers
7: end for the data into the labeled dataset. The parameters of the model
8: θ1 ← argmaxln (lhn ) are updated and optimized until the model reaches a stable
9: θ2 ← argmaxln (lhs ) and optimal state.
10: return θ1 ,θ2 According to the judgment standard based on the EM
11: end procedure
algorithm, the proportion of normal users to spammers is
required as a prior knowledge. It is defined by Eq. (12).

can be scanned by the two loops, with the goal of establishing πk N(xi |µk , Σk )
r(i, k) = k
(12)
the maximum likelihood function. We scan all samples of
∑ π j N(xi |µ j , Σ j )
the two classes respectively to establish the two maximum j=1
likelihood functions (Lines 2 to 7). Next, we calculate the
maximizing likelihood function by partial derivatives (Lines 8
and 9). Finally, we obtain the optimal solutions θ1 and θ2 . The
time consumption of this process mainly involves establishing
likelihood functions and calculating likelihood functions. The
number of users determines the complexity of Algorithm 1,
so its complexity is O(n).

B. Analysis of Gaussian Mixture Model in dataset


There is another unknown variable z belonging to the class
xi in the likelihood function. Our goal is to find the proper θ
and z that maximize the value of L(θ ). Eq. (10) defines the
likelihood function with the variable z.
ln ∏ p (xi ; θ ) = ∑ lnp (xi , zi ; θ ) (10)
The parameters can be estimated according to the data with the
same Gaussian distribution. The description of each sample
is represented by a triple yi . yi = {xi , zi1 , zi2 }, where xi is Fig. 4: Accuracy of two methods.
the i th sample, and zi1 and zi2 indicate which Gaussian
distribution produces xi . They indicate whether the user is
normal or not. For example, if xi is equal to 1.8, and he
The proportion cannot be estimated due to the large number
belongs to the Gaussian distribution of a normal user, then
of unlabeled data. πk is an uncertain variable, therefore the
we can describe the sample as {1.8, 1, 0}. If the values of zi1
value of r(i, k) cannot be calculated. To solve the problem, two
and zi2 are known, the parameters can then be estimated by the
solutions are proposed. First, roughly calculate πk according
maximum likelihood algorithm. The process below describes
to the samples to obtain the probability of each class. Second,
how to calculate zi1 and zi2 based on labeled samples.
through data visualization we note that the unlabeled data
(1) Initialize distribution parameter θ and repeat the follow-
points are distributed in the medial and lateral of two ellip-
ing steps until θ converges.
soids. Each ellipsoid is a distribution. Thus, the determination
(2) Calculate the posterior probability of the variable z
of class depends on the distance from a data point to the
according to the initial parameters or the parameters from the
ellipsoid surface. Compared with the first method, the second
last iteration. Qi (zi ) stands for the expectation of the implicit
method is more convenient to calculate. We conducted the
variable z. Qi (zi ) := P(zi |xi ; θ ).
following experiment to identify which one performs better.
(3) Maximize the likelihood function to obtain new param-
Fig.4 illustrates the precision of the two methods with a
eters:
p(x(i) , z(i) ; θ ) random selection of 1,000 labeled data. The X-axis represents
θ B arg max ∑ ∑ Qi log (11) the number of iterations, and the Y-axis is the precision of the
θ i z(i) Qi (z(i) )
two methods.

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2799907, IEEE
Transactions on Industrial Informatics

Algorithm 2 Predicting process for unlabeled data


Input: Algorithm 3 training and predicting process of SIGMM.
Modeln ,Models ,User point λ Input: Labeled dataset L,Unlabeled dataset U
Output: prediction reslut Output: Parameters of model
1: procedure G ET _P ROBABILITY (model, λ ) 1: procedure G ET _PARAMS(L,U)
2: elln ← ellipsoid (Modeln .µ, Modeln .σ ) 2: L1, L2 ← L.randomlySplit()
3: ells ← ellipsoid (Models .µ, Models .σ ) 3: θ1 ← Call Algorithm 1 with L1[normal]
4: Define two perpendicular lines to the two ellipsoid 4: as the parameter
from λ point. A, B are intersection points 5: θ2 ← Call Algorithm 1 with L1[spammer]
5: Do the tangent plane at two points A, B 6: as the parameter
0 0 0
6: FA (xA ) ∗ (x − xA ) + FA (yA ) ∗ (y − yA ) + FA (zA ) ∗ 7: µ1 ← θ1 .µ, σ1 ← θ1 .σ
(z − zA ) ← 0 8: µ2 ← θ2 .µ, σ2 ← θ2 .σ
0 0 0
7: FB (xB ) ∗ (x − xB ) + FB (yB ) ∗ (y − yB ) + FB (zB ) ∗ 9: Modeln ← (µ1 , σ1 ),Models ← (µ2 , σ2 )
(z − zB ) ← 0 10: for λ in U do
0 0
8: FA (A) ← λ − A, FB (B) ← λ − B 11: p1 ← Call Algorithm 2 with Modeln , λ
9: dn ← linalg.norm (λ − A) 12: as parameters
10: ds ← linalg.norm (λ − B) 13: p2 ← Call Algorithm 2 with Models , λ
11: d ← argmin (dn , ds ) 14: as parameters
12: return (1 − d/ (dn + ds )) 15: p ← max(p1, p2)
13: end procedure 16: M ← p response to Model {Modeln , Models }
17: lbl ← M response to label {normal, spammer}
18: label λ lbl
19: add λ to L1
0
20: θ1 ← Call Algorithm 1 with L1[normal]
Labeled node Part 1
Normal 21: as the parameter
User Label User Label Initial parameters user’s model 0
User Label
User Label User Label 22: θ2 ← Call Algorithm 1 with L1[spammer]
166 User0 Label
randomly 166 0 Spammer’s
166 0
166
166
0
0
split 166 0 model 23: as the parameter
0 0 0 0
24: µ1 ← θ1 .µ, σ1 ← θ1 .σ
0 0 0 0
Part 2 25: µ2 ← θ2 .µ, σ2 ← θ2 .σ
User Label 0 0 0 0
User Label test 26: Modeln ← (µ1 , σ1 ),Models ← (µ2 , σ2 )
add 166 0
202 1 27: end for
28: return Modeln .µ, Modeln .σ , Models .µ, Models .σ
Unlabeled
Stable 29: end procedure
Newly labeled node model
node Normal
User Label User
User
predict User user’s model

56 1 202
202 Spammer’s
56
model

struct two ellipsoids (Lines 2 and 3). Then we define two


perpendicular lines to the two ellipsoids from the λ point,
Fig. 5: Semi-supervised process. and construct tangent planes on the ellipsoids (Lines 4 and 5).
Lines 6 to 8 describe the relationships of points and tangent
It can be observed from Fig. 4, that the pink line which planes. Next, we calculate the Euclidean distance between
represents the distance method fluctuates because of the un- each point and ellipsoid with the linalg.norm() function in the
stable feature at the beginning of the iteration. Although the Scipy package (Lines 9 and 10). Next, we choose the closer
probability method is slightly higher than the distance method distance to the model as the classification result (Line 11).
it exhibits little change in the later iterations, and is lower Finally, the prediction of result is returned. The complexity of
than the distance method at the end of the iterations. When Algorithm 2 is O(1).
the probability r(i, k) is calculated, the roughly estimated πk If a node has been classified as one of the two classes, it will
does not represent the proportion of each distribution in the be removed from the unlabeled dataset and joins the training
whole dataset owing to the large amount of unlabeled data, set. The node that joined most recently might cause the model
which decreases precision. to adjust parameters. Through the entire iteration process, the
We choose the distance solution which calculates the dis- parameters of the two models are gradually adjusted to the
tance between data points and the two ellipsoids as our im- optimal result. The process of training and prediction is shown
proved judgment standard. This prediction process is detailed in Fig. 5.
in Algorithm 2. Algorithm 3 describes the entire process of training and
Algorithm 2 describes the process of prediction for unla- predicting. It uses the majority of the dataset as its input
beled data, it uses a trained model as its input and returns and then returns the model’s parameters as well as prediction
judgment results. First, we use a model’s parameters to con- results. First, the labeled dataset is randomly divided into two

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2799907, IEEE
Transactions on Industrial Informatics

parts, parts L1 and L2. These parts have no intersection with ability density distribution of samples, each Gaussian model
each other (Line 2). One is the training set for generating represents a class. By matching samples with several Gaussian
initial parameters while the other is the test set. Algorithm models to obtain probabilities, the class with the largest
1 is called to construct an initial model (Lines 3 to 9). By probability is chosen as the classification result.
calling Algorithm 2, unlabeled data in the remaining dataset With continuous iteration, the ellipsoid’s center as well as
are labeled (Lines 10 to 18). Line 19 is an important process radius change slowly as shown in Fig. 7(a) and converge to
in semi-supervised training. It adds the recently labeled data the stable position. The following conclusions can be obtained
to the training set and updates the model using Algorithm 1 from the figure. (1) The data of the two classifications are
(Lines 20 to 26). Thus, the complexity of Algorithm 3 is O(n). clearly separated. (2) The radius of the green one representing
normal users is larger than that of the red one, because the
number of normal users is large. Their behavior data are not
similar to each other and deviate from the center. The radius
of the red one is smaller than that of the green one because
spammer behaviors for attacking others are similar. (3) The
radius and location of the two ellipsoids vary only slightly.
Therefore, the model is stable.

(a) Position of center’s change

(a) Iterating ellipsoid

(b) Radius’ change

Fig. 6: Change of center and radius.

The above training process avoids relying solely on the


labeled dataset. When new data are added to the training
set, the parameters are adjusted at each iteration. The final
parameter will gradually converge to a fixed range. Take one
group of features as an example, where the changing trend
of parameters can be observed in Fig. 6. The mean of this (b) Stable model
one-dimensional feature is a component of the ellipsoidal
Fig. 7: Stable model after iterating
coordinates. The variance is the radius of the ellipsoid. The
figure shows the variation of the parameter over 100 iterations.
It can be seen from Fig. 6, that both the center coordinate V. S IMULATION R ESULTS
and the radius change significantly at the beginning of the iter- In this section, we perform simulations using SIGMM. The
ations. With new data joining, the change gradually decreases stable parameters are confirmed through simulation experi-
and eventually converges to a fixed range. ments. Taking the accuracy, recall, and precision as metrics,
we compare SIGMM with RMA and HFCM. In addition, the
D. Classification of SIGMM Model performances of the three schemes are compared in terms
A Gaussian mixture model is a probabilistic model for of recall, which represents the percentage of real spammers
statistical learning [25]. Through the estimation of the prob- identified.

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2799907, IEEE
Transactions on Industrial Informatics

A. Simulation Setup

The number of iterations is set to 100. The comparison


experiment is to divide the labeled data into two parts, one
part is to generate the initial values, and the other is the test
set. All the data are preprocessed identically.
Precision, accuracy, and recall are the main basis for judging
performance. SIGMM focuses on identifying spammers from
normal users, and the accuracy represents the proportion of
all the spammers predicted by the model. Recall represents
the proportion of real spammers that have been correctly
identified. Precision is the proportion of correct numbers
in synthetic forecasts. We first compare the three solutions
from these three aspects, then replace the semi-supervised (a) Comparison of the three metics of the three schemes
training process with supervised training, and compare the two
schemes in terms of the three measures.
It can be seen from Fig.8(a), that the recall of the three
algorithms is higher than that of others, as the behavior of the
spammers is similar and data dispersion is low. From Table
III we can see that semi-supervised training performs better
than supervised training because the small amount of labeled
data does not well-represent the characteristics of all data.

TABLE III: Training process comparison.


Training Accuracy Recall Precision
Supervised 0.564 0.739 0.625
Semi-supervised 0.758 0.862 0.812

(b) Comparison of recall in the three schemes

B. Time complexity

From Fig. 8 (b) we can see that the recall of the SIGMM
model is slightly higher compared with the other two schemes.
Moreover, HFCM and RMA are influenced because of the
initial values of the algorithms and the two polylines have a
certain degree of fluctuation.
When the size of the data is very large, the complexity of an
algorithm is also an important metric to judge its performance.
Thus, the three schemes are compared in terms of time
complexity. The SIGMM model is established by the process
of prediction and updating. The number of user samples is
(c) Runtime of the three schemes
n, and the labeled training set sample is approximately 0.4n.
The labeled set is randomly divided into two parts. Each is
Fig. 8: Performance of SIGMM.
approximately 0.2n.
Therefore, the complexity of the algorithm is determined by
the following factors: (1) it can be concluded from experiments The algorithm complexity of HFCM comes from the estab-
that the initial values have significant influence on the accuracy lishment of the fuzzy matrix and updating of membership. For
of the model. We iterate k times to select the best model and n samples, d dimension features, k iterations, and C clusters,
calculate the initial value O(1). The complexity of testing the complexity is O(n2k ).
is O(n). The total complexity is then approximately O(n). Similarly, the complexity of RMA comes from computing
(2) The calculation of each sample’s distance to the two the mean of all dimensions, cluster centers, and the samples
distributions is approximately O(1). Adjust the parameters, in each cluster. The complexity is then approximately O(n).
for a total 0.6n of data indicates a complexity of O(n). So Fig. 8(c) shows a comparison of the time complexity for the
theoretically, the overall complexity is approximately O(n). three schemes with a dataset of 1,000 users.

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2799907, IEEE
Transactions on Industrial Informatics

The X-axis represents the amount of data in Fig. 8(c), [6] S. Lu, V. H. Nascimento, J. Sun, and Z. Wang, “Sparsity-
and the Y-axis represents running time. With the increase of aware adaptive link combination approach over distribut-
the data, the difference in running times among the three ed networks,” Electronics Letters, vol. 50, no. 18, pp.
algorithms gradually increases, and the SIGMM model is 1285–1287, 2014.
significantly better than the other two for larger datasets in [7] E. Tan, L. Guo, S. Chen, X. Zhang, and Y. Zhao,
terms of running time. “Spammer behavior analysis and detection in user gener-
ated content on social networks,” in IEEE International
VI. C ONCLUSION Conference on Distributed Computing Systems, May. 16-
In order to solve the malicious attack problem in industrial 18, 2012, pp. 305–314.
mobile networks and reduce the computational complexity of [8] P. Heymann, G. Koutrika, and H. Garcia-Molina, “Fight-
using large cloud server datasets, this paper proposes SIGMM, ing spam on social web sites,” IEEE Internet Computing,
a spammer identification model based on the Gaussian Mixture vol. 11, no. 6, pp. 36–45, 2007.
Model. We extract features related to labels from originally [9] M. Al Hasan, V. Chaoji, S. Salem, and M. Zaki, “Link
labeled data in a given dataset containing both labeled and prediction using supervised learning,” in Proc of Sdm
unlabeled data, and visualize the data to add labels to the Workshop on Link Analysis Counterterrorism and Secu-
unlabeled data. rity, Apr. 26-28, 2006, pp. 798–805.
According to the characteristics of data presentation, each [10] F. Pedregosa, A. Gramfort, V. Michel, B. Thirion,
user data belongs to one distribution. Multidimensional fea- O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
tures are divided into three groups, and SIGMM separates V. Dubourg, and J. Vanderplas, “Scikit-learn: Machine
the two distributions based on these features. Finally, we learning in python,” Journal of Machine Learning Re-
performed simulations to evaluate the performance of SIGMM. search, vol. 12, no. 10, pp. 2825–2830, 2013.
The results show that even if the relationships among users are [11] X. Zhu, Y. Nie, S. Jin, A. Li, and Y. Jia, “Spammer
not taken into account, it can implement classification. detection on online social networks based on logistic
Our work is based on binary classification, whereas in large regression,” in International Conference on Web-Age
networks, the types of users are varied and complex. Our future Information Management, Aug. 11-13, 2015, pp. 29–40.
work will extend the categories of users to multi-classifications [12] H. Jia, Y. M. Cheung, and J. Liu, “A new distance
such as celebrity, advertiser, hacker, etc. metric for unsupervised learning of categorical data,”
IEEE Transactions on Neural Networks and Learning
ACKNOWLEDGMENT Systems, vol. 27, no. 5, pp. 1065–1079, 2016.
[13] S. D. Xenaki, K. D. Koutroumbas, and A. A. Rontogian-
This work is supported by National Natural Science Foun- nis, “A novel adaptive possibilistic clustering algorithm,”
dation of China (Grant Nos. 61672131 and 61672379), the IEEE Transactions on Fuzzy Systems, vol. 24, no. 4, pp.
National Key Research and Development Program of Chi- 791–810, 2015.
na(Grant No. 2016YFB1000205) and the State Key Program [14] K. V. Laerhoven, “Combining the self-organizing map
of National Natural Science Foundation of China (Grant and k-means clustering for on-line classification of sensor
No.61432002). data,” in International Conference on Artificial Neural
Networks, Aug. 21-25, 2001, pp. 464–469.
R EFERENCES [15] X. Yang, Y. Wang, D. Wu, and A. Ma, “K-means based
[1] J. Miranda, N. Makitalo, J. Garcia-Alonso, J. Berrocal, clustering on mobile usage for social network analy-
T. Mikkonen, C. Canal, and J. M. Murillo, “From the sis purpose,” in International Conference on Advanced
internet of things to the internet of people,” IEEE Internet Information Management and Service, Nov. 30-Dec. 2,
Computing, vol. 19, no. 2, pp. 40–47, 2015. 2010, pp. 223–228.
[2] T. Qiu, A. Zhao, F. Xia, W. Si, and D. O. Wu, “Rose: [16] S. Yang, J. Kim, and M. Chung, “A prediction model
Robustness strategy for scale-free wireless sensor net- based on big data analysis using hybrid fcm clustering,”
works,” IEEE/ACM Transactions on Networking, vol. 25, in International Journal of Internet Technology and Se-
no. 5, pp. 2944–2959, 2017. cured Transactions, Dec. 14-16, 2015, pp. 337–339.
[3] L. Yao, Q. Z. Sheng, and S. Dustdar, “Web-based [17] M. A. Wiering and H. V. Hasselt, “Ensemble algorithms
management of the internet of things,” IEEE Internet in reinforcement learning,” IEEE Transactions on Sys-
Computing, vol. 19, no. 4, pp. 60–67, 2015. tems Man and Cybernetics Part B Cybernetics, vol. 38,
[4] T. Qiu, R. Qiao, and D. O. Wu, “Eabs: An event-aware no. 4, pp. 930–936, 2008.
backpressure scheduling scheme for emergency internet [18] F. Peyravi, V. Derhami, and A. Latif, “Reinforcement
of things,” IEEE Transactions on Mobile Computing, learning based search (rls) algorithm in social networks,”
vol. 17, no. 1, pp. 72–84, 2017. in International Symposium on Artificial Intelligence and
[5] T. Qiu, K. Zheng, H. Song, M. Han, and B. Kantar- Signal Processing, Mar. 3-5, 2015, pp. 206–210.
ci, “A local-optimization emergency scheduling scheme [19] S. Bhagat, G. Cormode, and S. Muthukrishnan, “Node
with self-recovery for smart grid,” IEEE Transactions on classification in social networks,” Computer Science,
Industrial Informatics, vol. 13, no. 6, pp. 3195–3205, vol. 16, no. 3, pp. 115–148, 2011.
2017. [20] T. Kajdanowicz and P. Doskocz, “Label-dependent fea-

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2799907, IEEE
Transactions on Industrial Informatics

10

ture extraction in social networks for node classification,” Keqiu Li received the bachelors and masters degrees
in International Conference on Social Informatics, Oct. from the Department of Applied Mathematics at the
Dalian University of Technology in 1994 and 1997,
27-29, 2010, pp. 89–102. respectively. He received the Ph.D. degree from
[21] S. Lu and Z. Wang, “Accelerated algorithms for eigen- the Graduate School of Information Science, Japan
value decomposition with application to spectral clus- Advanced Institute of Science and Technology in
2005. He also has two-year postdoctoral experience
tering,” in The 49th Asilomar Conference on Signals, in the University of Tokyo, Japan. He is currently
Systems and Computers(Asilomar), Pacific Grove, CA, a professor with the School of Computer Science
USA, 2015, pp. 355–359. and Technology, Tianjin University, China. He has
published more than 100 technical papers, such as
[22] T. Chen, C. Huang, E. Chang, and J. Wang, “Automatic IEEE TPDS, ACM TOIT, and ACM TOMCCAP. He is an Associate Editor
accent identification using gaussian mixture models,” in of IEEE TPDS and IEEE TC. He is a senior member of IEEE. His research
Automatic Speech Recognition and Understanding, 2001. interests include internet technology, data center networks, cloud computing
and wireless networks.
ASRU ’01. IEEE Workshop on, Dec. 9-13, 2001, pp. 343–
Huansheng Ning received a B.S. degree from
346. Anhui University in 1996 and Ph.D. degree
[23] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and in Beihang University in 2001. Now, he is a
T. Raiko, “Semi-supervised learning with ladder net- professor and vice dean of School of Computer
and Communication Engineering, University of
works,” in Advances in Neural Information Processing Science and Technology Beijing, China. His
Systems, Dec. 7-12, 2015, pp. 3546–3554. current research focuses on Internet of Things,
[24] P. K. Mallapragada, R. Jin, A. K. Jain, and Y. Li- cyber-physical modeling. He is the founder of
Cyberspace and Cybermatics and Cyberspace
u, “Semiboost: Boosting for semi-supervised learning,” International Science and Technology Cooperation
IEEE Transactions on Pattern Analysis and Machine Base. He serves as an associate editor of
Intelligence, vol. 31, no. 11, pp. 2000–2014, 2009. IEEE System Journal and IEEE Internet of Things Journal. He is the
Co-Chair of IEEE Systems, Man, and Cybernetics Society Technical
[25] S. J. Roberts, D. Husmeier, I. Rezek, and W. Penny, Committee on Cybermatics. He has hosted the 2013 World Cybermatics
“Bayesian approaches to gaussian mixture modeling,” Congress (WCC2013/iThings2013/CPSCom2013/Greencom2013),
IEEE Transactions on Pattern Analysis and Machine and the 2015 Smart World Congress (Smart-
World2015/UIC2015/ATC2015/ScalCom2015/CBDCom2015/IoP2015)
Intelligence, vol. 20, no. 11, pp. 1133–1142, 1998. as the joint executive chair. He gained the IEEE Computer Society
Meritorious Service Award in 2013, IEEE Computer Society Golden Core
Award in 2014.
Tie Qiu Dr. Tie Qiu (M’12-SM’16) received M.Sc
and Ph.D degree in computer science from Dalian
University of Technology, in 2005 and 2012, respec- Arun Kumar Sangaiah received his Ph.D from VIT
tively. He is currently Full Professor at School of University and Master of Engineering from Anna
Computer Science and Technology, Tianjin Univer- University, in 2007 and 2014, respectively. He is cur-
sity. Prior to this position, he held assistant professor rently Associate Professor at School of Computing
(2008) and associate professor (2013) at School Science and Engineering, VIT University, Vellore,
of Software, Dalian University of Technology. He India. He was a visiting professor at School of com-
was a visiting professor at electrical and computer puter engineering at Nanhai Dongruan Information
engineering at Iowa State University in U.S. (Jan. Technology Institute in China (September. 2016-Jan.
2014 to Jan. 2015). He serves as an Associate Editor 2017). He has published more than 130 scientific
of IEEE Access Journal, Computers and Electrical Engineering (Elsevier papers in high standard SCI journals like IEEE-
journal) and Human-centric Computing and Information Sciences (Springer TII, IEEE-Communication Magazine, IEEE systems,
Journal), an Editorial Board Member of Ad Hoc Networks (Elsevier journal) IEEE-IoT, IEEE TSC, IEEE ETC and etc. In addition he has authored/edited
and International Journal on AdHoc Networking Systems, a Guest Editor over 8 books (Elsevier, Springer, Wiley, Taylor and Francis) and 50 journal
of Future Generation Computer Systems (Elsevier journal). He serves as special issues such as IEEE-Communication Magazine, IEEE-IoT, IEEE
General Chair, PC Chair, Workshop Chair, Publicity Chair, Publication Chair consumer electronic magazine etc. His area of interest includes software
or TPC Member of a number of conferences. He has authored/co-authored engineering, computational intelligence, wireless networks, bio-informatics,
8 books, over 100 scientific papers in international journals and conference and embedded systems. Also, he was registered a one Indian patent in the
proceedings, such as IEEE ToN, TMC, TII, TIP, IEEE Communications, area of Computational Intelligence. Besides, Prof. Sangaiah is responsible
IEEE Systems Journal, IEEE IoT Journal, Computer Networks etc. He has for Editorial Board Member/Associate Editor of various international SCI
contributed to the development of 4 copyrighted software systems and invented journals.
15 patents. He is a senior member of China Computer Federation (CCF) and
a Senior Member of IEEE and ACM.

Heyuan Wang received B.E. degree in mathematics Baochao Chen received B.E. from Dalian Universi-
from Changchun University of Science and Tech- ty of Technology, China, in 2016. He is currently
nology (CUST), China, in 2015. She is a master master student in School of Software, Dalian U-
student in school of software engineering, Dalian niversity of Technology (DUT), China. He is an
University of Technology (DUT), China. She is a excellent graduate student of DUT and has been
member of the Smart Cyber-Physical Systems Labo- awarded several scholarships in school work excel-
ratory (SmartCPS Lab). She is an excellent graduate lence and spiritual civilization. He has participated
student of CUST and and has been awarded several in science and technology competition many times
scholarships in academic excellence. Her research and achieved good results, for example, "Citi Cup"
interests cover machine learning and internet of financial innovation application contest third prize.
things. His research interests cover embedded system, large-
scale internet of things, distributed computing and internet of vehicle.

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like