Classification Multi
Classification Multi
A R T I C L E I N F O A B S T R A C T
Keywords: Several research studies have been conducted on multi-label classification algorithms for text and images, but
Social media few have been conducted on multi-label classification for users. Moreover, the existing multi-label user classi
Account classification fication algorithm does not provide an effective representation of users, and it is difficult to use directly in social
Multi-label
media scenarios. By analyzing complex social networks, this paper aims to achieve multi-label classification of
ML-KNN algorithm
Heterogeneous network
users based on research in single-label classification.
Considering the limitations of existing research, this paper proposes a user topic classification method based
on heterogeneous networks as well as a user multi-label classification method based on community detection.
The model is trained using the ML-KNN multi-label classification algorithm. In actual scenarios, the algorithm is
more effective than existing multi-label classification methods when applied to multi-label classification tasks for
social media users.
According to the results of the analysis, the algorithm has a high level of accuracy in classifying different
theme users into a variety of different scenarios using different theme users. Furthermore, this study contributes
to the advancement of classification research by expanding its perspective.
The classification of social media users is an effective method for In order to solve the multi-label classification problem, there are a
analyzing and managing social media. There is, however, a tendency to number of algorithms, which can be categorized into two types based on
classify users by a single label in most of the existing methods. Despite how they solve the problem: one is based on the problem transformation
the fact that single label classification can serve the needs of social method, and the other is based on algorithm application.
media management to some degree, in practice, a user contains multiple The problem conversion method involves converting existing prob
topics of information. Therefore, classifying users only by one label often lems into existing single-label algorithms. It is possible to divide prob
misses other topical attributes of users. It is therefore of great impor lem transformation methods into three categories: binary correlation
tance to study the multi-label classification of social media users, methods, classifier chain methods, and Label PowerSet methods. As part
whether in academia or industry. Nonetheless, most of the existing of the binary association method, each label is treated as a single label
multi-label classification algorithms are focused on text, image, and and dichotomies are applied to it. Then the samples are input into
audio processing, and little research has been conducted on multi-label multiple dichotomies, respectively, and finally the dichotomies are
classification for social media users. It represents the purpose of this classified as positive cases as a set of labels for the samples to be pre
paper to classify social media users according to multi-labels in order to dicted (Montañes et al., 2014; Yuan et al., 2018; Hadi and Kusprasapta,
achieve this objective. 2021). The classifier chain method is based on the number of labels
trained by multiple classifiers. A first classifier is only employed to train
the input data to train the model, while a second classifier is used to train
the input data and the output prediction results on training, and so on,
* Corresponding author.
E-mail address: [email protected] (M. Guo).
https://doi.org/10.1016/j.techfore.2022.122271
Received 9 May 2022; Received in revised form 26 August 2022; Accepted 29 August 2022
Available online 30 December 2022
0040-1625/© 2023 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
A. Huang et al. Technological Forecasting & Social Change 188 (2023) 122271
leading to the creation of multiple classifiers on training to generate classification model (Santos et al., 2016). A concise statistical analysis of
multiple tags in the end (Lughofer, 2022; Nitin and Pramod, 2022; all algorithms and methods is presented in Table 1.
Douglas et al., 2022). Through the use of the Label PowerSet method, To summarize, the existing user-tabbed classification algorithms,
each multi-label combination in the training set is converted into a which are used for multi-label classification, primarily rely on network
unique single label, which can be used to train a multi-topic classifica structure to characterize user characteristics. Building a network can
tion model in the single-label scenario using a traditional machine effectively describe the multiple relationships between users, which
learning algorithm or deep learning algorithm (L. and Rokach, 2010; makes the extracted features effective in describing the differences and
Abdullahi et al., 2021; Yap and Raymer, 2021). similarities between them. The use of more tags is therefore necessary
In this method, a particular algorithm is extended to handle a multi- for users. It is important to note that the current methods are primarily
label problem rather than transforming the problem into multiple affected by two problems when describing users: time and space
dichotomized subsets. There are several classical multi-label classifica complexity; the features used are too simple. Existing methods for
tion algorithms adapted from traditional machine learning algorithms, selecting training models are based on SVMs and other algorithms for
such as ML-KNN (M. and Zhou, 2007; Zhu et al., 2021; Agarwa et al., training multi-label classification models. In order to address the
2021), Rank-SVM (Jiarong et al., 2014; Kassim et al., 2021; Guoqiang shortcomings of existing methods, the paper examines the selection of
et al., 2022), ML-DT (Victor et al., 2022; Helena et al., 2021). The user representations and training models and proposes an algorithm that
traditional KNN, SVM, and decision tree algorithms are only applicable can effectively solve the problem of multi-label classification of social
to single-label classification problems. However, they can be directly media users.
applied to multi-label classification problems through their improve Due to the inadequacy of existing research, this paper proposes a
ment. In recent years, several novel multi-label classification methods multi-label supervised user classification algorithm based on community
have been developed using deep learning models due to the improve detection. Firstly, we need to construct the heterogeneous information
ment of hardware computing capabilities. Due to the way the deep network under the multi-label scenario. As users with similar beliefs and
learning model is constructed, multi-label classification algorithms interests tend to interact, this study divides the user nodes in the
based on deep learning can take into account how labels are related and network into communities using an overlapping community detection
learn a finer-grained representation of samples at the same time. algorithm and uses the results of the community division to represent
the characteristics of the user community. Secondly, in this method,
both textual and behavioral information can be used to describe user
1.2. Multi-label classification methods
topic information. In the linear operation time, this paper extracts users'
multi-label relationship features in order to supplement the problem of
A majority of the existing user multi-label classification algorithms
insufficient representation strength caused by community features.
are embedded in the construction of user interaction networks to
Finally, in this paper, the ML-KNN multi-label classification algorithm is
propagate tags and extract user characteristics in order to achieve user
used to train the user multi-label classification model. When compared
multi-label classification.
with other multi-label classification algorithms, this method is more
Some scholars proposed an edge-centered clustering method based
efficient in terms of training the model in time O(n). Based on a com
on user relationship networks in order to extract the social dimension
parison of the proposed method with existing multi-label user classifi
features of each user node, and a supervised user multi-label classifi
cation algorithms on actual data sets, the proposed method is found to
cation algorithm (Edgecluster) has been developed based on the SVM
be more effective at classifying social media users with better
algorithm (Tang and L., 2009; Guoqiang et al., 2020; Zhongwei et al.,
performance.
2020). A multi-tag relational neighborhood classifier (SCRN) using so
cial context features is proposed, which propagates node-class tags
2. Problem description
iteratively using the idea of the RL algorithm and assumes that the
number of tags per node is fixed to select k categories with the highest
2.1. The problem
probability as the final multiple labels of each node (Wang and Suk
thankar, 2013). As another option, the overlapping community division
Classifying social media users is intended to screen out abnormal
method (MORC) can be used, which is based on layer clustering to
users or to identify user sets relevant to a particular topic from a large
characterize each user, and then the model is trained (Lu et al., 2018;
number of social media users. Generally, each social media user belongs
Palla and Derényi, 2005; Gregory, 2007). This graph embedding algo
to multiple themes. With multi-label classification of social media users,
rithm (HCGE) takes advantage of Gaussian distributions for heteroge
the goal is to assign multiple theme tags to each user at the same time.
neous networks and optimizes two objective functions in order to train a
Consider a classification task, in which the number of instances is
model that embeds each node in the network into a potential space and a
represented by N, and the number of labels is represented by q. The
sample set can be expressed as VI = (I1 , I2 , …IN ), every instance has an
Table 1 eigenvector in the Input space, and all label sets can be expressed as Vt =
The existing multi-label classification algorithms and methods. ( )
l1 , l2 , …lq .
Multi-label Main ideas and methods Now, it is assumed that the first five nL instances in the instance set
classification
are multi-label samples, which are expressed as VIL = (I1 , I2 , …, InL ), in
Algorithms Problem conversion method: It converts the problem into a which each instance has a multi-label set Yi =
single label algorithm. The main methods include the binary ( 1 2 )T
correlation method, the classifier chain method, and the Yi , Yi , …, Yiq ∈ {0, 1}q . If Ii belongs to jth label in the theme set Vt,
Label PowerSet method. j j
Yi = 1. Otherwise, Yi = 0. For the rest VuI = VI − VLI unlabeled samples,
Algorithm application method: Specifically, it extends a
specific algorithm to handle multi-label problems. The main the task of the multi-label classification algorithm is used to make multi-
methods include ML-KNN, RANk-SVM, and ML-DT. label predictions.
Methods Currently, multi-label user classification algorithms are The main difficulty of multi-label classification is that the output
mainly used to construct user interaction networks for label multi-label sets grow exponentially with the number of labels q. Most
propagation and feature extraction. The main methods
multi-label classification algorithms consider converting multi-label
include edge-centered clustering, a multi-tag relational
neighborhood classifier (SCRN), an overlapping community classification tasks into multiple binary classification tasks in order to
division method (MORC), and a graph embedding algorithm solve this problem. Nevertheless, in this scheme, each binary task is
(HCGE). independent of each other, disregarding the correlation between tags. It
2
A. Huang et al. Technological Forecasting & Social Change 188 (2023) 122271
is therefore necessary to implement the classifier chain method. How (M. and Zhou, 2007). In essence, each sample should be marked by
ever, this method will affect the experimental results with the order of determining its nearest neighbors in the training set, followed by
tags. obtaining statistical information based on its tag set, meaning the
number of instances belonging to each possible category. According to
the maximum posteriori principle, multiple labels are determined for the
2.2. The whole process
samples to be labeled.For a given sample x, the label y represents a bi
nary class label set. Therefore, y(l) = 1 signifies that sample x belongs to
A general description of how to solve the problem of multi-label
the lth label. Otherwise, y(l) = 0 (where, l ∈ Y, and Y represents the set of
classification of social media users can be found in Fig. 1. Firstly, het
all labels). Set N(x) represent the K samples in the training set closest to
erogeneous networks are constructed in multi-label scenarios. Hetero
node x, Cx(l) as the quantity statistics of each label l among the K
geneous networks are constructed in multi-label scenarios with users
neighbors.
and entities of different topics, as opposed to the dichotomy problem.
∑
Secondly, heterogeneous networks use more label relations as well as Cx (l) = ya (l), l ∈ Y (1)
community characteristics extraction, tabbed relationship characteris a∈N(x)
tics, including tabbed scenarios that describe relationships between
For every sample t to be tested, the ML-KNN algorithm first defines
features and user entities, and community feature extraction method for
the N(t) and Cx(l). Set H1' and H'0 as the probabilities of sample t belong
the overlapping community detection algorithm, which is applied to
heterogeneous networks and to user node overlapping community and not belong to label l respectively. Meanwhile, set Elj (j ∈ {0, 1, …, K} )
classification. Thirdly, each user's community characteristics are char as an event in which x has j samples belonging to label l among the
acterized using the results of community division. Lastly, the ML-KNN neighborhood samples. Therefore, the label of sample t will be obtained
multi-label classification algorithm is used to train the model and pre according to the definitions of Ct(l) and Elj (j ∈ {0, 1, …, K} ).
dict the unlabeled users. ( ) ( ⃒ )
⃒
The fact that each user belongs to multiple topic labels makes it P Hbl P ECl t (l) ⃒Hbl ( ) ( ⃒ )
⃒
difficult to distinguish the topic information of different users based yt (l) = argmax ( ) = argmaxP Hbl P ECl t (l) ⃒Hbl (2)
l
b∈(0,1) P ECt (l) b∈(0,1)
solely on attribute information or the number of neighbor nodes in a
network of users and values of Page Rank. Despite the fact that
Word2vec and LDA theme models are capable of directly representing
user themes for identifying different users, such features are limited to Algorithm 1.
extracting user theme information at the text level and cannot analyze
Input: Initializing the node
the possibility of a user belonging to multiple themes from the
Output: Random sequence array P(H)
perspective of user behavior. Meanwhile, this type of algorithm requires The smoothing factor s is used in calculating the percentage of the Prior probability P
a high level of time complexity to process text. (Hlb).
Multi-label relationship features and community features are
extracted from heterogeneous networks. It is possible to represent the
(1) For l ∈ Y do
relationship between users of different topics in the network as well as ( ) ( ∑ )/ ( ) ( )
(2) P Hl1 = s + m i=1 yxi (l) (s × 2 + m); P Hl0 = 1 − P Hl1
the relationship between users and entities of different topics effectively
The percentage that calculates posterior probability is represented by P(ECtl(l)|Hlb).
using the multi-label relationship features. As a result of overlapping
community divisions, community features are represented.
The fact that a user belongs to more than one community indicates (3) For l ∈ Y do
that the user has a high probability of belonging to more than one topic. (4) For j ∈ {0, 1, …, K} do
In contrast to the LDA and Word2vec models, the text and behavioral (5) c[j] = 0; c' [j] = 0
(6) For i ∈ {1, …, m} do
information of the user are considered simultaneously, which enables ∑
(7) δ = Cxl(l) = a∈N(xl)ya(l)
them to describe whether the user belongs to several topics at the same (8)
( )
If yxi (l) == 1 , then c[δ] = c[δ] + 1
time. Additionally, considering that supervised learning has a more (9) Otherwise, c' [δ] = c' [δ] + 1
reliable and stable classification performance, the supervised ML-KNN (10) For j ∈ {0, 1, …, K} do
( ⃒ ) ( )
algorithm is selected in this paper to train the user multi-label classifi ∑
(11) P Elj ⃒Hl1 = (s + c[j] )/ s × (K + 1) + Kp=0 c[p]
⃒
3
A. Huang et al. Technological Forecasting & Social Change 188 (2023) 122271
4. User multi-label user classification based on ML-KNN Fig. 2 illustrates how user relationship features can be extracted in
multi-label scenarios. A RT forwarding relationship between a user and
4.1. Calculating multi-label correlation coefficients U2 indicates that the message sent by the user is forwarded by U2, and an
@ mention relationship between a user and U3 and U5 indicates that @
This paper applies the heterogeneous network to describe user mentions U3 and U5 when the user sends the message. The concerned
interaction to extract user features using RS scores in multi-label sce relationship between user and U1 and U4 indicates that the user is
narios. Therefore, heterogeneous networks should be constructed in concerned about U1 and U4, and the figure depicts the RS scores for U1,
multi-label scenarios. As a first step, 60 seed users are manually selected U2, U3, U4 and U5.
for various topics. The second step is to extract 300 tweets published As a result, the 8-dimensional user relationship characteristics on
recently by each seed user, then extract all @ mentioned user names topic 1 are [0,1,0,1,1,1], the 8-dimensional user relationship charac
from these 300 tweets, which results in a set U. Finally, the seed users are teristics on topic 2 are [0,2,0,1,1,0,1,2,0], and the 8-dimensional user
deleted from U, and the remaining users are used as user nodes to build relationship characteristics on topic k are [0,1,1,0,1,2,0].
heterogeneous networks. During the multi-label user entity relationship feature extraction
This paper selects some users to manually mark each topic for process, based on the calculation of each entity's RS scores on a variety
training the follow-up model and calculating the RS score. In addition, of topics, the statistics of all entities of each user are related to the theme
the heterogeneous relational network in a multi-label scenario is con of k-distribution and can be obtained, respectively, from the user in the
structed. There are user nodes under various topics and entity nodes user entity relationship under multiple topics. An analysis of the user
under various topics in the multi-label scenario. Similarly, the paper entity relationship characteristics of each user in multi-label scenarios
connects user nodes in the network based on three types of user can be carried out by integrating the k-distribution user entity rela
connection relationships: attention/attention, RT forwarding/being tionship characteristics of multiple topics in a series. In multi-label
forwarded, and @ mentioning/being mentioned, as well as users and scenarios, the properties of user entity relationships can be used to
entity nodes in the network based on affiliation. show how messages posted by users under different topics are spread
Finally, the RS score of each node in the network is calculated. out.
Rather than a single correlation coefficient, a plurality of values are
calculated for each topic, and each value indicates the degree of interest
of a user in a related topic or the possibility that an entity belongs to a 4.3. Extracting the user community
related topic, which can be specifically expressed asRSTopic1 , RSTopic2 , …,
RSTopick , which are the correlation coefficients of the topic 1,2, …, k, Prior to dividing nodes into communities in a heterogeneous
respectively. network, it is necessary to calculate the similarity between any two
nodes. Only similar nodes can be divided into a community. In order to
determine user similarity, the user is first represented as a feature vector,
4.2. Extracting features of multi-label relationship and then the similarity of any two users is calculated based on Euclidean
distance or cosine similarity. By representing users as topic distribu
During the extraction of multi-label user relationship features, the tions, the Latent DirichSet Allocation (LDA)topic model can extract
user relationship features do not consist of a K-dimension vector, but shallow semantic information from texts. Therefore, this paper will use a
instead are composed of many K-dimension vectors, which represent the model to represent user interests.
probability distribution for each topic to which the user belongs. As an For a given set of documents D which contains M chapters of docu
alternative to problems caused by feature redundancy and sparse con ments and each document has Nd words, all the words contained in set D
nections between users of different topics, when extracting RS scores for constitute a dictionary. First of all, we suppose that the number of all
a user under each topic, this section takes into account all neighbor topics is K and the distribution of the topics is subject to DirichSet dis
nodes under the concerned relationship, the RT forwarded relationship, tribution. Therefore, to any document, the paper uses the DirichSet
and the @ mention relationship among the three user connection distribution as the prior distribution of its subject distribution
relationships. θd 〈p1 , p2 , …, pK 〉,which can be calculated as follows:
4
A. Huang et al. Technological Forecasting & Social Change 188 (2023) 122271
5
A. Huang et al. Technological Forecasting & Social Change 188 (2023) 122271
{( ) ( ) ( )} ∑ Table 2
L' = l c1 , b'1 , l c2 , b'2 ,…, l c|L' | , b'|L' | b , b' (c, u) = 1 (13) Summary of multi-label classification features.
c,b'
Category Feature Dimension
three themes.
1 ∑N ‖h(xi ) ∩ Yi ‖1
During the process of extracting users' community features, the Recalmulti = (16)
overlapping community detection algorithm is used to divide the multi- N i=1 ‖Yi ‖1
label heterogeneous network into overlapping communities in this
1 ∑N ‖h(xi ) ∩ Yi ‖
paper. Having divided the community into 616 communities, the final Accuracymulti = (17)
N i=1 ‖h(x ) ∪ Y ‖
community feature is a 616-dimensional one-hot vector. In multi-label i i 1
scenarios, as shown in Table 2, the final feature vector for each node
1 ∑N 2|h(xi ) ∩ Yi |
is determined by connecting the community features, user relationship F1 score = (18)
N i=1 |Yi | + |h(xi ) |
features, and user entity relationship features.
y represents a binary label vector. Suppose, for example, that the where h(xi) represents the predicted results and Yi represents the manual
sample belongs to both labels 2 and 3 in a classification task with label marking results.
number 3. A supervised multi-label classification model will be trained
using the labeled training set during the training of the ML-KNN model.
5.2. Experimental results
For a user to be labeled xtest, the input of the ML-KNN model is repre
sented as a 664-dimensional eigenvector and the output is represented
5.2.1. Evaluating testing the performance of the MLUCHNCD algorithm
as y = [l1 , l2 , l3 ]. If l1 = 1, then the sample belongs to label i. Otherwise,
As can be seen in Table 3, the results of the test are presented. As
the sample does not belongs to label i during the forecasting process.
demonstrated evidently in this paper, the proposed multi-label classifi
cation algorithm for social media users is capable of performing well in a
variety of classification tasks. Data set 1 performs better in terms of
classification than data set 2. This is due to the fact that in data set 1, the
6
A. Huang et al. Technological Forecasting & Social Change 188 (2023) 122271
Table 3 Table 5
Results of the performance evaluation of multi-label classification. Matrix of multi-label classification confusion (data set 2).
Hamming Accuracymulti Precisionmulti Recalmulti F1_score 1 2 3 4 5 6 7 8 9
Loss
1 147 1 0 0 0 0 0.879 0.991 0.929
Data set 1 0.0357 0.911 0.879 0.862 0.868 2 6 16 0 0 2 0 0.340 0.712 0.79
Data set 2 0.048 0.911 0.890 0.803 0.829 3 5 0 29 6 0 0 0.677 0.619 0.648
4 3 0 2 138 0 0 0.931 0.842 0.891
5 2 3 0 0 41 0 1.00 0.630 0.771
contents of each topic overlap, suggesting some interaction between 6 1 3 2 5 1 21 0.928 0.969 0.950
users of each topic, resulting in a more dense heterogeneous network Note: 1, 2, 3, 4, 5 and 6 represent economy, economy-military, economy-polit
that can better represent users. Nevertheless, in data set 2, the content of ical, military, political, and political-military, respectively. 7, 8 and 9 represent
each topic is very different, and each topic rarely overlaps in content, Precision, Recall Rate, and F1 value, respectively.
indicating that users of different topics do not have interactive re
lationships, making the heterogeneous network constructed sparse and however, result in feature redundancy and increase computational
difficult to represent effectively. complexity when the number of multi-label topics is large. Therefore, in
In Tables 4 and 5, the simulation results are presented for each contrast to the dichotomy scenario, features of multi-label relationships
category based on data sets 1 and 2. Some users will have no neighbors can be extracted directly based on union-type-neighbor in the multi-
during the process of feature extraction, which will result in the failure label scenario.
of the feature extraction. As a result, the number of labeled users in data A comparison of the performance differences caused by the use of
set 1 and data set 2 used in the experiment is 1470 and 1462, respec community-feature alone, relation-feature alone, and hybrid-feature
tively. As part of the training and prediction process, 70 % of the labeled alone for user representation is made in order to demonstrate the val
data will be used as the training set, and 30 % of the labeled data will be idity of the community characteristics proposed in this paper.
used as the test set to evaluate the model's performance. Fig. 4 illustrates the performance evaluation results for each of the
In Tables 4 and 5, it can be seen that the classification performance of three characteristics. A user representation based on multi-label rela
the marked single-label user is superior to that of the multi-label user. tional features performs much better in classification than a user rep
The results conform to the actual application scenarios since a multiple- resentation based on community features. The reason for this is that, as
tag user classification algorithm in the forecast of tabbed users needs to the algorithm for community detection is unsupervised, it is difficult to
accurately predict the predictions of all the labels and the predictions of carry out accurate subject division without supervision. In practical
a single label user. It should be noted, though, that in practice, no matter applications, the majority of community detection algorithms are used
how good the algorithm is, it will always make a wrong prediction for to perform preliminary network community division. In contrast, multi-
one or more labels in a set with more than one label. label relationship features represent the distribution of users' interests
across different topics, which is why multi-label relationship features are
5.2.2. Validation tests of various features better able to represent the possibility that users may have interests in
Due to the large number of labels in the multi-label scenario, if different topics.
different kinds of neighbor user nodes are treated differently, the In spite of the slightly poor performance of community detection,
extracted multi-label relationship features have a higher dimension and community detection results in the feature mixing process improve the
can cause feature redundancy. Therefore, this paper does not distinguish performance of the final algorithm in this paper, thereby demonstrating
the types of neighbor user nodes in its extraction of multi-label rela the effectiveness of community detection applied to user multi-label
tionship features of users, so that following/followed, RT forward/for classification.
warded, and @ Mention/Mentioned neighbor nodes are unified into one
node type. By treating all types of neighbor users as one, the value 5.2.3. Performance comparison of different multi-label classification
brought by different types of neighbor users can also be fully utilized. algorithms
Therefore, experiments are conducted in this paper to demonstrate that, To demonstrate the effectiveness of ML-KNN as a multi-label classi
as compared to multi-label relational features constructed by taking into fication learning algorithm in this paper, the existing multi-label clas
account different types of single-type-neighbor nodes, they perform in a sification algorithms, BR (Binary Relvance) and LP (Label Power Set),
similar manner to multi-label relational features constructed by union- are used to train the model, and the performance of the three algorithms
type-neighbor nodes. is compared.
As shown in Fig. 3, there is a significant difference caused by con In Fig. 5, performance evaluation results are presented for the ML-
structing multi-label relational features in two cases on data set 1. From KNN multi-label learning algorithm, as well as the BR and LP multi-
the figure, it can be seen that both of them exhibit superior classification label learning algorithms, where Precision, Recall, and 1F values serve
performance in five evaluation indices. Nevertheless, a multi-label as performance evaluation indexes for the multi-label classification al
relationship constructed based on single-type neighbors performs gorithm using the Micro Average strategy. The figure shows that the ML-
slightly better in terms of feature performance. This method will, KNN algorithm presented in this paper has the most superior perfor
mance, followed by the BR algorithm, and the LP algorithm has the
worst performance. Since the LP algorithm converts more tags into
Table 4
single labels, the multiple tag problem can be solved by using a single
Matrix of multi-label classification confusion (data set 1).
label classification problem. However, this approach has limitations,
1 2 3 4 5 6 7 8 9
including the need for a large amount of training data and the limitation
1 117 0 9 0 0 0 0.911 0.929 0.908 of being able to mark tabbed combinations of existing problems. In the
2 0 31 1 1 0 1 0.969 0.912 0.938 training focus, there do not appear to be any more labels to mark. Due to
3 9 0 33 0 0 0 0.750 0.799 0.769
4 3 2 0 134 0 1 0.950 0.968 0.949
the fact that the BR algorithm utilizes multiple binary classifiers for the
5 1 0 1 0 96 3 1.00 0.948 0.980 classification of multi-label problems, the BR algorithm achieves a
6 4 0 0 8 0 17 0.770 0.619 0.678 performance between the two. Since the BR algorithm is essentially a
Note: 1, 2, 3, 4, 5 and 6 represent the economy, economy-military, economy-
binary classification problem, compared to the LP algorithm, it requires
political, military, political, and political-military, respectively. 7, 8 and 9 fewer training sets. Additionally, binary classification problems gener
represent Precision, Recall Rate and F1 value, respectively. ally have higher classification performance than multi-classification
7
A. Huang et al. Technological Forecasting & Social Change 188 (2023) 122271
Fig. 6. Evaluation results of the multi-label classification algorithms for different users.
8
A. Huang et al. Technological Forecasting & Social Change 188 (2023) 122271
problems, which is also the reason why BR is superior to LP. networks, this paper considers not only the jump correlation between
neighbor nodes, but also the heterogeneous network, which contains a
5.2.4. Performance comparison of algorithms for different users great deal of valuable information. Secondly, according to this paper,
This paper compares the performance of the MLUCHNCD algorithm, users are only represented by network structure features. Users will be
the MROC algorithm, and the edgecluster algorithm in order to test the able to be classified based on text information, personal attribute in
effectiveness of our algorithm. formation, and network information in the future. Thirdly, this paper
According to the MORC algorithm, based on the constructed topo uses ML-KNN to achieve user-supervised classification model training
logical network, a layer clustering strategy is employed to divide the for the multi-label classification problem. In contrast, deep learning
node community. The partition results are used to represent each user. frameworks, such as RNN and LSTM, are capable of representing users
An SVM algorithm is then applied to train and predict the model. The more finely while taking tag correlation into account. Therefore, multi-
idea behind the edgecluster algorithm is to express the edge in the label classification utilizing deep learning models may be able to achieve
network as a one-hot vector based on nodes, cluster the edge with the k- better classification results if future conditions permit. Because of time
means algorithm, and then express the node as a k-dimensional (the constraints, only a few topics were examined in the data set for this
number of K-means clustering) feature vector based on the type of edge study. Future studies can be conducted to determine whether or not the
connected to each user node. Finally, the SVM algorithm is used to train proposed method is effective on additional topics.
and predict multi-label models for users.
Fig. 6 illustrates the performance simulation results of the multi-label CRediT authorship contribution statement
classification of social media users under different multi-label classifi
cation algorithms. As shown in the figure, MLUCHNCD, the multi-label Anzhong Huang and Meiwen Guo contributed equally to this work.
classification algorithm proposed in this paper, performs better than Rui Xu and Xiaofang Liu analyzed the data.
MROC and edgecluster.
This poor performance of the MROC algorithm can be attributed to Funding information
the fact that the algorithm divides communities only based on user
connections, which results in unreliable results for community division. This study was supported by the “13th Five-Year” plan research
In addition, according to the Jaccard coefficient, fusion centers and project of philosophy and social sciences of Guangdong province (grant
neighborhood associations perform the best in the process of community number: GD17YGL03); Collaborative education project of industry
integration. However, the process just shows that compared with other university cooperation of the Ministry of Education “Research on the
neighbors in terms of community, the community has the highest sim curriculum reform of online marketing based on the integration of on
ilarity with center communities. However, it remains unknown whether line and offline interaction” (grant number: 202102209025).
the Jaccard coefficient is actually related to the center community based
on the highest neighbor community. During the implementation of the
Declaration of competing interest
edgecluster algorithm, the edge-using nodes were first characterized,
and then clustering was performed at the edge, and finally to a node in
The authors declare that there are no conflicts of interest regarding
the edge type as the final representation of the node. Rather than merely
the publication of this paper.
based on community divisions, this method of characterization of nodes
has higher interpretability as well as provides more useful information.
Data availability
Edge clustering, which is what this method is, makes it hard to use on
large networks because it needs powerful hardware.
No data was used for the research described in the article.
6. Conclusion
References
Considering the complexity of existing multi-label classification al Abdullahi, A., Noor, A.S., Mohd, H.A., Shamsul, K.A.K., Riswan, E., 2021. Multi-label
gorithms, feature extraction and model training are conducted in linear classification approach for quranic verses labeling. <sb:contribution><sb:
time in this paper. Due to the limitations of existing multi-label classi title>Indones. J. Electric. Eng. Comput.</sb:title></sb:contribution><sb:
host><sb:issue><sb:series><sb:title>Sci.</sb:title></sb:series></sb:issue></
fication algorithms for characterization of users, the present study ex
sb:host> 24 (1), 484–490.
tracts user relationship features and user entity relationship features in Agarwa, M., Rao, K.K., Vaidya, K., Bhattacharya, S., 2021. ML-MOC: machine learning
multi-label scenarios using heterogeneous networks. In situations in (kNN and GMM) based membership determination for open clusters. Mon. Not. R.
which users belong to different topics, these features can be effective in Astron. Soc. 502 (2), 2582–2599.
Ahn, Y.Y., Bagrow, J.P., Lehmann, S., 2010. Link communities reveal multi-scale
describing these situations. In addition, since users with common beliefs complexity in networks. Nature 466 (7307), 761–764.
and hobbies will contact each other, it will be possible to divide users Douglas, H.S., Erick, G.M., Muhammad, S., Renata, L.R., Juan, C.S., Demostenes, Z.R.,
with common themes into one community. In order to accomplish this, Kostromitin, K., 2022. Igorevich. Big data analytics for critical information
classification in online social networks using classifier chains. In: Peer-to-Peer
overlapping communities will be detected. In the event that a user be Networking And Applications, 15 (1), pp. 626–641.
longs to more than one community, the user will be divided into mul Gregory, S., 2007. An algorithm to find overlapping community structure in networks.
tiple communities at the same time. Therefore, the classification results In: European Conference on Principles of Data Mining And Knowledge Discovery,
Berlin, Heidelberg, pp. 91–102.
of overlapping communities can be used as the community character Guoqiang, W., Ruobing, Z., Yingjie, T., Dalian, L., 2020. Joint ranking SVM and binary
istics of users to identify which topics a user belongs to in the most direct relevance with robust low-rank learning for multi-label classification. Neural Netw.
manner. The method can also effectively describe the topic similarity 122, 24–39.
Guoqiang, W., Ruobing, Z., Yingjie, T., Dalian, L., 2022. Joint ranking SVM and binary
between two users without a direct connection. By comparing the pro relevance with robust low-rank learning for multi-label classification. Neural Netw.
posed user classification method with existing user classification 122, 24–39.
methods, it is apparent that the proposed method is capable of signifi Hadi, S., Kusprasapta, M., 2021. Multi-label classification using problem transformation
approach and machine learning on text mining for multiple event detection. In:
cantly improving social media user classification.
Cyber Physical, Computer And Automation System, 1291, pp. 91–105.
Despite the fact that the heterogeneous network-based social media Helena, E., Rainer, S., Anastasios, T., 2021. Automated measuring of engineering
user classification algorithm proposed in this paper improves classifi progress based on ML algorithms. Procedia CIRP 99, 627–632.
cation performance to some extent, it also reveals some shortcomings. Jiarong, W., Jun, F., Xia, S., Sushing, C., Bo, C., 2014. Simplified constraints rank-SVM
for multi-label classification. Pattern Recogn. 483, 229–236.
Future studies can improve on the following aspects: Kassim, T., Shajee, M.B.S., Ahammed, M.K.V., 2021. Modified ML-kNN and rank SVM for
Firstly, as part of feature extraction based on heterogeneous multi-label pattern classification. J. Phys. Conf. Ser. 1921 (1), 12027.
9
A. Huang et al. Technological Forecasting & Social Change 188 (2023) 122271
L., T.C., Rokach, B.S., 2010. Identification of label dependencies for multi-label Rui Xu. E-mail: [email protected]. Detailed-Address: 666#,
classification. In: Working Notes of the Second International Workshop on Learning Changhui Road, Dant District, Zhenjiang City, China, Postcode
from Multi-Label Data, Haifa, Israel, pp. 53–60. 212100. Degree: Master. Affiliation: School of Economics and
Lu, M., Zhang, Z., Qu, Z., 2018. LPANNI: overlapping community detection using label Management, Jiangsu University of Science and Technology.
propagation in large-scale complex networks. IEEE Trans. Knowl. Data Eng. 31 (9), Expertise: Financial and regional economic development. Rui
1736–1749. Xu, graduated from Nanjing Institute of Technology, majoring
Lughofer, Edwin, 2022. Evolving multi-label fuzzy classifier. Inf. Sci. 597, 1–23. in material Mechanics, will enter the School of Economics and
M., L.Z., Zhou, Z.H., 2007. ML-KNN: a lazy learning approach to multi-label learning. Management, Jiangsu University of Science and Technology for
Pattern Recognit. 40 (7), 2038–2048. a master's degree in economics in 2020. He applied system
Montañes, E., Senge, R., Barranquero, J., et al., 2014. Dependent binary relevance simulation and other foundations in materials major to the
models for multi-label classification. Pattern Recogn. 47 (3), 1494–1508. development of finance and regional economics. In 2021, his
Newman, M.E.J., Girvan, M., 2004. Finding and evaluating community structure in thesis “Research on Micro-credit Practice Mode Innovation and
networks. Phys. Rev.E 69 (2), 026113. Poverty Alleviation Efficiency” won the Excellence Award of
Nitin, K.M., Pramod, K.S., 2022. Linear ordering. Problem based Classifier Chain using Jiangsu Graduate Forum.
Genetic Algorithm for multi-label classification. Appl. Soft Comput. 117, 108395.
Noor, S., Guo, Y., Shah, S.H.H., Nawaz, M.S., Butt, A.S., 2020. Research synthesis and
thematic analysis of twitter through bibliometric analysis. Int. J. Semant. Web Inf.
Yu Chen. E-mail: [email protected]. Detailed-
Syst. 16 (3), 88–109.
Address: 100#, Outside the West Road, University town,
Palla, G., Derényi, I.I.F., 2005. Uncovering the overlapping community structure of
Panyu District, Guangzhou City, Guangdong Province, China,
complex networks in nature and society. Nature 435 (7043), 814–818.
Postcode 510006. Degree: Master. Affiliation: School of Elec
Raghavan, U.N., Albert, R., Kumara, S., 2007. Near linear time algorithm to detect
tromechanical Engineering, Guangdong University of Tech
community structures in large-scale networks. Phys. Rev. E. 76 (3), 036106.
nology. Expertise: System simulation and financial innovation.
Rosvall, M., B., C.T., 2008. Maps of random walks on complex networks reveal
Yu Chen is currently studying in the School of Mechanical and
community structure. Proc. Natl. Acad. Sci. 105 (4), 1118–1123.
Electrical Engineering of Guangdong University of Technology
Santos, Dos, Piwowarski, B., Gallinari, P., 2016. Multi-label classification on
for a master's degree and a third-year master's degree. The
heterogeneous graphs with Gaussian embeddings. In: Joint European Conference on
research direction is to use System simulation and financial
Machine Learning and Knowledge Discovery in Databases, Riva del Garda, Italy,
innovation. During the first year, he completed the academic
pp. 606–622.
course and studied my major and finance papers through the
Tang, L., L., H., 2009. Scalable learning of collective behavior based on sparse social
Internet. During the second year, he has been actively involved
dimensions. In: Proceedings of the 18th ACM Conference on Information And
in laboratory experiments and testing work. At present, the theory and process research on
Knowledge Management, Hong Kong, China, pp. 1107–1116.
the parts of aircraft engine guide has been completed. During the school years, he won a
Victor, O., Mazvita, M., Elvira, S., Wenlong, C.C., 2022. Identification of malignancies
university-level scholarship, and participated in the publication of an SCI paper and a
from free-text histopathology reports using a multi-model supervised machine
journal as a party member, and became a CPC member.
learning approach. Information 11 (9), 455.
Wang, X., Sukthankar, G., 2013. Multi-label relational neighbor classification using
social context features. In: Proceedings of the 19th ACM SIGKDD International
Conference on Knowledge Discovery And Data Mining, Chicago, Illinois, USA, Meiwen Guo. E-mail: [email protected]. Detailed-Address:
pp. 464–472. No. 19, Huamei Road, Tianhe District, Guangzhou City, China,
Yap, X.H., Raymer, M., 2021. Multi-label classification and label dependence in in silico Postcode 510520. Degree: Master, PhD candidate. Affiliation:
toxicity prediction. Toxicol. In Vitro 74, 105157. Guangzhou Xinhua University. Expertise: Management science,
Yuan, L.X., Tan, S.C., Gao, P.Y., Lim, C.P., Watada, J., 2018. Fuzzy ARTMAP with binary information management, business intelligence system. Mei
relevance for multi-label classification. Intell.Decis.Technol. 73, 127–135. wen Guo, The first group of e-commerce engineers in China;
Zhongwei, S., Xiuyan, L., Keyong, H., Zhuang, L., Jing, L., 2020. An efficient muti-label Top 100 e-commerce masters in Guangdong Province; She is a
SVM classification algorithm by combing approximate extreme points method and PhD candidate in USM; and is an Associate Professor in School
divide-and-conquer strategy. IEEE Access 8, 170967–170975. of Management, Guangzhou Xinhua University; also, she is a
Zhu, X.Y., Ying, C.Z., Wang, J.Y., Li, J.X., Lai, X., Wang, G.T., 2021. Ensemble of ML-KNN Senior Research Professor, Entrepreneurship Center, Sun Yat-
for classification algorithm recommendation. Knowl.-Based Syst. 221, 106933. sen University. She has published more than 20 papers in the
field of management science and information management,
which indexed by SCI, SSCI, EI, CSSCI and other databases; and
three books and a patent. She presided over and participated in 10 projects, all of which
Anzhong Huang. E-mail: [email protected]. Detailed-Address:
have been completed; among them, some projects have won excellent awards. Speciali
666#, Changhui Road, Dant District, Zhenjiang City, China,
zation: management science and engineering, sustainable development, information
Postcode 212100. Degree: Ph.D. Affiliation: School of Eco
management, intelligent system.
nomics and Management,Jiangsu University of Science and
Technology. Expertise: Inclusive finance financial technology.
Anzhong Huang, who graduated from Nanjing University in
June 2006, received his doctorate. After graduation in 2006, he
worked in School of Economics of Anhui University of Tech
nology, and was an associate professor and master tutor. In
April 2012, he worked as a postdoctoral researcher at School of
Economics and Management, Beijing Jiaotong University. After
leaving the station in 2014, he worked as a full professor and
master tutor at Jiangsu Normal University. Since 2019, he went
to work in School of Economics and Management, Jiangsu University of Science and
Technology. He have been devoted to research inclusive finance and financial technology.
From 2006 to now, he have published 42 papers in SSCI, SCI as well as CSSCI journals as an
independent author or the first author and published a monograph.
10