1 Introduction

Online social networks (OSNs) constitute an integral aspect of many individuals’ lives. However, these platforms are plagued by numerous social bots orchestrated by automated algorithms to mimic human behavior. Social bots often engage in malevolent activities, including disseminating false information and disrupting or deceiving users [1]. These deleterious actions have adverse repercussions on genuine users.

Previous bot detection methods relied on feature extraction or design, coupled with the training of machine learning or neural network classifiers. Early studies [1, 2] incorporated features like follower and friend counts, tweet frequency, and account creation dates, among others. Subsequent studies concentrated on feature extraction from bot-generated posts [1, 3], leading to enhanced detection accuracy. Since 2016, the third generation of social bots has emerged, blending human-operated and automated behaviors to effectively disguise themselves and evade detection in traditional classifier-based platforms [4]. This ongoing “cat-and-mouse game” has continuously led to the development of new solutions, disguises, and countermeasures. Recent advancements in Graph Neural Networks (GNNs) offer a promising approach to enhance detection efficacy by better understanding the implicit relationships between abnormal and legitimate users [4]. Methods based on GNNs [4,5,6,7] model the social bot detection problem as a node classification task, where social network users are treated as nodes in a social graph, and the relationships between users such as followers and friends are represented as edges in the graph. Leveraging GNN-based approaches effectively utilizes the interaction patterns among users, resulting in improved detection performance compared to traditional feature-based methods [7].

Ensemble Learning is a methodology that improves decision accuracy by combining multiple models. It can employ diverse strategies for model integration, such as voting, averaging, or weight allocation. Ensemble learning can harness the unique strengths of each model by fusing predictions from multiple models, thereby mitigating individual model biases and variances and elevating overall predictive efficacy. Ensemble learning has been widely applied in various fields, including machine learning, data mining, and artificial intelligence [8,9,10,11].

Recent studies [12,13,14] have merged ensemble learning with Graph Neural Networks (GNNs) to further enhance their performance. While previous graph ensemble methods primarily relied on boosting and bagging techniques, utilizing voting or averaging mechanisms to aggregate base classifier outputs, they often fell short in achieving optimum aggregation outcomes. Stacking [15] stands as a classical algorithm in ensemble learning that significantly improves the performance of base classifiers. Stacking engenders superior ensemble outcomes by training a secondary classifier based on the outputs of the base classifiers. Inspired by the conventional stacking methodology, we introduce a comprehensive learning framework named StackGNN, which integrates GNNs as the base classifiers and MLP models as the secondary classifier. To mitigate over-fitting and construct diverse base classifiers, StackGNN leverages cross-validation. However, using cross-validation significantly increases the computational complexity of the stacking method. Alternatively, we introduce the Simplified Stacking Graph Neural Network (SStackGNN) for social bot detection. Instead of using K-fold cross-validation in StackGNN, we utilize graph data augmentation to train different base classifiers. This approach remarkably reduces computational time while improving the performance of social bot detection.

In this paper, we propose a graph data augmentation Simplified Stacking Graph Neural Network for social bot detection (SStackGNN). In specific, SStackGNN utilizes GNNs as the base classifiers and MLP as the secondary classifier. We simplify the stacking process using graph data augmentation, eliminating the need for K-fold cross-validation. This significantly reduces computational time and improves the accuracy of social bot detection. Our proposed framework is flexible and can be used with various widely used backbones. Extensive experimental results demonstrate that our framework improves the performance of GNNs on different social bot detection benchmark datasets. The main contributions of this work are summarized as follows:

  • We combine stacking with GNNs and propose the StackGNN framework for social bot detection, which enhances the performance of GNN models in detecting social bot accounts.

  • Based on StackGNN, we propose SStackGNN, which uses graph data augmentation instead of K-fold cross-validation to increase the diversity of base classifiers, significantly reducing computational complexity and further improving classification performance.

  • Through extensive experiments on three real-world Twitter bot detection datasets, we demonstrate that our proposed SStackGNN outperforms state-of-the-art graph ensemble learning models and GNN-based social bot detection methods.

2 Related Work

2.1 Graph Neural Networks

Traditional graph embedding algorithms such as Deepwalk [16] and Node2Vec utilize the random walk algorithm to obtain node embedding vectors. Graph Convolutional Network (GCN) [17] is a well-known spectral graph convolution method that generates node embeddings by truncating the Chebyshev polynomial to the first-order neighborhoods. However, relying solely on full-graph convolution operations to obtain the global representation vector for a node may negatively impact the model’s generalization ability. The Simple Graph Convolution (SGC) [18], on the other hand, removes the non-linear activation function from GCN. Although it slightly reduces accuracy across all datasets, SGC performs comparable to GCN. In contrast, Graph Attention Network (GAT) [19] introduces an attention mechanism to GCN, enabling adaptive model parameter adjustment during the convolution and feature fusion process by assigning a learnable coefficient to each edge. The subsequent models introduce improvements in the neighborhood aggregation of GNNs and the model structure. APPNP [20] constructs a simple model that leverages the propagation of PageRank [21]. It utilizes a large, adjustable neighborhood to propagate information from neighboring nodes better. JK-Nets [22] employs jump knowledge to obtain a more effective structure-aware representation by flexibly utilizing the distinct neighborhood ranges of each node. This improves the model’s performance by making it more adaptive to the specific task. More recently, some subsequent models are improved for GCN, further improving the performance of the model [23,24,25].

2.2 Graph Ensemble Leaning

Recently, ensemble learning has been applied to enhance the performance of GNNs. AdaGCN [12] pioneered the fusion of ensemble learning and GNNs by designing a GNN with an RNN-like graph structure. Graph convolutional layers are utilized as base classifiers in the adaptive enhancement process. Similarly, Boosting-GNN [13] utilizes GNN classifiers as base classifiers and assigns higher weights to training samples previously misclassified, thereby improving GNN performance in class imbalance scenarios. In addition to boosting methods, BGNN [14] incorporates Gradient Boosting Decision Trees (GBDT) [26] into GNNs to handle heterogeneous tabular data. RF-GNN [6] combines Random Forest [27] with GNNs, introducing the pioneering graph ensemble learning classifier for Twitter bot detection.

While these methods enhance the classification accuracy of GNNs through ensemble learning, they primarily aggregate base classifier outputs via voting or averaging during result generation. In our proposed SStackGNN, we aggregate base classifier outputs using a secondary classifier, providing enhanced flexibility and elevating classification accuracy. Furthermore, we employ graph data augmentation instead of K-fold cross-validation, enriching the diversity of base classifiers while significantly diminishing the model’s computational complexity.

2.3 GNN-Based Twitter bot Detection

Graph-based methodologies have demonstrated remarkable efficacy in social bot detection compared to feature-based methods [28, 29]. In specific, START [30] constructs a self-supervised task for Twitter user representation learning and applies it to fine-tune bot detection. Recently, GNN methods have achieved significant advancements in social bot detection. Accounts are modeled as nodes in the graph, while relationships such as friends and followers are modeled as edges. Alhosseini et al. [31] pioneered the application of Graph Convolutional Networks (GCN) in social bot detection, effectively leveraging account interaction connections. Furthermore, Feng et al. [7] utilized RGCN [38] in social bot detection, considering friend and follower relationships. With the growing interest in graph ensemble learning, Shi et al. [6] recently proposed RF-GNN, a Random Forest-based GNN ensemble model, improving social bots detection accuracy. However, RF-GNN aggregates the base classifiers through averaging. Equating each base classifier diminishes the expressive capacity of graph ensemble learning, as different base classifiers may exhibit diverse performance levels.

3 Preliminaries

3.1 Stacking

Stacking is a classical ensemble learning algorithm where the individual learners are referred to as base classifiers. The learner responsible for aggregating the outputs is recognized as a secondary classifier or meta-classifier. The datasets utilized for training and evaluating by this secondary classifier are identified as secondary training and test datasets, respectively.

In the stacking method, the data are divided into training and test sets. The training set is further partitioned into K-folds, and the base classifier training is cross-validated using K-fold. The base classifier is trained using K-1 folds of data and generates predictions on the Kth fold. This process is iterated until all folds are predicted. The matrix resulting from combining the predicted values of the training set from all base classifiers is designated as the secondary training set. The predicted values of the test set from all base classifiers are averaged together to construct the secondary test set. Stacking facilitates the learning of weights for each base classifier through the secondary model, offering greater flexibility than the voting and averaging methods employed in bagging and boosting.

Fig. 1
figure 1

Schematic of StackGNN. StackGNN introduces stacking into graph neural networks to construct a new graph ensemble learning framework

To construct StackGNN, we integrate stacking with GNN, as depicted in Fig. 1. In specific, we utilize S GNNs as the base classifiers and MLP as the secondary classifier.

4 The Proposed Method

4.1 Motivation

In the StackGNN, K-fold cross-validation is used to divide the training set and train K base classifiers. The outputs of the base classifiers are then averaged to obtain the secondary training set. Training S base classifiers K times greatly increases the computational complexity. To address this issue, we eliminate the K-fold cross-validation in the stacking method, significantly reducing the computational complexity of the stacking method.

In the stacking approach, it is crucial to select diverse base classifiers. However, if all the base classifiers have the same structure, they may produce similar results during training. Consequently, the difference between the average of the outputs from the ensemble and the output of a single base classifier would be minimal, leading to limited performance improvement. To overcome this limitation, we propose SStackGNN, which employs graph data augmentation to preprocess the training data before training the base classifiers. By integrating S base classifiers, SStackGNN can learn from various graph structures and feature representations, resulting in enhanced machine account feature extraction. Please refer to Fig. 2 for an overview of the SStackGNN architecture.

Fig. 2
figure 2

Schematic of SStackGNN. SStackGNN leverages node-level, edge-level, and feature-level augmentation techniques to construct different graphs as training sets for diverse base classifiers. This approach helps increase the diversity among the base classifiers. Subsequently, the outputs of the base classifiers are concatenated with the original features and fed into the secondary classifier. The secondary classifier aggregates the outputs of the base classifiers to obtain the final classification result

4.2 Graph Data Augmentation

In SStackGNN, we enhance the original graph \(G=(\mathcal {E}, \textbf{X})\) to generate S different subgraphs, with each subgraph used for training a separate base classifier. This approach increases the diversity among the base classifiers, thus improving the effectiveness of the ensemble. In addition, by replacing K-fold cross-validation with data augmentation, we significantly reduce computational complexity while enhancing the performance of SStackGNN.

Node-Level Augmentation. For node-level data augmentation, we employ the technique of node dropping, where a specific proportion of vertices, along with their connections, are randomly selected without replacement. The probability of retaining nodes, denoted as \(\alpha \), is assumed to follow an independent and identically distributed (i.i.d.) uniform distribution.

Edge-Level Augmentation. In addition to node dropping, we introduce a method to corrupt the graph structure by randomly removing a portion of edges in the subgraph. The proportion of edges to be dropped follows a normal distribution, with 1-\(\beta \) representing the proportion of edges retained. This approach allows us to introduce randomness and variability in the subgraph’s edge connections.

Feature-Level Augmentation. The Mixup strategy [32] is a simple yet powerful technique for improving the generalization capacity of models by applying a linear transformation to the input data. In our case, we apply Mixup to the node features, enabling feature-level data augmentation. The process involves randomly selecting two nodes, \(v_i\) and \(v_j\), and performing linear interpolation on their feature vectors and corresponding labels. This method generates a new vector and label, which serve as augmented data.

$$\begin{aligned} \tilde{x}=\lambda x_{i}+(1-\lambda ) x_{j}, \end{aligned}$$
(1)
$$\begin{aligned} \tilde{y}=\lambda y_{i}+(1-\lambda ) y_{j}, \end{aligned}$$
(2)

where \(\tilde{x}\) and \(\tilde{y}\) represent the augmented node features and labels, while \(x_{i}\) and \(x_{j}\) represent the feature vectors of nodes i and j, respectively. We sample the Mixup weight \(\lambda \) from the distribution \(Beta(\gamma , \gamma )\) with a hyperparameter \(\gamma \).

4.3 Training of the Base Learner

By applying graph data augmentation, we obtain S distinct subgraphs represented as \(G_{sub1}, \cdot G_{sub2}, \ldots , \cdot G_{subS}\), \(1 \le i \le S\), where \(\mathcal {E}_{i}\) and \(\textbf{X}_{subi}\) represent the edge set and feature matrix of each subgraph, respectively. We utilize \(\textbf{X}_{subi}\) to train the ith base classifier.

$$\begin{aligned} \textbf{Z}_{\text{ subi } }=G_{\theta _{i}}\left( \mathcal {E}_{\text{ subi } i}, \textbf{X}_{\text{ subi }}\right) \end{aligned}$$
(3)

GNNs has \(L_G\) layers and can make the final-layer prediction of node i based on its \(L_G\)-hop neighborhood. We use the output embedding \(\textbf{Z}_\text {subi}\) in Eq. 3 for semi-supervised classification with a linear transformation and a softmax function.

$$\begin{aligned} \hat{\textbf{Y}}_{\text {subi}}={\text {softmax}}\left( \textbf{W}_{\text {subi}}\cdot \textbf{Z}_{\text {subi }}+\textbf{b}_{\text {subi}}\right) , \end{aligned}$$
(4)

where \(\textbf{W}_{\text {subi}}\) and \(\textbf{b}_{\text {subi}}\) are learnable parameters, softmax is actually a normalizer across all classes. \(\textbf{Y}=\left[ y_{1}, y_{2}, \ldots , y_{n}\right] \) and \(\hat{\textbf{Y}}_{\textrm{subi} i}=\left[ \hat{y}_{\text{ subi } , 1}, \hat{y}_{\text{ subi } , 2}, \ldots , \hat{y}_{\text{ subi } , n}\right] \). \(\textrm{V}_{L}\) denotes the training set. The loss function of the ith base classifier is as follows:

$$\begin{aligned} \mathcal {L}_{i}=-\sum _{v_{n} \in \textrm{V}_{L}} {\text {loss}}\left( \textbf{y}_{n}, \hat{\textbf{y}}_{\text{ subi,n } }\right) . \end{aligned}$$
(5)

4.4 Fitting the Secondary Classifier

The S base classifiers can be trained in parallel. After training the base classifiers, SStackGNN utilizes the predictions from multiple base models to construct a secondary classifier, which is used to generate the final predictions. The concatenated features \(\textbf{Z}_{1}, \textbf{Z}_{2}, \ldots , \textbf{Z}_{\textrm{s}}\), along with the original features \(\textbf{X}\), are used as the training set for the secondary classifier.

$$\begin{aligned} \textbf{X}_{\textrm{M}}={\text {concat}}\left( \textbf{Z}_{1}, \textbf{Z}_{2}, \ldots , \textbf{Z}_{\textrm{s}}, \textbf{X}\right) \end{aligned}$$
(6)

where \(M_{\theta }\) is used to represent the secondary classifier, the number of layers of the secondary classifier is \(L_M\), \(\textbf{Z}_{M}=M_{\theta }\left( \textbf{X}_{M}\right) \). Similar to base classifier training, a linear transformation and a softmax function are used to get the final prediction result.

$$\begin{aligned} \hat{\textbf{Y}}={\text {softmax}}\left( \textbf{W}_{M} \cdot \textbf{Z}_{M}+\textbf{b}_{M}\right) \end{aligned}$$
(7)

The loss function of the secondary classifier is as follows:

$$\begin{aligned} \mathcal {L}=-\sum _{v_{n} \in \textrm{V}_{L}} {\text {loss}}\left( \textbf{y}_{n}, \tilde{\textbf{y}}_{n}\right) . \end{aligned}$$
(8)

The secondary classifier is trained by minimizing the loss function \(\mathcal {L}\).

Algorithm 1
figure a

SStackGNN Training Strategy

5 Experiment Setup

5.1 Dataset

We evaluate Twitter bot detection models on three datasets that have graph structures: Cresci-15 [33], Twibot-20 [28], and MGTAB [29]. Here is an overview of each dataset:

  • Cresci-15: This dataset consists of 5,301 accounts that have been classified as either real human or automated accounts. It provides details on the friendship and follower connections between these accounts.

  • Twibot-20: This dataset includes 229,580 users and 227,979 edges. It contains 11,826 accounts that have been classified as either automated or real. The dataset provides information on the friend and follower relationships between these users.

  • MGTAB: This dataset is much larger, with 410,199 users and close to 100 million edges, featuring seven relationships. It includes 10,199 tagging users, which are labeled as either real people or machines.

We utilize all labeled users to construct user social graphs for these datasets. For Cresci-15, we follow the processing approach in [34] and use six user attribute features: followers_count, active_days, screen_name_length, following_count, listed_count, and is_default_profile_image. In addition, we incorporate two 768-dimensional user description features and the user tweet features extracted by RoBERTa [35]. For Twibot-20, we again follow the processing approach in [34] and employ 17 user attribute features, including protected, geo_enabled, verified, contributors_enabled, is_translator, is_translation_enabled, profile_background_tile, profile_user_background_image, has_extended_profile, default_profile, and default_profile_image. Similar to Cresci-15, we incorporate two 768-dimensional user description features and the user tweet features extracted by RoBERTa. As for MGTAB, we use the 20 user attribute features with the highest information gain and the 768-dimensional user tweet features extracted by LaBSE [36].

The statistics of these datasets are summarized in Table 1. We conduct a 1:1:8 random partition as training, validation, and test set for all datasets.

Table 1 Statistics of datasets used in the paper

5.2 Baseline Methods

To validate the effectiveness of our proposed SStackGNN, we compare it with several baseline models for semi-supervised learning and ensemble GNNs. In specific, we select Node2Vec [17], APPNP [20], GCN [37], SGC [18], GAT [19], JK-Nets [22], LA-GCN [23], S\(^2\)GC [24], GCN II [25], AdaGCN [12], Boosting-GNN [13], BGNN [14], and RF-GNN [6] as baseline models. SStackGNN utilizes GCN, SGC, and GAT as its backbone models. In addition, we extend SStackGNN to heterogeneous graphs using RGCN [38] and RGAT [39] as backbone models.

5.3 Parameter Settings

In our experiments, the AdamW optimizer trains our model for 200 epochs. The learning rate is ranging from 0.005 to 0.01. GAT and RGAT models have four attention heads. The L2 weight decay factor is set to 5e-4 for all datasets. The dropout rate ranges from 0.3 to 0.5. The input and output dimensions of the GNN layers are consistent across all models, either 128 or 256. The MLP is utilized as the secondary model in both StackGNN and SStackGNN. The number of GNN layers \(L_G\) and the number of layers in the secondary model \(L_M\) are set to 2. The hyperparameters \(\alpha \) and \(\beta \) are set within the range of 0.1 to 1. In addition, we set the value of \(\gamma \) to 4.0 in our experiments. StackGNN adopts a K-fold cross-validation approach with K set to 5. Our implementations of StackGNN and SStackGNN utilize PyTorch 1.8.0 and Python 3.7.10. PyTorch Geometric [16] is used for sparse matrix multiplication. For the baselines, all the results presented in the table were obtained by implementing the official source code and performing parameter fine-tuning to achieve the best classification performance.

5.4 Evaluation Metrics

Since there are not roughly equal numbers of bots and human accounts on social media, we utilize Accuracy and F1-score to indicate the overall performance of classifier:

$$\begin{aligned} \text {Acurracy}= & {} \frac{TP+TN}{TP+FP+FN+TN}, \end{aligned}$$
(9)
$$\begin{aligned} \text {Precision}= & {} \frac{TP}{TP+FP}, \end{aligned}$$
(10)
$$\begin{aligned} \text {Recall}= & {} \frac{TP}{TP+FN}, \end{aligned}$$
(11)
$$\begin{aligned} \text {F1}= & {} \frac{2\text {Precision}\times \text {Re call}}{\text {Precision}+\text {Recall}}, \end{aligned}$$
(12)

where TP is True Positive, TN is True Negative, FP is False Positive, and FN is False Negative.

6 Experiment Results

In this section, we conduct several experiments to evaluate SStackGNN. We mainly answers the following questions:

  • Q1: How do different algorithms perform in different scenarios? (Sect. 6.1).

  • Q2: How does the proportion of the training set affect the performance of SStackGNN? (Sect. 6.2).

  • Q3: How does SStackGNN perform when applied to heterogeneous GNNs? (Sect. 6.3).

  • Q4: How does each module of SStackGNN contribute to the overall effectiveness? (Sect. 6.4).

  • Q5: How does the number of base classifiers affect the performance of SStackGNN? (Sect. 6.5).

  • Q6: How does graph data enhancement affect the average similarity of base classifier output? (Sect. 6.6).

  • Q7: How effective is SStackGNN in reducing computing costs? (Sect. 6.7).

  • Q8: How sensitive is SStackGNN to graph data augmentation parameters? (Sect. 6.8).

6.1 Overall Performance

In this section, we conduct experiments on publicly available social bot detection datasets to assess the effectiveness of our proposed method. To reduce randomness and ensure result stability, each method was evaluated five times with different seeds. We present the average test results of the baselines, StackGNN, and SStackGNN. StackGNN and SStackGNN utilize GCN, SGC, and GAT as backbone models. As depicted in Table 2, SStackGNN demonstrates superior performance compared to StackGNN and other baselines in all scenarios.

Table 2 Comparison of the average performance of different methods for social bot detection using GCN, SGC, and GAT as backbone models.

The performance of SStackGNN shows significant improvement compared to the base classifier on all datasets. Since the Cresci-15 dataset is from an earlier period and the bot features are more distinguishable, bot detection is relatively straightforward. Consequently, the improvements on the Cresci-15 dataset are relatively limited. On average, SStackGNN achieves improvements of 3.88%, 15.48%, and 1.97% on the MGTAB, Twibot-20, and Cresci-15 datasets, respectively, in comparison to the backbone models. SStackGNN improves by 0.79%, 2.66%, and 1.29% on the same datasets compared to state-of-the-art graph ensemble learning methods. Notably, SStackGCN, SStackSGC, and SStackGAT achieve maximum improvements for accuracy of 17.11%, 16.79%, and 12.53%, respectively, on the Twibot-20 dataset when compared to the baseline model GCN.

StackGNN combines stacking with GNN directly, without data augmentation and performs K-fold cross-validation while training the base classifiers. Compared to StackGNN, SStackGNN consistently outperforms StackGNN while significantly reducing computation time. A detailed analysis of these performance improvements will be presented subsequently.

6.2 Impact of Training Set Proportion

Fig. 3
figure 3

Impact of training proportion for SStackGNN

In this subsection, we investigated the impact of varying training set proportions on prediction performance using the bot detection dataset. We conducted comparative experiments with different training proportions to showcase the stability and effectiveness of our method. In specific, we used 10% of the samples as the validation set, 50% as the test set, and varied the training set proportion from 10% to 40%. The results are illustrated in Fig. 3. We compared the performance of SStackGNN with GCN and GAT models. As the training proportion increased, the performance of all models improved. Across different training set proportions in all datasets, SStackGNN consistently achieved significant improvements in detection accuracy compared to the backbone models. This suggests that our method is robust and effective across varying training set sizes, showcasing its potential for reliable performance across different data scenarios.

6.3 Extend to Heterogeneous GNNs

In Sect. 6.1, the GNN models used were homogeneous graph models that did not distinguish between different types of edges. In this section, we extend StackGNN and SStackGNN to heterogeneous graph neural networks. We select RGCN and RGAT, commonly used multi-relational graph neural networks in social bot detection, as the backbone models. We consider two types of relationships, friends and followers, as the edges in the graph. The results are shown in Table 3.

Table 3 Comparison of the average performance of different methods for social bot detection using GCN, SGC, and GAT as backbone models.
Table 4 Comparison of the average performance of SStackGNN and variants

From Table 3, we observe the following phenomena:

  • RGCN and RGAT exhibit significantly better detection performance compared to homogeneous graph models like GCN and GAT. When utilizing RGCN and RGAT as backbone models, StackGNN and SStackGNN achieve higher classification accuracy.

  • StackGNN shows limited performance improvement compared to the backbone models. On the Cresci-15, Twibot-20, and MGTAB datasets, SStackGNN achieves an average accuracy improvement of 0.54%, 1.13%, and 0.43%, respectively.

  • Compared to StackGNN, SStackGNN demonstrates a significant performance advantage over the backbone models on all social bot detection datasets.

Due to their ability to leverage different types of relations, RGCN and RGAT outperform GCN and GAT. We speculate that RGCN and RGAT are essentially models that integrate different relation views, similar to the integration effect of StackGNN. Hence, the performance improvement of StackGNN is limited. By utilizing graph data augmentation to construct S different subgraphs for training classifiers, SStackGNN enhances the diversity of base classifiers and further improves the ensemble effect.

6.4 Ablation Study

SStackGNN employs different backbone models, all of which outperform StackGNN, validating the effectiveness of graph data augmentation. To gain a more comprehensive understanding of how different data augmentation techniques impact the overall learning framework and to better evaluate their individual contributions to performance improvement, we generated several variants of the full SStackGNN model. Below is a detailed description of these variations:

  • SStackGNN-w/o-NA: This variant removes node-level data augmentation, specifically node dropping.

  • SStackGNN-w/o-FA: This variant removes feature-level data augmentation, specifically node feature mixup.

  • SStackGNN-w/o-EA: This variant removes edge-level data augmentation, specifically edge dropping.

SStackGNN includes all modules in the graph learning framework. Table 4 presents the performance of these various variants, highlighting the roles of different levels of graph data augmentation methods in our proposed learning framework.

SStackGNN demonstrated improved performance across all variations, achieving higher accuracy and F1-score, thus validating the effectiveness of data augmentation at different levels. On the Twibot-20 and Cresci-15 datasets, SStackGCN-w/o-FA had the lowest classification accuracy, indicating that feature-level data augmentation contributed the most to enhancing SStackGNN’s performance. On the MGTAB dataset, SStackGCN-w/o-NA exhibited the poorest performance, suggesting that node-level data augmentation significantly improved SStackGNN’s performance.

6.5 The Effect of the Number of Base Classifiers

In this section, we investigated the impact of the number of base classifiers, denoted as S, on the classification performance of StackGNN and SStackGNN. As shown in Fig. 4, the number of base classifiers S is a crucial parameter that impacts the performance of SStackGNN. Across all datasets, without exception, all model instances experience an increase in accuracy as the number of base classifiers grows. When using the same backbone model as StackGNN, SStackGNN consistently outperforms StackGNN in classification performance across different layers. Across all datasets, the accuracy of SStackGNN significantly increases as the number of layers N increases from 2 to 6. However, when the number of layers exceeds 6, further increasing the number of layers does not lead to a significant improvement in the model’s classification accuracy. In contrast, for StackGNN, the accuracy only increases with the number of layers on the Cresci-15 dataset.

Fig. 4
figure 4

StackGNN and SStackGNN performance changing trend w.r.t N on MGTAB (a), Twibot-20 (b) and Cresci-15 (c) datasets

6.6 Similarity of the Output of the Base Classifiers

By applying node-level, edge-level, and feature-level augmentation techniques, the methodology generates diverse versions of the original graph data, each providing a distinct view of the underlying relationships and patterns. This diversity enhances the ensemble learning process by allowing the base classifiers to learn from different graph structures and feature representations, leading to improved predictive performance. To validate the enhancement of base classifier diversity through graph data augmentation, we computed the average similarity of the outputs of the base classifiers. We conducted this validation on the MGTAB dataset, using the same training and testing settings as described in Sect. 6.1. We utilized cosine similarity to measure the diversity of the base classifier outputs on the test samples. The results are presented in Fig. 5.

Fig. 5
figure 5

Output similarity of base classifiers for SStackGNN-w/o-GA and SStackGNN on MGTAB dataset

SStackGNN-w/o-GA refers to the SStackGNN model without graph data augmentation. In both SStackGNN-w/o-GA and SStackGNN, the number of base classifiers is set to 10, and the backbone models used for these classifiers are GCN, SGC, and GAT. The base classifier outputs of SStackGNN-w/o-GA exhibited significantly higher similarity compared to SStackGNN. This indicates that incorporating graph data augmentation can reduce the correlation among base classifier outputs, thereby increasing the base classifiers’ diversity and improving the model’s performance.

6.7 Reduce Computational Complexity

In this section, we exemplify GCN as the backbone model and analyze the computational complexity of SStackGNN and StackGNN. Given the dimension of node representations of layer l in the GNN base classifier \(d_{G, l}\) and the secondary classifier as \(d_{M, l}\), the time complexity of GCN is \(O\left( \sum _{l=1}^{L_{G}} d_{G, l}+|\mathcal {V}| \sum _{l=1}^{L_{G}} d_{G, l-1} d_{G, l}\right) \). The computational complexity of the secondary classifier in both SStackGNN and StackGNN is \(O\left( |\mathcal {V}| \sum _{l=1}^{L_{M}} d_{M, l-1} d_{M, l}\right) \). As our graph data augmentation method introduces minimal additional computation, we can overlook its increased computational load. StackGCN requires training KS base GCN classifiers and one MLP secondary classifier, resulting in a computational complexity of \(O\Big (K S\Big (|\mathcal {E}| \sum _{l=1}^{L_{G}} d_{G, l}+|\mathcal {V}| \sum _{l=1}^{L_{G}} d_{G, l-1} d_{G, l}\Big )+|\mathcal {V}| \sum _{l=1}^{L_{M}} d_{M, l-1} d_{M, l}\Big )\). Similarly, SStackGCN necessitates training S base GCN classifiers and one MLP secondary classifier, leading to a computational complexity of \(O\Big (S\Big (|\mathcal {E}| \sum _{l=1}^{L_{G}} d_{G, l}+|\mathcal {V}| \sum _{l=1}^{L_{G}} d_{G, l-1} d_{G, l}\Big )+|\mathcal {V}| \sum _{l=1}^{L_{M}} d_{M, l-1} d_{M, l}\Big )\). For a base classifier, the computational load of SStackGNN is equivalent to 1/K of the computational load of StackGNN. The average computation time for a single base classifier in SStackGNN and StackGNN is shown in Table 5. Since StackGNN also involves time-consuming tasks like K-fold cross-validation, training set and validation set partitioning, and repeated model initialization, SStackGNN reduces the average computation time for training a single base classifier to below 1/K.

Table 5 The time consumption to train a base classifier on the MGTAB, Twibot-20, and Cresci-15 datasets, respectively. Training is conducted for 10 base classifiers, and the average time is reported (unit: seconds)

6.8 Parameters Sensitivity Analysis

In this section, we investigate the hyper-parameters sensitivity of SStackGNN. The hyperparameters include \(\alpha \), which adjusts the proportion of nodes selected when constructing subgraphs, \(\beta \), which adjusts the proportion of features selected, and \(\gamma \), which adjusts the proportion of edges retained between nodes.

First of all, we use GCN as the backbone model for SStackGNN. We test hyperparameters \(\alpha \) and \(\beta \), and vary them from 0.1 to 0.9, and the results are shown in Fig. 6. For MGTAB and Cresci-15 datasets, SStackGCN performed better when the \(\alpha \) and \(\beta \) were larger. For the Twibot-20 dataset, SStackGCN performed better when the \(\beta \) was large, and the \(\alpha \) was small.

Fig. 6
figure 6

Sensitivity analysis for the graph data augmentation parameters \(\alpha \) and \(\beta \) on (a) MGTAB, (b) Twibot-20, and (c) Cresci-15

Table 6 Test accuracy (%) of bot detection given by GCN with our SStackGNN method of different \(\gamma \) values

Last but not least, we evaluate how sensitive our node feature Mixup method is to the selection of hyper-parameter value: \(\gamma \), which controls the distribution from which we randomly select the Mixup weights. The performance of SStackGCN with varying \(\gamma \) in terms of accuracy is shown in Table 6. We can observe that in most cases, the performance of SStackGNN remains relatively smooth within a certain range of parameters. However, when the \(\gamma \) is too large or too small, it leads to a performance decrease, which should be avoided in practice. Therefore, based on empirical evidence, we choose \(\gamma =4.0\) as the default setting in our experiments. The results demonstrate that using this setting yields satisfactory performance.

7 Conclusion

This paper proposes a Simplified Stacking Graph Neural Network-based graph data augmentation (SStackGNN) for social bot detection. The method leverages a graph neural network (GNN) as the base classifier and trains a series of sub-classifiers using data augmentation techniques. A multi-layer perceptron (MLP) serves as the secondary classifier to aggregate the output of different branch base classifiers effectively. The proposed approach effectively combines the strengths of graph neural networks and ensemble learning, resulting in improved detection performance of the graph neural network model. Compared to previous stacking methods, our method significantly reduces computational complexity. We have demonstrated its consistent superiority over state-of-the-art GNN baselines on social bot detection datasets.