0% found this document useful (0 votes)
14 views10 pages

Network Traffic Prediction Based On LSTM and Transfer Learning

This paper presents a network traffic prediction model utilizing Long Short-Term Memory (LSTM) and transfer learning to address challenges posed by small sample sizes in data. The proposed method significantly improves prediction accuracy by transferring knowledge from a source domain with sufficient data to a target domain with limited data, outperforming traditional models by over 40%. The research highlights the effectiveness of combining LSTM with transfer learning to enhance network traffic prediction in complex environments.

Uploaded by

6z74rbwqbj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views10 pages

Network Traffic Prediction Based On LSTM and Transfer Learning

This paper presents a network traffic prediction model utilizing Long Short-Term Memory (LSTM) and transfer learning to address challenges posed by small sample sizes in data. The proposed method significantly improves prediction accuracy by transferring knowledge from a source domain with sufficient data to a target domain with limited data, outperforming traditional models by over 40%. The research highlights the effectiveness of combining LSTM with transfer learning to enhance network traffic prediction in complex environments.

Uploaded by

6z74rbwqbj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Received 19 July 2022, accepted 8 August 2022, date of publication 17 August 2022, date of current version 22 August 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3199372

Network Traffic Prediction Based on LSTM


and Transfer Learning
XIANBIN WAN 1, HUI LIU 1, HAO XU1 , AND XINCHANG ZHANG 2, (Senior Member, IEEE)
1 Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan 250014,
China
2 College of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250306, China

Corresponding author: Xianbin Wan ([email protected])


This work was supported in part by the National Natural Science Foundation of China under Grant 92067108, in part by the Shandong
Provincial Natural Science Foundation of China under Grant ZR2020MF057, and in part by the Basic Research Promotion Plan of Qilu
University of Technology (Shandong Academy of Sciences) under Grant 2021JC03001.

ABSTRACT The increasing amount of traffic in recent years has led to increasingly complex network
problems. To be able to improve overall network performance and increase network utilization, it is valuable
to take measures to capture future trends in network traffic. In traditional machine learning, to guarantee
the accuracy and high reliability of the models obtained through training, there are two basic assumptions:
(1) the training samples used for learning and the new test samples satisfy the condition of independent
identical distribution; and (2) there must be enough training samples to learn a good model. However, time-
series data are not easily accessible in real life, and even after putting in a lot of time and effort to collect
them, the data may be unavailable due to confidentiality. In this paper, a neural network model based on long
and short-term memory (LSTM) and transfer learning is proposed to address the problem of small sample
size in network traffic prediction. Knowledge in the source domain is transferred to the target domain using
transfer learning, and a prediction model with good performance is constructed with a small amount of target
domain data. The results show that the performance of the transfer learning model improves by more than
40% over the direct training model when using the same samples for predicting 10,000 rows of data, resulting
in better performance of the network traffic prediction task.

INDEX TERMS LSTM, network traffic prediction, transfer learning.

I. INTRODUCTION Analyzing traffic data can improve network quality,


Due to the rapid development of society, the network is enhance network security, and prevent congestion. Future
getting more and more traffic. According to the latest Visual traffic data can be obtained through network traffic pre-
Networking Index (VNI) report [1] by Cisco, in 2022, more diction. It plays an essential role in network management,
traffic will flow through the global network than in all network design, short and long-term resource allocation, traf-
32 years combined from the first year of the Internet to the fic (rerouting), anomaly detection, and other network areas.
end of 2016. Global traffic will more than triple, and by Accurate traffic prediction can smooth out delay-sensitive
2022, traffic flowing through the worldwide network will traffic, perform dynamic allocation of bandwidth services,
reach 4.8 terabytes (ZB) per year or 396 exabytes per month. achieve congestion control on the network, and enhance
In 2017, the annual run rate of global traffic was 1.5 ZB per the overall user experience. To be able to improve overall
year or 122 exabytes per month. The increase in network network performance and enhance network utilization, it is
traffic makes the network situation more and more complex. valuable to take steps to capture the future trend of network
Therefore, a large number of solutions (e.g., [2], [3]) have traffic.
been proposed to optimize the network traffic. In traffic prediction, scholars at home and abroad
have conducted extensive research for a long time and
The associate editor coordinating the review of this manuscript and put forward many effective methods. The main models
approving it for publication was Mu-Yen Chen . include multivariate linear AR model based on time points,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 86181
X. Wan et al.: Network Traffic Prediction Based on LSTM and Transfer Learning

segmented autoregressive sliding average (ARMA) model in the future period. The acquisition of traffic sequence data
[4], differential autoregressive moving average (ARIMA) requires a lot of time and effort, and even then, the acquired
model [5], differential autoregressive summation sliding traffic sequence data may not be usable due to the inclusion
average (FARIMA) model [6], etc. In addition, some schol- of privacy. In traditional network traffic prediction methods,
ars apply nonlinear theory to network traffic prediction and neural network models can use network layers to extract fea-
propose prediction models based on support vector machines tures from sufficient data and show good performance in per-
(SVM) [7], gray models (GM) [8], Gaussian processes forming prediction tasks. However, when there is insufficient
(GP) [9], and neural networks (NN) [10]. For example, the data, the neural network model is unable to create attributes
gray model is based on support vector machine compensa- that are not present in the data. If the neural network model
tion, the Gaussian process hybrid prediction model is based obtains training data that is not representative, it models the
on Gaussian distribution and the traffic prediction model is unique attributes in these training data as general attributes,
based on long and short-term memory (LSTM) neural net- which is often referred to as the overfitting problem. The
work. overfitting problem will result in a neural network model that
Although the prediction effects of the above models are predicts the training data more accurately but has a higher
more satisfactory, there are still shortcomings. With the error rate for other data and poor generalization performance.
increase in network complexity, the distribution characteris- The method proposed in this paper uses the idea of trans-
tics of network traffic have exceeded the traditional sense of fer learning to transfer the parameters of a network traffic
Poisson distribution or Markov distribution, so it is difficult prediction model, which has been trained in other domains,
to ensure the accuracy of the linear model prediction. Increas- to the original LSTM model. The constructed LSTM model
ingly mature machine learning-based traffic prediction meth- is trained using the pre-processed target domain data. The
ods have received great attention, and many traffic predic- method proposed in this paper results in more accurate net-
tion models based on vector machines and artificial neural work traffic prediction and better generalization of the net-
networks have emerged to greatly improve the prediction of work model when using the same size of data.
complex traffic at present. For traditional machine learning Considering the previous studies, the key contributions of
models based on vector machines and artificial neural net- this work can be summarized as follows
works, to guarantee the accuracy and high reliability of the 1. Building a network traffic prediction architecture based
models obtained by training, there are two basic assumptions: on LSTM and transfer learning.
(1) the training samples used for learning and the new test 2. By adding transfer learning, a neural network model
samples satisfy the condition of independent identical distri- can be trained using a small amount of data. Our
bution; (2) there must be enough training samples to learn method is able to produce more accurate predictions
a good model. However, in practical applications, these two than the method without transfer learning.
conditions are often not satisfied. Many fields eager to use 3. Transfer learning has been applied to solve classifica-
machine learning do not have enough data to train a model. tion problems, usually in combination with CNN neural
In this context, transfer learning was born. Transfer Learning networks. The combination of transfer learning and
is a term used in machine learning to refer to the effect of one LSTM proposed in this paper extends the application
type of learning on another type of learning, or the effect of area of transfer learning and, at the same time, proposes
an acquired experience on the completion of other activities. a new method to solve the prediction problem.
It can transfer existing knowledge to solve the problem of
having only a small amount of labeled data in the target The paper is organized as follows. Section II briefly sum-
region [11]. marises LSTM and transfer learning and why transfer learn-
In this paper, we propose a network traffic prediction ing should be used in network traffic prediction. Section III
method based on LSTM neural network and transfer learning. describes the network traffic prediction architecture based on
The method uses the idea of transfer learning to save the LSTM and transfer learning that will be used in this paper.
knowledge acquired during the execution of the source task Section IV presents performance results from specific test
in the source domain. When the knowledge in the target scenarios, and conclusions are presented in Section V.
domain is insufficient to complete the target task, the saved
knowledge is applied to complete the target task. The specific II. RELATED WORK
implementation is to transfer the parameters in the network As mentioned in the previous section, the use of linear and
traffic prediction model with sufficient source domain data nonlinear models for network traffic prediction has been
training to the network traffic prediction model without suffi- extensively studied in the literature, mainly by construct-
cient target domain data training and then train with less target ing fine-grained neural network models and then training
domain data, and finally get the network traffic prediction them using sufficient amounts of data. In contrast to the
model with more accurate prediction. above literature, we will use a small amount of data to con-
The network prediction task involved in this paper is a struct well-performing network traffic prediction models that
single indicator time series prediction task, i.e., given the address the problem of data not being easily available in the
historical change of a certain indicator, predict its change network domain. In this section, we present the background of

86182 VOLUME 10, 2022


X. Wan et al.: Network Traffic Prediction Based on LSTM and Transfer Learning

(TCPA), and the size of surrounding vessels. These three


graphs are then jointly embedded into the prediction frame-
work by introducing the spatio-temporal multi-graph convo-
lutional layer (STMGCL).
In general, network traffic prediction is a well-researched
area, and LSTM-based prediction models have a wide range
of applications. In [14], the authors designed an LSTM neural
network-based traffic prediction system using mobile ser-
vices at LTE base stations as the research object. Ramakrish-
nan and Soni [15] proposed several recurrent neural network
(RNN) structures (standard RNN, long short-term memory
(LSTM) network, and gate recurrent unit (GRU)) to solve
the network traffic prediction problem. The performance of
these models was analyzed for three important problems
in network traffic prediction: traffic prediction, packet pro-
tocol prediction, and packet distribution prediction. Recent
FIGURE 1. Typical LSTM network structure diagram.
results were obtained on traffic prediction problems on public
datasets such as GEANT and Abilene networks. To enhance
the robustness of the real-time network traffic prediction
network traffic prediction research based on LSTM networks, model [16], Lu and Yang modified the loss function of the
and why we use LSTM networks and transfer learning for LSTM network. Unlike the traditional LSTM model, the
network traffic prediction research. model was continuously updated with the arrival of new
traffic. The experimental results showed that the model has
A. LSTM better prediction accuracy compared with models constructed
Long short-term memory (LSTM) is a modified recurrent by support vector regression and bp neural networks.
neural network that is suitable for processing and predicting It is clear from the literature that LSTM neural networks
important events with very long intervals and delays in time are widely used for network traffic prediction. However, the
series. The LSTM network contains LSTM blocks, which performance of LSTM neural network models is also limited
may be described as intelligent network units because they by the amount of training data, and they cannot perform
can remember values of indefinite duration, and there are the prediction task well when the training data is too small.
‘‘gates’’ in the blocks that can determine whether the ‘‘input’’ Therefore, transfer learning, which requires only a small
needs to be remembered and whether it can be output to the amount of data, is crucial for network traffic prediction.
‘‘output’’. The structure is shown in Fig. 1.
The hidden layer of the recurrent neural network has only
B. TRANSFER LEARNING
one state, ‘‘h’’, which is very sensitive to short-term input.
For the LSTM network, three ‘‘gates’’ are set: the forget gate, Weiss et al. [17] had given a unified definition of transfer
which is responsible for controlling the continued preser- learning.
vation of the long-term state ‘‘c’’; the input gate, which is Definition (Transfer learning): Given a source domain
responsible for controlling the input of the immediate state DS and learning task TS , a target domain DT and learning task
to the long-term state ‘‘c’’; and the output gate, which is TT , transfer learning aims to help improve the learning of the
responsible for controlling whether to use the long-term state target predictive function fT () in DT using the knowledge in
‘‘c’’ as the output of the current LSTM. DS and TS , where DS 6= DT , or TS 6=TT .
Network traffic prediction is based on data prediction, and Transfer learning can be divided into the following three
models that perform well in the field of data prediction will categories.
also be applicable in the field of network traffic. To guaran- • Inductive transfer learning: Whether the source
tee high-accuracy vessel trajectory prediction, [12] proposes domain is the same as the target domain or not, the
an AIS data-driven trajectory prediction framework, whose source task is different from the target task.
main component is a long short-term memory network. The • Transductive transfer learning: The source task is the
vessel traffic conflict situation modeling, generated using the same as the target task, but the source domain target
dynamic AIS data and social force concept, is embedded into domain is different.
the LSTM network. [13] proposed a spatio-temporal multi- • Unsupervised transfer learning: The source task is rel-
graph convolutional network (STMGCN) based vessel trajec- evant to the target task regardless of whether the source
tory prediction framework using the mobile edge computing and target domains are the same.
(MEC) paradigm. It is mainly composed of three different
graphs, which are, respectively, reconstructed according to Pan and Yang [18] had given a unified definition of trans-
the social force, the time to the closest point of approach ductive transfer learning.

VOLUME 10, 2022 86183


X. Wan et al.: Network Traffic Prediction Based on LSTM and Transfer Learning

but the trained neural network exhibited overfitting character-


istics. The authors suggest that alternative data visualization
techniques and modifications of transfer learning methods
may yield better results for multichannel neural time series
data.
The above literature describes how to use transfer learning
in time series data to transfer knowledge from one domain
(i.e., source domain) to another domain (i.e., target domain)
so that the target domain can achieve better learning results.
Usually, the source domain has sufficient data volume and
FIGURE 2. The difference between traditional machine learning and the target domain has less data volume, and transfer learning
transfer learning.
needs to take the knowledge learned in the case of sufficient
data volume and transfer it to the new environment with small
data volume.
Definition for Transductive Transfer Learning: Given The transfer is widespread in the learning of various knowl-
a source domain DS and a corresponding learning task TS , edge, skills, and social norms. Transfer learning focuses on
a target domain DT and a corresponding learning task TT , storing a solution model for an existing problem and using it
transductive transfer learning aims to improve the learning of for other different but related problems. The literature on net-
the target predictive function fT () in DT using the knowledge work traffic prediction based on small samples is very sparse,
in DS and TS , where DS 6 =DT and TS 6 =TT . In addition, some mainly due to the difficulty of obtaining data. We aim to
unlabeled target domain data must be available at training address this shortcoming by tackling this prediction problem
time. with transfer learning and LSTM neural networks.
The difference between traditional learning and transfer
learning is shown in Fig. 2. In the study of this paper, net- III. NETWORK TRAFFIC PREDICTION ARCHITECTURE
work traffic prediction in different network environments, i.e.
BASED ON LSTM AND TRANSFER LEARNING
DS 6 =DT , network traffic has the same data dimension and the
In this section, a network traffic prediction architecture based
same feature space, i.e., TS =TT . From the definitions, we can
on LSTM and transfer learning is built and displayed in Fig. 3.
see that transductive transfer learning is the best choice where
The architecture is divided into a data processing module,
the source domain is different from the target domain, but the
a model building module, and a parameter transfer module.
source task is the same as the target task.
The data processing module processes data into time series
Transfer learning of neural networks first trains a base
data more suitable for neural network models to capture
network on a source dataset and then transfers the learned
features, which includes processing outliers, complementing
features (weights of the network) to a second network, trained
missing values, scaling data and raw time series to build
on the target dataset. This idea has been shown to improve
supervised data. The model building module will construct
the generalization ability of deep neural networks in many
the LSTM neural network model and use the processed
computer vision tasks, such as image recognition and object
data for training. The parameter transfer module transfers
localization. However, unlike the image recognition prob-
the parameters of the neural network model performing the
lem, transfer learning techniques have not been thoroughly
source task to the neural network model performing the target
investigated in time series classification tasks. Based on this,
task. The following sections describe how the data is prepro-
Fawaz et al. in [19] construct deep convolutional neural
cessed, how the model is built, and how the parameters are
networks to solve the time series classification problem.
transferred.
In [20], Kashiparekh et al. proposed a deep convolutional
neural network trained on a different univariate time series
classification task. Once trained, the model can be easily A. DATA PREPROCESSING
adapted to the new time series classification target task 1) PROCESS OUTLIER
by performing a small amount of fine-tuning using labeled An outlier is an observation that deviates too much from other
instances of the target task. The authors observe a significant observations, is far from the general level of the series, may
improvement in classification accuracy and computational be generated by a different series, and is often a very large or
efficiency when using a pre-trained deep convolutional neu- very small value. Due to the complex network environment
ral network as a starting point for subsequent task-specific of the industrial Internet, outliers may be generated due to
fine-tuning compared to existing state-of-the-art time series errors in the data acquisition process or may be caused by
classification methods. [21] investigated whether the appli- unreliable network equipment itself and unreliable network
cation of transfer learning to the electroencephalogram time transmission. In the general data collection process, outliers
series classification problem could conveniently replace the appear more frequently, often making it difficult to build the
feature engineering involved with direct data visualization. data model later. Therefore, outliers in the data set need to
The model achieved more than 80% classification accuracy, be processed to identify and remove the outliers or use other

86184 VOLUME 10, 2022


X. Wan et al.: Network Traffic Prediction Based on LSTM and Transfer Learning

FIGURE 3. Network traffic prediction architecture based on LSTM and transfer learning.

values to replace the outliers, obtain a stable data set, and failure; the cost of acquiring such information is too high;
better construct the data model. the real-time performance of the system requires a high level
The logic of the percentile algorithm is to sort the factor of performance, that is, it is required to make judgments or
values in ascending or descending order and to process the decisions quickly before getting such information. The pres-
factor values whose ranking percentile is higher than the ence of missing values will cause the system to lose a large
set percentage or lower than the set percentage, similar to amount of useful information, making the certainty exhibited
the practice of ‘‘removing the highest scores and the lowest in the system weaker and the uncertainty component present
scores’’ in some competitions. The set percentages need to in the system more prominent. Data containing null values
be analyzed on a case-by-case basis. Due to the uncertainty of will cause the data analysis process to fall into chaos and lead
the percentages, this paper decided to use the median absolute to unreliable outputs.
deviation algorithm for the outliers. To avoid the problems caused by missing values, the miss-
The median absolute deviation (MAD) algorithm is to ing values are often removed to obtain the complete data
determine whether each element is an outlier by determin- set. Alternatively, other approaches are used for completing,
ing whether its deviation from the median value is within a such as Mean/Mode Completer and K-means clustering. The
reasonable range. Mean/Mode Completer method divides the attributes in the
1. Calculate the median value of all elements: Xmedian . initial dataset into numerical and non-numerical attributes
2. Calculate the absolute deviation of all elements from to be processed separately. If the null value is numeric, the
the median, a single element is denoted as Xi : bias = missing attribute is filled based on the average of the values
|Xi - Xmedian |. of the attribute in all other objects; if the null value is non-
3. Obtain the median value of the absolute deviation: numeric, the missing attribute is filled with the value that has
MAD = biasmedian . the highest number of values in all other objects (i.e., the value
4. Determine the parameter n, then all the data can be that occurs most frequently) based on the statistical principle
adjusted as (1). of plurality.
 Another method similar to it is the Conditional Mean Com-
Xmedian + nMAD Xi > Xmedian + nMAD;
 pleter method. In this method, the value used for averaging is
0
Xi = Xmedian − nMAD Xi < Xmedian − nMAD; (1) not taken from all the objects in the data set, but from those
Xi Xmedian − nMAD < Xmedian + nMAD. that have the same decision attribute value as that object. The


basic starting point of these two methods of data averaging
2) COMPLEMENTARY MISSING VALUES is the same, to supplement the missing attribute values with
There are many reasons for missing values. Broadly speaking, the maximum probability possible to take the values, only
information is temporarily unavailable; data is not recorded, differing a little in the specific method. Compared with the
omitted, or lost due to human factors, which is the main other methods, it uses the majority of the information from
reason for missing data; data is lost due to the failure of data the existing data to infer the missing values. The dataset used
collection equipment, storage media, or transmission media in this paper is a network traffic dataset, so we use a more

VOLUME 10, 2022 86185


X. Wan et al.: Network Traffic Prediction Based on LSTM and Transfer Learning

implementable approach. k nearest distance method is to first


determine the K nearest samples with missing data based on
Euclidean distance or correlation analysis and then weight the
average of these K values to estimate the missing data for that
sample. In this method, k ‘‘neighbors’’ are first selected based
on some distance measure, and their average values are used
to interpolate the missing data. The distance measures vary
depending on the type of data: 1. Continuous data: The most
commonly used distance measures are Euclidean distance,
Manhattan distance, and cosine distance. 2. Categorical data:
Hamming distance is more commonly used in this case.
For all values of categorical attributes, if the values of two
data points are different, the distance between them is added
by one. The Hamming distance is actually the same as the FIGURE 4. Dataset of academic backbone network traffic in the UK.
number of different values taken between attributes.

3) DATA SCALING
Data scaling, in statistics, means that the original data are
transformed by a certain mathematical transformation in a
certain way to put the data into a small specific interval, such
as 0 to 1 or −1 to 1. The purpose is to eliminate the differences
in characteristics, order of magnitude, and other characteristic
attributes between different samples and transform them into
a dimensionless relative value, with the resulting values of
each characteristic quantity being in the same order of mag-
nitude. There are many methods of data scaling, such as Min-
Max Normalization, Min-Max Normalization, also known as
the extreme difference method, is the simplest way to deal
with the magnitude problem, which is to scale the value of
a column in the data set to between 0 and 1. It is calculated
FIGURE 5. Dataset of core network traffic in a European city.
as (2). A single element is denoted as X, the minimum value
in the dataset is denoted as Xmin , and the maximum value in
the data set is denoted as Xmax .
to learn the mapping function y = f(x) from x to y. The
0 goal of the algorithm is to approximate the true mapping
X = (X − Xmin )/(Xmax − Xmin ). (2)
relationship well enough so that when new input data (X) is
This is a linear transformation of the original data. The available, the output variable (Y) of that data can be predicted.
Min-Max normalization method preserves the interrelation- A supervised learning problem is obtained by shifting the
ship between the original data, but if after normalization, the time series forward by a one-time step.
new input data exceeds the range of values of the original
data, i.e., it is not in the original interval [Xmin, Xmax], 5) DATASET
an out-of-bounds error will be generated. Therefore, this There are two datasets used for the experiments in this paper:
method is suitable for cases where the range of values of the the ‘‘int’’ traffic dataset and the ‘‘isp’’ traffic dataset. The
original data has been determined. ‘‘int’’ traffic dataset, was collected from 09:30 on November
Mean normalization is similar to Min-Max normalization, 19, 2004, to 11:11 on January 27, 2005. As shown in Fig. 4,
with the difference that the best value in the numerator is data were collected every five minutes. The ‘‘isp’’ traffic
replaced by the mean value u. It can be calculated using (3). dataset, was from a private ISP with centers in 11 European
X 0 = (X − u)/(Xmax − Xmin ). (3) cities. These data correspond to a transatlantic line and were
collected from June 7, 2005, 06:57 to July 31, 2005, 11:17.
This method scales the data to the interval [−1,1] with Data were collected every five minutes, as shown in Fig. 5.
an average value of 0. In this paper, the data are scaled to
between [0, 1] using the extreme difference method. B. MODEL BUILDING
In this paper, we use the LSTM network to construct a
4) RAW TIME SERIES TO CONSTRUCT SUPERVISED DATA network traffic prediction model. The input of the neural
Supervised learning is a problem with an input variable (X) network based on transfer learning is the network traffic of
and an output variable (Y), and an algorithm can be used the previous time of the backbone network, and the output

86186 VOLUME 10, 2022


X. Wan et al.: Network Traffic Prediction Based on LSTM and Transfer Learning

FIGURE 6. Network traffic prediction model based on LSTM and transfer learning.

result is the network traffic of the latter time. After completing Update the output gate output: the update of the hidden
the corresponding training, the network traffic of the previous state ht consists of two parts, the first part is ot , which is
time of the core network is used as the input, and the output obtained from the hidden state ht−1 of the previous moment
result is the traffic of the core network at a later time. The and the input data xt of the current moment, defining the
LSTM network traffic prediction model based on transfer weight Wo , bias bo , the weight Uo and activation function
learning is obtained after the training is completed. The neural sigmoid; and the second part consists of the hidden state Ct
network model based on transfer learning is shown in Fig. 6. and tanh activation function.
The input of the directly trained neural network is the network The last step is to update the predicted output of the current
traffic of the core network at the previous time, and the output moment: define the weights V and bias c, and then define the
result is the network traffic of the later time, and the directly activation function, generally the sigmoid function, to get the
trained LSTM network traffic prediction model is obtained predicted output of the current moment.
after the training is completed. The backpropagation algorithm of LSTM is shown in
The forward propagation algorithm of LSTM is shown in Fig. 6. It defines L as the loss function and updates
Fig. 6. Update the output of the forgetting gate: the forgetting the parameters by chaining the derivative rule to achieve
gate controls whether to forget the hidden cell state of the conditional satisfaction. Although the structure of LSTM
previous layer, and the input is the hidden state ht−1 of the is quite complex, we can use it effectively with some
previous moment and the input data xt of the current moment, API support.
defining the weight Wf and the bias bf and the weight Uf by The network traffic prediction model based on LSTM and
a selected activation function, generally sigmoid, the output transfer learning constructed in this paper uses mean squared
ft of the forgetting gate can be obtained. Since the sigmoid error (MSE) as the loss function. In mathematical statistics,
function has an output ft between [0,1], the output ft here the mean square error refers to the expected value of the
represents the probability of forgetting the state of the hidden square of the difference between the estimated value of the
cell in the previous layer. parameter and the true value of the parameter. MSE can
Updating the two outputs of the input gate: The input gate evaluate the degree of change in the data. The smaller the
consists of two parts, the first part defines the weight Wi and value of MSE, the better the accuracy of the prediction model
the bias bi and the weight Ui , and then uses the sigmoid in describing the experimental data. Moreover, as the error
activation function, the output is it, the second part defines decreases, the gradient also decreases, which is beneficial
the weight WC and the bias bC and the weight UC , and uses to convergence, and even with a fixed learning rate, it can
the tanh activation function, the output is C et , the two outputs converge to the minimum value faster. It can be calculated
will be multiplied together to update the cell state. by (4). The actual value is represented by yi in the equation,
Update the cell state: the cell state Ct consists of two parts, the predicted value is represented by ŷi , and the amount of
the first part is the product of Ct−1 and the output ft of the data in the data set is defined using m. The model uses Adam
forgotten gate; and the second part is the product of it and C et as the optimizer and sets the learning rate to 0.02. The main
of the input gate. advantage of Adam is that, after bias correction, each iteration

VOLUME 10, 2022 86187


X. Wan et al.: Network Traffic Prediction Based on LSTM and Transfer Learning

of the learning rate has a certain range, which makes the TABLE 1. Prediction accuracy of transfer learning model and direct
training model(1000row).
parameters relatively stable.
m
1X
MSE = (yi − ŷi )2 . (4)
m
i=1

C. PARAMETER TRANSFER
1) MODEL SAVING
In the use of transfer learning, the data in the source domain A. EXPERIMENTAL SETUP
is used to train the model to get a better model, but in the We performed ablation experiments to ensure that the param-
actual application, it is not possible to train it first and then eters used were superior before building the transfer learning
use it, which will increase the time consumption. Therefore, model and direct training the model.
it is possible to save the previously trained model and then
load it when you need to use it. One way is to save the whole 1) TRANSFER LEARNING MODEL
model and then load it directly, but this will consume more 1. An LSTM neural network model is constructed using
memory; the other way is to save only the parameters of the PyTorch, with step size set to 10, i.e., the first 10 rows
model. All we have to do is to save the dictionary and call of the dataset are used for prediction; batch size set to
it, then create a new model with the same structure when 10, i.e., the number of samples processed in each batch
we need it, and import the saved parameters into the new is 10, and input size set to 1. The model optimizer is
model. Adam, and the learning rate is set to 0.02.
2. Intercept 10,000 rows of samples from the ‘‘int’’
2) MODEL LOADING dataset in the academic backbone network domain to
A neural network is an operational model that consists of a train it. Set the epoch to 50.
large number of nodes and their mutual connections. Each 3. Transfer the model to the core network domain and
node represents a specific output function, called the activa- train it again using 1000 rows of samples from the
tion function. Each connection between two nodes represents ‘‘isp’’ dataset. Set the epoch to 50.
a weighted value for the signal passing through the connec- 4. Test the prediction accuracy of the model with 10,000
tion, called the weight, which is equivalent to the memory of samples from the ‘‘isp’’ dataset.
an artificial neural network. Depending on the model saving
method, there are some differences in the model loading 2) DIRECT TRAINING MODEL
method. Saving the complete model means that both the 1. An LSTM neural network model is constructed using
model structure and the model parameters are saved, and PyTorch, with step size set to 10, i.e., the first 10 rows
when loading the model, you can choose to load all of them of the dataset are used for prediction; batch size set to
or transfer only the model parameters to the model with the 10, i.e., the number of samples processed in each batch
same structure as the original model. When only the model is 10, and input size set to 1. The model optimizer is
parameters are saved, you can only choose to reconstruct Adam, and the learning rate is set to 0.02.
the model with the same structure as the original model 2. Train with samples from the 1000 rows of the ‘‘isp’’
and then transfer the parameters, and you cannot load the dataset. Set the epoch to 50.
model directly. The method proposed in this thesis saves the 3. Test the prediction accuracy of the model with samples
complete model and chooses to load all of it when the model from the 10,000 row ‘‘isp’’ dataset.
is loaded.

B. EXPERIMENTAL EFFECT DEMONSTRATION


IV. EVALUATION RESULTS In this paper, the prediction accuracy of the transfer learning
This section establishes two network traffic prediction mod- model and the direct training model after 50 epochs of train-
els based on LSTM, namely the transfer learning model and ing using 1000 rows of data is shown in TABLE 1, and it can
the direct training model. After the training is completed, the be seen that the transfer learning model shows better results
network traffic prediction models are obtained and their loss than the direct training model.
function curves are drawn. The advantages of the transfer Root mean square error (RMSE) is the square root of the
learning model in the training process are demonstrated by ratio of the square of the deviation of the predicted value from
comparing the loss function curve. Perform the prediction the true value to the number of observations. It measures the
task using the two models separately, listing the results in a deviation between the predicted and true values and is more
table. By comparing the evaluation indexes of the two models, sensitive to outliers in the data. The calculation method of
the advantages of the transfer learning model in the prediction RMSE is shown in (5). The actual value is represented by yi
process are demonstrated. in the equation, the predicted value is represented by ŷi , and

86188 VOLUME 10, 2022


X. Wan et al.: Network Traffic Prediction Based on LSTM and Transfer Learning

FIGURE 7. Loss functions of different data volumes are used for transfer learning model and direct learning model.

TABLE 2. Prediction accuracy of reverse transfer (1000row). TABLE 3. Prediction accuracy of transfer learning model and direct
training model(100row).

the amount of data in the data set is defined using m. TABLE 4. Prediction accuracy of transfer learning model and direct
v training model(10row).
u m
u1 X
RMSE = t (yi − ŷi )2 . (5)
m
i=1

The loss function curves of the transfer learning model and


the direct training model are printed in Figs. 7a and 7d, from
which it can be seen that the loss function of the transfer
learning model starts to decrease around the function value of higher for the transfer learning model and the direct training
0.08 after removing the outliers, while the loss function of the model after changing the amount of training data. Although
direct training model starts to decrease around the function the training effect of both models decreases, we can see
value of 0.4. This shows that the transfer learning model that the training effect of the direct training model decreases
learns the knowledge in the training in the source domain and more. The curve of the transfer learning model, although
can be applied to the target domain to better perform the tasks more unstable, has a lower starting point and acquires a lower
in the target domain. Therefore, the transfer learning model endpoint at the end of training. The accuracy of the transfer
can have a better starting point and a better ending point in learning model and the direct training model after the amount
the training process. of training data changes is given in TABLE 3 and TABLE 4.
The implementation of transfer learning comes from the As can be seen from the table, after changing the amount
similarity between the source domain and the target domain. of training data, the prediction effects of both the transfer
The data in DT can be learned by the knowledge in DS , and learning model and the direct training model decrease, and the
similarly, the data in DS can be learned by the knowledge in prediction effect of the direct training model decreases more
DT , and we list the proofs of reverse transfer in TABLE 2. significantly. When the amount of training data decreases to
In this paper, the transfer of learning knowledge from the a certain extent, the prediction model will lose its prediction
backbone network to the core network is called forward trans- ability. Therefore, the use of the transfer model can reduce
fer, and conversely, the transfer of learning knowledge from the impact of the reduction in the amount of training data on
the core network to the backbone network is called reverse the prediction ability, so that an acceptable level of error can
transfer. be obtained using fewer data.
Change the amount of training data to 100 rows and
10 rows to see the effect. The loss function curves of the trans- V. CONCLUSION
fer learning model and the direct training model are printed In this paper, we construct a network traffic prediction archi-
in Figs. 7b, 7e, 7c, and 7f. From the figure, we can see that tecture based on LSTM and transfer learning, apply transfer
the starting point of the loss function during training becomes learning to continuous time series problems, and build a

VOLUME 10, 2022 86189


X. Wan et al.: Network Traffic Prediction Based on LSTM and Transfer Learning

prediction model with good performance in a network traffic [17] K. Weiss, T. M. Khoshgoftaar, and D. Wang, ‘‘A survey of transfer learn-
prediction scenario. From the forward transfer experiment ing,’’ J. Big Data, vol. 3, no. 1, pp. 1–40, Dec. 2016.
[18] S. Pan and Q. Yang, ‘‘A survey on transfer learning,’’ IEEE Trans. Knowl.
and the reverse transfer experiment, we can see that the Data Eng., vol. 22, pp. 1345–1359, Nov. 2010.
knowledge acquired from the source domain can be applied [19] H. Ismail Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P.-A. Müller,
in the target domain, while the knowledge acquired from ‘‘Transfer learning for time series classification,’’ in Proc. IEEE Int. Conf.
Big Data (Big Data), Dec. 2018, pp. 1367–1376.
the target domain can also be applied in the source domain, [20] K. Kashiparekh, J. Narwariya, P. Malhotra, L. Vig, and G. Shroff, ‘‘Con-
so the source and target domains are similar. The results of vTimeNet: A pre-trained deep convolutional neural network for time series
the comparison experiments show that the transfer learning classification,’’ in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2019,
pp. 607–612.
model has better starting and ending points than the direct [21] D. Kearney, S. McLoone, and T. E. Ward, ‘‘Investigating the application
training model in the training process with the same amount of transfer learning to neural time series classification,’’ in Proc. 30th Irish
of data. Compared with the direct training model without Signals Syst. Conf. (ISSC), Jun. 2019, pp. 1–5.
[22] Y. Yu, X. Si, C. Hu, and Z. Jianxun, ‘‘A review of recurrent neural networks:
transfer learning, the performance of the transfer learning LSTM cells and network architectures,’’ Neural Comput., vol. 31, no. 7,
model can be improved by more than 40% in completing the pp. 1235–1270, Jul. 2019.
target task after training with the source domain data, which
leads to the performance improvement of the network traffic
prediction task. XIANBIN WAN was born in Heze, Shandong,
China, in 1998. He received the B.S. degree from
the Qilu University of Technology, where he is cur-
REFERENCES rently pursuing the M.S. degree. His main research
[1] Cisco Visual Networking Index: Global Mobile Data Traffic Forecast interests include machine learning, cloud comput-
Update, 2017–2022, Cisco, San Jose, CA, USA, 2019. ing, and network resource management.
[2] X. Zhang, Y. Wang, M. Yang, and G. Geng, ‘‘Toward concurrent video
multicast orchestration for caching-assisted mobile networks,’’ IEEE
Trans. Veh. Technol., vol. 70, no. 12, pp. 13205–13220, Dec. 2021, doi:
10.1109/TVT.2021.3119429.
[3] X. Zhang, Y. Wang, G. Geng, and J. Yu, ‘‘Delay-optimized multicast tree
packing in software-defined networks,’’ IEEE Trans. Services Comput.,
early access, Aug. 20, 2021, doi: 10.1109/TSC.2021.3106264. HUI LIU was born in Linyi, Shandong, China,
[4] N. Sadek and A. Khotanzad, ‘‘Multi-scale high-speed network traffic in 1995. He received the B.S. degree from Weifang
prediction using k-factor Gegenbauer ARMA model,’’ in Proc. IEEE Int. University. He is currently pursuing the M.S.
Conf. Commun., Jun. 2004, pp. 2148–2152. degree with the Qilu University of Technology.
[5] H. Z. Moayedi and M. A. Masnadi-Shirazi, ‘‘Arima model for network His main research interests include computer net-
traffic prediction and anomaly detection,’’ in Proc. Int. Symp. Inf. Technol., works, networked control systems, and computer
vol. 4, Aug. 2008, pp. 1–6. network reliability.
[6] C. G. Dethe and D. G. Wakde, ‘‘On the prediction of packet process in
network traffic using FARIMA time-series model,’’ J. Indian Inst. Sci.,
vol. 84, nos. 1–2, p. 31, 2013.
[7] W. Chen, Z. Shang, and Y. Chen, ‘‘A novel hybrid network traffic pre-
diction approach based on support vector machines,’’ J. Comput. Netw.
Commun., vol. 2019, pp. 1–10, Feb. 2019. HAO XU was born in Weifang, Shandong, China,
[8] X. Xiao, H. Duan, and J. Wen, ‘‘A novel car-following inertia gray model in 1997. He received the B.S. degree from the Qilu
and its application in forecasting short-term traffic flow,’’ Appl. Math. University of Technology, where he is currently
Model., vol. 87, pp. 546–570, Nov. 2020. pursuing the M.S. degree. His main research inter-
[9] Y. Xu, F. Yin, W. Xu, J. Lin, and S. Cui, ‘‘Wireless traffic prediction ests include computer networks, computer net-
with scalable Gaussian process: Framework, algorithms, and verification,’’ work reliability, and machine learning.
IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp. 1291–1306, Jun. 2019.
[10] B. G. Çetiner, M. Sari, and O. Borat, ‘‘A neural network based traffic-flow
prediction model,’’ Math. Comput. Appl., vol. 15, no. 2, pp. 269–278, 2010.
[11] A. Sherstinsky, ‘‘Fundamentals of recurrent neural network (RNN) and
long short-term memory (LSTM) network,’’ Phys. D, Nonlinear Phenom-
ena, vol. 404, Mar. 2020, Art. no. 132306.
[12] R. W. Liu, M. Liang, J. Nie, W. Y. B. Lim, Y. Zhang, and M. Guizani, XINCHANG ZHANG (Senior Member, IEEE)
‘‘Deep learning-powered vessel trajectory prediction for improving smart received the M.S. degree from the Shandong Uni-
traffic services in maritime Internet of Things,’’ IEEE Trans. Netw. Sci. versity of Science and Technology, China, in 2005,
Eng., early access, Jan. 7, 2022, doi: 10.1109/TNSE.2022.3140529. and the Ph.D. degree from the Computer Network
[13] R. W. Liu, M. Liang, J. Nie, Y. Yuan, Z. Xiong, H. Yu, and Information Center, Chinese Academy of Sci-
N. Guizani, ‘‘STMGCN: Mobile edge computing-empowered vessel tra- ences, China, in 2010. He is currently a Professor
jectory prediction using spatio-temporal multi-graph convolutional net- at the Qilu University of Technology (Shandong
work,’’ IEEE Trans. Ind. Informat., early access, Apr. 8, 2022, doi:
Academy of Sciences). He has over 40 papers
10.1109/TII.2022.3165886.
in research journals, such as IEEE JOURNAL
[14] H. D. Trinh, L. Giupponi, and P. Dini, ‘‘Mobile traffic prediction from raw
ON SELECTED AREAS IN COMMUNICATIONS (JSAC),
data using LSTM networks,’’ in Proc. IEEE 29th Annu. Int. Symp. Pers.,
Indoor Mobile Radio Commun. (PIMRC), Sep. 2018, pp. 1827–1832. IEEE TRANSACTIONS ON SERVICES COMPUTING (TSC), IEEE TRANSACTIONS ON
[15] N. Ramakrishnan and T. Soni, ‘‘Network traffic prediction using recurrent VEHICULAR TECHNOLOGY (TVT), and IEEE Communications Magazine and
neural networks,’’ in Proc. 17th IEEE Int. Conf. Mach. Learn. Appl. international conference proceedings. His research interests include network
(ICMLA), Dec. 2018, pp. 187–193. protocols and architectures, and cloud computing. He won the Shandong (in
[16] H. Lu and F. Yang, ‘‘Research on network traffic prediction based on long China) Science and Technology Progress Awards, in 2013, 2018, and 2019,
short-term memory neural network,’’ in Proc. IEEE 4th Int. Conf. Comput. respectively.
Commun. (ICCC), Dec. 2018, pp. 1109–1113.

86190 VOLUME 10, 2022

You might also like