Long Term 5G Network Traffic Forecasting Via Model
Long Term 5G Network Traffic Forecasting Via Model
https://doi.org/10.1038/s44172-023-00081-4 OPEN
5G cellular networks have recently fostered a wide range of emerging applications, but their
popularity has led to traffic growth that far outpaces network expansion. This mismatch may
decrease network quality and cause severe performance problems. To reduce the risk,
operators need long term traffic prediction to perform network expansion schemes months
1234567890():,;
ahead. However, long term prediction horizon exposes the non-stationarity of series data,
which deteriorates the performance of existing approaches. We deal with this problem by
developing a deep learning model, Diviner, that incorporates stationary processes into a well-
designed hierarchical structure and models non-stationary time series with multi-scale stable
features. We demonstrate substantial performance improvement of Diviner over the current
state of the art in 5G network traffic forecasting with detailed months-level forecasting for
massive ports with complex flow patterns. Extensive experiments further present its
applicability to various predictive scenarios without any modification, showing potential to
address broader engineering problems.
1 Beihang University, 100191 Beijing, China. 2 Zhongguancun Laboratory, 100094 Beijinig, China. 3 China Unicom, 100037 Beijing, China. 4 University at Buffalo,
14260 Buffalo, NY, USA. 5These authors contributed equally: Yuguang Yang and Shupeng Geng. ✉email: [email protected]; [email protected]
5G
technology has recently gained popularity world- planning for upgrading and scaling the network infrastructure
wide for its faster transfer speed, broader band- and prepare it for the next planning period14–17.
width, reliability, and security. 5G technology can In industry, a common practice is calculating network traffic’s
achieve a 20× faster theoretical peak speed over 4G with lower potential growth rate by analyzing the historical traffic data18.
latency, promoting applications like online gaming, HD stream- However, this approach cannot scale to predict the demand for
ing services, and video conferences1–3. The development of 5G is new services and is less than satisfactory for long-term forecast-
changing the world at an incredible pace and fostering emerging ing. And predictions-based methods have been introduced to
industries such as telemedicine, autonomous driving, and exten- solve this dilemma by exploring the potential dependencies
ded reality4–6. These and other industries are estimated to bring a involved in historical network traffic, which provides both a
1000-fold boost in network traffic, requiring the additional constraint and a source for assessing future traffic volume. Net-
capacity to accommodate these growing services and work planners can harness the dependencies to extrapolate long-
applications7. Nevertheless, 5G infrastructure, such as board enough traffic forecasts to develop sustainable expansion schemes
cards and routers, must be deployed and managed with strict cost and mitigation strategies. The key issue to this task is to obtain an
considerations8,9. Therefore, operators often adopt a distributed accurate long-term network traffic prediction. However, directly
architecture to avoid massive back-to-back devices and links extending the prediction horizon of existing methods is ineffec-
among fragmented networks10–13. As shown in Fig. 1a, the tive for long-term forecasting since these methods suffer a severe
emerging metropolitan router is the hub to link urban access performance degeneration, where the long-term prediction hor-
routers, where services can be accessed and integrated effectively. izon exposes the non-stationarity of time series. This inherent
However, the construction cycle of 5G devices requires about non-stationarity of real-world time series data is caused by multi-
three months to schedule, procure, and deploy. Planning new scale temporal variations, random perturbations, and outliers,
infrastructures requires accurate network traffic forecasts months which present various challenges. These are summarized as fol-
ahead to anticipate the moment that capacity utilization surpasses lows. (a) Multi-scale temporal variations. Multi-scale (daily/
the preset threshold, where the overloaded capacity utilization weekly/monthly/yearly) variations throughout long-term time
might ultimately lead to performance problems. Another issue series indicate multi-scale non-stationary latent patterns within
concerns the resource excess caused by building coarse-grained the time series, which should be taken into account compre-
5G infrastructures. To mitigate these hazards, operators for- hensively in the model design. The seasonal component, for
mulate network expansion schemes months ahead with long-term example, merely presents variations at particular scales. (b)
network traffic prediction, which can facilitate long-period Random factors. Random perturbations and outliers interfere
b c
ar
color b2
4 z
3
1 4
Inbits traffic
0 Input Layer
-1 3
2 X ... Linear Projection of Input Series Matrix
1
0 ..
y x
Da10080 300
Time Series
i ly 200 250
tim6040 100 150 h d ay Construct
e s 20 0 50 ss eac Embedding
0 s acro Input Time Series Matrix
tep
s Tim e step
MER
... MER
...
MER Add and normalization Add and normalization
outbit
inbit
MAR MAR
... ... One Step Convolution Generator
MAR MAR MAR
Fig. 1 Schematic illustration for the workflow of Diviner. a We collect the data from MAR–MER links. The orange cylinder depicts the metropolitan
emerging routers (MER), and the pale blue cylinder depicts metropolitan accessing routers (MAR). b The illustration of the introduced 2D → 3D
transformation process. Specifically, given a time series of network traffic data spanning K days, we construct a time series matrix X e ¼ ½~x1 ~
x2 ¼ ~xK ,
where each ~xi represents the traffic data for a single day of length T. The resulting 3D plot displays time steps across each day, daily time steps, and inbits
traffic along the x, y, and z axes, respectively, with the inbits traffic standardized. The blue line in the 2D plot and the side near the origin of the pale red
plane in the 3D plot represent historical network traffic, while the yellowish line in the 2D plot and the side far from the origin of the pale red plane in the 3D
plot represent the future network traffic to predict. c The overall working flow of the proposed Diviner. The blue solid line indicates the data stream
direction. Both the encoder and decoder blocks of Diviner contain a smoothing filter attention mechanism (yellowish block), a difference attention module
(pale purple block), a residual structure (pale green block), and a feed-forward layer (gray block). Finally, a one-step convolution generator (magenta
block) is employed to convert the dynamic decoding into a sequence-generating procedure.
with the discovery of stable regularities, which entails higher decomposition that merely presents the temporal variation at
robustness in prediction models. (c) Data distribution shift. Non- particular scales. (2) We develop a deep learning framework with
stationarity of the time series inevitably results in a dataset shift a well-designed hierarchical structure to model the multi-scale
problem with the input data distribution varying over time. stable regularities within non-stationary time series. In contrast to
Figure 1b illustrates these challenges. previous methods employing various modules in the same layer,
Next, we review the shortcomings of existing methods con- we perform a dynamical scale transformation between different
cerning addressing non-stationarity issues. Existing time series layers and model stable temporal dependencies in the corre-
prediction methods generally fall into two categories, conven- sponding layer. This hierarchical deep stationary process syn-
tional models and deep learning models. Most conventional chronizes with the cascading feature embedding of deep neural
models, such as ARIMA19,20 and HoltWinters21–25, are built with networks, which enables us to capture complex regularities con-
some insight into the time series but implemented linearly, tained in the long-term histories and achieve precise long-term
causing problems for modeling non-stationary time series. Fur- network traffic forecasting. Our experiment demonstrates that the
thermore, these models rely on manually tuned parameters to fit robustness and predictive accuracy significantly improve as we
the time series, which impedes their application in large-scale consider more factors concerning non-stationarity, which pro-
prediction scenarios. Although Prophet26 uses a nonlinear mod- vides an avenue to improve the long-term forecast ability of deep
ular and interpretive parameter to address these problems, its learning methods. Besides, we also show that the modeling of
hand-crafted nonlinear modules need help to easily model non- non-stationarity can help discover nonlinear latent regularities
stationary time series, whose complex patterns make it inefficient within network traffic and achieve a quality long-term 5G net-
to embed diverse factors in hand-crafted functions. This dilemma work traffic forecast for up to three months. Furthermore, we
boosts the development of deep learning methods. Deep learning expand our solution to climate, control, electricity, economic,
models can utilize multiple layers to represent latent features at a energy, and transportation fields, which shows the applicability of
higher and more abstract level27, enabling us to recognize deep this solution to multiple predictive scenarios, showing valuable
latent patterns in non-stationary time series. Recurrent neural potential to solve broader engineering problems.
networks (RNNs) and Transformer networks are two main deep
learning forecasting frameworks. RNN-based models28–34 feature Results
a feedback loop that allows models to memorize historical data Diviner with deep stationary processes. In this Section, we
and process variable-length sequences as inputs and outputs, introduce our proposed deep learning model, Diviner, that tackles
which calculates the cumulative dependency between time steps. the non-stationarity of long-term time series prediction with deep
Nevertheless, such indirect modeling of temporal dependencies stationary processes, which captures multi-scale stable features
can not disentangle information from different scales within and models multi-scale stable regularities to achieve long-term
historical data and thus fails to capture multi-scale variations time series prediction.
within non-stationary time series. Transformer-based
models35–37 solve this problem using a global self-attention Smoothing filter attention mechanism as a scale converter. As
mechanism rather than feedback loops. Doing so enhances the shown in Fig. 2a, the smoothing filter attention mechanism
network’s ability to capture longer dependencies and interactions adjusts the feature scale and enables Diviner to model time series
within series data and thus brings exciting progress in various from different scales and access the multi-scale variation features
time series applications38. For more efficient long-term time within non-stationary time series. We build this component
series processing, some studies39–41 turn the self-attention based on Nadaraya-Watson regression51,52, a classical algorithm
mechanism into a sparse version. However, despite their pro- for non-parametric regression. Given the sample space
mising long-term forecasting results, time series’ specialization is Ω ¼ fðxi ; yi Þj1 ≤ i ≤ n; xi 2 R; yi 2 Rg, window size h, and kernel
not taken into account during their modeling process, where function K( ⋅ ), the Nadaraya–Watson regression has the follow-
varying distributions of non-stationary time series deteriorate ing expression:
their predictive performances. Recent research attempts to n x x n x x
incorporate time series decomposition into deep learning j
^y ¼ ∑ K i
yi = ∑ K ; ð1Þ
models42–47. Although their results are encouraging and bring i¼1 h j¼1 h
more interpretive and reasonable predictions, their limited R1
decomposition, e.g., trend-seasonal decomposition, reverses the where the kernel function K( ⋅ ) is subject to 1 KðxÞdx ¼ 1 and
correlation between components and merely presents the varia- n, x, y denote sample size, independent variable, and dependent
tion of time series at particular scales. variable, respectively.
In this work, we incorporate deep stationary processes into The Nadaraya–Watson regression estimates the regression
neural networks to achieve precise long-term 5G network traffic value ^y using a local weighted average method, where the weight
xxj
forecasts, where stochastic process theories can guarantee the of a sample (xi, yi), Kðxx n
h Þ= ∑j¼1 Kð h Þ, decays with the distance
i
prediction of stationary events48–50. Specifically, as shown in of xi from x. Consequently, the primary sample (xi, yi) is closer to
Fig. 1c, we develop a deep learning model, Diviner, that incor- samples in its vicinity. This process implies the basic notion of
porates stationary processes into a well-designed hierarchical scale transformation, where adjacent samples get closer on a more
structure and models non-stationary time series with multi-scale significant visual scale. Inspired by this thought, we can
stable features. To validate the effectiveness, we collect an reformulate the Nadaraya–Watson regression from the perspec-
extensive network port traffic dataset (NPT) from the intelligent tive of scale transformation. We incorporate it into the attention
metropolitan network delivering 5G services of China Unicom structure to design a learnable scale adjustment unit. Concretely,
and compare the proposed model with numerous current arts we introduce the smoothing filter attention mechanism with a
over multiple applications. We make two distinct research con- learnable kernel function and self-masked operation, where the
tributions to time series forecasting: (1) We explore an avenue to former shrinks (or magnifies) variations for adaptive feature-scale
solve the challenges presented in long-term time series prediction adjustment, and the letter eliminates outliers. To ease under-
by modeling non-stationarity in the deep learning paradigm. This standing, we consider the 1D time series case here, and the high-
line is much more universal and effective than the previous works dimensional case can be easily extrapolated (shown mathemati-
incorporating temporal decomposition for their limited cally in Section “Methods”). Given the time step ti, we estimate its
a b
Smoothing Filter Matrix multiplication Difference Attention
, Attention Mechanism Modular
Direction of information flow
, Q K V
self-masked structure
Matrix-Difference Transformation
... n
∆Q
Q ∆K
K ∆V
V
Attention Matrix
, , ,
Vs = Vs SoftMax(Q s K
ConCat ...
... Concat & Matrix-CumSum
Fig. 2 Illustration of the structure of smoothing filter attention mechanism and difference attention module. a This panel displays the smoothing filter
attention mechanism, which involves computing adaptive weights K(ξi, ξj) (orange block) and employing a self-masked structure (gray block with dashed
lines) to filter out the outliers, where ξi denotes the ith embedded time series period (yellow block). The adaptive weights serve to adjust the feature scale
of the input series and obtain the scale-transformed period embedding hi (pink block). b This diagram illustrates the difference attention module. The
Matrix-Difference Transformation (pale blue block) subtracts adjacent columns of a matrix to obtain the shifted query, key, and value items (ΔQ, ΔK, and
ΔV). Then, an autoregressive multi-head self-attention is performed (in the pale blue background) to capture the correlation of time series across different
time steps, resulting in Ve ðiÞ for the ith attention head. Here, QðiÞ , KðiÞ , VðiÞ , and V
e ðiÞ represent the query, key, value, and result in items, respectively. the
s s s s s
SoftMax is applied to the scaled dot-product between the query and key vectors to obtain attention weights (the pale yellow block). The formula for the
SoftMax function is SoftMaxðki Þ ¼ eki = ∑nj¼1 ekj , where ki is the ith element of the input vector, and n is the length of the input vector. Lastly, the Matrix-
CumSum operation (light orange block) accumulates the shifted features using the ConCat operation, and Ws denotes the learnable aggregation
parameters.
regression value ^yi with an adaptive-weighted average of values and varies around a fixed mean level with minor distribution
{yt∣t ≠ ti}, ^yi ¼ ∑j≠i αj yj , where the adaptive weights α are obtained shifts. Subsequently, we use a self-attention mechanism to
by a learnable kernel function f. The punctured window {tj∣tj ≠ ti} interconnect shifts, which captures the temporal dependencies
of size n − 1 denotes our self-masked operation, and within the time series variation. Last, we employ a CumSum
f ðyi ; yÞw ¼ expðwi ðyi yÞ2 Þ, αi ¼ f ðyi ; yÞw =∑j≠i f ðyj ; yÞw . Our operation to accumulate shifted features and generate a non-
i i i stationary time series conforming to the discovered regularities.
adaptive weights vary with the inner variation fðyi yÞ2 jt i ≠tg
(decreased or increased), which adjusts (shrinking or magnifying) Modeling and generating non-stationary time series in Diviner
the distance of points across each time step and achieves an framework. The smoothing filter attention mechanism filters out
adaptive feature-scale transformation. Specifically, the minor random components and dynamically adjusts the feature scale.
variation gets further shrunk at a large feature scale, magnified at Subsequently, the difference attention module calculates internal
a small feature scale, and vice versa. Concerning random connections and captures the stable regularity within the time
components, global attention can serve as an average smoothing series at the corresponding scale. Cascading these two modules,
method to help filter small perturbations. As for outliers, their one Diviner block can discover stable regularities within non-
large margin against regular items leads to minor weights, which stationary time series at one scale. Then, we stack Diviner blocks
eliminates the interference of outliers. Especially when the sample in a multilayer structure to achieve multi-scale transformation
(ti, yi) comes to be an outlier, this structure brushes itself aside. layers and capture multi-scale stable features from non-stationary
Thus, the smoothing filter attention mechanism filters out time series. Such a multilayer structure is organized in an
random components and dynamically adjusts feature scales. This encoder-decoder architecture with asymmetric input lengths for
way, we can dynamically transform non-stationary time series efficient data utilization. The encoder takes a long historical series
according to different scales, which accesses time series under to embed trends, and the decoder receives a relatively short time
comprehensive sights. series. With the cross-attention between the encoder and decoder,
we can pair the latest time series with pertinent variation patterns
Difference attention module to discover stable regularities. The from long historical series and make inferences about future
difference attention module calculates the internal connections trends, improving calculation efficiency and reducing redundant
among stable shifted features to discover stable regularities within historical information. The point is that the latest time series is
the non-stationary time series and thereby overcomes the inter- more conducive to anticipating the immediate future than the
ference of uneven distributions. Concretely, as shown in Fig. 2b, remote-past time series, where the correlation across time steps
this module includes the difference and CumSum operations at generally degrades with the length of the interval53–57. Addi-
both ends of the self-attention mechanism35, which interconnects tionally, we design a generator to obtain prediction results in one
the shift across each time step to capture internal connections step to avoid dynamic cumulative error problems39. The gen-
within non-stationary time series. The difference operation erator is built with CovNet sharing parameters throughout each
separates the shifts from the long-term trends, where the shift time step based on the linear projection generator39,58,59, which
refers to the minor difference in the trends between adjacent time saves hardware resources. These techniques enable deep learning
steps. Considering trends lead the data distribution to change methods to model non-stationary time series with multi-scale
over time, the difference operation makes the time series stable stable features and produce forecasting results in a generative
Metric MSE MAE MASE MSE MAE MASE MSE MAE MASE MSE MAE MASE MSE MAE MASE
NPT-1 96 0.256 0.340 1.391 0.456 0.511 2.090 0.264 0.349 1.427 0.259 0.333 1.362 0.491 0.509 2.082
288 0.277 0.379 1.598 0.431 0.499 2.104 0.611 0.590 2.488 0.376 0.445 1.876 0.624 0.694 2.927
672 0.263 0.367 1.601 0.446 0.522 2.278 1.680 0.885 3.862 0.365 0.437 1.907 0.680 0.615 2.684
1344 0.275 0.367 1.585 0.400 0.467 2.017 1.307 0.923 3.987 0.448 0.462 1.996 0.883 0.692 2.989
2880 0.318 0.390 1.613 0.674 0.629 2.601 1.590 1.050 4.343 0.811 0.652 2.697 1.257 0.844 3.491
NPT-2 96 0.370 0.405 1.800 0.605 0.603 2.681 0.760 0.646 2.870 0.458 0.470 2.088 0.539 0.476 2.116
288 0.394 0.431 1.977 0.579 0.607 2.786 1.131 0.826 3.788 0.415 0.454 2.082 0.589 0.541 2.481
672 0.484 0.462 2.074 0.541 0.525 2.357 1.149 0.861 3.864 0.548 0.546 2.453 0.734 0.598 2.685
1344 0.314 0.372 1.814 0.437 0.472 2.301 1.129 0.858 4.181 0.705 0.593 2.889 0.583 0.532 2.593
2880 0.378 0.390 1.861 0.750 0.644 3.072 1.342 0.935 4.457 0.458 0.470 2.240 0.934 0.725 3.459
NPT-3 96 0.177 0.323 1.672 0.272 0.401 2.076 0.664 0.656 3.397 0.300 0.415 2.150 0.227 0.347 1.797
288 0.193 0.301 1.558 0.579 0.607 3.144 0.880 0.721 3.736 0.458 0.478 2.478 0.486 0.498 2.579
672 0.187 0.305 1.599 0.541 0.525 2.753 0.931 0.771 4.044 0.327 0.409 2.147 0.455 0.488 2.558
1344 0.204 0.335 1.822 0.437 0.472 2.569 1.023 0.831 4.520 0.362 0.434 2.363 0.622 0.575 3.128
2880 0.240 0.350 1.756 0.750 0.644 3.228 1.196 0.922 4.622 0.362 0.434 2.177 0.816 0.673 3.374
1 n
n ∑i¼1 jyi ^yi j
The traffic forecast accuracy is assessed by MSE, MAE, and MASE. MSE ¼ n1 ∑ni¼1 ðyi ^yi Þ2 , MAE ¼ n1 ∑ni¼1 jyi ^yi j, MASE ¼ n , where ^y 2 Rn denotes the forecast and y 2 Rn denotes the
n1 ∑j¼2 jyj yj1 j
1
ground truth. All datasets were standardized using the mean and standard deviation values of the training set. The best predictive performance over the comparison is shown in bold.
paradigm, which is an attempt to tackle long-term time series experimental results are summarized in Table 1. Although there
prediction problems. exist quantities of outliers and frequent oscillations in the NPT
dataset, Diviner achieves a 38.58% average MSE reduction
Performance of the 5G network traffic forecasting. To validate (0.451 → 0.277) and a 20.86% average MAE reduction
the effectiveness of the proposed techniques, we collect extensive (0.465 → 0.368) based on the prior art. In terms of the scalability to
NPTs from China Unicom. The NPT datasets include data recorded different prediction spans, Diviner has a much lower dMSE 30 1
every 15 minutes for the whole 2021 year from three groups of real- (4.014% → 0.750%) and dMAE 30 1 (2.343% → 0.474%) than the
world metropolitan network traffic ports {NPT-1, NPT-2, NPT-3}, prior art, which exhibits a slight performance degradation with a
where each sub-dataset contains {18, 5, 5} ports, respectively. We substantial improvement in predictive robustness when the pre-
split them chronologically with a 9:1 proportion for training and diction horizon becomes longer. The degradation rates and pre-
testing. In addition, we prepare 16 network ports for parameter- dictive performance of all baseline approaches have been provided
searching. The main difficulties lie in the explicit shift of the dis- in Supplementary Table S1 regarding to the space limitation.
tribution and numerous outliers. And this Section elaborates on the The experiments on NPT-2 and NPT-3 shown in Supplemen-
comprehensive comparison of our model with prediction-based and tary Data 1 reproduce the above results, where Diviner can support
growth-rate-based models in applying 5G network traffic forecasting. accurate long-term network traffic prediction and exceed current
art involving accuracy and robustness by a large margin. In
Experiment 1: We first compare Diviner to other time series addition, we have the following results by sorting the comprehen-
prediction-based methods, we note these baseline models as sive performances (obtained by the average MASE errors) of the
Baselines-T for clarity. Baselines-T include traditional models baselines established with the Transformer framework: Diviner >
ARIMA19,20 and Prophet26; classic machine learning model Autoformer > Transformer > Informer. This order aligns with the
LSTMa60; deep learning-based models Transformer35,
non-stationary factors considered in these models and verifies our
Informer39, Autoformer42, and NBeats61. These models are
proposal that incorporating non-stationarity promotes neural
required to predict the entire network traffic series {1, 3, 7, 14, 30}
networks’ adaptive abilities to model time series, and the modeling
days, aligned with {96, 288, 672, 1344, 2880} prediction spans
multi-scale non-stationarity other breaks through the ceiling of
ahead in Table 1, and inbits is the target feature. In terms of the
prediction abilities for deep learning models.
evaluation, although the MAE, MSE, and MASE predictive
accuracy generally decrease with prediction intervals, the degra-
Experiment 2: The second experiment compares Diviner with two
dation rate varies between models. Therefore, we introduce an
other industrial methods, which aim to predict the capacity uti-
exponential velocity indicator to measure the rate of accuracy
lization of inbits and outbits with historical growth rates. The
degradation. Specifically, given time spans [t1, t2] and the corre-
experiment shares the same network port traffic data as in
sponding MSE, MAE, and MASE errors, we have the following:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Experiment 1, while the split ratio is changed to 3:1 chron-
t ologically for a longer prediction horizon. Furthermore, we use a
dMSEt21 ¼ ð t2 t1 MSEt2 =MSEt1 1Þ ´ 100%; ð2Þ
long construction cycle of {30, 60, 90} days (aligned with {2880,
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 5760, 8640} time steps) to ensure the validity of such growth-rate-
t
dMAEt21 ¼ ð t2 t1 MAEt2 =MAEt1 1Þ ´ 100%; ð3Þ based methods for the law of large numbers. Here we first define
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi capacity utilization mathematically:
t
dMASEt21 ¼ ð t2 t1 MASEt2 =MASEt1 1Þ ´ 100%; ð4Þ Given a fixed bandwidth B 2 Rand the traffic flow of the kth
construction cycles e ¼ ~
XðkÞ xkCþ1 ~xkCþ2 ::: ~xðkþ1ÞC ,
t t t
where dMSEt21 ; dMAEt21 ; dMASE t21 2 R. Concerning the close e 2 RT ´ C , where ~
XðkÞ xi 2 RT is a column vector of length T
experimental results between {NPT-1, NPT-2, and NPT-3}, we representing the time series per day and C denotes the number of
focus mainly on the result of the NPT-1 dataset, and the days in one construction cycle. Then the capacity utilization (CU)
Table 2 Long-term (1–3 months) capacity utilization forecasting results on the NPT dataset.
The long-term capacity utilization forecasting results on the NPT dataset were evaluated using MAE and MASE error metrics. All datasets were standardized using the mean and standard deviation
values of the training set. The best predictive performance over the comparison is shown in bold.
of the kth construction cycle is defined as follows: remain inherent multi-scale variations of network traffic series, so
e Diviner still exceeds the Baseline-A, suggesting the necessity of
k XðkÞk ð5Þ
CU ðkÞ ¼ m1
; applying deep learning models such as Diviner to discover
BCT nonlinear latent regularities within network traffic.
where CU ðkÞ 2 R. As shown in the definition, capacity When analyzing the results of these two experiments jointly,
utilization is directly related to network traffic, so a precise we present that Diviner possesses a relatively low degradation rate
for a prediction of 90 days, dMASE 901 ¼ 1:034%. In contrast, the
network traffic prediction leads to a quality prediction of capacity
1 ¼ 2:343%
utilization. We compare the proposed predictive method with two degradation rate of the prior art comes to dMASE 30
commonly used moving average growth rate predictive methods for a three-times shorter prediction horizon of 30 days.
in the industry, the additive and multiplicative moving average Furthermore, considering diverse network traffic patterns in the
growth rate predictive methods. For clarity, we note the additive provided datasets (about 50 ports), the proposed method can deal
method as Baseline-A and the multiplicative method as Baseline- with a wide range of non-stationary time series, validating its
M. Baseline-A calculates an additive growth rate with the applicability without modification. These experiments witness
difference of adjacent construction cycles. Given the capacity Diviner’s success in providing quality long-term network traffic
utilization of the last two construction cycles CU(k − 1), forecasting and extending the effective prediction spans of deep
CU(k − 2), we have the following: learning models for up to three months.
c A ðkÞ ¼ 2 CUðk 1Þ CU ðk 2Þ:
CU ð6Þ
Application on other real-world datasets. We validate our
Baseline-M calculates a multiplicative growth rate with the method on benchmark datasets for the weather (WTH), electricity
quotient of adjacent construction cycles. Given the capacity transformer temperature (ETT), electricity (ECL), and exchange
utilization of the last two construction cycles CU(k − 1), (Exchange). We summarize the experimental results in Table 3. We
CU(k − 2), we have the following: follow the standard protocol and divide them into training, vali-
c M ðkÞ ¼ CUðk 1Þ CU ðk 1Þ:
CU ð7Þ
dation, and test sets in chronological order with a proportion of
CUðk 2Þ 7:1:2 unless otherwise specified. Due to the space limitation, the
complete experimental results are shown in Supplementary Data 2.
Different from the above two baselines, we calculate the
capacity utilization of the network with the network traffic Weather temperature prediction. The WTH dataset42 records 21
forecast. Given
the network traffic of the last K construction meteorological indicators for Jena 2020, including air temperature
cycles Xe¼ ~ xðkKÞCþ1 ::: ~xðkKþ1ÞC ::: ~xðk1ÞC ::: ~
xkC , and humidity, and WetBulbFarenheit is the target. This dataset is
we have the following: finely quantified to the 10-min level, which means that there are
e ðkÞ ¼ DivinerðX
e Þ; ð8Þ 144 steps for one day and 4320 steps for one month, thereby
X
challenging the capacity of models to process long sequences.
e Among all baselines, NBeats and Informer have the lowest error in
c D ðkÞ ¼ k X ðkÞkm1 :
CU ð9Þ terms of MSE and MAE metrics, respectively. However, we notice a
BCT contrast between these two models when extending prediction
We summarize the experimental results in Table 2. Concerning spans. Informer degrades precipitously when the prediction spans
the close experimental results between {NPT-1, NPT-2, and NPT- increase from 2016 to 4032 (MAE:0.417 → 0.853), but on the
3} shown in, we focus mainly on the result of the NPT-1 dataset, contrary, NBeats gains a performance improvement
which has the most network traffic ports. Diviner achieves a (MAE:0.635 → 0.434). We attribute this to a trade-off of pursuing
substantial reduction of 31.67% MAE (0.846 → 0.578) on inbits context and texture. Informer has an advantage over the texture in
and a reduction of 24.25% MAE (0.944 → 0.715) on outbits over the short-term case. Still, it needs to capture the context depen-
Baseline-A. An intuitive explanation is that the growth-rate-based dency of the series considering the length of input history series
methods extract particular historical features but lack adapt- should extend in pace with prediction spans and vice versa. As for
ability. We notice that Baseline-A has a much better performance Diviner, it achieves a remarkable 29.30% average MAE reduction
of 0.045× average inbits-MAE and 0.074× average outbits-MAE (0.488 → 0.345) and 41.54% average MSE reduction
over Baseline-M. This result suggests that network traffic tends to (0.491 → 0.287) over both Informer and NBeats. Additionally,
increase linearly rather than exponentially. Nevertheless, there Diviner gains a low degradation rate of dMSE 30 1 ¼ 0:439%,
Solar 144 0.348 0.326 7.461 0.431 0.485 11.091 0.365 0.362 8.290 0.546 0.513 11.742 0.351 0.371 8.487
288 0.312 0.331 8.355 0.437 0.477 12.035 0.405 0.397 10.007 0.368 0.368 9.289 0.345 0.356 8.988
720 0.315 0.342 8.793 0.400 0.525 13.497 0.577 0.537 13.803 0.339 0.441 11.352 0.350 0.357 9.176
864 0.310 0.297 7.053 0.546 0.607 14.423 0.994 0.897 21.299 0.813 0.478 11.367 0.349 0.357 8.488
Traffic 168 0.156 0.259 0.835 0.431 0.485 1.561 1.814 1.159 3.729 0.750 0.644 2.071 0.509 0.528 1.700
336 0.158 0.261 0.847 0.437 0.477 1.548 1.799 1.153 3.738 0.629 0.573 1.857 0.517 0.529 1.714
720 0.318 0.437 1.457 0.400 0.525 1.751 1.817 1.150 3.836 0.671 0.604 2.014 0.526 0.533 1.779
960 0.277 0.397 1.299 0.546 0.607 1.986 1.821 1.165 3.809 1.950 1.116 3.649 0.523 0.532 1.740
The model’s predictive accuracy is assessed by MSE, MAE, and MASE. All datasets were standardized using the mean and standard deviation values of the training set. The best and suboptimal predictive performance over the comparison is shown in bold and italics,
7
ARTICLE COMMUNICATIONS ENGINEERING | https://doi.org/10.1038/s44172-023-00081-4
techniques for practical applications such as bandwidth advantages of our model become apparent in long-term fore-
management14,15, resource allocation16, and resource provisioning17, casting scenarios.
where the time series prediction-based methods can provide detailed
network traffic forecast. However, existing time series forecasting Methods
methods suffer a severe performance degeneration since the long- Preliminaries.
We denote
the original form of the time-series data as
term prediction horizon exposes the non-stationarity of time series, X ¼ x1 x2 ::: xn ; xi 2 R. The original time series data X is reshaped to a
which raises several challenges: (a) Multi-scale temporal variations. matrix form as X e¼ ~ x1 ~x2 ::: ~ xK , where ~ xi is a vector of length T with the
(b) Random factors. (c) Data Distribution Shift. time series data per day/week/month/year, K denotes the number of days/weeks/
Therefore, this paper attempts to challenge the problem of months/years, ~ xi 2 RT . After that, we can represent the seasonal pattern as ~xi and use
its variation between adjacent time steps to model trends, shown as the following:
achieving a precise long-term prediction for non-stationary time
series. We start from the fundamental property of time series t 2 1
~
xt 2 ¼ ~
xt 1 þ ∑ Δest ;
non-stationarity and introduce deep stationary processes into a t¼t 1 ð10Þ
neural network, which models multi-scale stable regularities Δest ¼ ~
xtþ1 ~
xt ;
within non-stationary time series. We argue that capturing the
where Δest denotes the change of the seasonal pattern, Δest 2 RT . The shift reflects the
stable features is a recipe for generating non-stationary forecasts variation between small time steps, but when such variation (shift) builds up over a
conforming to historical regularities. The stable features enable t 2 1
rather long period, the trend d comes out. It can be achieved as ∑t¼t Δest . Therefore,
networks to restrict the latent space of time series, which deals 1
we can model trends by capturing the long- and short-range dependencies of shifts
with varying distribution problems. Extensive experiments on among different time steps.
network traffic prediction and other real-world scenarios Next, we introduce a smoothing filter attention mechanism to construct multi-
demonstrate its advances over existing prediction-based models. scale transformation layers. A difference attention module is mounted to capture
and interconnect shifts of the corresponding scale. These mechanisms make our
Its advantages are summarized as follows. (a) Diviner brings a Diviner capture multi-scale variations in non-stationary time series, and the
salient improvement on both long- and short-term prediction mathematical description is listed below.
and achieves state-of-the-art performance. (b) Diviner can per-
form robustly regardless of the selection of prediction span and Diviner input layer. Given the time series data X, we transform X into
granularity, showing great potential for long-term forecasting. (c) e¼ ~
X x1 ~ x2 ::: ~xK , where ~ xi is a vector of length T with the time series data
Diviner maintains a strong generalization in various fields. The per day (seasonal), and K denotes the number of days, ~ x i 2 RT , Xe 2 RT ´ K . Then
performance of most baselines might degrade precipitously in we construct the dual input for Diviner. Noticing that Diviner adopts an encoder-
some or other areas. In contrast, our model distinguishes itself for decoder architecture, we construct Xin for encoder and Xin de for decoder, where
in en
consistent performance on each benchmark. Xin
en ¼ ~
x 1 ~
x 2 ::: ~
x K , X de ¼ ~
x KK de þ1 ~xKK de ::: ~ xK , and Xin en 2 R ,
K
e
de 2 R
K de
This work explores an avenue to obtain detailed and precise Xin . This means that Xin in
en takes all elements from X while Xde takes only the
latest Kde elements. After that, a fully connected layer on Xin in
en and Xde is used to
long-term 5G network traffic forecasts, which can be used to dm ´ K d m ´ K de
obtain Einen and Ede , where Een 2 R de 2 R
in in
, Ein and dm denotes the model
calculate the time network traffic might overflow the capacity and dimension.
helps operators formulate network construction schemes months
in advance. Furthermore, Diviner generates long-term network Smoothing filter attention mechanism. Inspired by Nadaraya-Watson
traffic forecasts at the minute level, facilitating its broader regression51,52 bringing the adjacent points closer together, we introduce the
applications for resource provisioning, allocating, and monitor- smoothing filter attention mechanism with a learnable kernel function and self-
ing. Decision-makers can harness long-term predictions to allo- masked architecture, where the former brings similar items closer to filter out the
cate and optimize network resources. Another practical random component and adjust the non-stationary data to stable features, and the
letter reduces outliers. The smoothing filter attention mechanism is implemented
application is to achieve an automatic network status monitoring
based on the input E ¼ ξ 1 ξ 2 ::: ξ K in , where ξ i 2 Rdm , E is the general
system, which automatically alarms when real network traffic reference to the input of each layer, for encoder Kin = K, and for decoder Kin = Kde.
exceeds a permitted range around predictions. This system sup- en and Ede are, respectively, the input of the first encoder and decoder
Specifically, Ein in
ports targeted port-level early warning and assists workers in layer. The calculation process is shown as follows:
troubleshooting in time, which can bring substantial efficiency
∑ Kðξ i ; ξ j Þ ξ j
improvement considering the tens of millions of network ports j≠i
ηi ¼ ; ð11Þ
running online. In addition to 5G networks, we have expanded ∑ K ðξ i ; ξ j Þ
j≠i
our solution to broader engineering fields such as electricity,
climate, control, economics, energy, and transportation. Predict-
ing oil temperature can help prevent the transformer from K ðξ i ; ξ j Þ ¼ expðwi ðξ i ξ j Þ2 Þ; ð12Þ
overheating, which affects the insulation life of the transformer where wi 2 Rdm ; i 2 ½1; K in denotes the learnable parameters, ⊙ denotes the
and ensures proper operation66,67. In addition, long-term element-wise multiple, (⋅)2 denotes the element-wise square and the square of a
meteorological prediction helps to select and seed crops in agri- vector here represents the element-wise square. To simplify the representation, we
culture. As such, we can discover unnoticed regularities within assign the smoothing filter attention mechanism as Smoothing-Filter(E) and
historical series data, which might bring opportunities to tradi- denote its output as Hs. Before introducing our difference attention module, we
first define the difference between a matrix and its inverse operation CumSum.
tional industries.
One limitation of our proposed model is that it suffers from Difference and CumSumoperation. Given a matrix M 2 Rm ´ n ,
critical transitions of data patterns. We attribute this to external
M ¼ m1 m2 ::: mn , the difference of M is defined as:
factors, whose information is generally not included in the
measured data53,55,68. Our method is helpful in the intrinsic ΔM ¼ Δm1 Δm2 ::: Δmn ; ð13Þ
regularity discovery within the time series but cannot predict where Δmi ¼ miþ1 mi ; Δmi 2 R ; i 2 ½1; nÞ and we pad Δmn with Δmn−1 to
m
patterns not previously recorded in the real world. Alternatively, keep a fixed length before and after the difference operation. The CumSum
we can use dynamic network methods69–71 to detect such critical operation Σ toward M is defined as:
transitions in the time series53. Furthermore, the performance of
ΣM ¼ Σm1 Σm2 ::: Σmn ; ð14Þ
Diviner might be similar to other deep learning models if given a
few history series or in the short-term prediction case. The former where Σmi ¼ ∑ij¼1 mj ; Σmi 2 Rm : The differential attention module, intuitively,
contains insufficient information to be exploited, and the short- can be seen as an attention mechanism plugged between these two operations,
term prediction needs more problem scalability, whereas the mathematically described as follows.
Differential attention module. The input of this model involves three elements: References
Q, K, V. The (Q, K, V) varies between the encoder and decoder, which is 1. Jovović, I., Husnjak, S., Forenbacher, I. & Maček, S. Innovative application of
s ; Hs ; Hs Þ for the encoder and ðHs ; Een ; Een Þ for the decoder, where Een is
ðHen en en de out out out 5G and blockchain technology in industry 4.0. EAI Endorsed Trans. Ind. Netw.
the embedded result of the final encoder block (assigned in the pseudo-code), Intell. Syst. 6, e4 (2019).
dm ´ K d m ´ K de dm ´ K
s 2R ; Hde
s 2R ; Eout
en 2 R
Hen . 2. Osseiran, A. et al. Scenarios for 5G mobile and wireless communications: the
ðiÞ
vision of the METIS project. IEEE Commun. Mag. 52, 26–35 (2014).
QðiÞ ðiÞ ðiÞ
s ; Ks ; Vs ¼ WðiÞ
q ΔQ þ bðiÞ
q ; Wk ΔK þ bðiÞ ðiÞ
k ; Wv ΔV þ bðiÞ
v ; ð15Þ 3. Wu, G., Yang, C., Li, S. & Li, G. Y. Recent advances in energy-efficient
! networks and their application in 5G systems. IEEE Wirel. Commun. 22,
>
30. Qin, Y. et al. A dual-stage attention-based recurrent neural network for time 58. Wu, N., Green, B., Ben, X. & O’Banion, S. Deep transformer models for time
series prediction. in Proceedings of International Joint Conference on Artificial series forecasting: the influenza prevalence case. Preprint at arXiv https://doi.
Intelligence, 2627–2633 (2017). org/10.48550/arXiv.2001.08317 (2020).
31. Mona, S., Mazin, E., Stefan, L. & Maja, R. Modeling irregular time series with 59. Lea, C., Flynn, M. D., Vidal, R., Reiter, A. & Hager, G. D. Temporal convolutional
continuous recurrent units. Proc. Int. Conf. Mach. Learn. 162, 19388–19405 (2022). networks for action segmentation and detection. in Proceedings of the IEEE
32. Kashif, R., Calvin, S., Ingmar, S. & Roland, V. Autoregressive Denoising Conference on Computer Vision and Pattern Recognition, 156-165 (2017).
Diffusion models for multivariate probabilistic time series forecasting. in 60. Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly
Proceedings of International Conference on Machine Learning, vol. 139, learning to align and translate. in Proceedings of International Conference on
8857–8868 (2021). Learning Representations (2015).
33. Alasdair, T., Alexander, P. M., Cheng, S. O. & Xie, L. Radflow: A recurrent, 61. Oreshkin, B. N., Carpov, D., Chapados, N. & Bengio, Y. N-BEATS: Neural
aggregated, and decomposable model for networks of time series. in basis expansion analysis for interpretable time series forecasting. in
Proceedings of International World Wide Web Conference, 730–742 (2021). Proceedings of International Conference on Learning Representations (2020).
34. Ling, F. et al. Multi-task machine learning improves multi-seasonal prediction 62. Li, S. et al. Enhancing the locality and breaking the memory bottleneck of
of the Indian ocean dipole. Nat. Commun. 13, 1–9 (2022). transformer on time series forecasting. in Proceedings of Annual Conference on
35. Vaswani, A. et al. Attention is all you need. Proc. Annu. Conf. Neural Inf. Neural Information Processing Systems 32 (2019).
Process. Syst. 30, 5998–6008 (2017). 63. Geary, N., Antonopoulos, A., Drakopoulos, E., O’Reilly, J. & Mitchell, J. A
36. Alexandre, D., Étienne, M. & Nicolas, C. TACTiS: transformer-attentional framework for optical network planning under traffic uncertainty. in
copulas for time series. in Proceedings of International Conference on Machine Proceedings of International Workshop on Design of Reliable Communication
Learning, vol. 162, 5447–5493 (2022). Networks, 50–56 (2001).
37. Tung, N. & Aditya, G. Transformer neural processes: uncertainty-aware meta 64. Laguna, M. Applying robust optimization to capacity expansion of one
learning via sequence modeling. in Proceedings of International Conference on location in telecommunications with demand uncertainty. Manag. Sci. 44,
Machine Learning, vol. 162, 16569–16594 (2022). S101–S110 (1998).
38. Wen, Q. et al. Transformers in time series: A survey. CoRR (2022). 65. Bauschert, T. et al. Network planning under demand uncertainty with robust
39. Zhou, H. et al. Informer: beyond efficient transformer for long sequence time- optimization. IEEE Commun. Mag. 52, 178–185 (2014).
series forecasting. in Proceedings of AAAI Conference on Artificial Intelligence 66. Radakovic, Z. & Feser, K. A new method for the calculation of the hot-spot
(2021). temperature in power transformers with onan cooling. IEEE Trans. Power
40. Kitaev, N., Kaiser, L. & Levskaya, A. Reformer: the efficient transformer. Deliv. 18, 1284–1292 (2003).
in Proceedings of International Conference on Learning Representations 67. Zhou, L. J., Wu, G. N., Tang, H., Su, C. & Wang, H. L. Heat circuit method for
(2019). calculating temperature rise of scott traction transformer. High Volt. Eng. 33,
41. Li, S. et al. Enhancing the locality and breaking the memory bottleneck of 136–139 (2007).
transformer on time series forecasting. in Proceedings of the 33th Annual 68. Jiang, J. et al. Predicting tipping points in mutualistic networks through
Conference on Neural Information Processing Systems vol. 32, 5244–5254 dimension reduction. Proc. Natl Acad. Sci. USA 115, E639–E647 (2018).
(2019). 69. Chen, L., Liu, R., Liu, Z., Li, M. & Aihara, K. Detecting early-warning signals
42. Wu, H., Xu, J., Wang, J. & Long, M. Autoformer: decomposition transformers for sudden deterioration of complex diseases by dynamical network
with auto-correlation for long-term series forecasting. in Proceedings of biomarkers. Sci. Rep. 2, 1–8 (2012).
Annual Conference on Neural Information Processing Systems, vol. 34, 70. Yang, B. et al. Dynamic network biomarker indicates pulmonary metastasis at
22419–22430 (2021). the tipping point of hepatocellular carcinoma. Nat. Commun. 9, 1–14 (2018).
43. Zhou, T. et al. Fedformer: frequency enhanced decomposed transformer for 71. Liu, R., Chen, P. & Chen, L. Single-sample landscape entropy reveals the
long-term series forecasting. in Proceedings of International Conference on imminent phase transition during disease progression. Bioinformatics 36,
Machine Learning, vol. 162, 27268–27286 (2022). 1522–1532 (2020).
44. Liu, S. et al. Pyraformer: low-complexity pyramidal attention for long-range
time series modeling and forecasting. in Proceedings of International
Conference on Learning Representations (ICLR) (2021). Acknowledgements
45. Liu, M. et al. SCINet: Time series modeling and forecasting with sample This work was supported by the National Natural Science Foundation of China under
convolution and interaction. in Proceedings of Annual Conference on Neural Grant 62076016 and 12201024, Beijing Natural Science Foundation L223024.
Information Processing Systems (2022).
46. Wang, Z. et al. Learning latent seasonal-trend representations for time series Author contributions
forecasting. in Proceedings of Annual Conference on Neural Information Y.Y., S.G., B.Z., J.Z., and D.D. conceived the research. All authors work on the writing of
Processing Systems (2022). the article. Y.Y. and S.G. equally contributed to this work by performing experiments and
47. Xie, C. et al. Trend analysis and forecast of daily reported incidence of hand, results analysis. Z.W. and Y.Z. collected the 5G network traffic data. All authors read and
foot and mouth disease in hubei, china by prophet model. Sci. Rep. 11, 1–8 approved the final paper.
(2021).
48. Cox, D. R. & Miller, H. D. The Theory of Stochastic Processes (Routledge,
London, 2017). Competing interests
49. Dette, H. & Wu, W. Prediction in locally stationary time series. J. Bus. Econ. The authors declare no competing interests.
Stat. 40, 370–381 (2022).
50. Wold, H. O. On prediction in stationary time series. Ann. Math. Stat. 19,
558–567 (1948).
Inclusion and ethics
No ‘ethics dumping’ and ‘helicopter research’ cases occurred in our research.
51. Watson, G. S. Smooth regression analysis. Sankhyā: The Indian Journal of
Statistics, Series A359–372 (1964).
52. Nadaraya, E. A. On estimating regression. Theory Probab. Appl. 9, 141–142 Additional information
(1964). Supplementary information The online version contains supplementary material
53. Chen, P., Liu, R., Aihara, K. & Chen, L. Autoreservoir computing for multistep available at https://doi.org/10.1038/s44172-023-00081-4.
ahead prediction based on the spatiotemporal information transformation.
Nat. Commun. 11, 1–15 (2020). Correspondence and requests for materials should be addressed to Baochang Zhang or
54. Lu, J., Wang, Z., Cao, J., Ho, D. W. & Kurths, J. Pinning impulsive Juan Zhang.
stabilization of nonlinear dynamical networks with time-varying delay. Int. J.
Bifurc. Chaos 22, 1250176 (2012). Peer review information Communications Engineering thanks Akhil Gupta, Erol
55. Malik, N., Marwan, N., Zou, Y., Mucha, P. J. & Kurths, J. Fluctuation of Egrioglu, and the other anonymous reviewer(s) for their contribution to the peer review
similarity to detect transitions between distinct dynamical regimes in short of this work. Primary Handling Editors: Miranda Vinay and Rosamund Daw. A peer
time series. Phys. Rev. E 89, 062908 (2014). review file is available.
56. Yang, R., Lai, Y. & Grebogi, C. Forecasting the future: Is it possible for
adiabatically time-varying nonlinear dynamical systems? Chaos 22, 033119 Reprints and permission information is available at http://www.nature.com/reprints
(2012).
57. Henkel, S. J., Martin, J. S. & Nardari, F. Time-varying short-horizon Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
predictability. J. Financ. Econ. 99, 560–580 (2011). published maps and institutional affiliations.