Modeling Extreme Events in Time Series Prediction: Daizong Ding, Mi Zhang Xudong Pan, Min Yang Xiangnan He
Modeling Extreme Events in Time Series Prediction: Daizong Ding, Mi Zhang Xudong Pan, Min Yang Xiangnan He
2 PRELIMINARIES
In this section, we briefly describe the time-series prediction prob-
lem and introduce extreme events in time-series data.
assumption on learning capacity is a widely adopted assumption Intuitively, the inequality above indicates, with the estimated prob-
in previous researches [3, 21] and can be implemented with a deep ability of extreme events being added up, the estimation of normal
neural network structure in practice. Furthermore, if P(Y |X ) has events would simultaneously become inaccurate. Therefore, normal
been perfectly learned, so as the distributions P(Y ), P(X ), P(X |Y ), data in the test set is prone to be mis-classified as extreme events,
which are therefore totally independent of inputs X . By considering which therefore marks the overfitting phenomenon.
the following observations, As we can see, the extreme events problem in DNN is mainly
• The discriminative model (Eq. 2) has no prior on yt caused by that there is no sufficient prior on tail part of observations
• The output ot is learned under likelihood as normal distribution yt . Through maximizing likelihood could lead to a nonparametric
estimation of yt , which could easily cause underfitting problem.
it is therefore reasonable to state that empirical distribution P(Y )
On the other side, if we increase the weight on those large values,
after optimization should be of the following form,
DNN could easily suffer the overfitting problem. In order to alleviate
T
1 Õ these problems in DNN, we will provide an elegant solution, which
P̂(Y ) = N (yt , τ̂ 2 ) (9)
N t =1 aims at imposing prior on extreme events for DNN in predicting
where constant τ̂ is an unknown variance. In consideration of its time series data.
similarity to Kernel Density Estimator (KDE) with Gaussian Kernel
[38], we can reach an intermediate conclusion that such a model
4 PREDICTING TIME-SERIES DATA WITH
would perform relatively poor if the true distribution of data in EXTREME EVENTS
series is heavy-tailed, according to [7]. In order to impose prior information on tail part of observations for
DNN, we focus on two factors: memorizing extreme events and mod-
3.2 Why Deep Neural Network Could Suffer eling tail distribution. For the first factor we propose to use memory
Extreme Event Problem network to memorize the characteristic of extreme events in his-
tory, and for the latter factor we propose to impose approximated
As discussed above, the distribution of output from a learning model
tailed distribution on observations and provide a novel classifica-
with optimal parameters can be regarded as a KDE with Gaussian
tion called Extreme Value Loss (EVL). Finally we combine these two
Kernel (Eq. 7).
factors and introduce the full solution for predicting time series
Since nonparametric kernel density estimator only works well
data with extreme values.
with sufficient samples, the performance therefore is expected to
decrease at the tail part of the data, where sampled data points
would be rather limited [7]. The main reason is that the range of
4.1 Memory Network Module
extreme values are commonly very large, thus few samples hardly As pointed out by Ghil et al., extreme events in time-series data
can cover the range. As depicted in Fig. 2(b), we sample yt from the often show some form of temporal regularity [19]. Inspired from
true distribution and fit a KDE with Gaussian Kernel. As is shown, this point, we propose to use memory network to memorize these
since there are only two samples with yt > 1.5, the shape of fitted extreme events, which is proved to be effective in recognizing
KDE peaks inconsistently around these points. Moreover, as a large inherent patterns contained in historical information [45]. First,
majority of samples are centered at 0, therefore the probability define the concept of window in our context.
density around origin estimated by KDE tends to be much higher 4.1.1 Historical Window. For each time step t, we first randomly
than the true distribution. sample a sequence of windows by W = {w 1 , · · · , w M }, where M
Formally, let us suppose x 1 , x 2 are two test samples with cor- is the size of the memory network. Each window w j is formally
responding outputs as o 1 = 0.5, o 2 = 1.5. As our studied model is as- defined as w j = [x t j , x t j +1 , · · · , x t j +∆ ], where ∆ as the size of the
sumed to have sufficient learning capacity for modeling P(X ), P(X |Y ), window satisfying 0 < t j < t − ∆.
thus we have Then we propose to apply GRU module to embed each window
P(X |Y )P̂(Y ) P(X |Y )Ptrue (Y ) into feature space. Specifically, we use w j as input, and regard the
P(y1 |x 1 , θ ) = ≥ = Ptrue (y1 |x 1 ) (10)
P(X ) P(X ) last hidden state as the latent representation of this window, de-
noted as s j = GRU([x t j , x t j +1 , · · · , x t j +∆ ]) ∈ RH . Meanwhile, we
Similarly P(y2 |x 2 , θ ) ≤ Ptrue (y2 |x 2 ). Therefore, in this case, the
apply a memory network module to memorize whether there is a ex-
predicted value from deep neural network are always bound, which
treme event in t j +∆+1 for each window w j . In implementation, we
immediately disables deep model from predicting extreme events,
propose to feed the memory module by q j = vt j +∆+1 ∈ {−1, 0, 1}.
i.e. causes the underfitting phenomenon.
For an overview of our memory network based module, please
On the other side, as we have discussed in related work, several
see Fig. 3(a). In summary, at each time step t, the memory of our
methods propose to accent extreme points during the training by,
proposed architecture consists of the following two parts:
for example, increasing the weight on their corresponding training
losses. In our formulation, these methods are equivalent to repeating • Embedding Module S ∈ RM ×H : s j is the latent representation of
extreme points for several times in the dataset when fitting KDE. history window j.
Its outcome is illustrated by dot line in Fig. 2(b). As a consequence, • History Module Q ∈ {−1, 0, 1} M : q j is the label of whether there
we have is a extreme event after the window j.
P(X |Y )P̂(Y ) P(X |Y )Ptrue (Y ) 4.1.2 Attention Mechanism. In this part, we further incorporate
P(y2 |x 2 , θ ) = ≥ = Ptrue (y2 |x 2 ) (11)
P(X ) P(X ) the module demonstrated above into our framework for imbalanced
Modeling Extreme Events in Time Series Prediction KDD ’19, August 4–8, 2019, Anchorage, AK, USA
(a) Illustration of Memory Network Module at time step t (b) Attention mechanism for prediction.
Figure 3: Illustration of predicting process. For each time step t, we sample M windows and use GRU to build the memory
network module.
time-series prediction. At each time step t, we use GRU to produce Algorithm 1 The proposed framework (Non-batch).
the output value: Input: X 1:T = [x 1 , · · · , xT , xT +1 , · · · , xT +K ], extreme indicator
õt = WoT ht + bo , where ht = GRU ([x 1 , x 2 , · · · , x t ]) (12) V1:T = [v 1 , · · · , vT ], training output Y1:T = [y1 , · · · , yT ],
hyper-parameters γ , memory size M, size of the window ∆,
, where ht and s j share the same GRU units. As we have discussed learning rate and regularization parameter.
previously, the prediction of õt may lack the ability of recognizing Output: Predicted value [oT +1 , · · · , oT +K ]
extreme events in the future. Therefore we also require our model Initialize: Parameters θ, ϕ,ψ , b
1: procedure Train(Sample i)
to retrospect its memory to check whether there is a similarity
between the target event and extreme events in history. To achieve 2: for t = 1, · · · ,T do
this, we propose to utilize attention mechanism [4] to our ends. 3: Calculating weights β in Eq. 16
Formally, 4: Construct memory module S, Q (Sec. 4.1)
exp(c t j ) for j = 1, · · · , M do
α t j = ÍM , where c t j = hTt s j (13) 5:
current time step and certain extreme events in history, then ut will 19: end function
help detect such a pumping point by setting ut non-vanishing, while
when the current event is observed to hardly have any relation with
the history, then the output would choose to mainly depend on õt , prior of tailed data on loss function. Rather than modeling the out-
i.e. the value predicted by a standard GRU gate. The loss function put ot directly, we pay our attention to the extreme event indicator
can be written as square loss defined in Eq. 2 in order to minimize ut . For the sake of simplicity, we first consider right extreme events.
the distance between output ot and observation yt . In order to incorporate the tail distribution with P(Y ), we first
consider the approximation defined in Eq. 6, which can approximate
4.2 Extreme Value Loss the tail distribution of observations. In our problem, for observation
Although memory network could forecast some extreme events, yt , the approximation can be written as,
y − ϵ
t
1 − F (yt ) ≈ 1 − P(vt = 1) log G
such loss function still suffer extreme events problem. Therefore we 1
(15)
continue to model the second factor. As we have discussed in Sec. 3, f (ϵ1 )
the common square loss could lead to a nonparametric approxima- , where positive function f is the scale function. Further, if we
tion on yt . Without imposed prior P(Y ), the empirical estimation consider a binary classification task for detecting right extreme
P̂(Y ) could easily lead to two kinds of phenomenons. Therefore, in events. In our model the predicted indicator is ut , which can be
order to influence the distribution of P(Y ), we propose to impose regarded as a hard approximation for (yt − ϵ1 )/f (ϵ1 ). We regard
KDD ’19, August 4–8, 2019, Anchorage, AK, USA Daizong Ding, Mi Zhang, Xudong Pan, Min Yang, and Xiangnan He
the approximation as weights and add them on each term in binary 5 EMPIRICAL RESULTS
cross entropy, In this section, we present empirical results for our proposed frame-
EV L(ut ) = − 1 − P(vt = 1) log G(ut ) vt log(ut )
work. In particular, the main research questions are:
(a) Output from GRU (b) Output from our model (c) Output from õ t
Figure 4: Case study for Our Framework via Visualizations. Although outputs from õt still suffer extreme events problem, our
model improve it by EVL and the memory network module.
set, then the model will lack the ability to detect them [15]. Actually,
the cutting-edge deep learning models are also observed to suffer
from the data imbalance problem in most cases. As pointed out
by Lin et al., data imbalance problem mainly causes two haunting
phenomenons in DNN: (1) model lacks the ability to model rarely
occurred samples; (2) the generalization performance of model is
degenerated [32]. As we have discussed in the introductory part, in
the context of time-series data, these imbalanced samples are often
(a) Influence of memory size M (b) Influence of window size ∆
called extreme events. Previous studies mainly focus on modeling
Figure 5: Influence of different hyper-parameters in mem- the distributions of these extreme values [1]. Intuitively, due to the
ory network module. We only demonstrate the results in cli- rare and irregular essence of extreme events, it is always hard to
mate datasets due to the limit of space. forecast these pumping points [19].
of the prediction, while on the other hand, when the length is set
too large, the GRU module in the memory network would lack the 7 CONCLUSION
ability to memorize all the characteristics inside the window. In this paper, we focus on improving the performance of deep learn-
ing methods on time series prediction, specifically with a more
6 RELATED WORK find-grained modeling on the part of extreme events. As discussed
in Sec.3, without prior knowledge on tailed observations, DNN is
Time series prediction is a classical task in machine learning. Tra-
innately weak in capturing characteristics of the occurrences of
ditional methods such as autoregressive moving average model
extreme events. Therefore, as a novel technique delicately designed
[46] uses linear regression on latest inputs to predict the next
for extreme events, we have proposed a framework for forecasting
value, while Nonlinear autoregressive exogenous (NARX) adopts
time series data with extreme events. Specifically we consider two
non-linear mapping to improve the prediction [31]. However, their
factors for imposing tailed priors: (1) memorizing extreme events
learning capacity are limited because traditional methods use shal-
in historical data (2) modeling the tail distribution of observations.
low architecture [6]. Recently, deep neural network (DNN) achieve
For the first factor we utilize the recently proposed Memory Net-
great success in many fields , e.g., computer vision [28] and neu-
work technique by storing historical patterns of extreme events for
ral language processing [4], mainly because of its deep nonlinear
future reference. For the second factor we propose a novel classifi-
architecture which allows extraction of better features from the
cation loss function called Extreme Value Loss (EVL) for detecting
inputs [39]. A number of DNN based models has also been ap-
extreme events in the future, which contains the approximated tail
plied on time series prediction. Dasgupta and Osogami proposed
distribution of observations. Finally we combine them together and
to use Boltzmann Machine [11], while Yang et al. proposed to use
form our end-to-end solution for predictions of time series with
Convolutional Network [49] to improve the performance. Among
extreme events. With intensive experiments, we have validated
these deep learning models, recurrent neural network (RNN) shows
both the effectiveness of EVL on extreme event detection and the
superior advantages in modeling time-series data. For instance,
superior performance of our proposed framework on time series
Diaconescu proposes to combine RNN with NARX model, which
prediction, compared with the state-of-the-arts. For future works,
largely reduces the errors in prediction [13, 48]. Numerous im-
we plan to work on an efficient extension of our framework to
provements have been made on RNN, such as Long-short Term
multi-dimensional time series prediction in consideration of its
Memory [22] and Gated Recurrent Unit [9], with various empirical
significance in practice and its challenges in theory. Furthermore,
results reporting the effectiveness of these deep recurrent models
we suggest it would also be interesting to exploit the possibility
[30, 36, 50].
of applying our proposed EVL to derive solutions for other tasks
Data imbalance is always an important issue in machine learning
featured with data imbalance such as Point-of-Interest (POI) rec-
and data mining. For instance, considering certain classification
ommendation [14].
task, if there exists one label which only has few samples in training
Modeling Extreme Events in Time Series Prediction KDD ’19, August 4–8, 2019, Anchorage, AK, USA
ACKNOWLEDGMENTS [23] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feed-
forward networks are universal approximators. Neural networks 2, 5 (1989),
The work is supported in part by the National Program on Key 359–366.
Basic Research (NO. 2015CB358800), the National Natural Science [24] Holger Kantz, Eduardo G Altmann, Sarah Hallerberg, Detlef Holstein, and Anja
Riegert. 2006. Dynamical interpretation of extreme events: predictability and
Foundation of China (U1636204, 61602121, U1736208, 61602123, predictions. In Extreme events in nature and society. Springer, 69–93.
U1836213, U1836210) and Thousand Youth Talents Program 2018. [25] Charles David Keeling and Timothy P Whorf. 2004. Atmospheric CO2 concen-
Min Yang is also a member of Shanghai Institute of Intelligent trations derived from flask air samples at sites in the SIO network. Trends: a
compendium of data on Global Change (2004).
Electronics & Systems, Shanghai Institute for Advanced Communi- [26] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
cation and Data Science, and Engineering Research Center of Cyber mization. arXiv preprint arXiv:1412.6980 (2014).
Security Auditing and Monitoring, Ministry of Education, China. [27] Samuel Kotz and Saralees Nadarajah. 2000. Extreme value distributions. Theory
and applications. Prentice Hall,. 207–243 pages.
[28] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-
REFERENCES tion with deep convolutional neural networks. In Advances in neural information
processing systems. 1097–1105.
[1] Sergio Albeverio, Volker Jentsch, and Holger Kantz. 2006. Extreme events in [29] Holden Lee, Rong Ge, Tengyu Ma, Andrej Risteski, and Sanjeev Arora. 2017. On
nature and society. Springer Science & Business Media. the ability of neural nets to express distributions. arXiv preprint arXiv:1702.07028
[2] Eduardo Altmann and Kantz H. 2005. Recurrence time analysis, long-term (2017).
correlations, and extreme events. Physical Review E Statistical Nonlinear and Soft [30] Tao Lin, Tian Guo, and Karl Aberer. 2017. Hybrid neural networks for learning
Matter Physics (2005). the trend in time series. In Proceedings of the Twenty-Sixth International Joint
[3] Martin Arjovsky and Léon Bottou. 2017. Towards principled methods for training Conference on Artificial Intelligence, IJCAI-17. 2273–2279.
generative adversarial networks. arXiv preprint arXiv:1701.04862 (2017). [31] Tsungnan Lin, Bill G Horne, Peter Tino, and C Lee Giles. 1996. Learning long-
[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma- term dependencies in NARX recurrent neural networks. IEEE Transactions on
chine translation by jointly learning to align and translate. arXiv preprint Neural Networks 7, 6 (1996), 1329–1338.
arXiv:1409.0473 (2014). [32] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2018.
[5] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. 2005. Focal loss for dense object detection. IEEE transactions on pattern analysis and
Clustering with Bregman divergences. Journal of machine learning research 6, machine intelligence (2018).
Oct (2005), 1705–1749. [33] Edward N Lorenz. 1963. Deterministic nonperiodic flow. Journal of the atmo-
[6] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2007. Greedy spheric sciences 20, 2 (1963), 130–141.
layer-wise training of deep networks. In Advances in neural information processing [34] DD Lucas, C Yver Kwok, P Cameron-Smith, H Graven, D Bergmann, TP Guilder-
systems. 153–160. son, R Weiss, and R Keeling. 2015. Designing optimal greenhouse gas observing
[7] Tine Buch-Larsen, Jens Perch Nielsen, Montserrat Guillén, and Catalina Bolancé. networks that consider performance and cost. Geoscientific Instrumentation,
2005. Kernel density estimation for heavy-tailed distributions using the Cham- Methods and Data Systems 4, 1 (2015), 121–137.
pernowne transformation. Statistics 39, 6 (2005), 503–516. [35] T Okubo and N Narita. 1980. On the distribution of extreme winds expected in
[8] Armin Bunde, Jan F Eichner, Shlomo Havlin, and Jan W Kantelhardt. 2003. The Japan. National Bureau of Standards Special Publication 560 (1980), 1.
effect of long-term correlations on the return periods of rare events. Physica A: [36] Yao Qin, Dongjin Song, Haifeng Chen, Wei Cheng, Guofei Jiang, and Garrison
Statistical Mechanics and its Applications 330, 1-2 (2003), 1–7. Cottrell. 2017. A dual-stage attention-based recurrent neural network for time
[9] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, series prediction. arXiv preprint arXiv:1704.02971 (2017).
Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase [37] Tomasz Rolski, Hanspeter Schmidli, Volker Schmidt, and Jozef L Teugels. 2009.
representations using RNN encoder-decoder for statistical machine translation. Stochastic processes for insurance and finance. Vol. 505. John Wiley & Sons.
arXiv preprint arXiv:1406.1078 (2014). [38] Murray Rosenblatt. 1956. Remarks on some nonparametric estimates of a density
[10] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. function. The Annals of Mathematical Statistics (1956), 832–837.
Empirical evaluation of gated recurrent neural networks on sequence modeling. [39] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. 2017. Dynamic routing
arXiv preprint arXiv:1412.3555 (2014). between capsules. In Advances in Neural Information Processing Systems. 3856–
[11] Sakyasingha Dasgupta and Takayuki Osogami. 2017. Nonlinear Dynamic Boltz- 3866.
mann Machines for Time-Series Prediction.. In AAAI. 1833–1839. [40] Ajit P Singh and Geoffrey J Gordon. 2008. Relational learning via collective matrix
[12] Laurens De Haan and Ana Ferreira. 2007. Extreme value theory: an introduction. factorization. In Proceedings of the 14th ACM SIGKDD international conference on
Springer Science & Business Media. Knowledge discovery and data mining. ACM, 650–658.
[13] Eugen Diaconescu. 2008. The use of NARX neural networks to predict chaotic [41] Jeroen Van den Berg, Bertrand Candelon, and Jean-Pierre Urbain. 2008. A cautious
time series. Wseas Transactions on computer research 3, 3 (2008), 182–191. note on the use of panel models to predict financial crises. Economics Letters 101,
[14] Daizong Ding, Mi Zhang, Xudong Pan, Duocai Wu, and Pearl Pu. 2018. Geo- 1 (2008), 80–83.
graphical Feature Extraction for Entities in Location-based Social Networks. In [42] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. 2016. Match-
Proceedings of the 2018 World Wide Web Conference on World Wide Web. Interna- ing networks for one shot learning. In Advances in Neural Information Processing
tional World Wide Web Conferences Steering Committee, 833–842. Systems. 3630–3638.
[15] Li Fei-Fei, Rob Fergus, and Pietro Perona. 2006. One-shot learning of object [43] Ladislaus von Bortkiewicz. 1921. Variationsbreite und mittlerer Fehler. Berliner
categories. IEEE transactions on pattern analysis and machine intelligence 28, 4 Mathematische Gesellschaft.
(2006), 594–611. [44] Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019.
[16] Fuli Feng, Xiangnan He, Xiang Wang, Cheng Luo, Yiqun Liu, and Tat-Seng Chua. Neural Graph Collaborative Filtering. In SIGIR.
2019. Temporal Relational Ranking for Stock Prediction. ACM Transactions on [45] Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory Networks.
Information Systems (TOIS) 37, 2 (2019), 27. CoRR abs/1410.3916 (2014).
[17] R. A. Fisher and L. H. C. Tippett. 1928. Limiting forms of the frequency distribution [46] Peter Whitle. 1951. Hypothesis testing in time series analysis. Vol. 4. Almqvist &
of the largest or smallest member of a sample. Mathematical Proceedings of the Wiksells.
Cambridge Philosophical Society 24, 2 (1928), 180–190. [47] Rym Worms. 1998. Propriété asymptotique des excès additifs et valeurs extrêmes:
[18] JANOS GALAMBOS. 1977. The asymptotic theory of extreme order statistics. le cas de la loi de Gumbel. Comptes Rendus de l’Academie des Sciences Series I
In The Theory and Applications of Reliability with Emphasis on Bayesian and Mathematics 5, 327 (1998), 509–514.
Nonparametric Methods. Elsevier, 151–164. [48] Linjun Yan, Ahmed Elgamal, and Garrison W Cottrell. 2011. Substructure vibra-
[19] M Ghil, P Yiou, Stéphane Hallegatte, BD Malamud, P Naveau, A Soloviev, P tion NARX neural network approach for statistical damage inference. Journal of
Friederichs, V Keilis-Borok, D Kondrashov, V Kossobokov, et al. 2011. Extreme Engineering Mechanics 139, 6 (2011), 737–747.
events: dynamics, statistics and prediction. Nonlinear Processes in Geophysics 18, [49] Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiaoli Li, and Shonali Krish-
3 (2011), 295–350. naswamy. 2015. Deep Convolutional Neural Networks on Multichannel Time
[20] Gnedenko. 1943. Sur la distribution limite du terme maximum dâĂŹune sÃľrie Series for Human Activity Recognition.. In Ijcai, Vol. 15. 3995–4001.
alÃľatoire. Annals of Mathematics 44, 3 (1943), 423–453. [50] Yu Zhu, Hao Li, Yikang Liao, Beidou Wang, Ziyu Guan, Haifeng Liu, and Deng Cai.
[21] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, 2017. What to do next: Modeling user behaviors by time-lstm. In Proceedings of
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17.
nets. In Advances in neural information processing systems. 2672–2680. 3602–3608.
[22] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
computation 9, 8 (1997), 1735–1780.