0% found this document useful (0 votes)
69 views9 pages

Modeling Extreme Events in Time Series Prediction: Daizong Ding, Mi Zhang Xudong Pan, Min Yang Xiangnan He

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views9 pages

Modeling Extreme Events in Time Series Prediction: Daizong Ding, Mi Zhang Xudong Pan, Min Yang Xiangnan He

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Modeling Extreme Events in Time Series Prediction

Daizong Ding, Mi Zhang∗ Xudong Pan, Min Yang Xiangnan He


Fudan University, China Fudan University, China University of Science and Technology
{17110240010,mi_zhang}@fudan.edu. {18110240010,m_yang}@fudan.edu.cn of China, China
cn [email protected]

ABSTRACT price monitoring [16], how to obtain more accurate predictions


Time series prediction is an intensively studied topic in data mining. remains an open problem to solve.
In spite of the considerable improvements, recent deep learning- Historically, traditional methods such as autoregressive mov-
based methods overlook the existence of extreme events, which ing average (ARMA) [46] and nonlinear autoregressive exogenous
result in weak performance when applying them to real time series. (NARX) [31] use statistical models with few parameters to exploit
Extreme events are rare and random, but do play a critical role in patterns in time series data. Recently, with the celebrated success
many real applications, such as the forecasting of financial crisis of Deep Neural Network (DNN) in many fields such as image classi-
and natural disasters. In this paper, we explore the central theme of fication [28] and machine translation [4], a number of DNN based
improving the ability of deep learning on modeling extreme events techniques have been subsequently developed for time-series pre-
for time series prediction. diction tasks, achieving noticeable improvements over traditional
Through the lens of formal analysis, we first find that the weak- methods [11, 49]. As a basic component of these models, Recurrent
ness of deep learning methods roots in the conventional form of Neural Network (RNN) module serves as an indispensable factor
quadratic loss. To address this issue, we take inspirations from for these note-worthy improvements [31, 48]. Compared with tra-
the Extreme Value Theory, developing a new form of loss called ditional methods, one of the major advantages of RNN structure is
Extreme Value Loss (EVL) for detecting the future occurrence of that it enables deep non-linear modeling of temporal patterns. In
extreme events. Furthermore, we propose to employ Memory Net- recent literature, some of its variants show even better empirical
work in order to memorize extreme events in historical records. performance, such as the well-known Long-Short Term Memory
By incorporating EVL with an adapted memory network module, (LSTM) [22, 36, 50] and Gated Recurrent Unit (GRU) [10], while the
we achieve an end-to-end framework for time series prediction latter appears to be more efficient on smaller and simpler dataset
with extreme events. Through extensive experiments on synthetic [10]. However, most previously studied DNN are observed to have
data and two real datasets of stock and climate, we empirically val- troubles in dealing with data imbalance [15, 42, 44]. Illustratively,
idate the effectiveness of our framework. Besides, we also provide let us consider a binary classification task with its training set in-
a proper choice for hyper-parameters in our proposed framework cluding 99% positive samples and only 1% negative samples, which
by conducting several additional experiments. is said to contain data imbalance. Following the discussion in Lin
et al., such an imbalance in data will potentially bring any classi-
CCS CONCEPTS fier into either of two unexpected situations: a. the model hardly
learns any pattern and simply chooses to recognize all samples as
• Mathematics of computing → Probabilistic algorithms; • Com-
positive. b. the model memorizes the training set perfectly while it
puting methodologies → Neural networks;
generalizes poorly to test set.
In fact, we have observed that, in the context of time-series pre-
KEYWORDS diction, imbalanced data in time series, or extreme events, is also
Extreme Event, Memory Network, Attention Model harmful to deep learning models. Intuitively, an extreme event in
time series is usually featured by extremely small or large values,
1 INTRODUCTION of irregular and rare occurrences [24]. As an empirical justification
Time series prediction as a classical research topic, has been inten- of its harmfulness on deep learning models, we train a standard
sively studied by interdisciplinary researchers over the past several GRU to predict one-dimensional time series, where certain thresh-
decades. As its application increasingly ventures into safety-critical olds are used to label a small proportion of datasets as extreme
real-world scenarios, such as climate prediction [35] and stocks events in prior (dotted line in Fig 1). As clearly shown, the learn-
ing model indeed falls into the two priorly discussed situations:
∗ The corresponding author is Mi Zhang <[email protected]> a. In Fig. 1(a), most of its predictions are bounded by thresholds
and therefore it fails to recognize future extreme events, we claim
Permission to make digital or hard copies of all or part of this work for personal or this as underfitting phenomenon. b. In Fig. 1(b), although the model
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation learns extreme events in the train set correctly, it behaves poorly
on the first page. Copyrights for components of this work owned by others than ACM on test sets, we cliam this as overfitting phenomenon. Previously,
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, people always tend to tolerate the underfitting phenomenon since
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected]. models would still have an averagely tolerable performance on test
KDD ’19, August 4–8, 2019, Anchorage, AK, USA sets. However from our perspective, it would be really valuable
© 2019 Association for Computing Machinery. if a time-series prediction model could recognize future extreme
ACM ISBN 978-1-4503-6201-6/19/08. . . $15.00
https://doi.org/10.1145/3292500.3330896 events with reasonable predictions. With more accurate modeling
KDD ’19, August 4–8, 2019, Anchorage, AK, USA Daizong Ding, Mi Zhang, Xudong Pan, Min Yang, and Xiangnan He

2 PRELIMINARIES
In this section, we briefly describe the time-series prediction prob-
lem and introduce extreme events in time-series data.

2.1 Time Series Prediction


Suppose there are N sequences of fixed length T . For the i-th se-
quence the time series data can be described as,
(i) (i)   (i) (i) (i) (i) (i) (i) 
X 1:T , Y1:T = (x 1 , y1 ), (x 2 , y2 ), · · · , (xT , yT ) (1)
(i) (i)
(a) Underfitting Phenomenon , where x t and yt are input and output at time t respectively.
(i) (i)
In one-dimensional time series prediction we have x t , yt ∈ R
(i) (i)
and yt := x t +1 . For the sake of convenience, we will use X 1:T =
[x 1 , · · · , xT ] and Y1:T = [y1 , · · · , yT ] to denote general sequences
without referring to specific sequences.
The goal of time-series prediction is that, given observations
(X 1:T , Y1:T ) and future inputs XT :T +K , how to predict outputs
YT :T +K in the future. Suppose a model predicts ot at time t given
input x t , the common optimization goal can be written as,
T
Õ
(b) Overfitting Phenomenon
min ∥ot − yt ∥ 2 (2)
Figure 1: The extreme event problem in time-series predic- t =1
tion. The data are sampled from climate dataset. Then after the inference the model could predict the correspond-
of extreme events in many real-world cases, prediction models are ing outputs O 1:T +K give inputs X 1:T +K . Traditional methods such
expected to aid influential decisions by providing alarms on future as autoregressive moving average model (ARMA) [46] and Non-
incidents such as extreme winds [35] or financial crisis [41]. linear autoregressive exogenous (NARX)[31] predicts outputs by
With motivations above, in this paper, we focus on improving conducting linear or non-linear regression on past inputs. Recently,
the performance of DNN on predicting time series with extreme deep neural network such as Recurrent Neural Network (RNN)
values. First, besides the empirical validation above, we present shows superior advantages compared with traditional methods in
a formal analysis on why DNN could easily fall into underfitting modeling time-series data. Numerous improvements have been
or overfitting phenomenons when it is trained with time series made on RNN such as Long-short Term Memory [22] and Gated
with extreme events. Through the lens of Extreme Value Theory Recurrent Unit [9].
(EVT), we observe that the main reason lies in previous choices of
loss function, which inherently lacks the ability to model extreme 2.2 Extreme Events
events in a find-grained way. Therefore we propose a novel form Although DNN such as GRU has achieved noticeable improvements
of loss called Extreme Value Loss (EVL) to improve predictions on in predicting time-series data, this model tends to fall into either
occurrences of extreme events. Furthermore, Inspired by previous overfitting or underfitting if trained with imbalanced time series, as
studies on dynamics of extreme events, which pointed out that the we have demonstrated in introductory part. We will refer to such a
randomness of extreme events have limited degrees of freedom phenomenon as Extreme Event Problem. Towards a formal under-
(DOF) [33]. As a result, its patterns could indeed be memorized standing of this phenomenon, it will be convenient to introduce an
[2, 8]. We informatively propose a neural architecture to memorize auxiliary indicator sequence V1:T = [v 1 , · · · , vT ]:
extreme events from historical information, with the aid of Memory
1

 yt > ϵ 1
vt = 0 yt ∈ [−ϵ2 , ϵ1 ]
Network [45]. Together with our proposed EVL, our end-to-end 
(3)
framework is thus constructed for better predictions on time series  −1 y < −ϵ

 t 2
data with extreme events. Our main contributions are
where large constants ϵ1 , ϵ2 > 0 are called thresholds. For time step
• We provide a formal analysis on why deep neural network suffers
t, if vt = 0, we define the output yt as normal event. If vt > 0, we
underfitting or overfitting phenomenons during predicting time
define the output yt as right extreme event. If vt < 0, we define the
series data with extreme value.
output yt as left extreme event.
• We propose a novel loss function called Extreme Value Loss (EVL)
based on extreme value theory, which provides better predictions 2.2.1 Heavy-tailed Distributions. There are many researches pay
on future occurrences of extreme events. attention to these large observations in other tasks, e.g., previous
• We propose a brand-new Memory Network based neural architec- work notices that empirical distribution of real-world data always
ture to memorize extreme events in history for better predictions appear to be heavy-tailed [37]. Intuitively, if a random variable Y is
of future extreme values. Experimental results validates the supe- said to respect a heavy-tailed distribution, then it usually has a non-
riority of our framework in prediction accuracy compared with negligible probability of taking large values (larger than a threshold)
the state-of-the-arts. [37]. In fact, a majority of widely applied distributions including
Modeling Extreme Events in Time Series Prediction KDD ’19, August 4–8, 2019, Anchorage, AK, USA
Table 1: Mathematical Notations
Symbol Size Description
(i)
xt R Input of time t in i-th sequence
(i)
yt R Output of time t in i-th sequence
(i)
vt {−1, 0, 1} Extreme event indicator of time t in i-th
sequence
N N Number of sequences
T N Train length of each sequence
H N Size of latent factors in GRU (a) Illustration of Heavy Tail Distribution (b) Illustration of Optimized P (o)
M N Size of memory module
∆ N Size of each window Figure 2: Distributions of yt in time-series data, where yt are
ot R Prediction of time t in i-th sequence sampled from climate dataset as introduced in experiments.
ht RH Hidden state from GRU at time t
wj R∆ Window j of memory network to observations which exceed certain fixed threshold as follows,
sj RH Latent representation of window j which would be useful in the next part.
qj {−1, 0, 1} Extreme event indicator of window j 2.2.3 Modeling The Tail. Previous works extend the above theorem
pj [−1, 1] Prediction of extreme event indicator of to model the tail distribution of real-world data by [18, 47],
window j
 y − ξ i
õt Prediction from GRU part at time t
h
R
1 − F (y) ≈ (1 − F (ξ )) 1 − log G ,y > ξ
RM
(6)
αt Attentive weights at time t f (ξ )
ut [−1, 1] Prediction of extreme event at time t where ξ > 0 is a sufficiently large threshold.Previous researches
point that the approximation in Eq. 6 can fit the tail distribution well
Gaussian, Poisson are not heavy-tailed, therefore, light-tailed. Only [12]. Although there are many methods for modeling the distribu-
a few number of parametric distributions are heavy-tailed, e.g. tions of extreme values [1], due to the rare and irregular essence of
Pareto distribution and log-Cauchy distribution. Therefore mod- extreme events, it is always hard to forecast these pumping points
eling with light-tailed parametric distributions would bring un- [19]. What is worse, these extreme events could affect the learning
avoidable losses in the tail part of the data. Such a statement can be of deep neural networks, where we will discuss the reason in detail
illustratively presented with Fig. 2(a), where we choose a light-tailed in the next section.
truncated normal distribght-tailed distribution fits data around the
center quite well, the inaccuracy on the tail part is intolerable. 3 PROBLEMS CAUSED BY EXTREME EVENTS
In this part we will deliver our explanation on why extreme event
2.2.2 Extreme Value Theory. Historically, Extreme Value Theory problem is almost inevitable for previously studied DNN models in
(EVT) take a further step on studying these heavy-tailed data. EVT time-series prediction.
studies the distribution of maximum in observed samples [43]. For-
mally speaking, suppose T random variables y1 , . . . , yT are i.i.d 3.1 Empirical Distribution After Optimization
sampled from distribution FY , then the distribution of the maxi-
We further investigate the influence of extreme events in time series
mum is,
prediction. For the sake of simplicity, we only pay our attention to
lim P {max(y1 , · · · , yT ) ≤ y} = lim F T (y) = 0 (4) one sequence, that is, X 1:T and Y1:T . From the probabilistic perspec-
T →∞ T →∞
tive, minimization of the loss function in Eq. 2 is in essence equiva-
In order to obtain a non-vanishing form of P {max(y1 , · · · , yT ) ≤ lent to the maximization of the likelihood P(yt |x t ). Based on Breg-
y}, previous researches proceeded by performing a linear trans- man’s theory [5, 40], minimizing such square loss always has the
formation on the maximum. As a fundamental result in EVT, the form of Gaussian with variance τ , that is, p(yt |x t , θ ) = N (ot , τ 2 ),
following theorem states that the distribution of Y after linearly where θ is the parameter of the predicting model, O 1:T are outputs
transformed is always limited to few cases. from the model.
Theorem 2.1 ([17, 20]). If there exists a linear transformation Therefore, Eq. 2 can be replaced with its equivalent optimization
on Y which makes the distribution in Eq. 4 non-degenetated to 0. problem as follows
Then the class of the non-degenerated distribution G(y) after the T
Ö  
transformation must be the following distribution: max P yt |x t , θ (7)
θ
(   t =1
exp − (1 − γ1 y)γ , γ , 0, 1 − γ1 y > 0
G(y) = (5) With Bayes’s theorem, the likelihood above can be written as,
exp − e −y ,γ = 0

P(X |Y )P(Y )
P(Y |X ) = (8)
Usually, the form G(y) is called Generalized Extreme Value dis- P(X )
tribution, with γ , 0 as extreme value index. Such a statement By assuming the model has sufficient learning capacity with
sometimes is also regarded as the law of large numbers for the parameter θ [23, 29], we claim the inference problem will yield
maximum [27]. In fact, the theorem above has a natural extension an optimal approximation to P(Y |X ). It is worth to notice that our
KDD ’19, August 4–8, 2019, Anchorage, AK, USA Daizong Ding, Mi Zhang, Xudong Pan, Min Yang, and Xiangnan He

assumption on learning capacity is a widely adopted assumption Intuitively, the inequality above indicates, with the estimated prob-
in previous researches [3, 21] and can be implemented with a deep ability of extreme events being added up, the estimation of normal
neural network structure in practice. Furthermore, if P(Y |X ) has events would simultaneously become inaccurate. Therefore, normal
been perfectly learned, so as the distributions P(Y ), P(X ), P(X |Y ), data in the test set is prone to be mis-classified as extreme events,
which are therefore totally independent of inputs X . By considering which therefore marks the overfitting phenomenon.
the following observations, As we can see, the extreme events problem in DNN is mainly
• The discriminative model (Eq. 2) has no prior on yt caused by that there is no sufficient prior on tail part of observations
• The output ot is learned under likelihood as normal distribution yt . Through maximizing likelihood could lead to a nonparametric
estimation of yt , which could easily cause underfitting problem.
it is therefore reasonable to state that empirical distribution P(Y )
On the other side, if we increase the weight on those large values,
after optimization should be of the following form,
DNN could easily suffer the overfitting problem. In order to alleviate
T
1 Õ these problems in DNN, we will provide an elegant solution, which
P̂(Y ) = N (yt , τ̂ 2 ) (9)
N t =1 aims at imposing prior on extreme events for DNN in predicting
where constant τ̂ is an unknown variance. In consideration of its time series data.
similarity to Kernel Density Estimator (KDE) with Gaussian Kernel
[38], we can reach an intermediate conclusion that such a model
4 PREDICTING TIME-SERIES DATA WITH
would perform relatively poor if the true distribution of data in EXTREME EVENTS
series is heavy-tailed, according to [7]. In order to impose prior information on tail part of observations for
DNN, we focus on two factors: memorizing extreme events and mod-
3.2 Why Deep Neural Network Could Suffer eling tail distribution. For the first factor we propose to use memory
Extreme Event Problem network to memorize the characteristic of extreme events in his-
tory, and for the latter factor we propose to impose approximated
As discussed above, the distribution of output from a learning model
tailed distribution on observations and provide a novel classifica-
with optimal parameters can be regarded as a KDE with Gaussian
tion called Extreme Value Loss (EVL). Finally we combine these two
Kernel (Eq. 7).
factors and introduce the full solution for predicting time series
Since nonparametric kernel density estimator only works well
data with extreme values.
with sufficient samples, the performance therefore is expected to
decrease at the tail part of the data, where sampled data points
would be rather limited [7]. The main reason is that the range of
4.1 Memory Network Module
extreme values are commonly very large, thus few samples hardly As pointed out by Ghil et al., extreme events in time-series data
can cover the range. As depicted in Fig. 2(b), we sample yt from the often show some form of temporal regularity [19]. Inspired from
true distribution and fit a KDE with Gaussian Kernel. As is shown, this point, we propose to use memory network to memorize these
since there are only two samples with yt > 1.5, the shape of fitted extreme events, which is proved to be effective in recognizing
KDE peaks inconsistently around these points. Moreover, as a large inherent patterns contained in historical information [45]. First,
majority of samples are centered at 0, therefore the probability define the concept of window in our context.
density around origin estimated by KDE tends to be much higher 4.1.1 Historical Window. For each time step t, we first randomly
than the true distribution. sample a sequence of windows by W = {w 1 , · · · , w M }, where M
Formally, let us suppose x 1 , x 2 are two test samples with cor- is the size of the memory network. Each window w j is formally
responding outputs as o 1 = 0.5, o 2 = 1.5. As our studied model is as- defined as w j = [x t j , x t j +1 , · · · , x t j +∆ ], where ∆ as the size of the
sumed to have sufficient learning capacity for modeling P(X ), P(X |Y ), window satisfying 0 < t j < t − ∆.
thus we have Then we propose to apply GRU module to embed each window
P(X |Y )P̂(Y ) P(X |Y )Ptrue (Y ) into feature space. Specifically, we use w j as input, and regard the
P(y1 |x 1 , θ ) = ≥ = Ptrue (y1 |x 1 ) (10)
P(X ) P(X ) last hidden state as the latent representation of this window, de-
noted as s j = GRU([x t j , x t j +1 , · · · , x t j +∆ ]) ∈ RH . Meanwhile, we
Similarly P(y2 |x 2 , θ ) ≤ Ptrue (y2 |x 2 ). Therefore, in this case, the
apply a memory network module to memorize whether there is a ex-
predicted value from deep neural network are always bound, which
treme event in t j +∆+1 for each window w j . In implementation, we
immediately disables deep model from predicting extreme events,
propose to feed the memory module by q j = vt j +∆+1 ∈ {−1, 0, 1}.
i.e. causes the underfitting phenomenon.
For an overview of our memory network based module, please
On the other side, as we have discussed in related work, several
see Fig. 3(a). In summary, at each time step t, the memory of our
methods propose to accent extreme points during the training by,
proposed architecture consists of the following two parts:
for example, increasing the weight on their corresponding training
losses. In our formulation, these methods are equivalent to repeating • Embedding Module S ∈ RM ×H : s j is the latent representation of
extreme points for several times in the dataset when fitting KDE. history window j.
Its outcome is illustrated by dot line in Fig. 2(b). As a consequence, • History Module Q ∈ {−1, 0, 1} M : q j is the label of whether there
we have is a extreme event after the window j.
P(X |Y )P̂(Y ) P(X |Y )Ptrue (Y ) 4.1.2 Attention Mechanism. In this part, we further incorporate
P(y2 |x 2 , θ ) = ≥ = Ptrue (y2 |x 2 ) (11)
P(X ) P(X ) the module demonstrated above into our framework for imbalanced
Modeling Extreme Events in Time Series Prediction KDD ’19, August 4–8, 2019, Anchorage, AK, USA

(a) Illustration of Memory Network Module at time step t (b) Attention mechanism for prediction.

Figure 3: Illustration of predicting process. For each time step t, we sample M windows and use GRU to build the memory
network module.

time-series prediction. At each time step t, we use GRU to produce Algorithm 1 The proposed framework (Non-batch).
the output value: Input: X 1:T = [x 1 , · · · , xT , xT +1 , · · · , xT +K ], extreme indicator
õt = WoT ht + bo , where ht = GRU ([x 1 , x 2 , · · · , x t ]) (12) V1:T = [v 1 , · · · , vT ], training output Y1:T = [y1 , · · · , yT ],
hyper-parameters γ , memory size M, size of the window ∆,
, where ht and s j share the same GRU units. As we have discussed learning rate and regularization parameter.
previously, the prediction of õt may lack the ability of recognizing Output: Predicted value [oT +1 , · · · , oT +K ]
extreme events in the future. Therefore we also require our model Initialize: Parameters θ, ϕ,ψ , b
1: procedure Train(Sample i)
to retrospect its memory to check whether there is a similarity
between the target event and extreme events in history. To achieve 2: for t = 1, · · · ,T do
this, we propose to utilize attention mechanism [4] to our ends. 3: Calculating weights β in Eq. 16
Formally, 4: Construct memory module S, Q (Sec. 4.1)
exp(c t j ) for j = 1, · · · , M do
α t j = ÍM , where c t j = hTt s j (13) 5:

j=1 exp(c tj ) 6: Calculate p j for window j.


Finally, the prediction of whether an extreme event would hap- 7: end for
pen after referring historical information can be measured by im- 8: Minimizing loss function L 2 (θ,ψ ) (Eq. 18)
posing attentive weights on q j . The output of our model at time 9: Calculate ut and output ot (Eq. 14)
step t is calculated as 10: end for
M 11: Minimizing loss function L 1 (θ, ϕ, b). (Eq. 17)
ot = õt + bT · ut , where ut =
Õ
αt j · q j (14) 12: end procedure
j=1 13: function Predict(Sample i)
In the definition ut ∈ [−1, 1] is the prediction of whether there 14: Construct memory module S, Q with t ≤ T (Sec. 4.1)
will be an extreme event after time step t, and b ∈ R+ is the scale 15: for t = 1, · · · ,T + K do
parameter. Intuitively, the main advantage of our model lies in, it 16: Calculate ut and output ot (Eq. 14)
enables a flexible switch between yielding predictions of normal 17: end for
values and extreme values. When there is a similarity between the 18: return {ot }Tt =1+K

current time step and certain extreme events in history, then ut will 19: end function
help detect such a pumping point by setting ut non-vanishing, while
when the current event is observed to hardly have any relation with
the history, then the output would choose to mainly depend on õt , prior of tailed data on loss function. Rather than modeling the out-
i.e. the value predicted by a standard GRU gate. The loss function put ot directly, we pay our attention to the extreme event indicator
can be written as square loss defined in Eq. 2 in order to minimize ut . For the sake of simplicity, we first consider right extreme events.
the distance between output ot and observation yt . In order to incorporate the tail distribution with P(Y ), we first
consider the approximation defined in Eq. 6, which can approximate
4.2 Extreme Value Loss the tail distribution of observations. In our problem, for observation
Although memory network could forecast some extreme events, yt , the approximation can be written as,
y − ϵ 
t
1 − F (yt ) ≈ 1 − P(vt = 1) log G
such loss function still suffer extreme events problem. Therefore we  1
(15)
continue to model the second factor. As we have discussed in Sec. 3, f (ϵ1 )
the common square loss could lead to a nonparametric approxima- , where positive function f is the scale function. Further, if we
tion on yt . Without imposed prior P(Y ), the empirical estimation consider a binary classification task for detecting right extreme
P̂(Y ) could easily lead to two kinds of phenomenons. Therefore, in events. In our model the predicted indicator is ut , which can be
order to influence the distribution of P(Y ), we propose to impose regarded as a hard approximation for (yt − ϵ1 )/f (ϵ1 ). We regard
KDD ’19, August 4–8, 2019, Anchorage, AK, USA Daizong Ding, Mi Zhang, Xudong Pan, Min Yang, and Xiangnan He

the approximation as weights and add them on each term in binary 5 EMPIRICAL RESULTS
cross entropy, In this section, we present empirical results for our proposed frame-
EV L(ut ) = − 1 − P(vt = 1) log G(ut ) vt log(ut )
  work. In particular, the main research questions are:

− 1 − P(vt = 0) log G(1 − ut ) (1 − vt ) log(1 − ut )


  • RQ1: Is our proposed framework effective in time series predic-
tion?
h ut i γ
= − β0 1 − vt log(ut ) • RQ2: Is our proposed loss function worked in detecting extreme
γ events?
h 1 − ut i γ • RQ3: What is the influence of hyper-parameters in the frame-
− β1 1 − (1 − vt ) log(1 − ut ) (16)
γ work?
, where β 0 = P(vt = 0), which is the proportion of normal events in
the dataset and P(vt = 1) is the proportion of right extreme events 5.1 Experimental Settings
in the dataset. γ is the hyper-parameter, which is the extreme value We conduct experiments on three different kinds of datasets:
index in the approximation. We call the proposed classification loss
function as Extreme Value Loss (EVL). Similarly we have the binary • Stock Dataset We collect the stock price of 564 corporations
classification loss function for detecting whether there will be a left in Nasdaq Stock Market, with one sample per week. Our col-
extreme event in the future. Combining two loss functions together lected data covers the time duration from September 30,2003 to
we can extend EVL to the situation of vt = {−1, 0, 1}. December 29,2017.
As we have discussed in Sec. 3, without proper weight on non- • Climate Datasets The climate datasets are composed of "Green
parameteric estimator DNN will suffer overfitting problem. The key Gas Observing Network Dataset" and "Atmospheric Co2 Dataset"
point of EVL is to find the proper weights by adding approximation, which are built by Keeling and Whorf and Lucas et al. respec-
e.g., β 0 [1 − ut /γ ]γ , on tail distribution of the observations through tively [25, 34]. The green house dataset contains green house gas
extreme value theory. Intuitively, for detecting right extreme event, concentrations at 2921 grid cells in California covering an area
term β 0 will increase the penalty when the model recognizes the of 12 km by 12 km which are spaced 6 hours apart (4 samples
event as normal event. Meanwhile, the term [1 − ut /γ ]γ also in- per day) over the period May 10-July 31,2010. The Co2 dataset
crease the penalty when the model recognize the extreme event contains atmospheric Co2 concentrations collected from Mount
with little confidence. In the following part we will demonstrate Monalo in Hawaii week by week over the period March 1958 -
how to incorporate EVL with the proposed memory network based December 2001.
framework. • Pseudo Periodic Synthetic Dataset The original dataset con-
tains one million data points which has been split into 10 sections.
4.3 Optimization All values are in range [−0.5, 0.5].
In this part, we present the optimization algorithm for our frame- For these first two datasets, we set the time length as 500 for
work. First, in order to incorporate EVL with the proposed memory training and 200 for testing, while for the last dataset, we randomly
network, a direct thought is to combine the predicted outputs ot extract 150 time series per section with 400 data points, setting the
with the prediction of the occurrence of extreme events, time length as 300 for training and 100 for testing. It is worth to
notice that the regularity of our selected there datasets increase
T
Õ in order, i.e. the pattern of pseudo datasets is totally fixed while
L1 = ∥ot − yt ∥ 2 + λ 1 EV L ut , vt

(17) stock data has intensive noises. We further preprocessed the data
t =1
by replacing the raw data with the difference between time t and
Furthermore, in order to enhance the performance of GRU units, t − 1. By a subsequent normalization of data value to a fixed range,
we propose to add the penalty term for each window j, which aims the final datasets were therefore constructed.
at predicting extreme indicator q j of each window j: For each experiment, we conducted a 10-fold cross validation and
T Õ
M reported the averaged results. For the choice of hyper-parameter γ
Õ
L2 = EV L p j , q j
 in EVL, on one hand, previous works have pointed out that maximiz-
(18)
t =1 j=1 ing likelihood methods only work well when γ > 2 on estimating
extreme value distribution [12]. On the other hand, we have no-
, where p j ∈ [−1, 1] is calculated through s j , which is the embedded ticed that it is improper to set γ a large value due to the fact that
representation of window j, by a full connection layer. Finally we p ∈ [0, 1]. Therefore in most cases we chose γ optimally from (2, 3].
list the whole parameters to learn as follows. On the influence of different γ , an analysis will be presented in
• Parameters θ in GRU units:Wz ,Wr ,Wh , bz , br , bh ∈ RH , Uz , Ur , Uh ∈ experiments. We set the memory size M as 80 and the window size
RH ×H ∆ as 50. We will also analyze the different choices of M and ∆ later.
• Parameters ϕ in output gate: Wo ∈ RH , b ∈ R. The learning rate was commonly set as 0.0005.
• Parameters ψ in predicting the happening of extreme events in
history data: Wp , bp ∈ RH . 5.2 Effectiveness of Time Series Prediction
• Parameters b in attention model: b ∈ R3+ . We first validate our complete framework for time series data pre-
We use Adam [26] to learn the parameters. For the whole learning diction. We chose Rooted Mean Square Error (RMSE) as the metric,
process, please refer to Algorithm 1. where a smaller RMSE means a better performance. We compared
Modeling Extreme Events in Time Series Prediction KDD ’19, August 4–8, 2019, Anchorage, AK, USA
Table 2: F1 value of detecting extreme events with different classification loss function.
Climate Stock Pseudo
Model
Micro Macro Weighted Micro Macro Weighted Micro Macro Weighted
LSTM+CE 0.435 0.833 0.786 0.247 0.617 0.527 0.830 0.900 0.899
GRU+CE 0.471 0.717 0.733 0.250 0.617 0.547 0.854 0.917 0.917
GRU+EVL (γ = 0.5) 0.644 0.883 0.859 0.281 0.583 0.523 0.833 0.900 0.902
GRU+EVL (γ = 1.0) 0.690 0.900 0.881 0.267 0.667 0.547 0.874 0.933 0.933
GRU+EVL (γ = 2.0) 0.646 0.867 0.851 0.324 0.617 0.555 0.869 0.917 0.920
GRU+EVL (γ = 3.0) 0.508 0.867 0.825 0.295 0.617 0.548 0.810 0.900 0.897
GRU+EVL (γ = 4.0) 0.617 0.817 0.813 0.295 0.617 0.543 0.825 0.883 0.886

Table 3: RMSE results of time series prediction.


EVL outperformed the best baseline by 47% in Micro F1 score. In-
Climate Stock Pseudo terestingly, we have observed that γ indeed influenced the final
LSTM 0.188 0.249 5.2 × 10−3 classification results a lot. For example, when γ = 4.0, the perfor-
Time-LSTM 0.193 0.256 4.7 × 10−3 mance of EVL was worse than the baseline on the synthetic dataset.
GRU 0.174 0.223 5.3 × 10−3 As we have discussed before, γ intuitively described the character-
Mem 0.181 0.197 3.6 × 10−3 istics of tail distribution of the data, therefore, an improper γ could
Mem+EVL 0.125 0.168 2.5 × 10−3 mislead the model to an incorrect modeling of the tail distribution.
Furthermore, the experimental results also support our sugges-
our model with several state-of-the-art baselines: GRU, LSTM and tion of γ discussed in Sec. 5.1, as the optimal γ was always found
Time-LSTM [50], where Time-LSTM considers differences between around 2.0 on each data (Table. 2). On the other hand, γ was also
x t . We also compare our model with memory network without EVL, observed to capture the characteristics of heavy-tailed data dis-
by replacing EVL with cross entropy (CE). The results are in Table 3. tribution well. For instance, since there were more noises in the
Surprisingly, GRU outperformed other baselines although it has the stock dataset, i.e. the extreme events occurs more frequently. In this
simplest structure in real-world data. We infer the reason is that, case, the observed optimal γ , which took a larger value than other
there are much noises in real-world data, which could easily cause datasets, corresponds well to this prior knowledge. Therefore, in the
overfitting problem on one-dimensional data as we have described following experiments, we commonly took the hyper-parameter γ
before. Furthermore, as we can see, the RMSE of our model were as 2.0.
uniformly lower than GRU. Noticeably, on the synthetic dataset we
successfully made a near 50% improvement in RMSE. 5.4 Influence of Hyper-parameters in Memory
We also visualize the output from each module as an case study Network
(Fig. 4). As we can see from the results, the empirically success of our Finally we investigate the influence of hyper-parameters in memory
model can be mainly attributed to two parts: the predicting value õt module, where the results are in 5. As we can see from Fig. 5(a),
and the extreme events label ut . The output from õt approximated the optimal choice of memory size varied on different datasets.
the trend of the data correctly but, usually, the value it predicted For instance, M = 100 for the climate dataset and M = 80 for
commonly small. As a complement, ut came to rescue by amplifying the stock dataset, where M = 60 for the synthetic dataset, with
the prediction value if it predicts the occurrence of extreme event an average optimal choice for memory size around 80. We infer
at the current step with a high probability. Illustratively, it is worth the reason is that, with small memory size, the model is not able
to notice the visualization around time step 600 in Fig. 4. Although to memorize extreme events sufficiently in the history. On the
õt predicted the trend as up, however, it only gave a small positive contrary, if we keep enlarging the size of the memory, the model
value. As a complement, the memory module detected there would would otherwise pay inessential attentions to historical events
be an right extreme event at this time step, therefore it yielded a rather than making the right decision based on the current situation.
near-1 output and imposed an amplification on õt to form the final One special case is the synthetic dataset, where the smaller memory
output, while GRU could hardly do such a complicated decision. size performs better. We speculate the reason as, considering the
inherent regularity of the data set is extremely high, a learning
5.3 Effectiveness of EVL model with sufficient learning capacity would easily captured the
As we can see from Table 3, EVL plays an important role during regularity without referring to historical records.
the prediction. We further validated the effectiveness of EVL on After we investigated the influence of M for different datasets,
predictions of future occurrences of extreme events. We used F1 we also would like to understand the influence of different window
score to measure the effectiveness of predictions. Specifically, we size on our memory network module. As we have observed 5(b), the
adopted the Macro, Micro and Weighted F1 score for an overall influence of the window size was much smaller than the memory
evaluation.The results are presented in Fig. 2. We compared our size. However, the window size ∆ still have non-negligible effect
proposed EVL with GRU classifier and LSTM classifier. Also we on the final prediction. For instance, the best ∆ was 60 for the stock
studied the influence of different hyper-parameter γ for EVL. First dataset. Such a phenomenon mainly comes from, when the length
as we can see from Fig. 2, our proposed loss function outperformed is too small, the information or the pattern of extreme events is not
all the baselines on each dataset. Especially on the climate dataset, modeled well, which therefore would influence the effectiveness
KDD ’19, August 4–8, 2019, Anchorage, AK, USA Daizong Ding, Mi Zhang, Xudong Pan, Min Yang, and Xiangnan He

(a) Output from GRU (b) Output from our model (c) Output from õ t

Figure 4: Case study for Our Framework via Visualizations. Although outputs from õt still suffer extreme events problem, our
model improve it by EVL and the memory network module.
set, then the model will lack the ability to detect them [15]. Actually,
the cutting-edge deep learning models are also observed to suffer
from the data imbalance problem in most cases. As pointed out
by Lin et al., data imbalance problem mainly causes two haunting
phenomenons in DNN: (1) model lacks the ability to model rarely
occurred samples; (2) the generalization performance of model is
degenerated [32]. As we have discussed in the introductory part, in
the context of time-series data, these imbalanced samples are often
(a) Influence of memory size M (b) Influence of window size ∆
called extreme events. Previous studies mainly focus on modeling
Figure 5: Influence of different hyper-parameters in mem- the distributions of these extreme values [1]. Intuitively, due to the
ory network module. We only demonstrate the results in cli- rare and irregular essence of extreme events, it is always hard to
mate datasets due to the limit of space. forecast these pumping points [19].

of the prediction, while on the other hand, when the length is set
too large, the GRU module in the memory network would lack the 7 CONCLUSION
ability to memorize all the characteristics inside the window. In this paper, we focus on improving the performance of deep learn-
ing methods on time series prediction, specifically with a more
6 RELATED WORK find-grained modeling on the part of extreme events. As discussed
in Sec.3, without prior knowledge on tailed observations, DNN is
Time series prediction is a classical task in machine learning. Tra-
innately weak in capturing characteristics of the occurrences of
ditional methods such as autoregressive moving average model
extreme events. Therefore, as a novel technique delicately designed
[46] uses linear regression on latest inputs to predict the next
for extreme events, we have proposed a framework for forecasting
value, while Nonlinear autoregressive exogenous (NARX) adopts
time series data with extreme events. Specifically we consider two
non-linear mapping to improve the prediction [31]. However, their
factors for imposing tailed priors: (1) memorizing extreme events
learning capacity are limited because traditional methods use shal-
in historical data (2) modeling the tail distribution of observations.
low architecture [6]. Recently, deep neural network (DNN) achieve
For the first factor we utilize the recently proposed Memory Net-
great success in many fields , e.g., computer vision [28] and neu-
work technique by storing historical patterns of extreme events for
ral language processing [4], mainly because of its deep nonlinear
future reference. For the second factor we propose a novel classifi-
architecture which allows extraction of better features from the
cation loss function called Extreme Value Loss (EVL) for detecting
inputs [39]. A number of DNN based models has also been ap-
extreme events in the future, which contains the approximated tail
plied on time series prediction. Dasgupta and Osogami proposed
distribution of observations. Finally we combine them together and
to use Boltzmann Machine [11], while Yang et al. proposed to use
form our end-to-end solution for predictions of time series with
Convolutional Network [49] to improve the performance. Among
extreme events. With intensive experiments, we have validated
these deep learning models, recurrent neural network (RNN) shows
both the effectiveness of EVL on extreme event detection and the
superior advantages in modeling time-series data. For instance,
superior performance of our proposed framework on time series
Diaconescu proposes to combine RNN with NARX model, which
prediction, compared with the state-of-the-arts. For future works,
largely reduces the errors in prediction [13, 48]. Numerous im-
we plan to work on an efficient extension of our framework to
provements have been made on RNN, such as Long-short Term
multi-dimensional time series prediction in consideration of its
Memory [22] and Gated Recurrent Unit [9], with various empirical
significance in practice and its challenges in theory. Furthermore,
results reporting the effectiveness of these deep recurrent models
we suggest it would also be interesting to exploit the possibility
[30, 36, 50].
of applying our proposed EVL to derive solutions for other tasks
Data imbalance is always an important issue in machine learning
featured with data imbalance such as Point-of-Interest (POI) rec-
and data mining. For instance, considering certain classification
ommendation [14].
task, if there exists one label which only has few samples in training
Modeling Extreme Events in Time Series Prediction KDD ’19, August 4–8, 2019, Anchorage, AK, USA

ACKNOWLEDGMENTS [23] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feed-
forward networks are universal approximators. Neural networks 2, 5 (1989),
The work is supported in part by the National Program on Key 359–366.
Basic Research (NO. 2015CB358800), the National Natural Science [24] Holger Kantz, Eduardo G Altmann, Sarah Hallerberg, Detlef Holstein, and Anja
Riegert. 2006. Dynamical interpretation of extreme events: predictability and
Foundation of China (U1636204, 61602121, U1736208, 61602123, predictions. In Extreme events in nature and society. Springer, 69–93.
U1836213, U1836210) and Thousand Youth Talents Program 2018. [25] Charles David Keeling and Timothy P Whorf. 2004. Atmospheric CO2 concen-
Min Yang is also a member of Shanghai Institute of Intelligent trations derived from flask air samples at sites in the SIO network. Trends: a
compendium of data on Global Change (2004).
Electronics & Systems, Shanghai Institute for Advanced Communi- [26] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
cation and Data Science, and Engineering Research Center of Cyber mization. arXiv preprint arXiv:1412.6980 (2014).
Security Auditing and Monitoring, Ministry of Education, China. [27] Samuel Kotz and Saralees Nadarajah. 2000. Extreme value distributions. Theory
and applications. Prentice Hall,. 207–243 pages.
[28] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-
REFERENCES tion with deep convolutional neural networks. In Advances in neural information
processing systems. 1097–1105.
[1] Sergio Albeverio, Volker Jentsch, and Holger Kantz. 2006. Extreme events in [29] Holden Lee, Rong Ge, Tengyu Ma, Andrej Risteski, and Sanjeev Arora. 2017. On
nature and society. Springer Science & Business Media. the ability of neural nets to express distributions. arXiv preprint arXiv:1702.07028
[2] Eduardo Altmann and Kantz H. 2005. Recurrence time analysis, long-term (2017).
correlations, and extreme events. Physical Review E Statistical Nonlinear and Soft [30] Tao Lin, Tian Guo, and Karl Aberer. 2017. Hybrid neural networks for learning
Matter Physics (2005). the trend in time series. In Proceedings of the Twenty-Sixth International Joint
[3] Martin Arjovsky and Léon Bottou. 2017. Towards principled methods for training Conference on Artificial Intelligence, IJCAI-17. 2273–2279.
generative adversarial networks. arXiv preprint arXiv:1701.04862 (2017). [31] Tsungnan Lin, Bill G Horne, Peter Tino, and C Lee Giles. 1996. Learning long-
[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma- term dependencies in NARX recurrent neural networks. IEEE Transactions on
chine translation by jointly learning to align and translate. arXiv preprint Neural Networks 7, 6 (1996), 1329–1338.
arXiv:1409.0473 (2014). [32] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2018.
[5] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. 2005. Focal loss for dense object detection. IEEE transactions on pattern analysis and
Clustering with Bregman divergences. Journal of machine learning research 6, machine intelligence (2018).
Oct (2005), 1705–1749. [33] Edward N Lorenz. 1963. Deterministic nonperiodic flow. Journal of the atmo-
[6] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2007. Greedy spheric sciences 20, 2 (1963), 130–141.
layer-wise training of deep networks. In Advances in neural information processing [34] DD Lucas, C Yver Kwok, P Cameron-Smith, H Graven, D Bergmann, TP Guilder-
systems. 153–160. son, R Weiss, and R Keeling. 2015. Designing optimal greenhouse gas observing
[7] Tine Buch-Larsen, Jens Perch Nielsen, Montserrat Guillén, and Catalina Bolancé. networks that consider performance and cost. Geoscientific Instrumentation,
2005. Kernel density estimation for heavy-tailed distributions using the Cham- Methods and Data Systems 4, 1 (2015), 121–137.
pernowne transformation. Statistics 39, 6 (2005), 503–516. [35] T Okubo and N Narita. 1980. On the distribution of extreme winds expected in
[8] Armin Bunde, Jan F Eichner, Shlomo Havlin, and Jan W Kantelhardt. 2003. The Japan. National Bureau of Standards Special Publication 560 (1980), 1.
effect of long-term correlations on the return periods of rare events. Physica A: [36] Yao Qin, Dongjin Song, Haifeng Chen, Wei Cheng, Guofei Jiang, and Garrison
Statistical Mechanics and its Applications 330, 1-2 (2003), 1–7. Cottrell. 2017. A dual-stage attention-based recurrent neural network for time
[9] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, series prediction. arXiv preprint arXiv:1704.02971 (2017).
Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase [37] Tomasz Rolski, Hanspeter Schmidli, Volker Schmidt, and Jozef L Teugels. 2009.
representations using RNN encoder-decoder for statistical machine translation. Stochastic processes for insurance and finance. Vol. 505. John Wiley & Sons.
arXiv preprint arXiv:1406.1078 (2014). [38] Murray Rosenblatt. 1956. Remarks on some nonparametric estimates of a density
[10] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. function. The Annals of Mathematical Statistics (1956), 832–837.
Empirical evaluation of gated recurrent neural networks on sequence modeling. [39] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. 2017. Dynamic routing
arXiv preprint arXiv:1412.3555 (2014). between capsules. In Advances in Neural Information Processing Systems. 3856–
[11] Sakyasingha Dasgupta and Takayuki Osogami. 2017. Nonlinear Dynamic Boltz- 3866.
mann Machines for Time-Series Prediction.. In AAAI. 1833–1839. [40] Ajit P Singh and Geoffrey J Gordon. 2008. Relational learning via collective matrix
[12] Laurens De Haan and Ana Ferreira. 2007. Extreme value theory: an introduction. factorization. In Proceedings of the 14th ACM SIGKDD international conference on
Springer Science & Business Media. Knowledge discovery and data mining. ACM, 650–658.
[13] Eugen Diaconescu. 2008. The use of NARX neural networks to predict chaotic [41] Jeroen Van den Berg, Bertrand Candelon, and Jean-Pierre Urbain. 2008. A cautious
time series. Wseas Transactions on computer research 3, 3 (2008), 182–191. note on the use of panel models to predict financial crises. Economics Letters 101,
[14] Daizong Ding, Mi Zhang, Xudong Pan, Duocai Wu, and Pearl Pu. 2018. Geo- 1 (2008), 80–83.
graphical Feature Extraction for Entities in Location-based Social Networks. In [42] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. 2016. Match-
Proceedings of the 2018 World Wide Web Conference on World Wide Web. Interna- ing networks for one shot learning. In Advances in Neural Information Processing
tional World Wide Web Conferences Steering Committee, 833–842. Systems. 3630–3638.
[15] Li Fei-Fei, Rob Fergus, and Pietro Perona. 2006. One-shot learning of object [43] Ladislaus von Bortkiewicz. 1921. Variationsbreite und mittlerer Fehler. Berliner
categories. IEEE transactions on pattern analysis and machine intelligence 28, 4 Mathematische Gesellschaft.
(2006), 594–611. [44] Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019.
[16] Fuli Feng, Xiangnan He, Xiang Wang, Cheng Luo, Yiqun Liu, and Tat-Seng Chua. Neural Graph Collaborative Filtering. In SIGIR.
2019. Temporal Relational Ranking for Stock Prediction. ACM Transactions on [45] Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory Networks.
Information Systems (TOIS) 37, 2 (2019), 27. CoRR abs/1410.3916 (2014).
[17] R. A. Fisher and L. H. C. Tippett. 1928. Limiting forms of the frequency distribution [46] Peter Whitle. 1951. Hypothesis testing in time series analysis. Vol. 4. Almqvist &
of the largest or smallest member of a sample. Mathematical Proceedings of the Wiksells.
Cambridge Philosophical Society 24, 2 (1928), 180–190. [47] Rym Worms. 1998. Propriété asymptotique des excès additifs et valeurs extrêmes:
[18] JANOS GALAMBOS. 1977. The asymptotic theory of extreme order statistics. le cas de la loi de Gumbel. Comptes Rendus de l’Academie des Sciences Series I
In The Theory and Applications of Reliability with Emphasis on Bayesian and Mathematics 5, 327 (1998), 509–514.
Nonparametric Methods. Elsevier, 151–164. [48] Linjun Yan, Ahmed Elgamal, and Garrison W Cottrell. 2011. Substructure vibra-
[19] M Ghil, P Yiou, Stéphane Hallegatte, BD Malamud, P Naveau, A Soloviev, P tion NARX neural network approach for statistical damage inference. Journal of
Friederichs, V Keilis-Borok, D Kondrashov, V Kossobokov, et al. 2011. Extreme Engineering Mechanics 139, 6 (2011), 737–747.
events: dynamics, statistics and prediction. Nonlinear Processes in Geophysics 18, [49] Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiaoli Li, and Shonali Krish-
3 (2011), 295–350. naswamy. 2015. Deep Convolutional Neural Networks on Multichannel Time
[20] Gnedenko. 1943. Sur la distribution limite du terme maximum dâĂŹune sÃľrie Series for Human Activity Recognition.. In Ijcai, Vol. 15. 3995–4001.
alÃľatoire. Annals of Mathematics 44, 3 (1943), 423–453. [50] Yu Zhu, Hao Li, Yikang Liao, Beidou Wang, Ziyu Guan, Haifeng Liu, and Deng Cai.
[21] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, 2017. What to do next: Modeling user behaviors by time-lstm. In Proceedings of
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17.
nets. In Advances in neural information processing systems. 2672–2680. 3602–3608.
[22] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
computation 9, 8 (1997), 1735–1780.

You might also like