LSTM Model Architecture for Rare Event Time Series Forecasting - MachineLearningMastery.com
LSTM Model Architecture for Rare Event Time Series Forecasting - MachineLearningMastery.com
Time series forecasting with LSTMs directly has shown little success.
This is surprising as neural networks are known to be able to learn complex non-linear relationships and the LSTM is perhaps the most successful
type of recurrent neural network that is capable of directly supporting multivariate sequence prediction problems.
A recent study performed at Uber AI Labs demonstrates how both the automatic feature learning capabilities of LSTMs and their ability to handle input
sequences can be harnessed in an end-to-end model that can be used for drive demand forecasting for rare events like public holidays.
In this post, you will discover an approach to developing a scalable end-to-end LSTM model for time series forecasting.
The challenge of multivariate, multi-step forecasting across multiple sites, in this case cities.
An LSTM model architecture for time series forecasting comprised of separate autoencoder and forecasting sub-models.
The skill of the proposed LSTM architecture at rare event demand forecasting and the ability to reuse the trained model on unrelated
forecasting problems.
Kick-start your project with my new book Deep Learning for Time Series Forecasting, including step-by-step tutorials and the Python source code
files for all examples.
Overview
In this post, we will review the 2017 paper titled “Time-series Extreme Event Forecasting with Neural Networks at Uber” by Nikolay Laptev, et al.
presented at the Time Series Workshop, ICML 2017.
1. Motivation
2. Datasets
3. Model
4. Findings
Motivation
The goal of the work was to develop an end-to-end forecast model for multi-step time series forecasting that can handle multivariate inputs (e.g.
multiple input time series).
The intent of the model was to forecast driver demand at Uber for ride sharing, specifically to forecast demand on challenging days such as holidays
where the uncertainty for classical models was high.
Generally, this type of demand forecasting for holidays belongs to an area of study called extreme event prediction.
Extreme event prediction has become a popular topic for estimating peak electricity demand, traffic jam severity and surge pricing for
ride sharing and other applications. In fact there is a branch of statistics known as extreme value theory (EVT) that deals directly with
this challenge.
The difficulty of these existing models motivated the desire for a single end-to-end model.
Further, a model was required that could generalize across locales, specifically across data collected for each city. This means a model trained on
some or all cities with data available and used to make forecasts across some or all cities.
We can summarize this as the general need for a model that supports multivariate inputs, makes multi-step forecasts, and generalizes across multiple
sites, in this case cities.
Click to sign-up and also get a free PDF Ebook version of the course.
Datasets
The model was fit in a propitiatory Uber dataset comprised of five years of anonymized ride sharing data across top cities in the US.
A five year daily history of completed trips across top US cities in terms of population was used to provide forecasts across all major
US holidays.
The input to each forecast consisted of both the information about each ride, as well as weather, city, and holiday variables.
To circumvent the lack of data we use additional features including weather information (e.g., precipitation, wind speed, temperature)
and city level information (e.g., current trips, current users, local holidays).
The figure below taken from the paper provides a sample of six variables for one year.
Scaled Multivariate Input for Model
Taken from “Time-series Extreme Event Forecasting with Neural Networks at Uber”.
A training dataset was created by splitting the historical data into sliding windows of input and output variables.
The specific size of the look-back and forecast horizon used in the experiments were not specified in the paper.
Time series data was scaled by normalizing observations per batch of samples and each input series was de-trended, but not deseasonalized.
Neural networks are sensitive to unscaled data, therefore we normalize every minibatch. Furthermore, we found that de-trending the
data, as opposed to de-seasoning, produces better results.
Model
LSTMs, e.g. Vanilla LSTMs, were evaluated on the problem and show relatively poor performance.
Our initial LSTM implementation did not show superior performance relative to the state of the art approach.
Feature Extractor: Model for distilling an input sequence down to a feature vector that may be used as input for making a forecast.
Forecaster: Model that uses the extracted features and other inputs to make a forecast.
An LSTM autoencoder model was developed for use as the feature extraction model and a Stacked LSTM was used as the forecast model.
We found that the vanilla LSTM model’s performance is worse than our baseline. Thus, we propose a new architecture, that leverages
an autoencoder for feature extraction, achieving superior performance compared to our baseline.
When making a forecast, time series data is first provided to the autoencoders, which is compressed to multiple feature vectors that are averaged and
concatenated. The feature vectors are then provided as input to the forecast model in order to make a prediction.
… the model first primes the network by auto feature extraction, which is critical to capture complex time-series dynamics during
special events at scale. […] Features vectors are then aggregated via an ensemble technique (e.g., averaging or other methods). The
final vector is then concatenated with the new input and fed to LSTM forecaster for prediction.
It is not clear what exactly is provided to the autoencoder when making a prediction, although we may guess that it is a multivariate time series for the
city being forecasted with observations prior to the interval being forecasted.
A multivariate time series as input to the autoencoder will result in multiple encoded vectors (one for each series) that could be concatenated. It is not
clear what role averaging may take at this point, although we may guess that it is an averaging of multiple models performing the autoencoding
process.
The authors comment that it would be possible to make the autoencoder a part of the forecast model, and that this was evaluated, but the separate
model resulted in better performance.
Having a separate auto-encoder module, however, produced better results in our experience.
More details of the developed model were made available in the slides used when presenting the paper.
The input for the autoencoder was 512 LSTM units and the bottleneck in the autoencoder used to create the encoded feature vectors as 32 or 64
LSTM units.
Details of LSTM Autoencoder for Feature Extraction
Taken from “Time-series Extreme Event Forecasting with Neural Networks at Uber.”
The encoded feature vectors are provided to the forecast model with ‘new input‘, although it is not specified what this new input is; we could guess
that it is a time series, perhaps a multivariate time series of the city being forecasted with observations prior to the forecast interval. Or, features
extracted from this series as the blog post on the paper suggests (although I’m skeptical as the paper and slides contradict this).
The model was trained on a lot of data, which is a general requirement of stacked LSTMs or perhaps LSTMs in general.
The described production Neural Network Model was trained on thousands of time-series with thousands of data points each.
— Time-series Extreme Event Forecasting with Neural Networks at Uber, 2017.
An interesting approach to estimating forecast uncertainty was also implemented that used the bootstrap.
It involved estimating model uncertainty and forecast uncertainty separately, using the autoencoder and the forecast model respectively. Inputs were
provided to a given model and dropout of the activations (as commented in the slides) was used. This process was repeated 100 times, and the model
and forecast error terms were used in an estimate of the forecast uncertainty.
This approach to forecast uncertainty may be better described in the 2017 paper “Deep and Confident Prediction for Time Series at Uber.”
Findings
The model was evaluated with a special focus on demand forecasting for U.S. holidays by U.S. city.
The new generalized LSTM forecast model was found to outperform the existing model used at Uber, which may be impressive if we assume that the
existing model was well tuned.
The results presented show a 2%-18% forecast accuracy improvement compared to the current proprietary method comprising a
univariate timeseries and machine learned model.
The model trained on the Uber dataset was then applied directly to a subset of the M3-Competition dataset comprised of about 1,500 monthly
univariate time series forecasting datasets.
This is a type of transfer learning, a highly-desirable goal that allows the reuse of deep learning models across problem domains.
Surprisingly, the model performed well, not great compared to the top performing methods, but better than many sophisticated models. The result is
suggests that perhaps with fine tuning (e.g. as is done in other transfer learning case studies) the model could be reused and be skillful.
Performance of LSTM Model Trained on Uber Data and Evaluated on the M3 Datasets
Taken from “Time-series Extreme Event Forecasting with Neural Networks at Uber.”
Importantly, the authors suggest that perhaps the most beneficial application of deep LSTM models to time series forecasting are situations where:
From our experience there are three criteria for picking a neural network model for time-series: (a) number of timeseries (b) length of
time-series and (c) correlation among the time-series. If (a), (b) and (c) are high then the neural network might be the right choice,
otherwise classical timeseries approach may work best.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Summary
In this post, you discovered a scalable end-to-end LSTM model for time series forecasting.
The challenge of multivariate, multi-step forecasting across multiple sites, in this case cities.
An LSTM model architecture for time series forecasting comprised of separate autoencoder and forecasting sub-models.
The skill of the proposed LSTM architecture at rare event demand forecasting and the ability to reuse the trained model on unrelated
forecasting problems.
Project Spotlight: Event How to Tune LSTM How to Update LSTM Networks How to Use Timesteps in LSTM
Recommendation in Python… Hyperparameters with Keras for During Training for Time… Networks for Time…
Time…
Comparing Classical and Machine Learning Algorithms for Time Series Forecasting A Gentle Introduction to LSTM Autoencoders
71 Responses to LSTM Model Architecture for Rare Event Time Series Forecasting
Hi,
Is there a way to identify and remove outliers from data sets without affecting rare events?
Or how not to mistakenly have outliers as rare events ?
Thanks
Vali
You must carefully define what you mean by “outlier” and “rare event” so that the methods that detect the former don’t detect the latter.
Outliers usually are anomalies which are abnormal ie. outside a normal distribution. Something like mean+/-2*std. in a time series
outliers are sparks, with much higher freq than the normal signal even with rare events.
For instance a Black Friday is rare event but fits in the normal frequency whereas an outlier is much higher frequency.
So how can I bring the frequency part in the equation?
Thanks
Good question, I don’t have material on this topic so I can’t give you good off the cuff advice.
“Higher frequency” means more often. I think you mean that true outliers have a much *lower* frequency. But even that isn’t
necessarily accurate. I’ve seen web traffic time series that have occasional spikes that correspond to no known event, occurring in some
cases more commonly than the few known special events.
Thanks for the post. Do you know where an implementation for this algorithm can be found?
I don’t understood this paper as it includes terms like time series multivariate lstm recurrent model
Hi,
Thanks for this article. I’m trying to implement this paper using the Tensorflow low-level api.
Can you explain more about the confident interval computation, please.
I mean one you got the uncertainty error and the irreducible error, how can you get the interval through MC Dropout
Thank a lot
Perhaps check the paper or contact the author of the paper, it has been months since I read the paper.
very well explained, as always! a lot of your other articles contain code that help us understand the concepts better. I’m sure you’re very
busy, but it’d be great if you could add code to this post, or point me to some articles/repos that have some code related to this post. Thanks,
Hi Jason,
I Master Degree student and I got interested in aply this approach in climate data series. I my research I have an adiction challenge (dimesions) the are
latitude and longitude of an extrem or rare event.
I created a time series downloading 10 years of ERA Interim, Daily: Pressure and sufarce data form ECMWF
(https://apps.ecmwf.int/datasets/data/interim-full-daily/levtype=sfc/).
So, as an example, I’m interested in predict an extreme rain (> 50mm in 24h) for a selected area (0.75 resolution): latitude from-18.75 to -20.25 and
longitude from 315.0 to 316.5., It’s a grid 3 x 3 = 9 grids.
The rain (total precipitation in mm) doesn’t have a gaussian distribution, so, there are a lot of 00mm days, and the time series of rain are not a
continuos sequence.
In your experience, this “Ubber” approach can fit despite of distribution problem? I have some doubts about the approach, like how this “LSTM
Autoencoder for Feature Extraction” works. Do you expected do code a complete example like this “Uber” approach?
I don’t know how this approach will fair with your data, perhaps try it and see?
Hi Jason,
I have one question and maybe you could help me with that. The LSTM Autoencoder that I created looks like this —
repeat = RepeatVector(10)(encoder3)
Should we extract the feature from the “repeat” layer or the “encoder3” layer?
Could you please give me a hint for plotting/visualization of the extracted features please?
I’m eager to help, but I don’t have the capacity to debug your code, sorry.
Thank you for the answer Jason! You need not be sorry 🙂 Do you have any example code or could you suggest me some methods
with which I can visualize the feature vectors?
Thanks you 🙂
where to find the dataset for this paper of uber could you please send me
ansd how to implement this
please send me dataset for this paper . i need this desperately for my research work please help me
can you provide me thesis work related to this topic of rare events please help me.with implementation
I cannot, sorry.
Hi Jason, thank you for the post. Please! what is the difference between Monte Carlo dropout and normal dropout? Do you have a link to any
tutorial that shows how to add Monte Carlo dropout to the LSTM model implementation?
Thank you!
It is a stochastic dropout used as Bayesian approximation for model uncertainty estimation. It is equivalent to performing T
stochastic forward passes through the Neural Network and averaging the result. It can also be approximated by averaging the weights of the
NN (i.e.multiplying each weight by a probability p at test time). MC dropout s used for model uncertainty estimation in the paper you
elaborated and the one you provided as reference (“Deep and Confident Prediction for Time Series at Uber”) in this post.
Thanks!
Please! help me with any tutorial that shows how it can be implemented using the LSTM model
I made a post where I replicate these results. You can find the article here: https://towardsdatascience.com/extreme-event-forecasting-with-
lstm-autoencoders-297492485037 (with Python Code)
Papers are always incomplete, they are just enough to give you a rough idea – which might be enough.
It’s a pain. And unless a paper has associated code it is almost fraud – they can make up anything.
Thankfully, most good papers have associated github project – this never used to be the case.
If you are transforming new input data from different series then avg + concatenate to prior to making a new prediction then why
do they add another new input? Is this new input the same input as the one prior to transformation by the encoder? I still do not
understand this.
I’d recommend reading the paper, it may have more details and make the situation clearer.
How many time series are sufficient enough for these network training? (Author suggest that more number of time series needed for these
type of network to succeed, but how many?)
It really depends.
If you don’t have a lot of data, you can avoid overfitting with regularization:
https://machinelearningmastery.com/introduction-to-regularization-to-reduce-overfitting-and-improve-generalization-error/
Not off hand, some research may be required. Perhaps try some searches on scholar.google.com
Okay
Thanks 😀
I wanted to make a prediction of a t+1 every 15 minutes, (in real time: Run forever).
will be better to look for a good model, then I predict the next step (off line), Or, at each prediction I update my model with the new prediction (on
line)?
Test a few approaches and see what works best for your specific dataset.
Hi Jason,
I have to perform Anomaly detection and I only have a univariate Time series data (~1 year).
Does it make sense to create lagged and derived features from the same time series (such as mean, min, max, sd, deviation etc. for different
windows) and train a LSTM autoencoder model on it?
The idea is that if I score/predict the new data point using the lagged and derived features and the reconstruction error is > threshold then it’s an
anomaly.Do you recommend this approach?
I recommend testing a suite of framings of the problem and models in order to discover what works best.
Hi Jason,
But it is not conceptually wrong if I create features that are essentially from 1 univariate time series and then use autoencoder right?
Regards,
Emily
Hmmm, there is no real right and wrong, there are only models that work and ones that do not.
You’re idea is complex, but perhaps it will work – give it a shot. The great thing about these libraries is that testing ideas is very fast – like just a
few minutes.
Thanks a lot!
Adi November 21, 2019 at 2:25 am # REPLY
Hi Jason,
I’m working on a problem where I have a daily time series with a set of ~100 features associated for every day. There is also an 0/1 event associated
with each day. There are days missing in the data. I want to predict, based on the future features whether or not the event will occur on that day. I’m
not really getting how can I do it?
Thanks a lot!
I recommend testing many different framings of the dataset and see what works.
Thanks for the prompt reply. I’m not getting what might be a good approach to start with? Should I frame it as a simple classification
problem or a time series approach
Start with classification, e.g. prior days features as input todays label as output, or something. Perhaps explore feature selection on this.
Then start adding in more history to some/all features, for different prior intervals. Discover what results in skillful models on your data.
Hello Jason,
Is univariate LSTM RNN capable of giving good results with 1200 observation of daily sales data with 20 percent of observations have sales happened
and other 80 percent don’t have any sales happened so taken as zero. The sales data is in the form of daily number of units sold.
I am trying with this model. Is there any other time series model you can suggest me for this kind of problem where there is daily sales but happened
for few days only . An accuracy of 60 Percent as a start will be good .
for the1. part I have given 0 flag to the day where sales didn’t happened and 1 where sales happened irrespective of how much .
Is univariate LSTM helpful in pattern recognization of 0 and 1…? after that I will go for 2 part . i.e. how much …If you have any other technique let me
know ..
Regards
Bandeep
Perhaps test the model on your data and evaluate the result?
I would strongly encourage you to test other models as LSTMs are generally terrible at univariate time series forecasting.
Thanks for this, and the many other useful articles that you publish.
In the Uber study, did your network identify spikes and dips NOT associated with events known beforehand: public holidays?
I am working towards a network that identifies rare events (demand spikes) before they occur. I have demand spikes that just seem to appear out of
the blue.
I have a simple neural network that predicts when an order is coming in, but predicting whether the next order is a spike has resisted analysis thus far.
A runs test shows that order size is not random and intuition after many years in the business tells me there’s a model out there somewhere.
As luck would have it, a vanilla LSTM network gave astonishingly good results on my data: really exciting. I’m guessing that, if I can
do it, an expert can do it even better.
Accordingly, I think the guys working for Uber would have forecast random demand spikes not related to holidays. Perhaps it’s so obvious,
they didn’t feel the need to mention it.
As with them, the uncertainty on the actual level of demand shows up in my model. It’s a less urgent issue for me but further improvement
gives me a chance to upgrade my skills.
Hi
my dataset is 11000*6
LSTM is an RNN.
Perhaps test a suite of models on your dataset and discover what works best.
How much data points in daily time series data will be there to call it as a long time series data
It really depends.
Leave a Reply
Name (required)
SUBMIT COMMENT
Welcome!
I'm Jason Brownlee PhD
and I help developers get results with machine learning.
Read more
How to Develop Convolutional Neural Network Models for Time Series Forecasting