Time Series and Survival Analysis
Time Series and Survival Analysis
Forecasting Problems
These are the two types of forecasting problems. Consider that the vast majority of
applications employ univariate models, harder to combine variables when using time
series data.
1. Univariate
Think of multiple related series identifying groups such as customer types, department
or channel, or geographic joint estimation across series.
Generally, models perform better if we can first remove known sources of variation such
as trend and seasonality. The main motivation for doing decomposition is to improve
model performance. Usually, we try to identify the known sources and remove them,
leaving resulting series (residuals) that we can fit against time series models.
Trend captures the general Seasonality captures effect Residuals are random
direction of time series. that occur with specific fluctuations leftover once trend
frequency. It can be driven and seasonality are removed.
For example, increasing job many factors such as naturally There should not be a trend or
growth year over year despite occurring events(weather etc), seasonal pattern in residuals.
seasonal fluctuation. business or administrative The represent short term
procedures (school etc), fluctuations and may be
social/ cultural behavior random. There may be a
(holidays), fluctuations due to portion of trend or seasonality
calendar events (holidays that components missed in the
shift from year to year). decomposition.
Decomposition Models
log(Time*Seasonality*Residual) =
log(Time) + log(Seasonality) +
log(Residual)
These models are used if the magnitudes If there are trend and seasonality effects,
of the seasonal and residual values then we can use exponential smoothing.
fluctuate with trend. If it is an Exponential Trend, then the choice
for the Trend will be MULTIPLICATIVE.
If it is Magnitude of Spikes, then the choice
for the Trend will be ADDITIVE.
Pseudo-additive Decomposition Model
Our seasonal and residual components in the multiplicative model, you think about that
seasonal and residual components, they should be centered around one. As one times
any value is equal to that value. Whereas with the additive model, the seasonal
and residual components will be centered around zero, as if we are working with
additive any value plus zero is equal to that value.
Decomposition of time series allows us to remove deterministic components, which
would otherwise complicate modeling.
After removing these components, the main focus is to model the residual.
Other Methods
Stationarity: A stationary series has same mean and variance over time. Non-
stationary series are much harder to model.
Stationarity means that the statistical properties of a time series (or rather the process
generating it) do not change over time. Stationarity is important because many useful
analytical tools and statistical tests and models rely on it.
Common approach to handle non-stationary series:
Identify sources of non-stationarity.
Find some type of transformation that makes the series stationary.
Build models for that series.
There are four key properties that a time series must exhibit over time for it to be
stationary. They are:
Constant mean
Constant variance
Constant autocorrelation structure
No periodic (seasonal) component
Autocorrelation is key concept in time series. It means today’s measurement is highly
dependent on past value. The time interval between two correlated values is called lag.
If autocorrelation remains constant through out the series, a simple transformation will
yield a stationary series.
2. Summary statistics:
Calculate mean and variance over time.
A simple but effective way to do this is to split the data into chunks and
compute statistics for each chunk.
Large deviation in either the mean or variance amongst chunks are
problematic and suggest the data is non-stationary.
The mean and variance are
some what constant in the
diagram stationary.
3. Histogram plot:
A histogram plot gives important clues into a time series underlying structure.
If a distribution that is approximately normal, it is likely the time series is
stationary.
If we see a non-normal distribution, it indicates the time series is likely
non-stationary.
Trends, seasonality
and non-constant
variance are always
non-stationary.
4. Dickey-Fuller test:
The Augmented Dickey-Fuller (ADF) test specifically tests for stationarity.
It is a hypothesis test: the test returns a p-value, and we generally say the
series is non-stationary if the p-value is less than 0.05.
It is a less appropriate test to use with small datasets, or data with
heteroscedasticity (different variance across observations) present.
It is best to pair ADF with other techniques such as: run-sequence plots,
summary statistics, or histograms.
Common transformations from non-stationarity to stationarity:
Heteroscedasticity refers
to differencing variances
across observation. Then
just to note the opposite
of that and what we are
looking for would be
homoscedasticity.
The nice thing about using a metric like MSE we can compare different models or
estinmates to see which is doing the best job.
Lowest MSE Better model
Advanced Smoothing:
Single Exponential Smoothing:
Alpha is between 0 and 1.
Single exponential smoothing produces the same value pushed out over the forecast
horizon.
Clearly it is picking neither trend nor seasonality,
but just pushing out that one estimate repeatedly
into the horizon. Therefore, we are going to
expand on this and turn to double exponential
smoothing.
Double Exponential Smoothing: Double exponential smoothing has the ability to pick up
trend. It does by adding a second component into its formulation that smooths out trend.
Here, b is trend.
Triple Exponential Smoothing: It has the ability to pick up trend and seasonality. It does
this by adding a third component to its formulation that smooths out seasonality.
The most recent observations tend to really impact the current ones to a much larger
degree than the older ones do. So maybe we want to add more weight to more recent
trends.
What about autocorrelated data?
MA(q) models assumes that model depends on the last p values of the time series. For
q=2, the forecast has the form:
There are some things to keep in mind when working with ARMA models:
1. The time series is assumed to be stationary (same mean and same
variance over time).
2. A good rule of thumb is to have atleast 100 observations when fitting an
ARMA model so we can properly demonstrate past autocorrelations.
There will be three stages in building an ARMA model:
1. Identification
2. Estimation
3. Evaluation
ARMA Identification:
Confirm the following:
The time series is stationary.
Whether the time series contain some seasonal components - The Time Series
in ARMA model should not contain a seasonal component.
How to determine seasonality?
We can determine is seasonality is present using:
1. Autocorrelation and partial autocorrelation plots – how correlated one period of
full seasonal cycle in the past is with the current period.
2. Seasonal subseries plot – Shows the average and variation for each different
season.
3. Intuition – Possible in some cases i.e., seasonal sales of consumer products,
holidays etc.
These plots are going to be initial step to help us understand what type of season
patterns impact our data using visual representation.
If there is correlation with lag 1, there is also going to be some residual correlartion with
lag 2. Lag 2 will automatically be correlated with current value if lag 1 is as well. So,
therefore an AR one model for example so correlated with just one value would have a
slowly decaying autocorrelation as it is highly correlated with lag one which is somewhat
correlated with lag two, and lag three, and so on There is some partial auto
correlation between lag 1 and lag2, lag 2 and lag 3. So, we cannot automatically tell
whether it is AR(1), AR(2), AR(3) or even AR(5) model.
And for this we have partial auto-correlation.
It measures a partial result. So, it considers other lags and remove the effects of other
lags and allow you to look at the correlations independently.
Identifying p and q for ARMA model:
Once we have a stationary series, we can estimate AR and MA models. We need to
determine p and q, the order of the AR and MA models.
One approach is to look at the:
Autocorrelation plot
Partial Autocorrelation plot
Another approach is to treat p and q as hyperparameters and apply standard
approaches (grid search, cross validation etc.).
ARMA Estimation:
Estimating the parameters of an ARMA model can be a complicated non-linear model:
Non-linear least square and Maximum Likelihood Estimation (MLE) are most
common approaches.
Most statistical software will fit the ARMA model, and potentially help choose the
order.
ARMA Validation:
The residuals will approximate a Gaussian distribution (aka white noise).
Otherwise we need to iterate to obtain a better model.
AR(p) p is number of coefficients ᶲ1, ᶲ2 and so on till ᶲp. If p = 2 ᶲ1, ᶲ2
In AR models, essentialy we are just applying linear regression on prior terms.
In MA The errors propagate to future values of time series directly. So, for example,
the error at time t -1 appears directly on that right side of our equation for X sub
t. Whereas with the AR model via error at time t- 1 does not appear directly on the right
side. Or rather as a factor or a part of the Xt- 1, and the Xt- 1 will be whatever that value is
plus that shock. So, it does not directly affect the equation in the same way that the MA
model does.
MA value and the error (shock) affects the current values of X and q periods into the
future assuming that you are using more than MA1 model. And in contrast, the AR
model, that shock is going to be affecting the X value infinitely far into the future.
Because if you think about that shock and the shock before and the shock before that,
those are going to be built into the Xt.
If the model is slowing decaying, then the lags are correlated with each other. In AR
model, the past error will propagate into the future leading to slowing decaying plot.
If the graph jumps back and forth, then we are completely sure that we can use that
model.
If length of lag k < length of lag k+1 then the order of the model is k+1. But play with
the order.
P, D, Q represent the same as p, d, q but they are applied across a season (E.g.
Yearly or monthly data)
o So, for example, if we are looking with yearly data, that's broken down in
monthly data. So, monthly data across a year that P would be
how correlated it is with the last value, 12 months prior. D would be how
much differencing thing we should do year over year. So, should we
subtract out the series lag 12, and Q would be the correlation to the error
term at lag 12 or 12 being the number of periods within our season.
M = one season
o And we will have to be able to pass that through to our model, so we will
have to actually know what the frequency of our seasonality is. Something
like 12, if we are working with quarterly data, we know it is four. If we are
working with weekly data over a year, it would be 52.
SARIMA Assumptions:
It is useful to keep few ARMA, ARIMA and SARIMA assumptions in mind:
Time series models require data is stationary.
If not stationary, remove trend, seasonality, apply differencing, and so on.
Stationary data has no trend, seasonality and has constant mean and constant
variance.
The past is assumed to represent what will happen in future, in a probabilistic
sense.
Autocorrelation plot is plot summarizes the 2-way correlation between a variable and its
past values.
What is RNN?
Most common format is “many-to-one”, that maps an input sequence to one output
value. Input at each time step sequentially updates the RNN cell’s hidden state
(memory). After processing the input sequence, hidden state information is used to
predict the output.
Output is
between o and 1 for sigmoid.
How do we obtain the weight matrices U, V, and W?
When we train a recurrent neural net, we are actually finding weights via the back
propagation algorithm.
In back propagation, we repeatedly process the training data, updating the
weights in order to minimize some cost function.
For time series forecasting, a typical cost function would be the mean squared
error or some similar metric that tests 40 out versus the next time step. How far
off where we.
Intuitively, we find values U, V, and W that cause our predicted outputs t out to
be as close as possible to the true target values to that next step in our
sequence.
RNNs often struggle to process long input sequences. It is mathematically difficult for
RNNs to capture long-term dependencies over many time steps, which is a problem for
Time Series, as sequences are often hundreds of steps. Another type of Neural
Networks, long short-term memory networks (LSTMs) can mitigate these issues with a
better memory system.
What is LSTM?
LSTM cells have same role as RNN cells in sequential processing of the input
sequence.
These are internally more complex with gating mechanisms and two states (hidden
state and cell state) that allow for longer term memory.
These networks regulate information flow and memory storage from past time steps.
LSTM cells share forget, input and output gates that control how memory states are
updated and information is passed forward.
At each time step, the input and current states determine the gate computations.
LSTMs vs RNNs
LSTMs are better suited for handling long-term dependencies than RNNs. However,
they are much more complex, requiring many more trainable weights. As a result,
LSTMs tend to take longer to train (slower backpropagation) and can be more prone to
overfitting.
These are some guidelines on how to choose LSTMs or RNNs in a Forecasting task:
If sequences are many time steps long, an RNN may perform poorly as they
have trouble with longer term memory.
If training time is an issue, using a LSTM may be too cumbersome as it
requires learning more parameters.
Graphics processing units (GPUs) speed up all neural network training but
are especially recommended when training LSTMs on large datasets.
Examples:
This function:
It represents the instantaneous rate at which events occur, given that it has not
occurred already.
F(t) = probability density function that our subject will
survive during a specific interval, a very specific interval.
S(t) = probability that the subject will survive until a time
greater than that time t.
Some well-known Survival models for estimating Hazard Rates include these Survival
Regression approaches. These methods:
These models differ w.r.t assumptions they make about the hazard rate function, and
impact of features.
Cox Proportional Hazard (CPH) model
This is one of the most common survival models. It assumes features have a constant
proportional impact on the hazard rate.
For a single non-time-varying feature X, the hazard rate h(t)h(t) is modeled as:
When to use survival analysis?
Survival analysis is useful when we want to measure the risk of events occurring and
our data are censored.
Accelerated Failure Time (AFT) models (several variants including the Weibull AFT
model)
These models differ with respect to assumptions they make about the hazard rate
function, and the impact of features.