0% found this document useful (0 votes)
36 views30 pages

Time Series and Survival Analysis

Uploaded by

neeharika.sssvv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views30 pages

Time Series and Survival Analysis

Uploaded by

neeharika.sssvv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Introduction to forecasting and time series analysis:

Better forecast leads to better performance in a company.

Standard regression approaches do not work for time series models.


In time series features and target are same:
 Data correlated overtime.
 Often non-stationary (hard to model)  weather we have stable distribution of
data that we can model without any fluctuations (like seasonal fluctuations,
trends) and most models rely on stationarity assumptions in time series.
 Need a lot of data to capture years of patterns and trends.
Forecast outcome of one period are used in subsequent periods.
Adding factors and multiple variables can be tricky.
What is Time Series?
Time series is a sequence of data points organized in time order.
That sequence will capture data at equally spaced points in time.
Data collected irregularly is not considered time series.
Why is time series data different?
Evaluating forecast results can be challenging.
Standard measure can be misleading. Standard measures like:
 Forecast miss (%)
 Error rates by horizon
Custom measures based on business impact are often required.
Takes longer time to learn outcomes as we have to wait for the future time to hit to
check outcomes and learn from it.

Forecasting Problems

These are the two types of forecasting problems. Consider that the vast majority of
applications employ univariate models, harder to combine variables when using time
series data.

1. Univariate

Think of single data series containing of:

 Continuous data, binary data, or categorical data


 Multiple unrelated series
 Conditional series
2. Panel or Multivariate

Think of multiple related series identifying groups such as customer types, department
or channel, or geographic joint estimation across series.

Time series data is common across many industries.


For example:

 Finance: stock prices, asset prices,


macroeconomic factors
 E-Commerce: page views, new users,
searches
 Business: transactions, revenue,
inventory levels

Time series methods are used to:

 Understand the processes driving


observed data.
 Fit models to monitor or forecast a
process.
 Understand what influences future results
of various series.
 Anticipate events that require
management intervention.
In time series, there must be no duplicates index values of time/ date and no
missing index values in time/ date. Frequency must be assigned. For more clarity
see the LAB1 notebook and observe every code cell in it carefully.

Time series Decomposition:


A time series can be decomposed into several components:

Trend – long term direction

Seasonality – periodic behavior

Residual – irregular fluctuations

Generally, models perform better if we can first remove known sources of variation such
as trend and seasonality. The main motivation for doing decomposition is to improve
model performance. Usually, we try to identify the known sources and remove them,
leaving resulting series (residuals) that we can fit against time series models.

Trend captures the general Seasonality captures effect Residuals are random
direction of time series. that occur with specific fluctuations leftover once trend
frequency. It can be driven and seasonality are removed.
For example, increasing job many factors such as naturally There should not be a trend or
growth year over year despite occurring events(weather etc), seasonal pattern in residuals.
seasonal fluctuation. business or administrative The represent short term
procedures (school etc), fluctuations and may be
social/ cultural behavior random. There may be a
(holidays), fluctuations due to portion of trend or seasonality
calendar events (holidays that components missed in the
shift from year to year). decomposition.
Decomposition Models

These are the main models to decompose Time Series components:

Additive model decomposition:

Additive models assume the observed time


series is the sum of its components.

i.e., Observation = Trend + Seasonality +


Residual

These models are used when the


magnitudes of the seasonal and residual
values are independent of trend.

Multiplicative model decomposition:

Multiplicative models assume the


observed time series is the product of its
components.

i.e., Observation = Trend * Seasonality *


Residual

A multiplicative model can be transformed


to an additive by applying a log
transformation:

log(Time*Seasonality*Residual) =
log(Time) + log(Seasonality) +
log(Residual)

These models are used if the magnitudes If there are trend and seasonality effects,
of the seasonal and residual values then we can use exponential smoothing.
fluctuate with trend. If it is an Exponential Trend, then the choice
for the Trend will be MULTIPLICATIVE.
If it is Magnitude of Spikes, then the choice
for the Trend will be ADDITIVE.
Pseudo-additive Decomposition Model

Pseudo-additive models combine


elements of the additive and multiplicative
models.
Tt  Trend
They can be useful when:
Rt  Residual
Time series values are close to or equal St  Seasonal
to zero.

We expect features related to a


multiplicative model.

A division by zero needs to be solved in


the form: Ot = Tt + Tt(St – 1) + Tt(Rt – 1) =
Tt(St + Rt – 1)

Our seasonal and residual components in the multiplicative model, you think about that
seasonal and residual components, they should be centered around one. As one times
any value is equal to that value. Whereas with the additive model, the seasonal
and residual components will be centered around zero, as if we are working with
additive any value plus zero is equal to that value.
Decomposition of time series allows us to remove deterministic components, which
would otherwise complicate modeling.

After removing these components, the main focus is to model the residual.

Other Methods

These are some other approaches of time series decomposition:

 Single, double, or triple exponential smoothing


 Locally Estimated Scatterplot Smoothing (LOESS)  better for seasonal
components and allows for user to control the rate of change and also fairly
robust to outliers. I can only handle additive decompositions.
 Frequency-based methods  Uses spectral analysis. Goal is find the
underlying recurring patterns, or that periodic seasonal component without
having to specify any frequency, such as yearly or monthly.

Stationarity and Autocorrelation

Stationarity: A stationary series has same mean and variance over time. Non-
stationary series are much harder to model.
Stationarity means that the statistical properties of a time series (or rather the process
generating it) do not change over time. Stationarity is important because many useful
analytical tools and statistical tests and models rely on it.
Common approach to handle non-stationary series:
 Identify sources of non-stationarity.
 Find some type of transformation that makes the series stationary.
 Build models for that series.
There are four key properties that a time series must exhibit over time for it to be
stationary. They are:
 Constant mean
 Constant variance
 Constant autocorrelation structure
 No periodic (seasonal) component
Autocorrelation is key concept in time series. It means today’s measurement is highly
dependent on past value. The time interval between two correlated values is called lag.
If autocorrelation remains constant through out the series, a simple transformation will
yield a stationary series.

Constant mean Constant variance

Check for non-stationarity:


1. Run-sequence plot:
A run-sequence plot is simply a plot of unadjusted time series data.
 This is often the first step in time series analysis to check for non-stationarity.
 It often shows if there is underlying structure.
 Be on the lookout for trend, seasonality, and autocorrelation.

2. Summary statistics:
Calculate mean and variance over time.
 A simple but effective way to do this is to split the data into chunks and
compute statistics for each chunk.
 Large deviation in either the mean or variance amongst chunks are
problematic and suggest the data is non-stationary.
The mean and variance are
some what constant in the
diagram  stationary.

The mean is not constant


but variance is some what
constant in the diagram
shown  non-stationary.

3. Histogram plot:
A histogram plot gives important clues into a time series underlying structure.
 If a distribution that is approximately normal, it is likely the time series is
stationary.
 If we see a non-normal distribution, it indicates the time series is likely
non-stationary.

Trends, seasonality
and non-constant
variance are always
non-stationary.

4. Dickey-Fuller test:
The Augmented Dickey-Fuller (ADF) test specifically tests for stationarity.

 It is a hypothesis test: the test returns a p-value, and we generally say the
series is non-stationary if the p-value is less than 0.05.
 It is a less appropriate test to use with small datasets, or data with
heteroscedasticity (different variance across observations) present.
 It is best to pair ADF with other techniques such as: run-sequence plots,
summary statistics, or histograms.
Common transformations from non-stationarity to stationarity:

Heteroscedasticity refers
to differencing variances
across observation. Then
just to note the opposite
of that and what we are
looking for would be
homoscedasticity.

Time series smoothing:


Smoothing is a process that often imroves our ability to forecast series by reducing the
impact of noise.
There are many ways to smooth data. Some examples:
 Simple average smoothing – Taking average of series
 Equally weighted moving average – Eg: Rolling averages (taking certaing
windows and finding average within that certain window)
 Exponentially weighted moving average – Weight certain lags more or less
heavily, depending on certain criteria.
Smooting is one important tool that allows us to improve forward-looking forecasts.
 Consider the stationary data to the right.
 How might we forecast what will happen:
One, two, or more steps in future?

Solution: A solution is to calculate the mean of


the series and predict that value into future.
o Looks quite reasonable in this case.
o How ever we should be more rigorius
and calculate how far off our estimate
is from reality.

Mean Squared Error (MSE) is a metric commonly employed to quantitatively measure


the efficacy of an estimate.

The nice thing about using a metric like MSE we can compare different models or
estinmates to see which is doing the best job.
Lowest MSE  Better model

What if there is trend?


We cannot use the mean value method.
It over estimates in the first half and under estimates in the other half
What is a better approach?  Solution is moving averages.
Moving average:
Mooving average smoothing techniques allow us to avoid sensitivity to local
fluctuations.
Two primary approaches:
 Equally weighted – Each past value is weighted equally within the window.
 Exponentially weighted - we will find a weighting system that weights more recent
lags more heavily compared to those that are further back in the past.
This methid works great for trend, seasonality, trend + seasonality (since they are linear/
follows a pattern like wave).
But if the data is exponential, then equally weighted moving average will not work quite
well. The solution is exponentially weighted moving averages.
Exponentially weighted smoothing average works by smoothing entire series.
It does a better job than equally weighted moving average.
These two techniques (Equally weighted moving average (ma),
Exponentially weighted moving average) are still insufficient for
forecasting heavy trends and seasonality.
A small example:

Advanced Smoothing:
Single Exponential Smoothing:
Alpha is between 0 and 1.

Single exponential smoothing produces the same value pushed out over the forecast
horizon.
Clearly it is picking neither trend nor seasonality,
but just pushing out that one estimate repeatedly
into the horizon. Therefore, we are going to
expand on this and turn to double exponential
smoothing.

Double Exponential Smoothing: Double exponential smoothing has the ability to pick up
trend. It does by adding a second component into its formulation that smooths out trend.

Here, b is trend.

Double exponential smoothing can pick


up trend. This is a step in the right
direction. However, it seems to fail to
pick up seasonality.

Triple Exponential Smoothing: It has the ability to pick up trend and seasonality. It does
this by adding a third component to its formulation that smooths out seasonality.

m being however far into the


future we want to predict.
Even less MSE than previous.
– lack a trend
 Then use Single Exponential Smoothing
– have trend but no seasonality
 Then use Double Exponential Smoothing
– have trend and seasonality
 Then use Triple Exponential Smoothing

The most recent observations tend to really impact the current ones to a much larger
degree than the older ones do. So maybe we want to add more weight to more recent
trends.
What about autocorrelated data?

Autoregressive Models and Moving Average Models:


These models leverage the autocorrelation within the series.
ARMA models combine two models:
 The first is going to be Autoregressive models (AR). These anticipate series
dependence on its own past values. – If working with stationary series, and a
value is bit higher than our mean, then the following value assuming a positive
correlation is likely to be higher value than mean of series.
 The second is Moving Average models (MA). These anticipate series
dependence on past forecast errors. – So, again we can measure for working
with a stationary series if it is being off from our non-changing mean, it will lead to
some jump in reaction to that error.
 The combination (ARMA) is also known as Box-Jenkins approach.
ARMA models are often expressed using the orders p and q for the AR and MA
components. For a time series variable X that we want to predict for time t, the last few
observations are Xt-3, Xt-2, Xt-1.
AR(p) models assumes that model depends on the last p values of the time series. For
p=2, the forecast has the form:
Here the future value is a linear combination of Xt-1,Xt-2 .

MA(q) models assumes that model depends on the last p values of the time series. For
q=2, the forecast has the form:
There are some things to keep in mind when working with ARMA models:
1. The time series is assumed to be stationary (same mean and same
variance over time).
2. A good rule of thumb is to have atleast 100 observations when fitting an
ARMA model so we can properly demonstrate past autocorrelations.
There will be three stages in building an ARMA model:
1. Identification
2. Estimation
3. Evaluation
ARMA Identification:
Confirm the following:
 The time series is stationary.
 Whether the time series contain some seasonal components - The Time Series
in ARMA model should not contain a seasonal component.
How to determine seasonality?
We can determine is seasonality is present using:
1. Autocorrelation and partial autocorrelation plots – how correlated one period of
full seasonal cycle in the past is with the current period.
2. Seasonal subseries plot – Shows the average and variation for each different
season.
3. Intuition – Possible in some cases i.e., seasonal sales of consumer products,
holidays etc.
These plots are going to be initial step to help us understand what type of season
patterns impact our data using visual representation.
If there is correlation with lag 1, there is also going to be some residual correlartion with
lag 2. Lag 2 will automatically be correlated with current value if lag 1 is as well. So,
therefore an AR one model for example so correlated with just one value would have a
slowly decaying autocorrelation as it is highly correlated with lag one which is somewhat
correlated with lag two, and lag three, and so on  There is some partial auto
correlation between lag 1 and lag2, lag 2 and lag 3. So, we cannot automatically tell
whether it is AR(1), AR(2), AR(3) or even AR(5) model.
And for this we have partial auto-correlation.

It measures a partial result. So, it considers other lags and remove the effects of other
lags and allow you to look at the correlations independently.
Identifying p and q for ARMA model:
Once we have a stationary series, we can estimate AR and MA models. We need to
determine p and q, the order of the AR and MA models.
One approach is to look at the:
 Autocorrelation plot
 Partial Autocorrelation plot
Another approach is to treat p and q as hyperparameters and apply standard
approaches (grid search, cross validation etc.).
ARMA Estimation:
Estimating the parameters of an ARMA model can be a complicated non-linear model:
 Non-linear least square and Maximum Likelihood Estimation (MLE) are most
common approaches.
 Most statistical software will fit the ARMA model, and potentially help choose the
order.
ARMA Validation:
The residuals will approximate a Gaussian distribution (aka white noise).
Otherwise we need to iterate to obtain a better model.
AR(p)  p is number of coefficients ᶲ1, ᶲ2 and so on till ᶲp. If p = 2  ᶲ1, ᶲ2
In AR models, essentialy we are just applying linear regression on prior terms.
In MA  The errors propagate to future values of time series directly. So, for example,
the error at time t -1 appears directly on that right side of our equation for X sub
t. Whereas with the AR model via error at time t- 1 does not appear directly on the right
side. Or rather as a factor or a part of the Xt- 1, and the Xt- 1 will be whatever that value is
plus that shock. So, it does not directly affect the equation in the same way that the MA
model does.
MA value and the error (shock) affects the current values of X and q periods into the
future assuming that you are using more than MA1 model. And in contrast, the AR
model, that shock is going to be affecting the X value infinitely far into the future.
Because if you think about that shock and the shock before and the shock before that,
those are going to be built into the Xt.
If the model is slowing decaying, then the lags are correlated with each other. In AR
model, the past error will propagate into the future leading to slowing decaying plot.
If the graph jumps back and forth, then we are completely sure that we can use that
model.
If length of lag k < length of lag k+1  then the order of the model is k+1. But play with
the order.

ARIMA and SARIMA Models:


Why do we need ARIMA models?
ARIMA stands for Autoregressive Integrated Moving Average.
Observed data is often the result of integrated series. An integrated series results from
adding previous values together. Examples are stock prices (integrated) vs stock
returns, product sales year to data (integrated) vs product sales per day.
Such series can be tranformed into stationary series by differencing (subtracting
previous value from each observation).
ARIMA models extend AR/MA models to allow for integrated data.
ARIMA has three components:
1. AR model
2. Integrated component
3. MA model
ARIMA model is denoted by ARIMA(p, d, q).
 p is the order of the AR model
 d is the number of times to difference the data
 q is the order of MA model
 p, d and q are all non-negative integers.
Differencing nonstationary time series data one or more times can make it stationary.
That is the Integrated (I) component of ARIMA.
 d is going to be the number of times to perform a lag1 difference on the data.
 d = 0  no differencing  ARMA model
 d=1  differencing once
 d=2  differencing twice  Use on data with exponential growth.
SARIMA models:
SARIMA is short form of Seasonal ARIMA, an extension of ARIMA models to address
seasonality.
SARIMA model is denoted by SARIMA (p,d,q)(P,D,Q)

 P, D, Q represent the same as p, d, q but they are applied across a season (E.g.
Yearly or monthly data)
o So, for example, if we are looking with yearly data, that's broken down in
monthly data. So, monthly data across a year that P would be
how correlated it is with the last value, 12 months prior. D would be how
much differencing thing we should do year over year. So, should we
subtract out the series lag 12, and Q would be the correlation to the error
term at lag 12 or 12 being the number of periods within our season.
 M = one season
o And we will have to be able to pass that through to our model, so we will
have to actually know what the frequency of our seasonality is. Something
like 12, if we are working with quarterly data, we know it is four. If we are
working with weekly data over a year, it would be 52.

ARIMA and SARIMA Estimation

These are the steps to estimate p, d, q and P, D, Q?

 Visually inspect a run sequence plot for trend and seasonality.


 Generate an ACF Plot.
 Generate a PACF Plot.
 Treat as hyperparameters (cross validate).
 Examine information criteria (AIC, BIC) which penalize the number of
parameters the model uses  Used for automated selection models.

AIC is Akaike Information Criteria and BIC is Bayesian Information Criteria.


The AIC and the BIC are more for, if you are testing on the entire training set and you
want to ensure that it doesn't overfit.
ARIMA summary:
 Flexible family of models that capture autocorrelation.
 Based on strong statistical foundation.
 Requires stationary time series.
 Choosing optimal parameters manually require care.
 Some software will automatically find parameters.
 Can be challenging to explain and interpret.
 Can be prone to overfitting.

SARIMA Assumptions:
It is useful to keep few ARMA, ARIMA and SARIMA assumptions in mind:
 Time series models require data is stationary.
 If not stationary, remove trend, seasonality, apply differencing, and so on.
 Stationary data has no trend, seasonality and has constant mean and constant
variance.
 The past is assumed to represent what will happen in future, in a probabilistic
sense.

Autocorrelation plot is plot summarizes the 2-way correlation between a variable and its
past values.

We need to get low AIC and BIC values.


Deep Learning and forecasting:

Neural Networks do not come for free:

 Models can be complicated to build, computationally expensive to build (GPU


can help).
 Deep learning models often overfit.
 It is very challenging to explain/ interpret predictions made by the models (“black
box”).
 Tend to perform best with much large training datasets.

What is RNN?

Used in time series analysis.

RNN map sequence of inputs to predicted output.

Most common format is “many-to-one”, that maps an input sequence to one output
value. Input at each time step sequentially updates the RNN cell’s hidden state
(memory). After processing the input sequence, hidden state information is used to
predict the output.
Output is
between o and 1 for sigmoid.
How do we obtain the weight matrices U, V, and W?
 When we train a recurrent neural net, we are actually finding weights via the back
propagation algorithm.
 In back propagation, we repeatedly process the training data, updating the
weights in order to minimize some cost function.
 For time series forecasting, a typical cost function would be the mean squared
error or some similar metric that tests 40 out versus the next time step. How far
off where we.
 Intuitively, we find values U, V, and W that cause our predicted outputs t out to
be as close as possible to the true target values to that next step in our
sequence.

RNNs often struggle to process long input sequences. It is mathematically difficult for
RNNs to capture long-term dependencies over many time steps, which is a problem for
Time Series, as sequences are often hundreds of steps. Another type of Neural
Networks, long short-term memory networks (LSTMs) can mitigate these issues with a
better memory system.

What is LSTM?
LSTM cells have same role as RNN cells in sequential processing of the input
sequence.

These are internally more complex with gating mechanisms and two states (hidden
state and cell state) that allow for longer term memory.

These networks regulate information flow and memory storage from past time steps.

LSTM cells share forget, input and output gates that control how memory states are
updated and information is passed forward.

At each time step, the input and current states determine the gate computations.
LSTMs vs RNNs

LSTMs are better suited for handling long-term dependencies than RNNs. However,
they are much more complex, requiring many more trainable weights. As a result,
LSTMs tend to take longer to train (slower backpropagation) and can be more prone to
overfitting.

These are some guidelines on how to choose LSTMs or RNNs in a Forecasting task:

Always consider the problem at hand:

 If sequences are many time steps long, an RNN may perform poorly as they
have trouble with longer term memory.
 If training time is an issue, using a LSTM may be too cumbersome as it
requires learning more parameters.
 Graphics processing units (GPUs) speed up all neural network training but
are especially recommended when training LSTMs on large datasets.

Survival Analysis and Censoring


Survival Analysis focuses on estimating the length of time until an event occurs. It is
called ‘survival analysis’ because it was largely developed by medical researchers
interested in estimating the expected lifetime of different cohorts. Today, these methods
are applied to many types of events in the business domain.

Examples:

 How long will a customer remain before churning?


 How long until equipment needs repairs?
Standard regression-based approaches do not work due to the issue of censoring.
Example, not all the customers remain until they churn. This is censoring.
Survival Analysis is useful when we want to measure the risk of events occurring and
our data are Censored.

 This can be referred to as failure time, event time, or survival time.


 If our data are complete and unbiased, standard regression methods may work.
 Survival Analysis allows us to consider cases with incomplete or censored data.

The Survival Function is defined as S(t)=P(T>t). It measures the probability that a


subject will survive past time t.

T is the time of the event.

If t=5 years  probability of survival beyond 5 years is T>t.

This function:

 Is decreasing (non-increasing) over time.


 Starts at 1 for all observations when t=0
 Ends at 0 for a high-enough t
The Hazard Rate is defined as:

 It represents the instantaneous rate at which events occur, given that it has not
occurred already.
F(t) = probability density function that our subject will
survive during a specific interval, a very specific interval.
S(t) = probability that the subject will survive until a time
greater than that time t.

 The cumulative hazard rate (sum of h(t)h(t) from t = 0 to t = t) represents


accumulated risk over time.

The Kaplan-Meier estimator is a non-parametric estimator. It allows us to use observed


data to estimate the survival distribution. The Kaplan-Meier Curve plots the cumulative
probability of survival beyond each given time period.

Using the Kaplan-Meier Curve allows us to


visually inspect differences in survival rates by
category. We can use Kaplan-Meier Curves to
examine whether there appear to be differences
based on this feature. To see whether survival
rates differ based on number of services, we
estimate Kaplan-Meier curves for different
groups.
Survival Analysis Approach:

The Kaplan-Meier approach provides sample averages. However, we may want to


make use of individual-level data to predict survival rates.

Here we turn to Survival Regression approaches, which:

 Allow us to generate estimates of total risk as a function of time.


 Make use of censored and uncensored observations to predict hazard rates.
 Allow us to estimate feature effects.
Although these methods use time, these methods are not generally predicting a time to
an event, rather predicting survival risk (or hazard risk) as a function of time.

Some well-known Survival models for estimating Hazard Rates include these Survival
Regression approaches. These methods:

 Cox Proportion Hazard model


 Accelerated Failure Time (AFT) models

These models differ w.r.t assumptions they make about the hazard rate function, and
impact of features.
Cox Proportional Hazard (CPH) model

This is one of the most common survival models. It assumes features have a constant
proportional impact on the hazard rate.

For a single non-time-varying feature X, the hazard rate h(t)h(t) is modeled as:
When to use survival analysis?

Survival analysis is useful when we want to measure the risk of events occurring and
our data are censored.

 This can be referred to as failure time, event time or survival time.


 If our data is complete and unbiased or does not have censoring, standard
regression methods may work.
 Survival analysis allows us to consider cases with incomplete or censored data.

Accelerated Failure Time (AFT) models (several variants including the Weibull AFT
model)

These models differ with respect to assumptions they make about the hazard rate
function, and the impact of features.

You might also like