Week 10 Intro Time Series
Week 10 Intro Time Series
Import Modules
In [1]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Read data
We will work with the El Nino temperature data set. The data are available from
statsmodels . A link with a description of the data is provided below.
https://www.statsmodels.org/stable/datasets/generated/elnino.html
The data are imported via a statsmodels function in the cell below.
In [3]: dta = sm.datasets.elnino.load_pandas().data
The data consists of 13 columns. There is 1 column for each month in the year and a
column which stores the year. Thus, one row corresponds to all monthly measurements
within a single year!
In [4]: dta.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YEAR 61 non-null float64
1 JAN 61 non-null float64
2 FEB 61 non-null float64
3 MAR 61 non-null float64
4 APR 61 non-null float64
5 MAY 61 non-null float64
6 JUN 61 non-null float64
7 JUL 61 non-null float64
8 AUG 61 non-null float64
9 SEP 61 non-null float64
10 OCT 61 non-null float64
11 NOV 61 non-null float64
12 DEC 61 non-null float64
dtypes: float64(13)
memory usage: 6.3 KB
Reorganize data
The first few rows of the data set are shown below with the .head() method.
In [5]: dta.head()
Out[5]: YEAR JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV
0 1950.0 23.11 24.20 25.37 23.86 23.03 21.57 20.63 20.15 19.67 20.03 20.02
1 1951.0 24.19 25.28 25.60 25.37 24.79 24.69 23.86 22.32 21.44 21.77 22.33
2 1952.0 24.52 26.21 26.37 24.73 23.71 22.34 20.89 20.02 19.63 20.40 20.77
3 1953.0 24.15 26.34 27.36 27.03 25.47 23.49 22.20 21.45 21.25 20.95 21.60
4 1954.0 23.02 25.00 25.33 22.97 21.73 20.77 19.52 19.33 18.95 19.11 20.27
The last few rows are shown via the .tail() method below.
In [6]: dta.tail()
Out[6]: YEAR JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NO
56 2006.0 24.76 26.52 26.22 24.29 23.84 22.82 22.20 21.89 21.93 22.46 22.6
57 2007.0 25.82 26.81 26.41 24.96 23.05 21.61 21.05 19.95 19.85 19.31 19.8
58 2008.0 24.24 26.39 26.91 25.68 24.43 23.19 23.02 22.14 21.60 21.39 21.5
59 2009.0 24.39 25.53 25.48 25.84 24.95 24.09 23.09 22.03 21.48 21.64 21.9
60 2010.0 24.70 26.16 26.54 26.04 24.75 23.26 21.11 19.49 19.28 19.73 20.4
As previously mentioned, the data has one column for each month in a year. Although
this looks like a well organized data set, we cannot use it for time series analysis in this
state! We must reshape the data into long-format! The wide-format approach is
common because it "looks" well organized. However, it is not TIDY! CMPINF 2110 dives
into the concepts of TIDY data in more detail. However, for our purposes the key reason
why the data are not TIDY is because the column names are values. The column names
store the month the measurement was recorded. Instead, we need the data organized
such that the month is a value contained within the rows of a column.
We will use the .melt() method to reshape the data. The argument id_vars is set
to YEAR . The argument value_vars is set to all columns except YEAR . If you look
closely as the previous output displays, YEAR is the zeroth column in the DataFrame.
Thus, we can access all other column names easily as shown below.
In [7]: dta.columns[1:].to_list()
Out[7]: ['JAN',
'FEB',
'MAR',
'APR',
'MAY',
'JUN',
'JUL',
'AUG',
'SEP',
'OCT',
'NOV',
'DEC']
The long-format data has 3 columns. The YEAR column and two new columns,
variable and value .
In [9]: lf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YEAR 732 non-null float64
1 variable 732 non-null object
2 value 732 non-null float64
dtypes: float64(2), object(1)
memory usage: 17.3+ KB
The variable column stores the names of the months (the original wide-format
column names). The value column stores the values associated with the months
file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 3/34
22/04/2024, 03:17 week_10_intro_time_series
In [12]: lf.variable.value_counts()
Out[12]: variable
JAN 61
FEB 61
MAR 61
APR 61
MAY 61
JUN 61
JUL 61
AUG 61
SEP 61
OCT 61
NOV 61
DEC 61
Name: count, dtype: int64
One row now corresponds to a measurement in a month within a year! Thus, the
measurements within 1950 are spread across 12 rows in the long-format data:
In [13]: lf.loc[ lf.YEAR == dta.YEAR.min(), : ]
Out[14]: YEAR JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV D
0 1950.0 23.11 24.2 25.37 23.86 23.03 21.57 20.63 20.15 19.67 20.03 20.02 2
Why would we want to consider wide-format data? One reason is wide-format makes it
easy to explore correlation between the months across years! The figure below creates
a heatmap to show the correlation plot between all pairs of months. Such a figure can
only be made with a wide-format version of the data. We see that the months are highly
correlated!
In [15]: fig, ax = plt.subplots()
plt.show()
Long-format data is easier to group and aggregate and therefore summarize the data!
For example, we can count the number of rows and number of unique years associated
with each month. The cell below accomplishes this via the
.groupby().aggregate() "chain". You should notice the results are the same as
.value_counts() !
In [16]: lf.groupby(['variable']).\
aggregate(num_rows = ('value', 'size'),
num_years = ('YEAR', 'nunique')).\
reset_index()
In [18]: lf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YEAR 732 non-null float64
1 variable 732 non-null object
2 value 732 non-null float64
3 my_year 732 non-null int64
dtypes: float64(2), int64(1), object(1)
memory usage: 23.0+ KB
In [19]: lf.head()
In [21]: lf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YEAR 732 non-null float64
1 variable 732 non-null object
2 value 732 non-null float64
3 my_year 732 non-null object
dtypes: float64(2), object(2)
memory usage: 23.0+ KB
In [22]: lf.head()
In [24]: lf.head()
In [24]: lf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YEAR 732 non-null float64
1 variable 732 non-null object
2 value 732 non-null float64
3 my_year 732 non-null object
4 my_date 732 non-null object
dtypes: float64(2), object(3)
memory usage: 28.7+ KB
Let's now convert my_date into a date time object! The conversion is executed with
the pd.to_datetime() function. This function will try to "guess" the appropriate date
conversion based on some simple checks. The current format of YYYY-MON is an easy
format for the function to figure out.
In [25]: pd.to_datetime( lf.my_date )
C:\Users\XPS15\AppData\Local\Temp\ipykernel_21032\2942030940.py:1: UserWarni
ng: Could not infer format, so each element will be parsed individually, fal
ling back to `dateutil`. To ensure parsing is consistent and as-expected, pl
ease specify a format.
pd.to_datetime( lf.my_date )
Out[25]: 0 1950-01-01
1 1951-01-01
2 1952-01-01
3 1953-01-01
4 1954-01-01
...
727 2006-12-01
728 2007-12-01
729 2008-12-01
730 2009-12-01
731 2010-12-01
Name: my_date, Length: 732, dtype: datetime64[ns]
A warning is displayed, but the conversion executed successfully. You can remove the
warning by specifying the format argument. The specific value to set for format
depends on how the date is "written". Special codes must be provided to parse the date.
The my_date column is written as YYYY-MON and so we need to specify the special
code for that format. There are many such special codes to denote years, months, days,
and other aspects of a date time. For our specific purposes we need to use '%Y-%b'
because of the YYYY-MON format. Please see the documentation for a complete list of
the special codes for parsing date times.
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
Providing the format for the special code removes the warning message, as shown
below.
In [26]: pd.to_datetime( lf.my_date, format='%Y-%b')
Out[26]: 0 1950-01-01
1 1951-01-01
2 1952-01-01
3 1953-01-01
4 1954-01-01
...
727 2006-12-01
728 2007-12-01
729 2008-12-01
730 2009-12-01
731 2010-12-01
Name: my_date, Length: 732, dtype: datetime64[ns]
C:\Users\XPS15\AppData\Local\Temp\ipykernel_21032\2897020388.py:1: UserWarni
ng: Could not infer format, so each element will be parsed individually, fal
ling back to `dateutil`. To ensure parsing is consistent and as-expected, pl
ease specify a format.
lf['date_dt'] = pd.to_datetime( lf.my_date )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YEAR 732 non-null float64
1 variable 732 non-null object
2 value 732 non-null float64
3 my_year 732 non-null object
4 my_date 732 non-null object
5 date_dt 732 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(3)
memory usage: 34.4+ KB
Displaying the head and tail of the DataFrame reveals the date_dt column "looks"
different from the previous my_date column!
In [29]: lf
we have full access to the calendar information! Calender arithmetic is tricky. After all
the day after February 28 is usually March 1, unless it's a leap year! We would have to
manage calendar arithmetic if we kept the date as a string. However, we do not have to
worry about such issues! The datetime object handles all the weird calendar problems
for us!
The date time components can be separated using the .dt. attributes. For example,
the YEAR component is extracted with .dt.year .
In [30]: lf.date_dt.dt.year
Out[30]: 0 1950
1 1951
2 1952
3 1953
4 1954
...
727 2006
728 2007
729 2008
730 2009
731 2010
Name: date_dt, Length: 732, dtype: int32
Out[31]: 0 1
1 1
2 1
3 1
4 1
..
727 12
728 12
729 12
730 12
731 12
Name: date_dt, Length: 732, dtype: int32
These might not seem interesting since we already have the YEAR and MONTH in the
data. However, we can extract other components like the QUARTER!
In [32]: lf.date_dt.dt.quarter
Out[32]: 0 1
1 1
2 1
3 1
4 1
..
727 4
728 4
729 4
730 4
731 4
Name: date_dt, Length: 732, dtype: int32
Even though there are 12 MONTHS in a YEAR, there are only 4 QUARTERS in the YEAR.
In [33]: lf.date_dt.dt.quarter.value_counts()
Out[33]: date_dt
1 183
2 183
3 183
4 183
Name: count, dtype: int64
The date time components are very useful because you do not to figure out properties
of the date. The date time components will give them to you!
We have created the necessary datetime object, but the data are still not ready for the
time series methods. Most time series work with Pandas Series objects, not DataFrames.
Therefore, we need to separate out the value column into it's own Series object.
In [34]: my_series = lf.value.copy()
In [35]: my_series
Out[35]: 0 23.11
1 24.19
2 24.52
3 24.15
4 23.02
...
727 24.15
728 21.15
729 22.73
730 23.21
731 22.07
Name: value, Length: 732, dtype: float64
Time series methods require the index of a Series to be a datetime index. As shown
below the index to my_series is just the default range index.
In [36]: print( my_series.index )
As we can see above, the index is now a datetimeindex! We are one step closer but, this
index does not have a defined frequency. We can see that by the attribute freq being
equal to None above. The frequency is VERY IMPORTANT in time series methods! It
specifies the sampling frequency or rate that the data are collected. Let's go ahead
and force that the frequency is the start of each month. In Pandas, we enforce or change
the sampling frequency with the .resample() method. It is important to note that the
.resample() method does NOT refer to resampling methods like cross-validation. It
has to do with altering how time series data are stored.
The frequency is specified by an offset alias. Please see the link below for the many
different options that are available.
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-
aliases
We are working with Monthly data and so our offset alias must be either the start, end,
or middle of the month. We will use the start of the month frequency and so we will set
the rule argument to 'MS' . The resampled data must be summarized in order to
aggregate repeat observations within the desired sampling frequency. The current
example does not have any repeat observations within a single month, but the summary
method is applied after the .resample() method. For example, the .mean()
method is applied below to calculate the MEAN value associated with a month within a
year if there are multiple measurements per month. The average of a single value is the
value itself and so applying .mean() will not change the monthly values in the
example.
In [39]: ready_series = my_series.copy().resample('MS').mean()
The index attribute is displayed for the original my_series object and the resampled
ready_series object. Notice that the freq is now displayed as 'MS' to represet
that the ready_series index "knows" that each observation corresponds to the start
of the month.
In [40]: print( my_series.index )
Printing the first few elements of the Series object will tell us the frequency of the
DateTimeIndex as well.
In [42]: print( '-- original series --')
print( my_series.head() )
print( ' ' )
print( '-- after resampling --' )
print( ready_series.head() )
-- original series --
date_dt
1950-01-01 23.11
1951-01-01 24.19
1952-01-01 24.52
1953-01-01 24.15
1954-01-01 23.02
Name: value, dtype: float64
-- after resampling --
date_dt
1950-01-01 23.11
1950-02-01 24.20
1950-03-01 25.37
1950-04-01 23.86
1950-05-01 23.03
Freq: MS, Name: value, dtype: float64
If you are still wondering why we need a defined sampling frequency, it is because most
time methods require the data to be collected at regular or fixed intervals. Irregular time
series frequencies are very challenging. We will not discuss such methods in this
course. The .resample() method therefore ensures all measurements exist at some
fixed regular interval.
Visualizations
Let's start out by simply visualizing the time series. Series objects in Pandas have a
default plot method which plots the value of the elements with respect to the index.
Since we have a DateTimeIndex, the value in ready_series will be plotted with
respect to the calendar date.
In [37]: ready_series.plot( figsize=(15, 6) )
plt.show()
We could have created the above figure with the DataFrame and un-resampled data
using Seaborn. There's nothing wrong with the figure below, after all the data are the
same as those shown in the previous figure. The previous figure though was created
using the appropriately organized and resampled Pandas Series.
In [45]: sns.relplot(data = lf, x='date_dt', y='value', kind='line', aspect=2.25)
plt.show()
C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
Let's look at the lag-plot to get an idea of the autocorrelation structure of the data. We
can easily make the lag plot with the Pandas pd.plotting.lag_plot() method. The
lag argument is how we specify the lag to consider. Below, we plot the value with
respect to it's lagged value. The lag-plot reveals the linear relationship between two
sequential measurements. The lag-plot is nothing more than a scatter plot, but it helps
us understand relationships between observations in the time series. Please note the lag
plot is created using the appropriately organized time series data and not the
DataFrame.
In [46]: fig, ax = plt.subplots(figsize=(8, 8))
plt.show()
We could also consider 2 lags. In our current example this would correspond to 2
months prior. The lag argument in the .plotting.lag_plot() function is
specified to lag=2 below. This plot examines the relationship between measurements
that are two months apart. For example, we are comparing measurements in June to
measurements in April.
In [47]: fig, ax = plt.subplots(figsize=(8, 8))
plt.show()
plt.show()
What if we considered a lag of 12? In our application, that would correspond to the
previous year! We would thus be comparing the temperature value at the same month,
one year prior! Therefore, we would be looking at the correlation between observations
of the same season!
In [49]: fig, ax = plt.subplots(figsize=(8, 8))
plt.show()
Let's create a series of lag plots and visualize them within a plot grid. The faceted lag
plot lets us exmine the autocorrelation structure across numerous lags!
In [50]: lags_use = [1, 3, 6, 9, 12, 15, 18, 21, 24]
ax = ax.ravel()
for k in range(len(lags_use)):
pd.plotting.lag_plot( ready_series, lag=lags_use[k], ax=ax[k] )
ax[k].plot( ax[k].get_xlim(), ax[k].get_ylim(), 'k--')
ax[k].set_title('lag: ' + str(lags_use[k]) )
plt.show()
plt.show()
pd.plotting.autocorrelation_plot( ready_series, ax = ax )
plt.show()
I typically use the statsmodels option since I feel it is easy to control than the Pandas
method when visualizing the autocorrelation.
https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompo
The function is sm.tsa.seasonal_decompose() . The first argument is a Pandas
Series with the DateTimeIndex properly identified.
In [53]: my_decomposition = sm.tsa.seasonal_decompose(ready_series, model='additive')
We can plot the decomposition with the built in plotting method associated with the
decomposition object.
file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 24/34
22/04/2024, 03:17 week_10_intro_time_series
fig = my_decomposition.plot()
We can now use the wide-format plotting in Seaborn to easily compare the observed and
seasonally adjusted trends.
In [56]: sns.relplot( data = df_decomp, kind='line', aspect=2.5 )
plt.show()
C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
Let's take a look at the variation across years within each month. The
sm.graphics.tsa.month_plot() function shows the average value per month in
red. The black line is the observed value within the month across the years. The month
plot shows how the Seasonal pattern in the data. There is a difference in the average
monthly values in the second half of the calendar year compared to the first half!
However, the plot below includes the trend and thus makes it difficult to visualize if the
seasonal pattern changes over time!
In [49]: fig, ax = plt.subplots(figsize=(12, 8))
plt.show()
The classical additive decomposition method returns NaN values for the .trend
attribute at the beginning and end of the series. This is because of how the Moving
Average procedure works. The NaN values are shown at the head and tail of .trend
below.
In [57]: my_decomposition.trend
Out[57]: date_dt
1950-01-01 NaN
1950-02-01 NaN
1950-03-01 NaN
1950-04-01 NaN
1950-05-01 NaN
..
2010-08-01 NaN
2010-09-01 NaN
2010-10-01 NaN
2010-11-01 NaN
2010-12-01 NaN
Freq: MS, Name: trend, Length: 732, dtype: float64
The presentation slides for this week discuss why this happens in detail. A quick
discussion as to why is that the Moving Average cannot calculate the smooth trend at
the beginning and end. The Moving Average calculates a smooth trend "in the middle".
The first non NaN value occurs in July in the first year.
In [58]: my_decomposition.trend[:12]
Out[58]: date_dt
1950-01-01 NaN
1950-02-01 NaN
1950-03-01 NaN
1950-04-01 NaN
1950-05-01 NaN
1950-06-01 NaN
1950-07-01 21.998333
1950-08-01 22.088333
1950-09-01 22.142917
1950-10-01 22.215417
1950-11-01 22.351667
1950-12-01 22.555000
Freq: MS, Name: trend, dtype: float64
The last non NaN value occurs in June of the last year.
In [59]: my_decomposition.trend[-12:]
Out[59]: date_dt
2010-01-01 23.658333
2010-02-01 23.470000
2010-03-01 23.272500
2010-04-01 23.101250
2010-05-01 22.957083
2010-06-01 22.845000
2010-07-01 NaN
2010-08-01 NaN
2010-09-01 NaN
2010-10-01 NaN
2010-11-01 NaN
2010-12-01 NaN
Freq: MS, Name: trend, dtype: float64
Why does this matter? We have already explored the seasonally adjusted behavior, but
there are other components we can examine. The detrended behavior subtracts the
"macro" trend from the observed values. This allows isolating the changes in each
month over time. The classical additive method cannot calculate the detrended values
for all months because of the NaN.
However, we can still examine the seasonal behavior by using the .seasonal attribute.
This is the change within a month after removing the "macro" trend and the noisy
residual (error). The month plot for the seasonal component is shown below, but it is
rather boring. That is because the classical additive method assumes the seasonal
component does not change over the years. The value per month is the same for all
years! For our current example, that means March is always greater than February!
In [60]: fig, ax = plt.subplots(figsize=(12, 8))
plt.show()
If we would like to see if the seasonal behavior changes over time, we must use a more
complicated decomposition method. statsmodels includes the Seasonal and Trend
decomposition using LOESS (STL) method. So let's try that approach below.
In [61]: from statsmodels.tsa.seasonal import STL
The STL method is fit by calling the .fit() method. It's a little difficult to see but if
you look closely the Seasonal pattern is NOT constant overtime! The amplitude is
"undulating " or varying up and down as the years change.
In [63]: ready_stl_fit = ready_stl.fit()
Out[65]: date_dt
1950-01-01 21.387036
1950-02-01 21.486001
1950-03-01 21.588166
1950-04-01 21.693680
1950-05-01 21.802330
...
2010-08-01 22.526164
2010-09-01 22.377637
2010-10-01 22.227452
2010-11-01 22.076353
2010-12-01 21.925026
Freq: MS, Name: trend, Length: 732, dtype: float64
We can therefore explore the detrended data and the seasonally adjusted data!
In [66]: df_stl = pd.DataFrame({'observed': ready_stl_fit.observed,
'seasonal_adjusted': ready_stl_fit.observed - ready_s
'detrend': ready_stl_fit.observed - ready_stl_fit.tre
index=ready_series.index)
Plotting the observed data, seasonally adjusted data, and the detrended data on the
same chart looks wrong! That is because the detrended data has the "macro trend"
removed! The magnitude and scale can be different compared to the raw data. A
detrended value of 0 does not mean the raw data are zero. A detrended value of zero
means that month does not deviate from the "macro trend".
In [67]: sns.relplot( data = df_stl, kind='line', aspect=2.5 )
plt.show()
C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
Thus, even when we detrend the data, we should instead focus on the observed and
seasonally adjusted comparison:
In [68]: sns.relplot( data = df_stl.loc[:, ['observed', 'seasonal_adjusted']], kind='
plt.show()
C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
The detrended data allows us to study if the seasonal effects change overtime. The
detrended data can be easily visualized as the variation around the monthly average
change!
In [69]: fig, ax = plt.subplots(figsize=(12, 8))
plt.show()
plt.show()
But, the above figure is equivalent to visualizing the .seasonal component with a
month plot.
In [72]: fig, ax = plt.subplots(figsize=(12, 8))
plt.show()
Conclusion
This report introduced organizing data for time series analysis, exploring the time series,
and visually finding seasonal patterns with decompositions.
In [ ]: