0% found this document useful (0 votes)
14 views34 pages

Week 10 Intro Time Series

Time series

Uploaded by

arnablions
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views34 pages

Week 10 Intro Time Series

Time series

Uploaded by

arnablions
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

22/04/2024, 03:17 week_10_intro_time_series

CMPINF 2120 - Week 10


Introduction to Time Series Methods
Before we start forecasting, we need to know how to organize the data for time series
methods!. This notebook introduces how to format time series objects correctly and
introduces the idea of the sampling frequency associated with measurements.
Common visualizations are demonstrated including lag plots and autocorrelation plots.
Time series decomposition with the classic additive decomposition method are
demonstrated on a simple data set. The decomposition results are visualized, as is the
seasonally adjusted trend.

Import Modules
In [1]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

Most of the time series methods are part of statsmodels .


In [2]: import statsmodels.api as sm

Read data
We will work with the El Nino temperature data set. The data are available from
statsmodels . A link with a description of the data is provided below.

https://www.statsmodels.org/stable/datasets/generated/elnino.html
The data are imported via a statsmodels function in the cell below.
In [3]: dta = sm.datasets.elnino.load_pandas().data

The data consists of 13 columns. There is 1 column for each month in the year and a
column which stores the year. Thus, one row corresponds to all monthly measurements
within a single year!
In [4]: dta.info()

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 1/34


22/04/2024, 03:17 week_10_intro_time_series

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YEAR 61 non-null float64
1 JAN 61 non-null float64
2 FEB 61 non-null float64
3 MAR 61 non-null float64
4 APR 61 non-null float64
5 MAY 61 non-null float64
6 JUN 61 non-null float64
7 JUL 61 non-null float64
8 AUG 61 non-null float64
9 SEP 61 non-null float64
10 OCT 61 non-null float64
11 NOV 61 non-null float64
12 DEC 61 non-null float64
dtypes: float64(13)
memory usage: 6.3 KB

Reorganize data
The first few rows of the data set are shown below with the .head() method.
In [5]: dta.head()

Out[5]: YEAR JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV
0 1950.0 23.11 24.20 25.37 23.86 23.03 21.57 20.63 20.15 19.67 20.03 20.02
1 1951.0 24.19 25.28 25.60 25.37 24.79 24.69 23.86 22.32 21.44 21.77 22.33
2 1952.0 24.52 26.21 26.37 24.73 23.71 22.34 20.89 20.02 19.63 20.40 20.77
3 1953.0 24.15 26.34 27.36 27.03 25.47 23.49 22.20 21.45 21.25 20.95 21.60
4 1954.0 23.02 25.00 25.33 22.97 21.73 20.77 19.52 19.33 18.95 19.11 20.27
The last few rows are shown via the .tail() method below.
In [6]: dta.tail()

Out[6]: YEAR JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NO
56 2006.0 24.76 26.52 26.22 24.29 23.84 22.82 22.20 21.89 21.93 22.46 22.6
57 2007.0 25.82 26.81 26.41 24.96 23.05 21.61 21.05 19.95 19.85 19.31 19.8
58 2008.0 24.24 26.39 26.91 25.68 24.43 23.19 23.02 22.14 21.60 21.39 21.5
59 2009.0 24.39 25.53 25.48 25.84 24.95 24.09 23.09 22.03 21.48 21.64 21.9
60 2010.0 24.70 26.16 26.54 26.04 24.75 23.26 21.11 19.49 19.28 19.73 20.4

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 2/34


22/04/2024, 03:17 week_10_intro_time_series

As previously mentioned, the data has one column for each month in a year. Although
this looks like a well organized data set, we cannot use it for time series analysis in this
state! We must reshape the data into long-format! The wide-format approach is
common because it "looks" well organized. However, it is not TIDY! CMPINF 2110 dives
into the concepts of TIDY data in more detail. However, for our purposes the key reason
why the data are not TIDY is because the column names are values. The column names
store the month the measurement was recorded. Instead, we need the data organized
such that the month is a value contained within the rows of a column.
We will use the .melt() method to reshape the data. The argument id_vars is set
to YEAR . The argument value_vars is set to all columns except YEAR . If you look
closely as the previous output displays, YEAR is the zeroth column in the DataFrame.
Thus, we can access all other column names easily as shown below.
In [7]: dta.columns[1:].to_list()

Out[7]: ['JAN',
'FEB',
'MAR',
'APR',
'MAY',
'JUN',
'JUL',
'AUG',
'SEP',
'OCT',
'NOV',
'DEC']

The data are reshaped to long-format below.


In [8]: lf = dta.melt( id_vars = ['YEAR'], value_vars = dta.columns[1:].to_list(), i

The long-format data has 3 columns. The YEAR column and two new columns,
variable and value .

In [9]: lf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YEAR 732 non-null float64
1 variable 732 non-null object
2 value 732 non-null float64
dtypes: float64(2), object(1)
memory usage: 17.3+ KB

The variable column stores the names of the months (the original wide-format
column names). The value column stores the values associated with the months
file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 3/34
22/04/2024, 03:17 week_10_intro_time_series

within a given year.


In [10]: lf.head()

Out[10]: YEAR variable value


0 1950.0 JAN 23.11
1 1951.0 JAN 24.19
2 1952.0 JAN 24.52
3 1953.0 JAN 24.15
4 1954.0 JAN 23.02
In [11]: lf.tail()

Out[11]: YEAR variable value


727 2006.0 DEC 24.15
728 2007.0 DEC 21.15
729 2008.0 DEC 22.73
730 2009.0 DEC 23.21
731 2010.0 DEC 22.07
We can confirm that the variable column consists of all 12 months using
.value_counts() .

In [12]: lf.variable.value_counts()

Out[12]: variable
JAN 61
FEB 61
MAR 61
APR 61
MAY 61
JUN 61
JUL 61
AUG 61
SEP 61
OCT 61
NOV 61
DEC 61
Name: count, dtype: int64

One row now corresponds to a measurement in a month within a year! Thus, the
measurements within 1950 are spread across 12 rows in the long-format data:
In [13]: lf.loc[ lf.YEAR == dta.YEAR.min(), : ]

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 4/34


22/04/2024, 03:17 week_10_intro_time_series

Out[13]: YEAR variable value


0 1950.0 JAN 23.11
61 1950.0 FEB 24.20
122 1950.0 MAR 25.37
183 1950.0 APR 23.86
244 1950.0 MAY 23.03
305 1950.0 JUN 21.57
366 1950.0 JUL 20.63
427 1950.0 AUG 20.15
488 1950.0 SEP 19.67
549 1950.0 OCT 20.03
610 1950.0 NOV 20.02
671 1950.0 DEC 21.80
Contrast the above with the original wide-format data which spread the measurements
across 12 columns in a single row.
In [14]: dta.loc[ dta.YEAR == dta.YEAR.min(), :]

Out[14]: YEAR JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV D
0 1950.0 23.11 24.2 25.37 23.86 23.03 21.57 20.63 20.15 19.67 20.03 20.02 2
Why would we want to consider wide-format data? One reason is wide-format makes it
easy to explore correlation between the months across years! The figure below creates
a heatmap to show the correlation plot between all pairs of months. Such a figure can
only be made with a wide-format version of the data. We see that the months are highly
correlated!
In [15]: fig, ax = plt.subplots()

sns.heatmap( dta.iloc[:, 1:].corr(),


vmin=-1, vmax=1, center=0,
cmap='coolwarm',
ax=ax)

plt.show()

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 5/34


22/04/2024, 03:17 week_10_intro_time_series

Long-format data is easier to group and aggregate and therefore summarize the data!
For example, we can count the number of rows and number of unique years associated
with each month. The cell below accomplishes this via the
.groupby().aggregate() "chain". You should notice the results are the same as
.value_counts() !

In [16]: lf.groupby(['variable']).\
aggregate(num_rows = ('value', 'size'),
num_years = ('YEAR', 'nunique')).\
reset_index()

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 6/34


22/04/2024, 03:17 week_10_intro_time_series

Out[16]: variable num_rows num_years


0 APR 61 61
1 AUG 61 61
2 DEC 61 61
3 FEB 61 61
4 JAN 61 61
5 JUL 61 61
6 JUN 61 61
7 MAR 61 61
8 MAY 61 61
9 NOV 61 61
10 OCT 61 61
11 SEP 61 61
The data cannot be provided to time series methods though. We must first create a date
time index. At the moment we do not have that. We simply have the year, as a double
floating point number (a decimal) and the abbreviated name of the month as text (a
Pandas object data type). We will create a new column, my_date , which has the format
YYYY-Month. There are several steps to creating the datetime column in this format.
First, let's convert YEAR to an integer. The result is assigned to a NEW column
my_year within the long-format data.

In [17]: lf['my_year'] = lf.YEAR.astype('int64')

In [18]: lf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YEAR 732 non-null float64
1 variable 732 non-null object
2 value 732 non-null float64
3 my_year 732 non-null int64
dtypes: float64(2), int64(1), object(1)
memory usage: 23.0+ KB

In [19]: lf.head()

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 7/34


22/04/2024, 03:17 week_10_intro_time_series

Out[19]: YEAR variable value my_year


0 1950.0 JAN 23.11 1950
1 1951.0 JAN 24.19 1951
2 1952.0 JAN 24.52 1952
3 1953.0 JAN 24.15 1953
4 1954.0 JAN 23.02 1954
Next, let's convert my_year to a string (object) data type. Just why we are creating a
string will be apparent shortly.
In [20]: lf['my_year'] = lf.my_year.astype('str')

In [21]: lf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YEAR 732 non-null float64
1 variable 732 non-null object
2 value 732 non-null float64
3 my_year 732 non-null object
dtypes: float64(2), object(2)
memory usage: 23.0+ KB

In [22]: lf.head()

Out[22]: YEAR variable value my_year


0 1950.0 JAN 23.11 1950
1 1951.0 JAN 24.19 1951
2 1952.0 JAN 24.52 1952
3 1953.0 JAN 24.15 1953
4 1954.0 JAN 23.02 1954
Next, concatenate the my_year string with the month abbreviation, within the
variable column. Assign the result to a new column my_date . The .str.cat()
method is used to concatenate or combine the two strings together. This method is only
available for string data types and not integer data types. That's why the my_year
column was forced to be a string! The .str.cat() method includes an argument
sep which denotes how the combined strings are separated. I want the strings
separated by a - character.

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 8/34


22/04/2024, 03:17 week_10_intro_time_series

In [23]: lf['my_date'] = lf.my_year.str.cat( lf.variable, sep = '-' )

In [24]: lf.head()

Out[24]: YEAR variable value my_year my_date


0 1950.0 JAN 23.11 1950 1950-JAN
1 1951.0 JAN 24.19 1951 1951-JAN
2 1952.0 JAN 24.52 1952 1952-JAN
3 1953.0 JAN 24.15 1953 1953-JAN
4 1954.0 JAN 23.02 1954 1954-JAN
The above display shows my_date has the YEAR and MONTH abbreviation separated
by a - character. The my_date column now includes both the YEAR and MONTH the
measurement is associated with! Although this looks like everything we need...the
my_date column is a string! The my_date column does not "know" it's a date!

In [24]: lf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YEAR 732 non-null float64
1 variable 732 non-null object
2 value 732 non-null float64
3 my_year 732 non-null object
4 my_date 732 non-null object
dtypes: float64(2), object(3)
memory usage: 28.7+ KB

Let's now convert my_date into a date time object! The conversion is executed with
the pd.to_datetime() function. This function will try to "guess" the appropriate date
conversion based on some simple checks. The current format of YYYY-MON is an easy
format for the function to figure out.
In [25]: pd.to_datetime( lf.my_date )

C:\Users\XPS15\AppData\Local\Temp\ipykernel_21032\2942030940.py:1: UserWarni
ng: Could not infer format, so each element will be parsed individually, fal
ling back to `dateutil`. To ensure parsing is consistent and as-expected, pl
ease specify a format.
pd.to_datetime( lf.my_date )

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 9/34


22/04/2024, 03:17 week_10_intro_time_series

Out[25]: 0 1950-01-01
1 1951-01-01
2 1952-01-01
3 1953-01-01
4 1954-01-01
...
727 2006-12-01
728 2007-12-01
729 2008-12-01
730 2009-12-01
731 2010-12-01
Name: my_date, Length: 732, dtype: datetime64[ns]

A warning is displayed, but the conversion executed successfully. You can remove the
warning by specifying the format argument. The specific value to set for format
depends on how the date is "written". Special codes must be provided to parse the date.
The my_date column is written as YYYY-MON and so we need to specify the special
code for that format. There are many such special codes to denote years, months, days,
and other aspects of a date time. For our specific purposes we need to use '%Y-%b'
because of the YYYY-MON format. Please see the documentation for a complete list of
the special codes for parsing date times.
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
Providing the format for the special code removes the warning message, as shown
below.
In [26]: pd.to_datetime( lf.my_date, format='%Y-%b')

Out[26]: 0 1950-01-01
1 1951-01-01
2 1952-01-01
3 1953-01-01
4 1954-01-01
...
727 2006-12-01
728 2007-12-01
729 2008-12-01
730 2009-12-01
731 2010-12-01
Name: my_date, Length: 732, dtype: datetime64[ns]

Let's assign the result of the pd.to_datetime() function to a new column,


date_dt , within lf . The cell below does not set the format argument and
therefore requires pd.to_datetime() to guess the appropriate parsing action. I
recommend to first try the default behavior and check if the conversion worked
correctly. This way you do not need to look through the special codes unless you have
to.
In [27]: lf['date_dt'] = pd.to_datetime( lf.my_date )

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 10/34


22/04/2024, 03:17 week_10_intro_time_series

C:\Users\XPS15\AppData\Local\Temp\ipykernel_21032\2897020388.py:1: UserWarni
ng: Could not infer format, so each element will be parsed individually, fal
ling back to `dateutil`. To ensure parsing is consistent and as-expected, pl
ease specify a format.
lf['date_dt'] = pd.to_datetime( lf.my_date )

As shown below date_dt is a datetime object!


In [28]: lf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YEAR 732 non-null float64
1 variable 732 non-null object
2 value 732 non-null float64
3 my_year 732 non-null object
4 my_date 732 non-null object
5 date_dt 732 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(3)
memory usage: 34.4+ KB

Displaying the head and tail of the DataFrame reveals the date_dt column "looks"
different from the previous my_date column!
In [29]: lf

Out[29]: YEAR variable value my_year my_date date_dt


0 1950.0 JAN 23.11 1950 1950-JAN 1950-01-01
1 1951.0 JAN 24.19 1951 1951-JAN 1951-01-01
2 1952.0 JAN 24.52 1952 1952-JAN 1952-01-01
3 1953.0 JAN 24.15 1953 1953-JAN 1953-01-01
4 1954.0 JAN 23.02 1954 1954-JAN 1954-01-01
... ... ... ... ... ... ...
727 2006.0 DEC 24.15 2006 2006-DEC 2006-12-01
728 2007.0 DEC 21.15 2007 2007-DEC 2007-12-01
729 2008.0 DEC 22.73 2008 2008-DEC 2008-12-01
730 2009.0 DEC 23.21 2009 2009-DEC 2009-12-01
731 2010.0 DEC 22.07 2010 2010-DEC 2010-12-01
732 rows × 6 columns
The date_dt column is a new format, YYYY-MM-DD. Even though our data was not
provided with a day, a date time object requires a year, a month, and a day. In this way,
file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 11/34
22/04/2024, 03:17 week_10_intro_time_series

we have full access to the calendar information! Calender arithmetic is tricky. After all
the day after February 28 is usually March 1, unless it's a leap year! We would have to
manage calendar arithmetic if we kept the date as a string. However, we do not have to
worry about such issues! The datetime object handles all the weird calendar problems
for us!
The date time components can be separated using the .dt. attributes. For example,
the YEAR component is extracted with .dt.year .
In [30]: lf.date_dt.dt.year

Out[30]: 0 1950
1 1951
2 1952
3 1953
4 1954
...
727 2006
728 2007
729 2008
730 2009
731 2010
Name: date_dt, Length: 732, dtype: int32

While the MONTH component is extracted with .dt.month .


In [31]: lf.date_dt.dt.month

Out[31]: 0 1
1 1
2 1
3 1
4 1
..
727 12
728 12
729 12
730 12
731 12
Name: date_dt, Length: 732, dtype: int32

These might not seem interesting since we already have the YEAR and MONTH in the
data. However, we can extract other components like the QUARTER!
In [32]: lf.date_dt.dt.quarter

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 12/34


22/04/2024, 03:17 week_10_intro_time_series

Out[32]: 0 1
1 1
2 1
3 1
4 1
..
727 4
728 4
729 4
730 4
731 4
Name: date_dt, Length: 732, dtype: int32

Even though there are 12 MONTHS in a YEAR, there are only 4 QUARTERS in the YEAR.
In [33]: lf.date_dt.dt.quarter.value_counts()

Out[33]: date_dt
1 183
2 183
3 183
4 183
Name: count, dtype: int64

The date time components are very useful because you do not to figure out properties
of the date. The date time components will give them to you!
We have created the necessary datetime object, but the data are still not ready for the
time series methods. Most time series work with Pandas Series objects, not DataFrames.
Therefore, we need to separate out the value column into it's own Series object.
In [34]: my_series = lf.value.copy()

In [35]: my_series

Out[35]: 0 23.11
1 24.19
2 24.52
3 24.15
4 23.02
...
727 24.15
728 21.15
729 22.73
730 23.21
731 22.07
Name: value, Length: 732, dtype: float64

Time series methods require the index of a Series to be a datetime index. As shown
below the index to my_series is just the default range index.
In [36]: print( my_series.index )

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 13/34


22/04/2024, 03:17 week_10_intro_time_series

RangeIndex(start=0, stop=732, step=1)

Let's assign the .index to be the my_date column from lf .


In [37]: my_series.index = lf.date_dt

In [38]: print( my_series.index )

DatetimeIndex(['1950-01-01', '1951-01-01', '1952-01-01', '1953-01-01',


'1954-01-01', '1955-01-01', '1956-01-01', '1957-01-01',
'1958-01-01', '1959-01-01',
...
'2001-12-01', '2002-12-01', '2003-12-01', '2004-12-01',
'2005-12-01', '2006-12-01', '2007-12-01', '2008-12-01',
'2009-12-01', '2010-12-01'],
dtype='datetime64[ns]', name='date_dt', length=732, freq=None)

As we can see above, the index is now a datetimeindex! We are one step closer but, this
index does not have a defined frequency. We can see that by the attribute freq being
equal to None above. The frequency is VERY IMPORTANT in time series methods! It
specifies the sampling frequency or rate that the data are collected. Let's go ahead
and force that the frequency is the start of each month. In Pandas, we enforce or change
the sampling frequency with the .resample() method. It is important to note that the
.resample() method does NOT refer to resampling methods like cross-validation. It
has to do with altering how time series data are stored.
The frequency is specified by an offset alias. Please see the link below for the many
different options that are available.
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-
aliases
We are working with Monthly data and so our offset alias must be either the start, end,
or middle of the month. We will use the start of the month frequency and so we will set
the rule argument to 'MS' . The resampled data must be summarized in order to
aggregate repeat observations within the desired sampling frequency. The current
example does not have any repeat observations within a single month, but the summary
method is applied after the .resample() method. For example, the .mean()
method is applied below to calculate the MEAN value associated with a month within a
year if there are multiple measurements per month. The average of a single value is the
value itself and so applying .mean() will not change the monthly values in the
example.
In [39]: ready_series = my_series.copy().resample('MS').mean()

The index attribute is displayed for the original my_series object and the resampled
ready_series object. Notice that the freq is now displayed as 'MS' to represet

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 14/34


22/04/2024, 03:17 week_10_intro_time_series

that the ready_series index "knows" that each observation corresponds to the start
of the month.
In [40]: print( my_series.index )

DatetimeIndex(['1950-01-01', '1951-01-01', '1952-01-01', '1953-01-01',


'1954-01-01', '1955-01-01', '1956-01-01', '1957-01-01',
'1958-01-01', '1959-01-01',
...
'2001-12-01', '2002-12-01', '2003-12-01', '2004-12-01',
'2005-12-01', '2006-12-01', '2007-12-01', '2008-12-01',
'2009-12-01', '2010-12-01'],
dtype='datetime64[ns]', name='date_dt', length=732, freq=None)

In [41]: print( ready_series.index )

DatetimeIndex(['1950-01-01', '1950-02-01', '1950-03-01', '1950-04-01',


'1950-05-01', '1950-06-01', '1950-07-01', '1950-08-01',
'1950-09-01', '1950-10-01',
...
'2010-03-01', '2010-04-01', '2010-05-01', '2010-06-01',
'2010-07-01', '2010-08-01', '2010-09-01', '2010-10-01',
'2010-11-01', '2010-12-01'],
dtype='datetime64[ns]', name='date_dt', length=732, freq='MS')

Printing the first few elements of the Series object will tell us the frequency of the
DateTimeIndex as well.
In [42]: print( '-- original series --')
print( my_series.head() )
print( ' ' )
print( '-- after resampling --' )
print( ready_series.head() )

-- original series --
date_dt
1950-01-01 23.11
1951-01-01 24.19
1952-01-01 24.52
1953-01-01 24.15
1954-01-01 23.02
Name: value, dtype: float64

-- after resampling --
date_dt
1950-01-01 23.11
1950-02-01 24.20
1950-03-01 25.37
1950-04-01 23.86
1950-05-01 23.03
Freq: MS, Name: value, dtype: float64

Our time series object is now ready for modeling!

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 15/34


22/04/2024, 03:17 week_10_intro_time_series

If you are still wondering why we need a defined sampling frequency, it is because most
time methods require the data to be collected at regular or fixed intervals. Irregular time
series frequencies are very challenging. We will not discuss such methods in this
course. The .resample() method therefore ensures all measurements exist at some
fixed regular interval.

Visualizations
Let's start out by simply visualizing the time series. Series objects in Pandas have a
default plot method which plots the value of the elements with respect to the index.
Since we have a DateTimeIndex, the value in ready_series will be plotted with
respect to the calendar date.
In [37]: ready_series.plot( figsize=(15, 6) )

plt.show()

We could have created the above figure with the DataFrame and un-resampled data
using Seaborn. There's nothing wrong with the figure below, after all the data are the
same as those shown in the previous figure. The previous figure though was created
using the appropriately organized and resampled Pandas Series.
In [45]: sns.relplot(data = lf, x='date_dt', y='value', kind='line', aspect=2.25)

plt.show()

C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 16/34


22/04/2024, 03:17 week_10_intro_time_series

Let's look at the lag-plot to get an idea of the autocorrelation structure of the data. We
can easily make the lag plot with the Pandas pd.plotting.lag_plot() method. The
lag argument is how we specify the lag to consider. Below, we plot the value with
respect to it's lagged value. The lag-plot reveals the linear relationship between two
sequential measurements. The lag-plot is nothing more than a scatter plot, but it helps
us understand relationships between observations in the time series. Please note the lag
plot is created using the appropriately organized time series data and not the
DataFrame.
In [46]: fig, ax = plt.subplots(figsize=(8, 8))

pd.plotting.lag_plot( ready_series, lag=1, ax=ax )

ax.plot( ax.get_xlim(), ax.get_ylim(), 'k--')

plt.show()

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 17/34


22/04/2024, 03:17 week_10_intro_time_series

We could also consider 2 lags. In our current example this would correspond to 2
months prior. The lag argument in the .plotting.lag_plot() function is
specified to lag=2 below. This plot examines the relationship between measurements
that are two months apart. For example, we are comparing measurements in June to
measurements in April.
In [47]: fig, ax = plt.subplots(figsize=(8, 8))

pd.plotting.lag_plot( ready_series, lag=2, ax=ax )

ax.plot( ax.get_xlim(), ax.get_ylim(), 'k--')

plt.show()

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 18/34


22/04/2024, 03:17 week_10_intro_time_series

We can keep going, we can consider 3 lags.


In [48]: fig, ax = plt.subplots(figsize=(8, 8))

pd.plotting.lag_plot( ready_series, lag=3, ax=ax )

ax.plot( ax.get_xlim(), ax.get_ylim(), 'k--')

plt.show()

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 19/34


22/04/2024, 03:17 week_10_intro_time_series

What if we considered a lag of 12? In our application, that would correspond to the
previous year! We would thus be comparing the temperature value at the same month,
one year prior! Therefore, we would be looking at the correlation between observations
of the same season!
In [49]: fig, ax = plt.subplots(figsize=(8, 8))

pd.plotting.lag_plot( ready_series, lag=12, ax=ax )

ax.plot( ax.get_xlim(), ax.get_ylim(), 'k--')

plt.show()

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 20/34


22/04/2024, 03:17 week_10_intro_time_series

Let's create a series of lag plots and visualize them within a plot grid. The faceted lag
plot lets us exmine the autocorrelation structure across numerous lags!
In [50]: lags_use = [1, 3, 6, 9, 12, 15, 18, 21, 24]

fig, ax = plt.subplots(3, 3, figsize=(12, 12), sharex=True, sharey=True)

ax = ax.ravel()

for k in range(len(lags_use)):
pd.plotting.lag_plot( ready_series, lag=lags_use[k], ax=ax[k] )
ax[k].plot( ax[k].get_xlim(), ax[k].get_ylim(), 'k--')
ax[k].set_title('lag: ' + str(lags_use[k]) )

plt.show()

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 21/34


22/04/2024, 03:17 week_10_intro_time_series

Alternatively, we can examine the autocorrelation behavior with autocorrelation plots.


The autocorrelation plot contains the same information as examining numerous different
lag plots. However, the autocorrelation plot takes some getting used to. We can create
the autocorrelation plot two different ways. The first approach shown below uses
statsmodels plotting functions.

In [51]: fig, ax = plt.subplots( figsize = (12, 8) )

sm.graphics.tsa.plot_acf( ready_series.values.squeeze(), lags=72, ax = ax)

plt.show()

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 22/34


22/04/2024, 03:17 week_10_intro_time_series

Alternatively, we can use the Pandas autocorrelation_plot() method.


In [52]: fig, ax = plt.subplots(figsize=(12, 8))

pd.plotting.autocorrelation_plot( ready_series, ax = ax )

plt.show()

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 23/34


22/04/2024, 03:17 week_10_intro_time_series

I typically use the statsmodels option since I feel it is easy to control than the Pandas
method when visualizing the autocorrelation.

Time series decomposition


We have explored the time series data. We have learned that values are correlated with
values from the previous year. That suggests there could be a regularly occuring pattern
that repeats each year. Time series methods refer to such patterns as seasonal
patterns. A season corresponds to the sampling frequency of the data. Thus, one
season in our current application is 1 month. Let's apply techniques to explore the
seasonality to identify the commonly repeating pattern present in the data. We will do
so by decomposing the data!
Let's decompose our time series with the classic additive decomposition approach. The
statsmodels method documentation is below.

https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompo
The function is sm.tsa.seasonal_decompose() . The first argument is a Pandas
Series with the DateTimeIndex properly identified.
In [53]: my_decomposition = sm.tsa.seasonal_decompose(ready_series, model='additive')

We can plot the decomposition with the built in plotting method associated with the
decomposition object.
file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 24/34
22/04/2024, 03:17 week_10_intro_time_series

In [54]: plt.rcParams['figure.figsize'] = 18,8


plt.show()

fig = my_decomposition.plot()

The decomposition allows us to easily calculate the seasonally adjusted trend by


subtracting the seasonal component from the observed values. Let's create a new
DataFrame to help with the following visualizations so we can use Seaborn to make the
figures.
In [55]: df_decomp = pd.DataFrame({'observed': my_decomposition.observed,
'seasonal_adjusted': my_decomposition.observed - m
index=ready_series.index)

We can now use the wide-format plotting in Seaborn to easily compare the observed and
seasonally adjusted trends.
In [56]: sns.relplot( data = df_decomp, kind='line', aspect=2.5 )

plt.show()

C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 25/34


22/04/2024, 03:17 week_10_intro_time_series

Let's take a look at the variation across years within each month. The
sm.graphics.tsa.month_plot() function shows the average value per month in
red. The black line is the observed value within the month across the years. The month
plot shows how the Seasonal pattern in the data. There is a difference in the average
monthly values in the second half of the calendar year compared to the first half!
However, the plot below includes the trend and thus makes it difficult to visualize if the
seasonal pattern changes over time!
In [49]: fig, ax = plt.subplots(figsize=(12, 8))

sm.graphics.tsa.month_plot( ready_series, ylabel='Temperature', ax=ax )

plt.show()

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 26/34


22/04/2024, 03:17 week_10_intro_time_series

The classical additive decomposition method returns NaN values for the .trend
attribute at the beginning and end of the series. This is because of how the Moving
Average procedure works. The NaN values are shown at the head and tail of .trend
below.
In [57]: my_decomposition.trend

Out[57]: date_dt
1950-01-01 NaN
1950-02-01 NaN
1950-03-01 NaN
1950-04-01 NaN
1950-05-01 NaN
..
2010-08-01 NaN
2010-09-01 NaN
2010-10-01 NaN
2010-11-01 NaN
2010-12-01 NaN
Freq: MS, Name: trend, Length: 732, dtype: float64

The presentation slides for this week discuss why this happens in detail. A quick
discussion as to why is that the Moving Average cannot calculate the smooth trend at
the beginning and end. The Moving Average calculates a smooth trend "in the middle".
The first non NaN value occurs in July in the first year.
In [58]: my_decomposition.trend[:12]

Out[58]: date_dt
1950-01-01 NaN
1950-02-01 NaN
1950-03-01 NaN
1950-04-01 NaN
1950-05-01 NaN
1950-06-01 NaN
1950-07-01 21.998333
1950-08-01 22.088333
1950-09-01 22.142917
1950-10-01 22.215417
1950-11-01 22.351667
1950-12-01 22.555000
Freq: MS, Name: trend, dtype: float64

The last non NaN value occurs in June of the last year.
In [59]: my_decomposition.trend[-12:]

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 27/34


22/04/2024, 03:17 week_10_intro_time_series

Out[59]: date_dt
2010-01-01 23.658333
2010-02-01 23.470000
2010-03-01 23.272500
2010-04-01 23.101250
2010-05-01 22.957083
2010-06-01 22.845000
2010-07-01 NaN
2010-08-01 NaN
2010-09-01 NaN
2010-10-01 NaN
2010-11-01 NaN
2010-12-01 NaN
Freq: MS, Name: trend, dtype: float64

Why does this matter? We have already explored the seasonally adjusted behavior, but
there are other components we can examine. The detrended behavior subtracts the
"macro" trend from the observed values. This allows isolating the changes in each
month over time. The classical additive method cannot calculate the detrended values
for all months because of the NaN.
However, we can still examine the seasonal behavior by using the .seasonal attribute.
This is the change within a month after removing the "macro" trend and the noisy
residual (error). The month plot for the seasonal component is shown below, but it is
rather boring. That is because the classical additive method assumes the seasonal
component does not change over the years. The value per month is the same for all
years! For our current example, that means March is always greater than February!
In [60]: fig, ax = plt.subplots(figsize=(12, 8))

sm.graphics.tsa.month_plot( my_decomposition.seasonal, ylabel='Temperature',

plt.show()

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 28/34


22/04/2024, 03:17 week_10_intro_time_series

If we would like to see if the seasonal behavior changes over time, we must use a more
complicated decomposition method. statsmodels includes the Seasonal and Trend
decomposition using LOESS (STL) method. So let's try that approach below.
In [61]: from statsmodels.tsa.seasonal import STL

The STL decomposition result is assigned to the ready_stl object below.


In [62]: ready_stl = STL( ready_series )

The STL method is fit by calling the .fit() method. It's a little difficult to see but if
you look closely the Seasonal pattern is NOT constant overtime! The amplitude is
"undulating " or varying up and down as the years change.
In [63]: ready_stl_fit = ready_stl.fit()

The decomposition figure is shown below.


In [64]: fig = ready_stl_fit.plot()

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 29/34


22/04/2024, 03:17 week_10_intro_time_series

The STL fitted object does include the .trend attribute!


In [65]: ready_stl_fit.trend

Out[65]: date_dt
1950-01-01 21.387036
1950-02-01 21.486001
1950-03-01 21.588166
1950-04-01 21.693680
1950-05-01 21.802330
...
2010-08-01 22.526164
2010-09-01 22.377637
2010-10-01 22.227452
2010-11-01 22.076353
2010-12-01 21.925026
Freq: MS, Name: trend, Length: 732, dtype: float64

We can therefore explore the detrended data and the seasonally adjusted data!
In [66]: df_stl = pd.DataFrame({'observed': ready_stl_fit.observed,
'seasonal_adjusted': ready_stl_fit.observed - ready_s
'detrend': ready_stl_fit.observed - ready_stl_fit.tre
index=ready_series.index)

Plotting the observed data, seasonally adjusted data, and the detrended data on the
same chart looks wrong! That is because the detrended data has the "macro trend"
removed! The magnitude and scale can be different compared to the raw data. A
detrended value of 0 does not mean the raw data are zero. A detrended value of zero
means that month does not deviate from the "macro trend".
In [67]: sns.relplot( data = df_stl, kind='line', aspect=2.5 )

plt.show()

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 30/34


22/04/2024, 03:17 week_10_intro_time_series

C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)

Thus, even when we detrend the data, we should instead focus on the observed and
seasonally adjusted comparison:
In [68]: sns.relplot( data = df_stl.loc[:, ['observed', 'seasonal_adjusted']], kind='

plt.show()

C:\Users\XPS15\Anaconda3\envs\cmpinf2120_2024\lib\site-packages\seaborn\axis
grid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)

The detrended data allows us to study if the seasonal effects change overtime. The
detrended data can be easily visualized as the variation around the monthly average
change!
In [69]: fig, ax = plt.subplots(figsize=(12, 8))

sm.graphics.tsa.month_plot( df_stl.detrend, ylabel='Temperature', ax=ax )

plt.show()

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 31/34


22/04/2024, 03:17 week_10_intro_time_series

The detrended data can be further smoothed by removing the residual!


In [70]: df_stl['detrend_smooth'] = ready_stl_fit.observed - (ready_stl_fit.trend + r

In [71]: fig, ax = plt.subplots(figsize=(12, 8))

sm.graphics.tsa.month_plot( df_stl.detrend_smooth, ylabel='Temperature', ax=

plt.show()

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 32/34


22/04/2024, 03:17 week_10_intro_time_series

But, the above figure is equivalent to visualizing the .seasonal component with a
month plot.
In [72]: fig, ax = plt.subplots(figsize=(12, 8))

sm.graphics.tsa.month_plot( ready_stl_fit.seasonal, ylabel='Temperature', ax

plt.show()

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 33/34


22/04/2024, 03:17 week_10_intro_time_series

Conclusion
This report introduced organizing data for time series analysis, exploring the time series,
and visually finding seasonal patterns with decompositions.
In [ ]:

file:///Users/arnabdeysarkar/Desktop/Spring 2024 UPitt GIT/cmpinf 2120/Class/Week 10/week_10_intro_time_series.html 34/34

You might also like