0% found this document useful (0 votes)
11 views

Unit 5 Time Series Data Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Unit 5 Time Series Data Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

lOMoARcPSD|44414401

Unit 5 Time series Data Analysis

data visualization (Anna University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Preejith P ([email protected])
lOMoARcPSD|44414401

AD3301
DATA EXPLORATION AND VISUALIZATION

Unit 5

TIME SERIES ANALYSIS


Fundamentals of TSA – Characteristics of time
series data – Data Cleaning – Time-based
indexing – Visualizing – Grouping –
Resampling.

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

Time series data


Time series data includes timestamps and is generated
while monitoring the industrial process or tracking any
business metrics.

An ordered sequence of timestamp values at equally


spaced intervals is referred to as a time series.

Analysis of a time series is used in many applications such


as sales forecasting, utility studies, budget analysis, economic
forecasting, inventory studies.

There many methods that can be used to model and


forecast time series.

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

Fundamentals of TSA
1. We can generate the dataset using the
numpy library:

import os import numpy as np


import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
zero_mean_series = np.random.normal(loc=0.0, scale=1.,
size=50)
print(zero_mean_series)

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

The output of the preceding code is given here:

[ 0.91315139 0.51955858 -1.03172053 -0.725203 1.88933611 -0.39631515


0.71957305 0.01773119 -1.88369523 0.62272576 -1.22417583 -0.3920638
0.45239854 0.15720562 0.11885262 -0.96940705 -1.20997492 0.93202519
-0.37246211 1.11134324 0.15633954 -0.5439416 0.16875613 0.2826228
0.58295158 0.3245175 0.42985676 0.97500729 0.24721019 -0.45684401
-0.58347696 -0.68752098 0.82822652 -0.72181389 0.39490961 -1.792727
-0.6237392 -0.24644562 -0.22952135 3.06311553 -3.05745406 1.37894995
-0.39553 -0.26359025 -0.21658428 0.63820235 -1.7740917 0.66671788
-0.89029947 0.39759542]

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

2. Next, we use the seaborn library to plot the


time series data.
plt.figure(figsize=(16, 8))
g = sns.lineplot(data=zero_mean_series)
g.set_title('Zero mean model')
g.set_xlabel('Time index')
plt.show()

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

We plotted the time series graph using the seaborn.lineplot()


function which is a built-in method provided by the seaborn
library. The output of the preceding code is given here:

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

3. We can perform a cumulative sum over the list and


then plot the data using a time series plot. The plot
gives more interesting results
random_walk = np.cumsum(zero_mean_series)
print(random_walk)

It generates an array of the cumulative sum as shown here:


[ 0.91315139 1.43270997 0.40098944 -0.32421356 1.56512255 1.1688074
1.88838045 1.90611164 0.0224164 0.64514216 -0.57903366 -0.97109746
-0.51869892 -0.36149331 -0.24264069 -1.21204774 -2.42202265 -1.48999747
-1.86245958 -0.75111634 -0.5947768 -1.1387184 -0.96996227 -0.68733947
-0.10438789 0.22012962 0.64998637 1.62499367 1.87220386 1.41535986
0.8318829 0.14436192 0.97258843 0.25077455 0.64568416 -1.14704284
-1.77078204 -2.01722767 -2.24674902 0.81636651 -2.24108755 -0.86213759
-1.25766759 -1.52125784 -1.73784212 -1.09963977 -2.87373147 -2.20701359
-3.09731306 -2.69971764]

Note that for any particular value, the next value is the sum of previous values.

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

4. Now, if we plot the list using the time series plot as


shown here, we get an interesting graph that shows
the change in values over time:

plt.figure(figsize=(16, 8))
g = sns.lineplot(data=random_walk)
g.set_title('Random Walk')
g.set_xlabel('Time index')
plt.show()

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

The output of the preceding code is given here:

Note the graph shown in the preceding diagram. It shows the


change of values over time.

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

Univariate time series


• When we capture a sequence of observations
for the same variable over a particular
duration of time, the series is referred to as
univariate time series.
• In general, in a univariate time series, the
observations are taken over regular time
periods.
• (E.g.) The change in temperature over time
throughout a day.

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

Characteristics of time series data


• Trend: When looking at time series data, it is essential to see if there
is any trend. Observing a trend means that the average measurement
values seem either to decrease or increase over time.
• Outliers: Time series data may contain a notable amount of outliers.
These outliers can be noted when plotted on a graph.
• Seasonality: Some data in time series tends to repeat over a certain
interval in some patterns. We refer to such repeating patterns as
seasonality.
• Abrupt changes: Sometimes, there is an uneven change in time series
data. We refer to such uneven changes as abrupt changes. Observing
abrupt changes in time series is essential as it reveals essential
underlying phenomena.
• Constant variance over time: It is essential to look at the time series
data and see whether or not the data exhibits constant variance over
time.

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

Time Series Analysis (TSA) with Open


Power System Data
• We can use the Open Power System dataset to
discover how electricity consumption and
production varies over time in Germany.
• Importing the dataset
# load time series dataset
df_power =
pd.read_csv("https://raw.githubusercontent.com/je
nfly/opsd/master/opsd_germany_daily.csv")
print(df_power.columns)

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

The output of the preceding code is given here:


Index(['Consumption', 'Wind', 'Solar', 'Wind+Solar'],
dtype='object')
The columns of the dataframe are described here:
• Date: The date is in the format yyyy-mm-dd.
• Consumption: This indicates electricity consumption in
GWh.
• Solar: This indicates solar power production in GWh.
• Wind+Solar: This represents the sum of solar and wind
power production in GWh.

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

Data cleaning
1. We can start by checking the shape of the dataset:
df_power.shape
The output of the preceding code is given here:
(4383, 5)
The dataframe contains 4,283 rows and 5 columns.

2. We can also check few entries inside the dataframe.


Let's examine the last 10 entries:
print(df_power.tail(10))

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

The output of the preceding code is given here:

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

3. Next, let's review the data types of


each column in our df_power dataframe:
print(df_power.dtypes)
The output of the preceding code is given here:
Date object
Consumption float64
Wind float64
Solar float64
Wind+Solar float64
dtype: object

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

4. Note that the Date column has a data type of object. This is not
correct. So, the next step is to correct the Date column, as shown
here:
#convert object to datetime format
df_power['Date'] = pd.to_datetime(df_power['Date'])
5. It should convert the Date column to Datetime format. We can verify
this again:
print(df_power.dtypes)
The output of the preceding code is given here:
Date datetime64[ns]
Consumption float64
Wind float64
Solar float64
Wind+Solar float64
dtype: object
Note that the Date column has been changed into the correct data
type.
Downloaded by Preejith P ([email protected])
lOMoARcPSD|44414401

6. Let's next change the index of our dataframe


to the Date column:
df_power = df_power.set_index('Date')
df_power.tail(3)
The output of the preceding code is given
here:

Note from the preceding screenshot that the Date column has
been set as DatetimeIndex

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

7. We can simply verify this by using the code snippet given here:
Print(df_power.index)
The output of the preceding code is given here:
DatetimeIndex(['2006-01-01', '2006-01-02', '2006-01-03',
'2006-01-04', '2006-01-05', '2006-01-06', '2006-01-07',
'2006-01-08', '2006-01-09', '2006-01-10', ... '2017-12-22',
'2017-12-23', '2017-12-24', '2017-12-25', '2017-12-26',
'2017-12-27', '2017-12-28', '2017-12-29', '2017-12-30',
'2017-12-31'],dtype='datetime64[ns]', name='Date', length=4383,
freq=None)
8. Since our index is the DatetimeIndex object, now we can use it to
analyze thedataframe. Let's add more columns to our dataframe to
make it easier. Let's add Year, Month, and Weekday Name:
# Add columns with year, month, and weekday name
df_power['Year'] = df_power.index.year
df_power['Month'] = df_power.index.month
df_power['Weekday Name'] = df_power.index.day_name()

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

9. Let's display five random rows from the dataframe:


# Display a random sampling of 5 rows
print(df_power.sample(5, random_state=0))
The output of this code is given here:

Note that we added three more columns—Year, Month, and


Weekday Name. Adding these columns helps to make the
analysis of data easier.

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

Time-based indexing
Time-based indexing is a very powerful method of the pandas
library. Having time-based indexing allows using a formatted string
to select data.
See the following code, for example:
print(df_power.loc['2015-10-02'])
The output of the preceding code is given here:
Consumption 1391.05
Wind 81.229
Solar 160.641
Wind+Solar 241.87
Year 2015
Month 10
Weekday Name Friday
Name: 2015-10-02 00:00:00, dtype: object
Note that we used the pandas dataframe loc accessor. In the preceding
example, we used a date as a string to select a row. We can use all sorts of
techniques to access rows just as we can do with a normal dataframe
index.

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

Visualizing time series


Let's visualize the time series dataset. We will continue using the
same df_power dataframe:
1. The first step is to import the seaborn and matplotlib libraries:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(rc={'figure.figsize':(11, 4)})
plt.rcParams['figure.figsize'] = (8,5)
plt.rcParams['figure.dpi'] = 150

2. Next, let's generate a line plot of the full time series of Germany's
daily electricity consumption:
df_power['Consumption'].plot(linewidth=0.5)

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

The output of the preceding code is given here:

As depicted in the preceding screenshot, the y-axis shows the


electricity consumption and the x-axis shows the year.
However, there are too many datasets to cover all the years.

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

3. Let's use the dots to plot the data for all the other columns:
cols_to_plot = ['Consumption', 'Solar', 'Wind']
axes = df_power[cols_to_plot].plot(marker='.', alpha=0.5,
linestyle='None',figsize=(14, 6), subplots=True)
for ax in axes:
ax.set_ylabel('Daily Totals (GWh)')
The output of the preceding code is given here:

The output shows that electricity consumption can be broken down into two
distinct patterns:
One cluster roughly from 1,400 GWh and above
Another cluster roughly below 1,400 GWh
Moreover, solar production is higher in summer and lower in winter. Over the years,
there seems to have been a strong increasing trend in the output of wind power.

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

4. We can further investigate a single year to have a closer look.


Check the code given here:
ax = df_power.loc['2016', 'Consumption'].plot()
ax.set_ylabel('Daily Consumption (GWh)');
The output of the preceding code is given here:

From the preceding screenshot, we can see clearly the


consumption of electricity for 2016.
The graph shows a drastic decrease in the consumption of
electricity at the end of the year(December) and during August.

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

Let's examine the month of December 2016 with the following


code block:
ax = df_power.loc['2016-12',
'Consumption'].plot(marker='o', linestyle='-')
ax.set_ylabel('Daily Consumption (GWh)');
The output of the preceding code is given here:

As shown in the preceding graph, electricity consumption is higher


on weekdays and lowest at the weekends. We can see the
consumption for each day of the month. We can zoom in further to
see how consumption plays out in the last week of December.

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

In order to indicate a particular week of December, we can supply a specific date


range as shown here:
ax = df_power.loc['2016-12-23':'2016-12-30',
'Consumption'].plot(marker='o', linestyle='-')
ax.set_ylabel('Daily Consumption (GWh)');

As illustrated in the preceding code, we want to see the electricity consumption


between 2016-12-23 and 2016-12-30. The output of the preceding code is given here:

As illustrated in the preceding screenshot, electricity consumption was lowest


on the day of Christmas, probably because people were busy partying. After
Christmas, the consumption increased.

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

Grouping time series data


1. We can first group the data by months and then use the
box plots to visualize the data:
fig, axes = plt.subplots(3, 1, figsize=(8, 7), sharex=True)
for name, ax in zip(['Consumption', 'Solar', 'Wind'], axes):
sns.boxplot(data=df_power, x='Month', y=name, ax=ax)
ax.set_ylabel('GWh')
ax.set_title(name)
if ax != axes[-1]:
ax.set_xlabel('')
The output of the preceding code is given here:

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

2. Next, we can group the consumption of electricity by the


day of the week, and present it in a box plot:
sns.boxplot(data=df_power, x='Weekday Name',
y='Consumption');

The output of the preceding code is given here:

The preceding screenshot shows that electricity consumption is higher on


weekdays than on weekends. Interestingly, there are more outliers on the
weekdays.

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

Resampling time series data


It is often required to resample the dataset at lower or higher frequencies. This
resampling is done based on aggregation or grouping operations. For example, we can
resample the data based on the weekly mean time series as follows:
1. We can use the code given here to resample our data:
columns = ['Consumption', 'Wind', 'Solar', 'Wind+Solar']
power_weekly_mean = df_power[columns].resample('W').mean()
power_weekly_mean
The output of the preceding code is given here:

As shown in the preceding screenshot, the first row, labeled 2006-01-01, includes the
average of all the data. We can plot the daily and weekly time series to compare the
dataset over the six-month period.

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

2. Let's see the last six months of 2016. Let's start by initializing
the variable:

start, end = '2016-01', '2016-06‘

3. Next, let's plot the graph using the code given here:
fig, ax = plt.subplots()
ax.plot(df_power.loc[start:end, 'Solar'],
marker='.', linestyle='-', linewidth=0.5, label='Daily')
ax.plot(power_weekly_mean.loc[start:end, 'Solar'],
marker='o', markersize=8, linestyle='-', label='Weekly Mean
Resample')
ax.set_ylabel('Solar Production in (GWh)')
ax.legend();

Downloaded by Preejith P ([email protected])


lOMoARcPSD|44414401

The output of the preceding code is given here:

The preceding screenshot shows that the weekly mean


time series is increasing over time and is much smoother
than the daily time series.

Downloaded by Preejith P ([email protected])

You might also like