Unit 5 Time Series Data Analysis
Unit 5 Time Series Data Analysis
AD3301
DATA EXPLORATION AND VISUALIZATION
Unit 5
Fundamentals of TSA
1. We can generate the dataset using the
numpy library:
Note that for any particular value, the next value is the sum of previous values.
plt.figure(figsize=(16, 8))
g = sns.lineplot(data=random_walk)
g.set_title('Random Walk')
g.set_xlabel('Time index')
plt.show()
Data cleaning
1. We can start by checking the shape of the dataset:
df_power.shape
The output of the preceding code is given here:
(4383, 5)
The dataframe contains 4,283 rows and 5 columns.
4. Note that the Date column has a data type of object. This is not
correct. So, the next step is to correct the Date column, as shown
here:
#convert object to datetime format
df_power['Date'] = pd.to_datetime(df_power['Date'])
5. It should convert the Date column to Datetime format. We can verify
this again:
print(df_power.dtypes)
The output of the preceding code is given here:
Date datetime64[ns]
Consumption float64
Wind float64
Solar float64
Wind+Solar float64
dtype: object
Note that the Date column has been changed into the correct data
type.
Downloaded by Preejith P ([email protected])
lOMoARcPSD|44414401
Note from the preceding screenshot that the Date column has
been set as DatetimeIndex
7. We can simply verify this by using the code snippet given here:
Print(df_power.index)
The output of the preceding code is given here:
DatetimeIndex(['2006-01-01', '2006-01-02', '2006-01-03',
'2006-01-04', '2006-01-05', '2006-01-06', '2006-01-07',
'2006-01-08', '2006-01-09', '2006-01-10', ... '2017-12-22',
'2017-12-23', '2017-12-24', '2017-12-25', '2017-12-26',
'2017-12-27', '2017-12-28', '2017-12-29', '2017-12-30',
'2017-12-31'],dtype='datetime64[ns]', name='Date', length=4383,
freq=None)
8. Since our index is the DatetimeIndex object, now we can use it to
analyze thedataframe. Let's add more columns to our dataframe to
make it easier. Let's add Year, Month, and Weekday Name:
# Add columns with year, month, and weekday name
df_power['Year'] = df_power.index.year
df_power['Month'] = df_power.index.month
df_power['Weekday Name'] = df_power.index.day_name()
Time-based indexing
Time-based indexing is a very powerful method of the pandas
library. Having time-based indexing allows using a formatted string
to select data.
See the following code, for example:
print(df_power.loc['2015-10-02'])
The output of the preceding code is given here:
Consumption 1391.05
Wind 81.229
Solar 160.641
Wind+Solar 241.87
Year 2015
Month 10
Weekday Name Friday
Name: 2015-10-02 00:00:00, dtype: object
Note that we used the pandas dataframe loc accessor. In the preceding
example, we used a date as a string to select a row. We can use all sorts of
techniques to access rows just as we can do with a normal dataframe
index.
2. Next, let's generate a line plot of the full time series of Germany's
daily electricity consumption:
df_power['Consumption'].plot(linewidth=0.5)
3. Let's use the dots to plot the data for all the other columns:
cols_to_plot = ['Consumption', 'Solar', 'Wind']
axes = df_power[cols_to_plot].plot(marker='.', alpha=0.5,
linestyle='None',figsize=(14, 6), subplots=True)
for ax in axes:
ax.set_ylabel('Daily Totals (GWh)')
The output of the preceding code is given here:
The output shows that electricity consumption can be broken down into two
distinct patterns:
One cluster roughly from 1,400 GWh and above
Another cluster roughly below 1,400 GWh
Moreover, solar production is higher in summer and lower in winter. Over the years,
there seems to have been a strong increasing trend in the output of wind power.
As shown in the preceding screenshot, the first row, labeled 2006-01-01, includes the
average of all the data. We can plot the daily and weekly time series to compare the
dataset over the six-month period.
2. Let's see the last six months of 2016. Let's start by initializing
the variable:
3. Next, let's plot the graph using the code given here:
fig, ax = plt.subplots()
ax.plot(df_power.loc[start:end, 'Solar'],
marker='.', linestyle='-', linewidth=0.5, label='Daily')
ax.plot(power_weekly_mean.loc[start:end, 'Solar'],
marker='o', markersize=8, linestyle='-', label='Weekly Mean
Resample')
ax.set_ylabel('Solar Production in (GWh)')
ax.legend();