alizing-time-series-data-in-python
alizing-time-series-data-in-python
↪ch1_discoveries.csv')
date Y
0 01-01-1860 5
1 01-01-1861 3
2 01-01-1862 0
3 01-01-1863 2
4 01-01-1864 0
[ ]: discoveries.dtypes
[ ]: date object
Y int64
dtype: object
date datetime64[ns]
1
Y int64
dtype: object
%matplotlib inline
2
1.1.2 Specify plot styles
3
1.1.3 Display and label plots
[ ]: # Plot a line chart of the discoveries DataFrame using the specified arguments
ax = discoveries.plot(color='blue', figsize=(8, 3), linewidth=2, fontsize=6)
4
1.1.4 Subset time series data
[ ]: # Select the subset of data between 1945 and 1950
discoveries_subset_1 = discoveries['1945':'1950']
5
[ ]: # Select the subset of data between 1939 and 1958
discoveries_subset_2 = discoveries['1939':'1958']
[ ]: discoveries.head()
[ ]: Y
date
1860-01-01 5
1861-01-01 3
1862-01-01 0
1863-01-01 2
1864-01-01 0
6
1.1.5 Add vertical and horizontal markers
[ ]: # Plot your the discoveries time series
ax = discoveries.plot(color='blue', fontsize=6)
7
[ ]: Year SoluongSV
0 1945 1000
1 1946 2000
2 1947 1500
3 1948 1700
4 1949 2500
5 1950 2700
[ ]: SoluongSV
Year
1945 1000
1946 2000
1947 1500
1948 1700
1949 2500
[ ]: df.dtypes
[ ]: SoluongSV int64
dtype: object
[ ]: <Axes: xlabel='Year'>
8
[ ]:
9
1.2 Summary Statistics and Diagnostics
1.2.1 Find missing values
[ ]: co2_levels = pd.read_csv('https://raw.githubusercontent.com/ozlerhakan/datacamp/
↪master/Visualizing%20Time%20Series%20Data%20in%20Python/ch2_co2_levels.csv')
datestamp co2
0 1958-03-29 316.1
1 1958-04-05 317.3
2 1958-04-12 317.6
3 1958-04-19 317.5
4 1958-04-26 316.4
5 1958-05-03 316.9
6 1958-05-10 NaN
7 1958-05-17 317.5
10
[ ]: # Set datestamp column as index
co2_levels = co2_levels.set_index('datestamp')
co2 59
dtype: int64
co2 0
dtype: int64
<ipython-input-23-70d681172169>:2: FutureWarning: DataFrame.fillna with 'method'
is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill()
instead.
co2_levels = co2_levels.fillna(method='bfill')
plt.show();
11
1.2.4 Display aggregated values
[ ]: co2_levels.dtypes
[ ]: co2 float64
dtype: object
[ ]: co2_levels.reset_index('datestamp',inplace=True)
[ ]: co2_levels['datestamp'] = pd.to_datetime(co2_levels.datestamp)
[ ]: co2_levels.set_index('datestamp',inplace=True)
# Compute the mean CO2 levels for each month of the year
mean_co2_levels_by_month = co2_levels.groupby(index_month).mean()
# Plot the mean CO2 levels for each month of the year
12
mean_co2_levels_by_month.plot(fontsize=6)
# Print out the minima of the co2 column in the co2_levels DataFrame
print(co2_levels.co2.min())
# Print out the maxima of the co2 column in the co2_levels DataFrame
print(co2_levels.co2.max())
co2
count 2284.000000
mean 339.657750
std 17.100899
min 313.000000
13
25% 323.975000
50% 337.700000
75% 354.500000
max 373.900000
313.0
373.9
[ ]: # Generate a boxplot
ax = co2_levels.boxplot()
[ ]: # Generate a histogram
ax = co2_levels.plot(kind='hist', bins=50, fontsize=6)
14
plt.legend(fontsize=10)
plt.show();
plt.show();
15
1.3 Seasonality, Trend and Noise
1.3.1 Autocorrelation in time series data
[ ]: # Import required libraries
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
# Show plot
plt.show();
16
In order to help you asses how trustworthy these autocorrelation values are, the plot_acf() function
also returns confidence intervals (represented as blue shaded regions). If an autocorrelation value
goes beyond the confidence interval region, you can assume that the observed autocorrelation value
is statistically significant.
They are highly correlated and statistically significant.
the autocorrelation values are not beyond the confidence intervals (the blue shaded regions)
the correlations (check the lines and their corresponding values on the Y axis) are not greater than
0.5
# Show plot
plt.show()
17
If partial autocorrelation values are close to 0, then values between observations and lagged obser-
vations are not correlated with one another. Inversely, partial autocorrelations with values close
to 1 or -1 indicate that there exists strong positive or negative correlations between the lagged
observations of the time series.
at which lag values do we have statistically significant partial autocorrelations?
0, 1, 4,5,6 = there are the additional lag values that are beyond the confidence intervals.
[ ]: # Import statsmodels.api as sm
import statsmodels.api as sm
datestamp
1958-03-29 1.028042
1958-04-05 1.235242
18
1958-04-12 1.412344
1958-04-19 1.701186
1958-04-26 1.950694
…
2001-12-01 -0.525044
2001-12-08 -0.392799
2001-12-15 -0.134838
2001-12-22 0.116056
2001-12-29 0.285354
Name: seasonal, Length: 2284, dtype: float64
19
1.3.5 Visualize the airline dataset
[ ]: airline = pd.read_csv('https://raw.githubusercontent.com/ozlerhakan/datacamp/
↪master/Visualizing%20Time%20Series%20Data%20in%20Python/
↪ch3_airline_passengers.csv')
airline.head()
[ ]: Month AirPassengers
0 1949-01 112
1 1949-02 118
2 1949-03 132
3 1949-04 129
4 1949-05 121
20
1.3.6 Analyze the airline dataset
Month 0
AirPassengers 0
dtype: int64
AirPassengers
count 144.000000
mean 280.298611
std 119.966317
min 104.000000
25% 180.000000
50% 265.500000
75% 360.500000
max 622.000000
21
[ ]: # Display boxplot of airline values
ax = airline.boxplot()
[ ]: airline['Month'] = pd.to_datetime(airline.Month)
[ ]: airline.set_index('Month', inplace=True)
[ ]: airline.index
22
'1960-11-01', '1960-12-01'],
dtype='datetime64[ns]', name='Month', length=144, freq=None)
# Compute the mean number of passengers for each month of the year
mean_airline_by_month = airline.groupby(index_month).mean()
# Plot the mean number of passengers for each month of the year
mean_airline_by_month.plot()
plt.legend(fontsize=20)
plt.show()
23
1.3.7 Time series decomposition of the airline dataset
[ ]: # Import statsmodels.api as sm
import statsmodels.api as sm
[ ]: import numpy as np
airline_decomposed.head()
[ ]: trend seasonal
Month
1949-01-01 NaN -24.748737
1949-02-01 NaN -36.188131
1949-03-01 NaN -2.241162
1949-04-01 NaN -8.036616
1949-05-01 NaN -4.506313
24
1.4 Work with Multiple Time Series
1.4.1 Load multiple time series
turkey
0 NaN
25
1 NaN
2 NaN
3 NaN
4 NaN
other_chicken turkey
count 143.000000 635.000000
mean 43.033566 292.814646
std 3.867141 162.482638
min 32.300000 12.400000
25% 40.200000 154.150000
50% 43.400000 278.300000
75% 45.650000 449.150000
max 51.100000 585.100000
[ ]: meat.head()
turkey
date
1944-01-01 NaN
1944-02-01 NaN
1944-03-01 NaN
1944-04-01 NaN
1944-05-01 NaN
26
1.4.2 Visualize multiple time series
# Additional customizations
ax.set_xlabel('Date')
ax.legend(fontsize=15)
# Show plot
plt.show()
# Additional customizations
ax.set_xlabel('Date')
ax.legend(fontsize=15)
# Show plot
plt.show()
27
1.4.3 Define the color palette of your plots
# Additional customizations
ax.set_xlabel('Date')
ax.legend(fontsize=18)
# Show plot
plt.show()
28
[ ]: # Plot time series dataset using the cubehelix color palette
ax = meat.plot(colormap='PuOr', fontsize=15, figsize=(15,10))
# Additional customizations
ax.set_xlabel('Date')
ax.legend(fontsize=18)
# Show plot
plt.show()
29
1.4.4 Add summary statistics to your time series plot
[ ]: des = meat.describe().loc['mean']
meat_mean = pd.DataFrame([des.values], columns=des.index.values, index=['mean'])
meat_mean
other_chicken turkey
mean 43.033566 292.814646
30
# Specify the fontsize and location of your legend
ax.legend(loc='upper center', bbox_to_anchor=(0.5, 0.95), ncol=3, fontsize=12)
# Show plot
plt.show()
plt.show()
31
1.4.6 Compute correlations between time series
the pearson method should be used when relationships between your variables are thought to be
linear, while the kendall and spearman methods should be used when relationships between your
variables are thought to be non-linear.
[ ]: # Print the correlation matrix between the beef and pork columns using the␣
↪spearman method
print(meat[['beef', 'pork']].corr(method='spearman'))
beef pork
beef 1.000000 0.827587
pork 0.827587 1.000000
0.827587
[ ]: # Compute the correlation between the pork, veal and turkey columns using the␣
↪pearson method
32
print(-0.768366)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.show()
33
1.4.8 Clustered heatmaps
# Customize the heatmap of the corr_meat correlation matrix and rotate the␣
↪x-axis labels
fig = sns.clustermap(corr_meat,
row_cluster=True,
col_cluster=True,
figsize=(10, 10))
plt.setp(fig.ax_heatmap.xaxis.get_majorticklabels(), rotation=90)
plt.setp(fig.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
plt.show()
34
1.5 Case Study
[ ]: import seaborn as sns
sns.regplot(x=meat["veal"], y=meat["lamb_and_mutton"])
35
1.5.1 Explore the Jobs dataset
[ ]: jobs = pd.read_csv('https://raw.githubusercontent.com/ozlerhakan/datacamp/
↪master/Visualizing%20Time%20Series%20Data%20in%20Python/ch5_employment.csv')
jobs.head()
36
0 3.4 7.5 3.6 3.9
1 2.9 7.5 3.4 5.5
2 3.6 7.4 3.6 3.7
3 2.4 6.1 3.7 4.1
4 3.5 6.2 3.4 5.3
[ ]: jobs.tail()
37
120 9.1 11.1 10.0
121 10.7 9.7 9.9
datestamp object
Agriculture float64
Business services float64
Construction float64
Durable goods manufacturing float64
Education and Health float64
Finance float64
Government float64
Information float64
Leisure and hospitality float64
Manufacturing float64
Mining and Extraction float64
Nondurable goods manufacturing float64
Other float64
Self-employed float64
Transportation and Utilities float64
Wholesale and Retail Trade float64
dtype: object
Agriculture 0
Business services 0
Construction 0
Durable goods manufacturing 0
Education and Health 0
Finance 0
Government 0
38
Information 0
Leisure and hospitality 0
Manufacturing 0
Mining and Extraction 0
Nondurable goods manufacturing 0
Other 0
Self-employed 0
Transportation and Utilities 0
Wholesale and Retail Trade 0
dtype: int64
[ ]: # Generate a boxplot
jobs.boxplot(fontsize=6, vert=False)
plt.show()
# Print the name of the time series with the highest mean
print(9.840984)
# Print the name of the time series with the highest variability
print(4.587619)
39
Agriculture Business services Construction \
count 122.000000 122.000000 122.000000
mean 9.840984 6.919672 9.426230
std 3.962067 1.862534 4.587619
min 2.400000 4.100000 4.400000
25% 6.900000 5.600000 6.100000
50% 9.600000 6.450000 8.100000
75% 11.950000 7.875000 10.975000
max 21.300000 12.000000 27.100000
40
max 7.200000 11.300000 10.500000
9.840984
4.587619
plt.show()
41
1.5.4 Annotate significant events in time series data
# Show plot
plt.show()
42
1.5.5 Plot monthly and yearly trends
43
[ ]: # Extract of the year in each date indices of the jobs DataFrame
index_year = jobs.index.year
44
ax.legend(bbox_to_anchor=(0.1, 0.5), fontsize=10)
plt.show();
Averaging time series values by month shows that unemployment rate tends to be a lot higher during
the winter months for the Agriculture and Construction industry. The increase in unemployment
rate after 2008 is very clear when average time series values by year.
45
# Run time series decomposition on each time series of the DataFrame
for ts in jobs_names:
ts_decomposition = sm.tsa.seasonal_decompose(jobs[ts])
jobs_decomp[ts] = ts_decomposition
[ ]: jobs_seasonal = {}
[ ]: # Extract the seasonal values for the decomposition of each time series
for ts in jobs_names:
jobs_seasonal[ts] = jobs_decomp[ts].seasonal
# Show plot
plt.show()
46
1.5.8 Correlations between multiple time series
[ ]:
47