0% found this document useful (0 votes)
8 views5 pages

Time Series Analysis

The document provides a comprehensive guide on time series analysis using Python, including data preprocessing, visualization, and testing for stationarity. It covers techniques such as differencing, seasonal decomposition, and model fitting using ARIMA and SARIMAX. The document also highlights the importance of determining the right order of differencing and includes practical examples with code snippets.

Uploaded by

Daniel Wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views5 pages

Time Series Analysis

The document provides a comprehensive guide on time series analysis using Python, including data preprocessing, visualization, and testing for stationarity. It covers techniques such as differencing, seasonal decomposition, and model fitting using ARIMA and SARIMAX. The document also highlights the importance of determining the right order of differencing and includes practical examples with code snippets.

Uploaded by

Daniel Wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 5

---------------------------------------Time Series

Analysis----------------------------------

##date_parser: This specifies a function which converts an input string


# into datetime variable. Be default Pandas reads data in
# format ‘YYYY-MM-DD HH:MM:SS’. If the data is not in this format,
# the format has to be manually defined. Something similar to the
# dataparse function defined here can be used for this purpose.

#Convert date to the correct time series date formate if needed


dateparse = lambda dates: pd.datetime.strptime(dates, '%Y-%m')

# Read in time series


df = pd.read_csv('D:\\For Dan\\Learning\\Web\\AirPassengers.csv',
parse_dates=['Month'], index_col='Month')
#by the lambda function above #,date_parser=dateparse)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# 1Read in time series


#Check the data, drop any N/A rows

df['Month'] = pd.to_datetime(df['Month'])

df.set_index('Month', inplace=True)

#2 preprocessing, also check for missing value


df.timestamp = pd.to_datetime(df.Month , format = '%Y-%m')
df.index = train.timestamp
df.drop('Month',axis = 1, inplace = True)

#Set the index column


df['Month'] = pd.to_datetime(df['Month'])
df.set_index('Month', inplace=True)

----------------Visualize the Data

plt.rcParams['figure.figsize'] = 8,4
df.plot()

-----------------Testing For Stationarity


### Dickey-Fuller test for Time Series Stationarity
from statsmodels.tsa.stattools import adfuller

# Check for the p-value is it less than 5% or 1% 95%/99% confident to reject Ho


adfuller(df['Sales'])

#Ho: It is non stationary


#H1: It is stationary

def adf_test(values):
result=adfuller(values)
labels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations
Used']
for value,label in zip(result,labels):
print(label+' : '+str(value) )
if result[1] <= 0.05:
print("strong evidence against the null hypothesis(Ho), reject the null
hypothesis. Data has no unit root and is stationary")
else:
print("weak evidence against null hypothesis, time series has a unit root,
indicating it is non-stationary ")

adf_test(df['Sales'])

------------Difference
df['Sales First Difference'] = df['Sales']-df['Sales'].shift(1)
# Since the data is seasonal(sales cycle usually happened during a year, a seasonal
period)
df['Seasonal First Difference'] = df['Sales']-df['Sales'].shift(12)

#Or
df['Sales'].diff()

#Check constant mean/std after difference

plt.rcParams['figure.figsize'] = 8,4
rolmean = df['Sales'].rolling(12).mean()
rolstd = df['Sales'].rolling(12).std()
orig = plt.plot(df['Sales'], color='blue',label='Original')
mean = plt.plot(rolmean, color='red', label='Rolling Mean')
std = plt.plot(rolstd, color='black', label = 'Rolling Std')
plt.legend()

#Doing the adfuller test again, make sure to dropna


print(adf_test(df['Sales'].dropna()))
print('\n')
print(adf_test(df['Sales First Difference'].dropna()))
print('\n')
print(adf_test(df['Seasonal First Difference'].dropna()))

df['Seasonal First Difference'].plot()

----------------Decomposing
from statsmodels.tsa.seasonal import seasonal_decompose

#Also make sure to dropna


dfs = df['Seasonal First Difference'].dropna()
decomposition = seasonal_decompose(dfs)
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

how to determine the right order of differencing?

The right order of differencing is the minimum differencing required to get a near-
stationary series which roams around a defined mean and the ACF plot reaches to
zero fairly quick.

If the autocorrelations are positive for many number of lags (10 or more), then the
series needs further differencing. On the other hand, if the lag 1 autocorrelation
itself is too negative, then the series is probably over-differenced.

In the event, you can’t really decide between two orders of differencing, then go
with the order that gives the least standard deviation in the differenced series.

If your series is slightly under differenced, adding one or more additional AR


terms usually makes it up. Likewise, if it is slightly over-differenced, try adding
an additional MA term.

------------fit the model find p d q


---1plot acf /pacf

from statsmodels.graphics.tsaplots import plot_acf,plot_pacf


plt.figure(figsize=(12,8))
plot_acf(df['Seasonal First Difference'].iloc[13:],lags=40)
plot_pacf(df['Seasonal First Difference'].iloc[13:],lags=40)

---2plot acf /pacf


# PACF plot of 1st differenced series
plt.rcParams.update({'figure.figsize':(9,3), 'figure.dpi':120})

fig, axes = plt.subplots(1, 2, sharex=True)


axes[0].plot(df.value.diff()); axes[0].set_title('1st Differencing')
axes[1].set(ylim=(0,5))
plot_pacf(df.value.diff().dropna(), ax=axes[1])

plt.show()

plt.rcParams.update({'figure.figsize':(9,3), 'figure.dpi':120})

fig, axes = plt.subplots(1, 2, sharex=True)


axes[0].plot(df.value.diff()); axes[0].set_title('1st Differencing')
axes[1].set(ylim=(0,1.2))
plot_acf(df.value.diff().dropna(), ax=axes[1])

plt.show()

# For non-seasonal data


#p=1, d=1, q=0 or 1
#p is the order of the AR term
#q is the order of the MA term
#d is the number of differencing required to make the time series stationary

from statsmodels.tsa.arima_model import ARIMA

model=ARIMA(df['Sales'],order=(1,1,1))
model_fit=model.fit()
model_fit.summary()

df['forecast']=model_fit.predict(start=90,end=103,dynamic=True)
df[['Sales','forecast']].plot(figsize=(12,8))

# seasonal order: in that season, how many order you're shifting


model=sm.tsa.statespace.SARIMAX(df['Sales'],order=(1, 1,
1),seasonal_order=(1,1,1,12))
results=model.fit()

#See how our forecast fit the actual


df['forecast']=results.predict(start=90,end=103,dynamic=True)
df[['Sales','forecast']].plot(figsize=(12,8))

#Predict the future month


from pandas.tseries.offsets import DateOffset
future_dates=[df.index[-1]+ DateOffset(months=x)for x in range(0,24)]

future_datest_df=pd.DataFrame(index=future_dates[1:],columns=df.columns)

future_df=pd.concat([df,future_datest_df])
future_df['forecast'] = results.predict(start = 104, end = 120, dynamic= True)
future_df[['Sales', 'forecast']].plot(figsize=(12, 8))

---------------------------An End-to-End Project on Time Series Analysis and


Forecasting with Python---------------------------------------------------------

import warnings
import itertools
import numpy as np
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
plt.style.use('fivethirtyeight')
import pandas as pd
import statsmodels.api as sm
import matplotlib
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'

# You can specify the col_use when reading in files, or drop the unuse col later
df = pd.read_excel('C:\\Users\\wooju\\Desktop\\Python Programing\\Python Learning
Journey\\Dataset\\Superstore.xls',
sheet_name = 'Orders', usecols=['Order Date', 'Segment'])

if you were interested in summarizing all of the sales by month, you could use the
resample function. The tricky part about using resample is that it only operates on
an index. In this data set, the data is not indexed by the date column so resample
would not work without restructuring the data. In order to make it work, use
set_index to make the date column an index and then resample

You might also like