4/23/23, 5:37 PM TSA Project Ultimate
In [2]: # Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.stats.diagnostic import acorr_ljungbox
from scipy.stats import ttest_ind
import statsmodels.api as sm
In [3]: # Read The Data From CSV File
df = pd.read_csv("C:/Jupyter Lab/data/SPX Monthly Data 2000 To 2019.csv")
# set date column as index
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
# Extract Monthly Close Prices
close_monthly = df['Close'].resample('M').last()
In [4]: # Plot The Monthly Close Prices
plt.plot(close_monthly)
plt.xlabel('Year')
plt.ylabel('Close Price')
plt.title('Monthly Close Prices from 2000 to 2019')
plt.show()
file:///C:/Users/Ishan/Desktop/TSA Project Python Code.html 1/6
4/23/23, 5:37 PM TSA Project Ultimate
Since the plot shows a clear upwards trend, the data is not stationary. Hence we will use
differencing to make the data stationary since SARIMA Model assumes stationarity.
In [5]: # Perform Differencing To Make The Data Stationary
diff_monthly = close_monthly.diff().dropna()
# Plot The Differenced Data
plt.plot(diff_monthly)
plt.xlabel('Year')
plt.ylabel('Differenced Close Price')
plt.title('Differenced Monthly Close Prices from 2000 to 2019')
plt.show()
In [6]: # Plot ACF And PACF To Determine SARIMA Parameters
fig, ax = plt.subplots(2, figsize=(12,8))
sm.graphics.tsa.plot_acf(diff_monthly, lags=30, ax=ax[0])
sm.graphics.tsa.plot_pacf(diff_monthly, lags=30, method='ywm', ax=ax[1])
plt.show()
# Print ACF and PACF values
acf_values = sm.tsa.stattools.acf(diff_monthly, nlags=30)
pacf_values = sm.tsa.stattools.pacf(diff_monthly, nlags=30, method='ywm')
print('ACF values:', acf_values)
print('PACF values:', pacf_values)
file:///C:/Users/Ishan/Desktop/TSA Project Python Code.html 2/6
4/23/23, 5:37 PM TSA Project Ultimate
ACF values: [ 1. -0.05424451 0.01423008 0.01758467 -0.03799791 0.146267
4
-0.07793367 0.06328135 0.13414535 -0.01721492 0.05255444 -0.02164625
-0.02944189 -0.02383239 -0.01902415 0.06196196 0.0384558 -0.00116111
-0.02437942 0.1108885 -0.05274602 -0.01206332 -0.07061918 0.01159541
0.03963217 -0.03796366 0.00800788 -0.02832515 -0.02506188 -0.02671741
0.00515232]
PACF values: [ 1. -0.05424451 0.01132093 0.01902033 -0.03631952 0.14248
593
-0.06425536 0.05628265 0.13865735 0.00467306 0.02479729 0.00333099
-0.04712943 -0.0601706 -0.00357006 0.03472574 0.03285513 0.00629871
-0.02994631 0.12247964 -0.04395441 -0.00999909 -0.07376671 0.0032677
-0.00915684 -0.01225068 -0.01271558 -0.03274584 -0.01351006 -0.02857296
0.03796791]
In [7]: # Fit A SARIMA Model
model = SARIMAX(diff_monthly, order=(1,1,1), seasonal_order=(1,0,1,12))
results = model.fit()
Based on the ACF and PACF plots, there was no clear evidence of strong seasonality, and the
spikes were not strong enough to infer a definitive seasonal component. The PACF plot
showed a small spike at lag 6, but there were no other strong or consistent spikes that
suggest a clear pattern in the autocorrelations. Thus, the SARIMA(1,1,1)(1,0,1)12 model
parameters were chosen as a starting point.
The chosen model has the following components:
Non-seasonal component:
file:///C:/Users/Ishan/Desktop/TSA Project Python Code.html 3/6
4/23/23, 5:37 PM TSA Project Ultimate
Autoregressive term (p=1): This accounts for the direct relationship between the current
value and the previous value in the time series.
Differencing term (d=1): This makes the time series stationary by taking the first difference of
the series.
Moving average term (q=1): This captures the relationship between the current value and
the residual error from the previous value.
Seasonal component:
Seasonal autoregressive term (P=1): This accounts for the direct relationship between the
current seasonal value and the seasonal value from the previous cycle.
Seasonal differencing term (D=0): No seasonal differencing is applied, as there is no clear
evidence of strong seasonality in the ACF and PACF plots.
Seasonal moving average term (Q=1): This captures the relationship between the current
seasonal value and the residual error from the previous seasonal value.
Seasonal period (s=12): This sets the seasonal period to 12 months, which is typical for
monthly data with potential yearly seasonality.
In [8]: # Make Predictions For The Next 4 Years
start_date = '2020-01-31'
end_date = '2023-12-31'
pred_monthly = results.predict(start=start_date, end=end_date)
# Plot The Predicted Values
plt.plot(pred_monthly)
plt.xlabel('Year')
plt.ylabel('Predicted Differenced Close Price')
plt.title('Predicted Monthly Returns from 2020 to 2023')
plt.show()
file:///C:/Users/Ishan/Desktop/TSA Project Python Code.html 4/6
4/23/23, 5:37 PM TSA Project Ultimate
In [13]: # Perform t-test On January Returns
jan_returns = pred_monthly[pred_monthly.index.month == 1]
other_returns = pred_monthly[pred_monthly.index.month != 1]
t_stat, p_value = ttest_ind(jan_returns, other_returns, equal_var=False)
print('\nNull hypothesis (H0): There is no significant difference between the mean
print('\nAlternative hypothesis (H1): There is a significant difference between the
print('\nt-statistic:', t_stat)
print('\np-value:', p_value)
if p_value < 0.05:
print('\nThe January effect exists')
else:
print('\nThe January effect does not exist for the predicted values')
Null hypothesis (H0): There is no significant difference between the mean returns
of January and the mean returns of other months. In other words, the January Effec
t does not exist.
Alternative hypothesis (H1): There is a significant difference between the mean re
turns of January and the mean returns of other
months. This suggests that the January Effect exists.
t-statistic: -0.1618069189156481
p-value: 0.8804849944025712
The January effect does not exist for the predicted values
file:///C:/Users/Ishan/Desktop/TSA Project Python Code.html 5/6
4/23/23, 5:37 PM TSA Project Ultimate
In [10]: # Perform Ljung-Box Test For Autocorrelations
lb_stat, lb_p_value = acorr_ljungbox(results.resid, lags=[12, 24, 36], return_df=Fa
print('\nLjung-Box statistic (lag 12):', lb_stat[0])
print('\np-value (lag 12):', lb_p_value[0])
print('\nLjung-Box statistic (lag 24):', lb_stat[1])
print('\np-value (lag 24):', lb_p_value[1])
print('\nLjung-Box statistic (lag 36):', lb_stat[2])
print('\np-value (lag 36):', lb_p_value[2])
# Check For Significant Autocorrelation
if any(lb_p_value < 0.05):
print('\nThere is significant autocorrelation in the residuals')
else:
print('\nThere is no significant autocorrelation in the residuals')
Ljung-Box statistic (lag 12): 10.388090875990883
p-value (lag 12): 0.5819538907237682
Ljung-Box statistic (lag 24): 17.20902033750296
p-value (lag 24): 0.8396114070768761
Ljung-Box statistic (lag 36): 23.87893364865142
p-value (lag 36): 0.9393244429177144
There is no significant autocorrelation in the residuals
Since there is no significant autocorrelation in the residuals as indicated by the Ljung-Box
test at various lags, we can assume that the SARIMA(1,1,1)(1,0,1)12 model provides a good
fit for the data. This means that the model captures the underlying patterns and seasonality
of the time series and that the residuals do not contain any significant autocorrelation that
needs to be accounted for. Therefore, the model can be used for forecasting and making
predictions with reasonable accuracy.
In [ ]:
file:///C:/Users/Ishan/Desktop/TSA Project Python Code.html 6/6