0% found this document useful (0 votes)
25 views5 pages

Statistics Project SEM1 Notes

Uploaded by

mrarcadian26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views5 pages

Statistics Project SEM1 Notes

Uploaded by

mrarcadian26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Statistics Project SEM1 notes:

Part-A:
To address the requirements outlined in your project, let's break down the tasks step by step:

### Preliminary Assessment of the Time Series:


1. Data Loading: Load the 'CocoaPrices.csv' dataset into your preferred data analysis environment (e.g., Python
with pandas library, R, etc.).

2. Data Exploration: Visualize the time series data to understand its characteristics, including trends, seasonality,
and any potential outliers. You can use line plots, histograms, or other appropriate visualizations.

### Estimation and Discussion of Suitable Time Series Models:


1. Simple Time Series Models: Consider basic models such as the mean, naive, or random walk models as
baseline benchmarks.

2. Exponential Smoothing: Estimate exponential smoothing models such as Simple Exponential Smoothing
(SES), Holt's Linear Trend method, or Holt-Winters' seasonal method to capture trends and seasonality.

3. ARIMA/SARIMA: Fit Autoregressive Integrated Moving Average (ARIMA) or Seasonal ARIMA (SARIMA)
models to capture any autocorrelation, trends, and seasonality in the data. Conduct appropriate diagnostic tests
(e.g., ACF, PACF plots) to identify the model orders.

### Model Evaluation and Forecasting:


1. Training Set Selection: Use data up to and including September 2023 as the training set.

2. Forecasting: Forecast the average prices for the 6 months from October 2023 to March 2024 using the chosen
models.

3. Evaluation: Evaluate the accuracy of the forecasts against the actual data for the period October 2023 to
March 2024. Calculate relevant metrics (e.g., Mean Absolute Error, Mean Squared Error) to assess forecast
performance.

### Discussion of Optimal Model:


1. Model Selection: Discuss your choice of an 'optimum' model based on forecast accuracy, diagnostic tests, and
model simplicity.

2. Adequacy for Forecasting: Provide commentary on the adequacy of the chosen optimal model for forecasting
purposes. Consider factors such as model assumptions, forecast horizon, and robustness.

### Report Writing:


1. Organization: Structure your report with clear sections for each task, including introduction, data description,
model estimation, forecasting, evaluation, and conclusion.
2. Clarity and Interpretation: Clearly present your findings, interpretations, and conclusions in a concise and
understandable manner.

3. Visualizations: Include relevant visualizations (e.g., time series plots, forecast vs. actual plots) to support your
analysis and conclusions.

4. References: Provide proper citations for data sources, models, and methodologies used in your analysis.

Ensure to thoroughly document your process, including any assumptions made, methodology choices, and
interpretations of results. If you need assistance with any specific aspect of the analysis or have further
questions, feel free to ask!

EDA process to do:

In the preliminary assessment step of time series analysis, exploratory data analysis (EDA) involves examining
the characteristics of the time series data to gain insights into its structure, patterns, and potential issues.

Here are some common techniques for conducting EDA on time series data:

1. Time Series Plot: Plot the time series data over time to visualize its general trend, seasonality, and any
outliers or irregularities. This can be done using a simple line plot with time on the x-axis and the variable of
interest on the y-axis.

2. Seasonal Decomposition: Decompose the time series into its trend, seasonal, and residual components using
methods like seasonal decomposition of time series (STL) or seasonal-trend decomposition using LOESS (STL
decomposition). This helps identify underlying patterns and seasonal fluctuations.

3. Histogram and Density Plot: Examine the distribution of the data using histograms or density plots to
understand its variability and skewness. This can provide insights into the data's stationarity and potential
transformations needed.

4. Autocorrelation and Partial Autocorrelation Plots: Plot the autocorrelation function (ACF) and partial
autocorrelation function (PACF) to identify the presence of autocorrelation in the data. This helps in determining
the order of autoregressive (AR) and moving average (MA) components in ARIMA modeling.

5. Box Plot or Violin Plot: Visualize the distribution of the data across different time periods, such as months or
seasons, using box plots or violin plots. This can reveal any systematic patterns or differences between time
periods.

6. Time Series Decomposition: Decompose the time series into trend, seasonality, and noise components using
methods like moving averages or exponentially weighted moving averages (EWMA). This can help in
understanding the underlying patterns and trends.
7. Summary Statistics: Calculate summary statistics such as mean, median, standard deviation, minimum, and
maximum values to describe the central tendency and variability of the data.

8. Lag Plots: Create lag plots to visualize the relationship between the time series data and its lagged values.
This can help identify potential autocorrelation and guide the selection of lag orders in ARIMA modeling.

Code Explanation:
1) Syntax: df.set_index('Date', inplace=True) #If df is our data frame name

It seems like you're asking about setting the index of the DataFrame df to the 'Date' column.

When you use df.set_index('Date', inplace=True), it sets the 'Date' column as the index of the DataFrame in
place, meaning it modifies the DataFrame directly without creating a new DataFrame.

Here's what each part of the code does:

 df: This is your DataFrame containing the data.


 set_index('Date'): This method sets the 'Date' column as the index of the DataFrame.
 inplace=True: This parameter is optional. When set to True, it modifies the DataFrame in place,
meaning it doesn't return a new DataFrame but instead modifies the existing one.

So, after executing this line of code, your DataFrame df will have the 'Date' column as its index. This can be
helpful for time series analysis because you can easily access data based on dates.

2) Syntax: plt.gca().xaxis.set_major_locator(YearLocator())
 plt.gca(): This function gets the current Axes instance in the current figure.
"gca" stands for "get current axes".
 xaxis: This attribute of the Axes instance represents the x-axis.
 set_major_locator(YearLocator()) : This method sets the major locator for the x-
axis ticks. YearLocator() is a locator that places ticks at regular intervals of years.

So, by calling plt.gca().xaxis.set_major_locator(YearLocator()) , you're setting the major


locator for the x-axis ticks to show only the years in your plot, which can be useful for
better readability and understanding of the time series data.

TIME SERIES MODELS


1.
Mean Model:
2.
 Description: The mean model is one of the simplest time series models, where the prediction
for each future time point is simply the mean of all past observations. It assumes that the time
series data fluctuates around a constant average value, and future values are expected to be
similar to the historical average.
 Formula: 𝑌^𝑡+1=1𝑛∑𝑖=1𝑛𝑌𝑖Y^t+1=n1∑i=1nYi, where 𝑌^𝑡+1Y^t+1 is the predicted value
for the next time point, 𝑌𝑖Yi represents the observed values in the historical data, and 𝑛n is
the total number of observations.
 Usage: The mean model is often used as a baseline or benchmark model for time series
forecasting. It provides a simple reference point for evaluating the performance of more
complex models.
 Assumptions: The mean model assumes that the underlying process generating the time
series data is stationary and does not exhibit any trend or seasonality. It also assumes that the
mean of the time series remains constant over time.
3.
Naive Model:
4.
 Description: The naive model is even simpler than the mean model, where the prediction for
each future time point is equal to the last observed value in the time series. It assumes that
future values will remain constant and equal to the most recent observation.
 Formula: 𝑌^𝑡+1=𝑌𝑡Y^t+1=Yt, where 𝑌^𝑡+1Y^t+1 is the predicted value for the next time
point, and 𝑌𝑡Yt represents the last observed value in the time series.
 Usage: The naive model is often used as a baseline for comparison with more sophisticated
forecasting methods. Despite its simplicity, it can sometimes perform well for time series data
with stable and persistent trends.
 Assumptions: The naive model assumes that there are no systematic patterns or trends in the
time series data, and that future values will be similar to the most recent observation.
5.
Random Walk Model:
6.
 Description: The random walk model is a stochastic process where each future value in the
time series is equal to the previous value plus some random noise. It assumes that future
values are influenced by past observations, but also incorporate random fluctuations or
shocks.
 Formula: 𝑌𝑡=𝑌𝑡−1+𝜖𝑡Yt=Yt−1+ϵt, where 𝑌𝑡Yt represents the value at time 𝑡t, 𝑌𝑡−1Yt−1
represents the value at the previous time point, and 𝑡ϵt represents a random error term.
 Usage: The random walk model is commonly used for modeling and forecasting processes
that exhibit persistence or autocorrelation in their time series structure. It can be extended to
more complex models such as autoregressive integrated moving average (ARIMA) models.
 Assumptions: The random walk model assumes that the random error terms 𝑡ϵt are
independent and identically distributed (i.i.d.), and that there are no systematic trends or
patterns in the time series data.

These three simple models provide a starting point for time series forecasting and can be
useful for establishing baseline performance metrics. However, they may not capture more
complex patterns or dynamics present in real-world time series data. Hence, more
sophisticated models are often required for accurate forecasting in practical applications.
窗体顶端

Exponential Smoothening:
Simple Exponential Smoothing (SES):
Holt's Linear Trend method:
Holt-Winters' seasonal method:

You might also like