DATA ANALYTICS Unit III
DATA ANALYTICS Unit III
Unit III
Working with time series data and regression analysis.
1) Introduction of time series data - Time series data is a type of data that is
collected or recorded over time at regular intervals. In time series analysis, the
order of observations is crucial, as they are taken at successive points in time. This
type of data is commonly used in various fields, including finance, economics,
signal processing, environmental science, and many others
Bhise N K
DATA ANALYTICS BCS SY
❖ Pattern Identification.
❖ Visualization.
❖ Forcasting.
❖ Business Inteligence.
2) Introduction to time series forcasting - Time series forecasting is a specialized area
of predictive analytics that involves making predictions about future values based on
historical data points ordered chronologically. In a time series, each data point is associated
with a specific timestamp, and the goal is to use the patterns and trends within the
historical data to make accurate predictions for future time point.
Key Aspect of time series analysis.
❖ Trend
❖ Seasonality
Bhise N K
DATA ANALYTICS BCS SY
There are different types of moving averages, but the most common one is the
Simple Moving Average (SMA). The Simple Moving Average is calculated by
taking the average of a set of data points over a specified period and then
moving the average to the next set of data points. The formula for calculating the
Simple Moving Average for a given data set is :
SMA = Some of data points in the specified period / No. of data point in the
specified period.
2)Choose aperied - Decide on the period for your moving average. For example, if
you want a 3-period moving average, you would use the average of the first 3
data points, then the next 3, and so on.
Bhise N K
DATA ANALYTICS BCS SY
3) Calculate the moving average - Place the formula in the cell where you want
the moving average to start.
If your data is in column A and you are calculating a 3-period moving average,
and your first data point is in cell A2, the formula in cell B4 would be:
=AVERAGE(A2:A4)
Bhise N K
DATA ANALYTICS BCS SY
Y = B0 + B1X + E
The goal of simple linear regression is to estimate the values of the coefficients B0 and
B1 that minimize the sum of the squared differences between the observed values of Y
and the values predicted by the regression model. This is often done using the method
of least squares..
Bhise N K
DATA ANALYTICS BCS SY
Once the coefficients are estimated, the regression model can be used to make
predictions about the dependent variable based on values of the independent variable.
Additionally, the fit of the model can be assessed using various metrics such as the
coefficient of determination (R2) and hypothesis tests for the significance of the
coefficients.
Bhise N K
DATA ANALYTICS BCS SY
Model diagnostics and validation in Excel typically involve assessing the performance
and accuracy of a model built within Excel, such as a financial model, forecasting
model, or regression analysis. Here's a general overview of steps you can take for
diagnostics and validation:
Data Preparation:
● Ensure your data is clean, organized, and appropriately formatted.
● Split your data into training and testing sets if applicable.
Bhise N K
DATA ANALYTICS BCS SY
Model Building:
● Construct your model using Excel functions, formulas, or add-ins.
● Document your model's assumptions, methodology, and limitations.
Diagnostic Checks:
● Perform basic checks to ensure your model is functioning correctly, such
as:
● Verifying formulas and references.
● Checking for errors or inconsistencies.
● Assessing outliers or anomalies in the data.
Model Evaluation:
● Evaluate the performance of your model using appropriate metrics.
● For forecasting or regression models, consider metrics like Mean Absolute
Error (MAE), Root Mean Squared Error (RMSE), or R-squared (R²).
● For financial models, assess metrics such as Net Present Value (NPV),
Internal Rate of Return (IRR), or Payback Period.
Validation:
● Validate your model against real-world data or known outcomes.
● Compare model predictions or outputs with observed results.
● Use techniques like cross-validation if applicable.
Sensitivity Analysis:
● Conduct sensitivity analysis to understand how changes in input variables
affect model outputs.
● Use Excel's built-in tools like Data Tables or Scenario Manager for
sensitivity analysis.
Visualizations:
● Create visualizations to present your model's outputs and insights
effectively.
● Excel offers various chart types and formatting options for visual
representation.
Documentation:
● Document your findings, assumptions, methodologies, and validation
results thoroughly.
● Include notes within your Excel file or create a separate documentation
file.
Peer Review:
● Have your model reviewed by colleagues or subject matter experts to
identify potential errors or areas for improvement.
Revision and Iteration:
Bhise N K
DATA ANALYTICS BCS SY
● Based on feedback and validation results, revise and refine your model as
needed.
● Iteratively improve your model to enhance its accuracy and reliability.
Version Control:
● Maintain version control to track changes and ensure traceability of model
revisions.
Final Review and Approval:
● Conduct a final review of your model before deployment or presentation.
● Obtain necessary approvals or sign-offs from stakeholders.
Assigning the quality of a model based on R-squared, adjusted R-squared, and standard
error involves assessing how well the model fits the data and whether it provides
meaningful insights. Here's how you can interpret these metrics:
R-squared (R²):
● R-squared is a statistical measure that represents the proportion of the
variance in the dependent variable that is explained by the independent
variables in the model.
● It ranges from 0 to 1, where 1 indicates that the model explains all the
variability of the response data around its mean.
● Higher R-squared values generally indicate a better fit of the model to the
data.
● However, R-squared alone does not determine whether a model is good or
bad; it should be interpreted in conjunction with other metrics.
Interpretation:
● R-squared values closer to 1 imply that the model explains a large portion
of the variability in the data and is considered desirable.
● R-squared values closer to 0 suggest that the model does not explain
much of the variability in the data and may not be useful for prediction.
Adjusted R-squared:
● Adjusted R-squared is similar to R-squared but adjusts for the number of
predictors in the model.
● It penalizes excessive use of predictors and provides a more accurate
measure of model fit, especially when comparing models with different
numbers of predictors.
Bhise N K
DATA ANALYTICS BCS SY
● Look for higher R-squared and adjusted R-squared values, indicating better model
fit.
● Compare adjusted R-squared values across models to assess the trade-off
between model complexity and explanatory power.
● Aim for lower standard error values, indicating more accurate predictions
Bhise N K
DATA ANALYTICS BCS SY
Normality Assumption:
● Normality of residuals is essential for regression analysis. You can assess
this assumption by examining the distribution of residuals.
● After running your regression model, calculate the residuals (the
differences between the observed and predicted values).
● Use Excel to create a histogram or a Q-Q plot of the residuals to visually
inspect their distribution.
● Additionally, you can perform a formal test for normality, such as the
Shapiro-Wilk test, using Excel's statistical functions or add-ins like Real
Statistics Resource Pack.
Linearity Assumption:
● The relationship between the independent and dependent variables should
be linear. You can check this assumption by plotting the observed values
of the dependent variable against the predicted values from your
regression model.
● After running the regression, create a scatter plot in Excel with the
observed values on the y-axis and the predicted values on the x-axis.
● Ensure that the points on the scatter plot are randomly distributed around
a diagonal line, indicating linearity.
● You can also check for linearity by examining residual plots, where
residuals should be randomly distributed around zero for different values
of the independent variables.
Multicollinearity Assumption:
● Multicollinearity occurs when independent variables in a regression model
are highly correlated with each other.
● Calculate correlation coefficients between independent variables using
Excel's CORREL function.
● Alternatively, you can use Excel's Data Analysis Toolpak to perform a
correlation analysis.
● Look for high correlation coefficients (close to +1 or -1) between pairs of
independent variables, indicating potential multicollinearity issues.
● Consider using variance inflation factor (VIF) calculations to quantitatively
assess multicollinearity, which can be computed using Excel formulas
after estimating your regression model.
Homoscedasticity Assumption:
● Homoscedasticity means that the variance of the residuals is constant
across all levels of the independent variables.
● After running the regression, plot the residuals against the predicted
values or against each independent variable.
Bhise N K
DATA ANALYTICS BCS SY
● Ensure that there are no discernible patterns or trends in the residual plot,
indicating constant variance.
● You can also perform formal tests for homoscedasticity, such as the
Breusch-Pagan test or White's test, using Excel's statistical functions or
add-ins
Cross-validation and model selection testing are both important techniques used in
machine learning to evaluate and select the best-performing model for a given dataset.
Cross-validation:
● Cross-validation is a resampling technique used to assess how well a
model generalizes to an independent dataset.
● The basic idea is to partition the dataset into multiple subsets or folds.
The model is trained on a portion of the data and validated on the
remaining portion.
● Common types of cross-validation include k-fold cross-validation,
stratified k-fold cross-validation, leave-one-out cross-validation (LOOCV),
etc.
● By repeating this process with different partitions of the data, we can
obtain multiple estimates of model performance. The final performance
metric is often computed as the average across all folds.
Model selection testing:
● Model selection refers to the process of choosing the best model or
algorithm from a set of candidate models.
● Model selection testing involves evaluating different models using a
performance metric and selecting the one that performs best on unseen
data.
● This process typically involves comparing the performance of models
using techniques such as cross-validation, holdout validation, or other
validation strategies.
● Performance metrics used for model selection testing depend on the
problem at hand but often include accuracy, precision, recall, F1 score,
ROC AUC, etc.
Bhise N K
DATA ANALYTICS BCS SY
model candidate is trained and evaluated using cross-validation, and the model with the
best average performance across the folds is selected as the final model. Additionally,
variables and the dependent variable is not linear. In these cases, the relationship may
models can capture more complex patterns and relationships in the data compared to
Model Representation:
y = f(x, β) + ε
Bhise N K
DATA ANALYTICS BCS SY
● Where:
Model Fitting:
optimization methods.
Model Evaluation:
model fits the data and how well it generalizes to unseen data.
Bhise N K
DATA ANALYTICS BCS SY
(MAE), etc.
Applications:
complex relationships in the data and are valuable tools for modeling real-world
evaluation techniques.
add-in.
Implementing a non-linear regression model in Excel using the Solver Add-In involves
fitting a curve to data points by minimizing the sum of squared differences between the
Organize your data: Have your independent variable (X) in one column and your
Bhise N K
DATA ANALYTICS BCS SY
Choose a Model: Decide on the type of non-linear model you want to fit to your
Initial Guess: Provide initial guesses for the parameters A and B. You can either
Set up the Model in Excel: In another column, calculate the predicted Y values
Sum of Squared Residuals (SSR): Square each residual and sum them up. This is
Use Solver Add-In: Go to the "Data" tab, click on "Solver" (if you haven't installed it
yet, you may need to add it from Excel Add-Ins), and set up Solver to minimize the
Run Solver: Click Solve, and Solver will try different values of A and B to minimize
the SSR.
Analyze Results: Once Solver converges, you'll get the optimal values of
parameters A and B.
Bhise N K
DATA ANALYTICS BCS SY
Let's say your data is in columns A and B, with X values in column A and Y values in
column B.
In another column, calculate predicted Y values using the formula Y = A * EXP(B * X).
Run Solver, and it will find the optimal values for A and B.
This approach can be generalized to any non-linear model. Just replace the model
formula with the one you want to fit.
components, typically trend, seasonal, and irregular components. While Excel doesn't
have a built-in function specifically for time series decomposition, you can use some of
One common method for time series decomposition is the classical decomposition
Bhise N K
DATA ANALYTICS BCS SY
Here's a general guide on how to perform time series decomposition in Excel using
these steps:
Import Your Time Series Data: Input your time series data into Excel. Typically,
you'll have two columns: one for the dates (time) and another for the
corresponding values.
Estimate the Trend: You can use various methods to estimate the trend, such as
moving averages or linear regression. For instance, you could calculate a moving
average over a certain window of time to smooth out fluctuations and estimate
the trend.
Seasonal Adjustment: To adjust for seasonal effects, you'll need to calculate
seasonal indices. One simple method is to calculate the average value of the
time series for each season (e.g., each month or each quarter) and then calculate
seasonal indices by dividing each observed value by the corresponding seasonal
average. Subtracting these seasonal indices from the original values gives you
the seasonally adjusted series.
Residual Calculation: Once you have estimated the trend and adjusted for
seasonal effects, you can calculate the residuals (irregular component) as the
difference between the original values and the sum of the trend and seasonal
components.
Bhise N K
DATA ANALYTICS BCS SY
Visualization and Analysis: Plot the original time series data along with the
estimated trend, seasonal, and irregular components to visualize the
decomposition. You can use Excel charting features for this.
laborious, it is possible with careful use of formulas and data manipulation techniques.
languages like Python or R that have built-in functions and libraries for time series
Trend:
● Definition: The long-term movement or direction of the data over
time. It represents the underlying pattern in the data that persists
over a long period.
● Characteristics:
● Trends can be increasing, decreasing, or stable over time.
● They reflect changes due to underlying factors such as
population growth, economic cycles, technological
advancements, etc.
● Identification:
● Visual inspection of the time series plot.
Bhise N K
DATA ANALYTICS BCS SY
Decomposing time series data in Excel using moving averages and seasonal indices
involves estimating the trend and seasonal components separately. Here's how you can
do it step by step:
Import Your Time Series Data: Input your time series data into Excel. You should
have two columns: one for the dates (time) and another for the corresponding
values.
Calculate Moving Averages for Trend Estimation:
Bhise N K
DATA ANALYTICS BCS SY
● Choose a window size for your moving average. The window size
determines how many consecutive data points are averaged.
● In a new column, calculate the moving average for each data point using
Excel's AVERAGE function combined with relative cell references. For
example, if your time series values are in column B starting from B2, and
you've chosen a window size of 5, in cell C3, you would input
=AVERAGE(B2:B6) and drag this formula down to calculate moving averages
for all data points.
Calculate Seasonal Indices:
● Determine the periodicity of your seasonal component (e.g., monthly,
quarterly).
● Calculate the average value for each season. For instance, if you have
monthly data, calculate the average value for each month across all years.
● Divide each observed value by the corresponding seasonal average to
obtain seasonal indices.
● In Excel, you can calculate these seasonal averages manually or use
functions like AVERAGEIFS or PivotTables.
● Once you have the seasonal indices, expand them to match the length of
your time series data.
Calculate Seasonally Adjusted Values:
● Divide the original time series values by the seasonal indices to obtain
seasonally adjusted values. You can do this in a new column.
● This step removes the seasonal component from the original data, leaving
the trend and irregular components.
Calculate Residuals (Irregular Component):
● Subtract the trend (moving averages) from the seasonally adjusted values
to obtain residuals.
● Residuals represent the irregular component of the time series data.
Visualize the Components:
Bhise N K
DATA ANALYTICS BCS SY
● Plot the original time series data, moving averages (trend), seasonal
indices, and residuals to visualize how each component contributes to the
overall series.
● Excel's charting features can be used for this purpose.
By following these steps, you can decompose your time series data into its trend,
seasonal, and irregular components using moving averages and seasonal indices in
Bhise N K
DATA ANALYTICS BCS SY
and models. Here are some advanced techniques commonly used for time series
forecasting:
Bhise N K
DATA ANALYTICS BCS SY
Bhise N K
DATA ANALYTICS BCS SY
When applying advanced time series forecasting techniques, it's essential to evaluate
model performance using appropriate metrics and consider factors such as data
quality, seasonality, trend patterns, and the forecasting horizon. Additionally, model
taken into account when selecting the most suitable technique for a particular
forecasting task.
Yt=c+ϕ1Yt−1+ϕ2Yt−2+⋯+ϕpYt−p+εt
Where :
The AR model captures the linear relationship between the current value of the
time series and its past values.
Bhise N K
DATA ANALYTICS BCS SY
The Moving Average (MA) model is another time series forecasting technique
that predicts future values based on the weighted sum of past prediction errors.
In an MA model of order q (denoted as a MA(q)) the current value of Yt is
modeled as a function of the q most recent predictors errors.
Yt=μ+θ1εt−1+θ2εt−2+⋯+θqεt−q+εt
Where:
The MA model captures the dependence between the current value of the time series
and the residual errors from previous predictions.
using custom formulas alone can be quite challenging due to the complexity of these
models. However, you can use Excel in conjunction with add-ins or external tools to
perform advanced forecasting. One such popular add-in for Excel is the "Solver" add-in,
which can be used to optimize parameters for simpler models like exponential
Here's a general approach using Solver and an external tool like R or Python for ARIMA
forecasting:
Bhise N K
DATA ANALYTICS BCS SY
This approach leverages the strengths of both Excel and external tools like R or Python
to perform advanced time series forecasting. While Excel may not be suitable for
Bhise N K
DATA ANALYTICS BCS SY
directly implementing complex forecasting models, it can still be a valuable tool for data
software.
Bhise N K