Data Science Analytics Finals Reviewer
Data Science Analytics Finals Reviewer
In statistical modeling, regression analysis is used to estimate the relationships between two
or more variables:
Dependent variable (aka criterion variable) is the main factor you are trying to understand and
predict.
Independent variables (aka explanatoryvariables, or predictors) are the factors that might
influence the dependent variable.
Regression analysis helps you understand how the dependent variable changes when one of
the independent variables varies and allows to mathematically determine which of those
variables really has an impact.
In statistics, they differentiate between a simple and multiple linear regression. Simple linear
regression models the relationship between a dependent variable and one independent
variables using a linear function. If you use two or more explanatory variables to predict the
dependent variable, you deal with multiple linear regression. If the dependent variable is
modeled as a non-linear function because the data relationships do not follow a straight line,
use nonlinear regression instead.
As an example, let's take sales numbers for umbrellas for the last 24 months and find out the
average monthly rainfall for the same period. Plot this information on a chart, and the regression
line will demonstrate the relationship between the independent variable (rainfall) and dependent
variable (umbrella sales):
y = bx + a + ε
Where:
x is an independent variable.
y is a dependent variable.
ais the Y-intercept, which is the expected mean value of y when all x variables are equal
to 0. On a regression graph, it's the point where the line crosses the Y axis.
b is the slope of a regression line, which is the rate of change for y as x changes.
ε is the random error term, which is the difference between the actual value of a
dependent variable and its predicted value.
The linear regression equation always has an error term because, in real life, predictors are
never perfectly precise. However, some programs, including Excel, do the error term calculation
behind the scenes. So, in Excel, you do linear regression using the least squares method and
seek coefficients a and b such that:
y = bx + a
For our example, the linear regression equation takes the following shape:
Umbrellas sold = b * rainfall + a
There exist a handful of different ways to find a and b. The three main methods to perform linear
regression analysis in Excel are:
Below you will find the detailed instructions on using each method.
Analysis ToolPak is available in all versions of Excel 365 to 2003 but is not enabled by default.
So, you need to turn it on manually. Here's how:
This will add the Data Analysis tools to the Data tab of your Excel ribbon.
In this example, we are going to do a simple linear regression in Excel. What we have is a list of
average monthly rainfall for the last 24 months in column B, which is our independent variable
(predictor), and the number of umbrellas sold in column C, which is the dependent variable. Of
course, there are many other factors that can affect sales, but for now we focus only on these
two variables:
With Analysis Toolpak added enabled, carry out these steps to perform regression analysis in
Excel:
On the Data tab, in the Analysis group, click the Data Analysis button.
As you have just seen, running regression in Excel is easy because all calculations are
preformed automatically. The interpretation of the results is a bit trickier because you need to
know what is behind each number. Below you will find a breakdown of 4 major parts of the
regression analysis output.
Multiple R. It is the Correlation Coefficient that measures the strength of a linear relationship
between two variables. The correlation coefficient can be any value between -1 and 1, and its
absolute value indicates the relationship strength. The larger the absolute value, the stronger
the relationship:
In our example, R2 is 0.91 (rounded to 2 digits), which is fairy good. It means that 91% of our
values fit the regression analysis model. In other words, 91% of the dependent variables
(y-values) are explained by the independent variables (x-values). Generally, R Squared of 95%
or more is considered a good fit.
Adjusted R Square. It is the R square adjusted for the number of independent variable in the
model. You will want to use this value instead of R square for multiple regression analysis.
Standard Error. It is another goodness-of-fit measure that shows the precision of your
regression analysis - the smaller the number, the more certain you can be about your regression
equation. While R2 represents the percentage of the dependent variables variance that is
explained by the model, Standard Error is an absolute measure that shows the average
distance that the data points fall from the regression line.
Basically, it splits the sum of squares into individual components that give information about the
levels of variability within your regression model:
df is the number of the degrees of freedom associated with the sources of variance.
SS is the sum of squares. The smaller the Residual SS compared with the Total SS, the
better your model fits the data.
MS is the mean square.
F is the F statistic, or F-test for the null hypothesis. It is used to test the overall
significance of the model.
Significance F is the P-value of F.
The ANOVA part is rarely used for a simple linear regression analysis in Excel, but you should
definitely have a close look at the last component. The Significance F value gives an idea of
how reliable (statistically significant) your results are. If Significance F is less than 0.05 (5%),
your model is OK. If it is greater than 0.05, you'd probably better choose another independent
variable.
This section provides specific information about the components of your analysis:
The most useful component in this section is Coefficients. It enables you to build a linear
regression equation in Excel:
y = bx + a
For our data set, where y is the number of umbrellas sold and x is an average monthly rainfall,
our linear regression formula goes as follows:
Y=0.45*x-19.074
For example, with the average monthly rainfall equal to 82 mm, the umbrella sales would be
approximately 17.8:
0.45*82-19.074=17.8
In a similar manner, you can find out how many umbrellas are going to be sold with any other
monthly rainfall (x variable) you specify.
If you compare the estimated and actual number of sold umbrellas corresponding to the monthly
rainfall of 82 mm, you will see that these numbers are slightly different:
Why's the difference? Because independent variables are never perfect predictors of the
dependent variables. And the residuals can help you understand how far away the actual values
For the first data point (rainfall of 82 mm), the residual is approximately -2.8. So, we add this
number to the predicted value, and get the actual value: 17.8 - 2.8 = 15.
How to make a linear regression graph in Excel
If you need to quickly visualize the relationship between the two variables, draw a linear
regression chart. That's very easy! Here's how:
This will insert a scatter plot in your worksheet, which will resemble this one:
Now, we need to draw the least squares regression line. To have it done, right click
on any point and choose Add Trendline… from the context menu.
On the right pane, select the Linear trendline shape and, optionally, check Display
Equation on Chart to get your regression formula:
As you may notice, the regression equation Excel has created for us is the same as
the linear regression formula we built based on the Coefficients output.
Switch to the Fill & Line tab and customize the line to your liking. For example, you
can choose a different line color and use a solid line instead of a dashed line (select
Important note! In the regression graph, the independent variable should always be on
the X axis and the dependent variable on the Y axis. If your graph is plotted in the
reverse order, swap the columns in your worksheet, and then draw the chart anew. If you
are not allowed to rearrange the source data, then you can switch the X and Y axes
directly in a chart.
Microsoft Excel has a few statistical functions that can help you to do linear regression analysis
such as LINEST, SLOPE, INTERCEPT, and CORREL.
The LINEST function uses the least squares regression method to calculate a straight line that
best explains the relationship between your variables and returns an array describing that line.
For now, let's just make a formula for our sample dataset:
=LINEST(C2:C25, B2:B25)
Because the LINEST function returns an array of values, you must enter it as an array formula.
Select two adjacent cells in the same row, E2:F2 in our case, type the formula, and press Ctrl +
Shift + Enter to complete it.
The formula returns the b coefficient (E1) and the a constant (F1) for the already familiar linear
regression equation:
y = bx + a
If you avoid using array formulas in your worksheets, you can calculate a and b individually with
regular formulas:
=INTERCEPT(C2:C25, B2:B25)
Get the slope (b):
=SLOPE(C2:C25, B2:B25)
Additionally, you can find the correlation coefficient (Multiple R in the regression analysis
summary output) that indicates how strongly the two variables are related to each other:
=CORREL(B2:B25,C2:C25)
The following screenshot shows all these Excel regression formulas in action:
Analyze Time Series Data in Excel
Understanding Time Series Data
Time series data is a bunch of observations or measurements taken at different times. This time
order makes it different from looking at things all at once, providing a dynamic perspective on
the evolutions of a phenomenon. This type of data is commonly used in various fields such as
stock prices, temperature readings, monthly sales figures, and daily website traffic statistics.
● Temporal Order: Time series data follows a clear sequence, with each data point
corresponding to a specific point in time.
● Seasonality: Certain patterns or trends may repeat at regular intervals, reflecting
seasonal variations or recurring cycles.
● Irregularity: Unpredictable and random fluctuations, known as irregular components, may
be present in time series data.
● Trends and Patterns: Time series data frequently exhibits trends, cycles, or other
patterns that reflect underlying dynamics or recurring phenomena.
Simply put, time series data tells a story about changes, trends, and unusual events over time.
Analyzing this data helps decision-makers make informed choices for the future, using lessons
from the past.
Data Preparation for Time Series Analysis
Time series analysis involves examining and modelling data points collected over time to
identify patterns, trends, and make predictions. However, before delving into the analysis itself,
it is crucial to ensure that the time series data is clean, well-organized, and free from anomalies.
This section will discuss the essential steps required to prepare time series data, addressing
issues such as missing values, outliers, and irregularities.
● Exploratory Data Analysis – Identify the time variable, assess data distributions, and gain
insights into the overall data patterns.
● Duplicate Record Removal – Check for and eliminate duplicate records. Duplicate
entries can distort analyses, and their presence may be indicative of data entry errors or
system malfunctions.
Managing Outliers
Addressing Irregularities
● Time Irregularities – Inspect the time sequence for irregularities such as gaps or
overlaps. Ensure a consistent time frequency and address any irregularities by adjusting
timestamps or interpolating missing time points.
● Decomposition of Components – Decompose the time series into its underlying
components, including seasonal and trend elements.
Documentation and Logging
In summary, thorough data preparation is fundamental for accurate and meaningful time series
analysis. Addressing missing values, outliers, and irregularities ensures that the data accurately
represents the underlying patterns, allowing for more reliable insights and predictions.
● Line Chart – Excel’s Line Chart is a fundamental tool for visualizing time series data. It
connects data points with a line, making it easy to observe trends, fluctuations, and
patterns over time.
● Scatter Plot – Scatter plots in Excel allow the display of individual data points, offering a
clear representation of how each observation contributes to the overall time series. This
is particularly useful for identifying outliers or anomalies.
● Area Chart – Area charts can be employed to illustrate cumulative changes over time.
They are effective in showcasing trends and variations while providing a sense of the
overall magnitude of the time series.
● Mean (Average): The arithmetic mean represents the central tendency of the data. It is
calculated by summing all values and dividing by the number of observations.
Excel Function: =AVERAGE(data_range)
● Median: The median is the middle value in a dataset when it is ordered. It is less
sensitive to extreme values than the mean and provides a robust measure of central
tendency.
● Box Plots – Create box plots in Excel to visualize the distribution, central tendency, and
variability of the time series data. Box plots display the median, quartiles, and potential
outliers.
● Histograms – Excel’s histogram tool allows for the visualization of the frequency
distribution of time series data. This provides insights into the shape of the distribution.
● Trends and Seasonality – Analyze the mean and standard deviation over time to identify
trends and seasonality patterns within the time series.
● Outliers – Examine skewness and kurtosis, along with visualizations, to detect outliers or
extreme values that may impact the analysis.
Time Series Decomposition
Time series decomposition is a powerful technique used to break down a time series into its
constituent components. This process helps analysts understand and separate the underlying
patterns, trends, seasonality, and random noise present in the data. In this section, we will
introduce the concept of time series decomposition and guide how to perform it using Excel.
● Trend – The long-term movement or direction in the time series. It represents the
underlying growth or decline in the data.
● Seasonality – The repetitive and predictable patterns that occur at fixed intervals within
the time series. Seasonality often corresponds to regular, recurring events such as daily,
weekly, or yearly cycles.
● Noise (Residual) – The irregular and unpredictable fluctuations in the time series that
cannot be attributed to the trend or seasonality. It represents random variation or
measurement errors.
Moving Averages
Moving averages are valuable tools for revealing underlying trends and patterns within data by
reducing noise and short-term fluctuations. This section will also guide you through the practical
application of moving averages and smoothing techniques using Excel.
● Simple Moving Average (SMA) – A basic moving average calculated by averaging a set
of values over a specified time window. It is useful for smoothing out short-term
fluctuations and highlighting long-term trends.
To calculate the simple moving average in Excel, use the AVERAGE function. Create a new
column and input a formula like:
=AVERAGE(data_range)
● Exponential Moving Average (EMA) – A weighted moving average that gives more
importance to recent observations. It responds more quickly to changes, making it
suitable for capturing trends in rapidly changing data.
Excel provides a function, “EMA”, for calculating exponential moving averages. This function
requires the data range and a smoothing factor, commonly represented by a constant (α).
=EMA(data_range, smoothing_factor)
Smoothing Techniques
● Double Exponential Smoothing (Holt’s Method) – Holt’s method extends the concept of
exponential smoothing to capture both trend and seasonality in time series data. Excel’s
Data Analysis ToolPak provides tools for implementing double exponential smoothing.
● Triple Exponential Smoothing (Holt-Winters Method) – Holt-Winters method includes an
additional component to account for seasonality. Excel’s Data Analysis ToolPak also
supports triple exponential smoothing.
● Plotting Original vs. Smoothed Data – Create a time series plot that overlays the original
data with the smoothed data. This visual representation helps in comparing the
effectiveness of smoothing techniques.
● Assessing Trend and Seasonality – Analyze the smoothed data to identify underlying
trends and seasonality. Smoothing techniques reveal patterns that may be obscured by
noise in raw time series data.
● Dynamic Windows – Use dynamic window sizes for moving averages by incorporating
Excel functions like OFFSET or INDEX. This allows for flexibility in adapting to different
time series characteristics.
● Visualization Tools – Leverage Excel’s charting capabilities to visualize the impact of
moving averages and smoothing techniques on the time series data. Consider creating
side-by-side plots for easy comparison.
=TREND(data_range, timeline_range)
● Noise (Residual) – The residual component represents the noise or random fluctuations.
It can be obtained by subtracting the sum of the trend and seasonal components from
the original data.
● Visualization – Create charts or plots in Excel to visualize the trend, seasonality, and
residual components separately. This allows for a clear understanding of each
component’s contribution to the overall time series.
ARIMA Forecasting
● Mean Absolute Error (MAE) – MAE represents the average absolute difference between
actual and forecasted values. It is expressed in the same units as the data, making it
easy to interpret.
● Root Mean Square Error (RMSE) – RMSE penalizes larger errors more significantly than
MAE. It provides a measure of the typical size of the forecast errors.
● Mean Absolute Percentage Error (MAPE) – MAPE expresses the average percentage
difference between actual and forecasted values. It is particularly useful when dealing
with datasets with varying scales.
Implementing Accuracy Metrics in Excel
● Calculating MAE in Excel – Use the ABS function to calculate absolute differences and
AVERAGE to find the mean.
= AVERAGE(ABS(Actual_range – Forecast_range))
● Calculating RMSE in Excel – Excel does not have a built-in RMSE function, but it can be
computed using the following formula:
= SQRT(AVERAGE((Actual_range – Forecast_range)^2))
● Calculating MAPE in Excel – Similarly, calculate MAPE using the following formula:
● Financial Time Series Analysis – Analyzing daily stock prices to predict short-term trends
and volatility.
● Sales Forecasting for Retail – Predicting future sales for a retail business based on
historical sales data.
● Energy Consumption Prediction – Forecasting future energy consumption to optimize
resource allocation.
● Website Traffic Analysis – Analyzing daily website traffic to identify patterns and plan
server capacity.
● Temperature Forecasting for Agriculture – Predicting future temperatures to assist
farmers in planning crop cycles.
● Inventory Management – Forecasting inventory demand to optimize stock levels and
reduce holding costs.
Step 1 – Input Time Series Data
We are going to use a company’s quarterly revenue in two specific years.
● Put the year series data in column B. In our case, it has only been two years.
● Input the quarter of each year. You can use a repeating sequence for that or use
AutoFill.
● Insert the total revenue in every quarter.
=0.7*D6+0.3*E6
=SQRT(SUMXMY2(D6:D8,E6:E8)/3)
of new x-values. We can also use this function to fit an exponential curve to already-existing x-
and y-values.
=GROWTH($D$5:$D$12,$C$5:$C$12,C5,TRUE)
Prescriptive Analytics is the third and most advanced phase in the data analytics process,
following Descriptive and Predictive Analytics. While Descriptive Analytics answers the question
"What happened?" and Predictive Analytics answers "What is likely to happen?", Prescriptive
It involves the use of data, algorithms, and advanced computational techniques to recommend
actions and predict their outcomes. By integrating insights from past and future trends,
Key Characteristics
various decisions.
3. Optimization: Uses algorithms to find the best course of action among multiple options.
4. Integration with Systems: Can work alongside operational systems like supply chain
Applications in Business
effectively.
● Transportation: Scheduling delivery routes to minimize fuel costs and meet delivery
deadlines.
sources:
2. Analytical Models
● Machine Learning Models: Adapt and improve recommendations over time by learning
1. What-If Analysis
A technique that evaluates how different decisions impact outcomes by changing input
variables. It is commonly used in Excel through tools like Data Tables and Scenario Manager.
Example: A retail store wants to know how changes in the price of a product affect total
revenue. Using What-If Analysis, they can simulate various price points and observe the
2. Goal Seek
Goal Seek in Excel determines the input value required to achieve a specific outcome.
Example: A company wants to find the sales volume needed to achieve a profit of $50,000. By
using Goal Seek, they can calculate the exact number of units required.
3. Solver
Solver is an Excel add-in for solving optimization problems. It identifies the best decision by
Example: A logistics company needs to minimize transportation costs while ensuring timely
1. Maximize profit.
1. Data Collection: Gather historical sales data, food preparation costs, and customer
preferences.
2. Model Development: Use Solver to determine optimal menu pricing and inventory
management strategies.
Challenges
Emerging Trends
Conclusion
Prescriptive Analytics represents the pinnacle of data analytics, transforming raw data into
actionable strategies. With its power to optimize outcomes and guide decision-making, it is an
What-if analysis is the option available in Data. In what-if analysis, by changing the input value
in some cells you can see the effect on output. It tells about the relationship between input
values and output values. In this article, we will learn how to use the what-if analysis with data
tables effectively.
What-if analysis is a procedure in excel in which we work in tabular form data. In the What-if
analysis variety of values have been in the cell of the excel sheet to see the result in different
ways by not creating different sheets. There are three tools of what-if analysis.
1. Goal seek
2. Scenario manager
3. Data Table
Goal seek
In goal seek we already know our output value we have to find the correct input value. For
example, if a student wants to know his English marks and he knows all the rest of the marks
and total marks in all subjects.
Step 1: Write all subjects and their marks in an excel sheet and do the sum by applying the
formula sum.
Step 2: Go into the data tab of the Toolbar.
Step 3: Under the Data Table section, Select the What-if analysis.
Step 6: In the second column write the value of the target. The target value for this example is
440.
Step 7: In the third column write the name of the cell in which you want to get marks in English.
Provide absolute cell reference, i.e. $D$5.
Step 8: Click ok and see the result. The estimated marks for English are 71.
Scenario Manager
In scenario manager, we create different scenarios by proving different input values for the
same variable than by comparing scenarios to choose the correct result. For Example, To
check the cost of revenue for three different months.
Step 1: Given a data set, for Revenue Cost of Jan, with Expenses and Cost as its columns.
Step 2: Select the numerical value cell and Go to the Data.
Step 6: A new dialog appears to write the name of the new scenario in the first column. Under
Scenario name, write “Revenue of Feb”.
Step 7: In the second column select the changing cell. The changing cells for this example,
are $E$5:$E$9.
Step 8: A new dialogue box name Scenario Values appears to write the changed value in the
box. Enter the values as per shown in the image. Click Ok.
Data Table
In data, we create a table with different input values for the same variables. It is one of the most
helpful features in what-if analysis. One can change different values in x and can achieve
different outputs accordingly for research as well as business-driven purposes.
In the data table in one variable, we can change only one input value either in a row or in a
column. It includes only one input cell. For example, a company wants to know about its
revenue by changing the cost of raw materials by using a data table. Given a data set, with
material and their cost.
Step 3: Write the values in the cell for which you want to make a change in a column or in rows.
Step 4: Go to the data tab of the Toolbar.
Step 5: Under the data table section, Select the what-if analysis.
In the Data table in two variables, we can change two input values in both row and column. It
includes two input cells. For example, A person wants to know about per month installments
of loan by the different rates of interest and for the different time periods for the same
principal amount.
Step 3: Write both values you want to change in both columns and rows.
Step 7: A dialogue box appears in which you have to select the cell in which you want to
change the value in both row and column. The Row input cell value is $D$5 and the column
input cell value is $D$6.
Step 8: Click ok and see the result.