Guideline For Regression Models in Microsoft Excel
Guideline For Regression Models in Microsoft Excel
The purpose of this manual is to learn how to use Excel to build linear regression models based on
a chosen data set. The manual will guide you step-by-step through the Excel modelling process.
However, please remember to take a step back and think about what you are doing, and about the
significance and interpretation of your results.
Microsoft Excel has a built-in feature to perform regression analysis. This feature is available in the
Analysis Toolpack.
1A. If you have Excel on your computer: First, check that the option Data Analysis is available
under the Tools menu. If it is not there, select Tools/Add-Ins. In the ensuing dialog box, select
Analysis Toolpak and Analysis Toolpak-VBA. Now Data Analysis should appear as an option
in the Tools menu.
Here is a link to a Microsoft Support Article with more details about adding the Analysis Toolpak.
Copy the Excel workbook “Heating Cost.xls” from Blackboard onto your computer on which you are
running Excel. The spreadsheet contains heating cost data for 20 small houses in different
geographical regions, together with details on local average minimum external temperature, inches
of insulation in the house, the age of the central heating equipment and the number of windows.
1B. If you do not have Excel on your computer, please go to: https://desktops.ie.edu/
You need to be logged in with your student email. Once logged in, select MS Excel. If Excel does
not appear as an option, select @Risk, a program that starts Excel.
If you are running Excel on the cloud (through desktops.ie.edu), you will need to upload your data
to the cloud. Here is how to do that:
i. From Excel on the Cloud, choose Open à Add a Place à OneDrive For Business. Then
login with your student username/email address.
ii. Log into your IE OneDrive: https://onedrive.live.com/login/ (Make sure you log in with your
IE username, not with a personal username or one that you used in your company)
iii. Upload datasets (e.g., Heating Cost.xls) to OneDrive.
iv. From Excel on the Cloud, choose Open à OneDrive IE University.
1
Regression Analysis in Excel Masters in Management
Data Analytics for Decision Making IE Business School
The objective of the regression analysis in this case is to discover if these variables explain
the differences in the heating costs for the 20 houses, and hence if the approach would be
useful for predicting heating costs for other similar properties.
You can compute summary statistics for data sets by using appropriate functions, such as:
• Average(B4..B23), which yields the average heating cost for all the houses;
• Min(B4..B23), which yields the minimum heating cost for all the houses;
• Max(B4..B23), which yields the maximum heating cost for all the houses;
• Stdev(B4..B23), which yields the standard deviation in heating costs; etc.
If you enter for instance Average(B4..B23) in cell B24, the average heating cost will be displayed
in that cell. By selecting the cell and dragging the handle to the adjacent cells, the formula will
automatically be copied, and will display the average temperature, insulation, age and number of
windows.
The resulting spreadsheet is shown below (you may need to reformat the cells and column widths).
2
Regression Analysis in Excel Masters in Management
Data Analytics for Decision Making IE Business School
3. Correlation Analysis
You can compute correlation statistics for a data set by using the following function:
Correl(B4..B23;C4..C23), which yields the correlation between the heating cost and the minimum
outside temperature. A correlation coefficient indicates the level of linear association between a
pair of variables. In this case, the correlation between the heating cost and the minimum outside
temperature is –0.81, implying a rather strong negative correlation in the sense that if the outside
temperature is low, the heating cost is high and vice-versa.
Again, you can use an automated tool by selecting Tools\Data Analysis\Correlation. The
following dialog box should appear:
3
Regression Analysis in Excel Masters in Management
Data Analytics for Decision Making IE Business School
Notice the high correlation between heating cost and the minimum temperature (negative) and the
age of the heating installation (positive). Also notice the sometimes high correlations between the
independent (explanatory) variables (min temp, insulation, age, and windows) themselves.
4. Scatter Plots
Scatter plots are of great help in identifying the strength, nature and direction of relationships
between variables. In particular, they can highlight non-linear relationships, which will not
necessarily be apparent from the correlation values. Since the observed correlation, -0.81,
between the heating cost and the minimum outside temperature suggests a strong (linear)
relationship, let us examine their scatter plot:
Select Insert\Chart;
• In Step 1 of Chart Wizard, select Chart type XY (Scatter);
• In Step 2, specify the Data range as B3:C23 (using the mouse) and select Series in Columns;
(in order to have Temperature on the x-axis and Cost on y-axis you need to go into Series
and let the X values refer to column C and Y values to column B)
• In Step 3, specify the Chart title as “Cost & Temperature”, Value (X) Axis as Temperature,
and Value (Y) Axis as Cost;
• In Step 4, specify that the chart should be placed As object in the Data worksheet.
4
Regression Analysis in Excel Masters in Management
Data Analytics for Decision Making IE Business School
The scatter plot confirms the rather strong, linear relationship between heating cost and
temperature, with heating cost declining as the temperature increases. Similar scatter plots can be
examined for other pairs of variables.
A regression analysis estimates the linear equation that ‘best fits’ a set of data, in the sense that it
minimises the residual scatter. Let us perform a regression analysis of heating cost as a function of
the temperature, i.e. heating cost = a + b⋅ (temperature) + e
5
Regression Analysis in Excel Masters in Management
Data Analytics for Decision Making IE Business School
• Summary output, containing summary statistics for the regression as a whole, of which
2
Adjusted R Square (R ) and Standard Error (the standard deviation of the residuals) are the
most important;
• ANOVA (Analysis of Variance), can be ignored when performing a regression analysis;
• a table with the actual regression model;
• Residual Output, containing the predicted values from the regression model for each of the
observations in the data set (how are they calculated?), the prediction errors (residuals, tell
us how far the predicted value is from the observation), and the standardized prediction
error (residuals, standardized by their standard deviation).
• a Residual Plot;
• a Line Fit Plot.
6
Regression Analysis in Excel Masters in Management
Data Analytics for Decision Making IE Business School
The slope, -4.93, has a t-value of -5.89 (in absolute terms larger than 2) and a very small p-value
(smaller than our confidence level of 5%). The coefficient related to the temperature variable is
therefore significantly different from zero, which can also be seen from the confidence interval [-
6.69; -3.17] that does not include zero. We may conclude that there is a significant effect of
temperature on heating cost.
The regression model is able to explain 64% of the variability in heating cost in terms of differences
2
between outside temperature (Adjusted R ). The standard error of the forecasts is 63.55, implying
that if we want to make a prediction with confidence (95%), we should subtract and add 127.10
(2*63.55) to the prediction to obtain a confidence interval. For instance, for an outside temperature
of 50, we predict the heating costs to be in the region of [142.30-127.10; 142.30-127.10] = [15.20;
269.40].
The Regression tool also displays several charts (you may have to move them to make them
visible):
• The Line Fit Plot (see below) shows actual costs and predicted costs, plotted for different
values of temperature. This plot is identical to the scatter plot of cost and temperature we
constructed earlier, with the predicted points superimposed. The regression line is shown
as points rather than as a line. This can be changed by double-clicking the estimated
points, and selecting Patterns\Line\Automatic.
• The Residual Plot shows the forecast errors versus temperature. If this plot exhibits an
obvious pattern, it would suggest that the model is ill-specified. Ideally, the residuals should
be random. Residual plots are also useful for spotting outliers - data points much further
from the regression line than others.
7
Regression Analysis in Excel Masters in Management
Data Analytics for Decision Making IE Business School
6. Multiple Linear Regression Analysis
By adding extra independent (explanatory) variables in the regression model, we may be able to
improve our predictions of heating cost. However, including extra explanatory variables may also
cause problems such as multicollinearity1. We therefore have to find the best possible regression
model for the purpose of predicting heating costs using one or more explanatory variables.
Let us perform a regression analysis of heating cost as a function of all the available explanatory
variables, i.e.
1
Multicollinearity is an issue that can come up in multiple regression. It refers to the situation when some of the
independent (explanatory) variables are correlated and thus bring similar piece of information into the regression model.
This correlation makes the regression model, in fact, less accurate. Hence, it makes sense to delete one of these
correlated variables from the model and by doing so the significance of the other variables should improve.
8
Regression Analysis in Excel Masters in Management
Data Analytics for Decision Making IE Business School
9
Regression Analysis in Excel Masters in Management
Data Analytics for Decision Making IE Business School
In some cases a linear model is not suitable for modelling the relationship between two variables.
Let us have a look at another example: General Public Electric (GPE). GPE operates 11 thermal
power stations of basically the same design. We will investigate the relationship between the cost
efficiency (pence per Kilowatt-hour) of the electricity generating plants, as a function of their
generating capacity (Megawatts installed). The object of the exercise is to model the “economy of
scale” effect that allows larger plants to generate electricity at lower marginal cost per unit. In
practice, this analysis might be part of a larger exercise in which “economy of scale” would be one
of a variety of factors that would be taken into account in deciding between alternative
development plans. A more accurate understanding of the relative efficiency of different size plants
would make it easier to balance this factor against capital investment costs, environmental factors,
construction time, etc. The data can be found in the Excel file “GPE.xls” on Blackboard.
The spreadsheet, scatter plot and regression results are shown below.
10
Regression Analysis in Excel Masters in Management
Data Analytics for Decision Making IE Business School
seems reasonable because of the high t-statistic related to the Capacity variable and the high
2
Adjusted R (86%), implying that there is a significant relationship between capacity and cost, and
that we are able to explain a lot of the variability in costs purely by examining the capacity. Also,
the result is logical in the sense that we indeed observe an economies of scale effect: cost
decreases as capacity increases, for every Megawatt of generating capacity, the unit cost
decreases at a rate of 0.00053.
However, if we examine the line fit plot and the residual plot (see below), we observe the following:
11
Regression Analysis in Excel Masters in Management
Data Analytics for Decision Making IE Business School
• the line does not perfectly fit the data, it slightly underestimates the cost for small capacity
values, overestimates it for medium capacity values and again underestimates it for high
capacity values;
• the residual plot clearly exhibits a pattern, the errors are positive for small capacity values,
negative for medium capacity values and again positive for high capacity values.
This indicates that the model is ill-specified, and more specifically that we have been trying to fit a
line to data which exhibits a non-linear relationship. We therefore should look for a more suitable
specification of the model.
Let us try to regress costs to the reciprocal of the plant capacity, i.e.
Cost = a + b ⋅ (1/Capacity) + e
We therefore have to transform the capacity data. The proposed relationship (Cost as an function
of 1/Capacity) resembles the shape of the curve we observe in the scatter plot. Sometimes
however, different candidate transformations exist.
In order to transform the capacity data in our model, we add another column. In cell D3, we enter
the title “1/Capacity”. In cell D4, we enter the formula “=1/B4”. The value 0.0019 should appear (=
1/525). We select cell D4 and drag the handle down to fill the entire column with the transformed
capacity data. Now, we can run another regression analysis, using the new column as the
explanatory variable.
12
Regression Analysis in Excel Masters in Management
Data Analytics for Decision Making IE Business School
• The line fit plot indicates near-perfect fit of line and data.
Below, you will also find a plot of the estimated costs as a function of Capacity, revealing the non-
linear nature of the estimated relationship. In order to draw such a graph, add another column in
your spreadsheet with the cost predictions, computed using the regression coefficients and the
explanatory variables data. Then draw a scatter plot of the predictions as well as the actual
electricity cost data versus capacity. Again, the predictions will be displayed as points rather than a
line, but this can be changed by double-clicking the points and selecting a different format.
13
Regression Analysis in Excel Masters in Management
Data Analytics for Decision Making IE Business School
14