L08 - Advance Analytical Theory and Methods - Regression Analysis
L08 - Advance Analytical Theory and Methods - Regression Analysis
Analysis
Introduction
• Regression analysis makes it possible to infer or predict a variable based on one or more other
variable.
• Regression models are used to fit a relationship between a numerical outcome variable (also called as
dependent variable, response, target) and a set of predictors (also referred as independent variables,
input variables, regressors, or covariates)
Independent variables
Dependent variable
Use of Regression Analysis
• Explanatory task : Measurement of the influence of one or more variable on another variable
• What influences children’s ability to concentrate?
• Do parent’s education level and place of residence affect children’s future education?
• Predictive task : Predictions of a variable by one or more other variables. For new records outcome
will be predicted based on the input provided.
• What amount of revenue online store can generate in next month?
• How long does a new patient will stay in the hospital?
Forms of Regression Analysis
• Regression analysis can be in many forms based on the shape of the curve and other parameters.
• The major types are Simple Linear Regression, Multiple Linear Regression, Logistic Regression.
• Simple linear regression – Only one independent variable will use for infer/predict the dependent
variable
Forms of Regression Analysis Ctd.
• Multiple linear regression – Several different independent variables will be used to infer/predict the
dependent variable
• In both of simple linear and multiple linear regressions, dependent variable is metric. Independent
variables can be in any form.
Forms of Regression Analysis Ctd.
• Logistic regression : This will be used when the dependent variable is categorical. When you have a
dependent variable with yes or no answer then logistic regression can be used
Simple Linear Regression with One Variable
• Let's assume that you are running a small restaurant. You want to know how your waiters get “tips”
from customers. Most of the time the amount of tip is related to the amount of the bill. As the owner
you would like to develop a model that will allow you to make a prediction about what amount of tip
you can expect in next bill.
• So, you have collected a data for a six meals in one day. Unfortunately, you have only collected the
data about tips amount and not collected the data about meal amount.
Meal Number Tip Amount
($)
1 5.00
2 17.00
3 11.00
4 8.00
5 14.00
6 5.00
Simple Linear Regression with One Variable (2)
• So, you have a data from 6 random samples and tip amount. How can you predict the tip amount for
future meals?
Tip Amount ($)
18
16
14
12
10
0
0 1 2 3 4 5 6 7
Simple Linear Regression with One Variable (3)
• Since you have one variable (Tip amount) the best you can do is to calculate the mean value.
= 10
12
10
0
0 1 2 3 4 5 6 7 8
Simple Linear Regression with One Variable (4)
12
10
+7 +4
8
+1
6 -2
-5
4 -5
0
0 1 2 3 4 5 6 7 8
Simple Linear Regression with One Variable (5)
• Square the residuals/errors. By doing so will make them all positive and emphasizes the larger
deviations
2 17.00 +7 49
3 11.00 +1 1
4 8.00 -2 4
5 14.00 +4 16
6 5.00 -5 25
• The goal of simple linear regression is to create a model what minimizes the sum of squares of the
errors.
• When conducting simple linear regression with two variables we determine how good that line fits
the data by comparing it to this type where we pretend the second variable does not exist.
Algebra Review - Lines
x = random variable
m = slope
b = y-intercept (crosses y-axis) in here x =0
• Eg: 3
Simple Linear Regression Model
• Simple linear regression model is in the form of However more precisely it can be defined as;
• So problem is finding out and values to minimize the squared sum of errors.
Different Types of Regression Lines
Slope
Slope +
Slope -
Simple Linear Regression with Two Variables
• Consider that in our previous example we have collected the bill amount also with the tip amount. We
need to check how the independent variable bill amount can be used to predict the dependent variable tip
amount.
Total Bill ($) Tip Amount
($)
34 5.00
108 17.00
64 11.00
88 8.00
99 14.00
51 5.00
108 17.00
64 11.00
88 8.00
99 14.00
51 5.00
X̄ = 74 Ȳ = 10
• This value is important, and it is called the centroid. The best-fit regression line must pass through
this centroid.
Simple Linear Regression with Two Variables (4)
Centroid
Simple Linear Regression with Two Variables (4)
Total Bill ($) Tip Amount ($) Bill Deviation Tip Deviation Deviation Products Squared Value
(xi – x̄ ) (yi – ȳ) (xi – x̄ ) (yi – ȳ) (xi – x̄ )2
X̄ = 74 Ȳ = 10 Ʃ = 615 Ʃ = 4206
Simple Linear Regression with Two Variables (6)
= = 0.1462
= 10 – 0.1462(74) = -0.8188
Simple Linear Regression with Two Variables (7)
Simple Linear Regression with Two Variables (7)
X̄ = 74 Ȳ = 10 Ʃ = 30.075
Simple Linear Regression
• So, main idea in regression is to create a model which reduces the SSE.
SSE, SSR, SST, and R-Squared
• R-Squared
• Mean Square Error is an estimate of σ2 the variance of the error. In other words, how
spread out the data points are from the regression line
• Below is the equation for MSE in simple linear regression
= 2.742