Regression
Regression
Regression Analysis
• In statistical modeling, regression analysis is a statistical process for estimating the relationship among
variables. It includes many techniques for modeling and analyzing several variables when the focus is
on the relationship between a dependent variable and one or more independent variables (or
predictors). More specifically, regression analysis helps one understand how the typical value of the
dependent (or criterion variable) changes when the independent variables are varied. Given below
are some common applications of regression analysis in business and social sciences.
o The marketing manager wants to know if sales is dependent on factors such as advertising
spend, the number of products introduced, the number of sales personnel, etc.
o The HR department wants to predict the efficiency of management trainees based on
their academic performance, leadership abilities, IQ level, etc.
o A social researcher wants to predict the age of marriage of a girl based on characteristics
such as her education level, parent’s education level, number of siblings, and parent’s
annual income.
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀
where:
• It is an equation that is used to predict the value of the dependent variable based on the value of
the independent variable.
𝑦̂ = 𝑏0 + 𝑏1 𝑥
where:
• The Estimated Simple Linear Regression Equation can be formulated by finding the slope and y-
intercept of the equation.
o Slope of the equation
𝑛(∑ 𝑋𝑌) − (∑ 𝑋) (∑ 𝑌)
𝑏1 = 2
𝑛(∑ 𝑋 2 ) − (∑ 𝑋)
where:
Coefficient of Determination (𝑟 2 )
• It is used to determine how well the estimated regression line fits the sample data. It is very
useful in assessing how much errors of prediction of the dependent variable (y) can be reduced
by using the information provided by the independent variable (x). It can be computed using the
following formula:
2
𝑛(∑ 𝑋𝑌) − (∑ 𝑋) (∑ 𝑌)
𝑟2 =
√[𝑛(∑ 𝑋 2 ) − (∑ 𝑋)2 ][𝑛(∑ 𝑌 2 ) − (∑ 𝑌)2 ]
( )
Example:
• The manager at a construction company wants to predict the sales base on the advertising
expenditures. The company’s general manager has collected the following data on advertising
expenditures and gross sales for the past 12 months.
Advertising Sales
Month
(in hundred thousand pesos) (in hundred thousand pesos)
January 1.2 21.2
February 1.4 21.8
March 0.5 17.0
April 2.1 25.5
May 2.0 26.2
June 1.6 22.5
July 1.0 19.5
August 0.6 17.3
September 0.8 17.5
October 1.8 24.0
November 1.9 23.8
December 1.5 22.3
Solution:
• The independent variable is the advertising expenditures, while the dependent variable is the
sale.
• The estimated simple linear regression equation can be used to predict the sales based on the
advertising expenditures. To formulate this equation, first is to find the slope and the y-intercept
of the equation, then substitute these values to the equation 𝑦̂ = 𝑏0 + 𝑏1 𝑥.
̂ = 𝟏𝟑. 𝟖𝟎 + 𝟓. 𝟔𝟕𝒙
𝒚
25
20
15
10
0
0 0.5 1 1.5 2 2.5
Advertising Expenditures (in hundred thousand)
𝑛(∑ 𝑋𝑌) − (∑ 𝑋) (∑ 𝑌)
𝑟2 =
√[𝑛(∑ 𝑋2 ) − (∑ 𝑋)2 ][𝑛(∑ 𝑌2 ) − (∑ 𝑌)2 ]
( )
2
2
12(372.16) − (16.4)(258.6)
𝑟 =( ) = 𝟎. 𝟗𝟕𝟏𝟔 𝒐𝒓 𝟗𝟕. 𝟏𝟔%
√[12(25.72) − (16.4)2 ][12(5682.14) − (258.6)2 ]
Therefore, approximately 97.16% of the variability in sales can be explained by the advertising
expenses. It means that the estimated linear regression equation works well in predicting the
sales of the company by reducing the error to 2.84% using the advertising expenses.
• The sum of squares due to regression (SSR) is the explained variation, which is the deviation
between the predicted/estimated value of y and the average value of y. 𝑺𝑺𝑹 = ∑(𝒚̂ − 𝒚 ̅)𝟐
• The sum of squares due to error (SSE) is the unexplained variation, which is the deviation between
the actual/observed value of y and the predicted/estimated value of y. 𝑺𝑺𝑬 = ∑(𝒀 − 𝒚̂ )𝟐
• The total sum of squares (SST) is the deviation between the actual/observed value of y and the
average value of y, or simply the sum of SSE and SSR. 𝑺𝑺𝑻 = ∑(𝒀 − 𝒚 ̅)𝟐 or 𝑺𝑺𝑻 = 𝑺𝑺𝑬 + 𝑺𝑺𝑹
*Note: The coefficient determination can also be calculated using the following formula:
𝑆𝑆𝐸
𝑅2 = 1 − ; 𝑜𝑟
𝑆𝑆𝑇
𝑆𝑆𝑅
𝑅2 =
𝑆𝑆𝑇
Multiple Regression
• Multiple regression is the extension of simple linear regression and focuses on assessing the
strength of the relationship between each of a set of independent variables and a single dependent
variable.
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑝 𝑥𝑝 + 𝜀
where:
𝛽0 , 𝛽1 , 𝛽2 ,…, 𝛽𝑝 are the parameters; and
𝜀 is the random error.
𝑦̂ = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ + 𝑏𝑝 𝑥𝑝
where:
Example:
Suppose we are interested in predicting the current market value of houses in a particular city. Data are
collected from a random sample of 30 house current values (in ₱100,000s) together with the
corresponding living area (in 100 square feet) and the distance in miles from the city center. The data
are shown in the following table:
Solution:
• The given data set has two (2) independent variables (area (𝑥1 ) and distance from the city center
(𝑥2 )) and one (1) dependent variable (value of the house (y)).
• The current market value of a specific house can be estimated by formulating the estimated
multiple regression equation.
• MS Excel can be used to calculate the sample statistics and formulate the estimated multiple
regression equation.
a. Input the data in a worksheet. Go
to Data, then click Data Analysis.
b. In the Data Analysis box, choose
Regression.
c. Set the Input Y Range and Input X Range by selecting the observations in dependent variables
and independent variables, respectively. Check the Labels checkbox and Confidence Level
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.956726
R Square 0.915325
Adjusted R Square 0.909053
Standard Error 4.231613
Observations 30
ANOVA
df SS MS F Significance F
Regression 2 5226.31 2613.155 145.9329457 3.34736E-15
Residual 27 483.4768 17.90655
Total 29 5709.787
Coefficients
Standard Error t Stat P-value Lower 95%
Intercept 46.79047 7.038031 6.648234 0.00000039087831467 32.34962768
Area 7.328059 0.477722 15.3396 0.00000000000000748 6.34785547
City Center Distance -5.5225 0.563573 -9.79907 0.00000000021906578 -6.678855156
• This equation can be used to estimate the travel time given the distance traveled, number of
deliveries, and gas price.
o Ex. What is the estimated house value if its area is 1200 square feet and the distance
from the city center is 1.2 miles?
Solution:
𝑦̂ – estimated travel time
𝑥1 = area (in 100 square feet)= 12
𝑥2 = distance from city center (miles) = 1.2
• To test the individual significance of each of the beta parameters (𝛽1 , 𝛽2 , 𝛽3 ), test if each
parameter is equal to zero. Remember that these values are the slopes, and if the slope is equal
to 0, then there is no relationship between x and y. In general, we can state the following
hypotheses:
𝐻0 : 𝛽𝑖 = 0 (There is no significant relationship between x and y.)
𝐻𝑎 : 𝛽𝑖 ≠ 0 (There is a significant relationship between x and y.)
• The t-stat or p-value can be used to test the significance of the individual parameter.
o Rejection rule using the p-value.
Suppose the computed p-value is less than or equal to the set significance level. In that case, the
decision is to “Reject the null hypothesis (𝐻0 )”. On the other hand, if the p-value is greater than
the set significance level, the decision is “Fail to reject the null hypothesis (𝐻0 )”.
Suppose the computed t-value/t-stat is less than or equal to the critical value. In that case, the
decision is “Fail to reject the null hypothesis (𝐻0 )”, on the other hand, if the t-value is greater than
the critical value, the decision is “Reject the null hypothesis (𝐻0 )”.
References:
Gaur, A., Gaur, S. (2009). Statistical method for practices and research. Singapore: SAGE Publication Inc.
Landau, S., Everitt, B. (2004). A handbook of statistical analysis using SPSS. United State of America:
Chapman & Hall/CRC
Sullivan, M. (2017). Informed decision using data: Fifth edition: Pearson Education