Chapter 3 Complete
Chapter 3 Complete
Parishwar Acharya
Scatter Diagram
Scatter diagram is a graphical method to display the relationship
between two variables
Scatter diagram plots pairs of bivariate observations (x, y) on the XY
plane
Y is called the dependent variable
X is called an independent variable
Correlation
Correlation analysis is used to measure the degree of
relationship between two or more variables.
Only concerned with strength of the relationship
No causal effect is implied
It may be Simple, Partial or Multiple
Scatter Plot Examples
Linear relationships Curvilinear relationships
y y
x x
y y
x x
Scatter Plot Examples
Strong relationships Weak relationships
y y
x x
y y
x x
Scatter Plot Examples
No relationship
y
x
y
x
Simple Correlation coefficient (r)
It is also called Pearson's correlation or product moment
correlation coefficient.
It measures the nature and degree of relationships between
two variables of the quantitative type.
If the sign is positive this means the relation is direct (an increase in
one variable is associated with an increase in the other variable and a
decrease in one variable is associated with a decrease in the other
variable).
Player 1 2 3 4 5 6 7 8 9 10
Height (x) 8 9 7 6 13 7 11 12 9 14
Weight (y) 35 49 27 33 60 21 45 51 46 65
• Calculate correlation coefficient and interpret your result.
• Is there evidence of a linear relationship between weight
and height at 0.05 level of significance?
• Also plot scatter diagram.
Solution:
X Y X2 Y2 XY
8 35 64 1225 280
9 49 81 2401 441
7 27 49 729 189
6 33 36 1089 198
13 60 169 3600 780
7 21 49 441 147
11 45 121 2025 495
12 51 144 2601 612
9 46 81 2116 414
14 65 196 4225 910
∑X=96 ∑Y=432 ∑X2=990 ∑Y2=20452 ∑XY=4466
As t cal > t tab, H0 is rejected.
There is evidence of a linear relationship between weight
and height at 0.05 level of significance.
Spearman Rank Correlation Coefficient
This procedure makes use of the two sets of ranks that may be assigned to the
sample values of x and y.
4. Square each di and compute ∑(di)2 which is the sum of the squared
values.
Example
Twelve appearance in painting competition were ranked by two judges as
shown below
Entry 1 2 3 4 5 6 7 8 9 10 11 12
Judge I 5 2 3 4 1 6 8 7 10 9 12 11
Judge II 4 5 2 1 6 7 10 9 11 12 3 8
Regression tells us how to draw the straight line described by the correlation
Height 68 64 62 65 66
Weight 132 108 102 115 128
Y
Weight
X
Height
Calculation table: n = 5
Height (X) Weight (Y) Y2 XY
68 132
64 108
62 102
65 115
66 128
∑X = 325 ∑Y = 585 ∑Y2 = 69101 ∑XY = 38135
Questions
1. The annual expenditures (in lakhs of rupees) and the corresponding annual sales (in
crores of rupees) for the past 10 years of a company are presented in the following table.
a. Find the correlation coefficient between annual advertising expenditure and annual
sales revenue and comment the result.
b. Develop a regression model of sales at a function of advertising expenditures.
Predict the value of annual sales while advertising expenditures was 27 lakhs of Rs.
Regression line of x on y
The regression line of x on y gives the best estimated value of x for
given values of y.
The regression equation of x on y is
x = a’ + b’y …..(i)
where a’ is constant or x-intercept and b’ is the slope of regression
line (i) or regression coefficient of x on y which is denoted by bxy.
Computing Formula
Here just we interchange x as y and y as x in every formula
and formula becomes as below
Height (X) 68 64 62 65 66
Weight(Y) 132 108 102 115 128
In the linear regression model we will predict the value of dependent variable(Y) then we have
regression equation as y to x . But if we have to predict the value of independent variable (X)
then we will change y to x into x to y. for example
The following data were collected on the height (inches (X)) and weight (pounds(Y)) of women
swimmers
And the condition is of below
Predict the value of height for a given value of weight 105 pound.
Then we have to change height as y(in order to predict) and weight as x.
Height (X) 68 64 62 65 66
Weight(Y) 132 108 102 115 128
Then the regression equation will be
x = a + b1 y. now apply the rules and estimate the parameter then predict x for given y.
Standard Error of the estimate
If the least squares regression of y on x is given
y = a + bx , then the standard error of estimate is given
by
se2 = [syy – (sxy)2/sxx]/ n – 2
For the line x = a’ + b’y,
se2 = [sxx – (sxy)2/syy]/ n – 2
Test of significance of intercept parameter β
In order to test the significance of the regression coefficient of the simple linear regression model:
Y = α + b1 X
, following two statistical test have been applied.
I. t-test for significance in simple linear regression model
II. F-test for significance in simple linear regression model
t-test for significance in simple linear regression model
Ho : β = 0 , (There is no significant relationship between dependent and independent variable).
H1 : β ≠ 0,( There is significant relationship between dependent and independent variable).
Degree of freedom = n – 2
Choose the level of significance
Under Ho : test statistic is (follows n-2 degree of freedom)
You can use any one of the below formula
Error n-2
Total n-1
where
K = no of independent variable
n = number of observation
Decision Rule
Reject H0 if computed value of F > tabulated value of F with one degree of freedom in numerator and
(n-2) degree of freedom in the denominator at α % level of significance and accept otherwise.
Using p-value , reject the null hypothesis if p-value < α
Example
Cost accounts often estimate overhead based on the level of production.
At the Standard Knitting Co., they have collected information on
overhead expenses and units produced at different plants and want to
estimate a regression equation to predict future overhead.
Overhead 191 170 272 155 280 173 234 116 153 178
Units 40 42 53 35 56 39 48 30 37 40
Develop the regression equations for the cost accounts.
Predict overhead when 50 units are produced.
Estimate the number of production units if Company invests 400.
Example
The following measurements show the respective heights in inches of 10
fathers and their eldest sons.
Height of 66 67 63 71 69 65 62 70 61 72
father (X)
Height of 65 68 66 65 70 67 67 71 62 63
son (y)
Obtain the regression line of son’s height and estimate the height of
son when his father is found to be 70 inches high.
Obtain the regression line of father’s height and estimate the height of
father when his son is found to be 80 inches high.
Multiple Regression
Multiple regression analysis is a straightforward extension of simple
regression analysis which allows more than one independent variable.
It is used to estimate or predict the value of one dependent variable
when the values of two or more independent variables are known.
Multiple regression equation
The multiple regression equation of dependent variable y on two independent
variables x1 and x2 is given by
y = a + b x1 + c x2 ……(i)
where
a = value of y when x1 = 0 and x2 =0
b = Partial regression coefficient of
y on x1 when x2 is constant.
c = Partial regression coefficient of
y on x2 when x1 is constant.
Note that a, b , c are parameters of the equation whose values are to
determined.
Using the Principle of least square estimation , the normal equations of
line (i) are
∑y = na + b∑x1 + c∑x2 ….(ii)
∑ yx1 = a∑x1 + b∑x12 + c∑x1x2 ….(iii)
∑ yx2 = a∑x2 + b∑x1x2 + c∑x22 ….(iv)
Solving (ii), (iii), (iv), we get the values of a, b , c.
Substituting the values of a, b , c in (i), we get required multiple
regression equation of y on x1 and x2.
The multiple regression equation of dependent variable Y on n
independent variables x1 ,x2 , x3 ,…, xn is given by
Y = a + b1 x1 + b2 x2 +…… bn xn …(iv)
The values of α, ß1 and ß2 can be estimated by using the least square method.
The normal equation with two independent variables are given below
∑Y = n α + b1 ∑X1 + b2 ∑X2
∑ X1 Y = α ∑X1 + b1 ∑X12 + b2 ∑X1X2
∑ X2 Y = α ∑X2 + b1 ∑ X1X2 + b2 ∑X22
The values of α, ß1 and ß2 can be obtained by solving these three equations.
Source of Sum of Degree of Mean square F
variation squares (SS) freedom(df) (MS)
Regression SSR k
Family A B C D E
Saving(Y)(Rs.000 6 12 10 7 3
)
Income X1 8 11 9 6 6
(RS.000)
No. of 5 2 1 3 4
children(X2)
Ftab = 8.60 and Fcalc at 5% level of significance and (k, n-k-1) = (2,2) degree of freedom = 19
Decision
Ftab (8.60) < Fcalc (19) .we accept H1
Conclusion
There is linear relationship between dependent and at least one of the independent variable .
Again if question is asked which independent variable is significant then again we have to do
individual test called t test as like in simple linear regression model.
Question
1. Police stations across the country are interested in predicting the number of arrests they can expert to process each month so as
to better schedule office employee. Historically the average number of arrests (Y) each month is influenced by the number of
officers on police force (X1), the population of the city in thousands (X2) and the percentage of unemployed people in the city
(X3). The SPSS partial output for these factors in 15 cities are presented below.
Coefficient Table
Coefficients (bi) Standard error(sbi) t P-value
Constant 142.4363 25.96474 5.49 <0.001
X1 3.2741 0.2814354 11.64 <0.001
X2 0.5269 0.4693494 1.12 0.287
X3 -0.3203 1.295351 -0.24 0.812
ANOVA Table
Source of variation Sum of squares df Mean Square F
Regression 230500.663 3 76833.5544 246.41 (p-value
<0.001)
Residual 3429.9279 11 311.811627
Total 233930.591 14 16709.3279
a. Using the above output, determine the best-fitting regression equation.
b. What percentage of the total variation in the number of arrests (Y) is explained by
this equation?
c. The police department in a city is trying to predict the number of monthly arrests.
The city has a population of 75,000, a police force of 82 and an unemployment
percentage of 10.5 percent . How many arrests do you predict for each month.
Solution
2. The following is the partially developed SPSS output of the multiple regression where the outcome variable(Y)
represents the scores made by 10 assembly line employees on a test designed to measure job satisfaction . The
scores are affected by two factors- an aptitude test (X1) and the number of days absents(X2) during the past
year.(excluding vacation).
Coefficients Table
Coefficients (bi) Standard error(sbi) t
Intercept 36.2083 7.3441 ?
Aptitude test (X1) 5.3882 0.9900 ?
Number of days absent -1.6191 0.3909 ?
(X2)
ANOVA Table
Source of variation Sum of squares df Mean Square F
Regression 1016.26949 2 ? ?
Residual 62.6305138 ? ?
Total 1078.9 9
Find the following question
I. Complete above ANOVA table and coefficient table.
II. Fit a multiple regression model and predict the value of Y when aptitude test is 7
and number of days absent is 6.
II. Is there any significant relationship between any dependent and two independent
variables? (test at 5% level of significance)
V. Test the significance of the estimated regression coefficient of X2 at the 5%
significance level.
V. What proportion of variations in scores (Y) is explained by two independent
variables?
VI. Compute the standard error of estimate and interpret its meaning.
Coefficients Table
Coefficients (bi) Standard error(sbi) t
Intercept 36.2083 7.3441 ? (4.93)
Aptitude test (X1) 5.3882 0.9900 ? (5.44)
Number of days absent -1.6191 0.3909 ?(-4.14)
(X2)
ANOVA Table
Source of variation Sum of squares df Mean Square F
Regression 1016.26949 2 ?(508.1347) ?(56.79)
Residual 62.6305138 ?(7) ?(8.9472)
Total 1078.9 9
Regression Analysis Using Excel Toolpak
Run the regression and get the output
The output has three components:
• Regression Statistics table
Thank you