0% found this document useful (0 votes)
84 views21 pages

Simple Linear Correlation and Regression

This document discusses correlation analysis and different types of correlation. It defines correlation and describes why correlation does not necessarily imply causation. It also outlines different types of correlation including positive and negative, simple and partial, and linear and non-linear correlation. Methods to estimate correlation are presented including scatter diagrams, graphic methods, and Karl Pearson's coefficient of correlation.

Uploaded by

Reethika Ravi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views21 pages

Simple Linear Correlation and Regression

This document discusses correlation analysis and different types of correlation. It defines correlation and describes why correlation does not necessarily imply causation. It also outlines different types of correlation including positive and negative, simple and partial, and linear and non-linear correlation. Methods to estimate correlation are presented including scatter diagrams, graphic methods, and Karl Pearson's coefficient of correlation.

Uploaded by

Reethika Ravi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Unit 2: Simple Linear Correlation

and Regression Analysis


Class 19EC/AC/SE15

Credits 5

End Sem Exam Yes

Semester 1

Introduction to Correlation Analysis


Definitions
Correlation is a measure of degree of relatedness of variables. - Ken Black

If two or more quantities vary in sympathy so that movements in one tend to be


accompanied by corresponding movements in the other(s) then they are said to be
correlated. - L.R. Conner

When the relationship is of a quantitative nature, the appropriate statistical tool for
discovering and measuring the relationship and expressing it in a brief formula is
known as correlation. - Croxton and Cowden

Correlation analysis attempts to determine the 'degree of relationship' between


variables. - Ya-Lun Chou

Correlation is an analysis of the covariation between two or more variables.

A lack of relationship between variables implies that they are independent variables.

Significance of the Study of Correlation


Most variables show some kind of relationship

Once correlation is shown, it is useful for regression

Unit 2: Simple Linear Correlation and Regression Analysis 1


Understanding of economic behaviour

Contributes to progress in science and philosophy

Reduces range of uncertainty

Correlation and Causation


Why correlation does not imply causation:

Correlation may be due to pure chance, especially in a small sample

Both correlated variables may be influenced by one or more other variables

Both variables may be mutually influencing each other so neither can be designated
as the cause and the other the effect

2.2 Types of Correlation


There are three ways to classify correlation.

Positive and Negative Correlation


Whether correlation is positive (direct relationship) or negative (inverse relationship)
would depend upon the direction of change in the variables.

If both the variables are varying in the same direction, i.e., if increase in one
variable results in increase in the other variable, on average, or if decrease in one
variable results in decrease in the other variable on average, correlation is said to
be positive.

If both the variables are varying in opposite directions, i.e., if increase in one
variable results in decrease in the other variable, on average, or if decrease in one
variable results in increase in the other variable on average, correlation is said to be
positive.

Simple, Partial and Multiple Correlation


If only two variables are studied, it is a problem of simple correlation. (yield of rice
per acre and amount of rainfall)

Unit 2: Simple Linear Correlation and Regression Analysis 2


When three or more variables are studied it is a problem of either partial or multiple
correlation.

In multiple correlation, three or more variables are studied simultaneously. (yield of


rice per acre, amount of rainfall, amount of fertilizers)

In partial correlation, we recognize more than two variables, but consider only two
variables to be influencing each other simultaneously, the effect of other influencing
variables being kept constant. (analysis of yield of rice per acre and amount of
rainfall limited to periods with a certain constant temperature)

Linear and Non-Linear (Curvilinear) Correlation


The distinction between linear and non-linear correlation is based on the constancy
of the ratio of change between the two variables.

If the amount of change in one variable tends to bear constant ratio to the amount of
change in the other variable, then correlation is said to be linear. If the variables
were plotted, all the plotted points would fall on a straight line.

If the amount of change in one variable does not tend to bear constant ratio to the
amount of change in the other variable, then correlation is said to be non-linear or
curvilinear.

In most practical situations, the relationship is curvilinear, which makes analysis


much more difficult. Thus, there is a tendency to assume that the relationship is
linear in nature.

2.3 Methods to Estimate Correlation


Scatter Diagram Method
The simplest way to ascertain whether two variables are related is to prepare a dot
chart called scatter diagram

For each pair of X and Y observations, we put a dot and therefore obtain as many
dots as there are number of observations.

The greater the scatter of points on the chart, the lesser the relationship between
both variables.

Unit 2: Simple Linear Correlation and Regression Analysis 3


If the points are widely scattered over the diagram it indicates very little relationship.
If the points are vaguely sloping upwards, it is a low degree of positive correlation
and if the points are vaguely sloping downwards, it is a low degree of negative
correlation.

The more closely the points come to a straight line from the lower left hand corner
to the upper right hand corner, correlation is said to be perfectly positive ( r=+1 ).

If the points resemble perfectly positive correlation but instead form a narrow band,
there is said to be a high degree of positive correlation.

The more closely the points come to a straight line from the upper left hand corner
to the lower right hand corner, correlation is said to be perfectly negative ( r=-1 ).

If the points resemble perfectly negative correlation but instead form a narrow band,
there is said to be a high degree of negative correlation.

If the lines are widely scattered from the line of best fit, there is a low degree of
correlation and if they are closely scattered over the line of best fit, there is a high
degree of correlation.

Merits
Simple and non-mathematical way of studying correlation between the variables. It
can be easily understood and a rough idea can be formed from a single glance.

Not influenced by the presence of extreme values or outliers, as opposed to most


mathematical methods.

Usually the first step in investigating a relationship between variables.

If the variables are related, it is possible to see the line or estimating equation
describes the relationship.

Limitations
Cannot establish the exact degree of correlation between the variables.

Graphic Method
The individual values of the two variables are plotted, forming two individual curves
for X and Y.

Unit 2: Simple Linear Correlation and Regression Analysis 4


By examining direction and closeness of the two curves, inference is drawn about
whether they are closely related or not.

If both curves are moving in the same direction, either upward or downward,
correlation is said to be positive.

If the curves are moving in the opposite direction, correlation is said to be negative.

This method is normally used when we are given data over a period of time (time
series).

However, similar to scatter plot, we cannot get exact degree of relatedness.

Karl Pearson's Coefficient of Correlation


Widely used method, denoted by r . Also called the simple correlation coefficient or
zero-order correlation coefficient.

One of the very few symbols used to universally denote degree of correlation
∑ xy
r= N σx σy 
​ ​

where
ˉ ); y = (Y − Yˉ )
x = (X − X
σx =Standard Deviation of X

σy =Standard Deviation of Y

N =Number of pairs of observations


r =Correlation coefficient
The above method is only to be used when the deviations of X and Y are taken from
their ACTUAL mean and not assumed mean.

The value of the correlation coefficient shall always lie between ±1.

When r = +1, there is perfectly positive correlation, and when r = −1, there is
perfectly negative correlation. When r = 0, there is no relationship between the
variables.

However, in practice, these three values are rare.

The coefficient describes not only the magnitude but also direction of relationship.

Unit 2: Simple Linear Correlation and Regression Analysis 5


Parameters for judging correlation:

Lesser than ±0.5: low correlation

[±0.5, ±0.8]: medium correlation


Greater than ±0.8: high correlation

Direct Method
Revised formula widely used:
∑ xy
r= 
∑ x2 × ∑ y 2

that is,
∑ (X i − X ˉ )(Yi − Yˉ
r= 
​ ​

ˉ
∑ (X i − X )2 ∑ (Yi − Yˉ )2

​ ​ ​ ​

It greatly simplifies the process as we do not have to calculate standard deviation,


etc. We can ONLY use actual mean.

Assumed Mean Method


N ∑ dx dy − (∑ dx )(∑ dy )
r= 
​ ​ ​ ​

N ∑ d2x − (∑ dx )2 N ∑ d2y − (∑ dy )2

​ ​ ​ ​ ​ ​

dx =X - assumed mean


dy =Y - assumed mean


Calculation of Correlation in Grouped Data


N ∑ f dx dy − (∑ f dx )(∑ f dy )
r= 
​ ​ ​ ​

N ∑ f d2x − (∑ f dx )2 N ∑ f d2y − (∑ f dy )2

​ ​ ​ ​ ​ ​

Assumptions underlying r
There is a linear relationship between the variables.

They are affected by a large number of independent causes so as to form a normal


distribution.

There is a cause and effect relationship between the forces affecting the distribution
of items in X and Y. If such a relationship is not formed between the variables, i.e.,
they are independent, there cannot be any correlation. Eg: no relationship between
income and height because the forces that affect them are not common.

Unit 2: Simple Linear Correlation and Regression Analysis 6


Merits and Limitations of r
Merits: Most popular for mathematically measuring correlation, summarizes both
magnitude as well as direction.
Limitations:

Always assumes linear relationship, regardless of whether it is true or not

Very often misinterpreted and thus great care must be taken to interpret

Value of coefficient unduly affected by outliers

Time-consuming method of measuring correlation

Interpreting r
General rules to interpret r :

When r = +1, there is perfectly positive relationship


When r = −1, there is perfectly negative relationship
When r = 0, there is no relationship between the variables, uncorrelated
The closer r is to +1 or -1, the closer the relationship between the variables and the
closer r is to 0, the less close the relationship. It is not safe to go beyond this with
just r .

The closeness of the relationship is not proportional to r . For example, if the value
of r is 0.8, it does not indicate a relationship twice as close as 0.4. It is in fact much
closer.

Properties of r
The coefficient of correlation lies between -1 and +1. Symbolically, −1 ≤ r ≤ +1.
The coefficient of correlation is independent of scale and origin for the variables X
and Y. (change in origin means adding or subtracting a value, and change in scale
means multiplying or dividing a value. Mean, standard deviation and thus deviation
from mean get incremented or multiplied by that factor but gets cancelled out due to
numerator and denominator).

Unit 2: Simple Linear Correlation and Regression Analysis 7


The coefficient of correlation is the geometric mean of the two regression
coefficients.

r= bxy × byx 
​ ​

Both regression coefficients are nothing but change in the first variable for change
^2 for when X and Y are each interchanged.
in the second variable, i.e., slope, i.e., β ​ ​

The degree of relationship between the two variables is symmetric.


rxy = ryx 
​ ​

∑ xy ∑ yx
rxy = ​

N σx σy ​
= N σy σx
​ ​

= ryx 

Two other methods of estimating correlation:


Concurrent Deviation Method, Method of Least Squares

2.4 Testing the Significance of Correlation


Coefficient
Suppose you computed r=0.801 using n=10 data points. df=n−2=10−2=8 . The critical
values associated with df=8 are −0.632 and +0.632 . If r< negative critical value or r>
positive critical value, then r is significant.

2.5 Introduction to Regression Analysis


Regression analysis is concerned with the study of the dependence of one variable
(the dependent variable: y) on one of more other variables (explanatory variables:
x), with a view of estimating and/or predicting the mean/average value of the
dependent in terms of known/fixed values of the explanatory.

We deal with stochastic variables, i.e, variables that are random and have some
intrinsic random variability within them.

Regression itself NEVER implies causation. As Kendall and Stuart said, ideas of
causation can only arise outside of statistics, from some related theory.

A statistical relationship in itself cannot logically imply causation. To ascribe


causality, one must apply to a priori or theoretical considerations.

Unit 2: Simple Linear Correlation and Regression Analysis 8


Differences between Regression Analysis and Correlation

Basis for
Regression Analysis Correlation
Comparison

Regression is concerned with the study of


dependence of one variable (dependent)
Correlation is a measure of degree
on one or more other variables
of relatedness between variables; it
Meaning (explanatory) with a view to
measures the strength of linear
estimate/predict the mean/average values
association.
of the dependent variable in terms of
fixed/known values of the explanatory.

The two variables are treated


There is an asymmetry between the
symmetrically and there is no
variables; the dependent is generally taken
distinction between the dependent
to be stochastic/random and the
and explanatory variable. Both
explanatory is generally taken to have
Treatment of variables are assumed to be
fixed values (as per definition). The theory
Variables random. The theory of correlation is
of regression is conditional on this
based on the assumption of
assumption that the dependent variable is
randomness of both variables, which
stochastic and the explanatory variable is
is why the correlation coefficient is
fixed/non-stochastic.
symmetrical.

Given that one variable is said to explain It is not possible to study causation
the other(s), it is possible to study with the help of correlation as there
Studying
causation with the AID of regression. is no establishment of a dependent
Causation
Regression analysis alone cannot imply and explanatory variable. They are
causation. simply related variables.

The regression coefficients b(y,x) and


The correlation coefficients r(x,y)
b(x,y) are not symmetric because the first
Symmetry of and r(y,x) are symmetric and thus
assumes y to be the dependent and x the
Coefficients equal as it is immaterial which
explanatory, while the second assumes the
variable is dependent on the other.
opposite.

There may be nonsense correlation


Element of There is no such thing as nonsense between two variables purely due to
Chance regression. chance and having no practical
relevance.

Change in Regression coefficients are independent of Correlation coefficients are


Origin and change in origin but not of scale since the independent of change in both origin

Unit 2: Simple Linear Correlation and Regression Analysis 9


Basis for
Regression Analysis Correlation
Comparison
Scale predicted values of the dependent variable and scale.
are expressed in terms of original units of
the explanatory variable.

All the predicted mean values of y with respect to fixed values of x are called
conditional expected values, as they depend on the given values of x. They are
denoted by E(Y ∣X), read as the expected value of Y given the value of X. What is
the expected value of Y given X? The knowledge of X helps to better predict the
value of Y.
ˉ , i.e,
This is distinguished from the unconditional expected value of Y, which is Y
total value of all observations divided by the number of observations.

If we plot the conditional expected values of Y against the various X values and join
them, we get the population regression line (PRL) or the population regression
curve. It is the regression of Y on X.

The population regression curve is the locus of the conditional means of the
dependent variable for the fixed values of the explanatory variable (s).

Population Regression Function


Each conditional mean E(Y ∣Xi )is a function of Xi , where Xi is a given value of
​ ​ ​

X .
E(Y ∣Xi ) = f(Xi )
​ ​

where f(Xi )denotes some function of the explanatory variable X .


This equation is the conditional expectation function (CEF) or population regression


function (PRF) or population regression (PR) for short.

It states merely that the expected value of the distribution of Y given Xi is ​

functionally related to Xi . It tells us how the mean or average response of Y varies


with X .

As a a working hypothesis, assume that PRF is a linear function of Xi , say: ​

E(Y ∣Xi ) = β1 + β2 Xi 
​ ​ ​ ​

Unit 2: Simple Linear Correlation and Regression Analysis 10


where β1 and β2 are unknown but fixed parameters known as the regression
​ ​

coefficients. This equation is called the linear PRF.


β1 : intercept coefficient

β2 : slope coefficient

Stochastic Specification of PRF


Although the average values of Y may show some characteristics, there is some
deviation of an individual Yi around its expected values as: ​

ui = Yi − E(Y ∣Xi )
​ ​ ​

or
Yi = E(Y ∣Xi ) + ui 
​ ​ ​

where the deviation ui is an unobservable random variable taking positive or


negative values and is called the stochastic disturbance or stochastic error term.

The equation can be interpreted as: the expenditure of an individual family, given its
income level, can be expressed as a sum of two components:

i. E(Y ∣Xi ), which is the mean consumption expenditure of all the families with

the same level of income, and is a systematic or deterministic component

ii. ui , which is a random or stochastic or nonsystematic component. We can


assume it to be a surrogate or proxy for all the omitted or neglected variables


that may affect Y but are not/cannot be included in the PRF.

Thus, the equation may be written as:


Yi = E(Y ∣Xi ) + ui 
​ ​ ​

Yi = β1 + β2 Xi + ui 
​ ​ ​ ​ ​

Sample Regression Function (SRF)


Mostly only samples are available for study, and thus a way must be found to best
estimate PRF given this constraint.

It is not possible to accurately estimate PRF from samples due to sampling


constraints.

Unit 2: Simple Linear Correlation and Regression Analysis 11


For example, given two regression models drawn from two samples of the same
population, there is no way to tell for certain which model is more accurate. There
would be N different SRFs for N different samples, and these SRFs are not likely to
be the same.

Analogous to PRF, there exists the sample regression function (SRF):

Y^ = β^1 + β^2 Xi 
​ ​ ​ ​ ​

where:

Y^ = estimator of E(Y ∣Xi ) ​

β^1 = estimator of β1 
​ ​ ​

β^2 = estimator of β2 
​ ​ ​

An estimator, also known as a sample statistic, is a rule/formula/method that says


how to estimate the population parameter from the sample information. The
estimator is basically a function on random variables, and thus is random itself.

The particular numerical value obtained by the estimator when applied is called an
estimate. An estimate is non-random as it is a particular point value obtained from
the estimator.

Thus, in stochastic form, the SRF is:

Y^ = β^1 + β^2 Xi + u
​ ^i 
​ ​ ​ ​ ​

^i denotes the sample residual term.


u ​

Thus, the primary objective of regression analysis is to estimate the PRF:


Yi = β1 + β2 Xi + ui 
​ ​ ​ ​ ​

on the basis of the SRF:

Y^ = β^1 + β^2 Xi + u
​ ^i 
​ ​ ​ ​ ​

2.6 Method of Least Squares - Estimation


The PRF is not directly observable and is estimated from the SRF:

Yi = β^1 + β^1 + u
​ ^1 
​ ​ ​ ​ ​

= Y^i + u
^i  ​ ​

Unit 2: Simple Linear Correlation and Regression Analysis 12


^i is the estimated conditional mean value of Yi .
where Y ​ ​

But to actually determine the SRF, express the above equation as:

^i = Yi − Y^i 
u ​ ​ ​

= Yi − β^1 − β^2 Xi 
​ ​ ​ ​ ​ ​

This shows that the residual term is simply the difference between the actual and
estimated Y values.

Given npairs of observations on Y and X, the SRF must be as close as possible to


the actual Y.

A simple sum of residuals is not satisfactory as all residuals are given equal weight
in the sum no matter how close they are to the SRF. This is avoided by fixing the
SRF in such a way that the below is as small as possible.

^21 = ∑(Yi − Y^1 )2 


∑u ​ ​ ​

^2 = ∑(Yi − β^1 − β^2 Xi )2 


∑u 1 ​ ​ ​ ​ ​ ​ ​

^i , this method gives more weight to residuals that are closer to the
By squaring u ​

SRF.

This beats the previous method which could give a small sum even if residuals are
^i (in
widely spread about the SRF (due to cancelling) because the larger the u ​

^2i .
absolute values), the larger the u ​

Direct Method
For estimating β^1 and β^2 , differentiating the above equation with respect to both
​ ​ ​ ​

estimators yields the following:

∑ Yi = nβ^1 + β^2 ∑ Xi 
​ ​ ​ ​ ​ ​

∑ Yi Xi = β^1 ∑ Xi + β^2 ∑ X 2 
​ ​ ​ ​ ​ ​

i ​

These simultaneous equations are called normal equations.

Solving the above equations simultaneously gives the value of both estimators.

Indirect Method
By solving with variables simultaneously:

Unit 2: Simple Linear Correlation and Regression Analysis 13


β^2 =
∑ xi y i

​ ​

∑ x2i
​ ​ ​

β^1 = Yˉ − β^2 X

ˉ
​ ​ ​

where:
Xˉ = mean of X 
Yˉ = mean of Y 
xi = Xi − X

ˉ ​

yi = Yi − Yˉ 
​ ​

Estimators obtained from these methods are known as least-square estimators as they
are obtained using the least squares principle.

Properties of Least Square Estimators, i.e, Regression Coefficients


1. They are expressed solely in observable/sample quantities of X and Y and are thus
easily computed.

2. They are point estimators, i.e., they provide only a single value as compared to
interval estimators which provide a range of values.

3. Once the estimators are derived, they can be substituted in the equation to form the
SRF. The SRF thus derived has two properties:
ˉ and Yˉ . This is because the method of deriving the
i. It passes through X
estimators involves both and can be shown.
ˉ
ii. Y^ = Yˉ by substituting in the equation.
iii. The mean value of residuals is zero.

Some Assumptions Underlying the Method of Least Squares


Linear Regression Model

The regression model is linear in parameters, even if not in variables.

Yi = β1 + β2 + ui 
​ ​ ​ ​

This model can be extended to include more explanatory variables.

Fixed X values or X Values Independent of the Error Term

Unit 2: Simple Linear Correlation and Regression Analysis 14


Values taken by X may be considered fixed in repeated samples (fixed
regressor) or may be sampled along with dependent variable Y (stochastic
regressor)

X values are considered to be non-stochastic.

Number of observations must be greater than the number of parameters

Alternatively, the number of observations must be greater than the number of


explanatory variables.

Sometimes the intercept value is not taken as a variable as it does not vary.

The Nature of X Variables


ˉ and the denominator of the equation
If all the X values are identical, Xi = X ​

would be 0, which would make it impossible to find the estimators

There should also not be any outliers in X.

Precision or Standard Errors of Least-Squares Estimates


The standard error of the estimates can be obtained as follows:

var(β^2 ) = ​ ​
σ
∑ x2i 

se(β^2 ) = σ

∑ x21
​ ​ ​

​ ​

∑ X i2 2
var(β^1 ) = σ 

n ∑ x21
​ ​ ​

∑ X i2
se(β^1 ) = σ

n ∑ x21
​ ​ ​ ​

where:
var = variance
se= standard error
σ 2 = constant variance of ui  ​

All values for the above can be estimated from the data except:
∑ u^2i
^2 =
σ n−2 

Unit 2: Simple Linear Correlation and Regression Analysis 15


where:
^ 2 = estimator of the true but unknown σ 2 
σ
n − 2= number of degrees of freedom
^2i = Residual Sum of Squares (RSS)
∑u ​

^2i can be computed:


∑u ​

^2 = ∑ y2 − β^2 ∑ x2 
∑u i ​

i ​

2
​ ​

i ​

∑ u^2i
^=
σ n−2 

​ ​

^ is the standard error of estimate or the standard error of the regression (se). It is
σ
the standard deviation of the Y values about the estimated regression line.

2.7 Goodness of Fit Measures


The coefficient of determination (r 2 )is a summary measure that says how well the
sample regression line fits the data.

By simplifying the CLRM,

∑ yi2 = β^22 ∑ x2i + ∑ u


​ ^2i 
​ ​ ​ ​

TSS = ESS + RSS 


where:
∑ yi2 = ∑(Yi − Yˉ )2 = TSS = total sum of squares = total variation of the actual
​ ​

Y values around their sample mean


ˉ
β^22 ∑ x2i = ∑ y^i2 = ∑(Y^i − Y^ )2 = ∑(Y^i − Yˉ )2 = ESS = explained sum of
​ ​ ​ ​ ​ ​

squares = sum of squares due to regression, i.e., due to the explanatory variable

^2i = RSS = residual sum of squares = unexplained variation of Y about the


∑u ​

regression line

This shows that the total variation in observed Y values about their mean can be
partitioned into two parts: one attributable to the regression line and the other to
random forces because not all actual Y observations lie on the fitted line.

Unit 2: Simple Linear Correlation and Regression Analysis 16


TSS = ESS + RSS 
1= ESS
TSS
+ RSS

TSS
 ​

∑ (Y^i − Yˉ )2 ∑ u^2i
1=

∑ (Yi − Yˉ )2

+ ∑ (Yi − Yˉ )2
 ​


Thus, goodness of fit measure r 2 can be defined as:


∑ (Y^i − Yˉ )2 ∑ u^2i
r2 = = ESS
=1− =1− RSS


∑ (Yi − Yˉ )2 TSS ∑ (Yi − Yˉ )2 TSS


​ ​ ​ ​

​ ​

The coefficient of determination measures the proportion or percentage of the total


variation in Y explained by the regression model.

Properties of r 2 :

It is a non-negative quantity

0 ≤ r 2 ≤ 1. An r 2 of 1 means a perfect fit, i.e., Y^i = Yi for each i. An r 2 of 0 ​ ​

means that there is no relationship between the variables, i.e, β ^2 = 0. ​ ​

r 2 can be computed quickly using this:


r2 = ESS
TSS  ​

∑ y^i2
= 
​ ​

∑ y12

β^22 ∑ x2i
= 
​ ​ ​

∑ y12

2
= β^22 ( ∑ y 2i )
∑x ​

​ ​ ​

i ​

Unit 2: Simple Linear Correlation and Regression Analysis 17


2.8 Testing Overall Significance of the
Model-ANOVA
α, The Level of Significance & Interval Estimation
In estimation, sampling fluctuations can cause a single estimate to differ from the
true value, although in repeated sampling, the mean value is expected to be equal
to the true value.

E(β^2 ) = β2 
​ ​ ​

The reliability of a point estimator is measured by its standard error.

Thus instead of relying on the point estimator alone, an interval may be constructed
around the point estimator, with a small number of standard errors on either side,
say, the interval has 95% probability of containing the true value.
^2 is to β2 , so we try to find two positive
Assume we want to know how close β ​ ​ ​

numbers δand α, αlying between 0 and 1, such that the probability that the
^2
probability that the random interval (β ​ ​ − δ, β^2 + δ)contains the true β2 is 1 − α.
​ ​ ​

Symbolically,

P r(β^2 − δ ≤ β2 ≤ β^2 + δ) = 1 − α
​ ​ ​ ​ ​

Such an interval is called a confidence interval; 1 − αis known as the confidence


coefficient and αis known as the level of significance.

For example, if α = 0.05or 5%, it would be read as: the probability that the
random interval shown there includes the true β2 is 0.95 or 95%. ​

Hypothesis Testing: The Confidence-Interval Approach


The stated hypothesis is known as the null hypothesis and is denoted by the symbol
H0 . ​

The null hypothesis is usually tested against an alternative hypothesis (maintained


hypothesis) denoted H1 . ​

The null hypothesis is always a simple hypothesis, whereas the alternative


hypothesis is usually composite; this is known as a two-sided hypothesis.

Unit 2: Simple Linear Correlation and Regression Analysis 18


ANOVA Table, Analysis of Variance

Ref: Degrees of Freedom


It refers to the number of classes to which values can be assigned arbitrarily without
violating restrictions or limitations placed.

For eg: to choose any five numbers whose total is 100, there is independent choice
only for four numbers as the fifth has the restriction that it should be 100 minus the
remaining chosen numbers. Choice was reduced due to one restriction placed, i.e.,
df = 5-1 =4.

Similarly, if there are 10 classes into which frequencies must be assigned such that
the number of cases, the mean and standard deviation agree with the original
distribution, then there are three restrictions placed and thus df = 10-3 = 7.

Thus, df = ν 

The term number of degrees of freedom means the total number of observations in
the
sample (= n) less the number of independent (linear) constraints or restrictions put
on them.
In other words, it is the number of independent observations out of a total of n
observations.
For example, before the RSS can be computed,
β^1 and β^2 must first be obtained. These
​ ​ ​ ​

two estimates therefore put two restrictions on the RSS. Therefore, there are n− 2,
not n, independent observations to compute the RSS. The general rule is this:
df = (n− number of parameters estimated).

Unit 2: Simple Linear Correlation and Regression Analysis 19


Regression Full Sum: Statements
The estimated regression model is specified as follows:
^1 + β^2 In Words (X)
In words (Y) = β ​ ​ ​ ​

The slope coefficient is positive/negative. This shows that there is a direct/inverse


relationship between In Words X and In Words Y. For every additional unit of X, the
^2 .
average value of Y increases/decreases by β ​ ​

r 2 = 0.ab suggests that ab% of the variation in average Y is explained by X.


To test the overall significance of the model, the following hypothesis is stated.
H0 : β2 (There is no significant relationship between Y and X in words)
​ ​

H1 : β2 

= 0(There is a significant relationship between X and Y in words)

Unit 2: Simple Linear Correlation and Regression Analysis 20


Since the 'F' value computed is greater than/lesser than the F-table value at α =
0.05/0.01 level of significance for degrees of freedom df1 and df2 , the sample
​ ​

evidence shows that the model is highly & statistically significant and null hypothesis
H0 is rejected/statistically insignificant and there is no significant relationship between Y

and X in words, alternative or maintained hypothesis H1 is rejected.


Unit 2: Simple Linear Correlation and Regression Analysis 21

You might also like