0% found this document useful (0 votes)
36 views24 pages

Correlation and Regression 2

The document discusses the relationship between two interval scale variables and the use of correlation and regression analysis to understand and predict the relationship. It defines linear regression as estimating the means of the dependent variable (Y) for different values of the independent variable (X) to find the best fitting regression line. The key assumptions of the linear regression model are that the relationship is linear, the variables are normally distributed, and the variances are equal. The method of least squares is described as the approach to estimating the regression line parameters (a and b) by minimizing the vertical distances between the data points and the line. An example calculation is provided. Correlation is then introduced as a measure of the strength of the relationship between the variables.

Uploaded by

Srijoni Chaki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views24 pages

Correlation and Regression 2

The document discusses the relationship between two interval scale variables and the use of correlation and regression analysis to understand and predict the relationship. It defines linear regression as estimating the means of the dependent variable (Y) for different values of the independent variable (X) to find the best fitting regression line. The key assumptions of the linear regression model are that the relationship is linear, the variables are normally distributed, and the variances are equal. The method of least squares is described as the approach to estimating the regression line parameters (a and b) by minimizing the vertical distances between the data points and the line. An example calculation is provided. Correlation is then introduced as a measure of the strength of the relationship between the variables.

Uploaded by

Srijoni Chaki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

CORRELATION AND

REGRESSION
RELATIONSHIP BETWEEN TWO INTERVAL SCALES

Ramakant Agrawal 14-09-2020


Correlation and regression
Introduction
• Relationship between two interval scales e.g. number of years of education completed (X) and the annual income of adults
(Y) or the percentage of the labor force engaged in manufacturing (X) to a city’s population growth (Y).
• We may be interested in measures of degree of relationship but also may want to describe the nature of the relationship
between the two variables so that we can predict the value of one variable if we know the other e.g. we may want to predict
a person’s future income from his/her education
• When interest is focused primarily on the exploratory task of finding out which variables are related to a given variable, we
are likely to be mainly interested in measures of degree or strength of relationship such as correlation coefficients. On the
other hand, once we have found the significant variables we are more likely to turn our attention to regression analysis in
which we attempt to predict the exact value of one variable from the other.
• It will be advisable to begin our discussion by studying the prediction problem. This is because the notion of regression is
both logically prior to and theoretically more important than that of correlation. After discussing the prediction problem, we
shall turn our attention to measuring the strength of relationship.
Linear Regression and Least Squares
The Prediction Problem

• The ultimate goal of all sciences is that of prediction. This does not imply, of course, that one is only secondarily interested
in ‘“‘understanding’’ why two or more variables are interrelated as they are. Perhaps it is correct to say that such
“‘understanding”’ is the ultimate goal and that to the degree that understanding becomes perfected, prediction becomes
more and more accurate.
• Suppose there is a dependent variable Y which is to be predicted from an independent variable X. In some problems X will
clearly precede Y in time. For example, a person usually completes his education before earning his income. We want to be
careful not to imply a necessary or causal relationship or that X is the only variable influencing the value of Y. we may be
equally interested in predicting Y from X or X fromY. Let us assume, however, that Y is taken as the dependent variable.
• if X and Y are independent, we cannot predict Y from X, or more exactly, knowledge of X does not improve our prediction
of Y. For example , we may wish to estimate a person’s future income, given that he/she has completed three years of
college. Without this knowledge of education, our best guess would be the mean income of all adults Knowing his/her
education, however, ought to enable us to obtain a better prediction.
The Regression Equation
Coceptualising the Problem

• Let us conceptualize the problem in the following way. We imagine that for every fixed value of the
independent variable X (education) we have a distribution of Y’s (incomes). In other words, for each
educational level there will be a certain income distribution in the population. Not all persons who
have completed high school will have exactly the same income, but these incomes will be distributed
about some mean. There will be similar income distributions for school pass outs, college graduates,
post-graduates, etc.
• Each of these separate income distributions (for fixed X’s) will have a “mean, and we can plot the
position of these means in the familiar X-Y coordinates. We refer to the resulting path of these means
of the Y’s for fixed X’s as a regression equation of Y on X. Diagram 17.1 is shown the next slide.In
Fig. 17.1 we have indicated the general nature of regression equations as involving the paths of the
means of Y values for given values of X.
ASSUMPTIONS OF THE REGRESSION MODEL
• That the form of the regression equation is linear,
• That the distributions of the Y values for each X are normal, and
• That the variances of the Y distributions are the same for each value of X.
• If the regression of Y on X is linear, or a straight-line relationship, we can write an equation as follows

• Where both 𝜶 and β are constants. Greek letters have been used since for the present we are dealing with the total
population. If we set X equal to zero, we see that Y= 𝜶. Therefore, 𝜶 represents the point where the regression line
crosses the Y axis (i.e., where X = 0).
• The slope of the regression line is given by β since this constant indicates the magnitude of the change in Y for a
given change in X.
Assumptions Contd….
• It was indicated that we shall assume that the Y’s are distributed normally about each value of X. It will also be
convenient to assume that for each fixed value of Y the X’s are also distributed normally. We say that the joint
distribution of X and Y is a bivariate normal distribution, meaning that there are two variables, each of which is
distributed about the other normally.
• The bivariate normal distribution has the property that the regression of Y on X is linear. Therefore if we have a
bivariate normal distribution, we know that if we trace the means of Y’s for each X the result will be a straight line.
It does not follow, however, that if the regression is linear, the Joint distribution is necessarily bivariate normal.
Bivariate normality will have to be assumed when we come to tests of significance.
• We shall also need to assume that the standard deviations of the Y’s for each X are the same regardless of the value
of X. if the joint distribution is bivariate normal, the standard deviations of the Y’s for each X will in fact all be
identical. This property of equal variances is referred to as homoscedasticity.
LINEAR LEAST SQUARES
STEPS TO FOLLOW

• Draw a scatter diagram from the data pairs of X and Y. This will help identify whether an
association exists or not.
• Having plotted the scores on a scattergram, approximate these points by some sort of a best-
fitting curve
• One way of doing this is to draw a curve (in this case a straight line) by inspection (diagram in
the next slide). There are other more precise methods of doing this, however. One of these is
the method of least squares which will be discussed in the present.
Least Squares Theory
• This means that we shall fit the data with a best-fitting straight line according to the least-squares criterion,
getting an equation of the form

• It will then turn out that the a and b obtained by this method are the most efficient unbiased estimates of the
population parameters 𝜶 and β if the regression equation actually is a straight line.
• The least-squares criterion involves our finding the unique straight line which has the property that the sum of
the squares of the deviations of the actual Y values from this line is a minimum.
• Thus if we draw vertical lines from each of the points to the least-squares line, and if we square these
distances and add, the resulting sum will be less than a comparable sum of squares from any other possible
straight line (see diagram in the next slide).
Formula to compute a and b
• Formulae
Numerical Example of Regression Equation Estimation

• Suppose we have the data given in Table 17.1 below, with X representing the percentage of minorities in certain
North Indian cities and Y indicating the difference between majority and minority median incomes as a measure of
economic discrimination.

• From the raw data we can compute five sums which, together with N, are all that we need in order to handle
regression and correlation problems. All but one of these sums will be used in computing a and b. Computations can
be summarized as follows:
Predicting from the regression equation

• Since discrimination scores indicate differences (in dollars) between the median incomes of
Majority and Minority, we see that an increase of 1 per cent Minority corresponds to a
difference of $19.93 in the median incomes of majority and minority. A scattergram and the
least-squares equation have been drawn in Fig. 17.6. To illustrate the use of such a prediction
equation, if we knew that there were 8 per cent minorities in a given city, the estimated median
income differential would be
Percent Minority

Figure 17.6: Scattergram and least-squares line for data of Table 17.1.
Correlation

• It is necessary to know the degree or strength of the relationship. Obviously, if the relationship
is very weak there is no point in trying to predict Y from X.
• correlations of a very high order are necessary for even moderately accurate prediction.
• The correlation coefficient was introduced by Karl Pearson and is often referred to as product-
moment correlation in order to distinguish it from other measures of association.
THANK YOU

You might also like