0% found this document useful (0 votes)
25 views13 pages

Correlation

Co

Uploaded by

na831032
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views13 pages

Correlation

Co

Uploaded by

na831032
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Regression and correlation

Regression analysis, in the general sense, means the estimation of or prediction of the unknown values of
one variable from known values of the other variable. In Regression analysis there are two types of
variables. The variable whose value is influenced or to be predicted is called dependent (regressed or
explained) variable and the variable which influences the values or is used for prediction, is called
independent variable (regressor or Predictor or explanatory). If the Regression curve is a straight line, we
say that there is linear relationship between the variables under study, non-linear else where. When only
two variables are involved, the functional relationship is known as simple regression. If the relationship
between the two variables is a straight line, it is known as simple linear regression; otherwise it is called as
simple non-linear regression. When there are more than two variables and one of them is assumed
dependent upon the other, the functional relationship between the variables is known as multiple
regressions. Example Regression analysis is performed if one wants to know relationship between
a) Income Vs consumption
b) Blood pressure and age
c) Ski response and concentration
d) Industrial production Vs consumption of electricity
e) Yield of crops Vs amount of rainfall or type of fertilizer or humidity,..., etc.
Types Of Regression:
The regression analysis can be classified into:
a) Simple and Multiple
b) Linear and Non –Linear
c) Total and Partial
a) Simple and Multiple: In case of simple relationship only two variables are considered, for example, the
skin response and concentration. In the case of multiple relationship, more than two variables are involved.
On this while one variable is a dependent variable the remaining variables are independent ones. For
example, the blood pressure, age and weight. Then the
b) Linear and Non-linear: The linear relationships are based on straight-line trend, the equation of which
has no-power higher than one. But, remember a linear relationship can be both simple and multiple.
Normally a linear relationship is taken into account because besides its simplicity, it has a better predective
value, a linear trend can be easily projected into the future. In the case of non-linear relationship curved
trend lines are derived. The equations of these are parabolic.

1
c) Total and Partial: In the case of total relationships all the important variables are considered. Normally,
they take the form of a multiple relationships. In the case of partial relationship one or more variables are
considered, but not all, thus excluding the influence of those not found relevant for a given purpose.
Methods of regression analysis

A simple linear regression can be positive or negative. A positive regression β1 > 0, is represented by an
upward sloping line and y increases as x is increased. A negative regression, β1 < 0, is represented by a
downward sloping line and y decreases as x is increased. A regression with slope β1 = 0 indicates no linear
relationship between the variables.

Two main applications of regression analysis are:


1) Estimation of a function of dependency between variables
2) Prediction of future measurements or means of the dependent variable using new measurements of the
independent variable(s).

The Simple Regression Model

2
A regression that explains linear change of a dependent variable based on changes of one independent
variable is called a simple linear regression. For example, the weight of cows can be predicted by using
measurements of heart girth. The aim is to determine a linear function that will explain changes in weight
as heart girth changes. Hearth girth is the independent variable and weight is the dependent variable. To
estimate the function it is necessary to choose a sample of cows and to measure the heart girth and weight
of each cow. In other words, pairs of measurements of the dependent variable y and independent variable x
are needed.

In this example it can be assumed that the relationship between the x and y variables is linear and that each
value of variable y can be shown using the following model:
y = β0 + β1x + ε
where:
y = dependent variable
x = independent variable
β0, β1 = regression parameters
ε = random error (y- yˆ)
Here, β0 and β1 are unknown constants called regression parameters. They describe the location and shape
of the linear function. Often, the parameter β1 is called the regression coefficient, because it explains the
slope. The random error ε is included in the model because changes of the values of the dependent variable
are usually not completely explained by changes of values of the independent variable, but there is also an
unknown part of that change. The random error describes deviations from the model due to factors
unaccounted for in the equation, for example, differences among animals, environments, measurement
errors, etc. Generally, a mathematical model in which we allow existence of random error is called a
statistical model. If a model exactly describes the dependent variable by using a mathematical function of
the independent variable the model is deterministic. For example, if the relationship is linear the model is:
y = β0 + β1x
Note again, the existence of random deviations is the main difference between the deterministic and the
statistical model. In the deterministic model the x variable exactly explains the y variable, and in the
statistical model the x variable explains the y variable, but with random error. A regression model uses
pairs of measurements (x1,y1),(x2,y2),...,(xn,yn). According to the model each observation yi can be shown as:
yi = β0 + β1xi + εi i = 1,..., n

3
that is:
y1 = β0 + β1x1 + ε1
y2 = β0 + β1x2 + ε2
...
yn = β0 + β1xn + εn
For a regression model, assumptions and properties also must be defined. The assumptions describe
expectations and variance of random error.
Model assumptions:
1)
There is linear relation ship between dependent variable y and explanatory variable x
2)
E(εi) = 0, mean of errors is equal to zero
3)Var(εi) = σ2, variance is constant for every εi, that is, variance is homogeneous
4)Cov(εi,εi’) = 0, i ≠ i’, errors are independent, the covariance between them is zero
Usually, it is assumed that εi are normally distributed, εi ~ N(0, σ2). When that assumption is met the
5)
regression model is said to be normal.
The following model properties follow directly from these model assumptions.
Model properties:
1) E(yi| xi) = β0 + β1xi for a given value xi, the expected mean of yi is β0 + β1xi
2) Var(yi) = σ2, the variance of any yi is equal to the variance of εi and is homogeneous
3) Cov(yi,yi’) = 0, i ≠ i’, y are independent, the covariance between them is zero.
The expectation (mean) of the dependent variable y for a given value of x, denoted by E(y|x), is a straight
line . Often, the mean of y for given x is also denoted by μy|x.
Estimation of the Regression Parameters – Least Squares
Estimation
The most widely applied method for estimation of parameters in linear regression is least squares
estimation. The least squares estimators b0 and b1 for a given set of observations from a sample are such
estimators that the expression

Therefore, the regression line E(y|x) is unknown, but can be estimated by using a sample
with:

4
a) Determine the least square regression equation of blood pressure on age of women
b) Estimate the blood pressure of a women whose age is 45 years.

Coefficient of determination It is defined as the proportion of the variation in the dependent variable Y
that is explained, or accounted for, by the variation of the independent variable X. Its value is the square of
the coefficient of correlation, thus we denote it by and it is usually expressed in the form of percentage.
Example1: Compute and interpret coefficient of determination for the example on age and blood pressure.
Solution Given that simple correlation coefficient between blood pressure and age is 0.89, hence
coefficient of determination is square of the coefficient of correlation (r 2) = (0.89)2=79.21% which implies
that 79.21% variation in the blood pressure of women is accounted for, by the variation of the age of
women.
Student t test in Testing Hypotheses about the Parameters
If changing variable x effects change in variable y, then the regression line has a slope, that is, the
parameter β1 is different from zero. To test if there is a regression in a population the following hypotheses
about the parameter β1 are stated:
1. H0: β1 = 0
5
H1: β1 ≠ 0
The null hypothesis H0 states that slope of the regression line is not different from zero and that there is no
linear association between the variables. The alternative hypothesis H1 states that the regression line is not
horizontal and there is linear association between the variables.
2. Fix level of significance, α .
3. Assuming that the dependent variable y is normally distributed, the hypotheses about the parameter
β1 can be tested using a t distribution. It can be proved that the test statistic:

4. Find the critical value, tα/2,(n-2)


5. Decision: The null hypothesis H0 is rejected if the computed value from a sample |t| is “large”. For a
level of significance α, H0 is rejected if |t| ≥ tα/2,(n-2), where tα/2,(n-2) is a critical value.
6. Give conclusion.
Example: Look at the following data weight on heart girth of cows. Test the hypothesis about the
regression slope.

Solution

6
Correlation
INTRODUCTION
Statistical methods of measures of central tendency, dispersion, skewness and kurtosis are helpful for the
purpose of comparison and analysis of distributions involving only one variable i.e. univariate
distributions. However, describing the relationship between two or more variables, is another important
part of statistics.
Correlation is a measure of association between two or more variables.
“The correlation between variables is a measure of the nature and degree of association between the
variables”. As a measure of the degree of relatedness of two variables, correlation is widely used in
exploratory research when the objective is to locate variables that might be related in some way to the
variable of interest.
TYPES OF CORRELATION
Correlation can be classified in several ways. The important ways of classifying correlation are:
 Positive and negative,
 Linear and non-linear (curvilinear) and
 Simple, partial and multiple.
Positive and Negative Correlation
7
If both the variables move in the same direction, we say that there is a positive correlation, i.e., if one
variable increases, the other variable also increases on an average or if one variable decreases, the other
variable also decreases on an average. On the other hand, if the variables are varying in opposite direction,
we say that it is a case of negative correlation; e.g., movements of demand and supply.
Linear and Non-linear (Curvilinear) Correlation
If the change in one variable is accompanied by change in another variable in a constant ratio,
it is a case of linear correlation. On the other hand, if the amount of change in one variable does not follow
a constant ratio with the change in another variable, it is a case of non-linear or curvilinear correlation.
Simple, Partial and Multiple Correlation
The distinction amongst these three types of correlation depends upon the number of variables involved in
a study. If only two variables are involved in a study, then the correlation is said to be simple correlation.
When three or more variables are involved in a study, then it is a problem of either partial or multiple
correlations. In multiple correlations, three or more variables are studied simultaneously. But in partial
correlation we consider only two variables influencing each other while the effect of other variable(s) is
held constant.
The correlation analysis, in discovering the nature and degree of relationship between variables, does not
necessarily imply any cause and effect relationship between the variables. Two variables may be related to
each other but this does not mean that one variable causes the other. In other words, causation always
implies correlation, however converse is not true.
CORRELATION ANALYSIS
Correlation Analysis is a statistical technique used to indicate the nature and degree of relationship existing
between one variable and the other(s). It is also used along with regression analysis to measure how well
the regression line explains the variations of the dependent variable with the independent variable.
The commonly used methods for studying linear relationship between two variables involve both graphic
and algebraic methods. Some of the widely used methods include:
1. Scatter Diagram
2. Correlation Graph
3. Pearson’s Coefficient of Correlation
4. Spearman’s Rank Correlation
SCATTER DIAGRAM
This method is also known as Dotogram or Dot diagram. Scatter diagram is one of the simplest methods of
diagrammatic representation of a bivariate distribution. Under this method, both the variables are plotted on
the graph paper by putting dots. The diagram so obtained is called "Scatter Diagram". By studying
diagram, we can have rough idea about the nature and degree of relationship between two variables. The

8
term scatter refers to the spreading of dots on the graph. We should keep the following points in mind
while interpreting correlation:
 if the plotted points are very close to each other, it indicates high degree of correlation. If the
plotted points are away from each other, it indicates low degree of correlation.
 if the points on the diagram reveal any trend (either upward or downward), the variables are said to
be correlated and if no trend is revealed, the variables are uncorrelated.
 if there is an upward trend rising from lower left hand corner and going upward to the upper right
hand corner, the correlation is positive since this reveals that the values of the two variables move
in the same direction. If, on the other hand, the points depict a downward trend from the upper left
hand corner to the lower right hand corner, the correlation is negative since in this case the values
of the two variables move in the opposite directions.
 in particular, if all the points lie on a straight line starting from the left bottom and going up
towards the right top, the correlation is perfect and positive, and if all the points like on a straight
line starting from left top and coming down to right bottom, the correlation is perfect and negative.
Example 4-1

9
PEARSON’S COEFFICIENT OF CORRELATION

Properties of Pearsonian Correlation Coefficient

The following are important properties of Pearsonian correlation coefficient:


1. Pearsonian correlation coefficient cannot exceed 1 numerically. In other words it lies between –1
and +1. Symbolically, -1 ≤ r ≤1.
2. The sign of r indicate the nature of the correlation. Positive value of r indicates positive correlation,
whereas negative value indicates negative correlation. r = 0 indicate absence of correlation. The
following table sums up the degrees of correlation corresponding to various values of r:

10
3. Two independent variables are uncorrelated but the converse is not true. If X and Y are
independent variables then rxy = 0. However, the converse of the theorem is not true i.e.,
uncorrelated variables need not necessarily be independent.
4. The square of Pearsonian correlation coefficient is known as the coefficient of determination.
Coefficient of determination, which measures the percentage variation in the dependent variable
that is accounted for by the independent variable, is a much better and useful measure for
interpreting the value of r.
5. The correlation coefficient of x and y is symmetric. rxy = ryx.
Example
Calculate and interpret simple correlation coefficient for data on blood pressure and age of 10 women

Rank Correlation:
It is studied when no assumption about the parameters of the population is made. This method is based on
ranks. It is useful to study the qualitative measure of attributes like honesty, colour, beauty, intelligence,

11
character, morality etc.The individuals in the group can be arranged in order and there on, obtaining for
each individual a number showing his/her rank in the group. This method was developed by Edward
Spearman in 1904. It is defined as

∑ D 2 = sum of squares of differences between the pairs of ranks. n = number of pairs of observations. The
value of r lies between –1 and +1. If r = +1, there is complete agreement in order of ranks and the direction
of ranks is also same. If r = -1, then there is complete disagreement in order of ranks and they are in
opposite directions. Computation for tied observations: There may be two or more items having equal
values. In such case the same rank is to be given. The ranking is said to be tied. In such circumstances an
average rank is to be given to each individual item. For example if the value so is repeated twice at the 5 th
5+6
rank, the common rank to be assigned to each item is = 5.5 which is the average of 5 and 6 given as
2
5.5, appeared twice.
Example :
In a marketing survey the price of tea and coffee in a town based on quality was found as shown below.
Could you find any relation between and tea and coffee price.

The relation between price of tea and coffee is positive at 0.89. Based on quality the association between
price of tea and price of coffee is highly positive.

12
11.

12.

13

You might also like