Unit 3 Covariance and Correlation
Unit 3 Covariance and Correlation
Covariance
When analyzing data, it’s often useful to be able to investigate the relationship
between two numeric variables to assess trends. For example, you might expect
height and weight observations to have a noticeable positive relationship—taller
people tend to weigh more. One of the simplest and most common ways such
associations are quantified and compared is through the idea of correlation, for
which you need the covariance.
The covariance expresses how much two numeric variables “change together”
and the nature of that relationship, whether it is positive or negative. Suppose
for n individuals you have a sample of observations for two variables, labeled
and
The sample covariance rxy is computed with the following, where x¯ and y¯
represent the respective sample means of both sets of observations:
When you get a positive result for rxy , it shows that there is a positive linear
relationship—as x increases, y increases. When you get a negative result, it
shows a negative linear relationship—as x increases, y decreases, and vice versa.
When rxy = 0, this indicates that there is no linear relationship between the values
of x and y.
Lets take two vector
Although these are two different collections of numbers, note that they have an
identical arithmetic mean.
R> mean(xdata)
[1] 2.825
R> mean(ydata)
[1] 2.825
When xy = -1, a perfect negative linear relationship exists. Any result less than
zero shows a negative relationship, and the relationship gets weaker the nearer
to zero the coefficient gets, until xy = 0, showing no relationship at all. As the
coefficient increases above zero, a positive relationship is shown, until xy = 1,
which is a perfect positive linear relationship.
R>sd(xdata)
0.9528154
R> sd(ydata)
2.012639
Correlation =
R> cov(xdata,ydata)
[1] 1.479286
R> cov(xdata,ydata)/(sd(xdata)*sd(ydata))
[1] 0.7713962
R> cor(xdata,ydata)
[1] 0.7713962
R> plot(xdata,ydata)
As discussed earlier, the correlation coefficient estimates the nature of the
linear relationship between two sets of observations, so if you look at the
pattern formed by the points in Figure and imagine drawing a perfectly straight
line that best represents all the points, you can determine the strength of the
linear association by how close those points are to your line. Points closer to a
perfect straight line will have a value of closer to either -1 or 1. The
direction is determined by how the line is sloped—an increasing trend, with the
line sloping upward toward the right, indicates positive correlation; a negative
trend would be shown by the line sloping downward toward the right.
R>
plot(quakes$mag,quakes$stations,xlab="Magnitude",ylab="No.
of stations")
R> cov(quakes$mag,quakes$stations)
[1] 7.508181
R> cor(quakes$mag,quakes$stations)
[1] 0.8511824