0% found this document useful (0 votes)
46 views7 pages

Unit 3 Covariance and Correlation

Covariance and correlation quantify the relationship between two numeric variables. Covariance expresses how much two variables change together, whether positively or negatively. Correlation identifies both the direction and strength of association between -1 and 1. Pearson's correlation coefficient is most common and divides covariance by the standard deviations of each dataset. The example shows a moderate positive correlation between two sets of observations.

Uploaded by

Shreya Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views7 pages

Unit 3 Covariance and Correlation

Covariance and correlation quantify the relationship between two numeric variables. Covariance expresses how much two variables change together, whether positively or negatively. Correlation identifies both the direction and strength of association between -1 and 1. Pearson's correlation coefficient is most common and divides covariance by the standard deviations of each dataset. The example shows a moderate positive correlation between two sets of observations.

Uploaded by

Shreya Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Unit 3: Covariance and Correlation

Covariance

When analyzing data, it’s often useful to be able to investigate the relationship
between two numeric variables to assess trends. For example, you might expect
height and weight observations to have a noticeable positive relationship—taller
people tend to weigh more. One of the simplest and most common ways such
associations are quantified and compared is through the idea of correlation, for
which you need the covariance.

The covariance expresses how much two numeric variables “change together”
and the nature of that relationship, whether it is positive or negative. Suppose
for n individuals you have a sample of observations for two variables, labeled

and

where xi corresponds to yi for i = 1,…….., n.

The sample covariance rxy is computed with the following, where x¯ and y¯
represent the respective sample means of both sets of observations:

When you get a positive result for rxy , it shows that there is a positive linear
relationship—as x increases, y increases. When you get a negative result, it
shows a negative linear relationship—as x increases, y decreases, and vice versa.
When rxy = 0, this indicates that there is no linear relationship between the values
of x and y.
Lets take two vector

R> xdata <- c(2,4.4,3,3,2,2.2,2,4)

R> ydata <- c(1,4.4,1,3,2,2.2,2,7)

Although these are two different collections of numbers, note that they have an
identical arithmetic mean.

R> mean(xdata)
[1] 2.825
R> mean(ydata)
[1] 2.825

The sample covariance of thesetwo sets of observations is as follows:

The obtained value is a positive number, so this suggests there is a positive


relationship based on the observations in x and y.
Correlation

Correlation allows you to interpret the covariance further by identifying both


the direction and the strength of any association. There are different types of
correlation coefficients, but the most common of these is Pearson’s product-
moment correlation coefficient, the default implemented by R. Pearson’s
sample correlation coefficient xy is computed by dividing the sample covariance
by the product of the standard deviation of each data set.

When xy = -1, a perfect negative linear relationship exists. Any result less than
zero shows a negative relationship, and the relationship gets weaker the nearer
to zero the coefficient gets, until xy = 0, showing no relationship at all. As the
coefficient increases above zero, a positive relationship is shown, until xy = 1,
which is a perfect positive linear relationship.

R>sd(xdata)
0.9528154
R> sd(ydata)
2.012639

Correlation =

is positive just like rxy, the value of 0.771 indicates a moderate-to-strong


positive association between the observations in x and y.
The R commands cov and cor are used for the sample covariance and
correlation; you need only to supply the two corresponding vectors of data.

R> xdata <- c(2,4.4,3,3,2,2.2,2,4)

R> ydata <- c(1,4.4,1,3,2,2.2,2,7)

R> cov(xdata,ydata)

[1] 1.479286

R> cov(xdata,ydata)/(sd(xdata)*sd(ydata))

[1] 0.7713962

R> cor(xdata,ydata)

[1] 0.7713962

To plot these bivariate observations as a coordinate-based plot

R> plot(xdata,ydata)
As discussed earlier, the correlation coefficient estimates the nature of the
linear relationship between two sets of observations, so if you look at the
pattern formed by the points in Figure and imagine drawing a perfectly straight
line that best represents all the points, you can determine the strength of the
linear association by how close those points are to your line. Points closer to a
perfect straight line will have a value of closer to either -1 or 1. The
direction is determined by how the line is sloped—an increasing trend, with the
line sloping upward toward the right, indicates positive correlation; a negative
trend would be shown by the line sloping downward toward the right.

To aid your understanding of the idea of correlation, Below Figure displays


different scatterplots, each showing 100 points. These observations have been
randomly and artificially generated to follow preset “true” values of xy , labeled
above each plot.
The first row of scatterplots shows negatively correlated data; the second shows
positively correlated data. These match what you would expect to see—the
direction of the line shows the negative or positive correlation of the trend, and
the extremity of the coefficient corresponds to the closeness to a “perfect line.”
The third and final row shows data sets generated with a correlation coefficient
set to zero, implying no linear relationship between the observations in x and y.
The middle and rightmost plots are particularly important because they
highlight the fact that Pearson’s correlation coefficient identifies only “straight-
line” relationships; these last two plots clearly show some kind of trend or
pattern, but this particular statistic cannot be used to detect such a trend. To
wrap up this section, look again at the quakes data. Two of the variables are mag
(the magnitude of each event) and stations (the number of stations that reported
detection of the event). A plot of stations on the y-axis against mag on the x-
axis can be produced with the following:

R>
plot(quakes$mag,quakes$stations,xlab="Magnitude",ylab="No.
of stations")

Figure 13-6 shows this image.


You can see by the vertical patterning that the magnitudes appear to have
been recorded to a certain specific level of precision. Nevertheless, a positive
relationship (more stations tend to detect events of higher magnitude) is
clearly visible in the scatterplot, a feature that is con-firmed by a positive
covariance.

R> cov(quakes$mag,quakes$stations)

[1] 7.508181

As you might expect from examining the pattern, Pearson’s correlation


coefficient confirms that the linear association is quite strong.

R> cor(quakes$mag,quakes$stations)

[1] 0.8511824

You might also like