CORRELATION
ANALYSIS
PROF. DR. MUHAMMAD AZAM
SCATTER PLOT
• It is a graph of bivariate data. In which one of the variable (with values x1, x2, …, xn) is
taken on X-axis and second variable (with values y1, y2, …, yn) is taken on Y-axis.
• Paired values (x1, y1), (x2, y2),…, (xn, yn) are plotted on the graph paper in the form of
dots.
• This dotted graph is called a scatter plot or scatter diagram.
• Scatter plot helps us to observe the type of relationship between two variables X and Y.
SCATTER PLOT
• The owner of an organization is interested in the relationship between price at which a
commodity is offered for sale and the quantity sold. The data is
Price Quantity
25 118
45 105
30 112
50 100
35 111
40 108
65 95
75 88
70 91
60 96
CORRELATION
• Correlation: It is the relationship or interdependence between two variables. Two
variables are said to be correlated if both the variables tend to vary in some direction.
Correlation may be positive or negative depending upon its direction.
• Positive Correlation: The correlation is said to be positive or direct if both the
variables tend to vary in the same direction e.g. wheat yield increases by increasing the
amount of fertilizer up-to certain level.
• Negative Correlation: The correlation is said to be negative or indirect if both the
variables tend to vary in opposite direction i.e. one increases other decreases. For
example: the death rates in different age groups may be decreased by increasing medical
or health facilities in the society.
CORRELATION
COEFFICIENT OF CORRELATION
• Coefficient of Correlation: It is the measure of degree of relationship or
interdependence between two variables. Suppose X and Y be two variables then
correlation between X and Y is denoted by “r” for sample data. Where “r” is
𝑛 𝑥𝑦 − 𝑥 𝑦
𝑟=
𝑛 𝑥2 − 𝑥 2 𝑛 𝑦2 − 𝑦 2
x y xy x2 y2
⋮ ⋮ ⋮ ⋮ ⋮
𝑥 𝑦 𝑥𝑦 𝑥2 𝑦2
COEFFICIENT OF CORRELATION
• Formula for Coefficient of Correlation is called Karl Pearson Product Moment Formula.
• The value of r lies between -1 and +1
• If r = -1, it means there is perfect negative correlation
• If r = 0, it means there is no correlation
• If r = +1, it means there is perfect positive correlation
EXAMPLE
• Researchers at European Centre for road safety testing are trying to find out how the
age (in months) of cars affect their braking capability. They test a group of 10 cars of
different ages and find out the minimum stopping distances that the cars can achieve. The
results are given by
Car A B C D E F G H I J
Age 9 15 24 30 38 46 53 60 64 76
Min Stopping Distance at 40km 28.4 29.3 37.6 36.2 36.5 35.3 36.2 44.1 44.8 47.2
SOLUTION
Car x y xy x2 y2
𝑛 𝑥𝑦 − 𝑥 𝑦
A 9 28.4 255.6 81 806.56 𝑟=
𝑛 𝑥2 − 𝑥 2 𝑛 𝑦2 − 𝑦 2
B 15 29.3 439.5 225 858.49
C 24 37.6 902.4 576 1413.76
10(16713.3) − (415)(375.6)
D 30 36.2 1086 900 1310.44 𝑟=
10 21623 − 415 2 10 14457.72 − 375.6 2
E 38 36.5 1387 1444 1332.25
F 46 35.3 1623.8 2116 1246.09
G 53 36.2
𝑟 = 0.91
1918.6 2809 1310.44
H 60 44.1 2646 3600 1944.81 There is an evidence of strong positive correlation
I 64 44.8 between age of car and stopping distance. In other word,
2867.2 4096 2007.04
the older the car, the longer the distance before it take
J 76 47.2 3587.2 5776 2227.84 stop.
Sum 415 375.6 16713.3 21623 14457.72
HOME WORK
Measurements of serum cholesterol (mg/100 ml) and arterial calcium deposition (mg/100 g dry
weight of tissue) were made on eight animals. The data are as follows:
Calcium 59 52 42 24 40 32 63 36
Cholesterol 298 303 233 236 265 233 286 264
Calculate and interpret the correlation coefficient between calcium & cholesterol
PROPERTIES OF CORRELATION COEFFICIENT
• The value of “r” is symmetric about variables X and Y that mean 𝑟𝑥𝑦 = 𝑟𝑦𝑥
• The value of “r” is a pure number i.e. it is free from unit of measurement.
• The value of “r” lies between -1 and +1 i.e. −1 ≤ 𝑟 ≤ +1
• The value of “r” remains unchanged by change of origin, scale or both i.e. let 𝑈 = 𝑎𝑋 ∓ 𝑏
and V = 𝑐𝑌 ∓d then 𝑟𝑥𝑦 = 𝑟𝑢𝑣
• The value of “r” is geometric mean of two regression coefficients i.e. 𝑟𝑥𝑦 = ∓ 𝑏𝑥𝑦 . 𝑏𝑦𝑥
EXAMPLE
• Given below is the data of 7 patients. The variable “X” indicates the length of stay at
hospital and “Y” shows the average cost in thousands of the stay in hospital. Find the
correlation between X and Y and interpret your result.
Days (X) 2 4 5 6 8 11 17
Average Cost “000” (Y) 19 14 11 10 9 7 5
EXAMPLE
• The rainfall and output of wheat per acre for a certain area is as follows:
Rainfall (cm) 40 20 32 35 40 45 43 30 25 50
Wheat Yield (mds) 120 120 145 150 100 120 12 135 130 140
• Calculate correlation coefficient and discuss the result.
EXAMPLE
• Given the weights (in kgs) of fathers and their sons as follows:
Weight of father 65 66 67 67 68 69 71 73 85 120
Weight of son 67 68 67 68 72 70 69 70 82 100
• Calculate correlation coefficient and discuss the result.