Module 4 Satistics and Probability
Module 4 Satistics and Probability
MODULE 4
IN
Prepared by:
MS. EVELYN R. MATOS
Teacher
[email protected]
09674233745
1
Introduction
Our particular area of interest whenever we conduct a research investigation is
confirming whether the assumptions on the variables involved are baseless or grounded. In
relation to the assumptions, we often surmise that there must be some contributing factors
affecting the occurrence of any observed pattern or trend. For example, one may assume that
students' high test scores are directly related to how long they study their lessons, or one may
assume that there is a relationship between the amounts of sales of products with their price. In
the attempt to ascertain the possible reasons on this naturally occurring trend, we resort to
statistics. One way to objectively determine the reasons is to look for a relationship between the
variables. or association
In this module, we will be dealing with pairs of variables referred to as bivariate data
with the end in view of computing the degree and describing the extent of their relationship. We
will also learn how to make inferences and be able to confirm any assumptions made about the
subject of interest. Also, we will learn how to generate an equation that will be used to predict a
numerical value of dependent variable given that we have a value for independent variable.
Lesson 1
SCATTER PLOTS
Learning Objectives
At the end of the lesson, the learner will be able to:
a. illustrate the nature of bivariate data:
b. construct a scatter plot;
c. describe shapes (form), trend (direction), and variation (strength) bund the scatter plot, and
d. estimate strength of association between the variables based on a scatter plot.
Learning Content
Some research studies involve two variables. One of these two variables is the
independent variable and the other one is the dependent variable. The independent variable is
the variable that may affect the dependent variable to change. The dependent variable is the
variable that is influenced or affected by the independent variable. The data collected in this
study that involves two variables are called bivariate data. Bivariate data are always in pairs. For
instance, a researcher wants to find out if there is a relationship between height and weight. Here,
height is the independent variable and weight is the dependent variable. If a person gets taller, his
weight may increase but an increase in his weight will not make the person taller. But this does
not mean that this variable causes the other variable, it simply means that there is a significant
association between the two. The heights of the students, which may be in centimeters, and the
weights of the students, which may be in kilograms, are the bivariate data.
Scatter plots are diagrams that are used to show the degree and pattern of relationship
between the two sets of data. They are constructed on the xy coordinate plane. Each data point
on a scatter plot represents two values (x, y). The abscissa of the point is a value of the
independent variable (x) and the ordinate is a value of the dependent variable (y).
2
• Rectangular coordinate
• Two quantitative variables
• One variable is called independent (X) and the second is called dependent (Y)
• Points are not joined
Example 1
The table below shows the time in hours (x) spent by six grade 11 students in studying
their lessons and their scores (y) on a test. Construct a scatter plot.
Time 1 2 3 4 5 6
Spent (x)
Score(y) 5 15 10 15 30 35
SOLUTION
First, draw the horizontal axis for the values of the independent variable and the vertical
axis for the values of the dependent variable. In Example 1, the independent variable is x
representing the time spent studying and the dependent variable is y representing the scores
on a test. Then, plot the data points. Each data point is a pair of coordinates (x, y). The first
coordinate which is the abscissa of the point is the time in hours of studying and the second
coordinate which is the ordinate of the point is the score on the test. Shown below is the
scatter plot.
The points plotted on the xy coordinate plane seem to follow a straight line that points
upward to the right. This indicates that the two variables are to some extent linearly related
and the relationship between the variables is positive. The scatter plot represents a positive
correlation. It describes a positive trend since as the time spent in studying increases, the
score, also increases. There is a strong positive correlation between the two variables because
the points seem to follow a straight line.
3
Lesson 2
where:
n = number of paired values
∑x = sum of x-values
∑y = sum of y-values
∑xy = sum of the products of paired values x and y
The value of r ranges from +1 to -1. If the value of r is exactly +1, then the variables has
a perfect positive correlation. If it is exactly equal to -1, then the variable has a perfect negative
correlation. A value of r that is close to +1 indicates a strong positive correlation. And a value of
r that is, close to -1 indicates a strong negative correlation.
The following table for interpretation of r can be used in interpreting the degree of linear
relationship existing between the two variables.
5
-0.71 to -0.99 Strong Negative Correlation
-1 Perfect Negative Correlation
EXAMPLE 1
6
The table below shows the time in hours spent in studying (x) by six grade 11 students and their
scores on a test (y). Solve for the Pearson Product Moment Correlation Coefficient, r.
x 1 2 3 4 5 6
y 5 10 15 15 25 35
SOLUTION
2 2
x y xy x y
1 5 5 1 25
2 10 20 4 100
3 15 45 9 225
4 15 60 16 225
5 25 125 25 625
6 35 210 36 1,225
2 2
∑x=21 ∑y=105 ∑xy=465 ∑ x = 91 ∑ y =2,425
n (∑ xy)−(∑ x)( ∑ y )
√ ∑ x2)−( ∑ x)2 ][n (∑ y2 )−( ∑ y)2 ]
r = [ n(
6( 465)−(21)(105 )
√ 2 2
= [6 (91)−(21 ) ][6 (2 , 425)−(105 ) ]
2 , 790−2, 205
= √[546−441 )][14 ,550−11, 025 ]
585
= √370 , 125
= 0.96157 or 0.96
The value r = 0.96 is between +0.71 and +0.99 in the table for interpretation of r. It indicates that
there is a strong positive correlation between the time in hours spent in studying and the scores
on a test.
EXAMPLE 2
The table below shows the time in hours (x) spent by six students in playing computer games and
the scores these students got on a math test. Solve for Pearson Product Moment Correlation
Coefficient.
x 1 2 3 4 5 6
y 30 25 25 10 15 5
SOLUTION
x y xy x2 y2
1 30 30 1 900
2 25 50 4 625
3 25 75 9 625
4 10 40 16 100
5 15 75 25 225
6 5 30 36 25
7
∑ x =21 ∑ y=110 ∑ y 2=2 , 500
n (∑ xy)−(∑ x)( ∑ y )
1 , 800−2 ,310
= √[546−441 )][15 , 000−12 ,100 ]
−510
= √304 ,500
= -0.92422 or -0.92
The value of r = -0.92 indicates a strong negative correlation between the time in hours
spent in playing computer games and the scores on a test.
EXAMPLE 3
The table below shows the number of selfies (x) posted online and the scores (y) obtained
from a Science test. Solve for the Pearson Product Moment Correlation Coefficient.
x 1 2 3 4 5 6
y 25 5 20 40 25 9
SOLUTION
x y xy x2 y2
1 25 25 1 625
2 5 10 4 25
3 20 60 9 400
4 40 160 16 1,600
5 25 125 25 625
6 9 54 36 81
∑ x =21 ∑ y=124 ∑ xy=434 ∑ y 2=3 , 356
n (∑ xy)−(∑ x)( ∑ y )
√ ∑ x2)−( ∑ x)2 ][n (∑ y2 )−( ∑ y)2 ]
r = [ n(
6(434 )−(21)(124 )
√ 2 2
= [6 (91)−(21 ) ][6 (3 ,356 )−(124 ) ]
2 , 604−2 , 604
= √[546−441 )][20 ,136−15 ,376 ]
8
0
= √ 499 , 800
=0
There is no correlation between the number of selfies (x) posted online and the scores (y)
obtained from a Science test.
2.
x 1 2 3 4 5
y 3 6 10 8 12
3.
x 2 4 6 7 10
y 8 10 12 6 16
4.
x 1 4 3 7 8 9 10
y 10 12 14 8 6 4 4
5.
x 10 8 12 15 16 18 20
y 20 30 25 22 18 16 35
Lesson 3
9
REGRESSION ANALYSIS
Learning Objectives
When bivariate data are displayed on the xy coordinate plane by using a scatter plot, each
data point represents two variables. One of these two variables is the independent variable and
the other one is the dependent variable. The dependent variable is affected by the independent
variable but the dependent variable does not affect the independent variable. To identify which
of the variables is the independent variable or dependent variable, just think of which is affected
and which contribute to the change.
For example, the two variables are the number of hours spent in studying and the scores
on the periodic test. Here the scores are the ones being affected and the number of hours spent in
studying are the ones contributing to change. The number of hours spent in studying are the
values of the independent variable and the scores are the values of the dependent variable. The
values of the independent variable are plotted on or along the x-axis and the values of the
dependent variable are plotted on or along the y-axis.
Best-Fit Line
A scatter plot displays a group of data points that may appear to be following a straight
line pointing either upward to the right or to the left. The straight line that best illustrates the
trend or direction that the data points seem to follow is called the best-fit line or line of best fit.
This line of best fit can be estimated from two data points on the scatter plot. The first thing to do
is to draw a line that divides the data points into two sets of even number of points or almost the
same number of points. It would be better if the line passes through two of the points. The next
thing to do is to calculate the slope using the slope formula, and then find the equation of the
line.
There is a better way to find the equation of the line of best fit. This equation is called the
equation of the regression line or simply regression equation.
10
¿^ =a+bx
where:
a = y-intercept of the regression line, and
b = slope of the regression line.
The y-intercept can be computed using the following formula.
n( Σ xy )−(Σx )( Σy )
2 2
b = n( Σx )−(Σx )
EXAMPLE 1
Consider the following data.
a. Find the equation of the regression line.
b. Draw the graph of the regression equation on a scatter plot.
x 1 2 3 4 5 6 7
y 4 3 8 6 12 10 8
SOLUTION
x y xy x²
1 4 4 1
2 3 6 4
3 8 24 9
4 6 24 16
5 12 60 25
6 10 60 36
7 8 56 49
Σx=28 Σy=51 Σ xy=234 Σx2 =140
(51)(140 )−(28)(234 )
= 7(140 )−(28 )2
7 ,140−6 , 552
= 980−784
588
= 196
a=3
11
n( Σ xy )−(Σx )( Σy )
2 2
b = n( Σx )−(Σx )
b = 1.07142 or 1.07
¿^ =3 +1 .07 x
EXAMPLE 2
x 5 10 20 8 15 25
y 40 26 18 30 20 15
SOLUTION
x y xy x²
5 40 200 25
10 26 260 100
20 18 360 400
8 30 240 64
15 20 300 225
25 15 375 625
Σx=83 Σy=149 Σ xy=1,735 Σx2 =1 , 439
a = 40.34727794 or 40.35
n( Σ xy )−(Σx )( Σy )
2 2
b = n( Σx )−(Σx )
12
10, 410−12 ,367
= 8 ,634−6 ,889
−1,957
= 1,745
b = -1.121489971 or -1.12
¿^ =40. 35−1 .12 x
1.
x 1 2 3 4 5 6 7
y 15 10 20 5 25 20 35
2.
x 1 2 3 4 5 6 7
y 20 15 25 22 30 35 30
3.
x 2 4 6 8 10 12 14
y 15 20 15 20 25 22 35
4.
x 10 20 30 40 50 60 70
y 60 50 50 20 30 20 30
5.
x 5 10 15 20 25 35 30
y 15 20 10 20 30 33 35
13
Name: __________________________________ Section: ___________________________
Date: ___________________________________ Score: ______________________
x 1 2 3 4 5 6
y 6 4 8 6 10 12
x 1 2 3 4 5 6 7
y 2 6 4 8 12 10 12
x 5 15 20 10 25 30 40
y 20 25 30 20 35 25 35
x 10 20 30 40 50 60 70 80 90
y 10 20 80 25 70 90 80 85 88
x 30 35 45 50 40 65 55 60 70
y 40 45 50 60 35 65 55 65 70
14
Lesson 4: Spearman’s Rank Correlation Coefficient
The Spearman’s rank correlation coefficient is called Spearman’s rho, was developed by
Charles E. Spearman. The Spearman rho is used to measure the degree of association between
two ranked variables. It is computed with the use of the following formula.
The Spearman’s rho can take values from +1 to -1. The value of +1 indicates a perfect
positive correlation between the two ranked variables and a value of -1 indicates a perfect
negative correlation. The closer the value of Spearman rho is to zero, the weaker the correlation
between the ranks, and a zero values indicates no correlation between the ranks. The table that
was used to interpret Pearson r may be used to interpret Spearman’s rho.
Example 1
In a beauty contest, two judges were asked to rank eight candidates (A, B, C, …, H) in order of
preference. In the table shown below are the resulting ranks.
A B C D E F G H
First 5 2 4 3 6 1 8 7
Judge
(x)
Second 3 4 5 2 6 1 7 8
Judge
(y)
b. How strong is the correlation between the choices of the two judges?
SOLUTION
a.
Candidate x y d d2
A 5 3 2 4
B 2 4 -2 4
C 4 5 -1 1
D 3 2 1 1
E 6 6 0 0
F 1 1 0 0
G 8 7 1 1
H 7 8 -1 1
∑ d 2=12
15
6 ∑ d2
p=1−
n(n2 −1)
6(12 )
=1−
8(82 −1)
72
=1−
504
= 0.8571
b. The correlation between the choices of the two judges are strong.
EXAMPLE 2
Calculate the Spearman’s rank correlation coefficient of the data in the table:
x 10 6 9 12 8
y 8 7 5 6 9
We will arrange the values of x and y in descending order and rank them.
X Rx y Ry
12 1 9 1
10 2 8 2
9 3 7 3
8 4 6 4
6 5 5 5
Assign each number to the table now. For example, the first number is 10 and the rank of it is 2,
second is 6 and the rank is 5 and so on.
Rx Ry d d2
2 2 0 0
5 3 2 4
3 5 -2 4
1 4 -3 9
4 1 3 9
26
6 ∑ d2
p=1−
n(n2 −1)
6(26 )
=1−
5 (5 2−1 )
=1−1. 3
= -0.3
16
Name: __________________________________ Section: ___________________________
Date: ___________________________________ Score: ______________________
x 15 20 18 25 16 10
y 16 25 10 16 12 14
2. In the table below are the scores of students in General Mathematics and Physical Science.
Calculate the Spearman’s rho.
17