Unit-4-Normal-Curve-and-linear-fegression
Unit-4-Normal-Curve-and-linear-fegression
Learning Objectives
Upon the completion of this topic, you are expected to:
A. Determine the corresponding z-score of a given raw data;
B. Identify the location of a given data in terms of corresponding z-score and
quantiles;
C. Interpret the location of raw data in the Box Whisker’s plot; and
D. Solve word problems involving the concepts of the measures of relative
position.
Presentation of Content
I. Z-score
What do you know about z-score? Do you know how z-score is determined?
A z-score indicates how many standard deviations a data point is from the
mean. A given raw data can be converted in terms of z-score using the
formula:
x−μ
z=
σ
Where:
x = value of the raw data
µ = mean of the set of data where the given data belongs
σ = standard deviation of the set of data where the given data
belongs
z = corresponding z-score of the raw data
Hence, we can convert a given raw data if you know the mean and standard
deviation of the set where the raw data belongs.
How will you utilize the formula in converting a given raw score into its
corresponding z score?
Interpreting Z-scores
After determining the corresponding z-score of a given raw score, we need to
interpret it to be able to identify its location.
1
Unit 4: Data Management
Example:
a. A z-score equal to 1 represents a data that is 1 standard deviation
above the mean; a z-score equal to 2, 2 standard deviations above the
mean; etc.
b. A z-score equal to -1 represents a data that is 1 standard deviation
below the mean; a z-score equal to -2, 2 standard deviations below the
mean; etc.
II. Quantiles
Do you know anything about quantiles? Aside from z scores, we can use
quantiles as measure of location. It is an extension of median concept where
items in the distributions are divided into equal parts.
Types of Quantiles
There are three types of quantiles namely: quartiles, deciles, and percentiles.
A. Quartiles divide the distribution into four equal parts. The values that
divide the parts are called first, second, and third quartiles. These are
denoted by Q1, Q2, and Q3 respectively. Below shows a representation
of a set of observations divided into quartiles.
Q1 Q2 Q3
B. Deciles divide the distribution into 10 equal parts. The values divide
the parts are called first, second, third, fourth, fifth, sixth, seventh,
eight, and ninth deciles. These are denoted by D1, D2, D3, D4, D5, D6,
D7, D8, and D9 respectively.
D1 D2 D3 D4 D5 D6 D7 D8 D9
C. Percentiles divide the set of observations into 100 divisions. These are
the points or values separating the scores into 100 parts. A percentile
indicates the value below which a given percentage of observations in
a group of observations fall.
2
Unit 4: Data Management
Note: Quantiles are used in reporting scores from norm-referenced tests. For
example, if a score is at the 60th percentile, where 60 is the percentile rank, it
is equal to the value below which 60% of the observations may be found.
3
Unit 4: Data Management
How comparable is the Box Whisker’s Plot to the other measures of relative
location?
Note: The percentage of data between Q1 and Q3 is about 50%. Thus, only
about 25% of the data are found on both ends of the distribution.
4
Unit 4: Data Management
From the previous Box Whisker’s Plot, we can say that the distribution of
Group A is skewed to the right, the distribution of Group B is symmetric with
the mean, and the distribution of Group C is skewed to the left.
How can we use the Box Whisker’s Plot in determining the location of
observation in the distribution?
Procedure
A Box Whisker’s Plot is developed from five statistics.
1. Minimum value – the smallest value in the data set
2. First quartile – the value below which the lower 25% of the data are
contained
3. Median value – the middle number in a range of numbers
4. Third quartile – the value above which the upper 25% of the data are
contained
5. Maximum value – the largest value in the data set
For example, given the following 16 data points, the five required statistics are
displayed.
5
Unit 4: Data Management
Note: Note that for a data set with an even number of values, the median is
calculated as the average of the two middle values.
From the observations given in the previous page, the values of the five
Statistics are:
1. Minimum value = 50
2. First quartile = 54.5
3. Median value = 58
4. Third quartile = 62.5
5. Maximum value = 65
A boxplot splits the data set into quartiles. The body of the boxplot consists of
a "box" (hence, the name), which goes from the first quartile (Q1) to the third
quartile (Q3).
Within the box, a horizontal line is drawn at the Q2, the median of the data
set. Two vertical lines, called whiskers, extend from the front and back of the
box. The front whisker goes from Q1 to the smallest non-outlier in the data set,
and the back whisker goes from Q3 to the largest non-outlier.
If the data set includes one or more outliers, how will they be plotted on the
Box Whisker’s Plot?
6
Unit 4: Data Management
Application
Activity 1
Directions: Based on what we have learned about z-score, convert the
following raw scores to their corresponding z-scores. Use the formula
presented in the discussion and identify its location.
Solution:
After accomplishing the previous activity, compare your answers to the
following solutions.
1. x = 24 μ = 20 σ =2
x−μ
z=
σ
24−20
z=
2
4
z=
2
z=2
Interpretation: The corresponding z-score of the raw score is 2. It represents
that the data can be found 2 standard deviations above the mean.
2. x = 16 μ = 16 σ =1
x−μ
z=
σ
16−16
z=
1
0
z=
1
z=0
Interpretation: The corresponding z-score of the raw score is 0. It represents
that the data is equal to the value of the mean.
3. x = 18 μ = 24 σ =8
x−μ
z=
σ
7
Unit 4: Data Management
18−24
z=
8
−6
z=
8
z=−0.75
Interpretation: The corresponding z-score of the raw score is -0.75. It
represents that the data can be found 0.75 standard deviation below the mean.
Good job! Did you get the same answers? If not, what part do you need to
improve?
Can you determine the raw score given the standard deviation and the z-score?
In what way?
Activity 2
Let us have another activity. This time, you can seek the help of your friends
to answer the following problems.
Directions: Given the mean and standard deviation of the distribution, convert
the following raw scores to their corresponding z scores and interpret their
location relative to the distribution. Good luck!
1. mean = 120 standard deviation = 10 raw score = 100
2. mean = 50 standard deviation = 5 raw score = 55
3. mean = 35 standard deviation = 4 raw score = 40
You just have learned to measure relative position of data through z-score.
Congratulations!
Activity 3
Let us try to follow the procedure in determining a quantile value. Do your
best in answering the following:
1. Find the 30th percentile of the set {12, 15, 17, 20, 25, 27, 29, 30, 30,
34, 36, 36, 37, 38, 39, 40, 41, 42, 43}
2. Determine the 2nd quartile from the set {30, 34, 36, 36, 37, 38, 39, 40,
41, 42, 12, 15, 17, 20, 25, 27, 29, 30}
3. Determine the 2nd decile from the set {20, 25, 27, 29, 30, 30, 34, 36,
36, 12, 15, 17, 37, 38, 39, 40, 41, 42}
Solutions:
You may compare yours to the following solutions.
1. Note: The observations are already arranged from lowest to highest.
The given are: P = 30
N = 19
n = unknown
8
Unit 4: Data Management
The 30th percentile is the 6th observation from the set of data which is 27.
2. We arrange the observations from lowest to highest as: {12, 15, 17, 20,
25, 27, 29, 30, 30, 34, 36, 36, 37, 38, 39, 40, 41, 42}.
The given are: Q=2
N = 18
n = unknown
Value of the 2nd quartile = unknown
Q
n= ( N +1)
4
2
n= (18+1)
4
n=9.5
The whole number part of the ordinal part (n) is 9 and the 9th observation is
30. The value next to 30 is 34 and their difference is 4. The product of their
difference and the decimal part of the ordinal rank (n) which is 0.5 is 2. Thus,
the value of the 2nd quartile is 32.
3. We arrange the observations from lowest to highest as: {12, 15, 17, 20,
25, 27, 29, 30, 30, 34, 36, 36, 37, 38, 39, 40, 41, 42}
The given are: D=2
N = 18
n = unknown
Value of the 2nd decile = unknown
D
n= (N +1)
10
2
n= (18+1)
10
n=3.8
The whole number part of the ordinal part (n) is 3 and the 3rd observation is
17. The value next to 17 is 20 and their difference is 3. The product of their
difference and the decimal part of the ordinal rank (n) which is 0.8 is 2.4.
Thus, the value of the 2nd decile is 19.4.
9
Unit 4: Data Management
Activity 4
Let us have another activity to test your understanding and mastery of the
topic.
This time, call for a friend to help you answer the items. Good luck!
1. The following are the scores of 19 students of the College of
Agriculture: 40, 32, 32, 30, 45, 44, 43, 35, 39, 23, 25, 36, 37, 28, 33,
27, 30, 29, and 20. Calculate Q1, D3, and P40.
2. Determine the 3rd quartile, 2nd decile, and 10th percentile of the
number of siblings of the 11 students of the College of Teacher
Education.
2 1 3 7
2 6 4 5
3 4 2
Activity 5
Identifying the Location of Observation
10
Unit 4: Data Management
If you are done answering the activity, you can compare now your answers to
the solutions.
Solution:
A. Identifying the Location of Data
1. 48 is found above the third quartile and below the maximum score
2. 40 is the median of the distribution
3. 35 is located above the first quartile and below the median
4. 30 is positioned just below the first quartile
5. 20 is the lowest score in the distribution
Activity 6
With the concepts that you have learned, interpret the following Box
Whisker’s Plot. This time you can ask the help of your friends in this activity.
Good luck!
1. Determine the skewness of the three groups.
2. Identify the location of the score 20 in the three distributions.
11
Unit 4: Data Management
Feedback/ Assessment
Test I.
Directions: Supply the information being required by each item.
1. Given the mean of the distribution as 30 with a standard deviation of 5,
determine the corresponding z-score of the following raw scores.
a. 15
b. 30
c. 35
Test II.
Directions: Determine whether the following are correct or not. Write True if
the statement is true and False if it is false.
_____1. The 3rd quartile corresponds to the 30th percentile.
_____2. The 25th percentile is the observation below which 75% of the
observations may be found.
_____3. The Box Whisker’s Plot uses five Statistics namely: Q1, maximum,
mean, Q3, and minimum.
_____4. If the box of the Box Whisker’s Plot is situated on the upper part of
the line, then the distribution is skewed to the right.
_____5. Outliers can lie inside the box of the Box Whisker’s Plot.
Test III.
Directions: Below is the list of daily allowances (in peso) of 29 first year
students in Cagayan State University. Determine the value of:
1. 10th percentile
2. 3rd decile
3. 1st quartile
50 65 75 95 110 150
55 65 80 95 110 170
55 70 80 100 120 180
60 70 80 100 130 200
60 75 90 100 140
12
Unit 4: Data Management
Test IV.
A. Directions: Given the Box Whisker’s Plot in the next page:
1. Identify the location of these observations; and
a.) 30
b.) 35
c.) 40
13
Unit 4: Data Management
Problem: Albert’s teacher revealed that the mean score of their previous exam
is 60 with a standard deviation of 10. Instead of their raw scores, she gave the
z-scores instead.
Albert’s got a z-score of –0.5. If the passing score is 50, did he pass the exam?
Why or why not?
14
Unit 4: Data Management
Learning Objectives
Upon the completion of this topic, you are expected to:
A. Identify the properties of the normal distribution curve;
B. Determine the areas under the normal distribution curve given a portion of
the z table;
C. Determine the probability of cases in the normal distribution curve; and
D. Solve word problem involving the concepts of normal distribution.
Presentation of Content
I. Normal Distribution
A random variable x whose distribution has the shape of normal curve is
called a normal random variable. Its equation is as follows:
1 −¿¿
f ( x )= e
σ √2 π
Note: The random variable x is said to be normally distributed with mean and
standard deviation if its probability distribution is the above equation.
15
Unit 4: Data Management
-2 -1 0 1 2
Note: The shape of the normal distribution depends only on two parameters:
the population mean and the population standard deviation.
You can observe how the mean and standard deviation of different
distributions affect the size and location of the curve.
The first figure in the next page shows normal distributions with the same
mean but different standard deviations while the second figure presents
distributions with different means but the same standard deviation.
16
Unit 4: Data Management
Figure I: Two distributions with equal mean but different standard deviation
How can the mean and standard deviation of the distribution affect its shape?
Standard Scores
It is the position of raw score values in terms of the standard deviation relative
to the mean of the distribution.
Given the raw scores, we can convert them to their corresponding standard
scores or z scores. This means that the empirical distribution will be
standardized to the theoretical normal curve.
We can use the formula:
x−μ
z=
σ
Where:
z = standard score
µ = mean
σ = standard deviation
x = raw score
17
Unit 4: Data Management
x−μ
z=
σ
Where:
z = standard score
μ = mean
σ = standard deviation
x = raw score
After determining the corresponding value of the raw score, we need a z table
to determine the area between the given two values. Here is a portion of the
table.
How will we use the z table to determine the area under the normal curve?
Application
Activity 1
Directions: Answer the following problem.
Problem: Suppose that in a given test, the mean is 45 and the standard
deviation is 5. If Mario obtained a score of 50, what is his standard score?
18
Unit 4: Data Management
Solution:
Given:
µ = 45 x = 50 σ=5 z = unknown
x−μ
z=
σ
50−45
z=
5
5
z=
5
z=1
Interpretation: A standard score of 1 means that there is 1 standard deviation
between 50 and 45. This further indicates that the given observation is greater
than the mean of the distribution.
Activity 2
Directions: Given the mean and standard deviation of the distribution as 20
and 4 respectively, convert the following raw scores into their corresponding
standard scores and interpret their location relative to the mean.
1. 18
2. 20
3. 16
4. 24
5. 28
Activity 3
Using the z table, let us determine the areas of the following:
1. Between 0.1 and 0
2. Between 0.03 and 0
3. Between 0.3 and 0
4. Between 0.45 and 0
5. Between 0.32 and 0
Answers to Activity 3
Have you tried to answer the activity? Here are the answers.
1. Between 0.1 and 0 = 0.0398
2. Between 0.03 and 0 = 0.0120
3. Between 0.3 and 0 = 0.1179
4. Between 0.45 and 0 = 0.1736
5. Between 0.32 and 0 = 0.1255
19
Unit 4: Data Management
Can you follow the guidelines? Are there items that you are not sure with?
Activity 4
To understand the guidelines, let us determine the areas of the following. Try
to answer them before comparing your answers to answers provided in the
next page.
1. To the right of 0.1
2. To the left of –0.3
3. Between –0.2 and –0.4
How many of the guidelines did you apply? Congratulations! You can
compare now your answers to see how much you have understood.
Solutions to Activity 4
The following are the solutions to the previous activity.
1. Area to the right of 0.1 = unknown
Area between 0.1 and 0 = 0.0398
Subtract it from 0.5 = 0.4602
20
Unit 4: Data Management
Note: The areas between 0.2 and 0.4 and –0.2 and –0.4 are equal since they
are symmetrical about the mean.
Note: The probability of the occurrence of a case is its area under the curve!
Example:
If x is a normal random variable with a mean of 90 and standard deviation of
4, find the probability that x is:
1. Greater than 92
2. Less than 89
3. Between 89 and 92
Solution:
Given: mean = 90 standard deviation = 4
1. Given: mean = 90 standard deviation = 4 P (x > 92) =
unknown
92−90 2
Convert the raw score 92 to z-score: z= = =0.5
4 4
Determine the area to the right of 0.5: P (x > 0.5) = 0.3085
The probability that x is greater than 92 is 30.85%.
Congratulations! You finished learning the topics. Did you enjoy it?
What are other applications of the areas under the normal distribution curve in
the real life setting?
21
Unit 4: Data Management
Feedback/ Assessment
Test I.
Directions: Supply the information being required by each item.
_____1. It has the same value and location as the median and mode of the
normal distribution.
_____2. It is the total area under the normal distribution curve.
_____3. It is the shape of a normal distribution curve.
_____4. The shape of the normal curve depends on these parameters.
_____5. It is the equivalent standard score of the mean of the distribution.
Test II.
Directions: Given a portion of the z table, determine the areas of the
following z scores.
z (±) 0.00 0.01 0.02 0.03 0.04 0.05
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368
0.4 0.1554 0.1519 0.1628 0.1664 0.1700 0.1736
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088
1. Between 0.11 and 0
2. To the right of 0.12
3. Between 0.12 and 0.34
Test III.
Directions: Provide the information required by each item. Show all pertinent
solutions. The rubric below will be used to evaluate your answers.
Exceeds Meets Approaches
Criteria Expectation Expectation Expectation (1
(3 points) (2 points) point)
Understandin The given and the The given were Some of the given
g unknown were identified. were not identified.
identified and
properly labelled.
Solution The problem was The problem was The problem was
solved efficiently solved with the solved inefficiently
and systematically use of appropriate with the use of
with the use of solution. inappropriate
appropriate solution.
solution.
Answer The problem was The requirements The problem was
22
Unit 4: Data Management
Problem 1: Juan and Jimmy took a test in Geometry. Their teacher revealed
their scores but in terms of standard scores. Juan has z score of 1 while Jimmy
has 1.5. If the mean score of the class is 60 with a standard deviation of 6, who
got a higher raw score and how much higher?
23
Unit 4: Data Management
Learning Objectives
Upon the completion of this topic, you are expected to:
A. Recall concepts on linear correlation and least square line;
B. Determine the value of the linear coefficient;
C. Identify what relationship exists between two variables.
D. Estimate a value of the dependent variable that corresponds to the
independent variable.
Presentation of Content
I. Linear Correlation
The coefficient measures the strength and direction of linear coefficient
between two variables (Larson and Farber, 2000; Pagala, 2011). We will use
the formula below to determine the value of linear coefficient.
r =n
∑ xy −¿ ∑ x ∑ y ¿
√
[n ∑ x −(∑ x) ][n ∑ y −( ∑ y ) ]
2 2 2 2
Where:
n = number of ordered pairs
x = value of independent variable
y = value of dependent variable
How will you use the formula to determine the relationship of two variables?
2
11. Square the total value of y.( ∑ y )
24
Unit 4: Data Management
12. Substitute the values in the formula to determine the value of the
coefficient.
Note: We can only employ correlation when data are in interval or ratio scale.
y = α + βx + ε
Where:
y = the value of dependent variable
α = the y—intercept
β = the slope of the regression line
x = the value of the independent variable
ε = the random error term
How can we apply the formula to predict values of the dependent variable?
Note: The standard approach to estimating α and β is using the least squares
(minimizing the sum of the squared errors for your data points.)
y = α + βx + ε
The sum of the squared deviation between the line and the scatter of points
should be minimized. Statisticians have found that the formulas for α and β
are shown below:
β=
∑ (x −x)(x− y )
∑ (x−x )
a= y−β x
Note: Here, x and y denote the sample means of x and y.
25
Unit 4: Data Management
Alternative Formula
The alternative formulas for α and β are as follows.
β=n ∑ xy −¿ ¿ ¿
a=
∑ y−β ∑ x
n
Application
Activity 1
Now, let us apply what we have learned. Here is an activity where we can
utilize the formula given. Remember to follow the guidelines in determining
the linear coefficient. Try to solve the problem independently before
comparing your answers to the answers provided.
Have you tried answering the problem? Great! Now, we can compare your
answers.
Solution:
We determine the values of the variables.
Height (X) Weight (Y) XY X2 Y2
26
Unit 4: Data Management
461,690
( ∑ x ) =¿478,864 ( ∑ y ) =¿46,104
2 2
r =n
∑ xy −¿ ∑ x ∑ y ¿
√
[n ∑ x −(∑ x) ][n ∑ y −( ∑ y ) ]
2 2 2 2
( 470,500 )−(692)(697)
r=
√ [480,360−( 692 ) ][461,690−(697) ]
2 2
470,500−469,868
r=
√ [480,360−478,864][461,690−461,041]
632
r=
√(1,496)(649)
632
r=
√ 970,909
632
r=
985.35
r =0.64
27
Unit 4: Data Management
28
Unit 4: Data Management
Correlations
Heights Wiehgth
N 10 10
N 10 10
From the previous activity, the correlation coefficient is 0.64 which can be
interpreted as a moderately positive correlation. There is a substantial degree
of correlation between the height and weight of the ten basketball players.
Activity 2
Let us put your understanding into practice. Below are the test results of 10
students in their Mathematics and English examinations. With a partner,
determine the linear correlation coefficient and interpret its value.
X
(Score in 34 23 45 44 37 46 23 41 40 35
Mathematics)
29
Unit 4: Data Management
Y
(Score in 35 21 43 42 32 45 23 47 43 37
English)
Activity 3
Using the given formulas, try to determine the values of the variables to come
up with the least squares regression equation.
Problem:
The Cagayan State University officials wished to determine if the CSU—
College Admission scores is a good indicator of the Grade Point Average
(GPA) of the 16 scholars selected at random from the first year class. Their
GPA and CSU-CAT scores are shown in the next page:
30
Unit 4: Data Management
How can one predict and estimate GPA from CAT scores?
Solution
Now, we need to obtain the equation for the line that best fits the sample
data.
CAT Score
Student GPA (y) xy y2 x2
(x)
1 2.45 85 208.25 6.00 7225
2 2.59 92 238.28 6.71 8464
3 1.95 57 111.15 3.80 3249
4 2.11 64 135.04 4.45 4096
5 1.94 73 141.62 3.76 5329
6 2.12 62 131.44 4.49 3844
7 2.71 54 146.34 7.34 2916
8 2.63 56 147.28 6.92 3136
31
Unit 4: Data Management
Solution:
Using the formulas:
34.99
y= =2.187
16
1,144
x= =71.50
16
(1,144)(34.99)
2,457.48−
16
β= 2
=−0.0139
(1,144)
84,984−
16
a=2.187−(−0.0139 ) ( 71.50 )=3.181
The fitted equation describing the relationship between GPA and CAT scores
is: GPA = 3.181— 0.014x
Congratulations! You just learned to predict the future Grade Point Average of
the students.
Activity 4
With a partner, determine the equation that would fit the following set of
observations.
Age
10 12 11 26 28 21 22 18 16 15
(x)
Score
32 30 34 39 38 32 29 28 25 20
(y)
32
Unit 4: Data Management
Feedback/ Assessment
Test II.
Directions: Write TRUE if the statement is correct and FALSE if the
statement is wrong on the space provided before each question.
_____1. Beta is the y-intercept in regression analysis.
_____2. In the regression analysis, it is the dependent variable that we want to
predict.
_____3. The slope of the regression line is denoted by alpha.
_____4. The ultimate goal of regression analysis is to predict or estimate the
value of one variable corresponding to a given value of another
variable.
_____5. The sample regression equation may be used to predict or estimate
outside the range of values of the independent variable represented in
the sample.
Test III.
Directions: Provide the information required by the problem in the next page.
The rubric below will be used to evaluate your answers.
Exceeds Meets Approaches
Criteria Expectation Expectation Expectation (1
(3 points) (2 points) point)
Understandin The given and the The given were Some of the given
g unknown were identified. were not identified.
identified and
properly labelled.
Solution The problem was The problem was The problem was
solved efficiently solved with the solved inefficiently
and systematically use of appropriate with the use of
with the use of inappropriate
33
Unit 4: Data Management
Problem 1: The raw scores obtained by 10 students in a quiz are given below.
What is the relationship that exist in their performance in Biology and
Chemistry?
X (Biology) 12 11 19 20 15 17 18 12 14 15
Y
16 17 13 19 15 16 19 10 15 13
(Chemistry)
Summary
34
Unit 4: Data Management
Reflection
A. How much have you learned in this unit? Are there things that you
didn’t understand?
o I cannot understand the topic on _____________________.
o Now, I understand what the topics are all about.
B. Directions: Write your thoughts on the things that you have learned
and what you still need to improve by completing the following.
35
Unit 4: Data Management
_______________________________________________________________
_______________________________________________________________
References
Asaad, A. (2008). Statistics Made Simple for Researchers. Rex Book Store,
Inc.
Mamhot, M., Mamhot, A., & Adanza, J. (2013). Statistics for General
Education. Purelybooks Trading & Publishing Corp.
36
Unit 4: Data Management
37