0% found this document useful (0 votes)
4 views28 pages

Q4 STATISTICS C.1 Learning Modules Quarter 4 Learning Information and Course Activity

The document provides an overview of normal distribution in statistics, explaining its properties, significance, and the calculation of z-scores. It details the shape and characteristics of the normal curve, including its symmetry and the relationship between mean and standard deviation. Additionally, it includes examples and steps for finding areas under the normal curve using a z-table.

Uploaded by

aimeopalla1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views28 pages

Q4 STATISTICS C.1 Learning Modules Quarter 4 Learning Information and Course Activity

The document provides an overview of normal distribution in statistics, explaining its properties, significance, and the calculation of z-scores. It details the shape and characteristics of the normal curve, including its symmetry and the relationship between mean and standard deviation. Additionally, it includes examples and steps for finding areas under the normal curve using a z-table.

Uploaded by

aimeopalla1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

MATH 2 |Page |1

NORTHLINK
TECHNOLOGICAL COLLEGE
LEARNING MODULE
MATH 2 – Statistics and Probability

Learning Information 2.1


Learning Content 1: Normal Distribution

Normal Curve Distribution


There are many events in real life that generate random variables that have the natural tendency to approximate the shape
of a bell. For example, the heights of a large number of seedlings that we see in fields normally consist of a few tall ones, a
few short ones, and most of them having heights between tall and short. If a well-prepared test is administered to a class of
100 students, there will be few high scores, as well as few low scores. Most of the scores will be found in between these two
extreme scores. In reality, if a distribution consists of a very large number of cases and the three measures of averages
(mean, median, and mode) are equal, then the distribution is symmetrical and the skewness is 0. In Statistics, such
distribution is called normal distribution or simply normal curve.
The normal curve has a very important role in inferential statistics. It provides a graphical representation of statistical
values that are needed in describing the characteristics of populations as well as in making decisions. It is defined by an
equation that uses the population mean, and the standard deviation, . There is no single curve, but rather a whole
family of normal curves that have the same basic characteristics but have different means and standard deviations.
Properties of the Normal Probability
1. The distribution curve is bell-shaped.
2. The curve is symmetrical about its center.
3. The mean, median, and the mode coincide at the center.
4. The width of the curve is determined by the standard deviation
of the distribution.
5. The tails of the curve flatten out indefinitely along the horizontal
axis, always approaching the axis but never touching it. That is,
the curve is asymptotic to the base line.
6. The area under the curve is 1. Thus, it represents the
probability or proportion or the percentage associated with
specific sets of measurement values.
The change of value of the mean shifts the graph of the normal curve to the right or to the left.
The standard deviation determines the shape of the graphs (particularly the height and width of the curve). When the
standard deviation is large, the normal curve is short and wide, while a small value for the standard deviation yields skinnier
and taller graph.

The Table of Areas under the Normal Curve is also known as the z-table. The z-score is a measure of relative standing.
It is calculated by subtracting from the measurement X and then dividing the result by s or . The final result, the z-score,
represents the distance between the given measurement X and the mean, expressed in standard deviations. Either the z-
score locates X within a sample or within a population.
To find the area that corresponds to a z-score, simply find the area between z = 0 and the given z-value using the z-table.
MATH 2 |Page |2

Four-Step in Finding the Area Under the Normal Curve Given a z-Value
Step 1: Express the given z-value into a three-digit form.
Step 2: Using the z-table, find the first two digits on the left column.
Step 3: Match the third digit with the appropriate column on the right.
Step 4: Read the area (or probability) at the intersection of the row and the column.
This is the required area.

Table of Areas under the Normal Curve


Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
MATH 2 |Page |3

Example 1. Find the area that corresponds to z = 1.36.


Solution: Finding the area that corresponds to is the same as finding the area between z = 0 and z = 1.36.
Steps Solution
1. Express the given into a three-digit form. z = 1.36 (as is)
2. In the table, find the row z = 1.3.
3. In the table, find the column with heading z = 0.06.
4. Read the area at intersection. This area is 0.4131. This is the
area.

Example 2. Find the area that corresponds to z = -2.58.


Solution: In the z-table, the area that corresponds to z = 2.58 is the same as the area that corresponds z = -2.58. In the
graph of this region, it is located on the left of the mean.
Steps Solution
1. Express the given into a three-digit form. z = 2.58 (as is)
2. In the table, find the row z = 2.5.
3. In the table, find the column with heading z = 0.08.
4. Read the area at intersection. This area is 0.4951. This is the
area.

Example 3. Find the area that corresponds to z = 1.7.


Solution: Finding the area that corresponds to is the same as finding the area between z = 0 and z = 1.7.
Steps Solution
1. Express the given into a three-digit form. z = 1.70
2. In the table, find the row z = 1.7.
3. In the table, find the column with heading z = 0.00.
4. Read the area at intersection. This area is 0.4554. This is the
area.

Z-Scores
The areas under the normal curve are given in terms of z-values or scores. Either the z-score locates X within a sample or
within a population. The formula for calculating z is:
Z= (z-score for population data)
Z= (z-score for sample data)
where: X = given measurement
= population mean
= population standard deviation
x = sample mean
s = sample standard deviation
MATH 2 |Page |4

Example 1: Given the mean, = 50 and the standard deviation, = 4 of a population of Reading scores. Find the z value
that corresponds to a score x = 58.
Steps Solution
1. Use the computing formula for finding z-scores of
population data. Z=
2. Check the given values. Since these are population = 50, = 4, and x = 58.
values, the z-scores locate X within a population.
3. Substitute the given values in the computing
formula. Z=

Z = = 2.
Thus, the z-value that corresponds to the raw score 58 is 2
4. Compute the z-value.
in a population distribution.

Interpretation: The score is 2 SD units above the mean.

Example 2: Given the sample mean, x = 26 and the sample standard deviation, s = 6. Find the z value that corresponds to
a score x = 20.
Steps Solution
1. Use the computing formula for finding z-scores of
population data. Z=
2. Check the given values. Since these are population
x = 26, s = 6, and X = 20.
values, the z-scores locate X within a population.
3. Substitute the given values in the computing
formula. Z=

Z= = -1.
Thus, the z-value that corresponds to the raw score 20 is -1
4. Compute the z-value.
in a sample distribution.

Interpretation: The score is 1 SD unit below the mean.

Regions of Areas Under the Normal Curve


We have learned that the area under the curve is 1. So, we can make the correspondence between area and probability.
We have also learned how to use the z-table so that we can identify the areas under regions under the normal curve. When
we speak of a region under the curve, we are in fact interested in the area of that region.
We see that using the z-table, we can determine specific regions under the normal curve. For example, 50% of the region
under the curve is below the mean and 50% is above the mean. Specific regions can be determined in terms of their
usefulness in a situation.
Since the z-table provides the proportion of the area
between two specific values under the curve, regions
under the curve can be described in terms of area.
MATH 2 |Page |5

Example 3. Determine the area under the standard


normal curve to the right of z = 1.63
Solution.
The value that corresponds to z = 1.63 is 0.4484
A = 0.5 – 0.4484
A = 0.0516

Example 4. Determine the area under the standard


normal curve to the right of z = - 0.52
Solution.
The value that corresponds to z = - 0.52 is 0.1985
A = 0.5 + 0.1985
A = 0.6985

Example 5. Determine the area under the standard normal curve to the left of z = 1.25
Solution.
The value that corresponds to z = 1.25 is 0.3944
A = 0.5 + 0.3944
A = 0. 8944

Example 6. Determine the area under the standard normal curve to the left of z = - 0.95
Solution.
The value that corresponds to z = - 0.95 is 0.3289
A = 0.5 – 0.3289
A = 0.1711

Sign Notation Operation


subtract the biggest area by
same sign P(a < z < b)
the smallest area
between a and b
P(a < z < - b) add the biggest area by the
different sign
P(- a < z < b) smallest area

Example 7. Find the area under the standard normal curve between z = 1.03 and z = - 0.37.
Solution.
The value that corresponds to z = 1.03 is 0.3485 and the z = - 0.37 is 0.1443.
A = 0. 3485 + 0.1443
A = 0. 4928

Example 8. Find the area under the standard normal curve between z = 0.32 and z = 2. 42.
Solution.
The value that corresponds to z = 0.32 is 0.1255 and the z = 2.42 is 0.4922.
A = 0. 4922 – 0.1255
A = 0. 3667
MATH 2 |Page |6

Written Works 1

A. Find the area that corresponds to each of the following z-values:


1. Z = 0.3
2. Z = 1.96
3. Z = -1.15
4. Z = 2.58
5. Z = -0.99
B.Find the area under the regions of the following:
1. Z = 1.21, Z = -2.12
2. Z = 0.4, Z = 1.99
3. Z = -2.56, Z = -1.67
4. Z = 0, Z = 1.5
5. Z = 1.78, Z = 1.23
MATH 2 |Page |7

NORTHLINK
TECHNOLOGICAL COLLEGE
LEARNING MODULE
MATH 2 – Statistics and Probability

Learning Information 2.2


Learning Content 1: Sampling and Sampling Distribution

Sampling Distribution of Sample Means


A sampling distribution of sample means is a frequency distribution using the means computed from all possible random
samples of a specific size taken from a population. The means of the samples are less than or greater than the mean of the
population.
Sampling error. The difference between the sample mean and the population

Steps in Constructing the Sampling Distribution of the Means


1. Determine the number of possible samples that can be drawn from the population using
the formula NCn, where N is the size of the population and n is the size of the sample.
2. List all the possible samples and compute the mean of each sample.
3. Construct a frequency distribution of the sample means obtained in Step 2.
Example: A population consists of the numbers 2, 4, 9, 10 and 5. Let us list all possible sample size 3 from this population
and compute the mean of each sample.
1.Determine the number of sets of all possible random samples that can be drawn from the given population by using the
formula, NCn where N is the population size and n is the sample size.

NCn = N=5;n=3

NCn = = = = = 10

There are 10 sample of size 3 that can be drawn from the given population.
2.List all the possible samples and compute the mean of each sample.

3.Construct a frequency distribution of the sample means obtained in Step 2.


Sample Mean Frequency Probability
3.67 1 1/10 = 0.10
5.00 1 1/10 = 0.10
5.33 2 2/10 = 0.20
5.67 1 1/10 = 0.10
6.00 1 1/10 = 0.10
6.33 1 1/10 = 0.10
7.00 1 1/10 = 0.10
7.67 1 1/10 = 0.10
8.00 1 1/10 = 0.10
Total n = 10 1.00
MATH 2 |Page |8

Solving Problems Involving Sampling Distribution


The Central Limit Theorem is of fundamental importance in statistics because it justifies the use of normal curve methods
for a wide range of problems. This theorem applies automatically to sampling from infinite population. It also assures us that
no matter what the shape of the population distribution of the mean is, the sampling distribution of the sample means is
closely normally distributed whenever n is large.

Z=

Where X = sample mean


= population mean
= population standard deviation
n = sample size
Example: the average time it takes a group of college students to complete a certain examination is 46.2 minutes. The
standard deviation is 8 minutes. Assume that the variable is normally distributed.

a. What is the probability that a randomly selected college student will complete the examination in less than 43 minutes?

Steps Solutions
X= 43
1. Identify the given information. = 46.2
=8
2. Identify what is asked for. P(X<43)
Here, we are dealing with an individual data obtained
3. Identify the formula to be used. from a population. So, we will use the formula Z =
to standardized 43.

Z=
Z=
Z = -0.40
4. Solve the problem. We shall find P(X<43) by getting the area under the
normal curve.
P(X<43) = P(z<-0.40)
= 0.5000 – 0.1554

P(X<43) = 0.3446
So, the probability that a randomly selected college
5. State the final answer. student will complete the examination in less than 43
minutes is 0.3446 or 36.46%.

b. If 50 randomly selected college students take the examination, what is the probability that the mean time it takes the
group to complete the test will be less than 43 minutes?

Steps Solutions
X = 43
1. Identify the given information.
= 46.2
MATH 2 |Page |9

=8
n = 50
2. Identify what is asked for. P(<43)
Here, we are dealing with data about the sample
3. Identify the formula to be means. So, we will use the formula Z = to
used. √
standardized 43.
Z=

Z=

4. Solve the problem. Z = -2.83
We shall find P(<43) by getting the area under
the normal curve.
P(<43) = P(z<-2.83)
= 0.5000 – 0.4977
P(<43) = 0.0023
So, the probability that 50 randomly selected
5. State the final answer. college student will complete the examination in
less than 43 minutes is 0.0023 or 0.23%.

c.Does it seem reasonable that a college student would finish the examination in less than 43 minutes? YES
d.Does it seem reasonable that the mean of the 50 college students could be less than 43 minutes? NO, very unlikely.

Written Work 2

A. Problem Solving.
1. A population consists of numbers from 1-5. Consider samples of size 4 that can be drawn from this population.
a. List all the possible samples and the corresponding mean.
b. Construct the sampling distribution means.
2. The average number of milligrams of cholesterol in a cup of a certain brand of ice cream is 660 mg, and the
standard deviation is 35. Assume the variable is normally distributed.
a. If a cup of ice cream is selected, what is the probability that the cholesterol content will be more than 670
mg?
b. If a sample of 10 cups of ice cream is selected, what is the probability that the mean of the sample will be
larger than 670 mg?

Performance Task 1

Instructions.
The teacher will provide a dataset containing test scores of 40 students. Randomly select a sample of 15 students from this
dataset. Calculate the mean and standard deviation of their test scores. Create a histogram to visualize the distribution of
scores in your sample. Explain the concept of sampling distribution and its importance in statistics. Summarize your
findings, discussing how these sample statistics might reflect the overall test scores of all students in the dataset.
M A T H 2 | P a g e | 10

NORTHLINK
TECHNOLOGICAL COLLEGE
LEARNING MODULE
MATH 2 – Statistics and Probability

Learning Information 2.3


Learning Content 1: Estimation of Parameters

Point Estimate of a Population Mean


An estimate is the value or a range of values that approximate a parameter. It is based on sample on sample statistics
computed from a sample data. Estimation is a process of determining parameter values.
Recall that a parameters are numerical descriptive measures of populations and they are usually unknown. However we
can estimate population parameters from sample values. In Statistics, sample measures, such as the sample means and
standard deviations, are used to estimate population values.
A point estimate is a specific numerical value of a population parameter. The sample mean is the best point estimate of
the population mean.
Example: Mr. Santiago’s company sells bottled coconut juice. He claims that the bottle contains 500 ml of such
juice. A consumer group wanted to know if his claim is true. They took 4 random samples of 10 such bottles and obtained
the capacity, in ml, of each bottle. The result is shown as follows:

Sample 1 500 498 497 503 499 497 497 497 497 495
Sample 2 500 500 495 494 500 500 500 500 498 497
Sample 3 497 502 497 496 497 497 497 497 497 495
Sample 4 501 495 500 497 497 500 500 495 497 497

Assuming that the measurement was carefully obtained and that the only kind of error present is the sampling error, what is
the point estimate of the population mean?
Solution: When dealing with a large number of values, the mean of small samples may be obtained. These means
constitute a sampling distribution of means. To find the overall mean, simply find the sum of the mean values. Then divide
the sum by the total number of sample means.
MeanRow 1 =

MeanRow 1 =
MeanRow 1 = 498 and so on

Say, we already obtained the sample mean of each row; we will now find the overall mean.
Sample Row Mean
1 498.0
2 498.4
3 497.2
4 497.9

MeanOverall =
M A T H 2 | P a g e | 11

MeanOverall =
MeanOverall = 497.875. This will be our point estimate for a population mean.

Confidence Interval Estimate


When we describe population values, we want to be confident about our estimates. Other than the pint estimate, we can
use a range of values. This range of values is called interval estimate. An interval estimate, called a confidence interval,
is a range of values that is used to estimate a parameter. This estimate may or may not contain the true parameter value.
For example, the age of a beginning grade 1 pupil is any value between 6 and 7 years. In determining an interval estimate,
a degree of confidence (expressed as a percentage such as 95%) that the interval contains the true and fixed parameter is
made. That is, if we collect several random samples and then calculate a confidence interval from each sample, these
confidence intervals are constructed wide enough so that 95% of them contain the true population parameter and 5% do
not. The value 95% is also known as the confidence level.
The confidence level of an interval estimate of a parameter is the probability that the interval estimates contains the
parameter. It describes what percentage of intervals from many different samples contains the unknown population
parameter.
There are three commonly used confidence intervals: the 90%, the 95%, and the 99% confidence intervals. Shorter intervals
are more informative than longer ones. A short confidence interval can be obtained by having a large sample or by using a
lower confidence level.
Four-step Process in Computing the Interval Estimate
Step 1: Describe the population parameter of interest.
Step 2: Specify the confidence interval criteria.
a. Check the assumptions.
b. Determine the test statistic to be used.
c. State the level of confidence.
Step 3: Collect and present sample evidence.
a. Collect the sample information.
b. Find the point estimate.
Step 4: Determine the confidence interval.
a. Determine the confidence coefficient.
b. Find the maximum error E of the estimates.
c. Find the upper and lower confidence limits.
d. Describe/interpret the results.

Example: Find the estimate of the population mean using the 95% confidence level.
Given: Sample Mean = 72, n = 120, and = 3.
Solution: With the large sample, by the Central Limit Theorem, the distribution is normally distributed.

a.Point Estimate
Steps Solution
1. Describe the population parameter of interest. ● The parameter of interest is the mean where
the sample purportedly belongs.
2. Specify the confidence interval criteria.
a. Check the assumptions. ● The is given.
● The sample is normal guaranteed by the CLT.
M A T H 2 | P a g e | 12

b. Determine the test statistic to be used. ● The test statistic is the z with = 3.
c. State the level of confidence. ● The question asked for a 95% confidence, or
= 0.05. This means that if more random samples
were taken from the target population, and an
interval estimate is made for each sample, then 95%
of the intervals will contain the true parameter value.
3. Collect and present sample evidence.
a. Collect the sample information. ● The sample information consists of sample mean =
72, n = 120 and = 3.
b. Find the point estimate. ● The point estimate for the population mean is
72 (the sample mean).
b.95% Confidence Interval
4. Determine the confidence interval
a. Determine the confidence coefficient. ● The confidence coefficient is 1.96 for 95% (2.58 for
99% and 1.65 for 90%).
b. Find the maximum error E. ● E= (

= 1.96 (

= 1.96 (
= 1.96 (0.27)
E = 0.53
c. Find the lower and upper confidence limits. ● X- ( < <X+ (
√ √
72 – 0.53 to 72 + 0.53
71.47 to 72.53
d. Describe the results. ● We can say with 95% confidence that the interval
between 71.47 and 72.53 contains the population
mean based on the sample size of 120.

Confidence Level and Sample Size


There are two things to remember when we decide on the quality of the sample size we need: confidence and narrowness
of the interval. The formula in determining the minimum sample size needed when estimating the population mean is,
n=(
Since the value of is usually unknown, it can be estimated by the standard deviation s from a prior sample. Alternatively,

we may approximate the range R of observations in the population and make a conservative estimate of . In any case,
round up the value of obtained result to ensure that the sample size will be sufficient to achieve the specified reliability.
Example 1: In a certain village, Leony wants to estimate the mean weight , in kilograms, of all six-year old children
to be included in a feeding program. She want to be 99% confident that the estimate of is accurate to within 0.06 kg.
Suppose from a previous study, the standard deviation of the weights of the target population was 0.5 kg., what should be
the sample size be?
M A T H 2 | P a g e | 13

Written Works 3

A. Problem Solving.
1. A researcher wants to estimate the number of hours that 5-year old children spend watching television. A
sample of 50 five-year old children was observed to have a mean viewing time of 3 hours. The population is
normally distributed with a population standard deviation of 0.5 hours, find:
a. The best point estimate of the population mean.
b. The 95% confidence interval of the population mean.
2. A random selection of 40 entering Mathematics majors has the following GPAs. Assume that the standard
deviation is 0.46, estimate the true mean GPA with 99% confidence.

4.0 3.5 3.0 3.3 3.8 3.1 3.6 4.0 3.9 3.5
3.2 3.0 3.5 3.2 3.0 3.2 4.0 3.0 3.4 3.0
3.0 2.8 5.6 3.0 3.2 3.5 3.2 2.8 3.3 3.1
3.2 2.9 3.0 2.8 4.0 3.7 3.0 3.3 3.2 2.8
M A T H 2 | P a g e | 14

NORTHLINK
TECHNOLOGICAL COLLEGE
LEARNING MODULE
MATH 2 – Statistics and Probability

Learning Information 2.4


Learning Content 1: Hypothesis Testing

Understanding Hypothesis Testing


We make decisions every day. Some of these are important while others are not. In decision-making, we usually follow
simple processes: weigh alternatives, collect evidence, and make a decision. After a decision is made, an appropriate
interpretation is made (or an action is undertaken). We follow these basic processes in testing hypothesis in Statistics.
In Statistics, decision-making starts with a concern about a population regarding its characteristics denoted by parameter
values. We might be interested in the population parameter like the mean or population. Hypothesis testing is another
area of Inferential Statistics. It is a decision-making process for evaluating claims about a population based on the
characteristics of a sample purportedly coming from that population. The decision is whether the characteristic is acceptable
or not. How does it differ from estimation? While estimation is concerned with determining specific parameter values, testing
hypothesis is hypothesizing about the population parameter and subjecting this hypothesis to a test.
There are two types of hypotheses: the null hypothesis and the alternative hypothesis. The null hypothesis, denoted by
H0, is a statement that there is no difference between a parameter and a specific value, or that there is no difference
between two parameters. The alternative hypothesis, denoted by H1, is a statement that there is a difference between a
parameter and a specific value, or that there is a difference between two parameters. The null hypothesis is what we want
to test. It states an exact value about the parameter. When the null hypothesis is accepted, the buck stops there! But when
the null hypothesis is rejected, this leads to another option, which is the alternative hypothesis that allows for the possibility
of many values.

Formulating Hypotheses
Example 1: The owner of a factory that sells a particular bottled fruit juice claims that the average capacity of a bottle of
their product is 250 ml. is the claim true?
To test the claim, the members of a consumer group did the following:
1. Get a sample of 100 such bottles.
2. Calculate the capacity of each bottle.
3. Compare the sample mean and the claim.
The observed mean capacity of the 100 bottles is 243 ml. the sample standard deviation is 10 ml. in the example, the
owner’s statement (called claim) is a general statement. The claim is that the capacity of all their bottled products is 250 ml
per bottle. So, the population mean, = 250 ml. On the other hand, the consumer group has a sample value which is 243
ml. clearly the sample mean. Thus the hypotheses would be:
H0: The bottled drinks contain 250 ml per bottle. (This is the claim.)
H1: The bottled drinks contain 250 ml per bottle. (This is the opposite of the claim.)
But these statements should be written in symbols. For now, let us drop the unit measure and simply write.
H0: 250
H1: 250
M A T H 2 | P a g e | 15

In mathematics, the symbol in the alternative hypothesis suggest either a greater than relation or less than relation. What
is the interpretation of the symbol in the example? It means that the consumer group is not interested in getting a sample
mean greater than 250 or a sample mean less than 250. However, however, this does not make sense in the given
exercise. The consumer group has a purpose, a direction. The consumer group may want to refute the claim. So, the
appropriate alternative hypothesis is:
H1: < 250

● When the alternative hypothesis utilizes the symbol, the test is said to be non-directional. When the alternative
hypothesis utilizes the < or the > symbol, the test is said to be directional.
● In problems that involve hypothesis testing, there are words like greater, efficient, improves, effective,
increases, and so on that suggest a right-tailed direction on the formulation of the alternative hypothesis. Words
like decrease, less than, smaller, and the like suggest a left-tailed direction.

Example 2: A teacher wants to know if listening to popular music affects the performance of pupils. A class of 50 grade
1 pupils was used in the experiment. The mean score was 83 and the standard deviation is 5. A previous study revealed
that = 82 and the standard deviation is 10.
1. State the null and alternative hypothesis in words and in symbols.
2. State whether the test is directional or non-directional.
Solution: The parameter of interest is the population mean = 82.
1. In words, the hypotheses are:
H0: The sample comes from a population whose mean is 82.
H1: The sample does not come from a population whose mean is 82.
In symbols, we write: H0: 82
H1: 82
2. There is no clue as to the direction of the investigation. The phrase affects performance implies either an increase
or a decrease in performance. So, the test is non-directional.

Exploring More Elements of Hypothesis Testing


In hypothesis testing, we make decisions about the null hypothesis. Of course, there are risks when we make decisions.
When we conduct a hypothesis test, there are four possible outcomes. The following decision grid shows these four
outcomes. Again, note that the decision is focused on the null hypotheses.
Decisions about the H0
Reject Do not Reject H0
H0 is TRUE. Type I Error Correct Decision
Reality H0 is FALSE. Correct Decision Type II Error
If the null hypothesis is true and accepted, or if it is false and rejected, the decision is correct. If the null hypothesis is true
and rejected, the decision is incorrect and this is a Type I Error. If the null hypothesis is false and accepted, the decision is
incorrect and this is a Type II Error. In an ideal situation, there is no error when we accept the truth and reject what is false.
Example 1: Maria insists that she is 30 years old when, in fact, she is 32 years old. What error is Mary committing?
Solution: Mary is rejecting the truth. She is committing a Type I Error.
Example 2: Stephen says that he is not bald. His hairline is just receding. Is he committing an error? If so, what type?
Solution: Yes. A receding hairline indicated balding. This is a Type I Error. Stephen action may be to find remedial
measures to stop falling hair.
M A T H 2 | P a g e | 16

Example 3: A man plans to go to hunting the Philippine-monkey eating eagle believing that it is a proof of his mettle.
What type of error is this?
Solution: Hunting the Philippine eagle is prohibited by the law. Thus, it is not a good sport. It is a Type II error. Since hunting
the Philippine monkey-eating eagle is against the law, the man may find himself in jail if he goes out of his way hunting
endangered species.

Critical Regions/Values (Z-TEST)


Level of Significance One-tailed Test Two-tailed test
Left-tailed Right-tailed
= 0.05 Z < -1.645 Z > 1.645 Z > 1.96 or Z <-1.96
= 0.01 Z < -2.33 Z > 2.33 Z > 2.575 or Z <-2.575
= 0.1 Z < -1.28 Z > 1.28 Z > 1.645 or Z <-1.645

Hypothesis Testing Procedure


1. Formulate the null and alternative hypothesis.
2. Decide the level of significance, .
3. Choose the appropriate test statistic.
4. Establish the critical region.
5. Compute the value of statistical test.
6. Decide whether to accept or reject the null hypothesis.
7. Draw a conclusion.

Z-Test (Comparing Sample Mean and Population Mean)


1. Z= 2. Z=
√ √
Where:
Z = z-test value
X = sample mean
= population mean or claimed mean in H0.
= population standard deviation
s = sample standard deviation
n = number of cases greater than or equal to 30.

Example 1: The treasurer of a certain university claims that the mean monthly salary of their college professor is 21, 750.00
with the standard deviation of 6, 000.00 a researcher takes a random sample of 75 college professors were found to have a
mean monthly salary of 19, 375.00. Do the 75 college professors have lower salaries than the rest? Test the claim at =
0.05 level of significance.
Solution: Apply the different steps in testing hypothesis to solve the given problem.
Step 1:H0: The mean monthly salary of the college professors is 21, 750 ( = 21, 750)
H1: The mean monthly salary of the college professors is lower or less than 21, 750 ( < 21, 750)
Step 2: = 0.05
Step 3: One-tailed test is used because H1 is directional.
Step 4: The tabular value or critical value of z at 0.05 level of significance is -1.645.
M A T H 2 | P a g e | 17

Step 5: Compute the z-value.


X = 19, 375.00
= 21, 750.00
= 6, 000.00
n = 75
Z=

Z=

Z=

Z=
Z = -3.43
Step 6: The computed value of z = -3.43 lies under the rejection region, therefore reject H0 and accept H1.
Step 7: Conclusion: The mean monthly salary of the college professors is lower than 21, 750.
Example 2: The mean weight of the baggage carried into an airplane by individual passengers at Davao International
Airport is 19.8 kg. A statistician takes a random sample of 110 passengers and obtains a sample mean weight of 18.5 kg
with a standard deviation of 8.5 kg. Test the claim at = 0.01 level of significance.
Step 1: H0: 19.8
H1: 19.8
Step 2: = 0.01
Step 3: Two-tailed test is used because H1 is non-directional.
Step 4: The tabular values or critical values of z at 0.01 level of significance are -2.33 and 2.33.
Step 5: Compute the z-value.
X = 18.5
= 19.8
s = 8.5
n = 110
Z=

Z=

Z=

Z=
Z = -1.60
Step 6: The computed value of z = -1.60 lies under the non-rejection region, therefore accept H0.
Step 7: Conclusion: There is no significant difference between the weights of baggage carried by individual passengers.
M A T H 2 | P a g e | 18

Written Works 4

A. Problem Solving. Write the null and alternative hypothesis in symbols of each of the following:
1. A farmer believes that using organic fertilizers oh his plants will yield greater income. His average income from
the past was 200, 000.00 per year.
2. The net weight of a packet of a snack brand is 130 g. A sample of 80 packets yielded a sample mean weight of
3. 112 g with a standard deviation of 15 grams.
4. In a graduate college, the average length of the registration time during a semester is 120 minutes with a
standard deviation of 25 minutes. With the introduction of a new registration procedure, the administration
thinks that it would take less than or equal to 120 minutes.
5. The average height of Grade 8 female students is 158.2 cm. The mean height of a sample of 100 female
students is 160 cm with a standard deviation of 6 cm.

Performance Task 2

Testing Hypothesis.
1. A researcher used a developed problem solving test to randomly select 50 Grade 11 pupils. In this sample, the
mean is 80. The mean and the standard deviation of the population used in the standardization of the test were
75 and 15, respectively. Use the 95% confidence level to test if the sample means differ significantly from the
population mean.
2. The owner of a factory that sells a particular bottled fruit juice claims that the average capacity of their product
is 250 ml. To test the claim that it is less than the population mean, a consumer group gets a sample of 100
such bottles, calculates the capacity of each bottle, and then finds the mean capacity to be 248 ml and the
standard deviation is 5 ml. Is the claim true? (Use 0.05 level of significance.)
M A T H 2 | P a g e | 19

NORTHLINK
TECHNOLOGICAL COLLEGE
LEARNING MODULE
MATH 2 – Statistics and Probability

Learning Information 2.5


Learning Content 1: Hypothesis Testing 2

T-Test (Comparing Sample Mean and Population Mean)


When the sample size involves small cases (n < 30) and the population standard deviation is unknown, use the sample
standard deviation s as an estimator of population standard deviation . In cases like this, t-distribution/t-test is
appropriate as the test statistic. Using t-test as test statistic, it is always an assumption that the sampled population is
normal or approximately normal.
The t-distribution was developed by an employee of Irish brewery in the person of William Gossett. He chose to publish
using the pen name ―Student‖. To honor his work, the distribution is known today as Student t-distribution.
The formula for the t-test is,
T=

Where:
T = t-test value
X = sample mean
= population mean or claimed mean in H0.
s = sample standard deviation
n = number of cases less than 30
df = n-1 (for the critical region).
T-Distribution Table (One Tail)
DF A = 0.1 0.05 0.025 0.01 0.005 0.001 0.0005
∞ ta = 1.282 1.645 1.960 2.326 2.576 3.091 3.291
1 3.078 6.314 12.706 31.821 63.656 318.289 636.578
2 1.886 2.920 4.303 6.965 9.925 22.328 31.600
3 1.638 2.353 3.182 4.541 5.841 10.214 12.924
4 1.533 2.132 2.776 3.747 4.604 7.173 8.610
5 1.476 2.015 2.571 3.365 4.032 5.894 6.869
6 1.440 1.943 2.447 3.143 3.707 5.208 5.959
7 1.415 1.895 2.365 2.998 3.499 4.785 5.408
8 1.397 1.860 2.306 2.896 3.355 4.501 5.041
9 1.383 1.833 2.262 2.821 3.250 4.297 4.781
10 1.372 1.812 2.228 2.764 3.169 4.144 4.587
11 1.363 1.796 2.201 2.718 3.106 4.025 4.437
12 1.356 1.782 2.179 2.681 3.055 3.930 4.318
13 1.350 1.771 2.160 2.650 3.012 3.852 4.221
14 1.345 1.761 2.145 2.624 2.977 3.787 4.140
15 1.341 1.753 2.131 2.602 2.947 3.733 4.073
16 1.337 1.746 2.120 2.583 2.921 3.686 4.015
M A T H 2 | P a g e | 20

17 1.333 1.740 2.110 2.567 2.898 3.646 3.965


18 1.330 1.734 2.101 2.552 2.878 3.610 3.922
19 1.328 1.729 2.093 2.539 2.861 3.579 3.883
20 1.325 1.725 2.086 2.528 2.845 3.552 3.850
21 1.323 1.721 2.080 2.518 2.831 3.527 3.819
22 1.321 1.717 2.074 2.508 2.819 3.505 3.792
23 1.319 1.714 2.069 2.500 2.807 3.485 3.768
24 1.318 1.711 2.064 2.492 2.797 3.467 3.745
25 1.316 1.708 2.060 2.485 2.787 3.450 3.725
26 1.315 1.706 2.056 2.479 2.779 3.435 3.707
27 1.314 1.703 2.052 2.473 2.771 3.421 3.689
28 1.313 1.701 2.048 2.467 2.763 3.408 3.674
29 1.311 1.699 2.045 2.462 2.756 3.396 3.660
30 1.310 1.697 2.042 2.457 2.750 3.385 3.646
60 1.296 1.671 2.000 2.390 2.660 3.232 3.460
120 1.289 1.658 1.980 2.358 2.617 3.160 3.373
1000 1.282 1.646 1.962 2.330 2.581 3.098 3.300

Two Tails T Distribution Table


DF A = 0.2 0.10 0.05 0.02 0.01 0.002 0.001
∞ ta = 1.282 1.645 1.960 2.326 2.576 3.091 3.291
1 3.078 6.314 12.706 31.821 63.656 318.289 636.578
2 1.886 2.920 4.303 6.965 9.925 22.328 31.600
3 1.638 2.353 3.182 4.541 5.841 10.214 12.924
4 1.533 2.132 2.776 3.747 4.604 7.173 8.610
5 1.476 2.015 2.571 3.365 4.032 5.894 6.869
6 1.440 1.943 2.447 3.143 3.707 5.208 5.959
7 1.415 1.895 2.365 2.998 3.499 4.785 5.408
8 1.397 1.860 2.306 2.896 3.355 4.501 5.041
9 1.383 1.833 2.262 2.821 3.250 4.297 4.781
10 1.372 1.812 2.228 2.764 3.169 4.144 4.587
11 1.363 1.796 2.201 2.718 3.106 4.025 4.437
12 1.356 1.782 2.179 2.681 3.055 3.930 4.318
13 1.350 1.771 2.160 2.650 3.012 3.852 4.221
14 1.345 1.761 2.145 2.624 2.977 3.787 4.140
15 1.341 1.753 2.131 2.602 2.947 3.733 4.073
16 1.337 1.746 2.120 2.583 2.921 3.686 4.015
17 1.333 1.740 2.110 2.567 2.898 3.646 3.965
18 1.330 1.734 2.101 2.552 2.878 3.610 3.922
19 1.328 1.729 2.093 2.539 2.861 3.579 3.883
20 1.325 1.725 2.086 2.528 2.845 3.552 3.850
M A T H 2 | P a g e | 21

21 1.323 1.721 2.080 2.518 2.831 3.527 3.819


22 1.321 1.717 2.074 2.508 2.819 3.505 3.792
23 1.319 1.714 2.069 2.500 2.807 3.485 3.768
24 1.318 1.711 2.064 2.492 2.797 3.467 3.745
25 1.316 1.708 2.060 2.485 2.787 3.450 3.725
26 1.315 1.706 2.056 2.479 2.779 3.435 3.707
27 1.314 1.703 2.052 2.473 2.771 3.421 3.689
28 1.313 1.701 2.048 2.467 2.763 3.408 3.674
29 1.311 1.699 2.045 2.462 2.756 3.396 3.660
30 1.310 1.697 2.042 2.457 2.750 3.385 3.646
60 1.296 1.671 2.000 2.390 2.660 3.232 3.460
120 1.289 1.658 1.980 2.358 2.617 3.160 3.373
Example 1: According to the Department of Education, high school teachers work an average of 40 hours per week
during the school year. A district supervisor of a certain school surveyed 28 randomly selected teachers and they found that
thy work an average of 42.6 hours a week and the standard deviation was 3.75. Test if the mean number of hours worked
by teachers in the supervisor’s school district differs from the national average. Use = 0.01.
Solution: Apply the different steps in testing hypothesis to solve the given problem.
Step 1: H0: 40
H1: 40
Step 2: = 0.01
Step 3: Two-tailed test is used because H1 is non-directional.
Step 4: df = n – 1
= 28 – 1
df = 27
Using the t-distribution table, df = 27, = 0.01 for two-tailed test, the tabular value is 2.771.
Step 5: Compute the t-value.
X = 42.6
= 40
s = 3.75
n = 28
T=

T=

T=

T=
T = 3.67
Step 6: The computed value of t = 3.67 lies under the rejection region, therefore reject H0.
Step 7: Conclusion: There is a significant difference of the working hours of 28 teachers per week compared to the national
average.
M A T H 2 | P a g e | 22

Example 2: A sample of five measurements, randomly selected from n approximately normally distributed population,
resulted in the summary statistics: Sample mean is 4.6 and its standard deviation is 1.5. Test the null hypothesis that the
mean of the population is 6 against the alternative hypothesis < 6. Use = 0.05.
Solution: Apply the different steps in testing hypothesis to solve the given problem.
Step 1: H0: 6
H1: 6
Step 2: = 0.05
Step 3: One-tailed test is used because H1 is directional and it is left-tailed.
Step 4: df = n – 1
=5–1
df = 4
Using the t-distribution table, df = 4, = 0.05 for one-tailed test, the tabular value is -2.132.
Step 5: Compute the t-value.
X = 4.6
=6
s = 1.5
n=5
T=

T=

T=

T=
T = -2.092
Step 6: The computed value of t = -2.092 lies under the non-rejection region, therefore accept H0.
Step 7: Conclusion: There is no significant difference between the means.

Chi-Square Test
When the data are nominal or ordinal, the hypothesis used in this type of data is called non-parametric or distribution free
tests. This implies that these tests are free of assumptions regarding the distribution about a population.
The chi-square goodness-of-fit is one of the most commonly used non-parametric tests which was developed by Karl
Pearson. The purpose of the goodness-of-fit is to determine how well an observed set of data fits an expected data. The
formula for the chi-square test is,
2 =
Where:
O is an observed frequency in a particular category.
E is an expected frequency in a particular category.
Example 1: There are three gates at the University of the East. The building maintenance supervisor would like to know
if the gates are equally utilized. As an experiment, 600 students are observed as they enter the school. The number of
students using each gate is reported below. At 0.01 significance level, can we conclude that there is a difference in the use
of three gates?
M A T H 2 | P a g e | 23

Gate Number of Students


Recto 245
Lepanto 205
Gastambide 150
Because there are 600 students in the sample, we expect that 200 students fall in each of the three categories. These
categories are called cells.
Gate O E
Recto 245 200
Lepanto 205 200
Gastambide 150 200
Solution: Apply the steps in hypothesis testing.
Step 1: H0: There is no difference between the set of observed frequencies and the set of expected frequencies.
H1: There is a difference between the set of observed frequencies and the set of expected frequencies.
Step 2: = 0.01
Step 3: Critical Region
df = k -1
= 3 -1
Df = 2
Using the chi-square distribution table, the critical region is 9.210.
(See http://www3.med.unipmn.it/~magnani/pdf/Tavole_chi-quadrato.pdf for the complete table).
Step 4: The test statistic is the chi-square distribution.
2 =
= + +
= + +
= 10.125 + 0.125 + 12.5
2 = 22.75
Step 5: Formulate the decision rule.
Since the computed value that is 22.75 is greater than the critical value, 9.210, we will reject the H0.
Step 6: Conclusion: There is a large difference between the set of observed frequencies and set of expected frequencies.
Three gates are not equally utilized.
M A T H 2 | P a g e | 24

Written Works 5

A. Problem Solving. Test the hypotheses using the appropriate statistical test.
1. Drinking water has become an important concern among people. The quality of drinking water must be
monitored as often as possible during the day for possible contamination. Another variable of concern is the pH
level, which measures the alkalinity or the acidity of the water. A pH below 7.0 is acidic while a pH above 7.0 is
alkaline. A pH 7.0 is neutral. A water-treatment plant has a target pH of 8.0. Based on 16 random water
samples, the mean and the standard deviation were 7.6 and 0.4, respectively. Does the sample mean provide
enough evidence that it differs significantly from the target mean? Use = 0.05.
2. A spinning wheel is divided into5 colors sectors of the same size. The wheel is spun 1, 000 times, and the
number of occurrences for each of the 5 colors appears below. Test at = 0.01.

Color of Sector Number of Occurrences


White 207
Blue 164
Red 188
Green 213
Black 228

3. An agronomist believes that a newly developed fertilizer will increase the mean harvest of eggplants by more
than 2.5 kg. Twenty-six plants were treated with fertilizer and have a mean of 10.5 kg with standard deviation of
1.2. It is known that the population mean was 7.5 kg. Test the claim at 0.01 level of significance.
M A T H 2 | P a g e | 25

NORTHLINK
TECHNOLOGICAL COLLEGE
LEARNING MODULE
MATH 2 – Statistics and Probability

Learning Information 2.6


Learning Content 1: Correlation and Regression Analysis

Correlation Analysis is a statistical method used to determine whether a relationship between two variables exists. The
relationship or correlation between two variables may be described in terms of direction and strength.
The direction of correlation may be positive, negative or zero.
 A positive correlation exists when high values of one variable correspond to high values in the other variable or
low values in one variable correspond to low values in the other variable.
 A negative correlation exists when high values in one variable correspond to low values in the other variable or
low values in one variable correspond to high values in one variable.
 A zero correlation exists when high values in one variable correspond to either high or low values in one variable.
The strength of correlation maybe perfect, very high, moderately high, very low and zero.

Linear Regression
Redman offers this example scenario: Suppose you’re a sales manager trying to predict next month’s numbers. You know
that dozens, perhaps even hundreds of factors from the weather to a competitor’s promotion to the rumor of a new and
improved model can impact the number. Perhaps people in your organization even have a theory about what will have the
biggest effect on sales. ―Trust me. The more rain we have, the more we sell.‖ ―Six weeks after the competitor’s promotion,
sales jump.‖
Regression analysis is a way of mathematically sorting out which of those variables does indeed have an impact. It
answers the questions: Which factors matter most? Which can we ignore? How do those factors interact with each other?
And, perhaps most importantly, how certain are we about all of these factors?
In regression analysis, those factors are called variables. You have your dependent variable — the main factor that you’re
trying to understand or predict. In Redman’s example above, the dependent variable is monthly sales. And then you have
your independent variables — the factors you suspect have an impact on your dependent variable.
How does regression analysis work?
In order to conduct a regression analysis, you’ll need to define a dependent variable that you hypothesize is being
influenced by one or several independent variables.
M A T H 2 | P a g e | 26

You’ll then need to establish a comprehensive dataset to work with. Administering surveys to your audiences of interest is a
terrific way to establish this dataset. Your survey should include questions addressing all of the independent variables that
you are interested in.
Let’s continue using our application training example. In this case, we’d want to measure the historical levels of satisfaction
with the events from the past three years or so (or however long you deem statistically significant), as well as any
information possible in regards to the independent variables.
Perhaps we’re particularly curious about how the price of a ticket to the event has impacted levels of satisfaction.
To begin investigating whether or not there is a relationship between these two variables, we would begin by plotting these
data points on a chart, which would look like the following theoretical example.
(Plotting your data is the first step in figuring out if there is a
relationship between your independent and dependent
variables)
Our dependent variable (in this case, the level of event
satisfaction) should be plotted on the y-axis, while our
independent variable (the price of the event ticket) should be
plotted on the x-axis.
Once your data is plotted, you may begin to see correlations.
If the theoretical chart above did indeed represent the impact
of ticket prices on event satisfaction, then we’d be able to
confidently say that the higher the ticket price, the higher the
levels of event satisfaction.
But how can we tell the degree to which ticket price affects event satisfaction?
To begin answering this question, draw a line through the middle of all of the data points on the chart. This line is referred to
as your regression line, and it can be precisely calculated using a standard statistics program like Excel.
We’ll use a theoretical chart once more to depict what a
regression line should look like.
The regression line represents the relationship between your
independent variable and your dependent variable.
Excel will even provide a formula for the slope of the line,
which adds further context to the relationship between your
independent and dependent variables.
The formula for a regression line might look something like Y
= 100 + 7X + error term.
This tells you that if there is no ―X‖, then Y = 100. If X is our
increase in ticket price, this informs us that if there is no
increase in ticket price, event satisfaction will still increase by
100 points.
You’ll notice that the slope formula calculated by Excel
includes an error term. Regression lines always consider an
error term because in reality, independent variables are never precisely perfect predictors of dependent variables. This
makes sense while looking at the impact of ticket prices on event satisfaction — there are clearly other variables that are
contributing to event satisfaction outside of price.
Your regression line is simply an estimate based on the data available to you. So, the larger your error term, the less
definitively certain your regression line is.
M A T H 2 | P a g e | 27

Pearson r Correlation
The Pearson product-moment correlation coefficient (or Pearson correlation coefficient, for short) is a measure of the
strength of a linear association between two variables and is denoted by r. Basically, a Pearson product-moment correlation
attempts to draw a line of best fit through the data of two variables, and the Pearson correlation coefficient, r, indicates how
far away all these data points are to this line of best fit (i.e., how well the data points fit this new model/line of best fit).
What values can the Pearson correlation coefficient take?
The Pearson correlation coefficient, r, can take a range of values from +1 to -1. A value of 0 indicates that there is no
association between the two variables. A value greater than 0 indicates a positive association; that is, as the value of one
variable increases, so does the value of the other variable. A value less than 0 indicates a negative association; that is, as
the value of one variable increases, the value of the other variable decreases.
How can we determine the strength of association based on the Pearson correlation coefficient?
The stronger the association of the two variables, the closer the Pearson correlation coefficient, r, will be to either +1 or -1
depending on whether the relationship is positive or negative, respectively. Achieving a value of +1 or -1 means that all your
data points are included on the line of best fit – there are no data points that show any variation away from this line. Values
for r between +1 and -1 (for example, r = 0.8 or -0.4) indicate that there is variation around the line of best fit. The closer the
value of r to 0 the greater the variation around the line of best fit. Different relationships and their correlation coefficients are
shown in the diagram below:
The Formula
r=
√[ ]

Where:
X = the observed data for the independent variable
Y = the observed data for the dependent variable
N = sample size
r = degree of relationship between x and y
Example: Suppose we want to find out if a relationship exists between the height and weight of 10 UE college students.
Student Height (x) Weight (Y)
1 170 72
2 172 70
3 158 60
4 165 73
5 180 85
6 195 98
7 183 78
8 175 76
9 182 82
10 190 90
For the heights and weights of 10 college students, the computation in r is tabulated as follows:
Student x y xy x2 `y 2

1 170 72 12, 240 28, 900 5, 184


2 172 70 12, 040 29, 584 4, 900
3 158 60 9, 480 24, 964 3, 600
4 165 73 12, 045 27, 225 5, 329
M A T H 2 | P a g e | 28

5 180 85 15, 300 32, 400 7, 225


6 195 98 19, 110 38, 025 9, 604
7 183 78 14, 274 33, 489 6, 084
8 175 76 13, 300 30, 625 5, 776
9 182 82 14, 924 33, 124 6, 724
10 190 90 17, 100 36, 100 8, 100
∑x = 1, 770 ∑y = 784 ∑xy = 139, 803 ∑x2 = 314, 436 ∑y = 62, 526
2

r=
√[ ]

r=

r=

r=

r=
r = 0.93
Referring to our result, we can conclude that the heights and weights of 10 students in the sample are having a very high
positive correlation.

Written Works 6

A. Problem Solving.
1. Below are the data for six participants giving their number of years in college (X) and their subsequent yearly
income (Y). Income here is in thousands of dollars, but this fact does not require any changes in our
computations. Test whether there is a relationship of the two variables.
Student Number of years in college (x) Income(Y)
1 0 15
2 1 15
3 3 20
4 4 25
5 4 30
6 6 35

You might also like