Q4 STATISTICS C.1 Learning Modules Quarter 4 Learning Information and Course Activity
Q4 STATISTICS C.1 Learning Modules Quarter 4 Learning Information and Course Activity
NORTHLINK
TECHNOLOGICAL COLLEGE
LEARNING MODULE
MATH 2 – Statistics and Probability
The Table of Areas under the Normal Curve is also known as the z-table. The z-score is a measure of relative standing.
It is calculated by subtracting from the measurement X and then dividing the result by s or . The final result, the z-score,
represents the distance between the given measurement X and the mean, expressed in standard deviations. Either the z-
score locates X within a sample or within a population.
To find the area that corresponds to a z-score, simply find the area between z = 0 and the given z-value using the z-table.
MATH 2 |Page |2
Four-Step in Finding the Area Under the Normal Curve Given a z-Value
Step 1: Express the given z-value into a three-digit form.
Step 2: Using the z-table, find the first two digits on the left column.
Step 3: Match the third digit with the appropriate column on the right.
Step 4: Read the area (or probability) at the intersection of the row and the column.
This is the required area.
Z-Scores
The areas under the normal curve are given in terms of z-values or scores. Either the z-score locates X within a sample or
within a population. The formula for calculating z is:
Z= (z-score for population data)
Z= (z-score for sample data)
where: X = given measurement
= population mean
= population standard deviation
x = sample mean
s = sample standard deviation
MATH 2 |Page |4
Example 1: Given the mean, = 50 and the standard deviation, = 4 of a population of Reading scores. Find the z value
that corresponds to a score x = 58.
Steps Solution
1. Use the computing formula for finding z-scores of
population data. Z=
2. Check the given values. Since these are population = 50, = 4, and x = 58.
values, the z-scores locate X within a population.
3. Substitute the given values in the computing
formula. Z=
Z = = 2.
Thus, the z-value that corresponds to the raw score 58 is 2
4. Compute the z-value.
in a population distribution.
Example 2: Given the sample mean, x = 26 and the sample standard deviation, s = 6. Find the z value that corresponds to
a score x = 20.
Steps Solution
1. Use the computing formula for finding z-scores of
population data. Z=
2. Check the given values. Since these are population
x = 26, s = 6, and X = 20.
values, the z-scores locate X within a population.
3. Substitute the given values in the computing
formula. Z=
Z= = -1.
Thus, the z-value that corresponds to the raw score 20 is -1
4. Compute the z-value.
in a sample distribution.
Example 5. Determine the area under the standard normal curve to the left of z = 1.25
Solution.
The value that corresponds to z = 1.25 is 0.3944
A = 0.5 + 0.3944
A = 0. 8944
Example 6. Determine the area under the standard normal curve to the left of z = - 0.95
Solution.
The value that corresponds to z = - 0.95 is 0.3289
A = 0.5 – 0.3289
A = 0.1711
Example 7. Find the area under the standard normal curve between z = 1.03 and z = - 0.37.
Solution.
The value that corresponds to z = 1.03 is 0.3485 and the z = - 0.37 is 0.1443.
A = 0. 3485 + 0.1443
A = 0. 4928
Example 8. Find the area under the standard normal curve between z = 0.32 and z = 2. 42.
Solution.
The value that corresponds to z = 0.32 is 0.1255 and the z = 2.42 is 0.4922.
A = 0. 4922 – 0.1255
A = 0. 3667
MATH 2 |Page |6
Written Works 1
NORTHLINK
TECHNOLOGICAL COLLEGE
LEARNING MODULE
MATH 2 – Statistics and Probability
NCn = N=5;n=3
NCn = = = = = 10
There are 10 sample of size 3 that can be drawn from the given population.
2.List all the possible samples and compute the mean of each sample.
Z=
√
a. What is the probability that a randomly selected college student will complete the examination in less than 43 minutes?
Steps Solutions
X= 43
1. Identify the given information. = 46.2
=8
2. Identify what is asked for. P(X<43)
Here, we are dealing with an individual data obtained
3. Identify the formula to be used. from a population. So, we will use the formula Z =
to standardized 43.
Z=
Z=
Z = -0.40
4. Solve the problem. We shall find P(X<43) by getting the area under the
normal curve.
P(X<43) = P(z<-0.40)
= 0.5000 – 0.1554
P(X<43) = 0.3446
So, the probability that a randomly selected college
5. State the final answer. student will complete the examination in less than 43
minutes is 0.3446 or 36.46%.
b. If 50 randomly selected college students take the examination, what is the probability that the mean time it takes the
group to complete the test will be less than 43 minutes?
Steps Solutions
X = 43
1. Identify the given information.
= 46.2
MATH 2 |Page |9
=8
n = 50
2. Identify what is asked for. P(<43)
Here, we are dealing with data about the sample
3. Identify the formula to be means. So, we will use the formula Z = to
used. √
standardized 43.
Z=
√
Z=
√
4. Solve the problem. Z = -2.83
We shall find P(<43) by getting the area under
the normal curve.
P(<43) = P(z<-2.83)
= 0.5000 – 0.4977
P(<43) = 0.0023
So, the probability that 50 randomly selected
5. State the final answer. college student will complete the examination in
less than 43 minutes is 0.0023 or 0.23%.
c.Does it seem reasonable that a college student would finish the examination in less than 43 minutes? YES
d.Does it seem reasonable that the mean of the 50 college students could be less than 43 minutes? NO, very unlikely.
Written Work 2
A. Problem Solving.
1. A population consists of numbers from 1-5. Consider samples of size 4 that can be drawn from this population.
a. List all the possible samples and the corresponding mean.
b. Construct the sampling distribution means.
2. The average number of milligrams of cholesterol in a cup of a certain brand of ice cream is 660 mg, and the
standard deviation is 35. Assume the variable is normally distributed.
a. If a cup of ice cream is selected, what is the probability that the cholesterol content will be more than 670
mg?
b. If a sample of 10 cups of ice cream is selected, what is the probability that the mean of the sample will be
larger than 670 mg?
Performance Task 1
Instructions.
The teacher will provide a dataset containing test scores of 40 students. Randomly select a sample of 15 students from this
dataset. Calculate the mean and standard deviation of their test scores. Create a histogram to visualize the distribution of
scores in your sample. Explain the concept of sampling distribution and its importance in statistics. Summarize your
findings, discussing how these sample statistics might reflect the overall test scores of all students in the dataset.
M A T H 2 | P a g e | 10
NORTHLINK
TECHNOLOGICAL COLLEGE
LEARNING MODULE
MATH 2 – Statistics and Probability
Sample 1 500 498 497 503 499 497 497 497 497 495
Sample 2 500 500 495 494 500 500 500 500 498 497
Sample 3 497 502 497 496 497 497 497 497 497 495
Sample 4 501 495 500 497 497 500 500 495 497 497
Assuming that the measurement was carefully obtained and that the only kind of error present is the sampling error, what is
the point estimate of the population mean?
Solution: When dealing with a large number of values, the mean of small samples may be obtained. These means
constitute a sampling distribution of means. To find the overall mean, simply find the sum of the mean values. Then divide
the sum by the total number of sample means.
MeanRow 1 =
MeanRow 1 =
MeanRow 1 = 498 and so on
Say, we already obtained the sample mean of each row; we will now find the overall mean.
Sample Row Mean
1 498.0
2 498.4
3 497.2
4 497.9
MeanOverall =
M A T H 2 | P a g e | 11
MeanOverall =
MeanOverall = 497.875. This will be our point estimate for a population mean.
Example: Find the estimate of the population mean using the 95% confidence level.
Given: Sample Mean = 72, n = 120, and = 3.
Solution: With the large sample, by the Central Limit Theorem, the distribution is normally distributed.
a.Point Estimate
Steps Solution
1. Describe the population parameter of interest. ● The parameter of interest is the mean where
the sample purportedly belongs.
2. Specify the confidence interval criteria.
a. Check the assumptions. ● The is given.
● The sample is normal guaranteed by the CLT.
M A T H 2 | P a g e | 12
b. Determine the test statistic to be used. ● The test statistic is the z with = 3.
c. State the level of confidence. ● The question asked for a 95% confidence, or
= 0.05. This means that if more random samples
were taken from the target population, and an
interval estimate is made for each sample, then 95%
of the intervals will contain the true parameter value.
3. Collect and present sample evidence.
a. Collect the sample information. ● The sample information consists of sample mean =
72, n = 120 and = 3.
b. Find the point estimate. ● The point estimate for the population mean is
72 (the sample mean).
b.95% Confidence Interval
4. Determine the confidence interval
a. Determine the confidence coefficient. ● The confidence coefficient is 1.96 for 95% (2.58 for
99% and 1.65 for 90%).
b. Find the maximum error E. ● E= (
√
= 1.96 (
√
= 1.96 (
= 1.96 (0.27)
E = 0.53
c. Find the lower and upper confidence limits. ● X- ( < <X+ (
√ √
72 – 0.53 to 72 + 0.53
71.47 to 72.53
d. Describe the results. ● We can say with 95% confidence that the interval
between 71.47 and 72.53 contains the population
mean based on the sample size of 120.
we may approximate the range R of observations in the population and make a conservative estimate of . In any case,
round up the value of obtained result to ensure that the sample size will be sufficient to achieve the specified reliability.
Example 1: In a certain village, Leony wants to estimate the mean weight , in kilograms, of all six-year old children
to be included in a feeding program. She want to be 99% confident that the estimate of is accurate to within 0.06 kg.
Suppose from a previous study, the standard deviation of the weights of the target population was 0.5 kg., what should be
the sample size be?
M A T H 2 | P a g e | 13
Written Works 3
A. Problem Solving.
1. A researcher wants to estimate the number of hours that 5-year old children spend watching television. A
sample of 50 five-year old children was observed to have a mean viewing time of 3 hours. The population is
normally distributed with a population standard deviation of 0.5 hours, find:
a. The best point estimate of the population mean.
b. The 95% confidence interval of the population mean.
2. A random selection of 40 entering Mathematics majors has the following GPAs. Assume that the standard
deviation is 0.46, estimate the true mean GPA with 99% confidence.
4.0 3.5 3.0 3.3 3.8 3.1 3.6 4.0 3.9 3.5
3.2 3.0 3.5 3.2 3.0 3.2 4.0 3.0 3.4 3.0
3.0 2.8 5.6 3.0 3.2 3.5 3.2 2.8 3.3 3.1
3.2 2.9 3.0 2.8 4.0 3.7 3.0 3.3 3.2 2.8
M A T H 2 | P a g e | 14
NORTHLINK
TECHNOLOGICAL COLLEGE
LEARNING MODULE
MATH 2 – Statistics and Probability
Formulating Hypotheses
Example 1: The owner of a factory that sells a particular bottled fruit juice claims that the average capacity of a bottle of
their product is 250 ml. is the claim true?
To test the claim, the members of a consumer group did the following:
1. Get a sample of 100 such bottles.
2. Calculate the capacity of each bottle.
3. Compare the sample mean and the claim.
The observed mean capacity of the 100 bottles is 243 ml. the sample standard deviation is 10 ml. in the example, the
owner’s statement (called claim) is a general statement. The claim is that the capacity of all their bottled products is 250 ml
per bottle. So, the population mean, = 250 ml. On the other hand, the consumer group has a sample value which is 243
ml. clearly the sample mean. Thus the hypotheses would be:
H0: The bottled drinks contain 250 ml per bottle. (This is the claim.)
H1: The bottled drinks contain 250 ml per bottle. (This is the opposite of the claim.)
But these statements should be written in symbols. For now, let us drop the unit measure and simply write.
H0: 250
H1: 250
M A T H 2 | P a g e | 15
In mathematics, the symbol in the alternative hypothesis suggest either a greater than relation or less than relation. What
is the interpretation of the symbol in the example? It means that the consumer group is not interested in getting a sample
mean greater than 250 or a sample mean less than 250. However, however, this does not make sense in the given
exercise. The consumer group has a purpose, a direction. The consumer group may want to refute the claim. So, the
appropriate alternative hypothesis is:
H1: < 250
● When the alternative hypothesis utilizes the symbol, the test is said to be non-directional. When the alternative
hypothesis utilizes the < or the > symbol, the test is said to be directional.
● In problems that involve hypothesis testing, there are words like greater, efficient, improves, effective,
increases, and so on that suggest a right-tailed direction on the formulation of the alternative hypothesis. Words
like decrease, less than, smaller, and the like suggest a left-tailed direction.
Example 2: A teacher wants to know if listening to popular music affects the performance of pupils. A class of 50 grade
1 pupils was used in the experiment. The mean score was 83 and the standard deviation is 5. A previous study revealed
that = 82 and the standard deviation is 10.
1. State the null and alternative hypothesis in words and in symbols.
2. State whether the test is directional or non-directional.
Solution: The parameter of interest is the population mean = 82.
1. In words, the hypotheses are:
H0: The sample comes from a population whose mean is 82.
H1: The sample does not come from a population whose mean is 82.
In symbols, we write: H0: 82
H1: 82
2. There is no clue as to the direction of the investigation. The phrase affects performance implies either an increase
or a decrease in performance. So, the test is non-directional.
Example 3: A man plans to go to hunting the Philippine-monkey eating eagle believing that it is a proof of his mettle.
What type of error is this?
Solution: Hunting the Philippine eagle is prohibited by the law. Thus, it is not a good sport. It is a Type II error. Since hunting
the Philippine monkey-eating eagle is against the law, the man may find himself in jail if he goes out of his way hunting
endangered species.
Example 1: The treasurer of a certain university claims that the mean monthly salary of their college professor is 21, 750.00
with the standard deviation of 6, 000.00 a researcher takes a random sample of 75 college professors were found to have a
mean monthly salary of 19, 375.00. Do the 75 college professors have lower salaries than the rest? Test the claim at =
0.05 level of significance.
Solution: Apply the different steps in testing hypothesis to solve the given problem.
Step 1:H0: The mean monthly salary of the college professors is 21, 750 ( = 21, 750)
H1: The mean monthly salary of the college professors is lower or less than 21, 750 ( < 21, 750)
Step 2: = 0.05
Step 3: One-tailed test is used because H1 is directional.
Step 4: The tabular value or critical value of z at 0.05 level of significance is -1.645.
M A T H 2 | P a g e | 17
Z=
√
Z=
Z=
Z = -3.43
Step 6: The computed value of z = -3.43 lies under the rejection region, therefore reject H0 and accept H1.
Step 7: Conclusion: The mean monthly salary of the college professors is lower than 21, 750.
Example 2: The mean weight of the baggage carried into an airplane by individual passengers at Davao International
Airport is 19.8 kg. A statistician takes a random sample of 110 passengers and obtains a sample mean weight of 18.5 kg
with a standard deviation of 8.5 kg. Test the claim at = 0.01 level of significance.
Step 1: H0: 19.8
H1: 19.8
Step 2: = 0.01
Step 3: Two-tailed test is used because H1 is non-directional.
Step 4: The tabular values or critical values of z at 0.01 level of significance are -2.33 and 2.33.
Step 5: Compute the z-value.
X = 18.5
= 19.8
s = 8.5
n = 110
Z=
√
Z=
√
Z=
Z=
Z = -1.60
Step 6: The computed value of z = -1.60 lies under the non-rejection region, therefore accept H0.
Step 7: Conclusion: There is no significant difference between the weights of baggage carried by individual passengers.
M A T H 2 | P a g e | 18
Written Works 4
A. Problem Solving. Write the null and alternative hypothesis in symbols of each of the following:
1. A farmer believes that using organic fertilizers oh his plants will yield greater income. His average income from
the past was 200, 000.00 per year.
2. The net weight of a packet of a snack brand is 130 g. A sample of 80 packets yielded a sample mean weight of
3. 112 g with a standard deviation of 15 grams.
4. In a graduate college, the average length of the registration time during a semester is 120 minutes with a
standard deviation of 25 minutes. With the introduction of a new registration procedure, the administration
thinks that it would take less than or equal to 120 minutes.
5. The average height of Grade 8 female students is 158.2 cm. The mean height of a sample of 100 female
students is 160 cm with a standard deviation of 6 cm.
Performance Task 2
Testing Hypothesis.
1. A researcher used a developed problem solving test to randomly select 50 Grade 11 pupils. In this sample, the
mean is 80. The mean and the standard deviation of the population used in the standardization of the test were
75 and 15, respectively. Use the 95% confidence level to test if the sample means differ significantly from the
population mean.
2. The owner of a factory that sells a particular bottled fruit juice claims that the average capacity of their product
is 250 ml. To test the claim that it is less than the population mean, a consumer group gets a sample of 100
such bottles, calculates the capacity of each bottle, and then finds the mean capacity to be 248 ml and the
standard deviation is 5 ml. Is the claim true? (Use 0.05 level of significance.)
M A T H 2 | P a g e | 19
NORTHLINK
TECHNOLOGICAL COLLEGE
LEARNING MODULE
MATH 2 – Statistics and Probability
T=
√
T=
T=
T = 3.67
Step 6: The computed value of t = 3.67 lies under the rejection region, therefore reject H0.
Step 7: Conclusion: There is a significant difference of the working hours of 28 teachers per week compared to the national
average.
M A T H 2 | P a g e | 22
Example 2: A sample of five measurements, randomly selected from n approximately normally distributed population,
resulted in the summary statistics: Sample mean is 4.6 and its standard deviation is 1.5. Test the null hypothesis that the
mean of the population is 6 against the alternative hypothesis < 6. Use = 0.05.
Solution: Apply the different steps in testing hypothesis to solve the given problem.
Step 1: H0: 6
H1: 6
Step 2: = 0.05
Step 3: One-tailed test is used because H1 is directional and it is left-tailed.
Step 4: df = n – 1
=5–1
df = 4
Using the t-distribution table, df = 4, = 0.05 for one-tailed test, the tabular value is -2.132.
Step 5: Compute the t-value.
X = 4.6
=6
s = 1.5
n=5
T=
√
T=
√
T=
T=
T = -2.092
Step 6: The computed value of t = -2.092 lies under the non-rejection region, therefore accept H0.
Step 7: Conclusion: There is no significant difference between the means.
Chi-Square Test
When the data are nominal or ordinal, the hypothesis used in this type of data is called non-parametric or distribution free
tests. This implies that these tests are free of assumptions regarding the distribution about a population.
The chi-square goodness-of-fit is one of the most commonly used non-parametric tests which was developed by Karl
Pearson. The purpose of the goodness-of-fit is to determine how well an observed set of data fits an expected data. The
formula for the chi-square test is,
2 =
Where:
O is an observed frequency in a particular category.
E is an expected frequency in a particular category.
Example 1: There are three gates at the University of the East. The building maintenance supervisor would like to know
if the gates are equally utilized. As an experiment, 600 students are observed as they enter the school. The number of
students using each gate is reported below. At 0.01 significance level, can we conclude that there is a difference in the use
of three gates?
M A T H 2 | P a g e | 23
Written Works 5
A. Problem Solving. Test the hypotheses using the appropriate statistical test.
1. Drinking water has become an important concern among people. The quality of drinking water must be
monitored as often as possible during the day for possible contamination. Another variable of concern is the pH
level, which measures the alkalinity or the acidity of the water. A pH below 7.0 is acidic while a pH above 7.0 is
alkaline. A pH 7.0 is neutral. A water-treatment plant has a target pH of 8.0. Based on 16 random water
samples, the mean and the standard deviation were 7.6 and 0.4, respectively. Does the sample mean provide
enough evidence that it differs significantly from the target mean? Use = 0.05.
2. A spinning wheel is divided into5 colors sectors of the same size. The wheel is spun 1, 000 times, and the
number of occurrences for each of the 5 colors appears below. Test at = 0.01.
3. An agronomist believes that a newly developed fertilizer will increase the mean harvest of eggplants by more
than 2.5 kg. Twenty-six plants were treated with fertilizer and have a mean of 10.5 kg with standard deviation of
1.2. It is known that the population mean was 7.5 kg. Test the claim at 0.01 level of significance.
M A T H 2 | P a g e | 25
NORTHLINK
TECHNOLOGICAL COLLEGE
LEARNING MODULE
MATH 2 – Statistics and Probability
Correlation Analysis is a statistical method used to determine whether a relationship between two variables exists. The
relationship or correlation between two variables may be described in terms of direction and strength.
The direction of correlation may be positive, negative or zero.
A positive correlation exists when high values of one variable correspond to high values in the other variable or
low values in one variable correspond to low values in the other variable.
A negative correlation exists when high values in one variable correspond to low values in the other variable or
low values in one variable correspond to high values in one variable.
A zero correlation exists when high values in one variable correspond to either high or low values in one variable.
The strength of correlation maybe perfect, very high, moderately high, very low and zero.
Linear Regression
Redman offers this example scenario: Suppose you’re a sales manager trying to predict next month’s numbers. You know
that dozens, perhaps even hundreds of factors from the weather to a competitor’s promotion to the rumor of a new and
improved model can impact the number. Perhaps people in your organization even have a theory about what will have the
biggest effect on sales. ―Trust me. The more rain we have, the more we sell.‖ ―Six weeks after the competitor’s promotion,
sales jump.‖
Regression analysis is a way of mathematically sorting out which of those variables does indeed have an impact. It
answers the questions: Which factors matter most? Which can we ignore? How do those factors interact with each other?
And, perhaps most importantly, how certain are we about all of these factors?
In regression analysis, those factors are called variables. You have your dependent variable — the main factor that you’re
trying to understand or predict. In Redman’s example above, the dependent variable is monthly sales. And then you have
your independent variables — the factors you suspect have an impact on your dependent variable.
How does regression analysis work?
In order to conduct a regression analysis, you’ll need to define a dependent variable that you hypothesize is being
influenced by one or several independent variables.
M A T H 2 | P a g e | 26
You’ll then need to establish a comprehensive dataset to work with. Administering surveys to your audiences of interest is a
terrific way to establish this dataset. Your survey should include questions addressing all of the independent variables that
you are interested in.
Let’s continue using our application training example. In this case, we’d want to measure the historical levels of satisfaction
with the events from the past three years or so (or however long you deem statistically significant), as well as any
information possible in regards to the independent variables.
Perhaps we’re particularly curious about how the price of a ticket to the event has impacted levels of satisfaction.
To begin investigating whether or not there is a relationship between these two variables, we would begin by plotting these
data points on a chart, which would look like the following theoretical example.
(Plotting your data is the first step in figuring out if there is a
relationship between your independent and dependent
variables)
Our dependent variable (in this case, the level of event
satisfaction) should be plotted on the y-axis, while our
independent variable (the price of the event ticket) should be
plotted on the x-axis.
Once your data is plotted, you may begin to see correlations.
If the theoretical chart above did indeed represent the impact
of ticket prices on event satisfaction, then we’d be able to
confidently say that the higher the ticket price, the higher the
levels of event satisfaction.
But how can we tell the degree to which ticket price affects event satisfaction?
To begin answering this question, draw a line through the middle of all of the data points on the chart. This line is referred to
as your regression line, and it can be precisely calculated using a standard statistics program like Excel.
We’ll use a theoretical chart once more to depict what a
regression line should look like.
The regression line represents the relationship between your
independent variable and your dependent variable.
Excel will even provide a formula for the slope of the line,
which adds further context to the relationship between your
independent and dependent variables.
The formula for a regression line might look something like Y
= 100 + 7X + error term.
This tells you that if there is no ―X‖, then Y = 100. If X is our
increase in ticket price, this informs us that if there is no
increase in ticket price, event satisfaction will still increase by
100 points.
You’ll notice that the slope formula calculated by Excel
includes an error term. Regression lines always consider an
error term because in reality, independent variables are never precisely perfect predictors of dependent variables. This
makes sense while looking at the impact of ticket prices on event satisfaction — there are clearly other variables that are
contributing to event satisfaction outside of price.
Your regression line is simply an estimate based on the data available to you. So, the larger your error term, the less
definitively certain your regression line is.
M A T H 2 | P a g e | 27
Pearson r Correlation
The Pearson product-moment correlation coefficient (or Pearson correlation coefficient, for short) is a measure of the
strength of a linear association between two variables and is denoted by r. Basically, a Pearson product-moment correlation
attempts to draw a line of best fit through the data of two variables, and the Pearson correlation coefficient, r, indicates how
far away all these data points are to this line of best fit (i.e., how well the data points fit this new model/line of best fit).
What values can the Pearson correlation coefficient take?
The Pearson correlation coefficient, r, can take a range of values from +1 to -1. A value of 0 indicates that there is no
association between the two variables. A value greater than 0 indicates a positive association; that is, as the value of one
variable increases, so does the value of the other variable. A value less than 0 indicates a negative association; that is, as
the value of one variable increases, the value of the other variable decreases.
How can we determine the strength of association based on the Pearson correlation coefficient?
The stronger the association of the two variables, the closer the Pearson correlation coefficient, r, will be to either +1 or -1
depending on whether the relationship is positive or negative, respectively. Achieving a value of +1 or -1 means that all your
data points are included on the line of best fit – there are no data points that show any variation away from this line. Values
for r between +1 and -1 (for example, r = 0.8 or -0.4) indicate that there is variation around the line of best fit. The closer the
value of r to 0 the greater the variation around the line of best fit. Different relationships and their correlation coefficients are
shown in the diagram below:
The Formula
r=
√[ ]
Where:
X = the observed data for the independent variable
Y = the observed data for the dependent variable
N = sample size
r = degree of relationship between x and y
Example: Suppose we want to find out if a relationship exists between the height and weight of 10 UE college students.
Student Height (x) Weight (Y)
1 170 72
2 172 70
3 158 60
4 165 73
5 180 85
6 195 98
7 183 78
8 175 76
9 182 82
10 190 90
For the heights and weights of 10 college students, the computation in r is tabulated as follows:
Student x y xy x2 `y 2
r=
√[ ]
r=
√
r=
√
r=
√
r=
r = 0.93
Referring to our result, we can conclude that the heights and weights of 10 students in the sample are having a very high
positive correlation.
Written Works 6
A. Problem Solving.
1. Below are the data for six participants giving their number of years in college (X) and their subsequent yearly
income (Y). Income here is in thousands of dollars, but this fact does not require any changes in our
computations. Test whether there is a relationship of the two variables.
Student Number of years in college (x) Income(Y)
1 0 15
2 1 15
3 3 20
4 4 25
5 4 30
6 6 35