0% found this document useful (0 votes)
106 views

Study Schedule Topic Learning Outcomes Activities Week 4

This document provides an overview of a study schedule for a module on data management in statistics. It includes 3 learning activities to be completed on September 11, 2020 focused on basic statistical terms. Students are expected to define terms like data, variables, populations, samples, parameters, and scales of measurement. They will complete exercises to explore and engage with the topics as well as clarify their understanding. On completion, students should file their work in a red notebook.

Uploaded by

Sunny Egghead
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views

Study Schedule Topic Learning Outcomes Activities Week 4

This document provides an overview of a study schedule for a module on data management in statistics. It includes 3 learning activities to be completed on September 11, 2020 focused on basic statistical terms. Students are expected to define terms like data, variables, populations, samples, parameters, and scales of measurement. They will complete exercises to explore and engage with the topics as well as clarify their understanding. On completion, students should file their work in a red notebook.

Uploaded by

Sunny Egghead
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Study Schedule Topic Learning Outcomes Activities

Week 4 Data Management


Module 4.1 Basic Terms Related to Statistics
September 11, 1. Define statistical terms: Data, Explore: Discover This!
2020 Variables, Independent & Engage: Let’s Try This!
Dependent variables, Population Explain: Clarify Your Lesson!
Elaborate: Challenge Yourself!
& Sample, Parameter and
Evaluate: Gauge Your Learning!
Statistic, Scale of Measurements,
Sampling Techniques
(Probability and non-
probability).
September 11, Completion of Let’s Try This and File your activity in your red
2020 Gauge Your Learning Activities long clear book.

MATHEMATICS AS A TOOL

Overview

You may be familiar with statistics through radio, television, newspapers, and
magazines. For example, you may have read statements like the following found in
newspapers. According to the PISA 2018 profile of the Philippines, socio-economic status
accounts for 18% of the variance in reading performance in the country, compared to the
OECD (Organization for Economic Cooperation and Development) average of 12%.

Statistics is used in almost all fields of human endeavor. In sports, for example, a
statistician may keep records of the number of yards a running back gains during a football
game, or the number of hits a baseball player gets in a season. In other areas, such as public
health, an administrator might be concerned with the number of residents who contract a
new strain of flu virus during a certain year. In education, a researcher might want to know if
new methods of teaching are better than old ones. These are only a few examples of how
statistics can be used in various occupations. Furthermore, statistics is used to analyze the
results of surveys and as a tool in scientific research to make decisions based on controlled
experiments. Other uses of statistics include operations research, quality control, estimation,
and prediction.

Module 4: DATA MANAGEMENT

Introduction

Statistics is a branch of applied mathematics that deals with gathering, organizing,


presenting, analyzing and interpreting the collected data. There are two branches of statistics-
descriptive statistics and inferential statistics. In performing all these processes involved, the
application of statistical tools and techniques is necessary. Statistical tools derived from

55
mathematics are useful in processing and managing numerical data in order to describe a
phenomenon and predict values.

This chapter covers four topics of statistical tools. Lesson 1 discusses the basic terms
in statistics. Lesson 2 tackles measures of central tendency and measures of dispersion.
Lesson 3 focuses on hypothesis testing and lesson 4 discusses correlation and regression
analyses.

Module 4.1: BASIC TERMS RELATED TO STATISTICS

Learning Objective: At the end of the lesson, the students are expected to:

1. Define statistical terms: Data, Variables, Independent & Dependent variables,


Population & Sample, Parameter and Statistic, Scale of Measurements,
Sampling Techniques (Probability and non-probability).

Let’s Try This!


1. List down five (5) terms related to statistics that you have encountered during your high
school and give the meaning of each term.

Discover This!
Statistics is a branch of mathematics that deals with the systematic method of
collecting, classifying, presenting, analyzing and interpreting quantitative or numerical data.
Variable is a characteristic of interest measurable on each and every individual in the
universe. It refers to a property that can take on different values or categories which cannot
be predicted with certainty.
A variable may also be called a data item. Age, sex, business income and expenses,
country of birth, capital expenditure, class grades, eye color and vehicle type are examples of
variables.

Types of Variable:
a) Dependent variable is what you measure in the experiment and what is affected during
the experiment. The dependent variable responds to the independent variable. the
“assumed effect” of another variable. It is an outcome of interest (e. g. characteristic of
behavior) that is being observed and measured in order to assess the effects of the
independent variable. They are those that the researcher has control over. This “control”
may involve manipulating existing variables (e.g. modifying existing methods of
instruction) or introducing variables (e.g. adopting a totally new method for some sections
of a class) in the research setting. Whatever the case may be, the researcher expects that
the independent variable(s) will have some effect on (or relationship with) dependent
variables.

56
b) Independent variable is the variable you have control over, what you can choose and
manipulate. It is usually what you think will affect the dependent variable. In some cases,
you may not be able to manipulate the independent variable. It is the “assumed cause” of
a problem; assumed reason for a change. It is a variable that is examined in order to
determine its effects on an outcome of interest (the dependent variable). Examples: type
of incentive, instructional materials, pharmaceutical compound. It shows the effect of
manipulating or introducing the independent variables. For example, if the independent
variable is the use or non-use of a new language teaching procedure, then the
independent variable might be students’ scores on a test of the content taught using that
procedure. In other words, the variation in the dependent variable depends on the
variation in the independent variable.

Population (N) is a collection, or set, of individuals, objects, or measurements whose


properties are to be analyzed. It is the totality of the observation.

Examples are all students of CHMSC-Talisay Campus enrolled in the first semester-
academic year 2020-2021, all members of a particular club, all mobile phones sold by a certain
cell shop in one month, all babies born in a particular year, etc.

Sample (n) is a subset of a population. It is a smaller group representing the population


having identical characteristics from which it was taken. A sample is taken since the study of
complete population may be too costly, time-consuming and full of unpredictable
inaccuracies. It is a subset of the population or a collection of some elements in a population.
It is a subset of a population selected in accordance with the research design. To illustrate, if
we wished to estimate the number of viewers of a new situation comedy, we might select a
subset of television viewers during the hours the sitcom is airing.

Parameter is a numerical measurement describing some characteristic of a


population.

Statistic is a numerical measurement describing some characteristic of a sample.

The data (Asaad, 2004) are the quantities (numbers) or qualities (attributes) measures
or observed that are to be collected and/or analyzed. Two categories of data are categorical
and continuous data.

Categorical data are nominal and ordinal scales while continuous data are ratio and
interval scales.

57
Scale of Measurement
Variables can be classified according to how they are categorized, counted or
measured.

1) Nominal Scale
This is characterized by data that consist the names, labels, or categories only. The
data cannot be arranged in an ordering scheme. There is no criterion as to which values can
be identified as greater than or less than other values. It is used for labeling variables. There is
no intrinsic ordering to the categories. They are numbers used to names. Observations of
unordered variables constitute a very low level of measurement. Numbers have no
quantitative properties. They serve only to identify the class.
Examples are gender, mode of transportation, nationality, occupation, civil status,
course, specialization, etc.

2) Ordinal Scale
This involves data that maybe arrange in some order, but differences between data
values either cannot be determined or are meaningless. An ordinal scale produces a distinct
ordering or arrangement of data in which the observations may be ranked based on some
criteria such as good, better and best. They represent position in a series. Scale in which the
classes stand in a relationship to one another that is expressed in terms of algebra of
inequalities (less than or greater than).
Examples are pain level, social status, attitude towards a subject, satisfaction level,
etc.

3) Interval Scale
This is the same as the ordinal level, with an additional property to help determine
meaningful amounts of differences between data. Data at this level may lack an inherent zero
starting point. Variable on an interval scale are measured numerically and like an ordinal data.
It carries an inherent ranking or ordering. If we have data with ordinal properties (>, =) and
can also measure the distance between the two data items, we have an interval
measurement. It can determine or measure the distances between numbers. It is a
quantitative scale that requires a constant unit of measurement and permits the use of
arithmetic operations. The zero point in this scale is arbitrary. It does not represent the
complete absence of the attribute being measured.
Examples are temperature (in degree Celsius), test result, IQ, General weighted
average (GWA), etc.

4) Ratio Scale
This is an interval level modified to include the inherent zero starting point. The
difference and ration of data are meaningful. This is also the highest scale of measurement.
The four scales of measurement, only the ratio scale is based on the number system in which

58
zero becomes meaningful. Arithmetic operations such as multiplication, division, addition and
subtraction take a rational interpretation. Ratio scale is used to measure several types of data
found in business such as cost, profit and inventory. These variables are expressed in ratio
measures. It is the highest level of measurement and allows for all basic arithmetic
operations. Data measured on a ratio scale have a fixed or non-arbitrary zero point. Same as
interval scale, except that there is a true zero point.
Examples are income, profit, dimensions, weight, height, age (in years), number of
years in education, etc.

Sampling Techniques
Various sampling techniques or sample designs can be used by the researcher. The
choice of what technique to be used will depend on the nature of the problem at hand, the
king of population and in which sample results will be applied.
The techniques can be grouped into how selections of items are made such as
probability sampling and non-probability sampling.

1) Probability Sampling.
In probability sampling, the sample is a proportion of the population and such sample
is selected from the population by means of systematic way in which every element of the
population has a chance of being included in the sample. It might not be equal chance as long
as there is nonzero chance to be selected in the study is considered as probability sampling.

a) Random Sampling
This type of sampling is one in which everyone in the population of the study has an
equal chance of being selected to be included in the sample. For example, lottery method,
using the table of random numbers or computer-generated random numbers.

b) Systematic Sampling
This is a technique of sampling in which every n th name in the list may be selected to
be included in the sample. For example, a mall owner wants to conduct a study on the
satisfaction of customers in terms of the service provided by the mall. So, he asked his
employees to conduct the study. Since it may be impossible to use random sampling, the
owner decided to use systematic sampling where every 5th customer who entered the mall
is surveyed and asked questions about their level of satisfaction.

c) Stratified Random Sampling


It is a more efficient sampling procedure wherein the population is grouped into a
more or less homogeneous classes or strata in order to avoid the possibility of drawing
samples whose members come from one stratum. For example, since your population has
almost half is male and the other half is female (Ex. In 5000 students, 2100 are male and 2900
are female), you can apply the stratified random sampling and you group the population

59
according to sex. Therefore, sex is your strata. This is only possible if your population is almost
equally divided. Another example, 100 belongs to high family income, 4000 belongs to
average family income and 900 belongs to low family income, in this case, economic status is
not applicable (or impractical) to use as your strata.

d) Cluster Sampling
It is sometimes called area sampling because it is applied on geographical basis. A
cluster sampling will give more precise results particularly when each cluster contains a more
varied mixture and when one cluster is nearly like the other. This kind of sampling is used if
you have a huge population. For example, all teachers in Negros Occidental is your
population. You can group them according to cluster. Cluster 1 for Division of Bacolod, cluster
for Division of Silay and so on and so forth.

2) Non-Probability Sampling
In a non-probability sampling, the sample is not a proportion of the population and
there is no system in selecting the sample. The selection depends on the situation.

a) Purposive Sampling
It is based on certain criteria laid down by the researcher. People who satisfy the
criteria are interviewed. Purposive sampling is determining the target population of those
who will be taken for the study. The respondents are chosen on the basis of their knowledge
of the information desired. For example, you want to determine the performance of honor
students in their majoring subject. In this case, the only respondents of your study are those
students that have honors which is a purposive kind of sampling.

b) Convenience Sampling
It is a process of picking out people in the most convenient and fastest way to get
reactions immediately. For example, you want to conduct a survey about political views and
opinions in a particular barangay. Since it is impractical to use probability sampling,
convenience sampling is practically valid in this case. Again, it depends upon your kind of
respondents, your research design and statement of the problem.

c) Quota Sampling
This type of sampling specified number of persons of certain types in included in the
sample. In quota sampling many sectors of the population are represented. However, the
representation is doubtful are no guidelines in the selection of the respondents. Unlike
cluster, in quota sampling, some elements in the population have no chance to be selected in
the study.

60
Clarify Your Lesson! (Let’s Try This #2)

A. Give one example for each scale of measurement and discuss in not less than 3
sentences why the example is suited to it. (5 points each)

B. Discuss the difference between probability and non-probability sampling, and


clarify when to use each sampling techniques (5 points)? (give an example)

5 points Rubrics
5 The student clearly understands the concept. However, some minor mistakes and
careless errors appears insofar as they do not indicate a conceptual
misunderstanding.
4 The student understands the main concepts, but has some minor yet non-trivial gaps
in their reasoning.
3 The student has partially understood the concepts. The student may have started
out correctly, but gone on a target of the concepts.
2 The student has a poor understanding of the concepts.
1 The student did not understand the concepts.
0 The student wrote nothing or almost nothing.

Challenge Yourself! (Let’s Try This #3)

1. Indicate whether each of the statements is a statistic or parameter


a. The average income of 100 Filipinos selected at random from various telephone books
_________________
b. The average income of call centers in the province of negros occidental
__________________
c. The highest age among the respondents of a survey in a popular magazine
__________________

2. Give two examples of each of the following:


a. Nominal scale __________________, __________________
b. Ordinal scale __________________, __________________
c. Interval scale __________________, __________________
d. Ratio scale __________________, __________________

3. What type of scale is being used for each of the following measurements?
a. Number of arithmetic problems correctly solved _______________
b. Class standing (i.e., one’s rank in the graduating class) _______________
c. Type of phobia _______________

61
d. Body temperature (in °F) _______________
e. Self-esteem, as measured by self-report questionnaire _______________
f. Annual income in dollars _______________
g. Theoretical orientation toward psychotherapy _______________
h. Place in a dog show _______________
i. Heart rate in beats per minute _______________

4. A psychologist records how many words participants recall from a list under three
different conditions: large reward for each word recalled, small reward for each word
recalled, and no reward.

a. What is the independent variable? _______________


b. What is the dependent variable? _______________
c. What kind of scale is being used to measure the dependent variable? ______________

Gauge Your Learning!

A. Tell whether the following statement is a population or a sample.


1. The College Registrar wants to know the reasons of freshmen students in transferring
or shifting to other course so she got 150 freshmen students who shifted from the
other course to be interviewed. _______________
2. When Ricky bought a sack of rice, she examined a handful from the sack to check if it
is the variety he wants. _______________
3. The school guidance counselor would like to know the course preference of the
graduating students in their school so she interviewed all graduating students.
_______________
4. The chef wants to check if the food being cooked tastes as he wants it to be so he
tasted a spoonful of it. _______________
5. A doctor wants to know what causes the infection in a patient so he requested for the
patient’s blood examination. The medical technologist extracted only 10 cubic
centimeters of blood from the patient for examination. _______________

B. Encircle your answer from the words enclosed in the parenthesis.


6. A researcher is interested in the texting habits of college students in CHMSC. If the
researcher measures the number of text messages that each individual sends each day
and calculates the average number for the entire group of college students, the
average number would be an example of a (parameter, statistic).
7. A researcher is interested in how watching a reality television show featuring fashion
models influences the eating behavior of 13-year-old girls. A group of 30, 13-year-old
girls is selected to participate in a research study. The group of 30, 13-year-old girls is
an example of (population, sample).

62
8. In the same study, the amount of food eaten in one day is measured for each girl and
the researcher computes the average score for the 30, 13-year-old girls. The average
score is an example of a (parameter, statistic).
9. Out of 500 students of BS Psych, the desirable members of the club are only 10 per
section. 500 students is a (parameter, statistic).
10. 60% of products sold are food-related products and only 2% of these food-related
products are healthy. 2% is a (parameter, statistic).

C. Classify each as nominal, ordinal, interval or ratio scale of measurement.


11. Marital status of College of Arts and Sciences faculty _______________
12. Weight (in kgs.) of students _______________
13. Test scores _______________
14. Ages of students in a classroom _______________
15. Race results _______________
16. Time of day on a 12-hour clock _______________
17. Number of siblings _______________
18. General weighted average _______________
19. An English professor uses letter grades (A, B, C, D, and F) to evaluate a set of student
essays. What kind of scale is being used to measure the quality of essays?
_______________
20. The teacher in communications class asks students to identify their favorite reality
television show. The different television shows make up a what scale of measurement.
_______________

D. Determine which kind of sampling was used in each of the following scenarios.
(Random, Stratified Random, Systematic, Cluster, Purposive, Convenience, Quota)
21. Chosen at random, 300 students who received a scholarship from CHMSC-Talisay
participated in a study. _______________
22. A survey to find out if teachers in CHMSC-Talisay are in favor of Outcome-Based
Education (OBE) will be conducted. To ensure that all faculty in each department are
represented, teachers will be divided into COEd, CAS, and CIT Departments.
_______________
23. You would like to know the level of satisfaction of students in terms of school canteen
service. You decided to have interview of students who are eating at the school
canteen every lunch time. _______________
24. To get the most popular online game, each field student-researcher is given a quota
of 50 students per course. _______________
25. In a study wherein, a researcher wants to know what it takes to graduate with honors
in CHMSC-Talisay, the only people who can give the researcher first hand advise are
the individuals who graduated with honors. _______________

63
E. Identify the independent and dependent variables in the following descriptions of
experiments. Underline the phrase or group of words that tells about independent
variable (dependent variable) and write IV (or DV) below it.

26. The more time people spend using social media, the less able they are to express
themselves in conversation.
27. Taking a nap in the afternoon makes people more relaxed and less irritable for the rest
of the day.
28. The relationship between the amount of violence that children see on television and
the amount of aggressive behavior they display.
29. Does attentiveness in class influence teacher effectiveness?
30. What are the effects of psychological variables on teacher’s productivity?

Study Topic Learning Outcomes Activities


Schedule
Week 4 Data Management
Module 4.2 Measures of Central Tendency and
Measures of Dispersion
September 1. Define statistical terms: Data, Explore: Discover This!
14, 2020 Variables, Independent & Dependent Engage: Let’s Try This!
variables, Population & Sample, Explain: Clarify Your Lesson!
Elaborate: Challenge Yourself!
Parameter and Statistic, Scale of
Evaluate: Gauge Your Learning!
Measurements, Sampling Techniques
(Probability and non-probability).

September Completion of Let’s Try This and Gauge File your activity in your red long
14, 2020 Your Learning Activities clear book.

Module 4.2: MEASURES OF CENTRAL TENDENCY AND MEASURES OF DISPERSION

Learning Objectives: At the end of the lesson, the students are expected to:

1. Find the measures of central tendency and dispersion of the given data.

Let’s Try This!


1. Determine the following:
1. The exercises below are based on the following values for two variables:
𝑋1 = 2, 𝑋2 = 4, 𝑋3 = 6, 𝑋4 = 8, 𝑋5 = 10. Find the value of each of the following expressions:
a. ∑5𝑖=2 𝑋𝑖

b. ∑ 5𝑋𝑖

64
c. ∑ 𝑋𝑖 2
2
2. Make up your own set of at least five numbers and demonstrate that ∑ 𝑋𝑖 2 ≠ (∑ 𝑋𝑖 ) .
3. Round off the following numbers to two decimal places (assume digits to the right of those
shown are zero):
a. 144.0135 _______________
b. 67.245 _______________
c. 99.707 _______________
d. 13.345 _______________
e. 7.3451 _______________
f. 5.9817 _______________
g. 5.9977 _______________
4. Round off the following numbers to four decimal places (assume digits to the right of
those shown are zero):
a. .76995 _______________
b. 3.141627 _______________
c. 2.7182818 _______________
d. 6.89996 _______________
e. 1.000819 _______________
f. 22.55555 _______________
5. Round off the following numbers to one decimal place (assume digits to the right of those
shown are zero):
a. 55.555 _______________
b. 267.1919 _______________
c. 98.951 _______________
d. 99.95 _______________
e. 1.444 _______________
f. 22.14999 _______________

Discover This!
A measure of central tendency is any single value that is used to identify the “center”
of the data or the typical value. It is called measure of central tendency because when the
data points are arranged according to magnitude, it tends to lie centrally within the set.
1. Mean or Arithmetic Mean (𝐗 ̅)
Mean is the sum of all the values of the observations divided by the number of
∑𝑥
̅=
observations X , where 𝑛 is the number of observations in the sample.
𝑛
Example 1: What is the mean age (in years) of group children whose ages are 9, 11, 7, 10, 9,
8, 8, 7, 12, 7 and 13?
Solution: X̅ = 9 + 11 + 7 + 10 + 9 + 8 + 8 + 7 + 12 + 7 + 13
11
101
=
11

65
= 9.18 years
2. Median (𝐗̃)
Median is the positional middle of an array. In an array, one-half of the values precede
the median and one-half follow it. The first step in calculating the median, denoted by (X ̃), is
to arrange the data in an array. Let X(𝑖) the 𝑖 𝑡ℎ observation in the array, 𝑖 = 1, 2, … 𝑁.
𝑁+1 𝑁+1 𝑡ℎ
If 𝑁 is odd, the median position equals ( ), and the value of the ( ) observation in
2 2
the array is taken as the median, i.e. ̃
X = X𝑁+1 .
2
If 𝑁 is even, the mean of the two middle values in the array is the median, i.e.
X𝑁 + X𝑁+1
̃= 2
X 2
2

Example 2: Find the median of the given data set: 75, 67, 71, 75, and 72
Solution: First, arrange the data set in ascending order: 67, 71, 72, 75, 75
Since 𝑁 = 5, we will use ̃ X = X𝑁+1 , hence, ̃X = X𝑁+1
2 2
= X5+1
2
= X3
= 72.
67, 71, 72, 75, 75.
Therefore, ̃ X = 72.
3. Mode (𝐗 ̂)
Mode is the observed value the occurs most frequently. It locates the point where the
observation values occur with the greatest density. It does not always exist, and if it does, it
may not be unique. A data set is said to be unimodal if there is only one mode, bimodal if
there are two modes, multimodal if there three or more. It is not affected by extreme values.
It can be used for qualitative as well as quantitative data.

Example 3: Identify the mode(s) of the following data sets.


a) 2, 5, 2, 3, 5, 2, and 1 ̂=2
Solution: X
b) 2, 5, 5, 2, 2, 5, 1, 3, 5, 4, 2, 5, 5, and 2 ̂=5
Solution: X
c) 1, 2, 3, 3, 2, 1, 2, 3, 1, 4, 4, 5, 5, and 5 ̂ = 1, 2, 3, and 5
Solution: X

Example 4: Find the mean, median, and mode of the following ages in years.
1.) 3, 4, 5, 5, 6, 7, 9, 10, 14
2.) 7, 8, 9, 9, 10, 10, 11, 12
Solution:
1.) 3, 4, 5, 5, 6, 7, 9, 10, 14
∑X 3+4+5+5+6+7+9+10+14
Mean: ̅
X= =
𝑛 9
63
= 9

66
= 7 years
Median: Since N is 9 (which is odd), use the first formula:
̃
X = 𝑥(𝑁+1) = 𝑥(9+1)
2 2

= 𝑥10
2
= 𝑥5 , then, what is the 5th score in an ordered distribution? The
answer is 6. Therefore, the median is 6.
Mode: The mode is 5 since it has highest frequency (is appears twice)

2.) 7, 8, 9, 9, 10, 10, 11, 12

∑𝑥 7+8+9+9+10+10+11+12
Mean: ̅
X= =
𝑛 8
76
= 8
= 9.5 years
Median: Since N is 8 (which is even), use the second formula:
𝑥 𝑁 + 𝑥 𝑁+1
( ) ( ) 𝑥4 +𝑥5
̃
X= 2 2
=
2 2
9+10
= 2
= 9.5, therefore, the median is 9.5.
Mode: The modes are 9 and 10 since they have the highest frequency (appeared
twice). It is bimodal.

Weighted mean (X ̅ w) is the sum of the mean of each group multiplied by its respective
weight divided by the sum of the weights. (For mean alone, the weight values in each
distribution are equal). Example of weighted mean is solving your weighted average in a
semester to determine if you belong to the dean’s list. Each of your grade has a corresponding
number of units (Example, GECMAT is 3 units, major subject is 4 or 5 units, and so on and so
forth).

The formula of the weighted mean is

𝑥1 (𝑤1 ) + 𝑥2 (𝑤2 ) + … + 𝑥𝑛 (𝑤𝑛 )


̅
Xw =
𝑤1 + 𝑤2 + ⋯ + 𝑤𝑛

1
Example 5: Francis answered 20 calculus problems. He spent 12 hours for the first 6 problems;
45 minutes for the next 3; and 3 hours for the last 11 problems. What was the average time
(in minutes) he spent for the 20 problems?
Solution: This problem requires the weighted average time because each set of problems has
a weight (which is time).

67
𝑥1 (𝑤1 ) + 𝑥2 (𝑤2 ) + … + 𝑥𝑛 (𝑤𝑛 ) 6(90) + 3(45) + 11(180)
𝑋̅𝑤 = =
𝑤1 + 𝑤2 + ⋯ + 𝑤𝑛 90 + 45 + 180
540 + 135 + 1980 2655
= = ≈ 8.42 minutes
315 315

Measures of Dispersion
Measures of dispersion/variability indicate the extent to which individual items in a
series are scattered about an average. It is used to determine the extent of the scatter so that
steps may be taken to control the existing variation. It is also used as a measure of reliability
of the average value.

1. Range
The range of a set of measurements is the difference between the largest and the
smallest values. Range (𝑅) = Maximum value − Minimum value

Example 6: The IQ scores of 5 members of CHMSC Basketball men varsity are 108, 112, 127,
116, and 113. Find the range.
Solution: R = 127– 108 = 19

2. The Variance and Standard Deviation


̅ )2
∑(Xi −X
The sample variance 𝑠 2 is 𝑠 2 = and the sample standard deviation 𝑠 is
𝑛−1
̅ )2
∑(Xi −X
𝑠= √ where 𝑋𝑖 is the 𝑖 𝑡ℎ score of the ordered observation, 𝑋̅ is the mean of the
𝑛−1
observations and 𝑛 is the number of observations.
Example 7: A sample of 5 households showed the following number of household members:
3, 8, 5, 4, and 4. Find the variance and standard deviation.
Solution:
Score (X) Mean (X ̅) (X − X ̅) (X − X̅)2
3 4.8 −1.8 3.24
8 4.8 3.2 10.24
5 4.8 0.2 0.04
4 4.8 −0.8 0.64
4 4.8 −0.8 0.64
∑ X = 24 ∑(X − ̅
X)2 = 14.8

̅=
3+8+5+4+4 ∑(Xi − ̅
X )2
X 2
𝑠 = ∑(Xi − ̅
X )2
5 𝑛−1 𝑠= √
24 14.8 𝑛−1
= 5 = 5−1
= 4.8 = √3.7
= 3.7 Variance
= 1.92
Standard Deviation

68
Clarify Your Lesson!

1. Select the measure of central tendency (mean, median, or mode) that would be most
appropriate for describing each of the following hypothetical sets of data and give your
reason.
a. Religious preferences of delegates to the United Nations.
b. Heart rates for a group of women before they start their first aerobics class.
c. Types of phobias exhibited by patients attending a phobia clinic.
d. Amounts of time participants spend solving a classic cognitive problem, with some of
the participants unable to solve it.
e. Height in inches for a group of boys in the first grade
2. A veterinarian is interested in the life span of golden retrievers. She recorded the age at
death (in years) of the retrievers treated in her clinic. The ages were 12, 9, 11, 10, 8, 14,
12, 1, 9, 12.
a. Calculate the mean, median, and mode for age at death.
b. After examining her records, the veterinarian determined that the dog that had died
at 1 year was killed by a car. Recalculate the mean, median, and mode without that
dog’s data.
c. Which measure of central tendency in part b changed the most, compared to the
values originally calculated in part a?

Challenge Yourself!
(Let’s Try This #2)
Solve the following problems with complete process of the solution.

A study was conducted to determine the level of awareness of the residents of a certain
municipality on the causes of hypertension. The accompanying table shows the results
with respect to the gender of the respondents.

Indicator Male Female Weighted


(𝑛 = 32) (𝑛 = 48) ̅w)
Mean (X
1. Intake of food high in salt 2.8 2.8
2. Intake of food high in saturated fats 2.4 3.0
3. Caffeine consumed in more than 4 cups of 2.5 2.4
coffee a day
4. Drinking too much alcohol 2.9 3.0
5. Smoking 3.2 3.2
6. Intake of drugs or medications that may 2.8 2.8
increase blood pressure
7. Lack of physical exercise 3.0 3.1
8. Stress 2.8 3.0

69
9. Family history of hypertension 2.6 2.7
10. Present medical conditions like kidney 2.5 2.5
disorder, diabetes mellitus and others
Average
a. Find the average for the male and for the female.
b. Find the weighted mean, ̅ Xw, for each indicator letting 𝑛 as the weight.
c. Find the standard deviation for the male and female.

Gauge Your Learning!

Solve the following problems and show your complete solution.


1. The number of confirmed flu cases (in thousands) for the 10 neighboring countries is
shown. Find the mean, median, mode, variance, and standard deviation. (15 points)
25, 23, 22, 21, 26, 27, 29, 23, 21, 23
2. Grade Point Average (5 points)
A student received an A in English Composition (3 credits), a C in Introduction to
Psychology (3 credits), a B in Biology (4 credits), and a D in Physical Education
(2 credits). Assuming A = 4 grade points, B = 3 grade points, C = 2 grade points, D = 1
grade point, and F = 0 grade points, find the student’s grade point average.
3. A random sample of employees from Carlos Hilado Memorial State College pledged
the following donations to the projects of the Faculty Association:

₱ 200 ₱ 250 ₱ 500 ₱ 750 ₱ 250


₱ 1000 ₱ 1250 ₱ 1500 ₱ 700 ₱ 750
₱ 750 ₱ 2000 ₱ 1000 ₱ 1250 ₱ 1250
₱ 250 ₱ 750 ₱ 250 ₱ 1200 ₱ 250
₱ 500 ₱ 1250 ₱ 750 ₱ 500 ₱ 1200

Find the mean, median and mode.

70
Study ScheduleTopic Learning Outcomes Activities
Week 4 Data Management
Module 4.3 Hypothesis Testing
September 15- 1. Formulate a hypothesis utilizing Explore: Discover This!
16, 2020 the five steps on testing of Engage: Let’s Try This!
hypothesis Explain: Clarify Your Lesson!
Elaborate: Challenge Yourself!
2. Perform a hypothesis testing
Evaluate: Gauge Your Learning!
3. Test the significant difference
between groups
September 16, Completion of Let’s Try This and File your activity in your red long
2020 Gauge Your Learning Activities clear book.

Lesson 3: HYPOTHESIS TESTING

Learning Objectives: At the end of the lesson, the students are expected to:

1. Formulate a hypothesis utilizing the five steps on testing of hypothesis


2. Perform a hypothesis testing
3. Test the significant difference between groups

Let’s Try This!

1. A small warehouse employs a supervisor at ₱6,000 a week, an inventory manager at


₱3,500 a week, six stock boys at ₱2,000 a week, and four drivers at ₱2,500 a week.

a. Find the mean and median wage.


b. How many employees earn more than the mean wage?
c. Which measure of center best describes a typical wage at this company, the mean
or the median? Why?
d. Which measure of spread would best describe the payroll, the range or the
standard deviation? Why?

Discover This!

Hypothesis testing deals with the problem of testing specific assertions about the
population regarding the value of the unknown parameter or the distributional properties of
the population. The statement is stated in the form of a hypothesis and the statistical tool
used to decide whether or not to reject said statement is a test of hypothesis.
This is the process of making an inference or generalization on population parameters
based on the results of the study on samples. It is a procedure for deciding if the null
hypothesis should be rejected in favor of an alternative hypothesis, or will not be rejected. It
is a statistical procedure that allows researchers to use sample data to draw inferences about
the population of interest. It is a statistical method that uses sample data to evaluate a
hypothesis about a population.

71
Definition:
1. Statistical hypothesis – is statement or conjecture concerning one or more population.
2. Null hypothesis (Ho) – is the hypothesis that is being tested; it represents what the
experimenter doubts to be true.
3. Alternative hypothesis (Ha) – is the operational statement of the theory that the
experimenter believes to be true and wishes to prove.
4. Type I error - is the error made by rejecting the null hypothesis when it is true. The
probability of a Type I error is α (alpha).
5. Type II error – is the error made by accepting (not rejecting) the null hypothesis when
it is false. The probability of a Type II i=error is denoted by β (beta).
6. Level of significance (α) - is the maximum probability of Type I error the researcher is
willing to commit.

Five Steps in Hypothesis Testing


Step 1: State the null hypothesis (Ho) and the alternative hypothesis (Ha).
Step 2: Choose the level of significance α.
Step 3: Select the appropriate test statistic and establish the critical region.
Step 4: Collect the data and compute the value of the test statistic from the sample data.
Step 5: Make the decision. Reject Ho if the value of the test statistic is in the critical region
Otherwise, do not reject Ho.

Step 1: Determine the variable of interest 𝑋, State the null hypothesis (H0 ) and alternative
hypothesis (Ha ) in words and in symbols.
Null hypothesis (𝑯𝟎 ) is always hoped to be rejected. Always contains “=” sign.
Null hypothesis states that there is no statistically significant difference (effect,
change, relationship) between the variables. (It uses “= “symbol, sometimes, ≤ 𝑜𝑟 ≥).
The population mean value is equal to a hypothesized (standard) value. (The new
vaccine is as effective as the one commonly used. (𝜇 = 24 ℎ𝑜𝑢𝑟𝑠)
There is no significant difference between the two parameters. (Male students are
equally intelligent to female students in Mathematics)(𝜇𝑀 = 𝜇𝐹 )
There is no significant relationship between two variables. (There is no significant
relationship between the effect of new antianxiety drug and heart rate of a person.)
The experimental treatment on a group of students has had no effect on its
performance.
(𝜇𝑎𝑓𝑡𝑒𝑟 = 𝜇𝑏𝑒𝑓𝑜𝑟𝑒 𝑜𝑟 𝜇𝑑 = 0)
Alternative hypothesis (𝐻𝑎 ) challenges 𝐻0 . Never contains “=” sign. Uses “< or > or
≠”. It generally represents the idea which the researcher wants to prove. It is a statement that
there is a relationship between variables.
It is a statement specifying that the population parameter is some value other than
the one specified under the null hypothesis. It states that there is a change, a difference, or a

72
relationship for the general population. In the context of an experiment, 𝐻𝐴 predicts that the
independent variable (treatment) does have an effect on the dependent variable.
The population mean value is greater than (less than, not equal to) to a hypothesized
(standard) value. (The new vaccine is not as (more / less) effective as the one commonly used.)
(𝜇 ≠ 24 ℎ𝑜𝑢𝑟𝑠, 𝜇 > 24, 𝜇 < 24)
There is significant difference between the two parameters. (Male students are not as
(more / less) intelligent to female students in Mathematics) (𝜇𝑀 ≠ 𝜇𝐹 , 𝜇𝑀 > 𝜇𝐹 , 𝜇𝑀 < 𝜇𝐹 )
There is a significant relationship between two variables. (There is a significant
relationship between the effect of new antianxiety drug and heart rate of a person.)
The experimental treatment on a group of students has had an effect on its
performance.
(𝜇𝑎𝑓𝑡𝑒𝑟 > 𝑜𝑟 𝜇𝑑 > 0)

A one-tailed test of hypothesis is a test where the alternative hypothesis specifies a


one-directional difference for the parameter of interest.
A two – tailed test of hypothesis is a test where the alternative hypothesis does not
specify a directional difference for the parameter of interest.

Example 1: Variable of interest: Lifespan of battery (in years)


H0 : Brand A battery is as effective as the ordinary battery in the market. μ = 4 years
Ha : Brand A battery is more (or less or not as) effective than the ordinary battery in the
market. μ > 4 years (or μ < 4 years or μ ≠ 4 years)

Example 2: Variable of interest: Lifespan of battery (in years)


Null hypothesis: H0 : Brand A battery is as effective as Brand B battery. μA = μB
Alternative hypothesis: Ha : Brand A battery is more (or less or not as) effective than Brand B
battery. (μA > μB or μA < μB or μA ≠ μB)

Step 2: Choose the level of significance α (usually 0.05 or 0.01)


The level of significance (𝜶) is the maximum probability of Type 1 error the researcher
is willing to commit.
The alpha level, or the level of significance, is a probability value that is used to define
the concept of “very unlikely” in a hypothesis test. The lower the significance level, the more

73
the data must diverge from the null hypothesis to be significant. Therefore, the 0.01 level is
more conservative than the 0.05 level.

Step 3: Determine the following:


a) Type of test: (i) two-tailed test (≠), (ii) one-tailed left test (<), or one-tailed
right test (>)
b) z-test, t-test, or F-test and critical value/s

Statistical Hypotheses Tests


Z-test 𝑛 ≥ 30 𝜎 is known (2 means) Normal distribution
Student t-distribution
T – test 𝑛 < 30 & 𝑛 ≥ 30 𝜎 is unknown (2 means) 𝑑𝑓 = 𝑛 − 1 or
( )
𝑑𝑓 = 𝑛1 + 𝑛2 − 2
F-test (ANOVA) 2 or more means F - distribution

A test statistic is a statistic whose value is calculated from sample measurements and
on which the statistical decision will be based.
The critical region or rejection region is the set of values of the test statistic for which
the null hypothesis will be rejected. The acceptance region is the set of values of the test
statistic for which the null hypothesis will not be rejected. The acceptance and rejection
regions are separated by a critical value of the test statistic.

For z-test: Critical z-values


Level of Significance Two-Tailed Test One-Tailed Test
0.10 = 10% ±1.65 +1.28 or -1.28
0.05 = 5% ±1.96 +1.65 or -1.65
0.01 = 1% ±2.58 +2.33 or -2.33

For t-test: (Use t-distribution table) 𝑑𝑓 = 𝑛 – 1 or 𝑑𝑓 = 𝒏𝟏 + 𝒏𝟐 – 2


Note: 𝒛-test is used if 𝒏 ≥ 𝟑𝟎 and 𝝈 is known, if 𝒏 < 𝟑𝟎, the distribution should be
normally distributed and 𝝈 is known.
𝒕-test is used if 𝒏 ≥ 𝟑𝟎 and 𝝈 is unknown, if 𝒏 < 𝟑𝟎, the distribution should
be normally distributed and 𝝈 is unknown.

74
c) Critical region
The region of rejection can be found in terms of critical z-scores – the z-scores that cut
off an area of the normal distribution that is exactly equal to alpha. It has a size equal to 𝜶 . It
covers the range of values of the test value that indicates the difference was probably due to
chance and that 𝑯𝟎 should be rejected. The noncritical (acceptance) region has a size equal
to 𝟏 − 𝜶.
Decision rule:
Critical value approach:
Two-tailed test: Reject 𝑯𝟎 , if the |Computed value CV| ≥ |Critical value|, otherwise, do not
reject 𝑯𝟎 .
One-tailed right test: Reject 𝑯𝟎 , if the 𝑪𝑽 ≥ Critical value, otherwise, do not reject 𝑯𝟎 .
One-tailed left test: Reject 𝑯𝟎 , if the 𝑪𝑽 ≤ Critical value, otherwise, do not reject 𝑯𝟎 .
𝒑-value approach: Reject 𝑯𝟎 , if 𝒑-value is less than or equal to 𝜶 (𝒑 ≤ 𝜶), otherwise, do not
reject 𝑯𝟎 .

Step 4: Statistical tool and computation of calculated value


Test for one sample group (Population mean 𝝁 vs Sample mean 𝑿 ̅)
z-test for one sample group is used to compare perceived population mean 𝜇against
𝑋̅ − 𝜇
the sample mean 𝑋̅ with known 𝜎: 𝑧 = 𝜎
√𝑛
t-test for one sample group is used to compare perceived population mean 𝜇against
𝑋̅ − 𝜇
the sample mean 𝑋̅ with unknown 𝜎: 𝑡 = 𝑠 𝑑𝑓 = 𝑛 – 1
√𝑛

Test on the difference of two independent sample means (𝝁𝟏 vs 𝝁𝟐 )


z – test for two independent populations test is used compare the means of two
independent groups of samples drawn from a normal population
𝑋̅1 −𝑋̅2
when 𝜎 is known: 𝑧 = 1 1
𝜎√ +
𝑛1 𝑛2
𝑋̅1 −𝑋̅2
when 𝜎 are known: 𝑧 = 𝜎 2 𝜎 2
√ 1 + 2
𝑛1 𝑛2

t-test for two independent samples is a test of difference between two independent sample
groups. The two means are compared (when 𝜎 is unknown):
𝑋̅1 −𝑋̅2
𝑡= (𝑛 −1)𝑠1 2 + (𝑛2 −1)𝑠2 2 1 1
𝑑𝑓 = 𝑛1 + 𝑛2 − 2
√ 1 √𝑛 +𝑛
𝑛1 + 𝑛2 −2 1 2

Step 5: Make a decision. (Reject or Do not reject 𝑯𝟎 ) and conclusion

75
Example 3: Identify the variable of interest that you are going to use to represent information.
Formulate the appropriate null hypothesis (𝐻0 ) and the appropriate alternative hypothesis
(𝐻𝑎 ).
a. The soft drink dispenser of a fast food center was just readjusted. The manager,
wanting to know if the dispenser is really in good condition, got a sample of 50 cups
filled by the dispenser. He would only classify the dispenser as “in good condition”
(and therefore it need not be readjusted again) if the average fill per cup of the
dispenser is 8 ounces.

Solution: Variable of interest: fill per cup of the dispenser


𝐻0 : μ = 8 ounces (The dispenser is “in good condition”)
𝐻𝑎 : μ ≠ 8 ounces (The dispenser is not “in good condition”)

b. A common measure of intelligence is the intelligence quotient (10) test (Castles, 2012;
Spinks et al., 2007) in which scores in the general healthy population are
̅). Suppose we select a sample
approximately normally distributed with 100 ± 15 (μ ± X
of 100 graduate students to identify if the 10 of those students is significantly
different from that of the general healthy adult population. In this sample, we record
a sample mean equal to 103 (M= 103). Determine the null hypothesis and alternative
hypothesis.
Solution: Variable of interest: IQ score
H0 : μ = 100: The mean IQ score is equal to 100 in the population of
graduate students.
H0 : μ ≠ 100: The mean 10 score is not equal to 100 in the population
of graduate students

̅)
TEST FOR ONE SAMPLE GROUP (POPULATION MEAN 𝝁 vs SAMPLE MEAN 𝑿

𝑋̅ − 𝜇 If population standard deviation (𝜎) is known


𝑧= 𝜎
√𝑛
𝑋̅ − 𝜇 If population standard deviation (𝜎) is unknown
𝑡= 𝑠
√𝑛

where 𝑋̅ is the sample mean, 𝜇 is the population mean, 𝜎 is the population standard
deviation, 𝑠 is the sample standard deviation, 𝑛 is the sample size, 𝑧 is the z-value and 𝑡 is
the t-value.

76
ONE-SAMPLE Z-TEST (𝒏 ≥ 𝟑𝟎)
A one-sample 𝒛-test works when you have a single group of people (or things) and you
wonder whether they are different in some way from some hypothesized population. The
more common research question of interest is whether a sample matches the characteristics
that one would expect if that sample wasn’t different in some way from this imagined
population.

Example 4: According to a dietary study, high sodium intake may be related to ulcers, stomach
cancer, and migraine headaches. The human requirement for salt is only 220 milligrams per
day and a standard deviation of 24.5 milligrams, which is surpassed in most single servings of
ready-to-eat cereals. If a random sample of 50 similar servings of a certain cereal has a mean
sodium content of 244 milligrams, does this suggest at the 0.05 level of significance that the
average sodium content for a single serving of such cereal is greater than 220 milligrams?
Assume the distribution of sodium contents to be normal.

Solution:
Step 1: Variable of interest 𝑿: sodium content of a certain cereal (in milligrams)
𝑯𝟎 : The mean sodium content of a certain cereal is 220 mg. 𝝁 = 𝟐𝟐𝟎
mg
𝑯𝒂 : The mean sodium content of a certain cereal is greater than 220 mg 𝝁 > 𝟐𝟐𝟎
mg
Step 2: 𝜶 = 𝟎. 𝟎𝟓
Step 3:
a) Type of test: One-tailed right test (directional)
b) z-test: Critical value = 1.65

c) Critical region
d) Decision rule: Reject 𝑯𝟎 , if the 𝑪𝑽 ≥ 𝟏. 𝟔𝟓, otherwise, do not reject 𝑯𝟎 .
e) Step 4: Statistical tool and computation of computed value.

77
Given: ̅− 𝝁
𝑿
𝒛= 𝝈
𝝁 = 𝟐𝟐𝟎 mg z = 𝟔. 𝟗𝟑
𝝈 = 𝟐𝟒. 𝟓 mg √𝒏
̅ = 𝟐𝟒𝟒 mg
𝑿 𝟐𝟒𝟒 − 𝟐𝟐𝟎
z=
𝑛 = 50 servings 𝟐𝟒. 𝟓
√𝟓𝟎
Step 5: Decision: Reject 𝑯𝟎 , since 𝟔. 𝟗𝟑 > 𝟏. 𝟔𝟓.
Conclusion: Therefore, it does suggest at the 0.05 level of significance that the mean
sodium content of a certain cereal is greater than 220 mg.
ONE-SAMPLE T-TEST (𝒏 < 𝟑𝟎)
Example 5: An expert typist can type 65 words per minute. A random sample of 16 applicants
took the typing test and an average speed of 62 words per minute with a standard deviation
of 8 words was obtained. Can we say that the applicant’s performance is below the standard
at 0.05 level?
Solution:
1. Variable of interest: typing speed (in words per minute)
Hypothesis: 𝐻0 : 𝜇 = 65 words per minute (claim)
𝐻𝑎 : 𝜇 < 65 words per minute

2. Level of significance: 𝛼 = 5% or 0.05

3. Statistical Tool: Dependent sample t-test


Type of test: One-tailed left test
(Degrees of freedom) 𝑑𝑓 = 16 − 1 = 15
Critical value 𝐶𝑉 = − 1.753 (Refer Appendix A, t-Distribution)

78
Decision rule: Reject 𝐻0 if the test or computed value is less than or equal to − 1.753
(𝑡 ≤ −1.753), otherwise, do not reject 𝐻0 .

4. Collect and Compute:


𝑋̅ − 𝜇
𝑡= 𝑠
√𝑛
62− 65
𝑡= 8
√16
𝑡 = − 1.50
5. Decision: Since −1.50 > −1.753, do not reject 𝐻0 .
Conclusion: Therefore, there was no enough evidence to say that the applicant’s
performance is below the standard at 0.05 level.

TEST ON THE DIFFERENCE OF MEANS OF TWO POPULATIONS (INDEPENDENT SAMPLES T-


TEST)
The independent samples t-test compares the means of two independent groups in
order to determine whether there is statistical evidence that the associated population
means are significantly different. The Independent Samples t-test is a parametric test.

𝑋̅1 − 𝑋̅2
𝑡=
(𝑛1 − 1)(𝑠1 )2 + (𝑛2 − 1)(𝑠2 )2 1 1
√ √ +
𝑛1 + 𝑛2 − 2 𝑛1 𝑛2
where
𝑋̅1 is the mean of sample 1
𝑋̅2 is the mean of sample 2
𝑠1 is the standard deviation of sample 1
𝑠2 is the standard deviation of sample 2
𝑛1 is the sample size of sample 1
𝑛2 is the sample size of sample 2

Example 6: Samples of the weights of male and female students were obtained with the
following sample statistics. Using α = 0.01, do the sample means provide sufficient evidence
that the weights of male and female students are equal?
We will apply the five steps in hypothesis testing in
this problem. But before that, let us first Sample 𝑛 𝑋̅ 𝑠
determine the groups. Female 12 112.8 lbs 12.8
Let: 𝑋1 be the group of female students
𝑋2 be the group of male students Male 12 148.3 lbs 7.8

Solution:
Step 1: Variable of interest: weight of students (in lbs.)

79
(Null hypothesis) 𝐻0 : The weights of female and male students are the same. 𝜇1 = 𝜇2
(Alternative hypothesis) 𝐻𝑎 : The weights of female and male students are different. 𝜇1 ≠ 𝜇2
Step 2: Level of Significance 𝛼
𝛼 = 1% or 0.01
𝛼 0.01
= = 0.005
2 2
Step 3:
Type of test: Two-tailed test
Test statistics: Independent t-test
(Degrees of freedom) df = 𝑛1 + 𝑛2 − 2 = 12 + 12 – 2 = 22
Critical value 𝐶𝑉 = 2.82 (Refer Appendix A)
Decision rule: Reject 𝐻0 if the absolute value of test or computed value is greater than
or equal to |2.82| (|𝑡| ≥ |2.82|), otherwise, do not reject 𝐻0 .

Step 4: Computation of computed value


𝑋̅1 − 𝑋̅2
𝑡=
(𝑛 − 1)(𝑠1 )2 + (𝑛2 − 1)(𝑠2)2 1 1
√ 1 √ +
𝑛1 + 𝑛2 − 2 𝑛1 𝑛2
112.8 − 148.3
𝑡=
2 2
√(12 − 1)(12.8) + (12 − 1)(7.8) √ 1 + 1
12 + 12 − 2 12 12
−35.5
=
17.4769

= −2.03
Step 5:
Decision: Since |−2.03| < |2.82|, do not reject 𝐻0 .

Conclusion: Therefore, the sample means provide sufficient evidence that the weights of male
and female students are equal. The weights of female and male students are the same at 0.01
level.

Clarify Your Lesson! (Let’s Try This #2)

1. A psychiatrist is testing a new antianxiety drug, which seems to have the potentially
harmful side effect of lowering the heart rate. For a sample of 50 medical students whose
pulse was measured after 6 weeks of taking the drug, the mean heart rate was 70 beats
per minute (bpm). If the mean heart rate for the population is 72 bpm with a standard
deviation of 12, can the psychiatrist conclude that the new drug lowers heart rate
significantly?
2. Imagine that you are testing a new drug that seems to raise the number of T cells in the
blood and therefore has enormous potential for the treatment of disease. After treating
100 patients, you find that their mean (𝑋̅) T cell count is 29.1. Assume that 𝜇 and
𝜎 (hypothetically) are 28 and 6, respectively.

80
a. Test the null hypothesis at the .05 level, two-tailed.
b. Test the same hypothesis at the .01 level, two-tailed.
c. Describe in practical terms what it would mean to commit a Type I error in this
example.
d. Describe in practical terms what it would mean to commit a Type II error in this
example.
e. How might you justify the use of .01 for alpha in similar experiments?

Gauge Your Learning!


Carry out a complete test of hypothesis for the following problems. Show complete
solution.
1. A physician claims that joggers’ maximal volume oxygen uptake is greater than the
average of all adults. A sample of 15 joggers has a mean of 40.6 ml/kg and a standard
deviation of 6 ml/kg. If the average of all adults is 36.7 ml/kg, is there enough evidence
to support the physician’s claim at 𝛼=0.05?
2. On the first day of class, a third-grade teacher is told that 12 of his students are
“gifted,” as determined by IQ tests, and the remaining 12 are not. In reality, the two
groups have been carefully matched on IQ and previous school performance. At the
end of the school year, the gifted students have a grade average of 87.2 with s = 5.3,
whereas the other students have an average of 82.9, with s = 4.4. Perform a t-test to
decide whether you can conclude from these data that false expectations can affect
student performance; use alpha = .05, two-tailed.
3. An electrical firm manufactures light bulbs that have a lifetime that is approximately
normally distributed with a mean of 800 hours and a standard deviation of 40 hours.
Test the hypothesis that μ = 800 hours against the alternative, μ ≠ 800 hours, if a
random sample of 25 bulbs has an average life of 788 hours (using 0.05 level of
significance).
4. The average height of females in the freshman class of a certain college has historically
been 162.5 centimeters with a standard deviation of 6.9 centimeters. Is there reason
to believe that there has been a change in the average height if a random sample of
50 females in the present freshman class has an average height of 165.2 centimeters?
(using 0.05 level of significance)
5. The Tekeltronic Company manufactures car batteries whit two different production
methods. The lives (in years) of the batteries are found for a sample from each group,
with following results.
Traditional Method Experimental Method
n = 15 n = 15
X = 4.31 Y = 4.31
s = 0.37 s = 0.31

81
At the 0.05 significance level, test the claim that the two production methods yield
batteries with the same mean. Based on the results, if you were buying a battery of your
car, would you prefer a battery manufactured by the traditional method or the
experimental method?

Gauge Your Learning!

A. State the null and alternative hypotheses and identify the following as one-tailed or two-
tailed.
1. A researcher studies gambling in young people. She thinks those who gamble spend
more than $30 per day.

2. A researcher wishes to see if police officers whose spouses work in law enforcement
have a lower score on a work stress questionnaire than the average score of 120.

3. A teacher feels that if an online textbook is used for a course instead of a hardback
book, it may change the students’ scores on a final exam. In the past, the average final
exam score for the students was 83.

4. A medical researcher is interested in finding out whether a new medication will have
any undesirable side effects. The researcher is particularly concerned with the pulse
rate of the patients who take the medication. The mean pulse rate for the population
under study is 82 beats per minute.
5. A chemist invents an additive to increase the life of an automobile battery. The mean
lifetime of the automobile battery without the additive is 36 months.

6. A contractor wishes to lower heating bills by using a special type of insulation in


houses. If the average of the monthly heating bills is $78, her hypotheses about
heating costs with the use of insulation.

B. Carry out a complete test of hypothesis for the following problems. Show complete
solution.
1. According to a dietary study, high sodium intake may be related to ulcers, stomach
cancer, and migraine headaches. The human requirement for salt is only 220
milligrams per day, which is surpassed in most single servings of ready-to-eat cereals.
If a random sample of 36 similar servings of a certain cereal has a mean sodium
content of 244 milligrams and a standard deviation of 24.5 milligrams, does this
suggest at the 0.05 level of significance that the average sodium content for a single
serving of such cereal is greater than 220 milligrams? Assume the distribution of
sodium contents to be normal.

82
2. Self-Esteem Scores
In a study of a group of women science majors who remained in their profession and
a group who left their profession within a few months of graduation, the researchers
collected the data shown here on a self-esteem questionnaire. At α = 0.05, can it be
concluded that there is a difference in the self-esteem scores of the two groups?

Leavers Stayers
𝑋̅1 = 3.05 𝑋̅2 = 2.96
𝑠1 = 0.75 𝑠2 = 0.82
𝑛1 = 41 𝑛2 = 41

Study Topic Learning Outcomes Activities


Schedule
Week 4 Data Management
Module 4.4 Correlation and Regression Analyses
September 1. Perform the regression and Explore: Discover This!
17-18, 2020 correlation analysis of the data. Engage: Let’s Try This!
2. Test the significance of the Explain: Clarify Your Lesson!
Elaborate: Challenge Yourself!
regression and the population
Evaluate: Gauge Your Learning!
correlation coefficient
September Completion of Let’s Try This and Gauge File your activity in your red long
18, 2020 Your Learning Activities clear book.

Lesson 4.4: CORRELATION AND REGRESSION ANALYSES

Learning Objectives: At the end of the lesson, the students are expected to:

1. Perform the regression and correlation analysis of the data.


2. Test the significance of the regression and the population correlation coefficient
Introduction

When performing research studies, scientist often wish to know whether two
variables are related. If the variables are determined to be related, a scientist may then wish
to find an equation that can be used to model the relationship.

83
Let’s Try This!
1. Twenty students take a Spanish
written test. The scatter diagram
shows their marks and the number of
Spanish lessons they had missed
during the year.
a) Write down the mark of the
student who missed most
lessons.
b) Write down the number of
lessons missed by the student
having a mark of 36.
c) One student missed many
lessons but still had a high
mark in the test. Write down
the mark and number of
lessons missed by this
student.
d) The teacher looks at the scatter diagram and concludes: "The more Spanish lessons a
student attends, the higher their mark in the written test." Does the information in
the scatter diagram support this conclusion? Give a reason for your answer.

Discover This!

Correlation Analysis
Correlational analysis determines the strength and degree of relationship between
two variables and test if there is a significant relationship between two variables while
regression analysis predicts the dependent variable using the independent variable if it has a
relationship exists between two variables.
Correlation Analysis is concerned with the relationship in the changes of the given
variables. The relationship can be computed and may be shown in a scatter diagram. If y
increases as x increases the correlation is called positive or direct correlation. If y increases as
x decreases the correlation is negative or inverse correlation.
If there is no relationship indicated between x and y variables then we say that there
is no correlation between them. There are degrees of correlation between two variables. The
value of r ranges from – 1 to + 1, the degrees of correlation are the following:

±1.00 Perfect Correlation


±0.76- ±0.99 High positive/negative correlation
±0.51- ±0.75 Moderate high positive/negative correlation
±0.26- ±0.50 Moderate low positive/negative correlation

84
±0.01- ±0.25 Low positive/negative correlation
0.0 No correlation

𝑛 ∑ 𝑥𝑦−(∑ 𝑥)(∑ 𝑦)
Formula for correlation (Pearson 𝑟) 𝑟 =
√[𝑛 ∑ 𝑥 2 −(∑ 𝑥)2 ][𝑛 ∑ 𝑦 2 −(∑ 𝑦)2 ]

where: 𝑥 = The observed data for the independent variable


𝑦 = The observed data for the dependent variable
𝑛 = Size of the sample
𝑟 = The degree of relationship between x and y

To determine whether the obtained correlation coefficient is significant, i.e., that a


real correlation exists or that the obtained 𝑟 is not merely due to sampling variation, a t-test
for testing the significance of 𝑟 could be used. The formula is as follows:

Formula to test the significant of 𝑟


𝑛−2
𝑡 = 𝑟√1−𝑟 2 𝑑𝑓 = 𝑛 − 2
where: 𝑟 = obtained Pearson 𝑟 value
𝑛 = sample size

Example 1: A study was made to determine the relationship existing between the grades in
Trigonometry and Drafting 101. A random sample of 10 first year BSIT students of Carlos
Hilado Memorial State College, were taken and the following are the results of the sampling.

Students No: 1 2 3 4 5 6 7 8 9 10
Trigonometry (𝑥) 75 83 80 89 77 78 92 86 93 84
Drafting (𝑦) 78 87 78 92 76 81 89 89 91 84
Is the obtained relationship significant at 0.05 level?
Solution:
Step 1: Variable of interest: grades in Trigonometry and Drafting
𝐻0 : There is no significant relationship between the grades in Trigonometry and
Drafting 101. (𝜌 = 0)
𝐻𝑎 : There is a significant relationship between the grades in Trigonometry and
Drafting 101. (𝜌 ≠ 0)
Step 2: 𝛼 = 0.05
Step 3: Type of test: Two-tailed test (Why two tailed test?)
𝑛−2
Test statistic: Pearson Product Moment of Correlation Coefficient 𝑟. 𝑡 = 𝑟√1−𝑟 2
(Degree of freedom) df = 𝑛 − 2 = 10 – 2 = 8
Since, it is a two-tailed test, we will divide the level of significance by 2.

85
𝛼 0.05
Therefore, 2 = (Since we are testing for the possibility of the relationship in
2
both directions.)

= 0.025
𝛼
Using the t-distribution table, df = 8 and = 0.025, the critical value CV is ±2.31.
2
(Refer Appendix A)
Decision rule: Reject 𝐻0 if (|𝐶𝑉 | ≥ |±2.31|), otherwise, do not reject 𝐻0 .

Step 4: First, let us solve the degree of relationship r.


𝑥 𝑦 𝑥𝑦 𝑥2 𝑦2
75 78 5850 5625 6084
83 87 7221 6889 7569
80 78 6240 6400 6084
89 92 8188 7921 8464
77 76 5852 5929 5776
78 81 6318 6084 6561
92 89 8188 8464 7921
86 89 7654 7396 7921
93 91 8463 8649 8281
84 84 7056 7056 7056
∑ 𝑥 = 837 ∑ 𝑦 = 845 ∑ 𝑥𝑦 = 71030 ∑ 𝑥 2 = 70413 ∑ 𝑦 2 = 71717

So therefore,
𝑛 ∑ 𝑥𝑦 − (∑ 𝑥)(∑ 𝑦)
𝑟=
√[𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 ][𝑛 ∑ 𝑦 2 − (∑ 𝑦)2 ]

10(71030)−(837)(845)
𝑟=
√[10(70413)−(837)2][10(71717)−(845)2

(710300)−(707265)
𝑟=
√[(704130)−(700569)][(717170)−(714025)
3035 3035
𝑟= = = 0.907 (High positive correlation)
√(3561)(3145) √11199345

To determine is the relationship is significant, we have

𝑛−2 8
𝑡 = 𝑟√ t = 0.907√
1 − 𝑟2 0.177351
10−2 t = 6.09
𝑡 = 0.907√1−(0.907)2

86
Step 5: Decision: Since |6.09| > |±2.31|, reject the 𝐻0 .
Conclusion: Therefore, there was a significant relationship existing between the
grades in Trigonometry and Drafting 101 at 0.05 level.

Regression Analysis
This topic discusses the simplest type of prediction, that of predicting one variable (𝑦)
with the knowledge of another variable (𝑥). Prediction refers to the process of calculating
scores of the criterion variable (𝑦), on the basis of the knowledge of the predictor variable
(𝑥). The concept of prediction and correlation are closely related. A more accurate prediction
of 𝑦 could be made from 𝑥 if the correlation coefficient is of greater absolute value.
A simple technique for prediction is though linear regression analysis which utilizes an
equation of the form.
𝑦 = 𝑎 + 𝑏𝑥
Where: 𝑥 is the predictor variable (independent variable)
𝑦 is the criterion variable (dependent variable)
𝑎 is the 𝑦-intercept
𝑏 is the slope of the line

This is the equation of the line which is appropriated to the given data. This is called
the least square line or the simple regression line. In this method, 𝑦 is called the dependent
variable and 𝑥, the independent variable. The slope of the regression line for predicting 𝑦
from 𝑥 will be represented by 𝑏 and the point where the line intersects the 𝑦–axis or simply
the 𝑦–intercept is represented by 𝑎 and can be determined through the use of the following
formulas:
𝑛 ∑ 𝑥𝑦−(∑ 𝑥)(∑ 𝑦)
𝑏= 𝑛 ∑ 𝑥 2 − (∑ 𝑥)2
𝑎 = 𝑦̅ − 𝑏𝑥̅

Where: 𝑦̅ is mean of 𝑦 values


𝑥̅ is mean of 𝑥 values

Example 2: A researcher wants to know if there is a relationship between hours spent in


studying a particular subject at home and the achievement grade of the student in the subject.
If a significant relationship can be established, what prediction equation could be used to
estimate achievement grade in the subject knowing the number of hours spent in studying
the subject at home at 0.05 level significance?

Students No. 1 2 3 4 5 6 7 8 9 10
Hours Spent (𝑋) 2.5 2.75 1.5 1.0 3.0 2.5 1.25 3.5 1.5 2.0
Achievement Grade (𝑌) 89 88 82 77 90 91 80 93 81 86

87
Solution:

Step 1: Variable of interest: Hours spent in studying and Achievement grade


𝐻0 : There is no significant relationship between the number of hours spent in studying
the subject and the achievement grade of students. (𝜌 = 0)
𝐻𝑎 : There is a significant relationship between the number of hours spent in studying
the subject and the achievement grade of students. (𝜌 ≠ 0)

Step 2: 𝛼 = 0.05

Step 3: Type of test: Two-tailed test


Test statistic: Pearson Product Moment of Correlation Coefficient 𝑟 and t-test
𝑛 ∑ 𝑥𝑦−(∑ 𝑥)(∑ 𝑦) 𝑛−2
𝑟= and 𝑡 = 𝑟√1−𝑟 2
√[𝑛 ∑ 𝑥 2 −(∑ 𝑥)2 ][𝑛 ∑ 𝑦 2 −(∑ 𝑦)2 ]
df = 𝑛 − 2 = 10 – 2 = 8
Critical values = ±2.31
Decision rule: Reject 𝐻0 if (|𝐶𝑉 | ≥ |±2.31|), otherwise, do not reject 𝐻0 .

Step 4: First, let us solve for the degree of relationship r.

𝑥 𝑦 𝑥𝑦 𝑥2 𝑦2
2.5 89 222.5 6.25 7921
2.75 88 242 7.5625 7744
1.5 82 123 2.25 6724
1.0 77 77 1 5929
3.0 90 270 9 8100
2.5 91 227.5 6.25 8281
1.25 80 100 1.5625 6400
3.5 93 325.5 12.25 8649
1.5 81 121.5 2.25 6561
2.0 86 172 4 7396
∑ 𝑥 = 21.5 ∑ 𝑦 = 857 ∑ 𝑥𝑦 = 1881 ∑ 𝑥 2 = 52.375 ∑ 𝑦 2 = 73705

𝑛 ∑ 𝑥𝑦 − (∑ 𝑥)(∑ 𝑦)
𝑟=
√[𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 ][𝑛 ∑ 𝑦 2 − (∑ 𝑦)2 ]

10(1881)−(21.5)(857)
𝑟=
√[10(52.375)−(21.5)2 ][10(73705)−(857)2
(18810)−(18425.5)
=
√[(523.75)−(462.25)][(737050)−(734449)
(384.5)
𝑟=
√(61.5)(2601)

88
384.5
=
√159961.5
= 0.961 (High positive correlation)
Then, let us determine the computed t-value.
8
𝑛−2 t= 0.961√0.076479
𝑡 = 𝑟√
1 − 𝑟2
𝑡 = 9.83
10 − 2
𝑡 = 0.961√
1 − (0.961)2

Step 5: Decision: Since |9.83| > |±2.31|, reject the 𝐻0 .


Conclusion: Therefore, there is a significant relationship between the number of hours
spent in studying the subject and the achievement grade of students at 0.05 level.

Take note: We can proceed on regression or prediction if we established a significant


relationship between the two variables, if there is no significant relationship between the two
variables, we will not proceed on regression or prediction analysis.

REGRESSION
To determine the equation of linear regression 𝑦 = 𝑎 + 𝑏𝑥
Where:
𝑦 is the achievement grade 𝑎 is the 𝑦-intercept
𝑥 is the number hours spent in studying 𝑏 is the slope of the line

Identify the following


𝑛 ∑ 𝑥𝑦−(∑ 𝑥)(∑ 𝑦)
𝑏= 𝑎 = 𝑦̅ − 𝑏𝑥̅
𝑛 ∑ 𝑥 2 − (∑ 𝑥)2

𝑥 𝑦 𝑥2 𝑦2 𝑥𝑦
2.5 89 6.25 7921 222.5
2.75 88 7.5625 7744 242
1.5 82 2.25 6724 123
1.0 77 1 5929 77
3.0 90 9 8100 270
2.5 91 6.25 8281 227.5
1.25 80 1.5625 6400 100
3.5 93 12.25 8649 325.5
1.5 81 2.25 6561 121.5
2.0 86 4 7396 172
∑ 𝑥 = 21.5 ∑ 𝑦 = 857 ∑ 𝑥 2 = 52.375 ∑ 𝑦 2 = 73705 ∑ 𝑥𝑦 = 1881
𝑋̅ = 2.15 𝑌̅ = 85.7

89
Let us solve the value of b.
𝑛 ∑ 𝑥𝑦 − (∑ 𝑥 )(∑ 𝑦) 18810 − 18425.5
𝑏= =
𝑛 ∑ 𝑥 2 − (∑ 𝑥 )2 523.750 − 462.25
10(1881) − (21.5)(857) 384.5
= =
10(52.375) − (21.5)2 61.5
= 6.25
So, 𝑏 = 6.25
For, 𝑦̅ and 𝑥̅ , we have
∑𝑦 ∑𝑥
𝑦̅ = 𝑥̅ =
𝑛 𝑛
857 21.5
= =
10 10
= 85.7 = 2.15
Therefore,
𝑎 = 𝑦̅ − 𝑏𝑥̅
= 85.7 − 6.25(2.15)
= 72.26
Hence, 𝑏 = 6.25 and 𝑎 = 72.26
Using 𝑦 = 𝑎 + 𝑏𝑥, we have the equation of linear regression is
𝑦 = 72.26 + 6.25𝑥.
Question: What is the predicted achievement grade of a student who spent 2.25 hours in
studying the subject?
Solution: Substitute 𝑥 = 2.25 in the equation of linear regression and solve for 𝑦.
𝑦 = 72.26 + 6.25𝑥
= 72.26 + 6.25 (2.25)
= 86.32
Therefore, a student who spent 2.25 hours in studying the subject has a predicted
achievement grade of 86.32.

Take note: To solve problems involving correlation and regression, we have the following
steps.
Step 1: Using 5 steps of hypothesis testing, solve for the degree r to determine the
strength of relationship between the two variables (see the formula for Pearson r) and
determine whether there is a significant relationship between two variables by solving
the computed t-value (see the formula for t).
Step 2: If there is a significant relationship exists between the two variables, then you
proceed to regression analysis, otherwise, do not proceed to regression analysis
anymore.
Step 3: For regression analysis, solve first for the mean of 𝑦 (𝑦̅) and the mean of 𝑥 (𝑥̅ )
Step 4: Solve for the value of b (slope of the line) and a (y-intercept).
Step 5: Determine the equation y = a + bx. Remember that y is the dependent variable
and x is the independent variable. In our example, the achievement grade (𝑦) is

90
dependent on the number of hours spent in studying (x). Using the simple regression
line, we can predict the dependent variable y (which is the achievement grade in our
example) using the independent variable x (which is the number of hours spent in
studying in our example) and the equation y = a + bx.

Clarify Your Lesson! (Let’s Try This # 2)

1. In correlational analysis, discuss the steps to determine the significant relationship or


correlation between the two variables 𝑥 and 𝑦.
2. In regression analysis, discuss the steps to determine the predicted variable 𝑦 using
the independent variable 𝑥.

Challenge Yourself!

(Let’s Try This #3) Solve the following problems:

1. The following table shows the amount of converted sugar in a chemical process at
different temperatures.

Temperature, 𝑥 Converted Sugar, 𝑦


1.2 8.2
1.4 8.5
1.6 8.4
1.8 9.3
2.0 8.9
2.2 10.5
2.4 9.3

a) Compute for the correlation coefficient.


b) Estimate the linear regression line.
c) Estimate the mean amount of sugar produced when the temperatures recorded is
1.7.

2. Find out if there is a correlation – positive, negative, or no correlation – between the


length of your hand to your height. Measure the length of your right hand and five to
ten other persons i.e., from the wrist to the tip of your middle finger in centimeter
scale. Then, determine the corresponding heights in cm. Make a table of the two
variables. Calculate the Pearson correlation coefficient 𝑟𝑥𝑦 of the data set.

91
Gauge Your Learning!
Solve the following problems and show your complete solution.

Verbal 𝑥 Math 𝑦
1. SAT (Scholastic Ability Test) 95 87
Educational researchers desired to find out if a 89 88
relationship exists between the average SAT verbal
76 73
score and the average SAT mathematical score. There
65 60
were ten randomly selected, and their SAT average
72 75
scores are recorded below. Is there sufficient evidence
80 69
to conclude a relationship between the two scores at
73 76
0.05 level?
71 78
66 62
90 87

2. Absences and Final Grades No. of absences Final grade 𝑦


An educator wants to see how the number of 𝑥
absences for a student in her class affects the 10 73
student’s final grade. The data obtained from 12 70
a sample are shown. Is there a significant 2 95
relationship between the number of absences
0 92
and student’s final grade at 0.05 level? What is
8 80
the predicted final grade of a student who has
5 82
7 absences?
4 90
1 91
2 87
3 85

92

You might also like