Module in Advanced Statistics Revised
Module in Advanced Statistics Revised
MEYCAUAYAN COLLEGE
Graduate School Department
Calvario, City of Meycauayan, Bulacan
MODULE
in
STAT 202
(Advanced Statistics)
Prepared by
Table of Contents 2
Introduction to Statistics 3
Statistical Terms 4
What is probability? 5
Methods of Sampling from a Population 7
Data Types 9
Descriptive Statistics 10
Data Presentation 10
Central Tendency 15
Describing other Locations in a Distribution 18
Measures of Variability 22
Inferential Statistics 23
Normal Distribution 23
Normal Distribution, Standard Deviations 24
Standard Scores 24
t - score 25
Skewness 26
Kurtosis 27
Boxplot 28
Measures of Uncertainty 29
Hypothesis Testing 29
T-Tests and Z-Tests 35
Effect Size 37
Analysis of Variance (ANOVA) 40
Post-hoc test 44
Sign Tests 45
Mann-Whitney U test 46
The Wilcoxon Sign Test 48
Kruskal-Wallis Test 50
Friedman Test 52
Chi-Square 54
Correlation 55
The Pearson Product-Moment Correlation 56
Regression 57
Spearman Rank Correlation Coefficient 58
Phi Coefficient 59
Point-Biserial Correlation Coefficient 60
Tetrachoric Correlation Coefficient 61
Kendall Rank Correlation 61
References 63
Page 3 of 67
INTRODUCTION TO STATISTICS
What is 'Statistics'?
Statistics is the science of collecting, analyzing and making inference from data. Statistical methods and
analyses are often used to communicate research findings and to support hypotheses and give credibility to
research methodology and conclusions. It is important for researchers and also consumers of research to
understand statistics so that they can be informed, evaluate the credibility and usefulness of information, and make
appropriate decisions.
History of Statistics
Simple forms of statistics have been used since the beginning of civilization, when pictorial representations
or other symbols were used to record numbers of people, animals, and inanimate objects on skins, slabs, or sticks
of wood and the walls of caves.
Before 3000 BC the Babylonians used small clay tablets to record tabulations of agricultural yields and of
commodities bartered or sold.
The Egyptians analyzed the population and material wealth of their country before beginning to build the
pyramids in the 31st century BC.
The ancient Greeks held censuses to be used as bases for taxation as early as 594 BC.
The Roman Empire was the first government to gather extensive data about the population, area, and wealth
of the territories that it controlled.
Some scholars pinpoint the origin of statistics to 1662, with the publication of Natural and Political
Observations upon the Bills of Mortality by John Graunt.
Early applications of statistical thinking revolved around the needs of states to base policy on demographic
and economic data, hence its stat- etymology.
The scope of the discipline of statistics broadened in the early 19th century to include the collection and
analysis of data in general. Today, statistics is widely employed in government, business, and the natural and
social sciences.
Because of its empirical roots and its focus on applications, statistics is usually considered to be a distinct
mathematical science rather than a branch of mathematics. Its mathematical foundations were laid in the 17th
century with the development of probability theory by Blaise Pascal and Pierre de Fermat.
At present, statistics is a reliable means of describing accurately the values of economic, political, social,
psychological, biological, and physical data and serves as a tool to correlate and analyze such data. The work of
the statistician is no longer confined to gathering and tabulating data, but is chiefly a process of interpreting the
information.
Purposes of Statistics
Some of the major purposes of statistics are to help us understand and describe phenomena in our world
and to help us draw reliable conclusions about those phenomena.
Descriptive statistics intend to describe a big hunk of data with summary charts and tables, but do not
attempt to draw conclusions about the population from which the sample was taken. You are simply
summarizing the data you have with pretty charts and graphs–kind of like telling someone the key points of
a book (executive summary) as opposed to just handing them a thick book (raw data).
STATISTICAL TERMS
Example 1:
What is the prevalence of smoking at Penn State University?
The main campus at Penn State University has a population of approximately 42,000 students. A research
question is "what proportion of these students smoke regularly?" A survey was administered to a sample
of 987 Penn State students. Forty-three percent (43%) of the sampled students reported that they smoked
regularly. How confident can we be that 43% is close to the actual proportion of all Penn State students
who smoke?
The population is all 42,000 students at Penn State University.
The parameter of interest is p, the proportion of students at Penn State University who smoke regularly.
The sample is a random selection of 987 students at Penn State University.
The statistic is the proportion, p̂, of the sample of 987 students who smoke regularly. The value of the
sample proportion is 0.43.
Example 2:
Are the grades of college students inflated?
Let's suppose that there exists a population of 7 million college students in the United States today. (The
actual number depends on how you define "college student.") And, let's assume that the average GPA
of all of these college students is 2.7 (on a 4-point scale). If we take a random sample of 100 college
students, how likely is it that the sampled 100 students would have an average GPA as large as 2.9 if the
population average was 2.7?
The population is all 7 million college students in the United States today.
The parameter of interest is µ, the average GPA of all college students in the United States today.
The sample is a random selection of 100 college students in the United States.
The statistic is the mean grade point average, x̅, of the sample of 100 college students. The value of the
sample mean is 2.9.
Variable – any characteristics, number, or quantity that can be measured or counted. A variable may also be called
a data item. Age, sex, business income and expenses, country of birth, capital expenditure, class grades, eye-color
and vehicle type are examples of variables.
Bias – refers to the tendency of a measurement process to over- or under-estimate the value of a population
parameter. A bias is the intentional or unintentional favoring of one group or outcome over other potential groups
or outcomes in the population.
Selection bias:
Non-representative sample – refers to when the method with which a sample is selected specifically
excludes certain groups from the research, whether intentionally or unintentionally.
Non-response bias – describes the members of a sample that do not choose to respond or participate
in the research and the characteristics of those members.
Voluntary bias – describes the members of a sample that choose to respond or participate, whether
intentionally or unintentionally.
Response bias – can happen through design faults, such as constructing a survey with leading questions.
These questions encourage the answer expected from the researcher or try to get the participant to answer
in a certain way.
Page 5 of 67
Exercises 1:
1. A school takes a poll to find out what students want to eat at lunch. Seventy (70) students are randomly
chosen from 1,400 to answer the poll questions. Write the correct answer from the given choices below
in each of the following items:
a. The population is the __________ students.
(70 or 1,400)
b. The sample is the ____________ students
(70 or 1,400)
c. The responses of the 70 students is a __________________.
(parameter or statistic)
2. A survey handed out by loud volunteers on a street corner. Some people are suspicious of the volunteers
and choose not to participate in the survey. Which type of bias is illustrated?
WHAT IS PROBABILITY?
Probability theory – branch of mathematics concerned with the analysis of random phenomena. The outcome of
a random event cannot be determined before it occurs, but it may be any one of several possible outcomes. The
actual outcome is considered to be determined by chance. Probability theory provides the foundation for doing
statistics.
Random experiment – a physical situation whose outcome cannot be predicted until it is observed (e.g. toss a coin
once).
Sample space – a set of all possible outcomes of a random experiment (e.g. head and tail are the possible
outcomes).
Probability – how likely something is to happen or the chance that something will happen. How likely it is that
some event will occur. Sometimes probability is measured with a number like “10% chance”, or we can use words
such as impossible, unlikely, possibly, even chance, likely and certain.
Example 3:
Tossing a Coin: When a coin is tossed, there are two possible outcomes: heads (H) or tails (T). The
probability of the coin landing H is ½ and landing T is ½.
Throwing Dice: When a single die is thrown, there are six possible outcomes: 1, 2, 3, 4, 5, 6. The probability
of any one of them is ⅙.
In general, probability is
In other words,
Page 6 of 67
Example 4:
You are given a big containing 15 equally sized marbles. You know there are 10 yellow marbles and 5 green
marbles in the bag. What is the probability that you would pull a yellow marble out, if you reach in the bag
and grab a marble at random?
P(yellow) = 2 or 66.67%
3
When we calculate probabilities involving one event AND another event occurring, we multiply their
probabilities.
In some cases, the first event happening impacts the probability of the second event. We call these dependent
events.
In case of dependent events, the probability that both events occur simultaneously is:
The vertical bar in P(B∣A), left parenthesis, B, vertical bar, A, right parenthesis means "given," so this could
also be read as "the probability that B occurs given that A has occurred."
In other cases, the first event happening does not impact the probability of the seconds. We call these
independent events.
If A and B are two independent events in a probability experiment, then the probability that both events occur
simultaneously is:
P(A and B)=P(A)⋅P(B)
Example 5:
Suppose you take out two cards from a standard pack of cards one after another, without replacing the first
card. What is probability that the first card is the ace of spades, and the second card is a heart?
The two events are dependent events because the first card is not replaced.
There is only one ace of spades in a deck of 52 cards. So:
P(1st card is the ace of spades)= 1
52
If the ace of spaces is drawn first, then there are 51 cards left in the deck, of which 13 are hearts:
Example 6: What would be the theoretical probability of randomly pulling a queen from a deck of 52 cards,
putting it back, randomly pulling a queen again, and so on until you have pulled 5 queens in a
row?
Exercises 2:
1. What is the probability of drawing the ace of diamond from a deck of 52 cards?
2. There are 5 marbles in a box: 3 are blue, 1 yellow and 1 is red. What is the probability that a blue marble
gets picked?
3. What probability of picking a red or green marble from a bag with 5 red, 7 green, 6 blue, and 14 yellow
marbles in it?
4. What is the probability of shaking the hand of a student wearing red if you randomly shake the hand of
one person in a room containing the following mix of students?
13 female students wearing blue
7 male students wearing blue
6 female students wearing red
9 males students wearing red
18 female students wearing green
21 male students wearing green
5. A quiz has 2 true-false questions and 3 multiple choice questions that each have 4 choices of answers.
What is the probability that someone guessing all 5 answers will get 100% on the quiz?
1. Random sampling. Most popular types of random or probability sampling. In this technique, each member
of the population has an equal chance of being selected as subject. The entire process of sampling is done in a
single step with each subject selected independently of the other members of the population. The most
primitive and mechanical would be the lottery method.
2. Systematic sampling. Frequently chosen by researchers for its simplicity and its periodic quality. The
researcher first randomly picks the first item or subject from the population. Then, the researcher will select
each nth subject from the list. For example, the researcher has a population total of 100 individuals and need
12 subjects. He first picks 5 as his starting number and 8 as his interval. Hence, the members of his sample
will be individuals 5, 13, 21, 29, 37, 45, 53, 61, 69, 77, 85, 93.
3. Stratified sampling. The researcher divides the entire population into different subgroups or strata, then
randomly selects the final subjects proportionally from the different strata. The most common strata used in
stratified random sampling are age, gender, socioeconomic status, religion, nationality and educational
attainment.
Proportionate Stratified Random Sampling. The sample size of each stratum is proportionate to the population
size of the stratum when viewed against the entire population. This means that each stratum has the same
sampling fraction. For example, you have 3 strata with 100, 200 and 300 population sizes respectively. And
the researcher chose a sampling fraction of ½. Then, the researcher must randomly sample 50, 100 and 150
subjects from each stratum respectively.
Stratum A B C
Population Size 100 200 300
Sampling Fraction ½ ½ ½
Final Sample Size 50 100 150
The important thing to remember in this technique is to use the same sampling fraction for each stratum
regardless of the differences in population size of the strata.
Disproportionate Stratified Random Sampling. The difference between proportionate and disproportionate
stratified random sampling is their sampling fractions. With disproportionate sampling, the different strata
have different sampling fractions.
4. Cluster sampling. The researcher takes several steps in gathering his sample population. First, the researcher
selects groups or clusters then from each cluster, the researcher selects the individual subjects by either simple
random or systematic random sampling. The researcher can even opt to include the entire cluster and not just
a subset from it. The most common cluster used in research is a geographical cluster. For example, a researcher
wants to survey academic performance of high school students in Spain. He can divide the entire population
into different clusters (cities). Then selects a number of clusters depending on his research through simple or
systematic random sampling. From the selected clusters (randomly selected cities) the researcher can either
include all the high school students as subjects or he can select a number of subjects from each cluster through
simple or systematic random sampling.
1. Convenience sampling - subjects are selected because of their convenient accessibility and proximity to the
researcher. One of the most common examples of convenience sampling is using student volunteers as subjects
for the research. Another example is using subjects that are selected from a clinic, a class or an institution that
is easily accessible to the researcher. A more concrete example is choosing five people from a class or choosing
the first five names from the list of patients.
2. Sequential sampling - researcher picks a single or a group of subjects in a given time interval, conducts his
study, analyzes the results then picks another group of subjects if needed and so on. In sequential sampling
technique, there exists another step, a third option. The researcher can accept the null hypothesis, accept his
alternative hypothesis, or select another pool of subjects and conduct the experiment once again. This entails
that the researcher can obtain limitless number of subjects before finally making a decision whether to accept
his null or alternative hypothesis.
Page 9 of 67
3. Quota sampling - the assembled sample has the same proportions of individuals as the entire population with
respect to known characteristics, traits or focused phenomenon. For example, an interviewer has to survey
people about a cosmetic brand. His population is people in a certain city between 35 and 45 years old. The
interviewer might decide they want two survey subgroups — one male, and the other female — each with 100
people. (These subgroups are mutually exclusive since people cannot be male and female at the same time.)
After choosing these subgroups, the interviewer has the liberty to rely on his convenience or judgment factors
to find people for each subset. For example, the interviewer could stand on the street and interview people or
he can interview people at his workplace who fit the subgroup criteria.
4. Judgmental (also known as purposive) sampling - the researcher selects units to be sampled based on their
knowledge and professional judgment. It is used in cases where the specialty of an authority can selects a more
representative sample that can bring more accurate results than by using other probability sampling techniques.
It is also possible to use judgmental sampling if the researcher knows a reliable professional or authority that
he thinks is capable of assembling a representative sample. For example, in a study wherein a researcher wants
to know what it takes to graduate summa cum laude in college, the only people who can give the researcher
first hand advise are the individuals who graduated summa cum laude.
5. Snowball sampling - is used by researchers to identify potential subjects in studies where subjects are hard to
locate. Researchers use this sampling method if the sample for the study is very rare or is limited to a very small
subgroup of the population. This type of sampling technique works like chain referral. After observing the initial
subject, the researcher asks for assistance from the subject to help identify people with a similar trait of interest.
The process of snowball sampling is much like asking your subjects to nominate another person with the same
trait as your next subject. The researcher then observes the nominated subjects and continues in the same way
until the obtaining sufficient number of subjects. For example, if obtaining subjects for a study that wants to
observe a rare disease, the researcher may opt to use snowball sampling since it will be difficult to obtain
subjects.
DATA TYPES
Data Types are an important concept of statistics, which needs to be understood, to correctly apply
statistical measurements to your data and therefore to correctly conclude certain assumptions about it.
1. Quantitative (or numeric) data deals with numbers and things you can measure objectively: dimensions such
as height, width, length, Temperature, humidity, Prices, Area and volume. Quantitative Data could be Discrete
or Continuous. Discrete data is a count that can't be made more precise. Typically, it involves integers. For
instance, the number of children (or adults, or pets) in your family is discrete data, because you are counting
whole, indivisible entities: you can't have 2.5 kids, or 1.3 pets. Continuous data, on the other hand, could be
divided and reduced to finer and finer levels. For example, you can measure the height of your kids at
progressively more precise scales—meters, centimeters, millimeters, and beyond—so height is continuous data.
Interval are numeric scales in which we know not only the order, but also the exact differences between the
values. The classic example of an interval scale is Celsius temperature because the difference between each
value is the same. For example, the difference between 60 and 50 degrees is a measurable 10 degrees, as is
the difference between 80 and 70 degrees. Time is another good example of an interval scale in which the
increments are known, consistent, and measurable.
You can remember the key points of an “interval scale” pretty easily. “Interval” itself means “space in
between,” which is the important thing to remember–interval scales not only tell us about order, but also
about the value between each item. The problem with interval scales is they don’t have a “true zero.” For
example, there is no such thing as “no temperature.” Without a true zero, it is impossible to compute ratios.
With interval data, we can add and subtract, but cannot multiply or divide. Confused? Ok, consider this:
10 degrees + 10 degrees = 20 degrees. No problem there. 20 degrees is not twice as hot as 10 degrees,
however, because there is no such thing as “no temperature” when it comes to the Celsius scale.
Ratio scales are the ultimate nirvana when it comes to measurement scales because they tell us about the
order, they tell us the exact value between units, AND they also have an absolute zero–which allows for a
wide range of both descriptive and inferential statistics to be applied. Ratio scales have a clear definition of
zero. Good examples of ratio variables include height and weight.
Ratio scales provide a wealth of possibilities when it comes to statistical analysis. These variables can be
meaningfully added, subtracted, multiplied, divided (ratios).
Page 10 of 67
2. Qualitative data deals with characteristics and descriptors that can't be easily measured, but can be observed
subjectively—such as smells, tastes, textures, attractiveness, and color.
Nominal scales are used for labeling variables, without any quantitative value. “Nominal” scales could
simply be called “labels.” Here are some examples, below. Notice that all of these scales are mutually
exclusive (no overlap) and none of them have any numerical significance. A good way to remember all of
this is that “nominal” sounds a lot like “name” and nominal scales are kind of like “names” or labels. A sub-
type of nominal scale with only two categories (e.g., male/female) is called “dichotomous.”
Ordinal scales in which the order of the values is what’s important and significant, but the differences
between each one is not really known. Take a look at the example below. In each case, we know that a #4
is better than a #3 or #2, but we don’t know–and cannot quantify–how much better it is. For example, is
the difference between “OK” and “Unhappy” the same as the difference between “Very Happy” and
“Happy?” We can’t say. Ordinal scales are typically measures of non-numeric concepts like satisfaction,
happiness, discomfort, etc.
“Ordinal” is easy to remember because it sounds like “order” and that’s the key to remember with “ordinal
scales”–it is the order that matters, but that’s all you really get from these.
Broadly speaking, when you measure something and give it a number value, you create quantitative data. When you
classify or judge something, you create qualitative data. But this is just the highest level of data: there are also different
types of quantitative and qualitative data.
DESCRIPTIVE STATISTICS
Data Presentation
Data can be defined as groups of information that represent the qualitative or quantitative attributes of a
variable or set of variables, which is the same as saying that data can be any set of information that describes a
given entity. Data in statistics can be classified into grouped data and ungrouped data.
Ungrouped data - Any data that you first gather. It is data in the raw. An example of ungrouped data is any
list of numbers that you can think of.
Array - An arrangement of ungrouped data in ascending or descending order of magnitude is called an
array.
Arranging the marks of 20 students in ascending order, we get the following array.
Example 7: The marks obtained by 20 students in a class in a certain examination are given below;
21, 23, 19, 17, 12, 15, 15, 17, 17, 19, 23, 23, 21, 23, 25, 25, 21, 19, 19, 19
Array (ascending)
12, 15, 15, 17, 17, 17, 19, 19, 19, 19, 19, 21, 21, 21, 23, 23, 23, 23, 25, 25
Page 11 of 67
Frequency distribution table or Frequency chart for raw data using tally mark
A frequency is the number of times a data value occurs. For example, if ten students score 80 in statistics,
then the score of 80 has a frequency of 10. Frequency is often represented by the letter f.
A frequency chart is made by arranging data values in ascending order of magnitude along with their
frequencies.
We take each observation from the data, one at a time, and indicate the frequency (the number of times
the observation has occurred in the data) by small line, called tally marks. For convenience, we write tally
marks in bunches of five, the fifth one crossing the fourth diagonally. In the table formed, the sum of all
the frequency is equal to the total number of observations in the given data.
Exercises 3: In the table below, make a frequency chart of the data in Example 7.
Grouped data is data that has been organized into groups known as classes. Grouped data has been 'classified'
and thus some level of data analysis has taken place, which means that the data is no longer raw.
When the set of data values are spread out, it is difficult to set up a frequency table for every data value as
there will be too many rows in the table. So, we group the data into class intervals (or groups) to help us
organize, interpret and analyze the data.
Each class is bounded by two figures, which are called class limits. The figure on the left side of a class is called
its lower limit and that on its right is called its upper limit. Ideally, we should have between five and ten rows
in a frequency table. Bear this in mind when deciding the size of the class interval (or group).
Types of Grouped Frequency Distribution
1. Exclusive Form (or Continuous Interval Form): A frequency distribution in which the upper limit of
each class is excluded and lower limit is included, is called an exclusive form.
Example 8: Suppose the marks obtained by some students in an examination are given. We may
consider the classes 0 – 10, 10 – 20 etc. in class 0 – 10, we in clued 0 and exclude 10. In class 10 –
20, we include 10 and exclude 20.
2. Inclusive From (or Discontinuous Interval From): A frequency distribution in which each upper limit
as well as lower limit is included, is called an inclusive form.
Example 9: Classes of the form 0 – 10, 11 – 20, 21 – 30 etc. In 0 – 10, both 0 and 10 are included.
Here, class 360 – 369 means, marks obtained fr0m 360 to 369, including both.
Given a set of raw or ungrouped data, how would you group that data into suitable classes that are easy to
work with and at the same time meaningful?
The first step is to determine how many classes you want to have. Next, you subtract the lowest value in
the data set from the highest value in the data set and then you divide by the number of classes that you
want to have:
c. i. = 48 – 8 = 4
10
Class Interval Tally Frequency
Page 13 of 67
Boundaries
The true values which describe the actual class limits of a class are called class boundaries. The smallest
true value is called the lower-class boundary and the largest true value is called the upper-class boundary
of the class.
In exclusive form, the lower and upper limits are known as true lower limit (lower boundary) and true
upper limit (upper boundary) of the class interval. Thus, class limits of 10 - 20 class intervals in the
exclusive form are 10 and 20.
In inclusive form, class limits (boundaries) are obtained by subtracting 0.5 from lower limit and adding
0.5 to the upper limit. Thus, class limits of 10 - 20 class interval in the inclusive form are 9.5 - 20.5
Example 10: Inclusive class intervals and true/actual class limits or boundaries
Consider the class interval 170 – 174: its lower boundary is 169.5 and its upper boundary is 174.5.
Cumulative Frequency is the total of a frequency and all frequencies so far in a frequency distribution. It is the
'running total' of frequencies. The last entry of the cumulative frequency column is one, indicating the total
frequency.
A relative frequency is the fraction of times an answer occurs. To find the relative frequencies, divide each
frequency by the total frequency in the sample. Relative frequencies can be written as fractions, percent, or
decimals. Cumulative relative frequency is the accumulation of the previous relative frequencies. The last
entry of the cumulative relative frequency column is one, indicating that one hundred percent of the data
has been accumulated.
Example 11: The Cumulative frequency column lists the total of each frequency added to its
predecessor. Thus, the class interval 20 – 24 has a cumulative frequency of 15 calculated
as 4+6+3+2 or simply 13+2.
The Relative frequency in class interval 20 – 24 is 2 divide by 25. While its Cumulative
Relative frequency 0.60 calculated from 0.16+0.24+0.12+0.08 or simply 0.52+0.08.
Cumulative Cumulative
Class Interval Frequency Relative
Frequency Relative
(ci) (f) Frequency
(F) Frequency
5–9 4 4 0.16 0.16
10 – 14 6 10 0.24 0.40
15 – 19 3 13 0.12 0.52
20 – 24 2 15 0.08 0.60
25 – 29 6 21 0.24 0.84
30 – 34 4 25 0.16 1.00
n or Σf = 25
Page 14 of 67
Bar graph - A bar graph is a way of summarizing a set Histogram - A histogram is a way of summarizing
of categorical data. It displays the data using a number data that are measured on an interval scale (either
of rectangles, of the same width, each of which discrete or continuous). It is often used in exploratory
represents a particular category. Bar graphs can be data analysis to illustrate the features of the
displayed horizontally or vertically and they are distribution of the data in a convenient form.
usually drawn with a gap between the bars
(rectangles).
Pie chart - A pie chart is used to display a set of Line graph - A line graph is particularly useful when
categorical data. It is a circle, which is divided into we want to show the trend of a variable over time.
segments. Each segment represents a particular Time is displayed on the horizontal axis (x-axis) and
category. the variable is displayed on the vertical axis (y- axis).
Stem and leaf plot graphs – are usually used when there are large amounts of numbers to analyze. Some examples
of common uses of these graphs are to track series of scores on sports teams, series of temperatures or rainfall over
a period of time, and series of classroom test scores.
Example 12:
Consider the following test scores:
50, 50, 52, 58, 58, 61, 64, 64, 67, 68, 72, 74, 76, 78, 78, 79, 83, 85, 92, 92, 96, 98
Test Scores Out Of 100 Here, the Stem shows the 'tens' and the leaf. At a glance, one can see that 4 students
got a mark in the 90s on their test out of 100. Two students received the same mark
Stem Leaf
of 92; that no marks were received that fell below 50, and that no mark of 100 was
5 00288 received.
6 14478
When you count the total amount of leaves, you know how many students took the
7 246889 test. As you can tell, stem and leaf plots provide an "at a glance" tool for specific
8 35 information in large sets of data. Otherwise, one would have a long list of marks to
9 2268 sift through and analyze.
There is a variation of stem and leaf displays that is useful for comparing distributions. The two distributions are
placed back-to-back along a common column of stems. The result is a “back-to-back stem and leaf graph.”
Page 15 of 67
Example 13: The back-to-back stem and leaf graph shown below compares the numbers of touchdown passes
(TD passes) in the 1998 and 2000 seasons in the National Football League. The stems are in the middle, the
leaves to the left are for the 1998 data, and the leaves to the right are for the 2000 data. For example, the
second-to-last row shows that in 1998 there were teams with 11, 12, and 13 TD passes, and in 2000 there were
two teams with 12 and three teams with 14 TD passes.
Back-to-back stem and leaf display. The left It compares the numbers of TD passes in the 1998 and 2000
side shows the 1998 TD data and the right seasons. The stems are in the middle, the leaves to the left are
side shows the 2000 TD data. for the 1998 data, and the leaves to the right are for the 2000
11 4 data. For example, the second-to-last row shows that in 1998
3 7 there were teams with 11, 12, and 13 TD passes, and in 2000
332 3 233 there were two teams with 12 and three teams with 14 TD
8865 2 889 passes.
44331110 2 001112223
987776665 1 56888899 The graph helps us see that the two seasons were similar, but
321 1 22444 that only in 1998 did any teams throw more than 40 TD
7 0 69 passes.
Exercises 5: Complete a back-to-back stem-and-leaf plot for the following two lists of class sizes:
Economics: 9, 13, 14, 15, 16, 16, 17, 19, 20, 21, 21, 22, 25, 25, 26
Central Tendency
Central tendency (sometimes called “measures of location,” “central location,” or just “center”) is a way to
describe what’s typical for a set of data. Central tendency doesn’t tell you specifics about the individual pieces
of data, but it does give you an overall picture of what is going on in the entire data set. It is a single value that
attempts to describe a set of data by identifying the central position within that set of data. As such, measures
of central tendency are sometimes called measures of central location.
There are three main measures of central tendency: the mode, the median and the mean. Each of these measures
describes a different indication of the typical or central value in the distribution.
Page 16 of 67
The sum of the value of each observation in a dataset divided by the number of observations. This is
also known as the arithmetic average.
Advantage of the Mean: The mean can be used for both continuous and discrete numeric data.
1. Mean
Limitation of Mean:
The mean cannot be calculated for categorical data, as the values cannot be summed.
As the mean includes every value in the distribution the mean is influenced by outliers and skewed
distributions.
The median is the middle value in distribution when the values are arranged in ascending or
descending order. It divides the distribution in half (there are 50% of observations on either side of
the median value). In a distribution with an odd number of observations, the median value is the
middle value.
2.
Median Advantage of the Median: The median is less affected by outliers and skewed data than the mean,
and is usually the preferred measure of central tendency when the distribution is not symmetrical.
Limitation of Median: The median cannot be identified for categorical nominal data, as it cannot be
logically ordered.
The value that occurs most often in the data.
Advantage of the mode: Can be found for both numerical and categorical (non-numerical) data.
Limitations of Mode:
3. Mode In some distributions, the mode may not reflect the center of the distribution very well.
It is also possible for there to be more than one mode for the same distribution of data, (bi-modal,
or multi-modal).
In some cases, particularly where the data are continuous, the distribution may have no mode at
all (i.e., if all values are different).
How does the shape of a distribution influence the Measures of Central Tendency?
Skewed distributions - When a distribution is skewed the mode remains the most
commonly occurring value, the median remains the middle value in the
Symmetrical distributions: distribution, but the mean is generally ‘pulled’ in the direction of the tails. In a
skewed distribution, the median is often a preferred measure of central tendency,
When a distribution is as the mean is not usually in the middle of the distribution.
symmetrical, the mode, Positively or Right Skewed Negatively or Left Skewed
median and mean are all in The tail on the right side of the The tail on the left side of the
the middle of the distribution is longer than the left side. distribution is longer than the right
distribution. The mean is ‘pulled’ toward the right side. The mean is ‘pulled’ toward the
tail of the distribution. Generally, most left tail of the distribution. Generally,
of the values, including the median most of the values, including the
value, tend to be less than the mean median value, tend to be greater than
value. the mean value.
or or or
The position of the median is: {(n + 1) ÷ 2}th value, where n is the number of values in a set of data.
Page 17 of 67
Exercises 6 (Raw Data): The ages of 15 randomly selected customers at a local Best Buy are listed below:
23, 21, 29, 24, 31, 21, 27, 23, 24, 32, 33, 19, 24, 21, 31
Determine the mean, median and mode of the data.
Exercises 7 (Frequency Table): Determine the mean, median and mode of a simple frequency distribution of
the retirement age data.
Mean - There are also some special mathematical symbols used for the MEAN formula
Median
Step 1: Construct the cumulative frequency distribution (Cumulative means increasing or "how much so far").
In statistics, it is the running total of all frequencies. Cumulative frequency corresponding to a
particular value is the sum of all the frequencies up to and including that value.
Step 2: Decide the class that contain the median. Class Median is the first class with the value of cumulative
frequency equal at least n/2.
Step 3: Find the median by using the following formula:
Lmed = the lower boundary of the class median (e.g. 1-5 and 6-10, the
lower boundary for c. i. 6-10 is 5.5)
n = the total frequency
Fbef = the cumulative frequency before class median
fm = the frequency of the class median
i = the class width (size of class interval)
Mode
Mode is the value that has the highest frequency in a data set.
For grouped data, class mode (or, modal class) is the class with the highest frequency.
To find mode for grouped data, use the following formula
Where:
A1 = A2 = A3 = A4=.... = A10
…
Page 19 of 67
Example 14: The following are the number of defective items produced in a month by a machine for the last 24
months. The data were arranged in ascending order. Calculate the second quartile (Q2) and fifth
decile (D5).
10, 11, 14, 16, 17, 18, 19, 21, 21, 23, 24, 26, 29, 30, 32, 33, 34, 35, 36, 37, 39, 40, 42, 45
The test score of a sample of 20 students in a class The twentieth percentile P20 can be computed as
are as follows: follows:
20,30,21,29,10,17,18,15,27,25,16,15,19,22,13,17,14,
18,12, 9.
Find the value of P10, P20 and P80
Solution:
Arrange the data in ascending order
9, 10, 12, 13, 14, 15, 15, 16, 17, 17, 18, 18, 19, 20, 21, Thus, lower 20% of the students had test score less than
22, 25, 27, 29, 30 or equal to 13.2.
Tenth percentile P10 Eightieth percentile P80
The tenth percentile P10 can be computed as The eightieth percentile P80 can be computed as
follows:
follows:
Thus, lower 80% of the students had test score less than
Thus, lower 10% of the students had test score less or equal to 24.4.
than or equal to 10.2.
Page 20 of 67
Exercises 9: The following data are marks obtained by 20 students in a test of statistics: (a) determine the Q1,
and Q3; (b) Calculate third decile; and (c) Calculate 70th percentiles
53 74 82 42 39 20 81 68 58 28
67 54 93 70 30 55 36 38 29 61
Exercises 10:
Shown below is a data on Number of births to women by current age. Calculate (a) Q1 and Q3; (b) D2 and D8;
(c) and P10 and P85.
Measures of Variability
Variability refers to how spread out a group of data is. In other words, variability measures how much your scores
differ from each other. Variability is also referred to as dispersion or spread. Data sets with similar values are said
to have little variability, while data sets that have values that are spread out have high variability.
There are four measures of variability: the range, interquartile range, variance, and standard deviation.
Range is simply the highest score minus the lowest score. It is easy to calculate and very much affected by
extreme values (range is not a resistant measure of variability)
Range of ungrouped data:
R = maximum – minimum
Range of grouped data:
R = upper boundary of the highest interval – lower boundary of the lowest interval
The interquartile range (IQR) is the difference between upper and lower quartiles and denoted as IQR. In
some texts the interquartile range is defined differently. It is defined as the difference between the largest and
smallest values in the middle 50% of a set of data. IQR is not affected by extreme values. It is thus a resistant
measure of variability.
The standard deviation is a measure that summarizes the amount by which every value within a dataset varies
from the mean. Effectively it indicates how tightly the values in the dataset are bunched around the mean
value. It is the most robust and widely used measure of dispersion since, unlike the range and inter-quartile
range, it takes into account every variable in the dataset. When the values in a dataset are pretty tightly
bunched together the standard deviation is small. When the values are spread apart the standard deviation will
be relatively large. The standard deviation is usually presented in conjunction with the mean and is measured
in the same units.
The Variance is defined as the average of the squared differences from the Mean.
Page 22 of 67
Where x represents each value in the population, μ Where x represents each value in the population, x
is the mean value of the population, Σ is the is the mean value of the sample, Σ is the summation
summation (or total), and N is the number of values (or total), and n-1 is the number of values in the
in the population. sample minus 1.
The standard deviation s of a sample is calculated in similar manner as for the entire population as σ.
However, while the sample mean is an unbiased estimator of the population mean, the same is not true for
the standard deviation. If one look all possible samples of n, the value would not be equal to the true value
of the population standard deviation, it would be biased. This bias can be corrected by using n – 1.
The variance of an entire population is known as σ2 The variance of a sample is known as s2 and is
(sigma) and is calculated using: calculated using:
The formulas for variance and standard deviation change slightly if observations are grouped into a
frequency table. If the data are grouped into class intervals, the x is the midpoint of the class.
and
Exercises 11 (Ungrouped Data): Shown below are the examination marks for 20 students following a particular module.
60 74 76 78 66 68 50 56 58 43 48 50 59 62 70 71 80 65 52 52
Determine the values of (a) Range; (b) IQR; (c) Standard Deviation; and (d) Variance using the table below
n = 20
Mark (X) x – x̅ (x – x̅ ̅)2
Σx = Σ(x– x̅ ̅)2 =
Page 23 of 67
220 students were asked the number of hours per week they spent watching television. With this
information, calculate the mean and standard deviation of hours spent watching television by the 220
students. Determine the values of (a) Range; (b) IQR; (c) Standard Deviation; and (d) Variance
INFERENTIAL STATISTICS
Normal Distribution
Data can be "distributed" (spread out) in different ways.
It can be spread out more on the left Or more on the right Or it can be all jumbled up
But there are many cases where the data tends to be around a central value with no bias left or right, and it gets
close to a "Normal Distribution" like this:
68% of values are within 95% of values are within 99.7% of values are within
1 standard deviation of the mean 2 standard deviations of the mean 3 standard deviations of the mean
Example 16: 95% of students at school are between 1.1 m and 1.7 m tall. Assuming this data is normally
distributed can you calculate the mean and standard deviation?
The mean is halfway 95% is 2 standard deviations And this is the result:
between 1.1 m and 1.7 m: either side of the mean (a total
Mean = (1.1m + 1.7m) of 4 standard deviations)
2 so: 1 SD = (1.7m-1.1m) / 4
Mean = 1.4 m = 0.6m / 4
= 0.15m
It is good to know the standard deviation, because we can say that any value is:
likely to be within 1 standard deviation (68 out of 100 should be)
very likely to be within 2 standard deviations (95 out of 100 should be)
almost certainly within 3 standard deviations (997 out of 1000 should be)
Standard Scores
The number of standard deviations from
the mean is also called the "Standard
Score", "sigma" or "z-score". A Normal Distribution A Standard Normal Distribution
How to Use the Z-Score Table The Standard Normal Table provides three
significant digits. Two in the first column
Find the area corresponding and the third along the top row. (Example:
to the z-score for z of 1.64 the probability is 0.9495).
Draw a valid conclusion Table entry for z is the area under the
standard normal curve to the left of z.
Page 25 of 67
Z-Score Table
Exercises 14: Refer to example 16. In that same school your friends Sam and Anthony are 1.75 m and 1.35
m tall respectively. Suppose there are 2000 students in the school, how tall are they compared to
the other students?
Exercises 15: What is the probability of selecting a positive z-score less than z= 1.00?
t - score
In probability and statistics, the t-distribution is any member of a family of continuous probability distributions
that arises when estimating the mean of a normally distributed population in situations where the sample size is
small and population standard deviation is unknown.
A t-score is a form of a standardized test statistic, which allows you to take an individual score and transform it
into a standardized form to make comparison easier.
The shape of the t distribution depends on the degrees of freedom (df) that went into the estimate of the standard
deviation.
Page 26 of 67
Degrees of freedom (df) is the number of values in the final calculation of a statistic that are free to vary.
The overall shape of the probability density function of the t distribution
resembles the bell shape of a normally distributed variable with mean 0 and
variance 1, except that it is a bit lower and wider. As the number of degrees
of freedom grows, the t distribution approaches the standard normal
distribution, and in fact the approximation is quite close for df ≥ 30.
P(T ≤ t)
p=
df p = .60 p = .65 p = .70 p = .75 p = .80 p = .85 p = .90 p = .95 p = .99 p = .995 p = .996
.55
9 0.129 0.261 0.398 0.543 0.703 0.883 1.100 1.383 1.833 2.821 3.250 3.390
14 0.128 0.258 0.393 0.537 0.692 0.868 1.076 1.345 1.761 2.624 2.977 3.089
19 0.127 0.257 0.391 0.533 0.688 0.861 1.066 1.328 1.729 2.539 2.861 2.962
t-scores (Generated from T Distribution Calculator @ https://stattrek.com/online-calculator/t-distribution.aspx)
Exercises 17: Suppose scores on an IQ test are normally distributed, with a mean of 100. Suppose 20 people
are randomly selected and tested. The standard deviation in the sample group is 15. What is the
probability that the average test score in the sample group will be at most 110?
Skewness
Data can be "skewed", meaning it tends to have a long tail on one side or the other:
or
https://www.mathsisfun.com/data/skewness.html
The mean, mode and median can be used to figure out if you have a positively or negatively skewed distribution.
If the mean is greater than the mode, the
distribution is positively skewed.
If the mean is less than the mode, the
distribution is negatively skewed.
If the mean is greater than the median, the
distribution is positively skewed.
If the mean is less than the median, the
distribution is negatively skewed.
Page 27 of 67
Most software packages use a formula for the skewness that takes into account
Let x1 , x2 ,...xn be n observations. Then, sample size:
n
n ( xi x ) 3
Skewness i 1
3/ 2
n
( xi x ) 2
i 1
How to interpret:
If skewness is less than −1 or greater than +1, the distribution is highly skewed.
If skewness is between −1 and −½ or between +½ and +1, the distribution is moderately skewed.
If skewness is between −½ and +½, the distribution is approximately symmetric.
Caution: This is an interpretation of the data you actually have. When you have data for the whole population,
that’s fine. But when you have a sample, the sample skewness doesn’t necessarily apply to the whole population.
In that case the question is, from the sample skewness, can you conclude anything about the population skewness?
To answer that question, see the next section.
Exercises 18: Calculate the skewness of the data in (a) exercises 6; and (b) Exercises 7 using Pearson’s
Coefficient of Skewness. Describe the data.
Kurtosis
Kurtosis tells you the height and sharpness of the central peak, relative to that of a standard bell curve.
Platykurtic - Short and fat compared to the Normal bell curve with
fewer extreme values causing the tails to be thinner than the
Normal bell curve.
"When both skewness and kurtosis are zero (a situation that researchers are very unlikely to ever encounter), the
pattern of responses is considered a normal distribution. A general guideline for skewness is that if the number is
greater than +1 or lower than –1, this is an indication of a substantially skewed distribution. For kurtosis, the
general guideline is that if the number is greater than +1, the distribution is too peaked. Likewise, a kurtosis of less
than –1 indicates a distribution that is too flat. Distributions exhibiting skewness and/or kurtosis that exceed these
guidelines are considered non-normal." (Hair et al., 2017, p. 61).
How to interpret:
A normal distribution has kurtosis excess kurtosis exactly 0. Any distribution with excess ≈0 is called
mesokurtic.
A distribution with excess kurtosis <0 is called platykurtic. Compared to a normal distribution, its tails are
shorter and thinner, and often its central peak is lower and broader.
A distribution with excess kurtosis >0 is called leptokurtic. Compared to a normal distribution, its tails are
longer and fatter, and often its central peak is higher and sharper.
Exercises 19: Calculate the skewness of the data in Exercises 9. Describe the distribution based on the
results.
Boxplot
A boxplot can give you information regarding the shape, variability, and center (or median) of a statistical
data set. It is particularly useful for displaying skewed data. Statistical data also can be displayed with other charts
and graphs.
A boxplot, sometimes called a box and whisker plot, is a type of graph used to display patterns of
quantitative data.
A boxplot splits the data set into quartiles. The body of the boxplot consists of a "box" (hence, the name),
which goes from the first quartile (Q1) to the third quartile (Q3).
Within the box, a vertical line is drawn at the Q2, the median of the data set. Two horizontal lines, called
whiskers, extend from the front and back of the box. The front whisker goes from Q1 to the smallest non-outlier
in the data set, and the back whisker goes from Q3 to the largest non-outlier.
If the data set includes one or more outliers, they are plotted separately as points on the chart. In the
boxplot above, two outliers are shown to the right of the second whisker.
Page 29 of 67
Boxplots often provide information about the shape of a data set. The examples below show some common
patterns.
Each of the above boxplots illustrates a different skewness pattern. If most of the observations are concentrated
on the low end of the scale, the distribution is skewed right; and vice versa. If a distribution is symmetric, the
observations will be evenly split at the median, as shown above in the middle figure.
Exercises 21: A salesperson recorded the number of sales he made each month. In the past 12 months, he sold
the following numbers of computers. Make a boxplot for the sales
51, 17, 25, 39, 7, 49, 62, 41, 20, 6, 43, 13.
MEASURES OF UNCERTAINTY
The standard error (SE) of the mean, also called the standard deviation of the mean, is the approximate
standard deviation of a statistical sample mean from population mean. The standard error measures the
accuracy with which the sample represents a population. In statistics, the sample mean deviates from the actual
mean of the population, this deviation is the standard error.
The formula for the standard error of the mean Estimate of the standard error of the mean sample:
in a population:
SE = standard error of the mean; σ = population standard deviation; S = mean standard deviation; n =
number of measurements
The standard error is most useful as a means of calculating a confidence interval. A confidence interval is a
range, or interval, of values used to estimate the true value of a population parameter. Confidence intervals
are associated with confidence levels, such as 95%, which tell us the percentage of times the confidence interval
actually contains the true population parameter we seek. Confidence interval CI is calculated for any desired
degree of confidence by using sample size and variability (SD) of the sample.
Exercises 22: Calculating the SE of the sample mean when the population standard deviation is known.
In a certain property investment company, workers have a mean hourly wage of Php 600 with a population
standard deviation of Php 150. Given a sample size of 30, (a) estimate and interpret the SE of the sample
mean. (b) Find the margin of error at confidence level of 95%. (c) Determine the confidence interval.
Exercises 23: A researcher wants to determine the mean height of basketball players in the collegiate
tournament. The researcher selects random samples of 12 players. Find the confidence interval.
Page 31 of 67
HYPOTHESIS TESTING
A statistical hypothesis is an assumption about a population parameter. This assumption may or may not
be true. Hypothesis testing refers to the formal procedures used by statisticians to accept or reject statistical
hypotheses.
Statistical Hypotheses
The best way to determine whether a statistical hypothesis is true would be to examine the entire
population. Since that is often impractical, researchers typically examine a random sample from the population.
If sample data are not consistent with the statistical hypothesis, the hypothesis is rejected.
There are two types of statistical hypotheses.
Null hypothesis. The null hypothesis, denoted by Ho, is usually the hypothesis that sample observations result
purely from chance.
The null hypothesis (H0): This is a statement that there is NO significant difference in groups, or more
generally, that there is no association between two groups. In other words, it is describing an outcome that is
the opposite of the research hypothesis. The original speculation is not supported.
Alternative hypothesis. The alternative hypothesis, denoted by H1 or Ha, is the hypothesis that sample
observations are influenced by some non-random cause.
The alternative hypothesis (H1): This is the research hypothesis. It is the researcher’s/scientist’s
speculation/prediction at the heart of the experiment.
Example 17:
Null Hypotheses (H0):
Undertaking seminar classes has no significant effect on students' performance.
Alternative Hypothesis (HA):
Undertaking seminar class has a significant effect on students' performance.
Hypothesis Tests
Statisticians follow a formal process to determine whether to reject a null hypothesis, based on sample
data. This process, called hypothesis testing, consists of four steps.
1. State the hypotheses. This involves stating the null and alternative hypotheses. The hypotheses are stated in
such a way that they are mutually exclusive. That is, if one is true, the other must be false.
2. Formulate an analysis plan. The analysis plan describes how to use sample data to evaluate the null hypothesis.
The evaluation often focuses around a single test statistic.
3. Analyze sample data. Find the value of the test statistic (mean score, proportion, t statistic, z-score, etc.)
described in the analysis plan.
4. Interpret results. Apply the decision rule described in the analysis plan. If the value of the test statistic is
unlikely, based on the null hypothesis, reject the null hypothesis.
Decision Errors
Two types of errors can result from a hypothesis test.
Type I error. A Type I error occurs when the researcher rejects a null hypothesis when it is true. The probability
of committing a Type I error is called the significance level. This probability is also called alpha, and is often
denoted by α.
Type II error. A Type II error occurs when the researcher fails to reject a null hypothesis that is false. The
probability of committing a Type II error is called Beta, and is often denoted by β. The probability of not
committing a Type II error is called the Power of the test.
We use the following terminology:
Page 32 of 67
Significance level is the acceptable level of type I error, denoted α. Typically, a significance level of α = .05 is
used (although sometimes other levels such as α = .01 may be employed). This means that we are willing to
tolerate up to 5% of type I errors, i.e., we are willing to accept the fact that in 1 out of every 20 samples we
reject the null hypothesis even though it is true.
P-value (the probability value) is the value p of the statistic used to test the null hypothesis. If p < α then we
reject the null hypothesis.
Critical region is the part of the sample space that corresponds to the rejection of the null hypothesis, i.e., the
set of possible values of the test statistic which are better explained by the alternative hypothesis. The
significance level is the probability that the test statistic will fall within the critical region when the null
hypothesis is assumed.
Usually, the critical region is depicted as a region under a curve for continuous distributions (or a portion of a
bar chart for discrete distributions).
The typical approach for testing a null hypothesis is to select a statistic based on a sample of fixed size, calculate
the value of the statistic for the sample and then reject the null hypothesis if and only if the statistic falls in the
critical region.
The “tails” of a test are the values outside of the critical values. In other words, the tails are the ends of the
distribution, and they begin at the greatest or least value included in the alternative hypothesis (the critical
values). The rejection region is the region indicates that the null hypothesis be rejected.
One-tailed hypothesis testing specifies a direction of the statistical test. For example, to test whether cloud
seeding increases the average annual rainfall in an area which usually has an average annual rainfall of 20 cm,
we define the null and alternative hypotheses as follows, where μ represents the average rainfall after cloud
seeding.
H0: µ = 20 (i.e., average rainfall is equal to the average It is quite possible to have one sided test where the
annual rainfall after cloud seeding) critical value is the left (or lower) tail. For example,
suppose the cloud seeding is expected to decrease
H1: µ > 20 (i.e., average rainfall increases after cloud rainfall. Then the null hypothesis could be as follows:
seeding
H0: µ = 20 (i.e., average rainfall is equal to the average
Here the experimenters are quite sure that the cloud
annual rainfall after cloud seeding)
seeding will not significantly reduce rainfall, and so a
one-tailed test is used where the critical region is as in
H1: µ < 20 (i.e., average rain decreases after cloud
the shaded area in Figure 1. The null hypothesis is
seeding)
rejected only if the test statistic falls in the critical
region, i.e., the test statistic has a value larger than the
critical value.
Two-tailed hypothesis testing doesn’t specify a direction of the test. For the cloud seeding example, it is more
common to use a two-tailed test. Here the null and alternative hypotheses are as follows.
H0: µ = 20 (There is no significant difference between the average rainfall and the average annual rainfall
after cloud seeding)
H1: µ ≠ 20 (There is no significant difference between the average rainfall and the average annual rainfall
after cloud seeding)
Page 33 of 67
In this case we reject the null hypothesis if the test statistic falls in either side of the critical region. To
achieve a significance level of α, the critical region in each tail must have size α /2.
Statistical power is 1 – β. Thus, power is the probability that you find an effect when one exists, i.e., the
probability of correctly rejecting a false null hypothesis. While a significance level for type I error of α = .05 is
typically used, generally the target for β is .20 or .10, and so .80 or .90 is used as the target value for power.
Observation: Suppose we perform a statistical test of the null hypothesis with α = .05 and obtain a p-value of p =
.04, thereby rejecting the null hypothesis. This does not mean that there is a 4% probability of the null hypothesis
being true, i.e., P(H0) =.04. What we have shown instead is that assuming the null hypothesis is true, the
conditional probability that the sample data exhibits the obtained test statistic is 0.04; i.e., P(D|H0) =.04 where
D = the event that the sample data exhibits the observed test statistic.
Decision Rules
The analysis plan includes decision rules for rejecting the null hypothesis. In practice, statisticians describe these
decision rules in two ways - with reference to a P-value or with reference to a region of acceptance.
P-value. The strength of evidence in support of a null hypothesis is measured by the P-value. Suppose the test
statistic is equal to S. The P-value is the probability of observing a test statistic as extreme as S, assuming the
null hypothesis is true. If the P-value is less than the significance level, we reject the null hypothesis.
Region of acceptance. The region of acceptance is a range of values. If the test statistic falls within the region
of acceptance, the null hypothesis is not rejected. The region of acceptance is defined so that the chance of
making a Type I error is equal to the significance level.
The set of values outside the region of acceptance is called the region of rejection. If the test statistic falls within
the region of rejection, the null hypothesis is rejected. In such cases, we say that the hypothesis has been
rejected at the α level of significance.
A potential source of confusion in working out what statistics to use in analyzing data is whether your
data allows for parametric or non-parametric statistics.
Parametric statistics are any statistical tests based on underlying assumptions about data’s distribution. In other
words, parametric statistics are based on the parameters of the normal curve. Because parametric statistics are
based on the normal curve, data must meet certain assumptions, or parametric statistics cannot be calculated.
Prior to running any parametric statistics, you should always be sure to test the assumptions for the tests that you
are planning to run.
Non-Parametric statistics are not based on the parameters of the normal curve. Therefore, if your data violate the
assumptions of a usual parametric and nonparametric statistics might better define the data, try running the
nonparametric equivalent of the parametric test. You should also consider using nonparametric equivalent tests
when you have limited sample sizes (e.g., n < 30). Though nonparametric statistical tests have more flexibility
than do parametric statistical tests, nonparametric tests are not as robust; therefore, most statisticians recommend
that when appropriate, parametric statistics are preferred.
Page 34 of 67
The Traditional Method P-Value Approach to Hypothesis The confidence interval approach.
Critical Level Approach) Testing
Step 1: State the hypotheses. Step 1: State Hypothesis. Step 1: State Hypothesis.
Step 2: Identify α (Level of Step 2: Identify α (Level of Step 2: Determine the test size or the
Significance) Significance) (1-test size) and the hypothesized
value.
Step 3: Compute the test value. Step 3: Compute the test value.
Step 3: Construct the confidence
Step 4: Find the critical value(s) Step 4: Calculate p-value interval.
from the appropriate table.
Step 5: Accept or Reject Step 4: Reject the null hypothesis if
Step 5: Make the decision to Hypothesis the hypothesized value does not
reject or not reject the null exist in the range of the confidence
hypothesis. (A p-value less than the α interval.
means that there is stronger
Step 6: Summarize the results. evidence in favor of the Step 5: Make the substantive
alternative hypothesis). interpretation.
Page 35 of 67
A z-test is used for testing the mean of a population versus a standard, or comparing the means of two
populations, with large (n ≥ 30) samples whether you know the population standard deviation or not. It is also
used for testing the proportion of some characteristic versus a standard proportion, or comparing the proportions
of two populations.
A t-test is used for testing the mean of one population against a standard or comparing the means of two
populations if you do not know the populations’ standard deviation and when you have a limited sample (n < 30).
If you know the populations’ standard deviation, you may use a z-test.
Both z-tests and t-tests require data with a normal distribution, which means that the sample (or
population) data is distributed evenly around the mean. They compare two averages (means) and tells you if they
are different from each other. They tell you how significant the differences are; In other words, they let you know
if those differences could have happened by chance. Therefore, they are to evaluate whether the means of the two
sets of data are statistically significantly different from each other.
x̅ = sample mean
s = sample standard deviation
n = sample size ΣD = sum of the differences (x – y)
µ = specified population mean
t = Student t quantile with n-1 ΣD2 = sum of the squared differences
degrees of freedom.
(ΣD)2 = sum of the differences squared
df = nx + ny – 2
To evaluate whether the difference is statistically significant using the Traditional Method approach, you
first have to read in t test table the critical value of Student’s t distribution corresponding to the significance level
alpha of your choice. The significance level is the probability of rejecting the null hypothesis when it is true. For
example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no
actual difference.
Page 36 of 67
Percent
95 97.5 99 99.5 99.75 99.9 99.95
One Large Sample Z Test Large Independent Samples Z Test Paired-sample z-test
The following figures illustrate the rejection regions using the Traditional Method approach defined by
the decision rule for upper-, lower- and two-tailed Z tests with α=0.05. Notice that the rejection regions are in the
upper, lower and both tails of the curves, respectively. The decision rules are written below each figure.
Rejection Region for Upper-Tailed Rejection Region for Lower- Rejection Region for Two-Tailed
Z Test Tailed Z Test Z Test
(H1: μ > μ0) with α=0.05 (H1: μ < μ0) with α =0.05 (H1: μ ≠ μ 0) with α =0.05
The decision rule is: The decision rule is: The decision rule is:
Reject H0 if Z > 1.645. Reject H0 if Z < 1.645. Reject H0 if Z < -1.960 or if Z >
1.960.
Effect Size
A common measure of effect size is d, sometimes known as Cohen's d (as you might have guessed by
now, Cohen was quite influential in the field of effect sizes).
Solution 2: The Traditional Method (Critical Level Solution 3: The confidence interval approach.
Approach)
Step 1: State Hypothesis.
Step 1: State the hypotheses.
Exercises 25: A researcher would like to know who types faster on a pc, the men or the women. The data on
the typing speed is shown below. At α = 0.05 prove that there is significant difference in the
typing speed on a pc between men and women.
Typing Speed on a pc
Men Women
Mean (x̅) 65 wpm 68 wpm
Standard deviation (s) 10 wpm 14 wpm
Number cases (n) 50 60
Exercises 26: The data below were recorded from a study where the oxygen uptake (mL) during incubation
of two sets of cell suspensions, one buffered and one unbuffered. Is there difference in the mean
oxygen uptake between the buffered and unbuffered cell suspension?
Exercises 27: The data below were taken from a study wherein the performance of the ten students in an
examination was recorded before and after the intake of the memory pill. Is there difference in
the mean score of the students before and after the intake of the memory pill?
Scores in an Examination
Student Before After
A 72 75
B 61 60
C 48 37
D 55 64
E 81 76
F 50 59
G 42 49
H 64 55
I 77 75
J 69 75
Analysis of Variance (ANOVA) is a hypothesis-testing technique used to test the equality of two or more
population (or treatment) means by examining the variances of samples that are taken. ANOVA allows one to
determine whether the differences between the samples are simply due to random error (sampling errors) or
whether there are systematic treatment effects that causes the mean in one group to differ from the mean in
another. Assumptions of ANOVA: (i) All populations involved follow a normal distribution; (ii) All populations
have the same variance (or standard deviation); and (iii) The samples are randomly selected and independent of
one another.2
ANOVA test includes one-way ANOVA, two-way ANOVA or multiple ANOVA depending upon the type and
arrangement of the data.
The one-way ANOVA compares the means between the groups you are interested in and determines whether
any of those means are statistically significantly different from each other. Specifically, it tests the null
hypothesis:
Where µ = group mean and k = number of groups. If, however, the one-way ANOVA returns a statistically
significant result, we accept the alternative hypothesis (HA), which is that there are at least two group means
that are statistically significantly different from each other.
The two-way ANOVA is a means of comparing multiple levels of two independent variables (called factors).
The two-way ANOVA is grounded in the idea that there are two variables, referred to as factors, affecting the
outcome of the dependent variable. To be effective, a two-way ANOVA assumes population samples are
normally distributed, independent, equal in variance, and contain sample groups of equal size.5
For example: A researcher would use two-way ANOVA to examine whether the product use (heavy, medium,
light and non-drinkers) and brand loyalty (loyal and non-loyal) have different preferences towards a brand of
soft drink.
Note: If you fail to reject the null, then there are no differences to find. If otherwise, a post-hoc test is needed.
Page 41 of 67
To test the hypothesis using One-way ANOVA, use the table below:
where
X = individual observation,
In an experiment, group of plants were treated with different fertilizers. The height in feet of the plants were
recorded. Is there a statistically significant difference in the mean height among the four groups of plants at α
= .05?
a. Compute the sample means for each group and the overall mean based on the total sample.
Degrees of
Source of Sums of Squares Means Squares
Freedom F
Variation (SS) (MS)
(df)
Between
Treatments
Error (or Residual)
Total
6. Draw a conclusion.
Page 43 of 67
Exercises 29: Two-way ANOVA: (problem lifted from tutorvista.com and procedure lifted from
stanford.edu)
(B) Exposure to Sun
There are three groups of plants that were exposed to 8 (A) 8 hours 12 hours 16 hours
hours, 12 hours and 16 hours of sunlight per day during a Fertilizer
given growing period and two kinds of fertilizers (P and 12 15 17
Q) were used. The heights of the plants (in feet) after six Brand X 13 15 18
months are tabulated as follows. 11 14 16
14 16 19
At α = 0.05, find the test values FA, FB. Is there any Brand Y 14 17 20
difference in the mean height of plants depending on the 13 18 19
(i) type of fertilizer (ii) duration of exposure to sun using
a two-way ANOVA. (A: Fertilizer and B: Exposure to
sun)
Running the Two-way ANOVA:
6. Compare the calculated values of F to the critical value and draw conclusions
Page 44 of 67
The obtained F is significant at a given level if it is equal to or greater than the value shown in the table. The α
=0.05 points on the F distribution are shown in the top row, and the α =0.01 points are shown in the bottom
row.
Post-hoc test
For comparison of three or more group means we apply the analysis of variance (ANOVA) method to
decide if all means are equal or there is at least one mean which is different from others. If we get a significant
result, we can conclude a global decision that there is difference in group means. However, then we need to know
what specific pairs of group means show differences and what pairs do not. The procedure is performed by post-
hoc multiple comparison procedures.
The Tukey’s HSD determines differences between means in terms of standard error. Honest because we
adjust for making multiple comparisons. The HSD is compared to a critical value.
Where: t is the critical, tabled value of the t-distribution with the df = N – k associated with MSE, and n is the
number of values we are dealing with in each group (not total n). The Mean Square value is from the ANOVA
you already computed. To find tcritical at α usually at .05, two-tailed, with df.
The first step is to compute all possible differences between means (x̅1 – x̅2; x̅2 – x̅3; x̅1 – x̅3 …)
We will only be concerned with the absolute difference, so, you can ignore any negative signs. Next, we
compute HSD.
Compare the difference scores we computed with the HSD value. If the difference is larger than the HSD,
then we say the difference is significant.
Page 45 of 67
Non-Parametric Test – sometimes called a distribution free test. It does not assume anything about the underlying
distribution (for example the data comes from a normal distribution). That’s compared to parametric test, which
makes assumptions about a population’s parameters (for example, the mean or standard deviation); When the
word “non-parametric” is used in stats, it doesn’t quite mean that you know nothing about the population. It
usually means that you know the population data does not have a normal distribution.
Sign Tests
The sign test compares the sizes of two groups. It is a non-parametric or “distribution free” test, which means the
test doesn’t assume the data comes from a particular distribution, like the normal distribution. The sign test is an
alternative to a one sample t test or a paired t test. It can also be used for ordered (ranked) categorical data.
The 1 sample sign test is a nonparametric hypothesis test used to determine whether statistically significant
difference exists between the median of a non-normally distributed continuous data set and a standard. This test
basically concerns the median of a continuous population.
The one sample sign test is a nonparametric version of one sample t test. Similar to one sample t-test, the sign test
for a population median can be one-tailed (right or left tailed) or two-tailed distribution based on the hypothesis.
Paired Sample Sign Test. This test is also called an alternative to the paired t-test. This test uses the + and – signs
in paired sample tests or in before-after study. In this test, null hypothesis is set up so that the sign of + and – are
of equal size, or the population means are equal to the sample mean.
Paired Samples Sign Test is used to compare the medians of two related (paired) continuous populations, such as
'Before' and 'After' measurements of test scores for the same group of subjects.
Procedure to execute:
1. State the claim of the test and determine the null hypothesis and alternative hypothesis:
2. Calculate the + and – sign for the given distribution. Put a + sign for a value greater than the mean value, and
put a – sign for a value less than the mean value. Put 0 as the value is equal to the mean value; pairs with 0 as
the mean value are considered ties.
3. Denote the total number of signs by ‘n’ (ignore the zero sign) and the number of less frequent signs by ‘y.’
4. Find critical value:
a. From Cumulative Binomial probability table if n≤ 25 (approx.)
b. You may Obtain the critical value (K) at .05 of the significance level by using the following formula in case
of small samples:
5. Compare the value of y with the critical value from the table or from K. In case of large samples, y is compared
with Z value.
6. Make a decision, the null hypothesis will be rejected if the test statistic is less than or equal to the critical value.
7. Interpret the decision in the context of the original claim.
Exercises 30 (One-Sample Sign Test): Students who failed in the examination were given remediation. After
which, a removal examination is administered. In the previous year, the performance of students after the
remediation process shows that the median score is 64. A teacher randomly picked 12 students who took the
removal examination to prove that at .05 level of significance, their performance would be the same as the
previous batch. The data is shown below:
Student 1 2 3 4 5 6 7 8 9 10 11 12
Score 38 60 66 65 70 68 72 46 76 77 75 64
Page 46 of 67
Exercises 31 (Paired Sample Sign Test): The data below shows the scores of six students before and after the
intervention program. Test if there is significant difference between the median scores before and after the
program at α = .05.
1 84 91
2 103 100
3 111 117
4 89 89
5 116 130
6 89 96
7 105 95
8 91 98
9 99 76
10 96 104
Mann-Whitney U test
Mann-Whitney U test is the non-parametric alternative test to the independent sample t-test. It is used to
compare two independent groups of sampled data. Mann-Whitney test is also called the rank sum test. The test
statistic for the Mann-Whitney test is U. This value is compared to a table of critical values for U based on the
sample size of each group. It is statistical hypothesis test for assessing whether one of two samples of independent
observations tends to have larger values than the other.
Usually, the Mann-Whitney U test is used when the data is ordinal or when the assumptions of the t-test
are not met.
Mann-Whitney U test is a non-parametric test, so it does not assume any assumptions related to the distribution
of scores. There are, however, some assumptions that are assumed
1. The sample drawn from the population is random.
2. Independence within the samples and mutual independence is assumed. That means that an observation is in
one group or the other (it cannot be in both).
3. Ordinal measurement scale is assumed.
Summary points
1. The Mann-Whitney test is used as an alternative to a t test when the data are not normally distributed
2. The test can detect differences in shape and spread as well as just differences in medians
3. Differences in population medians are often accompanied by equally important differences in shape
Where:
R1 = sum of the ranks for group 1
R2 = sum of the ranks for group 2.
Hint: We reject H0 if Ucalculated < Ucritical. Use the lower calculated U value for two-tailed test.
Page 47 of 67
Exercises 32: A study was conducted to test the effectiveness of teaching strategy Y. The results of the test
after implementing the strategy were presented below. Is there a difference in the scores of the
two groups at α = 0.05? The data deviated from the normal. Hence, Mann-Whitney U test must
be used.
Experimental 7 5 6 4 12 7 5 6 4 12
Control Group 4 7 5 3 2 6 8 6 4 3
4. Calculate the U:
http://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_nonparametric/BS704_Nonparametric4.html
The Wilcoxon sign test signed rank test is a close sibling of the dependent samples t-test. Because the
dependent samples t-tests analyzes if the average difference of two repeated measures is zero; it requires metric
(interval or ratio) and normally distributed data; the Wilcoxon sign test uses ranked or ordinal data. Thus, it is a
common alternative to the dependent samples t-test when its assumptions are not met.
The test statistic for the Wilcoxon Signed Rank Test is W, defined as the smaller of W+ (sum of the positive
ranks) and W- (sum of the negative ranks).
The hypotheses for the Wilcoxon Signed Rank Test concern the population median of the difference
scores. The research hypothesis can be one- or two-sided. Here we consider a one-sided test.
Your obtained value is statistically significant if it is equal to or SMALLER than the value in the table.
Page 49 of 67
Exercises 33: A group of ten students were used as respondents in a study. The students were given a post-
test before taking up a course with a new teaching strategy. The test was administered as post-
test at the end of the course. The results were tabulated. The differences between the pre- and
post-test scores deviate from the normal. Is there difference in the pre- and post-test scores of
the students at α = 0.05?
Kruskal-Wallis Test
The Kruskal-Wallis Test is used to analyze the effects of more than two levels of just one factor on the
experimental result. It is the non-parametric equivalent of the One Way ANOVA.
One independent variable with two or more levels (independent groups). The test is more commonly used
when you have three or more levels. For two levels, consider using the Mann Whitney U Test instead.
Ordinal scale, Ratio Scale or Interval scale dependent variables.
Your observations should be independent. In other words, there should be no relationship between the
members in each group or between groups.
Exercises 34: The data in the Table below, gives the efficiency of a chemical process using three different
catalysts (A, B and C). Is there evidence that the different catalysts result in different
efficiencies? Assume in this example, that the data may not be normally distributed and that it
is necessary to use non-parametric statistics.
3. Rank the data all together. Ignoring which group they belong to. The lowest score gets the lowest rank (Lowest
value is ranked 1).
4. Find the total of the ranks for each group. Just add together all of the ranks for each group in turn.
7. Compare the computed H value to the critical value then draw a conclusion.
(If the critical value is less than the H statistic, reject the null hypothesis that the medians are equal. If the
critical value is not less than the H statistic, there is not enough evidence to suggest that the medians are
unequal.)
Page 51 of 67
Friedman Test
The Friedman test is the non-parametric alternative to the one-way ANOVA with repeated measures. It
is used to test for differences between groups when the dependent variable being measured is ordinal. It can also
be used for continuous data that has violated the assumptions necessary to run the one-way ANOVA with repeated
measures (e.g., data that has marked deviations from normality).1
Assumptions
When you choose to analyze your data using a Friedman test, part of the process involves checking to
make sure that the data you want to analyze can actually be analyzed using a Friedman test. You need to do this
because it is only appropriate to use a Friedman test if your data "passes" the following four assumptions:
Assumption #1: One group that is measured on three or more different occasions.
Assumption #2: Group is a random sample from the population.
Assumption #3: Your dependent variable should be measured at the ordinal or continuous level. Examples of
ordinal variables include Likert scales (e.g., a 7-point scale from strongly agree through to strongly disagree),
amongst other ways of ranking categories (e.g., a 5-point scale explaining how much a customer liked a
product, ranging from "Not very much" to "Yes, a lot"). Examples of continuous variables include revision
time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to
100), weight (measured in kg), and so forth. You can learn more about ordinal and continuous variables in
our article: Types of Variables.
Assumption #4: Samples do NOT need to be normally distributed.
The null hypothesis for the test is that the treatments all have identical effects, or that the samples differ in
some way. For example, they have different centers, spreads, or shapes. The alternate hypothesis is that the
treatments do have different effects.2
The formula:
Exercises 35: The scores of the ten candidates by three interviewers were recorded. Is there difference in the
ratings of the three interviewers?
(Steps to use the Friedman test was lifted from Statistics How To2)
Rank
Candidate Interviewer Interviewer Interviewer Interviewer Interviewer Interviewer
1 2 3 1 2 3
1 21 8 11
2 41 39 14
3 32 45 16
4 39 33 12
5 21 13 8
6 14 12 5
7 18 23 10
8 23 10 5
9 24 21 14
10 11 9 5
Totals
Page 53 of 67
3. Rank each column separately. The smallest score should get a rank of 1 (rank across rows so each candidate is
being ranked 1, 2, or 3 for each candidate).
6. Find the FM critical value from the table of critical values for Friedman.
7. Compare the calculated FM test statistic to the FM critical value. (Reject the null hypothesis if the calculated
FM value is larger than the FM critical value).
8. Draw a conclusion.
Chi-Square
Chi-square is used when the variables being considered are categorical variables (nominal or ordinal). If
we want to determine if two categorical variables are related or if we want to see if a distribution of data falls into
a prescribed distribution, then we use the Chi-Square as our test statistic. In either case, there are two pieces of
information one needs; (1) the actual frequency for each "cell" (actual or observed freq.), and (2) the expected cell
frequency - which comes from either your theory (goodness of fit problem) or via a formula (test of independence).1
The Chi Square test is the most important and most used method in statistical tests. The purpose of Chi
Square test is known as the difference between an observed frequency and expected frequency. This test,
sometimes is also used to test the differences between the two or more observed data.2
Exercises 36: A poll was conducted on a political issue. A number of male and female were included in the
study
6. Make a decision.
7. Formulate a conclusion:
https://people.smp.uq.edu.au/Yoni
Nazarathy/stat_models_B_course_s
pring_07/distributions/chisqtab.pdf
CORRELATION
Correlation is a measure of association between two variables. The variables are not designated as dependent
or independent.
The word correlation is used in everyday life to denote some form of association. We might say that we have
noticed a correlation between foggy days and attacks of wheeziness. However, in statistical terms we use
correlation to denote association between two quantitative variables. We also assume that the association is
linear, that one variable increase or decreases a fixed amount for a unit increase or decrease in the other. The
other technique that is often used in these circumstances is regression, which involves estimating the best
straight line to summarize the association.
Correlation is a bivariate (only two variables are being analyzed) analysis that measures the strength of
association between two variables and the direction of the relationship. The degree of association is measured
by a correlation coefficient, denoted by r.
In terms of the strength of relationship, the value of the correlation coefficient varies between +1 and -1. A
value of ± 1 indicates a perfect degree of association between the two variables. As the correlation coefficient
value goes towards 0, the relationship between the two variables will be weaker. The direction of the
relationship is indicated by the sign of the coefficient; a + sign indicates a positive relationship and a – sign
indicates a negative relationship.
Correlation analysis does not test for causal relationships, only whether there is relationship and to some extent
the strength of this relationship.
Page 56 of 67
or
Where:
n is the number of pairs of data;
x̅ and ȳ are the sample means of all the x-values and all the y- The value of r is also known as the effect
values, respectively; size.
sx and sy are the sample standard deviations of all the x- and y-
values, respectively.
T-test for correlation coefficients
(Test for the significance of relationships between two CONTINUOUS variables)
The correlation coefficient, r, tells us about the strength and direction of the linear relationship between x
and y. However, the reliability of the linear model also depends on how many observed data points are in the
sample. We need to look at both the value of the correlation coefficient r and the sample size n, together.
We perform a hypothesis test of the “significance of the correlation coefficient” to decide whether the
linear relationship in the sample data is strong enough to use to model the relationship in the population.
Remember!
o A relationship can be strong and yet not significant
o A relationship can be weak but significant
o The smaller the p-level, the more significant the relationship
o The larger the correlation, the stronger the relationship
The appropriate t value to test significance of a correlation coefficient employs the t distribution:
The null hypothesis to be tested: The correlation coefficient is not significantly different from zero.
The test is to see whether the variables under study have a positive or negative go-togetherness or not.
Therefore, one -tailed hypothesis testing is necessary.
To make a decision:
If the P-value is smaller than the significance level α, we reject the null hypothesis in favor of the
alternative. We conclude "there is sufficient evidence at the α level to conclude that there is a linear
relationship in the population between the predictor x and response y."
If the P-value is larger than the significance level α, we fail to reject the null hypothesis. We conclude
"there is not enough evidence at the α level to conclude that there is a linear relationship in the
population between the predictor x and response y."
REGRESSION
If there is a “significant” linear correlation between two variables, the next step is to find the equation of
a line that “best” fits the data. Such an equation can be used for prediction: given a new x-value, this equation can
predict the y-value that is consistent with the information known about the data. This predicted y-value will be
denoted by y. The line represented by such an equation is called the linear regression line.
The equation for a line is Y = is the value of the Dependent variable (Y), what is being predicted or explained
y = mx + b + e m = is the slope of the regression line; how much Y changes for each one-unit
change in X.
X = is the value of the independent variable (X), what is predicting or explaining
the value of Y
b = (or Alpha, a constant) equals the value of Y when the value of X=0
e = is the error term; the error in predicting the value of Y, given the value of X (In
practice, under ordinary circumstances, we do not know the value of the error
that is why it is not displayed in most regression equations).
In general, the regression line, will not pass through each data
point. For each data point, there is an error: the difference between the y-
value from the data and the y-value on the line y. By definition this linear
regression line is such that the sum of the squares of the errors is the least
possible. It turns out, given a set of data, there is only one such line.
The residual, d, is the difference of the observed y-value and the predicted
y-value.
d = (observed y-value) − (predicted y-value).
The coefficient of determination, r2, is the proportion of the variation that explained by the regression line.
Page 58 of 67
Exercises 37: The time (x) in years that an employee spent at a company and the employee’s hourly pay (y)
for 5 employees are listed in the table below.
a. Calculate the correlation coefficient r;
b. Test if the calculated r is significant at α = .05 using the t-test for correlation coefficients.
c. Test if the calculated r is significant at α = .05 by simply looking up for the critical value of
r in a table.
d. What percentage of the variations in the score of the dependent variable can be explained
by the dependent variable (as indicated by r2)?
e. If highly correlated, (b.i) find the equation of the regression line; and (b.ii) Predict the hourly
pay rate of an employee who has worked for 15 and 20 years
Spearman's correlation measures the strength and direction of monotonic association between two
variables. Monotonicity is "less restrictive" than that of a linear relationship.
A monotonic relationship is not strictly an assumption of Spearman's correlation. That is, you can run a
Spearman's correlation on a non-monotonic relationship to determine if there is a monotonic component to
the association.
as the x variable increases the y as the x variable increases the y as the x variable increases the y
variable never decreases variable never increases variable sometimes decreases and
sometimes increases
if your data does not have tied ranks if your data has tied ranks guide for the absolute value
Exercises 38: The data shows the marks achieved in a Math and English exam. Compute the Spearman rank
correlation.
Student 1 2 3 4 5 6 7 8 9 10
English 56 75 45 71 62 64 58 80 76 61
Math 66 70 40 60 65 56 59 77 67 63
To determine the Spearman rank correlation, complete this table and use the first equation.
PHI COEFFICIENT
The Phi Coefficient is a measure of association between two binary variables (i.e., living/dead,
black/white, success/failure). It is also called the Yule phi or Mean Square Contingency Coefficient and is
used for contingency tables when:
At least one variable is a nominal variable.
Both variables are dichotomous variables.
The phi coefficient is a symmetrical statistic, which means the independent variable and dependent variables are
interchangeable.
The interpretation for the phi coefficient is similar to the Pearson Correlation Coefficient. The range is from -1 to
1, where:
0 is no relationship/association.
+1 is a perfect positive relationship/association: most of your data falls along the diagonal cells.
-1 is a perfect negative relationship/association: most of your data is not on the diagonal.
Exercises 39: A survey was conducted on one political issue. The results were shown below. Find phi.
The point biserial correlation coefficient, rpbi, is a special case of Pearson’s correlation coefficient. It
measures the relationship between two variables:
One continuous variable (must be ratio scale or interval scale).
One naturally binary variable.
.
Formula: Where:
M1 = mean (for the entire test) of the group that received the positive binary
variable (i.e. the “1”).
M0 = mean (for the entire test) of the group that received the negative binary
variable (i.e., the “0”).
Sn = standard deviation for the entire test.
p = Proportion of cases in the “0” group.
q = Proportion of cases in the “1” group
Exercises 40: A researcher wishes to determine if a significant relationship exists between the gender of the
worker and the number of years they have been performing the tasks.
The tetrachoric correlation coefficient is a parametric test used to determine, in the population, if the
correlation between values on two variables is some value other than zero. More specifically, it is used to
determine if there is a significant linear relationship between the two variables. The tetrachoric correlation
coefficient requires both variables to be interval or ratio data, but also that both of them have been transformed
into dichotomous nominal or ordinal scale variables.
Test Assumptions
The sample has been randomly selected from the population it represents.
The underlying distributions for both the variables involved is assumed to be continuous and normal.
Contingency table Where “0” and “1” are the coded Formula
values of the dichotomous responses
for X and Y, and the values a, b, c,
and d represent the number of points
in the sample that belong to the
different combinations of 0 and 1 for
the two variables.
Exercises 41: Let’s look at the exam grades for one of the old STAT 213 classes. I want to see if there is a
significant correlation between the grades on midterm 1 and midterm 2 as far as whether they
got a grade higher than a C+. I will code a grade higher than a C+ as 1 and a grade equal to or
lower than a C+ as a 0. Let the X variable be the grade on the first midterm, and the Y variable
be the grade on the second midterm. I suspect a negative correlation between X and Y, since a
lot of students who did poorly on the first midterm either dropped the class or worked really
hard to do well on the second one.
Kendall’s Tau is a non-parametric measure of relationships between columns of ranked data. Calculations
based on concordant and discordant pairs. Insensitive to error. The values are more accurate with smaller
sample sizes. A concordant pair of cases is one in which one case is higher on both variables than the other
case. A discordant pair of cases is one in which one case is higher on one variable than the other case but
lower on the other variable.
Formula: Where:
Kendall’s Tau: C = Concordant Pairs
D = Discordant Pairs
Page 62 of 67
Exercises 42: Two judges ranked 10 candidates (A through J). The results from most preferred to least
preferred are:
Judge 1: ABCDEFGHIJ.
Judge 2: BADCFEHGJI.
Candidate Judge 1 Judge 2 Concordant Discordant
A
B
C
D
E
F
G
H
I
J
Page 63 of 67
References:
Credits for the Contents and images on this module are given to the following…
Anna Hart (BMJ. 2001 Aug 18; 323(7309): 391–393). Mann-Whitney test is not just a test of medians: differences
in spread can be important. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1120984/
David M. Lane and Heidi Ziemer. What is Central Tendency? Retrieved from
http://onlinestatbook.com/2/summarizing_distributions/what_is_ct.html
http://blog.minitab.com/blog/understanding-statistics/understanding-qualitative-quantitative-attribute-discrete-
and-continuous-data-types
http://math.ucdenver.edu/~ssantori/MATH2830SP13/Math2830-Chapter-08.pdf
http://www.statisticssolutions.com/kendalls-tau-and-spearmans-rank-correlation-coefficient/
https://sol.du.ac.in/mod/book/view.php?id=1317&chapterid=1067
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3576830/
MARKET RESEARCH GUY (2020). Types of Data & Measurement Scales: Nominal, Ordinal, Interval and
Ratio. Retrieved from http://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-
interval-ratio/
Math Only. Class Limits in Exclusive and Inclusive Form. Retrieved from. https://www.math-only-
math.com/class-limits-in-exclusive-and-inclusive-form.html
Math Only. Frequency Distribution of Ungrouped and Grouped Data. Retrieved from http://www.math-only-
math.com/frequency-distribution-of-ungrouped-and-grouped-data.html
mypolyuweb.hk. http://www.mypolyuweb.hk/machanck/lectnotes/c1_des.pdf
Phil Factor (2017) @ Redgate. Statistics in SQL: Kendall’s Tau Rank Correlation. Retrieved from
https://www.red-gate.com/simple-talk/blogs/statistics-sql-kendalls-tau-rank-correlation/
Real Statistics Using Excel. Null and Alternative Hypothesis. Retrieved from http://www.real-
statistics.com/hypothesis-testing/null-hypothesis/
riosalado.edu. Measures of Central Tendency: Mean, Median, and Mode. Retrieved from
http://www.riosalado.edu/web/oer/WRKDEV100-
20011_INTER_0000_v1/lessons/Mod05_MeanMedianMode.shtml
SCRIBD. Rank Biserial Correlation Coefficien. Retrieved from
https://www.scribd.com/document/242029206/Rank-Biserial-Correlation
Page 66 of 67
Stack Exchange. How to I use the standard normal table to get the following Z value? Retrieved from
https://math.stackexchange.com/questions/1369432/how-to-i-use-the-standard-normal-table-to-get-the-
following-z-value
Statistics Canada. Relationship between Quartiles Deciles and Percentiles. Retrieved from
https://www.statcan.gc.ca/edu/power-pouvoir/ch12/5214891-eng.htm
Statistics How To. Central Tendency (Measures of Location): Definition and Examples. Retrieved from
http://www.statisticshowto.com/central-tendency-2/
Statistics How To. Friedman’s Test / Two Way Analysis of Variance by Ranks (edited by Stephanie in April 9,
2014). Retrieved from http://www.statisticshowto.com/friedmans-test/
Statistics How To. Kendall’s Tau (Kendall Rank Correlation Coefficient). Retrieved from
http://www.statisticshowto.com/kendalls-tau/
Statistics How To. Phi Coefficient (Mean Square Contingency Coefficient). Retrieved from
http://www.statisticshowto.com/phi-coefficient-mean-square-contingency-coefficient/
Statistics How To. Point-Biserial Correlation & Biserial Correlation: Definition, Examples. Retrieved from
http://www.statisticshowto.com/point-biserial-correlation/
Statistics How To. T Score Formula: Calculate in Easy Steps. Retrieved from
https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/t-distribution/t-score-
formula/
Statistics How To. What is the Kruskal Wallis Test? Retrieved from
https://www.statisticshowto.datasciencecentral.com/kruskal-wallis/
UF. http://users.stat.ufl.edu/~athienit/Tables/Ztable.pdf
University of Leicester. Measures of variability: the range, inter-quartile range and standard deviation. Retrieved
from https://www2.le.ac.uk/offices/ld/resources/numerical-data/variability
Wayne LaMorte (2017). Mann Whitney U Test (Wilcoxon Rank Sum Test). Retrieved from
http://sphweb.bumc.bu.edu/otlt/mph-
modules/bs/bs704_nonparametric/BS704_Nonparametric4.html
www.hhs.iup.edu/.../Phi%20Coefficient%20Example%20Power%20point%2097-03.p...
www.pkwy.k12.mo.us/homepage/jtheobald/.../statsch3/meanmedmodegrouped.ppt
https://www.statisticshowto.com/sign-test/
http://www.csun.edu/~mr31841/documents/thesigntest.pdf
https://www.real-statistics.com/non-parametric-tests/moods-median-test-two-samples/
https://www.dummies.com/article/academics-the-arts/math/statistics/figuring-binomial-probabilities-
using-the-binomial-table-147222/