Hypotheses TestingComplete
Hypotheses TestingComplete
Learning Objectives:
1. Understand the fundamentals of hypothesis testing
2. Learn how hypothesis testing works
3. Be able to differentiate between z-test, t-test, and other statistics concepts
Statistical inference of samples involves using statistical methods to conclude, make predictions,
or generalize information about a population based on data collected from one or more samples. In
essence, it is the process of making educated and quantified guesses about a larger group (the
population) by analyzing a subset of that group (the sample).
Hypothesis testing for a single sample is a statistical method used to make inferences about a
population parameter based on data from a single sample. The goal is to assess whether the sample
data provides enough evidence to support or reject a specific hypothesis about the population
parameter.
Statistical inference for two samples involves using statistical methods to compare two
independent samples from different populations to make inferences or draw conclusions about the
population characteristics or differences between them. This is a common statistical analysis used in
various fields to answer questions such as whether there is a significant difference between two groups
or populations.
Parametric Hypotheses Tests
There are several types of parametric hypothesis tests. These tests are used to make inferences
about population parameters while assuming certain distributional data properties. These tests are
used both for single-sample and two-sample tests.
1. T-Test
2. Z-Test
3. Variance Test (Chi-Squared Test
4. F-Test
Each parametric test has specific assumptions and conditions that must be met for valid
inference. The appropriate test choice depends on the data's characteristics and the parameter
being tested.
1. Sign Test
2. Wilcoxon Signed-Rank Test
3. Runs Test
4. Kolmogorov-Smirnov Test
5. Kruskal – Wallis (H - test)
6. Friedman Test:
7. Spearman Rank Difference Correlation Coefficient:
8. Phi Correlation Coefficient:
9. Point Biserial Correlation Coefficient:
Example:
In one of the Engineering Data Analysis Class, the mean score of 40 marks out of 100. The
Professor decided that extra classes are necessary in order to improve the performance of the class.
The class scored an average of 45 marks out of 100 after taking extra classes. Can we be sure whether
the increase in marks is a result of extra classes or is it just random?
Hypothesis testing lets us identify that. It lets a sample statistic to be checked against a
population statistic or statistic of another sample to study any intervention etc. Extra classes being the
intervention in the above example.
Hypothesis testing is defined in two terms – Null Hypothesis and Alternate Hypothesis.
• Null Hypothesis (Ho) being the sample statistic to be equal to the population statistic. For
example, the Null Hypothesis for the above example would be that the average marks after
extra class are the same as that before the classes.
• Alternate Hypothesis (Ha) for this example would be that the marks after extra class significantly
differ from those before the class.
Hypothesis Testing is done on different confidence levels and uses a z-score to calculate the
probability. So, for a 95% Confidence Interval, anything above the z-threshold for 95% would reject the
null hypothesis.
NOTE:
We CANNOT ACCEPT the Null hypothesis, only REJECT it or FAIL TO REJECT it. Why?
The concept that null hypotheses cannot be "accepted" but can only be "rejected" or "fail to
be rejected" is a fundamental principle in hypothesis testing in statistics. This concept is rooted in the
philosophy of science and the way statistical inference works. Here's why it's the case:
• Burden of Proof: In hypothesis testing, the null hypothesis (often denoted as H0) states that there
is no effect or difference. It represents a default or null position, and it's up to the researcher to
provide evidence against this null hypothesis. The burden of proof is on the researcher to show
that there is a statistically significant effect or difference.
• Uncertainty: In statistical analysis, we deal with uncertainty. We use sample data to make
inferences about the entire population. Since we're dealing with sample data, some degree of
uncertainty is always involved. Even if we collect data and find that it doesn't strongly contradict
the null hypothesis, we can't definitively say that the null hypothesis is true. We can only say we
haven't found enough evidence to reject it.
• Type I and Type II Errors: When conducting hypothesis tests, two types of errors can occur: Type
I error (rejecting a true null hypothesis) and Type II error (failing to reject a false null hypothesis).
Researchers control the risk of Type I errors by designating a significance level (alpha). However,
they cannot contain Type II errors directly. The decision to "fail to reject" the null hypothesis at a
given significance level is a way to minimize the risk of Type II errors while allowing for the
possibility of Type I errors.
• Continuous Testing: Scientific research is an ongoing process. New evidence, data, and
research can always emerge. Therefore, we typically don't definitively "accept" or "prove"
hypotheses; we gather evidence to support or refute them. Even if we fail to reject the null
hypothesis in one study, future studies might provide more substantial evidence to leave it.
The language of hypothesis testing is rooted in the philosophy of scientific inquiry and the
recognition of uncertainty in data analysis. Rather than definitively "accepting" the null hypothesis, we
say that we either "reject" it based on the available evidence or "fail to reject" it because we haven't
found enough evidence to do so. This approach helps maintain a cautious and rigorous standard for
scientific conclusions.
ERRORS IN HYPOTHESES TESTS
Let’s take an example to understand the concept of Hypothesis Testing. A person is on trial for
a criminal offense, and the judge needs to provide a verdict on his case. Now, there are four possible
combinations in such a case:
• First Case: The person is innocent, and the judge identifies the person as innocent
• Second Case: The person is innocent, and the judge identifies the person as guilty
• Third Case: The person is guilty, and the judge identifies the person as innocent
• Fourth Case: The person is guilty, and the judge identifies the person as guilty
As you can see, there can be two types of error in the judgment – Type 1 error, when the verdict
is against the person while he was innocent, and Type 2 error, when the verdict is in favor of the person
while he was guilty.
According to the Presumption of Innocence, the person is considered innocent until proven
guilty. That means the judge must find the evidence which convinces him “beyond a reasonable
doubt.” This “Beyond a reasonable doubt” phenomenon can be understood as Probability (Judge
Decided Guilty | Person is Innocent) should be small.
Another Example:
• A male human tested positive for being pregnant. Is it even possible? This indeed looks like a
case of False Positive. More formally, it is the incorrect rejection of a True Null Hypothesis. The Null
Hypothesis, in this case, would be that a male Human is not pregnant. This is a Type I Error
• A male human is pregnant; the test supports the Null Hypothesis. This looks like a case of False
Negative. More formally, it is defined as accepting a false Null Hypothesis. This is a Type II Error
Now, we have defined a basic Hypothesis Testing framework. It is important to look into some of
the mistakes that are committed while performing Hypothesis Testing and try to classify those mistakes
if possible.
Now, look at the Null Hypothesis definition above. At first glance, we notice that it is a statement
subjective to the tester like you and me and not a fact. That means there is a possibility that the Null
Hypothesis can be true or false, and we may end up committing some mistakes on the same lines.
Levene's test is therefore used to test the null hypothesis that the samples to be compared come
from a population with the same variance. In this case, possible variance differences occur only by
chance, since each sampling has small differences.
If the p-value for the Levene test is greater than .05, then the variances are not significantly
different (i.e., the homogeneity assumption of the variance is met). If the p-value for Levene's test is less
than .05, then there is a significant difference between the variances.
H0: Groups have equal variances
H1: Groups have different variances
It is important to note that the mean values of the individual groups have no influence on the
result; they may differ. A big advantage of Levene's test is that it is very stable against violations of the
normal distribution. Therefore, Levene's test is used in many statistics programs.
Furthermore, the variance equality can also be checked graphically; this is usually done with a
grouped box plot or with a Scatterplot.
Significance Level
The significance level is determined before the test. If the calculated p-value is below this value,
the null hypothesis is rejected, otherwise, it is retained. As a rule, a significance level of 5 % is chosen.
• alpha < 0.01: very significant result.
• alpha < 0.05: significant result.
• alpha > 0.05: not significant result.
The significance level thus indicates the probability of a 1st type error. What does this mean?
Suppose there is a p-value of 5% and the null hypothesis is rejected. In that case, the probability that
the null hypothesis is valid is 5%, i.e., there is a 5% probability of making a mistake. If the critical value is
reduced to 1%, the probability of error is accordingly only 1%, but it is also more challenging to confirm
the alternative hypothesis.
One-Tailed and Two-Tailed p Values
What Is A One-Tailed Test?
Let’s discuss the meaning of a one-tailed test. If you use a significance level of .05, a one-tailed
test allows all of your alpha to test the statistical significance in the one direction of interest. This means
that .05 is in one tail of the distribution of your test statistic. When using a one-tailed test, you are testing
for the possibility of the relationship in one direction and completely disregarding the possibility of a
relationship in the other direction. The one-tailed test provides more power to detect an effect in one
direction by not testing the effect in the other direction.
In a one-tailed p-test (also known as a one-tailed t-test or one-tailed hypothesis test), the value
1.645 is associated with a specific significance level, often denoted as alpha (α), which determines the
critical value for deciding the null hypothesis. If your test statistic (calculated from your sample data) is
more significant than 1.645, you will reject the null hypothesis at the 0.05 significance level. You would
fail to reject the null hypothesis if your test statistic is less than or equal to 1.645.
What Is A Two-Tailed Test?
Suppose you are using a significance level of 0.05. In that case, a two-tailed test allots half of
your alpha to test the statistical significance in one direction and half of your alpha to testing statistical
significance in the other direction. This means that .025 is in each tail of the distribution of your test
statistic. When using a two-tailed test, regardless of the direction of the relationship you hypothesize,
you are testing for the possibility of the relationship in both directions.
For a two-tailed test at a 95% confidence level, you typically use a significance level of 0.05
(5%). This significance level is split evenly between the two tails of the distribution, with 2.5% in each tail.
The critical value of 1.96 is chosen because it corresponds to the 2.5% cutoff in the tails of a standard
normal distribution (z-distribution). This means that if your test statistic (calculated from your sample
data) falls below -1.96 or above 1.96, you will reject the null hypothesis at the 0.05 significance level,
indicating a significant difference between your sample and the population parameter in either
direction.
A single Sample and Two samples
In the context of hypothesis testing, the terms "single sample" and "two samples" denote the
number of groups or sets of data being compared. These distinctions hold significance as they
determine the type of statistical tests that are appropriate for the analysis.
In a single sample, one is engaged in comparing the characteristics of a particular group to a
known value or a theoretical expectation. The data collected emanates from a singular group or
population. For instance, a single sample test would be employed when investigating whether the
average height of a group of individuals differs significantly from a known average height, such as the
average height of the general population.
On the other hand, when there are two distinct groups or populations and wishes to compare
the characteristics of these two groups, a two-sample test is employed. The data derived from each
group is independent of one another, signifying that the observations in one group bear no relationship
to the observations in the other group.
Dependent and Independent Samples
In a DEPENDENT SAMPLE, the measures are related. It involves comparing two sets of
measurements derived from the same group or related individuals. These measurements are frequently
taken at distinct times or under varying conditions. This type of test is commonly referred to as a paired
sample or matched pairs test. For instance,
For example: if an individual seeks to ascertain whether there exists a noteworthy difference in the
blood pressure of individuals before and after a treatment, a dependent sample test would be
employed.
Example: If you take a sample of people who have had a knee operation and interview them before
and after the operation, this is a dependent sample. This is because the same person was interviewed
at two different times.
In INDEPENDENT SAMPLES, the values come from two or more different groups. Independent
samples refer to sets of data where the observations in one sample are not related or paired with the
observations in the other sample. An independent samples test is used when comparing the means or
other characteristics of two separate and unrelated groups. Another term often used to refer to
independent samples is unpaired samples. In the context of statistical analysis, independent samples
and unpaired samples are often used interchangeably to describe situations where observations in
one sample are not related or paired with observations in another sample.
More Than Two Dependent or Independent Samples
In the case of independent and dependent sampling, there can be more than two samples.
The important thing is that in the case of independent sampling, the individual groups or samples have
nothing to do with each other. In the case of dependent sampling, a respondent appears in all groups.
Why Is It Important to Know the Difference?
Whether the data at hand are from a dependent or an independent sample determines which
hypothesis test is used. For example, an independent samples t-test or an ANOVA without repeated
measures is calculated if the data are independent. If the data are dependent, a t-test for dependent
samples or an ANOVA with repeated measures is calculated.
Hypothesis Testing for Dependent and Independent Samples
T-Test
A t-test is a statistical hypothesis test that assesses sample means to draw conclusions about
population means. Frequently, analysts use a t-test to determine whether the population means for
two groups are different. The t-test is a statistical test procedure that tests whether there is a significant
difference between the means of two groups.
Figure 2. Sample of Groups. The two groups could be, for example, patients who received drug A once and drug
B once, and you want to know if there is a difference in blood pressure between these two groups
There are three types of t-tests. They all evaluate sample means using t-values, t-distributions, and
degrees of freedom to calculate statistical significance. It is a parametric analysis that compares one
or two group means.
Standard t-test
There are three standard t-tests. The one-sample t-test, the independent-sample t-test, and the
paired-sample t-test.
To do this, we randomly divided 60 test subjects into two groups. The first group receives drug A;
the second group receives drug B. With an independent t-test, we can now test whether there is a
significant difference in pain relief between the two drugs.
Paired Samples t-Test
t-test for dependent samples is used to compare the means of two dependent groups.
Now we can see for each person how big the weight difference is between before and after.
With a dependent t-test, we can now check whether there is a significant difference.
EXERCISES
a) Single Sample t- Test
An engineer measured the Brinell hardness of 25 pieces of ductile iron that were sub-critically
annealed. The resulting data were:
The engineer hypothesized that the mean Brinell hardness of all such ductile iron pieces is
greater than 170. Therefore, he was interested in testing the hypotheses:
H0 : μ = 170
HA: μ > 170
The engineer entered his data into Minitab and requested that the "one-sample t-test" be
conducted for the above hypotheses. He obtained the following output:
Descriptive Statistics
: mean of Brinell
Test
T-Value P-Value
1.22 0.117
The output tells us that the average Brinell hardness of the n = 25 pieces of ductile iron was 172.52
with a standard deviation of 10.31. (The standard error of the mean "SE Mean", calculated by dividing
the standard deviation 10.31 by the square root of n = 25, is 2.06). The test statistic t* is 1.22, and the P-
value is 0.117.
If the engineer set his significance level α at 0.05 and used the critical value approach to
conduct his hypothesis test, he would reject the null hypothesis if his test statistic t* were greater than
1.7109 (determined using statistical software or a t-table):
Since the engineer's test statistic, t* = 1.22, is not greater than 1.7109, the engineer fails to
reject the null hypothesis. The test statistic does not fall in the "critical region." At the = 0.05 level, there
is insufficient evidence to conclude that the mean Brinell hardness of all such ductile iron pieces is
greater than 170.
If the engineer used the P-value approach to conduct his hypothesis test, he would determine the
area under a tn - 1 = t24 curve and to the right of the test statistic t* = 1.22:
In the output above, Minitab reports that the P-value is 0.117. Since the P-value, 0.117, is greater
than = 0.05, the engineer fails to reject the null hypothesis. There is insufficient evidence, at the =
0.05 level, to conclude that the mean Brinell hardness of all such ductile iron pieces is greater than 170.
Note that the engineer obtains the same scientific conclusion regardless of the approach used. This
will always be the case.
Height of Sunflowers
A biologist was interested in determining whether sunflower seedlings treated with an extract
from Vinca minor roots resulted in a lower average height of sunflower seedlings than the standard
height of 15.7 cm. The biologist treated a random sample of n = 33 seedlings with the extract and
subsequently obtained the following heights:
H0 : μ = 15.7
HA: μ < 15.7
The biologist entered her data into Minitab and requested that the "one-sample t-test" be conducted
for the above hypotheses. She obtained the following output:
Descriptive Statistics
μ: mean of Height
Test
T-Value P-Value
-4.60 0.000
The output tells us that the average height of the n = 33 sunflower seedlings was 13.664 with a
standard deviation of 2.544. (The standard error of the mean "SE Mean", calculated by dividing the
standard deviation 13.664 by the square root of n = 33, is 0.443). The test statistic t* is -4.60, and the P-
value, 0.000, is to three decimal places.
Minitab Note. Minitab will always report P-values to only 3 decimal places. If Minitab reports the P-value
as 0.000, it really means that the P-value is 0.000....something. Throughout this course (and your future
research!), when you see that Minitab reports the P-value as 0.000, you should report the P-value as
being "< 0.001."
If the biologist set her significance level at 0.05 and used the critical value approach to
conduct her hypothesis test, she would reject the null hypothesis if her test statistic t* were less than -
1.6939 (determined using statistical software or a t-table):s-3-3
Since the biologist's test statistic, t* = -4.60, is less than -1.6939, the biologist rejects the null
hypothesis. That is, the test statistic falls in the "critical region." There is sufficient evidence, at the =
0.05 level, to conclude that the mean height of all such sunflower seedlings is less than 15.7 cm.
If the biologist used the P-value approach to conduct her hypothesis test, she would determine
the area under a tn - 1 = t32 curve and to the left of the test statistic t* = -4.60:
In the output above, Minitab reports that the P-value is 0.000, which we take to mean < 0.001.
Since the P-value is less than 0.001, it is clearly less than = 0.05, and the biologist rejects the null
hypothesis. There is sufficient evidence, at the = 0.05 level, to conclude that the mean height of all
such sunflower seedlings is less than 15.7 cm.
Note again that the biologist obtains the same scientific conclusion regardless of the approach used.
This will always be the case.
Gum Thickness
A manufacturer claims that the thickness of the spearmint gum it produces is 7.5 one-hundredths of an
inch. A quality control specialist regularly checks this claim. On one production run, he took a random
sample of n = 10 pieces of gum and measured their thickness. He obtained:
H0 : μ = 7.5
HA: μ ≠ 7.5
The quality control specialist entered his data into Minitab and requested that the "one-sample t-test"
be conducted for the above hypotheses. He obtained the following output:
Descriptive Statistics
μ: mean of Thickness
Test
T-Value P-Value
1.54 0.158
The output tells us that the average thickness of the n = 10 pieces of gums was 7.55 one-
hundredths of an inch with a standard deviation of 0.1027. (The standard error of the mean "SE Mean",
calculated by dividing the standard deviation 0.1027 by the square root of n = 10, is 0.0325). The test
statistic t* is 1.54, and the P-value is 0.158.
If the quality control specialist sets his significance level at 0.05 and used the critical value
approach to conduct his hypothesis test, he would reject the null hypothesis if his test statistic t* were
less than -2.2616 or greater than 2.2616 (determined using statistical software or a t-table):
Since the quality control specialist's test statistic, t* = 1.54, is not less than -2.2616 nor greater than
2.2616, the quality control specialist fails to reject the null hypothesis. That is, the test statistic does not
fall in the "critical region." There is insufficient evidence, at the = 0.05 level, to conclude that the mean
thickness of all of the manufacturer's spearmint gum differs from 7.5 one-hundredths of an inch.
If the quality control specialist used the P-value approach to conduct his hypothesis test, he would
determine the area under a tn - 1 = t9 curve, to the right of 1.54 and to the left of -1.54:
In the output above, Minitab reports that the P-value is 0.158. Since the P-value, 0.158, is greater
than = 0.05, the quality control specialist fails to reject the null hypothesis. There is insufficient
evidence, at the = 0.05 level, to conclude that the mean thickness of all pieces of spearmint gum
differs from 7.5 one-hundredths of an inch.
Note that the quality control specialist obtains the same scientific conclusion regardless of the
approach used. This will always be the case.
b) Paired t-test
To test the null hypothesis that the true mean difference is zero, the procedure is as follows:
6) State the null hypothesis and alternate hypothesis.
7) Choose an alpha level,
8) Find the critical value of t in a t table.
9) Calculate the t-test statistic
• Calculate the difference (di = yi − xi) between the two observations on each pair, making
sure you distinguish between positive and negative differences.
• Calculate the mean difference, d ̅.
• Calculate the standard deviation of the differences, sd, and use this to calculate the
o standard error of the mean difference,
𝑆
o 𝑆𝐸(𝑑̆ ) = 𝑑
√𝑛
• Calculate the t-statistic, which is given by
𝑑̅
𝑡=
𝑆𝐸(𝑑̆ )
Under the null hypothesis, this statistic follows a t-distribution with n − 1 degrees of freedom.
10) Use t-distribution tables to compare your value for T to the tn−1 distribution. This will give the
p-value for the paired t-test. Interpret the result
Example:
Suppose a sample of n=20 students were given a diagnostic test before studying a particular
module and then again after completing the module. We want to find out if, in general, our teaching
leads to improvements in students’ knowledge/skills (i.e., test scores). We can use the results from our
sample of students to conclude the impact of this module in general.
Let x = test score before the module, y = test score after the module.
Student Pre-Module Score Post Module Score Difference
1 18 22 4
2 21 25 4
3 16 17 1
4 22 24 2
5 19 16 -3
6 24 29 5
7 17 20 3
8 21 23 2
9 23 19 -4
10 18 20 2
11 14 15 1
12 16 15 -1
13 16 18 2
14 19 26 7
15 18 18 0
16 20 24 4
17 12 18 6
18 22 25 3
19 15 19 4
20 17 16 -1
𝑑̅ 2.05
sd 2.837
1) State the Null and alternate hypothesis
Null Hypothesis: The module has no significant impact on students' knowledge/skills, and
there is no improvement in test scores after completing the module.
Ho: after - before 0
Alternative Hypothesis: The module significantly improves students' knowledge/skills and
increases test scores after completing the module.
Ha: after - before > 0
2) Choose the alpha value, = 0.05
3) Find the critical value from the t-table with = 0.05 and df = 20 - 1 = 19
4) Tcritical = 1.729
5) Calculating the mean and standard deviation of the differences gives:
𝑑̅ = 2.05 and sd = 2.837. Therefore,
𝑆 2.837
𝑆𝐸(𝑑̆ ) = 𝑑 = = 0.634
√𝑛 √20
So, we have
𝑑̅ 2.05
𝑡= = = 3.241
𝑆𝐸(𝑑̆ ) 0.634
The computed t is 3.241, which is greater than the critical t value, the null hypothesis is
rejected. Therefore, there is strong evidence that, on average, the module does lead to
improvements.
a) Unpaired t-test
An unpaired t-test is used to compare two population means; the procedure is as follows:
1) State the null hypothesis and alternate hypothesis.
2) Choose an alpha level,
3) Find the critical value of t in a t table.
4) Calculate the t-test statistic
• Calculate the difference between the two sample means x1 – x2
• Calculate the pooled standard deviation,
1 1
𝑆𝐸(𝑥̆1 − 𝑥̆2 ) = 𝑠𝑝√ +
𝑛1 𝑛2
For the unpaired t-test to be valid the two samples should be roughly normally distributed and
should have approximately equal variances. If the variances are obviously unequal, we must use:
𝑠1 2 𝑠2 2
𝑆𝐸(𝑥̆1 − 𝑥̆2 ) = 𝑠𝑝√ +
𝑛1 𝑛2
Then,
𝑥̅1 − 𝑥̅2
𝑁 (0,1) if n1 and n2 are reasonably large
𝑆𝐸(𝑥̅1 − 𝑥̅2 )
Else,
2
𝑠 2 𝑠 2
𝑥̅1 − 𝑥̅2 ( 1 + 2 )
𝑛1 𝑛2
𝑡 𝑛′ , 𝑤ℎ𝑒𝑟𝑒 𝑛′ = 2 2 rounded down to the nearest integer
𝑆𝐸(𝑥̅1 − 𝑥̅2 ) 𝑠 2 𝑠 2
( 1 ) ( 2 )
𝑛1 𝑛2
+
(𝑛1 − 1) (𝑛2 − 1)
Example:
A U.S. magazine, Consumer Reports, carried out a survey of the calorie and sodium content of
a number of different brands of hotdog. There were three types of hotdog: beef, ’meat‘ (mainly pork
and beef but can contain up to 15% poultry) and poultry. The results below are the calorie content of
the different brands of beef and poultry hotdogs.
Beef hotdogs:
186, 181, 176, 149, 184, 190, 158, 139, 175, 148, 152, 111, 141, 153, 190, 157, 131, 149, 135, 132
Poultry hotdogs:
129, 132, 102, 106, 94, 102, 87, 99, 170, 113, 135, 142, 86, 143, 152, 146, 144
Before carrying out a t-test you should check whether the two samples are roughly normally
distributed. This can be done by looking at histograms of the data. In this case there are no outliers and
the data look reasonably close to a normal distribution; the t-test is therefore appropriate. So, first we
need to calculate the sample mean and standard deviation in each group:
Z Test
Z-test is a statistical method used for inferential analysis, specifically when comparing means of
two large data samples with a known standard deviation. It is applicable to populations that follow a
normal distribution and is commonly employed when the sample sizes are larger than 30. The test can
be utilized in two ways: the 1-sample analysis helps determine if a population mean differs from a
hypothesized value, while the 2-sample version assesses whether two population means differ.
Additionally, Z-tests are effective for comparing group means in statistical analysis. Further, the z-test
definition stresses an important assumption—the sample data is a normal distribution. A given sample
is normally distributed, and the external factor has no influence.
𝑥̅ −
𝑍 𝑡𝑒𝑠𝑡 =
√𝑛
Where:
𝑥̅ – mean of sample
- mean of population
- standard deviation of the Population
n – number of samples
The Formula for Z-Test (Two Populations)
𝑝̂1 − 𝑝̂2
𝑍=
1 1
√𝑝̂1 − 𝑞̂1 ( + )
𝑛1 𝑛2
𝑥1 + 𝑥2
𝑝̂ =
𝑛1 + 𝑛2
𝑞̂ = 1 − 𝑝̂
Where:
P1 and P2 are the proportions
n1 and n2 are the populations
x1 and x2 are the samples
Note: The alpha level, /2. Dividing by 2 is a standard practice in two-tailed hypothesis testing to
ensure that both extremes of the distribution are considered, and it helps maintain the desired
overall significance level for the test.
Calculating a Z-test requires the following steps:
1) State the null hypothesis and alternate hypothesis.
2) Choose an alpha level.
3) Find the critical value of z in a z table.
4) Calculate the z-test statistic (see below).
5) Compare the test statistic to the critical z value and decide whether to support or reject the null
hypothesis.
6) Interpret the result
Example:
The Dean claims that students in the College are above average in intelligence; a random
sample of 30 students IQ scores have a mean score of 112.5, and the mean population IQ is 100 with
a standard deviation of 15. Is there sufficient evidence to support the Deans claim?
Given:
𝑥̅ – mean of sample = 112.5
- mean of population = 100
- standard deviation of the population = 15
n – number of samples = 30
𝑥̅ − ̅̅̅̅̅̅̅
112.5 − 100
𝑍 𝑡𝑒𝑠𝑡 = = = 4.56
15
√𝑛 √30
Procedure:
1) State the Null and Alternative hypothesis of the statement.
Null Hypothesis (Ho): The mean IQ of students in the College is equal to the mean population IQ.
(Meaning, the null hypothesis assumes that there is no significant difference in the mean
population IQ)
Ho: = 100
Alternative Hypothesis (Ha): The mean IQ of students in the College is above the mean population.
(Meaning, the alternative hypothesis suggests that the mean IQ of the students in the College is
greater than the mean population IQ)
Ha: > 100
2) Choose the alpha level, = 0.05
3) Find the critical value of z in a z table, Zcrit = 1.645
4) Compute Z
𝑥̅ − ̅̅̅̅̅̅̅ − 100
112.5
𝑍 𝑡𝑒𝑠𝑡 = = = 4.56
15
√𝑛 √30
5) The computed z-value (Zcomp = 4.56) is greater than the critical value (Zcrit = 1.645), therefore, reject
the null hypothesis.
6) The IQ of the students in the College is above average.
7) Conclusion:
1 The study conducted by the Internal Affairs Committee showed that the properties of
dismissals to cases brought does not affect firms in the home districts.
2 The study conducted by the Education Committee showed that the properties of
dismissals to cases does not affect firms in the home districts.
3 The study conducted by the ICT Committee showed that the properties of dismissals to
cases brought affect firms in the home districts.
4 The study conducted by the Juducial Affairs Committee showed that the properties of
dismissals to cases brought affect firms in the home districts.
5 The study conducted by the other committee showed that the properties of dismissals
to cases brought does not affect firms in the home districts.
CHI – SQUARE
The Chi-square test is a hypothesis test used to determine whether there is a relationship
between two categorical variables. The chi-square test checks whether the frequencies occurring in
the sample differ significantly from the frequencies one would expect. Thus, the observed frequencies
are compared with the expected frequencies and their deviations are examined. Categorical
variables are, for example, a person's gender, preferred newspaper, frequency of television viewing,
or their highest level of education.
Figure 3. The Chi-square test is used to investigate whether there is a relationship between gender and the
highest level of education.
Example
A study investigates whether a new drug reduces the incidence of a certain disease. Two groups
of patients are considered: one receiving the new drug and the other receiving a placebo.
Null Hypothesis (H0): There is no difference in disease incidence between the new drug and
the placebo groups.
Alternative Hypothesis (H1): There is a difference in disease incidence between the new drug
and the placebo groups.
Result: If the p-value is less than the significance level (usually 0.05), the null hypothesis is rejected,
indicating a significant difference in disease incidence between the
two groups.
Conclusion: In this hypothetical example, the calculated p-value is 0.026, which is less than 0.05.
Therefore, we reject the null hypothesis and conclude that the new drug significantly reduces the
incidence of the disease compared to the placebo.
Calculate chi-squared
The chi-squared value is calculated with the equation:
𝑛
(𝑂𝑘 − 𝐸𝑘 )2
𝑥2 =∑
𝐸𝑘
𝑘=1
Where:
Ok – Observed frequency
Ek – Expected frequency
Example:
From the table of the chi-squared distribution for a significance level of 5 % and a df of 1, this
results in 3.841. Since the calculated chi-squared value is smaller, there is no significant difference. As
a prerequisite for this test, please note that all expected frequencies must be greater than 5.
Example:
Question: Does gender have an influence on whether a person has a YouTube subscription or not? For
the two variables gender (male, female) and has Netflix subscription (yes, no), it is tested whether they
are independent. If this is not the case, there is a relationship between the characteristics.
The research question that can be answered with the Chi-square test is: Are the characteristics
of gender and ownership of a Netflix subscription independent of each other?
In order to calculate the chi-square, an observed and an expected frequency must be given.
In the independence test, the expected frequency is the one that results when both variables are
independent. If two variables are independent, the expected frequencies of the individual cells are
obtained with the equation below.
𝑅𝑜𝑤𝑆𝑢𝑚(𝑖)𝑥𝐶𝑜𝑙𝑢𝑚𝑛𝑆𝑢𝑚(𝑗)
𝑓(𝑖, 𝑗) =
𝑁
Where i and j are the rows and columns of the table respectively.
For the fictitious Netflix example, the following tables could be used. On the left is the table with
the frequencies observed in the sample, and on the right is the table that would result if perfect
independence existed.
From the Chi-square table read the critical value again and compare it with the result.
The assumptions for the Chi-square independence test are that the observations are from a
random sample and that the expected frequencies per cell are greater than 5.
If a variable is present with two or more values, the differences in the frequency of the individual
values can be examined.
The Chi-square distribution test, or Goodness-of-fit test, checks whether the frequencies of the
individual characteristic values in the sample correspond to the frequencies of a defined distribution.
In most cases, this defined distribution corresponds to that of the population. In this case, it is tested
whether the sample comes from the respective population.
For market researchers it could be of interest whether there is a difference in the market
penetration of the three-video streaming services YouTube, Netflix and NBA between Manila and the
whole of Philippines. The expected frequency is then the distribution of streaming services throughout
Philippines and the observed frequency results from a survey in Manila. In the following tables the
fictitious results are shown:
YouTube 25 YouTube 23
Netflix 29 Netflix 26
NBA 13 NBA 16
The Chi-square homogeneity test can be used to check whether two or more samples come
from the same population? One question could be whether the subscription frequency of three video
streaming services Netflix, Amazon and Disney differ in different age groups. As a fictitious example, a
survey is made in three age groups with the following result:
Observed Frequency
Age 15 - 25 26 - 35 36 -45
YouTube 25 23 20
Netflix 29 30 33
NBA 11 13 12
Others or None 16 24 26
As with the Chi-square independence test, this result is compared with the table that would
result if the distributions of Streaming providers were independent of age.
There are different types of analysis of variance, the most common are the one-way and two-
way analysis of variance, each of which can be calculated either with or without repeated
measurements.
• One-factor (or one-way) ANOVA
• One-factor ANOVA with repeated measurements
• Two-factors (or two-way) ANOVA
• Two-factors ANOVA with repeated measurements
Difference between One-Way and Two-Way ANOVA
The one-way analysis of variance only checks whether an independent variable has an
influence on a metric dependent variable. This is the case, for example, if it is to be examined whether
the place of residence (independent variable) has an influence on the salary (dependent variable).
However, if two factors, i.e. two independent variables, are considered, a two-factor analysis of
variance must be used.
Two-factor analysis of variance tests whether there is a difference between more than two
independent samples that are split between two variables or factors.
EXAMPLE:
With the help of the dependent variable, e.g. highest educational qualification with the three
characteristics Group 1, Group 2 and Group 3 should be explained as much variance of the
dependent variable salary as possible. In the graphic below, under A) a lot of variance can be
explained with the three groups and under B) only very little variance.
Accordingly, in case A) the groups have a very high influence on the salary and in case B) they
do not.
In the case of A), the values in the respective groups deviate only slightly from the group mean,
the variance within the groups is therefore very small. In the case of B), however, the variance within
the groups is large. The variance between the groups is the other way around; it is large in the case of
A) and small in the case of B). In the case of B) the group means are close together, in the case of A)
they are not.
Variance within the groups Variance between group means
Case A) Small Large
Case B) Large Small
ANALYSIS OF VARIANCE HYPOTHESES
The null hypothesis and the alternative hypothesis result from a one-way analysis of variance as
follows:
• Null hypothesis H0: The mean value of all groups is the same.
• Alternative hypothesis H1: There are differences in the mean values of the groups.
The results of the ANOVA can only make a statement about whether there are differences
between at least two groups. However, it cannot be determined which groups are exactly different. A
post-hoc test is needed to determine which groups differ. There are various methods to choose from,
with Duncan, Dunnet C and Scheffe being among the most common methods.
If the computed F is less than the F critical value in the F distribution table, we fail to reject the
null hypothesis. This means we don’t have sufficient evidence to say that there is a statistically
significant difference between the mean scores of the groups.
Degrees of Freedom, df
Total dft= Total Number of items - 1
15 – 1 = 14
Between - Groups df = Number of columns - 1
dfr = K - 1
3–1=2
Within - Groups df = Number of columns - 1
dfe = Total df - Between column df
14 – 2 = 12
Compute the Mean Sum of Squares, MSS
Mean Regression Sum of Squares (MSSR)
𝑆𝑆𝑅 0.5362
𝑀𝑆𝑅 = =
𝑑𝑓𝑟 2
MSR = 0.26813
Mean Sum of Squares of Error MSSE
𝑆𝑆𝐸 0.3451
𝑀𝑆𝐸 = =
𝑑𝑓𝑒 12
MSE = 0.02876
We now compare the computed F value with the tabular F value which is taken from the table
on critical values of F. This value is in the intersection between the df of the MSSR (df = 2) and the df of
the MSSE (drf = 12). From the table at dfe = 2 and df = 12, at 5% significant level is 3.88.
Since the F test statistic in the ANOVA table is greater than the F critical value in the F distribution
table, we reject the null hypothesis and retain the alternative hypothesis. This means we have sufficient
evidence to say that there is a statistically significant difference between the mean moisture content
of arrow root starch dried in 3, 4, and 5 drying days.
Tukey’s Test
In finding which of the means are significantly different from each other Tukey’s test was
conducted using the online calculator.
The means of the following pairs are significantly different: The Means of Sample A and Sample
B and the Means of Sample A and Sample C.
References:
DATAtab Team (2023). DATAtab: Online Statistics Calculator. DATAtab e.U. Graz, Austria. URL
https://datatab.net
Kumari, Kajal (May 18, 2022). Hypothesis Testing for Data Science and Analytics. Retrieved on August
2023 from https://www.analyticsvidhya.com/blog/2022/05/hypothesis-testing-for-data-
science-and-analytics/
Madhuri Thakur (July 26, 2023). Z-test statistics formula. EDUCBA. Retrieved from on November 7, 2023
from https://www.educba.com/z-test-statistics-formula/.
Meena, Subhash (May 17, 2023). Difference Between Z-Test and t-Test. Retrieved on August 2023 from
https://www.analyticsvidhya.com/blog/2020/06/statistics-analytics-hypothesis-testing-z-test-t-
test/
Muwaya, Monica Seles (June 23, 2022). Hypothesis Testing in Inferential Statistics. Retrieved on August
2023 from https://www.analyticsvidhya.com/blog/2022/06/hypothesis-testing-in-inferential-
statistics