Parametric - Statistical Analysis PDF
Parametric - Statistical Analysis PDF
Parametric - Statistical Analysis PDF
ANALYSIS
PARAMETIC STATISTICS
Parametric Statistics
Pa rame t r ic s t at is t ic a l pro ce du re s a re
inferential procedures that rely on testing
claims regarding parameters such as the
population mean, the population standard
deviation, or the population proportion.
In some circumstances, the use of parametric
procedures requires that certain requirements
regarding the distribution of the population,
such as normality, be satisfied.
Parametric Statistics
✦ As sume unde r l ying s t at is t ic al
distributions in the data. Therefore,
several conditions of validity must be met
so that the result of a parametric test is
reliable.
✦ Apply to data in ratio scale, and some
apply to data in interval scale.
Two Common Forms of
Statistical Inference
1.Estimation
2.Hypothesis Testing
Estimating the Value of
a Parameter
( n) ( n)
σ s
E = zα/2 or E = zα/2
( n)
σ
x̄ ± zα/2
Point Estimator
Margin of Error
Confidence intervals about a
population Mean where the Population
Standard Deviation is Unknown
Case 2:
σ is unknown and n ≥ 30
( n)
s
x̄ ± zα/2
Note:
( n)
s
x̄ ± tα/2
( 1120 )
1.2 the population
8.17 ± 1.96
mean is
bet ween 8.10
8.17 ± 0.0703 = (8.0997,8.2403) and 8.24”.
Example 2
Suppose we would like to
e s t im ate t he me an
amount of money spent on
books by BS Statistics
students in a semester. We
h ave dat a f rom 20
randomly selected
students. Construct and
interpret a 95%
confidence interval.
Solution:
We will apply Case 3, since n <30 and σ is
unknown.
( 40 )
3.2 population mean is
20.1 ± 1.645
b e t w e e n 1 9. 2 7
and 20.93”.
20.1 ± 0.8323 = (19.2677,20.9323)
Example 4
A corporation monitors time spent by office
workers browsing the web on their computers
instead of working. In a sample of computer
re cords of 15 worke rs, cons t r uct a 99%
confidence inter val for the mean time spent by
selected office workers in browsing the web in an
eight-hour day.
Solution:
We will apply Case 3, since n <30 and σ is
unknown.
Th e p o i n t e s t i m ate f o r t h e p op u l at io n
proportion is x
p̂ =
n
where x is the number of individuals in the
sample with the specified characteristic and n is
the sample size.
Confidence Intervals About
a Population Proportion
Suppose a simple random sample of size n is
taken from a population. A (1 − α) × 100 % confidence
inter val for p is given by the following
quantities:
̂ − p̂
p(1
p̂ ± zα/2
n
Note:
2 2
(n − 1)s 2 (n − 1)s
< σ <
χα/2
2 χ1−α/2
2
2
13.2503 < σ < 56.9836
Defintion:
1. Null Hypothesis
2. Alternative Hypothesis
Null Hypothesis
• Denoted by Ho
• The statement being tested.
• Assumed true until evidence indicates
other wise.
• Must contain the condition of equality
and must be written with the symbol =, ≤ ,
or ≥ .
Example:
✦ Students who eat and not eat breakfast will
perform the same on a math exam.
✦ St u de n t s w h o e x p e r i e n c e a n d n o t
experience test anxiety prior to an English
exam will get the same scores.
✦ Motorists who talk and not talk on the phone
while driving will get the same errors on a
driving course.
Alternative Hypothesis
• Denoted by Ha
• Statement that must be true if the null
hypothesis is false.
• Sometimes referred to as the research
hypothesis.
• Must contain the condition of equality and
must be written with the symbol ≠ , < or >.
Example:
✦ Students who eat breakfast will perform better
on a math exam than students who do not eat
breakfast.
✦ Students who experience test anxiety prior to an
English exam will get higher scores than students
who do not experience test anxiety.
✦ Motorists who talk on the phone while driving will
be more likely to make errors on a driving course
than those who do not talk on the phone.
Remember:
If you are conducting a research study
and you want to use a hypothesis test to
support your claim, the claim must be
stated in such a way that it becomes
the alternative hypothesis, so it
c a n n o t c o n t a i n t h e c o ndi t io n o f
equality.
Two Types of Alternative Test
Type II Error
BFAD does not allow the release of an
effective drug.
Remember:
It is important to note that we want to
set (α) before we start our study because
the Type I error is the more ‘grevious’
error to make.
The smaller ( α ) is, the smaller the region
of rejection.
3. Determine the Test
Distribution to Use
Decision Rule:
Reject the null hypothesis if the test
statistic is no t within the range
specified by the confidence interval.
Using P - Value Approach
Decision Rule:
Reject the null hypothesis if the computed
p-value is less than or equal to the set
significance level , other wise do not reject
the null hypothesis.
Example:
If the level of significance α = 0.05,
P-value Decision
0.01 Reject
0.05 Reject
0.10 Failed to reject Ho
Using Traditional Method
Decision Rule:
Normal Curve
Properties of a Normal Curve
1. The normal cur ve is bell-shaped and
symmetric about the mean.
2. The mean, median and mode are equal.
3. The total area under the cur ve is equal to
one.
4. The normal cur ve approaches, but never
touches the x-axis as it extends farther and
farther away from the mean.
Testing Normality of the Data
To determine if the data is follows a
normalit y distribution, we can use the
graphical or numerical method.
Graphical:
Histogram and Normal Q-Q Plot
Numerical:
Kolmogorov Smirnov Test
Lilliefors
Anderson - Darling Test
Shapiro Wilk Test
How to Check Normality?
Histogram plots the observed values against their
frequency, states a visual estimation whether the
distribution is bell shaped or not.
How to Check Normality?
Q-Q probability plots display the observed values
against normally distributed data (represented by the
line).
Remember:
This test has been shown to be less powerful than the other
tests in most situations. It is included only because of its
historical popularity. Some published articles would say “The
Kolmogorov-Smirnov test is only a historical curiosity. It
should never be used."
Kolmogorov-
< 0.000 Reject Ho Not Normal
Smirnov
Failed to
Lilliefors 0.0571 Normal
Reject Ho
Failed to
Anderson-Darling 0.2178 Normal
Reject Ho
Ho : μ = μo Ho : μ = μo Ho : μ = μo
Ha : μ ≠ μo Ha : μ < μo Ha : μ > μo
n
Rejection Region
Alternative Hypothesis Rejection Region
n
Rejection Region
Alternative Hypothesis Rejection Region
n
Rejection Region
Alternative Hypothesis Rejection Region
( 25 ) ( 25 )
15 15
372.5 − 1.96 ≤ μ ≤ 372.5 + 1.96
366.62 ≤ μ ≤ 378.38
If this interval contains the hypothesized mean
(368), we do not reject the null hypothesis. Since the
computed interval contains the hypothesised mean
(368), we fail to reject the null hypothesis.
Testing a Claim About a
Proportion
We can test a claim about a proportion, percentage,
or probability, as illustrated in these examples:
Based on a sample survey, fewer than ¼ of all
college graduates smoke.
The percentage of physicians leaving the country
is equal to 15%.
If a driver is fatally injured in a car crash, there is
a 0.35 probability that the driver was legally
impaired.
One Sample Proportion
Test
The One-Sample Proportion Test is used to
assess whether a population proportion (P1)
i s s i g n i fi c a n t l y d i f f e r e n t f r o m a
hypothesized value (P0). The hypotheses
may be stated in terms of the proportions,
their difference, their ratio, or their odds
ratio, but all four hypotheses result in the
same test statistics.
Assumptions
1. The conditions for a binomial experiment are
satisfied. That is, we have a fixed number of
i n de p e n de n t t r i a l s h av i ng c o n s t a n t
probabilities, and each trial has t wo outcome
categories, which we classify as “success” and
“failure”.
2. The conditions npo ≥ 5 and n(1 − po) ≥ 5 are
both satisfied, so the binomial distribution of
sample proportions can be approximated by a
normal distribution with µ = np and
σ = np(1 − p) .
Hypotheses
Two-Tailed Left-Tailed Left-Tailed
Ho : p = po Ho : p = po Ho : p = po
Ha : p ≠ po Ha : p < po Ha : p > po
Student 1 2 3 4 5 6
Weight 135 119 106 135 180 108
Solution:
Step 1: Step 2:
Ho : μ = 140lbs α = 0.10
Ha : μ > 140lbs
Step 3:
Since is not given, and n is less than 30, we
will use One - Sample t - Test (Case 3) and a
right - tailed test. Rejection
Step 4: df = 6 − 1 = 5 Region
t0.10,5=1.476
-2 -1 0 1 2 1.476
Solution:
Step 5: If test statistic is greater than CV(1.476), reject
the null hypothesis, other wise fail to reject the
null hypothesis.
Step 6:
We can solve the test statistic and p-value of
One - Sample t - Test using RStudio. TV(-0.852)
and p-value(0.784)
Step 7:
Since test statistic (-0.852) is less than CV(1.476), we
fail to reject Ho, therefore we don’t have enough evidence
to support the claim of the teacher.
INFERENCE ABOUT TWO
MEANS
• INDEPENDENT SAMPLE Z- TEST
• INDEPENDENT SAMPLE T - TEST
• PAIRED SAMPLE T - TEST
Inference About Two
Means
H0 : μ1 − μ2 = 0 and Ha : μ1 − μ2 > 0
H0 : μ1 − μ2 = 0 and Ha : μ1 − μ2 ≠ 0
Independent Sample z -
Test
Case 1: σ1 = σ2 = σ
(x̄1 − x̄2) − (μ1 − μ2)
z=
1 1
σ +
n1 n2
Case 2: σ1 ≠ σ2
(x̄1 − x̄2) − (μ1 − μ2) where: ∑ (x − x̄1)2
z= σ1 =
N1
σ12 σ22
+ ∑ (x − x̄2)2
n1 n2 σ2 =
N2
Rejection Region
Alternative Hypothesis Rejection Region
-2.576 2.576
-2 -1 0 1 2
Solution:
Step 5:
If test statistic is less than CV(-2.576) and
gre ate r t h an CV(2.576), re je c t t he nul l
hypothesis, other wise fail to reject the null
hypothesis.
Step 6:
(x̄A − x̄B) − (μA − μB) (250 − 240) − 0
z= = = 2.39
σA2 σB2 202 152
+ +
nA nB 40 30
Solution:
Step 7:
Since te st statistic (2.39) is gre ater than
CV(-2.576) and less than CV(2.576), we fail to
reject Ho, therefore there is no significant
difference in the yield per hectare of varieties A
and B. The reputation that both are high-yielding
varieties is shown to be consistent. The studies
sugge s t ing th at the dif fe re nce in yie ld is
significant are not conclusive.
Example 2:
Suppose we put people on 2 diets “the fruit diet and
the bread diet”. Participants are randomly assigned
to either 7-days of eating exclusively fruits or 7-
week of exclusively eating bread. At the end of the
day, we measure weight gain by each participant. Is
bread diet causes more weight gain compared to
fruits diet? Test the claim using 10% level of
significance.
Fruit Diet 3 4 4 4 5 6 6
Bread Diet 1 2 2 2 3 4 4
Solution:
Step 1:
Ho : μF − μB = 0
There is no significant difference bet ween bread and fruit
diet.
Ha : μF − μB < 0
Bread diet causes more weight gain compared to fruits diet.
Step 2:
α = 0.10
Solution:
Step 3:
Since σ is not given, and n is less than 30, we
will use Independent - Sample t - Test and a
t wo - tailed test but we need to first use F-
test to determine if case 1 or case 2.
Step 4:
Rejection
df = 7 + 7 − 2 = 12
Region
t0.10,12=−1.356
-1.356 -2 -1 0 1 2
Before we proceed to t.test ( ) command, we must first
check whether the variances are homogeneous. Used
var.test ()command for F - test of Fisher.
We obtained p-value greater than 0.10, then the t wo
var iance s are homoge ne ous, the re fore we will use
Independent sample t - Test (Case 1).
Solution:
Step 5:
If test statistic is less than CV(-1.356), reject the
null hypothesis, other wise fail to reject the null
hypothesis.
Step 6:
We can solve the test statistic and p-value of
independent - Sample t - Test using RStudio.
TV(3.300) and p-value(0.997)
Solution:
Step 7:
Since test statistic (3.300) is greater than
CV(-1.356), we fail to reject Ho, therefore
there is no significant difference bet ween
bread and fruit diet.
Example 3:
An apartment rental agent tells the personnel
manager of a firm thinking of building a plant in
the agent’s city that the mean rental rates for
t wo-bedroom apartment are the same in sector
A and B of the city. To test this claim, the
p e rs o n ne l m a n age r ra n dom l y s am p le s
apartment comple xe s in e ach sector and
obtained the following data.
Example 3 (cont.):
Sector A Sector B
x1 = $595 x2 = $580
n1 = 10 n2 = 12
s1 = $62 s2 = $32
2 2
s = 3,844
1 s2 = 1, 024
H0 : μd = 0 and Ha : μd > 0
H0 : μd = 0 and Ha : μd ≠ 0
Note: μ1 − μ2 = μd
Rejection Region
Alternative Hypothesis Rejection Region
Step 2:
α = 0.05
Solution:
Step 3:
Since there are t wo groups that are related,
we will use Paired - Sample t - Test and a t wo
- tailed test.
Step 4:
Rejection Rejection
df = 8 − 1 = 7 Region
Region
t0.05,7=±2.365
-2.365 2.365
-2 -1 0 1 2
Solution:
Step 5: If test statistic is less than CV(-2.365) and greater
than CV(2.365), reject the null hypothe sis,
other wise fail to reject the null hypothesis.
Step 6: 2
x1 x2 d d −d (d − d )
x̄d − μd 2.0 − 0 85 80 5 3 9
t= s = = 1.366
d 4.1404 84 88 −4 -6 36
80 76 4 2 4
n 8 93 90 3 1 1
83 74 9 7 49
16
x̄d = = 2.0 71 70 1 -1 1
8
79 81 -2 -4 16
120
sd = = 4.1404 83 83 0 -2 4
8−1
16 120
Solution:
Step 7:
Before
85 84 86 87 89 82 80 84 86 82 89 87 82 81 86 89 89 84 85 88
Training
After
95 98 97 92 96 93 94 95 90 82 97 98 95 95 92 91 94 95 96 97
Training
Solution:
Step 1:
Ho : μd = 0
There is no significant dif ference in the te aching
performance of the teachers before and after training.
Ha : μd < 0
The training course increases the teaching performance of
the teachers who attended the training.
Step 2:
α = 0.10
Solution:
Step 3:
Since there are t wo groups that are related,
we will use Paired - Sample t - Test and a left-
tailed test.
Step 4:
Rejection
df = 20 − 1 = 19 Region
t0.10,19=−1.729
-1.729 -2 -1 0 1 2
Solution:
Step 5: If test statistic is less than CV(-1.729), reject the
null hypothesis, other wise fail to reject the null
hypothesis.
Step 6:
We can solve the test statistic and p-value of
paired Sample t - Test using RStudio. TV(3.300)
and p-value(0.997)
Step 7:
Since test statistic (-9.697) is less than CV(-1.729),
we reject Ho, therefore the training course help to
increase the teaching performance of the teachers
who attended the training.
Two Sample Proportion
Test
A t wo proportion z-test allows you to compare
t wo proportions to see if they are the same.
When testing a hypothesis made about t wo
population proportions – such as proportions
of cured patients in a population given some
treatment and a second population given a
placebo.
Two - Proportion z -
Test
Test Statistic:
( p1̂ − p2̂ ) − (p1 − p2)
z=
̂ − p)̂
p(1 ̂ − p)̂
p(1
+
n1 n2
Where:
x1 x2 x1 + x2
p1̂ = p2̂ = p̂ =
n1 n2 n1 + n2
Assumptions
1. We have t wo independent sets of
randomly selected sample data.
2. For both samples, the conditions np ≥ 5
and np(1 − p) ≥ 5 are satisfied.
Hypotheses
H0 : p1 − p2 = 0 and Ha : p1 − p2 < 0
H0 : p1 − p2 = 0 and Ha : p1 − p2 > 0
H0 : p1 − p2 = 0 and Ha : p1 − p2 ≠ 0
Rejection Region
Alternative Hypothesis Rejection Region
Step 4:
Rejection
z0.01=2.33 Region
2.33
-2 -1 0 1 2
Solution:
Step 5: If test statistic is greater than CV(2.33), reject
the null hypothesis, other wise fail to reject the
null hypothesis.
Step 6:
( p1̂ − p2̂ ) − (p1 − p2) (0.3333 − 0.16) − 0
z= = = 2.4973
̂ − p)̂
p(1 ̂ − p)̂
p(1 0.1667(1 − 0.1667) 0.1667(1 − 0.1667)
+ +
n1 n2 30 750
10 120 10 + 120
p1̂ = = 0.3333 p2̂ = = 0.16 p̂ = = 0.1667
30 750 30 + 750
Solution:
Step 7:
Since test statistic (2.4973) is greater than CV(2.33),
we reject Ho, therefore we can conclude that
miscarriage rate is greater for women exposed to
ethyl glycol. With this evidence, the John Hopkins
researchers concluded that women employees exposed
to glycol ethers “have a significantly increased risk of
miscarriage.” On the basis on these results, IBM
warned its employees of the danger, notified the
Environmental Protection Agency, and greatly reduced
its use of glycol ethers.
Exercises
Exercises 1:
The production manager of a fruits canning
factory begins to suspect that, as a result of
obser ving the machine operators, the 16 oz. can
of fruits may be slightly filled beyond the
required weight. He takes a random sample of 80
packed cans and finds that the mean weight is
16.08 oz. with a standard deviation of 0.04 oz.
At 1% Level of Significance, can the production
manager conclude that the fruit cans were being
overfilled?
Exercises 2:
An insurance executive asserts that the mean
amount paid by his firm for personal injury
resulting from personal accidents is P18,500. An
actuary wants to check the accuracy of this
assertion and is allowed to sample randomly 36
cases involving personal injury. The sample mean
is P19,415. Assuming that σ = P2,600, test the
executive belief with level of significance of
0.05.
Exercises 3:
The manager of the Granite Rock Company
believes that the average truckload delivered
weighs 4,500lbs. A stockholder, Chip Stone
argues that this is an inflated figure to live new
investors, Mr. Stone randomly samples the
records of 25 loads and finds the mean load to be
4,460lbs with standard deviation (s) of 250lbs.
Can Mr. stone reject the manager’s claim using
a significance level of 0.05?
Exercises 4:
A poultry raiser harvests an average of 300 eggs
per day. He has recently e xperimented with
different types of poultry feeds. As a result, he
noticed some fluctuations in the number of eggs laid
by the chickens, which is neither clearly higher nor
lower than previous weeks. He decides to find out if
there might be a significant change in the number of
eggs laid by the chickens. He records his har vest of
eggs for 20 days. He finds that the average per day
is 290 eggs with a standard deviation of 15. At 5%
Level of Significance, what did the poultry raiser
find out?
Exercises 5:
An experimental diet was followed by a random
sample of 6 people. The cholesterol level for each
was measured before and after the diet as follows:
X X X
r = -1 r = -.6 r =0
Y Y
r = .6 r=1
Note:
Features of r
Unit free
Range bet ween -1 and 1
The closer to -1, the stronger the negative
linear relationship.
The closer to 1, the stronger the positive
linear relationship.
The closer to 0, the weaker the linear
relationship.
Caveats
A correlation of 70% does not mean
that 70% of the points are clustered
around a line. Nor should we claim here
that we have t wice as much linear
association with a set of points, which
has a correlation of 35%.
Correlation does not imply causation.
Caveats
A The presence of outliers easily affects
the correlation of a set of data.
• In some situations, we ought to remove
these outliers from the data set and re-
do the correlation analysis.
• In other case, these outliers ought not to
be removed as there will always be some
points detached from the rest of the
data.
Pearson Product Moment
Correlation Coefficient
Commonly called the Pearson r.
It measures the linear relationship bet ween t wo
variables.
The level of measurement of the data for the t wo
variable are either in inter val or ratio scale.
n ∑ xy − ∑ x ∑ y
r=
[n ∑ x 2 − ( ∑ x)2][n ∑ y 2 − ( ∑ y)2]
where:
x = the observed data for the independent variable
y = the observed data for the dependent variable
n = no. of samples
Pearson Product Moment
Correlation Coefficient
Test Statistic:
df
t=r
where: 1 − r2
df = degrees of freedom
r = correlation coefficient of Pearson r
Note:
df = n − 2
Qualitative Interpretation
Note:
If r is negative, this means that for every
i n c re a s e i n o n e v a r i a b l e , t h e re i s a
corresponding decrease in the second variable
or that there is an inverse relationship
bet ween variables x and y.
If r is positive, this means that for every
i n c re a s e i n o n e v a r i a b l e , t h e re i s a
corresponding increase in the second variable
or that there is a direct relationship bet ween
variables x and y.
Hypotheses
Ho : ρ = 0
There is no significant relationship
bet ween the t wo variables.
Ha : ρ ≠ 0
Th e re i s s ig n ific a n t re l a t i o n s h i p
bet ween the t wo variables.
Example 1:
T h e R i p - o f f Ve n d i n g M a c h i n e No. of Persons
Working at
No. of cups of
coffee sold
Company operates coffee vending location
∑
x = 136 6 20 36 400 120
14 30 196 900 420
∑
y = 280 19 40 361 1600 760
15 30 225 900 450
x 2 = 2,448
∑ 11 20 121 400 220
2
∑
y = 10,000 18 40 324 1600 720
22 40 484 1600 880
∑
xy = 4,920 26 50 676 2500 1300
Sum: 136 280 2,448 10,000 4,920
Solution:
9(4920) − (136)(280)
r= = 0.9681
[9(2448) − (136)2][9(10000) − (280)2]
Strong Positive Correlation
9−2
t = 0.9681 = 10.222
1 − (0.9681)2
Solution:
Step 7:
1,726 3,681
correlation of the annual
1,542 3,395
sales of produce stores on
2,816 6,653
their size in square footage.
5,555 9,543
S ample dat a f o r se ve n
stores were obtained. 1,292 3,318
2,208 5,563
1,313 3,760
Solution:
Step 1:
Ho : ρ = 0
There is no significant relationship bet ween the annual
sales of produce stores on their size in square footage.
Ha : ρ ≠ 0
There is significant relationship bet ween the annual sales
of produce stores on their size in square footage.
Step 2:
α = 0.05
Solution:
Step 3:
Since we are testing the significant relationship of
t wo variables, we will use Pearson r.
Step 4: df = 7 − 2 = 5 t0.05,5=±2.571
Step 5:
If test statistic is less than
CV(-2 .571) a nd gre ate r Rejection Rejection
than CV(2.571), reject the Region Region
null hypothesis, other wise
f ai l t o re je c t t h e nul l -2.571 2.571
hypothesis. -2 -1 0 1 2
Solution:
Step 6:
We can solve the test statistic and p-value of
Pearson r using RStudio. TV(9.010) and p-
value(0.0003)
Step 7:
Since test statistic (9.010) is greater than
CV(2.571), we reject Ho, therefore there is
significant relationship bet ween the annual sales
of produce stores on their size in square footage.
Regression Analysis
Regression analysis is used primarily to
model causality and provide prediction.
Predicts the value of a dependent (response)
variable based on the value of at least one
independent (explanatory) variable.
Explains the effect of the independent
variables on the dependent variable
Types of Regression Models
Simple Linear
Regression
Re l at ionshi p be t we e n v ar i able s is
described by a linear function.
The change of one variable causes the
change in the other variable.
A dependency of one variable on the
other.
Population Linear Regression
Population regression line is a straight line that
describes the dependence of the average value of
one variable on the other.
Population Linear Regression
Sample Linear Regression
Sample regression line provides an estimate of the
population regression line as well as a predicted
value of Y.
Note:
b0 and b1 are obtained by finding the
values of b0 and b1 that minimizes the
sum of the squared residuals.
n n
2 2
(yi − y)̂ =
∑ ∑
ei
i=1 i=1
b0 provides an estimate of β0.
b1 provides an estimate of β1.
Interpretation of the
Slope and the Intercept
b0 = Eˆ (Y | X = 0 ) is the estimated
average value of Y when the value of X
is zero.
ΔEˆ (Y | X )
b1 =
ΔXis the estimated change in
the average value of Y as a result of a
one-unit change in X.
Note:
When b1>0, Y increases as X increases. In this
case, we say that Y is directly or positively
related to X.
When b1<0, Y decreases as X increases, and we
say that Y is inversely or negatively related to X.
When b1=0, Y is a constant and is equal to the y-
intercept a. This implies that there is no change
in Y whatever X value is. This implies that
variable x and y have no relationship.
Example:
Examine t he line ar Square
Feet
Annual Sales
($1000)
dependency of the annual 1,726 3,681
sales of produce stores on 1,542 3,395
t h e i r s i ze i n sq ua re 2,816 6,653
footage. Find the equation 5,555 9,543
of the straight line that 1,292 3,318
fits the data best. 2,208 5,563
1,313 3,760
Solution:
From RStudio Printout:
ŷ = 1636.415 + 1.487(xi)
Solution:
ŷ = 1636.415 + 1.487(xi)
The slope of 1.487 means that for each
increase of one unit in X, we predict the
average of Y to increase by an estimated
1.487 units.
The model estimates that for each increase of
one square foot in the size of the store, the
expected annual sales are predicted to
increase by $1487.
Inference About the
Slope: t-Test
t - test for a population slope
Is there a linear dependency of Y on X ?
Null and Alternative Hypothesis
Ho : β1 = 0 (No linear dependency)
Ha : β1 ≠ 0 (Linear dependency)
Test Statistic: Where:
sxy
b1 − β1 sb1 =
t= n
sb1 ∑i=1 (xi − x̄)2
Example:
Square Annual Sales
Feet ($1000)
Gi ven the following 1,726 3,681
information, determine if 1,542 3,395
the square footage of the 2,816 6,653
store affecting its annual 5,555 9,543
sales? 1,292 3,318
2,208 5,563
1,313 3,760
Solution:
Inference about the slope: Ho : β1 = 0 Ha : β1 ≠ 0
Y Y
X X
e e
X
X
Not
Linear ü Linear
Residual Analysis for Homoscedasticity
Y Y
X
X
SR SR
X X
Heteroscedasticity
ü Homoscedasticity
Pitfalls of Regression
Analysis
Lacking an awareness of the assumptions
underlying least-squares regression.
Not knowing how to evaluate assumptions.
Not knowing the alternatives to classical
regression if some assumption is violated.
Using a regression model without knowledge
of the subject matter.
Strategies for Avoiding
the Pitfalls of Regression
Start with a scatter plot of X on Y to observe
possible relationship.
Pe rform re sidual an alysis to che ck the
assumptions.
• Use a histogram, stem-and-leaf display, box-
and-whisker plot, or normal probability plot
of the residuals to uncover possible non-
normality.
Strategies for Avoiding
the Pitfalls of Regression
If there is violation of any assumption, use
a l t e r n a t i v e me t h o d s t o l e a s t-s q u a re s
regression or alternative least-squares models
(e.g.: Curvilinear or multiple regression)
If there is no evidence of assumption violation,
then test for the significance of the regression
coefficients
Example 2:
Examine t he line ar Square
Feet
Annual Sales
($1000)
dependency of the annual 1,726 3,681
sales of produce stores on 1,542 3,395
t h e i r s i ze i n sq ua re 2,816 6,653
footage. Find the equation 5,555 9,543
of the straight line that 1,292 3,318
fits the data best. 2,208 5,563
1,313 3,760
Solution:
From RStudio Printout:
ŷ = 1636.415 + 1.487(xi)
Solution:
ŷ = 1636.415 + 1.487(xi)
The slope of 1.487 means that for each
increase of one unit in X, we predict the
average of Y to increase by an estimated
1.487 units.
The model estimates that for each increase of
one square foot in the size of the store, the
expected annual sales are predicted to
increase by $1487.
Exercises
Exercises 1:
Castle Rock Entertainment has produced many
movies over the past few years. The Vice-
President wants to see if there is a relationship
be t ween the total cost of film (including
production costs, salarie s, and marke ting
expenses) and the gross income produced by the
film through ticket sales in the American movie
theaters. A random of sample films produced the
following data pairs.
Exercises 1 (cont.):
Costs Gross Income
(Million55
Dollars) (Million Dollars)
150.50
42 123.00
17 68.00
30 93.00
43 16.00
26 5.00
19 10.00
35 35.00
22 20.00
13 15.00
1. Predict the gross income for the film with a cost of 27
million.
2. Predict the gross income for the film with a cost of 35
million.
Exercises 2:
The scores of ten randomly Student x Y
se le c te d se nior high s ch o o l A 5 6
students on the mathematical B 7 15
portion of the National C 9 16
A dm i s s i o n t e s t ( N AT ) a n d D 10 12
mathematical ability part of a E 11 21
university admission test were F 12 22
recorded as follows: G 15 8
H 17 26
Compute the coefficient of
I 20 5
correlation (r).
J 26 30
Exercises 3:
In the following given data, x =number of sessions
attended by 15 trainees in a leadership training
seminar, while y = scores obtained by the same
trainees in a test given after the seminar.
x 3 2 4 5 5 6 6 7 9 7 8 5 6 3 8
y 65 50 75 70 80 85 79 88 91 87 88 70 71 63 85
Note: k
k n
2
∑∑
2 SSw = (x̄ij − x̄i)
∑
SSb = n (x̄i − x̄)
i=1 j=1
i=1
Note:
The ANOVA test is applied by calculating t wo
estimates of the variance of population
distributions: the variance bet ween samples
and the variance within samples.
The variance bet ween samples is also called
the mean square bet ween samples or MSB. The
variance within samples is also called the
mean square within samples of MSW.
Assumptions
1. Your dependent variable should be measured at
the interval or ratio level (i.e., they are
continuous).
2. Your independent variable should consist of two
or more categorical, independent groups.
3. You should have independence of observations,
which means that there is no relationship
bet ween the observations in each group or
bet ween the groups themselves.
Assumptions
4. There should be no significant outliers.
5. Your dependent variable should be
approximately normally distributed
for each category of the independent
variable.
6. There needs to be homogeneity of
variances.
Hypotheses
The analysis of variance is used to test the
hypothesis that the means of three or more
populations are the same against the alternative
hypothesis that the mean of at least one
population is different from the others.
Ho : μ1 = μ2 = . . . = μk
Ha : At least one of the population means is
different from the others.
Rejection Region
One - Way ANOVA is always right-tailed with
the rejection region in the right tail of the F
distribution curve.
Critical Value: Fα/2,df1,df2
Note: Where:
df = k − 1 k = No. of categories.
df = n − k n = Total number of observation.
Example:
Suppose we have teachers at a school who have
devised three different methods to teach arithmetic.
They want to find out if these three methods produce
different mean scores. Let μ1, μ2 and μ3 the mean
scores of all students who are taught by Methods I,
II, and III, respectively.
To test if the three teaching methods produce
different means, we test the null hypothesis
Ho : μ1 = μ2 = μ3
Ha : At least one of the population means is
different from the others.
Note:
Using a one-way ANOVA test, we analyze only one
factor or variable.
For instance, in the example of testing for the equality
of mean arithmetic scores of students taught by each
of the three different methods, we are considering only
one factor, which is the effect of different teaching
methods on the scores of students.
Sometimes we may analyze the effects of t wo factors.
For example, if different teachers teach arithmetic
using these three methods, we can analyze the effects
of teachers and teaching methods on the scores of
students. This is done by using a t wo-way ANOVA.
Note:
The variance bet ween samples, MSB, gives an estimate of
variance based on the variation among the means of
samples taken from different populations.
For the example of three teaching methods, MSB will be
based on the values of the mean scores of three samples of
students taught by three different methods. If the means
of all populations under consideration are equal, the means
of the respective samples will still be different but the
variation among them is expected to be small, and
consequently, the value of MSB is expected to be small.
However, if the means of populations under consideration
are not all equal, the variation among the means of
respective samples is expected to be large, and consequently,
the value of MSB is expected to be large.
Note:
The variance within samples, MSW, gives an
estimate of variance based on the variation
within the data of different samples.
For the example of three teaching methods,
MSW will be based on the scores of individual
students included in the three samples taken
from three populations.
Example:
Callie Cruz, Vice-President of the Nikel and Dime Savings
Bank, is reviewing employees performance for possible
salary increase. In evaluating tellers, Callie decides that
an important criterion is the number of customer each
day. She e xpects that e ach teller should handle
approximately the same number of customers daily.
Other wise, each teller should be rewarded or penalized
accordingly.
Callie randomly selects 6 business days and customer
traffic for each teller during these days is recorded. The
factor or variable of interest, then, is the number of
customers ser ved. The sample data are shown below:
Example (cont.):
Customer Traffic Data
Day Teller 1 Teller 2 Teller 3
Ms. David Ms. Chua Ms. Lim
1 45 55 54
2 56 50 61
3 47 53 54
4 51 59 58
5 50 58 52
6 45 49 51
Total 294 324 330
Solution:
Step 1:
Ho : μ1 = μ2 = μ3
All population means are equal. that is, Ms. David, Ms. Chua
and Ms. Lim serve the same average number of customer
per day and they are assumed to have same workload.
F0.05,2,15=3.68
Solution:
Step 6:
294 324 330
x̄1 = = 49 x̄2 = = 54 x̄3 = = 55
6 6 6
49 + 54 + 55
x̄ = = 52.6667
3
2 2
ssb = 6(49 − 52.6667) + 6(54 − 52.6667)
2
+6(55 − 52.6667)
= 124
Solution:
2 2 2
ss
Step 6: 1 = (45 − 49) + (56 − 49) + (47 − 49)
2 2 2
+(51 − 49) + (50 − 49) + (45 − 49)
= 90
2 2 2
ss2 = (55 − 54) + (50 − 54) + (53 − 54)
+(59 − 54)2 + (58 − 54)2 + (49 − 54)2
= 84
ss3 = (54 − 55)2 + (61 − 55)2 + (54 − 55)2
2 2 2
+(58 − 55) + (52 − 55) + (51 − 55)
= 72
ssw = 90 + 84 + 72
= 246
Solution:
Step 6:
Sum of Degrees of Variance
Source F Ratio
Squares Freedom Estimate
Between 124 2 62
Total 370 17
Solution:
Step 7:
Since test statistic (3.7805) is greater than CV(3.68),
we reject Ho, therefore at least one of the tellers
among David, Chua and Lim is likely to be handling
more or fewer customers than the others.
Exercises 1:
A career counselor claims in Career Development
Quarterly that there is no difference in career
decision-making attitudes among the population of
students from various socioeconomic classes.
The results of scores from an Lower Middle Upper
32 45 38
attitudes test given to random
36 42 38
samples of students are as 40 34 31
follows: 32 42 41
33 29
37 33
Test the counselor’s claim at the 0.01 level. 34
Exercises 2:
Fifteen fourth-grade students were randomly assigned to three
groups to experiment with three different methods of teaching
arithmetic. At the end of the semester, the same test was given
to all 15 students. The table gives the scores of students in the
three groups.
Test the that the mean scores of Method I Method II Method III
all three groups of fourth- 48 55 85
graders taught by three different
73 85 68
methods are not equal. Assume
that all the required assumptions 51 70 95
hold true. Use 0.01 level of 65 69 74
significance. 87 90 67
Post Hoc Tests on One-
Way Analysis of Variance
Suppose we perform a one-way ANOVA and
the results lead us to conclude that at least
one population is different from the others.
To de t e r m i n e w h i c h m e a n s d i f fe r
s i g n i f i c a nt l y, we m a k e a dd i t i o n a l
c omp a r is o n s be t we e n me a n s . Th e
procedures for making these comparisons
are called multiple comparison methods.
Tukey Test
Th e q - t e s t s t a t i s t i c f o l l o ws a
distribution called the Studentized
range distribution.
Standard Error
2 ( n1 n2 )
2
s 1 1
SE = × +
where:
2
s mean square error estimate (MSE) of from
the one-way ANOVA
n1 sample size from population 1
n2 sample size from population 2.
Test Statistic for
Tukey’s Test
The test statistic for Tukey’s test when
testing Ho : μ1 = μ2 versus Ha : μ1 ≠ μ2 is given
by
(x̄1 − x̄2) − (μ1 − μ2)
q=
2 ( n1 n2 )
s2 1 1
× +
qα,v,k
Critical Value for the
Tukey’s Test
The level of
qα,v,k
significance is Total number of
called the means being
experiment wise compared.
error rate or Degrees of freedom due to
familywise error error (the degrees of
rate. freedom due to error is the
total number of subjects’
sample size minus the
number of means being
compared, or n-k ).
Decision Rule
Step 1:
Arrange the sample means in ascending order.
Step 2:
Compute the pair wise differences, x̄i − x̄j ,
where x̄i > x̄j .
Procedures Used to Make Multiple
Comparison Using Turkey Test
Step 3:
Compute the test statistic for e ach
pair wise difference.
(x̄1 − x̄2) − (μ1 − μ2)
q=
2 ( n1 n2 )
s2 1 1
× +
Procedures Used to Make Multiple
Comparison Using Tukey Test
Step 4:
Determine the Critical Value.
Step 5:
Determine the decision.
Step 6:
Determine the conclusion.
Example 1
Suppose that there is sufficient evidence to
reject Ho : μ1 = μ2 = μ3 = μ4 using a one-way
ANOVA. The mean square error from ANOVA
is determined to be 26.2. The sample means
are x̄1 = 42.6,x̄2 = 49.1,x̄3 = 46.8,x̄4 = 63.7 with
n1 = n2 = n3 = n4 = 6 .
Use Tukey’s test to determine which pair wise
means are significantly different using a
familywise error of 0.05.
Solution:
Step 1:
(6 6)
26.2 1 1
× +
2
Solution:
Ho : μ4 − μ3 = 0
(16.9) − (0)
q= = 8.0875
(6 6)
26.2 1 1
× +
2
Ho : μ4 − μ2 = 0
(14.6) − (0)
q= = 6.9868
(6 6)
26.2 1 1
× +
2
Solution:
Ho : μ2 − μ1 = 0
(6.5) − (0)
q= = 3.1106
(6 6)
26.2 1 1
× +
2
Ho : μ2 − μ3 = 0
(2.3) − (0)
q= = 1.1007
(6 6)
26.2 1 1
× +
2
Solution:
Ho : μ3 − μ1 = 0
(4.2) − (0)
q= = 2.0099
(6 6)
26.2 1 1
× +
2
Step 4: