Business Statistics Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Lecture 1: Descriptive Statistics I

Activity 1: Data Classification

Properties of Nominal, Ordinal, Interval and Ratio: Tick where the following concepts are
appropriate (start from the bottom to the top)
Nomina Ordinal Interva Ratio
l (E.g l (length,
(E.g. Educati (E.g. height)
blood onal temper
type, level, ature)
gender, grades,
race) satisfact
ion
rating)
Has “true zero” (zero means zero) x

Can multiply and divide values (ratio x


between two)
Mean x x

Can add or subtract values x x

Can quantify the difference between each x x


value
Median x x x
The “order” of values is known x x x

Mode x x x x

Determine which level of measurement is used:


a) What is your favourite ice cream flavour?
Nominal
b) What is your total weekly income?
Ratio (0 is meaningful)
c) What is your shoe size?
Interval (difference between shoe sizes are equal? They go up by 1)
d) What is your university grade? (i.e. F, P, C, D or HD)
Ordinal

Sample or Population

a) Define the sample and what is the symbol for it?

A random collection of a certain size from the population (the heights of 50 randomly chosen
people; the heights of 50 randomly chosen people aged between 25 and 40). n

b) Define population and what is the symbol for it?

The complete pool of a certain random variable (all humans’ heights; the heights of all-
females aged between 25 and 40 on earth). N

Sample or Population
a) A city council member wanted to know how residents felt about a planned rezoning. She
randomly selected 75 names from the city phone directory and conducted a phone survey.

sample

b) A teacher is interested in the average grades his geometry students are achieving. The
geometry teacher records all grades in the class consisting of 30 students.

Population
Activity 2: Numerical methods

Using the below exam result data from a class of 6 students, calculate the following:
*Note for this question we use population because there are only 6 people in the class so we
are surveying the whole population.

Student 1 Student 2 Student 3 Student 4 Student 5 Student 6

Mark 4 6 3 4 9 6

MEASURES OF CENTRAL TENDENCY

Measures of central tendency yield information about the centre of the distribution of an r.v..
They give us some idea wof hat a typical, middle or average value that an r.v.can take. They
are sometimes called measures of location.

DEFINITION + Impact of Outliers ANSWER/CALCULATION

MEAN
- The average of a group of numbers. It is the sum of
all the values in a data set divided by the total number
of values in that data set.

- This measure of central tendency (unlike the others)


is affected by outliers and depending on the magnitude
of the outlier it will increase/decrease towards that X̄ = 4 + 6 + 3 + 4 + 9 + 6
outlier. 6
= 32
- The symbol for this measure at the population level is 6
μ ”mu” = 5.33 (2 d.p.)
- The symbol for this measure for a sample set of data
is x̄ “x bar”
MEDIAN Median = 3 4 4 6 6 9
- The middle value in an array of ordered numbers. =6+4
- When working out this measure it is important to put 2
the numbers in order first and then find the middle =5
score.
If there is an odd amount of numbers the middle score
will be easy to identify.
- If there is an even amount of numbers then we take
the average of the two middle scores.
MODE Mode = both 6 and 4 (both
- The most frequently occurring value in a set of data. occur twice)
- Unaffected by outliers.

MEASURES OF VARIABILITY

Measures of variability yield information about the likelihood of a realisation of the r.v.is
away from the centre of its distribution. They give us some idea of fluctuation and volatility
across realisations of the r.v. They are sometimes called measures of scale, spread,
dispersion, or risk.

*calculation for a sample is included but the correct answer is population variance.
Think about the difference between population and sample variance and try to calculate
it yourself :)
MEASURE/DEFINITION FORMULA ANSWER/CALCULATION

RANGE Maximum - Minimum Range = 9 - 3


Difference between the =6
highest and lowest scores

VARIANCE s2 = [(4 - 5.33)2 + (6 - 5.33)2


Average of the squared + (3 - 5.33)2 + (4 - 5.33)2 + (9
distance from the mean - 5.33)2 + (6 - 5.33)2 ] / (6 -1)

It measures how far a set = 23.333


of numbers are spread out 5
from their average value = 4.6666 = 4.67 (2 d.p.)
Population variance= 3.8889
STANDARD SD = √4.6666
DEVIATION = 2.16024
Square root of the = 2.16 (2 d.p.)
variance
SD (population) = 1.9720
the standard deviation is a
measure of the amount of
variation or dispersion of
a set
COEFFICIENT OF CV = 2.16024 x 100
VARIATION 5.33
How large the SD is in = 40.50 %
relation to the mean
CV (population) =
36.9755%

Activity 3: Descriptive Statistics Shape

Distribution Shape Skewness Relationship of


Mode,Median,Mean

Skewness = 0. Mode = Median = Mean

Name: Symmetric Distribution

Name: Right-Skewed Distribution Skewness > 0, positively Mode < Median < Mean
skewed.

Name: Left-Skewed Distribution Skewness < 0, negatively Mode > Median > Mean
skewed.
Name: Leptokurtic Name: Mesokurtic Name: Platykurtic

Description of Shape: Description of Shape: Description of Shape:

Tall and thin Normal in shape Flat and spread out

Probability Description: Probability Description: Probability Description:

More probability mass in the ‘Normal’ distribution of probability.Less probability mass in the centre
centre and in the tails. and in the tails

Kurtosis: Kurtosis: Kurtosis:

Kurtosis > 3 Kurtosis = 3 Kurtosis < 3


U:PASS 2 - Descriptive Statistics II and Probability
Theory
Activity 1: Descriptive Statistics Shape

Distribution Shape Skewness Relationship of


Mode,Median,Mean

Name: Symmetric Distribution Skewness = 0. Mode = Median = Mean

Name: Right-Skewed Distribution Skewness > 0, positively Mode < Median < Mean
skewed.

Name: Left-Skewed Distribution Skewness < 0, negatively Mode > Median > Mean
skewed.
Name: Leptokurtic Name: Mesokurtic Name: Platykurtic

Description of Shape: Description of Shape: Description of Shape:

Tall and thin Normal in shape Flat and spread out

Probability Description: Probability Description: Probability Description:

More probability mass in the ‘Normal’ distribution of probability. Less probability mass in the centre
centre and in the tails. and in the tails

Kurtosis: Kurtosis: Kurtosis:

Kurtosis > 3 Kurtosis = 3 Kurtosis < 3


Activity 2: Probability Theory
Complete the following table. In the diagram column, shade the areas that represent the
corresponding probability. (HINT: Lecture 2, Slides 7-14)
Type of Definition Diagram Formula
Probability

Probability of P(A)
A occuring

Marginal A

Probability of A OR (white rectangle to P(A) + P(B) - P(A and B)


B occuring highlight to students
Union you must minus the
Slide 15 overlap)

A
B

Probability of A AND P(A) x P(B|A)


B occurring
Joint
A (if independent: P(A) x P(B))
B

Probability of P(A and B) divided by P(B)


A occurring given
B has occurred
Conditional A
B
Activity 3: Probability Theory
A marketer for a bank was asked to help determine which of its loans out of its loan range is
the most popular for future development.The marketer conducts a survey of 50 customers and
records the following joint frequencies.
ANZ CBA Total

Small Deposit 11 2 13

Large Deposit 23 14 37

Total 34 16 50

a) Construct a table that shows the probability of people banking with ANZ or CBA, preferring
either small deposit or a large deposit.

ANZ CBA Total

Small Deposit 11/50 = 0.22 2/50 = 0.04 0.26

Large Deposit 23/50 = 0.46 14/50 = 0.28 0.74

Total 0.68 0.32 1

b) Find the probability that someone prefers making a small deposit.

P(small deposit) = 0.26 (marginal)

c) Find the probability that someone does not prefer making a small deposit

P(large deposit) = 0.74 (marginal) (also complement of question b)

d) Find the probability that someone banks with ANZ or making a large deposit

P(ANZ) + P(Large Deposit) - (P(ANZ) and P(large deposit)) = 0.68 + 0.74 - 0.46 = 0.96

NOTE: not independent events, must minus for double counting

e) Find the probability that someone banks with CBA and making a large deposit

P(CBA and large deposit) = 0.28 (From table)


f) Find the probability that someone banks with making a small deposit conditional on them
baking with ANZ

P(SmallDeposit|ANZ) = P(SmallDeposit and ANZ) divided by P(ANZ) = 0.22/0.68 = 0.3235

Activity 4: Binomial Distribution

a) A Statistics lecturer examines past mid-semester failure rates and notes that historically,
30% of students fail the mid-semester exam. The lecturer takes a random sample of 10
students this semester.
i) What is the value of n? 10
ii) What is the value of p? 0.3. We’re looking at failure rates, so the ‘success’ condition is
a ‘failed’ mid-semester exam.

b) Using the above values determine the probability that exactly 4 students will fail, assuming
the chance of failure for any one student is independent of another (round to 4 decimal place).
i) What is the value of x? 4

P (X = 4, 10, 0.3) = (10\4) * 0.3^4 * (1-0.3)^10-4


-> (10\4) -> 10! / 4! * (10-4)!
= 210 * 0.3^4 * (1-0.3)^10-4
=0.200120949
= 20%

c) The product manager at a phone manufacturer examines defective rates at one of their
factories and notes that historically, 25 out of every 1000 products produced are defective.
The manager takes a random sample of 20 products.
i) What is the value of n? 20
ii) What is the value of p? 25/1000 = 0.025

iii) Using the above values find the probability that at least two of the sampled products will
be defective, assuming defectiveness for any product is independent of another (round to 4
decimal place).
Using the values from above
P(at least 2 defective) = P(x greater or equal to 2)
= 1 - [(P(x=0) + P(x=1)]
= 1 - [(20C0 x 0.025^0 x 0.975^20) + (20C1 x 0.025^1 x 0.975^19)]
= 1 - [1*1*0.60268768021 + 20*0.025*0.61814121048]
= 1 - [0.60268768021 + 0.30907060524]
= 1 - [0.91175828545]
= 0.08824171455
= 8.82417%

U:PASS 3 - Discrete Probability Distributions


Activity 1: Discrete and Continuous Probability Distributions
Definition Examples

Discrete ○ Categorise ○ Female


○ Count ○ Colours
○ Finite number of values ○ Number of Children
○ ½ a unit doesn’t make sense (i.e. ○ Doctor visits in a year
whole numbers)

Continuous ● Scale ● Time


● Measure ● Age
● Any value on interval ● Temperature
● ½ unit makes sense (can have ● Height
decimals) ● distance
Activity 2: Which Discrete Distribution to use?

Distribution When to use it Answer Bank

Binomial: ● if it describe the (random) number of


X ~ Bin(n,p) successes out of 𝑛 trials in a binomial ● if it describes the
Slide 9 experiment binary outcome 0
(failure) or 1
(success), with
probability of
success 𝑝, or
𝑃 𝑋 = 1 = 𝑝.

● if it describe the
Bernoulli: ● if it describes the binary outcome 0 (random) number
X~ Ber(p) (failure) or 1 (success), with of successes out
Slide 13 probability of success 𝑝, or of 𝑛 trials in a
𝑃 𝑋 = 1 = 𝑝. binomial
experiment

● if all potential
outcomes
(realisations)
between 𝑎 and 𝑏
Discrete Uniform ● if all potential outcomes (realisations) are equally likely.
X ~ DUnif(a,b) between 𝑎 and 𝑏 are equally likely.
Slide 14 ● if it describes the
(random) number
of arrivals of
events within a
given time period.

Poisson ● if it describes the (random) number of


X ~ Poi(λ) (lambda) arrivals of events within a given time
Slide 15 period.
Activity 3: Discrete Formulas and Notations
(Screenshots of formulas is fine)
~ means follows E.g. Binominal follows a binomial distribution with n mean and p SD

Distribution Probability Distribution Notation


Function

Binomial: With two parameters;


X ~ Bin(n,p) 𝒏: number of trials
Slide 9 𝒑: probability of success 𝑝 ∈ (0,1)
𝐸 (𝑋) = 𝑛𝑝; 𝑉𝑎𝑟 (𝑋) = 𝑛𝑝(1 − 𝑝)

Bernoulli: with one parameter


X~ Ber(p) • 𝒑: probability of success 𝑝 ∈ (0,1)
Slide 13 • 𝐸 (𝑋) = 𝑝; 𝑉𝑎𝑟 (𝑋) = 𝑝(1 − 𝑝)
• Apparently, Bernoulli distribution is a
special case of binomial distribution
where the number of trials 𝑛 = 1.

X= 1(success) or 0 (fail)

Discrete Uniform with two parameters


X ~ DUnif(a,b) • 𝒂: the minimum value that 𝑋 can
Slide 14 assume
• 𝒃: the maximum value that 𝑋 can
E(X) = (b+a)/2
assume; so there are 𝑏 − 𝑎 + 1 potential
outcomes.

Poisson
X ~ Poi(λ) (lambda)
Slide 15

Activity 4: Practise Questions

a) 40% of Xbox 360s fail. Given 10 consoles have been sold, what is the likelihood that
exactly 6 consoles will fail?

Binominal Q - two independent outcomes


N = 10
X=6
P = 0.4
Q = (1 - P) = 0.6

nCx * p^x * q^(n-x)


10C6 x 0.4^6 x 0.6^4 = 0.111476736 = 11.15%
b) There are 5 chocolates left in a favourites box consisting of 4 Cherry Ripes and 1
Picnic. What is the probability that a person who takes one chocolate at random will
take the picnic?
x=1 (success=picnic)
x=0 (fail=cherry ripe)
⅕=0.2

Bernoulli Q -
P(X = 1; 0.2) = 0.2^1 * (1-0.2)^1-1
= 0.2

x=1(success=cherry ripe)
P(X=1;0.8)=0.8^1*(1-0.8)^1-1
=0.8

c) In a game of Bingo there is an equal probability that a number between 1 through to


75 is called at random. What is the probability that the bingo caller will call the
number ‘24’?
Discrete uniform probability q - discrete events that are all equally likely to occur
A = min = 1
B = max = 75

P(X = 24; 1, 75) = 1 / (75 -1 + 1)


= 0.01333
= 1.33%

a) A manager wants to determine the expected rating that employees will receive on
their mid year performance review. You have been given the following data detailing
the number of employees falling into each rating category (1 = unsatisfactory; 5 =
satisfactory).
Calculate:
i. The expected rating of employees
ii. The variance
iii. The standard deviation
Rating (xi) Employees Mean Variance (round to 4 dp)

1 25 (1-4.305)^2 x 0.00625 =
0.0683

2 125 (2-4.305)^2 x 0.03125 =


0.1660

3 522 (3 - 4.305)^2 x 0.1305 =


0.2223

4 1262 (4-4.305)^2 x 0.3155 =


0.0293

5 2066 (5 - 4.305)^2 x 0.5165


=0.2495

i. E(x) = 4.305 ii. Var (x) = 0.7355


(round to 3 dp) iii. Std (x) = 0.8576
Sum = 4000
Lecture 4: Continuous Distributions
Activity 1

Uniform Distribution Normal Distribution


Slides 17-19 Slides 22-32
When is it if all potential outcomes (realisations) When the data follows a normal
Used? between 𝑎 and 𝑏 are equally likely. distribution

Formula and X = interested value Standardisation formula:


Notation A = minimum value
B = maximum value

X = interested value want the probability for


Ux = mean
σ = Standard deviation

Diagram

Characteristics Probability of success remains the same Distribution is symmetrical about its
over the entire interval mean, continuous, uni-modal
Symmetry Rule Complementary Rule Interval Rule
Slides 27 Slides 15, 28 Slides 14, 20, 29

When is it
When the Z-score is negative; When the question asks for Interval rule is used for
Used? symmetry rule gets rid of the greater than (>); questions where the
negative sign. Complement rule reverts probability is between an
the sign from ‘ >’ to ‘ < ‘. interval.

Example: P(1<Z<2)
Example: P(Z<-1)
Example: P(Z>1)

Why? – because the z-table Why? – Because the z-score Why? – because the z-
Why? does not cater for negative z- table does not cater for > table will not give values
score answers questions and when we use for intervals.
the symmetry rule we must
always use the complement
rule after it to ensure we are
working out the correct
probability.

Key takeaway: z-table only caters for probability of Z<positive number.

Activity 2: Practise Questions


2. It is determined that the cost of conducting a research study is uniformly distributed, with a
minimum cost of $50 and a maximum cost of $120. What is the probability that a research
study will cost somewhere between $60 and $80.

i) Draw the graph representing this information:

A = $50 (min) B = $120 (max) X1 = $60 X2 = $80


ii) What is the probability that a research study will cost somewhere between $60 and $80?
A = $50 B = $120
X1 = $60 X2 = $80

P($60 < x <$80) = P(x<$80) – P(x<$60)

P(X<$80) = (80-50) / (120-50) = 0.4286

P(x<$60) = (60-50) / (120-50) = 0.1429

P($60 < x <$80) = 0.4286 – 0.1429

= 0.2857 = 28.57%.

iii) What is the probability that a research study will cost somewhere between $100 and $130?
Any value above $130 will not occur as there is an upper limit of $120. Therefore we calculate
the probability of $100 to $120
A = $50 B = $120
X1 = $100 X2 = $120

P($100 < x <$120) = P(x<$120) – P(x<$100)

P(X<$120) = (120-50) / (120-50) = 1

P(x<$100) = (100-50) / (120-50) = 0.7142

P($100 < x <$120) = 1 – 0.7142

= 0.2857 = 28.57%

3. The time taken to complete a programming assignment follows a normal distribution with
an average time of 6 hours and a standard deviation of 1 hour and 20 minutes. These statistics
were based on a sample of 100 computer science students. What is the probability that a
student takes more than 8 hours to complete the assignment?

Step 1: Write out the values that are given in the question
X = 8 hours SD = 1.333 hours mu = 6hr
X = 480 SD = 80 mu = 360 min
Either way will get the same answer, need to be consistent with units
Step 2: Standardise the information. You need to find the z-score (remember to convert all
numbers to the same units!)

P(X > 8; 𝜇 = 6, 𝜎 = 1.333) = 8 - 6 / 1.3333


= 2 / 1.333
= 1.5

P(X > 480; 𝜇 = 360, 𝜎 = 80) = 480 - 360 / 80


= 120 / 80
= 1.5

NOTE: Values from the table are all P (Z< z) where z is the formula
= P (Z > 1.5)
= 1 - P(Z < 1.5)
= 1 – 0.9332
= 0.0668 OR 6.68%

Lecture 5: Sampling & Sampling Distributions


ANSWERS

Activity 1: Theory Questions


1. Why do we prefer to take samples of data rather than surveying the whole
population?
a) It is costly. We cannot measure everyone’s height to compute the population
mean height of people. 7.7 billions of individuals to measure.

b) It is time-consuming. By the time measuring is finished, the world population


has changed.

c) It risks killing the entire population. To measure the average passing rate of
car crash tests, we need to crash all cars.

d) The population size can be infinity. To measure the average spinning speed
of an atom, we need to find all atoms in the universe.
2. Define statistical inference:
● Statistical inference goes from the sample to the population. We use information from
a sample to summarise/report/estimate/describe/test parameters in the population.
● In other words: Statistical inference (or statistical analysis) uses information from a
sample to infer properties about the population.

3. Define sampling error:


The discrepancy between a sample statistic and the corresponding population parameter.
- The difference between the sample mean and population mean i.e. the difference
between what the estimated mean is and what the actual average is.

Activity 2: Sampling Distributions- refer them to the slides for complete details of
each type of distribution
X" (X-bar: Sample Mean) s2X (Sample Variance)

If 𝝈X is known If 𝝈X is unknown follows Chi-square distribution


(Hint: Slide 8) (Hint: Slide 12) (Hint: Slide 17)

Formula: Formula: Formula:


where: where: where:

𝝈X = s2X =
𝝈X =
Standard Error is a measure of (the
t(v): χ2(v):
variability of) sampling error

Higher Accuracy when:


● n is larger v (dof) = n - 1 v (dof) = n - 1
● 𝝈X is smaller
Note:
- X = random variable we are
Note: interested in (sample mean)
- X = random variable we are - 𝜇𝑋 =
interested in (sample mean)
- 𝜇𝑋 =

standardisation formula.
Activity 3: Practise Questions
3a)

4-step Procedure.
Step 1: Write down the probability statement that represents the answer to the problem:

𝑃 (𝑋< 43; 𝜇𝑋= 40.5, 𝜎𝑋 = 7.1)

Step 2: Standardisation; write down the probability statement in a standardised form

𝑿 − 𝝁𝒙 𝟒𝟑 − 𝝁𝒙
𝑷( < ; 𝝁𝒙 = 𝟒𝟎. 𝟓, 𝝈𝒙 = 𝟕. 𝟏 )
𝝈𝒙 𝝈𝒙

Step 3: Plug in numbers

𝟕.𝟏
𝝈X =
√𝟒𝟎
= 1.12226
() * (+.,
z= -.-../+
= 2.22695609871
= 2.23

𝑿 − 𝝁𝒙
𝑷( < 𝟐. 𝟐𝟑)
𝝈𝒙

Step 4: Transform into 𝑍


!"#$%$&'&()$*$+,-./$#012345$*$6'789:
= 98.71% that the probability that the average age of the firm is less than 43.

3b)

Because the population standard deviation (𝜎𝑋) is unknown, the standard error has the
sample standard deviation sx in the formula.

$;$Follows t distribution
𝒔𝒙
𝝈X = = 3.2/√𝟑𝟏 = 0.5747
√𝒏
Consequently, (𝑋̅−𝜇𝑋̅)/𝜎𝑋̅ = tv ~ t(v) with (degrees of freedom) v = n - 1- > 31 - 1 = 30.
3c)

Chi-Square distribution

‘n’ = 19
s^2x = ??
o^2x = $200^2

with (degrees of freedom) v = n - 1 = 18.

If the question asked what is the probability that the sample standard deviation
sx is less than $185?

ANSWER:

( (19-1)x1852 ) / 2002 = 15.40125

P(X2V < 15.40125; v = 18)

To help understand the difference between this week and last week
THIS QUESTION IS ASKING FOR A RANDOMLY CHOSEN STUDENT AND NOT A
MEAN THUS WE DON'T NEED TO FIND THE SAMPLE ERROR

p(X<1350)
z=(1350-1500)/200
z= -.075
= p(Z<-0.75)
=1-P(Z<0.75)
=1-0.7734
=0.2266
Thus the answer is A

Lecture 6: Point and Interval Estimates


ANSWERS

1. What is the general structure to an interval estimate?

2. Illustrate the general structure graphically.


Example:

3.
(note: confidence interval is a reasonable range for a parameter)
Formula Notation

Population Mean (σ known)

Normal Distribution

Population Mean (σ unknown)


T-Distribution

Population Variance

Chi-square distribution
(lower value, upper value)

Population Standard Deviation

Summary

4. What does the α=0.05 mean?


𝛼 is the probability of making a mistake. An alpha of 0.05 means there’s a 5% chance
that mu (mean) would be outside the constructed interval.
- Likelihood that the true population parameter lies outside the confidence
interval

5. The confidence interval will be wider if (σX) sigma increases and alpha decreases.
The confidence interval will be tighter if n increases.

Information from lecture:


● 𝜎X increases? Wider. Because if the variation of X increases, so does 𝑋. We are less
certain about the value of 𝜇X
● n increases? Tighter. Because more data is used, which improves accuracy.
● a decreases? Wider. a can be thought of as the tolerance we have for making mistakes.
○ a = 0.05 means there is a 5% chance 𝜇X is outside the constructed interval, thus
we make a mistake by using that interval.
○ Decreasing a means we are more conservative, allowing for less mistakes, thus
we need to widen the interval.

Practice Questions
1. Using data on checkout time for a sample of 400 consumers. Coles supermarket
determines that it takes on average 6 minutes for consumers to complete their grocery
purchases at the self-checkout counters. The population standard deviation is known
to be 2 minutes. The store layout manager would like to construct an interval estimate
with a 95% confidence interval (CI) of average checkout time in Coles store.

95% CI = 0.05/2 = 0.025 = 1.96 Since population std dev is known, use z α/2.
n = 400 = 6 +/- (1.96) (2 / sqrt.400)
Point estimate (x-bar) = 6mins = 5.804min ; 6.196min
Pop. Std Dev. = 2mins

2. A tutor was interested in estimating how many hours students took to finish a
programming assignment. As he could not ask all students, he only polled his own
class of 30 students. He found the average was 30 hours with a sample standard
deviation of 7 hours. Construct a confidence interval showing how long students across
the subject took on average to finish the assignment. (a=0.05)

95% CI = 0.05/2 = 0.025 = 1.96 Since pop. std dev is unknown, use t α/2, n-1
n = 30 = 30 +/- 2.045 (7 / sqrt 30)
Point estimate (x-bar) = 30 = 30 - 2.6135; 30 + 2.6135
Sample Std Dev. = 7 = 27.3865 ; 32.6135

3. In a typical car, bell housings are bolted to crankcase castings by means of a series
of 13 mm bolts. A random sample of 12 bolt-hole diameters is checked as part of a
quality control process and found to have a variance of 0.0013 mm2 .
(a) Construct the 95% confidence interval for the variance of the holes.
(b) Find the 95% confidence interval for the standard deviation of the holes.
Lecture 7: Hypothesis Testing
ANSWERS:
Activity 1: What five steps do we need to follow when conducting hypothesis testing?
μ (Population Mean) 𝝈2X and 𝝈X

If 𝝈X is known If 𝝈X is unknown
(Hint: Slide 20) (Hint: Slide 25) (Hint: Slide 33)

STEP ONE: Write down the NULL and ALTERNATIVE hypothesis

H0: μ = c ; Ha<$=$>$? H0: 𝝈2X = c ; Ha: 𝝈2X$>$?$


H0: 𝝈X = c ; Ha: 𝝈X$>$?$

STEP TWO: Write down the TEST STATISTIC

STEP THREE: Write down the null DISTRIBUTION

STEP FOUR: Write down the REJECTION rule: At the significance level 𝛼, we reject the null if:
STEP FIVE: a) Compute the test STATISTIC, b) write down the test RESULT
c) and CONCLUDE.

CASE 1: Because (REFER TO THE REJECTION RULE FOR RESPECTIVE DISTRIBUTION), there is
statistical evidence suggesting that the null hypothesis does not hold. Thus we reject the null and accept the
alternative.

CASE 2: There is not enough statistical evidence to reject the null, so we fail to reject the null (and we
maintain our null hypothesis).
Xbar = 2.5 / 150
S = 0.75 / 45
N = 61

Step 1:
Ho: mean = 750
Ha: mean =/= 750

Step 2:
zstat = -404.46

Step 3:

Step 4:
Alpha = 0.05
alpha/2 = 0.025

zcritvalue = 1.96

Step 5:
zstat < -Zcritval -404 < -1.96
Zstat > Zscritval 5.21 > 2 YES

Yes, management should be worried because the workers are taking either too long ot too
little time on the tasks

Ha: mean = 2
Ho: mean does not = 2

Step2:
Test stat = 5.21

Step 3:
Dof = 61-1 = 60

Step 4:
Alpha = 0.05
alpha/2 = 0.025
T crit value =2
Tstat = 5.21

Tstat < -tcrit value


Tstat > tcrit value
5.21 < -2 NO
5.21 > 2 YES \

Statistical evidence to reject the null hypothesis and accept the alternative. This
means that management should be worried because the workers are either taking
too much or too little time completing the tasks.

Question 2ii)
Xbar = 135
Sx = 50
N = 41

Step 1:
Ho: sd = 100
Ha: sd =/=100

Step2:
Chistat = (41-1)50^2 / 100^2 = 10

Step3:
Dof = 40

Step4:
Alpha = 0.05
Alpha/2 = 0.025
1 - alpha/2 = 1 - 0.025 = 0.0975

Lecture 8: Hypothesis Testing II


ANSWERS

Activity 1: What five steps do we need to follow when conducting hypothesis testing?

t-test (tests for equal means)


F-test (tests for equal
variance)

Independent sample Paired sample


(Hint: Slide 12-15) (Hint: Slide 13) (Hint: Slide 23-24)

STEP ONE: Write down the null and alternative hypothesis

H0: μx = μY H0: μD = c H0: 𝝈2X = 𝝈2Y


Ha: μx >$μY Ha: μD >$? Ha: 𝝈2X$>$𝝈2Y
(firstly compute D-bar and sD)
OR

H0: μx - μY = c (usually 0)
Ha: μx - μY >$?

STEP TWO: Write down the test statistic

Where;

STEP THREE: Write down the null distribution

Tvu = always provided.

STEP FOUR: Write down the rejection rule: At the significance level 𝛼, we reject the null if:

where

STEP FIVE: a) Compute the test statistic, b) write down the test result
c) and conclude.

Activity 2: Practice Questions

● i) Independent Samples Test

● ii) Paired Sample Test


● iii) F-Test

The old answers are at the bottom if you want to use them :)

Jessica’s Working Out for Activity 2

Question 1:
Question 2:

Question 3

● Finding the critical values on the F table:

○ $;$$CV would be either 2.82 or 2.72


● EXAMPLE OF QUESTION 3 FROM LECTURE:

Lecture 9: Hypothesis Testing III - SOLUTIONS


Activity 1: ANOVA and Contingency Table Test for Independence

ANOVA Contingency Table Test

Used when.. we want to test the equality of Used when.. we want to test if the row
more than two means. variable and the column variable are
independent.

STEP ONE: Write down the null and alternative hypothesis

H0: μ1 = μ2 = μ3 = .. μi = c H0: 𝑋 row variable and 𝑌 column variable


are independent
Ha: at least one μi$>$? Ha: 𝑋 and 𝑌 are dependent
STEP TWO: Write down the p-value or the test statistic

Construct a chi square table:


For each cell that is not in the total column
of the table, we calculate:

NOTE: You will need to find this from the We then sum up the results to get the chi
ANOVA results table test statistic

STEP THREE: Write down the null distribution

The critical value 𝜒2𝛼,𝑣 follows a 𝜒2 (v)


distribution with the significance level 𝛼
and
degrees of freedom 𝑣 = (#rows − 1) ×
(#columns − 1)

NOTE: Skip this step for ANOVA analysis:) Distribution of the test statistic under
the null of independence: X2 (v)
distribution and with degrees of freedom (v
= (#rows - 1) x (#columns - 1))

STEP FOUR: Select a level of significance α and look up the critical value

STEP FIVE: Write down the rejection rule and conclude:

Reject the null hypothesis if:


If 𝜒2𝛼,𝑣< 𝜒2 15,1$,121-,1-?$;$@[email protected]
Reject the null hypothesis (at least one
mean is different)
If the 𝑝-value < the chosen 𝛼 Fail to reject the null hypothesis
If 𝜒2𝛼,𝑣 > 𝜒2 15,1$,121-,1-?$;$-.@[email protected]
Fail to reject the null hypothesis (all means
are the same)
If the 𝑝-value > the chosen 𝛼
Conclusions for ANOVA Test
Case 1: At the level of significance a, because _______ Reject the null, we accept the
alternative that at least one population means is not equal to the rest of the population means.
Case 2: At the level of significance a, because _____, we fail to reject the null, and therefore
maintain the hypothesis that all population means tested are equal.

Conclusion for Contingency Table Test


Case 1: At the level of significance a, because ____, there is sufficient statistical evidence to
reject the null hypothesis.
Case 2: At a level of significance a, because ______, There is insufficient statistical evidence
to reject the null, so we fail to reject the null (and we maintain the null hypothesis).

Activity 2: Practise Questions

i) ANOVA
a. What test should you use? Why?
● ANOVA - “test whether the mean pressure applied to the driver’s head during a crash
test is equal for each type of car”
● Testing if the mean of multiple independent samples is equal (more than two).

b. What is null and alternative hypothesis?


𝐻0:𝜇1=𝜇2=𝜇3 𝐻a: 𝐻𝑎: not all mean pressures are equal
where 𝜇𝑖 is the mean pressure applied to the driver’s head during a crash test for each type of
car 𝑖, and 𝑖∈{1,2,3} indicating compact, midsize, and full-size cars, respectively.

c. At the 5% significance level, what is the conclusion of the test?


The 𝑝-value 0.001207 < 0.05, which means there is sufficient statistical evidence to reject the
null hypothesis at the 5% level. We thus reject the null hypothesis and accept the alternative
hypothesis that not all the mean pressure applied for the car types are equal (there is at least
one mean pressure that deviates from the rest of group).

d. In the case of rejection, which type of car has a different mean pressure applied to the
driver’s head during a crash test?
There is at least one mean pressure that is different BUT we do not know which one.
ii) Contingency Table
a. Using the above data, produce a contingency table for expected frequency in terms of
whole numbers.

Formula for the cells:

(Total column/total)* total row


($29 / 800)*380 = 23.75

Expected $29 Cap $49 Cap $79 Cap Total

Mon - Fri 23.75 71.25 285 380

Sat - Sun 26.25 78.75 315 420

Total 50 150 600 800

b. Write down the null and alternative hypotheses for this question. Prove that the test statistic
= 86.8839 at the significance level alpha = 0.05.

𝐻o: day and phone plan sold are independent


𝐻𝑎: day and phone plan sold are dependent

Formula for the cells:

(observed - expected)^2 / expected


(10-23.75)^2 / 23.75 = 7.9605

$29 Cap $49 Cap $79 Cap Total

Mon - Fri 7.9605 33.3553 4.2982

Sat - Sun 7.2024 30.1786 3.8889

Total 86.8839

c. Create a contingency table for observed frequency in terms of relative frequency.

Formula for the cells:

Each cell divided by 800 (1st cell: 10/800 = 0.0125)

$29 Cap $49 Cap $79 Cap Total

Mon - Fri 0.0125 0.15 0.3125 0.475

Sat - Sun 0.05 0.0375 0.4375 0.525


Total 0.0625 0.1875 0.75 1

d. Create a contingency table under the assumption that the type of phone plan chosen is
independent of the day the plan is purchased.

Formula for the cells:

Multiply the total probabilities together (1st cell: 0.475*0.0625 = 0.029688)

$29 Cap $49 Cap $79 Cap Total

Mon - Fri 0.0297 0.0891 0.3563 0.475

Sat - Sun 0.0328 0.0984 0.3938 0.525

Total 0.0625 0.1875 0.75 1

e. Test whether the day (weekday/weekend) a customer purchases a plan and the type of
phone plan sold are independent. What is the conclusion from the test?

The critical value 𝜒𝛼,𝑣2 with the significance level 𝛼=0.05 and degrees of freedom
𝑣=(2−1)×(3−1)=2 equals 𝜒0.05,2=5.9915

The test statistic 86.8839 exceeds the critical value 5.9915.

This means that we have sufficient statistical evidence to reject the null hypothesis of
independence; we thus accept the alternative hypothesis - day of the week and the phone
plan sold are dependent.

EXTRA RESOURCE:
Process for Computing Test Statistic:
Making the assumption that the variables are independent:
Therefore:

To compute the Expected Values:


Expected Frequency of A and B occurring together:

Which can be simplified to:


Note: Count refers to the total number of that variable

Example:
Where:
A - $29 Cap Therefore Count(A) = 50
B - Mon-Fri Therefore Count(B) = 380
Total Count = 800
Substituted into the equation:

$29 Cap $49 Cap $79 Cap Total

Mon - Fri 23.75 71.25 285 380

Sat - Sun 26.25 78.75 315 420

Total 50 150 600 800


Note: only calculate the amount for the joint probabilities, the totals remain the same as they
are on the observed table

To Calculate the Test Statistic:


Using the equation:

Example: For $29 Cap and Mon-Fri


Observed = 10
Expected = 23.75
Substitute into equation:

This value is calculated for each of the joint probabilities and then added together to find the
total test statistic
$29 Cap $49 Cap $79 Cap Total

Mon - Fri 7.9605 33.3553 4.2982

Sat - Sun 7.2024 30.1786 3.8889

Total 86.8839
U:PASS 10: Regression Analysis I - SOLUTIONS
a) What is regression analysis and why is it useful? Provide some examples.

● Regression analysis establishes causal link among variables


● Hypothesis: some variables (𝑋1, 𝑋2, …, 𝑋𝑗) may have an effect on a variable (𝑌), and
we want to quantify such effects.
○ E.g. we hypothesis size, age and location may affect property price.
● 2. Given some value of independent variables, we want to predict the value of 𝑌.
○ E.g. Given size, age and location, what a property is expected to cost?

b) What is the difference between simple and multivariate regression?

● Simple - only has one independent variable (X) in the regression model
○ In other words: Simple - has only one explanatory variable in the regression
model
● Multivariate - has multiple independent variables (X’s) in the regression model that
may have a causal impact on the dependent variable (Y)
○ In other words: While simple has one explanatory variable multiple regression
has multiple explanatory variables (i.e. more factors affecting yi)

c) Complete the following table.

Test for Significant Effect Test for Joint Significance

Used when.. determining whether individual Used when.. determining whether the whole
independent variables have a significant effect on regression model has explanatory power for 𝑌
the dependent variable 𝑌 (whether a model is useful)

(In other words: This is to test whether the j-th


independent variable Xj has significant effect on the
dependent variable Y)

STEP ONE: Write down the null and alternate hypothesis.

H0: βj = 0 H0: All regression coefficients are zero or β1= β2=


βj … = 0
Ha: βj!"!#
Ha: at least one coefficient is non-zero

STEP TWO: Write down the rejection rule and conclude.


Reject the null if: Reject the null if:
𝑗-th 𝑝-value < α$;$Bj is a significant variable and significance F < α
( α > significance F )
has a significant effect on Y
;$21$452,1$C.5$?C5DD-?-5.1$-,$.C.0E5FC
Fail to reject the null if:
𝑗-th 𝑝-value > α → 𝑋𝑗 Fail to reject the null if:
is not a significant variable. It significance F > α
has no significant effect on 𝑌. (α < significance F)
→ 𝐚𝐥𝐥 regression coefficients are zero

Note: Significance F = P-Value

Activity 2: Practice Questions

Part A: Write out the regression Model (Regression Output Interpretation )


Information

Answer:
𝑦𝑖 = 𝛽0 + 𝛽1𝑥1,𝑖 + 𝛽2𝑥2,𝑖 + 𝛽3𝑥3,𝑖 + 𝛽4𝑥4,𝑖+ 𝛽5𝑥5,𝑖 + 𝜖𝑖

Where i = 1, … , 545 and


● 𝜖𝑖∼𝑁(0,𝜎) is a normally distributed error term
● 𝑦𝑖 is the auction price (in 1000 dollars) in house 𝑖
● 𝑥1,𝑖 is the square metres (in m2) in house 𝑖
● 𝑥2,𝑖 is the distance to schools (in km) for house 𝑖
● 𝑥3,𝑖 is the distance to shops (in km) for house 𝑖
● 𝑥4,𝑖 is the number of bathrooms in house 𝑖
● 𝑥5,i is the number of bedrooms in house 𝑖
Part B: Write out the estimated model
Information

Answer:
Part C: interpret the intercept and all of the estimated coefficients (Interpretation of the
intercept + estimated coefficients)

Information

● The intercept: The auction price is expected to be 60.39093*1000 dollars when


square metres, distance to schools/shops and the number of bathrooms/bedrooms
are all zero.
○ In other words: When square metres, distance to schools/shops and the
number of bathrooms/bedrooms (all independent variables) equal to 0, the
auction price is expected to equal $60.39 K (60.39093*1000).
That is, $60.39 K is the expected value of the auction price, when all X’s are
zero.
● The regression coefficients:
○ Keeping all other variables constant, an extra square metre in the house is
expected to increase the auction price by 0.068209*1000 dollars
($68.209).
○ Keeping all other variables constant, an extra km away from the school is
expected to decrease the auction price by 0.05025*1000 dollars ($50.25).
○ Keeping all other variables constant, an extra km away from the shop is
expected to decrease the auction price by 0.23617*1000 dollars ($236.17)
○ Keeping all other variables constant, an additional bathroom is expected to
increase the auction price by 0.033752*1000 dollars ($33.75).
○ Keeping all other variables constant, an additional bedroom is expected to
increase the auction price by 0.181733*1000 dollars ($181.73)
Part D: Are the variables jointly significant, at 10%

● At the 10% level, the null hypothesis is rejected because:


significance F < α $;$21$452,1$C.5$?C5DD-?-5.1$-,$.C.0E5FC

4.21E-38 < 0.10

(Note: 4.21E-38 = 4.21 x 10-38)

● This means there is sufficient statistical evidence to reject the null and accept the
alternative that the variables are jointly significant.

Part E: At the 10% level, which variables are significant? What about at the 5% and 1%
levels? What does this mean in terms of the null and alternative hypothesis?

● The fact these variables are significant means that their coefficients are non-zero.
○ In other words: It means that if you reject the null hypothesis Xj is a significant
variable and has a significant effect on Y. Therefore the coefficient is
significantly different from 0.

10% Significance level At the 10% level, the intercept, square metres, distance to
shops/schools and bedrooms are significant variables.

5% Significance level At the 5% level, the intercept, square metres, distance to


schools/shops and bedrooms are significant variables.

1% Significance level At the 1% level, the intercept, square metres, distance to shops
and bedrooms are significant variables.

Part F: Predict the auction prices if the house has 6 bedrooms, 2 bathrooms, is 5km
away from the shop and school and is situated on a block of 500 square metres.

The regression line is given by

𝑦auction prices = 60.39093 + 0.068209𝑥1 - 0.05025𝑥2 - 0.23617𝑥3 + 0.033752𝑥4+ 0.187133𝑥5

𝑥1 = 500 𝑥2 = 5 𝑥3 = 5 𝑥4 =2 𝑥5 = 6

𝑦auction prices = 60.39093 + (0.068209x500) - (0.05025x5) - (0.23617x5) + (0.033752x2) +


(0.187133x6)
= 94.253632 * 1000 dollars
= $94,253.632
ii) R Square

Part A: Interpret the reported R square.

Information

● The variation in experience, gender and health explains 48.99998% of the variation in
wages.

Part B: Added the variables ‘age’ and ‘productivity,’ what will happen to the reported
R square?

● Adding the variables ‘age’ and ‘productivity’ will always improve the R square value,
simply because more information is added to compute the regression model.

You might also like