0% found this document useful (0 votes)
29 views144 pages

ZC-417 Quantitative Methods Exam Notes

Uploaded by

2022mb21048
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views144 pages

ZC-417 Quantitative Methods Exam Notes

Uploaded by

2022mb21048
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 144

Lesson-1

Types of Data

 Statistics begins with data.


 Data can be Qualitative or Quantitative
 Examples of Qualitative or Categorical Data: Name, Gender, Car Colour, Date of Joining
 Quantitative Data can be of two types:
 Discrete: # of defects, # of Children
 Continuous: Inter-arrival Time, Market Share, PE Ratio

Measurement Scale

Summarizing Qualitative Data

Qualitative Data may be summarized by

 Frequency Distribution
 Relative Frequency Distribution
 Cumulative Frequency Distribution
 Percent Frequency Distribution
 Bar Charts / Graph
 Pie Charts
 All have the same information – A matter of taste
Cross Tabulating the Titanic Data (Titanic Data Contingency Table)

A person is randomly picked and the Class & Survival were recorded

• P((F & A) or (S, & A)) = P((F & A) + P(S & A) = 202/2201 + 118/2201 = 320 / 2201
• (mutually exclusive events)
• P(Not (F, & A)) = 1 – 202/2201 = 1999 / 2201 (complementary events)
• P(Person was Crew or Person Survived)
 = P(Person was Crew) + P(Person Survived) – P(Person was Crew and survived)
 = 885/2201 + 710/2201 – 212/2201 = 1383 / 2201

A person is randomly picked and the Class & Survival were recorded

• P((F & A) or (S, & A)) = P((F & A) + P(S & A) = 202/2201 + 118/2201 = 320 / 2201
• (mutually exclusive events)
• P(Not (F, & A)) = 1 – 202/2201 = 1999 / 2201 (complementary events)
• P(Person was Crew or Person Survived)
 = P(Person was Crew) + P(Person Survived) – P(Person was Crew and survived)
 = 885/2201 + 710/2201 – 212/2201 = 1383 / 2201

A person is randomly picked and the Class & Survival were recorded

• P((F & A) or (S, & A)) = P((F & A) + P(S & A) = 202/2201 + 118/2201 = 320 / 2201
• (mutually exclusive events)
• P(Not (F, & A)) = 1 – 202/2201 = 1999 / 2201 (complementary events)
• P(Person was Crew or Person Survived)
 = P(Person was Crew) + P(Person Survived) – P(Person was Crew and survived)
 = 885/2201 + 710/2201 – 212/2201 = 1383 / 2201

Displaying Quantitative Data

 Frequency Distribution
 Relative Frequency Distribution
 Percentage Relative Frequency Distribution
 Cumulative Frequency Distribution
 Histogram
 Ogive
 Dot Plot
Time Taken – Frequency Distributions

Time Taken – The Ogive

Time Taken – Histograms

Skewness

 Skewness measures the asymmetric nature of the distribution.


o A distribution with the peak towards the right and a longer left tail is skewed left or negatively
skewed
o A distribution with the peak towards the left and a longer right tail is skewed right or positively
skewed
o A Symmetric distribution has Skewness = 0

Skewness from Mean & Median

Suppose Median < Mean

 ⇒ More than 50% of the population is to the left of the mean


 ⇒ The histogram may have a longer right tail. And so skewed right
 Median > Mean ⇒ The histogram may be skewed left
 Median = Mean ⇒ We may a symmetric distribution

Time Taken – Dot Plot


Scatter Plot

 Graphical and Tabular presentations pictorially summarizes the entire data set
 Business may require:
o A measure that summarizes the data with a single number
o A measure that summarizes the spread of the data with a single number

Statistics & Parameter

If this measure summarizes a sample data, it is referred to as a Statistic


If this measure summarizes the population, it is referred to as a Parameter
o The sample mean X is said to be a point estimator of the population mean μ
o The sample standard deviation s is said to be a point estimator of the population standard
deviation σ.

Measures of Location I

Measures of Location II
Measures of Dispersion

The Range: Maximum Value – Minimum Value

The Interquartile Range (IQR) : Q3 – Q1

The variance is a measure of variability that includes all the available information
∑( ) ∑( )
𝜎 = &𝑠 =

The Standard Deviation: Square Root of the Variance

Mean Vs Median

1. Suppose the company knows it will retain customers if the customer satisfaction index is 6 or above (on
a scale of 1 – 10.) The average index for the current survey is 6. Should the company feel comfortable?

Suppose the scores as 3 4 4 5 5 5 6 8 10 10

1. The shopkeeper maintains a record of the monthly spends by its regular customers. Would he be
interested in the median spend or in the average spend?

Box Plot & Five-Number Summary & Outliers

Kentucky Derby Data: Box Plot & Five-Number Summary & Outliers
Comparing Different-Looking Values

 You have a job offer of Rs.8.2 lakh in Hyderabad and another of Rs8.6 lakh in Bangalore. The mean and
variance for similar jobs in Hyderabad were Rs.7 lakh and Rs.9 lakh, while for Bangalore, these were Rs.7.4
lakh and Rs.12.25 lakh. Which is a better job offer?
 The Feb “high” temperature averaged 30oC with variance 100oC, while in May these were 40oC and 64oC.
When is it more unusual to have a high of 35oC?

The Standard Deviation as a Distance Measure


𝒙 𝑿
 How many standard deviations is the data point from the mean 𝒛 = 𝒔
 The standardized value, z, is called the z-score
 z-score is the statistical distance from the mean.

The Standard Deviation as a Ruler – Example 1

You have a job offer of Rs.8.2 lakh in Hyderabad and another of Rs8.6 lakh in Bangalore. The mean and
variance for similar jobs in Hyderabad were Rs.7 lakh and Rs.9 lakh, while for Bangalore, these were Rs.7.4
lakh and Rs.12.25 lakh. Which is a better job offer?

 Z-Scores:
 Zhyd = (8.2-7)/3 = 0.4 &
 Zbgl =(8.6-7.4)/3.5 = 0.3429
 Hyderabad: The offer is 0.4 σ’s from the mean
 Bangalore: The offer is 0.34 σ’s from the mean

The Standard Deviation as a Ruler – Example 2

The Feb “high” temperature averaged 30oC with variance 100oC, while in May these were 40oC and 64oC.
When is it more unusual to have a high of 35oC?

 Z Scores
 Zfeb = (35-30)/10 = 0.5 &
 Zjun = (35-40)/8 = -5/8 = -0.625
 February: 35oC is 0.5 σ’s from the mean
 May: 35oC is 0.625 σ’s from the mean (Ignoring the negative sign)

Z Score & Outliers

 How does this help us?


 If we assume that anything beyond ±ks is an outlier, then we have another tool to analyse data

Z Score & Outliers Example


Chebyshev’s Theorem

A Symmetric Histogram & Chebyshev’s Theorem


Empirical Rule & Outliers

 Recall the team member with a a job offer of Rs.15 lakh in Hyderabad.
 Should you try and retain him or start the separation process?
 Recall also for his level in Hyd, m = Rs.8 lakh and s = Rs.2 lakh.
 z score is 3.5
 Suppose the salaries for his profile and level are bell shaped

Lesson -2

Random Experiment: Example

The soft drinks manufacturer has recently introduced a new drink Cocofizz targeted at college students. It
has engaged a market research company to do a survey among young people to find out how the product is
being perceived.

In the pilot survey, the MR team decided to meet 20 students. A random college was selected and a team
was stationed at the college canteen. The team asked very 10th student to rate the product on a scale of 1
– 5, 1: Yuck!, 2: Poor and 3: Neutral, 4: Good and 5: Excellent.

Consider any one student surveyed:

• The survey result has well-defined outcomes but cannot be predicted with certainty

• The set of all outcomes: {1, 2, 3, 4 5}

• In the pilot survey, the MR team decided to meet 20 students.

In the pilot survey, the MR team decided to meet 20 students.

A random college was selected and a team was stationed at the college canteen. The team asked very 10th
student whether he / she will recommend the drink to a friend – Yes or No

The survey result has well-defined outcomes but cannot be predicted with certainty

The set of all outcomes?

Random Experiment, Sample Space, Events

• Random Experiment: A process that has well-defined outcomes but these cannot be predicted with
certainty
• Sample Space: The set of all outcomes
• Event: Any subset of the sample space
• For the following processes, identify the sample space and one event:

 Flip a coin

 Roll a die

Assigning Probabilities

• Equally Likely
 Assigning probabilities based on the assumption of equal likely outcomes
• Relative Frequency Method
 Assigning probabilities based on historical data
• Subjective Method
 Assigning probabilities based on judgement

Equally Likely Example

Relative Frequency Method Example I

Venn Diagram

The rental classifieds suggest that 50% of the flats are fully furnished, 20% have 24/7 running water, and
10% have both features. What is the probability that a flat for rent:

A. Is fully furnished or has running water?, B. Neither, C. Is fully furnished but no running water

Experiment: Randomly selecting an ad and identifying the features

P(Furnished) = .50; P(Water) = .20; P(Furnished & Water) = 0.10.

A. P(Furnished or Water) = P(Furnished) + P(Water) – P(Both) = .50 +.0.20 – 0.10 = 0.6

B. P(Not Furnished & No 24/7 Water) = 1- 0.60 = .40

C. P(24/7 Water and Not Furnished) = P(24/7 Water) – P(Both) = .20-.10 = 0.10
Subjective Method

 Often managers use their experience and intuition (and the data available) to assign probabilities. The
probabilities represent their belief in the likelihood of the events
 Usually, probability estimates are based on the Relative Frequency approach together with the subjective
estimate.
 Example: The firm will shortly launch a variant of the existing model.
 R&A assigned the following probabilities to the possible market share by year-end: P(5%) = 20%, P(10%) =
55% and P( 15%) = 25%.
 VP Marketing modified the numbers as follows:
 P(5%) = 25%, P(10%) = 40% and P( 15%) = 35%.

Basic Requirements for Assigning Probabilities

1. 0 ≤ P(E) ≤ 1 for any outcome E

2. ∑ P(E ) = 1
Conditional Probability

titanic

Titanic Data: Joint & Marginal Probabilities

Titanic Data Conditional Probabilities


Titanic Data Independent Events and Multiplicative Law

 Conditional Probability
o P(A | B) “Read as Probability of A given B”
o P(A | B) = P(A and B) / P(B) = P(A & B) / P(B)
 Independent Events
o P(A | B) = P(A) or P(A & B) = P(A) * P(B)
 Multiplicative Law
o P(A and B) = P(B) * P(A | B) or P(A & B) = P(B)P(A|B)
 Joint Probability
o Probability of two events both occurring
 Marginal Probability
o Joint Probability: Probability Distribution of one of the variables

Prior & Posterior Probabilities – Example

 The firm will shortly launch a variant of the existing soft drink.
 Based on past data and industry reports, the R&A Department has assigned the following probabilities to
the possible market share: P(5%) = 0.2, P(10%) = 0.5 and P(15%) = 0.3.
 Since the soft drink was targeted for the younger generation, a taste test was held at a college campus. 35
of the 50 students who took the test said they like the drink.
 This new fact has to be factored into the earlier estimates to refine the probabilities.
 The earlier probabilities are called Prior probabilities and the latter would be called Posterior probabilities.
 Bayes’ theorem provides the tool for revising the prior probabilities.
 Any manager, dealing with a situation of uncertainty, can assign probabilities to the possible outcomes.
 These probabilities may be a combination of subjective and objective probabilities. The latter being based
on historical data and reports.
 These initial probabilities are termed as prior probabilities.
 Later new information is received – Survey or Product Test or some stray event.
 This new information is factored in to generate the posterior probabilities.
 Bayes’ theorem is the tool for revising the prior probabilities.
 The probability model may be continuously updated as ne w data flow in!

Titanic:

Bayes’ Theorem

Suppose Ei’s are ALL the possible outcomes and the prior probabilities P(E ) have been assigned to them.
The event F has occurred.
( ) ( | )
P(E |F) = ( ) ( | ) ( ) ( | ) ⋯ ( ) ( | )

( ) ( | )
= ( & ) ( & ) ⋯ ( & )

( ) ( | )
= ( )

Required:

1. The Ei’s are mutually exclusive

2. The Ei’s are collectively exhaustive – that is, are all the possible events

3. P(F | Ei)’s are available

The MD’s EA is often late returning from lunch. Based on observation, HR assigned the following
probabilities:

Lunch Location Probability he is late:

Out 40%

Company Canteen 19%

Cubicle 1%

HR knows that all locations are equally likely.

Today the EA came back late from lunch.

This information has to be factored into the data above!

A Simple Example: Symbolic Representation

 Events:
 E1: Lunch Out, E2: Lunch at the canteen, E3: Lunch in the cubicle
 Assumption: These are all the possibilities for the EA to have lunch
 F: EA came late today
 Prior Probabilities: P(E1) = P(E2) = P(E3) = 1/3
 Conditional Probabilities: P(F | E1) = 0.40, P(F | E2) =0.19, P(F | E3) = 0.01

A Simple Example: The Tabular Approach

 Events:
 A1: Lunch Out, A2: Lunch at the canteen, A3: Lunch in the cubicle
 Assumption: These are all the possibilities for the EA to have lunch
 F: EA came late today
 Prior Probabilities: P(A1) = P(A2) = P(A3) = 1/3
 Conditional Probabilities: P(F|A1) = 0.40, P(F|A2) =0.19, P(FA3) = 0.01

A Non-Trivial Example From Medicine

 The disease is present in 0.5% of the population. It is a deadly disease and death is almost always
inevitable.
 But there is a test that can detect the disease.
 The True Positive is 99% while the False Positive is 5%.
 That is P(Test is Positive given that you have the disease) = 0.99
 And P(Test is Positive given that you do not have the disease) = 0.05
 Question: If you test positive, should you panic?

 Events:
 E1: You have the disease, E2: You do not have the disease
 F: You have tested positive
 Prior Probabilities: P(E1) = 0.005, P(E2) = 0.995
 Conditional Probabilities: P(F|E1) = 0.99, P(F|E2) =0.05
 Need to compute P(A1|F)

Bayes’ Theorem: Tabular Approach

Step 1

Identify the mutually exclusive events (E’s) that make up the Sample Space; Identify the Fact F.

Note down the P(E) and the conditional probabilities (F | E)

Prepare the table with 5 columns and (n+1) rows, where n is the size of the sample space

Step 2

Enter
Column 1 The events E’s

Column 2 The prior probabilities P(E)’s

Column 3 P(F | E)’s

Step 3

Column 4 Compute the joint probabilities P(E&F) using P(E&F) = P(E) * P(F | E)

Step 4

Column 4. The last cell will contain P(E&F) = P(F)

Step 5

Column 5 Compute the posterior probabilities using P(E | F) = P(E & F) / P(F).

Lesson -3

Random Variables

A random variable is a numerical description of the outcome of a random experiment.

A discrete random variable may assume a countable number of values

A continuous random variable may assume any numerical value in an interval

The random variable inherits the probabilities of the events of the random experiment

EXAMPLE

Experiment: Toss a Coin.

If the coin turns up Heads, we win Rs.10; O/w we lose Rs.10

Event p X p

H 0.5 10 0.5

T 0.5 -10 0.5

Random Variables Examples

Discrete Random Variables

• # of dependents of an employee

• # of customers using the ATM in a day

• # of sixes in a T20 match

• # of owners who like the product

Continuous Random Variables

• Life of a tire

• Time between call at the call centre

• Volume of water in a 1 litre mineral water bottle

• % of owners who like the product

Probability Function of a Discrete Random Variable


# of Children (X) 0 1 2 3

Probability (P(x)) 0.1 0.4 0.4 0.1

The Probability Function lists the possible outcomes and their probabilities

Note: 0 ≤ P(x) ≤ 1 & ∑P(x) = 1

Like frequency distributions, probability distributions have descriptive measures

Common Probability Functions

Probability Distribution of a Continuous Random Variable

Common Probability Density Functions

Area Under the Curve

Discrete Random Variables Examples

X: # of customers serviced by the ATM in a day

Y: # of defective items in the consignment

Both X and Y take discrete values


Both are random variables

1. Until the day ends, X is unknown

2. Until all the items are tested, Y is unknown

Discrete Probability Distributions

The probability function provides the probability for each value of the random variable.

The required conditions for the probability function are: f(x) ≥ 0 & ∑f(x) = 1

The expected value, or mean, of a random variable is a measure of its central location. E(X) = μ = ∑ xf(x)

The variance summarizes the variability in the values of a random variable.

Var(x) = V(X) = σ2 = ∑ (x - μ)2f(x)

The standard deviation, σ, is defined as the positive square root of the variance.

Example: Constructing the Empirical Distribution

Example: Computing Expected Value (μ) & Variance (σ2)

Example: Computing Expected Value (μ) & Variance (σ2) of -3X


A Bivariate Discrete Probability Distribution

Computing the Joint Distribution of two Independent Random Variables

Computing the Joint Distribution of two Independent Random Variables

Computing Expected Value (μ) & Variance (σ2) of (X+Y)


Mean and Variance of Distributions

E(X ± c) = E(X) ± c Var(X ± c) = Var(X)

Example: Suppose X managed to convince Ace to reduce the charges on every order by Rs.100.

E(X) =  E(X) Var(αX) = 2 Var(X)

Example: X gets a 10% discount on all orders. In the above equation  = 0.9

Consider two random variables X and Y:

• E(X + Y) = E(X) + EY).

• For example E(X – Y) = E(X) – E(Y)

• If the random variables are independent, V(X+ Y) = 2V(X) + 2E(Y).

• For example V(X – Y) = V(X) + V(Y)

Discrete Uniform Probability Distribution

You are about to launch a new product. The product was test marketed, but preference for body colour
was not.

There are six colours: Violet (1), Blue (2), Green (3), Yellow (4), Orange (5) and Red (6). Initially it must be
assumed that each body colour is equally preferred.

The probability function f(x) = 1/6

Poisson Probability Distribution

A Poisson distributed random variable is used in estimating the number of occurrences in a specified
interval of time or space

Application

Sizing the size of operations at a bank, call centre, service cenre, petrol bunk, …

Examples

# of vehicles arriving at a toll booth in one hour

# of patients arriving in an emergency room between 11 and 12 pm


# of typos in a page

Requirements

Events occur independently.

Two events cannot occur at exactly the same instant.

The probability of an event in an interval is proportional to the length of the interval

Poisson Probability Function

I: The specified interval

X = the number of occurrences in an interval

f(x) = the probability of x occurrences in an interval

μ = mean number of occurrences in an interval

X ~ Π(μ)

P(X = x) = !

μ = E(X) = V(X)

Poisson Probability Distribution

Employees visit the ATM at the average rate of 6 per hour in the post-lunch period. What is the probability
of 2 arrivals in 30 minutes in the post-lunch period.

What is the expected # of arrivals? Variance?

Binomial Probability Distribution B(n, p)

The random variable X counts the number of successes in n trials

The typical example: Tossing a Coin & we are interested in the number of Heads

Four Properties of a Binomial Experiment

1. The experiment consists of a sequence of n identical trials.

2. Two outcomes, success and failure, are possible on each trial.

3. The probability of a success, denoted by p, does not change from trial to trial.

4. The trials are independent.

B(n, p) – Example 1

Indica sells encyclopedias targeted towards children using door-to-door saleswomen. Ms Rita, a
saleswomen with Indica, has randomly selected 20 houses in he neighborhood to sell the product. From
past experience, Rita knows that the probability that a sale will be made is 0.1.

# of trials, n, is 20
1. Trials are identical in the sense that each trial is a household

2. Two outcomes: Sale or No Sale

3. The probability may change. Initially p, the probability of success, may be 0.1. But as the day
progresses, Rita may get tired the success rate may decrease

4. Trials are independent – Since the households were randomly selected

B(n, p) – Example 2

A 1000-strong IT firm is concerned about a low retention rate for its employees. In recent years, management
has seen a turnover of 10% of the employees annually. HR takes a random sample of 5 employees and meets
each one separately to understand their concerns and also whether they are planning to leave.

1. # of trials, n, is 5; identical in the sense that each trial is an interview with an employee

2. Two outcomes: Resign or No plans of resigning

3. Trials are independent – Since the employees were randomly selected

4. The probability will change.

B(n, p) – Rule of Thumb

The supplier claims that the defective rate is 1%. We test the consignment of 1000 items by sampling 10
items and classifying each as Defective or Not Defective

Note:
Since this is sampling without replacement
So ‘p’ changes as we sample.
If N >> n, so that p does not change by much
Rule of Thumb: < 5%
B(n, p) – Probability Function, Mean and Variance

 X ~ B(n, p)
 p: Probability of Success
 q: Probability of Failure
!
 f(x) = P(X=x) = p q( )
= !( )!
p q( )

 μ = n*p
 2= npq
 σ = npq
B(3, 0.1) - Example

B(10, 0.20) - Example


A Digression

Binomial Distribution, B(n, p)

• P(X = x)

• The Mean and the Standard Deviation

• n << N may be required when modelling with B(n, p)

Continuous Random Variables

Common Probability Density Functions

Probability = Area Under the Curve

Uniform Probability Distribution – U(a, b)


U(5, 15) – Example

Exponential Probability Distribution – exp(μ)

• exp(μ) is useful in modeling

• Time between vehicle arrivals at a toll booth

• Time required to complete a questionnaire

• Distance between major defects in a highway

• Exponential Distribution and the Poisson Distribution are related

Average time between vehicle arrivals is μ = 5m = 1/12 h

Average number of vehicles arriving in 1 hour is 12

Probability Density Function: f(x) = (1/μ)e-x/μ

exp(μ) - Properties

 μ is the average waiting time


 The mean and standard deviation are equal.
 The exponential distribution is skewed to the right.
 P(X < x) = 1 – e-x/μ
 The distribution is memoryless!

Exp(3 Minutes): At the Petrol Bunk


Exponential & Poisson Distributions

Suppose the rate at which cars cross the toll booth is 10 cars/h, and the arrival process can be described by
a Poisson Distribution. Write down the Poisson & Exponential distributions that describe the process.

Suppose calls on your cell phone follow an exponential distribution with the average time between calls
being 10m. What are the Poisson & Exponential distributions that describe the process? (For the Poisson
distribution, take the time period to be 1 h.) Find the probability that there will be no calls in the next 1
hour.

Normal Probability Distribution N(μ, σ)

It is widely used in statistical inference.

It has been used in a wide variety of applications including:

Heights of people Rainfall amounts Test scores

We will use this extensively while describing

Distribution of sample mean

Distribution of a sample proportion

N(μ, σ) - Properties

N(μ, σ) – Examples
Normal Probability Distribution

Standard Normal Table


Standard Normal Table

Transforming N(μ, σ) to N(0, 1)


𝐱 𝛍
X ~ N(μ, σ) 𝐳= 𝛔
Z ~ N(0, 1)
z is the number of standard deviations x is from μ.

N(4, 2) – Example

N(4, 1) – Example

N(50, 10) – Example


N(50, 10) – Example

Session -4

Purpose of Sampling

The Management has some questions regarding the population.

• Estimation

What is μ / p / σ?

• Hypothesis Testing

Is the process meeting the standards?

Is the actual μ / p / σ way off the standards?

We sample to answer such questions

• From the sample we compute the statistic X/ p / s

• The Statistic is a Point Estimate of the Parameter

• For Hypothesis Testing, we examine how “far” the statistic is from the standard

Proper sampling can provide “good” estimates of the population characteristic

Why Not Census

 Destructive Aspect: Life of a battery


 Cost Aspect: Rural Household Surveys: Income / Expenditure / Indebtedness
 Time Aspect: Manager wants a customer feedback survey (by EOD)
 Accuracy Aspect: Customer feedback
 Infinite Population

Sampling Methods

• Probabilistic Sampling
o Simple Random Sampling (SRS)
o Stratified Random Sampling
o Cluster Sampling
o Systematic Sampling
• Non-Probabilistic Sampling
o Convenience Sampling
o Judgment Sampling
Simple Random Sample (SRS)

HR received 900 applications for the advertised job. The applicants were numbered, from 1 to 900, as their
applications arrived. Director HR wanted a simple random sample of 30 applicants to understand the
profiles of the applicants.

A simple random sample of size n from a finite population of size N is a sample selected such that each
possible sample of size n has the same probability of being selected

Conducting an SRS

• Create a numbered data frame

• Use a software package to generate ‘n’ random numbers

• Select the corresponding items from the data frame

Stratified Sampling

HR wants to introduce a yoga program in the organization. She wants to interview a sample of 40
employees to understand how it will be received and what employees are looking for..

An SRS may generate a sample from top level managers.

If HR believes that the staff comprises Top, Middle and Lower Level Managers, stratified sampling is
required

Conducting Stratified Sampling

• The population is composed of homogeneous groups

• A simple random sample is taken from each stratum.

(Formulas are available for combining the stratum sample data into one population parameter estimate)

Cluster Sampling

Once every quarter the large nation-wide fast foods chain conducts a quality check on its restaurants. 30
randomly selected restaurants are audited

An SRS may select 30 restaurants in 30 different cities!

Conducting Stratified Sampling

The population is composed of clusters, and each cluster is a representative of the population on a small
scale.

• A simple random sample of the clusters is then taken.

• All elements within each sampled (chosen) cluster form the sample OR a sample from each cluster
is chosen.

Systematic Sampling

Examples

1. Interview every 10th customer

2. Test a sample every 2 hours

This method has the properties of a simple random sample, especially if the list of the population elements
is a random ordering.

The sample usually will be easy to identify


Convenience Sampling

• It is a nonprobability sampling technique.


• The sample is identified primarily by convenience.
• Example: A student may ask his classmates to constitute a sample.

Judgment Sampling

The person most knowledgeable on the subject of the study selects elements of the population that he or
she feels are most representative of the population.

It is a nonprobability sampling technique.

A reporter might ask three or four senior MPs regarding an issue

Types of Sampling

• Probabilistic Sampling
o Simple Random Sampling (SRS)
o Stratified Random Sampling
o Cluster Sampling
o Systematic Sampling
• Non-Probabilistic Sampling
o Convenience Sampling
o Judgment Sampling

Point Estimation

• The sample data is used to compute a value of a sample statistic that serves as an estimate of a
population parameter.
• X as the point estimator of the population mean μ.
• s is the point estimator of the population standard deviation σ.
• s2 is the point estimator of the population variance σ2.
• p is the point estimator of the population proportion p.

Point Estimation: A Trivial Example

A sample of 5 weeks of call data was collected. Develop a point estimate for m and s. If success is getting
over 90 calls, estimate the percentage of successful weeks

Estimators As Random Variables

• A sample is selected and X/ p / s is computed


• If another sample is taken, the value of X/ p / s is would very probably be different
• Therefore X/ p / s is a random variable
• And it has
• Measures of Central Tendencies – for example Mean
• Measures of Dispersion – for example, Standard Deviation
• A shape – for example, bell shaped
Mean of 𝐗/ 𝐩 / s2 / s

• E(X) = μ
• E(p) = p
• E(s2) = σ2
• Interesting Point to Note
o E(s) ≠ σ!!
 E(s) < σ

Variance of 𝐗/ 𝐩

Recall our assumption while sampling: The population is infinite or n / N < 5%

Standard Error of the Mean:

Std Dev of 𝑋 = 𝜎 ~

Standard Error of the Proportion

Std Dev of 𝑝̅ = 𝜎 ̅ ~

Unbiased Estimators

The expected value of the sample statistic is equal to the population parameter being estimated

X, p, s2 are unbiased estimators

E(X) = μ
E(p) = p
E(s2) = σ2

s is not an unbiased estimator of σ: E(s) < σ

• Point Estimators are random variables


• The mean and standard deviation of the sample mean
• The mean and standard deviation of the sample proportion
• Unbiased estimators

Sampling Distribution

• A sampling distribution is the distribution of a statistic that would be produced in repeated random
sampling from the same population.
• For example:
o We collect a sample, record the mean and then discard the sample
o Collect another sample, record the mean and then discard the sample, Do this again and again,
ad nauseam!
• We can then create the histogram and subsequently the distribution

Example 1

• The Central Statistics Office (CSO) estimates the “Per Capita Income” in 2016-17 will be Rs.100,000.
• We will take the following steps ad nauseam:
• Take a random sample of 1000 Indians
• Compute the Average Income
• Plot the value in a Dot Plot
Dot Plot of Sample Mean – 1st Sample

Dot Plot of Sample Mean – 2nd Sample

Dot Plot of Sample Mean – 3rd Sample

Dot Plot of Sample Mean – 4th Sample

Dot Plot of Sample Mean – kth Iteration: Bell Shaped

Sampling Distribution of Per Capita Mean

How to create a Sampling Distribution of X

Sampling Distribution of Proportion

• The population is Binary


o Male / Female
o Yes / No
o Defective Item / Non-Defective Item
• A sampling distribution of the proportion is the distribution of the sample proportion 𝑝̅ , a statistic that
would be produced in repeated random sampling from the same population.
• That is, we collect a sample, record the number of Successes, compute the proportion of successes,
discard the sample, collect another sample, record the mean and then discard the sample, and do this
again and again, ad nauseam!
• We can then create the histogram and subsequently the distribution
Example 2: The Binary Random Variable X

Acme Soaps has captured 30% of the bathing soap market


0 if he or she does not use Acme
A person is picked at random: X =
1 if he or she does use Acme
P(X = 0) = 0.7 and P(X = 1) = 0.3

Example 2: The Distribution of X

P(X = 0) = 0.7 and P(X = 1) = 0.3

Example 2: Constructing the Distribution of 𝐗 = 𝒑

Sample Size: n

1. Take a sample of size n

2. Ask each person whether he / she uses Acme

3. Compute ∑Xi

4. Compute X = (∑Xi) / n

5. Repeat Steps 1 – 4 repeatedly

Develop the relative frequency histogram

Example 2: Distribution of 𝐗 = 𝒑 with n = 5

Example 2: Distribution of 𝐗 = 𝒑with n = 10


Example 2: Distribution of 𝐗 = 𝒑 with n = 20

Example 2: Distribution of 𝐗 = 𝒑 with n = 100

Example 2: Distribution of 𝐗 = 𝒑 from n = 1 to n = 100

Appendix: Mean and Variance of X ~ B(1, 0.3)

The Estimators: 𝐗 𝐚𝐧𝐝 𝐩

We know:

E(X) = μ & 𝜎 =

E(p) = p & 𝜎 ̅ =

What is missing is the shape of these random variables

D𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧 𝐨𝐟 𝐗 When X is Normal

X ~ N(μ, σ) ⇒ X ~ N(μ, )

The Central Limit Theorem

Regardless of the distribution of X

The sample mean 𝑋 is approximately normally distributed for large sample sizes

𝑋 ~ N(μ, )

Rule of Thumb: If n ≥ 30, 𝑋 ~ N(μ, )


Central Limit Theorem & 𝐩

If np ≥ 5 and nq ≥ 5

𝑝𝑞
𝑝̅ ~N 𝑝,
𝑛

Takeaways

The Central Limit Theorem

If n ≥ 30, 𝑋 ~ N(μ, )

If X ~ N(μ, σ), 𝑋 ~ N(μ, )


If np ≥ 5 and nq ≥ 5, 𝑝̅ ~𝑁 𝑝,

Sampling Distribution of 𝑿

𝑋 ~ N(μ, )

Bell-Shaped curve: If n ≥ 30 or X is Normal

Mean μ: Always (𝑋 is an unbiased estimator)

Standard Deviation : Infinite Population or n/N < 5%


Sampling Distribution of 𝑿 Example

The Central Bank conducted a survey of bank accounts of small farmers in a certain province and found the
average money in an account is Rs.1400 & σ = Rs.84. A sample of 36 accounts was taken. What is the
sampling distribution of the sample mean?
Sampling Distribution of 𝑿 Example

What is the probability that a simple random sample of 36 accounts will provide an estimate of the
population mean that is within +/-10 of the actual population mean μ ?

In other words, what is the probability that 𝑋 will be between 1390 and 1410?

Sampling Distribution of 𝒑

𝒑𝒒
𝒑~𝐍 𝒑, 𝒏

Bell-Shaped curve: If np ≥ 5 and nq ≥ 5

Mean p: Always (𝑝̅ is an unbiased estimator)

Standard Deviation : Infinite Population or n/N < 5%

Sampling Distribution of 𝒑

The Sampling Distribution of 𝐗

• We know:
• If n is large enough (n ≥ 30), 𝑋 ~ N(μ, )*

• If X ~ N(μ, σ), 𝑋 ~ N(μ, )

• What happens if σ is unknown?
• If X is Normal, ~ t distribution with (n − 1) Degrees of Freedom

The t Distribution with DoF = n

Comparison of t Distributions and N(0, 1)


Reading the t Distribution Tables

Constructing the Chi-Square Distribution χ2

• Suppose the population is normally distributed: X ~ N(μ, )

Repeatedly do the following

• Take a sample of size n

• Compute (n − 1)s = ∑(X − X)

• Compute (n - 1)s2/2

The sampling distribution of (n - 1)s2/2 has a χ2 distribution with (n-1) DoF

χ2 distribution: (n - 1)s2/s 2

Reading the χ2 Table with DoF = 5


Reading the χ2 Table: Examples

• The rationale for the χ Distribution


• The sampling Distribution (n - 1)s2/2
• How to read χ Tables

Inferences About Two Population Variances

• We may want to compare the variances in:


o AHT from two different call centers
o Temperatures for two heating devices
o Assembly times for two assembly methods
• We collect data from two independent random sample, one from population 1 and another from
population 2.
• Inferences regarding the two population variances will be based on ratio of these two sample variances

The F-Distribution

• Assume we repeatedly select a random sample of size n1 from one normal population and another random
sample n2 from another normal population.
s
• And each time, we compute
s
s
• If we do this ad nauseam, we would arrive at the distribution of the ratio of two variances: F= .
s
• The distribution formed in this manner approximates an F distribution with the following degrees of
freedom:
o v1 = n1 - 1 and v2 = n2 - 1
Reading the F Tables

Reading the F Tables


The rationale for the F Distribution

s
The sampling Distribution
s

How to read the F Tables

Lesson -5

Introduction to Interval Estimation

Setting the Context – 1

Setting the Context - 2

Point Estimate & Interval Estimate

In each case, every sample will generate a point estimate.


• Ozone: The point estimate may or may not equal the population mean (the average for that production
run)

• PollStar: 220/500 = 44% will not be the percentage of votes polled at election time

That’s because the sample is not the population

An interval estimate qualifies the point estimates

Mineral Water Company Sample Data

PollStar Sample Data

PollStar found that 220 of the 500 contacted, favored their client.

That is p = 44%

Other Information Available:

1. The sample size

2. The sampling distribution of p

3. The standard deviation of p

Interval estimate incorporates

1. The Statistic (X or p)

2. The sample size

3. The distribution of X or p

4. The standard deviation of X or p

Thereby providing more information on the statistic – the point estimate

Ozone Mineral Water Company

Ozone sells mineral water in 12-liter bottles.

The volume of water dispensed is a random variable.

But if the daily calibration has been proper, the population mean would be 12 liters.

Also, the population standard deviation is 0.6 liters.

At the start of production, a random sample of 36 bottles is selected and the sample mean is computed.
Sampling Distribution of 𝐗

We have the following:

1. μ = 12 liters

2.  = 0.6 liters

3. n = 36

By Central Limit Theorem, X ~ N(μ, )



.
That is, X ~ N(12, ) ~ N(12, 0.1)

𝑿 ~ N(12, 0.1)

Brief Introduction to the Concept

Suppose the confidence level is 90%. Or  = 10%.

Suppose the confidence level is 90%. Or  = 10%.

Let’s Experiment

Let us sample repeatedly


1. Take a random sample of 26 bottles

2. Compute the sample mean

3. Construct the interval centred at 𝑋 with length 2 * 0.1645

4. Check whether the interval contains μ

𝑿 = 𝟏𝟐. 𝟐

𝑿 = 𝟏𝟏. 𝟖

𝑿 = 𝟏𝟐. 𝟏

Experiment’s Conclusion
90% Confidence Interval (𝐗 - 0.1645, 𝐗 + 0.1645)

Confidence Interval & Confidence level

A confidence interval is a range of values that is likely to contain the (unknown) parameter.

If random samples are drawn repeatedly and for sample, the confidence interval is constructed, a certain
percentage of the confidence intervals will contain the population mean. This percentage is the confidence
level.

Mechanics of Interval Estimation

Confidence Interval or Interval Estimate

 Recall X & p are the sample mean and sample proportion respectively.
 X & p cannot be expected to provide the exact value of μ or p.
 An interval estimate is computed using
o The Point Estimate, say X or p
o The Sampling Distribution of X or p
o The Standard error of X or p: Namely 𝜎 = or σ =

o The prescribed Confidence Level

Margin of Error and the Interval Estimate I

An interval estimate can be computed by adding and subtracting a margin of error to the point estimate.

(X - Margin of Error, X + Margin of Error)

(p - Margin of Error, p + Margin of Error)

Sampling Error is synonymous with Margin of Error

Margin of Error and the Interval Estimate II

The Margin of Error is computed as follows:

X & σ is known (& n ≥ 30 or X is Normal)

z / √ where z / is the z value providing an area of /2 in the upper tail of N(0, 1)

X & σ is unknown but X is Normal

t / √ where t / is the t value providing an area of /2 in the upper tail of tn-1

p with np & nq >= 5

z / where z / is the z value providing an area of /2 in the upper tail of N(0, 1)
Margin of Error and the Interval Estimate III

The Interval Estimate is computed as follows:

Given α:

X & σ is known (& n ≥ 30 or X is Normal): (X - z / √ ,X+z / √ )

X & σ is unknown but X is Normal: (X - t / √ ,X+t / √ )

p with np & nq >= 5: p - z / ,p+z /

The formulas to construct Confidence Intervals for μ and p

Computing Interval Estimates

Formulas

Given a confidence level (1 – α), the Confidence Interval is computed as follows:

X & σ is known (& n ≥ 30 or X is Normal): (X - z / √ ,X+z / √ )

X & σ is unknown but X is Normal: (X - t / √ ,X+t / √ )

p with np & nq >= 5: p-z / ,p+z /

Margin of Error & Sample Size

/
X & σ is known (n ≥ 30 or X is Normal): E = z / √ ⇒n=

∗ ∗ / ∗ ∗
p with np* & nq* >= 5: E=z / ⇒n=

Exercise 1: Confidence Level of μ ( Known)

Ozone sells 12 liter bottles. 36 bottles were sampled and the mean volume of water was found to be
12.17. The population standard deviation is believed to be 0.6 liters. Compute the 95% Confidence Interval
of μ.

Exercise 1: Confidence Level of μ ( Known)

Ozone sells 12 liter bottles. 36 bottles were sampled and the mean volume of water was found to be 12.17.
The population standard deviation is believed to be 0.6 liters. Compute the 95% Confidence Interval of μ.

1. Point Estimate of μ 12.17 liters

2. Std Error of the Mean 0.6/√36 = 0.1

3. Since sample size >= 30

Sampling Distribution of 𝑋 .
~ N(0, 1)

4. Significance Level 5%

5. Critical Value 1.96

6. Margin of Error 1.96*0.1 = 0.196

7. 95% CI (12.17 – 0.196, 12.17 + 0.196) = (11.974, 12.366)


Exercise 1.1: Determining Sample Size

Recall the previous exercise:

: 0.6; 95% Confidence Interval of μ: (11.974 liters, 12.366 liters). E: 0.196 liters.

Suppose we wanted the Margin of Error to be 0.15 liters.

Exercise 1.1: Determining Sample Size

 Recall the previous exercise:


 : 0.6; 95% Confidence Interval of μ: (11.974 liters, 12.366 liters). E: 0.196 liters.
 Suppose we wanted the Margin of Error to be 0.15 liters.

Since  is known and n ≥ 30, the 95% Confidence Level is given by


(X - z / ,X+z / )
√ √
. .
(X - 1.96 , X + 1.96 )
√ √
.
Required Margin of Error is 0.15 liters ⇒ 1.96 = 0.15

2 2 2
That is n = 1.96 *0.6 /0.15 = 61.47 ~ 62

Exercise 2: Confidence Level of μ (α Unknown)

Ozone sells 12 liter bottles. 36 bottles were sampled and the mean volume of water was found to be
12.17. The sample standard deviation was 0.6 liters. Compute the 95% Confidence Interval of μ.

Exercise 2: Confidence Level of μ (α Unknown)

Ozone sells 12 liter bottles. 36 bottles were sampled and the mean volume of water was found to be
12.17. The sample standard deviation was 0.6 liters. Compute the 95% Confidence Interval of μ.

1. Point Estimate of μ 12.17 liters

2. Estimate of Std Error of the Mean 0.6/√36 = 0.1

3. Assuming X is Normally distributed

Sampling Distribution of 𝑋 .
~ t distribution with 35 DoF

1. Significance Level 5%

2. Critical Value 2.0301

3. Margin of Error 2.0301*0.1 = 0.20301

4. 95% CI (12.17 – 0.20301, 12.17 + 0.20301) = (11.9670, 12.3730)

Exercise 3: Confidence Level of p

In the current by poll, PollStar’s client wanted a 99% confidence interval for the proportion of voters that
support the client.

PollStar sampled 500 voters and found that 220 would vote for their client.

Exercise 3: Confidence Level of p

In the current by poll, PollStar’s client wanted a 99% confidence interval for the proportion of voters that
support the client.

PollStar sampled 500 voters and found that 220 would vote for their client.
1. Point Estimate of p: 220/500 = 0.44

. ∗ .
2. Estimate for Std Error of p: = = 0.0222

3. Since n𝑝̅ = 500*.44 = 220 >= 5 & n𝑞 = 500*.56 = 280 >= 5,


̅
Sampling Distribution of 𝑝̅ .
~ N(0, 1)

1. Significance Level 1%

2. Critical Value 2.576

3. Margin of Error: 2.576*.0222 = 0.0572

4. 99% CI: (38.28%, 49.72%)

Exercise 3.1: Determining Sample Size

PollStar’s client were unhappy since the margin of error was 5.72% while they wanted the sampling error to
be 3%.

Exercise 3.1: Determining Sample Size

PollStar’s client were unhappy since the margin of error was 5.72% while they wanted the sampling error to
be 3%.

The 99% Confidence Interval is given by:

. ∗ . . ∗ .
p-z / ,p+z / or p - 2.576 , p + 2.576

. ∗.
Required Margin of Error is 0.03 ⇒ 2.576* = 0.03

⇒ n = 2.5762*.44*.56 /.032 = 1816.7 ~ 1817.

Lesson -6:

Introduction to Hypothesis Testing

Setting the Context: Ozone Inc

Ozone sells 12-liter mineral water bottles

Questions that can be raised: Is μ 12 litres?

This could be asked by

1. The CFO concerned with cost cutting

2. QC Head’s demand that it should not be more nor should it be less than 12 litres

3. The Consumer Forum which wants to test whether it is as claimed or less!

Similar issues may crop up for proportions or variance or other statistics

Hypothesis Testing

• A hypothesis is an assumption regarding a parameter

• Hypothesis Testing is a formal statistical procedure to accept or reject the hypothesis


• The null hypothesis, H0 , is an assumption about the parameter.

• The alternative hypothesis, Ha, is the opposite of H0.

• The testing procedure samples the population to test the two competing statements H0 and Ha.

The Testing Process

Given the Null Hypothesis H0 and the Alternative Hypothesis Ha

• Data is collected to see whether H0 can be rejected

• There are two possible conclusions

1. There is enough evidence to reject H0 (and accept Ha), OR

2. There is not enough evidence to reject H0

In which situation will a decision be made?

• Usually that situation will suggest Ha

Developing the Hypotheses: Example 1

Developing the Hypotheses: Example 2

 A new drug is developed to lower blood sugar more than the existing drug.
 Alternative Hypothesis: The new drug lowers blood sugar more than the existing drug.
 Null Hypothesis: The new drug does not lower blood sugar more than the existing drug.

Developing the Hypotheses: Examples 3 & 4

The label on a coffee can states that it contains 500g. For the Consumer Forum

• Null Hypothesis: The label is correct. μ ≥ 500g.

• Alternative Hypothesis: The label is incorrect. μ < 500g.

The label on a coffee can states that it contains 500g. For the Quality Inspector

• Null Hypothesis: The label is correct. μ = 500g.

• Alternative Hypothesis: The label is incorrect. μ ≠ 500g.

Forms for Null and Alternative Hypotheses

 The equality part of the hypotheses always appears in the null hypothesis.
 H0 and Ha take one of the following three forms:
o H0: μ ≥ μ0 & Ha: μ < μ0, One-Tailed Test (Lower Tail or Left Tail)
o H0: μ ≤ μ0 & Ha: μ > μ0, One-Tailed Test (Upper Tail or Right Tail)
o H0: μ = μ0 & Ha: μ ≠ μ0, Two-Tailed Test
 μ0 is the hypothesized value of the population mean
Type I & Type II Errors

 Recall Ozone Mineral Water Company.


 The CFO will shut down production if he feels μ > 12 liters
 The volume dispensed is a random variable. So some bottles have less than 12 liters and others have
more. Chances are that no bottle has exactly 12 liters of water.
 Since hypothesis tests are based on sample data, there is the possibility of errors.
 Type I Error: The calibration is prefect. But the average of the sample of 36 bottles is much greater
than 12 liters. Production is shut down But actually nothing was wrong
 Type II Error: The machine is in a bad shape and almost every bottle contains much more than 12
liters.. But the sample average is less than 12 liters – since the volume dispensed is random.

Type I & Type II Errors

A Type I error is rejecting H0 when it is true.

 The probability of making a Type I error when the null hypothesis is true as an equality is called the
level of significance.
 In this course we will control only the Type I error.
 Such tests are also called significance tests.

A Type II error is accepting H0 when it is false.

 It is difficult to control for the probability of making a Type II error.


We will avoid the risk of making a Type II error by using “do not reject H0” and not “accept H0”.

Type I & Type II Errors

A Type I error is rejecting H0 when it is true.

o The probability of making a Type I error when the null hypothesis is true as an equality is called the
level of significance.
o In this course we will control only the Type I error.
o Such tests are also called significance tests.

A Type II error is accepting H0 when it is false.

o It is difficult to control for the probability of making a Type II error.


We will avoid the risk of making a Type II error by using “do not reject H0” and not “accept H0”.

Approaches to Hypothesis Testing – I

Three Approaches

a. The Critical Value Approach.


i. This is more intuitive.
b. The p-value Approach
i. All software packages generate the p-value.
ii. And all you need to do is compare the p-value with the significance level.
c. Using Confidence Intervals

EXAMPLE

 Recall Ozone sells 12 liter bottles of mineral water.


 The CFO is focusing on cost reduction. CFO believes that excess water is being dispensed. If that is the
case, he wants to shut down production for a major overhaul of the machinery.
 36 bottles were sampled and the mean volume of water was found to be 12.17. The population
standard deviation is believed to be 0.6 liters.
Distance in ’s

The Critical Value Approach I

The Critical Value Approach II

Approaches to Hypothesis Testing – II

Recall Ozone sells 12 liter bottles of mineral water.

The CFO is focusing on cost reduction. CFO believes that excess water is being dispensed. If that is the
case, he wants to shut down production for a major overhaul of the machinery.

36 bottles were sampled and the mean volume of water was found to be 12.17. The population standard
deviation is believed to be 0.6 liters.

The p-value approach


Ozone Example: Two Tail Test

p-Value Approach to Hypothesis Testing

Computer packages compute the p-value and we need to compare this with a

Recall a is the area of the critical region

For a One-Tailed Test

• For a Lower-Tailed Test, the p-value is the area to the left of the test statistic

• For a Upper-Tailed Test, the p-value is the area to the right of the test statistic

• Reject H0 if the p-value < α

• What about a Two-Tailed Test?

• The critical region comprises 2 tails and each tail has area α / 2

• If test statistic < 0, we need to compare area to the left of the test stat & α / 2

• If test statistic > 0, we need to compare area to the right of the test stat & α / 2

• Computer Packages compute p-value as 2 * Area to the left / right of the test stat

• Therefore we need to do so accordingly!

• Thus Reject H0 if the p-value < α

Testing A Mean

Testing a Population Mean

𝜎 Known & (X is Normal or n ≥ 30)

 Sampling Distribution: ⁄√
~ N(0, 1)

 Test Statistic: z = ⁄√
 Decision Rule: Reject H0 if p –value < α Else Do Not Reject H0
𝜎 Unknown & X is Normal

 Sampling Distribution: ⁄√
~ t distribution with (n-1) degrees of freedom

 Test Statistic: t = ⁄√
 Decision Rule: Reject H0 if p –value < α Else Do Not Reject H0

EXAMPLE 1

Example 1 LIGHTBOARD

Example 2

Example 2 ( Unknown) LIGHTBOARD


Testing A Proportion

 After a series of high profile TV Ads, has our market share increased from the previous level of 20%
 HR has successfully implemented some team initiatives. Has the satisfaction level gone up?
 The politician has announced some populist measure. Has his popularity index gone up?

Testing Population Proportion: The Hypotheses

 H0: p ≥ p0 Vs Ha: p < p0 – One-Tailed or Lower Tail or Left Tail Test


 H0: p ≤ p0 Vs Ha: p > p0 – One-Tailed or Upper Tail or Right Tail Test
 H0: p = p0 Vs Ha: p ≠ p0 – Two-Tailed
 p0 is the hypothesized value of the population proportion

Testing Population Proportion: The Mechanics

Example 1 (One Tailed Test)

Example 1 (One Tailed Test) LIGHTBOARD

Example 2 (Two Tailed Test)


Example 2 (Two Tailed Test) LIGHTBOARD

Revisiting Type I and Type II Errors

Ozone’s CFO: Critical Value Approach

Possibility 1: The Machine is Working Perfectly

Possibility 2: The Machine is Working Imperfectly

 μ > 12 liters
 Event:
 Sample Mean > 12.1645 ⇒ Stop production ⇒ Right Decision
 Sample Mean ≤ 12.1645 ⇒ Continue production ⇒ Wrong Decision

Trade Off between α and 


Type I and Type II Errors Table

lesson -7

Comparing Two Populations

 Compare
 Productivity of two shifts
 Customer satisfaction levels of two competing mobile service providers
 Volume of liquid dispensed at the bottling plant before and after overhaul of machinery
 Marital Happiness Levels of married couples.
 Sugar Levels before and after insulin injection
 Proportion of successful cold calls before and after Voice Modulation training
 Defect Rate before and after the implementation of a Six Sigma project

Practical Considerations

• Are the measurements Quantitative or Categorical?

If the measurements are quantitative ⇒ The means can be compared

• Are the samples independent?

• Are the standard deviations of the two populations known?

• Are the samples large?

• Or, are the populations normally distributed?

If the measurements are categorical ⇒ The proportions can be compared

• Need to define what constitutes success

• We will insist both samples are large and independent

Testing μ1 - μ2

Let

• {X1} be the measurements for the 1st population with mean μ1

• {X2} be the measurements for the 2nd population with mean μ2

Form of Null Hypothesis: μ1 - μ2 ≤ D0, μ1 - μ2 = D0, μ1 - μ2 ≥ D0,

Point Estimator of μ1 – μ2: X − X

Four Cases

• Case 1: σ1 and σ2 Known

• Case 2: σ1 and σ2 Unknown but thought to be equal

• Case 3: σ1 and σ2 Unknown but thought to be unequal


• Case 4: Matched Samples

Testing p1 - p2

Let

• {X1} be the measurements for the 1st population with proportion of success p1

• {X2} be the measurements for the 2nd population with proportion of success p2

Form of Null Hypothesis: p1 - p2 ≤ 0, p1 - p2 = 0, p1 - p2 ≥ 0,

Point Estimator of p1 – p2: p − p

Requirement: n1*p1 ≥ 5, n1*q1 ≥ 5, n2*p2 ≥ 5, n2*q2 ≥ 5

The Scenario and the Data Set

• TS Ltd does a form of Level 0 support for major consumer goods manufacturers.

TS operates two shifts, wherein customers call in toll-free, and TS operatives register the issues
along with the customer details.

• There are a few metrics that the clients want the call center to track

The monthly Average Call Handling Time (AHT) for each operator

The monthly AHT across the team

The variance of the monthly AHT across the team

1. The SLA for the monthly AHT across the team is 50 seconds

2. The SLA for the variance of the monthly AHT across the team is 9 seconds

The Scenario – II

• TS management has received complaints regarding the quality of service, especially the curtness of the
operators

• They brought in Kumar to clean up the operations

• Operators were re-trained on good telephone etiquette and on why the script must be followed

• Kumar also brought a sharp focus on basic metrics and published a weekly dashboard

• Three months have passed since Kumar arrived

• Kumar needs to know whether there has been any improvement

The Data – I

Testing Means – Case 1

Testing μ1 - μ2: Four Cases


 Case 1: σ1 and σ2 Known
 Case 2: σ1 and σ2 Unknown but thought to be equal
 Case 3: σ1 and σ2 Unknown but thought to be unequal
 Case 4: Matched Samples

Testing Means: σ1 and σ2 Known

It is rare that σ1 and σ2 are known

Point Estimate of Difference of Means: X -X

Standard Error of Difference of Means: +

If n1 & n2 ≥ 30 OR X1 and X2 are Normal:


( - ) ( )
Sampling Distribution: ~ N(0, 1)

AHT of Shifts in the First Month

Management informed Kumar that Shift 1 had better supervisors compared to Shift 2. He felt that this
would be reflected in the AHTs.

The tab “Known Variances” in AHT.xls has sample data of AHTs for each shift for the 1st month. The sample
statistics appear below.

Shift 1 Shift 2

Sample Size 30 Operators 45 Operators

Sample Mean 42.4000 s 38.6444 s

The supervisors informed him that the standard deviation for the Shift 1 and Shift 2 were 5.5 s and 6.5 s
respectively. Kumar felt it was reasonable to work with these estimates.

Kumar realized that both teams are well below the 1st SLA. But can we conclude at α = 1%, that Shift 1 is
doing better than Shift 2 for this SLA?

Summary Report: The Critical Value Approach

1. The Hypothesis: H0: µ1 – µ2 ≤ 0 & H1: µ1 – µ2 > 0

2. Data: n1 = 30, n2 = 45, X = 42.4, X = 38.6444, σ1 = 5.5, σ2 = 6.5, α = 1%.

3. Right-Tail Test
( - ) ( ) ( - )
4. As n1 & n2 >= 30, sampling distribution is = .
~ N(0, 1)

5. Test Statistic: (42.4-38.6444) / 1.3954 = 2.69

6. The critical value = 2.33 and the critical region is (X -X ) ≥ 2.33

7. Since the test statistic is falling in the critical region, we reject H0.

There is sufficient statistical evidence to infer that the average AHT for entire Shift 1 is greater than the
average AHT for the entire Shift 2.

Summary Report: The p-value Approach

 The Hypothesis: H0: µ1 – µ2 ≤ 0 & H1: µ1 – µ2 > 0


 Data: n1 = 30, n2 = 45, X = 42.4, X = 38.6444, σ1 = 5.5, σ2 = 6.5, α = 1%.
 Right-Tail Test
( - ) ( ) ( - )
 As n1 & n2 >= 30, sampling distribution is = .
~ N(0, 1)

 Test Statistic: (42.4-38.6444) / 1.3954 = 2.69


 p-value = P(Z > 2.69) = 0.0036
 Since p-value ≤ 0.01, we reject H0.
There is sufficient statistical evidence to infer that the average AHT for entire Shift 1 is greater than the
average AHT for the entire Shift 2.

Testing Means – Case 2

Testing μ1 - μ2: Four Cases

Case 1: s 1 and s 2 Known

Case 2: s 1 and s 2 Unknown but thought to be equal

Case 3: s 1 and s 2 Unknown but thought to be unequal

Case 4: Matched Samples

AHT of Shifts in the Third Month

Three months have passed since Kumar joined. He wants to know whether Shift 2 has caught up with Shift
1 with regard to AHT.

Kumar realizes that the earlier standard deviations may not be applicable. He also believes that because of
the standardization of processes that have been implemented, the variances may be equal.

The tab “Unknown Variances” in AHT.xls has sample data of AHTs for each shift for the 3rd month. The
sample statistics appear below.

Shift 1 Shift 2

n 17 Operators 34 Operators

𝐗 41.8824 s 51.5882 s

s2 27.9853 s2 24.4314 s2

Test at 99% confidence level whether Shift 2 has caught up with Shift 1.

The Excel Output

The Summary Report


Testing Means – Case 3

Testing μ1 - μ2: Four Cases

Case 1: s 1 and s 2 Known

Case 2: s 1 and s 2 Unknown but thought to be equal

Case 3: s 1 and s 2 Unknown but thought to be unequal

Case 4: Matched Samples

AHT of Shifts in the Third Month

Kumar was surprised that Shift was in fact doing better than Shift 1.

He wondered whether his assumption that the variances are equal may be wrong. He decided to test the
same hypothesis taking variances to be unknown but not equal.

The tab “Unknown Variances” in AHT.xls has sample data of AHTs for each shift for the 3rd month. The
sample statistics appear below.

Shift 1 Shift 2

n 17 Operators 34 Operators

X 41.8824 s 51.5882 s

s2 27.9853 s2 24.4314 s2

Test at 1% significance level whether Shift 2 has caught up with Shft 1.

The Excel Output

The Summary Report


Testing Means – Case 4

Testing μ1 - μ2: Four Cases

Case 1: s 1 and s 2 Known

Case 2: s 1 and s 2 Unknown but thought to be equal

Case 3: s 1 and s 2 Unknown but thought to be unequal

Case 4: Matched Samples

AHT of Shift 2 in the Third Month

The tests have shown that Shift 2 has done exceedingly well. They have met the SLA of 50 seconds while
Shift 1 has regressed.

Kumar wanted to understand the efficacy of his interventions by identifying operators in Shift 2 who were
common in the samples from the 1st and 3rd months

The tab “Matched Samples Shift 2” in AHT.xls has the sample data that Kumar wants. The sample statistics
appear below.

Month 1 Month 3 Difference

n 18 Operators 18 Operators 18 Observations

X 39.3889 s 53.1111 s 13.7222 s

s2 46.0163 s2 25.6340 s2 17.7418 s2

Has Shift 2 has significantly improved in the 3rd month. Test at 1% significance level.

Extract of the Data

The Excel Outputs


Testing Proportions – Introduction

Forms for Null and Alternative Hypotheses

Estimating the Difference Between Two Population Proportions

𝐩𝟏 -𝐩𝟐 : 𝐒𝐚𝐦𝐩𝐥𝐢𝐧𝐠 𝐃𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧 under H0


Testing Proportions – Example

Kumar was still not convinced that Shift 2 had improved so much while Shift 1 had slipped. He decided to
test whether the proportion of competent operators were the same in the two shifts.

He defined the acceptable range of AHT as [50 – 3, 50 + 3], where 50 seconds was the target AHT and 3
seconds was the target standard deviation of the AHT.

He asked his EA to test whether the proportions were the same using the sample data for the 3rd month at
α = 1%, and submit the Summary Report.

The tab “Proportions” in AHT.xls has the data and the sample statistics appear below.

Shift 1 Shift 2

n 17 Operators 34 Operators

# of Successes 2 20

p 0.1176 0.5882

The Summary Report

1. The hypotheses: H0: p1 – p2 = 0 Vs Ha: p1 – p2 ≠ 0

2. Data: n1 = 17, p = 0.1176, n2 = 34, p = 0.5882, α = 1%

3. Two Tail Test

4. Pooled Estimator of p: p = = 22/51 = 0.4314

5. Standard Error of p -p : σ = pq + = 0.1471

6. Since n1p, n1q, n2p, n2q ≥ 5, Sampling Distribution:

p -p pq + = (p -p )/.1471 ~ N(0,1)

1. Test Statistic: (0.1176 - 0.5882)/0.1471 = -3.1991 ~ -3.2

2. p –value: 2 * P(Z < -3.2) = 2 * 0.0013 = 0.0026

3. Since p-value ≤ α, Reject H0

Lesson -8

Testing Variances – Introduction

Variance is an important part of the decision-making process.

• The average is 1 liter as claimed by the bottling plant but what about the variance?

• The average strength in the production of 1mg Amaryl is 1 mg but what about the variance of drug
weight?

• In Finance, risk is measured by the variance

Forms for Null and Alternative Hypotheses

 H0 and Ha take one of the following three forms:


o H0: σ2 ≥ σ & Ha: σ2 < σ , One-Tailed Test (Lower Tail or Left Tail)
o H0: σ2 ≤ σ & Ha: σ2 > σ , One-Tailed Test (Upper Tail or Right Tail)
o H0: σ2 = σ & Ha: σ2 ≠ σ , Two-Tailed Test
As always, tests are performed assuming H0 is true – as an equality
Sampling Distribution of s2

o Let X be a population that is normally distributed


o Let the variance of this population be σ
o Samples of size n are repeatedly taken from X
o For each sample, (n - 1)s2 / σ is computed
o The (Sampling) Distribution of (n - 1)s2 / σ is a χ2 with (n-1) degrees of freedom

Reading the χ2 Table

Testing a Variance – Illustration

The project manager of Shift 2 was excited that Shift 2 had improved so much while Shift 1 had slipped. He
also claimed that they were meeting the second SLA – the variance was less than the target 9 s2

Test the validity of this claim at 5% significance level

Summary Data

The tab “variances” in AHT.xls has the data and the sample statistics appear below.

Shift 1 Shift 2

n 17 Operators 34 Operators

Variance 27.9853 24.4314

Summary Report

1. The Hypothesis: H : 𝜎 ≤ 9 s2 Vs H : 𝜎 > 9 s2

2. Data: n = 34, s2 = 24.4314,  = 5%.

3. Right-Tail Test

4. Sampling distribution is (n - 1)s2/ 𝜎 = 33s2/9 ~ χ2 with 33 degrees of freedom

5. Test Statistic: 33*24.4314/9 = 89.5818

6. Critical Value: 47.3999

7. Since Test Statistic falls in the critical region, Reject H0.

There is statistical evidence to show that the claim is not true

Testing Variances – Introduction

𝐬𝟏𝟐
Sampling Distribution of
𝐬𝟐𝟐

Suppose both populations, say X1 and X2, are normally distributed


Random samples of size n1 from X1 and n2 from X2 are repeatedly taken

s
For each pair of samples, is computed
s

s
The distribution of is an F distribution with
s

(n1 – 1) numerator DoF and (n2 – 1) denominator DoF

s
The Test Statistic will be
s

Reading the F Tables

Forms for Null and Alternative Hypotheses

1. H0: σ ≥ σ & Ha:σ < σ , One-Tailed Test (Lower Tail or Left Tail)

2. H0: σ ≤ σ & Ha:σ > σ , One-Tailed Test (Upper Tail or Right Tail)

3. H0: σ = σ & Ha:σ ≠ σ , Two-Tailed Test

• To avoid a left-tailed test

• Label the populations so that the Right Tailed Test will prevail!

• If Ha:σ <σ , label Women as Population 1 and Men as Population 2 so that Ha:σ > σ

• To manage a Two Tailed Test

• Label the populations so that Population 1 has the higher s2

Summary Report – Structure

Example

 Kumar was still not convinced that Shift 2 had improved so much while Shift 1 had slipped.
 He wanted to test at 10% significance level whether the variances of both shifts were equal for the 3rd
month
 Note that this is a Two-Tailed Test
 The population with the greater sample variance will be labelled 1

Summary Data

The tab “variances” in AHT.xls has the data and the sample statistics appear below.

Shift 1 Shift 2

n 17 Operators 34 Operators

Variance 27.9853 24.4314


Shift 1 will be labelled 1

Shift 2 will be labelled 2

Summary Report

1. The hypotheses: H0: σ = σ Vs Ha:σ ≠ σ

2. Data: n1 = 17, n2 = 34, s = 27.9853, s = 24.4314,  = 10%

Notice that s > s

3. Two Tailed Test

s
4. Sampling Distribution: ~ F(16, 33)
s

s
5. Test Statistic: = 27.9853 24.4314 = 1.1454 ~ 1.15
s

6. Critical Region: 1.96

7. Conclusion: Since Test Statistic does not fall in the critical region, do not reject H0.

There is no evidence to show that the variances are not equal

Example

Test at α = 5% whether the variance of Shift 2 is less than the variance of Shift 1 in the 3rd month.

We want to test Ha:σ <σ

• Label Shift 1 as Population 1 & Shift 2 as Population 2

• We then have: H0: σ ≤ σ Vs Ha:σ > σ

s
• Test Statistic:
s

• If Test Statistic is large enough, H0 is rejected

Summary Data

The tab “variances” in AHT.xls has the data and the sample statistics appear below.

Shift 1 Shift 2

n 17 Operators 34 Operators

Variance 27.9853 24.4314

Summary Report

1. The hypotheses: H0: σ ≤ σ Vs Ha:σ > σ

2. Data: n1 = 17, n2 = 34, s = 27.9853, s = 24.4314,  = 5%

3. Right Tailed Test

s
4. Sampling Distribution: ~ F(16, 33)
s

s
5. Test Statistic: = 27.9853 24.4314 = 1.1454 ~ 1.15
s

6. Critical Value: 1.96 (Approx Value from Table = 2.01)

7. Conclusion: Since Test Statistic does not fall in the critical region, do not reject H0.
There is no evidence to show that the variances are not equal

Lesson – 9

The Χ2Test

In this module, we will discuss

• Test of Independence

• Test of Homogeneity

• Goodness of Fit Test: Multinomial Distribution

All three tests are similar:

• The sampling distribution is the Χ2 distribution

• The Test Statistic is computed in similar fashion

• The data are categorical

These features will be presented in a detailed manner in the Test of Independence

The Test of Independence:

The Framework

The Framework for the Χ2 Testing

• The sampling distribution (Χ2 distribution)

• The computation of the Test Statistic

• Categorical Data

• Does Gender influence Buying Decisions?

• On the Titanic, was the Survival Rate dependent on Class?

• Are Buying Behaviour and Geographical Location independent?

• Are Productivity and Shift independent?

Kumar’s, a large supermarket chain, wanted to minimize the time customers spent at check-out time – there
were separate counters for cash and plastic payments.
The fresh MBA graduate wondered whether customers chose the mode of payment based on the bill size.
Accordingly he grouped the billing amount into 3 categories:
< 500, 500 – 2000 and >2000.
He collected the data from the ERP and created the following contingency table.

 There are two variables: Bill Size and Mode of Payment


 Both variables are Categorical
 Both are random
 The data represents counts of observations falling in each cell
 The data was collected to study whether “customers chose the mode of payment based on the bill
size”

Example I: Developing the Hypothesis

Concern: “Whether customers chose the mode of payment based on the bill size”

This is equivalent to

Are the two variables, Bill Size and Mode of Payment, dependent?

X: Mode of Payment: Plastic or Cash

Y: Bill Size: Less than Rs.500; Between Rs.500 and Rs.2000; Greater than Rs.2000

H0: X and Y are independent

Ha: X and Y are not independent

Example I: The Observed Frequencies

Example I: The Expected Frequencies I

Example I: The Expected Frequencies II

Example I: The Expected Frequencies III


Example I: Independence Requires …

Test of Independence: Framework

Let X and Y be two categorical variables.

1. Set up the null and alternative hypotheses.

H0: X and Y are independent

Ha: X and Y are not independent

2. Select a random sample and record oij , for each cell of the contingency table

3. Compute the expected frequency, eij , for each cell.

eij = (Row i Total) * (Column j Total) / Total Sample Size


( )
4. Compute the test statistic: χ = ∑ ∑

5. Compute the p-value with sampling distribution χ with (r-1)*(c-1) DoF

6. Reject H0 if p-value ≤ . O/w do not reject H0

Note of Caution on the Χ2 Test

Example I
Kumar’s, a large supermarket chain, wanted to minimize the time customers spent at check-out
time – there were separate counters for cash and plastic payments.

The fresh MBA graduate wondered whether customers chose the mode of payment based on the
bill size. Accordingly he grouped the billing amount into 3 categories: < 500, 500 – 2000 and >2000.

He collected the data from the ERP and created the following contingency table.

Bill Size

< 500 500 - 2000 > 2000

Plastic 100 200 500

Cash 120 600 480

Test the hypothesis at 5% significance level

Step 1: The Expected Frequencies

Step 2: The Test Statistic

The Report

X: Mode of Payment: Plastic or Cash

Y: Bill Size: Less than Rs.500; Between Rs.500 and Rs.2000; Greater than Rs.2000

1. H0: X and Y are independent Vs Ha: X and Y are not independent

2. α = 0.05

3. Right Tail Test

4. Sampling Distribution: Χ2 with (r-1)*(c-1) = (2-1)*(3-1) = 2 degrees of freedom

5. Test Statistic: 127.32

6. p-value = P(Χ2 > 127.32) < 0.05 = α

7. Since p-value ≤ 0.05, Reject H0


Test of Homogeneity: Introduction

Test of Homogeneity

Example 2

What is the Difference Between Examples?

Test of Homogeneity: The Hypothesis

Bill Size

< 500 500 - 2000 > 2000

Plastic 100 200 500

Cash 120 600 480

• Each row in the contingency table is a population

• Each population is segmented into 3 sub-populations

• Let pij = population proportion for in cell (i, j)

• If the two populations are perfectly homogeneous then

• p11 = p21 & p12 = p22 & p13 = p23


Test of Homogeneity: The General Hypothesis

If the rows define the populations, then

• The Null Hypothesis comprises c Null Hypotheses, 1 for each column


. . .
• H0: p11 = p21 = = pr1
. . .
• H0: p12 = p22 = = pr2

• …..
. . .
• H0: p1c = p2c = = prc

• Ha: At least one H0 is false

If the columns define the populations, then

• The Null Hypothesis comprises r Null Hypotheses, 1 for each row


. . .
• H0: p11 = p12 = = p1c
. . .
• H0: p21 = p22 = = p2c

• …..
. . .
• H0: pr1 = pr2 = = prc

• Ha: At least one H0 is false

Test of Homogeneity: Expected Counts

Test of Homogeneity: The Hypothesis

The Null Hypothesis comprises 3 Null Hypotheses

• H0: p11 = p21 pertaining to Bills of amount less than 500

• H0: p12 = p22 pertaining to Bills of amount between 500 & 2000

• H0: p13 = p23 pertaining to Bills of amount greater than 2000

Versus

• Ha: At least one H0 is false

Example 2: The Observed Frequencies

Example 2: The Expected Frequencies I


Example I: The Expected Frequencies III

Test of Homogeneity: Framework

Kumar’s, a large supermarket chain, wanted to minimize the time customers spent at check-out time –
there were separate counters for cash and plastic payments.

The fresh MBA graduate wondered whether customers chose the mode of payment based on the bill size.
Accordingly he grouped the billing amount into 3 categories: < 500, 500 – 2000 and >2000.

The ERP was down. So he collected the following data from each check-out counter

Bill Size

< 500 500 - 2000 > 2000

Plastic 100 200 500

Cash 120 600 480

Test of Homogeneity: The Hypothesis

The Null Hypothesis comprises 3 Null Hypotheses

• H0: p11 = p21 pertaining to Bills of amount less than 500

• H0: p12 = p22 pertaining to Bills of amount between 500 & 2000

• H0: p13 = p23 pertaining to Bills of amount greater than 2000

Versus

• Ha: At least one H0 is false

Example I: Homogeneity Requires …


Test of Homogeneity: Application

Kumar’s, a large supermarket chain, wanted to minimize the time customers spent at check-out time –
there were separate counters for cash and plastic payments.

The fresh MBA graduate wondered whether customers chose the mode of payment based on the bill size.
Accordingly he grouped the billing amount into 3 categories: < 500, 500 – 2000 and >2000.

The ERP was down. So he collected the following data from each check-out counter

Bill Size

< 500 500 – 2000 > 2000

Plastic 100 200 500

Cash 120 600 480

Step 1: The Expected Frequencies

Step 2: The Test Statistic

The Report

 X: Mode of Payment: Plastic or Cash


 Y: Bill Size: Less than Rs.500; Between Rs.500 and Rs.2000; Greater than Rs.2000
 H0: p11 = p21 (Bills < 500), H0: p12 = p22 (Bills between 500 & 2000) & H0: p13 = p23 (Bills > 2000) Versus Ha:
At least one H0 is false
 α = 0.05
 Right Tail Test
 Sampling Distribution: Χ2 with (r-1)*(c-1) = (2-1)*(3-1) = 2 degrees of freedom
 Test Statistic: 127.32
 p-value = P(Χ2 > 127.32) < 0.05 = α
 Since p-value ≤ 0.05, Reject H0

X: Mode of Payment: Plastic or Cash

Y: Bill Size: Less than Rs.500; Between Rs.500 and Rs.2000; Greater than Rs.2000

1. H0: p11 = p21 (Bills < 500), H0: p12 = p22 (Bills between 500 & 2000) & H0: p13 = p23 (Bills > 2000) Versus Ha: At
least one H0 is false

2. α = 0.05

3. Right Tail Test

4. Sampling Distribution: Χ2 with (r-1)*(c-1) = (2-1)*(3-1) = 2 degrees of freedom

5. Test Statistic: 127.32

6. p-value = P(Χ2 > 127.32) < 0.05 = α

7. Since p-value ≤ 0.05, Reject H0

 The three soft drinks majors have a market share of 25%, 35% and 40%
 One major went on an ad blitz
 Have the market shares changed?
 The three soft drinks majors have a market share of 25%, 35% and 40%
 One major went on an ad blitz
 Have the market shares changed?

The fresh MBA graduate at Kumar’s noted that H0 was rejected in the previous tests. An article in a
business journal discussed cash payment versus plastic payment in a general scenario. From the report,
the graduate inferred that 15% of payments made by plastics were for bills less than 500, 25% were for bills
between 500 and 2000 and 60% was for bills over 2000.

 He wondered whether that was the case at Kumar’s. To test this hypothesis, he collected the following
data.
 Bill Size
 < 500 500 - 2000 > 2000
 Plastic 100 200 500
 Test the hypothesis at 5% significance level

Goodness of Fit Test*: The Hypothesis

 Let p1 be the proportion of payments made by plastic for bills < 500
 Let p2 be the proportion of payments made by plastic for bills between 500 & 2000
 Let p3 be the proportion of payments made by plastic for bills > 20000
 H0: p1 = 0.15, p2 = 0.25 = p3 = 0.6
 H0: At least one equality is false

Step 1: The Expected Frequencies


Step 2: The Test Statistic

The Report

Introduction to Design of Experiments

Cause & Effect

 The Consumer Forum wants to test the effect of 4 fuel additives on mileage
 Does an MBA specialization has any effect on the starting salary
 The effect of 5 diets on liver cholesterol
 The Call center needs to evaluate 3 different training methods
 Online Retailer: Do sharper visuals of products lead to higher sales
 Catalogue Retailers: Which call to action leads to higher sales

Statistical Studies

 Statistical Studies: Experimental or Observational


o Observational study: No attempt is made to control the causes
o Experimental study: One or more causes are controlled (manipulated) to study how
they influence the dependent variable
 Cause-and-effect relationships are easier to identify and establish in experimental studies
Constituents of an Experiment: Independent Variables

o Factors are independent variables that are manipulated


 Factors are treated as categorical variables
 Treatments are different levels (values) of the Factors
o Blocks are independent variables that cannot be manipulated
o Extraneous Variables are independent variables that cannot be controlled but may affect the
dependent variable

Constituents of an Experiment: Dependent Variable

Dependent variable

– Measures the effect of the treatments on the test units

– It is a continuous variable

Constituents of an Experiment: Test Units

Test units

– The subjects whose responses to the treatments are measured

– The control group are the test units that do not receive any treatment and acts as a
benchmark

Statistical Designs

Statistical designs allow for statistical control of independent variables

 Completely Randomized Design


 Randomized Block Design
 Factorial Design

The Completely Randomized Design

The Randomized Block Design

The Factorial Design


Introduction to One-Way ANOVA

The Hypotheses

• ANOVA tests for the equality of three or more population means

H0: μ1 = μ2 = μ3 = … . = μk

Ha: Not all population means are equal

• Data obtained from observational or experimental studies can be used

• If H0 is rejected, we cannot conclude that all population means are different.

Rejecting H0 means that at least two population means have different values.

Xi ~ N( , σ), i = 1, 2, 3, …k

The observations are independent

Three Estimators of 2

We have 3 populations: X1, X2, X3 ~ N( , σ) & 3 samples, one from each population

We compute 3 statistics – all estimators of 2 :


1. Combine all 3 samples and compute the sample variance =


2. (Between-Treatments Estimate) MSTR = =

∑ ∑
3. (Within-Treatments Estimate) MSE = =

• Under H0 all these estimators are unbiased estimators of 2.

• If H0 is False, only the first and third are unbiased. The second will overestimate 2

∴ Therefore the test statistic is the ratio of the 2nd to the 3rd!
Mean Square due to Treatments (MSTR)

MSTR (Mean Square due to Treatments) denotes the weighted variance of the sample means:

∑ n X −X
MSTR =
k−1
• MSTR is a χ2 distributions with (k-1) DoF

• Numerator is called Treatment Sum of Squares (SSTR)

• Denominator is the degrees of freedom associated with SSTR

Mean Square Error (MSE)

MSE (Mean Square Error) denotes the weighted mean of the sample variances:

∑ ∑ x −X ∑ (n − 1)s
MSE = =
n −k n −k
• MSE is a χ2 distribution with (nT - k) DoF

• Numerator is called Error Sum of Squares (SSE)

• Denominator is the degrees of freedom associated with SSE

The Sampling Distribution

∑ ∑ ∑
MSTR = & MSE =

MSTR and MSE are χ2 distributions with (k - 1) and (nT - k) DoF respectively

F= is an F((k-1), (nT-k)) Distribution

is also a χ2 distributions with (nT - 1) DoF

SST is called Total Sum of Squares

The F Test: MSTR / MSE

• If H0 is true, MSTR and MSE are unbiased estimators of σ2

• If H0 is false, MSTR overestimates σ2 and

The Test Statistic = MSTR/MSE will be large

• H0 is rejected if MSTR / MSE appears to be too large

• Hence the test is a Right Tail Test

One-Way ANOVA – The Summary Report

The Analysis – So Far

 One Response Variable – Dependent, Continuous


 One Factor – Independent, Categorical with k Levels (Treatments)
 Test Units – nT, the total sample size
 MSTR & MSE – Two estimators of σ2
 MSTR / MSE ~ F with (k-1) numerator DoF and (nT – k) denominator DoF
 Null Hypothesis: H0: μ1 = μ2 = μ3 = … . = μk
 A Right Tail Test
Summary Report Structure – Critical Value Approach

Summary Report Structure – p-value Approach

ANOVA Table – Excel Output

One Way ANOVA Illustration – The Computations

Computations Required

 F = MSTR / MSE
 MSTR = & MSE =

 SSTR =∑ n X −X & SSE =∑ ∑ x −X


 X, X
The Computations

One Way ANOVA Illustration - The Summary Report

Preliminary Remarks

Factor: Shift

Treatments: Shift 1, Shift 2, Shift 3

Experimental Units: Operators

Response Variable: Number of tickets closed

Statistical Design: Completely Randomized Design

µ1 = mean number of tickets closed by operators in the 1st shift

µ2 = mean number of tickets closed by operators in the 2st shift

µ3 = mean number of tickets closed by operators in the 3st shift

The Summary Report

H0: μ1 = μ2 = μ3 Vs Ha: Not all population means are equal

Right Tail Test with α = 5%

Sampling Distribution: ~F(2, 12)

Test Statistic = = .
= 1.76

Critical Value 3.88

Since Test Statistic does not fall in the critical region, Do not reject H0

The ANOVA TABLE - Excel Output

The Summary Report – From the Excel Output


o H0: μ1 = μ2 = μ3 Vs Ha: Not all population means are equal
o Right Tail Test with α = 5%
o Sampling Distribution: ~F(2, 12)
o Test Statistic = 1.76
o p-value = 0.2129
o Since p-value > α, Do not reject H0

Lesson – 11

Linear Regression & Scatter Plots

 Linear Regression and Scatter Plots


 Simple Linear Regression Model
 Estimating the Model – The Least Squares Method
 Applying the Least Squares Method
 Correlation Coefficient & Coefficient of Determination
 Model Assumptions
 Testing for Significance
 Understanding Excel Outputs
o Simple Linear Regression
o Multiple Linear Regression

 Is there a relationship between ad expenses and sales


 If an additional crore for ad expenses is sanctioned, what is the expected increase in sales?
 Demand analysis predicts how many units will be sold in the next quarter
 Auto insurance premium tables may be built by regressing Claims against demographics, engine size,
automobile price etc.
 Salary Structure Management

Linear Regression

 Simple linear regression


o One independent variable (denoted by X)
o One dependent variable (denoted by Y)
o Y is considered to be linearly dependent on X
 Multiple Linear Regression
o More than one independent variable (X1, X2, X3, …)
o Y is considered to be linearly dependent on X1, X2, X3, …

Scatter Plot (Son’s Height Against Father’s Height)


Scatter Plot (Son’s Height Against Father’s Height)

Scatter Plots

Scatter Plot: Sales Vs Ad Expenses

Simple Linear Regression Model

y = 0 + 1x +ε,

Or E(y) = 0 + 1x

o where:
o 0 and 1 are the parameters of the model,
o ε is a random variable called the error term
o The model represents a linear relationship.
o 0 is the y intercept of the regression line.
o 1 is the slope of the regression line.
o E(y) is the expected value of y for a given x value

E(y) = β0 + β1x: Different Scenarios


Estimated Model: 𝒚 = b0 + b1x

o The Model: E(y) = 0 + 1x


o The Estimated Model: 𝑦 = b0 + b1x
o Where 𝑦 is the predicted value of the dependent variable at ‘x’

The Estimated Model: 𝒚 = b0 + b1x

The Least Squares Method I

𝑦 = b0 + b1x

The coefficients are computed by the Least Squares Criterion: Min∑(𝑦 − 𝑦 )

Where

yi = observed value of the dependent variable for the ith observation

𝑦 = estimated value of the dependent variable for the ith observation

Least Squares Method II

Least Squares Method III

o The Estimated Model: 𝑦 = b0 + b1x


o Slope for the Estimated Regression Equation
∑( )( ) ( , )
b = ∑( )
= ≡ ( )
o Y Intercept for the Estimated Regression Equation
𝑏 = 𝑦 - 𝑏 𝑥̅
o where:
o xi = value of independent variable for ith observation
o yi = value of dependent variable for ith observation
o 𝑥̅ = mean value for dependent variable
o 𝑦= mean value for dependent variable

Applying the Least Squares Method

Least Squares Method: 𝒚 = b0 + b1x

Slope for the Estimated Regression Equation


∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦) 𝑠 Covariance (X, Y)
𝑏 = = ≡
∑(𝑥 − 𝑥̅ ) 𝑠 Variance(X)
Y Intercept for the Estimated Regression Equation
𝑏 = 𝑦 - 𝑏 𝑥̅
where:
xi = value of independent variable for ith observation
yi = value of dependent variable for ith observation
𝑥̅ = mean value for dependent variable
𝑦= mean value for dependent variable

Simple Linear Regression Example

Kumar’s Clothing Emporium periodically has a special week-end sale. As part of the advertising campaign
Kumar runs one or more TV commercials on Friday preceding the sale. Data from a sample of 5 previous
sales are shown below.

Applying the Least Squares Method I

Applying the Least Squares Method II


Applying the Least Squares Method III

Applying the Least Squares Method IV

Applying the Least Squares Method V

Applying the Least Squares Method VI


The Estimated Regression Line 𝒚 = 8 + 6x

Correlation Coefficient
∑( )( )
( )
r= =
( ) ( )
( ) ( )

Correlation measures linear relationship

Does not explain causality

Scatter Plots

Scatter Plot: Sales Vs Ad Expenses


Computing the Correlation Coefficient: Example

Coefficient of Determination

R2 - Coefficient of Determination

𝑆𝑆𝑇 = ∑(y − y)
SSR = ∑(y − y)

SSE = ∑(y − y )
where:

SST = total sum of squares

SSR = sum of squares due to regression

SSE = sum of squares due to error

SST = SSR + SSE

R =

SST, SSR & SSE

R2 - Coefficient of Determination
Computing R2: Example

Interpreting R2 & r

0≤R ≤1& −1≤r≤1


The Context

Social Sciences: R2 = 0.3 may be considered as significant

Science: R2 may be required to be ≥ 0.6

Model Assumptions I

y = β0 + β1x +ε

1. The error ε is a random variable with mean of zero.

This allowed us to rewrite the model as E(y) = β0 + β1x

Model Assumptions II

y = β0 + β 1x +ε

2. The variance of ε, denoted by σ2, is the same for all values of the independent variable.

• X is a deterministic variable

• For each value of X, Y is a random variable


Model Assumptions III

Model Assumptions IV

y = β0 + β 1x +ε

4. The values of ε are independent

This requires the Y’s to be independent

Estimate for σ2

Recall the Error Sum of Squares – SSE

s2 = MSE = SSE / (n – k – 1) is the estimate for σ2

Where

n: The number of observations

k: The number of independent variables

Interpreting Excel Outputs I

Kumar’s Clothing Emporium

Kumar’s Clothing Emporium periodically has a special week-end sale. As part of the advertising campaign
Kumar runs one or more TV commercials on Friday preceding the sale. Data from a sample of 5 previous
sales are shown below.
Understanding the Summary Output

Excel Summary Output: The F test


Interpreting Excel Outputs II
Excel Summary Output: Testing using Confidence Intervals

Lesson -12

Introduction to Linear Programming

Example: A Blending Problem


The neighbourhood shop produces two blends of coffee: Nilgiri AA & Nilgiri A

Three types of coffee beans are required: Bean 1, Bean 2 & Bean 3

• Four Kg of Grade AA require 2 Kg of Bean 1, 1 Kg of Bean 2 and 1 Kg of Bean 3

• Four Kg of Grade A require 2 Kg of Bean 1, 0 Kg of Bean 2 and 2 Kg of Bean 3

He makes a profit of

• Rs.4 on every Kg of Grade AA sold and

• Rs.2 on every Kg of Grade A sold

The shopkeeper has 3 Kg of Bean 1, 1.25 Kg of Bean 2 and 2 Kg of Bean 3.

Constituents of a Linear Programming Problem (LPP)

o Objective Function
o Maximizing Profit or Minimizing Cost
o The objective function must be expressed as an algebraic expression
o This expression must be linear
o Decision Variables
o Variables whose values can be controlled by the manager
o # of units to produce
o Constraints
o Resource availability
o Company Policy
o Each constraint must be expressed as an algebraic expression
o This expression must be linear

Towards a Solution

o A feasible solution: A set of values of the decision variables that satisfies all the constraints.
o The Feasible Region: The set of all feasible solutions
o An optimal solution: A feasible solution that leads to the optimum solution (largest objective
function value when maximizing or smallest when minimizing).
o Special Cases
o An LPP may have
 Exactly one solution
 Infinite number of solutions
 No solution
 An unbounded solution

Guidelines for Model Formulation

o Problem Modeling is the process of translating a verbal description of the problem into a
mathematical statement.
o The Process
o Understand the problem thoroughly
o Describe the objective
o Describe each constraint
o Get a Sign off on the Problem Statement from the management
o Define the decision variables
o Express the objective function algebraically in terms of the decision variables
o Express the constraints algebraically in terms of the decision variables
o Finally use a computer package to solve the problem
Introduction to Linear Programming

Constituents of an LPP – A Recap

Decision Variables

• Variables whose values can be controlled by the manager

• Optimal values of which the manager wants to determine

Objective Function

• The goal of the manager

Constraints

• Imposed by resource availability or company policy

The objective function and the constraints must be expressed as linear functions of the decision variables

Example 1: A Simple Maximization Problem

Product Mix Problem

The neighbourhood shop produces two blends of coffee: Nilgiri AA & Nilgiri A. Three types of coffee beans
are required.

• 4 Kg of Grade AA requires 2 Kg of Bean 1, 1 Kg of Bean 2 and 1 Kg of Bean 3.

• 4 Kg of Grade A requires 2 Kg of Bean 1 and 2 Kg of Bean 3.

• There are 3 Kg of Bean 1, 1.25 Kg of Bean 2 and 2 Kg of Bean 3.

• Profit contributions are Rs4/Kg on Grade AA and Rs2/Kg on Grade A.

Example 1: A Simple Maximization Problem


Example 1: Final Formulation

Introduction to Linear Programming

Standard Form

o An LPP is said to be in Standard Form when


 All the decision variables are non-negative
 All the constraints are equalities.
o The standard form is required if we were to solve the LPP using the SIMPLEX Algorithm

Slack and Surplus Variables


o 2x + 3y ≤ 20 was rewritten as 2x + 3y + s1 = 20
o Where “s1” is a non-negative variable called a “slack” variable
o x + y ≥ 8 was rewritten as x + y – s2 = 8
o Where “s2” is a non-negative variable called a “surplus” variable
o Notice that in both cases, the new variables are non-negative!
o To insert these variables into the formulation, their coefficients in the objective function will be set
to 0

Converting to The Standard Form

The Standard Form

Constructing the Feasible Region Graphically

Example 1: Final Formulation


Constructing the Feasible Region~ x + y ≤ 6

Constructing the Feasible Region~ x + 2y ≤ 8

Constructing the Feasible Region~ x ≤ 5

Constructing the Feasible Region for the LPP


The Feasible Region

Solving an LPP Graphically

Example 1: Final Formulation

The Feasible Region (A Recap)

Inserting the lines 4x + 2y = k in the Feasible Region


The Optimal Solution

An Algebraic Approach to Finding the Optimal Solution

Impracticality of the Graphical Approach

o Suppose there are n decision variables and m constraints.


o The number of extreme points can be as high as
o And this can be a huge number
o OR professionals use computer packages to solve LP problems
o We will use Excel Solver

Example 1: Final Formulation


Binding & Non-Binding Constraints

o A constraint is said to be Binding when


o LHS of the constraint = RHS of the constraint
o A binding constraint is one where the slack or surplus variable equals 0.
o A non-binding constraint has a non-zero slack / surplus variable

Varying the RHS of a Constraint

Varying the Coefficients of the objective function


Varying the Coefficients of the objective function
The feasible region may be

– Bounded

– Unbounded

– May not exist

Even if the feasible region is unbounded, the LPP may have an optimal solution

An LPP may have:

– A unique optimal solution

– Alternative optimal solutions

– An unbounded solution

Alternative Optimal Solutions

Infeasibility
Unbounded Solution

Some special cases

Feasible Space: Bounded or Unbounded or Does not Exist

An LPP may have

• Unique solution

• Alternative solutions

• Unbounded solutionAlgebraically identifying the optimum point

• Impracticality of these approaches

Lesson -14

A Simple Maximization Problem

Introduction to Sensitivity Analysis

o Post-Optimality Analysis measures the robustness of the solution – how sensitive the solution is to
changes in the input data
o This is important
o When the business environment is subject to change
o Because it is difficult to get precise data
o Questions we will examine are:
o If C is the objective coefficient for a decision variable, what is the Range of Optimality for C
over which the optimal solution does not change (although the optimal value of the
objective function may change).
o If the right-hand side of a constraint changes within the Range of Feasibility, how much will
the objective function change by (Shadow Price)
o If a decision variable is not in the solution (i.e. its value is 0), what is the change (Reduced
Cost) to be made to its coefficient in the objective function so that it enters the solution
o Excel Solver provides the relevant information

Sensitivity Analysis: Range of Optimality

Range of Optimality

The Range of Optimality for an Objective Function coefficient is the range of values this coefficient can
assume without changing the current solution.

A narrow Range of Optimality for a decision variable should be a cause of concern. Especially if the
objective coefficient is near the endpoints of the range.

Sensitivity Analysis: Range of Feasibility

Range of Feasibility and the Shadow price

o The Shadow Price of a Resource (Constraint) is the change in the optimal objective value for a unit
change in the Resource (Constraint)
o As the RHS increases / decreases, other constraints will become binding and limit the change in the
value of the objective function.
o Thus the Shadow Price is effective only in the range where the current binding constraints remain
binding and the non-binding constraints remain non-binding – The Range of Feasibility

o Note that there is a positive effect if the feasible space increases and a negative effect otherwise

o The shadow price for a nonbinding constraint is 0.

o The shadow price for a binding constraint will be non-zero.

Sensitivity Analysis: Reduced Costs


o Suppose a decision variable X is not appearing in the optimal solution
o (The value of the decision variable is 0 in the optimal solution)
o The Reduced Cost for X is:
o The amount by which the coefficient of X in the objective function would have to improve (increase
for maximization problems, decrease for minimization problems) before this variable appears in
the solution.
o Reduced Cost = 0 

 X is in the optimal solution

 If X is not in the optimal solution there are alternative solutions

Quartiles Definition

Quartiles are a set of three values that divides a data set into four parts such that each part has an equal number of data
values.

Method 1

Step 1: First, arrange the data set in ascending order and calculate it’s median to divide the data set into two halves.

Step 2: Make a data set of each half and do not consider the median in either of the two parts.

Step 3: Now calculate the median of each lower and upper half of the sets. The median of the lower half of the set is called the
first quartile, and the median of the upper half of the set is called the third quartile.

Example: Consider the production of whole milk powder from 2008 to 2018 in the UK in millions.

15, 10, 16, 8, 7, 5, 14, 10, 9, 5, 12

Calculate the first and third quartiles as follows:

First, arrange the terms in ascending order.

5, 5, 7, 8, 9, 10, 10, 12, 14, 15, 16

The data set has 11 values. Hence, the median is as follows:


Method 2

Step 1: First, arrange the data set in ascending order and calculate its median to divide the data set into two halves.

Step 2: Make a data set of each half and consider the median in both the parts.

Now, calculate the median in both halves. The median of the lower half is called the first quartile, and that of the upper half is
called the third quartile.

Example: Consider the production of whole milk powder from 2008 to 2018 in UK in millions.

15, 10, 16, 8, 7, 5, 14, 10, 9, 5, 12

Calculate the quartiles as follows:

First, arrange the terms in ascending order.

5, 5, 7, 8, 9, 10, 10, 12, 14, 15, 16

The data set has 11 values. Hence, the median is calculated as follows

IQR- Q3-Q1

1.5*IQR

Upper limit = Q3+(1.5*IQR)

Lower Limit = Q1- (1.5*IQR)

The Standard Deviation as a Ruler – Example 2


The Feb “high” temperature averaged 30oC with variance 100oC, while in May these were 40oC and 64oC. When is it
more unusual to have a high of 35oC?
Z Scores
Zfeb = (35-30)/10 = 0.5 &
Zjun = (35-40)/8 = -5/8 = -0.625
February: 35oC is 0.5 σ’s from the mean
May: 35oC is 0.625 σ’s from the mean (Ignoring the negative sign)

Z Score & Outliers Example


Your team member says he has received a job offer of Rs.15 lakh in Hyderabad and would like to put in his papers.
He is a good resource but he is a little pushy.
Should you negotiate with him and try and increase his salary or tell him you cannot match that offer and he should
put in
his papers.
HR says that for his level in Hyd, m = Rs.8 lakh and s = Rs.2 lakh.
z score of the offer = (15 – 8) / 2 = 3.5
Is the employee bluffing?

Chebyshev’s Theorem
At least (1 - 1/z2) of the items in any data set will be within z standard deviations of the mean, where z is any value greater
than 1.
Chebyshev’s theorem requires z > 1, but z need not be an integer.
• At least 75% of the data values must be within 2 standard deviations from the mean
• At least 89% of the data values must be within 3 standard deviations from the mean
• At least 94% of the data values must be within 4 standard deviations from the mean

Empirical Rule & Outliers

Descriptive Statistics
Describing Central Tendency
• In addition to describing the shape of a distribution, want to describe
the data set’s central tendency
• A measure of central tendency represents the center or middle of the data

What is Average?
• If we are having ‘n’ observations that needs to be replaced by a single
observation then average is most suitable number.
• It’s a number around which all the observations lies.

Parameters and Statistics


• A population parameter is a number calculated from all the
population measurements that describes some aspect of the
population
• A sample statistic is a number calculated using the sample
measurements that describes some aspect of the sample

Measures of Central Tendency

The Mean

The Sample Mean


Relationships Among Mean, Median and Mode

Geometric Mean

Harmonic Mean

Dispersion

Quartile Deviation
Inter Quartile Range = (Q3 – Q1)
Quartile deviation = (Q3 – Q1)/2.
It is also known as semi-inter quartile range.

Coefficient of Quartile Deviation = Q3-Q1/Q3+Q1


Complement
• The complement (Ā) of an event A is the set of all
sample space outcomes not in A
• P(Ā) = 1 – P(A)

Union and Intersection


• The union of A and B are elementary events that belong to either A or
B or both
• Written as A  B
• The intersection of A and B are elementary events that belong to
both A and B
• Wrien as A ∩ B

Birla Institute of Technology & Science, Pilani


Work-Integrated Learning Programmes Division
Second Semester 2021-2022

Comprehensive Examination
(EC-3 Regular)

Course No. : MBA ZC417


Course Title : QUANTITATIVE METHODS
Nature of Exam : Open Book
Weightage : 45% No. of Pages =6
Duration : 2 Hours
Date of Exam : Saturday, 21/05/2022 (AN) No. of Questions = 8
Note:
1. Please follow all the Instructions to Candidates given on the cover page of the answer book.
2. All parts of a question should be answered consecutively. Each answer should start from a fresh page.
3. Assumptions made if any, should be stated clearly at the beginning of your answer.

Q.1 Set. (A) Explain and Compare- a) Covariance and Correlation, b) Normal Distribution and Sampling
Distribution, and c) One-tail and Two-tail hypothesis tests. Do the comparison in a table with
columns and rows, that is- side-by-side comparison.
[9]

[Common instructions for all questions- Upload only hand-written material; only hand-written material
will be evaluated. 2. Do not type the answer in the space provided below the question in the exam portal.
3. Do not attach any screenshot or file of EXCEL/PDF/PPT/any software].

a)

Covariance:-

Covariance is a statistical measure that shows whether two variables are related by measuring
how the variables change in relation to each other. This is clear when you break down the word.
Co- as a prefix often indicates some sort of joint action (like co-workers, co-owners, coordinate)
and variance refers to variation or change. So, covariance measures how two things change
together. It tells you if there is a relationship between two things and which direction that
relationship is in.

Correlation:-

Correlation, like covariance, is a measure of how two variables change in relation to each other,
but it goes one step further than covariance in that correlation tells how strong the relationship is.
Let's work through these two statistical measures one at a time to get a good understanding of
them, making sure we use the data that you collected when looking for trends with your ice
cream shop.

COVARIANCE VS. CORRELATION

 Both covariance and correlation measure the relationship and the dependency between
two variables.
 Covariance indicates the direction of the linear relationship between variables.
 Correlation measures both the strength and direction of the linear relationship between
two variables.
 Correlation values are standardized.
 Covariance values are not standardized

b).

Normal distribution:-

Normal distribution, also known as the Gaussian distribution, is a probability distribution that is
symmetric about the mean, showing that data near the mean are more frequent in occurrence
than data far from the mean. In graph form, normal distribution will appear as a bell curve.

sampling distribution:-

A sampling distribution is a probability distribution of a statistic obtained from a larger number of


samples drawn from a specific population. The sampling distribution of a given population is the
distribution of frequencies of a range of different outcomes that could possibly occur for a statistic
of a population.

Normal distribution VS sampling distribution

If the population is normally distributed, the sampling distribution will be normal. If the population
is not normally distributed, the sampling distribution, if the samples taken are large, will be
approximately normally distributed.

c).

one tailed hypothesis test:-

A one-tailed test results from an alternative hypothesis which specifies a direction. i.e. when the
alternative hypothesis states that the parameter is in fact either bigger or smaller than the value
specified in the null hypothesis.

two-tailed hypothesis test:-

A two-tailed hypothesis test is designed to show whether the sample mean is significantly greater
than and significantly less than the mean of a population. The two-tailed test gets its name from
testing the area under both tails (sides) of a normal distribution.

one-tailed vs two-tailed hypothesis test

However, if the alternative hypothesis is not exhibited directionally, then it is known as the two-
tailed test of the null hypothesis., wherein the critical region is one both the tails.
...
Comparison Chart.
Basis of Comparison One-tailed Test Two-tailed Test
Sign in alternative hypothesis > or < ≠

Q.2 Set. (A) ThirdEyeCare NGO has provided spectacles at no-profit-no-loss basis to 60 workers who are
involved in precision jobs. A summary of the number of spectacles provided to the workers are
given in the table below

Profession Frequency
Jewellery making 14
Embroidery 8
Wood carving 10
Miniature painting 9
Stone carving 11
Watch restoration 8
Is there an evidence that the NGO was fair in distribution of spectacles? Use alpha=0.05. Assume the
proportion workers in the 6 professions are equal in the population. (Do this problem using formulas (no
Excel or any other software’s utilities). Clearly write the hypothesis, all formulas, all steps, and all
calculations. Underline the final result).
[6]

[Common instructions for all questions- Upload only hand-written material; only hand-written material
will be evaluated. 2. Do not type the answer in the space provided below the question in the exam portal.
3. Do not attach any screenshot or file of EXCEL/PDF/PPT/any software].

Step 1: H0 : the proportion workers in the 6 professions are equal in the population.

Ha :the proportion workers in the 6 professions are not equal in the population.
Q.3 Set. (A) StartUp Storage Co. has launched a new model of mobile battery in the market. Its
advertisement claims that the average life of the new model is 600 minutes under standard
operating conditions.

StartUp’s new model performance has surprised the mobile battery industry. The R&D
department of MoreLife, the largest manufacturer of mobile phone batteries, purchased 10
batteries manufactured by StartUp and tested them in its lab under standard operating
conditions. The results of the tests are given below-
Life (minutes)
630
620
650
620
600
590
640
590
580
630
Count= 10
Sum= 6150
Sample variance= 561.11

Test the claim made by StartUp’s advertisement. Use alpha =0.05. (Do this problem using formulas (no
Excel or any other software’s utilities). Clearly write the hypothesis, all formulas, all steps, and all
calculations. Underline the final result on the answer sheet).
[7]

[Common instructions for all questions- Upload only hand-written material; only hand-written material
will be evaluated. 2. Do not type the answer in the space provided below the question in the exam portal.
3. Do not attach any screenshot or file of EXCEL/PDF/PPT/any software].
Q.4 Set. (A) WeTrainWell Consultants has imparted training to 50 production workers selected at random
of a packaging material manufacturer. Before proceeding to train the remaining 950 workers,
the manufacturer would like to know whether the training by TrainWell changes productivity.

The productivity of 6 randomly selected workers before they underwent training and another 6
workers who underwent training is given in the table below-
Before After
40 50
35 40
35 55
45 50
40 35
45 70
Sum 240 300
Sample Stdev 4.47 12.25

Should the manufacturer ask WeTrainWell to train the remaining 990 workers? Use
alpha=0.05. Assume equal variance. (Do this problem using formulas (no Excel or any other
software’s utilities). Clearly write the hypothesis, all formulas, all steps, and all calculations.
Underline the final result).
[7]

[Common instructions for all questions- Upload only hand-written material; only hand-written material
will be evaluated. 2. Do not type the answer in the space provided below the question in the exam portal.
3. Do not attach any screenshot or file of EXCEL/PDF/PPT/any software].
Q.5 Set. (A) SpendMore, a credit card company would like to know whether there is a relationship between
the age of the customers and their spending. The results of 5 randomly selected customers are
given in the table below- [8]

Age Spending
2 3
3 5
4 6
5 6
8 7

(a) What is the Covariance and Coefficient of Correlation between Age and Spending?
(b) What is the Covariance and Coefficient of Correlation between Spending and Age?
(c) What is the Slope and the Intercept of Simple Linear Regression equation considering Age as
an independent variable (X).
(d) Draw a neat (approximate is ok) scatter chart with the regression line on it.

(Do this problem using formulas--- no Excel function/utility, no utility of any other software. Clearly
write all formulas, all steps, and all calculations. Underline the final result).

[Common instructions for all questions- Upload only hand-written material; only hand-written
material will be evaluated. 2. Do not type the answer in the space provided below the question in the
exam portal. 3. Do not attach any screenshot or file of EXCEL/PDF/PPT/any software].
Q.6 Set. (A) OnlyForMen Garments Co. produces three designs of men’s shirts- Fancy, Office, and Causal.
The material required to produce a Fancy shirt is 2m, an Office shirt is 2.5m, and a Casual shirt
is 1.25m. The manpower required to produce a Fancy shirt is 3 hours, an Office shirt is 2 hours,
and a Casual shirt is 1 hour.

In the meeting held for planning production quantities for the next month, the production manager
informed that a maximum of 3000 hours of manpower will be available, and the purchase manager
informed that a maximum of 5000 m of material will be available. The marketing department
reminded that a minimum of 900 nos. of Office shirts and a minimum of 500 nos. of Causal shirts
must be produced to meet prior commitments, and the demand for Fancy shirts will not exceed 1200
shirts and that of Casual shirts will not exceed 600 shirts. The marketing manager also informed that
the selling prices will remain same in the next month- Rs 1,500 for a Fancy shirt, Rs 1,200 for an
Office shirt and Rs 800 for a Casual shirt.

Write a set of linear programming equations to determine the number of Fancy, Office, and Casual
shirts to be produced with an aim to maximize revenue. [8]

[Common instructions for all questions- Upload only hand-written material; only hand-written material
will be evaluated. 2. Do not type the answer in the space provided below the question in the exam portal.
3. Do not attach any screenshot or file of EXCEL/PDF/PPT/any software].
1. Explain important similarities and differences between , give an example and make charts
a. Normal distribution and t distribution
b. Box plot and histogram
c.

Comparison Between Normal and t Distribution:


Both normal and t distributions are used in statistical analyses for normally distributed data. The t curve
is relatively flatter compared with normal curve.

Answer and Explanation:


1. The correct answer is: (d) The t-distribution has a larger variance than the standard normal
distribution.

 Standard normal distribution is used when population standard deviation is known and sample
size is sufficiently large enough. If population standard deviation is unknown and sample size is
small, student t distribution is used. This is because it has heavier tails due to greater variability.
The degrees of freedom are used in student t distribution while in normal distribution are not
used.

2.
Q) average time spent by a guest at QM resort is 12 days with variance 4 days. what is the probability
that guest selected at random will stay less than 8 days, between 7 to 10 days and more than 13
days?
Q. The probability that a candidate gets selected for Commando training is 2%. What is the
probability that from a group of 3 friends, a) 2 friends get selected, and b) all 3 friends get selected?
Also, make a probability tree and show every relevant detail on the tree.
Q. QM Mobiles, a manufacturer of mobile handsets, gets batteries from MNC Batteries Ltd.
MNC Batteries has two manufacturing plants, located in South Korea and Japan. Past
records show that 2% of the batteries supplied by the South Korean plant are defective
and that 3% of the batteries supplied by the Japanese plant are defective. MNC supplies
80% of the requirements of mobile manufacturers from its South Korean plant and
remaining 20% from its Japanese plant.

(A) If a battery selected by QM Mobiles at random turns out to be good quality, what is the
probability that it was supplied from the Japanese plant?
(B) If a battery selected by QM Mobiles at random turns out to be of defective, what is the
probability that it was from the Korean plant?
(C) Draw a neat probability tree(s)/network(s) and show all probabilities and conditional
probabilities.

ANSWER:

(A)

Probability that a battery sellected is of good quality =

probability that japanese plant supplies good battery + probability that South Korean plant
supplies good battery

= 0.2*0.97 + 0.8*0.98

= 0.978

Now,
Probabity that a battery is of good quality that it was supplied from the Japanese plant =

probability that japanese plant supplies good battery/Probability that a battery sellected is of good
quality

= 0.2*0.97/0.978

= 0.198

Probabity that a battery is of good quality that it was supplied from the Japanese plant =
0.198

(B)

Probability that a battery sellected is of defective quality =

probability that japanese plant supplies defective battery + probability that South Korean plant
supplies defective battery

= 0.2*0.03 + 0.8*0.02

= 0.022

Now,

Probabity that a battery is of defective quality that it was supplied from the Korean plant =

probability that Korean plant supplies good battery/Probability that a battery sellected is of
defective quality

= 0.2*0.02/0.022

= 0.182

Probabity that a battery is of defective quality that it was supplied from the Korean plant =
0.182

(C)

Tree Diagram:
Q. a manufactuter of washing machine plans to introduce entry level new model to boost
up his sale. the company has hired two independent market research firms MRA and MRB,
to estimate the proportion of households that currently use washing machines. MRA
conducted the survey of 400 households and it found 20% of the house holds currently
using washing machines. MRB conducted survey of 600 households and found that 18%
of house holds currently using washing machines.

A)develop 95% confidence intervals for the two surveys

B)explain the reason for difference between two surveys

Ans :-

Given that A manufactuter of washing machine plans to introduce entry level new model to boost
up his sale. the company has hired two independent market research firms MRA and MRB, to
estimate the proportion of households that currently use washing machines. MRA conducted the
survey of 400 households and it found 20% of the house holds currently using washing
machines. MRB conducted survey of 600 households and found that 18% of house holds
currently using washing machines.

Using this information we have


B)

The 95% confidence interlavs for the two surveys are different because the sample sizes are
different for both the surveys and also both the surveys work independently with each other.

Q. the details of the patient treated by cough or col clinic is given the table below make a
neat contingency tables and show on the tables the joint probability, marginal probability
and conditional probabilities
Q. The Insurance Regulatory Authority regularly conducts surveys on the coverage of
insurance. Latest survey indicates that 30% of population has medical insurance.
Suppose a random sample of 3 persons is selected.

a. Draw a neat and complete probability tree for this problem. Show all possible
outcomes and their probabilities.
b. What is the probability that only one person has medical insurance coverage? Show
all calculations.

What is the probability that exactly two persons have medical insurance coverage?
Show all calculations.

Birla Institute of Technology & Science, Pilani


Work-Integrated Learning Programmes Division
Second Semester 2021-2022

Mid-Semester Test
(EC-2 Regular)
Course No. : MBA ZC417
Course Title : QUANTITATIVE METHODS
Nature of Exam : Open Book
Weightage : 35% No. of Pages =4
Duration : 2 Hours
Date of Exam : Saturday, 12/03/2022 (AN) No. of Questions = 5
Note:
4. Please follow all the Instructions to Candidates given on the cover page of the answer book.
5. All parts of a question should be answered consecutively. Each answer should start from a fresh page.
6. Assumptions made if any, should be stated clearly at the beginning of your answer.

Q.1 The table below gives 8 summary measures of the duration of stay of 60 patients who were discharged by
MaxiMax Hospital in the last week. [7]

S No. Summary measure Days


1 Minimum 1
2 Maximum 16
3 Quartile-1 3
4 Quartile-3 8
5 Mean 6
6 Median 5
7 Standard deviation 3.9
8 Skewness 1.0

(a) Draw a hand-drawn (approximate is ok), neat, and well-labeled Boxplot chart.
(b) Discuss two or more insights drawn from the Boxplot chart (give precise answers, preferably in points).
(c) Is the maximum number of days of stay an outlier?
(d) Comment on the importance of 3.9 days value, given in the table, to the hospital administrator (give precise
answers, preferably in points).
ANSWER
(a):A Box plot also known as box-and-whisker diagrams is used to present the five summary measures like minimum, maximum,
first quartile, median and third quartile of the data.
It contains a box whose bottom side represents the first quartile and the top side represents the third quartile.The desired drawn
box plot is shown below:

(b):

As the maximum number of days of stay in hospital is 16 days means all of patients stayed less than 16 days in
Hospital.

The value of 1st and 3rd quartiles is 3 and 8 respectively thus it means about 25% of patients stayed less than 3
days and about 75% of patients stayed less than 8 days.
The value median is 5 means about 50% of patients stayed less than 5 days.

(c):

In the box plot, the outlier limits are Q1 - 1.5*IQR and Q3 + 1.5*IQR i.e. if any value lies below (Q1 - 1.5*IQR) and
above (Q3 + 1.5*IQR) called an outlier where Q1 and Q3 are 1st and 3rd quartile respectively and IQR is the
Interquartile Range which is the difference of 1st and 3rd quartile i.e. IQR = Q3 - Q1.

As the value of 1st and 3rd quartiles is 3 and 8 respectively thus the value of Interquartile Range(IQR) is;

IQR = Q3 - Q1

= 8-3 = 5.

Thus, the outlier limits are:

(Q1 - 1.5*IQR) = 3 - 1.5* 5 = 3 - 7.5 = - 4.5 and

(Q3 + 1.5*IQR) = 8 + 1.5* 5 = 8 + 7.5 = 15.5

Therefore, the outlier limits are -4.5 and 15.5.

As the maximum value, 16 lies outside the upper outlier limit, 15.5 thus it is an outlier.

(d):

The sample standard deviation, 's' indicates how far are the data points from the mean value (average value) in the
sample.
The value of standard deviation is 3.9 days. It means the duration of stay of patients can vary 3.9 days from the
average number of stay, 6 days (the mean value). It gives an approximate idea about the number of occupied and
empty beds in the hospital next week.

(Do only by hand; do not type the answers. Draw a box over the final numerical answer. Wherever applicable,
show every step, every formula, and every calculation. If MS Excel is used, then mention 100% correct Excel
function. Screenshot of any software is not acceptable).

Q.2 ABC Retail sells Apples, Bananas and Chocolates in 50 stores in a metro city. A representative data of the
purchases made by 20 customers is given in the table below- [7]
Customer No. Apples Bananas Choloates
Customer #1 Yes Yes Yes
Customer #2 No No Yes
Customer #3 Yes No No
Customer #4 No Yes Yes
Customer #5 Yes Yes No
Customer #6 No Yes No
Customer #7 Yes Yes No
Customer #8 Yes No No
Customer #9 Yes No No
Customer #10 No Yes Yes
Customer #11 No Yes Yes
Customer #12 Yes Yes Yes
Customer #13 Yes No No
Customer #14 No Yes No
Customer #15 No Yes No
Customer #16 Yes Yes Yes
Customer #17 Yes No No
Customer #18 Yes No No
Customer #19 No Yes Yes
Customer #20 Yes No Yes

(a) Make a table for joint and conditional probabilities for the sale of Apples and Bananas.
(b) Make a table for conditional probability for the sale of Apples given that a customer has purchased or
not purchased Bananas.
(c) What is the probability that a customer will buy Apples and Bananas given that he has purchased
Chocolates.

ANSWER
(Do only by hand; do not type the answers. Draw a box over the final numerical answer. Wherever applicable,
show every step, every formula, and every calculation. If MS Excel is used, then mention 100% correct Excel
function. Screenshot of any software is not acceptable).
Q.3 FireFox General Insurance Company provides insurance services to oil exploration, processing, and
distribution firms. OilInMotion- an oil distribution company- has insured three newly acquired oil tankers
(named- Ace, King and Jack). Past records show the probability that an oil tanker catches fire in a year is 0.05.
Assume incidents of fire are independent.
[7]

(a) What is the probability that only one of three tankers will catch fire in a year.

(b) What is the probability that King and Ace will catch fire in a year, and

(c) What is the probability that FireFox gets insurance claim for King or Jack.

Ace-probability-happening 0.05
Ace-probability-Not happening 0.95
King-probability-happening 0.05
King-probability-not happening 0.95
jack-probability-happening 0.05
jack-probability-not happening 0.95

No tankers happening 0.8574


so any one tanker happening 0.1426
king and ace happening-P(K)*P(A) 0.0025
king or ace- P(K)+P(A)-P (K&A) 0.0975

Show all the calculations/tables/diagrams, etc.

(Do only by hand; do not type the answers. Draw a box over the final numerical answer. Wherever applicable,
show every step, every formula, and every calculation. If MS Excel is used, then mention 100% correct Excel
function. Screenshot of any software is not acceptable).

Q.4 The amount of powdered spice filled by FillAndChill, an automatic machine, differs from one packet to
another. The mean amount of chili powder filled by the machine is 160g with standard deviation of 2g. Assume
the amount of chili powder filled by the machine is Normal distributed.
[7]

(a) What is the probability that a packet chosen at random will weigh between 158 to 163g,

(b) What is the probability that the weight of a randomly chosen packet is more than 155g, and

(c) If the packets that weigh less than 158g are sold for Rs 80 and other packets for Rs 100, what is the
expected value of revenue/packet.
(Do only by hand; do not type the answers. Draw a box over the final numerical answer. Wherever applicable,
show every step, every formula, and every calculation. If MS Excel is used, then mention 100% correct Excel
function. Screenshot of any software is not acceptable).
Q.5 The Aviation Regulatory Authority (ARA) plans to make it mandatory for the commercial airlines to publish on
their websites the average delay by which their flights are delayed.

The delays of a randomly selected sample of 16 flights of flight no. F-16 and 12 flights of flight no. AN-12 are
given in the Table-1, and Table-2 gives 11 summary measures of the data. [7]

Table-1: Delays in minutes-


S No. F-15 AN-12
1 47 23
2 33 45
3 10 18
4 14 30
5 10 12
6 20 5
7 17 7
8 13 60
9 15 75
10 12 43
11 32 23
12 43 8
13 24
14 67
15 22
16 18
Table-2: Summary measures
Count 16 12
Sum 397 349
Mean 24.8 29.1
Median 19 23
Mode 10 23.00
Sample Stdev 15.95 22.39
Population Stdev 15.44 21.43
Skewness 1.49 0.90
Kurtosis 1.98 -0.06
Q1 13.75 11
Q3 32.25 43.5

(a) What is the point estimate and interval estimate of the mean delay of flight no. F-16 and flight
no. AN-12, for 95% Confidence Level?
(b) Why is interval estimate preferred over the point estimate (give precise answer)?
(c) What do Skewness values of F-15 and AN-12 given in Tbale-2 indicate (give precise answers)?
(Do only by hand; do not type the answers. Draw a box over the final numerical answer. Wherever applicable,
show every step, every formula, and every calculation. If MS Excel is used, then mention 100% correct Excel
function. Screenshot of any software is not acceptable).
_________

You might also like