0% found this document useful (0 votes)
25 views

Data Visualization & Analytics for Decision Making (2) (1)

Uploaded by

ankit
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Data Visualization & Analytics for Decision Making (2) (1)

Uploaded by

ankit
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 190

1

Data Visualization & Analytics for


Decision Making

PGCBM 43

Abhishek Chakraborty
[email protected]
2

Course Outline

Descriptive Statistics and Data Representation (3 sessions)

Introduction to Probability (3 sessions)

Sampling and Sampling Distributions (1 session)

Interval Estimation (1 session)

Hypothesis Testing (2 sessions)

Correlation and Regression (2 sessions)


3

Evaluation

1 Group Project 30 End Term 50


1 Quiz 20 Marks
Marks Marks
4

Software Tools

MICROSOFT EXCEL R
5

Descriptive
Statistics and Data
Representation
6

Types of Measurement Scale

Nominal

Ordinal

Interval

Ratio
• Representation for classification or categorization
• The numbers are used only to differentiate entities
and not to assign or make a statement regarding the
values

Nominal Scale • For instance, we may ask the profession of any


individual where the profession may include IT
professionals, finance professionals, lawyer,
doctor, educator, and others
• A special case of nominal variable is the binary
variable that can only take two values

7
• Ordinal scale is used to rank or order entities
• For example, during appraisal, an employee
maybe graded on a scale of one to five
• The distances between consecutive numbers has no
meaning
Ordinal Scale • Due to the imprecise nature of measurement, both
nominal scale and ordinal scale are referred to as non-
metric data also qualitative data
• Both nominal and ordinal variables can take from a
fixed set of values

8
• The distances between consecutive
numbers has a meaning and the data is
always numerical
Interval Scale • The zero point is all about convention
or convenience and is not fixed
• For instance, Celsius scale, GMAT
score, pH value

9
• The zero point is fixed and represents the
absence of something being studied
• Both interval and ratio scale are referred to
Ratio Scale as metric data or quantitative data
• Examples include, height, weight, total
monthly sales, etc.

10
11

Types of Measurement Scale


Label Ordering Differences Ratio between
between measurements/
measurements True zero

Nominal ✓ ✗ ✗ ✗

Ordinal ✓ ✓ ✗ ✗

Interval ✓ ✓ ✓ ✗

Ratio ✓ ✓ ✓ ✓
12

Data Summary and


Representation for
Categorical Variables
(Nominal or Ordinal)
13

Frequency Distribution
• Raw data is sometimes referred to as
ungrouped data
• We need to organize the ungrouped
data into grouped data through
frequency distribution
• For instance, consider the NIFTY 50
stocks and their sectors
14
Percentage
Sectors Frequency Relative Frequency Frequency
Consumer Goods 6 0.12 12.00%
Banking 6 0.12 12.00%
Automobile 6 0.12 12.00%
Information Technology 5 0.1 10.00%
Frequency Financial Services 5 0.1 10.00%
Pharmaceuticals 4 0.08 8.00%
Distribution Metals 4 0.08 8.00%
Energy - Oil & Gas 3 0.06 6.00%
Cement 3 0.06 6.00%
Energy - Power 2 0.04 4.00%
Telecommunication 1 0.02 2.00%
Infrastructure 1 0.02 2.00%
Healthcare 1 0.02 2.00%
Consumer Durables 1 0.02 2.00%
Construction 1 0.02 2.00%
Chemicals 1 0.02 2.00%
15

Frequency
7
6

Bar Chart
5
4
3
2
1
0
t er
ds in
g
bi
le
og
y
ce
s als etals Gas en er
s
oo n k o o l rv
i
ut
ic m o w th
er
G
Ba utom chn l Se ce
M il & Ce -P O
e a a O y
um A T nc
i
ar
m - er
g
ns o n a h rg y
E n
Co ati Fi
n P e
rm En
fo
In
Pie Chart 16
Frequency
Consumer Goods
Banking
Automobile
Information Technology
Financial Services
Pharmaceuticals
Metals
Energy - Oil & Gas
Cement
Energy - Power
Others
17
Dismissals

0% 6%
2%
16%

11%

5%
60%

lbw caught run out bowled not out stumped hit wicket

Pie Charts
18
Frequency
7

Pareto Chart 4

0
ile t er e ls
ds in
g
og
y
ice
s als eta
ls as en on re ar les on
ica
oo nk ob ol rv u tic &
G
em P ow icati u ctu lthc r ab ucti m
G a m n e e M C r u r e
er B ut
o ch S ac il - un st ea D st Ch
u m A Te cial r m -O er gy m
m n fra H er C on
ns on n a y I um
na Ph rg En co
Co ati Fi ne ele ons
r m E T C
fo
In
19

Univariate Analysis for


Quantitative Variables
20

Continuous or Quantitative Variables


Frequency Distributions

Histograms

Ogives and Frequency Polygons

Dot Plots

Stem and Leaf Plot


21
Relative Cumulative
Range Frequency Frequency Frequency
0-9 57 0.233 57
10-19 24 0.098 81
20-29 23 0.094 104
30-39 23 0.094 127
Frequency 40-49
50-59
13
12
0.053
0.049
140
152
Distributions 60-69 16 0.065 168
70-79 13 0.053 181
80-89 15 0.061 196
90-99 6 0.024 202
100-109 13 0.053 215
110-119 12 0.049 227
120-129 8 0.033 235
130-139 5 0.020 240
140-149 1 0.004 241
150-159 2 0.008 243
160-169 1 0.004 244
170-179 0 0.000 244
180-189 1 0.004 245
22

DATA REPRESENTATION

Ogives and
Stem and
Histogram Frequency Dot Plots
Leaf Plot
Polygons
23

Kohli’s Runs
60

50

CAUTION !!!
Histogram 40
In Excel, Histograms are represented like

Frequency
this though a histogram
30 needs to be
represented as a series on contiguous
20
rectangles
10

0
9 19 29 39 49 59 69 79 89 99 09 19 29 39 49 59 69 79 89 re
1 1 1 1 1 1 1 1 1 Mo

Runs

Frequency
24

Histogram and Ogive

Kohli’s Runs
60 120.00%

50 100.00%

40 80.00%
Frequency

Frequency
30 60.00%
Cumulative %

20 40.00%

10 20.00%

0 0.00%
9 19 29 39 49 59 69 79 89 99 109 119 129 139 149 159 169 179 189 More

Runs
25
Distribution of Kohli's runs
30

25

Frequency 20
Polygons

Frequency
15

10

0
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
4. 14. 24. 34. 44. 54. 64. 74. 84. 94. 04. 1 4. 24. 34. 44. 54. 64. 74. 84.
1 1 1 1 1 1 1 1 1
Runs Scored
26
STEM LEAF

0 0 0 0 2 2 8 9
1 0 0 1 2 6 8 8
2 5 7 8

Stem and Leaf 3


4
0 1 1 7 7

Plot 5 4 4 7
6 3 4 8
7 1 9
8 2
9 1
10 2 5 7
11 8
27

Bivariate Analysis
28

Crosstabulation

For producing a two-dimensional table that displays the frequency counts of


two variables simultaneously
Also referred to as a contingency table

Consider the data on Virat Kohli’s runs

We will look into two dimensions: dismissals and innings using Pivot tables
29
• In an IT firm, an IT engineer has been promoted to the role
of an HR manager. In the following year, after the appraisal
of the 30 associates of the IT firm, he was accused by some
of the associates of being partial towards those who are
coming from “Mainframes” skill set as compared to
“Java/J2E” skill set while promoting the associates. One of
the associates who has not been promoted, registered a
complaint with the CEO against the manager stating that
Crosstabulation since the HR manager is himself from the Mainframes
background, he has been biased towards those coming from
Mainframes background while promoting them. In his
defence, he also attached the following table where the
background of each associate is mentioned along with
whether they have been promoted or not. Is there any reason
to believe the claim being made?
30

Cross-tabulation

Employee Code Skill Set If Promoted Employee Code Skill Set If Promoted
Emp 1 Mainframes Yes Emp 16 Java/J2E Yes
Emp 2 Java/J2E No Emp 17 Mainframes No
Emp 3 Java/J2E No Emp 18 Java/J2E Yes
Emp 4 Mainframes No Emp 19 Mainframes Yes
Emp 5 Java/J2E No Emp 20 Mainframes No
Emp 6 Mainframes Yes Emp 21 Java/J2E No
Emp 7 Mainframes Yes Emp 22 Java/J2E Yes
Emp 8 Java/J2E No Emp 23 Mainframes Yes
Emp 9 Java/J2E Yes Emp 24 Java/J2E No
Emp 10 Mainframes No Emp 25 Java/J2E Yes
Emp 11 Java/J2E Yes Emp 26 Mainframes Yes
Emp 12 Mainframes Yes Emp 27 Java/J2E No
Emp 13 Mainframes No Emp 28 Mainframes Yes
Emp 14 Mainframes Yes Emp 29 Mainframes No
Emp 15 Java/J2E No Emp 30 Java/J2E Yes
31

Skill set Skill set


JAVA/J2E Mainframes
Promoted 7 9
Crosstabulation Not 8 6
Promoted
Total 15 15
32

Scatter Plot

Closing Price vs. Total Traded Qty


900,000

800,000

700,000

600,000
Total Traded Quantity

500,000

400,000

300,000

200,000

100,000

0
1,900.00 2,000.00 2,100.00 2,200.00 2,300.00 2,400.00 2,500.00 2,600.00

Closing Price
33

Measures of Central Tendency

Mean (Arithmetic,
Mode Median
Geometric, Harmonic)
Percentiles
• Procedure – Suppose we want to find pth percentile in a
dataset with n datapoints
• Arrange the data in ascending order

Analysing • Compute the percentile location, k= (n+1)*p

Distributions • Divide k into its integer and decimal components (i and


d)
• If d=0, find the kth largest value in the dataset
• If d>0, the percentile is in between two numbers in
the dataset. Suppose their difference is m. Then pth
percentile = m*d+ kth largest number

34
35

Analyzing Distributions

Quartiles z-scores

1st Quartile 3rd Quartile


2 Quartile
nd
(25th (75th
(Median)
percentile) percentile)
36

Variance/ Standard Deviation

Range
Measures of
Dispersion Inter quartile range

Mean Absolute Deviation


37
Measure of Symmetry

• Coefficient of Skewness

Other measures Measure of Peakedness

• Coefficient of Kurtosis

Coefficient of Excess

Coefficient of Variation
38

Measure of Symmetry
Coefficient of Skewness
• Positive Skewness
• Positive Value
• Longer right tail
• Higher data concentration on the left

• Negative Skewness
• Negative Value
• Longer left tail
• Higher data concentration on the right
Coefficient of Kurtosis

• Mesokurtic
39 Measure of • Leptokurtic
Peakedness • Platykurtic

Coefficient of Excess
Kurtosis
40

• Spearman’s rank correlation coefficient

Measures of • Let be the ith difference of ranks

Dependence • Pearson correlation coefficient


41

Other Important Plots

Box and Whisker


Plot
42

Box and Whisker


Plot
43

Basics of Probability Theory


44

• We often deal with the following scenarios:


• How likely it is to finish the ongoing
project on time?
• What are the chances that the salesperson
Introduction to
will not be able meet his monthly target?
Probability • What are the chances that the new product
will meet the customers’ expectations?
• How likely it is that the competitor will
benefit if the product price is increased?
45

Introduction to Probability
Nissan came out with the launch of Micra. Nissan Micra is
available in 6 different exterior colours (Olive Green,
Onyx Black, Blade Silver, Brick Red, Storm White,
Turquoise Blue). For any car dealer of Nissan, it is not
possible to keep all these varieties in their showroom.
However, based on their past experiences, they have some
idea as to what varieties have been requested more as
compared to the others. Based on all the customer requests
in the past month received at a particular car dealer, they
found the following results and will only be keeping the
cars of three colours in their showroom for display which
are being requested more.
46

Introduction to Probability

Car Sales
Car Shades Sales
Turquoise Blue
Olive Green 35
Storm White Onyx Black 48
Brick Red
Blade Silver 63
Brick Red 100
Blade Silver

Storm White 40
Onyx Black
Turquoise Blue 34
Olive Green

0 20 40 60 80 100 120

Series1
47

Introduction to Probability

Probability: It is a numerical measure of the likelihood


that an event will occur

Probability values are always assigned on a


scale of 0 to 1
• A probability value closer to 0 indicates that an event is
unlikely to occur while a probability value closer to 1
indicates that an event is almost certain to occur
48
• Suppose, the potential customers during the festive
season want to book their vehicles, the moment they
see the car in their most preferred exterior colour
and would be less eager to wait while making the
booking decision. If the same colour car is not on the
display in the showroom, they could even go to
some other showroom of Nissan. They could even
Introduction to decide of buying a car from some other competitor
Probability of Nissan. Currently, that car dealer is keeping only
Onyx Black, Blade Silver and Brick Red colours
cars in their showroom.
• What is the probability that a randomly arriving
customer would demand a car of Storm White
colour and in turn will move to some dealer of
Nissan?
49

• An experiment is any planned process of


data collection

Introduction to • It consists of a number of trials


(replications) repeated under the same
Probability
condition
• For instance, throwing a dice and
recording the outcome is an experiment
which can be repeated several times
50

• Random Experiment
• The experimental outcomes are well-
defined, and the knowledge is available
before conducting the experiment
Introduction to • In a single trial of the experiment, one and
Probability only one of the possible experimental
outcomes will occur and we don’t know
the outcome of a particular trial in advance
• Sample Space
• It is the set of all experimental outcomes
51
• Consider a random experiment which has N possible
experimental outcomes
• The probability of occurrence of each outcome will be between 0 Assigning
and 1 and the sum of probabilities of occurrences of all the Probabilities to
outcomes should be 1
Experimental
• Subjective Way
• It is an estimate that reflects a person’s opinion, or best guess
Outcomes
about whether an outcome will occur
• Classical Way
• When all the experimental outcomes are equally likely, then
we can assign a probability of 1/N to each experimental
outcome
• Relative Frequency Way
• When the number of experimental trials is large, then the
probability of a given outcome is the number of times that
outcome occurs divided by the total number of repetitions
52

Introduction to Probability

• Example: A decision maker randomly assigned the following probabilities


to four different outcomes of an experiment. Are these probability
assignments valid?
• P(E1)= 0.4
• P(E2)=0.3
• P(E3)=0.2
• P(E4)=0.15
53

• A company manufacturing toothpaste is studying


5 different package designs. Assuming that one
design is just as likely to be selected by a
consumer as any other design, what is the
Introduction to selection probability of each of the design when
Probability the firm has gathered the data from 100
customers selected at a random. Does the data
confirm to the belief that one design is just as
likely to be selected as another?
54

Introduction to Preferences of Packages

Probability

Packages A B C D E

Preferences 0.20 0.22 0.34 0.14 0.10

A B C D E
55
Streams/Programs Enrolments
Introduction to Probability
Science 350

Engineering 450
A market research firm is assigned the
Medical 120
task of finding the acceptance of a certain
software solution used for trading stocks, Arts 150
currencies, commodities, etc. among the
Commerce 275
students specialized in different streams
of XYZ University. In that university, the Management 180
distribution of student enrolment among
Law 75
the different streams/programs is as
follows: Total 1600
56
First, the firm wants to find out the
acceptance of the software solutions
among the students from the management
discipline. What is the probability that a
Introduction to student selected at random will have a
specialization in management?
Probability
What is the probability that a student
selected at random will have a specialization
in commerce?
57
Airline On-Time Mishandled Customer
Introduction to Arrivals (%) Baggage per 1000
passengers
Complaints per
1000 passengers
Probability Virgin America 83.5 0.87 1.50
The given table shows the JetBlue 79.1 1.88 0.79
percentage of on-time arrivals, AirTran Airways 87.1 1.58 0.91
the number of mishandled Delta Airlines 86.5 2.10 0.73
baggage reports per 1000 Alaska Airlines 87.5 2.93 0.51
passengers, and the number of Frontier Airlines 77.9 2.22 1.05
customer complaints per 1000 Southwest Airlines 83.1 3.08 0.25
passengers for 10 airlines US Airways 85.9 2.14 1.74
American Airlines 76.9 2.92 1.80
United Airlines 77.4 3.87 4.24

Source: Statistics for Business and Economics by Anderson, Sweeney, Williams, Camm and Cochran, Cengage Publishers 13e
58

Introduction to Probability

• If we randomly choose a Delta Airlines flight, what is the probability that


the flight will have an on-time arrival?
• If we randomly select one of the ten airlines, what is the probability that it
will be an airline with (a) less than two cases of mishandled baggage per
1000 passengers (b) more than one cases of customer complaints per 1000
passengers?
59

Introduction to Probability

• In an IT firm, an IT engineer has been promoted to the role of an HR manager. In the following
year, after the appraisal of the 30 associates of the IT firm, he was accused by some of the
associates of being partial towards those who are coming from “Mainframes” skill set as compared
to “Java/J2E” skill set while promoting the associates. One of the associates who has not been
promoted, registered a complaint with the CEO against the manager stating that since the HR
manager is himself from the Mainframes background, he has been biased towards those coming
from Mainframes background while promoting them. In his defence, he also attached the following
table where the background of each associate is mentioned along with whether they have been
promoted or not. Is there any reason to believe the claim being made?
60
Employee Skill Set If Employee Skill Set If
Code Promoted Code Promoted
Emp 1 Mainframes Yes Emp 16 Java/J2E Yes
Emp 2 Java/J2E No Emp 17 Mainframes No
Emp 3 Java/J2E No Emp 18 Java/J2E Yes
Emp 4 Mainframes No Emp 19 Mainframes Yes
Emp 5 Java/J2E No Emp 20 Mainframes No
Emp 6 Mainframes Yes Emp 21 Java/J2E No Introduction to
Emp 7 Mainframes Yes Emp 22 Java/J2E Yes Probability
Emp 8 Java/J2E No Emp 23 Mainframes Yes
Emp 9 Java/J2E Yes Emp 24 Java/J2E No
Emp 10 Mainframes No Emp 25 Java/J2E Yes
Emp 11 Java/J2E Yes Emp 26 Mainframes Yes
Emp 12 Mainframes Yes Emp 27 Java/J2E No
Emp 13 Mainframes No Emp 28 Mainframes Yes
Emp 14 Mainframes Yes Emp 29 Mainframes No
Emp 15 Java/J2E No Emp 30 Java/J2E Yes
61

Introduction to Probability

Skill set JAVA/J2E Skill set Mainframes


Promoted 7 9
Not Promoted 8 6
Total 15 15
62

Events

• Event: An event is a collection of sample points


• Revisiting the Market Research Example
• Suppose we are interested in finding the probability that a student selected at random will
have a specialization in any of the management, medical, or law disciplines.
• Or if we are interested in finding the probability that a student selected at random will have
a specialization in either medical or engineering.
• The above two are examples of Events
• In each case an event is said to occur “if any of the sample points mentioned in the case
appears in the experimental outcome”
63
6 digit PIN Geographical Number of
Events Code Location bookings
xxxx01 North 65
• An app-based cab aggregator has xxxx08 East 43
xxxx09 East 56
tracked the 12 different geographical
xxxx11 North 36
locations within a city (sorted based on xxxx18 South 49
the last two digits of the PIN code) for xxxx19 South 65
the number of cab bookings on a xxxx23 West 59
particular day. The following table xxxx25 South 43
represents the geographical distribution xxxx30 East 53
of the same: xxxx39 West 61
xxxx41 West 33
xxxx50 North 60
64

• The firm now wishes to analyze each of these


bookings in detail. They picked up randomly, a
booking from their database.
• What is the probability that the booking is from
the area where the PIN code is “xxxx08”?
Events • What is the probability that the booking is from
the Western part of the city?
• What is the probability that the booking is
either from area having a PIN code “xxxx11” or
a PIN code “xxxx19”?
65

Events

• Complement of an Event: For any event A, its complement consists of all those sample points
which are not in A
• It is denoted by AC
• Thus, P(AC)=1-P(A)
• For instance, consider the event of getting an odd number while rolling a dice
• The sample points corresponding to the event includes 2,4 and 6
• Thus, the complement of the above event is to get 1,3, or 5
66

Events
• Union of Two events: The union of two events A and B is an event consisting of all those
sample points which belong to either of them
• It is denoted by

• Intersection of Two events: The intersection of two events A and B is an event consisting
of all those sample points which belong to both of them
• It is denoted by

• Addition Rule
67

Example

• A market research firm conducted a study to find


out the reasons of attrition in different firms. The
study revealed that 30% of the employees who left
within first two years did so because of salary
related issues, 20% left as they found the job to be
monotonous, and 12% did so because they were
both dissatisfied with their salary as well as job
profile. What is the probability that a randomly
chosen employee who left his/her job at a
manufacturing firm within the first two years did
so due to (i) dissatisfaction with salary (ii)
dissatisfaction with job profile (iii) both

Link to the article: https://www.livemint.com/Industry/laogg0Cvqc7fOPB6YD6kEM/What-drives-attrition-across-companies.html


68

Solution

• Let W be the event denoting work issue


• S be the event denoting salary issue
• We wish to find P(W), P(S) and
• Here P(W)=0.2, P(S)=0.3 and
• We know,
69

Events

• Mutually Exclusive Events


• If two events don’t have any sample points in common i.e., the occurrence of one
event ensures that the other even cannot occur
• We can further say that if A and B are two mutually exclusive events then,
70

• It is quite common when the probability of occurrence of


an event is affected by the occurrence of a related event
• Suppose A is the event with a probability of
occurrence P(A)

Conditional • Further, assume some new information is available


for a related event B which has already occurred
Probability • Now, we might be interested in finding the
probability of occurrence of A in the presence of the
additional information obtained regarding the
occurrence of B
71

Conditional Probability

• An IT firm wishes to analyze the data regarding the promotion of its employees over the
past 2 years. The data is presented in the following table:

Skill set JAVA/J2E Skill set Mainframes


Promoted 145 375
Not Promoted 55 125
Total 200 500
72

Conditional Probability

• Let us define the events as follows:


• J: Skill set being JAVA/J2E
• M: Skill set being Mainframes
• A: Promoted
• AC: Not promoted
73

Conditional Probability

• Find the probability that a randomly selected employee has a skill set of JAVA/J2E and is
promoted
• Solution: We are required to find P(A|J)
• P(A|J)=145/200=0.725
• Let us also find other cases
• P(A|M)=375/500=0.75, P(AC|J)=1-0.725=0.275, P(AC|M)=1-0.75=0.25
74

Conditional Probability:
Context of Reneging

•Article Source:
https://www.benivo.com/blog/how-to-prevent-employees-fr
om-reneging-on-a-signed-offer-and-pulling-no-shows
75

Conditional Probability

• HR managers often face the problem of offer reneging which means the candidates, after
getting the offer from a firm don’t end up joining the same firm. A particular IT firm has
investigated the past data of reneges happening in last one year to conclude whether the
educational background has something to do with reneging. The following table shows
the data gathered:

Science Engineering Medical Arts Commerce Management


Offers Made 40 200 5 10 45 100
Reneges 10 60 1 1 9 40
76

Independence of Events

• In the previous example, we have seen that whether an employee is promoted or not depends
upon the skill set of the individual since P(A|M)≠P(A|J)
• Two events A and B are independent if P(B|A)=P(B) and P(A|B)=P(A)
• Revisiting the Promotion Example with modified data
Skill set JAVA/J2E Skill set Mainframes
Promoted 140 350
Not Promoted 60 150
Total 200 500
77

Independence of Events

• In this case, we have


P(A|J)=140/200=0.7
P(A|M)=350/500=0.7
P(AC|J)=1-0.7=0.3
P(AC|M)=1-0.7=0.3
• Again, P(A)=490/700=0.7 and P(AC)=210/700=0.3
• Thus, P(A)=P(A|J)=P(A|M)=0.7 establishing the independence of events
78

Multiplication Law

• Let A and B be two events

• Example: It is known that 80% of the households in a city have television sets. It is also
known that out of those having TV sets, 75% also a have connection to satellite channels.
What is the probability that a household selected at random will have both the TV set as
well as connection to the satellite channels?
79

Solution

Let us consider the following:


A: Event denoting that a household has a TV set
B: Event denoting that a household has satellite channel connection
P(A)=0.80, P(B | A)=0.75
• We wish to find
• We know
80

Bayes’ Theorem
81

Total Probability Theorem

• Let events C1, C2 . . . Cn form partitions of the sample space S, where all the events have a
non-zero probability of occurrence.
• For any event, A associated with S, the total probability theorem states
82

Total Probability Theorem

C1 C2

A
….
C3 Cn
83
• A student needs to appear for DVADM examination. The question paper
can be of low, moderate, or high level of difficulty. The probabilities of
passing the exam for the student under each of these cases are 0.9, 0.7,
and 0.5 respectively. If the probabilities that the question paper will be of
moderate difficulty is 0.45, and of low difficulty is 0.35, what is the
probability that the student will pass the exam?
• Sol: Let the events be denoted as: L (Low Difficulty), M (Moderate
Difficulty), and H (High Difficulty)
Example • Here, P(L)=0.35, P(M)=0.45, P(H)=0.2
• Also, let the event passing the examination is denoted by P
• Then P(P|L)=0.9, P(P|M)=0.7, and P(P|H)=0.3
• So, P(P)=P(P|L)*P(L)+ P(P|M)*P(M)+ P(P|H)*P(H) =
0.9*0.35+0.7*0.45+0.3*0.2=?
84

Bayes’ Theorem

• Prior Probability
Let there be some specific events of interests. We
have some initial information about the probability
of occurrence of such events
We call the same as Prior Probability and then seek
collection of further information about
• Posterior Probability
After obtaining additional information, we update
the prior probabilities to get the revised probabilities
being referred to as posterior probabilities
85

Bayes’ Theorem Example

Machine X & Machine Z &


Defective Defective
Machine Y &
Defective
Machine Y &
Defective
Machine Z &
Machine X & Non-
Machine Z&
Non- Defective
Non-Defective
Machine X & MachineYY&&
Machine
Defective
Non- Non-
Non-
Defective Defective
Defective
86

In a factory, there are 3 machines A, B and C


producing 50%, 30% and 20% of the total output
respectively. Machine A produces 4% of the items
as defectives, machine B produces 5% of the
items as defectives whereas machine C produces
Bayes’ Theorem
Example 7% of the items as defectives. An item is drawn at
a random from the production line. Find the
probability that the item is defective. Given the
item is defective, what is the conditional
probability that it is being produced by machine
A?
87

Solution

Let us define the events

• A: Defective from machine A


• B: Defective from machine B
• C: Defective from machine C
• X: Item is defective

We need to find P(X) and P(A|X)


88

Solution

• Tree Diagram Approach


Let there be 1000 items produced by the factory
Output of A= 500, B= 300 and C= 200
Defectives from A=4% of 500=20, from B 15 and from C 14
Total Number of Defectives = 20+15+14=49
P(X)=49/1000 = 0.049
P(A|X)=20/49=0.4081
89

• In the previous case, we wished to find P(A|X) which is


the posterior probability given the item is defective.
Likewise, we can also find P(B|X) and P(C|X).
• We know
i.e.
Bayes’ Theorem
• Similarly,
and

• We also know if an item is defective, it must come


from one of the three machines A, B or C
90

Bayes’ Theorem From we get


91

Solution

• P(A)=0.5, P(B)=0.3 and P(C)=0.2


• Now, P(X|A)=0.04
• P(X|B)=0.05
• P(X|C)=0.07
92

Example
An IT firm has developed its own filter for the emails
received. Emails are classified as Genuine emails and
Junk emails. The firm receives about 10% of Junk
emails. The filter is designed in such a way that if it
detects a Junk email, it will be sent to the “Spam”
email folder else it will be sent to the “Inbox”.
However, the filter is not fool-proof. It has been found
that about 15% of the Junk emails are being sent to the
Inbox folder while about 5% of the Genuine emails are
sent to the Spam folder.
What is the probability that an received at random will be
sent to the Spam folder?
What is the conditional probability that a randomly checked
email from the Spam folder is a Genuine email?
93

• Events --- S: Spam, I: Inbox, G: Genuine, J: Junk


• We wish to find P(S) and P(G | S)
Solution • P(J)=0.1 and P(G)=0.9
• Also and
94

• Alternatively, let the firm receives 1000 such emails


• Obviously, 900 will be genuine and 100 will be junk
• Out of the 900 genuine emails, 5% i.e. 45 emails will be sent to

Solution the spam folder and the remaining 855 will be sent to inbox
folder
• Out of the 100 junk emails, 15% i.e. 15 emails will be sent to the
inbox folder and the remaining 85 will be sent to spam folder
• Out of 130 emails in the spam folder, 45 are genuine
• Required probability is 45/130 = 0.346
95

Probability
Distributions
96

Random Variables and


Probability Distributions
97

• A random variable is a numerical description of the


outcome of an experiment, i.e., it is defined over a sample
space of the random experiment
• Consider the random experiment of tossing two coins
simultaneously
Random • Its sample space consists of 4 sample points, i.e.,
Variables S={(H,H),(H,T),(T,H),(T,T)}
• Let x be the random variable denoting the number of
heads obtained after tossing a coin twice
• Here, x can take only 3 values 0, 1 and 2
• P(x=0)=P(x=2)=1/4 and P(x=1)=1/2
98

Random Variables

• Consider the random experiment of rolling a pair of dice. We define the random variable x as the
sum of the outcomes of the pair of dice. The sample space is expressed in the following table:
1 2 3 4 5 6
1 (1,1) 2 (1,2) 3 (1,3) 4 (1,4) 5 (1,5) 6 (1,6) 7 The numbers outside
2 (2,1) 3 (2,2) 4 (2,3) 5 (2,4) 6 (2,5) 7 (2,6) 8 the brackets indicate
the sum of the
3 (3,1) 4 (3,2) 5 (3,3) 6 (3,4) 7 (3,5) 8 (3,6) 9
outcomes
4 (4,1) 5 (4,2) 6 (4,3) 7 (4,4) 8 (4,5) 9 (4,6) 10
5 (5,1) 6 (5,2) 7 (5,3) 8 (5,4) 9 (5,5) 10 (5,6) 11
6 (6,1) 7 (6,2) 8 (6,3) 9 (6,4) 10 (6,5) 11 (6,6) 12
99

Random Variables

• P(x=2)=1/36 • P(x=8)=5/36
• P(x=3)=2/36 • P(x=9)=4/36
• P(x=4)=3/36 • P(x=10)=3/36
• P(x=5)=4/36 • P(x=11)=2/36
• P(x=6)=5/36 • P(x=12)=1/36
• P(x=7)=6/36
100

Random Variables

• Machines can experience breakdowns due to electrical faults, mechanical faults or misuse
• There is a cost of repair associated with each of them leading to machine breakdowns
• The costs are given as follows:
Reason Electrical Mechanical Misuse
Cost of Repair 2000 2500 5000

• We define the random variables related to the cost of repair


• Let the probabilities for each reason causing these breakdowns be 0.4, 0.3 and 0.3 respectively for
electrical faults, mechanical faults and misuse
101

Random Variables

• Thus, P(x=2000)= 0.4


• P(x=2500)= 0.3
• P(x=5000)= 0.3
102

Random Variables

• Thus, a random variable associates a numerical value with each and every experimental
outcome
• It can be classified as discrete or continuous depending on the numerical values it
assumes
103

Discrete Random Variables

• A random variable that assumes only a finite set of values or an infinite sequence of
values in the form of 0, 1, 2, …. is referred to as a discrete random variable
• Consider the example of number of cars crossing a toll booth each hour on a particular
day
• Here the random variable x can assume values 0,1,2,3,……
104

Discrete Random Variables

• A call centre firm records the hourly calls received from different clients on a particular
day. The data is presented in the following table:

9AM- 10AM- 11AM- 12Noon- 1PM- 2PM- 3PM- 4PM- 5PM-


10AM 11AM 12Noon 1PM 2PM 3PM 4PM 5PM 6PM
20 35 40 45 5 35 30 20 10

• Develop Discrete Probability Distribution in the above case


105

• We assign a set of probability values to each of


the values taken by the discrete random variable
x, i.e., if i be a value taken by the discrete
Discrete random variable x then we denote by f(i), the

Random probability of the discrete random variable x


assuming the value i
Variables • The required conditions for a discrete probability
function are:
P(x=i)=f(i)≥0 and ∑P(x=i)= ∑f(i)= 1
106

• Probability mass function (p.m.f.) gives the


Probability mass probability that a discrete random variable
function (p.m.f.) will assume a particular value
for Discrete • And the following conditions are satisfied
Random
Variables P(x=i)=f(i)≥0 and ∑P(x=i)= ∑f(i)= 1
107

Discrete Random Variables

• Consider the following probability distribution of a random variable x


x 13 14 15 16 17 18
f(x) 0.05 0.1 0.2 0.3 0.25 0.1

• Is this probability distribution valid?


• What is the probability that x=16?
108

Probability distribution function or cumulative distribution


function (c.d.f.) for Discrete Random Variables
• The Probability distribution function or cumulative distribution function (c.d.f.) for any Discrete
Random Variable x evaluated at any value i denotes the probability that x will take a value less
than or equal to i
• Mathematically, c.d.f. for the random variable x at the point i gives P(x≤i)
• Consider the following probability distribution of a random variable x
x 13 14 15 16 17 18
f(x) 0.05 0.1 0.2 0.3 0.25 0.1

• What is the c.d.f. at the point x=16


109

• The expectation of a Discrete Random Expectation or


Variable represents the average of many Expected Value
independent realizations of the random of a Discrete
variable Random
• It is the probability-weighted average of all Variable
its possible values
110

Expectation or Expected Value of a Discrete Random


Variable
• Consider the following probability distribution of a random variable x

x 13 14 15 16 17 18
f(x) 0.05 0.1 0.2 0.3 0.25 0.1

• Compute the expected value and the variance of the random variable x
• A continuous random variable may assume any
value in an interval on the real number line or in a
collection of intervals
• Experimental outcomes based on weights, height,
Continuous marks, measurements etc. are described by
Random continuous random variables

Variable • Some continuous random variables of interest


includes the time difference between two
consecutive calls received by an associate at a call-
center, the deviation of diameter of ball bearings
from the specified standards etc.

111
• A major issue in dealing with continuous random
variables includes the computation of probability
of random variables at a particular point in the
sample space
Continuous • We define the probability density function as the
Random Variable continuous counterpart of probability mass
function
• Unlike the probability mass function, the
probability density function (f(x)) doesn’t provide
us the probability values directly

112
113

• The PDF is used to specify the probability of


the continuous random variable falling within
a particular range of values, as opposed to
Continuous taking on any one value as for a continuous
Random Variable random variable to take on any particular
value, the probability is 0
• An example of a pdf is as follows:
Some Common
114 Probability Distributions
Some Common Probability Distributions

• Discrete
• Binomial

• Continuous
• Normal

115
116

Binomial Probability Distribution

• It is a discrete probability distribution


• Consider an experiment having n identical trials where two outcomes are possible in each
trial
• Let one of these outcomes is referred to as a success with a known probability ‘p’ and the
other outcome will then be classified as a failure with a probability 1-p
• The probability of success (and also failure) will not change from one trial to another trial
• The trials are independent
117

• Here, we might be interested to find the


probability of the number of successes
occurring in n trials
Binomial • Ex 1: Consider the example when we are
Probability interested in finding the probability of x
Distribution customers making a purchase at a retail store
out of n footfalls where n≥x and the
probability of a randomly chosen customer
making a purchase at the store is known
118
• In the previous example, suppose the store manager
wants to find the probability that exactly 10 customers
will make a purchase at his store out of 15 arrivals where
the probability of a randomly chosen customer making a
Binomial purchase at the store is 0.6

Probability • Ex 2: 20 different individuals are exposed to a certain


drug that could induce a mild headache. It is known that
Distribution the probability that a randomly chosen individual will get
a mild headache after being exposed to that particular
drug is 0.8. Compute the probability that out of those 20
individuals at least 12 will get a mild headache after
being exposed to that particular drug.
119

• Ex 3: In a society, the outbreak of COVID19 is massive. The local


authorities decided to conduct random testing in the society as
they assume close to 10% of the people have contacted the
disease. They picked 20 persons at random. If that is true, what is
the probability that less than 5 people will be turned positive after
the testing results come.
Binomial • Ex 4: In a board meeting, for making any decision, at least 3/4 th of
Probability the 16 board members are supposed to be present. Due to the
Distribution outbreak of the pandemic, the meetings are to be conducted online
with the possibility of internet disruption being faced at the place
of the board members. It is assumed that the chances of internet
disruption at any individual is 15% due to which their presence
will not be considered during the meeting. What is the probability
that the meeting will be attended by the members fulfilling the
quorum requirements?
120

• Typically used where the random variables


represent height, weight, marks, precipitation,
Normal temperature etc.
Distribution • Standard Normal Probability Distribution i.e.
N (0,1)
121

Normal Distribution

• In a university examination, the marks


of 730 students are taken and plotted as
follows:
• Ex 1: An automobile part is designed in a way that its lifetime (in
months) is normally distributed with mean 26.4 months and
standard deviation 3.8 months. The manufacturer has decided to
use a marketing strategy in which the product is covered by a
warranty of 18 months. Approximately, what proportion of the
Normal product will fail the warranty? In addition to it, suppose the
manufacturer is ready only to take back 5% of the sold items that
Distribution fail the warranty. How much should be the warranty period?
• Ex 2: The marks obtained by several students for a Statistics
examination are assumed to be approximately normally distributed
with mean value 65 and standard deviation 5. What should be
minimum marks scored by top 10 percent students? If the cut-off
for A and A+ is 75, what percentage of students will get A and A+?
If 3 students are chosen at random from this set, what is the
probability that exactly 2 of them will have marks above 70?

122
123
124

• Ex 3: The daily demand of cola drinks (600 ml)


at a retail store is distributed normally with
mean 80 bottles and standard deviation 20
Normal bottles. At the start of the day, the inventory at
Distribution the store reflects the presence of 100 bottles.
What is the probability that store could face a
stock-out provided that there is no other
opportunity to place the order in that day?
Sampling and Sampling
125 Distributions
126

Introduction to Sampling

Population: It is the aggregate of units where interest lies in a given context

Sample: A subset of the population

Parameter: A population characteristic like population mean etc.

Statistic: A sample characteristic like sample mean etc.

Variable: A specific population characteristic which varies from one unit to another unit
127

Simple Random Sampling

Simple Random Sampling With Replacement (SRSWR)

Simple Random Sampling Without Replacement (SRSWOR)


128
Simple Random Sampling
129

Simple Random Sampling Samples Income in 000’s

• 6 different samples of size 10 are being 6 9 12 10.4 4 9.3 8 8.8 12 8


2
drawn from a population of households 17.3 9 7.9 11 12 11.2 8 10 11.2 16
in the city of Jamshedpur to estimate 3

the average monthly expenditure of the 10 7 12.4 6 15 6.6 14 5 8.9 7


4
households. The data is presented 9 10.4 11 10.6 8 11 8 10.9 11 9
below: 5
14 11 5.2 11 11 7.6 6 6.4 6.2 14
6
10 14.2 13.9 10 13 13.5 10 12 12.5 14
130

Simple Random Sampling Average Expenditure in 000s

8.75
The six samples yielded the
11.36
following sample means
9.19

9.89

9.24

12.31
131

Central Limit Theorem

• Suppose the population has mean, m, and


standard deviation s. Then, if the sample
size, n, is large enough, the distribution of
the sample mean, will have a normal
shape, the center will be the mean of the
original population, m, and the standard
deviation of the s will be s divided by the
square root of n.
132

Central Limit Theorem: Example

• Exercise
Consider the set of numbers from 1 to 100. Draw samples of size 2, 3, 4, …, etc.
from it and compute the sample means.
133

Central Limit Theorem (CLT)

Let S1, S2,…..,Sn be samples of size n drawn from an independent and identically distributed
population with mean µ and standard deviation σ.
According to the CLT, the distribution of the means of S1, S2,….., Sn follow normal
distribution with mean µ and standard deviation for large value of n.
Independent and identically distributed implies that random variables are mutually
independent and the random variables follow the same probability distribution.
134
The average annual stipend in SIP for a B-
school in Eastern India is ₹ 82000 with
standard deviation ₹ 5000. A random
Central Limit
sample of 36 students selected from the
Theorem population. What is the standard error of
the mean? What is the probability that the
sample mean is less than Rs. 80000?
135

Solution

• Standard error of the mean


= standard deviation of the mean=5000/√36=833.33
136

Statistical Inference
137

Statistical Inference

Two Types of Inference Problems

Point Estimation
Estimation Interval Estimation

Hypothesis Testing of a population parameter


138

Point Estimation

Unbiasedness Efficiency Consistency


139
140

Interval Estimation

• Confidence Interval for Population mean when population s.d. known or


large sample size
• Based on the Central Limit Theorem that the sampling distribution of the
sample means follow an approximately normal distribution
141
Interval Estimation

• A random sample of 36 students selected from the population of all B-school students
in Eastern India and their stipend during their summer internships is noted. The sample
mean is found out to be 78000. The population standard deviation is assumed to be
10000. Find a 90% confidence interval for the true (population) mean of the summer
internship stipend.

142
143

Interval Estimation

• Confidence Interval for Population mean when population s.d. unknown

• Here, the degrees of freedom is ν = n-1, where n is the sample size


144
• A random sample of 15 students selected from
the population of all B-school students in
Eastern India and their stipend during their
Interval summer internships is noted. The sample mean is
found out to be 81000. The population standard
Estimation deviation is unknown. Find a 90% confidence
interval for the true (population) mean of the
summer internship stipend. The sample s.d. is
8000.
145
146
147
Hypothesis Testing
148
• Suppose a group of subjects have been tested for a
particular drug for finding whether the blood
pressure increases post intake of the drug. Blood
pressure of the subjects was measured pre and post
What is introduction of the drug.

hypothesis? • It is known that approximately 90% of those who


come for doing their MBAs in premier B-schools
have an engineering background. A sample of 100
students across different premier B-schools was
collected to test this claim.

149
What is hypothesis?

• In limited overs format, it was decided by ICC (cricket’s governing body) in May 2015
that they wanted to get rid of batting power play so as to allow the bowlers a little more
breathing space in a format that has been largely dominated by batsmen.
• The economy rates of bowlers are taken both prior and post introduction of the rule to
test the claim.

150
151

What is hypothesis?

“A claim or statement
regarding the population
parameter which may or may
We define hypothesis as: not be true but requires
verification from a randomly
drawn sample”
152

Hypothesis Testing

• Null Hypothesis
• Dictionary Definition of Null Hypothesis
• “ the hypothesis that there is no significant difference between specified populations,
any observed difference being due to sampling or experimental error”
• Null Hypothesis specifies a population parameter of interest and proposes a
values for the same
• It symbolizes status quo and is denoted by H0
153

Hypothesis Testing

• For instance, the average height of MBA students in XLRI is 166 cm

• Mathematically, H0: μ=166 where μ is the average height in cm

• In hypothesis testing we check whether the null hypothesis is “plausible”


• Alternative Hypothesis
• A Null hypothesis is tested against some alternative hypothesis
154

Hypothesis Testing Examples

• A large distributor of automobile parts is able to maintain the average duration of receipts of interest
free credit allowed from the wholesalers to 20 days since inception of his company. Due to some
regulatory changes in the past months, he noticed that for a randomly selected sample of 75
wholesalers, the average duration of receipts of interest free credit has become 22 with sample
standard deviation of 2.5 days. Does the regulatory change influence in increasing the average credit
days at 5% level of significance?

• For the given problem, state the null hypothesis.


• Also state the alternate hypothesis.
155

Hypothesis Testing Examples

• It is claimed that increasing the advertisement budget helps in boosting up average daily sales.
Average daily sales figures prior the increase in the advertisement expenditure was found to
be 2500 units. The daily sales post the increase in the advertisement expenditure was found to
be 2650 for 25 days with a daily sample standard deviation of 100. Does the increasing of the
advertisement budget help in boosting of average daily sales at 5% level of significance?

• For the given problem, state the null hypothesis.


• Also state the alternate hypothesis.
156

Hypothesis • Parametric Space


Testing • Let θ1, θ2 , ….. , θk be the set unknown population
parameters
• The parameter space Pk for the ordered k-tuple (θ1, θ2 ,
….. , θk ) of real numbers is the subset of where each
θi is restricted to take values from

• For instance, take the Normal Distribution N (μ, σ)


• The parametric space in this case is (-∞, ∞) × [0, ∞)
157

Simple vs • Consider a parameter θ ∈


Composite • If in a hypothesis, the parameter space for θ is a
Hypothesis singleton set, the hypothesis is called a simple
hypothesis else it is called a composite hypothesis

• For instance a hypothesis of the form, H0: θ = 6 is a


simple hypothesis whereas H0: θ > 6 is a composite
hypothesis
158

Critical Region • Consider any population from where a sample is


and Hypothesis drawn

Testing • We want to test the following Null hypothesis


regarding the parameter θ of the population ----
• H0: θ=θ0

• From the sample we can compute some statistic T


• We can state a test rule that will partition all the
possible values of the statistic T into two mutually
exclusive and exhaustive sets whereas if the value of
test statistic T falls into one region, we reject the Null
hypothesis and if the value of test statistic T falls into
the other region, we fail to reject the Null hypothesis
159

• The region of the values of the statistic


defined by the test rule where we reject
Critical Region
the Null hypothesis is called the region
and Hypothesis
Testing of rejection or critical region while the
other region is called the region of
acceptance
160

Type I and Type II Errors

Situation H0 is true H0 is false


Decision

H0 is rejected Type I Error No Error

H0 is not rejected No Error Type II Error


161

Type I and Type II Errors

The probability of committing Type I and Type II errors


are respectively denoted by α and β

To formulate a good test rule, ideally, we need to


minimize committing to both these errors

Simultaneous minimization of both α and β is not


possible ….. Why????
162

Type I and Type II Errors

In all practical purposes we fix α and then try minimizing β

The level at which α is fixed is called level of


significance

The value 1- β is called the Power of a test and since we


try to minimize β i.e., maximize 1- β, a test rule achieving
this purpose is called most powerful test
163

• A die has six faces 1,2,3,4,5,6

• Consider the Null Hypothesis, H0:

The die is fair against the alternative H1: It is loaded in favour of


the larger numbers
Example
• To test the Null hypothesis, the following experiment was
conducted:
The die was rolled twice and a total score is obtained (say S)
A test rule is devised in the following manner: Reject H0 if S≥10
else don’t reject H0
164

Example

• Find the probability of Type I error

• Find the probability of Type II error against the alternative H 1: {P(X=1,2,3)=1/9 and
P(X=4,5,6)=2/9} where X is a random variable denoting the face obtained after the die is
rolled
165

Solution
Dice 1 1 2 3 4 5 6

Dice 2
1 2 3 4 5 6 7

2 3 4 5 6 7 8

3 4 5 6 7 8 9

4 5 6 7 8 9 10

5 6 7 8 9 10 11

6 7 8 9 10 11 12
166

Solution

• Here α=1/36*6=1/6
• And β=1-6*4/81=57/81
167

Example • Let p denote the probability of obtaining a


head when a coin is tossed once. Consider the
Null hypothesis, H0: p=0.5 as against the
alternative hypothesis H1: p=0.6. The Null
hypothesis is rejected if two consecutive
heads are obtained when the coin is tossed
twice. Compute the probability of Type I and
Type II errors.
168

Hypothesis Testing
169
Example 1
• An app based food ordering firm, FoodAnytime wants to launch some targeted offers to the customers who will
be availing their services on specific days of the week so as to boost up customer demand for their services on
those specific days. It first wants to find whether customers have a preference for their services more during the
weekends as compared to the weekdays. It collected the data pertaining to the total number of daily orders for
a period of 50 days selected at random for a city. Test at 5% significance level whether the demand for the services
of FoodAnytime is higher during the weekends as compared to the weekdays. If the demand during the weekends
is found to be significantly higher than that during weekdays, then FoodAnytime will launch the targeted offers
during the weekdays to attract more customers to avail their services.
• Further, FoodAnytime is also interested in finding whether the demand for their services is lower on any particular
day during the weekdays so that even better offers can be launched to attract the customers. Test at 5%
significance level, whether the demand for the services of FoodAnytime is different for different days of the week.
170

t-test: Comparison of Means

where (when sample sizes equal)


where (when sample sizes unequal)

Assumptions
• follows a normal distribution with mean μ and variance σ2/n
• follows a distribution with n-1 degrees of freedom
• Z and s are independent
171
172

t-test: Comparison of Means (Assuming equal variance)

The pooled estimator of the variance is given as

The t-statistics is given as

which has a t distribution with n1 + n2 -2 degrees of freedom


173
Example 2
• Bell Products Limited is a manufacturer of office stationery. It is planning to introduce a new
type of text liner where the stains left by it on the paper dries up quicker than their existing
product in the market. The firm is also thinking to price the product significantly higher than
their existing product. Still, some product managers are insisting on increasing the price of the
new offering as there could be chances of cannibalization of their existing text liner. In order to
find the impact of cannibalization, Bell Products has identified six different stores in a
geographical market and collected the daily sales data of their existing product (for a period of
20 days), both prior to and post the launch of their new product only in those three stores. The
following table shows the sales figures for the said period of 20 days (10 days prior to the launch
and 10 days post the launch).

• The marketing team needs to analyse the data to test whether the sales of the existing product
has gone down significantly post the launch of the new product. In such a case, the team will
recommend senior management regarding increasing the price of the new offering.
174

Example 3

• Ajanta Foods is a key player in the segment of processed foods in the Western part of India. It
wants to penetrate in the neighbouring regions where its presence is limited. For the same, the
company has hired three MBA graduates for the role of sales. Three different regions are
allocated to these three different individuals. It is assumed that the regions are similar
demographically as well as on the socio-economic parameters. The company wishes to evaluate
the performance of these MBA graduates on the basis of the sales that has been generated in
these regions. For the same purpose, the daily sales figures across these three regions are
obtained. The objective is to find whether there is a significant difference in the performance of
the three individuals in terms of the average daily sales. If one of the MBA graduates is
outperforming the rest, then the person will be rewarded. If there is no significant difference, then
no extra compensation will be provided.
175

Hypothesis Testing Examples


• A large distributor of automobile parts is able to maintain the average duration of receipts
of interest free credit allowed from the wholesalers to 20 days since inception of his
company. Due to some regulatory changes in the past months, he noticed that for a
randomly selected sample of 75 wholesalers, the average duration of receipts of interest
free credit has become 22 with sample standard deviation of 2.5 days. Does the regulatory
change influence in increasing the average credit days at 5% level of significance?

• For the given problem, state the null hypothesis.


• Also state the alternate hypothesis.
176

Hypothesis Testing Examples


• It is claimed that increasing the advertisement budget helps in boosting up average daily
sales. Average daily sales figures prior the increase in the advertisement expenditure was
found to be 2500 units. The daily sales post the increase in the advertisement expenditure
was found to be 2650 for 25 days with a daily sample standard deviation of 100. Does the
increasing of the advertisement budget help in boosting of average daily sales at 5% level
of significance?

• For the given problem, state the null hypothesis.


• Also state the alternate hypothesis.
177

Hypothesis Testing Examples

• The wages of a factory workers are assumed to be Normally distributed


with mean ‘m’ and variance 25. A random sample of 60 workers gives the
total wages equal to 3350 units. Test the hypothesis H0: m=52 against the
alternative H1: m>52 at 1% and 5% level of significance.
178

Hypothesis Testing Examples

• A manufacturer wants to test whether the hourly output rate of the newly
purchased machine is 60. It is however known that such machines have a
standard deviation of hourly output to be 10. He allowed the machine to be
operated for some time and found that 1910 units were produced in 2000
minutes. Test to see whether the test is consistent with the initial claim at
5% level of significance.
179
Example 4
• FinoTech is an India IT firm based out of Bangalore. It serves its clients primarily in the Banking and Financial Services
(BFS) sector. Last year it signed a contract with a leading firm based out of the US in the BFS sector where any IT related
issues will be raised in the form of tickets. The tickets issued have different levels of criticality: Category A are the most
critical ones and needed to be addressed urgently i.e. within 2 hours of issuing the tickets, Category B are less critical as
compared to Category A and have to resolved within 24 hours of them being raised, and Category C which are the least
critical and have to be resolved in one week. As per the contract, FinoTech will have to ensure a service level of 95% is
achieved for the Category A tickets and an overall service level of 90% is achieved while handling all the tickets taken
together. The project manager has to assign the responsibility of handling these tickets issued by the client to the different
associates reporting to him based on their criticality. The associates handling these tickets are either posted at the client’s
site or are providing the services from the offshore office.
• Can we say that associates at the client’s location are more efficient in handling tickets of category A?
• Can we say that the associates at the onsite get to handle less of Category B and Category C tickets than their offshore
counterparts?
• Test at 5% level of significance.
180
181

R codes

• t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired =


FALSE, var.equal = FALSE, conf.level = 0.95, …)
182

Advanced Topics
183

Sampling Distribution

• A box contains 5 balls with weights in certain units. These balls have weights as 1, 2, 3, 4
and 5. A simple random sample of 2 balls is drawn from the box without replacement. Let
x1 and x2 be the weights of the balls in the sample.

• Find the sampling distribution of and


• Find the expected value of and variance of
• Find the expected value of
184
Solution

1 2 1.5 0.5
1 3 2 2
1 4 2.5 4.5
1 5 3 8
2 3 2.5 0.5
2 4 3 2
2 5 3.5 4.5
3 4 3.5 0.5
3 5 4 2
4 5 4.5 0.5
185

Solution

• Sampling distribution of • Sampling distribution of


• P(=1.5)=0.1 • P(=0.5)=0.4
• P(=2)=0.1 • P(=2)=0.3
• P(=2.5)=0.2 • P(=4.5)=0.2
• P(=3)=0.2 • P(=8)=0.1
• P(=3.5)=0.2
• P(=4)=0.1
• P(=4.5)=0.1
186

Solution

• Expected value of =3
• Variance of =0.75
• Expected value of =2.5
• Population Mean is 3 and population variance is 2
187

Example

• A box contains 4 balls with weights in certain units. These balls have weights as 1, 2, 3
and 4. A simple random sample of 2 balls is drawn from the box with replacement. Let x 1
and x2 be the weights of the balls in the sample.

• Find the sampling distribution of and


• Find the expected value of and variance of
• Find the expected value of
Solution 188
1 1 1 0
1 2 1.5 0.5
1 3 2 2
1 4 2.5 4.5
2 1 1.5 0.5
2 2 2 0
2 3 2.5 0.5
2 4 3 2
3 1 2 2
3 2 2.5 0.5
3 3 3 0
3 4 3.5 0.5
4 1 2.5 4.5
4 2 3 2
4 3 3.5 0.5
4 4 4 0
189

Solution

• Sampling distribution of
• P(=1)=0.0625
• Sampling distribution of
• P(=1.5)=0.125 • P(=0)=0.25
• P(=2)=0.1875 • P(=0.5)=0.375
• P(=2.5)=0.25
• P(=2)=0.25
• P(=3)=0.1875
• P(=4.5)=0.125
• P(=3.5)=0.125
• P(=4)=0.0625
190

Solution

• Expected value of =2.5


• Variance of =0.0625
• Expected value of =1.25
• Also, μ=2.5 and σ2= 1.25

You might also like