Ca 1 Merged
Ca 1 Merged
Ca 1 Merged
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
Objective of the course
• The principle focus of this course is to introduce conceptual understanding
using simple and practical examples rather than repetitive and point click
mentality
• This course should make you comfortable using analytics in your career
and your life
• You will know how to work with real data, and might have learned many
different methodologies but choosing the right methodology is important
2
Objective of the course Contd…
3
Learning objectives
1. Define data and its importance
2. Define data analytics and its types
3. Explain why analytics is important in today’s business environment
4. Explain how statistics, analytics and data science are interrelated
5. Why python?
6. Explain the four different levels of Data:
– Nominal
– Ordinal
– Interval and
– Ratio
4
1. Define Data and its importance
5
1.1 Variable, Measurement and Data
6
1.2 What is generating so much data?
7
1.3 How data add value to business?
Data warehouse
Business value
Source:https://datajobs.com/
8
Data Products
9
1.4 Why Data is important?
10
2. Define data analytic and its types
• Define data analytics
• Data analysis
11
2.1. Define data analytics
12
2.2 Why analytics is important?
13
2.3 Data analysis
• Data analysis is the process of examining, transforming, and
arranging raw data in a specific way to generate useful
information from it
• Data analysis allows for the evaluation of data through
analytical and logical reasoning to lead to some sort of
outcome or conclusion in some context
• Data analysis is a multi-faceted process that involves a
number of steps, approaches, and diverse techniques
14
Analysis 2.4 Data analytics vs. Data analysis
Past
Explain
How?
Why?
15
2.4 Data analytics vs. Data analysis Analytics
Future
16
2.4 Data analytics vs. Data analysis
Analytics
Qualitative Quantitative
ll
ll
Intuition + analysis Formulas + algorithms
17
Analysis
Quantitative
ll
Qualitative Data + how the sale decreased last summer
ll
18
Analysis =/ Analytics
Data Analysis =/ Data analytics
19
2.5 Classification of Data analytics
Based on the phase of workflow and the kind of analysis required, there are
four major types of data analytics.
• Descriptive analytics
• Diagnostic analytics
• Predictive analytics
• Prescriptive analytics
20
Classification of Data analytics
https://www.governanceanalytics.org/knowledge-
base/Main_Tools/Data_classification_and_analysis
21
Descriptive Analytics
• Descriptive Analytics, is the conventional form of Business Intelligence and
data analysis
• It seeks to provide a depiction or “summary view” of facts and figures in
an understandable format
• This either inform or prepare data for further analysis
• Descriptive analysis or statistics can summarize raw data and convert it
into a form that can be easily understood by humans
• They can describe in detail about an event that has occurred in the past
22
Example
A common example of Descriptive Analytics are company reports that simply
provide a historic review like:
• Data Queries
• Reports
• Descriptive Statistics
• Data Visualization
• Data dashboard
Source: https://www.linkedin.com/learning/478e9692-d13d-338f-907e-d76f0724d773
23
Diagnostic analytics
24
Example
1. Data Discovery
2. Data Mining
3. Correlations
25
Predictive analytics
26
Source: https://www.logianalytics.com/wp-content/uploads/2017/11/predictive-1.png
27
Example
• Set of techniques that use model constructed from past data to predict
the future or ascertain impact of one variable on another:
1. Linear regression
2. Time series analysis and forecasting
3. Data mining
Source: https://bigdata-madesimple.com/5-examples-predictive-analytics-travel-industry/
28
Prescriptive analytics
29
Prescriptive analytics: Example
• Optimization Model
• Simulation
• Decision Analysis
30
3. Explain why analytics is important
31
3. Explain why analytics is important
Data Scientist
Search Trends
Statistician, Operations Researcher
32
https://timesofindia.indiatimes.com/india/Data-scientists-earning-more-than-
CAs-engineers/articleshow/52171064.cms
33
3.1 Demand for Data Analytics
http://timesofindia.indiatimes.com/articleshow/52171064.cms?utm_source=
contentofinterest&utm_medium=text&utm_campaign=cppst
34
3.2 Element of data Analytics
35
4. Data analyst and Data scientist
36
4.1 The requisite skill set
Technology;
Mathematic
Hacking Skill
Expertise
Business and
strategy Data Science
acumen
37
4.1 The requisite skill set
Mathematic Technology;
Expertise Hacking Skill
Business and
strategy
Data Science
acumen
38
4.1 The requisite skill set
Mathematic Technology;
Expertise Hacking Skill
Business and
strategy
Data Science
acumen
39
4.2 Difference between Data analyst and Data Scientist
Business Administration
Analyst
Domain specific responsibility : For Example marketing analyst, Financial analyst etc.
Data Scientist
Advance algorithms and machine learning
Source:https://datajobs.com/
40
5. Why python?
Features
• Simple and easy to learn
• Freeware and Open source
• Interpreted
• Dynamically Typed
• Extensible
• Embedded
• Extensive library
41
5. Why python?
Usability
• Desktop and web applications
• Database applications
• Networking applications
• Data analysis (Data Science)
• Machine learning
• IoT and AI applications
• Games
42
Companies using Python
43
Why Jupyter NoteBook?
Why?
• Client – Server Application
• Edit code on web browser
• Easy in documentation
• Easy in demonstration
• User- friendly Interface
44
6. Explain the four different levels of Data
• Types of Variables
• Levels of Data Measurement
• Compare the four different levels of Data:
Nominal
Ordinal
Interval and
Ratio
• Usage Potential of Various Levels of Data
• Data Level, Operations, and Statistical Methods
45
6.1 Types of Variables
Data
Categorical Numerical
Examples:
Marital Status
Political Party Discrete Continuous
Eye Color
Examples: Examples:
(Defined categories)
Number of Children Weight
Defects per hour Voltage
(Counted items) (Measured characteristics)
6.2 Levels of Data Measurement
47
6.3.1 Nominal
48
6.3.2 Ordinal scale
49
6.3.3. Interval scale
50
6.3.4 Ratio scale
51
6.4 Usage Potential of Various
Levels of Data
Ratio
Interval
Ordinal
Nominal
52
6.5 Impact of choice of measurement scale
Statistical
Data Level Meaningful Operations
Methods
53
Thank You
54
Data Analytics with Python
Lecture 2: Python – Fundamentals
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
Learning objectives
1. Installing Python
2. Fundamentals of Python
3. Data Visualisation
2
Python Installation Process
Installation Process –
3
Python Installation Process
Installation Process –
4
Python Installation Process
5
Python Installation Process
6
Python Installation Process
7
Python Installation Process
8
Python Installation Process
9
Python Installation Process
10
Python Installation Process
11
Python Installation Process
12
Python Installation Process
13
Python Installation Process
14
Python Installation Process
15
Python Installation Process
16
Why Jupyter NoteBook?
Why?
• Edit code on web browser
• Easy in documentation
• Easy in demonstration
• User- friendly Interface
17
Python and Jupyter
18
19
About Jupyter NoteBook
20
About Jupyter NoteBook
21
About Jupyter NoteBook
22
About Jupyter Notebook
23
About Jupyter Notebook
24
Fundamentals of Python
25
26
Loading a simple delimited data file
27
28
• head method shows us only the first 5 rows
29
Get the number of rows and columns
30
get column names
31
get the dtype of each column
32
Pandas Types Versus Python Types
33
get more information about data
34
Looking at Columns, Rows, and Cells
35
# show the first 5 observations
36
# show the last 5 observations
37
# Looking at country, continent, and year
38
39
Data Analytics with Python
Lecture 3: Python – Fundamentals - II
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
Looking at Columns, Rows, and Cells
2
get the first row
3
• # get the 100th row
# Python counts from 0
4
• get the last row
5
Subsetting Multiple Rows
6
Subset Rows by Row Number: iloc
7
• get the 100th row
8
• # using -1 to get the last row
9
With iloc, we can pass in the -1 to get the last row—something we couldn’t do with loc.
10
• # get the first, 100th, and 1000th rows
11
Subsetting Columns
12
• # subset columns with loc
# note the position of the colon
# it is used to select all rows
13
14
• # subset columns with iloc
• # iloc will alow us to use integers
• # -1 will select the last column
15
Subsetting Columns by Range
16
• # subset the dataframe with the range
17
Subsetting Rows and Columns
• # using loc
18
• # using iloc
19
Subsetting Multiple Rows and Columns
20
• if we use the column names directly,
# it makes the code a bit easier to read
# note now we have to use loc, instead of iloc
21
22
23
Grouped Means
• # For each year in our data, what was the average life
expectancy?
# To answer this question,
# we need to split our data into parts by year;
# then we get the 'lifeExp' column and calculate the mean
24
25
26
• If you need to “flatten” the dataframe, you can use the
reset_index method.
27
Grouped Frequency Counts
28
Basic Plot
29
30
Visual Representation of the Data
• Histogram -- vertical bar chart of frequencies
• Frequency Polygon -- line graph of frequencies
• Ogive -- line graph of cumulative frequencies
• Pie Chart -- proportional representation for categories of a whole
• Stem and Leaf Plot
• Pareto Chart
• Scatter Plot
31
Methods of visual presentation of data
• Table
32
Methods of visual presentation of data
• Graphs
90
80
70
60
50 East
40 West
30 North
20
10
0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
33
Methods of visual presentation of data
• Pie chart
1st Qtr
2nd Qtr
3rd Qtr
4th Qtr
34
Methods of visual presentation of data
• Multiple bar chart
4th Qtr
1st Qtr
0 20 40 60 80 100
35
Methods of visual presentation of data
• Simple pictogram
100
80
60
40
North
20
East
0 West
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
36
Frequency distributions
• Frequency tables
Observation Table
Class Interval Frequency Cumulative Frequency
< 20 13 13
<40 18 31
<60 25 56
<80 15 71
<100 9 80
37
Frequency diagrams
Frequency
30 Cumulative Frequency
25 Frequency
20
90
80
15
70
10 60
5 50
Cumulative Frequency
0 40
< 20 <40 <60 <80 <100 30
20
Frequency 10
0
30 < 20 <40 <60 <80 <100
25
20
15 Frequency
10
5
0
< 20 <40 <60 <80 <100
38
Histogram
20
Class Interval Frequency
Frequency
20-under 30 6
10
30-under 40 18
40-under 50 11
50-under 60 11
0
60-under 70 3 0 10 20 30 40 50 60 70 80
Years
70-under 80 1
39
Histogram Construction
20
Class Interval Frequency
20-under 30 6
Frequency
30-under 40 18
10
40-under 50 11
50-under 60 11
60-under 70 3
0
70-under 80 1
0 10 20 30 40 50 60 70 80
Years
40
Frequency Polygon
20
Class IntervalFrequency
20-under 30 6
Frequency
30-under 40 18
10
40-under 50 11
50-under 60 11
60-under 70 3
0
70-under 80 1 0 10 20 30 40 50 60 70 80
Years
41
Ogive
Cumulative
60
Class Interval Frequency
40
Frequency
20-under 30 6
30-under 40 24
20
40-under 50 35
50-under 60 46
0
60-under 70 49 0 10 20 30 40 50 60 70 80
70-under 80 50 Years
42
Relative Frequency Ogive
Cumulative
43
Pareto Chart
100 100%
90 90%
80 80%
70 70%
Frequency 60 60%
50 50%
40 40%
30 30%
20 20%
10 10%
0 0%
Poor Short in Defective Other
Wiring Coil Plug
44
Scatter Plot
(1000's) Gallons)
Gasoline Sales
5 60 100
15 120
9 90
0
15 140 0 5 10 15 20
Registered Vehicles
7 60
45
Principles of Excellent Graphs
• The graph should not distort the data
• The graph should not contain unnecessary adornments (sometimes
referred to as chart junk)
• The scale on the vertical axis should begin at zero
• All axes should be properly labeled
• The graph should contain a title
• The simplest possible graph should be used for a given set of data
Graphical Errors: Chart Junk
100 25
0 0
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Graphical Errors: No Zero Point on the Vertical Axis
Bad Presentation
Good Presentations
Monthly Sales $ Monthly Sales
$ 45
45
42
42 39
39 36
36 0
J F M A M J J F M A M J
Dr. A. Ramesh
Department of Management Studies
1
Lecture objectives
• Central tendency
• Measures of Dispersion
2
Measures of Central Tendency
3
Summary statistics
4
Arithmetic Mean
• Commonly called ‘the mean’
• It is the average of a group of numbers
• Applicable for interval and ratio data
• Not applicable for nominal or ordinal data
• Affected by each value in the data set, including extreme values
• Computed by summing all values in the data set and dividing the sum by
the number of values in the data set
5
Population Mean
X X 1
X 2
X 3
... X N
N N
24 13 19 26 11
5
93
5
18.6
6
Sample Mean
X
X X 1
X 2
X 3
... X n
n n
57 86 42 38 90 66
6
379
6
63.167
7
Mean of Grouped Data
• Weighted average of class midpoints
• Class frequencies are the weights
fM
f
fM
N
f 1M 1 f 2 M 2 f 3M 3 fiMi
f 1 f 2 f 3 fi
8
Calculation of Grouped Mean
Class Interval Frequency(f) Class Midpoint(M) fM
20-under 30 6 25 150
30-under 40 18 35 630
40-under 50 11 45 495
50-under 60 11 55 605
60-under 70 3 65 195
70-under 80 1 75 75
50 2150
fM 2150
43.0
f 50
9
Weighted Average
xw
Weighted Average
w
where x is a data value and w is
the weight assigned to that data
value. The sum is taken over all
data values.
Example
Suppose your midterm test score is 83 and your final exam score is 95.
Using weights of 40% for the midterm and 60% for the final exam, compute
the weighted average of your scores. If the minimum average for an A is
90, will you earn an A?
Weighted Average
830.40 950.60
0.40 0.60
32 57
90.2
1 You will earn an A!
Median
• Middle value in an ordered array of numbers
13
Median: Computational Procedure
• First Procedure
– Arrange the observations in an ordered array
– If there is an odd number of terms, the median is the middle term of the
ordered array
– If there is an even number of terms, the median is the average of the
middle two terms
• Second Procedure
– The median’s position in an ordered array is given by (n+1)/2.
14
Median: Example with an Odd Number of Terms
Ordered Array
3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21 22
• There are 17 terms in the ordered array.
• Position of median = (n+1)/2 = (17+1)/2 = 9
• The median is the 9th term, 15.
• If the 22 is replaced by 100, the median is 15.
• If the 3 is replaced by -103, the median is 15.
15
Median: Example with an Even Number of Terms
Ordered Array
3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21
16
Median of Grouped Data
N
cfp
Median L 2 W
fmed
Where :
L the lower limit of the median class
cfp = cumulative frequency of class preceding the median class
fmed = frequency of the median class
W = width of the median class
N = total of frequencies
17
Median of Grouped Data -- Example
Cumulative
N
Class Interval Frequency Frequency cfp
20-under 30 6 6 Md L 2 W
30-under 40 18 24 fmed
40-under 50 11 35 50
24
50-under 60 11 46
60-under 70 3 49
40 2 10
11
70-under 80 1 50 40.909
N = 50
18
Mode
19
Mode -- Example
• The mode is 44
• There are more 44s 35 41 44 45
37 43 44 46
39 43 44 46
40 43 44 46
40 43 45 48
20
Mode of Grouped Data
• Midpoint of the modal class
• Modal class has the greatest frequency
21
22
Percentiles
• Measures of central tendency that divide a group of data into 100 parts
• Example: 90th percentile indicates that at most 90% of the data lie
below it, and at least 10% of the data lie above it
• The median and the 50th percentile have the same value
23
Percentiles: Computational Procedure
• Organize the data into an ascending ordered array
• Calculate the p th percentile location:
P
i ( n)
100
• Determine the percentile’s location and its value.
24
Percentiles: Example
• Raw Data: 14, 12, 19, 23, 5, 13, 28, 17
• Ordered Array: 5, 12, 13, 14, 17, 19, 23, 28
• Location of 30th percentile:
30
i (8) 2.4
100
25
Dispersion
26
Variability
27
Measures of Variability or dispersion
Common Measures of Variability
• Range
• Inter-quartile range
• Mean Absolute Deviation
• Variance
• Standard Deviation
• Z scores
• Coefficient of Variation
28
Range – ungrouped data
40 43 45 48
29
Quartiles
• Measures of central tendency that divide a group of data into four subgroups
30
Quartiles
Q1 Q2 Q3
31
Quartiles: Example
• Ordered array: 106, 109, 114, 116, 121, 122, 125, 129
• Q1 i
25
(8) 2 Q
109 114
1 111.5
100 2
• Q2:
50 116 121
i (8) 4 Q2 118.5
100 2
• Q3:
75 122 125
i (8) 6 Q3 123.5
100 2
32
Interquartile Range
Interquartile Range Q3 Q1
33
Deviation from the Mean
-4 +5
-8 +4
+3
0 5 10 15 20
34
Mean Absolute Deviation
X X X
M . A.D.
X
5 -8 +8 N
9 -4 +4 24
16 +3 +3 5
17 +4 +4 4.8
18 +5 +5
0 24
35
Population Variance
• Average of the squared deviations from the arithmetic mean
X X X
2
X
2
2
5 -8 64 N
130
9 -4 16
5
16 +3 9 26.0
17 +4 16
18 +5 25
0 130
36
Population Standard Deviation
• Square root of the variance
X X X
2
X
2
2
N
5 -8 64 130
9 -4 16
5
16 +3 9 26.0
17 +4 16
2
18 +5 25 26.0
0 130 5.1
37
Sample Variance
• Average of the squared deviations from the arithmetic mean
X X X X X
2
X X
2
1,844 71 5,041 S n 1
1,539 -234 54,756 663,866
1,311 -462 213,444
3
7,092 0 663,866
221, 288.67
38
Sample Standard Deviation
• Square root of the sample variance
X X X X X
2
X X
2
2
39
Uses of Standard Deviation
• Indicator of financial risk
• Quality Control
– construction of quality control charts
– process capability studies
• Comparing populations
– household incomes in two cities
– employee absenteeism at two plants
40
Standard Deviation as an Indicator of Financial Risk
A 15% 3%
B 15% 7%
41
Lecture 5: Central Tendency and Dispersion- II
Dr. A. Ramesh
Department of Management Studies
1
The Empirical Rule… If the histogram is bell shaped
2
Empirical Rule
1 68
2 95
3 99.7
3
Chebysheff’s Theorem…Not often used because interval is very wide.
41
Coefficient of Variation
. . 100
CV
5
Coefficient of Variation
29
1
84
2
1
4.6 2
10
100 100
C.V .
1
1
C.V .
2
2
1 2
4.6 10
100 100
29 84
1586
. 11.90
6
Variance and Standard Deviation
of Grouped Data
Population Sample
f M S M X
2 2
f
2
2
n1
N
2
S
S
2
7
Population Variance and Standard Deviation of
Grouped Data(mu=43)
M 2
2
f 7200
144 12
2
144
N 50
8
Measures of Shape
• Skewness
– Absence of symmetry
– Extreme values in one side of a distribution
• Kurtosis
Peakedness of a distribution
– Leptokurtic: high and thin
– Mesokurtic: normal shape
– Platykurtic: flat and spread out
• Box and Whisker Plots
– Graphic display of a distribution
– Reveals skewness
9
Skewness
10
Skewness..
The skewness of a distribution is measured by comparing the relative positions
of the mean, median and mode.
• Distribution is symmetrical
• Mean = Median = Mode
• Distribution skewed right
• Median lies between mode and mean, and mode is less than mean
• Distribution skewed left
• Median lies between mode and mean, and mode is greater than
mean
11
Skewness
12
Coefficient of Skewness
3 Md
S
• If S < 0, the distribution is negatively skewed (skewed to the left)
13
Coefficient of Skewness
1
23 2
26 3
29
M
d1 26 M
d2 26 M
d3 26
1
12.3 2
12.3 3
12.3
3 1 M
d1
3 2 M d2
3 3 M
d3
S 1
S 2
S 3
1 2 3
Leptokurtic
Mesokurtic
Platykurtic
15
Box and Whisker Plot
– Median, Q2
– First quartile, Q1
– Third quartile, Q3
16
Box and Whisker Plot
Minimum Q1 Q2 Q3 Maximum
17
Skewness: Box and Whisker Plots, and Coefficient of
Skewness
S=0 S>0
S<0
18
THANK YOU
19
Lecture 6: Introduction to Probability
Dr. A. Ramesh
Department of Management Studies
1
Lecture objectives
2
Probability
• Probability is the numerical measure of the likelihood that an event will occur.
3
Range of Probability
1 Certain
.5
0 Impossible
4
Methods of Assigning Probabilities
5
Classical Probability
6
Classical Probability
P( E )
n e
N
Where:
N total number of outcomes
ne
number of outcomes in E
7
Relative Frequency Probability
8
Relative Frequency Probability
P( E ) ne
N
Where:
N total number of trials
n e
number of outcomes
producing E
9
Subjective Probability
10
Probability - Terminology
• Experiment
• Event
• Elementary Events
• Sample Space
• Unions and Intersections
• Mutually Exclusive Events
• Independent Events
• Collectively Exhaustive Events
• Complementary Events
11
Experiment, Trial, Elementary Event, Event
• Experiment: a process that produces outcomes
– More than one possible outcome
– Only one outcome per trial
• Trial: one repetition of the process
• Elementary Event: cannot be decomposed or broken down into other
events
• Event: an outcome of an experiment
– may be an elementary event, or
– may be an aggregate of elementary events
– usually represented by an uppercase letter, e.g., A, E1
12
An Example Experiment
• Experiment: randomly select,
without replacement, two families Tiny Town Population
from the residents of Tiny Town
• Elementary Event: the sample Children in Number of
Family Household
includes families A and C Automobiles
• Event: each family in the sample
has children in the household A Yes 3
• Event: the sample families own a B Yes 2
total of four automobiles C No 1
D Yes 2
13
Sample Space
14
Sample Space: Roster Example
15
Sample Space: Tree Diagram for Random Sample of Two
Families
16
Sample Space: Set Notation for Random Sample of Two
Families
• S = {(x,y) | x is the family selected on the first draw, and y is the family
selected on the second draw}
• Concise description of large sample spaces
17
Sample Space
• Useful for discussion of general principles and concepts
18
Union of Sets
• The union of two sets contains an instance of each element of the two
sets.
X 1,4,7,9
Y 2,3,4,5,6 X Y
X Y 1,2,3,4,5,6,7,9
19
Intersection of Sets
• The intersection of two sets contains only those element common to the
X 1,4,7,9
two sets.
Y 2,3,4,5,6 X Y
X Y 4
22
Collectively Exhaustive Events
E1 E2 E3
23
Complementary Events
• All elementary events not in the event ‘A’ are in its complementary event.
P( Sample Space ) 1
A
Sample
Space A
P( A) 1 P( A)
24
Counting the Possibilities
• mn Rule
• Sampling from a Population with Replacement
• Combinations: Sampling from a Population without Replacement
25
mn Rule
26
Sampling from a Population with Replacement
27
Combinations
N N! 1000!
166,167,00 0
n n!( N n)! 3!(1000 3)!
28
Four Types of Probability
Marginal Union Joint Conditional
P( X ) P( X Y ) P( X Y ) P( X | Y )
The probability The probability The probability The probability
of X occurring of X or Y of X and Y of X occurring
occurring occurring given that Y
has occurred
X X Y X Y
29
General Law of Addition
P ( X Y ) P( X ) P( Y ) P( X Y )
X Y
30
Design for improving productivity?
31
Problem
• A company conducted a survey for the American Society of Interior
Designers in which workers were asked which changes in office design
would increase productivity.
• Respondents were allowed to answer more than one type of design
change.
32
Problem
• If one of the survey respondents was randomly selected and asked what
office design changes would increase worker productivity,
– what is the probability that this person would select reducing noise or
more storage space?
33
Solution
34
General Law of Addition -- Example
P( N S ) P( N ) P( S ) P( N S )
N S P ( N ) .70
P ( S ) .67
P ( N S ) .56
.56
.70 .67 P ( N S ) .70.67 .56
0.81
35
Office Design Problem
Probability Matrix
Increase
Storage Space
Yes No Total
Noise Yes .56 .14 .70
Reduction
No .11 .19 .30
Total .67 .33 1.00
36
Joint Probability Using a Contingency Table
Event
Event B1 B2 Total
Yes No Total
Noise Yes .56 .14 .70
Reduction
No .11 .19 .30
Total .67 .33 1.00
P( N S ) P( N ) P( S ) P( N S )
.70.67 .56
.81
38
Law of Conditional Probability
39
Office Design Problem
40
Problem
• A company data reveal that 155 employees worked one of four types of
positions.
• Shown here again is the raw values matrix (also called a contingency table)
with the frequency counts for each category and for subtotals and totals
containing a breakdown of these employees by type of position and by
sex.
41
Contingency Table
42
Solution
43
Problem
• Shown here are the raw values matrix and corresponding probability
matrix for the results of a national survey of 200 executives who were
asked to identify the geographic locale of their company and their
company’s industry type.
• The executives were only allowed to select one locale and one industry
type.
44
Lecture 7: Introduction to Probability-II
Dr. A. Ramesh
Department of Management Studies
1
Problem
• A company data reveal that 155 employees worked one of four types of
positions.
• Shown here again is the raw values matrix (also called a contingency table)
with the frequency counts for each category and for subtotals and totals
containing a breakdown of these employees by type of position and by
sex.
2
Contingency Table
3
Solution
4
Problem
• Shown here are the raw values matrix and corresponding probability
matrix for the results of a national survey of 200 executives who were
asked to identify the geographic locale of their company and their
company’s industry type.
• The executives were only allowed to select one locale and one industry
type.
5
6
Questions
a. What is the probability that the respondent is from the Midwest (F)?
c. What is the probability that the respondent is from the Southeast (E) or
from the finance industry (A)?
7
8
Mutually Exclusive Events
Type of Gender
Position Male Female Total
Managerial 8 3 11
Professional 31 13 44
Technical 52 17 69
Clerical 9 22 31
Total 100 55 155
P(T C ) P(T ) P(C )
69 31
155 155
.645
9
Mutually Exclusive Events
Type of Gender
Position Male Female Total
Managerial 8 3 11
Professional 31 13 44
Technical 52 17 69
Clerical 9 22 31
Total 100 55 155
P( P C ) P( P) P(C )
44 31
155 155
.484
10
Law of Multiplication
P( X Y ) P( X ) P( Y | X ) P( Y ) P( X | Y )
11
Problem
12
Married
Y N Sub total
Supervisor Y 0.1143 30
N 110
Sub 80 60 140
total
13
80
P( M ) 0. 5714
140
P( S| M ) 0. 20
P ( M S ) P ( M ) P ( S| M )
( 0. 5714 )( 0. 20 ) 0.1143
14
Law of Multiplication
P( S ) 1 P( S )
Probability Matrix 1 0. 2143 0. 7857
of Employees P( M S ) P( S ) P( M S )
Married 0. 7857 0. 4571 0. 3286
Supervisor Yes No Total P( M S ) P( M ) P( M S )
Yes .1143 .1000 .2143 0. 5714 0.1143 0. 4571
No .4571 .3286 .7857 P( M S ) P( S ) P( M S )
Total .5714 .4286 1.00 0. 2143 0.1143 0.1000
P( M ) 1 P( M )
1 0. 5714 0. 4286
15
Special Law of Multiplication for Independent Events
• General Law
P( X Y ) P( X ) P(Y | X ) P(Y ) P( X | Y )
• Special Law
If events X and Y are independent,
P( X ) P( X | Y ), and P (Y ) P (Y | X ).
Consequently,
P( X Y ) P( X ) P(Y )
16
Law of Conditional Probability
P( X Y ) P( Y | X ) P( X )
P( X | Y )
P( Y ) P( Y )
17
Conditional Probability
• A conditional probability is the probability of one event, given that
another event has occurred:
P(A and B) The conditional
P(A | B) probability of A given
P(B) that B has occurred
P(A and B) The conditional
P(B | A)
P(A) probability of B given
that A has occurred
Where P(A and B) = joint probability of A and B
P(A) = marginal probability of A
P(B) = marginal probability of B
18
Computing Conditional Probability
• Of the cars on a used car lot, 70% have air conditioning (AC)
and 40% have a CD player (CD). 20% of the cars have both.
• What is the probability that a car has a CD player, given that it
has AC ?
• We want to find P(CD | AC).
Computing Conditional Probability
CD No CD Total
Given AC, we only consider the top row (70% of the cars). Of
these, 20% have a CD player. 20% of 70% is about 28.57%.
Computing Conditional Probability: Decision Trees
.2
.7
Given AC or P(AC and CD) = .2
no AC:
.5
P(AC and CD/) = .5
.7
All
Cars .2
.3
P(AC/ and CD) = .2
• If X and Y are independent events, the occurrence of Y does not affect the
probability of X occurring.
• If X and Y are independent events, the occurrence of X does not affect the
probability of Y occurring.
P(A | B) P(A)
P( A G) 0.07
P( A| G) 0.33 P( A) 0.28
P(G ) 0.21
P( A| G) 0.33 P( A) 0.28
Independent Events
D E
A 8 12 20 8
P( A| D) .2353
34
B 20 30 50
20
P ( A) .2353
C 6 9 15 85
P( A| D) P( A) 0.2353
34 51 85
Revision of Probabilities: Bayes’ Rule
P(Y | Xi ) P( Xi )
P( Xi| Y )
P(Y | X 1) P( X 1) P(Y | X 2 ) P( X 2 ) P(Y | Xn ) P( Xn )
28
29
30
31
Problem
• A particular type of printer ribbon is produced by only
two companies, Alamo Ribbon Company and South
Jersey Products.
• Suppose Alamo produces 65% of the ribbons and
that South Jersey produces 35%.
• Eight percent of the ribbons produced by Alamo are
defective and 12% of the South Jersey ribbons are
defective
• A customer purchases a new ribbon. What is the
probability that Alamo produced the ribbon? What is
the probability that South Jersey produced the
ribbon?
Revision of Probabilities
with Bayes' Rule: Ribbon Problem
P( Alamo) 0. 65
P( SouthJersey) 0. 35
P( d | Alamo) 0. 08
P( d | SouthJersey) 0.12
P( d | Alamo) P( Alamo)
P( Alamo| d )
P( d | Alamo) P( Alamo) P( d | SouthJersey) P( SouthJersey)
( 0. 08)( 0. 65)
0. 553
( 0. 08)( 0. 65) ( 0.12 )( 0. 35)
P( d | SouthJersey) P( SouthJersey)
P( SouthJersey| d )
P( d | Alamo) P( Alamo) P( d | SouthJersey) P( SouthJersey)
( 0.12 )( 0. 35)
0. 447
( 0. 08)( 0. 65) ( 0.12 )( 0. 35)
Revision of Probabilities with Bayes’ Rule: Ribbon Problem
Revision of Probabilities
with Bayes' Rule: Ribbon Problem
Defective
0.08 0.052
Alamo
0.65
Acceptable + 0.094
0.92
Defective 0.042
0.12
South
Jersey
0.35 Acceptable
0.88
THANK YOU
36
Data Analytics with Python
Lecture 8: Probability Distributions
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
Lecture Objectives
• Empirical Distribution
• Discrete Distributions
• Continuous Distributions
2
What is a distribution?
3
Why distribution?
• Can serve as a basis for standardized comparison of empirical
distributions
• Can help us estimate confidence intervals for inferential statistics
• Form a basis for more advanced statistical methods
– ‘fit’ between observed distributions and certain theoretical
distributions is an assumption of many statistical procedures
4
Random variable
• A variable which contains the outcomes of a chance experiment
• “Quantifying the outcomes”
• Example X= (1 = Head, 0 = Tails)
• A variable that can take on different values in the population
according to some “random” mechanism
• Discrete
– Distinct values, countable
– Year
• Continuous
– Mass
5
Probability Distributions
6
PDF of Discrete r.v.
Number of Heads (X): 0 1 2 sum
PDF (P(X)): ¼ ½ ¼ 1
The PDF of the Number of Heads in Two
Tosses of a Coin
0.6
0.5
Probability Density
0.5
0.4
0.3 0.25 0.25
0.2
0.1
0
0 1 2
Number of Heads
7
Probability Distribution for the Random Variable X
A probability distribution for a discrete random
variable X:
x –8 –3 –1 0 1 4 6
P(X = x) 0.13 0.15 0.17 0.20 0.15 0.11 0.09
Find
a. P X 0 0.65
b. P 3 X 1 0.67
8
Discrete Distribution -- Example
Distribution of Daily
Crises P
Number of r 0.5
Probability o
Crises 0.4
b
0 0.37 a 0.3
b
1 0.31 0.2
i
2 0.18 l 0.1
3 0.09 i
0
4 0.04 t 0 1 2 3 4 5
y
5 0.01 Number of Crises
9
Requirements for a Discrete Probability Function
• Probabilities are between 0 and 1, inclusively
0 P( X ) 1 for all X
P( X ) 1
over all x
10
Cumulative Distribution Function
11
The Expected Value of X
Let X be a discrete rv with set of possible values D and pmf p(x). The
expected value or mean value of X, denoted
E ( X ) or X , is
E( X ) X x p ( x)
xD
12
Mean and Variance of a Discrete Random Variable
15
The Variance and Standard Deviation
(or X2 or 2 ), is
V ( X ) ( x ) 2 p( x) E[( X ) 2 ]
D
The standard deviation (SD) of X is
X X2
16
The quiz scores for a particular student are given below:
22, 25, 20, 18, 12, 20, 24, 20, 20, 25, 24, 25, 18
Find the variance and standard deviation.
Value 12 18 20 22 24 25
Frequency 1 2 4 1 2 3
Probability .08 .15 .31 .08 .15 .23
21
V ( X ) p1 x1 p2 x2 ... pn xn
2 2 2
V (X )
17
V ( X ) .08 12 21 .15 18 21 .31 20 21
2 2 2
V ( X ) 13.25
V (X ) 13.25 3.64
18
Shortcut Formula for Variance
V ( X ) x p( x) 2
2 2
D
E X E X
2 2
19
Mean of a Discrete Distribution
E X X P( X )
X P(X) X.P(X)
-1 .1 -.1
0 .2 .0
1 .4 .4
2 .2 .4
3 .1 .3
1.0
20
Variance and Standard Deviation
of a Discrete Distribution
2
X P( X ) 1.2
2
2
12
. 110
.
X P(X) X ( X ) ( X )
2 2
P( X )
-1 .1 -2 4 .4
0 .2 -1 1 .2
1 .4 0 0 .0
2 .2 1 1 .2
3 .1 2 4 .4
1.2
21
Mean of the Data Example
E X X P( X ) 115
.
X P(X) XP(X) P
r 0.5
0 .37 .00
o 0.4
1 .31 .31 b
a 0.3
2 .18 .36 b
0.2
i
3 .09 .27
l 0.1
4 .04 .16 i
0
t 0 1 2 3 4 5
5 .01 .05 y
Number
1.15
22
Properties of Expected Value
3.E X
Y
E( X )
E (Y )
.
23
Properties of Variance
1. Var(constant) = 0
2. If X and Y are two independent random variables, then
Var(X + Y) = Var(X) + Var (Y) and
Var(X - Y) = Var(X) + Var (Y)
3. If b is a constant then Var(b+X) = Var(X)
4. If a is a constant then Var(aX) = a2Var(X)
5. If a and b are constants then Var(aX+b) = a2Var(X)
6. If X and Y are two independent random variables and a and b are
constants then Var(aX+bY) = a2Var(X) + b2Var(Y)
24
Covariance
25
Covariance
• In general, the covariance between two random variables can be
positive or negative.
• If two random variables move in the same direction, then the
covariance will be positive, if they move in the opposite direction
the covariance will be negative.
Properties:
1.If X and Y are independent random variables, their covariance is
zero. Since E(XY) = E(X)E(Y)
2. Cov(XX) = Var(X)
3. Cov(YY) = Var(Y)
26
Correlation Coefficient
• The covariance tells the sign but not the magnitude about how
strongly the variables are positively or negatively related. The
correlation coefficient provides such measure of how strongly the
variables are related to each other.
• For two random variables X and Y with E(X) = x and E(Y) = y,
the correlation coefficient is defined as
Cov( XY ) xy
xy
x y x y
27
28
Thank You
29
Data Analytics with Python
Lecture 9: Probability Distributions-II
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
Some Special Distributions
• Discrete
– Binomial
– Poisson
– Hyper geometric
• Continuous
– Uniform
– Exponential
– Normal
2
Binomial Distribution
• Let us consider the purchase decisions of the next three customers who
enter a store.
• What is the probability that two of the next three customers will make a
purchase?
3
Tree diagram for the Martin clothing store problem
4
Trial Outcomes
5
Graphical representation of the probability distribution
for the number of customers making a purchase
x P(x)
0 0.7 x 0.7 x 0.7=0.343
1 0.3x0.7x07+
0.7x0.3x0.7+
0.7x0.7x0.3 = 0.441
2 0.189
3 0.027
6
Binomial Distribution- Assumtions
• Experiment involves n identical trials
• Each trial has exactly two possible outcomes: success and failure
• Each trial is independent of the previous trials
• p is the probability of a success on any one trial
q = (1-p) is the probability of a failure on any one trial
• p and q are constant throughout the experiment
• X is the number of successes in the n trials
7
Binomial Distribution
• Probability n! X n X
P( X ) p q for 0 X n
function X ! n X !
• Mean
value n p
• Variance and
standard 2
n pq
deviation 2
n pq
8
Binomial Table
SELECTED VALUES FROM THE BINOMIAL PROBABILITY TABLE
EXAMPLE: n = 10, x = 3, p = .40; f (3) = .2150
9
Mean and Variance
• Suppose that for the next month the Clothing Store forecasts 1000
customers will enter the store.
• What is the expected number of customers who will make a purchase?
• The answer is μ = np = (1000)(.3) = 300.
• For the next 1000 customers entering the store, the variance and
standard deviation for the number of customers who will make a
purchase are
10
Poisson Distribution
11
Poisson Distribution: Applications
• Arrivals at queuing systems
– airports -- people, airplanes, automobiles, baggage
– banks -- people, automobiles, loan applications
– computer file servers -- read and write operations
12
Poisson Distribution
• Probability function
e
X
13
Poisson Distribution: Example
P(X)=
P(X)=
X X
e e
X! X!
10 6.4 6 6.4
14
Poisson Probability Table
Example: μ = 10, x = 5; f (5) = .0378
15
The Hypergeometric Distribution
• Each trial has exactly two possible outcomes, success and failure.
17
Hypergeometric Distribution
• Probability function
– N is population size
P( x)
ACx N ACn x
– n is sample size
N Cn
– A is number of successes in population
– x is number of successes in sample An
N
• Mean Value
A( N A) n( N n)
2
2
N ( N 1)
• Variance and standard deviation
2
18
The Hypergeometric Distribution Example
• Different computers are checked from 10 in the department. 4 of the 10
computers have illegal software loaded.
• What is the probability that 2 of the 3 selected computers have illegal
software loaded?
• So, N = 10, n = 3, A = 4, X = 2
A N A 4 6
X n X 2 1 (6)(6)
P(X 2) 0.3
N 10 120
n 3
• The probability that 2 of the 3 selected computers have illegal
software loaded is .30, or 30%.
Continuous Probability Distributions
• Uniform
• Normal
• Exponential
The Uniform Distribution
1
b a for a xb
1
f ( x)
0 ba
for all other values f (x)
Area = 1
a x b
Uniform Distribution: Mean and Standard Deviation
Mean
a +b
=
2
Standard Deviation
ba
12
The Uniform Distribution
1
f(X) = 6 - 2 = .25 for 2 ≤ X ≤ 6
f(X)
ab 26
μ 4
.25 2 2
(b - a) 2 (6 - 2 ) 2
σ 1 .1 5 4 7
2 6 X 12 12
Uniform Distribution Example
1
47 41 for 41 x 47
1 1
f ( x)
0 47 41 6
for all other values f ( x)
Area = 1
41 47 x
Uniform Distribution: Mean and Standard Deviation
Mean Mean
a +b 41+47 88
= = 44
2 2 2
P ( x1 X x2) x x1
2
ba 45 42 1
47 41 2
f (x)
45 42 1
P( 42 X 45)
47 41 2 Area
= 0.5
41 42 45 47 x
Example : Uniform Distribution
• Suppose the flight time can be any value in the interval from 120 minutes
to 140 minutes.
• Because the random variable x can assume any value in that interval, x is a
continuous rather than a discrete random variable
29
Example : Uniform Distribution contd….
• Let us assume that sufficient actual flight data are available to conclude
that the probability of a flight time within any 1-minute interval is the
same as the probability of a flight time within any other 1-minute interval
contained in the larger interval from 120 to 140 minutes.
• With every 1-minute interval being equally likely, the random variable x is
said to have a uniform probability distribution.
30
Uniform Probability Distribution for Flight time
31
Probability of a flight time between 120 and 130
minutes
32
Exponential Probability Distribution
• The exponential probability distribution is useful in describing the time it
takes to complete a task.
• The exponential random variables can be used to describe:
• Density Function
for x > 0, > 0
1 x /
f ( x) e
where: = mean
e = 2.71828
Exponential Probability Distribution
• Suppose that x represents the loading time for a truck at loading dock and
follows such a distribution.
• If the mean, or average, loading time is 15 minutes ( μ = 15), the
appropriate probability density function for x is
Exponential Distribution for the loading Dock Example
Exponential Probability Distribution
• Cumulative Probabilities
Cumulative Probabilities
xo /
P( x x0 ) 1 e
where:
x0 = some specific value of x x
Example: Exponential Probability Distribution
• The Petrol pump owner would like to know the probability that the time
f(x)
• Because the average number of arrivals is 10 cars per hour, the average
time between cars arriving is
42
The Normal Distribution: Properties
• ‘Bell Shaped’
• Symmetrical f(X)
• Mean, Median and Mode are equal
• Location is characterized by the mean, μ σ
• Spread is characterized by the standard μ
deviation, σ
Mean = Median = Mode
• The random variable has an infinite
theoretical range: - to +
The Normal Distribution: Density Function
The formula for the normal probability density function is
2
1 (X μ)
1
2
f(X) e
2π
Where e = the mathematical constant approximated by 2.71828
π = the mathematical constant approximated by 3.14159
μ = the population mean
σ = the population standard deviation
X = any value of the continuous variable
Chap 6-44
The Normal Distribution: Shape
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
The Normal Distribution: Properties
• ‘Bell Shaped’
• Symmetrical f(X)
• Mean, Median and Mode are equal
• Location is characterized by the mean, μ σ
• Spread is characterized by the standard μ
deviation, σ
Mean = Median = Mode
• The random variable has an infinite
theoretical range: - to +
The Normal Distribution: Density Function
The formula for the normal probability density function is
2
1 (X μ)
1
2
f(X) e
2π
Where e = the mathematical constant approximated by 2.71828
π = the mathematical constant approximated by 3.14159
μ = the population mean
σ = the population standard deviation
X = any value of the continuous variable
Chap 6-3
The Normal Distribution: Shape
Changing σ increases or
decreases the spread.
σ
μ X
The Standardized Normal Distribution
X μ
Z
σ
The Standardized Normal Distribution: Density
Function
f(Z)
Z
0
Values above the mean have positive Z-values, values below the mean have
negative Z-values
The Standardized Normal Distribution: Example
X μ 200 100
Z 2 .0
σ 50
• This says that X = 200 is two standard deviations (2 increments of 50
units) above the mean of 100.
The Standardized Normal Distribution: Example
Note that the distribution is the same, only the scale has changed. We
can express the problem in original units (X) or in standardized units (Z)
Normal Probabilities
f(X)
P(a ≤ X ≤ b)
a b
Normal Probabilities
The total area under the curve is 1.0, and the curve is symmetric,
so half is above the mean, half is below.
f(X) P ( X μ ) 0 .5
P (μ X ) 0 .5
0.5 0.5
P ( X ) 1 .0
Normal Probability Tables
Example:
P(Z < 2.00) = .9772
.9772
0 2.00 Z
Normal Probability Tables
X
8.0
8.6
Finding Normal Probability: Example
• Suppose X is normal with mean 8.0 and standard deviation 5.0. Find
P(X < 8.6).
X μ 8 .6 8 .0
Z 0 .1 2
σ 5 .0
μ=8 μ=0
σ = 10 σ=1
8 8.6 X 0 0.12 Z
Z
0
0.12
Finding Normal Probability: Between Two Values
Calculate Z-values:
X μ 88
Z 0
σ 5
8 8.6 X
X μ 8.6 8 0 0.12 Z
Z 0.12
σ 5 P(8 < X < 8.6)
= P(0 < Z < 0.12)
Finding Normal Probability
Between Two Values
• Let X represent the time it takes (in seconds) to download an image file
from the internet.
• Suppose X is normal with mean 8.0 and standard deviation 5.0
• Find X such that 20% of download times are less than X.
.2000
? 8.0 X
? 0 Z
Given Normal Probability, Find the X Value
X μ Zσ
8.0 (0.84)5.0
3.80
So 20% of the download times from the distribution with mean 8.0
and standard deviation 5.0 are less than 3.80 seconds.
Assessing Normality
• It is important to evaluate how well the data set is approximated by a normal
distribution.
• Normally distributed data should approximate the theoretical normal
distribution:
– The normal distribution is bell shaped (symmetrical) where the mean is
equal to the median.
– The empirical rule applies to the normal distribution.
– The interquartile range of a normal distribution is 1.33 standard deviations.
Assessing Normality
• Construct charts or graphs
– For small- or moderate-sized data sets, do stem-and-leaf display
and box-and-whisker plot look symmetric?
– For large data sets, does the histogram or polygon appear bell-
shaped?
• Compute descriptive summary measures
– Do the mean, median and mode have similar values?
– Is the interquartile range approximately 1.33 σ?
– Is the range approximately 6 σ?
Assessing Normality
0.00 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.10 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.20 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.30 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.90 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.00 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.10 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.20 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
2.00 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
3.00 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
3.40 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4998
3.50 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998
Table Lookup of a
Standard Normal Probability
P( 0 Z 1) 0. 3413
34
Lecture 11: Python demo for Distribution
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
1
Agenda
• Different numerical problems are solved for the following Distribution
using Python:
– Discrete
• Binomial
• Poisson
• Hyper geometric
– Continuous
• Uniform
• Exponential
• Normal
2
THANK YOU
3
Lecture 11: Sampling and Sampling Distribution
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT STUDIES
IIT ROORKEE
1
Lecture Objectives
After completing this lecture, you should be able to:
• Describe a simple random sample and why sampling is important
• Explain the difference between descriptive and inferential statistics
• Define the concept of a sampling distribution
• Determine the mean and standard deviation for the sampling distribution
of the sample mean,
2
Lecture Objectives
3
Descriptive vs Inferential Statistics
• Descriptive statistics
– Collecting, presenting, and describing data
• Inferential statistics
– Drawing conclusions and/or making decisions concerning a population
based only on sample data
4
Populations and Samples
5
Population vs. Sample
• Population • Sample
a b cd b c
ef ghi jkl m n gi n
o pq rs t uv w o r u
x y z y
6
Why Sample?
• Less time consuming than a census
• Less costly to administer than a census
• It is possible to obtain statistical results of a sufficiently high precision
based on samples.
• Because the research process is sometimes destructive, the sample can
save product
• If accessing the population is impossible; sampling is the only option
7
Reasons for Taking a Census
– Proportionate
– Disproportionate
11
Simple Random Sample:
Numbered Population Frame
9 9 4 3 7 8 7 9 6 1 4 5 7 3 7 3 7 5 5 2 9 7 9 6 9 3 9 0 9 4 3 4 4 7 5 3 1 6 1 8
5 0 6 5 6 0 0 1 2 7 6 8 3 6 7 6 6 8 8 2 0 8 1 5 6 8 0 0 1 6 7 8 2 2 4 5 8 3 2 6
8 0 8 8 0 6 3 1 7 1 4 2 8 7 7 6 6 8 3 5 6 0 5 1 5 7 0 2 9 6 5 0 0 2 6 4 5 5 8 7
8 6 4 2 0 4 0 8 5 3 5 3 7 9 8 8 9 4 5 4 6 8 1 3 0 9 1 2 5 3 8 8 1 0 4 7 4 3 1 9
6 0 0 9 7 8 6 4 3 6 0 1 8 6 9 4 7 7 5 8 8 9 5 3 5 9 9 4 0 0 4 8 2 6 8 3 0 6 0 6
5 2 5 8 7 7 1 9 6 5 8 5 4 5 3 4 6 8 3 4 0 0 9 9 1 9 9 7 2 9 7 6 9 4 8 1 5 9 4 1
8 9 1 5 5 9 0 5 5 3 9 0 6 8 9 4 8 6 3 7 0 7 9 5 5 4 7 0 6 2 7 1 1 8 2 6 4 4 9 3
Simple Random Sample:
Sample Members
• N = 20
• n=4
Stratified Random Sample
20 - 30 years old
(homogeneous within)
(alike) Heterogeneous
(different)
30 - 40 years old between
(homogeneous within)
(alike) Heterogeneous
(different)
40 - 50 years old between
(homogeneous within)
(alike)
Systematic Sampling
• Convenient and relatively easy to
N
administer k = ,
n
• Population elements are an ordered
where:
sequence (at least, conceptually).
n = sample size
• The first sample element is selected
N = population size
randomly from the first k population
elements. k = size of selection interval
• Purchase orders for the previous fiscal year are serialized 1 to 10,000 (N =
10,000).
• A sample of fifty (n = 50) purchases orders is needed for an audit.
• k = 10,000/50 = 200
• First sample element randomly selected from the first 200 purchase
orders. Assume the 45th purchase order was selected.
• Subsequent sample elements: 245, 445, 645, . . .
Cluster Sampling
• Quota Sampling: Sample elements are selected until the quota controls are
satisfied
Calculate x
to estimate
Population Sample
Process of x
Inferential Statistics
(parameter) (statistic)
Select a
random sample
Inferential Statistics
Sample
Population
24
Inferential Statistics
Drawing conclusions and/or making decisions concerning a
population based on sample results.
• Estimation
– e.g., Estimate the population mean weight
using the sample mean weight
• Hypothesis Testing
– e.g., Use sample evidence to test the claim
that the population mean weight is 120
pounds
25
Sampling Distributions
26
Types of sampling distributions
Sampling
Distributions
27
Sampling Distributions of Sample Means
Sampling
Distributions
28
Developing a Sampling Distribution
29
Developing a Sampling Distribution
(continued)
μ
X i P(x)
N
.25
18 20 22 24
21
4
0
18 20 22 24 x
σ
(X i μ) 2
2.236
A B C D
N Uniform Distribution
30
Developing a Sampling Distribution
(continued)
Now consider all possible samples of size n = 2
1st 2nd Observation
Obs 18 20 22 24 16 Sample
18 18,18 18,20 18,22 18,24 Means
20 20,18 20,20 20,22 20,24
22 22,18 22,20 22,22 22,24 1st 2nd Observation
Obs 18 20 22 24
24 24,18 24,20 24,22 24,24
18 18 19 20 21
16 possible samples 20 19 20 21 22
(sampling with 22 20 21 22 23
replacement) 24 21 22 23 24
31
Developing a Sampling Distribution
(continued)
E(X)
X i
18 19 21 24
21 μ
N 16
σX
( X μ)
i
2
N
(18 - 21)2 (19 - 21)2 (24 - 21)2
1.58
16
33
Comparing the Population with its Sampling
Distribution
Population Sample Means Distribution
N=4 n=2
μ 21 σ 2.236 μX 21 σ X 1.58
_
P(X) P(X)
.3 .3
.2 .2
.1 .1
0 0 _
18 20 22 24 X 18 19 20 21 22 23 24 X
A B C D
34
1,800 Randomly Selected Values
from an Exponential Distribution
450
F
400
r
e 350
q 300
u 250
e 200
n 150
c 100
y
50
0
0 .5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10
X
Means of 60 Samples (n = 2)
from an Exponential Distribution
F 9
r 8
e
77
q
u 66
e 55
n
44
c
y 33
22
11
00
0.00 0.25
0.00 0.25 0.50
0.50 0.75
0.75 1.00
1.00 1.25
1.25 1.50
1.50 1.75
1.75 2.00
2.00 2.25
2.25 2.50
2.50 2.75
2.75 3.00
3.00 3.25
3.25 3.50
3.50 3.75
3.75 4.00
4.00
xx
Means of 60 Samples (n = 5)
from an Exponential Distribution
10
F
r 9
e 8
q 7
u
6
e
n 5
c 4
y 3
2
1
0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00
x
Means of 60 Samples (n = 30)
from an Exponential Distribution
16
F
14
r
e 12
q
10
u
e 8
n
c 6
y 4
0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
x
1,800 Randomly Selected Values
from a Uniform Distribution
F 250
250
r
e 200
200
q
u 150
150
e
n 100
100
c
y 50
50
00
0.0
0.0 0.5
0.5 1.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
X-bar
Means of 60 Samples (n = 2)
from a Uniform Distribution
F 10
r 9
e 8
q 7
u
6
e
n 5
c 4
y 3
2
1
0
1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25
x
Means of 60 Samples (n = 5)
from a Uniform Distribution
12
10
F
r 8
e
q 6
u
e 4
n
c 2
y
0
1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25
x
Means of 60 Samples (n = 30)
from a Uniform Distribution
25
20
F
r
15
e
q
u 10
e
n 5
c
y
0
1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25
x
Expected Value of Sample Mean
1 n
X Xi
n i1
43
Standard Error of the Mean
• Different samples of the same size from the same population will yield
different sample means
• A measure of the variability in the mean from sample to sample is given by
the Standard Error of the Mean:
σ
σX
n
• Note that the standard error of the mean decreases as the sample size
increases
44
If sample values are not independent
(continued)
σ2 N n σ Nn
Var(X) or σX
n N 1 n N 1
45
If the Population is Normal
σ
μX μ σX
and n
• If the sample size n is not large relative to the population size N, then
μX μ and
σX
σ Nn
n N 1
46
Z-value for Sampling Distribution of the Mean
47
Sampling Distribution Properties
Normal Population
Distribution
μx μ
μ x
(i.e. x is unbiased ) Normal Sampling
Distribution
(has the same mean)
μx
x
48
Sampling Distribution Properties
As n increases,
σ x decreases Larger sample
Smaller sample size
size
x
μ
49
If the Population is not Normal- Central Limit Theorem
We can apply the Central Limit Theorem:
σ
μx μ And σx
n
50
Central Limit Theorem
n
the sampling
As the sample distribution becomes
size gets large almost normal
enough… regardless of shape of
population
51
If the Population is not Normal
(continued)
Population Distribution
Sampling distribution
properties:
Central Tendency
μx μ
μ x
Variation Sampling Distribution (becomes normal as n increases)
σ
σx Larger
n Smaller sample
sample
size
size
μx x
52
How Large is Large Enough?
53
Example
• What is the probability that the sample mean is between 7.8 and 8.2?
54
Example
Solution:
• Even if the population is not normally distributed, the central limit
theorem can be used (n > 25)
• … so the sampling distribution of x is approximately normal
• … with mean μx = 8
• …and standard deviation
σ 3
σx 0.5
n 36
55
Example (continued)
Solution (continued)
7.8 - 8 μX -μ 8.2 - 8
P(7.8 μ X 8.2) P
3 σ 3
36 n 36
P(-0.5 Z 0.5) 0.3830
Sampling Standard Normal
Distribution Distribution .1915
??? +.1915
? ??
? ? Sample Standardize
?? ?
?
-0.5 0.5
μ8 X 7.8
μX 8
8.2
x μz 0 Z
56
Distribution of Sample Mean, proportion,
and variance
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
2
Acceptance Intervals
Goal: determine a range within which sample means are likely to occur, given a
population mean and variance
• By the Central Limit Theorem, we know that the distribution of X is
approximately normal if n is large enough, with mean μ and standard
deviation
• Let zα/2 be the z-value that leaves area α/2 in the upper tail of the normal
distribution (i.e., the interval - zα/2 to zα/2 encloses probability 1– α)
• Then
μ z/2σ X
is the interval that includes X with probability 1 – α
3
Sampling Distributions of Sample Proportions
Sampling
Distributions
4
Sampling Distributions of Sample Proportions
P = the proportion of the population having some characteristic
• Sample proportion (p̂) provides an estimate of P:
5
^
Sampling Distribution of p
• Normal approximation:
Sampling Distribution
P(Pˆ )
.3
.2
Properties: E(pˆ ) P
.1
0
0 .2 .4 .6 8 1
(where P = population proportion)
X P(1 P)
And σ p2ˆ Var
n n
6
7
Z-Value for Proportions
pˆ P pˆ P
Z
σ pˆ P(1 P)
n
8
Example
9
Example (continued)
10
Example
(continued)
.4251
Standardize
Sampling
Distributions
12
Sample Variance
• Let x1, x2, . . . , xn be a random sample from a population. The
sample variance is
1 n
s
2
i
n 1 i1
(x x) 2
13
Sampling Distribution of Sample Variances
• The sampling distribution of s2 has mean σ2
E(s2 ) σ 2
14
15
The Chi-square Distribution
0 4 8 12 16 20 24 28 2 0 4 8 12 16 20 24 28 2 0 4 8 12 16 20 24 28 2
16
Degrees of Freedom (df)
Idea: Number of observations that are free to vary after sample
mean has been calculated
Example: Suppose the mean of 3 numbers is 8.0
If the mean of these three values is 8.0,
Let X1 = 7 then X3 must be 9
Let X2 = 8 (i.e., X3 is not free to vary)
What is X3?
18
Finding the Chi-square Value
(n 1)s2
χ
2
Is chi-square distributed with (n – 1) = 13
σ 2
degrees of freedom
• Use the the chi-square distribution with area 0.05 in the
upper tail:
213 = 22.36 (α = .05 and 14 – 1 = 13 d.f.)
probability
α = .05
2
213 = 22.36
19
Chi-square Example
(continued)
(22.36)(16)
so K 27.52
(14 1)
22
Confidence Interval Estimation: Single
Population
Dr. A. Ramesh
Department of Management Studies
IIT ROORKEE
1
Goals
After completing this lecture, you should be able to:
• Distinguish between a point estimate and a confidence interval estimate
• Construct and interpret a confidence interval estimate for a single
population mean using both the Z and t distributions
• Form and interpret a confidence interval estimate for a single population
proportion
• Create confidence interval estimates for the variance of a normal
population
2
Confidence Intervals
• Confidence Intervals for the Population Mean, μ
– when Population Variance σ2 is Known
– when Population Variance σ2 is Unknown
• Confidence Intervals for the Population Proportion, p̂ (large samples)
• Confidence interval estimates for the variance of a normal population
3
Definitions
4
Point and Interval Estimates
Lower Upper
Confidence Confidence
Point Estimate Limit
Limit
Width of
confidence interval
5
Point Estimates
Mean μ x
Proportion P p̂
6
Unbiasedness
E(θˆ ) θ
• Examples:
– The sample mean x is an unbiased estimator of μ
– The sample variance s2 is an unbiased estimator of σ2
– The sample proportion p̂ is an unbiased estimator of P
7
Unbiasedness
(continued)
• θ̂1 is an unbiased estimator, θ̂2 is biased:
θ̂1 θ̂2
θ θ̂
8
Bias
• Let θ̂ be an estimator of
Bias(θˆ ) E(θˆ ) θ
• The bias of an unbiased estimator is 0
9
Most Efficient Estimator
• Suppose there are several unbiased estimators of
• The most efficient estimator or the minimum variance unbiased estimator
of is the unbiased estimator with the smallest variance
• Let θ̂1 and θ̂2 be two unbiased estimators of , based on the same number
of sample observations. Then,
– θ̂1 is said to be more efficient than θ̂2 if Var(θˆ 1 ) Var(θˆ 2 )
10
Confidence Intervals
11
Confidence Interval Estimate
12
Confidence Interval and Confidence Level
13
Estimation Process
Sample
14
Confidence Level, (1-)
(continued)
• Suppose confidence level = 95%
• Also written (1 - ) = 0.95
• A relative frequency interpretation:
– From repeated samples, 95% of all the confidence intervals that can
be constructed will contain the unknown true parameter
• A specific interval either will contain or will not contain the true
parameter
15
General Formula
• The value of the reliability factor depends on the desired level of confidence
16
Confidence Intervals
Confidence
Intervals
σ2 Known σ2 Unknown
17
Confidence Interval for μ (σ2 Known)
• Assumptions
– Population variance σ2 is known
– Population is normally distributed
– If population is not normal, use large sample
• Confidence interval estimate:
σ σ
x z α/2 μ x z α/2
n n
(where z/2 is the normal distribution value for a probability of /2 in each tail)
18
Margin of Error
• The confidence interval,
σ σ
x z α/2 μ x z α/2
n n
σ
ME z α/2
n
19
Reducing the Margin of Error
σ
ME z α/2
n
The margin of error can be reduced if
20
Finding the Reliability Factor, z/2
• Consider a 95% confidence interval:
1 .95
α α
.025 .025
2 2
Confidence
Confidence
Coefficient, Z/2 value
Level
1
80% .80 1.28
90% .90 1.645
95% .95 1.96
98% .98 2.33
99% .99 2.58
99.8% .998 3.08
99.9% .999 3.27
22
Intervals and Level of Confidence
Sampling Distribution of the Mean
/2 1 /2
Intervals
x
μx μ
extend from 100(1-)%
x1
of intervals
σ
LCL x z x2 constructed
n contain μ;
to
σ 100()% do
UCL x z not.
n
Confidence Intervals
23
Example
• Determine a 95% confidence interval for the true mean resistance of the
population.
24
Example
(continued)
2.20 .2068
1.9932 μ 2.4068
25
Interpretation
26
Confidence Intervals
Confidence
Intervals
σ2 Known σ2 Unknown
27
Confidence Interval Estimation: Single
Population-II
Dr. A. Ramesh
Department of Management Studies
IIT ROORKEE
1
Student’s t Distribution
• Consider a random sample of n observations
– with mean x and standard deviation s
– from a normally distributed population with mean μ
2
Confidence Interval for μ (σ2 Unknown)
3
Confidence Interval for μ (σ Unknown)
(continued)
• Assumptions
– Population standard deviation is unknown
– Population is normally distributed
– If population is not normal, use large sample
• Use Student’s t Distribution
• Confidence Interval Estimate:
S S
x t n-1,α/2 μ x t n-1,α/2
n n
where tn-1,α/2 is the critical value of the t distribution with n-1 d.f. and an area of α/2 in each tail
4
Margin of Error
• The confidence interval,
S S
x t n-1,α/2 μ x t n-1,α/2
n n
σ
ME t n-1,α/2
n
5
Student’s t Distribution
6
Student’s t Distribution
Note: t Z as n increases
Standard
Normal
(t with df = ∞)
t (df = 13)
t-distributions are bell-
shaped and symmetric, but
have ‘fatter’ tails than the t (df = 5)
normal
0 t
7
Student’s t Table
Confidence t t t Z
Level (10 d.f.) (20 d.f.) (30 d.f.) ____
Note: t Z as n increases
9
Example
Confidence
Intervals
σ2 Known σ2 Unknown
11
Confidence Intervals for the
Population Proportion
12
Confidence Intervals for the Population
Proportion, p
(continued)
P(1 P)
σP
n
• We will estimate this with sample data:
pˆ (1 pˆ )
n
13
Confidence Interval Endpoints
• Upper and lower confidence limits for the population proportion are
calculated with the formula
pˆ (1 pˆ ) ˆ (1 pˆ )
p
pˆ z α/2 P pˆ z α/2
n n
• where
– z/2 is the standard normal value for the level of confidence desired
– p̂ is the sample proportion
– n is the sample size
– nP(1−P) > 5
14
Example
15
Example (continued)
ˆ ˆ ˆ ˆ
ˆp z α/2 p(1 p) P pˆ z α/2 p(1 p)
n n
25 .25(.75) 25 .25(.75)
1.96 P 1.96
100 100 100 100
0.1651 P 0.3349
16
Interpretation
• Although the interval from 0.1651 to 0.3349 may or may not contain the true
proportion, 95% of intervals formed from samples of size 100 in this manner
will contain the true proportion.
17
Confidence Intervals
Confidence
Intervals
σ2 Known σ2 Unknown
18
Confidence Intervals for the Population
Variance
19
Confidence Intervals for the Population Variance
(continued)
20
Confidence Intervals for the Population Variance
(n 1)s2 (n 1)s 2
σ 2
2
χn1, α/2
2
χn1, 1 - α/2
21
Example
Sample size 17
Sample mean 3004
Sample std dev 74
22
Finding the Chi-square Values
probability probability
χ 2
n 1, 1 - α/2 χ 2
16 , 0.975 6.91 α/2 = .025 α/2 = .025
216
216 = 6.91 216 = 28.85
23
Calculating the Confidence Limits
28.85 6.91
3037 σ 2 12683
Converting to standard deviation, we are 95% confident that the population standard
deviation of CPU speed is between 55.1 and 112.6 Mhz
24
Finite Populations
25
Finite Population Correction Factor
Nn
finite population correction factor
N 1
26
Estimating the Population Mean
27
Finite Populations: Mean
Nn
2
ˆ s
σ
2
N 1
x
n
• So the 100(1-α)% confidence interval for the population mean is
ˆ x μ x t n-1,α/2σ
x - t n-1,α/2σ ˆx
28
Estimating the Population Proportion
29
Finite Populations: Proportion
pˆ - zα/2σ
ˆ pˆ P pˆ zα/2σ
ˆ pˆ
30
Lecture Summary
• Introduced the concept of confidence intervals
• Discussed point estimates
• Developed confidence interval estimates
• Created confidence interval estimates for the mean (σ2
known)
• Introduced the Student’s t distribution
• Determined confidence interval estimates for the mean (σ2
unknown)
31
Lecture Summary
(continued)
• Created confidence interval estimates for the proportion
• Created confidence interval estimates for the variance of a normal
population
• Applied the finite population correction factor to form confidence
intervals when the sample size is not small relative to the population size
32
Summary
• Introduced sampling distributions
• Described the sampling distribution of sample means
– For normal populations
– Using the Central Limit Theorem
• Described the sampling distribution of sample proportions
• Introduced the chi-square distribution
• Examined sampling distributions for sample variances
• Calculated probabilities using sampling distributions
33
Thank You
34
Hypothesis Testing
Class Objectives
• Population Proportion
Hypothesis Testing
• The hypothesis testing procedure uses data from a sample to test the two
competing statements indicated by H0 and Ha.
Developing Null and Alternative Hypotheses
• It is not always obvious how the null and alternative hypotheses should be
formulated
• Care must be taken to structure the hypotheses appropriately so that the test
conclusion provides the information the researcher wants
• The context of the situation is very important in determining how the hypotheses
should be stated
• In such cases, it is often best to begin with the alternative hypothesis and make it the conclusion that
the researcher hopes to support
• The conclusion that the research hypothesis is true is made if the sample data provide sufficient
evidence to show that the null hypothesis can be rejected
Developing Null and Alternative Hypotheses
• Example: A new manufacturing method is believed to be better than the current method.
• Alternative Hypothesis:
• Null Hypothesis:
• Alternative Hypothesis:
• Null Hypothesis:
• Example:
• Alternative Hypothesis:
– The new drug lowers Cholesterol-level more than the existing drug
• Null Hypothesis:
– The new drug does not lower Cholesterol-level more than the existing
drug
Developing Null and Alternative Hypotheses
• We might begin with a belief or assumption that a statement about the value of a population
parameter is true
• We then using a hypothesis test to challenge the assumption and determine if there is statistical
evidence to conclude that the assumption is incorrect
• Example:
• Null Hypothesis:
• Alternative Hypothesis:
• The equality part of the hypotheses always appears in the null hypothesis
• In general, a hypothesis test about the value of a population mean must take one of the following
three forms (where 0 is the hypothesized value of the population mean)
• Because hypothesis tests are based on sample data, we must allow for the
possibility of errors
• The probability of making a Type I error when the null hypothesis is called
the level of significance
• Applications of hypothesis testing that only control the Type I error are
often called significance tests
Type II Error
• Statisticians avoid the risk of making a Type II error by using “do not reject H0” and not “accept H0”.
Type I and Type II Errors
Population Condition
H0 True H0 False
Conclusion ( < 8) ( 8)
Accept H0 Correct
Type II Error
(Conclude < 8) Decision
Reject H0 Correct
Type I Error
(Conclude > 8) Decision
Three Approaches for Hypothesis Testing
• P- Value
• Critical Value
• The p-value is the probability, computed using the test statistic, that measures the support (or lack of
• If the p-value is less than or equal to the level of significance , the value of the test statistic is in the
rejection region
= .10 Sampling
distribution
of
p-value
72
z
z = -za = 0
-1.46 -1.28
p-Value Approach
Upper-Tailed Test About a Population Mean :s Known
p-Value
11
z
0 z = z=
1.75 2.29
p-Value Approach
Critical Value Approach to One-Tailed Hypothesis Testing
• The test statistic z has a standard
normal probability distribution.
• We can use the standard normal
probability distribution table to
find the z-value with an area of
in the lower (or upper) tail of the
distribution.
• The value of the test statistic that
established the boundary of the
rejection region is called the
critical value for the test.
• The rejection rule is:
Lower tail: Reject H0 if z < -z
Upper tail: Reject H0 if z > z
Lower-Tailed Test About a Population Mean: s Known
Sampling
distribution
of
Reject H0
1
Do Not Reject H0
z
-z = -1.28 0
Upper-Tailed Test About a Population Mean: s Known
Critical Value Approach
Sampling
distribution
of
Reject H0
Do Not Reject H0
z
0 z = 1.645
Steps of Hypothesis Testing – P value approach
• Step 3. Collect the sample data and compute the test statistic.
• p-Value Approach
• Step 4. Use the value of the test statistic to compute the p-value.
•Step 4. Use the level of significance to determine the critical value and
•Step 5. Use the value of the test statistic and the rejection rule to determine
1
Class Objectives
2
One-Tailed Tests About a Population Mean: s Known
3
Given Values
• Sample • Population
• Sample mean = 32 Min • a =0.05
• Sample size = 30 • Population mean = 30 Min
4
p -Value Approach
5
One-Tailed Tests About a Population Mean:
s Known
1. Develop the hypotheses.
2. Specify the level of significance. H0: 30
3. Compute the value of the test statistic. Ha:30
a = .05
x 32 30
z 1.09
s / n 10 / 30
6
7
One-Tailed Tests About a Population Mean: s Known
p –Value Approach
4. Compute the p –value.
• There are not sufficient statistical evidence to infer that Pizza delivery services is not meeting the response
goal of 30 minutes.
8
One-Tailed Tests About a Population Mean: s Known
p –Value Approach
Sampling
distribution a = .05
of
p-value
0.137
z
z = za =
0 1.09 1.645
9
Critical Value Approach
10
One-Tailed Tests About a Population Mean: s Known
11
p-Value Approach to Two-Tailed Hypothesis Testing
12
Compute the p-value using the following three steps:
2. If z is in the upper tail (z > 0), find the area under the standard normal curve to the right of z.
3. If z is in the lower tail (z < 0), find the area under the standard normal curve to the left of z.
13
Critical Value Approach to Two-Tailed Hypothesis Testing
• The critical values will occur in both the lower and upper tails of the standard normal curve.
• Use the standard normal probability distribution table to find za/2 (the z-value with an area of a/2
in the upper tail of the distribution).
14
Two-Tailed Tests About a Population Mean:
s Known
15
Given Values
• Sample • Population
• Sample size = 30 • Population mean = 500 ml
• Sample mean = 505 ml • Standard deviation = 10 ml
• Significance level 0.03
16
p –Value approach
17
Two-Tailed Tests About a Population Mean:
s Known
1. Determine the hypotheses.
2. Specify the level of significance.
3. Compute the value of the test statistic.
a = .03
x 505 500
z 2.74
s / n 10 / 30
18
19
Two-Tailed Tests About a Population Mean:
s Known
p –Value Approach
4. Compute the p –value.
– For z = 2.74, p–value = 2(1 - .9969) = .0061
There is no sufficient statistical evidence to infer that the null hypothesis is true (i.e. the mean filling
quantity is not 500 ml)
20
Two-Tailed Tests About a Population Mean: s Known
p-Value Approach
1/2 1/2
p -value p -value
= .0031 = .0031
a/2 = a/2 =
.015 .015
z
z = -2.74 0 z = 2.74
-za/2 = -2.17 za/2 = 2.17
21
Critical Value Approach
22
Two-Tailed Tests About a Population Mean :s Known
There is sufficient statistical evidence to infer that the null hypothesis is not true
23
24
Two-Tailed Tests About a Population Mean :s Known
z
-2.17 0 2.17
25
Confidence Interval Approach
26
Confidence Interval Approach to
Two-Tailed Tests About a Population Mean
• Select a simple random sample from the population and use the value of the sample mean to
develop the confidence interval for the population mean .
• If the confidence interval contains the hypothesized value 500, do not reject H0.
• Actually, H0 should be rejected if 0 happens to be equal to one of the end points of the confidence
interval.
27
Confidence Interval Approach to Two-Tailed Tests About a Population Mean
The 97% confidence interval for 500 is
5 5 3.9619
501.03814 ,508.96186
Because the hypothesized value for the population mean, 0 = 500ml, is not in this interval, the
hypothesis-testing conclusion is that the null hypothesis, H0: = 500, is rejected.
28
Thanks
29
Hypothesis Testing-III
1
Tests About a Population Mean:s Unknown
• Test Statistic
2
Tests About a Population Mean:s Unknown
3
4
One-Tailed Test About a Population Mean: s Unknown
Example: Ice Cream Demand
Day No. of Ice- Day No. of Ice-
• In a ice cream parlor at IIT Roorkee, the following data cream cream
Sold Sold
represent the number of ice-creams sold in 20 days
1 13 11 12
2 8 12 11
• Test hypothesis H0: < 10 3 10 13 11
4 10 14 12
• Use = .05 to test the hypothesis. 5 8 15 10
6 9 16 12
7 10 17 7
8 11 18 10
9 6 19 11
10 8 20 8
5
Given Data
6
7
One-Tailed Test About a Population Mean:
s Unknown
Reject H0
Do Not Reject H0
t
0
8
Hypothesis Testing – proportion
9
Null and Alternative Hypotheses: Population Proportion
• The equality part of the hypotheses always appears in the null hypothesis.
• In general, a hypothesis test about the value of a population proportion p must take one of the
following three forms (where p0 is the hypothesized value of the population proportion).
One-tailed One-tailed
(lower tail) (upper tail) Two-tailed
10
Tests About a Population Proportion
Test Statistic
where:
11
Tests About a Population Proportion
Rejection Rule: p –Value Approach
Reject H0 if p –value <
Rejection Rule: Critical Value Approach
H0: pp Reject H0 if z > z
12
Two-Tailed Test About a Population Proportion
Example: City Traffic Police
13
p –Value Approach
14
Two-Tailed Test About a Population Proportion
H 0 : p .5
1. Determine the hypotheses.
H a : p .5
p0 (1 p0 ) .5(1 .5)
sp .045644
n 120
p p0 (67 /120) .5
z 1.28
sp .045644
15
Two-Tailed Test About a Population Proportion
16
17
Critical Value Approach
18
Two-Tailed Test About a Population Proportion
Because 1.278 > -1.96 and < 1.96, we cannot reject H0.
19
Errors in Hypothesis Testing
Dr. A. Ramesh
Department of Management Studies
Indian Institute of Technology Roorkee
1
Example
• We are interested in burning rate of a solid propellant used to power aircrew escape systems
Reference: Applied statistics and probability for engineers, Douglas C. Montgomery, George C. Runger, John Wiley &
Sons, 2007
2
Value of the null hypothesis
– Past experience or knowledge of the process, or even from the previous tests or experiments
obligations
3
Note: for this example n=10
4
Type I Error
• The true mean burning rate of the propellant could be equal to 50 centimeters per second
• However randomly selected propellant specimens that are tested, we could observe a value of test
• We would then reject the null hypothesis Ho in favor of the alternate H1, in fact, Ho is really true
5
Type I Error
6
Type II Error
• Now suppose the true mean burning rate is different from 50 centimeters per second, yet the sample
7
Type II Error
8
Type 1 and Type II Errors
H0 is correct H0 is incorrect
9
Type I error
• In the propellant burning rate example, a type I error will occur when either x 51.5 _ or _ x 48.5
• Suppose the standard deviation of burning rate is σ = 2.5 centimeters per second and n = 10
• Type I error is
P( x 48.5 _ when _ 50) P( x 51.5 _ when _ 50)
10
Where
does this We will reject the null
number hypothesis ( = 50) if our
come sample mean is either of
from? these two regions
11
12
Type I error
• This implies that 5.7 % of all random samples would lead to rejection of the hypothesis Ho: µ=50
• We can reduce the type I error by widening the acceptance region. If we make critical value 48 and
13
TYPE II ERROR
14
The pink area is
the probability
of a Type II error
if the actual mean
is 52.
15
Type II Error
• Type II error will be committed if the sample mean x-bar falls between 48.5 and 51.5 (critical region
• 0.2643
• When µ = 50.5
• 0.8923
16
17
18
Computing the
probability of a type II
error may be the most
difficult concept
19
For constant n, increasing the acceptance region (hence
decreasing ) increases .
20
Type I & II Errors Have an Inverse Relationship
21
Factors Affecting Type II Error
n
22
How to Choose between Type I and Type II Errors
• Choose smaller Type I Error when the cost of rejecting the maintained hypothesis is high
• Choose larger Type I Error when you have an interest in changing the status quo
23
Calculating the probability of Type II Error
Ho: µ = 8.3
H1: µ < 8.3
Determine the probability of Type II error if µ = 7.4 at 5% significance level. σ = 3.1 and n = 60.
24
Solution:
An error will be made when Z ≥ -1.645, for that will fail to reject Ho.
ᵦ = 0.2729
25
Solving for Type II Errors:
Example
Ho: 12 Zc
X
Ha: 12
c
n
010
.
12 ( 1645
. )
60
Rejection
Region
11979
.
=.05
If X 11979
. , reject Ho.
Non Rejection Region
=0 If X 11979
. , do not reject Ho.
Zc 1.645
26
Type II Error for Example with =11.99 Kg
Ho is False
Correct Type II
19.77% =.8023
Decision Error
Z1
X
27
28
Type II Error for Demonstration with =11.96 Kg
Ho is False
Correct =.0708 Type II
Decision 92.92% Error
Z1
X
29
30
Hypothesis Testing and Decision Making
• In the tests, we compared the p-value to a controlled probability of a Type I error, a, which is
called the level of significance for the test
• With a significance test, we control the probability of making the Type I error, but
not the Type II error
• We recommended the conclusion “do not reject H0” rather than “accept H0”
because the latter puts us at risk of making a Type II error
31
Hypothesis Testing and Decision Making
• With the conclusion “do not reject H0”, the statistical evidence is considered inconclusive
• Usually this is an indication to postpone a decision until further research and testing is
undertaken
• In many decision-making situations the decision maker may want, and in some cases may be
forced, to take action with both the conclusion “do not reject H0 “and the conclusion “reject
H0.”
32
Power of a test
33
Calculating the Probability of a Type II Error
34
Calculating the Probability of a Type II Error
Values of 1-
14.0 -2.31 .0104 .9896
13.6 -1.52 .0643 .9357
13.2 -0.73 .2327 .7673
12.8323 0.00 .5000 .5000
12.8 0.06 .5239 .4761
12.4 0.85 .8023 .1977
12.0001 1.645 .9500 .0500
35
36
Power of the Test
• The probability of correctly rejecting H0 when it is false is called the power of the test.
37
Power Curve
1.00
Probability of Correctly
0.80
H0 False
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
11.5 12.0 12.5 13.0 13.5 14.0 14.5
38
Thank You
39
Hypothesis Testing: Two sample test
Dr. A. Ramesh
DEPARTMENT OF MANAGEMENT
IIT ROORKEE
1
Hypothesis Testing about the Difference in Two
Sample Means
Population 1
X 1
X X
x 1 2
X n1
1
X X1 2
x
X 2
n2
X 2
Population 2
2
Two Sample Tests
Two Sample Tests
Population Population
Means, Means, Population Population
Independent Dependent Proportions Variances
Samples Samples
Examples:
Group 1 vs. Same group before Proportion 1 vs. Variance 1 vs.
independent vs. after treatment Proportion 2 Variance 2
Group 2
3
Difference Between Two Means
Population means,
independent samples
5
σ12 and σ22 Known
7
Hypothesis Tests for Two Population Means
8
Decision Rules
a a
a/2 a/2
9
Hypothesis Testing about the Difference in Two
Sample Means
X X2
1 2
X 2
1
2
2
X2 n n
1
1
1 2
X 1
X2
X 1
X 2
10
Sampling Distribution of x1 x2
• Expected Value
11
Interval Estimation of 1 - 2: 1 and 2 Known
• Interval Estimate
12
Problem ( 1 and 2 Known)
• A product developer is interested in reducing the drying time of a primer paint.
• Two formulations of the paint are tested; formulation 1 is the standard chemistry, and
formulation 2 has a new drying ingredient that should reduce the drying time.
• From experience, it is known that the standard deviation of drying time is 8 minutes, and this
inherent variability should be unaffected by the addition of the new ingredient.
• Ten specimens are painted with formulation 1, and another 10 specimens are painted with
formulation 2; the 20 specimens are painted in random order.
• The two-sample average drying times are 𝑥1 = 121 minutes and 𝑥2 = 112 minutes,
respectively.
• What conclusions can the product developer draw about the effectiveness of the new
ingredient, using alpha = 0.05?
Source: Applied Probability and statistics for Engineers by Douglas C. Montgomery and George C. Runger John Wiley, 3rd Ed. 2003
13
Problem ( 1 and 2 Known)
14
Problem ( 1 and 2 Known)
15
Problem ( 1 and 2 Known)
Reject H0
t
121 112 0 2.52
.05
0 1.645 t
1 1 2.52
8
2
10 10 Decision:
Reject H0 at a = 0.05
Conclusion:
There is evidence of a difference in
means.
16
Problem ( 1 and 2 Known)
17
Problem ( 1 and 2 Known)
18
σ12 and σ22 Unknown, Assumed Equal
19
σ12 and σ22 Unknown, Assumed Equal
• The population variances are assumed equal, so use the two sample
standard deviations and pool them to estimate σ
20
Test Statistic, σ12 and σ22 Unknown, Equal
The test statistic for
μ1 – μ2 is:
t
x 1
x2 μ1 μ 2
s 2p s 2p
n1 n2
n1 n 2 2
p
21
Decision Rules
1 2 1 2 1 2
1 2 1 2 1 2
22
Decision Rules
23
σ12 and σ22 Unknown, Assumed equal
• Two catalysts are being analyzed to
determine how they affect the mean Observation Catalyst 1 Catalyst 2
yield of a chemical process. Number
• Specifically, catalyst 1 is currently in use, 1 91.50 89.19
but catalyst 2 is acceptable. 2 94.18 90.95
• Since catalyst 2 is cheaper, it should be 3 92.18 90.46
adopted, providing it does not change 4 95.39 93.21
the process yield. 5 91.79 97.19
• A test is run in the pilot plant and results 6 89.07 97.04
in the data shown in table. 7 94.72 91.07
• Is there any difference between the 8 89.21 92.75
mean yields?
𝑥 1= 92.255 𝑥 1 = 92.733
• Use 0.05, and assume equal variances.
s1 =2.39 s2 =2.98
24
σ12 and σ22 Unknown, Assumed equal
25
σ12 and σ22 Unknown, Assumed equal
26
σ12 and σ22 Unknown, Assumed equal
27
σ12 and σ22 Unknown, Assumed equal
28
Thank You
29