Introduction To Statistics Using R
Introduction To Statistics Using R
AKINKUNMI
Introduction to
Series Editor: Steven G. Krantz, Washington University in St. Louis
Statistics Using R
Mustapha Akinkunmi, American University of Nigeria
Introduction to Statistics Using R is organized into 13 major chapters. Each chapter is broken
Mustapha Akinkunmi
About SYNTHESIS
This volume is a printed version of a work that appears in the Synthesis
Digital Library of Engineering and Computer Science. Synthesis
store.morganclaypool.com
Introduction to Statistics Using R
Synthesis Lectures on
Mathematics and Statistics
Editor
Steven G. Kranz, Washington University, St. Louis
Statistics is Easy!
Dennis Shasha and Manda Wilson
2008
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00899ED1V01Y201902MAS024
Lecture #24
Series Editor: Steven G. Kranz, Washington University, St. Louis
Series ISSN
Print 1938-1743 Electronic 1938-1751
Introduction to Statistics Using R
Mustapha Akinkunmi
American University of Nigeria
M
&C Morgan & cLaypool publishers
ABSTRACT
Introduction to Statistics Using R is organized into 13 major chapters. Each chapter is broken
down into many digestible subsections in order to explore the objectives of the book. There are
many real-life practical examples in this book and each of the examples is written in R codes to
acquaint the readers with some statistical methods while simultaneously learning R scripts.
KEYWORDS
descriptive statistics, probability distributions, sampling distribution, hypothesis
testing, regression analysis, correlation analysis, confidence interval
ix
To my son,
Omar Olanrewaju Akinkunmi
xi
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
2 Introduction to R Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 How to Download and Install R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Using R for Descriptive Statistical and Plots . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Basics of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 R is Vectorized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 R Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.3 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.4 Data Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.5 Data Type Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.6 Variable Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Basic Operations in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.1 Subsetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.2 Control Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.3 Built-In Functions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.4 User-Written Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.5 Importing, Reporting, and Writing Data . . . . . . . . . . . . . . . . . . . . . . . . 46
xii
2.5 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.5.1 Data Exploration through Visualization . . . . . . . . . . . . . . . . . . . . . . . . 51
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3 Descriptive Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1 Central Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Measure of Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 Shapes of the Distribution—Symmetric and Asymmetric . . . . . . . . . . . . . . . . 60
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
A Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Preface
Chapter 10: Hypothesis Testing for Single Population Mean and Proportion
This chapter is all about the decision-making process and it explains how to form and test hy-
potheses. It prepares readers on the steps to conduct hypothesis testing and interpret the results.
This chapter helps the readers to use available data to make a valid conclusion.
Mustapha Akinkunmi
February 2019
1
CHAPTER 1
Introduction to Statistical
Analysis
Statistics is a science that supports us in making better decisions in business, economics, and
other disciplines. In addition, it provides the necessary tools needed to summarize data, analyze
data, and make meaningful conclusions to achieve a better decision. These better decisions assist
us in running a small business, a corporation, or the economy as a whole. This book is a bit
different from most statistics books in that it will focus on using data to help you do business
and so the focus is not on statistics for statistics stake but rather on what you need to know to
really use statistics.
To help in that endeavor, examples will include the use to the R programming language
which was created specifically to give statisticians and technicians a tool to create solutions to
standard techiques or to create their own solutions.
Statistics is a word that originated from the Italian word stato meaning “state” and statista
is an individual saddled with the tasks of the state. Thus, statistics is the collection of useful infor-
mation to the statista. Its application commenced in Italy during the 16th century and diffused
to other countries around the world. At present, statistics covers a wide range of information
in every aspect of human activities. In addition, it is not limited to the collection of numerical
information but includes data summarization, presentation, and analysis in meaningful ways.
Statistical analysis is mainly concerned with how to make generalizations from the data.
Statistics is a science that deals with information. In order for us to perform statistical analysis on
information (data) you have on hand or collect, you may need to transform the data or work with
the data to get it in a form where it can be analyzed using statistical techniques. Information can
be found in qualitative or quantitative form. In order to explain the difference between these two
types of information, let’s consider an example. Suppose an individual intends to start a business
based on the information in Table 1.1. Which of the variables are quantitative and which are
qualitative? The product price is a quantitative variable because it provides information based
on quantity—the product price in dollars. The number of similar businesses and the rent for
business premises are also quantitative variables. The location used in establishing the business
is a qualitative variable since it provides information about a quality (in this case a location, such
as Nigeria or South Korea). The presence of basic infrastructures requires a (Yes or No) response,
these are also qualitative variables.
2 1. INTRODUCTION TO STATISTICAL ANALYSIS
Table 1.1: Business feasibility data
A quantitative variable represents a number for which arithmetic operations such as av-
eraging make sense. A qualitative (or categorical) variable is concerned with quality. In a case
where a number is applied to separate members of different categories of a qualitative variable,
the assigned number is subjective but generally intentional. An aspect of statistics is concerned
with measurements—some quantitative and others qualitative. Measurements provide the real
numerical values of a variable. Qualitative variables can be represented with numbers as well,
but such a representation might be arbitrary but intended to be useful for the purposes at hand.
For instance, you can assign numerics to an instance of a qualitative variable such as Nigeria D
1 and South Korea D 0.
Figure 1.1 shows the histogram representation for the given data in Table 1.2.
A data class refers to a group of data that are similar based on user-defined property. For
instance, ages of students can be grouped into classes such as those in their teens, 20s, 30s, etc.
Each of these groups is known as a class. Each class has a specific width known as the class
interval or class size. The class interval is very crucial in the construction of histograms and
frequency diagrams. Class size depends on how the data is grouped. The class interval is usually
a whole number.
However, an example of grouped data that have different class interval is presented in
Table 1.4. Tables like Table 1.4 are often called frequency tables as they show the frequency
that data is represented in a specific class interval.
14
12
12
Number of Businesses
10
8
8
6
6
4 3
2 1
0
0-2 3-5 6-8 9 - 11 12 - 14
Number of Years
2. Subtract the lowest value from the highest value in the dataset.
3. Divide the outcome under Step 2 by the number of classes you have in Step 1.
For instance, if a market survey is conducted on market prices for 20 non-durable goods with
the following prices: US$11, US$5, US$3, US$14, US$1, US$16, US$2, US$12, US$2, US$4,
US$3, US$9, US$4, US$8, US$7, US$5, US$8, US$6, US$10, and US$15. The raw data in-
dicate the lowest price is US$1 and the highest price is US$16. In addition, the survey expert
decides to have four classes. The class interval is written as follows:
highest value lowest value
Class Interval D
number of classes
16 1 15
D D
4 4
Class interval D 3:75:
The value of the class interval is usually a whole number, but in this case its value is a
decimal number. Therefore, the solution to this problem is to round-off to the nearest whole
number which is 4. This implies that the raw data can be grouped into 4, as presented in the
Table 1.5.
Table 1.5: Class interval generated from ungrouped data
Number Frequency
1-4 7
5-8 6
9 - 12 4
13 - 16 3
A class interval is the range of data in that particular class. The relationship between the
class boundaries and the class interval is given as follows:
Class Interval D Upper Class Boundary Lower Class Boundary:
The lower class boundary of one class is the same to the upper class boundary of the previous
class for both integers and non-integers. The importance of class limits and class boundaries is
highly recognized in the diagrammatical representation of statistical data.
Inflation
16
15
Rate (%)
14
13
12
11
10
Jan-18
Feb - 18
Mar - 18
Apr - 18
May - 18
Jun - 18
Jul - 18
Aug - 18
Sep - 18
Oct - 18
Nov - 18
Dec - 18
Figure 1.2: Methods of visualizing data: time plot.
0.32%
26.86%
PMS
HHK
AGO
72.82%
40
30
20
10
0
2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
14
12 12
Number of Businesses
10
8 8
6 6
4
3
2
1
0
1 4 7 10 13
Years of Establishment
Note: Charts can often be deceiving. This indicates the disadvantage of merely descrip-
tive methods of analysis and the need for statistical inference. Exploring statistical tests makes
the analysis more objective than eyeball analysis and less prone to deception if assumptions of
random sampling and others are established.
Cumulative Frequency
80
70
60
50
40
30
20
10
0
0.05 10.5 20.5 30.5 40.5 50.5 60.5 70.5 80.5 90.5
Student Marks
from the work of John W. Tukey “Exploratory Data Analysis” in 1977 and has made significant
strides in the past five years as software solutions in large data analysis (big data) and in busi-
ness intelligence (the reporting of characteristics of the data in meaningful ways) has improved
dramatically.
Q1 Q2 Q3
Lowest Value Highest Value
1.6 EXERCISES
1.1. Explain why statistics are necessary.
1.2. Describe the difference between a quantitative variable and qualitative variable.
1.3. List the four scales of measurements discussed in the chapter and discuss each with
relevant examples.
1.4. The total liters of petroleum products used for both importation and consumption (truck
out) in Nigeria in the second quarter of 2018 is summarized in Table 1.7.
(a) Represent the liters of petroleum products used for both importation and con-
sumption (truck out) in Nigeria in the second quarter of 2018 in a bar chart.
(b) Represent the total liters of petroleum products used for both importation and
consumption (truck out) in Nigeria in the second quarter of 2018 in a pie chart.
1.6. EXERCISES 13
CHAPTER 2
Introduction to R Software
R is a programming language designed for statistical analysis and graphics. It is based on S-plus
which was developed by Ross Ihaka and Robert Gentleman from the University of Auckland,
New Zealand, and R can be used to open multiple datasets. R is an open-source software which
can be downloaded at http://cran.r-project.org/. Other statistical packages are SPSS,
SAS, and Stata but they are not open source. Apart from this, there are large R group users
online that can provide real-time answers to questions, and that also contribute to add packages
to R. Packages increase the functions that are available for use, thus expanding the users’ abilities.
The R Development Core Team is responsible for maintaining the source code of R.
Why turn to R?
R software provides the following advantages.
1. R is free (meaning open-source software).
2. Any type of data analysis can be executed in R.
3. R includes advanced statistical procedures not yet present in other packages.
4. The most comprehensive and powerful feature for visualizing complex data is available in
R.
5. Importing data from a wide variety of sources can be easily done with R.
6. R is able to access data directly from web pages, social media sites, and a wide range of
online data services.
7. R software generates an easy and straightforward platform for programming new statistical
techniques.
8. It is simple to integrate applications written in other programming languages (such as
C++, Java, Python, PHP, Pentaho, SAS, and SPSS) into R.
9. R can operate on any operating system such as Windows, Unix, and Mac OSX. It can
also be installed on an iPhone. It is also possible to use R on an Android phone (see
https://selbydavid.com/2017/12/29/r-android/).
10. R offers a variety of graphic user interfaces (GUIs) if you are not interested in learning a
new language.
16 2. INTRODUCTION TO R SOFTWARE
Packages in R
Packages refer to the collections of R functions, data, and compiled code in a well-defined for-
mat.
Installation of R
1. Under the “download” link, there is another link that provides instructions on how to
install it. These might be useful if you encounter problems during the installation process.
2. To install R, double-click on the executable file and follow the instruction on the screen.
The default settings are perfect. Figure 2.1 shows the first screen that comes up when
installing R on a Windows system.
STEPS TO INSTALL R
Figure 2.1: Click the appropriate operating system based on your computer.
Figure 2.2: Click on “R Sources” to get information on the latest version of R software.
18 2. INTRODUCTION TO R SOFTWARE
2.3 BASICS OF R
2.3.1 R IS VECTORIZED
Some of the funtions in R are vectorized. This indicates that the functions operate on the ele-
ments of a vector by acting on each element one after the other without necessarily undergoing
the looping process. Being vectorized allows writing codes in R easy to read, efficient, precise,
and concise. The following examples demonstrate vectorization in R.
1. Multiplication of vector by a constant.
R language allows multiplying (or dividing) a vector by a constant value. A simplex example
is:
x<-c(2, 4, 6, 8, 10)
x*3
[1] 6 12 18 24 30
x/3
[1] 0.6666667 1.3333333 2.0000000 2.6666667 3.3333333
3. Multiplication of vectors.
x<-c(2, 4, 6, 8, 10)
y<-c(1, 2, 3, 4, 5)
x*y
[1] 2 8 18 32 50
4. Logical operations.
a<-(x>=4)
a
[1] FALSE TRUE TRUE TRUE TRUE
5. Matrix operations.
a <- matrix(1:4, 2, 2)
b <- matrix(0:3, 2, 2)
ab<-a*b
ab
[,1] [,2]
[1,] 0 6
[2,] 2 12
2.3.2.1 Scalar
This is an atomic quantity that can hold only one value at a time. Scalars are the most basic data
types that can be used to construct more complex ones. Scalars in R can be numerical, logical,
and character (string). The following are examples of the different types of scalars in R.
1. Numerical
p<-10
q<-12
class(p)
[1] "numeric"
2.3. BASICS OF R 21
class(q)
[1] "numeric"
class(p+q)
[1] "numeric"
2. Logical
3. Character (string)
m<-"10"
n<-"12"
m
[1] "10"
n
[1] "12"
m+n # this is not the same as p and q in earlier example
Error in m + n : non-numeric argument to binary operator
class(m)
[1] "character"
class(n)
[1] "character"
class(as.numeric(m))
[1] "numeric"
class(as.character(p)) # to coerce this number into number
[1] "character"
class(as.character(p)) # to coerce this number to character
[1] "character"
22 2. INTRODUCTION TO R SOFTWARE
2.3.2.2 Vector
A vector is a sequence of data elements of the same basic type.
d<-c(1, 2, 3, 4 , 5, (6) # Numeric vector
class(d)
[1] "numeric"
e<-c("one", "two", "three", "four", "five", "six") # Character vector
class(e)
[1] "character"
2.3.2.3 Matrix
A matrix is a collection of data elements arranged in a two-dimensional rectangular form. In
the same manner as a vector, the components in a matrix must be of the same basic type. An
example of a matrix with 2 rows and 4 columns is created below.
# fill the matrix with elements arranged by column in 2 rows
# and 4 columns
mat<-matrix(1:8, nrow=2, ncol=4, byrow= FALSE)
mat
[,1] [,2] [,3] [,4]
[1,] 1 3 5 7
[2,] 2 4 6 8
Alternatively, it is possible to have the elements of the matrix arranged by rows.
# fill the matrix with elements arranged by row in 2 rows
# and 4 columns
mat<-matrix(1:8, nrow=2, ncol=4, byrow= TRUE)
mat
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
Braces [ ] can be used to reference elements in a matrix; this is similar to referencing elements
in vectors.
2.3. BASICS OF R 23
mat[2,3] # refers to the element in the second row
# and third column
[1] 7
mat[,4] # refers to all elements in the fourth column
[1] 4 8
mat[1,] # refers to all elements in the first row
[1] 1 2 3 4
Table 2.1 contains the basic matrix operations and their respective meanings.
Function Meaning
t(x) Transpose of x
diag (x) Diagonal elements of x
%*% Matrix multiplication
solve (a,b) Solves a %*% x = b for x
rowsum (x) Sum of rows for a matrix-like object;
RowSums(x) is a faster version
rowMeans (x) Fast version of row means
colMeans (x) Fast version of column means
d<-c(1, 2, 3, 4 , 5, 6)
e<-c("one", "two", "three", "four", "five", "six")
f<-c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE)
data<-data.frame(d, e, f)
names(data)<-c("ID", "words", "state")
data
ID words state
1 1 one TRUE
2 2 two FALSE
24 2. INTRODUCTION TO R SOFTWARE
3 3 three TRUE
4 4 four TRUE
5 5 five FALSE
6 6 six TRUE
In addition, components from data frames can be extracted and this is similar to the extraction
of components in matrices, but after assigning names to each column makes it more flexible.
data$ID
[1] 1 2 3 4 5 6
data[1:2,]
ID words state
1 1 one TRUE
2 2 two FALSE
data[,3]
[1] TRUE FALSE TRUE TRUE FALSE TRUE
2.3.2.5 List
A list is a generic vector containing other objects. There is no restriction on data types or length
of the components. It is easier to work with lists that have named components. List is a special
type of vector. There are two characteristics of this type of vector—it can contain elements of
different classes and each element of a list can have a name.
Consider a list which contains a vector, matrix and data frame.
$matrix
[,1] [,2] [,3] [,4]
[1,] 1 3 5 7
[2,] 2 4 6 8
$frame
ID words state
1 1 one TRUE
2 2 two FALSE
3 3 three TRUE
4 4 four TRUE
2.3. BASICS OF R 25
5 5 five FALSE
6 6 six TRUE
$count
[1] 10
2.3.2.6 Factor
Factors are used to represent categorical data and can be ordered or unordered. We can regard a
factor as numerical vector when each of the integers contains a label. It is more appropriate using
factors with labels than using integers because factors are self-describing (e.g., variable with the
values of “Male” and “Female” is better than assigning value to them as 1 and 2).
2.3.2.7 Coersion
This occurs when different objects are mixed in a vector; then every element in the vector has
the same class.
x <- Sys.time()
2.3. BASICS OF R 27
x
[1] "2018-10-04 11:35:15 WAT"
p<-as.POSIXlt (x)
names (unclass(p)) # unclass (p) is a list object
[1] "sec" "min" "hour" "mday" "mon" "year" "wday" "yday"
[9] "isdst" "zone" "gmtoff"
p$sec
[1] 15.44309
p$wday
[1] 4
p$mday
[1] 4
y<-10:14+6
y
[1] 16 17 18 19 20
seq(from, to) : generates a sequence by D specifies increment; length D specifies desired
length.
seq() function generates a sequence of numbers.
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)), length.out = NULL, along.with =
NULL, ...)
from, to: begin and end number of the sequence
by: step, increment (Default is 1)
2.3. BASICS OF R 29
length.out: length of the sequence
rep(x, times) : replicate x times; use each D to repeat “each” element of x each times;
rep(c(1, 2, 3, 4, 5), (2) # repeat 1 2 3 4 5 twice
[1] 1 2 3 4 5 1 2 3 4 5
rep(c(1, 2, 3, 4, 5), each = (2) # repeat each of 1 2 3 4 5 twice
[1] 1 1 2 2 3 3 4 4 5 5
array(x, dim =) : array with data x; specify dimensions like dim=c(3, 4, 2); elements of x
recycle if c is not long enough. We can give names to the rows, columns, and matrices in the
array by using the dimnames parameter.
# Create two vectors of different lengths.
vector1 <- c(1,2,3)
vector2 <- c(4,5,6,7,8,9)
column.names <- c("Col1","Col2","Col3")
row.names <- c("Row1","Row2","Row3")
matrix.names <- c("Matrix1","Matrix2")
arr <- array(c(vector1,vector2),dim = c(3,3,2),
dimnames = list(row.names,column.names,
matrix.names))
arr
30 2. INTRODUCTION TO R SOFTWARE
, , Matrix1
, , Matrix2
gl(n, k, length=n*k, labels = 1:n) : generates levels (factors) by specifying the pattern of their
levels; k is the number of levels, and n is the number of replications.
2.4.1 SUBSETTING
This is an operator that can be used to extract subsets of R objects, e.g., [ always returns an
object of the same class as the original object. With one exception, it can be used to select more
than one element. [[ is used to extract elements of a list or a data frame. $ is used to extract
elements of a list or data frame by name. The semantics of $ are similar to [[.
In addition, an element of a vector v is assigned an index by its position in the sequence,
starting with 1. The basic function for subsetting is [ ]. v[1] is the first element; v[length(v)]
is the last. The subsetting function takes input in many forms.
# Example 1
sub[2:5]
[1] "b" "c" "d" "e"
sub>"c"
[1] FALSE FALSE FALSE TRUE TRUE TRUE
sub[sub>"c"]
[1] "d" "e" "f"
# Example 2
sub1<-matrix(1:9,3,3)
sub1
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
sub1[1,3]
[1] 7
sub1[2,] # entire elements in the second row
[1] 2 5 8
sub1[,1] # entire elements in the first column
[1] 1 2 3
2.4. BASIC OPERATIONS IN R 37
# Example 3
sub<-c("a", "b", "c", "d", "e", "f")
v <- c(1, 3, 6)
sub[v]
[1] "a" "c" "e"
v[1:3]
[1] "a" "b" "c"
# Example 4
sub1<-matrix(1:9,3,3)
sub1[1,3]
[1] 7
sub1[1,2, drop = FALSE] # return as a matrix of 1 by 1
[,1]
[1,] 4
sub1[1,2, drop = TRUE] # return as a single element
[1] 4
sub1[1,, drop = TRUE] # return as a vector
[1] 1 4 7
# Example 5
y<-list(w = 1:3, x = 0.5, z ="gender")
y[1]
$w
[1] 1 2 3
y$w
[1] 1 2 3
y$x
38 2. INTRODUCTION TO R SOFTWARE
[1] 0.5
y$z
[1] "gender"
2.4.2.1 Conditional
if (condition) {
# do something
} else {
# do something else
}
# Example 6
x <- 1:20
if (sample(x, (1) <= 10) {
print("x is less than 10")
} else {
print("x is greater than 10")
}
[1] "x is greater than 10"
2.4.2.2 For-Loop
Loops are used in programming to repeat a specific block of code. A loop works on an iterable
variable and assigns successive values until the end of a sequence.
# Example 7
for (i in 1:10) {
print(i*2)
}
[1] 2
[1] 4
[1] 6
2.4. BASIC OPERATIONS IN R 39
[1] 8
[1] 10
[1] 12
[1] 14
[1] 16
[1] 18
[1] 20
# Example 8
Alternatively,
2.4.2.3 Repeat-Loop
A repeat-loop is used to iterate over a block of code multiple number of times. There is no
condition check in a repeat-loop to exit the loop. There should be an explicit condition within
the body of the loop and then use a break statement to terminate the loop.
# Example 9
r <- 1
repeat {
print(r)
r = r+1
if (r == 10){
break
}
40 2. INTRODUCTION TO R SOFTWARE
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
2.4.2.4 While-Loop
While loops are used to loop until a specific condition is met.
# Example 10
w <- 1
while (w < 10) {
print(w)
w = w+1
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
# probability of at least 3
events with lamda=4
1-ppois(2,4)
[1] 0.7618967
dunif(x, min=0, max=1) Uniform distribution, follows #5 uniform random variates
punif(q, min=0, max=1) the same pattern as the nor- y <- runif(5)
qunif(p, min=0, max=1) mal distribution above y
runif(n, min=0, max=1) [1] 0.14192169 0.13701585
0.06418781 0.58657717
0.20230663
44 2. INTRODUCTION TO R SOFTWARE
2.4.3.4 Other Statistical Functions
See Table 2.7.
Table 2.7: Other statistical functions
2.4.5.1 Packages
Anything you may think of doing with R, there is a tendency that a package has been written
to execute it. The list of packages can be found in the official repository CRAN: http://cran
.fhcrc.org/web/packages/. The installation of any package of your choice is very easy in R,
you may use the command: install.packages("packagename").
setwd("C:\\MyWorkingDirectory ")
setwd("C:/MyWorkingDirectory ") # can use forward slash
setwd(choose.dir()) # open a file browser
getwd() # returns a string with the current working directory
Besides, in order to have access and see a list of the files in the current directory, you can use the
R command line:
Run a script
source("helloworld.R")
2.4. BASIC OPERATIONS IN R 47
2.4.5.3 Reading and Writing Local Flat Files
Local flat files can be read by using the following: read.table, read.csv, readLines. The acronomy
CSV stands for “comma separated values.”
For the purpose of illustration, we will be working with the Nigerian exchange rate dataset
from the Central Bank of Nigeria website. To procure this dataset, please visit the following link:
https://www.cbn.gov.ng/Functions/export.asp?tablename=exchange.
Right click and select “save as” to save it to your local directory. The code samples in the
remainder of this section assumes that you have set your working directory to the location of the
exchrt.csv file.
Reading and writing local flat files:
To write local flat files, you can used the functions such as write.table, write.csv, and
writeLines. Meanwhile, it is advisable for someone to take note of the parameters before reading
or writing a file.
write.table(exch.rt, "new_exchrt.csv")
Conversely, the read.table is the one of the commonest functions for reading a data. Some
of the arguments of read.table() are described below:
function (file, header = FALSE, sep = "", quote = "\"", dec = ".",
numerals = c("allow.loss", "warn.loss", "no.loss"), row.names,
col.names, as.is = !stringsAsFactors, na.strings = "NA",
colClasses = NA, nrows = -1, skip = 0, check.names = TRUE,
fill = !blank.lines.skip, strip.white = FALSE,
blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE,
flush = FALSE, stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
48 2. INTRODUCTION TO R SOFTWARE
1. file is the name of a file, or a connection.
2. header a logical value indicating the variables of the first line.
3. sep is a string indicating how the columns are separated.
4. colClasses is a character vector indicating the class of each column in the dataset.
5. nrows are the number of rows in the dataset.
6. comment.char is character string indicating the comment character.
7. skip is the number of lines to skip from the beginning.
8. stringsAsFactors determines whether character variables be coded as factors.
For a more specific sample of the data, assuming that we want to view the last three columns of
the data frame, and the first-two rows in the columns. The command lines are:
# View the last three columns of the data frame,
# and the first two rows in those columns
head(exch.rt[,5:7],2)
Buying.Rate Central.Rate Selling.Rate
1 305.400 305.9000 306.4000
2 396.623 397.2723 392.9217
The str() function can be used to summarize the data frame. Let us look at the summary of the
exchange rate data.
2.5. DATA EXPLORATION 51
str(exch.rt)
'data.frame': 38728 obs. of 7 variables:
$ Rate.Date : Factor w/ 4126 levels "1/10/2002","1/10/2003",..:
623 623 623 623 623 623 623 623 623 623 ...
$ Currency : Factor w/ 27 levels "CFA","CFA ","DANISH KRONA",..:
21 12 5 19 25 1 23 27 13 18 ...
$ Rate.Year : int 2018 2018 2018 2018 2018 2018 2018
2018 2018 2018 ...
$ Rate.Month : Factor w/ 15 levels "8","April","August",..:
14 14 14 14 14 14 14 14 14 14 ...
$ Buying.Rate : num 305.4 396.62 351.18 307.65 2.68 ...
$ Central.Rate: num 305.9 397.27 351.75 308.15 2.68 ...
$ Selling.Rate: num 306.4 392.92 352.33 308.65 2.69 ...
Ayo
John
Name
Olu
Ope
Ade
0 5 10 15 20
Age
Ope 22%
Ade 24%
Olu 16%
Ayo 20%
John 18%
plot widths proportional to the square root of the samples sizes. Add horizontal=TRUE to
reverse the axis orientation.
# simple box plot
age <- c(22, 20, 14, 16, 18)
gender= c("Female", "Male", "Male", "Male", "Female")
mydata<-data.frame(age, gender)
boxplot(age~gender,data=mydata, labels=c("Male","Female"),
xlab="Gender", ylab="Age", col=c("red","blue"))
2.6 EXERCISES
2.1. What are the advantages of R software?
2.2. With the aid of examples, explain the term vectorization in R.
2.3. Enumerate different types of data in R.
2.4. What do you understand by control structure in R and list common control structures
in R?
2.5. What are the methods to use in visualizing data?
54 2. INTRODUCTION TO R SOFTWARE
22
20
Age
18
16
14
Female Male
Gender
CHAPTER 3
Descriptive Data
This chapter presents different statistical measures that can be employed to provide descriptive
analysis of business-related data. In this chapter we will define the statistical measures of central
tendency and dispersion. Explanations of these measures of central tendency and dispersion
using example data are provided in this chapter.
The Mean
Mean or arithmetic mean: this is the average of a set of observations. It is the aggregate of all
observations divided by the number of observations in the data set. For example, observations
are denoted by x1 ; x2 ; x3 ; : : : ; xn . The sample mean is represented by x which is expressed as
follows.
Mean of a sample,
Pn
xi x1 C x2 C x3 C C xn
x D i D1 D ; (3.1)
n n
P
where is the summation symbol. The summation covers all of the data points.
As an example, if you have 9 houses on your block and the number of people living in the
houses is: 2, 1, 6, 4, 1, 1, 5, 1, and 3 and you want to know the average (mean) number of people
living in the house, you simply add the numbers together and get 24 and divide that by 9 to get
2.67 people per house.
If the observation set covers a whole population, the symbol (the Greek letter mu) is
used to represent the mean of the entire population. In addition, N is used instead of n to denote
the number of elements. The mean of the population is specified as follows.
Mean of a population:
PN
xi
D i D1 : (3.2)
N
Mean is the most commonly used measure of central tendency. In addition, the mean is
relied on information contained in all the observations in the dataset.
56 3. DESCRIPTIVE DATA
Characteristics of Mean
• It is a single point viewed as the point where all observations are concentrated.
• All observations in the dataset would be equal to the mean if each observation has the
same size or figure.
The Median
This is a special point because it is located in the center of the data, implying that half the data
lies below it and half above it. The median is defined as a measure of the location or centrality
of the observations. This is just an observation lying in the middle of the set.
In order to find the median by hand, given a small data set like this one, you need to
reorder the dataset in numerical order:
1; 1; 1; 1; 2; 3; 4; 5; 6:
If you divide the number of data points by two and round up to the next integer, you will find
the median. So for our example:
9=2 D 4:5:
Round up the answer is 5, so the 5th data point is the median, which is 2.
This works nicely if you have an odd number of data points, but if you have an even
number, then there would be two middle numbers. To resolve this dilemma, you take those two
data points and average them. So let’s add a house with 4 people. That means that the middle
two numbers would be 2 and 3. So averaging them gives you 2.5 as the median.
Characteristics of Median
Example 3.1
Calculate the mean, median, and mode of the following observations: 2, 6, 8, 7, 5, 3, 2, and 2.
Solution
Mean of the observations, x D 2C6C8C7C5C3C2C2
7
D 358
D 4:4.
To solve for the mean of the observations in R, the following commands are expressed.
x <- c (2,6,8,7,5,3,2,2)
mean(x)
[1] 4.4
Median of the observation requires the re-arrangement of the number from smallest to largest:
2, 2, 2, 3, 5, 6, 7, and 8. There are eight observations, so the median is the value in the middle,
that is, in the fourth and fifth position. Those values are 3 and 5; so you add the two numbers
together and divide the outcome by 2 in order to get the median.
Median is 3C5 2
D 4 because 3 and 5 are the center of the dataset. In R, the command is:
median (x)
[1] 4
Mode in the observation set is 2 because it appears three times while other values occur once.
In R, the command is:
mode(x)
[1] 2
higher half and lower half and then subtract the lower half median from the higher half
median to get the interquartile range.
Using data from our first example, 1, 1, 1, 1, 2, 3, 4, 5, 6, we first need to split the data
set in two. So we look to the middle and see that the number is odd, so we throw out
the median (3.2). Then we find the median of 3, 4, 5, and 6 which is 4.5 and subtract the
median of 1, 1, 1, 1 which is 1 to give us 3.5 as the interquartile range. So we see that the
spread between the lower half of this data and the higher half is quite significant.
(c) Variance: this is the average squared deviation of the data points from their mean.
(d) Standard deviation: this can be defined as the square root of the variance of the entire
dataset. In financial analysis, the standard deviation is explored to capture the volatility as
well as the risk associated with financial variables.
(e) The variance, is calculated by calculating the difference from the mean for each data point,
then squaring each of them and then adding them all together and then divided by the
number of items in the dataset minus one:
Pn
i D1 .xi x/2
:
n 1
3.2. MEASURE OF DISPERSION 59
(f ) The standard deviation of a sample is the square root of the sample variance while the
standard deviation of a population is the square root of the variance of the population. The
formulae for these two forms of standard deviation are expressed below:
s
Pn
p
2 i D1 .xi x/2
Sample standard deviation: s D s D
n 1
s
Pn
p
i D1 .xi /2
Population standard deviation: D 2 D :
n 1
Statisticians prefer working with the variance because its mathematical features make
computations easy, whereas applied statisticians like to work with the standard deviation be-
cause of its easy interpretation.
Let’s calculate the variance and standard deviation of the data in Table 3.1. In order to
compute this, a table is used for simplicity (before exploring R software to do everything within
a minute).
x 𝑥𝑖 – 𝑥̅ (𝑥𝑖 – 𝑥̅ )2
2 -2.4 9
2 -2.4 9
2 -2.4 9
3 -1.4 4
5 0.6 0
6 1.6 1
7 2.6 4
8 3.6 9
Total 0 45
In reference to the above equation, the variance of the sample equals the sum of the third
column in the Table 3.1, 45, divided by n 1: s 2 D 45 7
D 4:44. The standard deviation is the
p
square root of the variance: s D 3:44 D 2:335 or putting in two decimal places, s D 2:34.
There is a shortcut formula for the sample variance that is equivalent to the formula above
for variance: Pn Pn 2
2
2 i D1 xi i D1 xi =n
s D :
n 1
Solving the standard deviation and variance in Table 3.1 with the aid of R, the commands are:
60 3. DESCRIPTIVE DATA
x <- c(2,2,2,3,5,6,7,8)
var(x)
[1]4.44
sd(x)
[1] 2.34
(b) Asymmetric distribution: this is the kind of distributional pattern in which one side of the
distribution is not a mirror image of the other. In addition, in its data distribution, the
mean, median, and mode will not all be equal.
Additional attributes of a frequency distribution of a dataset are skewness and kurtosis.
Skewness: measures the degree of asymmetry of a distribution. A right skewness occurs when
the distribution stretches to the right more than it does to the left, while a left-skewed distribu-
tion is one that stretches asymmetrically to the left. Graphs depicting a symmetric distribution,
a right-skewed distribution, a left-skewed distribution, and a symmetrical distribution with two
modes, are presented in Fig. 3.2.
For a right-skewed distribution, the mean is to the right of the median, thus lying to the
right of the mode. The opposite is observed for left-skewed distribution (see Fig. 3.2). The calcu-
lation of skewness is reported by a number that may be positive, negative, or zero. Zero skewness
indicates a symmetric distribution. A positive skewness means a right-skewed distribution while
a negative skewness denotes a left-skewed distribution. Skewness could be different in terms of
their shape even if two distributions have the same mean and variance.
Kurtosis: this measures the peak level of distribution. The larger the kurtosis, the more peaked
will be the distribution. Its calculation is reported either as an absolute or a relative value. Ab-
solute kurtosis is usually a positive number. For a normal distribution, the absolute kurtosis is 3.
The value of 3 is used as the data point to compute relative kurtosis. Therefore, relative kurtosis
is the difference between the absolute kurtosis and 3:
The relative kurtosis can be either negative, known as platykurtic indicating a flatter distribution
than the normal distribution or a positive kurtosis known as leptokurtic showing a more peaked
distribution than the normal distribution.
3.3. SHAPES OF THE DISTRIBUTION—SYMMETRIC AND ASYMMETRIC 61
Mode Mode
Median Median
Mean Mean
Figure 3.2: (a) Skewness (left and right), (b) symmetric distribution, and (c) asymetric distribu-
tion.
62 3. DESCRIPTIVE DATA
3.4 EXERCISES
3.1. Create X that includes all even numbers between 0 and 100.
3.2. Calculate the mean, mode, and median of X .
3.3. Calculate the dispersion measures (standard deviation and variance) for X .
3.4. Explain the characteristics of central tendency measures (mean, mode, and median).
3.5. Explain the two tools discussed here—kurtosis and skewness—for shapes of distribu-
tion.
63
CHAPTER 4
4.1.1 EXPERIMENT
This is a measurement process that produces quantifiable results. Some typical examples of an
experiment are tossing of a die, tossing of a coin, playing of cards, measuring weight of students,
and recording growth of plants.
4.1.2 OUTCOME
This is a single result from a measurement. Examples of outcomes are getting a sum of 9 in a
tossing of two dice, turning up of head in a toss of a coin, selecting a spade in a deck of card,
and getting a weight above a threshold (say 50 kg).
4.7 PROBABILITY
The probability of event A could be defined as the number of ways event A can occur divided by
the total number of possible outcomes. It is mathematically defined as:
n.A/
P .A/ D : (4.1)
n.S/
Example 4.1
A fair die is rolled once, what is the probability of: (i) Rolling an even number? (ii) Rolling an
odd number?
Solution:
number of even number
Probability (even number) D :
number of possible outcomes
Let A be set of even number, then A D f2; 4; 6g and B be a set of odd numbers in a tossing of
a die, the B D f1; 3; 5g and the sample space, S D f1; 2; 3; 4; 5; 6g.
(i) P .A/D 36 .
(ii) P .B/D 36 .
Example 4.2
What is the probability that an applicant resume will be treated within a week of submitting
application if 5,000 graduates applied for a job and the recruitment firm can only treat 1,000 re-
sumes in a week?
Solution:
1;000
Probability (treating a resume) D 5;000
D 0:2.
Example 4.3
Consider an urn containing one blue ball, one green ball, and one yellow ball. To simulate draw-
ing two balls from the urn with replacement and without replacement, we have:
# Urn contains one blue, one green and one yellow balls
Urn = c('blue', 'green', 'yellow')
# Sampling with replacement
sample(x = Urn, size =2, replace = TRUE)
[1] "yellow" "yellow"
For the two cases, there is no unique answer here as long as they satisfied the condition where
they are selected. In the sampling with replacement, our result can be any from the sample space
of:
["yellow" "yellow"], ["yellow" "blue"], ["blue" "yellow"], ["blue" "blue"],
[ "blue" "green"], [ "green" "blue"], [ "green" "yellow"], [ "yellow" "green"],
[ "green" "green"]
However, in the case of sampling without replacement, we can have any of these outcomes:
["yellow" "blue"], ["blue" "yellow"], [ "blue" "green"], [ "green" "blue"],
[ "green" "yellow"], [ "yellow" "green"],
So by putting the yellow ball back in after it was selected the first time, it expands the solution
possibilities significantly. In our case it meant picking the yellow ball twice.
Example 4.4
Draw four random numbers between 1 and 20 with replacement and without replacement.
68 4. BASIC PROBABILITY CONCEPTS
# Sampling with replacement
sample(x = 1:20, size =4, replace = TRUE)
[1] 17 7 7 10
Note: There is no unique answer for both cases in as much the condition of generating the
samples is satisfied. It is generated randomly.
Example 4.5
Simulate rolling a fair die.
# Sample from rolling a fair die
set.seed(25)
sample(x = 1:6, size =1, replace = TRUE)
[1] 3
Note: Replacement option (TRUE or FALSE) does not hinder the outcome here since the
experiment is performed once, i.e., size D 1. A seed is set in the codes in Example 4.5 above;
this is done to make the codes reproducible and to ensure the same random numbers will be
generated each time the script is executed.
Example 4.6
Replicate the experiment in Example 4.5 in 100 times.
# Sample replication
set.seed(25)
replicate(100, sample(x = 1:6, size =1, replace = TRUE) )
[1] 3 5 1 6 1 6 4 3 1 2 2 3 6 4 5 1 4 5 3 5 1 3 1 1 2 2 1 4
4 2 5 3 2 1 5 3 6 3 1 1 5 2 4 5 4 2 6 5 1 5 1 5 5 3 4
[56] 1 1 4 2 1 1 2 2 5 4 6 3 6 4 2 6 1 1 5 2 3 1 6 2 2 4 3
4 1 5 5 2 3 5 4 3 2 4 4 1 3 4 6 2 2
2. P .A/ 1.
3. P .A/ D 0.
4. P .A/ P .B/ if only and if A B ; note that they can be equal.
5. P .A [ B/ D P .A/ C P .B/ P .A \ B/.
P
6. P .A [ B/ P .A/ C P .B/ or, alternatively, P [niD1 Ai D niD1 P .Ai /.
That is, that the probability of A given B is the same as the probability of A happening, because
A and B are independent of one another. Similarly, the probability of B given A is the same as
the probability of B happening, because they are independent events.
For any finite subset of events Ai1 ; Ai2 ; Ai 3 ; : : : ; Ai n are said to be independent, if:
Example 4.7
A fruit basket contained 30 pieces of fruit: 8 oranges, 12 apples, and 10 bananas. If two fruits
are taken at random after one another:
(a) What is the probability that the first fruit is a banana and the second fruit is an orange if
the first fruit is returned in the basket before the second fruit is taken?
(b) What is the probability that the first is an orange and the second is an apple if the fruit is
taken without replacement?
Solution:
Let A represents number of apples.
Let B represents number of bananas.
Let O represents number of oranges.
10 8
(a) P .B and O/ D P .B/:P .O/ D 30
30
D 0:088.
8 12
(b) P .O and A/ D P .O/:P .AjO/ D 30
29
D 0:11.
Example 4.8
Table 4.1 below shows the outcome of a survey conducted in Enugu State, Nigeria to look at the
rate small businesses fail despite the programs of government directed at their survival. Calculate
the probability that a restaurant firm is highly prone to chance of occurrence?
Thus, P .BjA/ indicates the probability of occurrence of event B given event A has occurred.
For an event A and event B are independent.
In general,
0.10
0.05
0.00
2 3 4 5 6 7 8 9 10 11 12
Sum of outcomes
Solution:
# set seed number to get the same result always since the
# observation is randomly generated
set.seed (1001)
Brands <- sample(c("A","B", "C"), 1000,
prob=c(0.54, 0.36, 0.10), rep=TRUE)
Status <- sample(c("defective","non-defective"), 1000,
prob=c(0.02,0.98), rep=TRUE)
dataset <- data.frame(Status, Brands)
tabular <-with(dataset, table(Status, Brands))
tabular
Brands
Status A B C
defective 11 7 0
non-defective 536 357 89
4.14 EXERCISES
4.1. A fair die is tossed once. Calculate the probability that: (i) exactly 3 comes up, (ii) 3 or
5 come up, and (iii) a prime number comes up.
4.2. Two dice are rolled together once. What is the probability that the sum of the outcome
is: (i) 4, (ii) less than 4, (iii) more than 4, and (iv) between 7 and 12 inclusive.
4.3. A card is drawn from a well shuffled pack of 52 cards. Find the probability of:
(a) a king or a queen,
(b) a black card,
(c) a heart or a red king, and
4.14. EXERCISES 75
(d) a spade or a jack.
4.4. A research firm has 30 staff, which consists of 2 research managers, 5 research associates,
3 administrative staff, and 20 fieldworkers. The managing director of the firm wants to
set up a committee of five-man. Find the probability that the committee will consist of:
(a) 1 research manager, 1 research associate and 3 fieldworkers;
(b) 1 research manager, 2 research associates and 2 fieldworkers; or
(c) 2 research managers, 3 research associates.
4.5. A firm has 100 employees (65 males and 35 females) and they were asked if they should
adopt the paternal leave or not, their responses are summarized in Table 4.2.
(a) Find the probability that a male opposed to paternal leave policy.
(b) Find the probability that the person is a female, given that the person is in favor of
paternal leave policy.
77
CHAPTER 5
Discrete Probability
Distributions
Consider a random variable (X ) that takes integers values, X1 ; X2 ; : : : ; Xn with the correspond-
P
ing probabilities of P .X1 /; P .X1 /; : : : ; P .Xn / and the probabilities P .X / such that n1 P .X/ D
1 is called a discrete probability distribution. The type of the random variable determines the
nature of the probability distribution it follows. A discrete random variable usually involves
counting which takes an integer value while the continuous random variable involves measur-
ing and it takes both integer and a fractional part or real number. When the probabilities are
assigned to random variables, then the collection of such probabilities give rise to a probability
distribution. The probability distribution function can be abbreviated as pdf. A discrete proba-
bility distribution satisfies two conditions:
0 P .X / 1 (5.1)
and X
P .X/ D 1: (5.2)
• P .X D x/ D f .x/ > 0, if x 2 S .
All probability must be positive for every element x in the sample space S . Hence, if
element x is not in the sample space S , then f .x/ D 0.
P
• x2S f .x/ D 1.
The sum of probabilities for all of the possible x values in the sample space S must equal 1.
P
• P .X 2 x/ D x2A f .x/.
The sum of probabilities of the x values in A is the probability of event A.
78 5. DISCRETE PROBABILITY DISTRIBUTIONS
Example 5.1
Experiment: toss a fair coin two times.
Sample space: S D fHH; HT; TH; TTg.
Random variable X is the number of tosses showing heads.
Thus, X W S ! R
X D .HH/ D 2
X D .HT/ D .TH/ D 1
X D .TT/ D 0
X D f0; 1; 2g:
That is, random variable X takes a range of values 0, 1, and 2. Hence, the pmf is given by:
1 1 1
P .X D 0/ D ; P .X D 1/ D ; and P .X D 2/ D :
4 2 4
Examples of a discrete probability distribution are: rolling a die, flipping of coins, counting
of car accidents on highways, producing a defective and non-defective goods, etc.
5.2. EXPECTED VALUE AND VARIANCE OF A DISCRETE RANDOM VARIABLE 79
The following are the common discrete probability distributions used in statistics:
Bernoulli distribution, binomial distribution, geometric distribution, hypergeometric distribu-
tion, Poisson distribution, negative binomial distribution, and multinomial distribution.
Example 5.2
A single tossing of a fair die.
Roll (X ) 1 2 3 4 5 6
P .X / 0.1667 0.1667 0.1667 0.1667 0.1667 0.1667
This has a discrete probability distribution since the random variable (X ) takes the integers
(i.e., 1, 2, 3, 4, 5, and 6) with corresponding probabilities of 0.1667 each. This satisfies that
P
0 P .X/ 1 and P .X/ D 1.
Example 5.3
Consider a distribution of the family size.
Assuming an analyst obtained the result in the table below in a survey of 1,000 households
in Nigeria. Let random variable X be the number of households size with probability of outcome.
Household size (X ) 1 2 3 4 5 6C
P .X / 2/69 4/77
3/28 29/81 1/3 3/25
P
This satisfies that 0 P .X / 1 and P .X / D 1.
Example 5.4
The number of defectives items per month and the corresponding probabilities in a manufac-
turing firm are given in the table below.
Defective(X ) 0 1 2 3 4 5
P .X / 1/15 1/6 3/10 1/5 2/15 2/15
P
Let X be number of defective items in a month, then 0 P .X/ 1 and P .X / D 1
are satisfied.
Example 5.5
Using the data Example 5.4 above, calculate:
2 D 1:432 D 2:05:
The code below gives how we can use R to get solution to the worked Example 5.5 above.
Using R Code
# to calculate the mean, standard deviation, and variance of the distribution in Example 5.5
set.seed(100)
x <- c(0, 1, 2, 3, 4, 5)
prob <- c(0.067, 0.167, 0.30, 0.20, 0.133, 0.133)
weighted.mean (x, prob)
[1] 2.56
x_miu = x - weighted.mean (x, prob)
Example 5.6
A farmer supplies eggs in crates to his customers in the neighboring city by motorcycle. An
egg gets broken in a crate, due to a bad road system, on his way to the city with the probability
of 0.75 if he transports 10 crates of eggs. Assuming that a number of damaged eggs is binomially
distributed, what is the probability that three eggs will break before he reaches his destination?
Solution:
Given that n D 10, x D 3, and p D 0:75. Then, the pmf of a binomial is given by:
!
n x
f .x/ D p .1 p/n x ; x D 0; 1; : : : ; n
x
!
10
f .x D 3/ D .0:75/3 .1 0:75/10 3
3
10Š
f .x D 3/ D .0:75/3 .0:25/7 D 0:0031:
.10 3/Š3Š
xD0
xŠ .n x/Š
Xn
n.n 1/Š
E .x/ D x px 1
pq n x
:
xD0
x.x 1/Š.n x/Š
So,
n
X n.n 1/Š
E .x/ D x px 1
pq n x
:
xD0
x.x 1/Š .n x/Š
and
n 1
!
X n 1 x 1 .n 1/ .x 1/
D np p q :
x 1D0
x 1
Since sum of probabilities for all of the possible x values in the sample space S is equal 1,
then the result yields
E.x/ D np: (5.9)
Therefore, the expected value (mean) of the binomial probability function is np where n is the
number of trials in the experiment and p is the probability of success.
So,
n
X n.n 1/.n 2/Š
E.x.x 1// D x.x 1/ p x .1 p/n x
:
xD0
x.x 1/.x 2/Š.n x/Š
Let p x D p x 2C2 D p 2 p x 2
.
Substituting for p x gives
n
X n .n 1/ .n 2/Š
E .x .x 1// D x .x 1/ p2px 2
.1 p/n x
:
xD0
x .x 1/ .x 2/Š .n x/Š
or
Example 5.7
A manufacturing company has 100 employees and an employee has a chance of 5% of being
absent from work at a particular day. It is assumed that absent of a worker from work would
not affect another. The company can continue production at a particular day if no more than
20 worker absent for that day. Calculate the probability that out of the 10 workers randomly
selected, 3 workers will be absent from work. Hence, find the expected number of workers that
will be absent from work at that particular day.
Solution:
1. Number of trials, n D 10 workers.
Possible outcomes: success .p/ is the probability that a worker is absent from work and
failure .q/ is the probability that a worker is not absent from work.
Probability of success: P (worker is absent from work) D 0.05 and it is constant all through
the trials.
The events are independent, i.e., the presence of a worker in the company does not affect
another.
The pmf of binomial is:
!
n x
f .x/ D p .1 p/n x ; x D 0; 1; : : : ; n
x
!
10
f .x D 3/ D .0:05/3 .1 0:05/10 3
3
!
10
f .x D 3/ D .0:05/3 .0:95/7 D 0:0105:
3
2. The expected number of worker to be absent in that day is: np D 10 0:05 D 0:5. This
interpreted that only one worker is expected to absent on that particular day.
Example 5.8
An airline operator has 12 airplanes. On a rainy day, the probability that an airplane will fly is
0.65. What is the probability that:
1. an airplane will fly;
86 5. DISCRETE PROBABILITY DISTRIBUTIONS
2. three airplanes will fly;
3. at most, two airplanes will fly; and
4. at least, two airplanes will fly.
Hence, calculate the number of airplanes that are expected to fly.
Solution:
1. Let x be number of airplane to fly.
N D 10 and p D 0:65
10
f .x D 1/ D .0:65/1 .0:35/9 D 0:0005:
1
2.
10
f .x D 3/ D .0:65/3 .0:35/7 D 0:0212:
3
3.
f .x 2/ D .f .x D 0/ C f .x D 1/ C f .x D 2//
10 10 10 9 10
f .x 2/ D 0
.0:65/ .0:35/ C 1
.0:65/ .0:35/ C .0:65/2 .0:35/8
0 1 2
D 0:000028 C 0:0005123017 C 0:004281378 D 0:0048:
4.
f .x 2/ D 1 .f .x D 0/ C f .x D 1/ C f .x D 2//
D1 0:0048 D 0:9952:
5. The expected number airplanes to fly on a rainy day is np D 10 0:65 D 6:5. This indicates
that only seven airplanes will fly on a rainy day.
Example 5.9
A market representative makes a sale on a particular product per day with a probability of 0.25.
If he has 30 products to sell that day, find the probability that:
1. no sales are made;
2. five sales are made; and
5.5. SOLVE PROBLEMS INVOLVING BINOMIAL DISTRIBUTION USING R 87
3. more than four sales are made.
Solution:
n D 30
!
30
p .x D 0/ D .0:25/0 .0:75/30 D 0:00018:
0
2.
!
30
p .x D 5/ D .0:25/5 .0:75/25 D 0:1047:
5
3.
var(rv)
[1] 2.147879
30
25
20
Frequency
15
10
5
0
4 5 6 7 8 9 10
Random Number
Example 5.10
Using the data from Example 5.7 above, use R code to obtain the binomial probability distri-
bution and compare the result.
90 5. DISCRETE PROBABILITY DISTRIBUTIONS
dbinom(3,10,0.05)
[1] 0.01047506
Example 5.11
Use the data on airplane Example 5.8 above, and then use R to obtain the probabilities in Ex-
ample 5.8 (1–4). Hence, compare the results:
# an airplane will fly
dbinom(1,10,0.65)
[1] 0.0005123017
p1<-dbinom(1,30,0.25)
p1
[1] 0.001785821
p2<-dbinom(2,30,0.25)
p2
[1] 0.008631468
p3<-dbinom(3,30,0.25)
p3
[1] 0.02685346
p4<-dbinom(4,30,0.25)
p4
[1] 0.06042027
prob<-1-(p0+p1+p2+p3+p4)
prob
[1] 0.9021304
The results in Example 5.9 (1–3) are the same with what we got using R.
92 5. DISCRETE PROBABILITY DISTRIBUTIONS
5.6 EXERCISES
5.1. (a) What is a discrete probability distribution and what is it used for?
(b) State the conditions to be satisfied for a discrete probability distribution.
5.2. (a) If X be a discrete random variable with the probability p.X /, what is the expected
value and standard deviation of X ?
(b) The table below shows the cases of malaria recorded in a health center with their
respective probabilities.
Malaria Cases (X ) 0 1 2 3 4 5 6 7 8
P .X/ 0.12 0.20 0.15 0.18 0.12 0.08 0.07 0.03 0.05
CHAPTER 6
Continuous Probability
Distributions
In the previous chapter, we discused discrete probability distributions and their properties. In
the discrete probability distribution, the random variable takes only integer values or countably
infinite number of possible outcomes. However, in the countinous probability distribution, the
random variable takes any real value within a specified range. Typical examples of countinous
random variables are weight, temperature, height, and some economic indicators (prices, costs,
sales, inflation, investments, etc.). A continuous probability distribution demonstrates the com-
plete range of values a continuous random variable can take with their associated probabilities
along the range of values. This distribution is very useful in the prediction of the likelihood of
an event within a specified range of values. In this section we are discussing continous prob-
P
ability distributions and we will observe that the sum symbol, , which is used in derivation
of mean and variance of discrete probability distribution has turned to an integral symbol. The
integral sign indicates sum of continous random variable over an interval of points. Examples of
continous probability distributions are normal distribution, exponential distribution, student-t
distribution, chi-sqaure distribution, etc.
Let X be a continuous random variable that can take any real value within a specified
range, then the probability over a random variable is called a continuous probability distribu-
tion. Consider x to be continous, the probability denisty function denoted by f .x/ such that
the probability of event a x b is represented mathematically as:
Z b
P .a x b/ D f .x/ dx;
a
Example 6.1
1
The time required by a driver to drive his boss from home to office is a function of x2
. What
is the probability that the driver will get to the office between 5–10 min?
Solution:
Let x denotes the time required for the driver to move from home to office.
Given that:
1
f .t/ D 2 ; a D 5 and b D 10:
x
Then, Z 10
1
P .5 x 10/ D 2
dx:
5 x
Integrate the Right-hand side (RHS) over the interval,
1 10
P .5 x 10/ D :
x 5
Substitute the value of x and then evaluate the RHS,
1 1
P .5 x 10/ D
10 5
P .5 x 10/ : D 0:10
The probability that the driver will reach office from home within 5–10 min is 0.10. This
means that he has a low chance of getting to office under this condition.
where y is a continuous random variable, is the mean, and 2 is the variance of the distribution.
The standard normal curve is a special case of the normal curve when the mean is 0 and
standard deviation is 1. A normal distribution can be written in the form X N.; 2 / and
reads, thus X is normally distributed with mean, , and variance, 2 . The total area under the
curve is 100%, i.e., all observations fall under the curve. A dataset is said to be normally dis-
tributed, if 68% of the observations fall within ˙1SD of the mean. Also, about 95% of the
observations will fall within ˙2SD and 99.7% of the observations will fall within ˙3SD , as
shown in Fig. 6.1.
-3 SD -2 SD -1 SD 0 +1 SD +2 SD +3 SD
(d) The curve is denser in the center and less dense in the tails.
(e) Normal distribution has two parameters: mean ./ and variance . 2 /.
96 6. CONTINUOUS PROBABILITY DISTRIBUTIONS
(f ) About 68% of the area of a normal distribution is within one standard deviation of the
mean.
(g) About 95% of the area of a normal distribution is within two standard deviations of the
mean.
(h) About 99% of the area of a normal distribution is within three standard deviations of the
mean.
X
zD N.0; 1/:
Thus, z is normally distributed with a mean of 0 and standard deviation of 1.
In the above equation X is the random variable to being standardize, is the mean of the
distribution and is the standard deviation of the distribution. Thus, once the random variable
is standardized, it is approximately normal with mean of 0 and standard deviation of 1 which is
same as standard normal.
Example 6.2
A class of 30 students sat for examinations on mathematics, the class mean score is 65 and
the standard deviation is 12.5. Assuming that the scores is normally distributed, what is the
percentage of students scoring above 70 in the mathematics examination?
Solution:
Given that: n D 30, D 65, and D 12:5. P .x > 70/ D?
Standardize the students’ scores:
x 65 70 65
P .x > 70/ D P > :
12:5 12:5
x 65
Let z D 12:5
:
P .x > 70/ D P .z > 0:4/ :
6.2. STANDARD NORMAL SCORE (Z -SCORE) 97
Since sum of probability is 1, thus the left side of normal curve can be written as
P .x > 70/ D 1 P .z 0:4/ :
Look at the probability corresponding to z 0:4 from the standard normal table.
That is, ˆ .0:4/ D 0:6554.
P .x > 70/ D 1 ˆ .0:4/ D 1 0:6554 D 0:3446:
Thus, 34% of the students that sat for the mathematics examination scored above 70.
Example 6.3
A plastic production machine produces plastics with the mean of 80 and standard deviation of 5
in a minute. Assuming that the production of plastics followed a normal distribution, calculate:
(a) the probability that the machine will produce between 78 and 85; and
(b) the probability that the machine will produce less than 90.
Solution:
Given that: D 80 and D 5.
First, standardize the production.
Let z D X 580 .
78 80 X 80 85 80
(a) P .78 x 85/ D P 5
5
5
.
X 80
Subtitute for z D 5
78 80 85 80
P .78 x 85/ D P z D P . 0:4 z 1/ :
5 5
Look for the probability corresponding to 0:4 z and z 1 from the standard normal
table and substract small value from high value:
D ˆ .1/ ˆ . 0:4/ D 0:8413 .1 0:6554/ D 0:4967:
This means that the probability that the machine will produce between 78–85 plastics with
a minute is 0.5.
(b) Standardize the production variable:
x 80 90 80
P .x < 90/ D P < D ˆ .2/ D 0:9772
5 5
90 80
P .x < 90/ D P z< D ˆ .2/ D 0:9772:
5
This indicates that the probability that the machine will produce less than 90 plastics is
0.98; this implies it is almost sure for the machine to produce 90 plastics within a minute.
98 6. CONTINUOUS PROBABILITY DISTRIBUTIONS
Example 6.4
Suppose that a Corporate Affairs Commission official knows that the monthly registration of
new companies followed a normal distributed withaverage of 50 new registrations per month
and variance of 16 new registrations. Find the probability that: (a) new registered companies
will less than 35, (b) new registered companies will be 45, and (c) new registered companies fall
between 35 and 45 new companies.
Solution:
Let x be the number of registered companies.
x is normally distributed with mean (50) and variance (16). That is, x N.50; 16/.
Since the standard deviation is the square root of variance then,
(a) P .x < 35/ D P x 450 < 35 4 50
This means that it is very that less than 35 companies will register within a month period.
(b) P .x D 45/ D P x 450 D 45 4 50 D 1 ˆ .1:25/ D 1 .0:89435/ D 0:1056.
The result shows that there is low possibility of getting exactly 45 companies to register
within a month.
(c) P .35 x 45/ D P 35 4 50 z 45 4 50 D ˆ . 1:25/ ˆ . 3:75/ D 0:10556.
This means there is low chance of getting between 35 and 45 companies to register within
a month period.
This is interpreted that more than 350 staff sitting for the test will pass at first sitting.
where
q vector of quantiles
mean vector of means
sd vector of standard deviations
log.p logical; if TRUE, probabilities p are given as log.p/
lower.tail logical; if TRUE (default), probabilities are P .X x/,
otherwise, P .X > x/.
Note that the lower tail of the normal curve is the direction of the tail to the left of a given point
(value) while the upper tail is direction to the right side of the curve at a given point. Thus, the
100 6. CONTINUOUS PROBABILITY DISTRIBUTIONS
left-tailed test is when the critical region is on the left side of the distribution of the test value.
The right-tailed test is when the critical region is on the right side of the distribution of the test
value.
The following scripts are the solutions to the Examples 6.2, 6.3, and 6.4 above, with the
use of R codes. It can be seen that we got the same results in R with less stress.
# Example 6.2
# To calculate the probability of students scoring above 70
prob70<-pnorm (70, mean=65, sd=12.5, lower.tail=FALSE)
prob70
[1] 0.3445783
percent70<- prob70*100
percent70
[1] 34.45783
# Example 6.3
# (a) To calculate the probability that the machine will produce
# between 78 and 85?
btw78_85 <- pnorm(85, 80, 5) - pnorm(78, 80, 5)
btw78_85
[1] 0.4967665
6.5 EXERCISES
6.1. Suppose X Bin.100; 0:8/. Calculate. (i) P .x 70/, (ii) P .x > 75/, and
(iii) P .70 x 90/.
6.2. A machine packs rice nominally in a 50-kg bag. It is observed that there is a variation in
the actual weight that is normally distributed. Records show that the standard deviation
of the distribution is 0.10 kg and the probability that the bag is underweight is 0.05.
Find:
6.3. The Ozone cinemas revealed that movie customers spent an average of $2,000 on con-
cessions with a standard deviation of $200. If the spending on concessions follows a
normal distribution:
(a) find the percentage of customers that will spend less than $1,800 on concenssions;
(b) find the percentage of customers that will spend more than $1,800 on concenssions;
and
(c) find the percentage of customer that will more than $2,000 on concenssions.
102 6. CONTINUOUS PROBABILITY DISTRIBUTIONS
6.4. Suppose we know that the survival rate of a new business is normally distributed with
mean 60% and standard deviation 8%. (a) What is the probability that more than 25
out of 30 new businesses registered in the week will survive? (b) What is the probability
that between 20 and 25 new businesses will survive?
6.5. The average lifetime of a light bulb is 4500 h with a standard deviation of 500 h. Assume
that the average lifetime of light bulbs is normal. (a) Find the probability that the average
life time of a bulb is between 4200 and 4700. (b) Find the probability that the average
life time of a bulb exceed 4500.
6.6. Suppose the time spent (in min) in opening a new bank account is X with a probability
density function of:
kx 2 for x > 0
f .x/ D
0 otherwise:
(a) Find the value of k .
(b) Find the probabilities that time to spend to open a new account will less than
10 min.
103
CHAPTER 7
f(X = x)
0.25
t (df = 5)
0.2
0.15
t (df = 10)
0.1 t (df = 30)
0.05 Std. Normal
0
-6 -4 -2 0 2 4 6
x
Figure 7.1: t -distribution with different degrees of freedom and standard normal distribution.
When the population standard deviation is unknown, the sample size is less than 30, and
the random variable x is approximately normally distributed, it follows a t -distribution. The test
statistics is:
.x /
tD p t.n 1/;.˛=2/ :
s= n
When you use a t -distribution to estimate a population mean, the df are equal to one less
than the sample size, df D n 1.
• When the degrees of freedom of t -distribution is sufficiently large, the t -distribution ap-
proaches the normal distribution.
7.1. STUDENT-T DISTRIBUTION 105
Example 7.1
What is the value of t in the following: (i) t0:05 .20/, (ii) t0:01 .20/, (iii) t0:975 .25/, and
(iv) t0:95 .25/?
Solution:
Let’s make use of the t -distribution table (see Appendix A).
(i) t0:05 .20/ D 1:725.
We look at t -distribution table when significance level (˛ ) is 0.05 and df .v/ is 20, then
we get 1.725.
(ii) t0:01 .20/ D 2:528.
We look at t -distribution table when significance level (˛ ) is 0.01 and df .v/ is 20, then
we get 2.528.
(iii) t0:975 D 2:060.
Because of the symmetry nature of t -distribution, we look t -distribution table when sig-
nificance level (˛ ) is 0.025 and df .v/ is 25, then we get 2.060.
(iv) t0:95 .25/ D 1:708.
We use significance level (˛ ) of 0.05 and df .v/ is 25, since t -distribution is symmetric,
then we get 1.708.
# Example 7.1(ii)
alpha<- 0.01
df<-20
t.alpha <- qt(alpha, df, lower.tail =TRUE )
t.alpha
[1] -2.527977
t.abs<-abs(t.alpha) # t distribution is symmetric.
t.abs
[1] 2.527977
# Example 7.1(iii)
alpha<- 0.975
df<-25
t.alpha <- qt(alpha, df, lower.tail =TRUE )
t.alpha
[1] 2.059539
Example 7.2
Suppose a random number X follows a t -distribution with v D 10 df. Calculate the probability
that the absolute value of X is less than 2.228.
7.2. CHI-SQUARE DISTRIBUTION 107
Solution:
V = 10
-2.228 0 2.228
Since t -distribution is symmetric and t -table does not contain have a negative t -values, we can
write the probabilities as:
From the t -table, the P .X < 2:228/ D 0:975 and P .X > 2:228/ D 0:025. The required prob-
ability is
P .jXj < 2:228/ D 0:975 0:025 D 0:95:
0.8
0.7
0.6
0.5
df = 1
f(X = x)
0.4
df = 5
0.3
df = 10
0.2
0.1
0
-0.1
0 5 10 15 20 25
X
X .Oi Ei /2
X2 D 2v :
Ei
where Oi is the observed value, Ei is the expected value and v is the df.
• The chi-square distribution is a continuous probability distribution with the values ranging
from 0 to 1 (nonnegative).
Example 7.3
What is the value of chi-square in the following: (i) 20:05 .15/, (ii) 20:995 .30/, and (iii) 20:900 .85/.
Solution:
We look at a chi-square table for the significance level (˛ ) and the degree of freedom (v ).
(i) 20:05 .15/ D 7:261.
(ii) 20:995 .30/ D 53:67.
(iii) 20:900 .85/ D 102:07.
We obtained the answer for 20:900 .85/ D 68:7845 from finding the average of
20:900
.80/ D 96:5782 and 20:900 .90/ D 107:565, since we cannot see 20:900 .85/ directly from
the chi-square table. We use the average method to estimate the value.
Note that we use the chi-square table that has the probabilities of P ŒX x.
R Codes to Obtain Chi-Square Values from the Table Using Example 7.3
The chi-square can be obtain in R using the function:
qchisq(p, df, ncp, lower.tail = TRUE, log.p = FALSE)
110 7. OTHER CONTINUOUS PROBABILITY DISTRIBUTIONS
where
qchisq is the quantile function for the Chi-square distribution
p is the vector of probabilities
ncp is the non-centrality parameter delta;
if omitted, use the central 2 distribution
lower.tail is logical; if TRUE (default), probabilities are P ŒX x,
otherwise, P ŒX > x
log.p is logical; if TRUE, probabilities p are given as log.p/.
The following are the codes to solve the Example 7.3(i)–(iii):
qchisq(0.05, df=15, lower.tail = TRUE)
[1] 7.260944
7.3 F-DISTRIBUTION
The F-distribution is the ratio of two independent chi-square distributions divided by their
degrees of freedom v1 and v2 . The probability density function of the F-distribution is
v1 Cv2
v1 v21 v
1
2 v2
x2
f .x/ D ; x 0;
v1
v2
v1 x
v1 Cv2
2
2
2
1C v2
where v1 and v2 are the shape parameter and is the gamma function.
The F-distribution is mainly spread out when the df are small. Thus, as the df decreases,
the F-distribution is more dispersed. The distribution of F-distribution is asymmetric that has a
minimum value of 0 and maximum of infinity. The curve reaches a peak not far to the right of 0,
and then gradually approaches the highest value of F in the horizontal axis. The F-distribution
approaches the horizontal axis but never touches the horizontal axis (see Fig. 7.4).
0.8
0.7
0.6
0.5
f(X = x)
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6
x
• Any changes in the first parameter .v1 / does not change the mean of the distribution, but
the density of the distribution is shifted from the tail of the distribution toward the center.
Solution:
Let’s look at F table for the significance level (˛ ) and the degrees of freedom (v1 and v2 ) for the
numerator and denominator, respectively.
(i) F.0:05/ .8; 12/ D 0:3045.
(ii) F.0:95/ .10; 15/ D 2:544.
(iii) F.0:995/ .20; 50/ D 2:470.
Note that we use the F table that has the probabilities of P ŒX x, the left side of the
curve.
R Codea to Obtain F Values from the F-Distribution Table Using Example 7.4
The F value can be obtained in R using the function:
qf(p, df1,df2, ncp, lower.tail = TRUE, log.p = FALSE)
where
qf is the quantile function for the F-distribution
p is the vector of probabilities
df1 is the degree of freedom for numerator
df2 is the degree of freedom for denominator
ncp is the non-centrality parameter delta;
if omitted, use the central F-distribution
lower.tail is logical; if TRUE (default), probabilities are P ŒX x,
otherwise, P ŒX > x
log.p is logical; if TRUE, probabilities p are given as log.p/.
With these short R statements we are able to obtain the F values for Example 7.4(i)–(iii).
qf (0.05, 8, 12, lower.tail = TRUE, log.p = FALSE)
[1] 0.3045124
CHAPTER 8
Non-Probability Sampling
It involves a non-random sampling of the sampling units, thus only specific members of the
population have a chance of being selected. The most widely used non-probability methods are
judgment sampling, quota sampling, convenience sampling, and extensive sampling. A non-
probability sampling technique is based on subjective judgment and it is used for exploratory
studies, e.g., pilot survey.
• It saves time.
3. Computer Generation
Computers have some in-built programs that help to generate random samples at ease.
These are mostly used in selection of winners of applicants for plots of lands, visa lottery,
etc.
Example 8.1
Suppose a population contains the following: 4, 6, 10, 11, 15, 17, and 20 units. Select a sample
size of 2. What are the possible samples with replacement and without replacement?
8.2. PROBABILITY SAMPLING TECHNIQUES 119
Solution:
Table 8.2: SRSWR and SRSWOR samples
Samples 4 6 10 11 15 17 20
4 (4, 4) (4, 6) (4, 10) (4, 11) (4, 15) (4, 17) (4, 20)
6 (6, 4) (6, 6) (6, 10) (6, 11) (6, 15) (6, 17) (6, 20)
10 (10, 4) (10, 6) (10, 10) (10, 11) (10, 15) (10, 17) (10, 20)
11 (11, 4) (11, 6) (11, 10) (11, 11) (11, 15) (11, 17) (11, 20)
15 (15, 4) (15, 6) (15, 10) (15, 11) (15, 15) (15, 17) (15, 20)
17 (17, 4) (17, 6) (17, 10) (17, 11) (17, 15) (17, 17) (17, 20)
20 (20, 4) (20, 6) (20, 10) (20, 11) (20, 15) (20, 17) (20, 20)
2
(a) SRSWR: Possible sample = 7 = 49
Samples 4 6 10 11 15 17 20
4
6 (6, 4)
10 (10, 4) (10, 6)
11 (11, 4) (11, 6) (11, 10)
15 (15, 4) (15, 6) (15, 10) (15, 11)
17 (17, 4) (17, 6) (17, 10) (17, 11) (17, 15)
20 (20, 4) (20, 6) (20, 10) (20, 11) (20, 15) (20, 17)
7
(b) SRSWOR: Possible sample = (2) = 2
• The samples from system sampling are based on precision in the member selection.
• It has a higher risk of manipulation of data thereby increasing the likelihood of achieving
a targeted outcome rather than randomness of a dataset.
• High estimates of standard errors when compared with other probability sampling tech-
niques.
8%
12%
Engineering
35% Science
Law
27% Social Science
Medicine
18%
If a stratified random sample of 1,000 students are be selected in the above example, it
is expected to select students according to the size of the faculties. That is, 120 students from
Engineering, 270 students from Science, 180 students from Law, 350 students from Social
Science, and 80 students from Medicine suppose to be among the selection to achieve a better
precision.
122 8. SAMPLING AND SAMPLING DISTRIBUTION
8.2.4.1 Advantages of Stratified Random Sampling
• It gives a high representation of the population.
• It improves the potential for units to evenly distribute over the population, hence improve
precision when compared with SRS.
• The list of the population is needed to be clearly delineated into each stratum.
• It has no capacity to extrapolate the population as samples selection are not good repre-
sentation of the population.
For SRSWR, the distribution of the sample means appears in Table 8.3.
Samples 4 6 10 11 15 17 20
4 4.0 5.0 7.0 7.5 9.5 10.5 12.0
6 5.0 6.0 8.0 8.5 10.5 11.5 13.0
10 7.0 8.0 10.0 10.5 12.5 13.5 15.0
11 7.5 8.5 10.5 11.0 13.0 14.0 15.5
15 9.5 10.5 12.5 13.0 15.0 16.0 17.5
17 10.5 11.5 13.5 14.0 16.0 17.0 18.5
20 12.0 13.0 15.0 15.5 17.5 18.5 20.0
Samples 4 6 10 11 15 17 20
4
6 5.0
10 7.0 8.0
11 7.5 8.5 10.5
15 9.5 10.5 12.5 13.0
17 10.5 11.5 13.5 14.0 16.0
20 12.0 13.0 15.0 15.5 17.5 18.5
In summary, irrespective of the sampling processes, the mean of distribution of means still
remain the mean of the population.
2. The standard error of the mean is equal to the ratio of the standard error of the popula-
tion to the square root of the sample size. This can also be demonstrated as follows using
(SRSWR) Table 8.3. The standard deviation of the population 4, 6, 10, 11, 15, 17, and 20
units is:
.4 11:86/2 C .6 11:86/2 C C .20 11:86/2
D D 5:38:
7
Similarly, the standard deviation of the distribution of sample means is calculated as:
2
3. Variance var.X/ D X2 D n
.
Alternatively,
X
zD N .0; 1/ as n ! 1:
The main significance of the central limit theorem is that it enables us to make probability
statements about the sample mean when compared with the population mean. Let’s distinguish
two separate variables with their corresponding distributions:
1. Let X be a random variable that measures a single element from the population, then the
distribution of X is the same as distribution of the population with mean ./ and standard
deviation ./.
2. Let X be a sample mean from a population; then the distribution of X is its sampling
p
mean with mean .X / and standard deviation .X D = n/, where n is the sample size.
Example 8.2
Let X be the mean of a random sample of 100 selected from a population of mean 30 and
standard deviation of 5. (a) What is the mean and standard deviation of X ? (b) What is the
probability that the value of X falls between 29 and 31? (c) What is the probability that value
of X is more than 31?
Solution:
p5
(a) X D D 30; X D p
n
D
100
D 0:5.
128 8. SAMPLING AND SAMPLING DISTRIBUTION
(b)
29
X 31 X
P .29 < X < 31/ D P z
0:5 0:5
29 30 31 30
DP z
0:5 0:5
D ˆ .2/ ˆ . 2/ D 0:9772 0:0228 D 0:9544:
(c)
31 30
P X > 31 D 1 P X 31 D 1 P
0:5
D1 ˆ . 2/ D 1 0:9772 D 0:0228:
R Codes
Example 8.3
A branch manager of a microfinance bank claims that the average number of customers that
deposit cash on monthly basis is 1,250 customers with a standard deviation of 130 customers.
Assume the distribution of cash lodgement is normal. Find:
(a) the probability that less than or equal to 1,220 customers will deposit cash in a month and
(b) the probability that the mean of a random sample of 15 months, less than 1,200 customers
deposit cash.
Solution:
(a)
1220 1250
P .X < 1220/ D P z D P .z 0:231/ D 0:4090:
130
(b) !
1200 1250
P X < 1200 D P z 130
D P .z 1:49/ D 0:0681:
p
15
R Codes
8.6 EXERCISES
8.1. (a) Explain what you understand by the sampling distribution.
(b) Differentiate between probability sampling and non-probability sampling.
(c) What are the merits and demerits of probability sampling and non-probability
sampling?
8.2. (a) Mention and describe types of probability sampling techniques you know.
(b) State the advantages and disadvantages of the sampling techniques mentioned
in (a).
8.3. List non-probability sampling techniques and state the advantages and disadvantages
of the techniques.
8.4. With the aid of a demonstrative example, explain the concept of sampling distribution
of means.
8.5. (a) State the central limit theory and its importance.
(b) Assume that the length of time of calls is normal, with average of 60 s and standard
deviation of 10 s. Find the probability that the average time obtained from a sample
of 35 calls is 55 s of the entire population average.
8.6. Suppose a normally distributed population has mean and standard deviation of 75 and
12, respectively.
(a) What is the probability that a random element X selected from the population
falls between 72 and 75?
(b) Calculate the mean and standard deviation of X for a random sample of size 30.
(c) Calculate the probability that the mean of selected sample size of 30 from the
population is between 72 and 75.
131
CHAPTER 9
95%
+10
Margin of Error
Lower Bound Upper Bound
2.5% 2.5%
For example, if the sample size n < 30, is unknown, and the population is normally distributed,
then we should use the Student t -distribution.
However, if n < 30, is unknown, and the population is not normally distributed, then we
should use nonparametric statistics (not covered in this book). Specifically, the CI for population
mean is constructed as:
s
CI D x ˙ tn 1;q p : (9.1)
n
A 100 .1 ˛/% confidence region for contains:
s s
x tn 1;q p x C tn 1;q p ; (9.2)
n n
9.2. CONFIDENCE INTERVALS FOR MEAN 133
where ˛ represents level of significance, n represents sample size, s is the standard deviation, x
is the mean, t is the critical region from the t -distribution table, and q is the quantile (usually
q D 1 ˛2 for two-tailed test and q D 1 ˛ for one-tailed test).
Suppose we want to construct a 95% CI for an unknown population mean, then a 95%
probability that CI will contain the true population mean could be calculated as follows:
s s
P x tn 1;˛=2 p x C tn 1;˛=2 p D 0:95: (9.3)
n n
Example 9.1
A random sample of 25 customers at a supermarket spent an average of $2,000 with a standard
deviation of $200. Construct a 95% CI estimating the population mean of purchase made at the
supermarket.
Solution:
100.1 ˛/% D 95% ) 100 100˛ D 95 ) ˛ D 0:05
This implies that the average purchase by the customers in the supermarket falls between
$1,917.44 and $2,082.56.
The average purchase by the customers in the supermarket lies between $1,917.44 and $2,082.56.
134 9. CONFIDENCE INTERVALS FOR SINGLE POPULATION MEAN AND PROPORTION
In addition, if n 30 and is known or n < 30, is known, and the population is nor-
mally distributed, then we use:
x z1 ˛=2 p x C z1 ˛=2 p : (9.4)
n n
If n 30 and is unknown, the standard deviation s of the sample is used to approximate the
population standard deviation , then we have:
s s
x z1 ˛=2 p x C z1 ˛=2 p : (9.5)
n n
Example 9.2
A sales manager of a company envisage that there dramatic drop in the sale of a particular
product. He took a simple random sample of 50 sales records from the previous days. The sales
(in dollars) was recorded and some summary measures are provided: n D 22, x D 5200 and
s D 400. Assuming that the sales is approximately normal. (a) Construct a 95% CI for the mean
sales of the product. (b) Interpret your result in (a).
Solution:
(a) n D 50, x D 5200 and s D 400 and z0:025 D 1:96.
400 400
CI D 5;200 1:96 p 5;200 C 1:96 p
50 50
CI D $5;089:13 $5;310:87 or D Œ$5;089:13; $5;310:87:
(b) The true sales of the product lies between $5,089.13 and $5,310.87.
Example 9.3
In an opinion poll to know whether to establish a National Grazing Reserve bill in Nigeria
or not, a random sample of 8,500 participants were selected, only 6,250 respondents were in
support of the bill while others moved against the bill. Construct 95% CI for the population
proportion.
Solution:
6250
Sample proportion (p/
O D 8500 D 0.74.
Since npO D 8500 0:74 > 5 and npO D 8500 0:36 > 5, then we can use normal distribution
table.
From the normal table, z.1 0:05 / D 1:96.
2
r
0:74.0:26/
CI D 0:74 ˙ 1:96
r 8500 r
0:74 .0:26/ 0:74.0:26/
CI D 0:74 1:96 p 0:74 C 1:96
8500 8500
CI D 0:7307 p 0:7493:
Hence, the proportion of respondents that supported the National Grazing Reserve bill in Nige-
ria lies within 0:7307 and 0:7493.
R Codes for the Computation of Confidence Interval for Proportion in Example 9.3
sample.prop <- 0.74
n <- 8500
std.error <- qnorm(0.975)*sqrt(sample.prop*(1-sample.prop)/n)
lower.limit <- sample.prop-std.error
upper.limit <- sample.prop+std.error
136 9. CONFIDENCE INTERVALS FOR SINGLE POPULATION MEAN AND PROPORTION
conf.interval<-c(lower.limit, upper.limit)
conf.interval
[1] 0.7306752 0.7493248
The result shows that between 73% and 75% of the respondents supported the National Grazing
Reserve bill in Nigeria.
Example 9.4
The manager of a commercial bank took a random sample of 120 customers’ account numbers
and found that 15 customers have not had bank verification number (BVN). Compute 90% CI
for the proportion of all the bank customers that are yet to complete the BVN process.
Solution:
15
pO D D 0:125
120 r
0:125.0:875/
CI D 0:125 ˙ 1:645
r 120 r
0:125 .0:875/ 0:125.0:875/
CI D 0:125 1:645 p 0:125 C 1:645
120 120
CI D 0:075 p 0:1747:
The percentage of all the bank customers that are yet to complete the BVN process is between
7.5% and 17.5% of the bank customers.
R Codes for the Computation of Confidence Interval for Proportion in Example 9.4
sample.prop <- 0.125
sample.size <- 120
std.error <- qnorm(0.95)
*sqrt(sample.prop*(1-sample.prop)/sample.size)
lower.limit <- sample.prop-std.error
upper.limit <- sample.prop+std.error
conf.interval<-c(lower.limit, upper.limit)
conf.interval
[1] 0.07534126 0.17465874
Example 9.5
A researcher claimed that the standard deviation for the monthly utility bill for an individual
household is $50. He wants to estimate the mean of the utility bill in the present month using
95% of confidence level with the margin of error of 12 . How large a sample is required?
Solution: 2
z 2 1:96
0:975 2
n .50/ D .50/2 D 66:69 67 households:
12 12
Therefore, the minimum sample size required is 67 households.
R Codes to Calculate Sample Size Given a Standard Deviation and Margin of Error
pop.std<-50
margin.err<-12
z.normal<-qnorm(0.975)
sample.size<- (pop.std * z.normal/margin.err)**2
sample.size
[1] 66.69199
round(sample.size, digits = 0)
[1] 67
However, we can calculate the sample size for proportion under two conditions.
Solution:
60
pO D 100 D 0:6 and qO D 0:4.
To verify the sampling distribution of pO to be approximated by the normal distribution, we have
npO D 100 0:6 > 5 and nqO D 100 0:4 > 5:
z1 ˛2 2
n p.1
O p/:
O
e
1:96 2
n .0:6/.0:4/
0:025
1:96 2
n .0:6/ .0:4/ D 1475:17:
0:025
The minimum sample size should be at least 1,475 respondents.
9.6 EXERCISES
9.1. The scores of students Business Statistics course are normally distributed. If a random
sample of 40 students are selected at random with mean 72 and standard deviation of 14,
compute the 95% CI for the population mean.
9.2. Research department of a telecommunication company wants to know the customers’
usage (in hours) of a new service rendered. Assuming that the usage of the service is
normally distributed, a random sample 3,000 customers under the new service is taken.
They found that the mean usage is 7 h and standard deviation of 1 h 30 min. Construct
a 99% CI for the mean usage of the service.
9.3. During the recession period, the price of a bottle of 60 cl Coke rose. Due to variability
in the price of Coke, retailers sold at diffferent prices. A random sample of 100 retailers
were sampled with mean 150.20 naira and standard deviation of 23.5. Calculate the
95% CI.
9.4. In the process of manufacturing bulbs, the probability that a bulb will be defective is
0.09. A random sample of 200 bulbs is selected, compute 95% confidence limit for the
defective bulbs?
9.5. In order to know the winner of the next presidential election in a country, a survey poll
was conducted to allow the citizens to express their opinions about the contestants. If the
95% CI is not greater than 0.09, what number of random sample size of the respondents
should be taken with the margin of error within 0.15 if the standard deviation is 20?
9.6. A steel rolling company manufanuctures cyclindrical steel with the same length but dif-
ferent diameter (in mm). A random sample of 24 steels is checked and a mean diameter
of 12.5 mm and standard deviation of 3 mm were observed. Compute the 95% CI for
the mean diameter of the steels.
141
CHAPTER 10
Acceptance
Region
Alternatively, a two-tail test is a test hypothesis where the rejection region is on both
sides of the sampling distribution. Assume that the null hypothesis stated that H0 W D 0
against the alternative hypothesis H1 W ¤ 0. The non-directional sign would take the
values from both sides of the sampling distribution; thus, the set of values on the right
side of zero and on the left size of zero are the rejection regions. Hypotheses testing can
be of the form:
H0 W D 0 vs. H1 W ¤ 0 (two-tail test)
H0 W D 0 vs. H1 W < 0 (one-tail test)
H0 W D 0 vs. H1 W > 0 (one-tail test)
H0 W > 0 vs. H1 W 0 (one-tail test)
H0 W 0 vs. H1 W < 0 (one-tail test)
H0 W < 0 vs. H1 W 0 (one-tail test)
H0 W < 0 vs. H1 W 0 (one-tail test)
Example 10.1
In a pharmaceutical company, the operations manager claimed that the mean of the drugs pro-
duced by the company is 100 mg. If a random sample of 60 drugs is chosen with mean of 98 mg
and standard deviation of 14 mg, test the hypothesis to justify the operations manager’s claim,
use ˛ D 0:05.
Solution:
(a) H0 W D 100 vs. H1 W ¤ 100.
(b) ˛ D 0:05.
p
60.98 100/
(c) Test statistic: z D 14
D 1:1066 (we use test statistic z since the sample size is
greater than 30).
(d) z.0:975/ D 1:96 (from the normal distribution table).
The stated hypothesis is a two-tailed test, therefore we use z.1 ˛ / . However, if the stated
2
hypothesis is a one-tailed test and sample size is is equal to or greater than 30, then we
will use z.1 ˛/ .
10.4. HYPOTHESIS TESTING PROCEDURES 145
Decision rule: reject null hypothesis if j 1:1066j > 1:96, since test statistics is not greater
than critical value, therefore we do not reject H0 .
(e) Conclusion: the data support the claim of the operations manager than the mean of the
drugs produced is 100 mg.
The scripts below show how the question in Example 10.1 can be solved in R. The z -
statistic and critical value are computed in R. This serves as a basis of comparison.
R Codes
# state the given parameters
xbar = 98 # sample mean
mu0 = 100 # hypothesized value
sigma = 14 # sample standard deviation
n = 60 # sample size
# use z-test since sample size is greater than 30
z = (xbar-mu0)/(sigma/sqrt(n))
z # test statistic
[1] -1.106567
alpha = 0.05
z_alpha = qnorm(1-alpha/2)
z_alpha # critical value
[1] 1.959964
The absolute value of computed z -statistic is 1.1066, which is less than the critical value of 1.96.
Therefore, we do not reject null hypothesis and we conclude that the operations manager is right
in his claim with the given data.
Example 10.2
A stockbroker claimed that weekly average return on a stock is normal with an average of return
of 0.5%. He took the 20 previous weeks return and found that the weekly average returns was
0.48 with standard deviation 0.08. At 5% level of significance, does his claim about the relevant?
If the level of significance is reduce to 1%, compare the result.
Solution:
Decision rule: Reject null hypothesis if j 1:118j > 2:861. Since the absolute of test statistics is
less than critical value, therefore we do not reject H0 .
Conclusion: We accept the null hypothesis based on the data, and conclude that the average
weekly returns is 0.5%.
Comparison: The stockbroker is claim that the average weekly return is 0.5% is right at both 1%
and 5% level of significance.
Let’s demonstrate this Example 10.2 with R codes.
R Codes
# Given parameters
x.bar = 0.48
mu0 = 0.50
sigma = 0.08
sample.size = 20
alpha1 = 0.05
t.alpha1 = qt(1-alpha1/2, sample.size-1 )
10.4. HYPOTHESIS TESTING PROCEDURES 147
t.alpha1
[1] 2.093024
# At 1% level of significance
alpha2 = 0.01
t.alpha2 = qt(1-alpha2/2, sample.size-1 )
t.alpha2
[1] 2.860935
These results are the same as the outcomes we obtained in Example 10.2.
Example 10.3
The National Bureau of Statistics (NBS) claimed that less than 10% of the graduate youths are
unemployed while the opposition party argued that the percentage of graduate youths is more
than NBS claims. To ascertain the validity of the claim, a random sample of 10,000 graduate
youths are selected in which 1,250 graduate youths are unemployed. Test the hypothesis that
less than 10% of the graduate youths are unemployed. (Hint: use ˛ D 0:05.)
Solution:
(a) Hypothesis: H0 W P < 10% vs. H1 W P 10%.
(b) ˛ D 0:05.
(c) pO D 0:125.
Test statistic: z D q pO p0 .
p0 .1 p0 /
n
Here, we used z -statistics because the sample size is 10,000 which is large enough, that is,
it is greater than 30.
0:125 0:1
zDq D 8:3:
0:1.1 0:1/
10000
(d) Critical value: z.1 ˛/ D z.0:95/ D 1:64 (Our hypothesis is one-tailed, thus z.1 ˛/ is to be
used).
Decision rule: Reject null hypothesis if z > z.0:95/ , since 8:3 > 1:64, then we reject null
hypothesis.
(e) Conclusion: There is no evidence to support the claim of NBS that the percentage of
unemployed graduate youths is less than 10% based on the data.
Note: There is a possibility that another set of dataset might justified the claim of the NBS.
However, based on the information given in this particular question, no justification for
the claim that the percentage of unemployed graduate youths is less than 10%.
148 10. HYPOTHESIS TESTING FOR SINGLE POPULATION MEAN AND PROPORTION
The R codes below describe how the z -statistics for the proportion .p/ and critical value
can be obtained for proper comparison.
R Codes
# compute the z-statistic
p.hat = 0.125
p0 = 0.10
n = 10000
z = (p.hat-p0)/ sqrt (p0*(1-p0)/n)
z
[1] 8.333333
alpha = 0.05
z.alpha = qnorm(1-alpha)
z.alpha
[1] 1.644854
The z -statistic and critical value under z are 8.333333 and 1.644854, respectively.
Example 10.4
There is a popular saying that girls perform better than the boys. To justify that, a random
sample of 1,500 students which consists of 950 boys was selected to participate in an entrance
examination into the secondary schools. Out of the boys that participated, only 480 passed and
only 455 girls passed. Test the hypothesis that girls perform better than boys at a 5% level of
significance.
Solution:
(a) Let pO be the proportion of girls that passed the examination and p0 be the proportion of
girls that participated in the examination.
Hypothesis: H0 W p > 0:5% vs. H1 W p 0:5.
(b) ˛ D 0:05.
(c) pO D 0:83 and p0 D 0:5.
Test statistic: z D q pO p0
p0 .1 p0 /
n
The z -statistic is used because the sample size is greater than 30:
0:83 0:50
zDq D 25:56:
0:5.1 0:5/
1500
10.5. EXERCISES 149
(d) Critical value: z.1 ˛/ D z.0:95/ D 1:64 (from z -table with one tailed test).
Decision rule: Reject null hypothesis if z > z.0:95/ , since 25:56 > 1:64, then we reject null
hypothesis.
(e) Conclusion: The data supported the general claim that girls perform better than the boys.
R Codes for Computing z -test for a Proportion and its Critical Value
# compute the z-test for the proportion
p.cap = 0.83
p0 = 0.5
n = 1500
z = (p.cap-p0)/ sqrt (p0*(1-p0)/n)
z
[1] 25.56169
alpha = 0.05
z.alpha = qnorm(1-alpha)
z.alpha
[1] 1.644854
Since both z -test for the proportion and critical value are the same as in Example 10.4. Therefore,
we are arriving at the same conclusion.
10.5 EXERCISES
10.1. (a) Define the following:
(i) null hypothesis and alternative hypothesis; and
(ii) type 1 and type 2 error.
(b) The resident doctor of a hospital stated that the weight of a newborn baby is 3.5 kg
and above because of the kind of foods pregnant women ate during pregnancy. A
random sample of the 20 newborn babies are selected from the records and find
that the mean is 2.95 kg and the standard deviation is 0.85 kg. Test the hypothesis
that the weight of newborn babies is greater than or equal to 3.5 kg, assuming the
weight is normal (use ˛ D 0:01).
10.2. The managing director claims that brewery company notices that the sales of the prod-
ucts decline less than 20% during Ramadan periods and summon the sales manager
to investigate his claim. The sales manager collated the sales of the products during the
150 10. HYPOTHESIS TESTING FOR SINGLE POPULATION MEAN AND PROPORTION
month of Ramadan in the past 30 years. Table 10.2 shows the distribution of percentage
decrease in the sales during Ramadan periods. What would you say about the managing
director’s claim?
Table 10.2: Sales of a brewery company during Ramadan
Year 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
Decrease in
27.41 9.31 20.36 26.43 22.75 14.48 10.21 12.42 20.83 18.53
sales (%)
Year 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
Decrease in
19.98 25.22 12.08 26.75 35.43 19.44 19.62 24.06 25.04 10.61
sales (%)
Year 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
Decrease in
31.32 24.13 16.38 20.08 21.59 13.98 19.80 23.97 29.64 21.29
sales (%)
10.3. A quality assurance manager argued that the average lifetime of light bulbs is 520 h. To
ascertain his claim, he took a random sample of 50 bulbs and the lifetime readings (in
hours) are as follows:
427.82, 425.76, 395.28, 444.67, 437.26, 442.67, 424.36, 416.63, 431.49,
401.58, 407.57, 461.93, 436.58, 423.20, 429.79, 447.74, 430.27, 434.41,
414.26, 435.79, 427.24, 401.04, 433.63, 404.31, 400.14, 437.70, 437.36,
424.59, 410.77, 448.75, 421.52, 416.88, 427.79, 425.87, 412.31, 423.88,
397.12, 430.68, 418.87, 411.51, 418.85, 405.59, 416.06, 388.01, 439.83,
419.70, 443.24, 422.75, 419.85, 420.28
Test whether the rector is right in his statement and test that H0 W p 0:5 against H1 W
p < 0:5, use ˛ D 0:05.
153
CHAPTER 11
Sales revenue ($’million) 115 118 120 125 126 128 131 132
Advertisement expenses ($’million) 4 7 9 14 15 17 20 21
134
Sales = Advert + 111
132
2. Forecasting an effect—it is used to predict a response variable fully knowing the indepen-
dent variables.
Yi D ˇ0 C ˇ1 Xi C "i ; (11.1)
where ˇ0 and ˇ1 are the intercept and regression coefficient of X and " is the error term.
11.2. TYPES OF REGRESSION ANALYSIS 155
The solution to the regression coefficients in (11.1) can be derived using Least Square method:
ei D Yi ˇ0 ˇ1 Xi : (11.2)
To find a minimum sum of squares of residuals you set the sum below equal to zero:
n
X n
X
.ei /2 D .Yi ˇ0 ˇ1 Xi /2 D 0: (11.3)
i D1 i D1
Therefore,
Pn Pn
i D1 Xi Yi Xi Y i D1 .Xi Y i / nXY cov.X; Y /
ˇ1 D Pn D P 2
D : (11.7)
i D1 Xi2 Xi X n
i D1 X 2
nX var.X /
i
6. There is no correlation between the error terms, i.e., no serial auto-correlation in the data.
7. The number of sample observations must greater than the number of parameters to be
estimated.
8. For each of value of X, the distribution of residuals has equal variance, i.e., homoscedacity.
Example 11.1
In a business statistics class, the weight and height of 30 students were measured, as shown in
Table 11.1.
(i) Find the regression of the weight on the height of the students.
(ii) Use your answer in (i) to estimate the value of student’s weight when the height is 1.80?
Student 1 2 3 4 5 6 7 8 9
Height (m) 1.43 1.10 2.24 1.36 2.26 1.25 1.74 1.55 1.51
Weight (kg) 92.18 77.76 65.44 114.19 82.81 106.66 94.44 75.32 67.35
Student 10 11 12 13 14 15 16 17 18
Height (m) 1.82 1.57 1.59 2.19 1.54 2.06 1.86 1.76 1.51
Weight (kg) 101.55 76.37 91.66 75.85 88.82 83.02 74.66 97.57 104.56
Student 19 20 21 22 23 24 25 26 27
Height (m) 2.39 1.83 2.02 1.99 1.40 1.54 1.60 1.88 1.52
Weight (kg) 113.36 64.71 103.79 70.02 78.35 80.70 90.54 91.55 82.57
Student 28 29 30
Height (m) 1.41 1.38 1.18
Weight (kg) 82.49 87.98 67.54
11.2. TYPES OF REGRESSION ANALYSIS 157
Solution:
From the data above, we obtained the following results:
X X X
XY D 4347:56; Y D 2583:81; X D 50:48;
Call:
lm(formula = weight ~ height)
Residuals:
158 11. REGRESSION ANALYSIS AND CORRELATION
Min 1Q Median 3Q Max
-21.411 -10.132 -3.192 7.747 28.049
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 86.19791 13.55401 6.360 6.99e-07 ***
height -0.04214 7.90579 -0.005 0.996
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
In the results above, the statistics for residuals, the coefficient estimates (with standard errors
and the associated p-values), and all other statistics (Multiple R-squared, Adjusted R-squared,
F-statistics, etc.) are shown in the output.
Example 11.2
Table 11.2 shows the log of gross domestic products (GDP) and the log of government spending
in Nigeria between 1981–2015. Regress the log of GDP on the log of government spending and
interpret your result.
Solution:
From the table above, we obtained the following:
X X X
XY D 20140:45; Y D 907:92; X D 775:62;
Year 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990
LNGDP (Y) 25.545 25.535 25.483 25.463 25.543 25.451 25.337 25.410 25.473 25.593
LNGEXP (X) 21.082 21.105 21.128 21.150 21.171 21.192 21.213 21.233 21.253 21.273
Year 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
LNGDP (Y) 25.587 25.591 25.612 25.621 25.618 25.666 25.694 25.721 25.725 25.777
LNGEXP (X) 21.283 21.312 21.340 21.354 21.354 21.382 21.399 21.416 21.433 21.449
Year 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
LNGDP (Y) 25.820 25.858 25.956 26.247 26.281 26.360 26.426 26.486 26.554 26.629
LNGEXP (X) 21.320 21.377 21.103 22.998 23.098 23.404 23.854 24.069 24.076 24.188
R Codes for Regressing the Log of Gross Domestic Products on the Log of Government
Spending
Assign the values for the log of GDP and the log of government spending (LNGEXP).
LNGDP<-c(25.545, 25.535, 25.483, 25.463, 25.543, 25.451, 25.337, 25.410,
25.473, 25.593, 25.587, 25.591, 25.612, 25.621, 25.618, 25.666,
25.694, 25.721, 25.725, 25.777, 25.820, 25.858, 25.956, 26.247,
26.281, 26.360, 26.426, 26.486, 26.554, 26.629, 26.677, 26.719,
26.771, 26.832, 26.859)
LNGEXP<-c(21.082, 21.105, 21.128, 21.150, 21.171, 21.192, 21.213, 21.233,
21.253, 21.273, 21.283, 21.312, 21.340, 21.354, 21.354, 21.382,
21.399, 21.416, 21.433, 21.449, 21.320, 21.377, 21.103, 22.998,
23.098, 23.404, 23.854, 24.069, 24.076, 24.188, 24.233, 24.213,
24.105, 24.032, 24.028)
data<-data.frame (LNGDP, LNGEXP)
model<-lm(LNGDP~LNGEXP, data)
summary(model)
Call:
lm(formula = LNGDP~LNGEXP)
160 11. REGRESSION ANALYSIS AND CORRELATION
Residuals:
Min 1Q Median 3Q Max
-0.25038 -0.06998 -0.01893 0.04646 0.40962
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.68058 0.39751 44.48 <2e-16 ***
LNGEXP 0.37273 0.01791 20.81 <2e-16 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
From the output above, the regression coefficients are 17.68 and 0.37 for the intercept and slope,
respectively.
In simple linear regression, we considered two variables where one is the response and the other
one is explanatory variable. In the case of multiple linear regression, it is an extension of simple
linear regression whereby we have two or more explanatory variables that account for the vari-
ation in a dependent variable. Each of the explanatory variables Xi is associated with a value of
the response variable Y . The multiple linear regression model is of the form:
where ˇ0 is a constant term, ˇ1 ; ˇ2 ; : : : ; ˇk are regression coefficients, "i k is error term, and
"ik N.0; 2 /.
The estimates of the regression coefficients (ˇ0 ; ˇ1 ; ˇ2 ; : : : ; ˇk ) are the values that mini-
mize the sum of squared errors for the residuals.
For two independents variables,
P P P P
x22 . x1 y/ . x1 x2 / . x2 y/
ˇ1 D P 2 P 2 P (11.10)
X1 X2 . X1 X2 /2
P 2 P P P
x1 . x2 y/ . x1 x2 / . x1 y/
ˇ2 D P 2 P 2 P (11.11)
x1 x2 . x1 x2 /2
ˇ0 D Y ˇ1 X 1 ˇ2 X 2 ; (11.12)
where
P P
X X . X1 / . Y /
x1 y D X1 Y (11.13)
P NP
X X . X2 / . Y /
x2 y D X2 Y (11.14)
P N P
X X . X1 / . X2 /
x1 x 2 D X1 X 2 : (11.15)
N
Example 11.3
Table 11.3 shows the level of education .X 1 /, year of experience .X 2 /, and the log of monthly
compesation of a company (Y ). The level of education X1 D 1 for primary education, X1 D 2 for
secondary education, X1 D 3 for polytechnic graduate, X1 D 4 for university Bachelor’s degree,
and X1 D 5 for university Master’s degree holder. Regress log of monthly compensation (Y ) on
level of education .X 1 / and the year of experience in the company .X 2 /.
Solution:
We obtained the following values from the data above:
X X X X
X22 D 921; X12 D 387; X1 Y D 759:61; X2 Y D 1165:55;
X X X X
X1 X2 D 499; n D 35; X1 D 107; X2 D 167; Y D 242:76:
162 11. REGRESSION ANALYSIS AND CORRELATION
Table 11.3: Data for Example 11.3
The regression coefficients are 5.597, 0.308, and 0.084 for ˇ0 , ˇ1 , and ˇ2 , respectively.
The model is Yi D 5:577 C 0:31Xi1 C 0:087Xi 2 .
Interpretation: This implies that 1 unit increase in the level of education would lead to a
31% increase in compensation holding years of experience constant. Also, for every unit increase
in years of experience would lead to a 9% increase in compensation holding level of education
constant.
empl<-c(8, 4, 6, 2, 7, 5, 5, 3, 2, 6, 3, 5, 8, 5, 5, 5, 3, 4, 5, 5,
3, 9, 2, 4, 4, 7, 5, 3, 9, 5, 6, 2, 5, 3, 4)
Call:
lm(formula = comp~edu + empl, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-0.25870 -0.09077 -0.01283 0.07499 0.28368
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.57723 0.07656 72.844 < 2e-16 ***
edu 0.30803 0.01540 20.002 < 2e-16 ***
empl 0.08717 0.01069 8.151 2.61e-09 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Example 11.4
In Example 11.3, test whether the null hypothesis that ˇ1 is significantly different from zero
i.e., H0 W ˇ1 D 0 vs. H1 W ˇ1 ¤ 0.
Solution:
Hypothesis: H0 W ˇ1 D 0 vs. H1 W ˇ1 ¤ 0.
Test statistic: t D 0:30803
0:01540
D 20:002.
Critical value: t0:975;33 D 2:042.
Decision rule: Reject H0 if jtj t0:975;33 , since 20:002 2:042, then we reject H0 .
Conclusion: The coefficient of level of education X1 is significantly different from zero.
Note: This result gives the same conclusion with the model in Example 11.3.
(b) t -value: This is the ratio of the coefficient and the standard error of the coefficient. The
rule of thumb is that the absolute value of t must be 2 or more to show the significance
of the coefficient. T -value is used to determine the p -value corresponding to Student t -
distribution.
(c) P -value: It indicates that theprobability that the estimated coefficient is not reliable. The
less the p -value the more it is reliable under the significance level. For example, if the
level of significance is 5% or (10%) it means than the value of p is less than 5% or (10%)
166 11. REGRESSION ANALYSIS AND CORRELATION
indicates that the estimated coefficient is reliable, otherwise it is unreliable and it should
be discarded from the model.
(d) Multiple R-squared: This shows the fraction (percentage) of the variation in a response
variable that is accounted for by independent variables in the model. It indicates how
well the terms fit the data. In addition, the adjusted R-squared is used to adjust for the
number of terms in a model. As long as you add more independent variables to a model,
the R-squared continue to increase in value, even when the variable is useless in the model.
However, the adjusted R-squared will increase if you add useful independent variable in
the model, otherwise the value of adjusted R-squared will decrease. The R-squared rages
from 0 to 1 but the adjusted R-squared can dip down to the negative value.
(e) F-statistics: This test for the significance of the overall coefficients whether the regression
model provides a better fit to the data than a model with no independent variables.
(f ) Durbin–Watson: It is used to test for the autocorrelation assumption of the error terms.
That is, to make sure that no correlation between the error terms Cov ."i ; "i 1 / D 0. Auto-
correlation may be caused by omission of important explanatory variable, misspecification
of the model, and systematic error in measurement. The consequences of autocorrelation
include the least square estimators will be inefficient and the estimated variances of the
regression coefficients will be biased and inconsistent, thus hypothesis testing is no longer
valid. Furthermore, Durbin–Watson ranges from 0–4. The value of 2 indicates no auto-
correlation between the error terms, between 0 to < 2 is a positive autocorrelation and
between 2 and 4 is negative autocorrelation. A rule of thumb for Durbin–Watson is that
for a relatively normal data, the test statistic should fall within 1.5–2.5.
Regression Output in R
After inputting the values for dependent and independent variables and combined the series
into dataframe, then the R function lm() is used to regress the dependent variable on the set of
independent variables. The function summary() gives the result below, as explained in Exam-
ple 11.2. We shall discuss most important results in this output; the regression coefficients of
the model are 17.68 and 0.37 with the corresponding p -values of 2e-16 each. This is very close
to 0 and much less than 5%. This indicates that the coefficients are reliable. Adjusted R-squared
(0.93) indicates that 93% variation in GDP is explained by variation government expenditure.
This shows that the model fits well. Also, the p -value of F-statistics is 2.2e-16, indicating that
all the regression coefficients (intercept and slope) are jointly significant.
Call:
lm(formula = LNGDP~LNGEXP)
Residuals:
11.4. PEARSON CORRELATION COEFFICIENT 167
Min 1Q Median 3Q Max
-0.25038 -0.06998 -0.01893 0.04646 0.40962
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.68058 0.39751 44.48 <2e-16 ***
LNGEXP 0.37273 0.01791 20.81 <2e-16 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
-1 0 1
Perfect Negative Negative No Association Positive Perfect Positive
Association Association Association Association
O X O X O X O X
O X O XO X O X
The t -test statistics for the significance of the correlation coefficient r is defined as:
r
n 2
t Dr t1 ˛2 ;n 2 : (11.18)
1 r2
However, when n is sufficiently large .n 30/, we use the standardized score for the r using
Fisher z -transformation (z 0 ) to test for the significance of the correlation coefficient r :
Example 11.5
Using the compensation data in Example 11.3,
(i) calculate the Pearson’s correlation coefficient between X2 and Y ;
(ii) compute coefficient of determination in (i); and
(iii) test for the significant of the Pearson’s correlation coefficient between X2 and Y .
170 11. REGRESSION ANALYSIS AND CORRELATION
Solution:
(i) From the data in Example 11.3, we obtained the following values:
X X X X
X22 D 921; X12 D 387; X1 Y D 759:61; X2 Y D 1165:55;
X X X
X1 X2 D 499; n D 35; X1 D 107; X2 D 167;
X X
Y D 242:76; Y 2 D 1690:23:
Example 11.6
Use the following data to test for the significance of correlation coefficient:
n D 50; r D 0:75 and ˛ D 1%.
Solution:
Hypothesis: H0 W D 0 vs. H1 W ¤ 0.
Test statistic:
z 0 D 0:5 .ln .1 C r/ ln .1 r// :
0
Substitute for the value of r to get z :
z 0 D 0:5 .ln .1 C 0:75/ ln .1 0:75// D 0:973:
11.5 EXERCISES
11.1. (a) What do you understand about the concept of regression analysis?
(b) With the aid of an example, differentiate between simple linear regression and
multiple linear regression.
(c) Table 11.4 shows data on the GDP and the government investment in Nigeria
between 1981 and 2016. Regress GDP growth rate on the total investment (%
GDP) and interpret your result.
11.2. (a) List the types of regression analysis you know.
(b) What are the assumptions of the simple linear regression?
(c) Table 11.5 shows the log of space area (measured in sq. feet), log of number of
bedrooms, and the log of house price (measured in $’million).
(i) Perform a regression analysis to show the relationship between the three (3)
variables. Take a log of house prices as the dependent variable (Y ).
(ii) Interpret your result.
(iii) Comment on the output .
11.3. (a) What do you understand about correlation coefficient?
(b) What are the assumptions of the correlation test?
(c) Explain the coefficient of determination.
(d) The summary statistics for the two variables are given below:
X X X
XY D 216:24; X D 21:20; Y D 204:21;
X X
X 2 D 22:90; Y 2 D 2187:59; n D 20:
(i) Calculate the Pearson’s correlation coefficient (r).
(ii) Test for the significance of r.
11.4. (a) Show that the least square estimates for the simple linear regression are given as:
ˇ0 D Y ˇ1 X and
Cov.X; Y /
ˇ1 D :
Var.X/
11.5. EXERCISES 173
Table 11.4: GDP and government investment in Nigeria
(b) With the aid of diagram, explain the different types of correlation you know.
11.5. The strength (MPa) of a steel and the diameter (mm) are measured, as shown in the
table below.
Log of space area (sq. feet) 8.51 8.77 8.55 8.63 7.49 8.61 8.55 8.44 8.42 8.34
Log of number of bedrooms 1.61 1.39 1.39 0.69 1.10 1.61 1.39 1.39 1.39 1.61
Log of house price ($' m) 0.34 0.22 0.30 0.18 0.20 0.26 0.35 0.27 0.14 0.41
Log of space area (sq. feet) 8.39 8.52 8.69 8.60 8.40 8.39 8.25 8.71 8.55 8.55
Log of number of bedrooms 1.10 1.61 1.10 0.69 1.61 1.10 1.10 1.61 1.39 1.39
Log of house price ($' m) 0.17 0.24 0.22 0.22 0.35 0.19 0.10 0.48 0.43 0.35
Log of space area (sq. feet) 8.48 8.38 8.76 8.59 8.37 8.47 8.63 8.56 8.61 8.22
Log of number of bedrooms 1.61 1.10 1.10 0.69 1.10 1.10 1.10 1.10 1.61 1.39
Log of house price ($' m) 0.36 0.23 0.48 0.25 0.18 0.23 0.29 0.20 0.36 0.34
11.6. The following table shows the prices (N) of bread in a supermarket and the quantity
sold in a week.
Price (X ) 100 200 500 750 1000 1500 1800 2000
Quantity sold (Y ) 55 49 40 37 35 27 15 8
(a) Plot the data on a scatter diagram.
(b) Draw the regression line on your scatter diagram.
(c) From the regression equation obtained in (b), estimate the quantity to be sold if
the price (N) of the bread is N2,200.
(d) Calculate Pearson’s correlation coefficient between price of bread and quantity sold.
11.7. The weekly number of sales in the two complementary commodities (DVD player and
DVD disk) are given in the following table.
Week (X ) 1 2 3 4 5 6 7 8
DVD player 20 23 18 25 15 17 16 20
DVD disks 38 39 25 28 23 26 25 30
(a) Calculate Pearson’s correlation coefficient between the number of sales of a DVD
player and DVD disk.
(b) Give interpretation of your result in (a).
(c) Test hypothesis about r D 0.
175
CHAPTER 12
Poisson Distribution
The Poisson distribution was developed by the French mathematician Simeon Denis Poisson in
1837. The Poisson distribution is a discrete probability distribution. It is used to approximate the
count of events that occur randomly and independently. The Poisson distribution may calculate
number of instances that should occur in a certain amount of time, distance, area, or volume.
For instance, the random variable could be used to estimate the number of radioactive decays in
a given period of time for a certain amount of the radioactive material. If you know the rate of
decay for that amount of the material you can use the Poisson distribution as a good estimate of
the amount of decays. We shall elaborate on Poisson statistical properties, the derivation of the
mean and the variance of a Poisson distribution, and an application of the Poisson distribution.
Its practicality shall be demonstrated in the worked examples.
45
40
35
30
25
P(X = x)
20 lambda = 1
15 lambda = 2
10 lambda = 3
5 lambda = 10
0
-5
0 5 10 15 20
x
Mean
The following is our standard form of the mean:
X
E .x/ D xf .x/:
We will use the following clever fact to simplify the formula for variance of the Poission distri-
bution in the following few steps.
From x 2 D x .x 1/ C x .
Then, E x 2 D E .x.x 1// C E .x/.
1
X
E .x.x 1// D x .x 1/ f .x/
xD0
X1
x e
E .x.x 1// D x .x 1/
xD0
xŠ
X1
x 2 e 2
E .x.x 1// D x .x 1/
xD0
x.x 1/.x 2/Š
X1
x 2 e
E .x.x 1// D 2
xD0
.x 2/Š
P x 2 e
Since 1 xD0 .x 2/Š D 1.
Then E .x.x 1// D 2 .
And, therefore,
var.x/ D E .x.x 1// C E .x/ .E .x//2 :
And since E.x/ D from our derivation of the mean above,
var.x/ D 2 C 2 :
Thus,
var.x/ D : (12.3)
Also, the variance of the Poisson distribution is lambda. This shows that the mean and variance
of a Poisson distribution are the same.
Example 12.1
A paper mill produces writing pads and the probability of a writing pad being defective is 0.02.
If a sample of 800 writing pads is selected, what is the probability that: (a) none are defective,
(b) one defective, (c) two are defectives, and (d) three or more are defective writing pads.
12.2. MEAN AND VARIANCE OF A POISSON DISTRIBUTION 179
Solution:
In Example 11.1, the probability of a writing pad being defective .p/ D 0:02 and sample selected
n D 800.
Calculate the lambda (mean),
(a) The probability that none of the selected sample are defective:
160 e 16
16
f .0/ D De :
0Š
161 e 16
16
f .1/ D D 16e :
1Š
162 e 16
16
f .2/ D D 128e :
2Š
(d) The probability that three or more writing pads are defectives:
16 16 16
f .x 3/ D 1 .f .0/ f .1/ f .2// D 1 e 16e 128e
16 16 16
f .x 3/ D 1 f .x < 3/ D 1 e 16e 128e D 0:999:
R Codes
The following are solutions to Example 12.1.
To calculate the probability of a Poisson distribution, the general format is:
where
q vector of quantiles
lambda vector of positive means
lower.tail logical; if TRUE(default), probabilities are P .X x/,
otherwise P .X > x/
log.p logical; if TRUE, probabilities P are given as log.p/.
The upper tail is used here because we are considering the right side of the value 3 or value
of 3 and above. Thus, we are able to use R code to solve the problem in Example 12.1
above.
Example 12.2
The number of arrivals of customers into an eatery per minute has a Poisson with mean 3. Assum-
ing that the number of arrivals in two different minutes are independent. Find the probability
that:
1. no calls come in a period of a minute;
2. one call comes in a period of a minute; and
3. at least two customers will arrive in a given two-minute period.
Solution:
The mean of the Poisson distribution is 3.
12.2. MEAN AND VARIANCE OF A POISSON DISTRIBUTION 181
1. D 3.
Substitute for D 3 and x D 0
30 e 3
f .0/ D D 0:0498:
0Š
31 e 3
f .1/ D D 0:1494:
1Š
3. The probability that at least two customers will arrive in a given two-minute period:
f .x 2/ D 1 f .0/ f .1/
f .x 2/ D 1 0:0498 0:1494
f .x 2/ D 0:8009:
Despite omitting the option “lower” in the statement above, R assumes lower.tail = TRUE
by default.
p.greater2<- 1- p0 - exact.p1
[1] 0.8008517
The probability that at least two customers will arrive in a given two-minute period is 0.8009.
We got similar results with Example 12.2 above.
Example 12.3
The operational manager has the option of using one of the two machines (A and B) to produce
a particular product. He knew that the energy output of both machines is represented well by
a Poisson distribution with machine A having a mean of 8.25 and machine B has a mean of
7.50. Assuming that the energy input remains the same, the efficiency of machine A is f .x/ D
2x 2 8x C 6 and the efficiency of machine B is f .y/ D y 2 C 2y C 1. Which of the machines
has the maximum expected efficiency?
182 12. POISSON DISTRIBUTION
Solution:
Machine A expected efficiency is:
E.x/ D E.2x 2 8x C 6/
2
E .x/ D 2E x 8E.x/ C E.6/:
Also, substitute for the mean and variance of the distribution to get
E .x/ D 2 8:25 C 8:252 .8 8:25/ C 6 D 92:63:
E.y/ D E.y 2 C 2y C 1/
E .y/ D E y 2 C 2E.y/ C E.1/
h i
E .y/ D Var .y/ C E .y/2 C 2E.y/ C 1;
since
E y 2 D V .y/ C E .y/2 :
Therefore,
E .y/ D 7:50 C 7:502 C .2 7:50/ C 1 D 79:75:
Machine A is more efficient than machine B.
Also, we assigned for the values of mean and variance to be 7.50 each and then compute efficiency
for machine B:
The efficiency of machine A is 92.63 and machine B is 79.75, therefore machine A is more
efficient.
Proof. The Poisson () is an approximation to the binomial .n; p/ for a large n, small p , and
D np .
Then p D n .
Substitute for p into the binomial distribution and then take the limit as n tends to infinity:
k n k
nŠ
lim P .X D k/ D lim 1 : (12.4)
n!1 n!1 .n k/ŠkŠ n n
184 12. POISSON DISTRIBUTION
Take out the constant terms in (12.4):
k n k
k nŠ 1
lim P .X D k/ D lim 1 1 : (12.5)
n!1 kŠ n!1 .n k/Š n n n
We can take the limit of the RHS one term after the other in (12.5):
k
nŠ 1
lim
n!1 .n k/Š n
n .n 1/ .n 2/ : : : .n k/.n k 1/ 1 k
lim
n!1 .n k/ .n k 1/ : : : .1/ n
n .n 1/ .n 2/ : : : .n k C 1/
lim : (12.6)
n!1 nk
As n ! 1, then k terms tend to 1.
Equation (12.6) can be written as:
n.n 1/.n 2/ : : : .n k C 1/ n n 1 n 2 n kC1
lim ::: : (12.7)
n!1 nk n n n n
The second step is to take the limit of the middle term in (12.5):
n
limn!1 1 : (12.8)
n
Example 12.4
Suppose a random variable X has a binomial distribution with n D 120 and p D 0:01, and use
the Poisson distribution to calculate the following: (a) P .X D 0/, (b) P .X D 1/, (c) P .X D 2/,
and (d) P .X > 2/.
Solution:
(a) D np D 1:2.
1:20 e 1:2
P .X D 0/ D D 0:3012:
0Š
1:21 e 1:2
(b) P .X D 1/ D 1Š
D 0:3614.
1:22 e 1:2
(c) P .X D 2/ D 2Š
D 0:2169.
12.5 EXERCISES
12.1. (a) Define the Poission distribution and the properties of the distribution.
(b) Show that the mean and the variance of a Poisson distribution are equal.
(c) A production manager took a sample of 25 textbooks and examined them for the
number of defectives pages. The outcome of his findings is summarized in the table
below.
Find the probability of finding a textbook chosen at random that contains two or
more defective pages.
Number of defectives 0 1 2 3 4 5
Frequency 10 4 3 2 3 3
12.2. (a) Show that the Poisson ( ) is an approximation to the binomial .n; p/ as n ! 1
and p ! 0.
(b) Consider a random variable X that has a binomial distribution with n D 100 and
p D 0:005. Use the Poisson distribution to calculate the following: (i) P .X D 0/,
(ii) P .X D 1/, and (iii) P .X > 1/.
12.3. (a) Explain the area of life that Poisson process be applied.
(b) The ABC company supplies a supermarket with some groceries weekly. Experience
has shown that 3% of the total supply in a given week of the groceries is defective.
If the manager of the supermarket decides to check the number of defective items
for a particular grocery and he took a sample of 105 items, what is the probability
that: (i) none of the product is defective? (ii) one of the products is defective? (iii) 2
or more of the products are defective?
12.5. EXERCISES 187
12.4. A specialist hospital recorded 200 deliveries in every 30 days and the management of
the hospital observed that most of the deliveries took place in the early hours of the
day, between 12:00 AM and 3:00 AM. Therefore, the management decided to make
as many staff as possible available during these time periods to show their dedication
to work. Using the Poisson distribution, find the probability of delivering: (a) no baby,
(b) one baby, (c) two babies, and (d) three babies in the early hours of the day. Hence,
how many days in a 30-day time period would 4 or more deliveries are expected?
12.5. An insurance company sells a special life insurance policy to people that are over the age
of 50. The actuarial probability that somebody of age 50 and above will die within one
year of the policy is 0.0008. If the special life insurance policy is sold to 7,500 people
of the same age group, this is an indication that there is possibility that 6 people aged
50 years or older will die within the next year. What is the probability that the insurance
company will pay exactly 6 claims on the 7,500 policies sold in the next year?
12.6. A car dealer claimed that he sold an average of three new brand cars per week. Assuming
that the sales follow a Poisson distribution, what is the probability that: (a) he will sell
exactly three new cars, (b) less than three new cars, or (c) more than three new cars in
a given week?
189
CHAPTER 13
Uniform Distributions
13.1 UNIFORM DISTRIBUTION AND ITS PROPERTIES
Let X be a continuous random variable. Then X is said to be a uniform distribution over the
interval Œa; b if its probability density function is defined as:
1
f .x/ D ; a x b:
b a
The uniform distribution is denoted as X U.a; b/. The uniform distribution is also known as
rectangular distribution.
f (x)
1/(b-a)
x
a b
Figure 13.1 shows a rectangle with the length of the base .b a/ and a height of b 1 a . The
total area under the curve of pdf is the product of the height of the rectangle and length of the
base, thus the total area is 1. The area under f .x/ and between the points a and b is expected to
be 1 and f .x/ > 0, therefore, f .x/ is a probability density function. There is an infinite number
of possible values of a and b , thus there is an infinite number of possible uniform distributions.
The commonly used continuous uniform distribution is a D 0 and b D 1. Uniform distribution
is useful when every variable has an equal or exact chance of happening.
Variance
The variance of a uniform distribution is derived as follows:
Var.x/ D E x 2 .E .x//2
Z b
E x2 D x 2 f .x/ dx
a
Z b
2
1
E x D x2 dx
a b a
Z b
1
E x2 D x 2 dx:
b a a
Solving the integral we get
b
x3 1
E x2 D
b a 3 a
1 3 b
E x2 D x a
3.b a/
1 3
E x2 D b a3 ;
3.b a/
since b 3 a3 D .b 2 C ab C a2 /.b a/.
13.2. MEAN AND VARIANCE OF A UNIFORM DISTRIBUTION 191
h i
Then, E x 2
D 3.b1 a/ .b 2 C ab C a2 /.b a/
1h 2 i
E x2 D .b C ab C a2 / ;
3
since
Var.x/ D E x 2 .E .x//2 :
Then,
2
1h 2 i .b C a/
Var.x/ D .b C ab C a2 /
3 2
.b 2 C ab C a2 / .b C a/2
Var .x/ D
3 4
4.b C ab C a / 3.b C a/2
2 2
Var .x/ D
12
4b C 4ab C 4a2 3b 2 6ab
2
3a2
Var .x/ D
12
2 2
b 2ab C a
Var .x/ D
12
.b a/2
Var .x/ D :
12
.b a/2
Hence, the variance of a uniform distribution is 12
. That means, the square of the difference
of point a and b divided by 12.
Example 13.1
Upon arriving at his office building, John wanted to take an elevator to the 10th floor
where his office is located. Due to the congestion of the building, it takes between 0 and 60 s
before the elevator arrives at the ground floor. Assuming that the arrival of the elevator to the
ground floor is uniformly distributed. (a) What is the probability that the elevator takes less than
15 s to arrive at the ground floor? (b) Find the mean of the arrival of the elevator to the ground
floor. (c) Find the standard deviation of the arrival of the elevator to the ground floor.
Solution:
(a) Let the intervals in seconds be a D 0, b D 60, and c D 15.
192 13. UNIFORM DISTRIBUTIONS
Z 15
1
P .0 x 15/ D dx
0 b a
Z 15
1
P .0 x 15/ D dx
060 0
h x i15
P .0 x 15/ D
60 0
15 0
P .0 x 15/ D D 0:25:
60
The possibility that the elevator will arrive in 15 s is one fourth of the time.
(b) mean D E .x/ D .bCa/ 2
.
.60 C 0/
E .x/ D D 30 s:
2
The mean arrival time2
is 30 s.2
2
(c) Var .x/ D .b 12a/ D .60120/ D 60 12
.
s
602 60
SD .x/ D D p D 17:32 s:
12 12
The standard deviation is just over 17 s.
f (x)
1/60
0 60
where
q vector of quintiles
min lower limit of the distribution
max upper limit of the distribution
lower.tail logical; if TRUE (default), probabilities are P[X x],
otherwise, P[X > x]
log.p logical; if TRUE, probabilities p are given as log.p/.
The R code that provides the solutions to Example 13.1 is as follows.
# Solution to Example 13.1a
punif(15,min=0,max=60)
[1] 0.25
Example 13.2
In a debate competition that consists of 10 participants, the moderator gave each participant
5 min to talk on the debate topic. (a) Find the probability that the participants will finish their
194 13. UNIFORM DISTRIBUTIONS
talk within 3 min and 30 s. (b) How many of these participants can finish up the talk within
3 min and 30 s?
Solution:
Let’s convert all the time interval to seconds.
Interval (in second) of probability distribution D Œ0; 300.
1 1
f .x/ D D
b a 300
Z 210
1
P .0 x 210/ D dx
0 300
h x i210
P .0 x 210/ D
300 0
210
P .0 x 210/ D D 0:7:
300
Therefore, the probability that the participants will finish their talk within 3 min and 30 s is 0.7.
The number of participants that can finish up the talk within 3 min and 30 s is: 0:7 10 D 7
participants.
R Codes
The possibility that the participants will finish their talk within 3 min and 30 s is 0.7.
# Solution to Example 13.2b
n.participants<- p.210*10
n.participants
[1] 7
Example 13.3
Two friends, Peter and Paul, agreed to meet at the school library for a study session. Both of
them randomly arrive between 5:45 PM and 6.00 PM. What is the probability that Peter arrives
at the venue at least 5 min before Paul?
13.2. MEAN AND VARIANCE OF A UNIFORM DISTRIBUTION 195
Solution:
Let x be arrival time for Peter and y be arrival time for Paul.
Time interval D 6:00 PM 5:45 PM D 15 min.
Time interval of probability distribution D Œ0; 15.
1 1
f .x/ D D
b a 15
Z 15
1
P .x 5/ D dx
5 15
h x i15
P .x 5/ D
15 5
15 5 10
P .x 5/ D D D 0:67:
15 15 15
The possibility that Peter arrives at the venue at least 5 min before Paul is 0.67.
R Codes
This result .0:67/ gives the possibility of Peter arriving the venue at least 5 min before Paul.
Example 13.4
A filling station manager claimed that the minimum volume of PMS sold per day is 5,000 L
and the maximum of the product sold is 6,500 L per daily. Assume that the service at the filling
station is uniform distribution. Find the probability that the volume of the product to be sold
per day will fall between 5,500 and 6,200 L.
196 13. UNIFORM DISTRIBUTIONS
Solution:
Given a D 5; 000 and b D 6; 500,
1 1
f .x/ D D
b a
1500
Z
6200
1
P .5500 x 6200/ D dx
5500 1500
h x i6200
P .5500 x 6500/ D
1500 5500
6200 5500
P .5500 x 6500/ D D 0:47:
1500 1500
R Codes
The probability that the volume of the product to be sold per day is 6,200 L is 0.8.
p.5500<-punif(5500,min=5000,max=6500)
[1] 0.3333333
The probability that the volume of the product to be sold per day is 5,500 L is 0.33.
btw5500_6200<-p.6200-p.5500
btw5500_6200
[1] 0.4666667
The probability that the volume of the product to be sold per day will fall between 5,500 and
6,200 L is 0.47; this is the difference in the probability of sale of 6,200 L and 5,500 L.
13.3 EXERCISES
13.1. (a) Define the continous uniform distribution
(b) Suppose X has a uniform distribution on the interval Œ10; 50. That is, X
U.10; 50/. Find the probability that: (i) P .X 25/, (ii) P .20 X 32/, and (iii)
P .X 32/.
.ˇ C˛/
13.2. (a) Let X U.˛; ˇ/ and show that the mean of a uniform distribution is 2
and
.ˇ ˛/2
variance 12
.
2
(b) If f .x/ D 3x , 0 x 2. Find the expected of X .
13.3. EXERCISES 197
13.3. The waiting time for a bus to arrive at a bus stop is 10 min. Find the probability that a
bus will come within 7 min of waiting at the bus stop. Asume that the waiting time is
uniformly distributed.
13.4. A recruitment agency conducted an interview for 20 job seekers and scheduled 5 min
for each of the job seekers to express themselves on what changes and contributions
they could bring into the company to which they were applying. Find the probability
that: (a) the job seekers finish within 3 min, and (b) the job seekers finish after 3 min.
(c) How many of these job seekers can express themselves for more than 3 min?
13.5. Suppose X U.2; 10/. (a) What is the probability density function f .x/ of X ?
(b) Sketch the probability
density function f .x/ of X . (c) What is P .5 X 10/?
(d) What is E 2X 2 X ?
199
APPENDIX A
Tables
We provide tables in the following pages.
200 A. TABLES
Standard Normal Distribution Values (z ≤ 0)
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.50000 0.49601 0.49202 0.48803 0.48405 0.48006 0.47608 0.47210 0.46812 0.46414
0.1 0.46017 0.45621 0.45224 0.44828 0.44433 0.44038 0.43644 0.43251 0.42858 0.42466
0.2 0.42074 0.41683 0.41294 0.40905 0.40517 0.40129 0.39743 0.39358 0.38974 0.38591
0.3 0.38209 0.37828 0.37448 0.37070 0.36693 0.36317 0.35942 0.35569 0.35197 0.34827
0.4 0.34458 0.34090 0.33724 0.33360 0.32997 0.32636 0.32276 0.31918 0.31561 0.31207
0.5 0.30854 0.30503 0.30153 0.29806 0.29460 0.29116 0.28774 0.28434 0.28096 0.27760
0.6 0.27425 0.27093 0.26763 0.26435 0.26109 0.25785 0.25463 0.25143 0.24825 0.24510
0.7 0.24196 0.23885 0.23576 0.23270 0.22965 0.22663 0.22363 0.22065 0.21770 0.21476
0.8 0.21186 0.20897 0.20611 0.20327 0.20045 0.19766 0.19489 0.19215 0.18943 0.18673
0.9 0.18406 0.18141 0.17879 0.17619 0.17361 0.17106 0.16853 0.16602 0.16354 0.16109
1.0 0.15866 0.15625 0.15386 0.15151 0.14917 0.14686 0.14457 0.14231 0.14007 0.13786
1.1 0.13567 0.13350 0.13136 0.12924 0.12714 0.12507 0.12302 0.12100 0.11900 0.11702
1.2 0.11507 0.11314 0.11123 0.10935 0.10749 0.10565 0.10384 0.10204 0.10027 0.09853
1.3 0.09680 0.09510 0.09342 0.09176 0.09012 0.08851 0.08692 0.08534 0.08379 0.08226
1.4 0.08076 0.07927 0.07780 0.07636 0.07493 0.07353 0.07215 0.07078 0.06944 0.06811
1.5 0.06681 0.06552 0.06426 0.06301 0.06178 0.06057 0.05938 0.05821 0.05705 0.05592
1.6 0.05480 0.05370 0.05262 0.05155 0.05050 0.04947 0.04846 0.04746 0.04648 0.04551
1.7 0.04457 0.04363 0.04272 0.04182 0.04093 0.04006 0.03920 0.03836 0.03754 0.03673
1.8 0.03593 0.03515 0.03438 0.03363 0.03288 0.03216 0.03144 0.03074 0.03005 0.02938
1.9 0.02872 0.02807 0.02743 0.02680 0.02619 0.02559 0.02500 0.02442 0.02385 0.02330
2.0 0.02275 0.02222 0.02169 0.02118 0.02068 0.02018 0.01970 0.01923 0.01876 0.01831
2.1 0.01786 0.01743 0.01700 0.01659 0.01618 0.01578 0.01539 0.01500 0.01463 0.01426
2.2 0.01390 0.01355 0.01321 0.01287 0.01255 0.01222 0.01191 0.01160 0.01130 0.01101
2.3 0.01072 0.01044 0.01017 0.00990 0.00964 0.00939 0.00914 0.00889 0.00866 0.00842
2.4 0.00820 0.00798 0.00776 0.00755 0.00734 0.00714 0.00695 0.00676 0.00657 0.00639
2.5 0.00621 0.00604 0.00587 0.00570 0.00554 0.00539 0.00523 0.00509 0.00494 0.00480
2.6 0.00466 0.00453 0.00440 0.00427 0.00415 0.00403 0.00391 0.00379 0.00368 0.00357
2.7 0.00347 0.00336 0.00326 0.00317 0.00307 0.00298 0.00289 0.00280 0.00272 0.00264
2.8 0.00256 0.00248 0.00240 0.00233 0.00226 0.00219 0.00212 0.00205 0.00199 0.00193
2.9 0.00187 0.00181 0.00175 0.00170 0.00164 0.00159 0.00154 0.00149 0.00144 0.00140
3.0 0.00135 0.00131 0.00126 0.00122 0.00118 0.00114 0.00111 0.00107 0.00104 0.00100
3.1 0.00097 0.00094 0.00090 0.00087 0.00085 0.00082 0.00079 0.00076 0.00074 0.00071
3.2 0.00069 0.00066 0.00064 0.00062 0.00060 0.00058 0.00056 0.00054 0.00052 0.00050
3.3 0.00048 0.00047 0.00045 0.00043 0.00042 0.00040 0.00039 0.00038 0.00036 0.00035
3.4 0.00034 0.00033 0.00031 0.00030 0.00029 0.00028 0.00027 0.00026 0.00025 0.00024
3.5 0.00023 0.00022 0.00022 0.00021 0.00020 0.00019 0.00019 0.00018 0.00017 0.00017
A. TABLES 201
Standard Normal Distribution Values (z ≥ 0)
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.50000 0.50399 0.50798 0.51197 0.51595 0.51994 0.52392 0.52790 0.53188 0.53586
0.1 0.53983 0.54380 0.54776 0.55172 0.55567 0.55962 0.56356 0.56749 0.57142 0.57535
0.2 0.57926 0.58317 0.58706 0.59095 0.59483 0.59871 0.60257 0.60642 0.61026 0.61409
0.3 0.61791 0.62172 0.62552 0.62930 0.63307 0.63683 0.64058 0.64431 0.64803 0.65173
0.4 0.65542 0.65910 0.66276 0.66640 0.67003 0.67364 0.67724 0.68082 0.68439 0.68793
0.5 0.69146 0.69497 0.69847 0.70194 0.70540 0.70884 0.71226 0.71566 0.71904 0.72240
0.6 0.72575 0.72907 0.73237 0.73565 0.73891 0.74215 0.74537 0.74857 0.75175 0.75490
0.7 0.75804 0.76115 0.76424 0.76730 0.77035 0.77337 0.77637 0.77935 0.78230 0.78524
0.8 0.78814 0.79103 0.79389 0.79673 0.79955 0.80234 0.80511 0.80785 0.81057 0.81327
0.9 0.81594 0.81859 0.82121 0.82381 0.82639 0.82894 0.83147 0.83398 0.83646 0.83891
1.0 0.84134 0.84375 0.84614 0.84849 0.85083 0.85314 0.85543 0.85769 0.85993 0.86214
1.1 0.86433 0.86650 0.86864 0.87076 0.87286 0.87493 0.87698 0.87900 0.88100 0.88298
1.2 0.88493 0.88686 0.88877 0.89065 0.89251 0.89435 0.89617 0.89796 0.89973 0.90147
1.3 0.90320 0.90490 0.90658 0.90824 0.90988 0.91149 0.91308 0.91466 0.91621 0.91774
1.4 0.91924 0.92073 0.92220 0.92364 0.92507 0.92647 0.92785 0.92922 0.93056 0.93189
1.5 0.93319 0.93448 0.93574 0.93699 0.93822 0.93943 0.94062 0.94179 0.94295 0.94408
1.6 0.94520 0.94630 0.94738 0.94845 0.94950 0.95053 0.95154 0.95254 0.95352 0.95449
1.7 0.95543 0.95637 0.95728 0.95818 0.95907 0.95994 0.96080 0.96164 0.96246 0.96327
1.8 0.96407 0.96485 0.96562 0.96638 0.96712 0.96784 0.96856 0.96926 0.96995 0.97062
1.9 0.97128 0.97193 0.97257 0.97320 0.97381 0.97441 0.97500 0.97558 0.97615 0.97670
2.0 0.97725 0.97778 0.97831 0.97882 0.97932 0.97982 0.98030 0.98077 0.98124 0.98169
2.1 0.98214 0.98257 0.98300 0.98341 0.98382 0.98422 0.98461 0.98500 0.98537 0.98574
2.2 0.98610 0.98645 0.98679 0.98713 0.98745 0.98778 0.98809 0.98840 0.98870 0.98899
2.3 0.98928 0.98956 0.98983 0.99010 0.99036 0.99061 0.99086 0.99111 0.99134 0.99158
2.4 0.99180 0.99202 0.99224 0.99245 0.99266 0.99286 0.99305 0.99324 0.99343 0.99361
2.5 0.99379 0.99396 0.99413 0.99430 0.99446 0.99461 0.99477 0.99492 0.99506 0.99520
2.6 0.99534 0.99547 0.99560 0.99573 0.99585 0.99598 0.99609 0.99621 0.99632 0.99643
2.7 0.99653 0.99664 0.99674 0.99683 0.99693 0.99702 0.99711 0.99720 0.99728 0.99736
2.8 0.99744 0.99752 0.99760 0.99767 0.99774 0.99781 0.99788 0.99795 0.99801 0.99807
2.9 0.99813 0.99819 0.99825 0.99831 0.99836 0.99841 0.99846 0.99851 0.99856 0.99861
3.0 0.99865 0.99869 0.99874 0.99878 0.99882 0.99886 0.99889 0.99893 0.99896 0.99900
3.1 0.99903 0.99906 0.99910 0.99913 0.99916 0.99918 0.99921 0.99924 0.99926 0.99929
3.2 0.99931 0.99934 0.99936 0.99938 0.99940 0.99942 0.99944 0.99946 0.99948 0.99950
3.3 0.99952 0.99953 0.99955 0.99957 0.99958 0.99960 0.99961 0.99962 0.99964 0.99965
3.4 0.99966 0.99968 0.99969 0.99970 0.99971 0.99972 0.99973 0.99974 0.99975 0.99976
3.5 0.99977 0.99978 0.99978 0.99979 0.99980 0.99981 0.99981 0.99982 0.99983 0.99983
202 A. TABLES
Student's t-distribution
α
The table gives the values of tα; ν where
Pr(Tν > tα; ν ) = α , with ν degrees of freedom t α; ν
α
0.1 0.05 0.025 0.01 0.005 0.001 0.0005
ν
1 3.078 6.314 12.076 31.821 63.657 318.310 636.620
2 1.886 2.920 4.303 6.965 9.925 22.326 31.598
3 1.638 2.353 3.182 4.541 5.841 10.213 12.924
4 1.533 2.132 2.776 3.747 4.604 7.173 8.610
5 1.476 2.015 2.571 3.365 4.032 5.893 6.869
6 1.440 1.943 2.447 3.143 3.707 5.208 5.959
7 1.415 1.895 2.365 2.998 3.499 4.785 5.408
8 1.397 1.860 2.306 2.896 3.355 4.501 5.041
9 1.383 1.833 2.262 2.821 3.250 4.297 4.781
10 1.372 1.812 2.228 2.764 3.169 4.144 4.587
11 1.363 1.796 2.201 2.718 3.106 4.025 4.437
12 1.356 1.782 2.179 2.681 3.055 3.930 4.318
13 1.350 1.771 2.160 2.650 3.012 3.852 4.221
14 1.345 1.761 2.145 2.624 2.977 3.787 4.140
15 1.341 1.753 2.131 2.602 2.947 3.733 4.073
16 1.337 1.746 2.120 2.583 2.921 3.686 4.015
17 1.333 1.740 2.110 2.567 2.898 3.646 3.965
18 1.330 1.734 2.101 2.552 2.878 3.610 3.922
19 1.328 1.729 2.093 2.539 2.861 3.579 3.883
20 1.325 1.725 2.086 2.528 2.845 3.552 3.850
21 1.323 1.721 2.080 2.518 2.831 3.527 3.819
22 1.321 1.717 2.074 2.508 2.819 3.505 3.792
23 1.319 1.714 2.069 2.500 2.807 3.485 3.767
24 1.318 1.711 2.064 2.492 2.797 3.467 3.745
25 1.316 1.708 2.060 2.485 2.787 3.450 3.725
26 1.315 1.706 2.056 2.479 2.779 3.435 3.707
27 1.314 1.703 2.052 2.473 2.771 3.421 3.690
28 1.313 1.701 2.048 2.467 2.763 3.408 3.674
29 1.311 1.699 2.045 2.462 2.756 3.396 3.659
30 1.310 1.697 2.042 2.457 2.750 3.385 3.646
40 1.303 1.684 2.021 2.423 2.704 3.307 3.551
60 1.296 1.671 2.000 2.390 2.660 3.232 3.460
120 1.289 1.658 1.980 2.358 2.617 3.160 3.373
∞ 1.282 1.645 1.960 2.326 2.576 3.090 3.291
A. TABLES 203
F-distribution (Upper tail probability = 0.05) Numerator df = 1 to 10
df2\df1 1 2 3 4 5 6 7 8 10
1 161.448 199.500 215.707 224.583 230.162 233.986 236.768 238.883 241.882
2 18.513 19.000 19.164 19.247 19.296 19.330 19.353 19.371 19.396
3 10.128 9.552 9.277 9.117 9.013 8.941 8.887 8.845 8.786
4 7.709 6.944 6.591 6.388 6.256 6.163 6.094 6.041 5.964
5 6.608 5.786 5.409 5.192 5.050 4.950 4.876 4.818 4.735
6 5.987 5.143 4.757 4.534 4.387 4.284 4.207 4.147 4.060
7 5.591 4.737 4.347 4.120 3.972 3.866 3.787 3.726 3.637
8 5.318 4.459 4.066 3.838 3.687 3.581 3.500 3.438 3.347
9 5.117 4.256 3.863 3.633 3.482 3.374 3.293 3.230 3.137
10 4.965 4.103 3.708 3.478 3.326 3.217 3.135 3.072 2.978
11 4.844 3.982 3.587 3.357 3.204 3.095 3.012 2.948 2.854
12 4.747 3.885 3.490 3.259 3.106 2.996 2.913 2.849 2.753
13 4.667 3.806 3.411 3.179 3.025 2.915 2.832 2.767 2.671
14 4.600 3.739 3.344 3.112 2.958 2.848 2.764 2.699 2.602
15 4.543 3.682 3.287 3.056 2.901 2.790 2.707 2.641 2.544
16 4.494 3.634 3.239 3.007 2.852 2.741 2.657 2.591 2.494
17 4.451 3.592 3.197 2.965 2.810 2.699 2.614 2.548 2.450
18 4.414 3.555 3.160 2.928 2.773 2.661 2.577 2.510 2.412
19 4.381 3.522 3.127 2.895 2.740 2.628 2.544 2.477 2.378
20 4.351 3.493 3.098 2.866 2.711 2.599 2.514 2.447 2.348
21 4.325 3.467 3.072 2.840 2.685 2.573 2.488 2.420 2.321
22 4.301 3.443 3.049 2.817 2.661 2.549 2.464 2.397 2.297
23 4.279 3.422 3.028 2.796 2.640 2.528 2.442 2.375 2.275
24 4.260 3.403 3.009 2.776 2.621 2.508 2.423 2.355 2.255
25 4.242 3.385 2.991 2.759 2.603 2.490 2.405 2.337 2.236
26 4.225 3.369 2.975 2.743 2.587 2.474 2.388 2.321 2.220
27 4.210 3.354 2.960 2.728 2.572 2.459 2.373 2.305 2.204
28 4.196 3.340 2.947 2.714 2.558 2.445 2.359 2.291 2.190
29 4.183 3.328 2.934 2.701 2.545 2.432 2.346 2.278 2.177
30 4.171 3.316 2.922 2.690 2.534 2.421 2.334 2.266 2.165
35 4.121 3.267 2.874 2.641 2.485 2.372 2.285 2.217 2.114
40 4.085 3.232 2.839 2.606 2.449 2.336 2.249 2.180 2.077
45 4.057 3.204 2.812 2.579 2.422 2.308 2.221 2.152 2.049
50 4.034 3.183 2.790 2.557 2.400 2.286 2.199 2.130 2.026
55 4.016 3.165 2.773 2.540 2.383 2.269 2.181 2.112 2.008
60 4.001 3.150 2.758 2.525 2.368 2.254 2.167 2.097 1.993
70 3.978 3.128 2.736 2.503 2.346 2.231 2.143 2.074 1.969
80 3.960 3.111 2.719 2.486 2.329 2.214 2.126 2.056 1.951
90 3.947 3.098 2.706 2.473 2.316 2.201 2.113 2.043 1.938
100 3.936 3.087 2.696 2.463 2.305 2.191 2.103 2.032 1.927
110 3.927 3.079 2.687 2.454 2.297 2.182 2.094 2.024 1.918
120 3.920 3.072 2.680 2.447 2.290 2.175 2.087 2.016 1.910
130 3.914 3.066 2.674 2.441 2.284 2.169 2.081 2.010 1.904
140 3.909 3.061 2.669 2.436 2.279 2.164 2.076 2.005 1.899
150 3.904 3.056 2.665 2.432 2.274 2.160 2.071 2.001 1.894
160 3.900 3.053 2.661 2.428 2.271 2.156 2.067 1.997 1.890
180 3.894 3.046 2.655 2.422 2.264 2.149 2.061 1.990 1.884
200 3.888 3.041 2.650 2.417 2.259 2.144 2.056 1.985 1.878
220 3.884 3.037 2.646 2.413 2.255 2.140 2.051 1.981 1.874
240 3.880 3.033 2.642 2.409 2.252 2.136 2.048 1.977 1.870
260 3.877 3.031 2.639 2.406 2.249 2.134 2.045 1.974 1.867
280 3.875 3.028 2.637 2.404 2.246 2.131 2.042 1.972 1.865
300 3.873 3.026 2.635 2.402 2.244 2.129 2.040 1.969 1.862
400 3.865 3.018 2.627 2.394 2.237 2.121 2.032 1.962 1.854
500 3.860 3.014 2.623 2.390 2.232 2.117 2.028 1.957 1.850
600 3.857 3.011 2.620 2.387 2.229 2.114 2.025 1.954 1.846
700 3.855 3.009 2.618 2.385 2.227 2.112 2.023 1.952 1.844
800 3.853 3.007 2.616 2.383 2.225 2.110 2.021 1.950 1.843
900 3.852 3.006 2.615 2.382 2.224 2.109 2.020 1.949 1.841
1000 3.851 3.005 2.614 2.381 2.223 2.108 2.019 1.948 1.840
∞ 3.841 2.996 2.605 2.372 2.214 2.099 2.010 1.938 1.831
204 A. TABLES
F-distribution (Upper tail probability = 0.05) Numerator df = 12 to 40
df2\df1 12 14 16 18 20 24 28 32 36 40
1 243.906 245.364 246.464 247.323 248.013 249.052 249.797 250.357 250.793 251.143
2 19.413 19.424 19.433 19.440 19.446 19.454 19.460 19.464 19.468 19.471
3 8.745 8.715 8.692 8.675 8.660 8.639 8.623 8.611 8.602 8.594
4 5.912 5.873 5.844 5.821 5.803 5.774 5.754 5.739 5.727 5.717
5 4.678 4.636 4.604 4.579 4.558 4.527 4.505 4.488 4.474 4.464
6 4.000 3.956 3.922 3.896 3.874 3.841 3.818 3.800 3.786 3.774
7 3.575 3.529 3.494 3.467 3.445 3.410 3.386 3.367 3.352 3.340
8 3.284 3.237 3.202 3.173 3.150 3.115 3.090 3.070 3.055 3.043
9 3.073 3.025 2.989 2.960 2.936 2.900 2.874 2.854 2.839 2.826
10 2.913 2.865 2.828 2.798 2.774 2.737 2.710 2.690 2.674 2.661
11 2.788 2.739 2.701 2.671 2.646 2.609 2.582 2.561 2.544 2.531
12 2.687 2.637 2.599 2.568 2.544 2.505 2.478 2.456 2.439 2.426
13 2.604 2.554 2.515 2.484 2.459 2.420 2.392 2.370 2.353 2.339
14 2.534 2.484 2.445 2.413 2.388 2.349 2.320 2.298 2.280 2.266
15 2.475 2.424 2.385 2.353 2.328 2.288 2.259 2.236 2.219 2.204
16 2.425 2.373 2.333 2.302 2.276 2.235 2.206 2.183 2.165 2.151
17 2.381 2.329 2.289 2.257 2.230 2.190 2.160 2.137 2.119 2.104
18 2.342 2.290 2.250 2.217 2.191 2.150 2.119 2.096 2.078 2.063
19 2.308 2.256 2.215 2.182 2.155 2.114 2.084 2.060 2.042 2.026
20 2.278 2.225 2.184 2.151 2.124 2.082 2.052 2.028 2.009 1.994
21 2.250 2.197 2.156 2.123 2.096 2.054 2.023 1.999 1.980 1.965
22 2.226 2.173 2.131 2.098 2.071 2.028 1.997 1.973 1.954 1.938
23 2.204 2.150 2.109 2.075 2.048 2.005 1.973 1.949 1.930 1.914
24 2.183 2.130 2.088 2.054 2.027 1.984 1.952 1.927 1.908 1.892
25 2.165 2.111 2.069 2.035 2.007 1.964 1.932 1.908 1.888 1.872
26 2.148 2.094 2.052 2.018 1.990 1.946 1.914 1.889 1.869 1.853
27 2.132 2.078 2.036 2.002 1.974 1.930 1.898 1.872 1.852 1.836
28 2.118 2.064 2.021 1.987 1.959 1.915 1.882 1.857 1.837 1.820
29 2.104 2.050 2.007 1.973 1.945 1.901 1.868 1.842 1.822 1.806
30 2.092 2.037 1.995 1.960 1.932 1.887 1.854 1.829 1.808 1.792
35 2.041 1.986 1.942 1.907 1.878 1.833 1.799 1.773 1.752 1.735
40 2.003 1.948 1.904 1.868 1.839 1.793 1.759 1.732 1.710 1.693
45 1.974 1.918 1.874 1.838 1.808 1.762 1.727 1.700 1.678 1.660
50 1.952 1.895 1.850 1.814 1.784 1.737 1.702 1.674 1.652 1.634
55 1.933 1.876 1.831 1.795 1.764 1.717 1.681 1.653 1.631 1.612
60 1.917 1.860 1.815 1.778 1.748 1.700 1.664 1.636 1.613 1.594
70 1.893 1.836 1.790 1.753 1.722 1.674 1.637 1.608 1.585 1.566
80 1.875 1.817 1.772 1.734 1.703 1.654 1.617 1.588 1.564 1.545
90 1.861 1.803 1.757 1.720 1.688 1.639 1.601 1.572 1.548 1.528
100 1.850 1.792 1.746 1.708 1.676 1.627 1.589 1.559 1.535 1.515
110 1.841 1.783 1.736 1.698 1.667 1.617 1.579 1.549 1.524 1.504
120 1.834 1.775 1.728 1.690 1.659 1.608 1.570 1.540 1.516 1.495
130 1.827 1.769 1.722 1.684 1.652 1.601 1.563 1.533 1.508 1.488
140 1.822 1.763 1.716 1.678 1.646 1.595 1.557 1.526 1.502 1.481
150 1.817 1.758 1.711 1.673 1.641 1.590 1.552 1.521 1.496 1.475
160 1.813 1.754 1.707 1.669 1.637 1.586 1.547 1.516 1.491 1.470
180 1.806 1.747 1.700 1.661 1.629 1.578 1.539 1.508 1.483 1.462
200 1.801 1.742 1.694 1.656 1.623 1.572 1.533 1.502 1.476 1.455
220 1.796 1.737 1.690 1.651 1.618 1.567 1.528 1.496 1.471 1.450
240 1.793 1.733 1.686 1.647 1.614 1.563 1.523 1.492 1.466 1.445
260 1.790 1.730 1.683 1.644 1.611 1.559 1.520 1.488 1.463 1.441
280 1.787 1.727 1.680 1.641 1.608 1.556 1.517 1.485 1.459 1.438
300 1.785 1.725 1.677 1.638 1.606 1.554 1.514 1.482 1.456 1.435
400 1.776 1.717 1.669 1.630 1.597 1.545 1.505 1.473 1.447 1.425
500 1.772 1.712 1.664 1.625 1.592 1.539 1.499 1.467 1.441 1.419
600 1.768 1.708 1.660 1.621 1.588 1.536 1.495 1.463 1.437 1.414
700 1.766 1.706 1.658 1.619 1.586 1.533 1.492 1.460 1.434 1.412
800 1.764 1.704 1.656 1.617 1.584 1.531 1.490 1.458 1.432 1.409
900 1.763 1.703 1.655 1.615 1.582 1.529 1.489 1.457 1.430 1.408
1000 1.762 1.702 1.654 1.614 1.581 1.528 1.488 1.455 1.429 1.406
∞ 1.752 1.692 1.644 1.604 1.571 1.517 1.476 1.444 1.417 1.394
A. TABLES 205
Cumulative Binomial Distribution - 1
206 A. TABLES
Cumulative Binomial Distribution - 2
A. TABLES 207
Cumulative Binomial Distribution - 3
208 A. TABLES
Cumulative Binomial Distribution - 4
A. TABLES 209
Cumulative Binomial Distribution - 5
210 A. TABLES
Cumulative Binomial Distribution - 6
A. TABLES 211
Cumulative Binomial Distribution - 7
212 A. TABLES
Cumulative Binomial Distribution - 8
213
Bibliography
[1] Robert I. Kobacoff, R in Action: Data Analysis and Graphics with R, 2nd ed., Manning
Publications, 2015.
[2] http://www.businessdictionary.com/definition/sampling-distribution.htm
l
[3] https://www.emathzone.com/tutorials/basic-statistics/simple-random-
sampling.html#ixzz5EhfEhbLUhttps://www.emathzone.com/tutorials/basic-
statistics/simple-random-sampling.html#ixzz5EhaygQsf
[4] https://www.britannica.com/science/statistics/Sample-survey-methods#re
f367539
[5] https://www.intmath.com/counting-probability/13-poisson-probability-
distribution.php#mean_
[6] https://www.umass.edu/wsp/resources/poisson/#end_
[7] https://www.isixsigma.com/tools-templates/control-charts/a-guide-to-
control-charts/
[8] https://businessjargons.com/chi-square-distribution.html
[9] https://www.statisticshowto.datasciencecentral.com/probability-and-
statistics/chi-square/
215
Author’s Biography
MUSTAPHA AKINKUNMI
Dr. Mustapha Akinkunmi is a Financial Economist and
Technology Strategist. He has over 25 years of experi-
ence in estimation, planning, and forecasting using sta-
tistical and econometric methods, with particular expertise
in risk, expected utility, discounting, binomial-tree valua-
tion methods, financial econometrics models, Monte Carlo
simulations, macroeconomics, and exchange rate modeling.
Dr. Akinkunmi has performed extensive software develop-
ment for quantitative analysis of capital markets, revenue and
payment gateway, predictive analytics, data science, and credit
risk management.
He has a record of success in identifying and implement-
ing change management programs and institutional develop-
ment initiatives in both public and private sector organizations. He has been in high profile po-
sitions as a Consultant, Financial Advisor, Project Manager, and Business Strategist to AT&T,
Salomon Brothers, Goldman Sachs, Phibro Energy, First Boston (Credit Suisse First Boston),
World Bank, and Central Bank of Nigeria. He is an internationally recognized co-author (In-
troduction to Strategic Financial Management, May 2013) and leader in demand analysis, special-
izing in working with very large databases. Furthermore, he has conducted teaching and applied
research in areas that include analyses of expenditure patterns, inflation and exchange rate mod-
eling for Manhattan College, Riverdale, NY, Fordham University, New York, NY, University
of Lagos, Lagos, Nigeria, State University of New York-FIT, New York, NY, Montclair State
University, Montclair, NJ, and American University, Yola, Nigeria.
In 1990, he founded Technology Solutions Incorporated (TSI) in New York, which fo-
cused on data science and software application development for clients including major fi-
nancial services institutions. After ten years of successful operations and rapid growth under
Dr. Akinkunmi’s leadership, TSI was acquired by a publicly traded technology company based
in the U.S. in a value-creating transaction. Dr. Akinkunmi was the former Honorable Com-
missioner for Finance, Lagos State, Nigeria. He is now an Associate Professor of Finance and
Chair of the Accounting and Finance Department at the American University of Nigeria, Yola,
Nigeria.