Statistics For Decision-Making 2024
Statistics For Decision-Making 2024
2
The Economist article. . .
3
DATA are everywhere these days; the problem is
making sense of them. That is the role of
statistics, the university course that so many
people dodge or forget. Charles Wheelan, a
professor at Dartmouth College (and a former
Chicago correspondent for The Economist), does
something unique here: he makes statistics
interesting and fun. His book strips the subject of
its complexity to expose the sexy stuff underneath.
4
But. . .
“I keep saying the sexy job in the next ten years
will be statisticians. People think I’m joking, but
who would’ve guessed that computer engineers
would’ve been the sexy job of the 1990s?” - Hal
Varian, Google’s Chief Economist in
The McKinsey Quarterly, January 2009.
5
6
In print. . .
• In 2014, LinkedIn reported the skill of
“statistical analysis” as the number one
hottest skill that resulted in a job hire.
7
In print. . .
9
So what is Statistics?
• Statistics is the science of learning from data, and
of measuring, controlling and communicating
uncertainty.
• Statistics is the science of drawing conclusions
from data with the aid of the mathematics of
probability.
• Statistics the explanation of variation in the
context of what remains unexplained.
• Statistics is a collection of procedures and
principles for gaining information in order to
make decisions when faced with uncertainty.
10
Four Key Elements Of Statistics
• These definitions highlight four key elements
of statistics.
• Data – the raw material
• Information – the goal
• Uncertainty – the context
• Probability – the tool
11
A maze or toolbox?
12
13
14
Statistical Decision-making
• A defining business trend in the Digital Age has been
the growth in the volume and the use of quantitative
data.
• Increasingly, decisions once based on management
intuition and experience now rely on empirical
evidence drawn from statistical data.
• As the volume of data sets grow larger, the term "big
data" has become a buzzword.
• Statistical evidence can inform business leaders about
how their companies perform, the effectiveness of
their business operations and information about their
customers.
15
Quote
• In God we trust. The rest should bring data.
16
Data
• Data refers to the individual pieces we collect
about an entity eg. firm size, annual revenues,
share prices, etc.
• Data is the raw material of knowledge.
• We process data to get knowledge.
17
For example. . .
18
Where does data come from?
• Data comes from the entities of interest to us
as researchers, managers, observers.
• We might be interested in the entirety of the
observations available to us.
• For reasons of cost and time, we can only
collect data from a subset of the observations
available.
• These requires we carefully select from this
observations to avoid skewing the results.
19
Classification of Data for Analysis
• Three basic data classification exist for analysis.
• These are cross-sectional data, time series data, and panel
data.
• Cross-sectional data is taken at a point in time eg. how
does income affect consumption? What determines your
wage?
• Time series data is taken over time. Can be regular or
irregular eg. what is the performance of UNL’s stock price
over the last 5 years?
• Panel data is cross-sectional data over time. It reveals the
changes in a set of related variables over time eg. GLSS.
20
Some Terminology
• Population: All individuals, objects, firms or
measurements whose properties are being
studied.
• Sample: A subset of the population studied.
• Representative Sample: A subset of the
population that has the same characteristics
as the population.
• Variable: A characteristic of interest for each
person or object in a population
21
Some Terminology
• Numerical Variable: Variables that take on values that are
indicated by numbers.
• Categorical Variable: Variables that take on values that are
names or labels.
• Parameter: A number that is used to represent a population
characteristic and that generally cannot be determined easily.
• Statistic: A numerical characteristic of the sample; a statistic
estimates the corresponding population parameter.
• Proportion: The number of successes divided by the total
number in the sample.
• Probability: A number between zero and one, inclusive, that
gives the likelihood that a specific event will occur .
22
Back to Data. . .
• A statistical analysis starts with a set of data. We
construct a set of data by first deciding what cases or
units we want to study.
• For each case, we record information about
characteristics that we call variables.
• Data a set of observations (a set of possible outcomes).
• Data can be put into two groups: qualitative (an
attribute whose value is indicated by a label) or
quantitative (an attribute whose value is indicated by a
number).
• Quantitative data can be separated into two subgroups:
discrete and continuous.
23
Back to Data
• Data is discrete if it is the result of counting
(such as the number of students of a given
ethnic group in a class or the number of books
on a shelf).
• Data is continuous if it is the result of
measuring (such as distance traveled or
weight of luggage)
24
Sources of Data
• Anecdotal data come from stories or reports about cases
that do not necessarily represent a larger group of cases.
• Available data are data that were produced for some
other purpose but that may help answer a question of
interest.
• A sample survey collects data from a sample of cases that
represent some larger population of cases.
• A census collects data from all cases in the population of
interest.
• In an experiment, a treatment is imposed and the
responses are recorded.
25
Types of Data
• There are three type of data for our statistical
work.
• Cross-Sectional:
– A set of data values observed at a fixed point in
time (e.g. bank data about its loan customers)
– The wage equation
• Time-Series:
– a set of consecutive data values observed at
successive points in time (e.g. stock price on daily
basis for a year)
26
Types of Data
• Panel/Longitudinal Data
• When we collect cross-sectional data across
time, we form a panel or longitudinal data.
Sales (in $1000s)
2009 2010 2011 2012
Time
Accra 435 460 475 490 Series
Data
Ho 320 345 375 395
Cape Coast 405 390 410 395
Koforidua 260 270 285 280
Cross Sectional
Data
27
Levels of Measurement of Data
• Statisticians use different types of variables to
describe the characteristics of a population.
• Usually a more detailed distinction called the
levels of measurement is used when examining
the information that is collected for a variable.
• Nominal measurement
• Ordinal
• Interval
• Ratio
28
Levels of Measurement of Data
• A nominal measurement is one in which the values of the variable
are names.
• Nominal data are considered the lowest or weakest type of data,
since numerical identification is chosen strictly for convenience and
does not imply ranking of responses.
• The values of nominal variables are words that describe the
categories or classes of responses.
• The values of the gender variable are male and female; the values of
Do you own a car? are yes and no.
• We arbitrarily assign a code or number to each response. However,
this number has no meaning other than for categorizing.
• For example, we could code gender responses or yes/no responses
as follows: 1 = Male; 2 = Female, 1 = Yes; 2 = No
29
Levels of Measurement of Data
• Ordinal data indicate the rank ordering of items, and similar to nominal
data the values are words that describe responses.
• Some examples of ordinal data and possible codes are as follows:
• 1. Product quality rating (1: poor; 2: average; 3: good)
• 2. Satisfaction rating with your current Internet provider (1: very
dissatisfied; 2: moderately dissatisfied; 3: no opinion; 4: moderately
satisfied; 5: very satisfied)
• 3. Consumer preference among three different types of soft drink (1:
most preferred; 2: second choice; 3: third choice)
• In these examples the responses are ordinal, or put into a rank order, but
there is no measurable meaning to the “difference” between responses.
• That is, the difference between your first and second choices may not be
the same as the difference between your second and third choices.
30
Levels of Measurement of Data
• Interval and ratio levels of measurement refer to data obtained
from numerical variables, and meaning is given to the difference
between measurements.
• An interval scale indicates rank and distance from an arbitrary zero
measured in unit intervals.
• That is, data are provided relative to an arbitrarily determined
benchmark.
• Temperature is a classic example of this level of measurement,
with arbitrarily determined benchmarks generally based on either
Fahrenheit or Celsius degrees.
• Suppose that it is 50°C in Ouagadougu, and only 25°C in Koforidua.
We can conclude that the difference in temperature is 25°C, but
we cannot say that it is two times as warm in Ouagadougu as it is
in Koforidua.
31
Levels of Measurement of Data
• Ratio data indicate both rank and distance
from a natural zero, with ratios of two
measures having meaning.
• A person who weighs 200 pounds is twice the
weight of a person who weighs 100 pounds; a
person who is 40 years old is twice the age of
someone who is 20 years old.
32
Schematically
33
Exercise
• For each of the following samples, state what type of data have been
collected (i.e. nominal, Ordinal, Interval, Ratio, Discrete and/or
Continuous):
1. The gender of students in a class.
2. The height in millimeters of students in a class.
3. The number of siblings (i.e. brothers and/or sisters) for each individual in a
class.
4. The birth order (i.e. first born, second born) of each individual in a class.
5. The distance that each individual in a class travels to get to college.
6. The type of degree (e.g. BSc, BEng, BA) that each individual in a class is
studying
34
Statistics. . .
35
Sampling
• A sample should have the same characteristics
as the population it is representing.
• Statisticians use various methods of random
sampling in an attempt to achieve this goal.
• There are several different methods of random
sampling.
• In each form of random sampling, each
member of a population initially has an equal
chance of being selected for the sample.
36
Random Sampling Methods
• Simple random sample
• Stratified sample
• Cluster sample
• Systematic sample
• Convenience sample
• Voluntary Response Sample
37
Simple random sample
• It is the easiest method to describe.
• Any group of n individuals is equally likely to
be chosen by any other group of n individuals
if the simple random sampling technique is
used. In other words, each sample of the same
size has an equal chance of being selected.
38
Stratified Sampling
• To choose a stratified sample, we divide the
population into groups called strata and then
take a proportionate number from each
stratum.
39
Cluster sample
• To choose a cluster sample, divide the population into
clusters (groups) and then randomly select some of the
clusters.
• All the members from these clusters are in the cluster
sample.
• For example, randomly sampling from the departments of
the B-School, the departments make up the cluster sample.
Divide your college faculty by department. The departments
are the clusters. Number each department, and then choose
four different numbers using simple random sampling. All
members of the four departments with those numbers are
the cluster sample.
40
Systematic sample
• To choose a systematic sample, randomly select
a starting point and take every nth piece of data
from a listing of the population. For example,
suppose you have to do a phone survey. Your
phone book contains 20,000 residence listings.
You must choose 400 names for the sample.
Number the population 1–20,000 and then use a
simple random sample to pick a number that
represents the first name in the sample.
41
Convenience sampling
• A type of sampling that is non-random is convenience
sampling.
• Convenience sampling involves using results that are
readily available.
• For example, a computer software store conducts a
marketing study by interviewing potential customers who
happen to be in the store browsing through the available
software.
• The results of convenience sampling may be very good in
some cases and highly biased (favor certain outcomes) in
others.
42
Voluntary Response Sample*
• A voluntary response sample consists of
people who choose themselves by responding
to a general appeal.
• Voluntary response samples are biased
because people with strong opinions,
especially negative opinions, are most likely to
• respond.
43
44
Careful!
• Sampling data should be done very carefully.
• Collecting data carelessly can have devastating
results.
• Surveys sent to households and then returned
may be very biased (they may favor a certain
group).
• It is better for the person conducting the
survey to select the sample respondents.
45
What can go wrong?
• Errors can occur during the sampling process.
• Sampling error can include both systematic
sampling error and random sampling error.
• Systematic sampling error is the fault of the
investigation, but random sampling error is not.
• When errors are systematic, they bias the sample in
one direction.
• Under these circumstances, the sample does not
truly represent the population of interest.
• Systematic error occurs when the sample is not
drawn properly.
46
What can go wrong?
• It can also occur if names are dropped from the
sample list because some individuals were difficult
to locate or uncooperative.
• Random sampling error, as contrasted to systematic
sampling error, is often referred to as chance error.
• Purely by chance, samples drawn from the same
population will rarely provide identical estimates of
the population parameter of interest.
• These estimates will vary from sample to sample.
47
Problems
• Sampling error can affect inferences based on
sampling in two important situations.
• In one situation, we may wish to generalize from
the sample to a particular population.
• With a small sampling error, we can feel more
confident that our sample is representative of the
population.
• We can therefore feel reasonably comfortable
about generalizing from the sample to the
population. Survey research is most concerned
about this kind of sampling error.
48
Things to consider. . .
• The second situation in which sampling error plays a
role is when we wish to determine whether two or
more samples were drawn from the same or different
populations.
• In this case, we are asking if two or more samples are
sufficiently different to rule out factors due to chance.
• An example of this situation is when we ask the
question “Did the group that received the experimental
treatment really differ from the group that did not
receive the treatment other than on the basis of
• chance?”
49
Descriptive Statistics
50
• Descriptive statistics are very important because
• If we simply presented our raw data it would be hard to
visualize what the data was showing, especially if there
was a lot of it.
• Descriptive statistics therefore enables us to present
the data in a more meaningful way, which allows
simpler interpretation of the data.
• For example, if we had the results of 100 pieces of
students' coursework, we may be interested in the
overall performance of those students.
• We would also be interested in the distribution or
spread of the marks.
• Descriptive statistics allow us to do this.
51
• We often do not have access to the whole population
for investigation.
• Usually, we have only a limited number of data
instead.
• For example, you might be interested in the exam
marks of all students in the Ghana.
• It is not feasible to measure all exam marks of all
students in the whole of the UK so you have to
measure a smaller sample of students (e.g., 100
students), which are used to represent the larger
population of all UK students.
• Properties of samples, such as the mean or standard
deviation, are not called parameters, but statistics.
52
Descriptive Statistics
• Typically, there are two general types of statistic
that are used to describe data.
• Measures of central tendency: these are ways of
describing the central position of a frequency
distribution for a group of data.
• In this case, the frequency distribution is simply
the distribution and pattern of marks scored by the
100 students from the lowest to the highest.
• We can describe this central position using a
number of statistics, including the mode, median,
and mean.
53
Measures of Spread/Dispersion
• These are ways of summarizing a group of data by
describing how spread out the scores are.
• For example, the mean score of our 100 students may be
65 out of 100. However, not all students will have scored
65 marks.
• Rather, their scores will be spread out.
• Some will be lower and others higher.
• Measures of spread help us to summarize how spread
out these scores are.
• To describe this spread, a number of statistics are
available to us, including the range, quartiles, absolute
deviation, variance and standard deviation.
54
• For example, if you were only interested in the
exam marks of 100 students, the 100 students
would represent your population.
• Descriptive statistics are applied to
populations, and the properties of
populations, like the mean or standard
deviation, are called parameters as they
represent the whole population (i.e.,
everybody you are interested in).
55
Inferential Statistics
56
Inferential Statistics Defined
• Inferential statistics are techniques that allow
us to use of a samples to make generalizations
about the populations from which the
samples were drawn.
• It is, therefore, important that the sample
accurately represents the population.
• The process of getting the sample is called
sampling.
57
• Inferential statistics arise out of the fact that
sampling naturally incurs sampling error and
thus a sample is not expected to perfectly
represent the population.
• The methods of inferential statistics are
• (1) the estimation of parameter(s) and
• (2) testing of statistical hypotheses.
58
Measures Of Central Tendency
• We move on from visualization techniques to numerical
measures that can be used to quantitatively summarize data.
• We will first describe the three measures and then discuss the
circumstances in which each should be used.
59
Arithmetic Mean
• The arithmetic mean (or simply mean) of a set of data is
the sum of the data values divided by the number of
observations.
• If the data set is the entire population of data, then the
population mean, m, is a parameter given by
62
Example 1
• if we were taking the weights of tourists. The
weights are recorded below:
• S={96,103,121,114,98,111,107,289,115,101,
114, 100} where S denotes the sample.
• Compute the following in Python
• mean
• medianConsider
• mode
• Is there an unusual value in the dataset?
63
Measure of Central Tendency to Use?
• For continuous or discrete data, the mode is rarely used.
• The use of the mean or median depends upon whether our data
distribution is symmetric or skewed, and whether there are outliers any
outliers.
• The mean, median and mode will all have approximately the same value
if the data are symmetrically distributed.
• If the skew is negative (i.e. the left tail of the distribution is longer than
the right tail), then the mode will be larger than the median, which in
turn will be larger than the mean.
• The converse is true for positively skewed distributions.
• Outliers are data points that are very different from the others in the data
set being analyzed.
• It is important to detect them as they may be due to errors in data
gathering (e.g. a height entered in meters rather than centimeters).
• Outliers should not be removed without there being a good reason to do
so.
64
Outliers
• An outlier is an extreme point that doesn’t really ‘lie’ with the rest
of the data.
• Consider a sample of the litres of water drank by ten friends in a
month
• 9, 37, 39, 39, 43, 44, 48, 48, 48, 89
• It’s clear that there seem to be two outliers: the person who
drank only 9 litres and the person who drank a whopping 89
litres.
• However, if someone asked you why these points are outliers,
how would you respond? “Because they are really big or really
small?”
• As in any field of analysis, it’s important to quantify exactly how
we make these decisions.
• We will use two methods: the z-score and Box-Plot methods later.
65
Exercise 1: Demand for Bottled Water
• The demand for bottled water increases
during the harmattan season in Ghana.
• The number of 1-gallon bottles of water sold
for a random sample of n = 12 hours in one
store during hurricane season is:
60 84 65 67 75 72
80 85 63 82 70 75
• Describe the central tendency of the data.
66
Measures of Dispersion/Variation
• Measures of central tendency only summarize the
typical or average value of the data, and they provide
no information on its spread, variation or dispersion.
• There are two main measures to summarize the
spread of data, which are the standard deviation and
the interquartile range (IQR).
67
Standard Deviation
• The sample standard deviation s is defined by
•
• where, as before, n is the sample size, are the
individual sample values, and is the sample mean.
68
• Note the following points about the standard
deviation:
• It has the same units as the data, for example,
calculating s for our height data would result in a
value in centimeters.
• It is always positive.
• It requires calculation of the mean of the data, .
• The division is by (n − 1), not n. This makes the
value of s a better estimate of the population
standard deviation.
• The variance is the standard deviation squared,
that is, .
69
Interquartile Range
• To calculate the interquartile range, we need to calculate
the values of the upper and lower quartiles of our data.
• The concept of a quartile is related to the concept of the
median, as explained below:
• The median is the data value that has 50%of the values
above it and 50% of values below.
• The upper quartile is the data value that has 25% of values
above it and 75% of values below.
• The lower quartile is the data value that has 75% of values
above it and 25% of values below.
• The interquartile range (IQR) is then calculated as
IQR = upper quartile−lower quartile.
70
Quartiles
71
Illustration
72
Range
• As well as the IQR, the overall range of the
data is also regularly reported as a measure of
variation.
• The range is simply
• range = maximum value − minimum value.
73
Which Measure of Variation to Use?
• The answer to this question is similar to the
answer to the question of when to use the
mean or median:
• Use the mean and standard deviation if your
data distribution is symmetric with no outliers.
• Use the median and IQR if your data
distribution is skewed or has outliers.
74
Skewness
• Skewness is basically a measure of asymmetry, and
the easiest way to explain it is by drawing some
pictures.
• If the data tend to have a lot of extreme small values
(i.e., the lower tail is “longer” than the upper tail) and
not so many extremely large values (right panel), then
we say that the data are negatively skewed.
• On the other hand, if there are more extremely large
values than extremely small ones (left panel) we say
that the data are positively skewed.
• That’s the qualitative idea behind skewness.
75
76
Symmetrical = No skew
77
Left skew
78
Right skew
79
80
81
• The actual formula for the skewness of a data
set is as follows:
82
Kurtosis
• Kurtosis is a measure of the “tailedness”, or
outlier character, of the data.
• In other words, kurtosis is a statistical measure
that defines how heavily the tails of a
distribution differ from the tails of a normal
distribution.
• The normal distribution is taken as the
standard.
• It has a kurtosis measure of 3. Anything above
or below it is said to deviate from normality.
83
• The value for kurtosis of a normal distribution
is 3, and the shape is referred to as
mesokurtic. Kurtosis for a standard normal
distribution will be zero.
84
• For tail thickness < 3, the tails are said to be
platykurtic.
• So platykuritc describes a distribution where
the center of the curve will be shorter than a
normal, and the tails will be lighter with fewer
values in the tail.
85
• Leptokurtic describes a distribution where the
value for kurtosis is greater than 3 when
compared to a normal distribution and a value
larger than zero when compared to a standard
normal distribution.
• So for a leptokurtic, kurtosis > 3 compared to
the normal distribution or > 0 compared to
the standard normal distribution.
86
Formulae
87
Mesokurtic
88
Platykurtic
89
Leptokurtic
90
Python Exercise
• import numpy as np
• import pandas as pd
• import matplotlib.pyplot as plt
• %matplotlib #%matplotlib inline
• from scipy.stats import skew, kurtosis
91
• urlfile='https://raw.githubusercontent.com/burakaydin/materyaller/gh-
pages/ARPASS/dataWBT.csv'
• data = pd.read_csv(urlfile)
• data.head()
• data.columns
• data['wage'].head()
• wage = data['wage']
92
• wage.dropna(inplace = True)
• mean = np.mean(wage)
• mean
• Out[39]: 12703.65552623864
• median = np.median(wage)
• median
• Out[41]: 10800.0
• plt.hist(wage)
93
• plt.xlabel('Wages($)')
• plt.ylabel('Frequency')
94
95
• skew(wage)
• Out[48]: 4.901206069790337
• kurtosis(wage)
• Out[49]: 64.0760982749893
96
• https://sievo.com/resources/procurement-ana
lytics-demystified
• https://rfp360.com/procurement-analytics/
97
Association in Statistics
• Statisticians, data scientists, business people, etc. are
interested in the relationships between variables.
• For example, if I want to setup a factory to
manufacture luxury cars in Ghana, I have to know
the income of Ghanaians.
• If their income constituent a large percentage of the
price of the cars, chances are that these vehicles will
not sell.
• A social scientist will be concerned about the
relationship between a family’s income and their
consumption. . .
98
• Covariance is a term used to describe the degree to
which two random variables are related to each other.
• There are three ways the two variables can related to
each other.
• Positive: In a positive relation, when one variable
increases, the other also increases eg. when income
increases, consumption also increases on average.
• Negative: Here when one variable increases, there
other decreases eg. price and the demand for a normal
good.
• Zero: Here there is no discernible relationship between
the variables eg. number of babies made in Accra and
the show sizes of their parents!
99
Definition - Covariance
• Covariance gives a measure of the direction
with which two variables vary together.
• A positive value means that there is a direct
relationship: that is, they move in the same
direction.
• A negative value means there is an inverse
relationship, or they move in opposite
direction.
• A zero value mean there is no discernible
relationship between the variables.
100
Example - Covariance
• Let’s consider two variables: how many hours you
spend studying in school and your GPA.
• Hopefully, these have a positive covariance.
• That would mean that the two vary together: the more
you study, the higher the GPA tends to be (and vice
versa).
• An opposite example would be imagining the
covariance between hours exercised and heartbeats
per minute.
• This would likely be negative, since usually people are
in better shape when they exercise more and thus their
hearts tend to pump less per minute.
101
Formulae - Covariance
• The covariance between two variables x and y is given by:
• Income = df['Income']
• Consumption = df['Consumption']
103
• Another important reminder for covariance is
that it gives association, NOT causation.
• In the example above, we can’t say that studying
harder caused a higher GPA (notice how we use
the word “tends”), only that when you study a
lot you also seem to get a better GPA.
• Without actually performing an experiment, we
can only say that these things tend to vary
together, but we can never say if one actually
causes the other without performing a
controlled experiment.
104
Correlation
• Correlation is the more interesting and perhaps more
important.
• Instead of only indicating direction, correlation
measures direction and strength of a linear
relationship.
• The sign gives direction, the magnitude strength.
• Correlation is always between −1 and 1.
• A Correlation of 0 means there is no linear relationship
between two variables.
• As the correlation moves from 0 to −1 or 1, the
relationship gets stronger and stronger, culminating in
a perfect relationship at either endpoint (−1 or 1).
105
Formulae - Correlation
• The correlation between a pair of variables, x
and y is given by
106
Symmetrical?
• How can we decide if our data distribution is
skewed or symmetric?
• Often it is clear from looking at a histogram, but
the following numerical measure can be useful in
making this assessment:
109
How Do you Do online Research?
• A study of 552 first-year
college students asked
about their preferences
for online resources.
• One question asked
them to pick their
favorite.
• Here are the results:
110
Solutions
• Since this is categorical,
we can convert to
percentages and show
as a pie-chart.
• We can also take the
raw counts and plot as a
bar graph
111
Solution
112
Pareto Chart for Cost Analysis
• A bar graph whose categories are ordered from
most frequent to least frequent is called a Pareto
chart.
• Pareto charts are frequently used in quality
control settings.
• There, the purpose is often to identify common
types of defects in a manufactured product.
• Deciding upon strategies for corrective action can
then be based on what would be most effective.
113
Example
114
Solution
115
Quantitative variables: Histograms
• Quantitative variables often take many values.
• A graph of the distribution is clearer if nearby
values are grouped together.
• The most common graph of the distribution of
a single quantitative variable is a histogram.
116
Calls to a Customer Service Center
• Many businesses operate call centers to serve
customers who want to place an order or
make an inquiry.
• Customers want their requests handled
thoroughly. Businesses want to treat
customers well, but they also want to avoid
wasted time on the phone. They, therefore,
monitor the length of calls and encourage
their representatives to keep calls short
117
Calls to a Customer Service Center
118
Histogram of Call Duration
119
Shape of a Distribution
• We can describe graphically the shape of the
distribution by a histogram.
• That is, we can visually determine whether data
are evenly spread from its middle or center.
• Sometimes the center of the data divides a graph
of the distribution into two “mirror images,” so
that the portion on one side of the middle is
nearly identical to the portion on the other side.
• Graphs that have this shape are symmetric; those
without this shape are asymmetric, or skewed.
120
Shape of a Distribution
• Symmetry: The shape of a distribution is said to be
symmetric if the observations are balanced, or
approximately evenly distributed, about its center.
• Skewness: A distribution is skewed, or asymmetric, if
the observations are not symmetrically distributed on
either side of the center.
• A skewed-right distribution (sometimes called
positively skewed) has a tail that extends farther to the
right.
• A skewed-left distribution (sometimes called negatively
skewed) has a tail that extends farther to the left.
121
122
Quantitative variables: Stem-and-leaf plots
125
Scatter Plot
• We can prepare a scatter plot by locating one point
for each pair of two variables that represent an
observation in the data set.
• The scatter plot provides a picture of the data,
including the following:
1. The range of each variable
2. The pattern of values over the range
3. A suggestion as to a possible relationship between
the two variables
4. An indication of outliers (extreme points)
126
Entrance Scores and College GPA
127
Scatter Plot
128
Numerical Summaries of Data
• Earlier on we described data graphically, noting
that different graphs are used for categorical and
numerical variables.
• Going forward, we describe data numerically and
observe that different numerical measures are
used for categorical and numerical data.
• In addition, we will look at measures for grouped
data and measures of the direction and strength
of relationships between two variables.
129
Position of
130
Some points to note
• The decision as to whether the mean, median, or mode is the
appropriate measure to describe the central tendency of data is
context specific.
• One factor that influences our choice is the type of data, categorical or
numerical.
• Categorical data are best described by the median or the mode, not
the mean.
• If one person strongly agrees (coded 5) with a particular statement and
another person strongly disagrees (coded 1), is the mean “no opinion”?
• An obvious use of median and mode is by clothing retailers considering
inventory of shoes, shirts, and other such items that are available in
various sizes.
• The size of items sold most often, the mode, is then the one in heaviest
demand.
131
Some points to note
• Numerical data are usually best described by the mean.
• However, we have to consider is the presence of outliers
—that is, observations that are unusually large or
unusually small in comparison to the rest of the data.
• The median is not affected by outliers, but the mean is.
• Whenever there are outliers in the data, we first need to
look for possible causes.
• One cause could be simply an error in data entry.
• The mean will be greater if unusually large outliers are
present, and the mean will be less when the data contain
outliers that are unusually small compared to the rest of
the data.
132
Five-Number Summary
• The five-number summary refers to the five
descriptive measures:
• minimum, first quartile, median, third quartile,
and maximum.
133
Covariance & Correlation
• Is there a relationship between your income
and your expenditure?
• What about your education and the amount
of money your employer pays you?
• Still, what is the relationship between the
sizes of shoes people wear in Cape Coast and
the amount of rainfall?
• What of the amount of goods a household
demand and the price of goods?
134
• In our examples we see that when our incomes
increase, we tend to increase our consumption.
• In the same way, if incomes decrease, we also
decrease our consumption.
• The more educated we are, the more pay (at
least in principle) we expect.
• The cheaper the goods, the more we buy and
vice versa.
• But we can say that there is no relationship
between shoes size and the amount of rainfall
in Cape Coast.
135
• These are relationships.
• We can describe them using the positive,
negative signs and zero.
• So. . .
• Income and consumption (+)
• Education and wage (+)
• Price and good demanded (-)
• Shoe sizes and amount of rainfall (0)
• Positive and negative relations are illustrated
in the diagrams on next slide.
136
137
138
Typical Example of Negative Covariance
139
Definitions
• Covariance is therefore the relationship
between a pair of variables.
• When one variable X increases and at the same
time Y increases, we say they have a positive
covariance.
• When one variable X increases and at the same
time Y decreases, we say they have a negative
covariance.
• If there is no obvious relationship between the
variables X and Y, we say they have zero
covariance.
140
• Statistically, we use the following formulae to
141
Example: Covariance
• Compute the covariance in the returns (%)
between the returns of Crane Analytics and
Heron Computing as shown below.
Year Crane Analytics Heron Computing
2008 1 3
2009 -2 2
2010 3 4
2011 0 6
2012 3 0
144
Statistical Inference and Estimation
145
Statistical Sampling
• Sampling is the foundation of statistical analysis.
• We use sample data in business analytics applications for
many purposes.
• For example, we might wish to estimate the mean,
variance, or proportion of a very large or unknown
population; provide values for inputs in decision models;
understand customer satisfaction; reach a conclusion as to
which of several sales strategies is more effective; or
understand if a change in a process resulted in an
improvement.
• We discussed sampling methods used to estimate
population parameters, and how we can assess the error
inherent in sampling.
146
Estimating Population Parameters
• Sample data provide the basis for many useful
analyses to support decision making.
• Estimation involves assessing the value of an
unknown population parameter—such as a
population mean, population proportion, or
population variance—using sample data.
147
Estimating Population Parameters
• Estimators are the measures used to estimate
population parameters.
• For example, we use the sample mean to
estimate a population mean µ.
• The sample variance estimates a population
variance , and the sample proportion p
estimates a population proportion π.
• A point estimate is a single number derived
from sample data that is used to estimate the
value of a population parameter.
148
Unbiased Estimators
• Statisticians develop many types of estimators,
from a theoretical as well as a practical
perspective.
• It is important that they “truly estimate” the
population parameters they are supposed to
estimate.
• Suppose we perform an experiment in which we
repeatedly sampled from a population and
computed a point estimate for a population
parameter.
149
Unbiased Estimators
• Each individual point estimate will vary from the
population parameter.
• However, we would hope that the long-term
average (expected value) of all possible point
estimates would equal the population parameter.
• If the expected value of an estimator equals the
population parameter it is intended to estimate,
the estimator is said to be unbiased.
• If this is not true, the estimator is called biased
and will not provide correct results.
150
For example. . .
• The population variance is computed by
151
Errors in Point Estimation
• One of the drawbacks of using point estimates
is that they do not provide any indication of
the magnitude of the potential error in the
estimate.
152
Look at this story. . .
• A national newspaper in Ghana reported that,
based on a FWSC survey, university teachers
are the highest-paid workers in the Ghana,
with an average salary of GHC150,004.
• Actual averages for two local universities were
less than GHC70,000. What happened?
153
Well. . .
• As reported in a follow-up story, the sample
size was very small and included a large
number of highly paid medical school faculty;
as a result, there was a significant error in the
point estimate that was used.
• When we sample, the estimators we use—
such as a sample mean, sample proportion, or
sample variance — are actually random
variables that are characterized by some
distribution.
154
• By knowing what this distribution is, we can
use probability theory to quantify the
uncertainty associated with the estimator.
• To understand this, we first need to discuss
sampling error and sampling distributions
(again).
155
Sampling Error
• Different samples from the same population
have different characteristics—for example,
variations in the mean, standard deviation,
frequency distribution, and so on.
• Sampling (statistical) error occurs because
samples are only a subset of the total
population.
• Sampling error is inherent in any sampling
process, and although it can be minimized, it
cannot be totally avoided.
156
• Another type of error, called non-sampling
error, occurs when the sample does not
represent the target population adequately.
• This is generally a result of poor sample
design, such as using a convenience sample
when a simple random sample would have
been more appropriate or choosing the wrong
population frame.
• It may also result from inadequate data
reliability.
157
Note carefully. . .
158
Sampling Distribution
• Let’s take a population
and sample it.
• The sample is as a result
of simple random sample
(why SRS?).
• Suppose what we are
looking for is the average
height.
• Again suppose sample is
with replacement.
• We can get multiple
samples, isn’t it?
159
Sampling Distribution of the Mean
• So if we replace the
sample and take
another sample and
find the mean. . .
• We can go on and on
and find several sample
and means.
• Suppose these means
are , , , , . . .,
160
Sampling Distribution of the Proportion
• We can do the same
thing for other sample
statistics like the
proportion as shown on
the right.
• We are going to have a
sample of
proportions , , , , . . .,
161
Statistics Have a Distribution
• Each of these statistics
from the sample (mean,
proportion, standard
deviation, etc.) varies
with the samples.
• And because they vary,
they have a distribution.
162
163
Probability
164
• Definition
– outcome
• A result of a random experiment that cannot be further
decomposed.
166
• Probability
• A probability is a number between 0 and 1 that we
attach to each element of the sample space.
• Informally, that number simply describes he
chance of that event happening.
• A probability of 1 means that the event will
happen for sure.
• A probability of 0 means that we are talking about
an impossible event.
• The number in between represent various degrees
of certainty about the occurrence of the event.
167
Definition of a ‘Distribution’
• A statistical distribution or probability
distribution is a mathematical function that
provides the probabilities of the occurrence of
various possible outcomes in an experiment.
• In plain English. . .
– A distribution is simply a collection of data, or
scores, on a variable. Usually, these scores are
arranged in order from smallest to largest and
then they can be presented graphically (Statistics
in Plain English, Third Edition, 2010)
168
Distribution of Heights
169
When we plot the means. . .
170
• This is called the sampling
distribution of the mean.
• Formally, we define the
sampling distribution of a
statistic as distribution of the
statistic for all possible
samples from the same
population of a given size.
• Like all distributions, this
one will also have its own
properties like its own mean
and standard distribution.
171
Properties of the Sampling Distribution
• The overall shape of the distribution is
symmetric and approximately normal.
• There are no outliers or other important
deviations from the overall pattern.
• The center of the distribution is very close to
the true population mean.
172
Mathematical Properties
• The mean of the sample
means equal to the
population mean ie
• The standard deviation
of the sample mean
called the standard
error is where s is the
sample standard
deviation and n is the
sample size.
173
Statistical Inference
• Once we have gathered our sample data, we
can try to learn something about the larger
population.
• A statistic is a summary measure of the
sample data used to infer something about
the larger population.
• Prior to sampling, the statistic is called an
estimator and is merely a formula.
174
• For example, the sample mean is this formula:
176
Towards CLT
• Suppose you are
interested in some
population mean μ.
• You might be interested
in the average income,
average hours of sleep,
or the average number
of children of all
Ghanaian.
177
• We draw a random sample from the
population. We might collect the sample
observations .
178
Discussion
• First, the standard deviation of the sampling
distribution of the mean, called the standard error
of the mean, is computed as:
• Standard error of the mean =
• where is the standard deviation of the population
from which the individual observations are drawn
and n is the sample size.
• We use the sample standard deviation if the is not
known.
• From this formula, we see that as n increases, the
standard error decreases.
179
• This suggests that the estimates of the mean
that we obtain from larger sample sizes
provide greater accuracy in estimating the
true population mean. In other words, larger
sample sizes have less sampling error.
180
181
Example
• Suppose the variance of a population is 8.33.
• Compute the standard error of the mean for
each of the sample sizes:
• 10, 25, 100, 500.
182
Second Result
• The second result is called the central limit
theorem, one of the most important practical
results in statistics that makes systematic
inference possible.
• The central limit theorem states that if the
sample size is large enough, the sampling
distribution of the mean is approximately
normally distributed, regardless of the
distribution of the population and that the mean
of the sampling distribution will be the same as
that of the population.
183
184
Normal Distribution
185
• Whenever the variates
eg. heights of students,
income of farmers,
number of years people
have worked with us,
etc. follow the nicely
drawn curve on the
right, we say it has a
normal or Gaussian
distribution.
• The normal distribution
is symmetrical about the
line which goes through
the middle.
186
• The points of
inflection from the
mean give you the
standard deviation.
• The second point
of inflection give
the second
standard deviation
from the mean and
so on.
187
• All distributions denote
probabilities.
• The maximum
probability under any
distribution is one.
• In the case of the
normal distribution too,
the are under the curve
(pdf) is one.
188
Formally. . .
• Characteristics of a Normal Curve are. . .
• 1. All normal curves are bell-shaped with
points of inflection at ,
• All normal curves are symmetric about the
mean .
189
• The area under an entire normal curve is 1.
• All normal curves are positive for all . That is,
for all .
• The height of any normal curve is maximized
at .
• The shape of any normal curve depends on its
mean and standard deviation .
190
Finding the Area Under the Curve
• The equation of the normal distribution curve
also known as the probability density curve is
• for , , and .
• The mean of X is and the variance is .
• This is written simply as
191
Exercise
• Let denote the mark of students on the
Statistics exam. It has long been known that
the marks follow a normal distribution with
mean 68 and standard deviation of 16. That is,
. Draw a picture of the normal curve, that is,
the distribution, of .
192
Finding Normal Probabilities
• Let equal the IQ of a randomly selected
Ghanaian. Assume . What is the probability
that a randomly selected Ghanaian has an IQ
below 90?
193
Partial Solution
• As is the case with all continuous distributions,
finding the probability involves finding the
area under the curve and to the left of the line
.
194
• That is:
• That is a mouthful!
• The integration is simply hard to do.
• We can bypass this by the use of normal table.
• All we need to do is transform our distribution
to a distribution and then use the cumulative
probability table for the distribution to
calculate our desired probability.
195
Proof
196
Normal versus Standard Normal
Distribution
197
Solved Examples
• Suppose that the starting salary of UCC HR
graduates is normally distributed with a mean
of 54,400GHS and a standard deviation of
11,000GHS. If we randomly select 25 college
graduates, what is the probability that the
average salary of these graduates is between
56,000GHS and 58,000GHS?
198
Solved Examples
• Suppose that GRE scores are normally
distributed with a mean of 500 and a standard
deviation of 100. If we randomly select 10
Ghanaian university students, what is the
probability that their mean GRE score is
greater than 550?
199
Solved Examples
• Suppose that the annual return on
the stock market is normally
distributed with a mean of 12% and a
standard deviation of 22%. What is
the probability that the average rate
of return over an entire decade will
be over 20% a year?
200
The Student t-Distribution
• In the previous exercises, we used the
population standard deviation .
203
Properties of the Student's t-Distribution
• The graph for the Student's t-distribution is similar to the standard normal
curve.
• The mean for the Student's t-distribution is zero and the distribution is
symmetric about zero.
• The Student's t-distribution has more probability in its tails than the standard
normal distribution because the spread of the t-distribution is greater than the
spread of the standard normal. So the Student's t-distribution is thicker in the
tails and shorter in the center than the standard normal distribution.
204
205
The Normal Distribution
• Recall that the normal distribution was
206
The Student-t Distribution
• The Student-t is
• is written as
• and transformed as
207
Interval Estimates
• An interval estimate provides a range for a
population characteristic based on a sample.
• Intervals are quite useful in statistics because
they provide more information than a point
estimate.
• Intervals specify a range of plausible values for
the characteristic of interest and a way of
assessing “how plausible” they are.
208
• In general, a 100(1 – α)% probability interval
is any interval [A, B] such that the probability
of falling between A and B is 1 - α.
• Probability intervals are often centered on the
mean or median.
209
Confidence Intervals
• Confidence interval estimates provide a way of
assessing the accuracy of a point estimate.
• A confidence interval is a range of values
between which the value of the population
parameter is believed to be, along with a
probability that the interval correctly estimates
the true (unknown) population parameter.
• This probability is called the level of
confidence, denoted by 1 - , where is a
number between 0 and 1.
210
211
• The level of confidence is usually expressed as a
percent; common values are 90%, 95%, or 99%.
• Note that if the level of confidence is 90%, then =
0.1.
• The margin of error depends on the level of
confidence and the sample size.
• Many different types of confidence intervals may be
developed.
• The formulas used depend on the population
parameter we are trying to estimate and possibly
other characteristics or assumptions about the
population.
212
Example 1
• Suppose you do a study of acupuncture to
determine how effective it is in relieving pain.
You measure sensory rates for 15 subjects
with the results given:
• 8.6; 9.4; 7.9; 6.8; 8.3; 7.3; 9.2; 9.6; 8.7; 11.4;
10.3; 5.4; 8.1; 5.5; 6.9
• Use the sample data to construct a 95%
confidence interval for the mean sensory rate
for the population from which you took the
data.
213
Example II
• You do a study of hypnotherapy to determine
how effective it is in increasing the number of
hours of sleep subjects get each night. You
measure hours of sleep for 12 subjects with the
following results.
• 8.2; 9.1; 7.7; 8.6; 6.9; 11.2; 10.1; 9.9; 8.9; 9.2;
7.5; 10.5
• Construct a 95% confidence interval for the
mean number of hours slept for the population
(assumed normal) from which you took the data.
214
Example III
• A random sample of statistics students were asked to
estimate the total number of hours they spend watching
television in an average week. The responses are recorded
in Table below. Use this sample data to construct a 98%
confidence interval for the mean number of hours statistics
students will spend watching television in one week.
215
Confidence Interval: Mean with Known
Population Standard Deviation
217
• A 100(1 – )% confidence interval for the
population mean based on a sample of size n
with a sample mean and a known population
standard deviation is given by
219
In short. . .
• Let be a random sample from a normal
population with a mean and variance . Then
220
Example
• In a production process for filling bottles of
liquid detergent, historical data have shown
that the variance in the volume is constant;
however, clogs in the filling machine often
affect the average volume. The historical
standard deviation is 15 milliliters. In filling
800-milliliter bottles, a sample of 25 found an
average volume of 796 milliliters.
• Find the 95% confidence interval for the
population mean.
221
Example 2
• A random sample of 126 police officers subjected to
constant inhalation of automobile exhaust fumes in
Accra, Ghana had an average blood lead level
concentration of 29.2 μg/dl. Assume X, the blood lead
level of a randomly selected policeman, is normally
distributed with a standard deviation of σ = 7.5 μg/dl.
Historically, it is known that the average blood lead level
concentration of humans with no exposure to
automobile exhaust is 18.2 μg/dl. Is there convincing
evidence that policemen exposed to constant auto
exhaust have elevated blood lead level concentrations?
222
Solution
• Let's try to answer the question by calculating
a 95% confidence interval for the population
mean. For a 95% confidence interval, 1−α =
0.95, so that α = 0.05 and α/2 = 0.025.
Therefore, as the following diagram illustrates
the situation, z0.025 = 1.96:
223
• Now, substituting in what we know ( = 29.2, n
= 126, σ = 7.5, and z0.025 = 1.96) into the
formula for a Z-interval for a mean, we get:
• [29.2−1.96(7.5/126),29.2+1.96(7.5/126)]
• Simplifying, we get a 95% confidence interval
for the mean blood lead level concentration of
all policemen exposed to constant auto
exhaust: [27.89,30.51]
224
• That is, we can be 95% confident that the mean
blood lead level concentration of all policemen
exposed to constant auto exhaust is between
27.9 μg/dl and 30.5 μg/dl.
• Note that the interval does not contain the value
18.2, the average blood lead level concentration of
humans with no exposure to automobile exhaust.
• In fact, all of the values in the confidence interval
are much greater than 18.2. Therefore, there is
convincing evidence that policemen exposed to
constant auto exhaust have elevated blood lead
level concentrations.
225
Hypothesis Testing
226
Example 1
• Cowbell, a producer of powdered milk, claims
that, on average, its powdered sachets weigh
at least 16 grams, and thus do not weigh less
than 16 grams.
• The company can test this claim by collecting a
random sample of powdered sachets,
determining the weight of each one, and
computing the sample mean sachet weight
from the data.
227
Example 1
• Accra Brewery, is a company that brews crispy and
good tasting beer in Ghana since 1931. It claims
that on average, the volume of its fill is 625ml.
• It wishes to monitor its brewing process to ensure
that the volume of its fill meet this requirement for
regulation and reputational purposes.
• It could obtain random samples every 2 hours from
the production line and use them to determine if
standards are being maintained.
228
• These examples are a standard industrial
procedure.
• We state a hypothesis about some population
parameter and then collect sample data to
test the validity of our hypothesis.
229
Concepts of Hypothesis Testing
• Earlier on we developed statistical methods of
estimation, primarily in the form of confidence
intervals, for answering the question "what is
the value of a population parameter?"
• In this lecture, we'll seek an answer question
like "is the value of the parameter θ equal to
a given value?“
230
• For example, rather than attempting to
estimate μ, the mean body temperature of
adults, we might be interested in testing
whether μ, the mean body temperature of
adults, is really 37 degrees Celsius.
• We'll attempt to answer such questions using
a statistical method known as hypothesis
testing.
231
• We'll look at hypothesis tests for the following
population parameters, including:
– a population proportion p, the difference in two
population proportions, p1−p2
– a population mean μ
– the difference in two population means, μ1−μ2,
– a population variance σ2
– the ratio of two population variances,
– three (or more!) means, μ1, μ2, and μ3.
• regression coefficient β of a least squares regression
line through a set of (x,y) data points as well as the
corresponding population correlation coefficient ρ.
232
*Tests About One Mean
• There are basically three tests related to the mean of
the population.
1. Hypothesis test based on the normal distribution for
the mean μ for the completely unrealistic situation
that the population variance σ2 is known
2. Hypothesis test based on the t-distribution for the
mean μ for the (much more) realistic situation that
the population variance σ2 is unknown.
3. Hypothesis test based on the t-distribution for μD,
the mean difference in the responses of two
dependent populations
233
Hypothesis-Testing Procedure
Conducting a hypothesis test involves several steps:
1. Identifying the population parameter of interest and
formulating the hypotheses to test
2. Selecting a level of significance, which defines the risk of
drawing an incorrect conclusion when the assumed
hypothesis is actually true
3. Determining a decision rule on which to base a
conclusion
4. Collecting data and calculating a test statistic
5. Applying the decision rule to the test statistic and drawing
a conclusion.
234
Significance Level and P-Value
• Before any hypothesis testing, we define a
significance level.
• Significance level determines the level that we want
to believe in the null hypothesis.
• You look at significant level as a boundary between
‘rejecting’ or ‘failing to reject’ our null hypothesis.
Reject Failure to reject
0 significant level 1
235
• In the diagram, the ends of the line are marked
0 and 1 (why?).
• The red vertical line represents the dividing line
that marks the boundary between ‘Reject’ and
‘Fail to reject’.
• This significant level is going to be defined a
priori. The conventional values are 0.01 (1%),
0.05(5%) and 0.1(10%).
• The p-value is calculated for our NULL
hypothesis. We then compare the p-value with
our significant level and make a decision with to
reject or fail to reject the NULL.
236
Decisions
• Remember
237
• The following diagram will help make the points on the
previous diagrams clear.
238
Summary of Hypothesis Testing
• Every time we perform a hypothesis test, this is the
basic procedure that we will follow:
(1) We'll make an initial assumption about the
population parameter.
(2) We'll collect evidence or else use somebody else's
evidence (in either case, our evidence will come in
the form of data).
(3) We specify the level of significance.
(4) Based on the available evidence (data), we'll
decide whether to "reject" or "not reject" our initial
assumption.
239
One-Sample Hypothesis Tests
• We may conduct three types of one-sample
hypothesis tests:
• : population parameter constant vs. H1:
population parameter constant
• : population parameter constant vs. H1:
population parameter > constant
• : population parameter = constant vs. H1:
population parameter ≠ constant
240
Tests About Proportions
• We perform hypothesis test for a single proportion.
• Recall the hypothesis testing procedure:
(1) State the null hypothesis H0 and the alternative
hypothesis HA.
(2) Calculate the test statistic:
243
• Because we're interested in seeing if the
advertising campaign was successful, that is,
that a greater proportion of people wear seat
belts, the alternative hypothesis is:
• HA: p > 0.14
244
• If we use a significance level of α = 0.01, then the critical
region is:
• That is, we reject the null hypothesis if the test statistic Z >
2.326. Because the test statistic falls in the critical region,
that is, because Z = 2.52 > 2.326, we can reject the null
hypothesis in favor of the alternative hypothesis. There is
sufficient evidence at the α = 0.01 level to conclude the
campaign was successful (p > 0.14).
245
Example
• Among patients with lung cancer, usually 90%
or more die within three years. As a result of
new forms of treatment, it is felt that this rate
has been reduced. In a recent study of n = 150
lung cancer patients, y = 128 died within three
years. Is there sufficient evidence at the α =
0.05 level, say, to conclude that the death rate
due to lung cancer has been reduced?
246
Solution
• The sample proportion is:
• = 128/150 = 0.853
• The null and alternative hypotheses are:
• H0: p = 0.90 and HA: p < 0.90
• The test statistic is, therefore:
• -1.92
247
• And, the rejection region is:
250
Example
251
Is there sufficient evidence at the α = 0.05 level, say, to conclude
that the two populations — smokers and non-smokers — differ
significantly with respect to their opinions?
252
Solution
• If p1 = the proportion of the non-smoker
population who reply "yes" and p2 = the
proportion of the smoker population who reply
"yes," then we are interested in testing the null
hypothesis:
• H0: p1 = p2
• against the alternative hypothesis:
• HA: p1 ≠ p2
• Before conducting the hypothesis test, we'll have
to derive the appropriate test statistic.
253
• The test statistic for testing the difference in
two population proportions, that is, for testing
the null hypothesis is
• where
• is the proportion of ‘successes’ in the two
samples combined.
254
So. . .
• The overall sample proportion is:
255
• Since this is a two tail test, we put the half the
probability in each tail
257
When Population Variance is Known
• First, it is completely unrealistic to think that
we'd find ourselves in the situation of knowing
the population variance, but not the
population mean.
• Think about it. . . but we have to start from
some where. . .
258
Example 1
• Boys of a certain age are known to have a
mean weight of μ = 85kg. A complaint is made
that the boys living in a municipal children's
home are underfed. As one bit of evidence, n
= 25 boys (of the same age) are weighed and
found to have a mean weight of = 80.94 kg. It
is known that the population standard
deviation σ is 11.6 kg. With a significant level
of α = 0.05, what should be concluded
concerning the complaint?
259
Solution 1
• We formulate the and .
• The null hypothesis is H0: μ = 85, and the
alternative hypothesis is H1: μ < 85.
• In general, we know that if the weights are
normally distributed, then:
261
• The critical region approach tells us to reject the
null hypothesis at the α = 0.05 level if Z <
−1.645.
• Therefore, we reject the null hypothesis
because Z = −1.75 < −1.645, and therefore falls
in the rejection region:
262
When Population Variance is Unknown
• Let's look at the realistic situation in which
population variance is unknown.
263
Example 2
• It is assumed that the mean systolic blood
pressure of a population of vegetarians is μ =
120 mm Hg. In a sample study of 100 people,
it was found that the average systolic blood
pressure of 130.1 mmHg with a standard
deviation of 21.21 mmHg. Assuming a 95%
confidence level, is the group significantly
different from the regular population?
264
Solution 2
• The null hypothesis is H0: μ = 120, and because
there is no specific direction implied, the
alternative hypothesis is HA: μ ≠ 120.
• In general, we know that if the data are normally
distributed, then:
266
Python Code Hypothesis Testing
• from pandas_datareader import data as pdr
• import yfinance as yf
• yf.pdr_override()
• msft.head()
• msft.dropna(inplace = True)
• msft.head()
268
• Let’s visualize the logReturn.
• msft['logReturn'].plot(figsize = (10, 8))
• The figure size is measured in pixels.
• plt.ylabel('Returns')
269
We see that the mean return is 0
270
Step 1: Set Hypothesis
• We want to test whether indeed the
mean return is 0.
271
Step 2: Calculate test statistic
• sample_mean = msft['logReturn'].mean()
• In calculating the stdev for the sample, remember we
lose 1 degree of freedom.
• sample_std = msft['logReturn'].std(ddof = 1)
• n = msft['logReturn'].shape[0]
• tTest
• Out[207]: 2.160519860261913
272
Step 3: Set decision criteria
• Remember we are using a sample from the
population.
• We will therefore prefer the Student-t tables.
• import scipy as scs
• alpha = 0.05
273
• tLeft = scs.stats.t.ppf(alpha/2, n-1)
• tLeft
• Out[212]: -1.961674579682696
• tRight
• Out[214]: 1.9616745796826955
274
Step 4: Make decision - reject H0?
• print('At significant level of {}, shall we reject
H0: {}'.format(alpha, tTest > tRight or tTest <
tLeft))
• At significant level of 0.05, shall we reject H0:
True
275
Exercise
• Test the hypothesis
• and
276
*Paired T-Test
• In the paired t-test, we compare the means of
two independent populations, but there may
be occasions in which we are interested in
comparing the means of
two dependent populations.
• For example, suppose a researcher is
interested in determining whether the mean
IQ of the population of first-born twins differs
from the mean IQ of the population of
second-born twins.
277
• The researcher identifies a random sample
of n pairs of twins, and measures X, the IQ of
the first-born twin, and Y, the IQ of the
second-born twin.
• In that case, she's interested in determining
whether:
• or equivalently if:
.
278
• Now, the population of first-born twins is not
independent of the population of second-born
twins.
• Since all of our distributional theory requires
the independence of measurements, we're
rather stuck.
• There's a way out though... we can "remove"
the dependence between X and Y by
subtracting the two measurements Xi and Yi
for each pair of twins i, that is, by considering
the independent measurements
279
• Then, our null hypothesis involves just a single mean,
which we'll denote μD, the mean of the differences:
280
• to a t-distribution with n−1 degrees of
freedom.
281
Example
• Blood samples from n = 10 people were sent
to each of two laboratories (Lab 1 and Lab 2)
for cholesterol determinations.
• The resulting data are summarized here:
282
Is there a statistically significant difference at the α =
0.01 level, say, in the (population) mean cholesterol
levels reported by Lab 1 and Lab 2?
283
Solution
• . The null hypothesis is H0: μD = 0, and the
alternative hypothesis is HA: μD ≠ 0.
• The value of the test statistic is:
285
Kid stuff. . .
• The size of a contingency table is defined by the
number of rows times the number of columns
associated with the levels of the two categorical
variables.
• The size is notated r * c, where r is the number of
rows of the table and c is the number of columns.
• A cell displays the count for the intersection of a
row and column. Thus the size of a contingency
table also gives the number of cells for that table.
For example, if we have a 2*2 table, then we have 4
cells.
286
Example: Attitude to Campus Food
• A random sample of 500 students were
surveyed on their attitude toward food sold on
campus. The results of this survey are
summarized in the following contingency
table:
Bachelor Masters PhD Total
Dislike 64 67 84 215
287
• The size of this table is 2*3 and NOT 3*4. There
are only two rows of observed data for attitude to
campus food and three columns of observed data
for their level of education.
• We define the attitude as the explanatory variable
and level as the response because it is more
natural to analyze how one's attitude is shaped by
their level than the other way around.
• From here, we would want to determine if an
association (relationship) exists between attitude
and level. That is, are the two variables dependent
or independent?
288
Chi-Square Test of Independence
• This test is performed by using a Chi-square
test of independence of two categorical
variables.
• As with all prior statistical tests we need to
define null and alternative hypotheses.
• Also, as we have learned, the null hypothesis
is what is assumed to be true until we have
evidence to go against it.
289
Hypothesis
• Null Hypothesis: The two categorical variables
are independent.
• Alternative Hypothesis: The two categorical
variables are dependent.
• As usual we need a test statistic. It is called
Chi-Square Test Statistic.
• where O represents the observed frequency.
290
• E is the expected frequency under the null
hypothesis and computed by:
291
Procedure
• Once we have gathered our data, we
summarize the data in the two-way
contingency table.
• This table represents the observed counts and
is called the Observed Counts Table or simply
the Observed Table.
• Then from the Observed Table, we compute
our Expected Table.
292
Observed vrs Expected Values
Observed
Group 1 A B A+B
Group 2 C D C+D
Expected
294
• Are gender and education level dependent at
5% level of significance? In other words, given
the data collected above, is there a relationship
between the gender of an individual and the
level of education that they have obtained?
295
Solution
• We build the Expected Table
High School Bachelors Masters PhD Total
296
• Now 7.815
• Since 8.006 > 7.815, therefore we reject the
null hypothesis and conclude that the
education level depends on gender at a 5%
level of significance.
297
Question continued
• Apply the Chi-square Test of Independence to
our example where we have a random sample
of 500 students who are questioned regarding
their attitude to campus food. Assume a level
significance of 5%. What conclusion will you
reach?
298
Exercise 1
• The operations manager of a company that
manufactures tires wants to determine whether there
are any differences in the quality of work among the
three daily shifts. She randomly selects 496 tires and
carefully inspects them. Each tire is either classified as
perfect, satisfactory, or defective, and the shift that
produced it is also recorded. The two categorical
variables of interest are shift and condition of the tire
produced. Does the data provide sufficient evidence at
the 5% significance level to infer that there are
differences in quality among the three shifts?
299
300
Exercise 2
• A food services manager for a campus joint
wants to know if there is a relationship
between gender (male or female) and the
preferred condiment on a hot dog. The
following table summarizes the results. Test
the hypothesis with a significance level of
10%.
301
302
Statistics for Business Decision
Making
Covariance and Correlation
303
Covariance and Correlation
• How far is it true that when one’s income goes up then
one’s consumption also go up?
• What about the wage one earns and the ‘amount’ of
education one has?
• And is it true that on a beach on a sunny day, the amount
of ice-cream sales can determine the number of people
drowning in water?
• Does sales in a shop anything to do with the amount spent
on advertising?
• How about the prices of a stock and the volume of stocks
traded?
• Again, is there a relationship between your height and
weight?
304
Background
• The above and many more of such
relationships between variables are quantified
in statistics.
• For some variables, there is a direct
relationships. In this case of others, there is an
inverse relationship.
• So we can express the direct relationships as
positive and the indirect relationships as
negative.
305
Diagram for Covariance
306
Definition
• We define covariance formally a is a measure
of the relationship between two random
variables.
• It is a statistical measure of how much – to
what extent – the variables change together.
• In other words, it is essentially a measure of
the variance between two variables.
• However, the metric does not assess the
dependency between variables.
307
• Positive covariance is an indication that the
two variables tend to move in the same
direction.
• Negative covariance implies that two
variables tend to move in opposite or inverse
directions.
• Can we have a situation where the variables
are unrelated? Yes, we can. How will the
scatter plot of variables X and Y which have no
relationship look like? Next slide. . .
308
• Here there is no obvious pattern. So the
relationship between the variable on the
vertical and horizontal axes is zero.
309
Statistically. . .
• We define the covariance between two
variables x and y as
• where N is the size of the data.
• However, we work with samples instead of
population. We define the sample covariance
as
• where and area the sample means of x and y
respectively.
310
Example
• The Table below is the income and
consumption expenditure of a household for
10 years. Is there a relationship between
income and consumption for this household?
Year Income Consumption
2000 8559.4 6830.4
2001 8883.3 7148.8
2002 9060.1 7439.2
2003 9378.1 7804
2004 9937.2 8285.1
2005 10485.9 8819
2006 11268.1 9322.7
2007 11894.1 9826.4
2008 12238.8 10129.9
2009 12030.3 10088.5
311
Solution
• We set up as follows:
• , , N = 10
• = 15683018.9
• Cov(x,y) =
• Since the covariance is not zero, we conclude that
there is a relationship between income and
consumption which is positive.
• There is a general tendency for consumption to
increase whenever income increases. See attached
Excel sheet.
312
• We do have negative covariance between
variables.
• Think about it!
• There is a negative relationship between
inflation rate and the strength of the currency
of a country eg. GHC/USD.
• Whenever inflation is high in a country, that
country’s currency depreciates against the
major currencies!
• Can you give examples of two variables with
negative covariance?
313
Some Properties of Covariance
• We state briefly the properties of covariance.
• Cov(X,Y) = Cov(Y,X)
• Cov(X,c) = 0 where c is a constant. A constant
does not vary (Remember?)
• Cov(X,Y+Z) = Cov(X,Y) + Cov(X,Z)
• Cov(X+Y,X+Y) = Cov(X,X) + Cov(Y,Y) + 2Cov(X,Y)
• Cov(X, X) = Var(X)
314
General Comments
• Some covariance can be spurious!
• Is there a relationship between show sizes of
students in UCC and the amount of rainfall in
Cape Coast?
• Hardly! But if we gather data on these two
variables, it may show a relationship.
• If we cannot find any reason for the
relationship, we say that it is spurious.
• Be on the look out for such spurious
relationships.
315
Correlation
• When we computed the covariance between
Income and Consumption, we got .
• Can we say there is a strong relationship
between Income and Consumption? How very
strong is the relationship?
• On the face of it, we cannot say because we
have no reference points.
• That is the work of correlation.
• Correlation expresses the direction and
strength of the relationship between variables.
316
Example of Correlation
317
• Correlation build on covariance by providing
reference values which we use to make
decisions about the strength of the
relationship.
• Correlation spans -1 to +1.
• If the correlation between X and Y is +1, the
relationship is said to be positively perfectly
correlated.
• If the correlation between X and Y is -1, the
relationship is said to be negatively perfectly
correlated.
318
• What about the situation where there is no linear
relationship between the variables?
• Just as we can across in covariance, the
correlation is zero.
• So in general
320
Examples
321
Scatter plot of Salaries against Educ
322
323
Example
• Continuing from our Income and
Consumption example, calculate the strength
of the relationship between the variable.
• We set up as follows:
• , , N = 10
• = 15683018.9
• Cov(x,y) =
• SD(x) = 1405.0445, SD(y) = 1244.493478
• Corr(x,y) =
324
• We conclude therefore that with a correlation
of 0.9966, there is a string positive correlation
between Income and Consumption. The
diagram show the strong positive relationship.
9500 10000
9000
Consumption
8500
8000
7500
7000
325
Some Observations on Correlation
• Correlation does not imply causation, they say.
• Just because there is correlation between two variables
does not mean one is causing the other!
• For example, there is a correlation between the number
of beach resort drownings each month and the number
of ice-creams sold in the same period.
• It will seem like ice-creams cause drownings. No.
• People eat more ice-creams on hot days when they are
also more likely to go swimming. So the two variables
(ice-cream sales and drownings) are correlated, but one is
not causing the other.
• They are both caused by a third variable (temperature).
326
Coming up
• Next week, we are going to look at the ideas
of linear regression which will look at the
relationships in terms of how one or more
variables ‘cause’ another variable.
• Till then, keep safe and sound.
327
Linear Regression I
• As business people, we rely on a lot of variables to
make decisions.
• Very often, we can see that one or more variable
explain another variable.
• For example, employers determine an employee’s
salary based on the employee’s education,
experience and in some countries gender.
• In effect we say that ‘salary is a function of
education, experience and gender’.
• This is written as salary = f(educ, expr, gender)
328
• But hold on. . . are these the only factors that
determine your salary? What about your
productivity? And ‘who you know’?
• Indeed there are other variables that
determine your salary but may be we don’t
want to consider them now or they are subject
ie difficult to get an objective measure.
• We have find a way of getting all the variables
that are excluded in our relationships.
• This is done by calling them ‘errors’. So we have
• salary = f(educ, expr, gender) + errors
329
• What other relationship can you think about?
• In microeconomics, you were told that one’s
consumptions depends on one’s income.
• So consumption = f(income).
• But we know that consumption is not just a
function of income. It will also depend on
taste, age, etc. Since we don’t want to
measure that for now, we include errors in the
above relation as
• consumption = f(income) + errors
• Think about other relationships.
330
Terminology
• We name the terms in the functions for ease
of communication.
• In the relation
• salary = f(educ, expr, gender) + error, educ,
expr and gender are called exogenous, input
or independent variables. Another name for
them is regressors.
• salary is called the output, dependent,
endogenous or regression.
331
• Identify the names in the relations
• consumption = f(income) + errors
• sales = f(radio, TV, newspaper) + error
332
Simple and Multiple Linear Regression
• Linear regression is basically divided into two
and named simple or multiple depending on
the number of regressors.
• If we have only one regressor, then we call it a
simple linear regression eg.
• Consumption = f(Income) + error.
• For multiple regressors, we name it . . . .
• Sales = f(radio, TV, newspaper) + errors
333
So formally. . .
• A simple linear regression is a mathematical
approach for predicting a quantitative
response Y on the basis of a single predictor
variable X.
• It is written as
• Compare with
338
• Remind yourself what we said about the linear
regression.
339
Find the s
• The whole idea of regression is to find the s in
the model
341
• In the diagram, the ovals are the actual points
depicting sales given a particular temperature.
• We have the line of best fit approximating the
points.
• Roughly, there are the same number of points
above as below the line.
• The distance from a point to the line is the ‘error’.
• Taking points above the line as positive and points
below as negative, summing them will give as
zero.
• That is not what we want. That is why we rather
take the sum of squared errors.
342
Ordinary Least Squares
• The method we have been describing is called
the ordinary least square method of finding the
s.
• Before we use the method, let’s rewrite the
linear regression model more generally as
• . This will make things easy for us.
• From this, we have
• is the expression for the errors.
• The sum of squared errors is
• S = . Think about this.
343
• This is all that we saying.
344
Minimising Sum of Squared Errors
• Minimising in math involve the use of basic
differential calculus. . . . . . Do you remember?
• We take the differential of with respect to
and equate the result to zero and solve for .
• But we are not interested in the intermediate
math except to say that we will get
• , for i > 0
•
345
Example
• We want to model the relationship between the sales
as output and the expenditure on Radio, TV and
Newspaper as inputs.
• The equation is
347
Output I
OLS Regression Results
==================================================
Dep. Variable: Sales R-squared: 0.903
Model: OLS Adj. R-squared: 0.901
Method: Least Squares F-statistic: 605.4
Date: Fri, 22 May 2020 Prob (F-statistic): 8.13e-99
Time: 10:35:15 Log-Likelihood: -383.34
No. Observations: 200 AIC: 774.7
Df Residuals: 196 BIC: 787.9
Df Model: 3
348
Output II
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------
const 4.6251 0.308 15.041 0.000 4.019 5.232
Radio 0.1070 0.008 12.604 0.000 0.090 0.124
TV 0.0544 0.001 39.592 0.000 0.052 0.057
Newspaper 0.0003 0.006 0.058 0.954 -0.011 0.012
349
Checking Model Assumptions
• When you build a model, the first thing is to
check the model assumptions.
• Do you remember the model assumptions?
• Let’s look at Output I.
• Look at the F-statistic. The value is 605.4
• By itself, we cannot say anything. We look at its
probability which is Prob (F-statistic): 8.13e-
99.
• This value is used in hypothesis testing. Do you
remember hypothesis testing?
350
Hypothesis For the Model
• For the model as a whole, we can state our hypothesis
as:
• vrs
353
Distribution of the Errors
354
Homoscedasticity of Errors
355
Comment
• From the histogram, we can see that but for a
few values to the left, the errors would have
been normally distributed.
• A maxim in Statistics says ‘All models are wrong
but some are useful’. So we accept that the
assumption of normality with mean zero holds.
• The same thing applies to the homoscedasticity
of errors. Since the graph is not showing any
pattern, we say that the errors have to the
variance.
356
Autocorrelation of Errors
• By autocorrelation, we mean how the errors are correlated with
its own past values ie Corr().
• We require that value to be zero.
• There is a test known as Durbin-Watson test what we use to
make this decision.
• It says that D = 2(1 - r), where r is the correlation coefficient.
Recall that correlation coefficient span -1 through 0 to +1.
• For zero correlation, r = 0. When you put r = 0 into that
equation, we get D = 2.
• If our model should have zero auto correlation, D should be
around 2.
• Our model’s Durbin-Watson = 2.251. We will accept that
autocorrelation of zero.
357
Interpretation of Model
• Bring back Output II
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------------
const 4.6251 0.308 15.041 0.000 4.019 5.232
Radio 0.1070 0.008 12.604 0.000 0.090 0.124
TV 0.0544 0.001 39.592 0.000 0.052 0.057
Newspaper 0.0003 0.006 0.058 0.954 -0.011 0.012
359
• We can also use the p-values written as P >|t|
or the 95% confidence values to make decisions
about the significant of the coefficients.
• At the 5% significant level, all the coefficients,
except Newspaper are significant [Recall what
we said about the p-values].
• On the confidence levels, look at the values for
Newspaper ie. [-0.011, 0.012].
• In moving from the lower limit -0.011 to 0.012,
we will cross 0. That is what makes the
regression coefficient insignificant.
360
Conclusion
• Linear regression models are a must have in a
manager’s toolkit.
• We have taken a look at a multiple linear
regression and indeed we regressed Sales on
Radio, TV and Newspaper expenditure and we
saw that Newspaper is not significant.
• In the next lesson, we are going to look at how
good our model is ie. the coefficient of
determination and the related ANOVA issues.
361
Analysis of Variance
• Let’s suppose we have this question: Is there a
relationship between Sales and the
expenditure on TV adverts?
362
25
20
Sales
15
10
5
TV
363
• We can investigate by writing the relationship
as a regression:
• If indeed
364
• Call:
• lm(formula = Sales ~ TV, data = ad)
• Coefficients:
• Estimate Std. Error t value Pr(>|t|)
• (Intercept) 6.974821 0.322553 21.62 <2e-16 ***
• TV 0.055465 0.001896 29.26 <2e-16 ***
• Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
• Residual standard error: 2.296 on 198 degrees of freedom
• Multiple R-squared: 0.8122, Adjusted R-squared:
0.8112
• F-statistic: 856.2 on 1 and 198 DF, p-value: < 2.2e-16
365
• Indeed both the t-statistics and the p-value
shows at the 95% confidence interval (or
significant level) that we reject the null in
favour of the alternative hypothesis.
• So we conclude that our finding that there is a
relationship between Sales and TV adverts is
significant.
• From the relationship, we see that for a unit
increase in TV adverts, Sales increase by
0.055465 units.
366
Analysis of Variance (ANOVA)
• There is an alternative method for answering
the same question, which uses the analysis of
variance based on F-test.
• Let's first define the term "analysis of
variance“.
• Analysis of Variance (ANOVA) consists of
calculations that provide information about
levels of variability within a regression model.
• It forms a basis for tests of significance.
367
• The basic regression line viewed another way
can be written as:
• DATA = FIT + RESIDUAL
• Let’s put this in a diagram
368
369
• Based on the diagram, we see that
• SST = SSE + SSR that is
372
ANOVA Table
373
• In fact,
• ie the ‘mean squared error’.
• ie the ‘regression mean square’.
• So the total variable ie is shared between the
regression model and the errors.
• Finally, the ratio of MSR to MSE is called the F-
statistic.
• F-statistic is used to assess whether the model
is correctly specified.
374
Example
• Let’s run Sales on TV to see
• options(digits = 8)
• anova.model <- aov(Sales ~ TV, data = ad)
• summary(anova.model)
• > summary(anova.model)
• Df Sum Sq Mean Sq F value Pr(>F)
• TV 1 4512.43517 4512.43517 856.17671 < 2.22e-16 ***
Residuals 198 1043.54878 5.27045
• Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1
375