0% found this document useful (0 votes)
285 views256 pages

Commerce 1DA3 Notes-6

This document provides an introduction to statistics and statistical concepts. It discusses what statistics is used for and some key aspects of statistical analysis like finding the right data, categorizing and visualizing data, and applying statistical tools to solve problems. It also defines some fundamental statistical terms like data, variables, populations, samples, and how sampling is used to make inferences about populations. Different sampling methods like simple random sampling, stratified sampling and cluster sampling are also introduced.

Uploaded by

Michael Wright
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
285 views256 pages

Commerce 1DA3 Notes-6

This document provides an introduction to statistics and statistical concepts. It discusses what statistics is used for and some key aspects of statistical analysis like finding the right data, categorizing and visualizing data, and applying statistical tools to solve problems. It also defines some fundamental statistical terms like data, variables, populations, samples, and how sampling is used to make inferences about populations. Different sampling methods like simple random sampling, stratified sampling and cluster sampling are also introduced.

Uploaded by

Michael Wright
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 256

Chapter 1: Introduction To Statistics

Have You Ever Wondered…

● if the number of children in the household has a relationship with


the household’s annual income
● whether or not air quality has a relationship with the number of
hospital visits due to respiratory problems
● if the general public is able to tell Pepsi from Coke just by tasting
it
● how wide is the pay gap between male and female CEOs in
Canada
● what factors contribute to the likelihood of a company filing for
bankruptcy in 5 years

Statistical Analysis

● Data is all around us


● We need statistics to extract useful information from data
● Good statistical analysis includes:
○ Finding the right data
○ Categorizing and visualizing the data
○ Applying relevant statistical tools
○ Visualizing and interpreting the results in a meaningful way
○ Solving problems using the data analysis

Chapter 2: Data
What Is Data?

● Data values or observations are information collected regarding


some subject
● Data can be numbers, names, etc., and tell us the “Who and What”
● Data are useless without their context
What is Data?

● The rows of a data table correspond to individual cases about


Whom (or about Which – if they are not people) we record some
characteristics.
● The characteristics recorded about each individual or case are
called variables.
● These are usually shown as the columns of a data table and
identify What has been measured

What is Data?

● Data tables are cumbersome for complex data sets, so often two
or more separate data tables are linked together in a relational
database
● Each data table included in the database is a relation because it is
about a specific set of cases with information about each of these
cases for all of the variables
● Example: A typical relational database is provided consisting of
three relations: customer data, item data, and transaction data

Types of Variables

● Categorical (also known as qualitative): names categories,


indicates whether or not a case falls into a certain category.
● Quantitative: measures numerical values with/without units. Tells
us about the quantity of something.
○ Some quantitative variables have units (purchase amount),
and some are unitless (click count).
● Some variables could potentially be categorical and quantitative
(Age).

Types of Variables

● Counting: is the core of statistics. We usually count things to get


insight into the world.
● Example: Counting cases in each category or how many of
something was observed
Types of Variables

● Identifier: identifies cases in databases (datasets). An identifier is


unique
○ Type of categorical variable
○ Does not have units
○ Helps to combine different datasets and makes relational
databases possible
○ Are not analyzed

Types of Variables

● Nominal: Categorical variables used only to name categories.


(school attended)
● Ordinal: If a variable can be ordered (cat. or quant.) (satisfaction
level, purchase amount)
● Time Series: data that are gathered at regular intervals over time.
(Example?!) (temperature of days in Sep.)
● Cross-sectional data: when data for several variables are
measured at the same point in time, the data is called
cross-sectional. For example, determining sales revenue, number
of customers, and expenses for the last month of business.

Data Collection

● Primary data: collected by the researcher/analyst.


● Secondary data: collected by another party, like Statistics Canada
obtained by the researcher/analyst.
● When and Where data was collected is important

Chapter 3: Survey and Sampling


Key Words:

● Population vs. Sample


● Sampling
● Sample Statistics (Ex. Sample mean)
● Population Parameter (Ex. Population mean)
● Sample data collected is either:
○ Cross-sectional data
○ Time series data

Key Words:

● Structured Data: Well defined length and format


● Unstructured Data: No pre-defined format (doctors notes, reports,
video data)

Sampling

● Why do we take samples?


○ Insight into behaviours of a population
○ Population is big
○ Observing the whole population is impossible or costly or too
time consuming.
○ Only a sample of the population is observable
● Statistics helps us draw insight about the population by observing
and analyzing the sample

Features of Sampling

Feature 1: Examine a part of the whole

● We may use sample surveys. Questions designed to give us


answers on some characteristics of the sample.
● Sample may be biased. A biased sample over- or under-
emphasizes certain characteristics of the population
● A biased sample gives us a biased understanding of the
characteristics of the population.
● Individuals (cases) for samples must be selected randomly!
Feature 2: Randomize

● Randomizing protects us by giving us a representative sample


even for effects we were unaware of.
● Randomization seems fair because nobody can guess the
outcome before it happens and because usually some underlying
set of outcomes will be equally likely.
● Sampling Variability (sampling error): Sample to sample
differences
○ Example: average height of McMaster undergraduate
Business students by drawing samples from different
sections of 1DA3!

Feature 3: Sample size is important!

● The size of the sample determines what we can conclude from the
data regardless of the size of the population

How big a sample do we need?

● It depends on what we are estimating


● Too small sample size may not be representative of population
● Naturally, we prefer a sample that is a good representative of the
population and is as small as possible!

Population and Parameters

● Census: a sample that includes observations from the entire


population.
● A census is usually not the best idea, why?
○ Difficult or impractical or cumbersome to perform one
○ Population characteristics may change. We can’t perform
censuses often.
● Models use mathematics to represent reality
● Parameters: Key numbers in models that represent reality
● Population parameter: A parameter used in a model for a
population
Population and Parameters

● Since we are taking samples, we need to estimate population


parameters through the sample data
● Sample Statistic: Anything calculated from a sample
● Representative: A sample statistic that estimates the
corresponding population parameter accurately
● The goal is to use sample statistics from the sample to estimate
population parameter

Simple Random Sample

● A sample drawn so that every possible sample of the size we plan


to draw has an equal chance of being selected is called a simple
random sample, usually abbreviated SRS.
● To select a sample at random, a sampling frame is first defined.
● Sampling frame: a list of individuals (or cases, record, etc.) from
which the sample will be drawn.
● Once we have a sampling frame, we can assign a sequential
number to each individual in the sampling frame and draw random
numbers to identify those to be sampled.

Other Random Sample Designs

Stratified Sampling:

● Slice the population into homogeneous groups, called strata. Use


simple random sampling within each stratum to select members.
Combine the results at the end.
● Reduced sample variability is one of the most important benefits of
stratified sampling.

Cluster Sampling:

● Split the population into parts or clusters that each represent the
population. Perform a census within one or a few clusters at
random.
● If each cluster fairly represents the population, cluster sampling
will generate an unbiased sample.
Systematic Sampling:

● A systematic approach is used to select individuals. Start from a


randomized individual and follow the approach to create the
sample.
● Example: Pick every 10th individual from a list of employees to
create a sample of 30 individuals.

Multistage Sampling:

● Sampling schemes that combine several methods are called


multistage samples.

The Valid Survey

● A survey that can yield the information you need about the
population in which you are interested is a valid survey
● To help ensure a valid survey, you need to ask four questions:
○ What do I want to know?
○ Who are the right respondents?
○ What are the right questions?
○ What will be done with the results?

The Valid Survey

● Nonresponse bias: When individuals don’t respond to questions.


● voluntary response bias: In volunteer surveys, individuals with
the strongest feelings on either side of an issue are more likely to
respond; those who don’t care may not bother.
● measurement errors: When a question does not take into
account all possible answers
● It is important not to confuse inaccuracy with bias! Both create
errors but the errors are different.
Chapter 4: Displaying and Describing Categorical Data

Displaying Data

● Data visualization is an important part of any statistical or data


analysis.
● It summarizes huge amounts of data into easy to follow, easy to
digest graphs and plots. (2.5 billion GB of data is generated every
day)
● Visualization plays an important role in telling the story of the data.

Charts

● Bar Charts: A bar chart displays the distribution of One


categorical variable, showing the counts for each category next to
each other for easy comparison.

● Pie Charts: Pie charts show the whole group as a circle (“pie”)
sliced into pieces. The size of each piece is proportional to the
fraction of the whole in each category. The pie chart for Loblaw
data is displayed below.

Frequency Tables

● A frequency table organizes data by recording counts and


category names as in the table below (Frequency table of the
number of Loblaw stores in eastern Canada),
● A relative frequency table displays the proportions or
percentages that lie in each category rather than the counts

Frequency Distribution

● Groups data into categories and records the number of (counts the
number of) observations in each category
Contingency Tables

● A Contingency Table shows how the values of one variable is


contingent on the value of another variable (2 variables)
○ Example: Data was collected on the use of social networks
in different countries. To show how social network use is
varied by countries

● The marginal distribution of a variable in a contingency table is


the total count that occurs without reference to the value of the
other variable(s).

● Each cell of a contingency table gives the count for a combination


of values of both variables. (e.g. Country, and social network use).
● We may display the data as a percentage – as a row percent,
column percent, or a total percent which show percentages with
respect to the total count, row count, or column count, respectively.

● A segmented bar chart divides a bar proportionally into segments


corresponding to the percentage in each group.
○ We could display the Super Bowl viewer data as a
segmented bar, which treats each bar as the “whole” and
divides it proportionally into segments corresponding to the
percentage in each group.

Contingency Distribution
● Conditional Distributions: We may want to restrict variables in a
distribution to show the distribution for just those cases that satisfy
a specified condition. This is called a conditional distribution.
(e.g., social networking use given the country of focus is Egypt)

Simpson’s Paradox

● Simpson’s Paradox results from inappropriately combining


percentages of different groups.
● The paradox appears when a certain trend appears in several
different groups of data, but disappears or reverses when these
groups are combined.

Simson's Paradox

● Treatment for Kidney stones (small vs. large stones):


○ Treatment A is more comprehensive and involves open
surgical procedures
○ Treatment B is less comprehensive and involves small
punctures
● Of the 350 patients (with small and large stones combined), the
number of successes is,
○ Treat. A: 273 resulting in a 78% success rate (273/350=78%)
○ Treat. B: 289 resulting in a 83% success rate (289/350=83%)
● Which treatment is suggested for a patient with kidney stone
(unknown size)?

Treatment/ Stone Treatment A Treatment B


Size
Small Stones Group 1 Group 2
93% (81/87) 87% (234/270)
Large Stones Group 3 Group 4
73% (192/263) 69% (55/80)
Both 78% (273/350) 83% (289/350)

Reasons For The Simpson’s Paradox

● Size of the groups: when the effect of the difference in groups is


ignored, the groups with a higher sample size have a greater
influence on the combined results, proportionate to their size!
● The lurking variable (confounding variable) that influences the
results when two groups with significantly different behaviors are
combined!

Simpson’s Paradox

What does it mean for data analysis:

● Analysis should be comprehensive and nuanced


● Content knowledge is important: investigate further if data is
showing you results that are counterintuitive.
● Understand the limitations of data: if data is not detailed enough, it
may give misleading results.
● Shows the trade-off between bias and variance (accuracy):
○ Too much aggregation→more accuracy but may result in
more bias
○ Too much disaggregation→less bias but may result in
smaller data which means less accuracy.

Simpson’s Paradox Can Be Avoided By:

● Reviewing frequency table


● Reviewing correlation among variables
● Investigating any lurking (confounding variables) that may result in
significant differences between groups
● A comprehensive and deep level of content knowledge (domain
knowledge)

Chapter 5: Displaying and Describing Quantitative Data

Displaying Qualitative Data

Frequency Table
Histogram

Example of Histogram
Example of Histogram

Visualizing Quantitative Data

● Stem-And-Leaf Diagrams
Stem and Leaf

Displaying Data Distributions

● Stem-and-Leaf Display

● Before making a histogram or a stem-and-leaf display, the


Quantitative Data Condition must be satisfied: the data values
are of a quantitative variable whose units are known.
● Histograms: A histogram plots the bin counts as the height of
bars and it describes the overall "shape" of data.

Displaying Data Distributions

A stem-and-leaf display for thirty six months of stock price


changes data is shown below together with a histogram
How do histograms work?

1) Decide how wide to make the bins – if there are n data points, use
log2 𝑛 for the number of bins
2) Determine the count for each bin
3) Decide where to place values that land on the endpoint of a bin.
For example, does a value of $5 go into the $0 to $5 bin or the $5
to $10 bin? The standard rule is to place such values in the higher
bin.

Stem and Leaf Display

● A stem and leaf display is like a histogram, but it also gives the
individual values
● These are easy to make by hand for data sets that aren’t too large,
so they’re a great way to look at a small batch of values quickly

Describing Data

● When describing a distribution, attention should be paid to


○ Its shape
○ Its center
○ Its spread
Shape

● We describe the shape of a distribution in terms of its modes, its


symmetry, and whether it has any gaps or outlying values.
○ Negatively Skewed: Skewed to the left
○ Symmetric Distribution: Bars are symmetric
○ Positively Skewed: Skewed to the right

Shape

● Modes: Peaks or humps seen in a histogram are called the


modes of a distribution
● A distribution whose histogram has one main peak is called
unimodal, two peaks – bimodal (see figure), three or more peaks
– multimodal.
○ Unimodal: One main Peak
○ Bimodal: Two Peaks
○ Multimodal: Three or more peaks
Shape

● Uniform Distribution: A distribution whose histogram doesn’t


appear to have any clear mode and in which all the bars are
approximately the same height

● Symmetric Distribution: A distribution is symmetric if the halves


on either side of the center look, at least approximately, like mirror
images.

● Tails: The thinner ends of a distribution


○ If one tail stretches out farther than the other, the distribution
is said to be skewed to the side of the longer tail. The
distribution below is skewed to the right.
Outliers:
● Those values that stand off away from the body of the
distribution
○ Always be careful to point out the outliers in a distribution
● can affect every statistical method we will study (finding dist.)
● can be the most informative part of your data (model adj.)
● may be an error in the data (find the error and correct it)
● should be discussed in any conclusions drawn about the data
● Characterizing the shape of a distribution is often a judgment
call

Centre

● Mean: is a natural summary and the centre point of a unimodal


and symmetric distribution
● To find the mean of the variable y, add all the values of the variable
and divide that sum by the number of data values, n. The mean is
a natural summary for unimodal, symmetric distributions.

If we have the data 𝑦1,𝑦2,𝑦3, ...,𝑦10, then the mean is,


Centre

● The mean of the sample is referred to as 𝑦(bar) (pronounced y-


bar).

Centre

● If a distribution is skewed, contains gaps, or contains outliers,


then it might be better to use the median – the value that splits the
histogram into two equal areas
● The median is found by counting in from the ends of the data until
we reach the middle value
● The median is said to be resistant because it isn’t affected by
unusual observations or by the shape of the distribution (like
outliers).
● If a distribution is roughly symmetric, we’d expect the mean and
median to be close.

Centre

Calculating the median:

1. Sort the data in ascending order.


2. If the number of observations is odd the median is the middle
value
3. If the number of observations is even, the median is the average
of the two middle values.

Centre
Mode - Median - Mean

● Relationship between mean, median and mode in symmetric or


skewed data:
● Negatively skewed (Mean Median Mode)
● Normal (No skew) On top of eachother
● Positively skewed (Mode Median Mean)
Spread

● We need to determine how spread out the data are because the
more the data vary, the less a measure of centre can tell us.
● One simple measure of spread is the range, defined as the
difference between the extremes (max and min)
● Range = Max - Min

Spread

● Range is a weak measure. Because it only considers the two


endpoints of the data.
● The range is a single value and it is not resistant to unusual
observations and outliers. Why?
○ If you have outliers, the min or max will be outliers
● Example: What is the range of the data; 6,4,1,9,7?
○ Range = 9 - 1 = 8

Spread

● Each quartile is a value that frames the middle 50% of the data.
One-quarter of the data lies below the lower quartile, Q1, and
one-quarter lies above the third quartile, Q3.
● The interquartile range (IQR) is defined to be the difference
between the two quartiles: Q1 and Q3
Spread
● Variance
● What is variance?
● Variance: Average of squared deviations between data points and
the mean
○ Variance Unit of Measurement: (Unit of data)^2
● For sample values 𝑦1,𝑦2,...,𝑦𝑛 the sample variance (𝑠^2) is
calculated as,

Spread
For population values 𝑦1,𝑦2,...,𝑦𝑁 the population variance (𝜎^2) [𝜎is
the Greek letter sigma] is calculated as,

Spread
● Standard Deviation: Standard deviation represents, on average,
how far data points are from the mean
○ Standard Deviation Unit of Measurement: Same unit as
data
● What are standard deviations for the sample and
population as calculated in the previous slide?
Spread
● Coefficient of Variation (CV)
○ What is the CV for a dataset: Measure of relative spread
● What is the CV for a sample and a population?

Reporting the Shape, Centre, and Spread

● If the shape is skewed, the median and IQR should be reported.


● If the shape is unimodal and symmetric, the mean and standard
deviation and possibly the median and IQR should be reported.
● If there are unusual observations point them out and report the
mean and standard deviation with and without the values.
● Always pair the median with the IQR and the mean with the
standard deviation.

5-number Summary and Boxplots

● The five-number summary of a distribution reports its median,


quartiles, and extremes (maximum and minimum).
● It provides a good overall summary of the distribution of data.
○ Max
○ Q3
○ Median
○ Q1
○ Min
Percentiles

● Percentile shows where a given percentage of the data lies.


● Suppose the numbers of passengers on 12 flights from
Hamilton to Ottawa are 24, 18, 31, 27, 15, 16,
● 26, 15, 24, 26, 25, 30.
● Step 1. We first put the data in ascending order, getting 15, 15, 16,
18, 24, 24, 25, 26, 26, 27, 30, 31.
● Step 2 (option 1): Suppose we want to calculate the 80th
percentile of this data. Since there are 12 data values, we first
calculate 80% of 12, which is 9.6. Since 9.6 is not an integer, we
round it up to 10 and the 80th percentile is the 10th data value, or
27.
● Step 2 (option 2): Suppose we want to calculate the 50th
percentile of this data. Since there are 12 data values, we first
calculate 50% of 12, which is 6. Since 6 is an integer, we do not
round it up. Instead we take the average of the sixth and seventh
data values (24+25)/2=24.5.

Percentile

● What percentile is the median of a dataset?


○ The 50th percentile
● What are some examples of percentiles used to describe a data
point?
○ Credit scores
○ Standardized Test
● What is the Pth percentile of a dataset?
5-number Summary and Boxplots

Once we have a five-number summary of a variable, we can display that


information in a boxplot. To make a boxplot: IQR = Q3-Q1

​ 1) Draw a single vertical axis spanning the extent of the data


​ 2) Draw short horizontal lines at the lower and upper quartiles
and at the median. Then connect them with vertical lines to
form a box

3) Erect (but don’t show in the final plot) “fences” around the main
part of the data, placing the upper fence 1.5 IQRs above the upper
quartile and the lower fence 1.5 IQRs below the lower quartile.

● 𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 0.297
● 𝑄3 + 1.5𝐼𝑄𝑅 = 1.972 + 0.4455 = 2.41
● Q1 - 1.5*IQR
4) Draw lines (whiskers) from each end of the box up and
down to the most extreme data values found within the
fences.

5) Add any outliers by displaying data values that lie beyond


the fences with special symbols. Erase the fences.
5-Number Summary and Boxplots

● The centre of a boxplot shows the middle half of the data between
the quartiles – the height of the box equals the IQR.
● If the median is roughly centred between the quartiles, then the
middle half of the data is roughly symmetric. If it is not centred, the
distribution is skewed.
● The whiskers show skewness as well if they are not roughly the
same length.
● The outliers are displayed individually to keep them out of the way
in judging skewness and to display them for special attention

Boxplot

Question: Draw a boxplot that has two outliers on the right hand
side. Show Q1, Q2 and Q3. Show the IQR. Show the max and min
values.
Comparing Groups

● In attempting to understand the data, we may want to look for


patterns, differences, and trends over different time periods. We
may want to split the data in half and display histograms for each
half. Histograms for six-month split NYSE data is shown below

Comparing Groups

● Histograms work well for comparing two groups, but boxplots tend
to offer better results when side-by-side comparison of several
groups is sought.
● Below the NYSE data is displayed in monthly boxplots.
Standardizing

Example: Compare two companies (from the “top” 100 companies) with
respect to the variables New Jobs (jobs created) and Average Pay.

​ ▪ Starbucks created over 2000 jobs and has an average salary of


$44,790 while Wrigley created 16 jobs and has an average salary
of $56,351.
​ ▪ For all 100 companies, the mean number of new jobs created
was 305.9 and the average salary was $73,229.42.
▪ Which company did better?

Example (Continued): To compare the two companies based on these


variables, we find the mean and standard deviation for all 100
companies.

● To quantify how well each of the companies did and to combine


the two scores, we’ll determine how many standard deviations they
each are from the variable means.

Standardizing
● To find how many standard deviations a value is from the
mean we calculate a standardized value or z-score.
● z-Score Formula:

● For example, a z-score of 2.0 indicates that a data value is two


standard deviations above the mean and a z-score of −2 indicates
that the data value is two standard deviations below the mean.
● A rule of thumb for identifying outliers is z>3 or z< −3

Standardizing

Example (continued): Computing the z-scores for these variables for


Starbucks and Wrigley we obtain the results summarized below.

Analysis of Relative Location

● In the following dataset find the z-score of all sample values (𝑦(bar)
= 6 and 𝑠 = 3.16). This procedure is called standardizing the data.
Standardizing Data

Analysis of Relative Location

● If we know data is from a symmetric and bell-shaped it is usually


approximated by the normal distribution.
● The normal distribution is often used as an approximation for many
real-world applications.
● The empirical rule:

Outlier Identification:

Q3+1.5 * IQR

Q1 - 1.5 * IQR
● Iven the sample mean 𝑥ҧ, the sample standard deviation 𝑠
and a relatively symmetric and bell-shaped distribution,
○ Approximately 68% of all observation fall in the interval 𝑥ҧ ±
𝑠
○ Approximately 95% of all observation fall in the interval 𝑥ҧ ±
2𝑠
○ Approximately 99.7% (almost 100%) of all observation fall in
the interval 𝑥ҧ ± 3𝑠

Outliers (Relative Location)

Transforming Skewed Data


Example: Below we display the skewed distribution of total
compensation for the CEOs of the 500 largest companies.

Transforming Skewed Data

● When a distribution is skewed, it can be hard to summarize the


data simply with a centre and spread, and hard to decide whether
the most extreme values are outliers or just part of the
stretched-out tail.
● One way to make a skewed distribution more symmetric is to
re-express, or transform, the data by applying a simple function
to all the data values.
● If the distribution is skewed to the right, we often transform using
logarithms or square roots; if it is skewed to the left, we may
square the data values.

Example: Below we display the transformed distribution of total


compensation for the CEOs of the 500 largest companies. A simple log
function is used to transform data values.

● This histogram is much more symmetric, and we see that a typical


log compensation is between 6.0 and 7.0 or $1 million and $10
million in the original terms.
Chapter 6: Scatterplots, Association and Correlation

Scatterplots

● A scatterplot, which plots one quantitative variable against another,


can be an effective display for data
● Scatterplots are the ideal way to picture associations between two
quantitative variables.

Scatterplots

● Scatterplots: indicating if two variables are related. And if they


are related, what is the nature of their relationship.
● Examples where a scatterplot could be used is to determine if,
○ Gas prices vary with average monthly temperature
○ Cholesterol varies with dietary intake
○ Job satisfaction varies with number of employees working in
the company
○ The quality of user experience varies with average annual
profit made by the company

Scatterplots

● Relationship between two variables


● Scatterplots showing relationship between two variables

● Relationship between two variables

Scatterplots

● The direction of the association is important


● A pattern that runs from the upper left to the lower right is said to
be negative
● A pattern running from the lower left to the upper right is called
positive
● Look for direction: What’s the sign - positive, negative, or
neither?

Scatterplots

● The second thing to look for in a scatterplot is its form


● If there is a straight line relationship, it will appear as a cloud or
swarm of points stretched out in a generally
● consistent, straight form. This is called linear form.
● Sometimes the relationship curves gently, while still increasing or
decreasing steadily; sometimes it curves sharply up then down
● Look for form: Is it straight, curved, something exotic, or no
pattern?

Scatterplots

● The third feature to look for in a scatterplot is the strength of the


relationship
● Do the points appear tightly clustered in a single stream or do the
points seem to be so variable and spread out that we can barely
discern any trend or pattern?
● Look for strength: How much scatter?

Scatterplots

● Finally, always look for the unexpected


● An outlier is an unusual observation, standing away from the
overall pattern of the scatterplot
● Look for unusual features: Are there unusual observations or
subgroups?

Measures of Association

● Using scatterplots we can visually talk about the strength and/or


direction of linear relationship between two variables

Understanding Correlation
● Correlation Coefficient ( r ):

What is the correlation coefficient?

A measure that evaluates the direction and the strength of a linear


association between x and y

● In order to measure the correlation coefficient, the variable needs


to be quantifiable

What is the unit of correlation coefficient?

r has no unit of measure

What are possible values of the correlation coefficient and what do


they indicate?

[-1, +1]

● -1: Perfect negative linear association


● +1: Perfect positive linear association
● 0: no linear association

Understanding Correlation

The sample correlation coefficient (𝑟) is computed as,

N: sample size

Alternative Formulas for Correlation:

Covariance

● An alternative to the correlation coefficient is the covariance


between two variables, the covariance is calculated as,
● Unit: (Unit of x)(Unit of y)
● Unlike correlation coefficient, covariance depends on the “unit of
measurement” for both variables.
○ No bound on the limit
○ Measure for direction

Measures of Association

Question: Compute the sample correlation coefficient for the


following dataset by completing the table (𝑥(bar) = 6, 𝑦(bar) = 11, 𝑠x
= 3.39, 𝑠y = 4.36)

Measure of Association

Question: Compute the sample correlation coefficient for the


following dataset by completing the table (𝑥(bar) = 6, 𝑦(bar) = 11, 𝑠x
= 3.39, 𝑠y = 4.36)

Association

● Scatterplots are the ideal way to picture associations between


two quantitative variables.
● Association: Is change in the value of one variable associated
with change in the value of the other variable?!
● We may wonder if there are any association between the following
variables. And if there is one, is it positive or negative, and how
strong is it.
○ Temperature and sales of AC devices!
○ Population of a city and the number of parks in it!
○ Number of employees in a company and annual profit!
○ Price of item and its weekly sales

Assigning Roles to Variables In Scatterplots

● To make a scatterplot of two quantitative variables, assign one to


the y-axis and the other to the x-axis
● Since we are investigating two variables, we call this branch of
Statistics bivariate analysis.
● Each point is placed on a scatterplot at a position that corresponds
to values of the two variables
● The point’s horizontal location is specified by its x-value, and its
vertical location is specified by its y-value variable
● Together, these variables are known as coordinates and written
(x, y)

Assigning Roles to Variables in Scatterplots

● One variable plays the role of the explanatory or predictor


variable, while the other takes on the role of the response
variable.
○ Horizontal Axis: Explanatory, predictor, independent
● We place the explanatory variable on the x-axis and the response
variable on the y-axis.
○ Vertical Axis: Response, Dependent
● The x- and y-variables are sometimes referred to as the
independent and dependent variables, respectively.

Example
Understanding Correlation

● The ratio of the sum of the product zxzy for every point in the
scatterplot, to n – 1 is called the correlation coefficient.

Alternative Formulas for Correlation:

Understanding Correlation (An Example)

Finding the Correlation Coefficient

● Suppose the data pairs are:


Understanding Correlation

Correlation measures the strength of the linear association between two


quantitative variables

Before you use correlation, you must check three conditions:

● Quantitative Variables Condition: Correlation applies only to


quantitative variables
● Linearity Condition: Correlation measures the strength only of
the linear association
● Outlier Condition: Unusual observations can distort the
correlation

Understanding Correlation

Correlation Properties:

● The sign of a correlation coefficient gives the direction of the


association
● Correlation is always between −1 and +1
● Correlation treats x and y symmetrically
● Correlation has no units
● Correlation is not affected by changes in the center or scale of
either variable.
● Correlation measures the strength of the linear association
between the two variables.
● Correlation is sensitive to unusual observations

Correlation Coefficient

Understanding Correlation

Correlation Table

Correlation tables are compact and give a lot of summary


information at a glance. There, you’ll see the correlations between
pairs of variables in a data set arranged in a table.

▪ A correlation table for some variables collected on a sample of


Amazon books.
Straightening Scatterplots

● Sometimes scatterplots do not show a linear pattern. There are


ways to straighten the points
● However, if we look at the logarithm of the values, the plot looks
straighter, so the correlation is now a more appropriate measure of
association.
● Simple transformations such as the logarithm, square root, and
reciprocal can sometimes straighten a scatterplot’s form.

Correlation ≠ Causation

● Two variables may be correlated but that does not mean


there is a causal effect between them.
● Some examples include:
○ There is a positive correlation between the sales of ice
cream and the number of deaths by drowning.
○ Number of pirates in the world vs. global warming
● The two variables with a strong correlation, may both be
connected to a third “lurking” variable that is not visible.
● The third variable may be simultaneously affecting both variables.
● Therefore, correlation does not mean causation
Possible Correlation Reasons

A few Notes

● Don’t correlate categorical variables


● Make sure the association is linear
● Beware of outliers and multiple clusters
● The correlation between just two data points is
meaningless.
● Don’t confuse correlation with causation

Selection

● Only look at a fraction of your data and ignore the rest


Chapter 7: Introduction To Linear Regression

The Linear Model

Example: The scatterplot below shows monthly advertising expenditures


against sales over 5 years.

The Linear Model

● Looking at the example, the analyst might be faced with


questions like:
○ What is the expected sales volume if the monthly advertising
expenditure is 0.3 millions (no data points).
○ What is the expected sales volume if the monthly advertising
expenditure is 0.9 millions (more than one point)
○ What level of monthly advertising expenditure can create a
sales volume of $35 millions?
○ If the current sales volume is $25 millions, to double the
sales, how much should the monthly advertising expenditure
be increased?
○ What is the expected relationship between the monthly
advertising expenditure and the sales volume?

The Linear Model

Example (continued): We see that the points don’t all line up, but that a
straight line can summarize the general pattern. We call this line a linear
model. This line can be used to predict sales for the level of advertising
expenses.
The Linear Regression Model

A few other examples for situations where a linear regression


model can be used:

● Estimate a person’s income based on education and years of


experience.
● Predict the selling price of a house based on its size and location.
● Predict auto sales based on consumer income, interest rates and
price discounts.
● Predict hydro consumption based on different electricity pricing
systems.
● Predict a firm’s sales based on its advertising.

The Linear Model

● The regression line: The line that best fits all the points.
● What do we mean by “best fits”?
○ The line is used to predict values of the dependent variable
for values of the independent variable.
● Note: The value predicted by the line is usually not equal to the
value of the data. There are some errors (residuals).
● Even we know that the line is not the “perfect” prediction, we can
still work with the linear model and accept some level of error.
● Unless the points form a perfect line, we will always have some
errors.
The Linear Model

● Remember there is always a difference between the predicted


values of the dependent variable (y-hat) and the actual value for
the dependent variable (𝑦) (if there is a point).
● This difference is called the “residual”.
● A linear model can be written in the form,
The Linear Model

● The difference between the predicted value and the observed


value, y, is called the residual and is denoted e.

● Residuals are shown in the picture below

The Linear Model

Example (continued): For advertising expenses of $1.42 million, the


actual sales are $28.1 million and the predicted sales are $32.9 million.
The residual is $28.1 million − $32.9 million = −$4.8 million of sales.
The Linear Model

The Line of “Best Fit”

● Some residuals will be positive and some negative, so adding up


all the residuals is not a good assessment of how well the line fits
the data
● If we consider the sum of the squares of the residuals, then the
smaller the sum, the better the fit
● The line of best fit is the line for which the sum of the squared
residuals is smallest – often called the least squares line.
● Question: Why do we square the residuals?
○ So that + and - residuals don’t cancel each other out

Correlation and The Line

● The scatterplot of real data won’t fall exactly on a line so we


denote the model of predicted values by the equation

● The “hat” on the y will be used to represent an approximate value


● The approximate value (the predicted value) is used to predict the
dependent variable (𝑦) based on the value of the independent
variable (𝑥). In this example the independent variable is the sales
volume (𝑦) and the independent variable is monthly advertising
expenditure (𝑥).

Correlation and the Line

For our example of sales and advertising expenses, the line shown
with the scatterplot has the equation that follows
Correlation and the Line

a) What is the slope? How can you interpret the slope in this
question?

For every additional unit of increase in advertising expenditure ($


million), we expect sales to increase by 8.31 ($ million)

b) What is the intercept? How can you interpret the intercept?

Expected sales when advertising expenditure = 0 is $21.1 million

The Linear Model

Method of ordinary least squares (OLS):

● Some residuals will be positive and some negative, so adding up


all the residuals is not a good assessment of how well the line fits
the data
● If we consider the sum of the squares of the residuals, then the
smaller the sum, the better the fit
● The regression line is the line for which the sum of the squared
residuals is smallest – often called the least squares line.
Question: Why do we square the residuals?

● + and -

The Linear Model

● The OLS method chooses the line whereby the error sum of
squares (SSE) is minimized.
● SSE is the sum of he squared differences between the observed
values 𝑦 and their predicted values y(hat).
● The OLS method predicts the straight line that is “closest” to the
data.
● The OLS method tries to minimize SSE, which is,

Correlation and the Line

We can find the slope of the least squares line using the correlation
and the standard deviations as follows,

● The slope gets its sign from the correlation. If the correlation is
positive, the scatterplot runs from lower left to upper right and the
slope of the line is positive. (remember, standard deviation is
always a positive number).
● The slope gets its units from the ratio of the two standard
deviations, so the units of the slope are a ratio of the units of the
variables.

Correlation and the Line

● To find the intercept of our line, we use the means. If our line
estimates the data, then it should predict 𝑦(bar) for the x- value
𝑥(bar). Thus we get the following relationship for 𝑦(bar) from our
line,
● We can now solve this equation for the intercept to obtain the
formula for the intercept

Correlation and the Line

● In summary, to find

● The slope of the line is found using the formula,

● Where 𝑟 is the correlation coefficient and 𝑆 and 𝑆 are the 𝑥𝑦


standard deviations of the independent and dependent
variables, respectively.
● The intercept of the line is found using the formula,

● The least squares line has the formula,

Correlation and the Line

Least squares lines are commonly called regression lines. We’ll need to
check the same condition for regression as we did for correlation.
● Quantitative Variables Condition
● Linearity Condition
● Outlier Condition

Standardizing Data

Correlation and then Line

● (recall) Previously, we said that the correlation coefficient can also


be expressed using standardized values (z).

● These z-scores are also useful in interpreting regression models


because they have the simple properties that their means are zero
and their standard deviations are 1.
● In other words, we have
Correlation and the Line

● If we consider finding the least squares line for standardized


variables zx and zy, the formula for slope can be simplified as,

● The intercept formula can be rewritten as,

● So if we are working with standardized values, the slope will


be equal to 𝑟 and the intercept is 0. The formula for the line
will be,

Correlation and the Line

From the least squares line formula,

we can conclude that, for example.

● If the value of x is one SD above the mean, the predicted


standardized value of y will be equal to____r____.
● If the value of x is at the mean, the predicted standardized value
for y will be equal to___0____.
● If the two variables x and y have no linear relationship with each
other, if the value of x is one SD above the mean, the predicted
standardized value of y will be equal to__0____.
Correlation to the Line

● For our data on advertising costs and sales, the correlation


coefficient 𝑟 is 0.693. We can now express the relationship for the
standardized variables as,

● That means, for every standard deviation above (or below) the
mean we are in advertising expenses, we’ll predict that the sales
are _______0.693______ standard deviations above (or below)
their mean.

Regression to the Mean

▪ Regression to the mean is indicating, that each predicted value y(hat)


is closer to its mean that its corresponding 𝑥.

▪ The reason can be seen from the standardized value best fit line
𝑍(hat)y , which is,

● For example if x is 2 SDs above its mean, we won’t ever move


more than 2 SDs away from y, since r can’t be bigger than 1. For
the majority of x values, the corresponding predicted value y(bar)
is closer to its mean than 2 SD.

Regression Lines

● Between two variables, we have only one correlation


coefficient 𝑟.
● However, there are two regression lines; one that has x as the
explanatory variable and one that has y as explanatory
variable.

Regression

Question: We have the following dataset for the number of salespeople


working in an organization and the corresponding sales numbers.

Regression

1. What is the mean and standard deviation of number of


salespeople and sales volume?
2. If we employ 8 salespeople, how many standard deviations is
that above or below the mean?
3. Are these two variables correlated? If yes, is that a positive
correlation or a negative one? Is it a strong correlation?
4. If the relationship can be modelled by a regression line, what
is the equation of this line?
5. If the number of salespeople is increased by 1 person, how
much do you expect sales volume to change (increase or
decrease)?
6. By employing 15 salespeople in the organization, what
value of sales do you expect the organization will achieve?
7. If the number of people working is two standard deviations
above the mean, how many standard deviations above or
below the mean do you expect sales to be?
8. What value of sales does the answer to question 7 correspond
to?
Learning from the Residuals

● Remember residuals are calculated as

● The residuals are part of the data that has not been
Modeled.

● Residuals help us see whether the model makes sense


● A scatterplot of residuals against predicted values should show
nothing interesting – no patterns, no direction, no shape (otherwise
there’s sth wrong!)
● If nonlinearities, outliers, or clusters in the residuals are seen, then
we must try to determine what the regression model missed.

Learning from the Residuals

The plot of the Amazon residuals are given below. It does not
appear that there is anything interesting occurring.
● The variation in the residuals is the key to assessing how well a
model fits.

Learning from the Residuals

● The standard deviation of the residuals, se, gives us a measure of


how much the points are spread around the regression line.
● We estimate the standard deviation of the residuals as shown
below

● The standard deviation around the line should be the same


wherever we apply the model – this is called the Equal Spread
Condition.

Variations in the model and 𝑅^2

● All regression models fall somewhere between the two extremes of


zero correlation or perfect correlation of plus or minus 1.
● We consider the square of the correlation coefficient r to get r^2
which is a value between 0 and 1.
● r^2 gives the fraction of the data’s variation accounted for by the
model and 1 – r^2 is the fraction of the original variation left in the
residuals.
● r^2 by tradition is written R^2 and called “R squared”

Variations in the model and 𝑅^2

● The majority of data is accounted for by the linear model


when data points are very close to a line→𝑟 is close to -1 or 1.

Results: Simple Regression

● Se: you want to be as small as possible


● R^2 you want to be as large as possible

Variations in the model and R^2


● Thus 48.1% of the variation in sales (𝑦) is accounted for (or
explained) by the advertising expenses (𝑥), and 1 − 0.481 = 0.519
or 51.9% of the variability in sales has been left in the residuals
(not explained by the advertising expenses (𝑥)).

Variations in the model and R^2

Question: An economist wants to investigate if debt payment disparity


between different cities is due to differences in average income in those
cities. Assuming variable 𝑥 represents income and variable 𝑦 represents
debt payments. She studies the sample data containing information on
26 and finds the sample correlation coefficient between 𝑥 and 𝑦 as 0.38.

Nonlinear Relationships

A regression model works well if the relationship between the two


variables is linear. What should be done if the relationship is nonlinear?
(relationship between number of Cell Phones (000s) vs. HDI for
countries.
Nonlinear Relationships

● To use linear regression models:


● Transform or re-express one or both variables by a function such
as:
○ Logarithm
○ Square root
○ Reciprocal
Chapter 8: Randomness and Probability

Probability

● The (theoretical) probability of event A can be computed with the


following equation:

● Example:
○ Picking a student at random, the probability that her/his
birthday is in the month of September.
○ If you draw a card from a standard deck of cards, what is the
probability of drawing a face card?

Probability Rules

Rule 1:

● If the probability of an event occurring is 0, the event can’t


occur.
○ Picking a student at random, prob(he/she is from the outer
space).
● If the probability is 1, the event always occurs.
○ Picking a student at random, prob(he/she is older than 3)
● Probability is a number between 0 and 1.
● For any event A, the probability 𝑃(𝐴) is always between 0 and 1.
Probability Rules

Rule 2: the Probability Assignment rule

● The probability of the set of all possible outcomes must be 1.

● where 𝑆 represents the set of all possible outcomes and is called


the sample space.

Probability Rules

Rule 3: the Complement rule

● The probability of an event occurring is 1 minus the probability that


it doesn’t occur.
Complement of an Event

● The event A and its Complement A^c

Probability Rules

Rule 4: the Multiplication rule

● For two independent events A and B, the probability that both A


and B occur is the product of the probabilities of the two events.

● provided that A and B are independent.


● The above equation can be used to determine if two events are
independent. The above holds if they are independent.
Probability Rules

Rule 5: the Addition rule

● Two events are disjoint (or mutually exclusive) if they have no


outcomes in common.
● The Addition Rule allows us to add the probabilities of disjoint
events to get the probability that either event occurs.

● Where A and B are disjoint

Disjoint Events

When A and B are disjoint:

When A and B are not disjoint:


Probability Rules

Rule 6: the General Addition rule

▪ The General Addition Rule calculates the probability that either of two
events occurs. It does not require that the events be disjoint.

Joint Probability

Events may be placed in a contingency table such as the one in the


example below.

Example: As part of a Pick Your Prize Promotion, a store invited


customers to choose which of three prizes they’d like to win. The
responses could be placed in the following contingency table:

Marginal Probability

● Marginal probability depends only on totals found in the margins


of the table.
Joint Probability

● Joint probabilities give the probability of two events occurring


together.

Conditional Probability

● Each row or column shows a conditional distribution given one


event.

● In the table above, the probability that a selected customer wants a


bike given that we have selected a woman is:
● P(bike|woman) = 30/251 = 0.120.

Conditional Probability

● In general, when we want the probability of an event from a


conditional distribution, we write P(B|A) and pronounce it “the
probability of B given A.”
● A probability that takes into account a given condition is called a
conditional probability.
Conditional Probability

● Question: Based on the following contingency table. Specify the


following probabilities.

a. P(bike and woman) (This is a joint probability)


b. P(woman (This is a marginal)
c. 𝑃 (𝑏𝑖𝑘𝑒|𝑤𝑜𝑚𝑎𝑛) (This is conditional probability)
d. Is the event “choosing skis” independent of the event “Man”?
Conditional Probability

Rule 7: the General Multiplication rule

● The General Multiplication Rule calculates the probability that


both of two events occurs. It does not require that the events be
independent.

Independent vs. Disjoint

● Events are independent if the occurrence of one event does not


influence (and is not influenced by) the occurrence of the other(s).
● If two events are independent P(A and B)=P(A)P(B)
● Independent events should not be confused with disjoint events.
Two events are disjoint if only one of the could happen. (example:
head or tail are two disjoint events when tossing a coin).

Independent Events

● Using the General Multiplication Rule, show why if A and B are


independent, then P(A and B)=P(A)P(B).
Chapter 9: Random Variables and Probability Distributions

Random Variable

What is a random variable?

● A measure that represents the outcome of a random event

● A random variable (usually denoted by X) could be


○ discrete, in which case it assumes a countable number of
distinct values.
○ Or continuous, in which case it assumes an uncountable
number of distinct values.

Random Outcomes of Random Variables

● Yes, the outcome of a random variable is random


○ The weight of a randomly selected student is random
○ The number of accidents on HWY-401 during 24 hours is
random
Probability Distribution

● Every random variable is associated with a probability distribution.


● Probability distributions capture the randomness inherent in
random variables.

Discrete Probability Distribution

● It is common to describe a discrete random variables using its


probability mass function (pmf), which is a list of possible
outcomes of the random variable and their associated
probabilities.
● In other words, the probability mass function is a list of all
possible outcomes and their corresponding probabilities

Discrete Probability Distribution


● Question: using a histograms, draw the pmf of random variable 𝑋,
where 𝑋 represents the number of courses a full- time student took
in Fall 2018.

Discrete Probability Distribution

Cumulative Distribution Function

● Discrete random variables can also be defined by their cumulative


distribution function (abbr. as CDF).
● For any value of the random variable 𝑋, the CDF is defined as 𝑃
(𝑋 ≤ x) .
● For instance, suppose you throw a fair die and the random variable
𝑋 is the number showing,
Cumulative Distribution Function

● Question: using a histograms, draw the CDF of random variable


𝑋, where 𝑋 represents the number of courses a full- time student
took in Fall 2018

Expected Value of a Random Variable

How would you interpret the expected value of a random variable?

● A weighted average of the outcomes of a random variable


● For discrete: weight = probability

● Note: 𝐸(𝑋) should not be confused with the most probable value
of the random variable. It may be not even one of the possible
values of the random variable.
Expected Value of a Random Variable

Expected Value (or why insurance could be a profitable business)

Question: The probability model for a particular life insurance policy is


shown. Find the expected annual payout on a policy. How much do we
expect this company is paying out per year, in the long run?
Variance and Standard Deviation of a Random Variable

● Variance talks about how the values are dispersed around the
expected value, whether they are closely clustered or scattered
around it.
● It is a measure of dispersion.

Variance and Standard Deviation of a Random Variable

● The Variance of X denoted as Var(X) is calculated as:

Variance and Standard Deviation of a Random Variable

● Whatever the unit of measurement for 𝑋 is, the variance is


measured in that unit squared (if 𝑋 is in $, the variance is in $2).
The standard deviation of 𝑋 is a measure of dispersion with the
same unit. The standard deviation of random variable 𝑋 is,
Var(X) and SD(X)

Example: Find the variation and standard deviation of the annual


payout.

● Remember variance and standard deviation are measures to


calculate the average spread of data around its mean.
● In the last column, we have the distance (deviation) from the
mean.

Var(X) and SD(X)

Discrete Probability (Uniform Distribution)

● If X is a random variable with possible outcomes 1, 2, ..., n and for


each i, then we say X has a discrete Uniform distribution U[1, ...,
n].
● Example: Tossing a die is described by the Uniform model U[1, 2,
3, 4, 5, 6], with 𝑃 𝑋 = 𝑖 = 1/6.
● Question: Is X a random variable?
Discrete Probability (Uniform Distribution)

Binomial Distribution

● An online vendor has 25 daily visitors on average. Each visitor


makes a purchase with probability 0.2. On any given day, what is
the probability that,
○ No customer makes a purchase?
○ Exactly 4 customers make a purchase?
○ Between 10 and 17 customers make a purchase?
○ Etc.
● Visitor is either a purchasing customer or a non- purchasing
customer
● Visitors are independent!
● The probability of purchase is known!
Discrete Probability (Bernoulli Trial)

Definition: A Bernoulli Trial is a trial with the following characteristics:

1. There are only two possible outcomes (success and failure) for
each trial.
2. The probability of success, denoted p, is the same for each trial.
The probability of failure is q = 1 – p.
3. The trials are independent.

Discrete Probability (Bernoulli Trial)

Discrete Probability (Bernoulli Trial)

● So each event has,


○ only two outcomes
○ each outcome has a probability and the probabilities are
complementary.
● Therefore,

● Or
Introduction: Binomial Distribution

Question 1: Let’s throw a die 3 times, what is the probability of rolling


the number 5 every time?

● Probability of success 𝑝 = 1/6.


● Probability of failure 𝑞 = 5/6.
● Let’s define:
○ A to be the event that number 5 is rolled the first time
○ B to be the event that number 5 is rolled the second time
○ C to be the event that number 5 is rolled the third time
● Question is, what is 𝑃(𝐴 𝑎𝑛𝑑 𝐵 𝑎𝑛𝑑 𝐶)? (Note the joint
probability!)

Introduction: Binomial Distribution

Question 2: Now, let’s throw the die 3 times, what is the probability of
rolling the number 5 exactly 2 times?

● Again we have, Probability of success 𝑝 = 1/6.


● Probability of failure 𝑞 = ⅚
Introduction: Binomial Distribution

Binomial Distribution

Introduction: Binomial Distribution

● Question 3: Now what if we want to find the probability of rolling


number 5 exactly twice in 10 rolls.
● What are some examples of desired events?(SSFFFFFFFF),
(FFFFFFFF,SS), (FFSSFFFFFF), ...
Introduction: Binomial Distribution

Binomial Distribution

● Predicting the number of successes in a series of Bernoulli trials.


● If n = Number of trials
● If p = Probability of success (and q = 1 − p = probability of failure)
● For X = Number of successes in n trials.
● What is P(X = k)?
Binomial Distribution

● The mean and standard deviation of the Binomial Distribution are,

Binomial Distribution

● An online vendor has 25 daily visitors on average. Each visitor


makes a purchase with probability 0.2. On any given day, what is
the probability that exactly 5 customers make a purchase?

Binomial Distribution

● An online vendor has 25 daily visitors on average. Each visitor


makes a purchase with probability 0.2. On any given day, what is
the probability that,
Binomial Distribution

Binomial Distribution

Question: Based on historical data, 85% of claims received at a


department of a small insurance company can clear the preliminary audit
stage without requiring additional documents. Suppose the department
usually receives 10 claims every day. What is the probability that on any
given day 8 claims will successfully pass the preliminary audit without
requiring more documents??
Binomial Distribution

● What is the probability that the number is 7? (x=7)


● What is the probability that the number is 6? (x=6)
● What is the probability that the number is 5? (x=5)
● What is the probability that the number is 4? (x=4)
● What is the probability that the number is 3? (x=3)
● The collection of all probabilities P(X=x) will give the binomial
distribution for this problem

Binomial Distribution

Continuous Random Variables

● A continuous random variable is a random variable that may take


on any value in some interval [a, b].
● The distribution of the probabilities can be shown with a curve, f (x)
called a probability density function (pdf).
Continuous Random Variable

● Therefore, for a continuous r.v. 𝑋, it is only meaningful to find the


probability that the value of 𝑋 falls within some specified interval.
Therefore,

● This is because,

Probability Density Function

● For a continuous r.v. 𝑋, the counterpart of a pmf is used.


● It is called the probability density function denoted by
𝑓 𝑥 (abbr. as PDF)
● Unlike the pmf, the PDF does not provide probabilities
Directly.
● The area under the PDF 𝑓(𝑥) between the two values 𝑎
and 𝑏 represents the probability 𝑃 (𝑎 ≤ 𝑋 ≤ 𝑏) .
● The entire area under 𝑓(𝑥) equals 1.
Continuous Random Variables

Density functions must satisfy these requirements:

1. They must be non-negative for every possible value.


2. The area under the curve must exactly equal 1.

● You can find probabilities like 𝑃(𝑋 ≤ 10) or 𝑃(8 ≤ 𝑋 ≤ 10) can be
read from the density function by calculating the area under the
density curve 𝑓(𝑥) .

Probability Density Function

● Suppose the plot is the PDF of continuous r.v. 𝑋. Highlight the


probabilities.

Continuous Random Variable (Uniform)


Cumulative Distribution Function

● Continuous random variables can also be defined by their


cumulative distribution function (abbr. as CDF).
● For any value 𝑥 of the continuous random variable 𝑋, the CDF
denoted by 𝐹(𝑥), is defined as
Cumulative Distribution Function

● Suppose the plot is the PDF of continuous r.v. 𝑋. Highlight the


probabilities.

The Normal Distribution

● It is the familiar bell-shaped distribution


● It is the most extensively used continuous probability distribution in
statistics
● It closely approximates the probability distribution for a wide range
of different random variables
● It is the cornerstone of statistical inference
● Possibilities of where normal distribution can be used is endless, a
few examples include:
○ Advertising expenditure of firms
○ Revenues generated by firms
○ Return on an investment
Normal Distribution

Characteristics of the normal distribution: (PDF)

Normal Distribution

● How do we calculate these probabilities?


○ Using Integrals
○ Using software

Standard Normal Distribution

What is the standard normal distribution?

● It is a special case of the normal distribution with a mean qual


to zero and a standard deviation equal to one.
● It is denoted with letter Z to denote a random variable that is
normal and has 𝐸(𝑍) = 0 and 𝑉𝑎𝑟(𝑍) = 𝑆𝐷(𝑍) = 1
● Each value of this random variable (Z) is actually a z-score.
Standard Normal Distribution

Z-Score (reminder)

Remember the formula

What does a z-score 2.2 imply?

● z-score 2.2 implies that the point is 2.2 standard deviations to the
right of the mean

What does a z-score -1.8 imply?

● z-score -1.8 implies that the point is -1.8 standard deviations to the
left of the mean
Standardization

Standardization

Convert the following x values to z-scores


Standardization

Suppose you have the z-score and want to find the x-value

The empirical rule

The 68-95-99.7 Rule (the Empirical Rule)

● In the normal distributions, about 68% of the values fall within one
standard deviation of the mean, about 95% of the values fall within
two standard deviations of the mean, and about 99.7% of the
values fall within three standard deviations of the mean.

Other Applications (an example)

● Sometimes, stock markets follow an uptrend (or downtrend) within


2 standard deviations of the mean. This is called moving within the
linear regression channel.
● Here is a chart of the Australian index (the All Ordinaries) from
2003 to Sep 2006.
zTable

● Provides cumulative probabilities (e.g., 𝑃(𝑍 ≤ 𝑧)

● Ex. P(Z ≤ -2.24) = 0.0125


zTable

Using the z table provided on the previous slide, find the following
probabilities.

zTable

● We can also find the z-score if we are given the probability.


For instance,
Adding Normal Variable

Adding and Subtracting Normally Distributed Variables

● An important property of Normal models is that the sum or


difference of independent Normal random variables is also
Normal

Adding Normal Variables


Subtracting Normal Variables

Normal Random Variable


Chapter 10: Sampling Distributions
Background (Example)

● A credit card company’s marketing specialist had an idea offering


double air miles to its customers with an airline- affiliated card if
they increased their spending by at least $800 in the month
following the offer.
● Before starting the actual program, in order to forecast the cost
and revenue of the offer, finance department of the credit card
company needed to know what percentage of customers would
actually qualify for the double miles.
● Two marketing specialists (Alice and Bob) have been tasked with
finding out what the true proportion of customers who would qualify
for double miles is.

Background (Example)

● Alice decided to send the offer to a random sample of 1000


customers to find out. In that sample, she found that 211 (21.1%)
of the cardholders increased their spending by more than the
required $800.
● Bob sent the offer to a different random sample of 1000 customers
and found out. In his sample, 202 or (20.2%) of the cardholders
increased their spending by more than the required $800.
● Each sample gave a different result!
● What is the true proportion (percentage) of customers who would
increase their spending by $800?

Background (Example)

● If instead of two specialists we had say 100 of them and they each
collected a sample and found a proportion of customers who’d
increase their spending following the offer, what are your thoughts
on the distribution (shape, center, spread) of these different 100
proportion values?
● What would be the shape of this distribution?
● What would affect the center (mean) of this distribution?
● What would affect the spread (std. dev.) of this distribution?

Parameter vs. Statistic

● Recall: a parameter belongs to the population whereas, a statistic


belongs to a sample
● A parameter is a constant, although its value may be unknown.
● A statistic is a random variable, whose value depends on the
chosen random sample.
● There is only one population but many different samples of the
same size can be drawn.
● A statistic used to make inferences about the population parameter
is called an estimator or point estimator.

Sample Proportions

● If events have only two outcomes, we can call them “success” and
“failure”
● The proportion of “success” in a sample is called the “sample
proportion”.
● Examples: We would like to estimate the proportion of smokers
over the age of 25 in a city. We select 100 people from the city
(25+) and measure the proportion of them who smoke. This
proportion will be the sample proportion.
● The sample proportion is most probably different from the
population proportion (true proportion).

Sample Proportions
What are some examples of proportions?

● Proportion of credit card holders who increase spending after


a promotion
● Percentage of passenger who book first-class
● Probability (proportion) of pregnant women requiring a C- section

Why is the distribution of sample proportions important?

Sample Proportions

Sample Proportion
Sample Proportions

● The result of the simulation can be summarized in a table, as


below. (𝑝 = 0.25 and 𝑛 = 30)

Sample Proportions
● The result of the simulation can be summarized in a table, as
below. (𝑝 = 0.25 and 𝑛 = 70)

Samples Proportions

● Remember
a. the difference between sample proportions, referred to as sampling
error, is not really an error. It’s just the variability you’d expect to
see from one sample to another. A better term might be sampling
variability.

Sample Proportions

● Sample proportions are distributed.


● Seems like sample distributions are distributed around
population proportion.
● Depending on sample size, their dispersion (spread) around the
population proportion is different.
● The distribution of sample proportions can be modeled by a
normal distribution if,
○ All samples are independent
○ Sample size is large enough

Sample Proportions

● Based on what we observed, when certain conditions hold (more


on that later!), we can model the distribution of sample proportions
using a normal distribution.
● The sample proportions (^p ), when modeled by the normal
distribution will have the mean and standard deviation as follows,

Sample Proportions

● In order to model the distribution of sample proportions using the


normal distribution, we need to make sure a number of
assumptions and conditions are satisfied.

1. Independence Assumption: The sampled values must be


independent of each other
2. Sample Size Assumption: The sample size, n, must be large
enough
3. Randomization Condition: If your data come from an experiment,
subjects should have been randomly assigned to treatments
4. 10% Condition: If sampling has not been made with replacement,
then the sample size, n, must be no larger than 10% of the
population.
5. Success/Failure Condition: The sample size must be big enough
so that both the number of “successes,” np, and the number of
“failures,” nq, are expected to be at least 10.

Sample Proportions

Question: Suppose we already know that 15% of the population of a


city smokes. Assuming that the population is homogeneous across
the city,

a. What is the average and standard deviation of the sample


proportion derived from a random sample of 100 individuals
randomly selected?
b. What happens to the expected value (average) and the standard
deviation of 𝑃^ if 𝑛 becomes larger?
Sample Distribution

Question: Rogers provides cable, phone, and internet services to


customers, some of whom subscribe to packages including several
services. Nationwide, suppose that 30% of Rogers customers are
“package subscribers” and subscribe to all three types of service. A
representative in Toronto wonders if the proportion in his region is the
same as the national proportions. If the same proportions holds in his
region and he takes a survey of 100 customers at random from his
subscriber list:
a) What proportion of customers would you expect to be package
subscribers?

b) What is the standard deviation of the sample proportions?

c) What is the shape you expect the sampling distribution of the


proportion?

d) What is the probability that a sample from this population shows a


sample proportion that is at least 0.49? Would you be surprised to draw
a sample with that proportion or more extreme? What conclusion would
you draw based on this sample?
Sample Mean

Some Notations…

● Let 𝑌 be the random variable representing a certain characteristic


of the population (e.g., starting salary of business graduates from
Canadian universities). Then
Sampling Distribution of 𝑌(Bar)

● Suppose we have the sampling distribution of 𝑌(Bar), what can be


derived from this information?

● As you can see, it is extremely useful to find the sampling


distribution of 𝑌(Bar)

Sample mean

● The distribution of the sample mean, 𝑦(bar), when estimated by the


normal distribution has the following parameters,

● where σ is the standard deviation of the population and 𝑛 is the


sample size.

Central Limit Theorem

● Central Limit Theorem (CLT): The sampling distribution of any


mean becomes Normal as the sample size grows.
● This is true regardless of the shape of the population distribution!
● The Central Limit Theorem talks about the sample means and
sample proportions of many different random samples drawn from
the same population.
● For large enough sample sizes (at least 30) sampling distribution
of sample means can be approximated by a Normal model. The
larger the sample, the better the approximation will be.

Sample Mean Distribution

● Based on the Central Limit Theorem (CLT), the distribution of the


sample mean (𝑦 bar) is normally distributed, regardless of the
distribution of the population, if sample size (n) is sufficiently large.
● If the population is Normally distributed, the sample mean (𝑦 bar) is
normally distributed, regardless of the sample size.

Sample Mean

Assumptions and Conditions

● Independence Assumption: The sampled values must be


independent of each other
● Randomization Condition: The data values must be sampled
randomly.
● 10% Condition: When the sample is drawn without
replacement, the sample size, n, should be no more than 10%
of the population
● Large Enough Sample Condition: If the population is
unimodal and symmetric, even a fairly small sample is okay.
For skewed distributions, larger sample sizes are required for
distribution of means to be approximately Normal

Sample Mean

Question: A potato chips manufacturer produces bags of chips with


weights that are normally distributed with mean 60 grams and standard
deviation 2 grams.
a) If you select one bag of chips randomly, what is the probability that it
weighs less than 59 grams?

b) If you select 20 bags of chips randomly, what is the probability that the
average weight of this sample is less than 59 grams?

c) If you select 10 bag of chips randomly, what is the probability that the
average weight of this sample is greater than 59 grams?
Central Limit Theorem

Question: A potato chips manufacturer produces bags of chips which


weigh on average 60 grams with a standard deviation 2 grams.

a) If you select one bag of chips randomly, what is the probability that it
weighs less than 59 grams?

b) If you select 35 bags of chips randomly, what is the probability that the
average weight of this sample is less than 59 grams?
Example

Question: According to a survey from Statistics Canada, in 2011, the


average earnings in Vancouver for educational services was $42,800.
Suppose the standard deviation of earnings in Educational Services for
the whole of Vancouver was $12,500, and we want the standard
deviation of average earnings in our survey to be at most $1000. What is
the smallest sample that will satisfy this criterion?
Sampling Distributions

● We now have the sampling distribution of (a) sample proportions,


and (b) sample means.

Standard Error

● Population parameters are usually unknown.


● Therefore, we usually use sample characteristics (statistics) to
“estimate” population statistics.
● Whenever we estimate the standard deviation of a sampling
distribution, we call it a standard error (SE).
Standard Error

● For a sample proportion, 𝑝(hat), the standard error is:

Standard Error
Example

Question: A study commissioned by a clothing manufacturer measured


the “waist sizes” of a random sample of 250 men. The mean and
standard deviation of the waist sizes for all 250 men are 36.33 inches
and 4.019 inches, respectively. What do you expect the standard
deviation of the sampling distribution of the mean to be?
Sampling Distribution

Question: When a truckload of apples arrives at a packing plant, a


random sample of 150 is selected and examined for bruises,
discolouration, and other defects. The whole truckload will be rejected if
more than 5% of the sample is unsatisfactory. Suppose that in fact 8% of
the apples on the truck do not meet the desired standard. What’s the
probability that the shipment will be accepted anyway?

Chapter 11: Confidence Intervals for Proportions

A Confidence Interval

Example: A Gallop Poll found that 1495 out of 3559 respondents


thought economic conditions were getting better – a sample proportion
of
● We’d like to use this sample proportion to say something about
what proportion, p, of the entire population thinks the economic
conditions are getting better.
● We have information on a sample, we would like to estimate
information about the population (we do not have population
characteristics).

Confidence Interval

● Confidence intervals are about a population parameter.


● They always involve a confidence level (certainty level). For
example 90% 95% etc.
● They provide an interval (range)

Discussing the C.I.

Why is it a range and not a point?


Ok a range is fine, but why is it coupled with a probability?

● The probability is the confidence level we have in the interval


Refresher on Z

Question: Using the standard normal table, find an upper and lower
bound such that,

(assume |lower bound| = |upper bound|).

𝑃 (𝑙𝑜𝑤𝑒𝑟 𝑏𝑜𝑢𝑛𝑑 ≤ 𝑍 ≤ 𝑢𝑝𝑝𝑒𝑟 𝑏𝑜𝑢𝑛𝑑) = 0.95

Confidence Intervals

● A C. I. provides a range of values that, with a certain level of


confidence contains the population parameter.
● In this course, we talk about confidence intervals for,
○ Population proportions
○ Population means (when 𝜎 is known)
○ Population means (when 𝜎 is unknown)
● What is the general format of a C. I. for population mean or
proportion?
CI for the p
CI for 𝜇 When 𝜎 is Known

We will return to this (CI for 𝜇 when 𝜎 is known) more formally in


Chapter 13.

CI and other Confidence Levels

● We can have confidence intervals with any desired confidence


levels.
● The higher the confidence level the wider the interval.
● Confidence level = 100 (1 − 𝛼 )%
● 𝛼 is referred to as “probability of error” (why?) or the significance
level.
A Confidence Interval

● We know that our sampling distribution model is centered at the


true proportion, p, and we know the standard deviation of the
sampling distribution is given by the formula below.

● We also know from the previous chapter that the shape of the
sampling distribution is approximately Normal and we can use p̂ to
find the standard error.

A Confidence Interval

● Remember the example.

Example: A Gallop Poll found that 1495 out of 3559 respondents


thought economic conditions were getting better – a sample proportion
of
A Confidence Interval

● We are 95% confident that the population proportion 𝑝 is within


1.96 × 𝑆𝐸(p̂) = 1.96 × 0.008 ≅ 0.016 of p̂ (0.42).
● In other words, we are 95% confident that the population
proportion 𝑝 is between 0.42 − 0.016 = 0.404 and 0.42 + 0.016 =
0.436.

A Confidence Interval

Here are a few things we can say about the population by looking at
this sample.

1. There is no way to be sure that the population proportion is the


same as the sample proportion.
2. We can be pretty certain that whatever the true proportion is, it’s
probably not exactly 42% (but most probably very close to it, but
how close?).
3. We can provide a range (interval) that most probably (not 100%)
contains the true population proportion.
4. We don’t know the exact proportion of adults who thought the
economy was improving but the interval from 40.4% and 43.6%
probably contains the true proportion. What do we mean by
probably?
5. We are 95% confident that between 40.4% and 43.6% of adults
thought the economy was improving.
● What we have here, is an example of a confidence interval. An
interval that with some level of certainty should contain our
desired value.

A Confidence Interval

What does it mean when we say we have 95% confidence that our
interval contains the true proportion?

● Our uncertainty is about whether the particular sample we have at


hand is one of the successful ones or one of the 5% that fail to
produce an interval that captures the true value.
● If other pollsters would have collected samples, their confidence
intervals would have been centered at the proportions they
observed.

Confidence Intervals

Below we see the confidence intervals produced by simulating 20


samples.

● The purple dots are the simulated proportions of adults who


thought the economy was improving (sample proportions). The
orange segments show each sample’s confidence intervals. The
green line represents the true proportion of the entire population.
● Note: Not All confidence intervals capture the true proportion.

Certainty vs. Precision


The confidence interval, in general, can be expressed as below:

● Point Margin of error estimate

● The extent of that interval on either side of p̂ is called the margin of


error (ME).

● The more confident we want to be, the larger the margin of error
must be (larger ME→more confidence)!
● smaller ME→less confidence!

Critical Value

● Recall for the 95% confidence interval the length of the margin of
error was:

● Now if we want to increase the confidence level, the interval needs


to become wider
● What happens if we increase the sample size?
○ Nothing happens to the confidence level
○ The ME will decrease
( 1 - 𝛼)%

● Confidence level in the confidence interval is usually denoted by (1


− 𝛼)%.
● For a 95% confidence interval [ 1 − 𝛼 % = 95%], 𝛼 is equal to,
● Draw a representation of the distribution for p̂ . Specify and explain
𝑧𝛼/2.

CI for Population Proportion

● As a result, the confidence interval for 𝑝 becomes,

● This formula is valid only if p̂ (approximately) follows a normal


distribution.
● This condition is evaluated at the sample proportion p̂ by checking
for 𝑛p̂ ≥ 10 and 𝑛(1 − p̂) ≥ 10.
Certainty vs. Precision

● Every confidence interval is a balance between certainty and


precision.
● As the interval grows (larger ME), certainty grows too, but
precision decreases. The larger the interval the less precision we
have, but we will have more certainty!
● Fortunately, we can usually be both sufficiently certain and
sufficiently precise to make useful statements.
● To change the confidence level, we’ll need to change the number
of SEs to correspond to the new level.
● For any confidence level the number of SEs we must stretch out
on either side of p̂ is called the critical value.

Standard Normal Cumulative Probability Table

Critical Values for CIs

Question: Using the standard normal table, find the critical value 𝑧∗ for,
Critical Values

● A 90% confidence interval has a critical value of 1.645.

Assumptions and Conditions

● Independence Assumption: Is there any reason to believe that


the data values somehow affect each other?
● Randomization Condition: Proper randomization can help ensure
independence.
● 10% Condition: If the sample exceeds 10% of the population, the
probability of a success changes so much during the sampling that
a Normal model may no longer be appropriate.
● Success/Failure Condition: We must expect our sample to
contain at least 10 “successes” and at least 10 “failures”. So we
check that both np̂ ≥ 10 and nq̂ ≥ 10

Confidence Intervals

Questions: A survey of 200 students is selected randomly on a large


university campus. They’re asked if they use a laptop in class to take
notes. Suppose that based on the survey, 70 of the 200 students
responded “yes.”

a) What is the value of the sample proportion?

b) What is the standard error of the sample proportion?

c) Construct an approximate 90% confidence interval for the true


proportion 𝑝.
CI for Population Proportion

Question: We would like to know the proportion of households in


Hamilton Mountain neighborhood that do proper recycling of garbage.
We randomly select 140 households and ask the resident to fill out a
survey. This proportion turns out to be 92%. Find the,

a) 90% confidence interval for the proportion of all households in


Hamilton Mountain that do proper recycling.
b) 95% confidence interval for the proportion of all households in
Hamilton Mountain that do proper recycling.
Choosing the Sample Size

● To get a narrower confidence interval without giving up confidence,


we must choose a larger sample.
● Example: Suppose a company wants to offer a new service and
wants to estimate, to within 3%, the proportion of customers who
are likely to purchase this new service with 95% confidence. How
large a sample do they need (at least)?
● To answer this question, we look at the margin of error.

Choosing the Sample Size

Example (continued):

● We see that this question can’t be answered because there are


two unknown values, p̂ and n.
● We proceed by guessing the worst case scenario for p̂. We guess
p̂ is 0.50 because this makes the SD (and therefore n) the largest.
● We may now compute n.

Sample Size Selection (for p)

● For a desired margin of error 𝑴𝑬, the minimum sample size 𝑛


required to estimate a 100(1 − 𝛼)% confidence interval for the
population proportion 𝑝 is

● where p̂ is a reasonable estimate of 𝑝 in the planning stage. We


usually pick p̂ = 0.5 as conservative measure.

Choosing the Sample Size

● Usually a margin of error of 5% or less is acceptable.


● However, to cut the margin of error in half, you will have to
quadruple the sample size.
● The sample size in a survey is the number of respondents, not the
number of questionnaires sent or phone numbers dialed, so
increasing the sample size can dramatically increase the cost and
time needed to collect the data.

Sample Size Selection (for p)


Question: We would like to know the 90% confidence interval for the
proportion of Uber driver in Toronto who are satisfied with their job. If we
want the margin of error to be within 12%, what is the minimum sample
size 𝑛 required?

Comparing Two Proportions

​ ▪ So far we have focused on one population parameter only


​ ▪ But sometimes we need to compare the value of two populations
(e.g., two population proportions, two population means, etc.)
​ ▪ Example: Statistics Canada’s Survey of Household Spending
reports that 66.5% of households in Ontario spend money on
lottery, whereas the figure for Manitoba is 75.2%. How large is the
difference between these two provinces? (𝑛𝑂 = 4151 and 𝑛𝑀 =
389)

PM - PO

What is the sample statistics used to estimate P𝑀 − P𝑂 ?


Elaborate on the Distribution

Comparing Two Proportions

● Not knowing what the true population proportions for the two
provinces are, what we can do is construct a confidence interval
for the difference between the two true population proportions (P𝑀
−P𝑂).
● But we also need the standard error (SE) of the difference (P𝑀 −
P𝑂 ). The SE of difference cannot be calculated by simply
subtracting the two standard errors.
● To find the SE of the difference we use the variances,

Comparing Two Proportions

● So the confidence interval for this difference will be, estimate ±


critical value∗SE
● For a 95% confidence interval we have,

● Or,

● What does this interval mean?

CI for Diff. between Two Proportions

How to develop a confidence interval for the difference between the


proportions of two populations?

1) Take a sample of size n1 from the first population


2) Take a sample of size n2 from the second population
3) Calculate the standard error of proportion for the first sample
4) Calculation the standard error of proportion for the second sample
5) Calculate the standard error of the difference of proportions
6) Determine the critical value based on the certainty level

CI for Diff. between two Proportions

● Use the formula below to calculate the confidence interval for the
difference between two populations:
CI for Difference between Two Proportions

● A manager wants to find out if the satisfaction level at the


company’s Boston office is higher (lower) than that in the
company’s Toronto office. She collects feedback from 200
customers at the Boston office and 220 customers at the Toronto
office. The proportion of customers who were “satisfied or highly
satisfied” at the Boston and Toronto offices was found to be 81%
and 85%, respectively. Create a 90% confidence interval for P𝐵 −
P𝑇 and comment on your findings.
Chapter 12: Testing Hypotheses about Proportions

Hypothesis Testing (Proportions)

● Is the Toronto Stock Exchange just as likely to move higher as it is


to move lower on any given day?
● In order to test this out, we collect data for 1000 days and find that
the proportion of up days is 0.515. That is more “up” days than
“down” days.
● But is it far enough from 50% to cast doubt on the assumption of
equally likely up or down movement?
● To test whether the daily fluctuations are equally likely to be up as
down, we assume that they are, and that any apparent difference
from 50% is just random fluctuation.
● We need to test our assumption that the proportions of “up” and
“down” are 50%.
Hypotheses

● Assuming that the true proportion of “up” is 𝑝,


● The null hypothesis, 𝐻0 , specifies a population model parameter
and proposes a value for that parameter.
● We usually write a null hypothesis about a proportion in the form,

● For our hypothesis about the TSX, we need to test,

● The alternative hypothesis, 𝐻A , contains the values of the


parameter that we consider plausible if we reject the null
hypothesis. Our alternative is,

Hypotheses

● Example: A supplier of stainless steel kitchen utensils is having


0.86% of its merchandise returned as a result of corrosion of the
steel. The company improves the quality control on the production
process, monitors 2000 shipments chosen at random, and has
0.53% of the merchandise returned for corrosion issues. We are
hoping that the proportion of merchandise returned will go down.
● To test this, we would like to test if the proportion of returns is
really 0.86% or it has indeed gone down as a result of our
improvements. So we have,

Hypotheses
● In our example about the TSX, the alternative hypothesis is known
as a two-sided alternative, because we are equally interested in
deviations on either side of the null hypothesis value. 𝐻A: 𝑝 ≠ 0.5.
○ Because in the alternative hypothesis any value other than
0.5 would be accepted.
● In our example about returned merchandise, the alternative
hypothesis is called a one-sided alternative because it focuses
on deviations from the null hypothesis value in only one direction.
𝐻A : 𝑝 < 0.0086.
● Because in the alternative hypothesis, only values smaller than
0.0086 will be accepted.

Hypotheses

Things to consider:

● Don’t put the issue that you are investigating into the null
hypothesis.
● The issue that you are investigating should go in the alternative
hypothesis.
● Don’t have different numbers in the null and alternative
hypotheses.
● The numerical values are always the same.

Trial as Hypothesis Test

● This is the logic of jury trials. In British common law, the null
hypothesis is that the defendant is innocent.
● The evidence takes the form of facts that seem to contradict the
presumption of innocence. For us, this means collecting data.
● The jury considers the evidence in light of the presumption of
innocence and judges whether the evidence against the defendant
would be plausible if the defendant were in fact innocent.
● Like the jury, we ask: “Could these data plausibly have happened
by chance if the null hypothesis were true?”
P-Value

● Once we have our null and alternative hypotheses, we will


look at the data from our sample.
● The fundamental questions as follows: “How likely would the data
we observed (sample) be, if the null hypothesis were true”.
● The answer to this question is a probability.
● In other words, we are trying to figure out “the probability of
seeing a sample like what we found, or something even less likely,
given that the null hypothesis were actually true”.
● This probability is called the p-Value, or the “plausibility value”.

P-Value

● Example: flipped a coin 1000 times, we produced 400 heads and


600 tails. So if we are to measure the proportion of heads, we
have p̂ = 0.4
● We have for the hypotheses,

● The p-value asks: Let’s first assume that the null hypothesis is
true, now given that assumption, how likely would it be to
produce a sample like we produced, or a sample even less likely,
again given that the null hypothesis were true.

P-Value

● A very low p-Value indicates that the probability of getting a sample


like the one we got (or one that’s less likely) is too small.
Therefore, the assumption that the null hypothesis is true cannot
be right).
○ Example: the coin (1000 flips, 45 Heads, 955 Tails)
● A very low p-Value → depending on your threshold or tolerance,
you might (a) conclude that the null hypothesis cannot be true, or
(b) still believe in the null hypothesis to be true and claim that this
was one of those rare cases that a rare sample is drawn.
● If the p-Value is too small for you to believe that null hypothesis
holds → you have reasons to reject the null hypothesis!

P-Value

● If the p-Value is high (or not just low enough), we can conclude
that we haven’t seen anything unlikely → our assumption should
be fine the way it is → the null hypothesis is true.
● We have no reason to reject the null hypothesis.
● In other words, we fail to reject the null hypothesis!
○ Example: The coin (1000 flips, 520 Heads, 480 Tails)

P-Value

● Remember the example: Is the Toronto Stock Exchange just as


likely to move higher as it is to move lower on any given day? In
order to test this out, we collect data for 1000 days and find that
the proportion of up days is 0.515. That is more “up” days than
“down” days. (p̂ = 0.515)

● Let’s test to see, how likely it is to get a sample like the one we got,
or one that is less likely, given that we assume the null hypothesis
were true (assume that 𝑝 = 0.5)
P-Value

● To solve this, we have 𝑝 = 0.5 (assumption). So we calculate the


standard deviation of the sampling distribution (sample
proportions).

● Let’s first find how many standard deviations p̂ is away from 𝑝.

P-Value
● Because this is a two-sided alternative hypothesis, we are
basically trying to calculate,

P-Value

● Because this is a two-sided test, we need to add up the


probabilities in the two red areas.
● Because, we are interested in the probability of finding a sample
with the “up” days having a proportion greater than 0.515 or more
extreme in both directions (also less than 0.485).
● From the standard probability table, we can find the value for 𝑃
(𝑍 < −0.949) =𝑃 (𝑍 > 0.949) =0.171
● So the p-Value will be equal to 2 ∗ 0.171 = 0.342

P-Value

● For the two-sided test, the p-Value is the probability of deviation in


both directions from the null hypothesis.
● For the one-sided test, the p-Value is the probability of deviation
only in the direction of the alternative hypothesis, away from
the null hypothesis.
● for example, the graph below shows a one-sided test. The p-Value
is the probability of the red area.
P-Value

Question: The SmartWool analyst is now testing hypotheses about


whether the proportion of website visitors who make a purchase has
increased from 0.2 since the redesign of the website.

She collects a random sample of 50 visits since the new web- site has
gone online and finds that 24% of them made purchases.

● What are the null and alternative hypotheses?


● What conclusions can she draw?
Alpha and Significance

● As humans, we are suspicious of “rare” events.


● If the p-Value is small, we may be leaning towards concluding that
such sample with such small probability could not have happened
by chance.
● Since the data did actually happen, something must be wrong →
null hypothesis cannot be trusted!
● We can define a “rare event” arbitrarily by setting a threshold for
our p-value. If our p-value falls below that point, we’ll reject the null
hypothesis.
● We call such results statistically significant.
● The threshold is called an alpha level. Not surprisingly, it’s labeled
with the Greek letter α.

Alpha and Significance

● Common 𝛼 values are 0.1, 0.05 and 0.01.


● Alpha can is determined based on our tolerance level for risk.
● Depending on the tolerance for risk the significance value 𝛼 could
be different.
● The alpha level is also called the significance level.
● 𝛼 = 0.01 is also referred to as 1% significance level.
● 𝛼 = 0.05 is also referred to as 5% significance level.
● 𝛼 = 0.1 is also referred to as 10% significance level.

Alpha and p-Value

● If your p-Value is smaller than 𝛼 (threshold was exceeded) → you


have enough evidence to reject the null hypothesis.
● If your p-Value is greater than 𝛼 (threshold was not exceeded) →
you do not have enough evidence to reject the null hypothesis →
you fail to reject null hypothesis.
Alpha and p-Value

Question: For each situation (introduced on the previous slide), draw a


normal curve representing the sampling distribution of 𝑝. In each case,
specify p̂, 𝛼 and the p-value using a one-sided test.
Critical Value

● Critical values could also be used in hypothesis testing.


● Any z-values more extreme than the critical value, would mean
that we have enough evidence to reject the null hypothesis.
● In other words, any z-value more extreme than the critical value,
means that the p-Value was smaller than the significance level 𝛼.
● The reason is the critical value 𝑧∗ is derived from the significance
level.
● This is an alternative way for hypothesis testing to calculating the
p-Value.

Critical Value

● Critical value is the 𝑧∗ that is on the border of the 𝛼 region. For a


one-sided case, all the 𝛼 region is on one side.
● Example: For a right-sided (one-sided) test with 𝛼 = 0.05 the
critical value will be,
Critical Value

● The 𝛼 region is divided equally between two regions on both tails,


if we have a two-sided (two-tailed) test.
● Example: for a two-sided test with 𝛼 = 0.05, the critical value is,
Critical value and z

1) If the z for our test statistics (sample) is more extreme than the
critical value (𝑧∗) (z is further away from the mean than 𝑧∗
is)→Then it is in the critical region→we will reject the null
hypothesis.
a) p-value is smaller than 𝛼
2) If the z is less extreme than the critical value (𝑧∗) (z is closer to the
mean than 𝑧∗ is)→then it is not in the critical region→we cannot
(fail to) reject the null hypothesis.
a) p-value is greater than 𝛼
Critical Value

Question: supplier of stainless steel kitchen utensils, which is receiving


0.86% of its merchandise returned as a result of corrosion of the steel.
The company improves the quality control on the production process,
takes a sample of 2000 shipments chosen at random, and receives
0.53% of the merchandise returned for corrosion issues.

a) What are the null and alternative hypotheses?


b) By comparing the z-score and the critical value, conduct the test at
𝛼 = 0.01. Has the process actually improved?
Critical Value

Question: Find the critical value for the following tests:

a) A two-sided test at 𝛼 = 0.05


b) A one-sided (left sided) test at 𝛼 = 0.05
c) A two-sided test at 𝛼 = 0.1
d) A one-sided (right sided) test at 𝛼 = 0.1
e) Compare results in parts (b) and (c). Comment on the comparison!
CI and Hypothesis Tests

● An alternative way to do hypothesis testing is by looking at the


confidence interval constructed around the sample proportion.
● For example, assume we have a two-sided hypothesis test with
a significance level of 5% (𝛼 = 0.05).
● If we decide to construct a 95% confidence interval around the
sample proportion p̂. Also, if we decide to use the formula,

● Then we should be 95% confident that this interval contains the


mean 𝑝.
● If the 95% CI does not contain 𝑝, at the significant level 5%, we
can reject the null hypothesis.

CI and Hypothesis Tests

● The idea is, if the sample proportion of the sample we found p̂ is


so far from the mean that leads us to reject the null hypothesis,
then a confidence interval around p̂ (using SE) should not contain
𝑝.
Comparing Two Proportions

● Example: A market research company has learned through a


survey taken by 1002 individuals that 36% would consider
purchasing a particular product. The company conducted a similar
survey last year, where they found that 41% of the 1980 individuals
surveyed considered purchasing the product. The question is: has
there been a change in the attractiveness of the product over the
past year (significance level is 5% or 𝛼 = 0.05)?
● 𝐻0 : there is no difference between the proportion of 0 customers
who’d buy the product today and last year.
● 𝐻A : there is a difference between the proportion of customers
who’d buy the product today and last year.

Comparing Two Proportions

● We have two sample proportions p̂1 = 0.36 and p̂2 = 0.42.


● We have two sample sizes n1 = 1002 and n2 = 1980.

● So we are not making any assumption on the values of population


proportions p1 and p2 . All we are investigating is whether or not
they are equal.
● In this case, we need to find the z-score of the estimate p̂1 − p̂2.
The z score will be,
Comparing Two Proportions

● Remember, in hypothesis tests we use the value of the population


to find SE. But what is the value of the population. Let’s assume
𝐻0: p1 = p2 = p0
● Then,

● But we don’t know what the unknown value p0 is, therefore, we will
need to estimate it using the following,

Comparing Two Proportions

● Therefore, 𝑆𝐸(p̂1 − p̂2) becomes,

● Note that this is only for a hypothesis test assuming investigating if


two population proportions are equal.
● If a hypothesis test wants to investigate if the difference between
two population proportions is a certain number 𝑲, different
formulation should be used!
Comparing Two Proportions

● A summary of formulas used for hypothesis testing for the


difference between two population proportions.

Comparing Two Proportions

● Example: A market research company has learned through a


survey taken by 1002 individuals that 36% would consider
purchasing a particular product. The company conducted a similar
survey last year, where they found that 41% of the 1980 individuals
surveyed considered purchasing the product. The question is: has
there been a change in the attractiveness of the product over the
past year (significance level is 5% or 𝛼 = 0.05)?
● Specify the null and alternative hypothesis.
● Conduct the HT at the specified significance level.
Hypothesis Tests Outcomes

● In any hypothesis test 4 outcomes are possible:

1. We reject the null hypothesis when the null hypothesis is


actually true (Error type I)
2. We reject the null hypothesis when the null hypothesis is actually
not true (Good!)
3. We fail to reject the null hypothesis (do not reject) when the null
hypothesis is actually true (Good!)
4. We retain the null hypothesis (do not reject) when the null
hypothesis is actually not true (Error type II)

Hypothesis Test Outcomes

Type 1 Error

● Type I error occurs when we reject a null hypothesis, when the null
hypothesis is actually true (false positive).
Type 2 Error

● Type II error occurs when we do not reject the null hypothesis


(retain the null hypothesis), when it is actually not true (false
negative).

Type 1 Error

● When you are choosing the significance level, you are actually
setting the probability of type I error.
● Probability of type I error is equal to 𝛼.
● 𝛼 is the probability that determines the critical region.
● If a sample statistic is one of those “rare” samples that falls into the
critical region, we will reject the null hypothesis while it was
actually true.
● A “rare” and “extreme” sample may fall into the critical region
(probability 𝛼), and lead to a true null hypothesis be rejected by
error.

Type 1 Error
Type 1 Error

Type 2 Error

● Type II error (with probability 𝛽) is the probability of retaining


(failing to reject) the null hypothesis, while it is actually not true.

Power

● The power of a test is the probability that it correctly rejects a false


null hypothesis.
● We know that β is the probability that a test fails to reject a false
null hypothesis, so the power of the test is the complement, 1 – β.
● We call the distance between the null hypothesis value, p0, and the
truth, p, the effect size.
● The further p0 is from p, the greater the power.
● It makes intuitive sense that the larger the effect size, the easier it
should be to see it.

Power

● Suppose we are testing the following hypotheses

● We’ll reject the null if the observed proportion p̂ is greater than


some critical value p*.
● p* can be found based on the significance level of the test. In
general,

Critical Value p*

● For a right-sided test, we’ll reject the null if the observed proportion
p̂ is greater than some critical value p*.
● For a left-sided test, we’ll reject the null if the observed
proportion p̂ is smaller than some critical value p*.
● For a two-sided test, one of the above holds based on the position
of the p̂.
● If p̂ is more extreme than p* we reject H0

Critical Value p*

● Demonstrating the relationship between p* and the hypothesized


value for the population proportion 𝑝 = p0.
Power

● The upper figure shows the null hypothesis model. The lower
figure shows the true model.
Power

● The power of the test is the green region on the right of the lower
figure.

● Reducing α , moves the critical value p*, to the right but increases
β. It correspondingly reduces the power.

Power

● If the true proportion, p, were further from the hypothesized value,


p0, the bottom curve would shift to the right, making the power
greater (and 𝛽 smaller).
Increasing Sample Size

If we can increase the sample size,

● The curves will become narrower (standard deviation becomes


smaller)
● For a given value of 𝛼, the probability of Type II error will decrease.
● The power of the test will increase.

Power
● Making the standard deviations smaller increases the power
without changing the alpha level.

Summary

● 𝛼 is chosen by the researcher, whereas 𝛽 is the result of


characteristics of the problem. However, they are related.
● But 𝛼 and 𝛽 are related.
● An extremely low 𝛼 (e.g., 0.001) will result in intolerably
high values for 𝛽.
● It is necessary to balance both error probabilities.
● 𝛽 can be controlled by increasing the sample size.
● For a given level of 𝛼, increasing the sample size will decrease 𝛽
and thereby, increase the power of the test.
Chapter 13: Confidence Intervals and Hypothesis Tests for
the Means

Sampling Distribution

Reminder:

● Confidence interval for the sample proportion was,

● where, 𝑧* is determined by the confidence level


● Standard error of the sample proportion is,

CI for the Population Mean

● The confidence interval for the mean follows the same logic.
● For the population mean, there are two cases:
○ If the population standard deviation 𝜎 is known
○ If the population standard deviation 𝜎 is unknown
● Having a sample mean (ȳ), in order to estimate the true population
mean (𝜇), we can estimate it by creating a confidence interval that
will include the population mean with some level of certainty
(confidence level).

𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 ± margin of error(ME)


CI for the Population Mean (𝜎 known)

● When the population standard deviation is known, we can have,

● And therefore, the confidence interval becomes,

● Or,

CI for the Population Mean (𝜎 known)

● If we do not have the population standard deviation, instead of


𝑆𝐷(ȳ) we need to estimate it by 𝑆𝐸(ȳ).
● So we have,
● Where the standard error is,

CI for the Mean (𝜎 unknown)

● But when we use the standard error for the mean, the distribution
is no longer normal. The distribution will be something that we
refer to as the “student’s t” distribution.
● The new model, the Student’s t, is a model that is always
bell-shaped, but the details change with the sample sizes.
● The Student’s t-models form a family of related distributions
depending on a parameter known as degrees of freedom.

CI for the Mean (𝜎 unknown)

● When certain conditions are met, the standardized sample mean,


● Follows a Student's t-model with n - 1 degrees of freedom we find
the standard error from

CI for the Mean (𝜎 unknown)

● When the assumptions and conditions are met, we’reready to find


the confidence interval for the population mean, 𝜇 . The
confidence interval extends on either side of the mean by an
amount known as the margin of error, ME, and can be calculated
as:

● Where standard error of the mean is

CI for the Mean (𝜎 unknown)

● The critical value 𝑡*n-1 depends on the particular confidence level


that you specify and on the number of degrees of freedom, n − 1,
which we get from the sample size.
T curve vs. z curve

Student's t model

● Student’s t-models are unimodal, symmetric, and bell- shaped, just


like the Normal model.
● But t-models (solid curve below) with only a few degrees of
freedom have a narrower peak than the Normal model (dashed
curve below) and have much fatter tails.
● As the degrees of freedom increase, the t-models look more and
more like the Standard Normal model.
● The 68-95-99 no longer works for the t-distribution.
● The t-curve has more spread (is fatter) than the z-curve and
depends on the sample size (degrees of freedom).
t-model vs. z-model

● The t-model (solid curve) with 2 degrees of freedom vs. the normal
model (dashed curve).

Assumptions and Conditions

● Independence Assumption: There is no way to check


independence of the data, but we should think about whether the
assumption is reasonable from the business context.
● Randomization Condition: The data arise from a random sample
or suitably randomized experiment.
● 10% Condition: The sample size should be no more than 10% of
the population. For means our samples generally are, so this
condition will only be a problem if our population is small.

Student’s t-models won’t work for data that are badly skewed. We
assume the data come from a population that follows a Normal model.
Data being Normal is idealized, so we have a “nearly normal” condition
we can check.

● Nearly Normal Condition: The data come from a distribution that


is unimodal and symmetric. This can be checked by making a
histogram.
Assumptions and Conditions

Normal Population Assumption

● For very small samples (n < 15), the data should follow a Normal
model very closely. If there are outliers or strong skewness, t
methods shouldn’t be used.
● For moderate sample sizes (n between 15 and 40), t methods will
work well as long as the data are unimodal and reasonably
symmetric.
● For sample sizes larger than 40, t methods are safe to use unless
the data are extremely skewed. If outliers are present, analyses
can be performed twice, with the outliers and without.

Assumptions and Conditions

Normal Population Assumption

● In business, the mean is often the value of consequence.


● Even when we must sample from a very skewed distribution, the
Central Limit Theorem tells us that the sampling distribution of our
sample mean will be close to Normal.
● We can use Student’s t methods without much worry as long as
the sample size is large enough.

Assumptions and Conditions

Example: Researchers purchased whole farmed salmon at random from


51 farms in eight regions in six countries. The histogram shows the
concentrations of the insecticide Mirex in the 150 samples of farmed
salmon.

● Nearly Normal Condition: The histogram of the data looks


bimodal. While it might be interesting to learn the reason for that
and possibly identify the subsets, we can proceed because the
sample size is large.
T-Table

Critical Value (t)

Using the t tables, estimate the following:

a) The critical value of t for a 95% confidence interval with 𝑑𝑓 = 7.


b) The critical value of t for a 99% confidence interval with 𝑑𝑓 =
28.
CI for the Mean

● Question: In a widely cited study of contaminants in farmed


salmon, fish from many sources were analyzed for 14 organic
contaminants. One of those was the insecticide Mirex, which has
been shown to be carcinogenic and is suspected of being toxic to
the liver, kidneys, and endocrine system. Summaries for 28 Mirex
concentrations (in parts per million) from a variety of farmed
salmon sources were reported as:

● The Environmental Protection Agency (EPA) recommends to


recreational fishers as a “screening value” that Mirex
concentrations be no larger than 0.07 ppm. What does the 95%
confidence interval say about that value?

T-table
CI for the Mean
● Question: When homeowners fail to make mortgage payments,
the bank forecloses and sells the home, often at a loss. In one
large community, realtors randomly sampled 36 bids from potential
buyers to determine the average loss in home value. The sample
showed that the average loss was $11,560 with a standard
deviation of $1500.
a) Assuming that conditions to use the t-model are satisfied, find a
95% confidence interval for the mean loss in value per home.
b) Interpret this interval and explain what 95% confidence means.
c) Suppose that, nationally, the average loss in home values at this
time was $10,000. Do you think the loss in the sampled community
differs significantly from the national average? Explain.
HT for the Mean

A random sample of 20 purchases at the internet music site has mean


and standard deviation equal to $45.26 and $20.67, respectively. We
want to investigate if the mean amount of purchase of all transactions is
greater than $40.

a) What are the null and alternative hypotheses?


b) Is the alternative one- or two-sided?
c) What is the value of the test statistic?
d) What is the p-value of the test statistic?
e) What do you conclude at 𝛼 = 0.05?
Hypothesis Test for the Mean

Hypothesis Test for the Mean

● The Student’s t-model is different for each value of degrees of


freedom.
● Typically we limit ourselves to 80%, 90%, 95%, and 99%
confidence levels.
● We can use technology to give critical values for any number of
degrees of freedom and for any confidence levels we need. More
precision won’t necessarily help make good business decisions.
● P-Value Method
Hypothesis Test for the Mean

Critical value method.

● A typical t-table is shown here. The table shows the critical values
for varying degrees of freedom, df, and for varying confidence
intervals.
● Since the t-models get closer to the normal as df increases, the
final row has critical values from the Normal model and is labeled
“∞”

Hypothesis Test for the Mean

Critical Value Method:

● In this method, the critical value 𝑡* is found based on the test


(one-sided, two-sided), the significance level (𝛼) and the degrees
of freedom (sample size). The table is used to find this value.
● Then the value for t-statistic is calculated (equivalent of z- statistic
in the previous chapters)
● Based on the comparison between 𝑡 and 𝑡*, we can draw a
conclusion about the test.
Hypothesis Test for the Mean

For the critical value method:

● Reject the null hypothesis if the test statistic is more


extreme than the value of 𝑡*n-1 found from the table.
● By more extreme, we mean farther from the mean (centre).
● Note: The table only gives positive critical values. We need to
change the sign of the critical value based on our test (left-sided,
right-sided, two-sided).

HT for the Mean

Question: From 30 randomly selected retail sales, an analyst finds that


the mean amount spent is $22 with a standard deviation of $10 and is
approximately Normally distributed. Using hypothesis tests, investigate if
the average spending has decreased from last year (which had an
average spending amount equal to $24). Test at 𝛼 = 0.05.
Chapter 14: Comparing Two Means

Comparing Two Means

● Do customers tend to spend more at a retailer in the month of


December if they are exposed to Classical music or Christmas
music?
● Does the productivity of employees of an organization increase if
an on-site gym is built or not?
● Does working from home increase or decrease efficiency of
employees?
● Etc.
● In these type of questions we are interested in comparing the
average of a variable in two different populations (or a population
under two different conditions).
Comparing Two Means

● Example: Do customers spend more using their credit card if they


are given an incentive such as “double miles” or “double coupons”
toward flights, hotel stays, or store purchases?
● To answer questions such as this, credit card issuers often perform
experiments on a sample of customers, making them an offer of an
incentive, while other customers receive no offer.
● By comparing the performance of the two offers on the sample,
they can decide whether the new offer would provide enough
potential profit if they were to “roll it out” and offer it to their entire
customer base.

Comparing Two Means

● For two groups, the statistic of interest is the difference in the


observed means of the offer and no offer groups:

● From this estimate, we would like to find the difference between


the two population means,

● The population model parameter is the difference between the two


means.

Comparing Two Means

● In order to tell if a difference we observe in the sample means


indicates a real difference in the underlying population means, we’ll
need to know the sampling distribution model and standard
deviation of the difference.
● Once we know those, we can build a confidence interval and test
a hypothesis just as we did for a single mean.

Comparing Two Means

● It’s easy to find the mean and standard deviation of the spend lift
(increase in spending) for each of these groups (ȳ and 𝑠), but that’s
not what we want.
● We need the standard deviation of the difference in their means.

● For that, we can use a simple rule: If the sample means come from
independent samples, the variance of their sum or difference is
the sum of their variances.

Comparing Two Means

● As long as the two groups are independent, we find the standard


deviation of the difference between the two sample means by
adding their variances and then taking the square root:
Comparing Two Means

● Usually we don’t know the true standard deviations of the two


groups, σ1 and σ2, so we substitute the estimates, s1 and s2, and
find a standard error:

● We’ll use the standard error to see how big the difference really is.

Comparing Two Means

● Just as for a single mean, the ratio of the difference in the means
to the standard error of that difference has a sampling model that
follows a Student’s t distribution.
● The sampling model isn’t really Student’s t, but by using a special,
adjusted degrees of freedom value, we can find a Student’s
t-model that is so close to the right sampling distribution model that
nobody can tell the difference.

Comparing Two Means


A Sampling Distribution for the Difference Between Two Means

● When the conditions are met (introduced later), the standardized


sample difference between the means of two independent groups,

● can be modelled by a Student’s t-model with a number of degrees


of freedom found with a special formula. We estimate the standard
error with

Two-Sample Test

● To construct the hypothesis test for the difference of the means, we


start by hypothesizing a value for the true difference of the means.
We’ll call that hypothesized difference Δ0.

● It’s so common for that hypothesized difference to be zero that we


often just assume Δ0 = 0.
● The two-sample t-test is the ratio of the difference in the means
from our samples to its standard error compared to a critical value
from a Student’s t-model.

Two-Sample Test

Two-Sample t-test

● When the appropriate assumptions and conditions are met, we test


the hypothesis 𝐻0 : 𝜇1 − 𝜇2 = Δ 0
● where the hypothesized difference Δ0 is almost always 0. We use
the statistic

● The standard error of ȳ1 − ȳ2 is

Two-Sample Test

● When the null hypothesis is true, the statistic can be closely


modelled by a Student’s t-model with a number of degrees of
freedom given by a special formula. We use that model to
compare our t-ratio with a critical value for t or to obtain a
p-value.
● Formulat:

● If using table, round down to be safe

Assumptions and Conditions

● Independence Assumption: The data in each group must be


drawn independently and at random from each group’s own
homogeneous population or generated by a randomized
comparative experiment.
● Randomization Condition: Without randomization of some sort,
there are no sampling distribution models and no inference.
● 10% Condition: We usually don’t check this condition for
differences of means. We needn’t worry about it at all for
randomized experiments.
● Normal Population Assumption: We need the assumption that
the underlying populations are each Normally distributed.
● Nearly Normal Condition: We must check this for both groups; a
violation by either one violates the condition.
● Independent Groups Assumption: To use the two-sample t
methods, the two groups we are comparing must be independent
of each other.
● No statistical test can verify that the groups are independent. You
need to think about how the data were collected.

HT for Mean Differences

● Question: We want to know if cardholders who are offered a


promotion will consequently spend more on their credit card. We
have the spend lift (in $) for a random sample of 500 cardholders
who were offered the promotion and for a random sample of 500
customers who were not. Here are the results:

Assuming that the conditions are satisfied,

a) What are the null and alternative hypotheses for this test?
b) Use a two-sample t-test to conduct the test. Does the average
spending really increase following a promotion? (use 𝑑𝑓 = 992 and
𝛼 = 0.05)
CI for diff. b/w two Means

● A hypothesis test really says nothing about the size of the


difference. All it says is that the observed difference is large
enough that we can be confident it isn’t zero.

Confidence Interval for the Difference Between Two Means

● When the conditions are met, we’re ready to find a two- sample
t-confidence interval for the difference between means of two
independent groups, 𝜇1 − 𝜇2. The confidence interval is

CI for diff. b/w two Means

● In the confidence interval given, the standard error of the


difference of the means is

● The critical value 𝑡df∗ depends on the particular confidence level


and on the number of degrees of freedom.

CI for Mean Differences

Question: We want to find a 95% confidence interval for the mean


difference in spending between those who are offered a promotion and
those who aren’t. Assuming that the conditions are satisfied, find the
95% confidence interval for the true difference in the spending lift
between the two populations,
Chapter 15: Design Experiments and Analysis of Variance (ANOVA)

Why ANOVA?

● In previous lectures, we studied comparisons between two


population means.
● Depending on the characteristics of the problem, either a t or z
statistic was used to study hypotheses regarding the two
population means, or proportions.
● We were able to determine whether or not two population means
were equal. Also, we were able to determine if a population mean
was greater than another population mean.
● Sometimes we need to compare the mean of more than two
populations. In these situations, we use a method called the
Analysis of Variance or ANOVA.
Why ANOVA?

● ANOVA is used when we have samples from more than two


populations and we would like to study hypotheses regarding
those population means.
● ANOVA could be,
○ One-way: Where there is only one independent variable and
you want to measure the response. Example: Studying the
effectiveness of three types of pain relievers; Aspirin,
Tylenol, and Ibuprofen.
○ Two-way: Where there are two independent variables and
you want to measure the response. Example: Studying pain
relief based on pain reliever (Aspirin, Tylenol) and type of
pain (headache, back pain)

Why ANOVA?

● Assume we are studying the effectiveness of three pain relievers


A, T and I.
● If we compare them individually, we would need three separate
t-tests to compare (A and T), (A and I) and (T and I) separately.
Each test would involve some risk of error (𝛼). The overall risk of
our analysis would be much more than 𝛼.
● ANOVA keeps the risk of error low.
● To perform ANOVA, we need experimental designs (we need to
design experiments).

Example

● Google is interested in testing three features for the design of its


next generation cellphones to maximize its market share. The
three features are (a) screen size, (b) existence of a headphone
jack, (c) keyboard.
● Each feature (factor) has some levels.
● Screen size: medium, large
● Existence of headphone jack: yes, no.
● Keyboard: qwerty keyboard, no keyboard.
● How can Google use samples of its customer base to find the most
appealing design?

Experiments

● An experiment is a study in which the experimenter manipulates


attributes of the study participants and observes the
consequences.
● The attributes, called factors, are manipulated by being set to
particular values called levels. These levels are assigned or
allocated to individuals in the study.
● The combination of factor levels assigned to a subject is called that
subject’s treatment.

Experiments

Two key features distinguish an experiment from other


investigations:

● The experimenter actively and deliberately manipulates the factors


to specify a treatment.
● The experiment assigns subjects to the treatment randomly.

Experimental Design

1. Randomize. Assign treatments to subjects randomly.


2. Control. We control sources of variation other than the factors we
are testing by making conditions as similar as possible for all
treatment groups.
● (There is a second meaning of control. A control treatment is a
special treatment class designed to mark the baseline of the study,
and the group that receives it is called the control group.)
● Example: add high doses of salt to the diet of the test group. The
control group does not receive high doses of salt.
3. Replicate. Repeated observations at each treatment are called
replicates. If the number of replicates is the same for each
treatment combination, the experiment is said to be balanced.
● A second kind of replication is to repeat an entire experiment for a
different group of subjects, under different circumstances, or at a
different time.

Experimental Design

4. Block. Group or block subjects together according to some factor


that you cannot control and feel may affect the response. Such
factors are called blocking factors, and their levels are called
blocks.
● Example blocking factors: sex, ethnicity, marital status, etc.
● In effect, blocking an experiment into n blocks is equivalent to
running n separate experiments.
● Example: you may feel that marital status may affect the
consumption of milk for breakfast. Your blocks will be married vs.
single individuals.

Experimental Design

Completely Randomized Designs

● When each of the possible treatments is assigned to at least one


subject at random, the design is called a completely randomized
design.
● In the simplest completely randomized design, the subjects are
assigned at random to two treatments.
Example

● A marketer wants to know the effect of two types of offers


(double-miles and companion air ticket) on the revenue of the new
WestJet cards. The analyst suspects this should be done in each
of two segments: a high-spending group and a low-spending
group, separately. The marketer selected 12,000 customers from
each group at random and then randomly assigned the three
treatments (including the Control treatment) to the 12,000
customers in each group so that 4000 customers in each segment
received one of the three treatments.

Experimental Design

● Randomized Block Designs :When we have a blocking factor, we


randomize the subject to treatments within each block.
Experimental Design

Factorial Designs

● An experiment with more than one manipulated factor is called a


factorial design.
● A full factorial design contains treatments that represent all
possible combinations of all levels of all factors.
● When the combination of two factors has a different effect than you
would expect by adding the effects of the two factors together, that
phenomenon is called an interaction.
● Unless your experiment incorporates a factorial design, you cannot
see interactions, and this may be a serious omission.
Blinding and Placebos

● Blinding: The deliberate withholding of the treatment details from


individuals who might affect the outcome.

Two sources of unwanted bias:

1) Those who might influence the results (the subjects, treatment


administrators, technicians, etc.)
2) Those who evaluate the results (judges, experimenters, etc.)
● Single-Blind Experiment: one or the other groups is blinded.
● Double-Blind Experiment: both groups are blinded.

Blinding and Placebos

● Often, applying any treatment can alter a response.


● To separate the real treatment effects from imagined ones, use an
ineffective control treatment that mimics the treatments being
tested. Such a “fake” treatment is called a placebo.
● Example: Patients in the test group may think that just because
they are taking a pill, their headache is “supposed” to get better. In
these cases, we define a control group that is given something like
a simple sugar pill rather than the real medication to spread this
effect among all groups.

Confounding

● When the levels of one factor are associated with the levels of
another factor, we say that two factors are confounded.
● Example: A bank offers credit cards with 2 possible treatments:

● There is no way to separate the effect of the rate factor from that of
the fee factor. These two factors are confounded in this design.
One-way ANOVA

● Consider an experiment with a single factor of k levels.

Question of Primary Interest:

● Is there evidence for differences in effectiveness for the


treatments?
● Let 𝜇𝑖 be the mean response for treatment group i.
● Then, to answer the question, we must test the hypothesis:

One-way ANOVA

● What criterion might we use to test the hypothesis?


● The figure shows the means (medians, actually) of each of the four
levels of a one-factor experiment. The means are scattered, but so
are the underlying data.
● Is the scatter of the means due to treatment effects or to
underlying error?

One-way ANOVA

● Which picture shows more variation between the means in


groups?
One-way ANOVA

● Let’s explicitly consider the two classes of variation in a


factorial design:
1) Variation (red oval) of the group means 𝜇𝑖 about the grand mean
(dashed line).
2) Variation (blue ovals) of the group data about the group means.
One-way ANOVA

● In order to observe any treatment differences, we suspect that


“between group” variation must be large compared to “within
group” variation.
● We need to use a tool (test statistic) that measures the ration of
“between group” variation to “within group” variation.
● If the value of this test statistic is higher than a certain threshold,
we can conclude that the null hypothesis cannot be true → at least
one of the means is not equal to the rest!

One-way ANOVA

● Means widely scattered; treatment differences are likely

● Means not widely scattered; possibly no treatment differences


ANOVA Assumptions

● When using ANOVA, three assumptions are made,


1) Values in each population (response) are normally distributed.
2) Variance of all populations are equal 𝜎 = 𝜎 = ⋯ = 𝜎 12𝑘
3) The observations must be independent

● Hypotheses,
○ In ANOVA, generally the null hypothesis assumes that the
mean of all the populations are equal 𝜇1 = 𝜇2 = ⋯ = 𝜇𝑘
○ The alternative hypothesis states that at least one of the
population means is not equal to another population mean.
In Other words, it states that “not all population means are
equal”.

Example

● Suppose we want to evaluate how satisfied customers are with


three different products.
● Satisfaction rating of a small sample has been collected, where 5
represents the highest satisfaction level.
Example (Scenario 1)

● Suppose the following summarized the observations.

● As you can see, the variations inside of each group is significant.


The observations are widely spread.
● In other words, customers inside of each group showed a lot of
variation.
● Within-group variations are significant.

Example (Scenario 2)

● Suppose the following summarized the observations.

● As you can see, the variations within each group are very small.
The observations are clustered tightly together.
● Ratings within groups did not show much variation.
● But the groups are very different from each other.
● The Between-group variations are significant.
Example

In this example we observed that,

● When variations between groups were large compared to


variations within groups, we concluded that the satisfaction levels
were indeed different.
● When variations between groups were small compared to
variations within groups, we concluded that the satisfaction levels
were not different (null hypothesis is true!).
● To make a conclusion, it was important to study both variation
between and within groups. Only a comparison of both variations
together would lead us to accept or reject the null hypothesis.

F Statistic

● In ANOVA, we use the same concepts; variations between groups


and variations within groups, to determine whether or not
population means under study are equal.
● To do that, we use a test statistic called the F statistic. F statistic
includes both between and within variations at the same time and
has the F distribution.

● If the value of the F statistic is greater than a certain value (C.V.),


we can conclude that the between variations are much larger than
the within variations and we will reject the null hypothesis (we are
in the rejection area).

F Statistic

● As a result, we understand that in ANOVA we have only one


rejection area on the right.
● The area under the curve in the rejection area is 𝛼.
● If the value of our test statistic F falls into the rejection area (if F
>C.V.), we can reject the null hypothesis and conclude that not all
population means are equal.
● The F statistic grows when between variations are large compared
to within variations.
● The F statistic shrinks when between variations are small
compared to within variations.

One-Way ANOVA

● To quantify these two classes of variation, we introduce two new


measures of variability for one-factor experiments with k levels:
1) The Mean Square due to Treatments (between-group variation
measure)
● In each instance, (ȳ𝑖 −ȳ)^2 measures the between-group variation.

One-Way ANOVA

The Sum of Squares between Treatments (SST) (sometimes called the


between sum of squares) is,

● where ȳ𝑖 is the mean of group i, ni is the number of observations in


a group, and ȳ is the overall mean of all observations.
● To turn this estimate of variation into variance, we divide the sum
of squares by its associated degrees of freedom,

One-Way ANOVA
2) The Mean Square due to Error (within-group variation
measure)
● We compare the SST with how much variation there is within each
group. The Sum of Squares Error (SSE) captures it like this

● where s²𝑖 is the sample variance of group i. N is the total number


of observations.

One-Way ANOVA

● To turn this estimate of variation into variance, we divide the sum


of squares by its associated degrees of freedom.

where 𝑁 is the total number of observations and 𝑘 is the number of


groups.

● The total sum of squares will be equal to,

One-Way ANOVA
● When there is no difference in treatments, it can be shown that
the ratio MST/MSE follows the F-statistic :

● The critical value and P-value depend on the two degrees of


freedom 𝑘 − 1 and 𝑁 − 𝑘.
● So, to accept or reject the hypothesis…

...we compute and analyze the F-statistic.

▪ This type of analysis is called an Analysis of Variance (ANOVA)

F-ratio ( F-Statistic )

● In the One-way ANOVA analysis, we reject the null hypothesis


when,
○ The between and within variations are such that the value of
the F ratio is greater than a certain critical value.
○ This means that the p-value is smaller than 𝛼 of the test.
● The higher the F ratio, the higher the probability that the null
hypothesis is rejected.
● Significantly large F ratios will produce significantly small p-values
because remember,
● The rejection area for the one-way ANOVA is always on the
right-hand side of the F distribution.

F ratio ( F statistic )

One-Way ANOVA

● An ANOVA table is used to summarize information from


experiments and elements of variation seen in different groups.
One-Way ANOVA (Example)

Example: For the summer catalogue of the percussion supply company


Tom’s Tom-Toms, 4000 customers were selected at random to receive
one of four offers: No Coupon, Free Sticks with purchase, Free Pad with
purchase, or Fifty Dollars off next purchase. All the catalogues were sent
out on March 15 and sales data for the month following the mailing were
recorded.

Offers:

1) Free sticks
2) Free pad
3) Fifty dollars off next purchase
4) No coupon (control group)

One-Way ANOVA (Example)

● Tom’s Tom-Toms is interested to see if there are any differences


between the average sales data following any type of offers.
● There are 4 experiment groups (including the control group with no
offers).
● The company would like to test the following hypothesis at 𝛼 =
0.01.

One-Way ANOVA (Example)

● Here is a summary of the spending for the month after the start of
the experiment. A total of 4000 offers were sent, 1000 per
treatment.
One-Way ANOVA ( Example )

● And the data from the experiments are summarized in the table
below.

One-Way ANOVA (Example)

● We construct the ANOVA table!


● The p-value was calculated using software!
● Since p-value is smaller than 𝛼, we reject the null hypothesis!

One-Way ANOVA

Question: A wine producer wants to evaluate the quality of its new wine
across Canada. The dining facilities of a hotel chain in three major cities
are selected for the study (Montreal, Toronto, and Vancouver). On the
same day, a number of patrons in each city are asked to taste the wine
and rate the overall quality of the wine on a 5-point scale, with 1 =
horrible, 3 = moderately good, and 5 = excellent. Assuming that the
average qualities of the wine in Montreal, Toronto and Vancouver are 𝜇1,
𝜇2 and 𝜇3, respectively, at significance level (𝛼 = 0.05), we would like to
test, if the quality of the wine across the three cities is the same.

​ a) what are the null and alternative hypotheses?


​ b) Complete the ANOVA table.
​ c) What is the critical value of the test at 𝛼 = 0.05?
​ d) Is there a difference between the wines across the three cities?
ANOVA Table
Post hoc Analysis

● If ANOVA states that not all population means are equal, we can
investigate further which population means are different by
pairwise-comparisons between them. This is called the post hoc
analysis to ANOVA.
● A possible method for post hoc analysis is the 𝑡-test of significance
between two population means (𝐻a : 𝜇1 − 𝜇2 > 0, etc.). This
method can be performed for all possible pair combinations.
● This method can be complemented by constructing estimated
confidence intervals around population means and comparing
confidence intervals to figure out the relationship between
population means.

Chapter 18: Inference for Regression

Sample Regression Line

● Remember: Regression line

● The line was derived from fitting a regression line to a sample of


“paired data”.
● Suppose we take three samples from the same population and find
the regression lines.
Population Regression

● Sample regression lines are useful, however, population


regression lines are what we usually need.
● Population regression lines are useful to predict values of 𝑦 in the
population based on values of 𝑥.
● We can imagine a true line that summarizes the relationship
between x and y for the entire population,
● This true line is not the same as the sample regression line we
derive from a sample.

Population Regression

● Assume the true population regression line is,

where 𝜇𝑦 is the true population mean of all the 𝑦’s of the population
for any given 𝑥.

● As we can tell from the sample, each 𝑥 in the population could


have multiple 𝑦’s associated with it.
● NOTE: We are assuming an idealized case in which the points
(x, 𝜇𝑦) are in fact exactly linear.

Population Regression

● The values on the line


● will miss most of (if not all of) the 𝑦’s.
● Some 𝑦’s lie above the line and some below the line.
● This line also makes errors.
● If we want to account for each individual 𝑦 in our model, we need
to include errors, denoted by 𝜀.

● Therefore,

Population vs. Sample

● Regression Inference:
● Collect a sample and estimate the population 𝛽’s by finding
a regression line

● The residuals 𝑒 = 𝑦– ŷ are the sample based versions of 𝜀.


● In this mode, 𝑏0 estimates 𝛽0 .
● b1 estimates 𝛽1 .
● 𝑒 estimates 𝜀.

Population vs. Sample

● Estimate?!
● Using sample information, we can create confidence
intervals and hypothesis tests for population parameters.
● We observe 𝑏0 , we estimates 𝛽0 .
● We observe 𝑏1 , we estimates 𝛽1.
● We observe 𝑒, we estimates 𝜀.
Assumptions and Conditions

● Estimating population parameters using sample information is


based on a number of assumptions and conditions.

1. Linearity Assumption
2. Independence Assumption
3. Equal Variance Assumption
4. Normal Population Assumption

Assumptions and Conditions

Testing the Assumptions

1. Make a scatterplot of the data to check for linearity


(Linearity Assumption)
2. Fit a regression and find the residuals, e, and predicted values ŷ
3. Plot the residuals against time (if appropriate) and check for
evidence of patterns
(Independence Assumption).
4. Make a scatterplot of the residuals against x or the predicted
values. This plot should not exhibit a “fan” or “cone” shape. (Equal
Variance Assumption)
5. Make a histogram and Normal probability plot of the residuals
(Normal Population Assumption)
Assumptions and Conditions

● ​Graphical Summary of Assumptions and Conditions

Error of b1 (SE(b1))

● For a sample, we expect b1 to be close to the model 𝛽1


● For similar samples, the standard error of the slope is a measure
of the variability of b1 about the true slope 𝛽1.
● Suppose for the (# of salespeople, sales amount) problem you
took 4 samples and the following were the sample regression line
slopes you found.

Error of b1

● As you can see, every time we sample from the same population
and calculate the regression line, the value of 𝑏1 could be different.
(Why?)
● By observing the value of 1 sample, we would like to estimate the
variation in the values of 𝑏1 .
● This variation is calculated using the standard error of 𝑏1 ,
𝑆𝐸(𝑏1).
Error of b1

● Three aspects of the model that affect 𝑆𝐸(𝑏1 ) include:


1. Spread of the 𝑦’s around the sample regression line, or the residual
standard deviation, 𝑠𝑒
● We already know from previous lectures that this residual standard

deviation can be calculated using the formula,

● Increasing 𝑠e will increase 𝑆𝐸(𝑏1)

Error of b1

● Less scatter around the line means the slope will be more
consistent from sample to sample → lower variation in 𝑏1 . The
picture on the left provides a lower 𝑆𝐸(𝑏1) , and hence is more
accurate.
Error of b1

● Increasing 𝑠x will decrease 𝑆𝐸(𝑏1 )

Error of b1

● A plot like the one on the right has a broader range of x-values, so
it gives a more stable base for the slope. We might expect the
slopes of samples from situations like that to vary less from
sample to sample. A large standard deviation of 𝑥, 𝑠 x , as in the
figure on the right, provides a lower 𝑆𝐸(𝑏1) and hence a more
accurate regression.
Error of b1

3. Sample size, 𝑛

● Increasing 𝑛 will decrease 𝑆𝐸(𝑏1)


● A larger sample size (scatterplot on the right), 𝑛, gives more
consistent estimates from sample to sample and hence a lower
𝑆𝐸(𝑏1).

Error of b1

● Based on these three factors, the formula for standard error of 𝑏1,
which is 𝑆𝐸(𝑏1) is calculated as,

Distribution of b1

● We know that 𝑏1 varies from one sample to the next.


● Its sampling distribution model is centered at 𝛽1 , the slope
of the idealized regression line.
● We have also estimated its standard deviation using
𝑆𝐸(𝑏1). What about its shape?
● When we standardize the slopes by subtracting the mode
mean and dividing by their standard error, we get a Student’s
t-model, this time with 𝑛 − 2 degrees of freedom.
● This t-statistic can be used for hypothesis testing and
confidence intervals.

Hypothesis test for 𝛽1 (Slope)

Error of b1

● Based on these three factors, the formula for standard error of 𝑏1,
which is 𝑆𝐸(𝑏1) is calculated as,

● Where
● And

Hypothesis test for 𝛽1 (Slope)

● The most common hypothesis test for 𝛽1 is to test if 𝛽1 = 0.


● If 𝛽1 = 0, then that means there is no linear association
between the two variables.
● A slope of zero means that 𝑦 doesn’t change linearly when
𝑥 changes.
● A null hypothesis of a zero slope questions the entire claim
of a linear relationship between the two variables.
● Often that is just what we want to know!

Hypothesis test for 𝛽1 (Slope)

● So when it comes to a hypothesis test for the slope, we are usually


testing the following,

● The test statistic we use is therefore,


● Which we would compare with a critical value 𝑡*𝑛−2 at the desired
𝛼 level to draw a conclusion about the test (whether or not 𝛽 is
zero)

CI for 𝛽1 (Slope)

● What if we want to construct a confidence interval around the


population regression slope 𝛽1
● You need your estimate (𝑏1), and your margin of error

● Therefore, When the assumptions and conditions are met, we can


find a confidence interval for the regression slope 𝛽1 from,

● Where the critical value t* depends on the confidence level


and has n − 2 degrees of freedom.

Test for Regression Parameters

Question: Consider the following data set.


Test for Regression Parameters
CI in Regression
Regression Results

● A product manager is interested in learning how sensitive sales


are to changes in the unit price of a frozen pizza in downtown
Toronto. Here’s the regression of Sales volume on Price for frozen
pizza each week for a three-year period:

● What is the regression equation.


● Specify what each column represents. Specify the tests, along with
the null and alternative hypotheses and results.
Chapter 20: Multiple Regression

Why Multiple Regression?

● How do Netflix, Amazon, Youtube, Facebook come up with what


material to show to each user?
● How do financial analysts predict stick prices?
● How does Google map come up with the fastest route
between two locations?
● How do airlines determine the best price of each seat?
● And many more examples are why,
● Multiple regression models are important!

Why Multiple Regression?

● So far we have learned simple regression models!


● Simple regression models we have learnt had only 1
predictor variable and 1 response variable.
● In other words, we assumed that the response variable depended
on only 1 variable, which was the predictor.
● However,
● The real world is complex.
● Simple models with only one variable are not detailed enough to
be useful for understanding, predicting, and making business
decisions in many real-world situations.
Multiple Regression

● For simple regression, the predicted value depends on only one


predictor variable:

● For multiple regression, we write the regression model with more


predictor variables:

Example

● Assume we want to predict the price of homes based on a single


variable; number of bedrooms.
● We use a sample of 1057 homes and will plot the sales price
against the number of bedrooms.
● We are hoping to find a good linear relationship between these two
variables.
● Let’s start with a boxplot of home prices for 1, 2, ..., 5 bedroom
houses.

Example
Multiple Regression (Example)

● Using software, we fit a regression line to this relationship.

Multiple Regression (Example)

● The variation in Bedrooms accounts for only 21% of the variation


in Price.
● Perhaps the inclusion of another factor can account for a portion of
the remaining variation.
● The standard deviation of the residuals is s = 68,432.21 (𝑠e)!
● Let us add another variable to predict prices.
● Let’s add “living area” as the second variable!

Multiple Regression (Example)

● Using software, we fit a regression line to this relationship.

Multiple Regression (Example)

● Now the model accounts for 57.8% of the variation in Price.


● The standard deviation of the residuals is also decreased and is at
𝑠e = 50142.4. This is indicate by (𝑠) on the results.
● The prediction model is improved as a result of adding an
additional variable to the model.

Multiple Regression Models

● In multiple regressions:
● Residuals are similar to the simple case
● Degrees of freedom is 𝑑𝑓 = 𝑛 − 𝑘 − 1 where, 𝑛 is the number of
cases and 𝑘 is the number of predictor variables.
● Standard deviation of residuals is

Coefficients

● Take the example of home prices with two predictors.


● The regression line is

● We’d expect both variables to have a positive effect on


price—houses with more bedrooms typically sell for more money,
as do larger houses
● How can it be that the coefficient of Bedrooms in the multiple
regression is negative?
● Price drops with increasing bedrooms? Counterintuitive?

Coefficients

● In a multiple regression, coefficients have a more subtle meaning.


Each coefficient takes into account the other predictor(s) in the
model.
● After accounting for living area, houses with more bedrooms tend
to sell for a lower price.
● Restricting our attention to homes of a certain size and seeing that
additional bedrooms had a negative impact on price, we notice,
this is generally true across all sizes.
● Living Area and Bedrooms are also related.
● Multiple regression coefficients must always be interpreted in
terms of the other predictors in the model (all other predictors are
held constant).

Coefficients

● For houses with the living area between 2,500 and 3,000 square
foot, we have,

Coefficients

● For a fixed living area, increasing the number of bedrooms will


lead to smaller bedrooms → lower overall price!
● So does the price of a house increase with an increase in the
number of bedrooms?
○ Yes! If the number of bedrooms is the only predictor.
○ No! If the number of bedrooms increases for a fixed living
area.
● Conclusion: Multiple regression coefficients must be interpreted in
terms of all the other predictors in the model (all other predictors
are held constant).
● Also, as always with regression, be careful not to assume
causation because there is correlation!
Assumptions and Conditions

● Linearity Assumption: Check each of the predictors. Each


predictor variable must have a linear relationship with the response
variable.
● Independence Assumption: Think about how the data were
collected to decide if the assumption is reasonable
● Linearity Assumption: Also check the residuals plot. Residuals
should show no patterns.
● Equal Variance Assumption: The variability of the residuals
should be about the same for each predictor.
● Normality Assumption: Check to see if the distribution of
residuals is unimodal and symmetric.

Multiple Regression (Example)

● Assuming that all the conditions are satisfied, we fit a multiple


regression line to see the relationship between home prices and 5
predictor variables.

Multiple Regression (Example)

The regression model:


● All p-values are small → all 5 predictors are contributing to the
model.
● Based on the 𝑅^2 value, more than 60% of the variation in house
prices are captured by the regression line. What’s missing?
● Residual standard error is $48,616.

Testing Multiple Regression

● There are several hypothesis tests in multiple regression


● Each is concerned with whether the underlying parameters
(slopes and intercept) are actually zero
● First the global test of significance. We ask the global
question: Is this multiple regression model any good at all!
● In the home prices example, this is asking, “are home prices
determined randomly, or, are home prices determined by factors
other than those we have in our model?”.

Testing Multiple Regression

● The Null Hypothesis for slope coefficients:

● The null hypothesis is stating that none of the predictor variables in


the model actually contribute to the response variable.
● This hypothesis is tested against the alternative hypothesis,
● 𝐻A : at least one 𝛽 is not zero.
● We can test this hypothesis with an F-test.
● It’s the generalization of the t-test to more than one predictor.

Testing Multiple Regression

● Similar to the case of ANOVA, the F-statistic here has two degrees
of freedom.
● The degree of freedom for the numerator is 𝑘, the number of
predictors.
● The degree of freedom for the denominator is 𝑛 − 𝑘 − 1.
● This gives,

● To test the null hypothesis with F distribution you can


○ Compare 𝛼 and p-value (1-sided)?
○ Compare F statistic (ratio) and critical F (1-sided)?

Testing Multiple Regression

● What is the next step if the F-test leads to the rejection of null
hypothesis?
● The next step is doing the t-test for each coefficient to see if it is
significant. This test has the null hypothesis,

● The t statistic is the test we already performed in Chapter 18.

● Note that the degrees of freedom for the t-test is n – k – 1.

Testing Multiple Regression

● This test statistic addresses the individual test of significance for


each predictor variable.
Testing Multiple Regression

● Remember the “home price” example with 5 predictors: living area,


bedrooms, bathrooms, fireplace, and age.

a) State the null and alternative hypotheses for the global as well
as individual test of significance?
b) What is the conclusion on the tests?
CI for Coefficients

● An important inference from the sample is constructing a


confidence interval.

● Remember the critical t value comes from the table with 𝑛 − 𝑘 − 1


degrees of freedom.
● The standard error of the slope is usually given in the regression
table.

Testing for Parameters


F-Statistic and ANOVA

● An ANOVA result is usually present in the results of a regression


computer output.
● It is important to understand the ANOVA analysis in the concept of
regressions.
● What we did before: using ANOVA in testing differences between
population means.
● ANOVA in the regression concept looks at variability in the model
and residuals.

F-Statistic and ANOVA

● The key concepts to understand here are the amount of variability


that (a) is present in the original data, (b) is explained by the
regression, and (c) remains unexplained by the regression.
● It can be shown that the total variability is the sum of the explained
and unexplained variabilities.

● 𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸


● Each of these sums of squares is associated with a number of
degrees of freedom.

F-Statistic and ANOVA

● The variability in the original data is measured by the Sum of


Squares, Total (SSTotal):

● SSTotal measures all the variability that exists in the data and our
model combined.
● As before, the degree of freedom for SSTotal is 𝑛 − 1.

F-Statistic and ANOVA

● ​The variability that is explained by the regression is measured by


the Sum of Squares, Regression (SSR):

● Note that the model is explaining the variability through 𝑘 predictor


variables. The degree of freedom for SSR is 𝑘.
F-Statistic and ANOVA

● The variability that is left unexplained by the regression is


measured by the Sum of Squares, Errors (SSE) of the residuals
left over between the regression, ŷ, and the data, y

● The SSE has 𝑛 − 𝑘 − 1 degrees of freedom.

F-Statistic and ANOVA

● When we calculate the variance of some sample data, we divide


the total sum of squares by the degrees of freedom 𝑛 − 1, giving
the mean sum of squares. We can do the same thing for the other
sums of squares in the ANOVA table
● The mean square, regression (explained), MSR,

● The mean square, error (unexplained, residuals), MSE,

F-Statistic and ANOVA

● The F-statistic is then calculated as the ratio of the explained and


unexplained mean squares:
● When we see a high value for the F-statistic, we know that a lot of
the variability in the original data has been explained by the
regression.
● higher F values → more significant models.
● A P-value is also given in the ANOVA table, indicating whether the
F-statistic is high enough to imply that the regression is significant
overall.
● Smaller p-values → more significant models.

F ratio (F statistic)

F-Statistic and ANOVA

● An ANOVA table for a multiple regression model.


F-Statistic and R²

● We already know another measure of how much variation in our


data is explained by the regression:

● Or,

F-Statistic and R²

● By using the expressions for SSE, SSR, SST, and R², it can be
shown that:

Testing Regression Models

● Question: What can predict how much a motion picture will make?
We have data on a number of releases that includes the USGross
(in $), the Budget ($), the Run Time (minutes), and the average
number of Stars awarded by reviewers. The first several entries in
the data table look like this:
Testing Regression Models

● We want a regression model to predict USGross. Parts of the


regression output computed in Excel look like this:

a) What is the null hypothesis tested for the coefficient of Stars


in this table? Is “Stars” significant (contributing to the
model)?

Testing Regression Models

● Here is another part of the regression output


R² and adjusted R²

● Adding new predictor variables to a model will either keep at the


same level, or increase it.
● Adding a new predictor variable will never decrease the R².
● Too many predictor variables add to the complexity of the model.
● Ideally, we want to have a model that is not too complex but at the
same time captures the data well.
● So checking R² only does not help much to determine if a variable
should be added to the model or not.
● For this we use the adjusted R².

R² and adjusted R²

● Adjusted R² imposes a “penalty” on the correlation strength of


larger models, depreciating their R² values to account for an
undesired increase in complexity.
● If a predictor variable is added to the model, the adjusted R² can,
○ Shrink if the predictor variable does not contribute to the
model, and
○ Grow if the predictor variable contributes to the model
● Adjusted R² can even become negative (unlike R² )
● Adjusted R² can also help compare models with different numbers
of predictor variables!

Testing Regression Models

Question: In the movies example,


a) What is the 𝑅2 for this regression? What does it mean?
b) Why is the “Adjusted R-Square” in the table different from the
“R-Square”?

Residuals
Multiple Regression

Example
Example

You might also like