Commerce 1DA3 Notes-6
Commerce 1DA3 Notes-6
Statistical Analysis
Chapter 2: Data
What Is Data?
What is Data?
● Data tables are cumbersome for complex data sets, so often two
or more separate data tables are linked together in a relational
database
● Each data table included in the database is a relation because it is
about a specific set of cases with information about each of these
cases for all of the variables
● Example: A typical relational database is provided consisting of
three relations: customer data, item data, and transaction data
Types of Variables
Types of Variables
Types of Variables
Data Collection
Key Words:
Sampling
Features of Sampling
● The size of the sample determines what we can conclude from the
data regardless of the size of the population
Stratified Sampling:
Cluster Sampling:
● Split the population into parts or clusters that each represent the
population. Perform a census within one or a few clusters at
random.
● If each cluster fairly represents the population, cluster sampling
will generate an unbiased sample.
Systematic Sampling:
Multistage Sampling:
● A survey that can yield the information you need about the
population in which you are interested is a valid survey
● To help ensure a valid survey, you need to ask four questions:
○ What do I want to know?
○ Who are the right respondents?
○ What are the right questions?
○ What will be done with the results?
Displaying Data
Charts
● Pie Charts: Pie charts show the whole group as a circle (“pie”)
sliced into pieces. The size of each piece is proportional to the
fraction of the whole in each category. The pie chart for Loblaw
data is displayed below.
Frequency Tables
Frequency Distribution
● Groups data into categories and records the number of (counts the
number of) observations in each category
Contingency Tables
Contingency Distribution
● Conditional Distributions: We may want to restrict variables in a
distribution to show the distribution for just those cases that satisfy
a specified condition. This is called a conditional distribution.
(e.g., social networking use given the country of focus is Egypt)
Simpson’s Paradox
Simson's Paradox
Simpson’s Paradox
Frequency Table
Histogram
Example of Histogram
Example of Histogram
● Stem-And-Leaf Diagrams
Stem and Leaf
● Stem-and-Leaf Display
1) Decide how wide to make the bins – if there are n data points, use
log2 𝑛 for the number of bins
2) Determine the count for each bin
3) Decide where to place values that land on the endpoint of a bin.
For example, does a value of $5 go into the $0 to $5 bin or the $5
to $10 bin? The standard rule is to place such values in the higher
bin.
● A stem and leaf display is like a histogram, but it also gives the
individual values
● These are easy to make by hand for data sets that aren’t too large,
so they’re a great way to look at a small batch of values quickly
Describing Data
Shape
Centre
Centre
Centre
Centre
Mode - Median - Mean
● We need to determine how spread out the data are because the
more the data vary, the less a measure of centre can tell us.
● One simple measure of spread is the range, defined as the
difference between the extremes (max and min)
● Range = Max - Min
Spread
Spread
● Each quartile is a value that frames the middle 50% of the data.
One-quarter of the data lies below the lower quartile, Q1, and
one-quarter lies above the third quartile, Q3.
● The interquartile range (IQR) is defined to be the difference
between the two quartiles: Q1 and Q3
Spread
● Variance
● What is variance?
● Variance: Average of squared deviations between data points and
the mean
○ Variance Unit of Measurement: (Unit of data)^2
● For sample values 𝑦1,𝑦2,...,𝑦𝑛 the sample variance (𝑠^2) is
calculated as,
Spread
For population values 𝑦1,𝑦2,...,𝑦𝑁 the population variance (𝜎^2) [𝜎is
the Greek letter sigma] is calculated as,
Spread
● Standard Deviation: Standard deviation represents, on average,
how far data points are from the mean
○ Standard Deviation Unit of Measurement: Same unit as
data
● What are standard deviations for the sample and
population as calculated in the previous slide?
Spread
● Coefficient of Variation (CV)
○ What is the CV for a dataset: Measure of relative spread
● What is the CV for a sample and a population?
Percentile
3) Erect (but don’t show in the final plot) “fences” around the main
part of the data, placing the upper fence 1.5 IQRs above the upper
quartile and the lower fence 1.5 IQRs below the lower quartile.
● 𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 0.297
● 𝑄3 + 1.5𝐼𝑄𝑅 = 1.972 + 0.4455 = 2.41
● Q1 - 1.5*IQR
4) Draw lines (whiskers) from each end of the box up and
down to the most extreme data values found within the
fences.
● The centre of a boxplot shows the middle half of the data between
the quartiles – the height of the box equals the IQR.
● If the median is roughly centred between the quartiles, then the
middle half of the data is roughly symmetric. If it is not centred, the
distribution is skewed.
● The whiskers show skewness as well if they are not roughly the
same length.
● The outliers are displayed individually to keep them out of the way
in judging skewness and to display them for special attention
Boxplot
Question: Draw a boxplot that has two outliers on the right hand
side. Show Q1, Q2 and Q3. Show the IQR. Show the max and min
values.
Comparing Groups
Comparing Groups
● Histograms work well for comparing two groups, but boxplots tend
to offer better results when side-by-side comparison of several
groups is sought.
● Below the NYSE data is displayed in monthly boxplots.
Standardizing
Example: Compare two companies (from the “top” 100 companies) with
respect to the variables New Jobs (jobs created) and Average Pay.
Standardizing
● To find how many standard deviations a value is from the
mean we calculate a standardized value or z-score.
● z-Score Formula:
Standardizing
● In the following dataset find the z-score of all sample values (𝑦(bar)
= 6 and 𝑠 = 3.16). This procedure is called standardizing the data.
Standardizing Data
Outlier Identification:
Q3+1.5 * IQR
Q1 - 1.5 * IQR
● Iven the sample mean 𝑥ҧ, the sample standard deviation 𝑠
and a relatively symmetric and bell-shaped distribution,
○ Approximately 68% of all observation fall in the interval 𝑥ҧ ±
𝑠
○ Approximately 95% of all observation fall in the interval 𝑥ҧ ±
2𝑠
○ Approximately 99.7% (almost 100%) of all observation fall in
the interval 𝑥ҧ ± 3𝑠
Scatterplots
Scatterplots
Scatterplots
Scatterplots
Scatterplots
Scatterplots
Scatterplots
Measures of Association
Understanding Correlation
● Correlation Coefficient ( r ):
[-1, +1]
Understanding Correlation
N: sample size
Covariance
Measures of Association
Measure of Association
Association
Example
Understanding Correlation
● The ratio of the sum of the product zxzy for every point in the
scatterplot, to n – 1 is called the correlation coefficient.
Understanding Correlation
Correlation Properties:
Correlation Coefficient
Understanding Correlation
Correlation Table
Correlation ≠ Causation
A few Notes
Selection
Example (continued): We see that the points don’t all line up, but that a
straight line can summarize the general pattern. We call this line a linear
model. This line can be used to predict sales for the level of advertising
expenses.
The Linear Regression Model
● The regression line: The line that best fits all the points.
● What do we mean by “best fits”?
○ The line is used to predict values of the dependent variable
for values of the independent variable.
● Note: The value predicted by the line is usually not equal to the
value of the data. There are some errors (residuals).
● Even we know that the line is not the “perfect” prediction, we can
still work with the linear model and accept some level of error.
● Unless the points form a perfect line, we will always have some
errors.
The Linear Model
For our example of sales and advertising expenses, the line shown
with the scatterplot has the equation that follows
Correlation and the Line
a) What is the slope? How can you interpret the slope in this
question?
● + and -
● The OLS method chooses the line whereby the error sum of
squares (SSE) is minimized.
● SSE is the sum of he squared differences between the observed
values 𝑦 and their predicted values y(hat).
● The OLS method predicts the straight line that is “closest” to the
data.
● The OLS method tries to minimize SSE, which is,
We can find the slope of the least squares line using the correlation
and the standard deviations as follows,
● The slope gets its sign from the correlation. If the correlation is
positive, the scatterplot runs from lower left to upper right and the
slope of the line is positive. (remember, standard deviation is
always a positive number).
● The slope gets its units from the ratio of the two standard
deviations, so the units of the slope are a ratio of the units of the
variables.
● To find the intercept of our line, we use the means. If our line
estimates the data, then it should predict 𝑦(bar) for the x- value
𝑥(bar). Thus we get the following relationship for 𝑦(bar) from our
line,
● We can now solve this equation for the intercept to obtain the
formula for the intercept
● In summary, to find
Least squares lines are commonly called regression lines. We’ll need to
check the same condition for regression as we did for correlation.
● Quantitative Variables Condition
● Linearity Condition
● Outlier Condition
Standardizing Data
● That means, for every standard deviation above (or below) the
mean we are in advertising expenses, we’ll predict that the sales
are _______0.693______ standard deviations above (or below)
their mean.
▪ The reason can be seen from the standardized value best fit line
𝑍(hat)y , which is,
Regression Lines
Regression
Regression
● The residuals are part of the data that has not been
Modeled.
The plot of the Amazon residuals are given below. It does not
appear that there is anything interesting occurring.
● The variation in the residuals is the key to assessing how well a
model fits.
Nonlinear Relationships
Probability
● Example:
○ Picking a student at random, the probability that her/his
birthday is in the month of September.
○ If you draw a card from a standard deck of cards, what is the
probability of drawing a face card?
Probability Rules
Rule 1:
Probability Rules
Probability Rules
Disjoint Events
▪ The General Addition Rule calculates the probability that either of two
events occurs. It does not require that the events be disjoint.
Joint Probability
Marginal Probability
Conditional Probability
Conditional Probability
Independent Events
Random Variable
● Note: 𝐸(𝑋) should not be confused with the most probable value
of the random variable. It may be not even one of the possible
values of the random variable.
Expected Value of a Random Variable
● Variance talks about how the values are dispersed around the
expected value, whether they are closely clustered or scattered
around it.
● It is a measure of dispersion.
Binomial Distribution
1. There are only two possible outcomes (success and failure) for
each trial.
2. The probability of success, denoted p, is the same for each trial.
The probability of failure is q = 1 – p.
3. The trials are independent.
● Or
Introduction: Binomial Distribution
Question 2: Now, let’s throw the die 3 times, what is the probability of
rolling the number 5 exactly 2 times?
Binomial Distribution
Binomial Distribution
Binomial Distribution
Binomial Distribution
Binomial Distribution
Binomial Distribution
● This is because,
● You can find probabilities like 𝑃(𝑋 ≤ 10) or 𝑃(8 ≤ 𝑋 ≤ 10) can be
read from the density function by calculating the area under the
density curve 𝑓(𝑥) .
Normal Distribution
Z-Score (reminder)
● z-score 2.2 implies that the point is 2.2 standard deviations to the
right of the mean
● z-score -1.8 implies that the point is -1.8 standard deviations to the
left of the mean
Standardization
Standardization
Suppose you have the z-score and want to find the x-value
● In the normal distributions, about 68% of the values fall within one
standard deviation of the mean, about 95% of the values fall within
two standard deviations of the mean, and about 99.7% of the
values fall within three standard deviations of the mean.
Using the z table provided on the previous slide, find the following
probabilities.
zTable
Background (Example)
Background (Example)
● If instead of two specialists we had say 100 of them and they each
collected a sample and found a proportion of customers who’d
increase their spending following the offer, what are your thoughts
on the distribution (shape, center, spread) of these different 100
proportion values?
● What would be the shape of this distribution?
● What would affect the center (mean) of this distribution?
● What would affect the spread (std. dev.) of this distribution?
Sample Proportions
● If events have only two outcomes, we can call them “success” and
“failure”
● The proportion of “success” in a sample is called the “sample
proportion”.
● Examples: We would like to estimate the proportion of smokers
over the age of 25 in a city. We select 100 people from the city
(25+) and measure the proportion of them who smoke. This
proportion will be the sample proportion.
● The sample proportion is most probably different from the
population proportion (true proportion).
Sample Proportions
What are some examples of proportions?
Sample Proportions
Sample Proportion
Sample Proportions
Sample Proportions
● The result of the simulation can be summarized in a table, as
below. (𝑝 = 0.25 and 𝑛 = 70)
Samples Proportions
● Remember
a. the difference between sample proportions, referred to as sampling
error, is not really an error. It’s just the variability you’d expect to
see from one sample to another. A better term might be sampling
variability.
Sample Proportions
Sample Proportions
Sample Proportions
Sample Proportions
Some Notations…
Sample mean
Sample Mean
Sample Mean
b) If you select 20 bags of chips randomly, what is the probability that the
average weight of this sample is less than 59 grams?
c) If you select 10 bag of chips randomly, what is the probability that the
average weight of this sample is greater than 59 grams?
Central Limit Theorem
a) If you select one bag of chips randomly, what is the probability that it
weighs less than 59 grams?
b) If you select 35 bags of chips randomly, what is the probability that the
average weight of this sample is less than 59 grams?
Example
Standard Error
Standard Error
Example
A Confidence Interval
Confidence Interval
Question: Using the standard normal table, find an upper and lower
bound such that,
Confidence Intervals
● We also know from the previous chapter that the shape of the
sampling distribution is approximately Normal and we can use p̂ to
find the standard error.
A Confidence Interval
A Confidence Interval
Here are a few things we can say about the population by looking at
this sample.
A Confidence Interval
What does it mean when we say we have 95% confidence that our
interval contains the true proportion?
Confidence Intervals
● The more confident we want to be, the larger the margin of error
must be (larger ME→more confidence)!
● smaller ME→less confidence!
Critical Value
● Recall for the 95% confidence interval the length of the margin of
error was:
Question: Using the standard normal table, find the critical value 𝑧∗ for,
Critical Values
Confidence Intervals
Example (continued):
PM - PO
● Not knowing what the true population proportions for the two
provinces are, what we can do is construct a confidence interval
for the difference between the two true population proportions (P𝑀
−P𝑂).
● But we also need the standard error (SE) of the difference (P𝑀 −
P𝑂 ). The SE of difference cannot be calculated by simply
subtracting the two standard errors.
● To find the SE of the difference we use the variances,
● Or,
● Use the formula below to calculate the confidence interval for the
difference between two populations:
CI for Difference between Two Proportions
Hypotheses
Hypotheses
● In our example about the TSX, the alternative hypothesis is known
as a two-sided alternative, because we are equally interested in
deviations on either side of the null hypothesis value. 𝐻A: 𝑝 ≠ 0.5.
○ Because in the alternative hypothesis any value other than
0.5 would be accepted.
● In our example about returned merchandise, the alternative
hypothesis is called a one-sided alternative because it focuses
on deviations from the null hypothesis value in only one direction.
𝐻A : 𝑝 < 0.0086.
● Because in the alternative hypothesis, only values smaller than
0.0086 will be accepted.
Hypotheses
Things to consider:
● Don’t put the issue that you are investigating into the null
hypothesis.
● The issue that you are investigating should go in the alternative
hypothesis.
● Don’t have different numbers in the null and alternative
hypotheses.
● The numerical values are always the same.
● This is the logic of jury trials. In British common law, the null
hypothesis is that the defendant is innocent.
● The evidence takes the form of facts that seem to contradict the
presumption of innocence. For us, this means collecting data.
● The jury considers the evidence in light of the presumption of
innocence and judges whether the evidence against the defendant
would be plausible if the defendant were in fact innocent.
● Like the jury, we ask: “Could these data plausibly have happened
by chance if the null hypothesis were true?”
P-Value
P-Value
● The p-value asks: Let’s first assume that the null hypothesis is
true, now given that assumption, how likely would it be to
produce a sample like we produced, or a sample even less likely,
again given that the null hypothesis were true.
P-Value
P-Value
● If the p-Value is high (or not just low enough), we can conclude
that we haven’t seen anything unlikely → our assumption should
be fine the way it is → the null hypothesis is true.
● We have no reason to reject the null hypothesis.
● In other words, we fail to reject the null hypothesis!
○ Example: The coin (1000 flips, 520 Heads, 480 Tails)
P-Value
● Let’s test to see, how likely it is to get a sample like the one we got,
or one that is less likely, given that we assume the null hypothesis
were true (assume that 𝑝 = 0.5)
P-Value
P-Value
● Because this is a two-sided alternative hypothesis, we are
basically trying to calculate,
P-Value
P-Value
She collects a random sample of 50 visits since the new web- site has
gone online and finds that 24% of them made purchases.
Critical Value
1) If the z for our test statistics (sample) is more extreme than the
critical value (𝑧∗) (z is further away from the mean than 𝑧∗
is)→Then it is in the critical region→we will reject the null
hypothesis.
a) p-value is smaller than 𝛼
2) If the z is less extreme than the critical value (𝑧∗) (z is closer to the
mean than 𝑧∗ is)→then it is not in the critical region→we cannot
(fail to) reject the null hypothesis.
a) p-value is greater than 𝛼
Critical Value
● But we don’t know what the unknown value p0 is, therefore, we will
need to estimate it using the following,
Type 1 Error
● Type I error occurs when we reject a null hypothesis, when the null
hypothesis is actually true (false positive).
Type 2 Error
Type 1 Error
● When you are choosing the significance level, you are actually
setting the probability of type I error.
● Probability of type I error is equal to 𝛼.
● 𝛼 is the probability that determines the critical region.
● If a sample statistic is one of those “rare” samples that falls into the
critical region, we will reject the null hypothesis while it was
actually true.
● A “rare” and “extreme” sample may fall into the critical region
(probability 𝛼), and lead to a true null hypothesis be rejected by
error.
Type 1 Error
Type 1 Error
Type 2 Error
Power
Power
Critical Value p*
● For a right-sided test, we’ll reject the null if the observed proportion
p̂ is greater than some critical value p*.
● For a left-sided test, we’ll reject the null if the observed
proportion p̂ is smaller than some critical value p*.
● For a two-sided test, one of the above holds based on the position
of the p̂.
● If p̂ is more extreme than p* we reject H0
Critical Value p*
● The upper figure shows the null hypothesis model. The lower
figure shows the true model.
Power
● The power of the test is the green region on the right of the lower
figure.
● Reducing α , moves the critical value p*, to the right but increases
β. It correspondingly reduces the power.
Power
Power
● Making the standard deviations smaller increases the power
without changing the alpha level.
Summary
Sampling Distribution
Reminder:
● The confidence interval for the mean follows the same logic.
● For the population mean, there are two cases:
○ If the population standard deviation 𝜎 is known
○ If the population standard deviation 𝜎 is unknown
● Having a sample mean (ȳ), in order to estimate the true population
mean (𝜇), we can estimate it by creating a confidence interval that
will include the population mean with some level of certainty
(confidence level).
● Or,
● But when we use the standard error for the mean, the distribution
is no longer normal. The distribution will be something that we
refer to as the “student’s t” distribution.
● The new model, the Student’s t, is a model that is always
bell-shaped, but the details change with the sample sizes.
● The Student’s t-models form a family of related distributions
depending on a parameter known as degrees of freedom.
Student's t model
● The t-model (solid curve) with 2 degrees of freedom vs. the normal
model (dashed curve).
Student’s t-models won’t work for data that are badly skewed. We
assume the data come from a population that follows a Normal model.
Data being Normal is idealized, so we have a “nearly normal” condition
we can check.
● For very small samples (n < 15), the data should follow a Normal
model very closely. If there are outliers or strong skewness, t
methods shouldn’t be used.
● For moderate sample sizes (n between 15 and 40), t methods will
work well as long as the data are unimodal and reasonably
symmetric.
● For sample sizes larger than 40, t methods are safe to use unless
the data are extremely skewed. If outliers are present, analyses
can be performed twice, with the outliers and without.
T-table
CI for the Mean
● Question: When homeowners fail to make mortgage payments,
the bank forecloses and sells the home, often at a loss. In one
large community, realtors randomly sampled 36 bids from potential
buyers to determine the average loss in home value. The sample
showed that the average loss was $11,560 with a standard
deviation of $1500.
a) Assuming that conditions to use the t-model are satisfied, find a
95% confidence interval for the mean loss in value per home.
b) Interpret this interval and explain what 95% confidence means.
c) Suppose that, nationally, the average loss in home values at this
time was $10,000. Do you think the loss in the sampled community
differs significantly from the national average? Explain.
HT for the Mean
● A typical t-table is shown here. The table shows the critical values
for varying degrees of freedom, df, and for varying confidence
intervals.
● Since the t-models get closer to the normal as df increases, the
final row has critical values from the Normal model and is labeled
“∞”
● It’s easy to find the mean and standard deviation of the spend lift
(increase in spending) for each of these groups (ȳ and 𝑠), but that’s
not what we want.
● We need the standard deviation of the difference in their means.
● For that, we can use a simple rule: If the sample means come from
independent samples, the variance of their sum or difference is
the sum of their variances.
● We’ll use the standard error to see how big the difference really is.
● Just as for a single mean, the ratio of the difference in the means
to the standard error of that difference has a sampling model that
follows a Student’s t distribution.
● The sampling model isn’t really Student’s t, but by using a special,
adjusted degrees of freedom value, we can find a Student’s
t-model that is so close to the right sampling distribution model that
nobody can tell the difference.
Two-Sample Test
Two-Sample Test
Two-Sample t-test
Two-Sample Test
a) What are the null and alternative hypotheses for this test?
b) Use a two-sample t-test to conduct the test. Does the average
spending really increase following a promotion? (use 𝑑𝑓 = 992 and
𝛼 = 0.05)
CI for diff. b/w two Means
● When the conditions are met, we’re ready to find a two- sample
t-confidence interval for the difference between means of two
independent groups, 𝜇1 − 𝜇2. The confidence interval is
Why ANOVA?
Why ANOVA?
Example
Experiments
Experiments
Experimental Design
Experimental Design
Experimental Design
Experimental Design
Factorial Designs
Confounding
● When the levels of one factor are associated with the levels of
another factor, we say that two factors are confounded.
● Example: A bank offers credit cards with 2 possible treatments:
● There is no way to separate the effect of the rate factor from that of
the fee factor. These two factors are confounded in this design.
One-way ANOVA
One-way ANOVA
One-way ANOVA
One-way ANOVA
● Hypotheses,
○ In ANOVA, generally the null hypothesis assumes that the
mean of all the populations are equal 𝜇1 = 𝜇2 = ⋯ = 𝜇𝑘
○ The alternative hypothesis states that at least one of the
population means is not equal to another population mean.
In Other words, it states that “not all population means are
equal”.
Example
Example (Scenario 2)
● As you can see, the variations within each group are very small.
The observations are clustered tightly together.
● Ratings within groups did not show much variation.
● But the groups are very different from each other.
● The Between-group variations are significant.
Example
F Statistic
F Statistic
One-Way ANOVA
One-Way ANOVA
One-Way ANOVA
2) The Mean Square due to Error (within-group variation
measure)
● We compare the SST with how much variation there is within each
group. The Sum of Squares Error (SSE) captures it like this
One-Way ANOVA
One-Way ANOVA
● When there is no difference in treatments, it can be shown that
the ratio MST/MSE follows the F-statistic :
F-ratio ( F-Statistic )
F ratio ( F statistic )
One-Way ANOVA
Offers:
1) Free sticks
2) Free pad
3) Fifty dollars off next purchase
4) No coupon (control group)
● Here is a summary of the spending for the month after the start of
the experiment. A total of 4000 offers were sent, 1000 per
treatment.
One-Way ANOVA ( Example )
● And the data from the experiments are summarized in the table
below.
One-Way ANOVA
Question: A wine producer wants to evaluate the quality of its new wine
across Canada. The dining facilities of a hotel chain in three major cities
are selected for the study (Montreal, Toronto, and Vancouver). On the
same day, a number of patrons in each city are asked to taste the wine
and rate the overall quality of the wine on a 5-point scale, with 1 =
horrible, 3 = moderately good, and 5 = excellent. Assuming that the
average qualities of the wine in Montreal, Toronto and Vancouver are 𝜇1,
𝜇2 and 𝜇3, respectively, at significance level (𝛼 = 0.05), we would like to
test, if the quality of the wine across the three cities is the same.
● If ANOVA states that not all population means are equal, we can
investigate further which population means are different by
pairwise-comparisons between them. This is called the post hoc
analysis to ANOVA.
● A possible method for post hoc analysis is the 𝑡-test of significance
between two population means (𝐻a : 𝜇1 − 𝜇2 > 0, etc.). This
method can be performed for all possible pair combinations.
● This method can be complemented by constructing estimated
confidence intervals around population means and comparing
confidence intervals to figure out the relationship between
population means.
Population Regression
where 𝜇𝑦 is the true population mean of all the 𝑦’s of the population
for any given 𝑥.
Population Regression
● Therefore,
● Regression Inference:
● Collect a sample and estimate the population 𝛽’s by finding
a regression line
● Estimate?!
● Using sample information, we can create confidence
intervals and hypothesis tests for population parameters.
● We observe 𝑏0 , we estimates 𝛽0 .
● We observe 𝑏1 , we estimates 𝛽1.
● We observe 𝑒, we estimates 𝜀.
Assumptions and Conditions
1. Linearity Assumption
2. Independence Assumption
3. Equal Variance Assumption
4. Normal Population Assumption
Error of b1 (SE(b1))
Error of b1
● As you can see, every time we sample from the same population
and calculate the regression line, the value of 𝑏1 could be different.
(Why?)
● By observing the value of 1 sample, we would like to estimate the
variation in the values of 𝑏1 .
● This variation is calculated using the standard error of 𝑏1 ,
𝑆𝐸(𝑏1).
Error of b1
Error of b1
● Less scatter around the line means the slope will be more
consistent from sample to sample → lower variation in 𝑏1 . The
picture on the left provides a lower 𝑆𝐸(𝑏1) , and hence is more
accurate.
Error of b1
Error of b1
● A plot like the one on the right has a broader range of x-values, so
it gives a more stable base for the slope. We might expect the
slopes of samples from situations like that to vary less from
sample to sample. A large standard deviation of 𝑥, 𝑠 x , as in the
figure on the right, provides a lower 𝑆𝐸(𝑏1) and hence a more
accurate regression.
Error of b1
3. Sample size, 𝑛
Error of b1
● Based on these three factors, the formula for standard error of 𝑏1,
which is 𝑆𝐸(𝑏1) is calculated as,
Distribution of b1
Error of b1
● Based on these three factors, the formula for standard error of 𝑏1,
which is 𝑆𝐸(𝑏1) is calculated as,
● Where
● And
CI for 𝛽1 (Slope)
Example
Example
Multiple Regression (Example)
● In multiple regressions:
● Residuals are similar to the simple case
● Degrees of freedom is 𝑑𝑓 = 𝑛 − 𝑘 − 1 where, 𝑛 is the number of
cases and 𝑘 is the number of predictor variables.
● Standard deviation of residuals is
Coefficients
Coefficients
Coefficients
● For houses with the living area between 2,500 and 3,000 square
foot, we have,
Coefficients
● Similar to the case of ANOVA, the F-statistic here has two degrees
of freedom.
● The degree of freedom for the numerator is 𝑘, the number of
predictors.
● The degree of freedom for the denominator is 𝑛 − 𝑘 − 1.
● This gives,
● What is the next step if the F-test leads to the rejection of null
hypothesis?
● The next step is doing the t-test for each coefficient to see if it is
significant. This test has the null hypothesis,
a) State the null and alternative hypotheses for the global as well
as individual test of significance?
b) What is the conclusion on the tests?
CI for Coefficients
● SSTotal measures all the variability that exists in the data and our
model combined.
● As before, the degree of freedom for SSTotal is 𝑛 − 1.
F ratio (F statistic)
● Or,
F-Statistic and R²
● By using the expressions for SSE, SSR, SST, and R², it can be
shown that:
● Question: What can predict how much a motion picture will make?
We have data on a number of releases that includes the USGross
(in $), the Budget ($), the Run Time (minutes), and the average
number of Stars awarded by reviewers. The first several entries in
the data table look like this:
Testing Regression Models
●
R² and adjusted R²
R² and adjusted R²
Residuals
Multiple Regression
Example
Example