0% found this document useful (0 votes)

11 views36 pages

Data Highlights Combined

The document discusses key concepts in statistics and probability, including Bayes' rule, z-scores, confidence intervals, hypothesis testing, and regression analysis. It emphasizes the importance of understanding confidence levels, margins of error, and the implications of Type I and Type II errors in hypothesis testing. Additionally, it covers regression techniques, including LASSO regression, and the differences between linear and logistic regression.

Uploaded by

oliviadtush

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views36 pages

Data Highlights Combined

Uploaded by

oliviadtush

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Bayes’ rule

• One of the most important formulas in probability and statistics:

base rate

• Applying Bayes’ rule helps us avoid the base rate fallacy

Expanded form:
Comparing distributions: z-scores
• Given the mean and standard deviation of a normal distribution, as well as a threshold value
x, the z-score is given by

• z-score = number of standard deviations that x is away from the mean (a measure of
extremeness)

• Temperature example:

1951-1980: 2008-2022: 80 deg is a less extreme

(more likely) average
80 − 74.0 80 − 75.8 temperature for October in
= 2.22 = 1.27 2008-2022
2.7 3.3
Confidence intervals for means
• The 95% confidence interval for the population mean 𝜇 is:

sample standard deviation

sample mean
sample size

• Interpretation: In approx. 95% of cases, the sample mean will be within of the true population mean
95%

𝑛 = 50

= “standard error”

standard deviation of
sampling distribution of 𝑥ҧ
Confidence intervals for means
• The 95% confidence interval for the population mean 𝜇, using our sample of 50 rides:

22.6 sample standard deviation

sample mean 21.1 ± 2 ∙
50 sample size

• Interpretation: With 95% confidence, the true population mean is between 14.7 and 27.5 minutes.

• More rigorous interpretation using conditional probability:

P 𝜇 between 14.7 and 27.5 ȁ 𝑥ҧ = 21.1, 𝑠 = 22.6, 𝑛 = 50 ≈ 0.95

… also conditional on (1) the sample being random,

(2) the distribution being approximately Normal, and
(3) roughly speaking, no other information about 𝜇.
For more on this last point, see the last three slides. (“Extra Content: Frequentists vs. Bayesians”)
Confidence levels and margin of error
• 95% is a common confidence interval, can technically do any %
• Adjusting z-score changes confidence level and margin of error

Margin of error for means: Confidence level (%) z

50 0.7
68 1.0
80 1.3
90 1.6
95 2.0
99 2.6
Confidence intervals for proportions
• The 95% confidence interval for the population proportion p is:

𝑝ҧ 1 − 𝑝ҧ for proportions, this is 𝒔

sample proportion 𝑝ҧ ± 2 ∙
𝑛 sample size

• Interpretation: In 95% of cases, sample proportion will be within of the true pop. proportion
95%

𝑛 = 50

= “standard error”

standard deviation of
sampling distribution of 𝑝ҧ
Hypothesis tests: formal process

1. Formulate your null hypothesis and your alternative hypothesis.

• Best practice: decide this before looking at data.

2. Assuming that the null hypothesis is true, calculate the z-score of your sample statistic.
• This is called the test statistic.
• It quantifies how “surprising” your sample is – assuming the null hypothesis is true.

3. Calculate the probability of getting a test statistic that “surprising”.

• In other words, P( data with |z-score| > your test statistic | null hypothesis is true)
• This is called the p-value. It is uniquely determined by the test statistic.

4. Reject the null hypothesis if the p-value is sufficiently small.

• Equivalently, reject the null hypothesis if the test statistic is sufficiently large.
• The threshold for “sufficient” depends on the significance level 𝛼.
z-scores, p-values, and significance levels

• Test statistic: the z-score of a sample statistic, assuming the null hypothesis is true
For proportions: 𝑝ҧ − 𝑝 For means: 𝑥ҧ − 𝜇
𝑠
𝑝(1 − 𝑝) ൗ 𝑛
൘
𝑛
In other words, the number of standard errors from the hypothesized population parameter.

• p-value: the probability of a test statistic being at least that surprising. For example, the p-
value of 2 is (roughly) 0.05.

• significance level: a threshold that we compare against the p-value. The significance
level is denoted by 𝛼 and often set to 𝛼 = 0.05. If the p-value is below this threshold, then we
reject the null hypothesis. If not, we cannot reject either hypothesis.
Test statistics comparing two samples

• Test statistic (two samples): the z-score of the difference between the two sample
statistics, assuming that there is no difference between the two populations.

For proportions: For means:

𝑥1ҧ − 𝑥ҧ2
𝑝1ҧ − 𝑝ҧ2
1 1 𝑠12 𝑠22
𝑝𝑎𝑣𝑔
ҧ 1 − 𝑝𝑎𝑣𝑔
ҧ + +
𝑛1 𝑛2 𝑛1 𝑛2

𝑛1 𝑝ത1 +𝑛2 𝑝ത 2
where 𝑝𝑎𝑣𝑔
ҧ =
𝑛1 +𝑛2
Type I and Type II errors
• There are two possible realities:
A. Any difference is due to random chance
B. The null hypothesis is actually false

• Type I error (false positive): rejecting the null hypothesis, even though the reality is A. The
significance level 𝛼 is the Type I error rate.

• Type II error (false negative): failing to reject the null hypothesis, even though the reality is B.
The true positive rate 1 − 𝛽 is the power of a hypothesis test.

• There is a trade-off between the two types of errors.

Type II error rate

Type I error rate

1) Goodness-of-fit, R2

• A number between 0 and 1 that measures of how well our regression model “fits” the data

• R2 = 0.54 → 54% of variation in sales volume explained by price; 46% all other possible variables

• Adding variables to the model always increases R2 (but does not always improve predictive
performance!)
2) Slope/intercept estimates and predicted values

7.4

b0
b1
1.60

Slope and intercept estimates gives us the prediction line:

predicted sales volume price

Example: What do we predict for the sales volume when weekly average price is $1.60?
b0 b1 x
3a) Standard error of slope estimate

b1
se(b1)

• Standard error of price variable measures uncertainty in slope estimate

• 95% confidence interval for slope estimate in general:

• For avocado prices:

3b) t-statistic and p-value of slope estimate
Null hypothesis: there is no
relationship between price and volume

Alternative hypothesis: there is a

relationship

t-stat p-val

• It is extremely unlikely that price has no relationship with quantity sold.

• We can conclude that there is a statistically significant relationship between price and quantity.
But we can’t conclude causation!
Note: the p-value is rounded to 0.00. It is actually 0.0000000000000116
4) Residuals and prediction intervals
• Residuals: Vertical distance between
each observation and the prediction line
residual

• Histogram of residuals shows variability

around prediction line

• 95% prediction interval:

S = standard
deviation of
residuals
(where )
What does a “bad” fit look like?

• Diagnostic plots can help identify three

possible issues:

1) Outliers
2) Obvious non-linearity
3) “Fanning out” of data (heteroscedasticity)

• Each of these can negatively affect

prediction quality; proceed with caution
Demand for Shampoo: Practice Questions
• What if we add an interaction term? (Equivalently, let’s run one regression for each group)

• How many bottles do you predict to sell to each group at a price of $0.50 per fluid ounce?
For Group A: 𝑦ො = 11.33 − 12.7 ∙ 0.50 = 4.98, For Group B: 𝑦ො = 11.33 − 12.7 ∙ 0.50 − 1.42 + 4.91 ∙ 0.50 = 6.02

• Extra Content: what is the revenue maximizing price for each group?
11.33 9.91
For Group A: 𝑝Ƹ = = 0.45, For Group B: 𝑝Ƹ = 15.58 = 0.64
25.4
Model 2 Model 3

• Add dummy EUROPE = 1 if country is in Europe

• What is the interpretation of the coefficient for EUROPE?

Fixing GDP and chocolate consumption, European countries are predicted to have 3.2 more laureates per 10 mil. people
• What changes once we add EUROPE to the regression?
Chocolate is no longer statistically significant at the 5% level.
𝑅2 went up
GDP coefficient and standard error didn’t change much.
Correlation and collinearity
• Correlation is a number between 0 and 1 that measures strength of
association between two variables

Collinearity refers to high correlation between

predictors -- can lead to loss of precision in
estimates if two collinear variables are included

Often detected by large changes in standard

error, t-stat, and/or p-value after adding a
collinear variable
Interpreting transformed regression results
• Now everything looks good – but how do we interpret the slope parameters?

• If years = 0, we predict
Salary = 𝑒 10.6+0.07∙(0) = $40,135

• If years = 1, we predict
Salary = 𝑒 10.6+0.07∙(1) = $43,045

• Each additional year is associated with a

~7% increase in salary

Equivalently, Salary = 𝑒 𝛽0+𝛽1∙Years

Interpreting transformed regression results
• If the regression coefficient is b > 0, then….

Independent variable (x)

Original Log-transformed

Original 1 unit increase in x 1% increase in x

b unit increase in y b/100 unit increase in y
Dependent
variable (y)

1 unit increase in x 1% increase in x

Log-transformed
b ×100% increase in y b% increase in y
(If b is close to 0)
23 variables: 1 variable:

Error on new data: Error on new data:

$77/night $42/night

R2 = 0.38
Avg. prediction error = $33/night

R2 = 0.88
Avg. prediction error = $15/night

How does each model perform on a new set of 50 Airbnb listings?

How do we avoid overfitting?

• Adding more variables to a regression model increases complexity

• This is both good and bad!

• Good: Can detect patterns in the data
• Bad: Can overfit to data

• How can we balance the two?

Which model will make the best predictions?

• Training data vs test data: we can build model on training data, and evaluate accuracy
on test data that is “not seen” by the model

Common rule of thumb: 75/25

for training / testing split

• If test error is much larger than training error, we are probably overfitting → increase λ
LASSO regression: Airbnb Listings (Number of variables)
30 30 28 28 28 26 27 27 24 19 18 16 9 8 3 3 2

• R automatically searches over many

55
different values of λ Minimum error →

• Can automatically find “best” λ via

50
cross-validation

45
Mean Absolute Error

40
35
30
25
R will find
−4 −2
best λ
0 2

Log(l)
More training data allows more variables to be selected
without as much concern for overfitting
30 30 28 28 28 26 27 27 24 19 18 16 9 8 3 3 2 29 29 29 29 29 26 24 20 19 15 9 5 5 4 1 1

50 Airbnb listings 500 Airbnb listings

50
Minimum error → Minimum error →
50

45
45

Mean Absolute Error

40
35
30

35
25

−4 −2 0 2 −3 −2 −1 0 1 2 3

Log(l) Log(l)
14 variables: 1 variable:

Error on new data: Error on new data:

$26.80 $19.81

R2 = 0.89 R2 = 0.80
Avg. prediction error = $15.69 Avg. prediction error = $21.19

How does each model perform on new data generated identically?

λ=1 λ=5 λ = 20

1) Using a LASSO model with λ = 5, how would your prediction for our original
customer change if he had visited before? What if he was single?
71-year-old married man. $102,000 income, visiting the site from “Other”.
His visit is in the spring, 21 days away, and he has not been to this hotel before.
He is not international, and he is coming from 2056 miles away.
If he had visited before, prediction decreases by $0.32. If not married, prediction increases by $4.02.
Linear vs logistic regression
Linear regression Logistic regression

Independent variables can take on any value Independent variables can take on any value
Dependent variable can take on any value Dependent variable must be a 0 or 1
Output is a number between and Output is a value between 0 and 1

1.0
Intuition: Logistic regression uses the logit

0.8
function to convert the prediction from a linear

0.6
P(y = 1)
model to a number between 0 and 1

0.4
Output of LR = interpreted as probability that

0.2
observation belongs to positive class

0.0
-4 -2 0 2 4
Cutoff and predictions
• If cutoff is large, predict CHD (Y = 1) less
often
cutoff

• If cutoff is small, predict CHD (Y = 1) more

often

• How we set cutoff depends on costs of errors

• For heart disease, what is the cost of a
false positive? A false negative?

• Balance of classes in data also can influence

choice of cutoff.

SysBP CHD
True positive rate
Cutoff and performance
False positive rate
cutoff = 0.2 cutoff = 0.4
predicted
predicted
True negative rate

607 156 746 17

actual

actual
False negative rate
73 80 133 20

Accuracy
For cutoff = 0.2, what is… How do performance metrics
change under cutoff = 0.4?
• True positive rate? False positive rate?
• True negative rate? False negative rate? The TPR and FPR go way TP = # of true positives
down, the TNR and FNR go FP = # of false positives
• Accuracy? TPR = 80/153 FPR = 156/763 P = # of actual positives
up, and accuracy goes up
TNR = 607/763 FNR = 73/153 TN = # of true negatives
FN = # of false negatives
ACC = 687/(763 + 153) N = # of actual negatives
Framingham: AUC = 0.75
ROC plot

0.75
1.0
• ROC plot shows trade-off between
true and false positives

0.8

0.6
• Area under the curve (AUC) of ROC
plot summarizes model performance

0.46
True positive rate

0.6
across all possible cutoff values into
a single value

0.31
0.4
• Highest possible AUC: 1

0.16
0.2
• Lowest “possible” AUC: 0.5 (flipping
a coin)
• If your model does worse than this, just do
the opposite of what it suggests!

0.02
0.0

0.0 0.2 0.4 0.6 0.8 1.0

False positive rate

Reading leaves
Class prediction (majority class)

Fraction of data points in

each class at this leaf
Total fraction of all observations
contained in this leaf

What this leaf tells us:

• For data points with petal length between
2.4 and 4.8, versicolor is the majority class

• More specifically, for data points with petal

length between 2.4 and 4.8, 0% are setosa,
96% are versicolor, and 4% are virginica

• 29% of all data points have petal length

between 2.4 and 4.8
What does the model tell us? Trees are interpretable
and explainable
26% of all customers churned

Among month-to-month customers

(55% of total), 43% churned

Among month-to-month customers

with fiber optic internet (30% of total),
54% churned
Training accuracy always increases in tree depth (model complexity)

Test accuracy generally increases then decreases due to overfitting

Logistic regression vs classification trees
• Classification trees are often easier to interpret and visualize than
logistic regression

• Due to flexibility, classification trees can better fit training data, but are
more prone to overfitting than logistic regression

• Difficult to know in advance which method works better for your data
and prediction task – need to try both!

Practical Research2 Q2 Module 2
90% (39)
Practical Research2 Q2 Module 2
57 pages
College Statistics Cheat Sheet
100% (2)
College Statistics Cheat Sheet
2 pages
BioStats and Epidemiology BNB Notes
No ratings yet
BioStats and Epidemiology BNB Notes
39 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
Important Formulas: Chapter 5: Discrete Probability Distributions
No ratings yet
Important Formulas: Chapter 5: Discrete Probability Distributions
2 pages
69813
No ratings yet
69813
16 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
Business Analytics & Machine Learning: Regression Analysis
No ratings yet
Business Analytics & Machine Learning: Regression Analysis
58 pages
4 Regression Inference
No ratings yet
4 Regression Inference
36 pages
Data Science 03 - Regression PDF
No ratings yet
Data Science 03 - Regression PDF
32 pages
Inference For Regression
No ratings yet
Inference For Regression
24 pages
Making Confident Decisions
No ratings yet
Making Confident Decisions
37 pages
Analyze
No ratings yet
Analyze
194 pages
Basic
No ratings yet
Basic
4 pages
Statistics 221 Summary of Material
No ratings yet
Statistics 221 Summary of Material
5 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
23 pages
Chapter08 Part 5
No ratings yet
Chapter08 Part 5
19 pages
Simple Regression
No ratings yet
Simple Regression
46 pages
Cheat Sheet - Test 3
No ratings yet
Cheat Sheet - Test 3
2 pages
01 Basics 03LinearRegression 02
No ratings yet
01 Basics 03LinearRegression 02
4 pages
Ecn 306
No ratings yet
Ecn 306
43 pages
Linear Regression
100% (2)
Linear Regression
28 pages
Linear Regression
100% (2)
Linear Regression
228 pages
STAB27
No ratings yet
STAB27
51 pages
Unit IV - Analytics Tasks (Students)
No ratings yet
Unit IV - Analytics Tasks (Students)
127 pages
Statistical Inference
No ratings yet
Statistical Inference
14 pages
Evans - Analytics2e - PPT - 07 and 08 CH
No ratings yet
Evans - Analytics2e - PPT - 07 and 08 CH
50 pages
Quiz 2 Cheatsheet v3
No ratings yet
Quiz 2 Cheatsheet v3
2 pages
BA - Advanced Statistical Method Using R
No ratings yet
BA - Advanced Statistical Method Using R
13 pages
Part 2 - Regression Inference
No ratings yet
Part 2 - Regression Inference
37 pages
Intro To Essential Stats With Python
No ratings yet
Intro To Essential Stats With Python
51 pages
304BA AdvancedStatisticalMethodsUsingR
No ratings yet
304BA AdvancedStatisticalMethodsUsingR
31 pages
Regression
No ratings yet
Regression
46 pages
Statistics ESCP
No ratings yet
Statistics ESCP
383 pages
12 Supervised Learning
No ratings yet
12 Supervised Learning
88 pages
SPSS Guide: Website Resources
No ratings yet
SPSS Guide: Website Resources
11 pages
Lecturer 10 UET
No ratings yet
Lecturer 10 UET
54 pages
C22 Inferential Statistics DXB
No ratings yet
C22 Inferential Statistics DXB
66 pages
Cheat Sheet Statistics
No ratings yet
Cheat Sheet Statistics
3 pages
RMP470S Lecture 7 - One-Dimensionalstatistics
No ratings yet
RMP470S Lecture 7 - One-Dimensionalstatistics
27 pages
Predective Analytics or Inferential Statistics
No ratings yet
Predective Analytics or Inferential Statistics
27 pages
Stats Quiz 2 Cheatsheet
No ratings yet
Stats Quiz 2 Cheatsheet
2 pages
Example: Anscombe's Quartet Revisited: CC-BY-SA-3.0 GFDL
No ratings yet
Example: Anscombe's Quartet Revisited: CC-BY-SA-3.0 GFDL
10 pages
ISOM2500 Spring 25 - Topic 10 - Linear Regression Interpretation and Diagnosis
No ratings yet
ISOM2500 Spring 25 - Topic 10 - Linear Regression Interpretation and Diagnosis
51 pages
Fitting & Interpreting Linear Models in Rinear Models in R
100% (1)
Fitting & Interpreting Linear Models in Rinear Models in R
8 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
8 pages
Discussion+on+Multiple+Regression ShimengHuang
No ratings yet
Discussion+on+Multiple+Regression ShimengHuang
35 pages
Module 5
No ratings yet
Module 5
24 pages
MLR TestingSignificance
No ratings yet
MLR TestingSignificance
21 pages
Inference Statistics Terminology
No ratings yet
Inference Statistics Terminology
54 pages
AP Statistics Michel Liao
No ratings yet
AP Statistics Michel Liao
20 pages
Metrics Topic3 Statistics Brief
No ratings yet
Metrics Topic3 Statistics Brief
24 pages
Chapter 3 Notes
No ratings yet
Chapter 3 Notes
5 pages
Evans - Analytics2e - PPT - 07 and 08
No ratings yet
Evans - Analytics2e - PPT - 07 and 08
49 pages
(Ebook-Pdf) - Statistics - Spss Guide For Dummies
No ratings yet
(Ebook-Pdf) - Statistics - Spss Guide For Dummies
11 pages
Topic 3a
No ratings yet
Topic 3a
64 pages
BSRM Final Assignment
No ratings yet
BSRM Final Assignment
7 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
18 pages
Parametric Test
No ratings yet
Parametric Test
49 pages
CandumanNHS Formating1
No ratings yet
CandumanNHS Formating1
32 pages
Research Study On Technical Analysis of Dc21acfb
No ratings yet
Research Study On Technical Analysis of Dc21acfb
22 pages
Stevenchap 1, 2, and 3
No ratings yet
Stevenchap 1, 2, and 3
35 pages
Reporting Statistics in APA Style
No ratings yet
Reporting Statistics in APA Style
2 pages
PR2 For Final
No ratings yet
PR2 For Final
33 pages
GAME - AI 02 Preprint Roccetti Fin
No ratings yet
GAME - AI 02 Preprint Roccetti Fin
5 pages
01 BSC
No ratings yet
01 BSC
13 pages
PSM Syllabus
No ratings yet
PSM Syllabus
13 pages
BPCC-108 em 2023-24-R
100% (1)
BPCC-108 em 2023-24-R
19 pages
Z-Test of Difference
No ratings yet
Z-Test of Difference
5 pages
Assignment Module04 Part1
No ratings yet
Assignment Module04 Part1
5 pages
Statistics and Probability - q4 - Mod 5 - Identifying The Appropriate Rejection Region For A Given Level of Significance - V2
No ratings yet
Statistics and Probability - q4 - Mod 5 - Identifying The Appropriate Rejection Region For A Given Level of Significance - V2
25 pages
Test Bank For Stats Data and Models 5th by de Veaux PDF Download Full Book With All Chapters
No ratings yet
Test Bank For Stats Data and Models 5th by de Veaux PDF Download Full Book With All Chapters
45 pages
Personal Care Products: A Study On Women Consumer Buying Behaviour
No ratings yet
Personal Care Products: A Study On Women Consumer Buying Behaviour
9 pages
Collectivist Values For Productive Teamwork Between Korean and CH
No ratings yet
Collectivist Values For Productive Teamwork Between Korean and CH
25 pages
VINCULADO-STEPHANIE-Assignment #14 - Simple Linear Regression Analysis
No ratings yet
VINCULADO-STEPHANIE-Assignment #14 - Simple Linear Regression Analysis
5 pages
(Kel. 2) Case Study Manrisk-Mechanical Hazard
No ratings yet
(Kel. 2) Case Study Manrisk-Mechanical Hazard
13 pages
Energy Use and Maintenance Costs of Upmarket Hotels
No ratings yet
Energy Use and Maintenance Costs of Upmarket Hotels
11 pages
Action of Elaneer Kulambu On Presbyopia - A Case Study
No ratings yet
Action of Elaneer Kulambu On Presbyopia - A Case Study
5 pages
Derogatory Terms in Labeling Mental Illness and Self-Stigma of Selected Senior High School Students in Cavite: Basis in The Development of A Self-Help Booklet Against Bullying
No ratings yet
Derogatory Terms in Labeling Mental Illness and Self-Stigma of Selected Senior High School Students in Cavite: Basis in The Development of A Self-Help Booklet Against Bullying
12 pages
Ch8 Statistics
No ratings yet
Ch8 Statistics
43 pages
Report Case Study Fin534 Group Assignment
No ratings yet
Report Case Study Fin534 Group Assignment
15 pages
The Relationship Among Knowledge Management, Organizational Learning, and Organizational Performance
No ratings yet
The Relationship Among Knowledge Management, Organizational Learning, and Organizational Performance
13 pages
Audit Committee and Timeliness of Financial Reports
100% (1)
Audit Committee and Timeliness of Financial Reports
13 pages
Module2 Ds
No ratings yet
Module2 Ds
28 pages
What Are The Consequences of Heteroscedasticity and Multicollinearity in Regression? What Are The Possible Remedies?
No ratings yet
What Are The Consequences of Heteroscedasticity and Multicollinearity in Regression? What Are The Possible Remedies?
3 pages
Test of Homogeneity Based On Geometric Mean of Variances
No ratings yet
Test of Homogeneity Based On Geometric Mean of Variances
11 pages
Green Accounting and Firm Performance in Nigeria
No ratings yet
Green Accounting and Firm Performance in Nigeria
10 pages

Data Highlights Combined

Uploaded by

Data Highlights Combined

Uploaded by

Bayes’ rule

• One of the most important formulas in probability and statistics:

• Applying Bayes’ rule helps us avoid the base rate fallacy

1951-1980: 2008-2022: 80 deg is a less extreme

sample standard deviation

22.6 sample standard deviation

• More rigorous interpretation using conditional probability:

P 𝜇 between 14.7 and 27.5 ȁ 𝑥ҧ = 21.1, 𝑠 = 22.6, 𝑛 = 50 ≈ 0.95

… also conditional on (1) the sample being random,

Margin of error for means: Confidence level (%) z

𝑝ҧ 1 − 𝑝ҧ for proportions, this is 𝒔

1. Formulate your null hypothesis and your alternative hypothesis.

3. Calculate the probability of getting a test statistic that “surprising”.

4. Reject the null hypothesis if the p-value is sufficiently small.

For proportions: For means:

• There is a trade-off between the two types of errors.

Type II error rate

Type I error rate

Slope and intercept estimates gives us the prediction line:

predicted sales volume price

• Standard error of price variable measures uncertainty in slope estimate

• For avocado prices:

Alternative hypothesis: there is a

• It is extremely unlikely that price has no relationship with quantity sold.

• Histogram of residuals shows variability

• 95% prediction interval:

• Diagnostic plots can help identify three

• Each of these can negatively affect

• Add dummy EUROPE = 1 if country is in Europe

• What is the interpretation of the coefficient for EUROPE?

Collinearity refers to high correlation between

Often detected by large changes in standard

• Each additional year is associated with a

Equivalently, Salary = 𝑒 𝛽0+𝛽1∙Years

Independent variable (x)

Original 1 unit increase in x 1% increase in x

1 unit increase in x 1% increase in x

Error on new data: Error on new data:

How does each model perform on a new set of 50 Airbnb listings?

• Adding more variables to a regression model increases complexity

• This is both good and bad!

• How can we balance the two?

Common rule of thumb: 75/25

• R automatically searches over many

• Can automatically find “best” λ via

50 Airbnb listings 500 Airbnb listings

Mean Absolute Error

Error on new data: Error on new data:

How does each model perform on new data generated identically?

• If cutoff is small, predict CHD (Y = 1) more

• How we set cutoff depends on costs of errors

• Balance of classes in data also can influence

607 156 746 17

0.0 0.2 0.4 0.6 0.8 1.0

False positive rate

Fraction of data points in

What this leaf tells us:

• More specifically, for data points with petal

• 29% of all data points have petal length

Among month-to-month customers

Among month-to-month customers

Test accuracy generally increases then decreases due to overfitting

You might also like