Data Highlights Combined
Data Highlights Combined
base rate
Expanded form:
Comparing distributions: z-scores
• Given the mean and standard deviation of a normal distribution, as well as a threshold value
x, the z-score is given by
• z-score = number of standard deviations that x is away from the mean (a measure of
extremeness)
• Temperature example:
• Interpretation: In approx. 95% of cases, the sample mean will be within of the true population mean
95%
𝑛 = 50
= “standard error”
standard deviation of
sampling distribution of 𝑥ҧ
Confidence intervals for means
• The 95% confidence interval for the population mean 𝜇, using our sample of 50 rides:
• Interpretation: With 95% confidence, the true population mean is between 14.7 and 27.5 minutes.
50 0.7
68 1.0
80 1.3
90 1.6
95 2.0
99 2.6
Confidence intervals for proportions
• The 95% confidence interval for the population proportion p is:
sample proportion 𝑝ҧ ± 2 ∙
𝑛 sample size
• Interpretation: In 95% of cases, sample proportion will be within of the true pop. proportion
95%
𝑛 = 50
= “standard error”
standard deviation of
sampling distribution of 𝑝ҧ
Hypothesis tests: formal process
2. Assuming that the null hypothesis is true, calculate the z-score of your sample statistic.
• This is called the test statistic.
• It quantifies how “surprising” your sample is – assuming the null hypothesis is true.
• Test statistic: the z-score of a sample statistic, assuming the null hypothesis is true
For proportions: 𝑝ҧ − 𝑝 For means: 𝑥ҧ − 𝜇
𝑠
𝑝(1 − 𝑝) ൗ 𝑛
൘
𝑛
In other words, the number of standard errors from the hypothesized population parameter.
• p-value: the probability of a test statistic being at least that surprising. For example, the p-
value of 2 is (roughly) 0.05.
• significance level: a threshold that we compare against the p-value. The significance
level is denoted by 𝛼 and often set to 𝛼 = 0.05. If the p-value is below this threshold, then we
reject the null hypothesis. If not, we cannot reject either hypothesis.
Test statistics comparing two samples
• Test statistic (two samples): the z-score of the difference between the two sample
statistics, assuming that there is no difference between the two populations.
𝑥1ҧ − 𝑥ҧ2
𝑝1ҧ − 𝑝ҧ2
1 1 𝑠12 𝑠22
𝑝𝑎𝑣𝑔
ҧ 1 − 𝑝𝑎𝑣𝑔
ҧ + +
𝑛1 𝑛2 𝑛1 𝑛2
𝑛1 𝑝ത1 +𝑛2 𝑝ത 2
where 𝑝𝑎𝑣𝑔
ҧ =
𝑛1 +𝑛2
Type I and Type II errors
• There are two possible realities:
A. Any difference is due to random chance
B. The null hypothesis is actually false
• Type I error (false positive): rejecting the null hypothesis, even though the reality is A. The
significance level 𝛼 is the Type I error rate.
• Type II error (false negative): failing to reject the null hypothesis, even though the reality is B.
The true positive rate 1 − 𝛽 is the power of a hypothesis test.
• A number between 0 and 1 that measures of how well our regression model “fits” the data
• R2 = 0.54 → 54% of variation in sales volume explained by price; 46% all other possible variables
• Adding variables to the model always increases R2 (but does not always improve predictive
performance!)
2) Slope/intercept estimates and predicted values
7.4
b0
b1
1.60
Example: What do we predict for the sales volume when weekly average price is $1.60?
b0 b1 x
3a) Standard error of slope estimate
b1
se(b1)
t-stat p-val
• We can conclude that there is a statistically significant relationship between price and quantity.
But we can’t conclude causation!
Note: the p-value is rounded to 0.00. It is actually 0.0000000000000116
4) Residuals and prediction intervals
• Residuals: Vertical distance between
each observation and the prediction line
residual
S = standard
deviation of
residuals
(where )
What does a “bad” fit look like?
1) Outliers
2) Obvious non-linearity
3) “Fanning out” of data (heteroscedasticity)
• How many bottles do you predict to sell to each group at a price of $0.50 per fluid ounce?
For Group A: 𝑦ො = 11.33 − 12.7 ∙ 0.50 = 4.98, For Group B: 𝑦ො = 11.33 − 12.7 ∙ 0.50 − 1.42 + 4.91 ∙ 0.50 = 6.02
• Extra Content: what is the revenue maximizing price for each group?
11.33 9.91
For Group A: 𝑝Ƹ = = 0.45, For Group B: 𝑝Ƹ = 15.58 = 0.64
25.4
Model 2 Model 3
• If years = 0, we predict
Salary = 𝑒 10.6+0.07∙(0) = $40,135
• If years = 1, we predict
Salary = 𝑒 10.6+0.07∙(1) = $43,045
Original Log-transformed
R2 = 0.38
Avg. prediction error = $33/night
R2 = 0.88
Avg. prediction error = $15/night
• Training data vs test data: we can build model on training data, and evaluate accuracy
on test data that is “not seen” by the model
• If test error is much larger than training error, we are probably overfitting → increase λ
LASSO regression: Airbnb Listings (Number of variables)
30 30 28 28 28 26 27 27 24 19 18 16 9 8 3 3 2
55
different values of λ Minimum error →
50
cross-validation
45
Mean Absolute Error
40
35
30
25
R will find
−4 −2
best λ
0 2
Log(l)
More training data allows more variables to be selected
without as much concern for overfitting
30 30 28 28 28 26 27 27 24 19 18 16 9 8 3 3 2 29 29 29 29 29 26 24 20 19 15 9 5 5 4 1 1
50
Minimum error → Minimum error →
50
45
45
40
40
35
30
35
25
−4 −2 0 2 −3 −2 −1 0 1 2 3
Log(l) Log(l)
14 variables: 1 variable:
R2 = 0.89 R2 = 0.80
Avg. prediction error = $15.69 Avg. prediction error = $21.19
1) Using a LASSO model with λ = 5, how would your prediction for our original
customer change if he had visited before? What if he was single?
71-year-old married man. $102,000 income, visiting the site from “Other”.
His visit is in the spring, 21 days away, and he has not been to this hotel before.
He is not international, and he is coming from 2056 miles away.
If he had visited before, prediction decreases by $0.32. If not married, prediction increases by $4.02.
Linear vs logistic regression
Linear regression Logistic regression
Independent variables can take on any value Independent variables can take on any value
Dependent variable can take on any value Dependent variable must be a 0 or 1
Output is a number between and Output is a value between 0 and 1
1.0
Intuition: Logistic regression uses the logit
0.8
function to convert the prediction from a linear
0.6
P(y = 1)
model to a number between 0 and 1
0.4
Output of LR = interpreted as probability that
0.2
observation belongs to positive class
0.0
-4 -2 0 2 4
Cutoff and predictions
• If cutoff is large, predict CHD (Y = 1) less
often
cutoff
SysBP CHD
True positive rate
Cutoff and performance
False positive rate
cutoff = 0.2 cutoff = 0.4
predicted
predicted
True negative rate
actual
False negative rate
73 80 133 20
Accuracy
For cutoff = 0.2, what is… How do performance metrics
change under cutoff = 0.4?
• True positive rate? False positive rate?
• True negative rate? False negative rate? The TPR and FPR go way TP = # of true positives
down, the TNR and FNR go FP = # of false positives
• Accuracy? TPR = 80/153 FPR = 156/763 P = # of actual positives
up, and accuracy goes up
TNR = 607/763 FNR = 73/153 TN = # of true negatives
FN = # of false negatives
ACC = 687/(763 + 153) N = # of actual negatives
Framingham: AUC = 0.75
ROC plot
0.75
1.0
• ROC plot shows trade-off between
true and false positives
0.8
0.6
• Area under the curve (AUC) of ROC
plot summarizes model performance
0.46
True positive rate
0.6
across all possible cutoff values into
a single value
0.31
0.4
• Highest possible AUC: 1
0.16
0.2
• Lowest “possible” AUC: 0.5 (flipping
a coin)
• If your model does worse than this, just do
the opposite of what it suggests!
0.02
0.0
• Due to flexibility, classification trees can better fit training data, but are
more prone to overfitting than logistic regression
• Difficult to know in advance which method works better for your data
and prediction task – need to try both!