0% found this document useful (0 votes)
11 views36 pages

Data Highlights Combined

The document discusses key concepts in statistics and probability, including Bayes' rule, z-scores, confidence intervals, hypothesis testing, and regression analysis. It emphasizes the importance of understanding confidence levels, margins of error, and the implications of Type I and Type II errors in hypothesis testing. Additionally, it covers regression techniques, including LASSO regression, and the differences between linear and logistic regression.

Uploaded by

oliviadtush
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views36 pages

Data Highlights Combined

The document discusses key concepts in statistics and probability, including Bayes' rule, z-scores, confidence intervals, hypothesis testing, and regression analysis. It emphasizes the importance of understanding confidence levels, margins of error, and the implications of Type I and Type II errors in hypothesis testing. Additionally, it covers regression techniques, including LASSO regression, and the differences between linear and logistic regression.

Uploaded by

oliviadtush
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Bayes’ rule

• One of the most important formulas in probability and statistics:

base rate

• Applying Bayes’ rule helps us avoid the base rate fallacy

Expanded form:
Comparing distributions: z-scores
• Given the mean and standard deviation of a normal distribution, as well as a threshold value
x, the z-score is given by

• z-score = number of standard deviations that x is away from the mean (a measure of
extremeness)

• Temperature example:

1951-1980: 2008-2022: 80 deg is a less extreme


(more likely) average
80 − 74.0 80 − 75.8 temperature for October in
= 2.22 = 1.27 2008-2022
2.7 3.3
Confidence intervals for means
• The 95% confidence interval for the population mean 𝜇 is:

sample standard deviation


sample mean
sample size

• Interpretation: In approx. 95% of cases, the sample mean will be within of the true population mean
95%

𝑛 = 50

= “standard error”

standard deviation of
sampling distribution of 𝑥ҧ
Confidence intervals for means
• The 95% confidence interval for the population mean 𝜇, using our sample of 50 rides:

22.6 sample standard deviation


sample mean 21.1 ± 2 ∙
50 sample size

• Interpretation: With 95% confidence, the true population mean is between 14.7 and 27.5 minutes.

• More rigorous interpretation using conditional probability:

P 𝜇 between 14.7 and 27.5 ȁ 𝑥ҧ = 21.1, 𝑠 = 22.6, 𝑛 = 50 ≈ 0.95

… also conditional on (1) the sample being random,


(2) the distribution being approximately Normal, and
(3) roughly speaking, no other information about 𝜇.
For more on this last point, see the last three slides. (“Extra Content: Frequentists vs. Bayesians”)
Confidence levels and margin of error
• 95% is a common confidence interval, can technically do any %
• Adjusting z-score changes confidence level and margin of error

Margin of error for means: Confidence level (%) z

50 0.7
68 1.0
80 1.3
90 1.6
95 2.0
99 2.6
Confidence intervals for proportions
• The 95% confidence interval for the population proportion p is:

𝑝ҧ 1 − 𝑝ҧ for proportions, this is 𝒔

sample proportion 𝑝ҧ ± 2 ∙
𝑛 sample size

• Interpretation: In 95% of cases, sample proportion will be within of the true pop. proportion
95%

𝑛 = 50

= “standard error”

standard deviation of
sampling distribution of 𝑝ҧ
Hypothesis tests: formal process

1. Formulate your null hypothesis and your alternative hypothesis.


• Best practice: decide this before looking at data.

2. Assuming that the null hypothesis is true, calculate the z-score of your sample statistic.
• This is called the test statistic.
• It quantifies how “surprising” your sample is – assuming the null hypothesis is true.

3. Calculate the probability of getting a test statistic that “surprising”.


• In other words, P( data with |z-score| > your test statistic | null hypothesis is true)
• This is called the p-value. It is uniquely determined by the test statistic.

4. Reject the null hypothesis if the p-value is sufficiently small.


• Equivalently, reject the null hypothesis if the test statistic is sufficiently large.
• The threshold for “sufficient” depends on the significance level 𝛼.
z-scores, p-values, and significance levels

• Test statistic: the z-score of a sample statistic, assuming the null hypothesis is true
For proportions: 𝑝ҧ − 𝑝 For means: 𝑥ҧ − 𝜇
𝑠
𝑝(1 − 𝑝) ൗ 𝑛

𝑛
In other words, the number of standard errors from the hypothesized population parameter.

• p-value: the probability of a test statistic being at least that surprising. For example, the p-
value of 2 is (roughly) 0.05.

• significance level: a threshold that we compare against the p-value. The significance
level is denoted by 𝛼 and often set to 𝛼 = 0.05. If the p-value is below this threshold, then we
reject the null hypothesis. If not, we cannot reject either hypothesis.
Test statistics comparing two samples

• Test statistic (two samples): the z-score of the difference between the two sample
statistics, assuming that there is no difference between the two populations.

For proportions: For means:

𝑥1ҧ − 𝑥ҧ2
𝑝1ҧ − 𝑝ҧ2
1 1 𝑠12 𝑠22
𝑝𝑎𝑣𝑔
ҧ 1 − 𝑝𝑎𝑣𝑔
ҧ + +
𝑛1 𝑛2 𝑛1 𝑛2

𝑛1 𝑝ത1 +𝑛2 𝑝ത 2
where 𝑝𝑎𝑣𝑔
ҧ =
𝑛1 +𝑛2
Type I and Type II errors
• There are two possible realities:
A. Any difference is due to random chance
B. The null hypothesis is actually false

• Type I error (false positive): rejecting the null hypothesis, even though the reality is A. The
significance level 𝛼 is the Type I error rate.

• Type II error (false negative): failing to reject the null hypothesis, even though the reality is B.
The true positive rate 1 − 𝛽 is the power of a hypothesis test.

• There is a trade-off between the two types of errors.

Type II error rate

Type I error rate


1) Goodness-of-fit, R2

• A number between 0 and 1 that measures of how well our regression model “fits” the data

• R2 = 0.54 → 54% of variation in sales volume explained by price; 46% all other possible variables

• Adding variables to the model always increases R2 (but does not always improve predictive
performance!)
2) Slope/intercept estimates and predicted values

7.4

b0
b1
1.60

Slope and intercept estimates gives us the prediction line:

predicted sales volume price

Example: What do we predict for the sales volume when weekly average price is $1.60?
b0 b1 x
3a) Standard error of slope estimate

b1
se(b1)

• Standard error of price variable measures uncertainty in slope estimate


• 95% confidence interval for slope estimate in general:

• For avocado prices:


3b) t-statistic and p-value of slope estimate
Null hypothesis: there is no
relationship between price and volume

Alternative hypothesis: there is a


relationship

t-stat p-val

• It is extremely unlikely that price has no relationship with quantity sold.

• We can conclude that there is a statistically significant relationship between price and quantity.
But we can’t conclude causation!
Note: the p-value is rounded to 0.00. It is actually 0.0000000000000116
4) Residuals and prediction intervals
• Residuals: Vertical distance between
each observation and the prediction line
residual

• Histogram of residuals shows variability


around prediction line

• 95% prediction interval:

S = standard
deviation of
residuals
(where )
What does a “bad” fit look like?

• Diagnostic plots can help identify three


possible issues:

1) Outliers
2) Obvious non-linearity
3) “Fanning out” of data (heteroscedasticity)

• Each of these can negatively affect


prediction quality; proceed with caution
Demand for Shampoo: Practice Questions
• What if we add an interaction term? (Equivalently, let’s run one regression for each group)

• How many bottles do you predict to sell to each group at a price of $0.50 per fluid ounce?
For Group A: 𝑦ො = 11.33 − 12.7 ∙ 0.50 = 4.98, For Group B: 𝑦ො = 11.33 − 12.7 ∙ 0.50 − 1.42 + 4.91 ∙ 0.50 = 6.02

• Extra Content: what is the revenue maximizing price for each group?
11.33 9.91
For Group A: 𝑝Ƹ = = 0.45, For Group B: 𝑝Ƹ = 15.58 = 0.64
25.4
Model 2 Model 3

• Add dummy EUROPE = 1 if country is in Europe

• What is the interpretation of the coefficient for EUROPE?


Fixing GDP and chocolate consumption, European countries are predicted to have 3.2 more laureates per 10 mil. people
• What changes once we add EUROPE to the regression?
Chocolate is no longer statistically significant at the 5% level.
𝑅2 went up
GDP coefficient and standard error didn’t change much.
Correlation and collinearity
• Correlation is a number between 0 and 1 that measures strength of
association between two variables

Collinearity refers to high correlation between


predictors -- can lead to loss of precision in
estimates if two collinear variables are included

Often detected by large changes in standard


error, t-stat, and/or p-value after adding a
collinear variable
Interpreting transformed regression results
• Now everything looks good – but how do we interpret the slope parameters?

• If years = 0, we predict
Salary = 𝑒 10.6+0.07∙(0) = $40,135

• If years = 1, we predict
Salary = 𝑒 10.6+0.07∙(1) = $43,045

• Each additional year is associated with a


~7% increase in salary

Equivalently, Salary = 𝑒 𝛽0+𝛽1∙Years


Interpreting transformed regression results
• If the regression coefficient is b > 0, then….

Independent variable (x)

Original Log-transformed

Original 1 unit increase in x 1% increase in x


b unit increase in y b/100 unit increase in y
Dependent
variable (y)

1 unit increase in x 1% increase in x


Log-transformed
b ×100% increase in y b% increase in y
(If b is close to 0)
23 variables: 1 variable:

Error on new data: Error on new data:


$77/night $42/night

R2 = 0.38
Avg. prediction error = $33/night

R2 = 0.88
Avg. prediction error = $15/night

How does each model perform on a new set of 50 Airbnb listings?


How do we avoid overfitting?

• Adding more variables to a regression model increases complexity

• This is both good and bad!


• Good: Can detect patterns in the data
• Bad: Can overfit to data

• How can we balance the two?


Which model will make the best predictions?

• Training data vs test data: we can build model on training data, and evaluate accuracy
on test data that is “not seen” by the model

Common rule of thumb: 75/25


for training / testing split

• If test error is much larger than training error, we are probably overfitting → increase λ
LASSO regression: Airbnb Listings (Number of variables)
30 30 28 28 28 26 27 27 24 19 18 16 9 8 3 3 2

• R automatically searches over many

55
different values of λ Minimum error →

• Can automatically find “best” λ via

50
cross-validation

45
Mean Absolute Error

40
35
30
25
R will find
−4 −2
best λ
0 2

Log(l)
More training data allows more variables to be selected
without as much concern for overfitting
30 30 28 28 28 26 27 27 24 19 18 16 9 8 3 3 2 29 29 29 29 29 26 24 20 19 15 9 5 5 4 1 1

50 Airbnb listings 500 Airbnb listings


55

50
Minimum error → Minimum error →
50

45
45

Mean Absolute Error


Mean Absolute Error

40

40
35
30

35
25

−4 −2 0 2 −3 −2 −1 0 1 2 3

Log(l) Log(l)
14 variables: 1 variable:

Error on new data: Error on new data:


$26.80 $19.81

R2 = 0.89 R2 = 0.80
Avg. prediction error = $15.69 Avg. prediction error = $21.19

How does each model perform on new data generated identically?


λ=1 λ=5 λ = 20

1) Using a LASSO model with λ = 5, how would your prediction for our original
customer change if he had visited before? What if he was single?
71-year-old married man. $102,000 income, visiting the site from “Other”.
His visit is in the spring, 21 days away, and he has not been to this hotel before.
He is not international, and he is coming from 2056 miles away.
If he had visited before, prediction decreases by $0.32. If not married, prediction increases by $4.02.
Linear vs logistic regression
Linear regression Logistic regression

Independent variables can take on any value Independent variables can take on any value
Dependent variable can take on any value Dependent variable must be a 0 or 1
Output is a number between and Output is a value between 0 and 1

1.0
Intuition: Logistic regression uses the logit

0.8
function to convert the prediction from a linear

0.6
P(y = 1)
model to a number between 0 and 1

0.4
Output of LR = interpreted as probability that

0.2
observation belongs to positive class

0.0
-4 -2 0 2 4
Cutoff and predictions
• If cutoff is large, predict CHD (Y = 1) less
often
cutoff

• If cutoff is small, predict CHD (Y = 1) more


often

• How we set cutoff depends on costs of errors


• For heart disease, what is the cost of a
false positive? A false negative?

• Balance of classes in data also can influence


choice of cutoff.

SysBP CHD
True positive rate
Cutoff and performance
False positive rate
cutoff = 0.2 cutoff = 0.4
predicted
predicted
True negative rate

607 156 746 17


actual

actual
False negative rate
73 80 133 20

Accuracy
For cutoff = 0.2, what is… How do performance metrics
change under cutoff = 0.4?
• True positive rate? False positive rate?
• True negative rate? False negative rate? The TPR and FPR go way TP = # of true positives
down, the TNR and FNR go FP = # of false positives
• Accuracy? TPR = 80/153 FPR = 156/763 P = # of actual positives
up, and accuracy goes up
TNR = 607/763 FNR = 73/153 TN = # of true negatives
FN = # of false negatives
ACC = 687/(763 + 153) N = # of actual negatives
Framingham: AUC = 0.75
ROC plot

0.75
1.0
• ROC plot shows trade-off between
true and false positives

0.8

0.6
• Area under the curve (AUC) of ROC
plot summarizes model performance

0.46
True positive rate

0.6
across all possible cutoff values into
a single value

0.31
0.4
• Highest possible AUC: 1

0.16
0.2
• Lowest “possible” AUC: 0.5 (flipping
a coin)
• If your model does worse than this, just do
the opposite of what it suggests!

0.02
0.0

0.0 0.2 0.4 0.6 0.8 1.0

False positive rate


Reading leaves
Class prediction (majority class)

Fraction of data points in


each class at this leaf
Total fraction of all observations
contained in this leaf

What this leaf tells us:


• For data points with petal length between
2.4 and 4.8, versicolor is the majority class

• More specifically, for data points with petal


length between 2.4 and 4.8, 0% are setosa,
96% are versicolor, and 4% are virginica

• 29% of all data points have petal length


between 2.4 and 4.8
What does the model tell us? Trees are interpretable
and explainable
26% of all customers churned

Among month-to-month customers


(55% of total), 43% churned

Among month-to-month customers


with fiber optic internet (30% of total),
54% churned
Training accuracy always increases in tree depth (model complexity)

Test accuracy generally increases then decreases due to overfitting


Logistic regression vs classification trees
• Classification trees are often easier to interpret and visualize than
logistic regression

• Due to flexibility, classification trees can better fit training data, but are
more prone to overfitting than logistic regression

• Difficult to know in advance which method works better for your data
and prediction task – need to try both!

You might also like