DS-05 Introduction To Machine Learning
DS-05 Introduction To Machine Learning
2
Agenda
• Introduction to ML
– Types of ML
– Using ML to make predictions
• Regression
– Defining the problem
– Linear Regressing Models
– Regressing model evaluation
• Classification
– Logistic regression
– Classification metrics
3
Paradigm shift
Traditional Programming
Data
Computer Output
Program
Machine Learning
Data
Computer Program
Output
Understanding
how machines
learn
What is machine learning?
• Unsupervised learning
– Labels not provided
– Training data does not include desired outputs
– Making sense of data
– Understanding the past
– Learning the structure of data
• Reinforcement learning
– Rewards and/or punishments from sequence of actions
Algorithms
Semi-supervised learning
Machine Learning Capabilities
• Decision trees
• Sets of rules / Logic programs
• Instances
• Graphical models (Bayes/Markov nets)
• Neural networks
• Support vector machines
• Model ensembles
• Etc.
Evaluation
• Accuracy
• Precision and recall
• Squared error
• Likelihood
• Posterior probability
• Cost / Utility
• Margin
• Entropy
• K-L divergence
• Etc.
Optimization
• Combinatorial optimization
– E.g.: Greedy search
• Convex optimization
– E.g.: Gradient descent
• Constrained optimization
– E.g.: Linear programming
ML in Practice
• When the relationship between the inputs and the response are
complicated, models such as logistic regression can be limited. We
need to use more complicated models.
Predicting a Variable
• Let's imagine a scenario where we'd like to predict one variable using another
(or a set of other) variables.
• Thus, we'd like to define two categories of variables:
• variables whose value we want to predict
• variables whose values we use to make our prediction
• Examples:
• Predicting the amount of views a YouTube video will get next week based
on video length, the date it was posted, previous number of views, etc.
• Predicting which movies a Netflix user will rate highly, based on their
previous movie ratings, demographic data etc.
21
Translating Between Statistics and Machine Learning
other terms:
residual, loss (statistics) = error (machine learning)
22
Data
The Advertising data set consists of the sales of that product in 200
different markets, along with advertising budgets for the product
, in
each of those markets for three different media: TV, radio, and
newspaper. Everything is given in units of $1000.
Some of the figures in this presentation are taken from "An Introduction to Statistical Learning, with
applications in R" (Springer, 2013)
23
Response vs. Predictor Variables
X Y
predictors outcome
features response variable
covariates dependent variable
p predictors
24
Response vs. Predictor Variables
𝑋 = 𝑋! , … , 𝑋"
𝑋# = 𝑥!# , … , 𝑥$# , … , 𝑥%# 𝑌 = 𝑦! , … , 𝑦%
predictors outcome
features response variable
covariates dependent variable
p predictors
25
True vs. Statistical Model
• We will assume that the response variable, 𝑌, relates to the
predictors, 𝑋, through some unknown function expressed
generally as:
• 𝑌 =𝑓 𝑋 +𝜀
x
27
Statistical Model
How do we -ind 𝑓0 𝑥 ?
What is the
value of y at this
𝑥?
28
Statistical Model
How do we -ind 𝑓0 𝑥 ?
or this one?
29
Statistical Model
(
Simple idea is to take the mean of all y’s, 𝑓0 𝑥 = ∑)( 𝑦*
)
30
Prediction vs. Estimation
• When we use a set of measurements, (𝑥*,( , … , 𝑥*,+ ) to predict a value for the response
variable, we denote the predicted value by:
• 0 *,( , … , 𝑥*,+ ).
𝑦9* = 𝑓(𝑥
31
Simple Prediction Model
𝑥-
32
Simple Prediction Model
Do the same for “all” 𝑥′𝑠
33
Extend the Prediction Model
!
'
& = ∑$ 𝑦&!
Predict 𝑦1
'
𝑥-
34
Simple Prediction Models
35
Simple Prediction Models
We can try different k-models on more data
36
Error Evaluation
Start with some data.
37
Error Evaluation
Hide some of the data from the model. This is called train-test split.
We use the
train set to
estimate 𝑦,9 and
the test set to
evaluate the
model.
38
Error Evaluation
Estimate 𝑦9 for k=1 .
39
Error Evaluation
Now, we look at the data we have not used, the test data (red crosses).
40
Error Evaluation
Calculate the residuals (𝑦* − 𝑦9* ).
41
Error Evaluation
Do the same for k=3.
42
Error Evaluation
• In order to quantify how well a model performs, we define a loss or error function.
• A common loss function for quantitative outcomes is the Mean Squared Error
(MSE): Xn
1
M SE = (yi ybi )2
n i=1
• The quantity 𝑦* − 𝑦9* is called a residual and measures the error at the i-th
prediction.
• Note: The square Root of the Mean of the Squared Errors (RMSE) is also
commonly used. v
u n
p u1 X
RM SE = M SE = t (yi ybi )2
n i=1
43
Model Comparison
Do the same for all k’s and compare the RMSEs. k=3 seems to be the best model.
44
Model fitness
For a subset of the data, calculate the RMSE for k=3. Is RMSE=5.0 good enough?
45
Model fitness
What if we measure the Sales in cents instead of dollars?
RMSE is now
5004.93.
Is that good?
46
Model fitness
It is better if we compare it to something.
47
R-squared - The Coefficient of Determination
P
(ŷi yi ) 2
R2 = 1 P i
i (ȳ yi ) 2
• = then 𝑅 3 = 0
If our model is as good as the mean value, 𝑦,
48
Linear Models
• Note that in building our k-NN model for prediction, we did not
compute a closed form for 𝑓.0
Y = f (X) + ✏ = 1X + 0 + ✏.
• … then it follows that our estimate is:
Yb = fb(X) = c1 X + c0
• where 𝛽0( and 𝛽04 are estimates of 𝛽( and 𝛽4 respectively, that we compute using
49
observations.
Estimate of the regression coefficients (cont)
Is this line good?
50
Estimate of the regression coefficients (cont)
Maybe this one?
51
Estimate of the regression coefficients (cont)
Or this one?
52
Estimate of the regression coefficients (cont)
Question: Which line is the best?
First calculate the residuals
53
Estimate of the regression coefficients (cont)
• We choose 𝛽0( and 𝛽04 in order to minimize the predictive errors made by our
model, i.e. minimize our loss function.
• Then the optimal values for 𝛽04 and 𝛽0( should be:
b0 , b1 = argminL( 0, 1 ).
0, 1
54
Estimate of the regression coefficients: brute force
• A way to estimate argmin5$ ,5" 𝐿 is to calculate the loss function for every
possible 𝛽4 and 𝛽( . Then select the 𝛽4 and 𝛽( where the loss function is
minimum.
55
Estimate of the regression coefficients: exact method
Take the partial derivatives of 𝐿 with respect to 𝛽4 and 𝛽( , set to zero, and find the
solution to that equation. This procedure will give us explicit formulae for 𝛽04 and 𝛽0( :
P
ˆ1 = i (x
Pi
x)(yi y)
(x x) 2
i i
ˆ0 = ȳ ˆ1 x̄
56
Proof: 1X 2
L( 0 , 1 ) = [yi ( 0 1 xi )]
n i
dL( 0 , 1 ) dL( 0 , 1 )
=0 =0
d 0 d 1
2X 2X
) (yi ) (yi 0 1 xi )( xi ) = 0
0 1 xi ) =0 n i
n i X X X
1X 1X ) x i yi + 0 xi + 1 x2i = 0
) yi 0 1 xi = 0 i i i
n i n i X X X
) xi yi + (ȳ 1 x̄) xi + 1 x2i = 0
) 0 = ȳ 1 x̄ i
! i i
X X
) 1 x2i nx̄ 2
= x i yi nx̄ȳ
i i
P
xi yi nx̄ȳ
) 1 = Pi 2
i xi nx̄2
P
(xi x̄)(yi ȳ)
) 1 = iP
i (xi x̄)2
57
Estimate of the regression coefficients: gradient descent
60
Interpretation of Predictors
• If we increase the TV by $1000, what would you expect the increase in sales to be?
•
The interpretation of the predictors depends on the values, but decisions depend on
how much we trust these values.
61
Confidence intervals for the predictors estimates
62
Confidence intervals for the predictors estimates (cont)
• But due to error, every time we measure the response Y for a fixed value of X,
we will obtain a different observation.
• We have 3 measurements for several different values of X, “realization” on the
picture
63
Confidence intervals for the predictors estimates (cont)
• For each one of those “realizations”, we could fit a model and estimate 𝛽04 and
𝛽0( .
64
Confidence intervals for the predictors estimates (cont)
• For each one of those “realizations”, we could fit a model and estimate, 𝛽04
and 𝛽0( .
65
Confidence intervals for the predictors estimates (cont)
• For each one of those “realizations”, we could fit a model and estimate, 𝛽04
and 𝛽0( .
66
Bootstrapping for Estimating Sampling Error
Definition
• Bootstrapping is the practice of estimating properties of an
estimator by measuring those properties by, for example, sampling
from the observed data.
take samples
bootstrap samples,
drawn with replacement
The region containing 95% of the samples is a 95% confidence interval (CI)
Confidence intervals for the predictors estimates (cont)
We sample multiple times and calculate 𝛽.04 and 𝛽0(
70
Confidence intervals for the predictors estimates (cont)
Another sample
71
Confidence intervals for the predictors estimates (cont)
Another sample
72
Confidence intervals for the predictors estimates (cont)
And another sample
73
Confidence intervals for the predictors estimates (cont)
Repeat this for 100 times
74
Confidence intervals for the predictors estimates (cont)
We can now estimate the mean and standard deviation of all the estimates 𝛽0( .
The variance of 𝛽04 and 𝛽0( are also called their standard errors, 𝑆𝐸 𝛽04 , 𝑆𝐸 𝛽0( .
75
Confidence intervals for the predictors estimates (cont)
Finally we can calculate the confidence intervals, which are the ranges of values
such that the true value of 𝛽( is contained in this interval with n percent
probability.
95%
68%
76
Confidence intervals for the predictors estimates: Standard Errors
• If for each bootstrapped sample the estimated betas are: 𝛽0[,^ , 𝛽0\,^ , then
• 𝑆𝐸 𝛽0[ = var(𝛽
`[)
• 𝑆𝐸 𝛽0\ = var(𝛽
`\)
Confidence intervals for the predictors estimates: Standard Errors
• Alternatively:
• If we know the variance 𝜎"# of the noise 𝜖, we can compute 𝑆𝐸 𝛽') , 𝑆𝐸 𝛽'+
analytically using the formulae below (no need to bootstrap):
•
•
Standard Errors
• then, we can empirically estimate 𝜎 3 , from the data and our regression line:
Remember:
𝑦* = 𝑓 𝑥* + 𝜖* ⟹ 𝜖* = 𝑦* − 𝑓(𝑥* )
Standard Errors
s
⇣ ⌘ 1 x2
More data: 𝑛 ↑ and ∑/(𝑥/ − 𝑥)̅ # ↑⟹ 𝑆𝐸 ↓ SE b0 =
n
+P
(x x)
2
i i
⇣ ⌘
Larger coverage: 𝑣𝑎𝑟(𝑥) or ∑/(𝑥/ − 𝑥)̅ # ↑ ⟹ 𝑆𝐸 ↓ SE b1 = qP
2
# (xi x)
Better data: 𝜎 ↓ ⇒ 𝑆𝐸 ↓ i
s
X (fˆ(x) yi ) 2
Better model: (𝑓' − 𝑦/ ) ↓ ⟹ 𝜎 ↓ ⟹ 𝑆𝐸 ↓ ⇡
n 2
80
Classification
• Up to this point, the methods we have seen have centered around modeling and
the prediction of a quantitative response variable (ex, number of taxi pickups,
number of bike rentals, etc). Linear regression performs well under these
situations
• When the response variable is categorical, then the problem is no longer called a
regression problem, but instead is called a classification problem.
• The goal is to attempt to classify each observation into a category, also known as
(aka) class or cluster, defined by Y, based on a set of predictor variables X.
Example
82
Why not Linear Regression?
83
Logistic Regression
• Rather than modelling this response Y directly, logistic regression
models the probability that Y belongs to a particular category
• For the Default data, logistic regression models the probability of
default.
– The probability of default given balance can be written as
Pr(default = Yes|balance), which we abbreviate as p(balance)
– The values p(balance), will range between 0 and 1.
– Then for any given value of balance, a prediction can be made for default.
– For example, one might predict default = Yes for any individual for whom
p(balance) > 0.5.
– Alternatively, if a company wishes to be conservative in predicting
individuals who are at risk for default, then they may choose to use a lower
threshold, such as p(balance) > 0.1.
84
The Logistic Model
85
The Logistic Model
• The quantity p(X)/[1−p(X)] is called the odds, and can take on any
value odds between 0 and ∞.
• Values of the odds close to 0 and ∞ indicate very low and very high
probabilities of default, respectively
• For example,
– on average 1 in 5 people with an odds of 1/4 will default, since p(X)=0.2
implies an odds of 0.2/(1−0.2) = 1/4.
– Likewise on average nine out of every ten people with an odds of 9 will
default, since p(X)=0.9 implies an odds of 0.9/(1−0.9) = 9 86
The Logistic Model
• By taking the logarithm of both sides of we arrive at
87
Making Predictions
which is below 1 %.
• In contrast, the predicted probability of
default for an individual with a balance
of $2, 000 is much higher, and
equals 0.586 or 58.6 %.
88
Multiple Logistic Regression
89
Important measures for classification and diagnostic testing
• Accuracy
• Precision
• Recall
• Sensitivity
• Specificity
90
Basic quantities for performance evaluation
91
Two of the most important performance measures
• Precision By this we mean the percentage of true positives, NTP, among all
examples that the classifier has labeled as positive: NTP + NFP . The value is
obtained by the following formula:
92
Why not to use “accuracy” directly
NTP + NTN
NTP + NTN + NFP + NFN
94
Precision and Recall
Important measures for classification and diagnostic testing
Confusion matrix
An ROC curve that rises at 45° is a poor model. It represents a random allocation of cases to the classes and is the
99
ROC curve for the baseline model.
ROC AUC
103