Introduction To Machine Learning: Slides Credit: CMU AI, Zico Kolter, Pat Virtue
Introduction To Machine Learning: Slides Credit: CMU AI, Zico Kolter, Pat Virtue
Linear regression
But, relatively easy to record past days of consumption, plus additional features
that affect consumption (i.e., weather)
Peak_Demand ≈ 𝜃1 ⋅ High_Temperature + 𝜃2
If we know the high temperature will be 72 degrees (ignoring for now that this is
also a prediction), then we can predict peak demand to be:
Predicted_Peak_Demand = 𝜃1 ⋅ 72 + 𝜃2 = 1.821 GW
Predicted_Peak_Demand = 𝜃1 ⋅ High_Temperature + 𝜃2
Many possibilities, but natural objective is to minimize some difference between this line
and the observed data, e.g. squared loss
2
𝐸 𝜃 = ∑ Predicted_Peak_Demand 𝑖 − Peak_Demand 𝑖
𝑖∈days
2
𝐸 𝜃 = ∑ 𝜃1 ⋅ High_Temperature 𝑖 + 𝜃2 − Peak_Demand 𝑖
𝑖∈days
How do we find parameters?
How do we find the parameters 𝜃1, 𝜃2 that minimize the function
2
𝐸 𝜃 = E 𝜃1, 𝜃2 = ∑ 𝜃1 ⋅ High_Temperature 𝑖 + 𝜃2 − Peak_Demand 𝑖
𝑖∈days
≡ ∑𝑖∈days 𝜃1 ⋅ 𝑥 𝑖
+ 𝜃2 − 𝑦 𝑖
2 𝜃2
Peak_Demand
𝑦
𝜃1
𝑥
High_Temperature
How do we find parameters?
How do we find the parameters 𝜃1, 𝜃2 that minimize the function
2
𝐸 𝜃 = E 𝜃1, 𝜃2 = ∑ 𝜃1 ⋅ High_Temperature 𝑖 + 𝜃2 − Peak_Demand 𝑖
𝑖∈days
≡ ∑𝑖∈days 𝜃1 ⋅ 𝑥 𝑖
+ 𝜃2 − 𝑦 𝑖
2 𝜃2
Peak_Demand
𝑦
𝜃1
𝑥
High_Temperature
How do we find parameters?
How do we find the parameters 𝜃1, 𝜃2 that minimize the function
2
𝐸 𝜃 = E 𝜃1, 𝜃2 = ∑ 𝜃1 ⋅ High_Temperature 𝑖 + 𝜃2 − Peak_Demand 𝑖
𝑖∈days
𝑖 2
≡ ∑𝑖∈days 𝜃1 ⋅ 𝑥 + 𝜃2 − 𝑦 𝑖
Peak_Demand
𝑦
𝐸(𝜃)
𝜃1 𝜃𝑏2
𝑥
High_Temperature
Gradient descent
How do we find the parameters 𝜃1, 𝜃2 that minimize:
2
𝐸 𝜃 = E(𝜃1 , 𝜃2 ) = ∑ 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖
𝐸(𝜃)
𝑖∈days
𝜃2
𝜃1
𝜃2
𝜃1
Gradient descent
How do we find the parameters 𝜃1, 𝜃2 that minimize:
2
𝐸 𝜃 = E(𝜃1 , 𝜃2 ) = ∑ 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖
𝐸(𝜃)
𝑖∈days
𝜃2
𝜃1
𝜃2
𝜃1
Gradient descent
To find a good value of 𝜃, we can repeatedly take steps in the direction of the
negative derivatives for each value
Repeat:
𝜕
𝜃1 ≔ 𝜃1 − 𝛼 𝐸(𝜃1 , 𝜃2 )
𝜕𝜃1
𝜕
𝜃2 ≔ 𝜃2 − 𝛼 𝐸(𝜃1, 𝜃2 )
𝜕𝜃2
𝑖∈days
𝑖 𝑖 2
≡ ∑𝑖∈days 𝜃1 ⋅ 𝑥 + 𝜃2 − 𝑦
𝑑𝑓
B. 𝑓 𝑥 = (3 − 5𝑥)2 =
𝑑𝑥
𝜕𝑓
C. 𝑓 𝑥, 𝑧 = 2𝑥 + 3𝑧 + 5𝑥2𝑧 =
𝜕𝑧
𝜕𝑓
D. 𝑓 𝑥, 𝑧 = 2𝑥 + 3𝑧 + 5𝑥2𝑧 =
𝜕𝑥
𝑦
𝑥(1), 𝑦(1)
Computing the derivatives
𝑥(2), 𝑦(2)
Assume we just have m=2 points 𝑥 (1) , 𝑦 (1) and 𝑥(2), 𝑦(2)
𝑚
𝜕 𝜕 2 𝑥
𝜕𝜃1
𝐸 𝜃 =
𝜕𝜃1
∑ 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖
𝑖=1
Computing the derivatives
What are the derivatives of the error function with respect to each parameter 𝜃1 and 𝜃2?
𝜕 𝜕 2
𝐸 𝜃 = ∑ 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖
𝜕𝜃1 𝜕𝜃1
𝑖∈days
𝜕 𝑖 𝑖
2
= ∑ 𝜃1 ⋅ 𝑥 + 𝜃2 − 𝑦
𝜕𝜃1
𝑖∈days
𝜕
= ∑ 2 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖 ⋅
𝜕𝜃1
𝜃1 ⋅ 𝑥 𝑖
𝑖∈days
= ∑ 2 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖 ⋅𝑥 𝑖
𝑖∈days
𝜕
𝐸 𝜃 = ∑ 2 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖
𝜕𝜃2
𝑖∈days
Gradient descent
To find a good value of 𝜃, we can repeatedly take steps in the direction of the
negative derivatives for each value
Repeat:
𝜕
𝜃1 ≔ 𝜃1 − 𝛼 𝐸(𝜃1 , 𝜃2 )
𝜕𝜃1
𝜕
𝜃2 ≔ 𝜃2 − 𝛼 𝐸(𝜃1, 𝜃2 )
𝜕𝜃2
Repeat:
𝜃1 ≔ 𝜃1 − 𝛼 ∑ 2 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖 ⋅𝑥 𝑖
𝑖∈days
𝜃2 ≔ 𝜃2 − 𝛼 ∑ 2 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖
𝑖∈days
2.0
1.0
𝜃1
𝐸 𝜃 0.1 0.2 0.3
= 1427.53
𝜕𝐸 𝜃
𝜕𝜃2 −151.20
𝛼 = 0.001
𝜕𝐸 𝜃 −1243.10
28
𝜕𝜃2
Gradient descent – Iteration 2
𝜃2
3.0
2.0
1.0 292.18
−67.74
𝛼 𝜃1
−556.91
2.0
64.31
1.0 292.18
𝜃1
𝐸 𝜃 0.1 0.2 0.3
= 1427.53
Gradient descent – Iteration 4
𝜃2
3.0
2.0 18.58
64.31
1.0 292.18
𝜃1
𝐸 𝜃 0.1 0.2 0.3
= 1427.53
Gradient descent – Iteration 5
𝜃2
3.0
9.40
2.0 18.58
64.31
1.0 292.18
𝜃1
𝐸 𝜃 0.1 0.2 0.3
= 1427.53
Gradient descent – Iteration 10
𝜃2
3.0
7.09
9.40
2.0 18.58
64.31
1.0 292.18
𝜃1
𝐸 𝜃 0.1 0.2 0.3
= 1427.53
Fitted line in “original” coordinates
𝜃2
𝜃1
𝜃2
𝜃1
Extensions
What if we want to add additional features, e.g. day of week, instead of just
temperature?
What if we want to use a different loss function instead of squared error (i.e.,
absolute error)?
We can easily reason about all these things by adopting some additional notation…
Outline
Least squares regression: a simple example
and Gradient Descent
⋮
Prediction
Predicted
New input Hypothesis function Output
𝑥(𝑛𝑒𝑤) ℎ𝜃 𝑥(𝑛𝑒𝑤)
Terminology
Input features: 𝑥 𝑖 ∈ ℝ𝑛 , 𝑖 = 1, … , 𝑚
High_Temperature 𝑖
E. g. : 𝑥 𝑖 = Is_Weekday 𝑖
1
Outputs: 𝑦 𝑖 ∈ 𝒴, 𝑖 = 1, … , 𝑚
E. g. : 𝑦 𝑖 ∈ ℝ = Peak_Demand 𝑖
Model parameters: 𝜃 ∈ ℝ𝑛
Virtually every machine learning algorithm has this form, just specify
• What is the hypothesis function?
• What is the loss function?
• How do we solve the optimization problem?
Example machine learning algorithms
Note: we (machine learning researchers) have not been consistent in naming conventions,
many machine learning algorithms actually only specify some of these three elements
• Least squares: {linear hypothesis, squared loss, (usually) analytical
solution}
• Linear regression: {linear hypothesis, *,*}
• Support vector machine: {linear or kernel hypothesis, hinge loss, *}
• Neural network: {Composed non-linear function, *,(usually) gradient
descent)
• Decision tree: {Hierarchical axis-aligned halfplanes, *,greedy optimization}
• Naïve Bayes: {Linear hypothesis, joint probability under certain
independence assumptions, analytical solution}
Outline
Least squares regression: a simple example
and Gradient Descent
Setup:
• Linear hypothesis function: ℎ𝜃 𝑥 = ∑𝑗=1
𝑛 𝜃 ⋅𝑥
𝑗 𝑗
• Squared error loss:
• Resulting machine learning optimization problem:
Derivative of the least squares objective
Compute the partial derivative with respect to an arbitrary model parameter 𝜃𝑗
Gradient descent algorithm
1. Initialize 𝜃𝑘 ≔ 0, 𝑘 = 1, … , 𝑛
2. Repeat:
• For 𝑘 = 1, … , 𝑛:
Note: do not actually implement it like this, you’ll want to use the matrix/vector
notation we will cover soon
Outline
Least squares regression: a simple example
and Gradient Descent
𝜕𝑓 𝜃
𝜕𝜃1
𝛻𝜃𝑓 𝜃 = ⋮ ∈ ℝ𝑛
𝜕𝑓 𝜃
𝜕𝜃𝑛
Gradient in vector notation
We can actually simplify the gradient computation (both notationally and
computationally) substantially using matrix/vector notation
Putting things in this form also make it more clear how to analytically find the
optimal solution for last squares
Matrix notation, one level deeper
Let’s define the matrices
−𝑥 1 𝑇 − 𝑦 1
𝑇 2
𝑋 = −𝑥
2 − , 𝑦= 𝑦
⋮ ⋮
𝑦 𝑚
−𝑥 𝑚 𝑇 −
Putting things in this form also make it more clear how to analytically find the
optimal solution for last squares
Solving least squares
Gradient also gives a condition for optimality:
• Gradient must equal zero
2𝑋𝑇(𝑋𝜃 − 𝑦) = 0
Compute solution:
# solve least squares
theta = np.linalg.solve(X.T @ X, X.T @ y)
print(theta)
# [ 0.04747948 0.22462824 -1.80260016]
Make predictions:
# predict on new data
Xnew = np.array([[77, 1, 1], [80, 0, 1]])
ypred = Xnew @ theta
print(ypred)
# [ 2.07794778 1.99575797]
Scikit-learn
By far the most popular machine learning library in Python is the scikit-learn library
(http://scikit-learn.org/)
Important: you need to understand the very basics of how these algorithms work in
order to use them effectively
if self.fit_intercept:
self.intercept_ = self.coef_[-1]
self.coef_ = self.coef_[:-1]