0% found this document useful (0 votes)
7 views59 pages

Introduction To Machine Learning: Slides Credit: CMU AI, Zico Kolter, Pat Virtue

The document introduces the fundamentals of machine learning, focusing on linear regression and gradient descent as key concepts. It explains how to predict peak power consumption using historical data and the relationship between temperature and demand. The document also outlines methods for finding optimal parameters through gradient descent, emphasizing its importance in modern machine learning.

Uploaded by

khushimegha211
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views59 pages

Introduction To Machine Learning: Slides Credit: CMU AI, Zico Kolter, Pat Virtue

The document introduces the fundamentals of machine learning, focusing on linear regression and gradient descent as key concepts. It explains how to predict peak power consumption using historical data and the relationship between temperature and demand. The document also outlines methods for finding optimal parameters through gradient descent, emphasizing its importance in modern machine learning.

Uploaded by

khushimegha211
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Introduction to Machine Learning

Slides credit: CMU AI, Zico Kolter, Pat Virtue


Readings
Joel Grus, Data Science from Scratch, 2nd Edition:
• Ch. 8 (Gradient Descent)
• Ch. 14 (Simple Linear Regression)
• Ch. 15 (Multiple Linear Regression)
Outline
Least squares regression: a simple example
and Gradient Descent

Machine learning notation

Linear regression revisited

Matrix/vector notation and analytic solutions

Implementing linear regression


Outline
Least squares regression: a simple example
and Gradient Descent

Machine learning notation

Linear regression

Matrix/vector notation and analytic solutions

Implementing linear regression


A simple example: predicting electricity use
What will peak power consumption tomorrow?

But, relatively easy to record past days of consumption, plus additional features
that affect consumption (i.e., weather)

Date High Temperature (F) Peak Demand (GW)


2011-06-01 84.0 2.651
2011-06-02 73.0 2.081
2011-06-03 75.2 1.844
2011-06-04 84.9 1.959
… … …
Plot of consumption vs. temperature
Plot of high temperature vs. peak demand for summer months (June – August) for
past six years
Hypothesis: linear model
Let’s suppose that the peak demand approximately fits a linear model
Hypothesis: linear model
Let’s suppose that the peak demand approximately fits a linear model

Peak_Demand ≈ 𝜃1 ⋅ High_Temperature + 𝜃2

Here 𝜃1 is the “slope” of the line, and 𝜃2 = is the intercept


Making predictions
Importantly, our model also lets us make predictions about new days

What will the peak demand be tomorrow?

If we know the high temperature will be 72 degrees (ignoring for now that this is
also a prediction), then we can predict peak demand to be:
Predicted_Peak_Demand = 𝜃1 ⋅ 72 + 𝜃2 = 1.821 GW

Equivalent to just “finding the point on the line”


Predicted output for each data point
Peak_Demand(𝑖)
Predicted_Peak_Demand 𝑖 = 𝜃1 ⋅ High_Temperature 𝑖 + 𝜃2
Hypothesis: linear model
Peak_Demand(𝑖)
Predicted_Peak_Demand 𝑖 = 𝜃1 ⋅ High_Temperature 𝑖 + 𝜃2
Hypothesis: linear model
Let’s suppose that the peak demand approximately fits a linear model

Predicted_Peak_Demand = 𝜃1 ⋅ High_Temperature + 𝜃2

Here 𝜃1 is the “slope” of the line, and 𝜃2 is the intercept

How do we find a “good” fit to the data?

Many possibilities, but natural objective is to minimize some difference between this line
and the observed data, e.g. squared loss

2
𝐸 𝜃 = ∑ Predicted_Peak_Demand 𝑖 − Peak_Demand 𝑖

𝑖∈days
2
𝐸 𝜃 = ∑ 𝜃1 ⋅ High_Temperature 𝑖 + 𝜃2 − Peak_Demand 𝑖

𝑖∈days
How do we find parameters?
How do we find the parameters 𝜃1, 𝜃2 that minimize the function
2
𝐸 𝜃 = E 𝜃1, 𝜃2 = ∑ 𝜃1 ⋅ High_Temperature 𝑖 + 𝜃2 − Peak_Demand 𝑖

𝑖∈days

≡ ∑𝑖∈days 𝜃1 ⋅ 𝑥 𝑖
+ 𝜃2 − 𝑦 𝑖
2 𝜃2
Peak_Demand
𝑦

𝜃1

𝑥
High_Temperature
How do we find parameters?
How do we find the parameters 𝜃1, 𝜃2 that minimize the function
2
𝐸 𝜃 = E 𝜃1, 𝜃2 = ∑ 𝜃1 ⋅ High_Temperature 𝑖 + 𝜃2 − Peak_Demand 𝑖

𝑖∈days

≡ ∑𝑖∈days 𝜃1 ⋅ 𝑥 𝑖
+ 𝜃2 − 𝑦 𝑖
2 𝜃2
Peak_Demand
𝑦

𝜃1

𝑥
High_Temperature
How do we find parameters?
How do we find the parameters 𝜃1, 𝜃2 that minimize the function
2
𝐸 𝜃 = E 𝜃1, 𝜃2 = ∑ 𝜃1 ⋅ High_Temperature 𝑖 + 𝜃2 − Peak_Demand 𝑖

𝑖∈days

𝑖 2
≡ ∑𝑖∈days 𝜃1 ⋅ 𝑥 + 𝜃2 − 𝑦 𝑖

Peak_Demand
𝑦
𝐸(𝜃)

𝜃1 𝜃𝑏2
𝑥
High_Temperature
Gradient descent
How do we find the parameters 𝜃1, 𝜃2 that minimize:
2
𝐸 𝜃 = E(𝜃1 , 𝜃2 ) = ∑ 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖
𝐸(𝜃)
𝑖∈days

𝜃2
𝜃1

𝜃2

𝜃1
Gradient descent
How do we find the parameters 𝜃1, 𝜃2 that minimize:
2
𝐸 𝜃 = E(𝜃1 , 𝜃2 ) = ∑ 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖
𝐸(𝜃)
𝑖∈days

𝜃2
𝜃1

𝜃2

𝜃1
Gradient descent
To find a good value of 𝜃, we can repeatedly take steps in the direction of the
negative derivatives for each value

Repeat:
𝜕
𝜃1 ≔ 𝜃1 − 𝛼 𝐸(𝜃1 , 𝜃2 )
𝜕𝜃1
𝜕
𝜃2 ≔ 𝜃2 − 𝛼 𝐸(𝜃1, 𝜃2 )
𝜕𝜃2

where 𝛼 is some small positive number called the step size

This is the gradient decent algorithm, the workhorse of modern machine


learning
Computing gradients (partial derivatives)
How do we find the parameters 𝜃1, 𝜃2 that minimize the function
2
𝐸 𝜃 = E 𝜃1 , 𝜃2 = ∑ 𝜃1 ⋅ High_Temperature 𝑖 + 𝜃2 − Peak_Demand 𝑖

𝑖∈days

𝑖 𝑖 2
≡ ∑𝑖∈days 𝜃1 ⋅ 𝑥 + 𝜃2 − 𝑦

General idea: suppose we want to minimize some function 𝑓 𝜃

Derivative is slope of the function, so negative derivative points “downhill”


Calculus worksheet
𝑑𝑓
A. 𝑓 𝑥 = 𝑥2 + 5𝑥3 =
𝑑𝑥

𝑑𝑓
B. 𝑓 𝑥 = (3 − 5𝑥)2 =
𝑑𝑥

𝜕𝑓
C. 𝑓 𝑥, 𝑧 = 2𝑥 + 3𝑧 + 5𝑥2𝑧 =
𝜕𝑧

𝜕𝑓
D. 𝑓 𝑥, 𝑧 = 2𝑥 + 3𝑧 + 5𝑥2𝑧 =
𝜕𝑥
𝑦
𝑥(1), 𝑦(1)
Computing the derivatives
𝑥(2), 𝑦(2)
Assume we just have m=2 points 𝑥 (1) , 𝑦 (1) and 𝑥(2), 𝑦(2)
𝑚
𝜕 𝜕 2 𝑥
𝜕𝜃1
𝐸 𝜃 =
𝜕𝜃1
∑ 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖

𝑖=1
Computing the derivatives
What are the derivatives of the error function with respect to each parameter 𝜃1 and 𝜃2?
𝜕 𝜕 2
𝐸 𝜃 = ∑ 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖
𝜕𝜃1 𝜕𝜃1
𝑖∈days
𝜕 𝑖 𝑖
2
= ∑ 𝜃1 ⋅ 𝑥 + 𝜃2 − 𝑦
𝜕𝜃1
𝑖∈days
𝜕
= ∑ 2 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖 ⋅
𝜕𝜃1
𝜃1 ⋅ 𝑥 𝑖

𝑖∈days

= ∑ 2 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖 ⋅𝑥 𝑖

𝑖∈days
𝜕
𝐸 𝜃 = ∑ 2 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖
𝜕𝜃2
𝑖∈days
Gradient descent
To find a good value of 𝜃, we can repeatedly take steps in the direction of the
negative derivatives for each value

Repeat:
𝜕
𝜃1 ≔ 𝜃1 − 𝛼 𝐸(𝜃1 , 𝜃2 )
𝜕𝜃1
𝜕
𝜃2 ≔ 𝜃2 − 𝛼 𝐸(𝜃1, 𝜃2 )
𝜕𝜃2

where 𝛼 is some small positive number called the step size

This is the gradient decent algorithm, the workhorse of modern machine


learning
Finding the best 𝜃
To find a good value of 𝜃, we can repeatedly take steps in the direction of the
negative derivatives for each value

Repeat:
𝜃1 ≔ 𝜃1 − 𝛼 ∑ 2 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖 ⋅𝑥 𝑖

𝑖∈days

𝜃2 ≔ 𝜃2 − 𝛼 ∑ 2 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖

𝑖∈days

where 𝛼 is some small positive number called the step size

This is the gradient decent algorithm, the workhorse of modern machine


learning
Gradient descent
Gradient descent

Normalize input by subtracting the mean and


dividing by the standard deviation
Gradient descent – Iteration 1
𝜃2
3.0

2.0

1.0
𝜃1
𝐸 𝜃 0.1 0.2 0.3
= 1427.53
𝜕𝐸 𝜃
𝜕𝜃2 −151.20
𝛼 = 0.001
𝜕𝐸 𝜃 −1243.10
28
𝜕𝜃2
Gradient descent – Iteration 2
𝜃2
3.0

2.0

1.0 292.18
−67.74
𝛼 𝜃1
−556.91

𝐸 𝜃 0.1 0.2 0.3


= 1427.53
Gradient descent – Iteration 3
𝜃2
3.0

2.0
64.31
1.0 292.18

𝜃1
𝐸 𝜃 0.1 0.2 0.3
= 1427.53
Gradient descent – Iteration 4
𝜃2
3.0

2.0 18.58
64.31
1.0 292.18

𝜃1
𝐸 𝜃 0.1 0.2 0.3
= 1427.53
Gradient descent – Iteration 5
𝜃2
3.0

9.40
2.0 18.58
64.31
1.0 292.18

𝜃1
𝐸 𝜃 0.1 0.2 0.3
= 1427.53
Gradient descent – Iteration 10
𝜃2
3.0
7.09
9.40
2.0 18.58
64.31
1.0 292.18

𝜃1
𝐸 𝜃 0.1 0.2 0.3
= 1427.53
Fitted line in “original” coordinates

Important note: requires that we also rescale 𝜃 when un-normalizing


Gradient descent
How do we find the parameters 𝜃1, 𝜃2 that minimize:
2
𝐸 𝜃 = E(𝜃1 , 𝜃2 ) = ∑ 𝜃1 ⋅ 𝑥 𝑖 + 𝜃2 − 𝑦 𝑖
𝐸(𝜃)
𝑖∈days

𝜃2
𝜃1

𝜃2

𝜃1
Extensions
What if we want to add additional features, e.g. day of week, instead of just
temperature?

What if we want to use a different loss function instead of squared error (i.e.,
absolute error)?

What if we want to use a non-linear prediction instead of a linear one?

We can easily reason about all these things by adopting some additional notation…
Outline
Least squares regression: a simple example
and Gradient Descent

Machine learning notation

Linear regression revisited

Matrix/vector notation and analytic solutions

Implementing linear regression


Machine learning
Gradient descent to find the parameters to minimize MSE for a linear model is an
example of a machine learning algorithm

Basic idea: in many domains, it is difficult to hand-build a predictive model, but


easy to collect lots of data; machine learning provides a way to automatically infer
the predictive model from data
Machine learning
The basic process (supervised learning):

Training data Hypothesis function


Machine learning
𝑥 1 ,𝑦 1 (including any
training algorithm
𝑥 2 ,𝑦 2 parameter settings)
𝑥 3 ,𝑦 3


Prediction
Predicted
New input Hypothesis function Output
𝑥(𝑛𝑒𝑤) ℎ𝜃 𝑥(𝑛𝑒𝑤)
Terminology
Input features: 𝑥 𝑖 ∈ ℝ𝑛 , 𝑖 = 1, … , 𝑚
High_Temperature 𝑖
E. g. : 𝑥 𝑖 = Is_Weekday 𝑖
1

Outputs: 𝑦 𝑖 ∈ 𝒴, 𝑖 = 1, … , 𝑚
E. g. : 𝑦 𝑖 ∈ ℝ = Peak_Demand 𝑖

Model parameters: 𝜃 ∈ ℝ𝑛

Hypothesis function: ℎ 𝜃 : ℝ𝑛 → 𝒴, predicts output given input


𝑛
E. g. : ℎ𝜃 𝑥 = ∑ 𝜃𝑗 ⋅ 𝑥𝑗
𝑗=1
Terminology
Loss function: ℓ: 𝒴 × 𝒴 → ℝ+, measures the difference between a prediction and
an actual output

The canonical machine learning optimization problem:

Virtually every machine learning algorithm has this form, just specify
• What is the hypothesis function?
• What is the loss function?
• How do we solve the optimization problem?
Example machine learning algorithms
Note: we (machine learning researchers) have not been consistent in naming conventions,
many machine learning algorithms actually only specify some of these three elements
• Least squares: {linear hypothesis, squared loss, (usually) analytical
solution}
• Linear regression: {linear hypothesis, *,*}
• Support vector machine: {linear or kernel hypothesis, hinge loss, *}
• Neural network: {Composed non-linear function, *,(usually) gradient
descent)
• Decision tree: {Hierarchical axis-aligned halfplanes, *,greedy optimization}
• Naïve Bayes: {Linear hypothesis, joint probability under certain
independence assumptions, analytical solution}
Outline
Least squares regression: a simple example
and Gradient Descent

Machine learning notation

Linear regression revisited

Matrix/vector notation and analytic solutions

Implementing linear regression


Least squares revisited
Using our new terminology, plus matrix notion, let’s revisit how to solve linear
regression with a squared error loss

Setup:
• Linear hypothesis function: ℎ𝜃 𝑥 = ∑𝑗=1
𝑛 𝜃 ⋅𝑥
𝑗 𝑗
• Squared error loss:
• Resulting machine learning optimization problem:
Derivative of the least squares objective
Compute the partial derivative with respect to an arbitrary model parameter 𝜃𝑗
Gradient descent algorithm
1. Initialize 𝜃𝑘 ≔ 0, 𝑘 = 1, … , 𝑛

2. Repeat:
• For 𝑘 = 1, … , 𝑛:

Note: do not actually implement it like this, you’ll want to use the matrix/vector
notation we will cover soon
Outline
Least squares regression: a simple example
and Gradient Descent

Machine learning notation

Linear regression revisited

Matrix/vector notation and analytic solutions

Implementing linear regression


The gradient
It is typically more convenient to work with a vector of all partial derivatives, called
the gradient

For a function 𝑓: ℝ𝑛 → ℝ, the gradient is a vector

𝜕𝑓 𝜃
𝜕𝜃1
𝛻𝜃𝑓 𝜃 = ⋮ ∈ ℝ𝑛
𝜕𝑓 𝜃
𝜕𝜃𝑛
Gradient in vector notation
We can actually simplify the gradient computation (both notationally and
computationally) substantially using matrix/vector notation

Putting things in this form also make it more clear how to analytically find the
optimal solution for last squares
Matrix notation, one level deeper
Let’s define the matrices

−𝑥 1 𝑇 − 𝑦 1
𝑇 2
𝑋 = −𝑥
2 − , 𝑦= 𝑦
⋮ ⋮
𝑦 𝑚
−𝑥 𝑚 𝑇 −

Euclidean (L2) norm:


Gradient in linear algebra notation
We can actually simplify the gradient computation (both notationally and
computationally) substantially using matrix/vector notation

Putting things in this form also make it more clear how to analytically find the
optimal solution for last squares
Solving least squares
Gradient also gives a condition for optimality:
• Gradient must equal zero

Solving for 𝛻𝜃𝐸 𝜃 = 0:

2𝑋𝑇(𝑋𝜃 − 𝑦) = 0

These are known as the normal equations an extremely convenient closed-form


solution for least squares
Solving least squares
Gradient also gives a condition for optimality:
• Gradient must equal zero

Solving for 𝛻𝜃𝐸 𝜃 = 0:


Example: electricity demand
Returning to our electricity demand example:
𝑖 High_Temperature 𝑖
−1 𝑋 𝑇 𝑦 0.046
𝑥 = , 𝜃⋆ = 𝑋𝑇𝑋 =
1 −1.574
Example: electricity demand
Returning to our electricity demand example:
High_Temperature 𝑖 0.047
𝑥𝑖 = Is_Weekday 𝑖 , 𝜃⋆ = 𝑋𝑇𝑋 −1 𝑋 𝑇 𝑦 =
0.225
1 −1.803
Outline
Least squares regression: a simple example
and Gradient Descent

Machine learning notation

Linear regression revisited

Matrix/vector notation and analytic solutions

Implementing linear regression


Manual implementation of linear regression
Create data matrices:
# initialize X matrix and y vector
X = np.array([df["Temp"], df["IsWeekday"], np.ones(len(df))]).T
y = df_summer["Load"].values

Compute solution:
# solve least squares
theta = np.linalg.solve(X.T @ X, X.T @ y)
print(theta)
# [ 0.04747948 0.22462824 -1.80260016]

Make predictions:
# predict on new data
Xnew = np.array([[77, 1, 1], [80, 0, 1]])
ypred = Xnew @ theta
print(ypred)
# [ 2.07794778 1.99575797]
Scikit-learn
By far the most popular machine learning library in Python is the scikit-learn library
(http://scikit-learn.org/)

Reasonable (usually) implementation of many different learning algorithms, usually


fast enough for small/medium problems

Important: you need to understand the very basics of how these algorithms work in
order to use them effectively

Sadly, a lot of data science in practice seems to be driven by the default


parameters for scikit-learn classifiers…
Linear regression in scikit-learn
Fit a model and predict on new data
from sklearn.linear_model import LinearRegression

# don't include constant term in X


X = np.array([df_summer["Temp"], df_summer["IsWeekday"]]).T
model = LinearRegression(fit_intercept=True, normalize=False)
model.fit(X, y)

# predict on new data


Xnew = np.array([[77, 1], [80, 0]])
model.predict(Xnew)
# [ 2.07794778 1.99575797]

Inspect internal model coefficients


print(model.coef_, model.intercept_)
# [ 0.04747948 0.22462824] -1.80260016
Scikit-learn-like model, manually
We can easily implement a class that contains a scikit-learn-like interface
class MyLinearRegression:
def init (self, fit_intercept=True):
self.fit_intercept = fit_intercept

def fit(self, X, y):


if self.fit_intercept:
X = np.hstack([X, np.ones((X.shape[0],1))])

self.coef_ = np.linalg.solve(X.T @ X, X.T @ y)

if self.fit_intercept:
self.intercept_ = self.coef_[-1]
self.coef_ = self.coef_[:-1]

def predict(self, X):


pred = X @ self.coef_
if self.fit_intercept:
pred += self.intercept_
return pred

You might also like