0% found this document useful (0 votes)
35 views26 pages

Lecture 8 Applications

Uploaded by

은지
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views26 pages

Lecture 8 Applications

Uploaded by

은지
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Lecture 8: Applications in ML

Nicholas Ruozzi
University of Texas at Dallas
Function Fitting: ML Applications
• A wide variety of machine learning problems can be cast as
function fitting problems

• Given data observations with corresponding “labels”, find


the function that is the best fit for the data
• We saw an example of this on homework 1, the least squares
regression problem

• Here the function being fit is parameterized by two


numbers , e.g.,

2
L1 Regression
• Suppose that instead of a squared error, we wanted to minimize
the absolute error?

• What optimization procedure should we use?

3
L1 Regression
• Suppose that instead of a squared error, we wanted to minimize
the absolute error?

• We can reformulate this problem to make it differentiable

subject to

(apply existing LP solvers!)


4
Sparse Least Squares Regression

• Sometimes we might prefer a solution vector that has a small


number of nonzero entries – such vectors are called sparse
• This kind of preference does not yield a convex optimization
problem

• Instead, it can be shown that, under certain assumptions, is


a good surrogate

• Can incorporate this as either a constraint or a penalty

5
Sparse Least Squares Regression

• Called the LASSO (least absolute shrinkage and selection


operator) optimization problem
• Here, is a constant that controls the trade-off between a
solution that achieves a low squared error and one that
minimizes the -norm
• Which optimization procedure should we use to solve this
problem?

6
Sparse Least Squares Regression

subject to

• We could also add this penalty as a hard constraint where


controls how large of an -norm is allowed
• Which optimization procedure should we apply here?

7
Maximum Likelihood Estimation
• When fitting a statistical model to data, the principle of
maximum likelihood estimation posits that the best fit model is
the one that generates the data with highest probability

• Example: suppose that you roll a biased 6-sided die 100 times
and observe a sequence of outcomes, e.g.,

• A biased die is described by 6 numbers such that

8
Maximum Likelihood Estimation
• Example: suppose that you roll a biased 6-sided die 100 times
and observe a sequence of outcomes, e.g.,
• A biased die is described by 6 numbers such that

• Let be equal to the number of data observations that were


equal to

• Probability of seeing these observations then

9
Maximum Likelihood Estimation

subject to

10
Maximum Likelihood Estimation

subject to

11
Maximum Likelihood Estimation

subject to

Has a closed form solution, but...

12
Stochastic Gradient Descent
• These types of problems often can be written as minimizing a
sum of a, perhaps large, number of terms:
• Approximate the gradient of a sum by sampling a few indices (as
few as one) uniformly at random and averaging

Each is sampled uniformly at random from


• Stochastic gradient descent converges to the global optimum
under certain assumptions on the step size

13
Stochastic Gradient Descent
• These types of problems often can be written as minimizing a
sum of a, perhaps large, number of terms:
• Approximate the gradient of a sum by sampling a few indices (as
few as one) uniformly at random and averaging Expectation
taken over
the random
Each is sampled uniformly at random from
samples
• Stochastic gradient descent converges to the global optimum
under certain assumptions on the step size

14
SGD for Least Squares

• Select an index uniformly at random

• Update

15
Stochastic Gradient Descent

• Often, SGD is simply implemented as a round robin procedure,


i.e., instead of picking randomly, you just iterate through all of
the indices 1,...,M in a cyclic fashion
• One pass from to is equivalent in terms of computation time
to computing the entire gradient
• What is the terminating condition for stochastic gradient
descent?

16
Logistic Regression
Given and ,

What optimization strategies can we apply?

17
Logistic Regression
Given and ,

18
Logistic Regression
Given and ,

Newton’s Method:

19
Logistic Regression
Given and ,

Newton’s Method:

20
Logistic Regression
Given and ,

Newton’s Method:

21
Solving Linear Systems

• Solving linear systems, e.g., find a solution to or determine that


there is no such , can be cast as a convex optimization problem
• If is positive semidefinite, then we can write this as an
unconstrained minimization problem

What optimization strategies can we apply?

22
Solving Linear Systems

• Solving linear systems, e.g., find a solution to or determine that


there is no such , can be cast as a convex optimization problem
• For general , can write this as a constrained minimization
problem

subject to

23
Solving Linear Systems

subject to

What optimization strategies can we apply?

24
Sparse Linear Systems

• Solving linear systems, e.g., find a solution to or determine that


there is no such , can be cast as a convex optimization problem
• If we are interested in sparse solutions...

subject to

Called the basis pursuit problem

25
Sparse Linear Systems

• Solving linear systems, e.g., find a solution to or determine that


there is no such , can be cast as a convex optimization problem
• If we are interested in sparse solutions...

subject to

26

You might also like