0% found this document useful (0 votes)
29 views

Lecture 8 Applications

Uploaded by

은지
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Lecture 8 Applications

Uploaded by

은지
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Lecture 8: Applications in ML

Nicholas Ruozzi
University of Texas at Dallas
Function Fitting: ML Applications
• A wide variety of machine learning problems can be cast as
function fitting problems

• Given data observations with corresponding “labels”, find


the function that is the best fit for the data
• We saw an example of this on homework 1, the least squares
regression problem

• Here the function being fit is parameterized by two


numbers , e.g.,

2
L1 Regression
• Suppose that instead of a squared error, we wanted to minimize
the absolute error?

• What optimization procedure should we use?

3
L1 Regression
• Suppose that instead of a squared error, we wanted to minimize
the absolute error?

• We can reformulate this problem to make it differentiable

subject to

(apply existing LP solvers!)


4
Sparse Least Squares Regression

• Sometimes we might prefer a solution vector that has a small


number of nonzero entries – such vectors are called sparse
• This kind of preference does not yield a convex optimization
problem

• Instead, it can be shown that, under certain assumptions, is


a good surrogate

• Can incorporate this as either a constraint or a penalty

5
Sparse Least Squares Regression

• Called the LASSO (least absolute shrinkage and selection


operator) optimization problem
• Here, is a constant that controls the trade-off between a
solution that achieves a low squared error and one that
minimizes the -norm
• Which optimization procedure should we use to solve this
problem?

6
Sparse Least Squares Regression

subject to

• We could also add this penalty as a hard constraint where


controls how large of an -norm is allowed
• Which optimization procedure should we apply here?

7
Maximum Likelihood Estimation
• When fitting a statistical model to data, the principle of
maximum likelihood estimation posits that the best fit model is
the one that generates the data with highest probability

• Example: suppose that you roll a biased 6-sided die 100 times
and observe a sequence of outcomes, e.g.,

• A biased die is described by 6 numbers such that

8
Maximum Likelihood Estimation
• Example: suppose that you roll a biased 6-sided die 100 times
and observe a sequence of outcomes, e.g.,
• A biased die is described by 6 numbers such that

• Let be equal to the number of data observations that were


equal to

• Probability of seeing these observations then

9
Maximum Likelihood Estimation

subject to

10
Maximum Likelihood Estimation

subject to

11
Maximum Likelihood Estimation

subject to

Has a closed form solution, but...

12
Stochastic Gradient Descent
• These types of problems often can be written as minimizing a
sum of a, perhaps large, number of terms:
• Approximate the gradient of a sum by sampling a few indices (as
few as one) uniformly at random and averaging

Each is sampled uniformly at random from


• Stochastic gradient descent converges to the global optimum
under certain assumptions on the step size

13
Stochastic Gradient Descent
• These types of problems often can be written as minimizing a
sum of a, perhaps large, number of terms:
• Approximate the gradient of a sum by sampling a few indices (as
few as one) uniformly at random and averaging Expectation
taken over
the random
Each is sampled uniformly at random from
samples
• Stochastic gradient descent converges to the global optimum
under certain assumptions on the step size

14
SGD for Least Squares

• Select an index uniformly at random

• Update

15
Stochastic Gradient Descent

• Often, SGD is simply implemented as a round robin procedure,


i.e., instead of picking randomly, you just iterate through all of
the indices 1,...,M in a cyclic fashion
• One pass from to is equivalent in terms of computation time
to computing the entire gradient
• What is the terminating condition for stochastic gradient
descent?

16
Logistic Regression
Given and ,

What optimization strategies can we apply?

17
Logistic Regression
Given and ,

18
Logistic Regression
Given and ,

Newton’s Method:

19
Logistic Regression
Given and ,

Newton’s Method:

20
Logistic Regression
Given and ,

Newton’s Method:

21
Solving Linear Systems

• Solving linear systems, e.g., find a solution to or determine that


there is no such , can be cast as a convex optimization problem
• If is positive semidefinite, then we can write this as an
unconstrained minimization problem

What optimization strategies can we apply?

22
Solving Linear Systems

• Solving linear systems, e.g., find a solution to or determine that


there is no such , can be cast as a convex optimization problem
• For general , can write this as a constrained minimization
problem

subject to

23
Solving Linear Systems

subject to

What optimization strategies can we apply?

24
Sparse Linear Systems

• Solving linear systems, e.g., find a solution to or determine that


there is no such , can be cast as a convex optimization problem
• If we are interested in sparse solutions...

subject to

Called the basis pursuit problem

25
Sparse Linear Systems

• Solving linear systems, e.g., find a solution to or determine that


there is no such , can be cast as a convex optimization problem
• If we are interested in sparse solutions...

subject to

26

You might also like