0% found this document useful (0 votes)
14 views6 pages

Lecture 2

The lecture covers key concepts in machine learning, including the standard supervised ML setup, empirical risk minimization, and the distinction between hyperparameters and parameters. It discusses optimization techniques such as gradient descent and stochastic gradient descent, as well as the introduction of neural networks using ReLU activation functions. The lecture emphasizes the importance of choosing appropriate loss functions and regularization techniques to improve model performance and generalization.

Uploaded by

sspakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views6 pages

Lecture 2

The lecture covers key concepts in machine learning, including the standard supervised ML setup, empirical risk minimization, and the distinction between hyperparameters and parameters. It discusses optimization techniques such as gradient descent and stochastic gradient descent, as well as the introduction of neural networks using ReLU activation functions. The lecture emphasizes the importance of choosing appropriate loss functions and regularization techniques to improve model performance and generalization.

Uploaded by

sspakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

CS 182/282A: Designing/Visualizing and Understanding Deep Neural Networks

Fall 2022
Lecture 2: August 30 (Tuesday)
Lecturer: Prof. Anant Sahai

Scribes: Connie Huang, Jaewoong Lee

Today
1. Recap. Basic Standard ML Doctrine

2. Empirical Risk Minimization (ERM)-Optimization Perspective (e.g. Generalization)

3. Hyperparameters vs. Parameters

4. Gradient Descent & SGD

5. Intro. to Neural Nets via ReLU Nets

1 Recap. Basic Standard ML Doctrine


1.1 Typical Supervised ML Setup
• Training Data: xi , yi , where xi is input (or covariants), yi is label, and i = 1, 2, ..., n

• Model: fθ (−)

• Loss Function: l(y, ŷ)

• Optimizer

Our goal of a supervised ML setup is to make an inference ŷ on new data X as follows: ŷ = fθ̂ (X),
where θ̂ are the learned parameters.

2 Empirical Risk Minimization (ERM)-Optimization Perspective


How do we learn parameters θ? The basic approach is to find the optimal θ for our optimization
problem,
n
1X
θ̂ = argmin l(yi , fθ (xi )) (2.1)
θ n
i=1

We can then extend this to the probabilistic interpretation, maximum likelihood (ML) estima-
tion, where our loss function l(y, ŷ) is interpreted as the negative log-likelihood function.
The big picture goal for supervised machine learning is to achieve good performance in the real
world when the model is deployed. In practice, this is difficult to achieve because in the real world,
there are unexpected circumstances that we do not have data for and therefore cannot actually
represent in our model. As a result, we must use a mathematical proxy so that our model has a

2-1
2-2 Lecture 2: August 30 (Tuesday)

low generalization error. We can model the real world using a probability distribution P (X, y) and
aim to minimize the expectation of our loss function with respect to this probability function:

EX,y [l(y, fθ̂ (X))]

However, our mathematical proxy introduces a few complications.

Complication 1: We do not have access to P (X, y).


Solution: collect a test set (xi,test , yi,test )ni=1
test
to be used once to evaluate our learned model
by getting test error.
ntest
1 X
l(yi,test , fθ̂ (xi,test ))
ntest
i=1

The model is desired to be tested once because it is not only hard to collect test data but also there
is a risk of data incest of test data while designing the model. Test data are not supposed to affect
the model.

Complication 2: The loss we care about may be incompatible with our optimizer. For example,
our optimizer will use derivatives to find optimal parameters, but our loss function may not have
nice derivatives everywhere.
Solution: Use a surrogate loss function ltrain (., .) that does have nice derivatives, computes
fast, and works with the optimizer. We use this surrogate loss function to guide learning of the
model. The real loss function is used to evaluate the model. Some standard loss functions include
squared error (regression); logistic, hinge, and exponential loss (binary classification); and cross-
entropy loss (multiclass classification). You may want to choose a loss function based on the
application settings of the problem and model.
This surrogate loss function is different from the evaluation loss function from complication
1. The surrogate loss function is used for training the model, and the evaluation loss function is
to see how well your model works with new test data. A few things to remember for choosing
an appropriate surrogate loss function are it should be compatible with the optimizer, guide the
model to the correct solutions, and run fast enough (e.g. easy to take derivatives). The squared
loss (ltrain = (yi − ŷi )2 or in the vector form, ltrain = ||y − ŷ||2 ) is a good example of running fast
enough.

3 Hyper-parameters & Parameters


Complication 3: We might get crazy values for θ̂ (e.g. over-fitting). How do we solve this
problem?
Solution A: Add a regularizer during training.
" n #
1X
θ̂ = argmin ltrain (yi , fθ (xi )) + R(θ) (3.1)
θ n
i=1

In the above equation, R(θ) is the regularizer that can be chosen based on what loss function
is used. For example, if squared loss is used as the loss function, then the ridge regularization
(R(θ) = λ||θ||2 ) might be used as the corresponding regularizer. The ridge regularization prevents
the θ̂ values from becoming too big. The probability interpretation of regularization is Maximum
Lecture 2: August 30 (Tuesday) 2-3

A Posteriori (MAP) estimation where R(θ) corresponds to a prior, which means we want to
achieve optimal thetas, given R(θ), that find a good balance between the unpenalized loss function
and R(θ). Now, realize that we added a new parameter λ to the regularizer. How do we handle λ?
Solution B: Split parameters into two groups: The normal parameters (θ) and hyperparam-
eters (θH ).
A hyperparameter is a parameter that cannot be trained or the optimizer cannot deal with.
For example, if λ was considered as a normal parameter in the above ridge regularization example,
then λ would end up with an absurd number being assigned (e.g. 0 and -inf ). Another example
of a hyperparameter is the model order (degree) of a polynomial function fθ (xi ).

How does the optimizer work with hyperparameters?


Figure 3.1 shows the classical division of data into three categories for hyperparameter fitting.
The process of parameters and hyperparameters fitting can be represented as a nested optimization
problem with the equations below. Notice that equation (3.3) is equal to equation (3.1) except
RθH (θ).
nval
1 X
θ̂H = argmin lval (yi,val , fθ̃,θH (xi,val )) (3.2)
θH nval
i=1

n
" #
1X
θ̃ = argmin ltrain (yi , fθ (xi )) + RθH (θ) (3.3)
θ n
i=1

Process

1. Initiate the values of hyperparameters (θH )

2. Based on the values of hyperparameters (θH ), compute the regularized loss with RθH (θ) on
the training data set to get θ̃ in equation (3.3)

3. Based on the values of normal parameters (θ̃) and hyperparameters (θH ), find the best θ̂H
on the validation data set using equation (3.2)

Figure 3.1: Partitioning data for hyperparameter Tuning

You may split the original training data set into the new training and validation data set.
However, be careful about data contamination. (e.g. Duplicated data points in each data set. The
training and validation data set should be distinct)
2-4 Lecture 2: August 30 (Tuesday)

Complication 4: The optimizer might have ”knobs” (other parameters) associated with it. This
might include, for example, the learning rate (or step size) η in gradient descent.
Solution: Include these in θH or ignore this problem (i.e. pick a value that has worked in
the past. This is a reasonable approach in the light of the limit of the experimentation budget).

4 Gradient Descent and SGD


Gradient Descent (GD) is an iterative approach to optimization (with the spirit of Newton’s
method) that seeks the local optima taking repeated steps in the opposite direction of the gradient
around the current point. Also, Gradient Descent operates under the assumption that it’s a linear
system. P
What does the linear assumption mean? The basic idea is to look at the loss function
Lθ = n1 ni=1 ltrain (yi , fθ (Xi )) + R(θ, θH ) in the neighborhood of θ0 in any place using Taylor
Expansion.

Lθ (θ0 + ∆θ) ≈ Lθ (θ0 ) + (∇θ Lθ (θ0 ))T ∆θ

Here, θ0 , ∆θ, and (∇θ Lθ (θ0 )) are vectors, and Lθ (θ0 ) is a scalar. From the equation above,
(∇θ Lθ (θ0 )) is the gradient around θ0 .
Using this approximation, we can iterate to find our optimal θ.

θ̂t+1 = θt + η(−∇θ Lθ (θt ))

Notice that the gradient ((∇θ Lθ (θt )), multiplied by a scalar factor (η), at the current time step
t is subtracted (taking a negative step) from θt . η is the learning rate, which we set to be small
enough so that the system converges and big enough so that optimization is not too slow. One
problem we introduce with this method is that computing gradients for extremely large datasets
can be very computationally intensive. As a result, we introduce Stochastic Gradient Descent
(SGD), where instead of using the entire dataset of size n, we randomly choose a representative
subset of size nbatch from n every iteration to reduce the computation of gradients to only this
batch. Because we randomly choose a subset of size nbatch every iteration, the overall result over
time is a good estimate and trustful.

5 Intro. to Neural Nets via ReLU (Rectified Linear Unit) Nets


5.1 What is a Neural Net (Differentiable Programming)?
A neural net is an object that is easy to take derivatives (e.g. Analog circuits realized as computa-
tion graphs with (mostly) differentiable operations compatible with nice vectorization). Moreover,
differentiable operations allow nonlinearities.

5.2 Two goals of the analog circuits


1. Expressivity: Use the circuit to express the patterns that we want to learn. In other words,
the circuit is realization of the function fθ (−), and the θs are tunable resistors in the circuit.

2. Reliably Learnable: Think of your machine learning system as a microscope where you
look at data and the right patterns come into focus.
Lecture 2: August 30 (Tuesday) 2-5

5.3 Example of Neural Networks

Figure 5.1 shows a 1-D nonlinear (Blue) function, piecewise linear (Red) functions, and data
points (Black). As shown in the figure, the piecewise functions describe the nonlinear function
pretty well. Our goal is to find a set of piecewise linear functions (Red) that best match the
nonlinear function (Blue) based on the data points using Neural Nets. Then, how do we create the
piecewise linear functions? A linear combination of elbows (Rectified Linear Units) in Figure 5.2!
The rectifier circuit in Figure 5.2 is composed of a diode and a resistor. The diode prevents the
current from flowing in the opposite (or negative) direction. Setting the positive direction of the
current to be from left to right, it means that the current can never flow in the negative direction
(from right to left). All negative currents will be set to zero resulting in Vout readings being zero.
On the other hand, positive currents (from left to right) will go through the diode resulting in Vout
readings on the other side of the diode. The standard ReLU function is shown below.

(
0 if x ≤ 0
f (x) = max(0, x) =
x if x > 0

In this example, Vin is x and Vout is f (x). Also, the standard ReLU function can be modified if
needed. For example, x can be replaced with wx + b, so that the modified ReLU function becomes
f (x) = max(0, wx + b). Here, w and b are the parameters(θ) we want to minimize using a loss
function as mentioned in the previous sections. More details and visualization will be covered in
the discussion session and next lecture.

Figure 5.1: 1-D nonlinear and Piecewise linear functions


2-6 Lecture 2: August 30 (Tuesday)

Figure 5.2: Elbow and Rectifier

6 What we wish this lecture also had to make things clearer?


• It would be helpful if more Empirical Risk Minimization (ERM) is covered in this lecture
more directly and potentially with diagrams.

You might also like