0% found this document useful (0 votes)

14 views6 pages

Lecture 2

The lecture covers key concepts in machine learning, including the standard supervised ML setup, empirical risk minimization, and the distinction between hyperparameters and parameters. It discusses optimization techniques such as gradient descent and stochastic gradient descent, as well as the introduction of neural networks using ReLU activation functions. The lecture emphasizes the importance of choosing appropriate loss functions and regularization techniques to improve model performance and generalization.

Uploaded by

sspakash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views6 pages

Lecture 2

Uploaded by

sspakash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

CS 182/282A: Designing/Visualizing and Understanding Deep Neural Networks

Fall 2022
Lecture 2: August 30 (Tuesday)
Lecturer: Prof. Anant Sahai

Scribes: Connie Huang, Jaewoong Lee

Today
1. Recap. Basic Standard ML Doctrine

2. Empirical Risk Minimization (ERM)-Optimization Perspective (e.g. Generalization)

3. Hyperparameters vs. Parameters

4. Gradient Descent & SGD

5. Intro. to Neural Nets via ReLU Nets

1 Recap. Basic Standard ML Doctrine

1.1 Typical Supervised ML Setup
• Training Data: xi , yi , where xi is input (or covariants), yi is label, and i = 1, 2, ..., n

• Model: fθ (−)

• Loss Function: l(y, ŷ)

• Optimizer

Our goal of a supervised ML setup is to make an inference ŷ on new data X as follows: ŷ = fθ̂ (X),
where θ̂ are the learned parameters.

2 Empirical Risk Minimization (ERM)-Optimization Perspective

How do we learn parameters θ? The basic approach is to find the optimal θ for our optimization
problem,
n
1X
θ̂ = argmin l(yi , fθ (xi )) (2.1)
θ n
i=1

We can then extend this to the probabilistic interpretation, maximum likelihood (ML) estima-
tion, where our loss function l(y, ŷ) is interpreted as the negative log-likelihood function.
The big picture goal for supervised machine learning is to achieve good performance in the real
world when the model is deployed. In practice, this is difficult to achieve because in the real world,
there are unexpected circumstances that we do not have data for and therefore cannot actually
represent in our model. As a result, we must use a mathematical proxy so that our model has a

2-1
2-2 Lecture 2: August 30 (Tuesday)

low generalization error. We can model the real world using a probability distribution P (X, y) and
aim to minimize the expectation of our loss function with respect to this probability function:

EX,y [l(y, fθ̂ (X))]

However, our mathematical proxy introduces a few complications.

Complication 1: We do not have access to P (X, y).

Solution: collect a test set (xi,test , yi,test )ni=1
test
to be used once to evaluate our learned model
by getting test error.
ntest
1 X
l(yi,test , fθ̂ (xi,test ))
ntest
i=1

The model is desired to be tested once because it is not only hard to collect test data but also there
is a risk of data incest of test data while designing the model. Test data are not supposed to affect
the model.

Complication 2: The loss we care about may be incompatible with our optimizer. For example,
our optimizer will use derivatives to find optimal parameters, but our loss function may not have
nice derivatives everywhere.
Solution: Use a surrogate loss function ltrain (., .) that does have nice derivatives, computes
fast, and works with the optimizer. We use this surrogate loss function to guide learning of the
model. The real loss function is used to evaluate the model. Some standard loss functions include
squared error (regression); logistic, hinge, and exponential loss (binary classification); and cross-
entropy loss (multiclass classification). You may want to choose a loss function based on the
application settings of the problem and model.
This surrogate loss function is different from the evaluation loss function from complication
1. The surrogate loss function is used for training the model, and the evaluation loss function is
to see how well your model works with new test data. A few things to remember for choosing
an appropriate surrogate loss function are it should be compatible with the optimizer, guide the
model to the correct solutions, and run fast enough (e.g. easy to take derivatives). The squared
loss (ltrain = (yi − ŷi )2 or in the vector form, ltrain = ||y − ŷ||2 ) is a good example of running fast
enough.

3 Hyper-parameters & Parameters

Complication 3: We might get crazy values for θ̂ (e.g. over-fitting). How do we solve this
problem?
Solution A: Add a regularizer during training.
" n #
1X
θ̂ = argmin ltrain (yi , fθ (xi )) + R(θ) (3.1)
θ n
i=1

In the above equation, R(θ) is the regularizer that can be chosen based on what loss function
is used. For example, if squared loss is used as the loss function, then the ridge regularization
(R(θ) = λ||θ||2 ) might be used as the corresponding regularizer. The ridge regularization prevents
the θ̂ values from becoming too big. The probability interpretation of regularization is Maximum
Lecture 2: August 30 (Tuesday) 2-3

A Posteriori (MAP) estimation where R(θ) corresponds to a prior, which means we want to
achieve optimal thetas, given R(θ), that find a good balance between the unpenalized loss function
and R(θ). Now, realize that we added a new parameter λ to the regularizer. How do we handle λ?
Solution B: Split parameters into two groups: The normal parameters (θ) and hyperparam-
eters (θH ).
A hyperparameter is a parameter that cannot be trained or the optimizer cannot deal with.
For example, if λ was considered as a normal parameter in the above ridge regularization example,
then λ would end up with an absurd number being assigned (e.g. 0 and -inf ). Another example
of a hyperparameter is the model order (degree) of a polynomial function fθ (xi ).

How does the optimizer work with hyperparameters?

Figure 3.1 shows the classical division of data into three categories for hyperparameter fitting.
The process of parameters and hyperparameters fitting can be represented as a nested optimization
problem with the equations below. Notice that equation (3.3) is equal to equation (3.1) except
RθH (θ).
nval
1 X
θ̂H = argmin lval (yi,val , fθ̃,θH (xi,val )) (3.2)
θH nval
i=1

n
" #
1X
θ̃ = argmin ltrain (yi , fθ (xi )) + RθH (θ) (3.3)
θ n
i=1

Process

1. Initiate the values of hyperparameters (θH )

2. Based on the values of hyperparameters (θH ), compute the regularized loss with RθH (θ) on
the training data set to get θ̃ in equation (3.3)

3. Based on the values of normal parameters (θ̃) and hyperparameters (θH ), find the best θ̂H
on the validation data set using equation (3.2)

Figure 3.1: Partitioning data for hyperparameter Tuning

You may split the original training data set into the new training and validation data set.
However, be careful about data contamination. (e.g. Duplicated data points in each data set. The
training and validation data set should be distinct)
2-4 Lecture 2: August 30 (Tuesday)

Complication 4: The optimizer might have ”knobs” (other parameters) associated with it. This
might include, for example, the learning rate (or step size) η in gradient descent.
Solution: Include these in θH or ignore this problem (i.e. pick a value that has worked in
the past. This is a reasonable approach in the light of the limit of the experimentation budget).

4 Gradient Descent and SGD

Gradient Descent (GD) is an iterative approach to optimization (with the spirit of Newton’s
method) that seeks the local optima taking repeated steps in the opposite direction of the gradient
around the current point. Also, Gradient Descent operates under the assumption that it’s a linear
system. P
What does the linear assumption mean? The basic idea is to look at the loss function
Lθ = n1 ni=1 ltrain (yi , fθ (Xi )) + R(θ, θH ) in the neighborhood of θ0 in any place using Taylor
Expansion.

Lθ (θ0 + ∆θ) ≈ Lθ (θ0 ) + (∇θ Lθ (θ0 ))T ∆θ

Here, θ0 , ∆θ, and (∇θ Lθ (θ0 )) are vectors, and Lθ (θ0 ) is a scalar. From the equation above,
(∇θ Lθ (θ0 )) is the gradient around θ0 .
Using this approximation, we can iterate to find our optimal θ.

θ̂t+1 = θt + η(−∇θ Lθ (θt ))

Notice that the gradient ((∇θ Lθ (θt )), multiplied by a scalar factor (η), at the current time step
t is subtracted (taking a negative step) from θt . η is the learning rate, which we set to be small
enough so that the system converges and big enough so that optimization is not too slow. One
problem we introduce with this method is that computing gradients for extremely large datasets
can be very computationally intensive. As a result, we introduce Stochastic Gradient Descent
(SGD), where instead of using the entire dataset of size n, we randomly choose a representative
subset of size nbatch from n every iteration to reduce the computation of gradients to only this
batch. Because we randomly choose a subset of size nbatch every iteration, the overall result over
time is a good estimate and trustful.

5 Intro. to Neural Nets via ReLU (Rectified Linear Unit) Nets

5.1 What is a Neural Net (Differentiable Programming)?
A neural net is an object that is easy to take derivatives (e.g. Analog circuits realized as computa-
tion graphs with (mostly) differentiable operations compatible with nice vectorization). Moreover,
differentiable operations allow nonlinearities.

5.2 Two goals of the analog circuits

1. Expressivity: Use the circuit to express the patterns that we want to learn. In other words,
the circuit is realization of the function fθ (−), and the θs are tunable resistors in the circuit.

2. Reliably Learnable: Think of your machine learning system as a microscope where you
look at data and the right patterns come into focus.
Lecture 2: August 30 (Tuesday) 2-5

5.3 Example of Neural Networks

Figure 5.1 shows a 1-D nonlinear (Blue) function, piecewise linear (Red) functions, and data
points (Black). As shown in the figure, the piecewise functions describe the nonlinear function
pretty well. Our goal is to find a set of piecewise linear functions (Red) that best match the
nonlinear function (Blue) based on the data points using Neural Nets. Then, how do we create the
piecewise linear functions? A linear combination of elbows (Rectified Linear Units) in Figure 5.2!
The rectifier circuit in Figure 5.2 is composed of a diode and a resistor. The diode prevents the
current from flowing in the opposite (or negative) direction. Setting the positive direction of the
current to be from left to right, it means that the current can never flow in the negative direction
(from right to left). All negative currents will be set to zero resulting in Vout readings being zero.
On the other hand, positive currents (from left to right) will go through the diode resulting in Vout
readings on the other side of the diode. The standard ReLU function is shown below.

(
0 if x ≤ 0
f (x) = max(0, x) =
x if x > 0

In this example, Vin is x and Vout is f (x). Also, the standard ReLU function can be modified if
needed. For example, x can be replaced with wx + b, so that the modified ReLU function becomes
f (x) = max(0, wx + b). Here, w and b are the parameters(θ) we want to minimize using a loss
function as mentioned in the previous sections. More details and visualization will be covered in
the discussion session and next lecture.

Figure 5.1: 1-D nonlinear and Piecewise linear functions

2-6 Lecture 2: August 30 (Tuesday)

Figure 5.2: Elbow and Rectifier

6 What we wish this lecture also had to make things clearer?

• It would be helpful if more Empirical Risk Minimization (ERM) is covered in this lecture
more directly and potentially with diagrams.

Unit 2 Introduction To Deep Learning
No ratings yet
Unit 2 Introduction To Deep Learning
79 pages
Basic and Advanced Laboratory Techniques in Histopathology and Cytology
100% (12)
Basic and Advanced Laboratory Techniques in Histopathology and Cytology
275 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
Deep Learning Chapter 1
No ratings yet
Deep Learning Chapter 1
46 pages
RNN Problems
100% (1)
RNN Problems
12 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
TRD PRM
No ratings yet
TRD PRM
33 pages
WsCube Tech Online MERN Stack Course
No ratings yet
WsCube Tech Online MERN Stack Course
24 pages
Chapter 6 - Feedforward Deep Networks
No ratings yet
Chapter 6 - Feedforward Deep Networks
27 pages
DBMS - Unit 3 - Notes (Relational Calculus)
No ratings yet
DBMS - Unit 3 - Notes (Relational Calculus)
22 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
31 pages
CS229 Lecture 2 PDF
100% (1)
CS229 Lecture 2 PDF
48 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Week 4
No ratings yet
Week 4
61 pages
Schedule Check Fisik
No ratings yet
Schedule Check Fisik
17 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Coroner's Findings - Inquest Into The Mangatepopo Gorge Disaster - Coroner CJ Davenport - 30th March 2010
No ratings yet
Coroner's Findings - Inquest Into The Mangatepopo Gorge Disaster - Coroner CJ Davenport - 30th March 2010
39 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Lecture 6
No ratings yet
Lecture 6
25 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
Introduction of Machine Learning
No ratings yet
Introduction of Machine Learning
61 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Deep Learning Lectures - 2
No ratings yet
Deep Learning Lectures - 2
73 pages
Chapter
No ratings yet
Chapter
46 pages
4 Optimization
No ratings yet
4 Optimization
48 pages
Camiao Grua - Manual
No ratings yet
Camiao Grua - Manual
156 pages
Lec 05 Regularization
No ratings yet
Lec 05 Regularization
77 pages
Ambitious Repertoire Against The Italian Game by GM Kiril Georgiev
100% (1)
Ambitious Repertoire Against The Italian Game by GM Kiril Georgiev
14 pages
hw4 Sol
No ratings yet
hw4 Sol
14 pages
hw7 Sol
No ratings yet
hw7 Sol
13 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
A Layman's Guide To The Project
No ratings yet
A Layman's Guide To The Project
34 pages
Practical-5 - 2CEIT606 - Artificial Intelligence
No ratings yet
Practical-5 - 2CEIT606 - Artificial Intelligence
14 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
practicalMachineLearning Lecture3
No ratings yet
practicalMachineLearning Lecture3
25 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
ML 01
No ratings yet
ML 01
24 pages
M I C R o e C o N o M I C S I I
No ratings yet
M I C R o e C o N o M I C S I I
90 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
Lecture 6
No ratings yet
Lecture 6
41 pages
BÀI TẬP HOÀN THÀNH CÂU
No ratings yet
BÀI TẬP HOÀN THÀNH CÂU
18 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
Pretest English 7
No ratings yet
Pretest English 7
3 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
Week2 DL
No ratings yet
Week2 DL
29 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
1.17. MarketLine - JarirMarketingCo - Jan - 29 - 2024
No ratings yet
1.17. MarketLine - JarirMarketingCo - Jan - 29 - 2024
27 pages
On Dumpster Diving
No ratings yet
On Dumpster Diving
10 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
1 ML Introduction
No ratings yet
1 ML Introduction
36 pages
Drilling Engineering 30 Days Program
No ratings yet
Drilling Engineering 30 Days Program
2 pages
Drawing Class Notes
No ratings yet
Drawing Class Notes
5 pages
Instant Download Wrist Diagnosis and Operative Treatment 2nd Edition The Wei Zhi PDF All Chapter
100% (2)
Instant Download Wrist Diagnosis and Operative Treatment 2nd Edition The Wei Zhi PDF All Chapter
24 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
Logistic
No ratings yet
Logistic
14 pages
Acute Mesenteric Ischemia
No ratings yet
Acute Mesenteric Ischemia
20 pages
Semantics Term Paper
No ratings yet
Semantics Term Paper
14 pages
LogisticRegression ExercisesSolutions
No ratings yet
LogisticRegression ExercisesSolutions
5 pages
Machine Vesion hw6
No ratings yet
Machine Vesion hw6
18 pages
Exercises INF 5860 Solution Hints
No ratings yet
Exercises INF 5860 Solution Hints
11 pages
Lec 2
No ratings yet
Lec 2
5 pages
Optimizers
No ratings yet
Optimizers
4 pages
Pressure Groups in India
No ratings yet
Pressure Groups in India
3 pages
Washing Machine Owner's Instructions: B1485AV/ B1285AV/ B1285AS/ B1285A/ B1085A/ R1285AV/ R1085A/ F1285AV/ F1085A
No ratings yet
Washing Machine Owner's Instructions: B1485AV/ B1285AV/ B1285AS/ B1285A/ B1085A/ R1285AV/ R1085A/ F1285AV/ F1085A
22 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
Presentation 1 Adjectives-1
No ratings yet
Presentation 1 Adjectives-1
13 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Profile
No ratings yet
Profile
2 pages
Find The Value of The Unknown in Each of The Following Quadrilaterals
No ratings yet
Find The Value of The Unknown in Each of The Following Quadrilaterals
3 pages
Two Phase Pressure Drop & Flooding Characyeristics in A Horizontal Vertical Pulsed Seive Plate Column
No ratings yet
Two Phase Pressure Drop & Flooding Characyeristics in A Horizontal Vertical Pulsed Seive Plate Column
11 pages
The Captain's Shirt
100% (1)
The Captain's Shirt
3 pages
Akash Resume
No ratings yet
Akash Resume
3 pages
Fitch 1963
No ratings yet
Fitch 1963
9 pages
Autocad 2008 Features and Benifits
No ratings yet
Autocad 2008 Features and Benifits
7 pages
The American College
No ratings yet
The American College
2 pages
AUTOBIOGRAPHY - Peter James Sinkinson
No ratings yet
AUTOBIOGRAPHY - Peter James Sinkinson
2 pages
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet

Lecture 2

Uploaded by

Lecture 2

Uploaded by

CS 182/282A: Designing/Visualizing and Understanding Deep Neural Networks

Scribes: Connie Huang, Jaewoong Lee

2. Empirical Risk Minimization (ERM)-Optimization Perspective (e.g. Generalization)

3. Hyperparameters vs. Parameters

4. Gradient Descent & SGD

5. Intro. to Neural Nets via ReLU Nets

1 Recap. Basic Standard ML Doctrine

• Loss Function: l(y, ŷ)

2 Empirical Risk Minimization (ERM)-Optimization Perspective

EX,y [l(y, fθ̂ (X))]

However, our mathematical proxy introduces a few complications.

Complication 1: We do not have access to P (X, y).

3 Hyper-parameters & Parameters

How does the optimizer work with hyperparameters?

1. Initiate the values of hyperparameters (θH )

Figure 3.1: Partitioning data for hyperparameter Tuning

4 Gradient Descent and SGD

Lθ (θ0 + ∆θ) ≈ Lθ (θ0 ) + (∇θ Lθ (θ0 ))T ∆θ

θ̂t+1 = θt + η(−∇θ Lθ (θt ))

5 Intro. to Neural Nets via ReLU (Rectified Linear Unit) Nets

5.2 Two goals of the analog circuits

5.3 Example of Neural Networks

Figure 5.1: 1-D nonlinear and Piecewise linear functions

Figure 5.2: Elbow and Rectifier

6 What we wish this lecture also had to make things clearer?

You might also like