Lecture 2
Lecture 2
Fall 2022
Lecture 2: August 30 (Tuesday)
Lecturer: Prof. Anant Sahai
Today
1. Recap. Basic Standard ML Doctrine
• Model: fθ (−)
• Optimizer
Our goal of a supervised ML setup is to make an inference ŷ on new data X as follows: ŷ = fθ̂ (X),
where θ̂ are the learned parameters.
We can then extend this to the probabilistic interpretation, maximum likelihood (ML) estima-
tion, where our loss function l(y, ŷ) is interpreted as the negative log-likelihood function.
The big picture goal for supervised machine learning is to achieve good performance in the real
world when the model is deployed. In practice, this is difficult to achieve because in the real world,
there are unexpected circumstances that we do not have data for and therefore cannot actually
represent in our model. As a result, we must use a mathematical proxy so that our model has a
2-1
2-2 Lecture 2: August 30 (Tuesday)
low generalization error. We can model the real world using a probability distribution P (X, y) and
aim to minimize the expectation of our loss function with respect to this probability function:
The model is desired to be tested once because it is not only hard to collect test data but also there
is a risk of data incest of test data while designing the model. Test data are not supposed to affect
the model.
Complication 2: The loss we care about may be incompatible with our optimizer. For example,
our optimizer will use derivatives to find optimal parameters, but our loss function may not have
nice derivatives everywhere.
Solution: Use a surrogate loss function ltrain (., .) that does have nice derivatives, computes
fast, and works with the optimizer. We use this surrogate loss function to guide learning of the
model. The real loss function is used to evaluate the model. Some standard loss functions include
squared error (regression); logistic, hinge, and exponential loss (binary classification); and cross-
entropy loss (multiclass classification). You may want to choose a loss function based on the
application settings of the problem and model.
This surrogate loss function is different from the evaluation loss function from complication
1. The surrogate loss function is used for training the model, and the evaluation loss function is
to see how well your model works with new test data. A few things to remember for choosing
an appropriate surrogate loss function are it should be compatible with the optimizer, guide the
model to the correct solutions, and run fast enough (e.g. easy to take derivatives). The squared
loss (ltrain = (yi − ŷi )2 or in the vector form, ltrain = ||y − ŷ||2 ) is a good example of running fast
enough.
In the above equation, R(θ) is the regularizer that can be chosen based on what loss function
is used. For example, if squared loss is used as the loss function, then the ridge regularization
(R(θ) = λ||θ||2 ) might be used as the corresponding regularizer. The ridge regularization prevents
the θ̂ values from becoming too big. The probability interpretation of regularization is Maximum
Lecture 2: August 30 (Tuesday) 2-3
A Posteriori (MAP) estimation where R(θ) corresponds to a prior, which means we want to
achieve optimal thetas, given R(θ), that find a good balance between the unpenalized loss function
and R(θ). Now, realize that we added a new parameter λ to the regularizer. How do we handle λ?
Solution B: Split parameters into two groups: The normal parameters (θ) and hyperparam-
eters (θH ).
A hyperparameter is a parameter that cannot be trained or the optimizer cannot deal with.
For example, if λ was considered as a normal parameter in the above ridge regularization example,
then λ would end up with an absurd number being assigned (e.g. 0 and -inf ). Another example
of a hyperparameter is the model order (degree) of a polynomial function fθ (xi ).
n
" #
1X
θ̃ = argmin ltrain (yi , fθ (xi )) + RθH (θ) (3.3)
θ n
i=1
Process
2. Based on the values of hyperparameters (θH ), compute the regularized loss with RθH (θ) on
the training data set to get θ̃ in equation (3.3)
3. Based on the values of normal parameters (θ̃) and hyperparameters (θH ), find the best θ̂H
on the validation data set using equation (3.2)
You may split the original training data set into the new training and validation data set.
However, be careful about data contamination. (e.g. Duplicated data points in each data set. The
training and validation data set should be distinct)
2-4 Lecture 2: August 30 (Tuesday)
Complication 4: The optimizer might have ”knobs” (other parameters) associated with it. This
might include, for example, the learning rate (or step size) η in gradient descent.
Solution: Include these in θH or ignore this problem (i.e. pick a value that has worked in
the past. This is a reasonable approach in the light of the limit of the experimentation budget).
Here, θ0 , ∆θ, and (∇θ Lθ (θ0 )) are vectors, and Lθ (θ0 ) is a scalar. From the equation above,
(∇θ Lθ (θ0 )) is the gradient around θ0 .
Using this approximation, we can iterate to find our optimal θ.
Notice that the gradient ((∇θ Lθ (θt )), multiplied by a scalar factor (η), at the current time step
t is subtracted (taking a negative step) from θt . η is the learning rate, which we set to be small
enough so that the system converges and big enough so that optimization is not too slow. One
problem we introduce with this method is that computing gradients for extremely large datasets
can be very computationally intensive. As a result, we introduce Stochastic Gradient Descent
(SGD), where instead of using the entire dataset of size n, we randomly choose a representative
subset of size nbatch from n every iteration to reduce the computation of gradients to only this
batch. Because we randomly choose a subset of size nbatch every iteration, the overall result over
time is a good estimate and trustful.
2. Reliably Learnable: Think of your machine learning system as a microscope where you
look at data and the right patterns come into focus.
Lecture 2: August 30 (Tuesday) 2-5
Figure 5.1 shows a 1-D nonlinear (Blue) function, piecewise linear (Red) functions, and data
points (Black). As shown in the figure, the piecewise functions describe the nonlinear function
pretty well. Our goal is to find a set of piecewise linear functions (Red) that best match the
nonlinear function (Blue) based on the data points using Neural Nets. Then, how do we create the
piecewise linear functions? A linear combination of elbows (Rectified Linear Units) in Figure 5.2!
The rectifier circuit in Figure 5.2 is composed of a diode and a resistor. The diode prevents the
current from flowing in the opposite (or negative) direction. Setting the positive direction of the
current to be from left to right, it means that the current can never flow in the negative direction
(from right to left). All negative currents will be set to zero resulting in Vout readings being zero.
On the other hand, positive currents (from left to right) will go through the diode resulting in Vout
readings on the other side of the diode. The standard ReLU function is shown below.
(
0 if x ≤ 0
f (x) = max(0, x) =
x if x > 0
In this example, Vin is x and Vout is f (x). Also, the standard ReLU function can be modified if
needed. For example, x can be replaced with wx + b, so that the modified ReLU function becomes
f (x) = max(0, wx + b). Here, w and b are the parameters(θ) we want to minimize using a loss
function as mentioned in the previous sections. More details and visualization will be covered in
the discussion session and next lecture.