CS230 Midterm Solutions Fall 2022
CS230 Midterm Solutions Fall 2022
Multiple Choice 14
Short Answer 30
Feed-Forward Neural Network 15
Backpropagation 19
Discrete Functions in Neural Networks 11
Debugging Code 18
Total 88
• This exam is open book, but collaboration with anyone else, either in person or online,
is strictly forbidden pursuant to The Stanford Honor Code.
• In all cases, and especially if you’re stuck or unsure of your answers, explain your
work, including showing your calculations and derivations! We’ll give partial
credit for good explanations of what you were trying to do.
Name:
SUNETID: @stanford.edu
Signature:
1
CS230
For each of the following questions, circle the letter of your choice. Each question has AT
LEAST one correct option unless explicitly mentioned. No explanation is required.
(a) (2 points) Imagine you are tasked with building a model to diagnose COVID-19 using
chest CT images. You are provided with 100,000 chest CT images, 1,000 of which are
labelled. Which learning technique has the best chance of succeeding on this task?
**(SELECT ONLY ONE)**
(i) Transfer Learning from a ResNet50 that was pre-trained on chest CT images to
detect tumors
(ii) Train a GAN to generate synthetic labeled data and train your model on all the
ground truth and synthetic data
(iii) Supervised Learning directly on the 1,000 labeled images
(iv) Augment the labeled data using random cropping and train using supervised
learning
Solution: (i)
(i) The model was pre-trained on the same type of data, namely CT images.
Therefore, transfer learning from such a model would make the most sense,
despite the pre-training being done for a slightly different task.
(ii) GANs require a lot of data to train and note that in this case, we would have
to train a GAN to generate both CT images with and without COVID-19.
As we only have 1k labelled images, we can then only train separate GANs
for our known COVID and non-COVID images, which we don’t have enough
of.
(iii) There wouldn’t be enough data in this case, and your model would easily
overfit to the 1k examples.
(iv) Although using random cropping can help, similar to (iii), there just isn’t
enough data to train a model in a fully supervised manner from scratch.
(b) (2 points) Imagine you are tasked with training a lane detection system that can
detect between two different types of lanes: lanes in the same direction as the car
moves and lanes in the opposite direction. Assume all the images are taken in common
two-way streets in California from a car’s front-view camera. What are the following
data augmentation techniques can be used for your task?
2
CS230
(i) The images after flipping vertically do not make sense anymore, ie, your car
will never be driving upside-down
(ii) The images after flipping horizontally breaks the directions of traffic as lanes
in the opposite direction should always appear on the left side of the image
(iii) Adding artificial fog to the images is a good technique to apply in this case,
as it can simulate more realistic driving conditions
(iv) Applying random masking is a good data augmentation technique to use here
to cover the cases where there are obstacles on the road that occlude your
vision.
(c) (2 points) You are training a binary classifier and are unsatisfied with the F1-score
as a good metric to combine precision and recall into a single number. You are consid-
ering alternatives to F1-score. Which of the following would be reasonable candidate
metric(s):
(i) |precision - recall|
(ii) recall/precision
(iii) precision × recall
(iv) max(precision, recall)
Solution: (iii)
(d) (2 points) Dropout can be considered as a form of ensembling over variants of a neural
network. Consider a neural network with N nodes, each of which can be dropped
during training independently with a probability 0 < p < 1. What is the total number
of unique models that can be realized on applying dropout?
(i) ⌊N × p⌋
(ii) (⌊N × p⌋)N
(iii) 2⌊N ×p⌋
(iv) 2N
3
CS230
Solution: (iv)
Each node has 2 possibilities: to be kept or to be dropped, and we have N nodes
total. Note that this is independent of p, as we ask for the total number of unique
models, not the expected value.
(e) (2 points) In practice when using Early Stopping, one needs to set a “buffer” hy-
perparameter, which determines the number of epochs model training continues when
no improvement in validation performance is observed before training is terminated.
After training is terminated, the model with the best validation performance is used.
What is the benefit of setting the buffer parameter to a value k = 5 epochs instead of
0:
Solution: (i)
In real, especially smaller datasets, validation performance can exhibit small fluc-
tuations from epoch to epoch. In such a case, using a buffer parameter (typically
referred to as “patience”) of 0 increases the chances of prematurely stopping train-
ing when validation performance in a given epoch does not improve due to random
fluctuation. Instead, setting patience to a value such as k = 5 epochs is more likely
to ensure that we are not underfitting.
(f) (2 points) Suppose that you are training a deep neural network and observe that
the training curve contains a lot of oscillations, especially at early stages of training.
Which of the following techniques can help stabilize training?
(i) can help stabilize training at the beginning stages when there is high uncer-
tainty
(ii) helps reduce overfitting, not necessarily the training stability
(iii) constrains the maximum magnitude for gradients, and hence, the step size
4
CS230
(g) (2 points) You have a 2-layer MLP with Sigmoid activations in the hidden layers
that you want to train with SGD. Your network weights are initialized from N (10, 1).
From the very first epoch, you observe that some weights in the first layer are not
getting updated or are updated very slowly compared to the second layer. Which of
the following can fix this issue?
Solution: (i) and (iv) will help. Initializing the weights to very high values will
cause large pre-activations causing sigmoid to output close to 1 and the derivative
will be almost 0. Tanh has a similar problem as well. Switching to leaky ReLU or
initializing the weights to smaller values can help.
The questions in this section can be answered in less than 3 sentences. Please be concise in
your responses.
(a) Imagine that you are building an app to optimize wait times in US emergency rooms
while prioritizing severe cases. You build a deep learning-based app that works as
follows:
You trained and tested your model using 3 months worth of data from hospitals in the
US, before deploying it to several hospitals in the San Francisco Bay Area.
(i) (2 points) Now you want to deploy your app internationally. Do you think your
app will work well? Why or why not?
5
CS230
You noticed that the app tends to rank African American and Hispanic patients
lower than patients from other ethnic backgrounds, even if those patients came
into the emergency department with more severe cases.
(ii) (1 point) Why is this a problem?
(b) Graph Neural Networks (GNNs) are a family of neural networks that can operate on
graph-structured data. Here, we describe a basic 2-layer GNN. Consider a graph with
k nodes labeled {1, 2, . . . , k}. For simplicity, assume that each node i is associated
with a scalar input xi . The first layer of our GNN, parameterized by scalar parameters
[1]
w[1] and b[1] performs the following operation to compute ai at each node i:
∑
ai = ReLU xi + w[1] xn + b[1]
[1]
(1)
n∈N (i)
where N (i) is the set of neighbors of node i in the graph (i.e, all nodes that share an
edge with node i). The second layer, parameterized by scalar parameters w[2] and b[2] ,
[2]
analogously computes ai for each node i:
∑
ai = ReLU ai + w[2] + b[2]
[2] [1]
a[1]
n (2)
n∈N (i)
Answer the following questions for the graph in the figure below, with labels as shown
in the nodes.
[2]
(i) (2 points) What is ∂a1 /∂x6 ?
6
CS230
2 5
1 3 6
(ii) (2 points) You are allowed to add one additional node (suppose this is node 7)
[2]
and accompanying edges such that the value of ∂a1 /∂x6 changes from the value
computed in part (i). Describe how you would do this with fewest number of
edges accompanying node 7.
(c) Consider the graph in figure below representing the training procedure of a GAN.
The figure shows the cost function of the generator plotted against the output of the
discriminator when given a generated image G(z). Concerning the discriminator’s
output, we consider that 0 means that the discriminator thinks the input “has been
generated by G”, whereas 1 means the discriminator thinks the input “comes from the
real data”.
(i) (2 points) After one round of training the generator and discriminator, is the
value of D(G(z)) closer to 0 or closer to 1? Explain.
7
CS230
(ii) (2 points) Two cost functions are presented in Figure 1 above. Which one would
you choose to train your GAN? Justify your answer.
Solution: Non-saturating cost, as the gradient is the highest when the cost
is largest.
(iii) (2 points) True or false. Your GAN is finished training when D(G(z)) is close
to 1. Please explain your answer for full credit.
(d) We would like to train a self-supervised generative model that can learn encodings z of
a given input image X by reconstructing the same input image as X̂. For our example,
lets say our input images are MNIST digits. Consider the architecture shown below:
Latent space
representation
x q(z | x) z p(x | z) x̂
8
CS230
(i) (3 points) In plain English, intuitively, explain what each loss function is trying
to optimize.
Solution:
Alice: Wants q(z | x) to be as close to N (0, 1) as possible. This is because
we want to keep the distribution of encodings to be relatively spread-out, so
that most of the latent-space is “covered” by a digit.
Bob: Wants the same as Alice, but also want to minimize reconstruction error
(wants X and X̂ to be as similar as possible)
Carol: Wants to minimize reconstruction error of X
(ii) (3 points) Say we choose the dimension of z to be 2 so we can plot the z’s on a
graph. Consider the three graphs below where each of the two axes is a dimension
of z. The different colours indicate different MNIST digits as indicated by the
legend. The plots are numbered left to right as (1), (2) and (3).
8 4 4
(1) (2) (3)
0 0 0
-8 -4 -4
-8 0 8 -4 0 4 -4 0 4
Figure 3: Plotted graphs for different loss functions. Plots are numbered, left to right as (1),
(2) and (3).
Match each graph to Alice, Bob and Carol (draw lines connecting the two columns
if you printed the midterm) and explain your reasoning for each.
Alice (1)
Bob (2)
Carol (3)
Solution:
Alice: 2.
• This is clearly a plot of a N (0, 1) distribution along both axes (most of
the points, you could say around 95% of them, are quite clustered in
a circle at the center where mean = 0 with 5% of the points that are
slightly more spread out)
• Notice that there is no separation between digits, as there is nothing in
9
CS230
10
CS230
Consider the following neural network with arbitrary dimensions (ie, x is not necessarily
5-dimensional, etc.):
where σ is the sigmoid activation function, and ⊙ is the operator for element-wise products,
and y is a k-dimensional vector of 1’s and 0’s. Note that yi represents the i-th element of
vector y, and likewise for ŷi .
(i) (3 points) What is ∂L/∂ ŷi ? You must write the most reduced form to get full credit.
Solution:
∂L ∂
= max(0, 1 − yi ŷi )
∂ ŷi ∂ ŷ
{ i
0, if 1 − yi ŷi < 0
=
−yi , if 1 − yi ŷi ≥ 0
= −yi (refer to the note below)
Notice that because we use the sigmoid activation function and y is a vector of 1’s
and 0’s, 0 ≤ yi ŷi ≤ 1. Therefore, 0 ≤ 1 − yi ŷi ≤ 1. Hence, the first case cannot
ever happen.
(ii) (2 points) What is ∂L/∂ ŷ? Refer to this result as ŷ. Please write your answer
according to the shape convention, i.e., your result should be the same shape as ŷ.
Solution:
ŷ = −y
(iii) (2 points) What is ∂L/∂z[2] ? Refer to this result as z[2] . To receive full credit, your
answer must include ŷ and your answer must be in the most reduced form.
11
CS230
Solution:
( )
z[2] = ŷ ⊙ ŷ ⊙ 1 − ŷ
(iv) (2 points) What is ∂L/∂W[2] ? Please refer to this result as W[2] . Please include z[2]
in your answer.
Solution:
W[2] = z[2] h⊤
(v) (2 point) What is ∂L/∂b[2] ? Please refer to this result as b[2] . Please include z[2] in
your answer.
Solution:
b[2] = z[2]
(vi) (2 points) What is ∂L/∂h? Please refer to this result as h. Please include z[2] in your
answer.
Solution:
h = W [2]⊤ z[2]
(vii) (2 points) What is ∂L/∂z[1] ? Refer to this result as z[1] . Please include h in your
answer.
Solution:
z[1] = h ⊙ δ
{
[1]
1, zi > 0
where δ is a vector of same size as z[1] where δi = .
0, otherwise
(viii) (2 point) What is ∂L/∂W[1] ? Please refer to this result as W[1] . Please include z[1]
in your answer.
Solution:
W[1] = z[1] x⊤
12
CS230
(ix) (2 point) What is ∂L/∂b[1] ? Please refer to this result as b[1] . Please include z[1] in
your answer.
Solution:
b[1] = z[1]
13
CS230
In this problem, we will explore training neural networks with discrete functions. Consider
a neural network encoder z = softmax[fθ (X)]. You can think of fθ as an MLP for this
example. z is the softmax output and we want to discretize this output into a one-hot
representation before we pass it into the next layer. Consider the operation one_hot where
one_hot(z) returns a one-hot vector where the 1 is at the argmax location. For example,
one_hot([0.1, 0.5, 0.4]) = [0, 1, 0]. Say we want to pass this output to another FC layer gϕ to
get a final output y.
(i) (1 points) Is there a problem with the neural network defined below?
y = g(one_hot(softmax(f (X))))
Here dividing by τ means every element in the vector is divided by τ . Obviously, when
τ = 1, this is exactly the same as the regular softmax function. What happens when
τ → ∞? What happens when τ → 0?
Hint: You don’t need to prove these limits, just showing a trend and justifying is good
enough.
(iii) (4 points) Assume f (X) = w⊤ X where w is a weight vector. What is the derivative
of Sτ (f (X))i with respect to w for a fixed τ ? In other words, what is ∂Sτ (w⊤ X)i /∂w,
the derivative of the i-th element of Sτ (w⊤ X) with respect to w? You must write your
answer in the most reduced form to receive full credit.
14
CS230
Solution:
∂Sτ (w⊤ X)i ∂
= softmax(w⊤ X/τ )i
∂w ∂w
∂ exp(w⊤ Xi /τ )
= ∑
∂w j exp(w⊤ Xj /τ )
∑ ∑
Xi (exp(w⊤ Xi /τ )/τ )( j exp(w⊤ Xj /τ )) − exp(w⊤ Xi /τ )[ j Xj exp(w⊤ Xj /τ )/τ ]
= ∑
( j exp(w⊤ Xj /τ ))2
1( ⊤ ⊤
∑
⊤
)
= Xi Sτ (w X)i − Sτ (w X)i Xj Sτ (w X)j
τ j
1 ( ∑ )
= Sτ (w⊤ X)i Xi − Xj Sτ (w⊤ X)j
τ j
(iv) (2 points) How can we use this modified softmax function S to get discrete vectors
in our neural networks? Perhaps we cannot get perfect one-hot vectors but can we get
close?
(v) (2 points) What problems could arise by setting τ to very low values?
Solution: High values means high variance, which means that the network will
be difficult to train, i.e instability.
15
CS230
Consider the pseudocode below for an MLP model to perform regression. The model takes
an input of dim 10, hidden layer of size 20 with ReLU activations and outputs a real number.
There are biases in both layers.
Weights are initialized from the random normal distribution and biases to 0.
Point out the errors in the code with line numbers and suggest fixes to them.
Your fixes should suggest code changes and not just English descriptions.
Functions/classes that are not implemented completely can be assumed to be
correctly written and have no errors in them.
1 import numpy as np
2
3 def mse_loss ( predictions , targets ):
4 """
5 Returns the Mean Squared Error Loss given the
6 predictions and targets
7
8 Args :
9 predictions (np. ndarray ): Model predictions
10 targets (np. ndarray ): True outputs
11
12 Returns :
13 Mean squared error loss between predictions and targets
14 """
15 return 0.5 * \
16 ( predictions . reshape ( -1) - targets . reshape ( -1))**2
17
18
19 def dropout (x, p =0.1):
20 """
21 Applies dropout on the input x with a drop
22 probability of p
23
24 Args :
25 x ( np. ndarray ): 2D array input
26 p ( float ): dropout probability )
27
28 Returns :
29 Array with values dropped out
30 """
31 ind = np. random . choice (x. shape [1]* x. shape [0] , replace = False ,
32 size =int(x. shape [1]* x. shape [0]* p))
33 x[ np . unravel_index ( indices , x. shape )] = 0
34 return x / p
16
CS230
35
36
37 def get_grads (loss , w1 , b1 , w2 , b2 ):
38 """
39 This function takes the loss and returns the gradients
40 for the weights and biases
41 YOU MAY ASSUME THIS FUNCTION HAS NO ERRORS
42 """
43 ...
44 return dw1 , db1 , dw2 , db2
45
46 def sample_batches (data , batchsize ):
47 """
48 This function samples of batches of size `batchsize `
49 from the training data .
50 YOU MAY ASSUME THIS FUNCTION HAS NO ERRORS
51 """
52 ...
53 return x, y
54
55 class Adam :
56 """
57 The class for the Adam optimizer that
58 accepts the parameters and updates them .
59 YOU MAY ASSUME THIS CLASS AND ITS METHODS HAVE
60 NO ERRORS
61 """
62 def __init__ ( self , w1 , b1 , w2 , b2 ):
63 ...
64
65 def update ( self ):
66 """
67 Updates the params according to the
68 Adam update rule
69 """
70 ...
71
72 class MLP :
73 """
74 MLP Model to perform regression
75 """
76 def __init__ ( self ):
77 super (). __init__ ()
78 self .w1 = np. random . randn (10 , 20)
79 self .b1 = np. zeros (10)
17
CS230
18
CS230
Solution:
Line 15: Add np.mean
Line 19: Update the dropout function to take a training argument and only apply
dropout when training=True
Line 34: Should be x / (1 - p)
Lines 78,80: Bias shapes should be 20 and 1 respectively
Lines 93, 95: Need to do matrix multiplication, not hadamard product
Line 93: Missing ReLU activation
Line 121: loss = mse_loss(out, y)
19
CS230
END OF PAPER
20