0% found this document useful (0 votes)

210 views20 pages

CS230 Midterm Solutions Fall 2022

This document contains information about a midterm exam for the CS230: Deep Learning course at Stanford University in Fall Quarter 2022. The exam is 180 minutes long and covers multiple choice questions worth 14 points, short answer questions worth 30 points, and questions on feed-forward neural networks, backpropagation, discrete functions in neural networks, and debugging code worth a total of 88 points. The exam instructions specify that it is open book but collaboration is forbidden, and students should show their work for partial credit.

Uploaded by

ahmedmody2001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

210 views20 pages

CS230 Midterm Solutions Fall 2022

Uploaded by

ahmedmody2001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

CS230: Deep Learning

Fall Quarter 2022

Stanford University
Midterm Examination
Suggested duration: 180 minutes

Problem Full Points Your Score

Multiple Choice 14
Short Answer 30
Feed-Forward Neural Network 15
Backpropagation 19
Discrete Functions in Neural Networks 11
Debugging Code 18
Total 88

The exam contains 20 pages including this cover page.

• This exam is open book, but collaboration with anyone else, either in person or online,
is strictly forbidden pursuant to The Stanford Honor Code.

• In all cases, and especially if you’re stuck or unsure of your answers, explain your
work, including showing your calculations and derivations! We’ll give partial
credit for good explanations of what you were trying to do.

Name:

SUNETID: @stanford.edu

The Stanford University Honor Code:

I attest that I have not given or received aid in this examination, and that I have done my
share and taken an active part in seeing to it that others as well as myself uphold the spirit
and letter of the Honor Code.

Signature:

1
CS230

Question (Multiple Choice, 14 points)

For each of the following questions, circle the letter of your choice. Each question has AT
LEAST one correct option unless explicitly mentioned. No explanation is required.

(a) (2 points) Imagine you are tasked with building a model to diagnose COVID-19 using
chest CT images. You are provided with 100,000 chest CT images, 1,000 of which are
labelled. Which learning technique has the best chance of succeeding on this task?
**(SELECT ONLY ONE)**

(i) Transfer Learning from a ResNet50 that was pre-trained on chest CT images to
detect tumors
(ii) Train a GAN to generate synthetic labeled data and train your model on all the
ground truth and synthetic data
(iii) Supervised Learning directly on the 1,000 labeled images
(iv) Augment the labeled data using random cropping and train using supervised
learning

Solution: (i)

(i) The model was pre-trained on the same type of data, namely CT images.
Therefore, transfer learning from such a model would make the most sense,
despite the pre-training being done for a slightly different task.
(ii) GANs require a lot of data to train and note that in this case, we would have
to train a GAN to generate both CT images with and without COVID-19.
As we only have 1k labelled images, we can then only train separate GANs
for our known COVID and non-COVID images, which we don’t have enough
of.
(iii) There wouldn’t be enough data in this case, and your model would easily
overfit to the 1k examples.
(iv) Although using random cropping can help, similar to (iii), there just isn’t
enough data to train a model in a fully supervised manner from scratch.

(b) (2 points) Imagine you are tasked with training a lane detection system that can
detect between two different types of lanes: lanes in the same direction as the car
moves and lanes in the opposite direction. Assume all the images are taken in common
two-way streets in California from a car’s front-view camera. What are the following
data augmentation techniques can be used for your task?

(i) Flipping vertically (across x-axis)

(ii) Flipping horizontally (across y-axis)
(iii) Adding artificial fog to your images

2
CS230

(iv) Applying random masking to a (small) portion of the image

Solution: (iii), (iv)

(i) The images after flipping vertically do not make sense anymore, ie, your car
will never be driving upside-down
(ii) The images after flipping horizontally breaks the directions of traﬀic as lanes
in the opposite direction should always appear on the left side of the image
(iii) Adding artificial fog to the images is a good technique to apply in this case,
as it can simulate more realistic driving conditions
(iv) Applying random masking is a good data augmentation technique to use here
to cover the cases where there are obstacles on the road that occlude your
vision.

(c) (2 points) You are training a binary classifier and are unsatisfied with the F1-score
as a good metric to combine precision and recall into a single number. You are consid-
ering alternatives to F1-score. Which of the following would be reasonable candidate
metric(s):
(i) |precision - recall|
(ii) recall/precision
(iii) precision × recall
(iv) max(precision, recall)

Solution: (iii)

(i) Optimizes for gap between precision and recall

(ii) Maximizes recall at the expense of precision
(iii) Is proportional to the geometric mean of precision and recall
(iv) Only focuses on the better metric, e.g. can not discriminate between (0.9,
0.4) and (0.9, 0.7).

(d) (2 points) Dropout can be considered as a form of ensembling over variants of a neural
network. Consider a neural network with N nodes, each of which can be dropped
during training independently with a probability 0 < p < 1. What is the total number
of unique models that can be realized on applying dropout?
(i) ⌊N × p⌋
(ii) (⌊N × p⌋)N
(iii) 2⌊N ×p⌋
(iv) 2N

3
CS230

Solution: (iv)
Each node has 2 possibilities: to be kept or to be dropped, and we have N nodes
total. Note that this is independent of p, as we ask for the total number of unique
models, not the expected value.

(e) (2 points) In practice when using Early Stopping, one needs to set a “buffer” hy-
perparameter, which determines the number of epochs model training continues when
no improvement in validation performance is observed before training is terminated.
After training is terminated, the model with the best validation performance is used.
What is the benefit of setting the buffer parameter to a value k = 5 epochs instead of
0:

(i) Robustness to noise in validation performance from epoch to epoch

(ii) Reduced training time on average
(iii) Reduced inference time on average
(iv) None of the above

Solution: (i)
In real, especially smaller datasets, validation performance can exhibit small fluc-
tuations from epoch to epoch. In such a case, using a buffer parameter (typically
referred to as “patience”) of 0 increases the chances of prematurely stopping train-
ing when validation performance in a given epoch does not improve due to random
fluctuation. Instead, setting patience to a value such as k = 5 epochs is more likely
to ensure that we are not underfitting.

(f) (2 points) Suppose that you are training a deep neural network and observe that
the training curve contains a lot of oscillations, especially at early stages of training.
Which of the following techniques can help stabilize training?

(i) Early stopping

(ii) Learning rate scheduling
(iii) Data augmentation
(iv) Gradient clipping

Solution: (ii), (iv)

(i) can help stabilize training at the beginning stages when there is high uncer-
tainty
(ii) helps reduce overfitting, not necessarily the training stability
(iii) constrains the maximum magnitude for gradients, and hence, the step size

4
CS230

(iv) incorrect, as you are simply stopping training early

(g) (2 points) You have a 2-layer MLP with Sigmoid activations in the hidden layers
that you want to train with SGD. Your network weights are initialized from N (10, 1).
From the very first epoch, you observe that some weights in the first layer are not
getting updated or are updated very slowly compared to the second layer. Which of
the following can fix this issue?

(i) Initializing the weights to be from N (0, 1)

(ii) Adding more hidden layers
(iii) Switching the activation function to tanh
(iv) Switching the activation function to leaky ReLU

Solution: (i) and (iv) will help. Initializing the weights to very high values will
cause large pre-activations causing sigmoid to output close to 1 and the derivative
will be almost 0. Tanh has a similar problem as well. Switching to leaky ReLU or
initializing the weights to smaller values can help.

Question (Short Answer, 30 points)

The questions in this section can be answered in less than 3 sentences. Please be concise in
your responses.

(a) Imagine that you are building an app to optimize wait times in US emergency rooms
while prioritizing severe cases. You build a deep learning-based app that works as
follows:

• Input: a patient’s demographic information (i.e, ethnicity, age), health history

and reasons for emergency
• Output: ranking of patients currently in the emergency room from most to least
severe

You trained and tested your model using 3 months worth of data from hospitals in the
US, before deploying it to several hospitals in the San Francisco Bay Area.

(i) (2 points) Now you want to deploy your app internationally. Do you think your
app will work well? Why or why not?

Solution: No, as the model is trained on US data only, it is heavily biased

to American citizens and the same model cannot be directly applied to users
in other countries where the data distribution is different.

5
CS230

You noticed that the app tends to rank African American and Hispanic patients
lower than patients from other ethnic backgrounds, even if those patients came
into the emergency department with more severe cases.
(ii) (1 point) Why is this a problem?

Solution: Many answers accepted.

Example solution: This is a problem because the model is perpetuating a
racial bias that hurts specific populations.

(iii) (2 points) What may have caused this problem?

Hint: Think about how the model was trained and the input data that was provided

Solution: Many answers accepted.

Example solution: The training data may have been taken from hospitals that
treat other populations over African American and Hispanic populations. As
a result, the model may have learned to rank these populations lower.

(iv) (2 points) How can we fix this problem?

Solution: Many answers are accepted.

Example solutions: model de-biasing, re-training the model with less biased
data, remove “ethnicity” as an input to the model

(b) Graph Neural Networks (GNNs) are a family of neural networks that can operate on
graph-structured data. Here, we describe a basic 2-layer GNN. Consider a graph with
k nodes labeled {1, 2, . . . , k}. For simplicity, assume that each node i is associated
with a scalar input xi . The first layer of our GNN, parameterized by scalar parameters
[1]
w[1] and b[1] performs the following operation to compute ai at each node i:
   
∑
ai = ReLU xi + w[1]  xn  + b[1] 
[1]
(1)
n∈N (i)

where N (i) is the set of neighbors of node i in the graph (i.e, all nodes that share an
edge with node i). The second layer, parameterized by scalar parameters w[2] and b[2] ,
[2]
analogously computes ai for each node i:
   
∑
ai = ReLU ai + w[2]   + b[2] 
[2] [1]
a[1]
n (2)
n∈N (i)

Answer the following questions for the graph in the figure below, with labels as shown
in the nodes.
[2]
(i) (2 points) What is ∂a1 /∂x6 ?

6
CS230

2 5

1 3 6

Solution: 0. After the k th GNN layer, each node assimiliates information

from nodes up to k hops away. Since nodes 1 and 6 are more than 2 hops
away, the output of the 2nd GNN layer for node 1 does not depend on the
input value for node 6.

(ii) (2 points) You are allowed to add one additional node (suppose this is node 7)
[2]
and accompanying edges such that the value of ∂a1 /∂x6 changes from the value
computed in part (i). Describe how you would do this with fewest number of
edges accompanying node 7.

Solution: Node 7 would have edges connecting it to nodes 1 and 6. This

brings nodes 1 and 6 within 2 hops of each other.

(c) Consider the graph in figure below representing the training procedure of a GAN.
The figure shows the cost function of the generator plotted against the output of the
discriminator when given a generated image G(z). Concerning the discriminator’s
output, we consider that 0 means that the discriminator thinks the input “has been
generated by G”, whereas 1 means the discriminator thinks the input “comes from the
real data”.

Figure 1: GAN training curve

(i) (2 points) After one round of training the generator and discriminator, is the
value of D(G(z)) closer to 0 or closer to 1? Explain.

7
CS230

Solution: Closer to 0. This is because the generator is not well-trained and

the discriminator can easily separate real data from generated data.

(ii) (2 points) Two cost functions are presented in Figure 1 above. Which one would
you choose to train your GAN? Justify your answer.

Solution: Non-saturating cost, as the gradient is the highest when the cost
is largest.

(iii) (2 points) True or false. Your GAN is finished training when D(G(z)) is close
to 1. Please explain your answer for full credit.

Solution: False. For a well-trained generator, the discriminator should not

be able to discriminate between real and generated examples, and so D(G(z))
should be closer to 0.5.

(d) We would like to train a self-supervised generative model that can learn encodings z of
a given input image X by reconstructing the same input image as X̂. For our example,
lets say our input images are MNIST digits. Consider the architecture shown below:

Latent space
representation

x q(z | x) z p(x | z) x̂

Neural network Neural network

mapping mapping
x to z z to x

Figure 2: Architecture of proposed generative model

Assume the encoder q(z | x) is parameterized to output a normal distribution over z.

Alice, Bob and Carol propose 3 different loss functions to train this model end-to-end.

• Alice: KL(q(z | x) || N (0, I))

• Bob: MSE(X − X̂) + KL(q(z | x) || N (0, I))
• Carol: MSE(X − X̂)

The entire network is end-to-end differentiable for all 3 loss functions.

Here KL is the KL-divergence which is a measure of similarity of two different proba-
bility distributions. N (0, I) is the multivariate standard Normal distribution where I
is the identity matrix. MSE is the mean squared error.

8
CS230

(i) (3 points) In plain English, intuitively, explain what each loss function is trying
to optimize.

Solution:
Alice: Wants q(z | x) to be as close to N (0, 1) as possible. This is because
we want to keep the distribution of encodings to be relatively spread-out, so
that most of the latent-space is “covered” by a digit.
Bob: Wants the same as Alice, but also want to minimize reconstruction error
(wants X and X̂ to be as similar as possible)
Carol: Wants to minimize reconstruction error of X
(ii) (3 points) Say we choose the dimension of z to be 2 so we can plot the z’s on a
graph. Consider the three graphs below where each of the two axes is a dimension
of z. The different colours indicate different MNIST digits as indicated by the
legend. The plots are numbered left to right as (1), (2) and (3).

8 4 4
(1) (2) (3)

0 0 0

-8 -4 -4
-8 0 8 -4 0 4 -4 0 4

Figure 3: Plotted graphs for different loss functions. Plots are numbered, left to right as (1),
(2) and (3).

Match each graph to Alice, Bob and Carol (draw lines connecting the two columns
if you printed the midterm) and explain your reasoning for each.
Alice (1)
Bob (2)
Carol (3)

Solution:
Alice: 2.
• This is clearly a plot of a N (0, 1) distribution along both axes (most of
the points, you could say around 95% of them, are quite clustered in
a circle at the center where mean = 0 with 5% of the points that are
slightly more spread out)
• Notice that there is no separation between digits, as there is nothing in

9
CS230

the loss function that enforces that separation

Bob: 3.
• The overall shape still resembles plot 2, where most of the points are
clustered in the center where mean = 0 and some other points are spread
farther away, indicating that there is a KL divergence term in the loss.
• We also see separation of the different digits, which is caused by the
reconstruction error, as similarly-looking digits would have lower recon-
struction error and thus, our loss groups those digits together
Carol: 1.
• We evidently see that there is separation between the classes, which is
due to the reconstruction error
• We also see that the points do not appear to be N (0, 1) distributed (you
can also see from the axes that the scale is a lot larger).

10
CS230

Question (Backpropagation, 19 points)

Consider the following neural network with arbitrary dimensions (ie, x is not necessarily
5-dimensional, etc.):

z[1] = W[1] x + b[1]

h = ReLU(z[1] )
z[2] = W[2] h + b[2]
ŷ = σ(z[2] )
∑
k
L= max(0, 1 − yi ŷi )
i=1

where σ is the sigmoid activation function, and ⊙ is the operator for element-wise products,
and y is a k-dimensional vector of 1’s and 0’s. Note that yi represents the i-th element of
vector y, and likewise for ŷi .

(i) (3 points) What is ∂L/∂ ŷi ? You must write the most reduced form to get full credit.

Solution:
∂L ∂
= max(0, 1 − yi ŷi )
∂ ŷi ∂ ŷ
{ i
0, if 1 − yi ŷi < 0
=
−yi , if 1 − yi ŷi ≥ 0
= −yi (refer to the note below)

Notice that because we use the sigmoid activation function and y is a vector of 1’s
and 0’s, 0 ≤ yi ŷi ≤ 1. Therefore, 0 ≤ 1 − yi ŷi ≤ 1. Hence, the first case cannot
ever happen.

(ii) (2 points) What is ∂L/∂ ŷ? Refer to this result as ŷ. Please write your answer
according to the shape convention, i.e., your result should be the same shape as ŷ.

Solution:

ŷ = −y

(iii) (2 points) What is ∂L/∂z[2] ? Refer to this result as z[2] . To receive full credit, your
answer must include ŷ and your answer must be in the most reduced form.

11
CS230

Solution:
( )
z[2] = ŷ ⊙ ŷ ⊙ 1 − ŷ

(iv) (2 points) What is ∂L/∂W[2] ? Please refer to this result as W[2] . Please include z[2]
in your answer.

Solution:

W[2] = z[2] h⊤

(v) (2 point) What is ∂L/∂b[2] ? Please refer to this result as b[2] . Please include z[2] in
your answer.

Solution:

b[2] = z[2]

(vi) (2 points) What is ∂L/∂h? Please refer to this result as h. Please include z[2] in your
answer.

Solution:

h = W [2]⊤ z[2]

(vii) (2 points) What is ∂L/∂z[1] ? Refer to this result as z[1] . Please include h in your
answer.

Solution:

z[1] = h ⊙ δ
{
[1]
1, zi > 0
where δ is a vector of same size as z[1] where δi = .
0, otherwise

(viii) (2 point) What is ∂L/∂W[1] ? Please refer to this result as W[1] . Please include z[1]
in your answer.

Solution:

W[1] = z[1] x⊤

12
CS230

(ix) (2 point) What is ∂L/∂b[1] ? Please refer to this result as b[1] . Please include z[1] in
your answer.

Solution:

b[1] = z[1]

13
CS230

Question (Discrete Functions in Neural Networks, 11 points)

In this problem, we will explore training neural networks with discrete functions. Consider
a neural network encoder z = softmax[fθ (X)]. You can think of fθ as an MLP for this
example. z is the softmax output and we want to discretize this output into a one-hot
representation before we pass it into the next layer. Consider the operation one_hot where
one_hot(z) returns a one-hot vector where the 1 is at the argmax location. For example,
one_hot([0.1, 0.5, 0.4]) = [0, 1, 0]. Say we want to pass this output to another FC layer gϕ to
get a final output y.

(i) (1 points) Is there a problem with the neural network defined below?

y = g(one_hot(softmax(f (X))))

Solution: Yes, one_hot is not a differentiable operation.

(ii) (2 points) Consider the following function:

z = Sτ (f (X)) = softmax(f (X)/τ )

Here dividing by τ means every element in the vector is divided by τ . Obviously, when
τ = 1, this is exactly the same as the regular softmax function. What happens when
τ → ∞? What happens when τ → 0?
Hint: You don’t need to prove these limits, just showing a trend and justifying is good
enough.

Solution: As τ → ∞, we get a uniform distribution. As τ → 0, we get the

one-hot vector at the argmax location.

(iii) (4 points) Assume f (X) = w⊤ X where w is a weight vector. What is the derivative
of Sτ (f (X))i with respect to w for a fixed τ ? In other words, what is ∂Sτ (w⊤ X)i /∂w,
the derivative of the i-th element of Sτ (w⊤ X) with respect to w? You must write your
answer in the most reduced form to receive full credit.

14
CS230

Solution:
∂Sτ (w⊤ X)i ∂
= softmax(w⊤ X/τ )i
∂w ∂w
∂ exp(w⊤ Xi /τ )
= ∑
∂w j exp(w⊤ Xj /τ )
∑ ∑
Xi (exp(w⊤ Xi /τ )/τ )( j exp(w⊤ Xj /τ )) − exp(w⊤ Xi /τ )[ j Xj exp(w⊤ Xj /τ )/τ ]
= ∑
( j exp(w⊤ Xj /τ ))2
1( ⊤ ⊤
∑
⊤
)
= Xi Sτ (w X)i − Sτ (w X)i Xj Sτ (w X)j
τ j
1 ( ∑ )
= Sτ (w⊤ X)i Xi − Xj Sτ (w⊤ X)j
τ j

(iv) (2 points) How can we use this modified softmax function S to get discrete vectors
in our neural networks? Perhaps we cannot get perfect one-hot vectors but can we get
close?

Solution: Replace softmax with S. S is differentiable. No need for one_hot

anymore. With low τ , this should give us basically one-hot vectors.

(v) (2 points) What problems could arise by setting τ to very low values?

Solution: High values means high variance, which means that the network will
be diﬀicult to train, i.e instability.

15
CS230

Question (Debugging Code, 18 points)

Consider the pseudocode below for an MLP model to perform regression. The model takes
an input of dim 10, hidden layer of size 20 with ReLU activations and outputs a real number.
There are biases in both layers.
Weights are initialized from the random normal distribution and biases to 0.
Point out the errors in the code with line numbers and suggest fixes to them.
Your fixes should suggest code changes and not just English descriptions.
Functions/classes that are not implemented completely can be assumed to be
correctly written and have no errors in them.
1 import numpy as np
2
3 def mse_loss ( predictions , targets ):
4 """
5 Returns the Mean Squared Error Loss given the
6 predictions and targets
7
8 Args :
9 predictions (np. ndarray ): Model predictions
10 targets (np. ndarray ): True outputs
11
12 Returns :
13 Mean squared error loss between predictions and targets
14 """
15 return 0.5 * \
16 ( predictions . reshape ( -1) - targets . reshape ( -1))**2
17
18
19 def dropout (x, p =0.1):
20 """
21 Applies dropout on the input x with a drop
22 probability of p
23
24 Args :
25 x ( np. ndarray ): 2D array input
26 p ( float ): dropout probability )
27
28 Returns :
29 Array with values dropped out
30 """
31 ind = np. random . choice (x. shape [1]* x. shape [0] , replace = False ,
32 size =int(x. shape [1]* x. shape [0]* p))
33 x[ np . unravel_index ( indices , x. shape )] = 0
34 return x / p

16
CS230

35
36
37 def get_grads (loss , w1 , b1 , w2 , b2 ):
38 """
39 This function takes the loss and returns the gradients
40 for the weights and biases
41 YOU MAY ASSUME THIS FUNCTION HAS NO ERRORS
42 """
43 ...
44 return dw1 , db1 , dw2 , db2
45
46 def sample_batches (data , batchsize ):
47 """
48 This function samples of batches of size `batchsize `
49 from the training data .
50 YOU MAY ASSUME THIS FUNCTION HAS NO ERRORS
51 """
52 ...
53 return x, y
54
55 class Adam :
56 """
57 The class for the Adam optimizer that
58 accepts the parameters and updates them .
59 YOU MAY ASSUME THIS CLASS AND ITS METHODS HAVE
60 NO ERRORS
61 """
62 def __init__ ( self , w1 , b1 , w2 , b2 ):
63 ...
64
65 def update ( self ):
66 """
67 Updates the params according to the
68 Adam update rule
69 """
70 ...
71
72 class MLP :
73 """
74 MLP Model to perform regression
75 """
76 def __init__ ( self ):
77 super (). __init__ ()
78 self .w1 = np. random . randn (10 , 20)
79 self .b1 = np. zeros (10)

17
CS230

80 self .w2 = np. random . randn (20 , 1)

81 self .b2 = np. zeros (20)
82 self . optimizer = Adam (w1 , b1 , w2 , b2)
83
84 def forward (self , x):
85 """
86 Forward pass for the model
87
88 Args :
89 x ( np. ndarray ): Input of shape batchsize x 10
90
91 Returns :
92 out (np. ndarray ): Output of shape batchsize x 1
93 """
94 x = self .w1 * x + b1
95 x = dropout (x)
96 x = self .w2 * x + b2
97 return x
98
99
100 def train (self , training_data , test_data ):
101 """
102 This method trains the neural network and outputs
103 predictions for the test_data
104
105 Args :
106 training_data (np. ndarray ):
107 Training data containing (x, y) pairs
108 x is 10 - dimensional and y is 1- dimensional
109 test_data (np. ndarray ): 100 test points of shape
110 (100 , 10)
111
112 Returns :
113 predictions (np. ndarray ): The predictions for
114 the 100 test points .
115 Final shape is (100 ,1)
116 """
117 batchsize = 32
118 for _ in range ( num_epochs ):
119 for x, y in sample_batches ( training_data , batchsize ):
120 # Shape of x is (32 , 10) and y is (32 , 1)
121 out = self . forward (x)
122 loss = mse_loss (x, y)
123 dw1 , db1 , dw2 , db2 = get_grads (loss , self .w1 ,
124 self .b1 , self .w2 ,

18
CS230

125 self .b2)

126 self . optimizer . update ()
127
128 # Assume test_data is of shape (100 , 10)
129 predictions = self . forward ( test_data )
130
131 return predictions
132

Solution:
Line 15: Add np.mean
Line 19: Update the dropout function to take a training argument and only apply
dropout when training=True
Line 34: Should be x / (1 - p)
Lines 78,80: Bias shapes should be 20 and 1 respectively
Lines 93, 95: Need to do matrix multiplication, not hadamard product
Line 93: Missing ReLU activation
Line 121: loss = mse_loss(out, y)

19
CS230

END OF PAPER

VH GR-7 Mathematics T1 Sample-QP
100% (2)
VH GR-7 Mathematics T1 Sample-QP
6 pages
Writing of An Application Letter: Discussions
100% (1)
Writing of An Application Letter: Discussions
13 pages
Activation Functions - Ipynb - Colaboratory
No ratings yet
Activation Functions - Ipynb - Colaboratory
10 pages
Mccall Diesel Motor Works Case Study
100% (2)
Mccall Diesel Motor Works Case Study
14 pages
Nursing Care Plans
100% (3)
Nursing Care Plans
10 pages
Understanding Machine Learning Solution Manual: 2 Gentle Start
No ratings yet
Understanding Machine Learning Solution Manual: 2 Gentle Start
67 pages
RM 64
No ratings yet
RM 64
632 pages
(Out of 100 Marks) (Out of 40 Marks) : Mathematics (Paper 1)
No ratings yet
(Out of 100 Marks) (Out of 40 Marks) : Mathematics (Paper 1)
38 pages
Cs230exam spr21 Soln
No ratings yet
Cs230exam spr21 Soln
21 pages
Presentation - J&J 2
0% (1)
Presentation - J&J 2
47 pages
Unit 5 (C++) - Function
No ratings yet
Unit 5 (C++) - Function
102 pages
Cs230exam Spr18 Soln PDF
100% (1)
Cs230exam Spr18 Soln PDF
45 pages
TMH 7 Moving Loads
No ratings yet
TMH 7 Moving Loads
30 pages
Autoencoders - Presentation
No ratings yet
Autoencoders - Presentation
18 pages
American Scientist, Vol. 111.1 (January-February 2023)
No ratings yet
American Scientist, Vol. 111.1 (January-February 2023)
68 pages
7 ROOT LOCUS Part 1
No ratings yet
7 ROOT LOCUS Part 1
7 pages
Pursue Lesson 1
No ratings yet
Pursue Lesson 1
10 pages
4-Lens and Cataract
No ratings yet
4-Lens and Cataract
59 pages
Cs230exam Win21
No ratings yet
Cs230exam Win21
21 pages
Digestion of Carbohydrates
No ratings yet
Digestion of Carbohydrates
4 pages
TLE7 - 8-ICT-PROGRAMMING FOR ROBOTICS Q1 M1 W1 - noAK
No ratings yet
TLE7 - 8-ICT-PROGRAMMING FOR ROBOTICS Q1 M1 W1 - noAK
16 pages
GSTR1 Excel Workbook Template V1.4
No ratings yet
GSTR1 Excel Workbook Template V1.4
84 pages
Audio Spotlight
No ratings yet
Audio Spotlight
40 pages
CS230 Midterm Fall 2022
No ratings yet
CS230 Midterm Fall 2022
14 pages
Must Know Questions Deep Learning
No ratings yet
Must Know Questions Deep Learning
22 pages
RIM S BlackBerry Fall Back Analysis and PDF
No ratings yet
RIM S BlackBerry Fall Back Analysis and PDF
9 pages
Assignment 10
100% (1)
Assignment 10
3 pages
"Office Green Wants To Increase Brand Awareness.": Goal One: SMART Goal One
No ratings yet
"Office Green Wants To Increase Brand Awareness.": Goal One: SMART Goal One
2 pages
Bab III
No ratings yet
Bab III
22 pages
DCGAN Presentation
No ratings yet
DCGAN Presentation
16 pages
EUPoP-Solo and Bot Rules-1.2-Single Pages
No ratings yet
EUPoP-Solo and Bot Rules-1.2-Single Pages
16 pages
ĐỀ THI THỬ SỐ 10 - Khóa Đề
No ratings yet
ĐỀ THI THỬ SỐ 10 - Khóa Đề
6 pages
Cs230exam Win19 Soln
No ratings yet
Cs230exam Win19 Soln
29 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
CS230: Deep Learning: Winter Quarter 2019 Stanford University Midterm Examination 180 Minutes
No ratings yet
CS230: Deep Learning: Winter Quarter 2019 Stanford University Midterm Examination 180 Minutes
29 pages
Assignment 7
No ratings yet
Assignment 7
3 pages
Assignment 4: Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 4: Introduction To Machine Learning Prof. B. Ravindran
2 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
19 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Omkar Sabnis B4-764 Experiment No. 7 Aim: Implementation of MC-Culloch Pitt Model For AND Gate Using Python. Theory
No ratings yet
Omkar Sabnis B4-764 Experiment No. 7 Aim: Implementation of MC-Culloch Pitt Model For AND Gate Using Python. Theory
10 pages
CISC 867: Deep Learning Assignment #1: K J Net
No ratings yet
CISC 867: Deep Learning Assignment #1: K J Net
3 pages
Answers For End-Sem Exam Part - 2 (Deep Learning)
No ratings yet
Answers For End-Sem Exam Part - 2 (Deep Learning)
20 pages
Assignment 5
No ratings yet
Assignment 5
3 pages
Assignment 4 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 4 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
4 pages
IIT Madras Notes Machine Learning
No ratings yet
IIT Madras Notes Machine Learning
13 pages
Hatdog 1.2
No ratings yet
Hatdog 1.2
18 pages
Assignment 9
No ratings yet
Assignment 9
4 pages
4th Attempts Huawei
No ratings yet
4th Attempts Huawei
6 pages
SRM Valliammai Engineering College (An Autonomous Institution)
No ratings yet
SRM Valliammai Engineering College (An Autonomous Institution)
9 pages
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
No ratings yet
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
7 pages
Assignment - Week 6 (Neural Networks) Type of Question: MCQ/MSQ
No ratings yet
Assignment - Week 6 (Neural Networks) Type of Question: MCQ/MSQ
4 pages
LU5: Deep Feedforward Networks: Hidden Units, Architecture Design
No ratings yet
LU5: Deep Feedforward Networks: Hidden Units, Architecture Design
15 pages
Nueral Network Mcqs
No ratings yet
Nueral Network Mcqs
6 pages
Potato Specification
No ratings yet
Potato Specification
3 pages
Overfitting vs. Underfitting, Bias vs. Variance
No ratings yet
Overfitting vs. Underfitting, Bias vs. Variance
7 pages
Introduction To Machine Learning - IITKGP - Unit 4 - Week 2
No ratings yet
Introduction To Machine Learning - IITKGP - Unit 4 - Week 2
5 pages
Aryan Mittal
No ratings yet
Aryan Mittal
1 page
Deep Learning
No ratings yet
Deep Learning
6 pages
Lecture Notes 5
No ratings yet
Lecture Notes 5
3 pages
Deep Learning - IIT Ropar - Unit 4 - Week 1
No ratings yet
Deep Learning - IIT Ropar - Unit 4 - Week 1
8 pages
Practice Final sp22
No ratings yet
Practice Final sp22
10 pages
Status and Prospects For Helicopter Apus in Russia Gavrilov V.V., Ponomarev B.A
No ratings yet
Status and Prospects For Helicopter Apus in Russia Gavrilov V.V., Ponomarev B.A
16 pages
Final Written Exam Edit 3.3
No ratings yet
Final Written Exam Edit 3.3
13 pages
2.building Blocks of Neural Networks
100% (1)
2.building Blocks of Neural Networks
2 pages
Artificial Neural Networks Quiz Questions 1
No ratings yet
Artificial Neural Networks Quiz Questions 1
17 pages
Assignment 3: Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 3: Introduction To Machine Learning Prof. B. Ravindran
4 pages
Mated Ttbbi 3
No ratings yet
Mated Ttbbi 3
1 page
2022 ML Assignments
No ratings yet
2022 ML Assignments
45 pages
Assignment 4: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Assignment 4: Reinforcement Learning Prof. B. Ravindran
4 pages
Machine Learning Full Question Bank
No ratings yet
Machine Learning Full Question Bank
14 pages
1157 CS F425 20231222015056 Mid Semester Question Paper DL
No ratings yet
1157 CS F425 20231222015056 Mid Semester Question Paper DL
2 pages
2023 ML Assignment
No ratings yet
2023 ML Assignment
57 pages
Deep Learning - IIT Ropar - Unit 4 - Week 1
No ratings yet
Deep Learning - IIT Ropar - Unit 4 - Week 1
5 pages
Deep Learning - IIT Ropar - Unit 11 - Week 8
No ratings yet
Deep Learning - IIT Ropar - Unit 11 - Week 8
4 pages
ML Assignment 3 Nptel 2019
No ratings yet
ML Assignment 3 Nptel 2019
26 pages
Deep Learning Exp
No ratings yet
Deep Learning Exp
25 pages
GPS Unit 2 Assignment Sheet
No ratings yet
GPS Unit 2 Assignment Sheet
3 pages
Deep Learning - IIT Ropar - Unit 10 - Week 7
100% (1)
Deep Learning - IIT Ropar - Unit 10 - Week 7
4 pages
Unit4 DL Final
No ratings yet
Unit4 DL Final
30 pages
Deep Learning - Unit-III Two Marks
100% (1)
Deep Learning - Unit-III Two Marks
3 pages
Internship Presentation
No ratings yet
Internship Presentation
30 pages
Deep Learning - IIT Ropar - Unit 12 - Week 9
No ratings yet
Deep Learning - IIT Ropar - Unit 12 - Week 9
4 pages
Introduction To Deep Learning - Assignment
No ratings yet
Introduction To Deep Learning - Assignment
4 pages
4th Unit DL Final Class Notes
No ratings yet
4th Unit DL Final Class Notes
68 pages
Deep Learning 117 MCQ
No ratings yet
Deep Learning 117 MCQ
33 pages
ANN Quiz - PDF - Artificial Neural Network - Computational Science
No ratings yet
ANN Quiz - PDF - Artificial Neural Network - Computational Science
17 pages
AROGYA ADVANCE - Pdfdisplayname AROGYA ADVANCE
100% (1)
AROGYA ADVANCE - Pdfdisplayname AROGYA ADVANCE
2 pages
Hazardous Area Ventilation Sce Performance Standard
No ratings yet
Hazardous Area Ventilation Sce Performance Standard
82 pages
Tour de Samos 2025 Results Overall
No ratings yet
Tour de Samos 2025 Results Overall
1 page
Unit 4 Deeplearning
No ratings yet
Unit 4 Deeplearning
41 pages

CS230 Midterm Solutions Fall 2022

Uploaded by

CS230 Midterm Solutions Fall 2022

Uploaded by

CS230: Deep Learning

Fall Quarter 2022

Problem Full Points Your Score

The exam contains 20 pages including this cover page.

The Stanford University Honor Code:

Question (Multiple Choice, 14 points)

(i) Flipping vertically (across x-axis)

(iv) Applying random masking to a (small) portion of the image

Solution: (iii), (iv)

(i) Optimizes for gap between precision and recall

(i) Robustness to noise in validation performance from epoch to epoch

(i) Early stopping

Solution: (ii), (iv)

(iv) incorrect, as you are simply stopping training early

(i) Initializing the weights to be from N (0, 1)

Question (Short Answer, 30 points)

• Input: a patient’s demographic information (i.e, ethnicity, age), health history

Solution: No, as the model is trained on US data only, it is heavily biased

Solution: Many answers accepted.

(iii) (2 points) What may have caused this problem?

Solution: Many answers accepted.

(iv) (2 points) How can we fix this problem?

Solution: Many answers are accepted.

Solution: 0. After the k th GNN layer, each node assimiliates information

Solution: Node 7 would have edges connecting it to nodes 1 and 6. This

Figure 1: GAN training curve

Solution: Closer to 0. This is because the generator is not well-trained and

Solution: False. For a well-trained generator, the discriminator should not

Neural network Neural network

Figure 2: Architecture of proposed generative model

Assume the encoder q(z | x) is parameterized to output a normal distribution over z.

• Alice: KL(q(z | x) || N (0, I))

The entire network is end-to-end differentiable for all 3 loss functions.

the loss function that enforces that separation

Question (Backpropagation, 19 points)

z[1] = W[1] x + b[1]

Question (Discrete Functions in Neural Networks, 11 points)

Solution: Yes, one_hot is not a differentiable operation.

(ii) (2 points) Consider the following function:

z = Sτ (f (X)) = softmax(f (X)/τ )

Solution: As τ → ∞, we get a uniform distribution. As τ → 0, we get the

Solution: Replace softmax with S. S is differentiable. No need for one_hot

Question (Debugging Code, 18 points)

80 self .w2 = np. random . randn (20 , 1)

125 self .b2)

You might also like