Assignment 3
Assignment 3
Due date: 5/21 11:59 PM PST (You are allowed to use three (3) late days maximum for this
assignment)
This handout consists of several homework problems, as well as instructions on the “deliverables”
associated with the coding portions of this assignment.
These questions require thought, but do not require long answers. Please be as concise as possible.
We encourage students to discuss in groups for assignments. However, each student must finish the
problem set and programming assignment individually, and must turn in her/his assignment. We ask
that you abide by the university Honor Code and that of the Computer Science department, and make
sure that all of your submitted work is done by yourself.
Our RNN has one ReLU layer and one softmax layer, and uses Cross Entropy loss as its cost function.
We follow the parse tree given from the leaf nodes up to the top of the tree and evaluate the cost at each
node. During backprop, we follow the exact opposite path. Figure 1 shows an example of such a RNN
applied to a simple sentence ”I love this assignment”. These equations are sufficient to explain our model:
X
CE(y, ŷ) = − yi log(yˆi )
i
where y is the label represented as a one-hot row vector, and ŷ is a row vector containing the predicted
probabilities for all classes. In our case, y ∈ R1×5 and ŷ ∈ R1×5 to represent our 5 sentiment classes: Really
Negative, Negative, Neutral, Positive, and Really Positive. Furthermore,
h i
h(1) = max( h(1) (1)
Lef t , hRight
W (1) + b(1) , 0)
ŷ = softmax(h(1) U + b(s) )
(1) (1)
where hLef t is the vector representation of the left subtree (possibly be a word vector), the hRight of the
right subtree. For clarity, L ∈ R|V |×d , W (1) ∈ R2d×d , b(1) ∈ R1×d , U ∈ Rd×5 , b(s) ∈ R1×5 .
(a) (20 points) Follow the example parse tree in Figure 1 in which we are given a parse tree and truth labels
y for each node. Starting with Node 1, then to Node 2, finishing with Node 3, write the update rules for
W (1) ,b(1) , U , b(s) , and L after the evaluation of ŷ against our truth, y. This means for at each node, we
evaluate:
1
CS 224d: Assignment #3
δ3 = ŷ − y
as our first error vector and we backpropogate that error through the network, aggregating gradient at
each node for:
∂J ∂J ∂J ∂J ∂J
∂U ∂b(s) ∂W (1) ∂b(1) ∂Li
Points will be deducted if you do not express the derivative of activation functions (ReLU) in terms of
their function values (as with Assignment 1 and 2) or do not express the gradients by using an “error
vector” (δi ) propagated back to each layer. Tip on notation: δbelow and δabove should be used for error
that is being sent down to the next node, or came from an above node. This will help you think about
the problem in the right way. Note you should not be updating gradients for Li in Node 1. But error
should be leaving Node 1 for sure!
(b) (80 points) Implementation time! We have simplified the problem to reduce training time by binarizing
the sentiment labels. This means that all the sentences are either positive or negative. The internal
nodes however, can be positive negative or neutral. While training, the cost function includes predictions
over all nodes that have a sentiment associated with them and ignores the neutral nodes. While testing,
we are only interested in the performance at the full sentence level. This is all provided for you in the
starter code.
Page 2 of 3
CS 224d: Assignment #3
Page 3 of 3