Recurrent Neural Networks (RNN) : Course 5: Sequence Models
Recurrent Neural Networks (RNN) : Course 5: Sequence Models
**********************
Course 5: Sequence Models
**********************
In this course we'll learn how to build models for natural language, audio, and other sequence data.
Thanks to deep learning, sequence algorithms are working far better than just two years ago, and
this is enabling numerous exciting applications in speech recognition, music synthesis, chatbots,
machine translation, natural language understanding, and many others.
You will:
- Understand how to build and train Recurrent Neural Networks (RNNs), and commonly-used variants
such as GRUs and LSTMs.
- Be able to apply sequence models to natural language problems, including text synthesis.
- Be able to apply sequence models to audio applications, including speech recognition and music
synthesis.
similarly y(i)<t> means the tth element in the output sequence of the i for an example and Ty(i) will
be the length of the output sequence in the i training example. So into this example, Tx(i) is equal to 9
would be the highly different training example with a sentence of 15 words and Tx(i) will be close to
15 for that different training example. Now, that we're starting to work in NLP or Natural Language
Processing. Now, this is our first serious foray into NLP or Natural Language
Processing and one of the things we need to decide is, how to represent individual words in the
sequence. So, how do you represent a word like Harry, and why should x<1> really be? Let's next talk
about how we would represent individual words in a sentence.
So, to represent a word in the sentence the first thing you do is come up with a Vocabulary.
Sometimes also called a Dictionary and that means making a list of the words that you will use in
your representations. So the first word in the vocabulary is a, that will be the first word in the
dictionary. The second word is Aaron and then a little bit further down is the word and, and then
eventually you get to the words Harry then eventually the word Potter, and then all the way down to
maybe the last word in dictionary is Zulu. And so, a will be word one, Aaron is word two, and in my
dictionary the word and appears in positional index 367. Harry appears in position 4075, Potter in
position 6830, and Zulu is the last word to the dictionary is maybe word 10,000.
So in this example, I'm going to use a dictionary with size 10,000 words. This is quite small by
modern NLP applications. For commercial applications, for visual size commercial applications,
dictionary sizes of 30 to 50,000 are more common and 100,000 is not uncommon and then some of
the large Internet companies will use dictionary sizes that are maybe a million words or even bigger
than that. But you see a lot of commercial applications used dictionary sizes of maybe 30,000 or
maybe 50,000 words. But I'm going to use 10,000 for illustration since it's a nice round number. So,
if you have chosen a dictionary of 10,000 words and one way to build this dictionary will be be to look
through your training sets and find the top 10,000 occurring words, also look through some of the
online dictionaries that tells you what are the most common 10,000 words in the English Language
saved. What you can do is then use one hot representations to represent each of these words. For
example, x<1> which represents the word Harry would be a vector with all zeros except for a 1 in
position 4075 because that was the position of Harry in the dictionary and then x<2> will be again
similarly a vector of all zeros except for a 1 in position 6830 and then zeros everywhere else. The
word and was represented as position 367 so x<3> would be a vector with zeros of 1 in position 367
and then zeros everywhere else and each of these would be a 10,000 dimensional vector if your
vocabulary has 10,000 words and the word "a" is the first word of the dictionary, then x<7> which
corresponds to word a, that would be the vector 1. This is the first element of the dictionary and then
zero everywhere else. So in this representation, x<t> for each of the values of t in a sentence will be a
one-hot vector, one-hot because there's exactly one one is on and zero everywhere else and you
will have nine of them to represent the nine words in this sentence and the goal is given this
representation for X to learn a mapping using a sequence model to then target output y, I will do this
as a supervised learning problem, I'm sure given the table data with both x and y.
Then just one last detail, which we'll talk more about in a later section is, what if you encounter a
word that is not in your vocabulary? Well the answer is, you create a new token or a new fake word
called Unknown Word which under note as follows and go back as UNK to represent words not in your
vocabulary, we'll come more to talk more about this later. So, to summarize in this section, we
described a notation for describing your training set for both x and y for sequence data. In the next
section let's start to describe a Recurrent Neural Networks for learning the mapping from X to Y.
We'll just draw this in green, W_a and b_a, those are the parameters that are used to compute a_1.
And then, these parameters are actually used for every single timestep so, these parameters are
actually used to compute a_2, a_3, and so on, all the activations up to last timestep depend on the
parameters W_a and b_a. Let's keep fleshing out this graph. Now, given a_1, your neural network
can then compute the first prediction, y-hat_1, and then the second timestep, y-hat_2, y-hat_3, and
so on, with y-hat_ty. And let me again draw the parameters of a different color. So, to compute y-hat,
you need the parameters, W_y as well as b_y, and this goes into this node as well as all the others.
So, I'll draw this in green as well. Next, in order to compute backpropagation, you need a loss
function.
So let's define an element-wise loss force, which is supposed for a certain word in the sequence. It is
a person's name, so y_t is one. And your neural network outputs some probability of maybe 0.1 of
the particular word being a person's name. So I'm going to define this as the standard logistic
regression loss, also called the cross entropy loss. This may look familiar to you from where we were
previously looking at binary classification problems. So this is the loss associated with a single
prediction at a single position or at a single time set, t, for a single word. Let's now define the overall
loss of the entire sequence, so L will be defined as the sum overall t equals one to, i guess, T_x or
T_y. T_x is equals to T_y in this example of the losses for the individual timesteps, comma y_t. And
then, so, just have to L without this superscript T. This is the loss for the entire sequence. So, in a
computation graph, to compute the loss given y-hat_1, you can then compute the loss for the first
timestep given that you compute the loss for the second timestep, the loss for the third timestep, and
so on, the loss for the final timestep. And then lastly, to compute the overall loss, we will take these
and sum them all up to compute the final L using that equation, which is the sum of the individual per
timestep losses. So, this is the computation problem and from the earlier examples you've seen of
backpropagation, it shouldn't surprise you that backprop then just requires doing
computations or parsing messages in the opposite directions. So, all of the four propagation
steps arrows, so you end up doing that. And that then, allows you to compute all the appropriate
quantities that lets you then, take the riveters, respected parameters, and update the parameters
using gradient descent. Now, in this back propagation procedure, the most significant
message or the most significant recursive calculation is this one, which goes from right to
left, and that's why it gives this algorithm as well, a pretty fast full name called
backpropagation through time. And the motivation for this name is that for forward prop,
you are scanning from left to right, increasing indices of the time, t, whereas, the
backpropagation, you're going from right to left, you're kind of going backwards in time.
So this gives this, I think a really cool name, backpropagation through time, where you're
going backwards in time, right? That phrase really makes it sound like you need a time
machine to implement this output, but I just thought that backprop through time is just
one of the coolest names for an algorithm. So, I hope that gives you a sense of how forward
prop and backprop in RNN works. Now, so far, you've only seen this main motivating example in RNN,
in which the length of the input sequence was equal to the length of the output sequence. In the next
section, I want to show you a much wider range of RNN architecture, so I'll let you tackle a much
wider set of applications.
So in this example, we have y1, y2, y3, 4, 5, 6, 7, 8, 9. Nine inputs in this example if you append
the end of sentence token to the end. And doing the tokenization step, you can decide whether or not
the period should be a token as well. In this example, I'm just ignoring punctuation. So I'm just using
day as another token. And omitting the period, if you want to treat the period or other punctuation as
explicit token, then you can add the period to you vocabulary as well. Now, one other detail would be
what if some of the words in your training set, are not in your vocabulary. So if your vocabulary uses
10,000 words, maybe the 10,000 most common words in English, then the term Mau as in the
Egyptian Mau is a breed of cat, that might not be in one of your top 10,000 tokens. So in that case
you could take the word Mau and replace it with a unique token called UNK or stands for unknown
words and would just model, the chance of the unknown word instead of the specific word now.
Having carried out the tokenization step which basically means taking the input sentence and
mapping out to the individual tokens or the individual words in your vocabulary. Next let's build an
RNN to model the chance of these different sequences. And one of the things we'll see on the next
slide is that you end up setting the inputs x<t> = y<t-1> or you see that in a little bit. So let's go on
to built the RNN model and I'm going to continue to use this sentence as the running example. This
will be an RNN architecture. At time 0 you're going to end up computing some activation a1 as a
function of some inputs x1, and x1 will just be set it to the set of all zeroes, to 0 vector. And the
previous A0, by convention, also set that to vector zeroes. But what A1 does is it will make a soft
max prediction to try to figure out what is the probability of the first words y. And so that's going to
be y<1>. So what this step does is really, it has a soft max it's trying to predict. What is the
probability of any word in the dictionary? That the first one is a, what's the chance that the first word
is Aaron? And then what's the chance that the first word is cats? All the way to what's the chance the
first word is Zulu? Or what's the first chance that the first word is an unknown word? Or what's the
first chance that the first word is the in the sentence they'll have, shouldn't have to read? Right, so y
hat 1 is output to a soft max, it just predicts what's the chance of the first word being, whatever it
ends up being. And in our example, it wind up being the word cats, so this would be a 10,000 way
soft max output, if you have a 10,000-word vocabulary. Or 10,002, I guess you could call unknown
word and the sentence is two additional tokens. Then, the RNN steps forward to the next step and
has some activation, a<1> to the next step. And at this step, his job is try figure out, what is the
second word?
But now we will also give it the correct first word. So we'll tell it that, gee, in reality, the first word
was actually Cats so that's y1. So tell it cats, and this is why y1 = x2, and so at the second step the
output is again predicted by a soft max. The RNN's jobs to predict was the chance of a being
whatever the word it is. Is it a or Aaron, or Cats or Zulu or unknown whether EOS or whatever given
what had come previously. So in this case, I guess the right answer was average since the sentence
starts with cats average. And then you go on to the next step of the RNN. Where you now compute
a3. But to predict what is the third word, which is 15, we can now give it the first two words. So
we're going to tell it cats average are the first two words. So this next input here, x<3> = y<2>, so
the word average is input, and this job is to figure out what is the next word in the sequence. So in
other words trying to figure out what is the probability of anywhere than dictionary given that what
just came before was cats.
Average, right? And in this case, the right answer is 15 and so on.
Until at the end, you end up at, I guess, time step 9, you end up feeding it x(9), which is equal to
y(8), which is the word, day. And then this has A(9), and its jpob iws to output y hat 9, and this
happens to be the EOS token. So what's the chance of whatever this given, everything that comes
before, and hopefully it will predict that there's a high chance of it, EOS and the sentence token. So
each step in the RNN will look at some set of preceding words such as, given the first three words,
what is the distribution over the next word? And so this RNN learns to predict one word at a time
going from left to right. Next to train us to a network, we're going to define the cos function. So, at a
certain time, t, if the true word was yt and the new networks soft max predicted some y hat t, then
this is the soft max loss function that you should already be familiar with. And then the overall loss is
just the sum overall time steps of the loss associated with the individual predictions. And if you train
this RNN on the last training set, what you'll be able to do is given any initial set of words, such as
cats average 15 hours of, it can predict what is the chance of the next word.
And given a new sentence say, y(1), y(2), y(3)with just a three words, for simplicity, the way you can
figure out what is the chance of this entire sentence would be. Well, the first soft max tells you what's
the chance of y(1). That would be this first output. And then the second one can tell you what's the
chance of p of y(2) given y(1). And then the third one tells you what's the chance of y(3) given y(1)
and y(2). And so by multiplying out these three probabilities. And you'll see much more details of this
in the previous exercise. By multiply these three, you end up with the probability of the three
sentence, of the three-word sentence. So that's the basic structure of how you can train a language
model using an RNN.
But to sample, you do something slightly different, so what you want to do is first sample what is the
first word you want your model to generate and so for that you input the usual x<1> equals 0, a<0>
equals 0. And now your first time stamp will have some max probability over possible outputs. So
what you do is you then randomly sample according to this soft max distribution. So what the soft
max distribution gives you is it tells you what is the chance that it refers to this word "a", what is the
chance that it refers to word "Aaron"? What's the chance it refers to "Zulu", what is the chance that
the first word is the Unknown word token. Maybe it was a chance it was a end of sentence token and
then you take this vector and use, for example, the numpy command np.random.choice to sample
according to distribution defined by the vector probabilities, and that lets you sample the first words.
Next you then go on to the second time step, and now remember that the second time step is
expecting this y<1> as input but what you do is you then take the y1 hat that you just sampled and
pass that in here as the input to the next timestep. So whatever works, you just chose the first time
step passes this input in the second position, and then this soft max will make a prediction for what is
y hat 2.
Example, let's say that after you sample the first word, the first word happened to be z, which is
very common choice of first word. Then you pass in v as x2, which is now equal to y hat 1 and now
you're trying to figure out what is the chance of what the second word is given that the first word is
d. And this is going to be y hat 2. Then you again use this type of sampling function to sample y hat
2. And then at the next time stamp, you take whatever choice you had represented say as a one hard
encoding. And pass that to next timestep and then you sample the third word to that whatever you
chose, and you keep going until you get to the last time step. And so how do you know when the
sequence ends? Well, one thing you could do is if the end of sentence token is part of your
vocabulary, you could keep sampling until you generate an EOS token and that tells you you've hit
the end of a sentence and you can stop. Or alternatively, if you do not include this in your vocabulary
then you can also just decide to sample 20 words or 100 words or something, and then keep going
until you've reached that number of time steps and this particular procedure will sometimes generate
an unknown word token. If you want to make sure that your algorithm never generates this token,
one thing you could do is just reject any sample that came out as unknown word token and just keep
resampling from the rest of the vocabulary until you get a word that's not an unknown word. Or you
can just leave it in the output as well if you don't mind having an unknown word output. So this is
how you would generate a randomly chosen sentence from your RNN language model. Now, so far
we've been building a words level RNN, by which I mean the vocabulary are words from English.
Depending on your application, one thing you can do is also build a character level RNN.
So in this case your vocabulary will just be the alphabets. Up to z, and as well as maybe space,
punctuation if you wish, the digits 0 to 9 and if you want to distinguish the uppercase and lowercase,
you can include the uppercase alphabets as well, and one thing you can do as you just look at your
training set and look at the characters that appears there and use that to define the vocabulary and if
you build a character level language model rather than a word level language model, then your
sequence y1, y2, y3, would be the individual characters in your training data, rather than the
individual words in your training data. So for our previous example, the sentence cats average 15
hours of sleep a day. In this example, c would be y1, a would be y2, t will be y3, the space will be y4
and so on. Using a character level language model has some pros and cons. One is that you don't
ever have to worry about unknown word tokens. In particular, a character level language
model is able to assign a sequence like mau, a non-zero probability. Whereas if mau was
not in your vocabulary for the word level language model, you just have to assign it the
unknown word token. But the main disadvantage of the character level language model is
that you end up with much longer sequences. So many english sentences will have 10 to
20 words but may have many, many dozens of characters. And so character language
models are not as good as word level language models at capturing long range
dependencies between how the the earlier parts of the sentence also affect the later part
of the sentence. And character level models are also just more computationally expensive
to train. So the trend I've been seeing in natural language processing is that for the most part, word
level language model are still used, but as computers gets faster there are more and more
applications where people are, at least in some special cases, starting to look at more character level
models. But they tend to be much hardware, much more computationally expensive to train, so they
are not in widespread use today. Except for maybe specialized applications where you might need to
deal with unknown words or other vocabulary words a lot. Or they are also used in more specialized
applications where you have a more specialized vocabulary. So under these methods, what you can
now do is build an RNN to look at the purpose of English text, build a word level, build a character
language model, sample from the language model that you've trained. So here are some examples of
text that were examples from a language model, actually from a culture level language model.
If the model was trained on news articles, then it generates texts like that shown on the left. And
this looks vaguely like news text, not quite grammatical, but maybe sounds a little bit like things that
could be appearing news, concussion epidemic to be examined. And it was trained on Shakespearean
text and then it generates stuff that sounds like Shakespeare could have written it. The mortal moon
hath her eclipse in love. And subject of this thou art another this fold. When besser be my love to me
see sabl's. For whose are ruse of mine eyes heaves.
So that's it for the basic RNN, and how you can build a language model using it, as well
as sample from the language model that you've trained. In the next few sections, we'lll
discuss further some of the challenges of training RNNs, as well as how to adjust some of
these challenges, specifically vanishing gradients by building even more powerful models
of the RNN. So in the next section we'll talk about the problem of vanishing the gradient
and we will go on to talk about the GRU, Gate Recurring Unit as well as the LSTM models.
we've already seen the formula for computing the activations at time t of RNN. It's the activation
function applied to the parameter Wa times the activations in the previous time set, the current input
and then plus ba. So we're going to draw this as a picture. So the RNN unit, we're going to draw as a
picture,
drawn as a box which inputs a of t-1, the activation for the last time-step and also inputs x<t> and
these two go together and after some weights and after this type of linear calculation, if g is a tanh
activation function, then after the tanh, it computes the output activation a. And the output activation
a(t) might also be passed to say a softener unit or something that could then be used to output
y<t>. So this is maybe a visualization of the RNN unit of the hidden layer of the RNN in terms of a
picture. And I want to show you this picture because we're going to use a similar picture to explain
the GRU or the Gated Recurrent Unit. Lots of the idea of GRU were due to these two papers
respectively by Yu Young Chang, Kagawa, Gaza Hera, Chang Hung Chu and Jose Banjo. And I'm
sometimes going to refer to this sentence which we'd seen in the last section to motivate that. Given
a sentence like this, you might need to remember the cat was singular, to make sure you understand
why that was rather than were. So the cat was for or the cats were for. So as we read in this
sentence from left to right, the GRU unit is going to have a new variable called c, which stands for
cell, for memory cell. And what the memory cell do is it will provide a bit of memory to remember, for
example, whether cat was singular or plural, so that when it gets much further into the sentence it
can still work under consideration whether the subject of the sentence was singular or plural. And so
at time t the memory cell will have some value c of t. And what we see is that the GRU unit will
actually output an activation value a of t that's equal to c of t. And for now I wanted to use different
symbol c and a to denote the memory cell value and the output activation value, even though they
are the same. I'm using this notation because when we talk about LSTMs, a little bit later, these will
be two different values. But for now, for the GRU, c of t is equal to the output activation a of t. So
these are the equations that govern the computations of a GRU unit. And every time-step, we're
going to consider overwriting the memory cell with a value c tilde of t. So this is going to be a
candidate for replacing c of t. And we're going to compute this using an activation function tanh of
Wc. And so that's the parameter to make sure it's Wc and we'll plus this parameter matrix, the
previous value of the memory cell, the activation value as well as the current input value x<t>, and
then plus the bias. So c tilde of t is going to be a candidate for replacing c<t>. And then the key,
really the important idea of the GRU it will be that we have a gate. So the gate, I'm going to call
gamma u. This is the capital Greek alphabet gamma subscript u, and u stands for update gate, and
this will be a value between zero and one. And to develop your intuition about how GRUs work, think
of gamma u, this gate value, as being always zero or one. Although in practice, your compute it with
a sigmoid function applied to this. So remember that the sigmoid function looks like this. And so it's
value is always between zero and one. And for most of the possible ranges of the input, the sigmoid
function is either very, very close to zero or very, very close to one. So for intuition, think of gamma
as being either zero or one most of the time. And this alphabet u stands for- I chose the alphabet
gamma for this because if you look at a gate fence, looks a bit like this I guess, then there are a lot
of gammas in this fence. So that's why gamma u, we're going to use to denote the gate. Also Greek
alphabet G, right. G for gate. So G for gamma and G for gate. And then next, the key part of the GRU
is this equation which is that we have come up with a candidate where we're thinking of updating c
using c tilde, and then the gate will decide whether or not we actually update it. And so the way to
think about it is maybe this memory cell c is going to be set to either zero or one depending on
whether the word you are considering, really the subject of the sentence is singular or plural. So
because it's singular, let's say that we set this to one. And if it was plural, maybe we would set this to
zero, and then the GRU unit would memorize the value of the c<t> all the way until here, where this
is still equal to one and so that tells it, oh, it's singular so use the choice was. And the job of the gate,
of gamma u, is to decide when do you update these values. In particular, when you see the phrase,
the cat, you know they you're talking about a new concept the especially subject of the sentence cat.
So that would be a good time to update this bit and then maybe when you're done using it, the cat
blah blah blah was full, then you know, okay, I don't need to memorize anymore, I can just forget
that. So the specific equation we'll use for the GRU is the following. Which is that the actual value of
c<t> will be equal to this gate times the candidate value plus one minus the gate times the old value,
c<t> minus one. So you notice that if the gate, if this update value, this equal to one, then it's saying
set the new value of c<t> equal to this candidate value. So that's like over here, set gate equal to
one so go ahead and update that bit. And then for all of these values in the middle, you should have
the gate equals zero. So this is saying don't update it, don't update it, don't update it, just hang onto
the old value. Because if gamma u is equal to zero, then this would be zero, and this would be one.
And so it's just setting c<t> equal to the old value, even as you scan the sentence from left to right.
So when the gate is equal to zero, we're saying don't update it, don't update it, just hang on to the
value and don't forget what this value was. And so that way even when you get all the way down
here, hopefully you've just been setting c<t> equals c<t> minus one all along. And it still memorizes,
the cat was singular. So let me also draw a picture to denote the GRU unit. And by the way, when you
look in online blog posts and textbooks and tutorials these types of pictures are quite popular for
explaining GRUs as well as we'll see later, LSTM units. I personally find the equations easier to
understand in a pictures. So if the picture doesn't make sense. Don't worry about it, but I'll just draw
in case helps some of you. So a GRU unit inputs c<t> minus one, for the previous time-step and just
happens to be equal to 80 minus one. So take that as input and then it also takes as input x<t>,
then these two things get combined together. And with some appropriate weighting and some tanh,
this gives you c tilde t which is a candidate for placing c<t>, and then with a different set of
parameters and through a sigmoid activation function, this gives you gamma u, which is the update
gate. And then finally, all of these things combine together through another operation. And I won't
write out the formula, but this box here which wish I shaded in purple represents this equation which
we had down there. So that's what this purple operation represents. And it takes as input the gate
value, the candidate new value, or there is this gate value again and the old value for c<t>, right. So
it takes as input this, this and this and together they generate the new value for the memory cell.
And so that's c<t> equals a. And if you wish you could also use this process to soft max or something
to make some prediction for y<t>. So that is the GRU unit or at least a slightly simplified version of
it. And what is remarkably good at is through the gates deciding that when you're scanning the
sentence from left to right say, that's a good time to update one particular memory cell and then to
not change, not change it until you get to the point where you really need it to use this memory cell
that is set even earlier in the sentence. And because the sigmoid value, now, because the gate is
quite easy to set to zero right. So long as this quantity is a large negative value, then up to numerical
around off the uptake gate will be essentially zero. Very, very, very close to zero. So when that's the
case, then this updated equation and subsetting c<t> equals c<t> minus one. And so this is very
good at maintaining the value for the cell. And because gamma can be so close to zero, can be
0.000001 or even smaller than that, it doesn't suffer from much of a vanishing gradient problem.
Because when you say gamma so close to zero this becomes essentially c<t> equals c<t> minus one
and the value of c<t> is maintained pretty much exactly even across many many many many time-
steps. So this can help significantly with the vanishing gradient problem and therefore allow a neural
network to go on even very long range dependencies, such as a cat and was related even if they're
separated by a lot of words in the middle. Now I just want to talk over some more details of how you
implement this. In the equations I've written, c<t> can be a vector. So if you have 100 dimensional
or hidden activation value then c<t> can be a 100 dimensional say. And so c tilde t would also be the
same dimension, and gamma would also be the same dimension as the other things on drawing
boxes. And in that case, these asterisks are actually element wise multiplication. So here if gamma u,
if the gate is 100 dimensional vector, what it is really a 100 dimensional vector of bits, the value is
mostly zero and one. That tells you of this 100 dimensional memory cell which are the bits you want
to update. And, of course, in practice gamma won't be exactly zero or one. Sometimes it takes values
in the middle as well but it is convenient for intuition to think of it as mostly taking on values that are
exactly zero, pretty much exactly zero or pretty much exactly one. And what these element wise
multiplications do is it just element wise tells the GRU unit which other bits in your- It just tells your
GRU which are the dimensions of your memory cell vector to update at every time-step. So you can
choose to keep some bits constant while updating other bits. So, for example, maybe you use one bit
to remember the singular or plural cat and maybe use some other bits to realize that you're talking
about food. And so because you're talk about eating and talk about food, then you'd expect to talk
about whether the cat is four letter, right. You can use different bits and change only a subset of the
bits every point in time. You now understand the most important ideas of the GRU. What I'm
presenting in this slide is actually a slightly simplified GRU unit. Let me describe the full GRU unit. So
to do that, let me copy the three main equations. This one, this one and this one to the next slide. So
here they are. And for the full GRU unit, I'm sure to make one change to this which is, for the first
equation which was calculating the candidate new value for the memory cell, I'm going just to add
one term. Let me pushed that a little bit to the right, and I'm going to add one more gate. So this is
another gate gamma r. You can think of r as standing for relevance. So this gate gamma r tells you
how relevant is c<t> minus one to computing the next candidate for c<t>. And this gate gamma r is
computed pretty much as you'd expect with a new parameter matrix Wr, and then the same things as
input x<t> plus br. So as you can imagine there are multiple ways to design these types of neural
networks. And why do we have gamma r? Why not use a simpler version from the previous slides? So
it turns out that over many years researchers have experimented with many, many different possible
versions of how to design these units, to try to have longer range connections, to try to have more
the longer range effects and also address vanishing gradient problems. And the GRU is one of the
most commonly used versions that researchers have converged to and found as robust and useful for
many different problems. If you wish you could try to invent new versions of these units if you want,
but the GRU is a standard one, that's just common used. Although you can imagine that researchers
have tried other versions that are similar but not exactly the same as what I'm writing down here as
well. And the other common version is called an LSTM which stands for Long Short Term Memory
which we'll talk about in the next video. But GRUs and LSTMs are two specific instantiations of this
set of ideas that are most commonly used. Just one note on notation. I tried to define a consistent
notation to make these ideas easier to understand. If you look at the academic literature, you
sometimes see people- If you look at the academic literature sometimes you see people using
alternative notation to be x tilde, u, r and h to refer to these quantities as well. But I try to use a
more consistent notation between GRUs and LSTMs as well as using a more consistent notation
gamma to refer to the gates, so hopefully make these ideas easier to understand. So that's it for the
GRU, for the Gate Recurrent Unit. This is one of the ideas in RNN that has enabled them to become
much better at capturing very long range dependencies has made RNN much more effective. Next, as
I briefly mentioned, the other most commonly used variation of this class of idea is something called
the LSTM unit, Long Short Term Memory unit. Let's take a look at that in the next video.
Long Short Term Memory (LSTM)
In the last section, we've learned about the GRU, the gated recurrent units, and how that can
allow you to learn very long range connections in a sequence. The other type of unit that allows you
to do this very well is the LSTM or the long short term memory units, and this is even more
powerful than the GRU. Let's take a look. Below are the equations from the previous section for the
GRU.
And for the GRU, we had a<t> equals c<t>, and two gates, the optic gate and the relevance gate, c~
<t>, which is a candidate for replacing the memory cell, and then we use the update gate, gamma_u,
to decide whether or not to update c<t>using c~<t>. The LSTM is an even slightly more powerful and
more general version of the GRU, and is due to Sepp Hochreiter and Jurgen Schmidhuber. And this
was a really seminal paper, a huge impact on sequence modelling. I think this paper is one of the
more difficult to read. It goes quite along into theory of vanishing gradients. And so, I think more
people have learned about the details of LSTM through maybe other places than from this particular
paper even though I think this paper has had a wonderful impact on the Deep Learning community.
But these are the equations that govern the LSTM. So, the book continued to the memory cell, c, and
the candidate value for updating it, c_tilde_t, will be this, and so on. Notice that for the LSTM, we will
no longer have the case that a_t is equal to c_t. So, this is what we use. And so, this is just like the
equation on the left except that with now, more specially use a_t there or a_t minus one instead of
c_t minus one. And we're not using this gamma or this relevance gate. Although you could have a
variation of the LSTM where you put that back in, but with the more common version of the LSTM,
doesn't bother with that. And then we will have an update gate, same as before. So, W updates and
we're going to use a_t minus one here, x_t plus b_u. And one new property of the LSTM is, instead of
having one update gate control, both of these terms, we're going to have two separate terms. So
instead of gamma_u and one minus gamma_u, we're going have gamma_u here. And forget gate,
which we're going to call gamma_f. So, this gate, gamma_f, is going to be sigmoid of pretty much
what you'd expect, x_t plus b_f. And then, we're going to have a new output gate which is sigma of
W_o. And then again, pretty much what you'd expect, plus b_o. And then, the update value to the
memory so will be c_t equals gamma u. And this asterisk denotes element-wise multiplication. This is
a vector-vector element-wise multiplication, plus, and instead of one minus gamma u, we're going to
have a separate forget gate, gamma_f, times c of t minus one. So this gives the memory cell the
option of keeping the old value c_t minus one and then just adding to it, this new value, c tilde of t.
So, use a separate update and forget gates. So, this stands for update, forget, and output gate. And
then finally, instead of a_t equals c_t a_t is a_t equal to the output gate element-wise multiplied by
c_t. So, these are the equations that govern the LSTM and you can tell it has three gates instead of
two. So, it's a bit more complicated and it places the gates into slightly different places. So, here
again are the equations governing the behavior of the LSTM. Once again, it's traditional to explain
these things using pictures. So let me draw one here. And if these pictures are too complicated, don't
worry about it. I personally find the equations easier to understand than the picture. But I'll just show
the picture here for the intuitions it conveys. The bigger picture here was very much inspired by a
blog post due to Chris Ola, title, Understanding LSTM Network, and the diagram drawing here is quite
similar to one that he drew in his blog post. But the key thing is to take away from this picture are
maybe that you use a_t minus one and x_t to compute all the gate values.
In this picture, you have a_t minus one, x_t coming together to compute the forget gate, to compute
the update gates, and to compute output gate. And they also go through a tanh to compute
c_tilde_of_t. And then these values are combined in these complicated ways with element-wise
multiplies and so on, to get c_t from the previous c_t minus one. Now, one element of this is
interesting as you have a bunch of these in parallel. So, that's one of them and you connect them.
You then connect these temporally. So it does the input x_1 then x_2, x_3. So, you can take these
units and just hold them up as follows, where the output a at the previous timestep is the input a at
the next timestep, the c. I've simplified to diagrams a little bit in the bottom. And one cool thing
about this you'll notice is that there's this line at the top that shows how, so long as you set the
forget and the update gate appropriately, it is relatively easy for the LSTM to have some value c_0
and have that be passed all the way to the right to have your, maybe, c_3 equals c_0. And this is why
the LSTM, as well as the GRU, is very good at memorizing certain values even for a long time, for
certain real values stored in the memory cell even for many, many timesteps. So, that's it for the
LSTM. As you can imagine, there are also a few variations on this that people use. Perhaps, the most
common one is that instead of just having the gate values be dependent only on a_t minus one, x_t,
sometimes, people also sneak in there the values c_t minus one as well. This is called a peephole
connection. Not a great name maybe but you'll see, peephole connection. What that means is that
the gate values may depend not just on a_t minus one and on x_t, but also on the previous memory
cell value, and the peephole connection can go into all three of these gates' computations. So that's
one common variation you see of LSTMs. One technical detail is that these are, say, 100-dimensional
vectors. So if you have a 100-dimensional hidden memory cell unit, and so is this. And the, say, fifth
element of c_t minus one affects only the fifth element of the corresponding gates, so that
relationship is one-to-one, where not every element of the 100-dimensional c_t minus one can affect
all elements of the case. But instead, the first element of c_t minus one affects the first element of
the case, second element affects the second element, and so on. But if you ever read the paper and
see someone talk about the peephole connection, that's when they mean that c_t minus one is used
to affect the gate value as well. So, that's it for the LSTM. When should you use a GRU? And when
should you use an LSTM? There isn't widespread consensus in this. And even though I presented
GRUs first, in the history of deep learning, LSTMs actually came much earlier, and then GRUs were
relatively recent invention that were maybe derived as Pavia's simplification of the more complicated
LSTM model. Researchers have tried both of these models on many different problems, and on
different problems, different algorithms will win out. So, there isn't a universally-superior algorithm
which is why I want to show you both of them. But I feel like when I am using these, the advantage
of the GRU is that it's a simpler model and so it is actually easier to build a much bigger network, it
only has two gates, so computationally, it runs a bit faster. So, it scales the building somewhat bigger
models but the LSTM is more powerful and more effective since it has three gates instead of two. If
you want to pick one to use, I think LSTM has been the historically more proven choice. So, if you
had to pick one, I think most people today will still use the LSTM as the default first thing to try.
Although, I think in the last few years, GRUs had been gaining a lot of momentum and I feel like
more and more teams are also using GRUs because they're a bit simpler but often work just as well.
It might be easier to scale them to even bigger problems. So, that's it for LSTMs. Well, either GRUs or
LSTMs, you'll be able to build neural network that can capture a much longer range depends.
Bidirectional RNN
By now, we've seen most of the key building blocks of RNNs. But, there are just two more ideas
that let you build much more powerful models. One is bidirectional RNNs, which lets you at a point
in time to take information from both earlier and later in the sequence, so we'll talk about that in this
section and second, is deep RNNs, which we'll see in the next section. So let's start with
Bidirectional RNNs. So, to motivate bidirectional RNNs, let's look at this network which you've seen a
few times before in the context of named entity recognition and one of the problems of this network
is that, to figure out whether the third word Teddy is a part of the person's name, it's not enough to
just look at the first part of the sentence.
So to tell, if yhat<3> should be zero or one, you need more information than just the first three
words because the first three words doesn't tell you if they'll talking about Teddy bears or talk about
the former US president, Teddy Roosevelt. So this is a unidirectional or forward directional only RNN
and, this comment I just made is true, whether these cells are standard RNN blocks or whether
they're GRU units or whether they're LSTM blocks. But all of these blocks are in a forward only
direction. So what a bidirectional RNN does or BRNN, is fix this issue. So, a bidirectional RNN works
as follows. I'm going to use a simplified four inputs or maybe a four word sentence. So we have four
inputs. x<1> through x<4>.
So this networks heading there will have a forward recurrent components. So we're going to call this,
a<1>, a<2>, a<3> and a<4> and we're going to draw a right arrow over that to denote this is the
forward recurrent component, and so they'll be connected as shown in above diagram and so, each of
these four recurrent units inputs the current x, and then feeds in to help predict yhat<1>, yhat<2>,
yhat<3>, and yhat<4>. So far we've drawn the RNN from the previous section, but with the arrows
placed in slightly funny positions. But as we draw the arrows in this slightly funny positions because
what we're going to do is add a backward recurrent layer. So we'd have a<1>, left arrow to denote
this is a backward connection, and then a<2> backwards, a<3> backwards and a<4> backwards, so the
left arrow denotes that it is a backward connection and so, we're then going to connect to network up
as follows and A backward connections will be connected to each other going backward in time. So,
notice that this network defines a Acyclic graph and so, given an input sequence, x<1> through
x<4>, the fourth sequence will first compute A forward one, then use that to compute A forward two,
then A forward three, then A forward four. Whereas, the backward sequence would start by
computing A backward four, and then go back and compute A backward three, and then as you are
computing network activation, this is not backward this is forward prop. But the forward prop has
part of the computation going from left to right and part of computation going from right to left in
this diagram. But having computed A backward three, you can then use those activations to compute
A backward two, and then A backward one, and then finally having computed all you had in the
activations, you can then make your predictions.
So, for example, to make the predictions, your network will have something like yhat at time t is
an activation function applied to WY with both the forward activation at time t, and the backward
activation at time t being fed in to make that prediction at time t. So, if you look at the prediction at
time set three for example, then information from x<1> can flow through here, forward one to forward
two, they're are all stated in the function here, to forward three to yhat<3>. So information from x<1>,
x<2>, x<3> are all taken into account with information from x<4> can flow through a backward four to
a backward three to y<3>. So this allows the prediction at time three to take as input both information
from the past, as well as information from the present which goes into both the forward and the
backward things at this step, as well as information from the future.
So, in particular, given a phrase like, "He said, Teddy Roosevelt..." To predict whether
Teddy is a part of the person's name, you take into account information from the past and
from the future. So this is the bidirectional recurrent neural network and these blocks here can be
not just the standard RNN block but they can also be GRU blocks or LSTM blocks. In fact, for a lots of
NLP problems, for a lot of text with natural language processing problems, a bidirectional RNN
with a LSTM appears to be commonly used. So, we have NLP problem and you have the
complete sentence, you try to label things in the sentence, a bidirectional RNN with LSTM
blocks both forward and backward would be a pretty views of first thing to try.
So, that's it for the bidirectional RNN and this is a modification they can make to the basic RNN
architecture or the GRU or the LSTM, and by making this change you can have a model that uses
RNN and or GRU or LSTM and is able to make predictions anywhere even in the middle of a sequence
by taking into account information potentially from the entire sequence. The disadvantage of the
bidirectional RNN is that you do need the entire sequence of data before you can make
predictions anywhere. So, for example, if you're building a speech recognition system, then the
BRNN will let you take into account the entire speech utterance but if you use this straightforward
implementation, you need to wait for the person to stop talking to get the entire utterance before you
can actually process it and make a speech recognition prediction. So for a real type speech
recognition applications, they're somewhat more complex modules as well rather than just using the
standard bidirectional RNN as you've seen here. But for a lot of natural language processing
applications where you can get the entire sentence all the same time, the standard BRNN
algorithm is actually very effective. So, that's it for BRNNs and next and final SECTION for this
week, let's talk about how to take all of these ideas RNNs, LSTMs and GRUs and the bidirectional
versions and construct deep versions of them.
Deep RNNs
The different versions of RNNs you've seen so far will already work quite well by themselves. But
for learning very complex functions sometimes is useful to stack multiple layers of RNNs together to
build even deeper versions of these models. In this section, we'll see how to build these deeper
RNNs. Let's take a look. So we remember for a standard neural network, you will have an input x and
then that's stacked to some hidden layer and so that might have activations of say, a1 for the first
hidden layer, and then that's stacked to the next layer with activations a2, then maybe another layer,
activations a3 and then you make a prediction ŷ. So a deep RNN is a bit like this, by taking this
network that (shown in diagram) and unrolling that in time. So let's take a look. So here's the
standard RNN that you've seen so far. But we've changed the notation a little bit which is that,
instead of writing this as a0 for the activation time zero, I've added this square bracket 1 to denote
that this is for layer one. So the notation we're going to use is a[l] to denote that it's an activation
associated with layer l and then <t> to denote that that's associated over time t. So this will have an
activation on a[1]<1>, a[1]<2>, a[1]<3>, a[1]<4> and then we can just stack these things on top
and so this will be a new network with three hidden layers. So let's look at an example of how this
value is computed.
So a[2]<3> has two inputs. It has the input coming from the bottom, and there's the input coming
from the left. So the computer has an activation function g applied to a way matrix. Check the
equation at the end of diagram which calculates the activation value and so the same parameters
Wa[2] and ba[2] are used for every one of these computations at this layer. Whereas, in contrast, the
first layer would have its own parameters Wa[1] and ba[1]. So whereas for standard RNNs like the one
on the left, you know we've seen neural networks that are very, very deep, maybe over 100 layers.
For RNNs, having three layers is already quite a lot. Because of the temporal dimension, these
networks can already get quite big even if you have just a small handful of layers and you don't
usually see these stacked up to be like 100 layers. One thing you do see sometimes is that you have
recurrent layers that are stacked on top of each other. But then you might take the output here, let's
get rid of this, and then just have a bunch of deep layers that are not connected horizontally but have
a deep network here that then finally predicts y<1> and you can have the same deep network here
that predicts y<2>. So this is a type of network architecture that we're seeing a little bit more where
you have three recurrent units that connected in time, followed by a network, followed by a network
after that, as we seen for y<3> and y<4>, of course. There's a deep network, but that does not have
the horizontal connections. So that's one type of architecture we seem to be seeing more of and quite
often, these blocks don't just have to be standard RNN, the simple RNN model. They can also be GRU
blocks LSTM blocks and finally, you can also build deep versions of the bidirectional RNN. Because
deep RNNs are quite computationally expensive to train, there's often a large temporal extent
already, though you just don't see as many deep recurrent layers. In this case three deep recurrent
layers that are connected in time. You don't see as many deep recurrent layers as you would see in a
number of layers in a deep conventional neural network. So that's it for deep RNNs. With what you've
seen this week, ranging from the basic RNN, the basic recurrent unit, to the GRU, to the LSTM, to the
bidirectional RNN, to the deep versions of this that you just saw, you now have a very rich toolbox for
constructing very powerful models for learning sequence models.
if man is word number 5391 in this dictionary, then you represent him with a vector with one in
position 5391. And I'm also going to use O subscript 5391 to represent this factor, where O here
stands for one-hot. And then, if woman is word number 9853, then you represent it with O subscript
9853 which just has a one in position 9853 and zeros elsewhere. And then other words king, queen,
apple, orange will be similarly represented with one-hot vector. One of the weaknesses of this
representation is that it treats each word as a thing unto itself, and it doesn't allow an algorithm to
easily generalize the cross words. For example, let's say you have a language model that has learned
that when you see I want a glass of orange blank. Well, what do you think the next word will be? Very
likely, it'll be juice. But even if the learning algorithm has learned that I want a glass of orange juice
is a likely sentence, if it sees I want a glass of apple blank. As far as it knows the relationship
between apple and orange is not any closer as the relationship between any of the other words man,
woman, king, queen, and orange. And so, it's not easy for the learning algorithm to generalize from
knowing that orange juice is a popular thing, to recognizing that apple juice might also be a popular
thing or a popular phrase. And this is because the any product between any two different one-hot
vector is zero. If you take any two vectors say, queen and king and any product of them, the end
product is zero. If you take apple and orange and any product of them, the end product is zero. And
you couldn't distance between any pair of these vectors is also the same. So it just doesn't know that
somehow apple and orange are much more similar than king and orange or queen and orange.
So, won't it be nice if instead of a one-hot presentation we can instead learn a
featured representation with each of these words, a man, woman, king, queen, apple, orange or
really for every word in the dictionary, we could learn a set of features and values for each of them.
So for example, each of these words, we want to know what is the gender associated with each of
these things. So, if gender goes from minus one for male to plus one for female, then the gender
associated with man might be minus one, for woman might be plus one and then eventually, learning
these things maybe for king you get minus 0.95, for queen plus 0.97, and for apple and orange sort
of genderless. Another feature might be, well how royal are these things and so the terms, man and
woman are not really royal, so they might have feature values close to zero. Whereas king and queen
are highly royal. And apple and orange are not really loyal. How about age? Well, man and woman
doesn't connotes much about age. Maybe men and woman implies that they're adults, but maybe
neither necessarily young nor old. So maybe values close to zero. Whereas kings and queens are
always almost always adults and apple and orange might be more neutral with respect to age and
then, another feature for here, is this is a food? Well, man is not a food, woman is not a food, neither
are kings and queens, but apples and oranges are foods. And they can be many other features as
well ranging from, what is the size of this? What is the cost? Is this something that is a live? Is this
an action, or is this a noun, or is this a verb, or is it something else? And so on. So you can imagine
coming up with many features. And for the sake of the illustration let's say, 300 different features,
and what that does is, it allows you to take this list of numbers, I've only written four here, but this
could be a list of 300 numbers, that then becomes a 300 dimensional vector for representing the
word man. And I'm going to use the notation e5391 to denote a representation like this. And
similarly, this vector, this 300 dimensional vector or 300 dimensional vector like this, I would denote
e9853 to denote a 300 dimensional vector we could use to represent the word woman and similarly,
for the other examples here. Now, if you use this representation to represent the words orange and
apple, then notice that the representations for orange and apple are now quite similar. Some of the
features will differ because of the color of an orange, the color an apple, the taste, or some of the
features would differ. But by a large, a lot of the features of apple and orange are actually the same,
or take on very similar values and so, this increases the odds of the learning algorithm that has
figured out that orange juice is a thing, to also quickly figure out that apple juice is a thing. So this
allows it to generalize better across different words. So over the next few videos, we'll find a way to
learn words embeddings. We just need you to learn high dimensional feature vectors like these, that
gives a better representation than one-hot vectors for representing different words. And the features
we'll end up learning, won't have a easy to interpret interpretation like that component one is gender,
component two is royal, component three is age and so on. Exactly what they're representing will be
a bit harder to figure out. But nonetheless, the featurized representations we will learn, will allow an
algorithm to quickly figure out that apple and orange are more similar than say, king and orange or
queen and orange. If we're able to learn a 300 dimensional feature vector or 300 dimensional
embedding for each words, one of the popular things to do is also to take this 300 dimensional data
and embed it say, in a two dimensional space so that you can visualize them. And so, one common
algorithm for doing this is the t-SNE algorithm due to Laurens van der Maaten and Geoff Hinton
and if you look at one of these embeddings, one of these representations, you find that words like
man and woman tend to get grouped together, king and queen tend to get grouped together, and
these are the people which tends to get grouped together. Those are animals who can get grouped
together. Fruits will tend to be close to each other. Numbers like one, two, three, four, will be close to
each other. And then, maybe the animate objects as whole will also tend to be grouped together. But
you see plots like these sometimes on the internet to visualize some of these 300 or higher
dimensional embeddings and maybe this gives you a sense that, word embeddings algorithms like
this can learn similar features for concepts that feel like they should be more related, as visualized by
that concept that seem to you and me like they should be more similar, end up getting mapped to a
more similar feature vectors and these representations will use these sort of featurized
representations in maybe a 300 dimensional space, these are called embeddings and the reason
we call them embeddings is, you can think of a 300 dimensional space and again, they
can't draw out here in two dimensional space because it's a 3D one and what you do is you
take every words like orange, and have a three dimensional feature vector so that word
orange gets embedded to a point in this 300 dimensional space and the word apple, gets
embedded to a different point in this 300 dimensional space and of course to visualize it,
algorithms like t-SNE, map this to a much lower dimensional space, you can actually plot the 2D data
and look at it. But that's what the term embedding comes from. Word embeddings has been one of
the most important ideas in NLP, in Natural Language Processing. In this section, we saw why you
might want to learn or use word embeddings. In the next section, let's take a deeper look at how
you'll be able to use these algorithms, to build NLP algorithims.
Continuing with the named entity recognition example, if you're trying to detect people's names.
Given a sentence like Sally Johnson is an orange farmer, hopefully, you'll figure out that Sally Johnson
is a person's name, hence, the outputs 1 like that. And one way to be sure that Sally Johnson has to
be a person, rather than say the name of the corporation is that you know orange farmer is a person.
So previously, we had talked about one hot representations to represent these words, x(1), x(2), and
so on. But if you can now use the featurized representations, the embedding vectors that we talked
about in the last video. Then after having trained a model that uses word embeddings as the inputs,
if you now see a new input, Robert Lin is an apple farmer. Knowing that orange and apple are very
similar will make it easier for your learning algorithm to generalize to figure out that Robert Lin is also
a human, is also a person's name. One of the most interesting cases will be, what if in your test set
you see not Robert Lin is an apple farmer, but you see much less common words? What if you see
Robert Lin is a durian cultivator? A durian is a rare type of fruit, popular in Singapore and a few other
countries. But if you have a small label training set for the named entity recognition task, you
might not even have seen the word durian or seen the word cultivator in your training set. I guess
technically, this should be a durian cultivator. But if you have learned a word embedding that tells you
that durian is a fruit, so it's like an orange, and a cultivator, someone that cultivates is like a farmer,
then you might still be generalize from having seen an orange farmer in your training set to knowing
that a durian cultivator is also probably a person. So one of the reasons that word embeddings will be
able to do this is the algorithms to learning word embeddings can examine very large text corpuses,
maybe found off the Internet. So you can examine very large data sets, maybe a billion words,
maybe even up to 100 billion words would be quite reasonable. So very large training sets of just
unlabeled text and by examining tons of unlabeled text, which you can download more or less for
free, you can figure out that orange and durian are similar and farmer and cultivator are similar, and
therefore, learn embeddings, that groups them together. Now having discovered that orange and
durian are both fruits by reading massive amounts of Internet text, what you can do is then take this
word embedding and apply it to your named entity recognition task, for which you might have a
much smaller training set, maybe just 100,000 words in your training set, or even much smaller and
so this allows you to carry out transfer learning, where you take information you've learned from
huge amounts of unlabeled text that you can suck down essentially for free off the Internet to figure
out that orange, apple, and durian are fruits and then transfer that knowledge to a task, such as
named entity recognition, for which you may have a relatively small labeled training set and, of
course, for simplicity, l drew this for it only as a unidirectional RNN. If you actually want to carry out
the named entity recognition task, you should, of course, use a bidirectional RNN rather than a
simpler one I've drawn here.
But to summarize, this is how you can carry out transfer learning using word embeddings.
Step 1 is to learn word embeddings from a large text corpus, a very large text corpus or you can also
download pre-trained word embeddings online. There are several word embeddings that you can find
online under very permissive licenses and you can then take these word embeddings and transfer the
embedding to new task, where you have a much smaller labeled training sets and use this, let's say,
300 dimensional embedding, to represent your words. One nice thing also about this is you can now
use relatively lower dimensional feature vectors. So rather than using a 10,000 dimensional one-hot
vector, you can now instead use maybe a 300 dimensional dense vector. Although the one-hot vector
is fast and the 300 dimensional vector that you might learn for your embedding will be a dense vector
and then, finally, as you train your model on your new task, on your named entity recognition task
with a smaller label data set, one thing you can optionally do is to continue to fine tune, continue to
adjust the word embeddings with the new data. In practice, you would do this only if this task 2 has a
pretty big data set. If your label data set for step 2 is quite small, then usually, I would not bother to
continue to fine tune the word embeddings. So word embeddings tend to make the biggest difference
when the task you're trying to carry out has a relatively smaller training set. So it has been useful for
many NLP tasks and I'll just name a few. It has been useful for named entity recognition, for text
summarization, for co-reference resolution, for parsing. These are all maybe pretty standard NLP
tasks. It has been less useful for language modeling, machine translation, especially if you're
accessing a language modeling or machine translation task for which you have a lot of data just
dedicated to that task. So as seen in other transfer learning settings, if you're trying to transfer from
some task A to some task B, the process of transfer learning is just most useful when you happen to
have a ton of data for A and a relatively smaller data set for B. And so that's true for a lot of NLP
tasks, and just less true for some language modeling and machine translation settings.
Finally, word embeddings has a interesting relationship to the face encoding ideas that you learned
about in the previous course, if you took the convolutional neural networks course.
So you will remember that for face recognition, we train this Siamese network architecture that would
learn, say, a 128 dimensional representation for different faces. And then you can compare these
encodings in order to figure out if these two pictures are of the same face. The words encoding and
embedding mean fairly similar things. So in the face recognition literature, people also use the term
encoding to refer to these vectors, f(x(i)) and f(x(j)). One difference between the face recognition
literature and what we do in word embeddings is that, for face recognition, you wanted to train a
neural network that can take as input any face picture, even a picture you've never seen before, and
have a neural network compute an encoding for that new picture. Whereas what we'll do, and you'll
understand this better when we go through the next few videos, whereas what we'll do for learning
word embeddings is that we'll have a fixed vocabulary of, say, 10,000 words. And we'll learn a vector
e1 through, say, e10,000 that just learns a fixed encoding or learns a fixed embedding for each of the
words in our vocabulary. So that's one difference between the set of ideas you saw for face
recognition versus what the algorithms we'll discuss in the next few videos. But the terms encoding
and embedding are used somewhat interchangeably. So the difference I just described is not
represented by the difference in terminologies. It's just a difference in how we need to use these
algorithms in face recognition, where there's unlimited sea of pictures you could see in the future.
Versus natural language processing, where there might be just a fixed vocabulary, and everything
else like that we'll just declare as an unknown word.
So in this section, you saw how using word embeddings allows you to implement this type of
transfer learning. And how, by replacing the one-hot vectors we're using previously with the
embedding vectors, you can allow your algorithms to generalize much better, or you can learn from
much less label data. Next, I want to show you just a few more properties of these word embeddings
and then after that, we will talk about algorithms for actually learning these word embeddings.
Let's say I pose a question, man is to woman as king is to what? Many of you will say, man is to
woman as king is to queen. But is it possible to have an algorithm figure this out automatically?
So one of the remarkable results about word embeddings is the generality of analogy relationships
they can learn. So for example, it can learn that man is to woman as boy is to girl, because the
vector difference between man and woman, similar to king and queen and boy and girl, is primarily
just the gender. It can learn that Ottawa, which is the capital of Canada, that Ottawa is to Canada as
Nairobi is to Kenya. So that's the city capital is to the name of the country. It can learn that big is to
bigger as tall is to taller, and it can learn things like that. Yen is to Japan, since yen is the currency of
Japan, as ruble is to Russia and all of these things can be learned just by running a word embedding
learning algorithm on the large text corpus. It can spot all of these patterns by itself, just by running
from very large bodies of text. So in this section, we saw how word embeddings can be used for
analogy reasoning and while you might not be trying to build an analogy reasoning system yourself
as an application, this I hope conveys some intuition about the types of feature-like representations
that these representations can learn. And you also saw how cosine similarity can be a way to
measure the similarity between two different word embeddings. Now, we talked a lot about properties
of these embeddings and how you can use them. Next, let's talk about how you'd actually learn these
word embeddings.
Embedding matrix
Let's start to formalize the problem of learning a good word embedding. When you implement an
algorithm to learn a word embedding, what you end up learning is an embedding matrix. Let's take a
look at what that means. Let's say, as usual we're using our 10,000-word vocabulary. So, the
vocabulary has A, Aaron, Orange, Zulu, maybe also unknown word as a token. What we're going to
do is learn embedding matrix E, which is going to be a 300 dimensional by 10,000 dimensional
matrix, if you have 10,000 words vocabulary or maybe 10,001 is our word token, there's one extra
token and the columns of this matrix would be the different embeddings for the 10,000 different
words you have in your vocabulary.
Let me take the list of words, I want a glass of orange, and let's start with the first word I. So I'm
going to construct one add vector corresponding to the word I. So there's a one add vector with a
one in position, 4343. So this is going to be 10,000 dimensional vector and what we're going to do is
then have a matrix of parameters E, and take E times O to get an embedding vector e4343, and this
step really means that e4343 is obtained by the matrix E times the one hot vector 43 and then we'll
do the same for all of the other words. So the word want, is where 9665 one add vector, multiply by E
to get the embedding vector and similarly, for all the other words. A, is a first word in dictionary,
alphabetic comes first, so there is O one, gets this E one. And similarly, for the other words in this
phrase. So now you have a bunch of three dimensional embedding, so each of this is a 300
dimensional embedding vector. and what we can do, is feed all of them into a neural network. So
here is the neural network layer. And then this neural network feeds to a softmax, which has it's own
parameters as well. And a softmax classifies among the 10,000 possible outputs in the vocab for
those final word we're trying to predict. And so, if in the training slide we saw the word juice then,
the target for the softmax in training repeat that it should predict the other word juice was what
came after this. So this hidden name here will have his own parameters. So have some, I'm going to
call this W1 and there's also B1. The softmax there was this own parameters W2, B2, and they're
using 300 dimensional word embeddings, then here we have six words. So, this would be 6x300. So
this layer or this input will be a 1,800 dimensional vector obtained by taking your six embedding
vectors and stacking them together. Well, what's actually more commonly done is to have a fixed
historical window. So for example, you might decide that you always want to predict the next word
given say the previous four words, where four here is a hyperparameter of the algorithm. So this is
how you adjust to either very long or very short sentences or you decide to always just look at the
previous four words, so you say, I will still use those four words and so, if you're always using a four
word history, this means that your neural network will input a 1,200 dimensional feature vector, go
into this layer, then have a softmax and try to predict the output. And again, variety of choices. And
using a fixed history, just means that you can deal with even arbitrarily long sentences because the
input sizes are always fixed. So, the parameters of this model will be this matrix E, and use the same
matrix E for all the words. So you don't have different matrices for different positions in the
proceedings four words, is the same matrix E and then, these weights are also parameters of the
algorithm and you can use that crop to perform gradient to sent to maximize the likelihood of your
training set to just repeatedly predict given four words in a sequence, what is the next word in your
text corpus? And it turns out that this algorithm we'll learn pretty decent word embeddings and the
reason is, if you remember our orange juice, apple juice example, is in the algorithm's incentive to
learn pretty similar word embeddings for orange and apple because doing so allows it to fit the
training set better because it's going to see orange juice sometimes, or see apple juice sometimes,
and so, if you have only a 300 dimensional feature vector to represent all of these words, the
algorithm will find that it fits the training set fast. If apples, oranges, and grapes, and pears, and so
on and maybe also durians which is a very rare fruit and that with similar feature vectors. So, this is
one of the earlier and pretty successful algorithms for learning word embeddings, for
learning this matrix E.
But now let's generalize this algorithm and see how we can derive even simpler algorithms. So, I
want to illustrate the other algorithms using a more complex sentence as our example. Let's say that
in your training set, you have this longer sentence, I want a glass of orange juice to go along with my
cereal. So, what we saw on the last slide was that the job of the algorithm was to predict some word
juice, which we are going to call the target words, and it was given some context which was the last
four words. And so, if your goal is to learn a embedding of researchers I've experimented with many
different types of context. If it goes to build a language model then is natural for the context to be a
few words right before the target word. But if your goal is into learn the language model per se, then
you can choose other contexts. For example, you can pose a learning problem where the context is
the four words on the left and right. So, you can take the four words on the left and right as the
context, and what that means is that we're posing a learning problem where the algorithm is given
four words on the left. So, a glass of orange, and four words on the right, to go along with, and this
has to predict the word in the middle. And posing a learning problem like this where you have the
embeddings of the left four words and the right four words feed into a neural network, similar to what
you saw in the previous slide, to try to predict the word in the middle, try to put it target word in the
middle, this can also be used to learn word embeddings. Or if you want to use a simpler context,
maybe you'll just use the last one word. So given just the word orange, what comes after orange? So
this will be different learning problem where you tell it one word, orange, and will say well, what do
you think is the next word and you can construct a neural network that just fits in the word, the one
previous word or the embedding of the one previous word to a neural network as you try to predict
the next word. Or, one thing that works surprisingly well is to take a nearby one word.
Some might tell you that, well, take the word glass, is somewhere close by. Some might say, I saw
the word glass and then there's another words somewhere close to glass, what do you think that
word is? So, that'll be using nearby one word as the context and we'll formalize this in the next
section but this is the idea of a Skip-Gram model, and just an example of a simpler algorithm where
the context is now much simpler, is just one word rather than four words, but this works remarkably
well. So what researchers found was that if you really want to build a language model, it's natural to
use the last few words as a context. But if your main goal is really to learn a word embedding, then
you can use all of these other contexts and they will result in very meaningful work embeddings as
well. I will formalize the details of this in the next section where we talk about the Word2Vec model.
To summarize, in this section you saw how the language modeling problem which causes the pose
of machines learning problem where you input the context like the last four words and predicts some
target words, how posing that problem allows you to learn input word embedding. In the next
section, you'll see how using even simpler context and even simpler learning algorithms to mark from
context to target word, can also allow you to learn a good word embedding.
Word2Vec
In the last section, we saw how you can learn a natural language model in order to get good word
embeddings. In this section, you see the Word2Vec algorithm which is simple and comfortably more
efficient way to learn this types of embeddings. Lets take a look. Most of the ideas I'll present in this
section are due to Tomas Mikolov, Kai Chen, Greg Corrado, and Jeff Dean. Let's say you're given this
sentence in your training set. In the skip-gram model, what we're going to do is come up with a few
context to target errors to create our supervised learning problem. So rather than having the context
be always the last four words or the last end words immediately before the target word, what we're
going to do is, say, randomly pick a word to be the context word and let's say we choose the word
orange and what we're going to do is randomly pick another word within some window. Say +-5
words or +-10 words of the context word and we choose that to be target word. So maybe just by
chance you might pick juice to be a target word, that's just one word later. Or you might choose two
words before. So you have another pair where the target could be glass or, maybe just by chance you
choose the word my as the target and so we'll set up a supervised learning problem where given the
context word, you're asked to predict what is a randomly chosen word within say, a +-10 word
window, or +-5 or 10 word window of that input context word and obviously, this is not a very easy
learning problem, because within +- 10 words of the word orange, it could be a lot of different words.
But a goal that's setting up this supervised learning problem, isn't to do well on the supervised
learning problem per se, it is that we want to use this learning problem to learn good word
embeddings.
So, here are the details of the model. Let's say that we'll continue to our vocab of 10,000 words
and some have been on vocab sizes that exceeds a million words but the basic supervised learning
problem we're going to solve is that we want to learn the mapping from some Context c, such as the
word orange to some target, which we will call t, which might be the word juice or the word glass or
the word my, if we use the example from the previous section. So in our vocabulary, orange is word
6257, and the word
juice is the word 4834 in our vocab of 10,000 words and so that's the input x that you want to learn
to map to that open y. So to represent the input such as the word orange, we can start out with some
one hot vector which is going to be write as O subscript C, so there's a one hot vector for the context
words and then similar to what we saw on the last section we can take the embedding matrix E,
multiply E by the vector O subscript C, and this gives you your embedding vector for the input
context word, so here EC is equal to capital E times that one hot vector. Then in this neural network
that we formed we're going to take this vector EC and feed it to a softmax unit. So I've been drawing
softmax unit as a node in a neural network. That's not an o, that's a softmax unit and then there's a
drop in the softmax unit to output y hat.
So to write out this model in detail. This is the model, the softmax model, probability of different
tanka words given the input context word as e to the e, theta t transpose, ec. Divided by some over
all words, so we're going to say, sum from J equals one to all 10,000 words of e to the theta j
transposed ec. So here theta T is the parameter associated with, I'll put t, but really there's a chance
of a particular word, t, being the label.
So I've left off the biased term to solve mass but we could include that too if we wish.
And then finally the loss function for softmax will be the usual.
So we use y to represent the target word. And we use a one-hot representation for y hat and y here.
Then the lost would be The negative log liklihood, so sum from i equals 1 to 10,000 of yi log yi hat.
So that's a usual loss for softmax where we're representing the target y as a one hot vector. So this
would be a one hot vector with just 1 1 and the rest zeros. And if the target word is juice, then it'd be
element 4834 from up here. That is equal to 1 and the rest will be equal to 0. And similarly Y hat will
be a 10,000 dimensional vector output by the softmax unit with probabilities for all 10,000 possible
targets words. So to summarize, this is the overall little model, little neural network with basically
looking up the embedding and then just a soft max unit. And the matrix E will have a lot of
parameters, so the matrix E has parameters corresponding to all of these embedding vectors, E
subscript C. And then the softmax unit also has parameters that gives the theta T parameters but if
you optimize this loss function with respect to the all of these parameters, you actually get a pretty
good set of embedding vectors. So this is called the skip-gram model because is taking as input one
word like orange and then trying to predict some words skipping a few words from the left or the
right side. To predict what comes little bit before little bit after the context words. Now, it turns out
there are a couple problems with using this algorithm. And the primary problem is computational
speed. In particular, for the softmax model, every time you want to evaluate this probability, you
need to carry out a sum over all 10,000 words in your vocabulary. And maybe 10,000 isn't too bad,
but if you're using a vocabulary of size 100,000 or a 1,000,000, it gets really slow to sum up over
this denominator every single time. And, in fact, 10,000 is actually already that will be quite slow, but
it makes even harder to scale to larger vocabularies. So there are a few solutions to this, one which
you see in the literature is to use a hierarchical softmax classifier. And what that means is, instead of
trying to categorize something into all 10,000 carries on one go. Imagine if you have one classifier, it
tells you is the target word in the first 5,000 words in the vocabulary? Or is in the second 5,000
words in the vocabulary? And lets say this binary cost that it tells you this is in the first 5,000 words,
think of second class to tell you that this in the first 2,500 words of vocab or in the second 2,500
words vocab and so on. Until eventually you get down to classify exactly what word it is, so that the
leaf of this tree, and so having a tree of classifiers like this, means that each of the retriever nodes of
the tree can be just a binding classifier. And so you don't need to sum over all 10,000 words or else it
will capsize in order to make a single classification. In fact, the computational classifying tree like this
scales like log of the vocab size rather than linear in vocab size. So this is called a hierarchical
softmax classifier. I should mention in practice, the hierarchical softmax classifier doesn't use a
perfectly balanced tree or this perfectly symmetric tree, with equal numbers of words on the left and
right sides of each branch. In practice, the hierarchical software classifier can be developed so that
the common words tend to be on top, whereas the less common words like durian can be buried
much deeper in the tree. Because you see the more common words more often, and so you might
need only a few traversals to get to common words like the and of. Whereas you see less frequent
words like durian much less often, so it says okay that are buried deep in the tree because you don't
need to go that deep. So there are various heuristics for building the tree how you used to build the
hierarchical software spire.
So this is one idea you see in the literature, the speeding up the softmax classification.
But I won't spend too much more time.
And you can read more details of this on the paper that I referenced by Thomas
and others, on the first slide. But I won't spend too much more time on this. Because in the next
video, where she talk about a different method, called nectar sampling, which I think is even simpler.
And also works really well for speeding up the softmax classifier and the problem of needing the sum
over the entire cap size in the denominator. So you see more of that in the next video. But before
moving on, one quick Topic I want you to understand is how to sample the context C. So once you
sample the context C, the target T can be sampled within, say, a plus minus ten word window of the
context C, but how do you choose the context C? One thing you could do is just sample uniformly, at
random, from your training corpus. When we do that, you find that there are some words like the, of,
a, and, to and so on that appear extremely frequently. And so, if you do that, you find that in your
context to target mapping pairs just get these these types of words extremely frequently, whereas
there are other words like orange, apple, and also durian that don't appear that often. And maybe
you don't want your training site to be dominated by these extremely frequently or current words,
because then you spend almost all the effort updating ec, for those frequently occurring words. But
you want to make sure that you spend some time updating the embedding, even for these less
common words like e durian. So in practice the distribution of words pc isn't taken just entirely
uniformly at random for the training set purpose, but instead there are different heuristics that you
could use in order to balance out something from the common words together with the less common
words.
So that's it for the Word2Vec skip-gram model. If you read the original paper by that I referenced
earlier, you find that that paper actually had two versions of this Word2Vec model, the skip gram was
one. And the other one is called the CBow, the continuous backwards model, which takes the
surrounding contexts from middle word, and uses the surrounding words to try to predict the middle
word, and that algorithm also works, it has some advantages and disadvantages. But the key
problem with this algorithm with the skip-gram model as presented so far is that the softmax step is
very expensive to calculate because needing to sum over your entire vocabulary size into the
denominator of the soft packs. In the next video I show you an algorithm that modifies the training
objective that makes it run much more efficiently therefore lets you apply this in a much bigger fitting
set as well and therefore learn much better word embeddings.
Negative Sampling
In the last section, we saw how the Skip-Gram model allows us to construct a supervised learning
task. So we map from context to target and how that allows you to learn a useful word embedding.
But the downside of that was the Softmax objective was slow to compute. In this section, we'll see a
modified learning problem called negative sampling that allows us to do something similar to the
Skip-Gram model we saw just now, but with a much more efficient learning algorithm. Let's see
how we can do this. Most of the ideas presented in this video are due to Tomas Mikolov, Ilya
Sutskever, Kai Chen, Greg Corrado, and Jeff Dean. So what we're going to do in this algorithm is
create a new supervised learning problem and the problem is, given a pair of words like orange and
juice, we're going to predict, is this a context-target pair? So in this example, orange juice was a
positive example. And how about orange and king? Well, that's a negative example, so we're going to
write 0 for the target. So what we're going to do is we're actually going to sample a context and a
target word. So in this case, we have orange and juice and we'll associate that with a label of 1, so
just put words in the middle and then having generated a positive example, so the positive example
is generated exactly how we generated it in the previous section. Sample a context word, look around
a window of say, plus-minus ten words and pick a target word. So that's how we generate the first
row of this table with orange, juice, 1 and then to generate a negative example, we're going to take
the same context word and then just pick a word at random from the dictionary.
So in this case, we chose the word king at random and we will label that as 0 and then let's take
orange and let's pick another random word from the dictionary. Under the assumption that if we pick
a random word, it probably won't be associated with the word orange, so orange, book, 0 and let's
pick a few others, orange, maybe just by chance, we'll pick the 0 and then orange and then orange,
and maybe just by chance, we'll pick the word of and we'll put a 0 there and notice that all of these
are labeled as 0 even though the word of actually appears next to orange as well. So to summarize,
the way we generated this data set is, we'll pick a context word and then pick a target word and that
is the first row of this table. That gives us a positive example. So context, target, and then give that
a label of 1 and then what we'll do is for some number of times say, k times, we're going to take the
same context word and then pick random words from the dictionary, king, book, the, of, whatever
comes out at random from the dictionary and label all those 0, and those will be our negative
examples and it's okay if just by chance, one of those words we picked at random from the dictionary
happens to appear in the window, in a plus-minus ten word window say, next to the context word,
orange. Then we're going to create a supervised learning problem where the learning algorithm
inputs x, inputs this pair of words, and it has to predict the target label to predict the output y. So the
problem is really given a pair of words like orange and juice, do you think they appear together? Do
you think I got these two words by sampling two words close to each other? Or do you think I got
them as one word from the text and one word chosen at random from the dictionary? It's really to try
to distinguish between these two types of distributions from which you might sample a pair of
words. So this is how you generate the training set. How do you choose k, Mikolov et al, recommend
that maybe k is 5 to 20 for smaller data sets and if you have a very large data set, then chose k to
be smaller. So k equals 2 to 5 for larger data sets, and large values of k for smaller data sets. Okay,
and in this example, we'll just use k = 4. Next, let's describe the supervised learning model for
learning a mapping from x to y. So here was the Softmax model you saw from the previous section
and here is the training set we got from the previous section where again, this is going to be the new
input x and this is going to be the value of y you're trying to predict.
So to define the model, I'm going to use this to denote, this was c for the context word, this to
denote the possible target word, t, and this, I'll use y to denote 0, 1, this is a context target pair. So
what we're going to do is define a logistic regression model. Say, that the chance of y = 1, given the
input c, t pair, we're going to model this as basically a regression model, but the specific formula we'll
use s sigma applied to theta transpose, theta t transpose, e c. So the parameters are similar as
before, you have one parameter vector theta for each possible target word. And a separate
parameter vector, really the embedding vector, for each possible context word. And we're going to
use this formula to estimate the probability that y is equal to 1. So if you have k examples here, then
you can think of this as having a k to 1 ratio of negative to positive examples. So for every positive
examples, you have k negative examples with which to train this logistic regression-like model and so
to draw this as a neural network, if the input word is orange, which is word 6257, then what you do
is, you input the one hot vector passing through e, do the multiplication to get the embedding vector
6257 and then what you have is really 10,000 possible logistic regression classification problems.
Where one of these will be the classifier corresponding to, well, is the target word juice or not? And
then there will be other words, for example, there might be ones somewhere down here which is
predicting, is the word king or not and so on, for these possible words in your vocabulary. So think of
this as having 10,000 binary logistic regression classifiers, but instead of training all 10,000 of them
on every iteration, we're only going to train five of them. We're going to train the one responding to
the actual target word we got and then train four randomly chosen negative examples. And this is for
the case where k is equal to 4. So instead of having one giant 10,000 way Softmax, which is very
expensive to compute, we've instead turned it into 10,000 binary classification problems, each of
which is quite cheap to compute. And on every iteration, we're only going to train five of them or
more generally, k + 1 of them, of k negative examples and one positive examples. And this is why
the computation cost of this algorithm is much lower because you're updating k + 1, let's just say
units, k + 1 binary classification problems. Which is relatively cheap to do on every iteration rather
than updating a 10,000 way Softmax classifier. So you get to play with this algorithm in the problem
exercise for this week as well. So this technique is called negative sampling because what
you're doing is, you have a positive example, the orange and then juice. And then you will
go and deliberately generate a bunch of negative examples, negative samplings, hence, the
name negative sampling, with which to train four more of these binary classifiers and on every
iteration, you choose four different random negative words with which to train your algorithm on.
Now, before wrapping up, one more important detail with this algorithm is, how do you choose the
negative examples? So after having chosen the context word orange, how do you sample these words
to generate the negative examples? So one thing you could do is sample the words in the middle, the
candidate target words. One thing you could do is sample it according to the empirical frequency of
words in your corpus. So just sample it according to how often different words appears. But the
problem with that is that you end up with a very high representation of words like the, of, and, and
so on. One other extreme would be to say, you use 1 over the vocab size, sample the negative
examples uniformly at random, but that's also very non-representative of the distribution of English
words. So the authors, Mikolov et al, reported that empirically, what they found to work best was to
take this heuristic value, which is a little bit in between the two extremes of sampling from the
empirical frequencies, meaning from whatever's the observed distribution in English text to the
uniform distribution. And what they did was they sampled proportional to their frequency of a word to
the power of three-fourths. So if f of wi is the observed frequency of a particular word in the English
language or in your training set corpus, then by taking it to the power of three-fourths, this is
somewhere in-between the extreme of taking uniform distribution. And the other extreme of just
taking whatever was the observed distribution in your training set and so'm not sure this is very
theoretically justified, but multiple researchers are now using this heuristic, and it seems to work
decently well.
So to summarize, we've seen how you can learn word vectors in a Softmax classier, but it's very
computationally expensive and in this section, we saw how by changing that to a bunch of binary
classification problems, you can very efficiently learn words vectors and if we run this algorithm, we'll
be able to learn pretty good word vectors. Now of course, as is the case in other areas of deep
learning as well, there are open source implementations and there are also pre-trained word vectors
that others have trained and released online under permissive licenses.
The input X is a piece of text and the output Y that you want to predict is what is the sentiment, such
as the star rating of, let's say, a restaurant review. So if someone says, "The dessert is excellent" and
they give it a four-star review, "Service was quite slow" two-star review, "Good for a quick meal but
nothing special" three-star review. And this is a pretty harsh review, "Completely lacking in good
taste, good service, and good ambiance." That's a one-star review. So if you can train a system to
map from X or Y based on a label data set like this, then you could use it to monitor comments that
people are saying about maybe a restaurant that you run. So people might also post messages about
your restaurant on social media, on Twitter, or Facebook, or Instagram, or other forms of social
media. And if you have a sentiment classifier, they can look just a piece of text and figure out how
positive or negative is the sentiment of the poster toward your restaurant. Then you can also be able
to keep track of whether or not there are any problems or if your restaurant is getting better or worse
over time. So one of the challenges of sentiment classification is you might not have a huge label
data set. So for sentimental classification task, training sets with maybe anywhere from 10,000 to
maybe 100,000 words would not be uncommon. Sometimes even smaller than 10,000 words and
word embeddings that you can take can help you to much better understand especially when you
have a small training set. So here's what you can do. We'll go for a couple different algorithms in this
section. Here's a simple sentiment classification model.
You can take a sentence like "dessert is excellent" and look up those words in your dictionary.
We use a 10,000-word dictionary as usual. And let's build a classifier to map it to the output Y that
this was four stars. So given these four words, as usual, we can take these four words and look up
the one-hot vector. So there's 0 8 9 2 8 which is a one-hot vector multiplied by the embedding matrix
E, which can learn from a much larger text corpus. It can learn in embedding from, say, a billion
words or a hundred billion words, and use that to extract out the embedding vector for the word
"the", and then do the same for "dessert", do the same for "is" and do the same for "excellent". And
if this was trained on a very large data set, like a hundred billion words, then this allows you to take a
lot of knowledge even from infrequent words and apply them to your problem, even words that
weren't in your labeled training set. Now here's one way to build a classifier, which is that you can
take these vectors, let's say these are 300-dimensional vectors, and you could then just sum or
average them and I'm just going to put a bigger average operator here and you could use sum or
average. And this gives you a 300-dimensional feature vector that you then pass to a soft-max
classifier which then outputs Y-hat and so the softmax can output what are the probabilities of the
five possible outcomes from one-star up to five-star. So this will be assortment of the five possible
outcomes to predict what is Y. So notice that by using the average operation here, this particular
algorithm works for reviews that are short or long because even if a review that is 100 words long,
you can just sum or average all the feature vectors for all hundred words and so that gives you a
representation, a 300-dimensional feature representation, that you can then pass into your sentiment
classifier. So this average will work decently well. And what it does is it really averages the meanings
of all the words or sums the meaning of all the words in your example. And this will work to
[inaudible]. So one of the problems with this algorithm is it ignores word order. In particular, this is a
very negative review, "Completely lacking in good taste, good service, and good ambiance". But the
word good appears a lot.
So the final detail is, how do you decide what word to neutralize? So for example, the word doctor
seems like a word you should neutralize to make it non-gender-specific or non-ethnicity-specific.
Whereas the words grandmother and grandmother should not be made non-gender-specific and there
are also words like beard, right, that it's just a statistical fact that men are much more likely to have
beards than women, so maybe beards should be closer to male than female. And so what the authors
did is train a classifier to try to figure out what words are definitional, what words should be gender-
specific and what words should not be and it turns out that most words in the English language are
not definitional, meaning that gender is not part of the definition and it's such a relatively small
subset of words like this, grandmother-grandfather, girl-boy, sorority-fraternity, and so on that should
not be neutralized and so a linear classifier can tell you what words to pass through the neutralization
step to project out this bias direction, to project it on to this essentially 299-dimensional subspace
and then, finally, the number of pairs you want to equalize, that's actually also relatively small, and
is, at least for the gender example, it is quite feasible to hand-pick most of the pairs you want to
equalize. So the full algorithm is a bit more complicated than we have seen here, we can take a look
at the paper for the full details.
So to summarize, that reducing or eliminating bias of our learning algorithms is a very important
problem because these algorithms are being asked to help with or to make more and more important
decisions in society. In this section we have seen just one set of ideas for how to go about trying to
address this problem, but this is still a very much an ongoing area of active research by many
researchers.
Let's say you want to input a French sentence like Jane visite l'Afrique en septembre, and you want to
translate it to the English sentence, Jane is visiting Africa in September. As usual, let's use x<1>
through x, in this case x<5>, to represent the words in the input sequence, and we'll use y<1>
through y<6> to represent the words in the output sequence. So, how can you train a new network
to input the sequence x and output the sequence y? Well, here's something you could do, and the
ideas I'm about to present are mainly from these two papers due to Sutskever, Oriol Vinyals, and
Quoc Le, and that one by Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau,
Fethi Bougares, Holger Schwen, and Yoshua Bengio.
First, let's have a network, which we're going to call the encoder network be built as a RNN, and this
could be a GRU and LSTM, feed in the input French words one word at a time. And after ingesting the
input sequence, the RNN then offers a vector that represents the input sentence. After that, you can
build a decoder network which I'm going to draw here, which takes as input the encoding output by
the encoder network shown in black on the left, and then can be trained to output the translation one
word at a time until eventually it outputs say, the end of sequence or end the sentence token upon
which the decoder stops and as usual we could take the generated tokens and feed them to the next
[inaudible] in the sequence like we 're doing before when synthesizing text using the language model.
One of the most remarkable recent results in deep learning is that this model works, given enough
pairs of French and English sentences. If you train the model to input a French sentence and output
the corresponding English translation, this will actually work decently well and this model simply uses
an encoder network, whose job it is to find an encoding of the input French sentence and then use a
decoder network to then generate the corresponding English translation.
An architecture very similar to this also works for image captioning so given an image like the one
shown here, maybe wanted to be captioned automatically as a cat sitting on a chair. So how do you
train a new network to input an image and output a caption like that phrase up there? Here's what
you can do. From the earlier course on [inaudible] you've seen how you can input an image into a
convolutional network, maybe a pre-trained AlexNet, and have that learn an encoding or learn a set
of features of the input image. So, this is actually the AlexNet architecture and if we get rid of this
final Softmax unit, the pre-trained AlexNet can give you a 4096-dimensional feature vector of which
to represent this picture of a cat. And so this pre-trained network can be the encoder network for the
image and you now have a 4096-dimensional vector that represents the image. You can then take
this and feed it to an RNN, whose job it is to generate the caption one word at a time. So similar to
what we saw with machine translation translating from French to English, you can now input a feature
vector describing the input and then have it generate an output sequence or output set of words one
word at a time and this actually works pretty well for image captioning, especially if the caption you
want to generate is not too long. As far as I know, this type of model was first proposed by Junhua
Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille, although it turns out there were
multiple groups coming up with very similar models independently and at about the same time. So
two other groups that had done very similar work at about the same time and I think independently
of Mao et al were Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan, as well as
Andrej Karpathy and Fei-Fei Yi. So, you've now seen how a basic sequence-to-sequence model works,
or how a basic image-to-sequence or image captioning model works, but there are some differences
between how you would run a model like this, so generating a sequence compared to how you were
synthesizing novel text using a language model. One of the key differences is, you don't want a
randomly chosen translation, you maybe want the most likely translation, or you don't want a
randomly chosen caption, maybe not, but you might want the best caption and most likely caption.
So let's see in the next section how you go about generating that.
The machine translation model looks as follows, and I am going to use a couple different colors,
green and purple, to denote respectively the coded network in green and the decoded network in
purple and you notice that the decoded network looks pretty much identical to the language model
that we had up there. So what the machine translation model is, is very similar to the language
model, except that instead of always starting along with the vector of all zeros, it instead has an
encoded network that figures out some representation for the input sentence, and it takes that input
sentence and starts off the decoded network with representation of the input sentence rather than
with the representation of all zeros. So, that's why we call this a conditional language model, and
instead of modeling the probability of any sentence, it is now modeling the probability of, say, the
output English translation, conditions on some input French sentence. So in other words, you're
trying to estimate the probability of an English translation. Like, what's the chance that the
translation is "Jane is visiting Africa in September," but conditions on the input French censors
like, "Jane visite I'Afrique en septembre." So, this is really the probability of an English sentence
conditions on an input French sentence which is why it is a conditional language model. Now, if you
want to apply this model to actually translate a sentence from French into English, given this input
French sentence, the model might tell you what is the probability of difference in corresponding
English translations. So, x is the French sentence, "Jane visite l'Afrique en septembre." And, this now
tells you what is the probability of different English translations of that French input. And, what you
do not want is to sample outputs at random. If you sample words from this distribution, p of y given
x, maybe one time you get a pretty good translation, "Jane is visiting Africa in September." But,
maybe another time you get a different translation, "Jane is going to be visiting Africa in September. "
Which sounds a little awkward but is not a terrible translation, just not the best one. And sometimes,
just by chance, you get, say, others: "In September, Jane will visit Africa." And maybe, just by
chance, sometimes you sample a really bad translation: "Her African friend welcomed Jane in
September."
So, when you're using this model for machine translation, you're not trying to sample at random from
this distribution. Instead, what you would like is to find the English sentence, y, that maximizes that
conditional probability. So in developing a machine translation system, one of the things you need to
do is come up with an algorithm that can actually find the value of y that maximizes this term over
here. The most common algorithm for doing this is called beam search, and it's something you'll see
in the next section. But, before moving on to describe beam search, you might wonder, why not just
use greedy search? So, what is greedy search? Well, greedy search is an algorithm from computer
science which says to generate the first word just pick whatever is the most likely first word
according to your conditional language model. Going to your machine translation model and then
after having picked the first word, you then pick whatever is the second word that seems most likely,
then pick the third word that seems most likely. This algorithm is called greedy search.
Let's go into the second step of Beam search. Having picked in, Jane and September as the three
most likely choice of the first word, what Beam search will do now, is for each of these three choices
consider what should be the second word, so after "in" maybe a second word is "a" or maybe as
Aaron, I'm just listing words from the vocabulary, from the dictionary or somewhere down the list will
be September, somewhere down the list there's visit and then all the way to z and then the last word
is zulu. So, to evaluate the probability of second word, it will use this new network fragments where
is coder in green and for the decoder portion when trying to decide what comes after in. Remember
the decoder first outputs, yhat<1>. So, I'm going to set to this y hat one to the word "in" as it goes
back in. So there's the word "in" because it decided for now. That's because It trying to figure out
that the first word was "in", what is the second word, and then this will output I guess yhat<2> and
so by hard wiring y hat one here, really the inputs here to be the first words "in" this time were
fragment can be used to evaluate whether it's the probability of the second word given the input
french sentence and that the first words of the translation has been the word "in". Now notice that
what we also need help out in this second step would be assertions to find the pair of the first and
second words that is most likely it's not just a second where is most likely that the pair of the first
and second whereas the most likely and by the rules of conditional probability. This can be expressed
as P of the first words times P of probability of the second words. Which you are getting from this
network fragment and so if for each of the three words you've chosen "in", "Jane," and "September"
you save away this probability then you can multiply them by this second probabilities to get the
probability of the first and second words. So now you've seen how if the first word was "in" how you
can evaluate the probability of the second word. Now at first it was "Jane" you do the same thing.
The sentence could be "Jane a"," Jane Aaron", and so on down to "Jane is", "Jane visits" and so on.
And you will use this in your network fragments let me draw this in as well where here you will
hardwire, yhat<1> to be Jane. And so with the First word yhat<1>'s hard wired as Jane than just the
network fragments can tell you what's the probability of the second words to me. And given that the
first word is "Jane". And then same as above you can multiply with P(y<1>) to get the probability of
Y1 and Y2 for each of these 10,000 different possible choices for the second word. And then finally do
the same thing for September although words from a down to Zulu and use this network fragment.
That just goes in as well to see if the first word was September. What was the most likely options for
the second words. So for this second step of beam search because we're continuing to use a beam
width of three and because there are 10,000 words in the vocabulary you'd end up considering three
times 10000 or thirty thousand possibilities because there are 10,000 here, 10,000 here, 10,000 here
as the beam width times the number of words in the vocabulary and what you do is you evaluate all
of these 30000 options according to the probably the first and second words and then pick the top
three. So with a cut down, these 30,000 possibilities down to three again down the beam width
rounded again so let's say that 30,000 choices, the most likely were in September and say Jane is,
and Jane visits sorry this bit messy but those are the most likely three out of the 30,000 choices then
that's what Beam's search would memorize away and take on to the next step being surge. So notice
one thing if beam search decides that the most likely choices are the first and second words are in
September, or Jane is, or Jane visits. Then what that means is that it is now rejecting September as a
candidate for the first words of the output English translation so we're now down to two possibilities
for the first words but we still have a beam width of three keeping track of three choices for pairs
of y<1> , y<2>before going onto the third step of beam search. Just want to notice that because of
beam width is equal to three, every step you instantiate three copies of the network to evaluate these
partial sentence fragments and the output and it's because of beam width is equal to three that you
have three copies of the network with different choices for the first words, but these three copies of
the network can be very efficiently used to evaluate all 30,000 options for the second word. So just
don't instantiate 30,000 copies of the network or three copies of the network to very quickly evaluate
all 10,000 possible outputs at that softmax output say for Y2. Let's just quickly illustrate one more
step of beam search. So said that the most likely choices for first two words were in September, Jane
is, and Jane visits and for each of these pairs of words which we should have saved the way in
computer memory the probability of y<1> , y<2> given the input X given the French sentence X. So
similar to before, we now want to consider what is the third word. So in September a? In September
Aaron? All the way down to is in September Zulu and to evaluate possible choices for the third word,
you use this network fragments where you Hardwire the first word here to be in the second word to
be September and so this network fragment allows you to evaluate what's the probability of the third
word given the input French sentence X and given that the first two words are in September and
English output and then you do the same thing for the second fragment. So like so and same thing
for Jane visits and so beam search will then once again pick the top three possibilities may be that
things in September. Jane is a likely outcome or Jane is visiting is likely or maybe Jane visits Africa is
likely for that first three words and then it keeps going and then you go onto the fourth step of beam
search hat one more word and on it goes and the outcome of this process hopefully will be that
adding one word at a time that Beam search will decide that. Jane visits Africa in September will be
terminated by the end of sentence symbol using that system is quite common. They'll find that this is
a likely output English sentence and you'll see more details of this yourself. So with a beam of three
being searched considers three possibilities at a time. Notice that if the beam width was said to
be equal to one, say cause there's only one, then this essentially becomes the greedy
search algorithm which we had discussed in the last section but by considering multiple
possibilities say three or ten or some other number at the same time beam search will
usually find a much better output sentence than greedy search. You've now seen how Beam
Search works but it turns out there's some additional tips and tricks from our partners that help you
to make beam search work even better. Let's go onto the next section to take a look.
So just to wrap up how you run beam search, as you run beam search you see a lot of sentences with
length equal 1, a lot of sentences with length equal 2, a lot of sentences with length equals 3 and so
on, and maybe you run beam search for 30 steps and you consider output sentences up to length 30,
let's say. And so with beam with a 3, you will be keeping track of the top three possibilities for each of
these possible sentence lengths, 1, 2, 3, 4 and so on, up to 30. Then, you would look at all of the
output sentences and score them against this score and so you can take your top sentences and just
compute this objective function onto sentences that you have seen through the beam search process.
And then finally, of all of these sentences that you validate this way, you pick the one that achieves
the highest value on this normalized log probability objective. Sometimes it's called a normalized log
likelihood objective and then that would be the final translation, your outputs.
Finally, a few implementational details, how do you choose the beam width B? The larger B is, the
more possibilities you're considering, and does the better the sentence you probably find. But the
larger B is, the more computationally expensive your algorithm is, because you're also keeping a lot
more possibilities around. All right, so finally, let's just wrap up with some thoughts on how to choose
the beam width B. So here are the pros and cons of setting B to be very large versus very small. If
the beam width is very large, then you consider a lot of possibilities, and so you tend to get a better
result because you are consuming a lot of different options, but it will be slower. And the memory
requirements will also grow, will also be compositionally slower. Whereas if you use a very small
beam width, then you get a worse result because you're just keeping less possibilities in mind as the
algorithm is running. But you get a result faster and the memory requirements will also be lower. So
in the previous section, we used in our running example a beam width of three, so we're keeping
three possibilities in mind. In practice, that is on the small side. In production systems, it's not
uncommon to see a beam width maybe around 10, and I think beam width of 100 would be
considered very large for a production system, depending on the application. But for research
systems where people want to squeeze out every last drop of performance in order to publish the
paper with the best possible result. It's not uncommon to see people use beam widths of 1,000 or
3,000, but this is very application, that's why it's a domain dependent. So I would say try other
variety of values of B as you work through your application. But when B gets very large, there is
often diminishing returns. So for many applications, I would expect to see a huge gain as you go
from a beam width of 1, which is very greedy search, to 3, to maybe 10. But the gains as you go
from 1,000 to 3,000 in beam width might not be as big and for those of you that have taken maybe a
lot of computer science courses before, if you're familiar with computer science search algorithms like
BFS, Breadth First Search, or DFS, Depth First Search. The way to think about beam search is
that, unlike those other algorithms which you have learned about in a computer science algorithms
course, and don't worry about it if you've not heard of these algorithms. But if you've heard of
Breadth First Search and Depth First Search then unlike those algorithms, which are exact search
algorithms. Beam search runs much faster but does not guarantee to find the exact maximum for this
arg max that you would like to find. If you haven't heard of breadth first search or depth first search,
don't worry about it, it's not important for our purposes. But if you have, this is how beam search
relates to those algorithms.
So that's it for beam search, which is a widely used algorithm in many production systems, or in
many commercial systems. Now, in the circles in the sequence of courses of deep learning, we talked
a lot about error analysis. It turns out, one of the most useful tools I've found is to be able to do
error analysis on beam search. So you sometimes wonder, should I increase my beam width? Is my
beam width working well enough? And there's some simple things you can compute to give you
guidance on whether you need to work on improving your search algorithm. Let's talk about that in
the next section.
Here are the two sentences from the previous slide. And remember, we're going to compute P(y*
given x) and P(y-hat given x) and see which of these two is bigger. So there are going to be two
cases. In case 1, P(y* given x) as output by the RNN model is greater than P(y-hat given x). What
does this mean? Well, the beam search algorithm chose yhat, right? The way you got y-hat was you
had an RNN that was computing P(y given x) and beam search's job was to try to find a value of y
that gives that arg max. But in this case, y* actually attains a higher value for P(y given x) than the
y-hat. So what this allows you to conclude is beam search is failing to actually give you the value of y
that maximizes P(y given x) because the one job that beam search had was to find the value of y that
makes this really big. But it choose yhat, the y* actually gets a much bigger value. So in this case,
you could conclude that beam search is at fault. Now, how about the other case? In case 2, P(y*
given x) is less than or equal to P(y-hat given x), right? And then either this or this has gotta be true.
So either case 1 or case 2 has to hold true. What do you conclude under case 2? Well, in our
example, y* is a better translation than y-hat. But according to the RNN, P(y*) is less than P(yhat),
so saying that y* is a less likely output than yhat. So in this case, it seems that the RNN model is at
fault and it might be worth spending more time working on the RNN. There's some subtleties
pertaining to length normalizations that I'm glossing over. And if you are using some sort of length
normalization, instead of evaluating these probabilities, you should be evaluating the optimization
objective that takes into account length normalization. But ignoring that complication for now, in this
case, what this tells you is that even though y* is a better translation, the RNN ascribed y* in lower
probability than the inferior translation. So in this case, I will say the RNN model is at fault. So the
error analysis process looks as follows. You go through the development set and find the mistakes
that the algorithm made in the development set and so in this example, let's say that P(y* given x)
was 2 x 10 to the -10, whereas, P(yhat given x) was 1 x 10 to the -10. Using the logic from the
previous slide, in this case, we see that beam search actually chose y-hat, which has a lower
probability than y*.
So I will say beam search is at fault. So I'll abbreviate that B. And then you go through a second
mistake or second bad output by the algorithm, look at these probabilities and maybe for the second
example, you think the model is at fault. I'm going to abbreviate the RNN model with R. And you go
through more examples and sometimes the beam search is at fault, sometimes the model is at fault,
and so on and through this process, you can then carry out error analysis to figure out what fraction
of errors are due to beam search versus the RNN model. And with an error analysis process like this,
for every example in your dev sets, where the algorithm gives a much worse output than the human
translation, you can try to describe the error to either the search algorithm or to the objective
function, or to the RNN model that generates the objective function that beam search is supposed to
be maximizing. And through this, you can try to figure out which of these two components is
responsible for more errors. And only if you find that beam search is responsible for a lot of errors,
then maybe is we're working hard to increase the beam width. Whereas in contrast, if you find that
the RNN model is at fault, then you could do a deeper layer of analysis to try to figure out if you want
to add regularization, or get more training data, or try a different network architecture, or something
else and so a lot of the techniques that you saw in the third course in the sequence will be applicable
there. So that's it for error analysis using beam search. I found this particular error analysis process
very useful whenever you have an approximate optimization algorithm, such as beam search that is
working to optimize some sort of objective, some sort of cost function that is output by a learning
algorithm, such as a sequence-to-sequence model or a sequence-to-sequence RNN that we've been
discussing in these lectures. So with that, I hope that you'll be more efficient at making these types
of models work well for your applications.
Bleu Score
One of the challenges of machine translation is that, given a French sentence, there could be
multiple English translations that are equally good translations of that French sentence. So how do
you evaluate a machine translation system if there are multiple equally good answers, unlike, say,
image recognition where there's one right answer? You just measure accuracy. If there are multiple
great answers, how do you measure accuracy? The way this is done conventionally is through
something called the BLEU score. Let's say you are given a French sentence Le chat est sur le tapis.
And you are given a reference, human generated translation of this, which is the the cat is on the
mat. But there are multiple, pretty good translations of this. So a different human, different person
might translate it as there is a cat on the mat and both of these are actually just perfectly fine
translations of the French sentence. What the BLEU score does is given a machine generated
translation, it allows you to automatically compute a score that measures how good is that machine
translation and the intuition is so long as the machine generated translation is pretty close to any of
the references provided by humans, then it will get a high BLEU score. BLEU, stands for bilingual
evaluation, Understudy. So in the theater world, an understudy is someone that learns the role of a
more senior actor so they can take over the role of the more senior actor, if necessary and motivation
for BLEU is that, whereas you could ask human evaluators to evaluate the machine translation
system, the BLEU score is an understudy, could be a substitute for having humans evaluate every
output of a machine translation system. So the BLEU score was due to Kishore Papineni, Salim
Roukos, Todd Ward, and Wei-Jing Zhu. This paper has been incredibly influential, and is, actually,
quite a readable paper. So I encourage you to take a look if you have time. So, the intuition behind
the BLEU score is we're going to look at the machine generated output and see if the types of words
it generates appear in at least one of the human generated references and so these human generated
references would be provided as part of the depth set or as part of the test set.
There is a great importance of having a single real number evaluation metric. Because it allows
you to try out two ideas, see which one achieves a higher score, and then try to stick with the one
that achieved the higher score. So the reason the BLEU score was revolutionary for machine
translation was because this gave a pretty good, by no means perfect, but pretty good single real
number evaluation metric and so that accelerated the progress of the entire field of machine
translation. This section gave us a sense of how the BLEU score works. In practice, few people would
implement a BLEU score from scratch. There are open source implementations that you can download
and just use to evaluate your own system. But today, BLEU score is used to evaluate many systems
that generate text, such as machine translation systems, as well as the example I showed briefly
earlier of image captioning systems where you would have a system, have a neural network
generated image caption and then use the BLEU score to see how much that overlaps with maybe a
reference caption or multiple reference captions that were generated by people. So the BLEU score is
a useful single real number evaluation metric to use whenever you want your algorithm to generate a
piece of text and you want to see whether it has similar meaning as a reference piece of text
generated by humans. This is not used for speech recognition, because in speech recognition, there's
usually one ground truth and you just use other measures to see if you got the speech transcription
on pretty much, exactly word for word correct. But for things like image captioning, and multiple
captions for a picture, it could be about equally good or for machine translations. There are multiple
translations, but equally good. The BLEU score gives you a way to evaluate that automatically and
therefore speed up your development.
You kind of work part by part through the sentence, because it's just really difficult to memorize the
whole long sentence like that. What you see for the Encoder-Decoder architecture above is that, it
works quite well for short sentences, so we might achieve a relatively high Bleu score, but for very
long sentences, maybe longer than 30 or 40 words, the performance comes down. The Bleu score
might look like this as the sentence that varies and short sentences are just hard to translate, hard to
get all the words, right? Long sentences, it doesn't do well on because it's just difficult to get in your
network to memorize a super long sentence. In this and the next section, you'll see the Attention
Model which translates maybe a bit more like humans might, looking at part of the sentence at a time
and with an Attention Model, machine translation systems performance can look like this, because by
working one part of the sentence at a time, you don't see this huge dip which is really measuring the
ability of a neural network to memorize a long sentence which maybe isn't what we most badly need
a neural network to do. In this video, I want to just give you some intuition about how attention
works and then we'll flesh out the details in the next video.
The Attention Model was due to Dimitri, Bahdanau, Camcrun Cho, Yoshe Bengio and even though it
was obviously developed for machine translation, it spread to many other application areas as well.
This is really a very influential, I think very seminal paper in the deep learning literature. Let's
illustrate this with a short sentence, even though these ideas were maybe developed more for long
sentences, but it'll be easier to illustrate these ideas with a simpler example. We have our usual
sentence, Jane visite l'Afrique en Septembre. Let's say that we use a R and N, and in this case, I'm
going to use a bidirectional R and N, in order to compute some set of features for each of the input
words and you have to understand it, bidirectional R and N with outputs Y1 to Y3 and so on up to Y5
but we're not doing a word for word translation, let me get rid of the Y's on top. But using a
bidirectional R and N, what we've done is for each other words, really for each of the five positions
into sentence, you can compute a very rich set of features about the words in the sentence and
maybe surrounding words in every position. Now, let's go ahead and generate the English translation.
We're going to use another R and N to generate the English translations. Here's my R and N note as
usual and instead of using A to denote the activation, in order to avoid confusion with the activations
down here, I'm just going to use a different notation, I'm going to use S to denote the hidden state in
this R and N up here, so instead of writing A1 I'm going to right S1 and so we hope in this model that
the first word it generates will be Jane, to generate Jane visits Africa in September. Now, the question
is, when you're trying to generate this first word, this output, what part of the input French sentence
should you be looking at? Seems like you should be looking primarily at this first word, maybe a few
other words close by, but you don't need to be looking way at the end of the sentence. What the
Attention Model would be computing is a set of attention weights and we're going to use Alpha one,
one to denote when you're generating the first words, how much should you be paying attention to
this first piece of information here. And then we'll also come up with a second that's called Attention
Weight, Alpha one, two which tells us what we're trying to compute the first work of Jane, how much
attention we're paying to this second work from the inputs and so on and the Alpha one, three and so
on, and together this will tell us what is exactly the context from denoter C that we should be paying
attention to, and that is input to this R and N unit to then try to generate the first words. That's one
step of the R and N, we will flesh out all these details in the next video. For the second step of this R
and N, we're going to have a new hidden state S two and we're going to have a new set of the
attention weights. We're going to have Alpha two, one to tell us when we generate in the second
word. I guess this will be visits maybe that being the ground trip label. How much should we paying
attention to the first word in the french input and also, Alpha two, two and so on. How much should
we paying attention the word visite, how much should we pay attention to the free and so on. And of
course, the first word we generate in Jane is also an input to this, and then we have some context
that we're paying attention to and the second step, there's also an input and that together will
generate the second word and that leads us to the third step, S three, where this is an input and we
have some new context C that depends on the various Alpha three for the different time sets, that
tells us how much should we be paying attention to the different words from the input French
sentence and so on. So, some things I haven't specified yet, but that will go further into detail in the
next video of this, how exactly this context defines and the goal of the context is for the third word is
really should capture that maybe we should be looking around this part of the sentence. The formula
you use to do that will defer to the next video as well as how do you compute these attention
weights. And you see in the next video that Alpha three T, which is, when you're trying to generate
the third word, I guess this would be the Africa, just getting the right output. The amounts that this R
and N step should be paying attention to the French word that time T, that depends on the activations
of the bidirectional R and N at time T, I guess it depends on the fourth activations and the, backward
activations at time T and it will depend on the state from the previous steps, it will depend on S two,
and these things together will influence, how much you pay attention to a specific word in the input
French sentence. But we'll flesh out all these details in the next video. But the key intuition to take
away is that this way the R and N marches forward generating one word at a time, until eventually it
generates maybe the EOS and at every step, there are these attention weighs. Alpha T.T. Prime that
tells it, when you're trying to generate the T, English word, how much should you be paying attention
to the T prime French words. And this allows it on every time step to look only maybe within a local
window of the French sentence to pay attention to, when generating a specific English word. I hope
this video conveys some intuition about Attention Model and that we now have a rough sense of,
maybe how the algorithm works. Let's go to the next video to flesh out the details of the Attention
Model.
Attention Model
In the last video, you saw how the attention model allows a neural network to pay attention to only
part of an input sentence while it's generating a translation, much like a human translator might.
Let's now formalize that intuition into the exact details of how you would implement an attention
model. So same as in the previous video, let's assume you have an input sentence and you use a
bidirectional RNN, or bidirectional GRU, or bidirectional LSTM to compute features on every word. In
practice, GRUs and LSTMs are often used for this, with maybe LSTMs be more common. And so for
the forward occurrence, you have a forward occurrence first time step. Activation backward
occurrence, first time step. Activation forward occurrence, second time step. Activation backward and
so on. For all of them in just a forward fifth time step a backwards fifth time step. We had a zero here
technically we can also have I guess a backwards sixth as a factor of all zero, actually that's a factor
of all zeroes. And then to simplify the notation going forwards at every time step, even though you
have the features computed from the forward occurrence and from the backward occurrence in the
bidirectional RNN. I'm just going to use a of t to represent both of these concatenated together. So a
of t is going to be a feature vector for time step t. Although to be consistent with notation, we're
using second, I'm going to call this t_prime. Actually, I'm going to use t_prime to index into the
words in the French sentence. Next, we have our forward only, so it's a single direction RNN with
state s to generate the translation. And so the first time step, it should generate y1 and just will have
as input some context C. And if you want to index it with time I guess you could write a C1 but
sometimes I just right C without the superscript one. And this will depend on the attention
parameters so alpha_11, alpha_12 and so on tells us how much attention. And so these alpha
parameters tells us how much the context would depend on the features we're getting or the
activations we're getting from the different time steps. And so the way we define the context is
actually be a way to some of the features from the different time steps waited by these attention
waits. So more formally the attention waits will satisfy this that they are all be non-negative, so it will
be a zero positive and they'll sum to one. We'll see later how to make sure this is true. And we will
have the context or the context at time one often drop that superscript that's going to be sum over
t_prime, all the values of t_prime of this waited sum of these activations. So this term here are the
attention waits and this term here comes from here. So alpha(t_prime) is the amount of attention
that's yt should pay to a of t_prime. So in other words, when you're generating the t of the output
words, how much you should be paying attention to the t_primeth input to word. So that's one step
of generating the output and then at the next time step, you generate the second output and is again
done some of where now you have a new set of attention waits on they to find a new way to sum.
That generates a new context. This is also input and that allows you to generate the second word.
Only now just this way to sum becomes the context of the second time step is sum over t_prime
alpha(2, t_prime). So using these context vectors. C1 right there back, C2, and so on. This network
up here looks like a pretty standard RNN sequence with the context vectors as output and we can just
generate the translation one word at a time.
We have also define how to compute the context vectors in terms of these attention ways and those
features of the input sentence. So the only remaining thing to do is to define how to actually compute
these attention waits. Let's do that on the next slide. So just to recap, alpha(t, t_prime) is the
amount of attention you should paid to a(t_prime ) when you're trying to generate the t th words in
the output translation. So let me just write down the formula and we talk of how this works. This is
formula you could use the compute alpha(t, t_prime) which is going to compute these terms e(t,
t_prime) and then use essentially a soft pass to make sure that these waits sum to one if you sum
over t_prime. So for every fix value of t, these things sum to one if you're summing over t_prime.
And using this soft max prioritization, just ensures this properly sums to one. Now how do we
compute these factors e. Well, one way to do so is to use a small neural network as follows. So s t
minus one was the neural network state from the previous time step. So here is the network we
have. If you're trying to generate yt then st minus one was the hidden state from the previous step
that just fell into st and that's one input to very small neural network. Usually, one hidden layer in
neural network because you need to compute these a lot. And then a(t_prime) the features from time
step t_prime is the other inputs. And the intuition is, if you want to decide how much attention to pay
to the activation of t_prime. Well, the things that seems like it should depend the most on is what is
your own hidden state activation from the previous time step. You don't have the current state
activation yet because of context feeds into this so you haven't computed that. But look at whatever
you're hidden stages of this RNN generating the upper translation and then for each of the positions,
each of the words look at their features. So it seems pretty natural that alpha(t, t_prime) and e(t,
t_prime) should depend on these two quantities. But we don't know what the function is. So one
thing you could do is just train a very small neural network to learn whatever this function should be.
And trust that obligation trust wait and descent to learn the right function. And it turns out that if you
implemented this whole model and train it with gradient descent, the whole thing actually works. This
little neural network does a pretty decent job telling you how much attention yt should pay to
a(t_prime) and this formula makes sure that the attention waits sum to one and then as you chug
along generating one word at a time, this neural network actually pays attention to the right parts of
the input sentence that learns all this automatically using gradient descent. Now, one downside to
this algorithm is that it does take quadratic time or quadratic cost to run this algorithm. If you have
tx words in the input and ty words in the output then the total number of these attention parameters
are going to be tx times ty. And so this algorithm runs in quadratic cost.
Although in machine translation applications where neither input nor output sentences is usually
that long maybe quadratic cost is actually acceptable. Although, there is some research work on
trying to reduce costs as well. Now, so far up in describing the attention idea in the context of
machine translation. Without going too much into detail this idea has been applied to other problems
as well. So just image captioning. So in the image capturing problem the task is to look at the picture
and write a caption for that picture. So in this paper set to the bottom by Kevin Chu, Jimmy Barr,
Ryan Kiros, Kelvin Shaw, Aaron Korver, Russell Zarkutnov, Virta Zemo, and Andrew Benjo they also
showed that you could have a very similar architecture. Look at the picture and pay attention only to
parts of the picture at a time while you're writing a caption for a picture. So if you're interested, then
I encourage you to take a look at that paper as well. And you get to play with all this and more in the
programming exercise. Whereas machine translation is a very complicated problem in the prior
exercise you get to implement and play of the attention while you yourself for the date normalization
problem. So the problem inputting a date like this. This actually has a date of the Apollo Moon landing
and normalizing it into standard formats or a date like this and having a neural network a sequence,
sequence model normalize it to this format. This by the way is the birthday of William Shakespeare.
Also it's believed to be. And what you see in prior exercises as you can train a neural network to input
dates in any of these formats and have it use an attention model to generate a normalized format for
these dates. One other thing that sometimes fun to do is to look at the visualizations of the attention
waits. So here's a machine translation example and here were plotted in different colors. the
magnitude of the different attention waits. I don't want to spend too much time on this but you find
that the corresponding input and output words you find that the attention waits will tend to be high.
Thus, suggesting that when it's generating a specific word in output is, usually paying attention to the
correct words in the input and all this including learning where to pay attention when was all learned
using propagation with an attention model.
So, the quick brown fox represented as phonemes. I'm going to simplify a bit, let say, "The" has a
"de" and "e" sound and Quick, has a "ku" and "wu", "ik", "k" sound, and linguist used to write off
these basic units of sound, and try the Greek language down to these basic units of sound. So,
brown, this aren't the official phonemes which are written with more complicated notation, but
linguists use to hypothesize that writing down audio in terms of these basic units of sound called
phonemes would be the best way to do speech recognition. But with end-to-end deep learning, we're
finding that phonemes representations are no longer necessary. But instead, you can built systems
that input an audio clip and directly output a transcript without needing to use hand-engineered
representations like these. One of the things that made this possible was going to much larger data
sets. So, academic data sets on speech recognition might be as a 300 hours, and in academia, 3000
hour data sets of transcribed audio would be considered reasonable size, so lot of research has been
done, a lot of research papers that are written on data sets there are several thousand voice. But, the
best commercial systems are now trains on over 10,000 hours and sometimes over a 100,000 hours
of audio. And, it's really moving to a much larger audio data sets, transcribe audio data sets were
both x and y, together with deep learning algorithm, that has driven a lot of progress is speech
recognition.
So, how do you build a speech recognition system? In the last section, we're talking about the
attention model. So, one thing you could do is actually do that, where on the horizontal axis, you
take in different time frames of the audio input, and then you have an attention model try to output
the transcript like, "the quick brown fox", or what it was said. One other method that seems to work
well is to use the CTC cost for speech recognition. CTC stands for Connectionist Temporal
Classification and is due to Alex Graves, Santiago Fernandes, Faustino Gomez, and Jürgen
Schmidhuber. So, here's the idea. Let's say the audio clip was someone saying, "the quick brown
fox". We're going to use a neural network structured as shown in diagram below with an equal
number of input x's and output y's, and we have drawn a simple of what uni-directional for the RNN
for this, but in practice, this will usually be a bidirectional LSTM and bidirectional GRU and usually, a
deeper model. But notice that the number of time steps here is very large and in speech recognition,
usually the number of input time steps is much bigger than the number of output time steps. So, for
example, if you have 10 seconds of audio and your features come at a 100 hertz so 100 samples per
second, then a 10 second audio clip would end up with a thousand inputs. Right, so it's 100 hertz
times 10 seconds, and so with a thousand inputs. But your output might not have a thousand
alphabets, might not have a thousand characters. So, what do you do? The CTC cost function allows
the RNN to generate an output like this ttt, there's a special character called the blank character,
which we're going to write as an underscore here, h_eee___, and then maybe a space, we're going to
write like this, so that a space and then ___ qqq__. And, this is considered a correct output for the
first parts of the space, quick with the Q, and the basic rule for the CTC cost function is to collapse
repeated characters not separated by "blank". So, to be clear, I'm using this underscore to denote a
special blank character and that's different than the space character. So, there is a space here
between the and quick, so I should output a space. But, by collapsing repeated characters, not
separated by blank, it actually collapse the sequence into t, h, e, and then space, and q, and this
allows your network to have a thousand outputs by repeating characters allow the times. So,
inserting a bunch of blank characters and still ends up with a much shorter output text transcript. So,
this phrase here "the quick brown fox" including spaces actually has 19 characters, and if
somehow, the newer network is forced upwards of a thousand characters by allowing the network to
insert blanks and repeated characters and can still represent this 19 character upwards with this
1000 outputs of values of Y.
So, this paper by Alex Grace, as well as by those deep speech recognition system, which I was
involved in, used this idea to build effective Speech recognition systems. So, I hope that gives you a
rough sense of how speech recognition models work. Attention like models work and CTC models
work and present two different options of how to go about building these systems. Now, today,
building effective where production skills speech recognition system is a pretty significant effort and
requires a very large data set. But, what I like to do in the next video is share you, how you can build
a trigger word detection system, where keyword detection system which is actually much easier and
can be done with even a smaller or more reasonable amount of data. So, let's talk about that in the
next video.
You can use now you've seen our ends like this and what we really do is take an audio clip maybe
compute spectrogram features and that generates features x1 x2 x3 audio features x1 x2 x3 that you
pass through an RNN and so all that remains to be done is to define the target labels Y so if this point
in the audio clip is when someone just finished saying the trigger word such as hey Siri or okay
Google then in the training sets you can set the target labels to be zero for everything before that
point and right after that to set the target label of one and then if a little bit later on you know the
trigger word was set again and the trigger was said at this point then you can again set the target
label to be one right after that now this type of labeling scheme for an RNN you know could work
actually this won't actually work reasonably well one slight disadvantage of this is it creates a very
imbalanced training set so if a lot more zeros than ones so one other thing you could do that it's
getting a little bit of a hack but could make them all the little bit easy to train is instead of setting
only a single time step to output one you can actually make an output a few ones for several times or
for a fixed period of time before reverting back to zero so and that thumb slightly evens out the ratio
of ones to zeros but this is a little bit of a hack but if this is when in the audio clipper trigger where
the set then right after that you can set the toggle able to one and if this is the trigger words say the
game then right after that just when you want the RNN to output one but so I think you should feel
quite proud of yourself we've learned enough about the learning that it just takes one picture at one
slide to this to describe something as complicated as trigger word detection and based on this I hope
you'd be able to implement something that works and allows you to detect trigger words and get it to
work maybe even make it do something fun in your house that I'm like turn on or turn off you could
do something like a computer when you're when someone else says they trigger words on this is the
last technical section of this course and to wrap up in this course on sequence models you learned
about RNN's including both GRU and LSTM and then in the second week you learned a lot about word
embeddings and how they learn representations of words and then in this week you learned about
the attention model as well as how to use it to process audio data.
*****************************************************************************************END
OF COURSE *********************************************************