Understanding GRU Networks
Understanding GRU Networks
In this article, I will try to give a… | by Simeon Kostadinov | Towards Data Science 11/20/24, 11:16 AM
https://pixabay.com
Search Write Sign up Sign in
6.9K 30
https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be Page 1 of 11
Understanding GRU Networks. In this article, I will try to give a… | by Simeon Kostadinov | Towards Data Science 11/20/24, 11:16 AM
To solve the vanishing gradient problem of a standard RNN, GRU uses, so-
called, update gate and reset gate. Basically, these are two vectors which
decide what information should be passed to the output. The special thing
about them is that they can be trained to keep information from long ago,
without washing it through time or remove information which is irrelevant
to the prediction.
https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be Page 2 of 11
Understanding GRU Networks. In this article, I will try to give a… | by Simeon Kostadinov | Towards Data Science 11/20/24, 11:16 AM
If you are not familiar with the above terminology, I recommend watching
these tutorials about “sigmoid” and “tanh” function and “Hadamard product”
operation.
When x_t is plugged into the network unit, it is multiplied by its own weight
W(z). The same goes for h_(t-1) which holds the information for the previous
t-1 units and is multiplied by its own weight U(z). Both results are added
together and a sigmoid activation function is applied to squash the result
https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be Page 3 of 11
Understanding GRU Networks. In this article, I will try to give a… | by Simeon Kostadinov | Towards Data Science 11/20/24, 11:16 AM
The update gate helps the model to determine how much of the past
Top highlight
information (from previous time steps) needs to be passed along to the
future. That is really powerful because the model can decide to copy all the
information from the past and eliminate the risk of vanishing gradient
problem. We will see the usage of the update gate later on. For now
remember the formula for z_t.
This formula is the same as the one for the update gate. The difference
comes in the weights and the gate’s usage, which will see in a bit. The
schema below shows where the reset gate is:
https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be Page 4 of 11
Understanding GRU Networks. In this article, I will try to give a… | by Simeon Kostadinov | Towards Data Science 11/20/24, 11:16 AM
As before, we plug in h_(t-1) — blue line and x_t — purple line, multiply them
with their corresponding weights, sum the results and apply the sigmoid
function.
1. Multiply the input x_t with a weight W and h_(t-1) with a weight U.
https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be Page 5 of 11
Understanding GRU Networks. In this article, I will try to give a… | by Simeon Kostadinov | Towards Data Science 11/20/24, 11:16 AM
neural network approaches to the end of the text it will learn to assign r_t
vector close to 0, washing out the past and focusing only on the last
sentences.
https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be Page 6 of 11
Understanding GRU Networks. In this article, I will try to give a… | by Simeon Kostadinov | Towards Data Science 11/20/24, 11:16 AM
Let’s bring up the example about the book review. This time, the most
relevant information is positioned in the beginning of the text. The model
can learn to set the vector z_t close to 1 and keep a majority of the previous
information. Since z_t will be close to 1 at this time step, 1-z_t will be close to
0 which will ignore big portion of the current content (in this case the last
part of the review which explains the book plot) which is irrelevant for our
prediction.
Following through, you can see how z_t — green line is used to calculate 1-z_t
which, combined with h’_t — bright green line, produces a result in the dark
red line. z_t is also used with h_(t-1) — blue line in an element-wise
multiplication. Finally, h_t — blue line is a result of the summation of the
outputs corresponding to the bright and dark red lines.
https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be Page 7 of 11
Understanding GRU Networks. In this article, I will try to give a… | by Simeon Kostadinov | Towards Data Science 11/20/24, 11:16 AM
Now, you can see how GRUs are able to store and filter the information using
their update and reset gates. That eliminates the vanishing gradient problem
since the model is not washing out the new input every single time but keeps
the relevant information and passes it down to the next time steps of the
network. If carefully trained, they can perform extremely well even in
complex scenarios.
I hope the article is leaving you armed with a better understanding of this
state-of-the-art deep learning model called GRU.
Thank you for reading. If you enjoyed the article, give it some claps .
Hope you have a great day!
Lstm
Obsessed with creating a positive impact. Love blogging about AI and reading books. For
more content, follow me https://www.linkedin.com/in/simeonkostadinov/
https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be Page 8 of 11