L2 Neural Network Basics
L2 Neural Network Basics
THUNLP
1
Content
• Neural Network Components
• Simple Neuron; Multilayer; Feedforward; Non-linear; …
• How to Train
• Objective; Gradients; Backpropogation
• Word Representation: Word2Vec
• Common Neural Networks
• RNN
• Sequential Memory; Language Model
• Gradient Problem for RNN
• Variants: GRU; LSTM; Bidirectional;
• CNN
• NLP Pipeline Tutorial (PyTorch) 2
Neural Network Components
Shi Yu
THUNLP
3
Neural Network
• (Artificial) Neural Network
Source: Wikipedia
4
(Artificial) Neuron
• A neuron is a computational unit with 𝑛 inputs and 1 output
and parameters 𝒘, 𝑏
𝑤# hw,b(x) output
x3
𝑏
bias +1
ℎ𝒘,# 𝒙 = 𝑓(𝒘$ 𝒙 + 𝑏)
x1
inputs x2
x3
bias +1
neurons
6
Matrix Notation
• A single layer neural network: Hooking together many
simple neurons
𝑊12
x1 a1 𝑎 = 𝑓(𝑊 𝑥 + 𝑊 𝑥 + 𝑊 𝑥 + 𝑏 )
# ## # #$ $ #% % #
7
Multilayer Neural Network
• Stacking multiple layers of neural networks
x1
inputs x2
ℎ%,# 𝑥
x3
+1
bias +1 +1
Feedforward Computation
8
Feedforward Computation
Multiple hidden layers
Input layer
x1
x2
ℎ%,# 𝑥
x3
+1
+1 +1
𝒉# = 𝑓(𝑾#𝒙 + 𝒃#)
𝒉$ = 𝑓(𝑾$𝒉# + 𝒃$)
𝒉% = 𝑓(𝑾%𝒉$ + 𝒃%)
9
Why use non-linearities (f)?
• Without non-linearities, deep neural networks cannot do
anything more than a linear transform
• Extra layers could just be compiled down into a single linear
transform
𝒉& = 𝑾& 𝒙 + 𝒃& 𝒉' = 𝑾' 𝒉& + 𝒃' 𝒉' = 𝑾' 𝑾& 𝑥 + 𝑾' 𝒃& + 𝒃'
10
Choices of non-linearities
• Sigmoid
1
𝑓 𝑧 =
1 + 𝑒 &'
• Tanh
𝑒 ) − 𝑒 *)
𝑓 𝑧 = tanh 𝑧 =
𝑒 ) + 𝑒 *)
• ReLU
𝑓 𝑧 = max(𝑧, 0)
•…
11
Output Layer
Multiple hidden layers
Input layer Output layer
x1
x2
x3
+1 +1
+1 +1
12
Output Layer
• Linear output
Output layer
• 𝑦 = 𝒘$ 𝒉 + 𝑏
• Sigmoid
• 𝑦 = 𝜎 𝒘$ 𝒉 + 𝑏
• For binary classification
• 𝑦 for one class
• 1 − 𝑦 for another +1
𝒉 𝑦
13
Output Layer
Output layer
• Softmax
,-.()! )
• 𝑦+ = softmax(𝒛)+ = ∑
" ,-.()" )
• 𝒛 = 𝑾𝒉 + 𝒃
• For multi-class classification
+1
𝒉 𝒚
14
Summary
• Simple neuron
• Single layer neural network
• Multilayer neural network
• Stack multiple layers of neural networks
• Non-linearity activation function
• Enable neural nets to represent more complicated
features
• Output layer
• For desired output
15
How to Train a Neural
Network
Shi Yu
THUNLP
16
Training Objective
• Mean Squared Error
• Given 𝑁 training examples 𝑥+ , 𝑦+ - +,# where 𝑥+ and 𝑦+
are the attributes and price of a computer. We want to
train a neural network 𝐹. (⋅) which takes the attributes 𝑥
as input and predicts its price 𝑦. A reasonable training
objective is Mean Squared Error:
-
1 $,
min 𝐽 𝜃 = min B 𝑦+ − 𝐹. (𝑥+ )
. . 𝑁
+,#
17
Training Objective
• Cross-entropy
• Given 𝑁 training examples 𝑥+ , 𝑦+ - +,# where 𝑥+ and 𝑦+
are the sentence and its sentiment label. We want to
train a neural network 𝐹. (⋅) which takes the sentence 𝑥
as input and predicts its sentiment 𝑦. A reasonable
training objective is Cross-entropy:
-
1
min 𝐽 𝜃 = min − B log 𝑃/0123 (𝐹. 𝑥+ = 𝑦+ ) ,
. . 𝑁
+,#
18
Training Objective
• Cross-entropy
-
1
min 𝐽 𝜃 = min − B log 𝑃/0123 (𝐹. 𝑥+ = 𝑦+ ) ,
. . 𝑁
+,#
Output
distribution
If ground truth is y=1 (first class), then the loss
0.6 for this instance is
− log 𝑃'()*+ (𝐹, 𝑥 = 1) = −log 0.6 = 0.74.
0.3 If y=2 …
− log 𝑃'()*+ (𝐹, 𝑥 = 2) = −log 0.3 = 1.74.
+1 0.1
If y=3 …
− log 𝑃'()*+ (𝐹, 𝑥 = 3) = −log 0.1 = 3.32.
𝒉 𝒚
19
Stochastic Gradient Descent
• Update rule:
20
Gradients
• Given a function with 1 output and n inputs:
F 𝒙 = 𝐹(𝑥#, 𝑥$ … 𝑥D )
21
Jacobian Matrix: Generalization of the Gradient
22
Chain Rule for Jacobians
• For one-variable functions: multiply derivatives
𝑧 = 3𝑦
𝑦 = 𝑥$
I' I' IN
IM
= IN IM = 3×2𝑥 = 6𝑥
24
Back to Neural Network
= >?
• Given 𝑠 = 𝒖 𝒉, 𝒉 = 𝑓 𝒛 , 𝒛 = 𝑾𝒙 + 𝒃, what is ?
>𝒃
• Apply the chain rule:
𝒖$ diag(f’(𝑧)) 𝐈
25
Backpropagation
• Compute gradients algorithmically
26
Computational Graphs
• Representing our neural net equations as a graph
• Source node: inputs
• Interior nodes: operations 𝑠 = 𝒖S 𝒉
• Edges pass along result of 𝒉=𝑓 𝒛
𝒛 = 𝑾𝒙 + 𝒃
the operation
𝒙 input
Input
𝑾𝒙 𝒛 𝒉 𝑠
𝒙 G + f G
Parameters 𝑾 𝒃 𝒖
“Forward Propagation”
27
Backpropagation
• Go backwards along edges
• Pass along gradients
𝑠 = 𝒖S 𝒉
𝒉=𝑓 𝒛
𝒛 = 𝑾𝒙 + 𝒃
𝒙 input
𝑾𝒙 𝒛 𝒉 𝑠
𝒙 G + f G
𝜕𝑠 𝜕𝑠 𝜕𝑠
𝜕𝑠 𝜕𝒛 𝜕𝒉 𝜕𝒔
𝜕𝒃
𝑾 𝒃 𝒖
28
Backpropagation: Single Node
• Node receives an “upstream gradient”
• Goal is to pass on the correct “downstream
gradient”
𝒉=𝑓 𝒛
𝒛 f 𝒉
𝜕𝑠 𝜕𝑠
𝜕𝒛 𝜕𝒉
downstream upstream
gradient gradient
29
Backpropagation: Single Node
• Each node has a local gradient
• The gradient of its output with respect to its input
𝒉=𝑓 𝒛
𝜕𝒉
𝒛 f 𝒉
𝜕𝒛
𝜕𝑠 𝜕𝑠
local
𝜕𝒛 𝜕𝒉
gradient
downstream upstream
gradient gradient
30
Backpropagation: Single Node
• Each node has a local gradient
• The gradient of its output with respect to its input
• [downstream gradient] = [upstream gradient] x
[local gradient]
𝒉=𝑓 𝒛
𝜕𝒉
𝒛 f 𝒉
𝜕𝒛
𝜕𝑠 𝜕𝒔
Chain Rule: local
𝜕𝒛 𝜕𝒉
gradient
downstream upstream
gradient gradient
31
An Example
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 max 𝑦, 𝑧
𝑥 = 1, 𝑦 = 2, 𝑧 = 0
Forward prop steps: Local gradients:
23 23
𝑎 =𝑥+𝑦 =3 24
= 1,
25
=1
b = max 𝑦, 𝑧 = 2 𝜕𝑏 𝜕𝑏
= 𝟏 𝑦 > 𝑧 = 1, =𝟏 𝑧>𝑦 =0
𝜕𝑦 𝜕𝑧
𝑓 = 𝑎𝑏 = 6 𝜕𝑓 𝜕𝑓
= 𝑏 = 2, =𝑎=3
𝜕𝑎 𝜕𝑏
𝑥 1
+ a=3
2 f=6
𝑦 2
*
b=2
max
0
𝑧 32
An Example
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 max 𝑦, 𝑧
𝑥 = 1, 𝑦 = 2, 𝑧 = 0
Forward prop steps: Local gradients:
23 23
𝑎 =𝑥+𝑦 =3 24
= 1,
25
=1
b = max 𝑦, 𝑧 = 2 𝜕𝑏 𝜕𝑏
= 𝟏 𝑦 > 𝑧 = 1, =𝟏 𝑧>𝑦 =0
𝜕𝑦 𝜕𝑧
𝑓 = 𝑎𝑏 = 6 𝜕𝑓 𝜕𝑓
= 𝑏 = 2, =𝑎=3
𝜕𝑎 𝜕𝑏
𝑥 1
+ a=3
2 f=6
𝑦 2
*
b=2
max 𝜕𝑓
=1
0 𝜕𝑓
𝑧 upstream * local = downstream 33
An Example
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 max 𝑦, 𝑧
𝑥 = 1, 𝑦 = 2, 𝑧 = 0
Forward prop steps: Local gradients:
23 23
𝑎 =𝑥+𝑦 =3 24
= 1,
25
=1
b = max 𝑦, 𝑧 = 2 𝜕𝑏 𝜕𝑏
= 𝟏 𝑦 > 𝑧 = 1, =𝟏 𝑧>𝑦 =0
𝜕𝑦 𝜕𝑧
𝑓 = 𝑎𝑏 = 6 𝜕𝑓 𝜕𝑓
= 𝑏 = 2, =𝑎=3
𝜕𝑎 𝜕𝑏
𝑥 1
+ a=3
2 1*2=2 f=6
𝑦 2
*
b=2
max 1
0 1*3=3
𝑧 upstream * local = downstream 34
An Example
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 max 𝑦, 𝑧
𝑥 = 1, 𝑦 = 2, 𝑧 = 0
Forward prop steps: Local gradients:
23 23
𝑎 =𝑥+𝑦 =3 24
= 1,
25
=1
b = max 𝑦, 𝑧 = 2 𝜕𝑏 𝜕𝑏
= 𝟏 𝑦 > 𝑧 = 1, =𝟏 𝑧>𝑦 =0
𝜕𝑦 𝜕𝑧
𝑓 = 𝑎𝑏 = 6 𝜕𝑓 𝜕𝑓
= 𝑏 = 2, =𝑎=3
𝜕𝑎 𝜕𝑏
𝑥 1
+ a=3
2 2 f=6
𝑦 2
*
b=2
3*1=3
max 1
3
0
𝑧 3*0=0 upstream * local = downstream 35
An Example
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 max 𝑦, 𝑧
𝑥 = 1, 𝑦 = 2, 𝑧 = 0
Forward prop steps: Local gradients:
23 23
𝑎 =𝑥+𝑦 =3 24
= 1,
25
=1
b = max 𝑦, 𝑧 = 2 𝜕𝑏 𝜕𝑏
= 𝟏 𝑦 > 𝑧 = 1, =𝟏 𝑧>𝑦 =0
𝜕𝑦 𝜕𝑧
𝑓 = 𝑎𝑏 = 6 𝜕𝑓 𝜕𝑓
= 𝑏 = 2, =𝑎=3
𝜕𝑎 𝜕𝑏
𝑥 1
2*1=2
+ a=3
2 2*1=2 2 f=6
𝑦 2
*
b=2
3
max 1
3
0
𝑧 0 upstream * local = downstream 36
An Example
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 max 𝑦, 𝑧
𝑥 = 1, 𝑦 = 2, 𝑧 = 0
Forward prop steps: Local gradients:
23 23
𝑎 =𝑥+𝑦 =3 24
= 1,
25
=1
b = max 𝑦, 𝑧 = 2 𝜕𝑏 𝜕𝑏
= 𝟏 𝑦 > 𝑧 = 1, =𝟏 𝑧>𝑦 =0
𝜕𝑦 𝜕𝑧
𝑓 = 𝑎𝑏 = 6 𝜕𝑓 𝜕𝑓
= 𝑏 = 2, =𝑎=3
𝜕𝑎 𝜕𝑏
𝜕𝑓
𝜕𝑥
=2 𝑥 1
2
+ a=3
2 2 f=6
𝜕𝑓 2
=3+2=5 𝑦 2
*
𝜕𝑦 b=2
3
max 1
3
𝜕𝑓 0
=0
𝜕𝑧 𝑧 0 upstream * local = downstream 37
Summary
• Forward pass: compute results of operation and
save intermediate values
38
Word Representation:
Word2Vec
Shi Yu
THUNLP
39
Word2Vec
• Word2vec uses shallow neural networks that
associate words to distributed representations
• It can capture many linguistic regularities, such as:
40
Typical Models
• Word2vec can utilize two architectures to produce
distributed representations of words:
• Continuous bag-of-words (CBOW)
• Continuous skip-gram
CBOW Skip-Gram
41
Sliding Window
• Word2vec uses a sliding window of a fixed size moving along
a sentence
• In each window, the middle word is the target word, other
words are the context words
• Given the context words, CBOW predicts the probabilities of the
target word
• While given a target word, skip-gram predicts the probabilities of
the context words
42
An Example of the Sliding Window
43
Continuous Bag-of-Words
• In CBOW architecture, the model predicts the target word
given a window of surrounding context words
• According to the bag-of-word assumption: The order of
context words does not influence the prediction
• Suppose the window size is 5
• 𝑁𝑒𝑣𝑒𝑟 𝑡𝑜𝑜 𝑙𝑎𝑡𝑒 𝑡𝑜 𝑙𝑒𝑎𝑟𝑛
• 𝑃(𝑙𝑎𝑡𝑒|[𝑛𝑒𝑣𝑒𝑟, 𝑡𝑜𝑜, 𝑡𝑜, 𝑙𝑒𝑎𝑟𝑛]) …
44
Continuous Bag-of-Words
• 𝑁𝑒𝑣𝑒𝑟 𝑡𝑜𝑜 𝑙𝑎𝑡𝑒 𝑡𝑜 𝑙𝑒𝑎𝑟𝑛
0 0.05
𝑛𝑒𝑣𝑒𝑟 0 0.01
1 0.4 𝑡𝑜𝑜
⋮ softmax 0.1
0 dot 0.05
𝐶# 𝐶$ product 0.02
0
avg
⋮
1
0.01
0
45
Continuous Skip-Gram
• In skip-gram architecture, the model predicts the context
words from the target word
• Suppose the window size is 5
• 𝑁𝑒𝑣𝑒𝑟 𝑡𝑜𝑜 𝑙𝑎𝑡𝑒 𝑡𝑜 𝑙𝑒𝑎𝑟𝑛
• 𝑃([𝑡𝑜𝑜, 𝑙𝑎𝑡𝑒]|𝑁𝑒𝑣𝑒𝑟), 𝑃([𝑁𝑒𝑣𝑒𝑟, 𝑙𝑎𝑡𝑒, 𝑡𝑜]|𝑡𝑜𝑜), …
• Skip-gram predict one context word each step, and the training
samples are:
• 𝑃(𝑡𝑜𝑜|𝑁𝑒𝑣𝑒𝑟), 𝑃(𝑙𝑎𝑡𝑒|𝑁𝑒𝑣𝑒𝑟), 𝑃(𝑁𝑒𝑣𝑒𝑟|𝑡𝑜𝑜), 𝑃(𝑙𝑎𝑡𝑒|𝑡𝑜𝑜),
𝑃(𝑡𝑜|𝑡𝑜𝑜), …
46
Continuous Skip-Gram
• 𝑁𝑒𝑣𝑒𝑟 𝑡𝑜𝑜 𝑙𝑎𝑡𝑒 𝑡𝑜 𝑙𝑒𝑎𝑟𝑛
0.05
0 0.01
0 0.4 𝑁𝑒𝑣𝑒𝑟
dot softmax 0.1
1
product
𝑡𝑜𝑜 0
𝐶%
0.05
0 0.02
⋮ ⋮
0 0.01
⋮ 𝑃(𝑁𝑒𝑣𝑒𝑟|𝑡𝑜𝑜)
47
Problems of Full Softmax
• When the vocabulary size is very large
• Softmax for all the words every step depends on a huge number of
model parameters, which is computationally impractical
• We need to improve the computation efficiency
48
Improving Computational Efficiency
• In fact, we do not need a full probabilistic model in
word2vec
• There are two main improvement methods for word2vec:
• Negative sampling
• Hierarchical softmax
49
Negative Sampling
• As we discussed before, the vocabulary is very large, which
means our model has a tremendous number of weights
need to be updated every step
• The idea of negative sampling is, to only update a small
percentage of the weights every step
50
Negative Sampling
• Since we have the vocabulary and know the context words,
we can select a couple of words not in the context word list
by probability:
𝑓(𝑤+ )(⁄7
𝑃(𝑤+ ) = :
∑89& 𝑓(𝑤8 )(⁄7
. /!
𝑓 𝑤- is the frequency of 𝑤- , compared to ∑% , this can increase the
"#$ . /"
probability of low-frequency words.
51
Negative Sampling
• Here is one training step of skip-gram model which
computes all probabilities at the output layer
full softmax
0.05
0 0.01
0 0.4 𝑁𝑒𝑣𝑒𝑟
softmax 0.1
1
𝑡𝑜𝑜 0
𝐶%
0.05
0.02
0 avg map
⋮
⋮
0.01
0
dim=V
⋮
𝑃(𝑁𝑒𝑣𝑒𝑟|𝑡𝑜𝑜)
52
Negative Sampling
• Suppose we only sample 4 negative words:
negative sampling
0 0.1
0 0.5 𝑁𝑒𝑣𝑒𝑟
softmax 0.1
1
𝑡𝑜𝑜 0
𝐶%
0.2
0.1
0 avg map
⋮
dim=5
0
⋮ 𝑃(𝑁𝑒𝑣𝑒𝑟|𝑡𝑜𝑜)
53
Negative Sampling
• Then we can compute the loss, and optimize the weights
(not all of the weights) every step
• Suppose we have a weight matrix of size 300×10,000, the
output size is 5
• We only need to update 300×5 weights, that is only 0.05%
of all the weights
54
Other Tips for Learning Word Embeddings
55
Other Tips for Learning Word Embeddings
56
Recurrent Neural Networks
(RNNs)
Chaoqun He
THUNLP
57
Sequential Memory
• Key concept for RNNs: Sequential memory during
processing sequence data
• Sequential memory of human:
• Say the alphabet in your head
• Pretty easy
58
Sequential Memory
• Key concept for RNNs: Sequential memory during
processing sequence data
• Sequential memory of human:
• Say the alphabet backward
• Much harder
59
Sequential Memory
• Definition: a mechanism that makes it easier for
your brain to recognize sequence patterns
60
Recurrent Neural Networks
hidden
states ℎ; ℎ& ℎ' ℎ(
61
Recurrent Neural Networks
• RNN Cell
𝑦+
ℎ+ = tanh(𝑊M 𝑥+ + 𝑊g ℎ+&# + 𝑏)
ℎ+*& ℎ+
𝑦+ = 𝐹(ℎ+ )
𝑥+
62
RNN Language Model
hidden states 𝑊g
𝑊1 𝑥- + ℎh ℎ#
ℎ- = tanh( )
𝑊2 ℎ-3! + 𝑏!
𝑊M
word embeddings 𝑥#
𝑥+ = 𝐸𝑤+
𝐸
one-hot vectors never
𝑤+ ∈ ℝ|:| 𝑤#
63
RNN Language Model
hidden states 𝑊g 𝑊g
𝑊1 𝑥- + ℎh ℎ# ℎ$
ℎ- = tanh( )
𝑊2 ℎ-3! + 𝑏!
𝑊M 𝑊M
word embeddings 𝑥# 𝑥$
𝑥+ = 𝐸𝑤+
𝐸 𝐸
one-hot vectors never too
𝑤+ ∈ ℝ|:| 𝑤# 𝑤$
64
RNN Language Model
hidden states 𝑊g 𝑊g 𝑊g 𝑊g
𝑊1 𝑥- + ℎh ℎ# ℎ$ ℎ% ℎp
ℎ- = tanh( )
𝑊2 ℎ-3! + 𝑏!
𝑊M 𝑊M 𝑊M 𝑊M
word embeddings 𝑥# 𝑥$ 𝑥% 𝑥p
𝑥+ = 𝐸𝑤+
𝐸 𝐸 𝐸 𝐸
one-hot vectors never too late to
𝑤+ ∈ ℝ|:| 𝑤# 𝑤$ 𝑤% 𝑤p
65
RNN Language Model
code read
output distribution
𝑦7 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑈ℎ7 + 𝑏' ∈ ℝ|:|
𝑦p
a zoo
𝑈
hidden states 𝑊g 𝑊g 𝑊g 𝑊g
𝑊1 𝑥- + ℎh ℎ# ℎ$ ℎ% ℎp
ℎ- = tanh( )
𝑊2 ℎ-3! + 𝑏!
𝑊M 𝑊M 𝑊M 𝑊M
word embeddings 𝑥# 𝑥$ 𝑥% 𝑥p
𝑥+ = 𝐸𝑤+
𝐸 𝐸 𝐸 𝐸
one-hot vectors never too late to
𝑤+ ∈ ℝ|:| 𝑤# 𝑤$ 𝑤% 𝑤p
66
Application Scenarios
• Sequence Labeling
• Given a sentence, the lexical properties of each word are
required
• Sequence Prediction
• Given the temperature for seven days a week, predict
the weather conditions for each day
• Photograph Description
• Given a photograph, create a sentence that describes
the photograph
• Text Classification
• Given a sentence, distinguish whether the sentence has
a positive or negative emotion
67
Recurrent Neural Networks
• Advantages:
• Can process any length input
• Model size does not increase for longer input
• Weights are shared across timesteps
• Computation for step 𝑖 can (in theory) use information
from many steps back
• Disadvantages:
• Recurrent computation is slow
• In practice, it’s difficult to access information from many
steps back
68
Gradient Problem for RNN
69
RNN Variants
Chaoqun He
THUNLP
70
Solution for Better RNNs
• Better Units!
• The main solution to the Vanishing Gradient
Problem is to use a more complex hidden unit
computation in recurrence
• GRU
• LSTM
• Main ideas:
• Keep around memories to capture long distance
dependencies
71
Gated Recurrent Unit (GRU)
Chaoqun He
THUNLP
72
Gated Recurrent Unit (GRU)
• Vanilla RNN computes hidden layer at next time
step directly:
ℎ+ = tanh(𝑊M 𝑥+ + 𝑊g ℎ+&# + 𝑏)
• Introduce gating mechanism into RNN
• Update gate
𝑧+ = 𝜎(𝑊M ' 𝑥+ + 𝑊g ' ℎ+&# + 𝑏 (') )
• Reset gate
𝑟+ = 𝜎(𝑊M q 𝑥+ + 𝑊g q ℎ+&# + 𝑏 (q) )
• Gates are used to balance the influence of the past
and the input
73
Gated Recurrent Unit (GRU)
• Update gate
𝑧+ = 𝜎(𝑊M ' 𝑥+ + 𝑊g ' ℎ+&# + 𝑏 (') )
• Reset gate
𝑟+ = 𝜎(𝑊M q 𝑥+ + 𝑊g q ℎ+&# + 𝑏 (q) )
• New activation ℎ2 b
ℎj + = tanh(𝑊M 𝑥+ + 𝑟+ ∗ 𝑊g ℎ+&# + 𝑏)
• Final hidden state ℎb
ℎ+ = 𝑧+ ∗ ℎ+&# + 1 − 𝑧+ ∗ ℎj +
• Where ∗ refers to element-wise product
74
Gated Recurrent Unit (GRU)
0.7 0.5 Preparation Update
𝑧& 𝑟&
0.5 0.2 0.5 0.6
-0.1
𝑧& ℎ&'( 1 − 𝑧& ℎC &
-0.4 𝜎 𝜎 tanh
0.2
ℎ&'(
-0.1
0.1 -0.1
-0.47
-0.2 -0.5
0.4
0.3 0.6
ℎ&
𝑥& 75
ℎC &
Gated Recurrent Unit (GRU)
• If reset 𝑟b is close to 0
ℎj + ≈ tanh(𝑊M 𝑥+ + 0 ∗ 𝑊g ℎ+&# + 𝑏)
ℎj + ≈ tanh(𝑊M 𝑥+ + 𝑏)
• Ignore previous hidden state, which indicates the
current activation is irrelevant to the past.
76
Gated Recurrent Unit (GRU)
• Update gate 𝑧b controls how much of past state
should matter compared to the current activation.
77
Long Short-Term Memory
Network (LSTM)
Chaoqun He
THUNLP
78
Long Short-Term Memory Network
• Long Short-Term Memory network (LSTM)
• LSTM are a special kind of RNN, capable of learning
long-term dependencies like GRU
79
By Chris Olah: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short-Term Memory Network
• The key to LSTMs is the cell state 𝐶c
80
Long Short-Term Memory Network
83
Long Short-Term Memory Network
84
Long Short-Term Memory Network
• Powerful especially when stacked and made even
deeper (each hidden layer is already computed by a
deep internal network)
85
Bidirectional RNNs
Chaoqun He
THUNLP
86
Bidirectional RNNs
• In traditional RNNs, the state at time 𝑡 only
captures information from the past
ℎr = 𝑓(𝑥r&#, … , 𝑥$, 𝑥#)
• Problem: in many applications, we want to have an
output 𝑦c depending on the whole input sequence
• For example
• Handwriting recognition
• Speech recognition
87
Bidirectional RNNs
𝑦# 𝑦$ 𝑦%
ℎh ℎ# ℎ$ ℎ%
𝑥# 𝑥$ 𝑥%
88
Summary
• Recurrent Neural Network
• Sequential Memory
• Gradient Problem for RNN
• RNN Variants
• Gated Recurrent Unit (GRU)
• Long Short-Term Memory Network (LSTM)
• Bidirectional Recurrent Neural Network
89
Convolutional Neural
Networks (CNNs)
Chaoqun He
THUNLP
90
CNN for Sentence Representation
• Convolutional Neural Networks (CNNs)
• Generally used in Computer Vision (CV)
• Achieve promising results in a variety of NLP tasks:
• Sentiment classification
• Relation classification
• …
• CNNs are good at extracting local and position-
invariant patterns
• In CV, colors, edges, textures, etc.
• In NLP, phrases and other local grammar structures
91
CNN for Sentence Representation
• CNNs extract patterns by:
• Computing representations for all possible n-gram
phrases in a sentence.
• Without relying on external linguistic tools (e.g.,
dependency parser)
possible n-gram
phrases
Bigram: The plane, plane is, is taking, taking off
The plane is taking off
Trigram: The plane is, plane is taking, is taking off
n-gram: …
92
Architecture
• Input Layer
• Convolutional Layer
• Max-pooling Layer
• Non-linear Layer Filter 𝐰 Feature 𝐟
Input 𝐱
conv 𝐪 𝐜
The
students
opened
their pooling tanh
books conv
and
93
Input Layer
• Transform words into input representations 𝐱 via
word embeddings
• 𝐱 ∈ Rf×h : input representation
• 𝑚 is the length of sentence
• 𝑑 is the dimension of word embeddings
The
students
opened
their
books
and
94
Convolution Layer
• ⋅ is dot product
95
Convolution Layer
• Extract feature representation from input
representation via a sliding convolving filter
𝐟L = 𝐰 ⋅ 𝐱 L:Luv&# + b, i = 1,2, … , n − h + 1
The Feature 𝐟)
Conv
students
opened
their
books
and
96
Convolution Layer
• Extract feature representation from input
representation via a sliding convolving filter
𝐟L = 𝐰 ⋅ 𝐱 L:Luv&# + b, i = 1,2, … , n − h + 1
students
Conv
opened
their
books
and
97
Application Scenarios
• Object Detection
• You Only Look Once: Unified, Real-Time Object
Detection
• Video Classification
• Large-scale Video Classification with Convolutional
Neural Networks
• Speech Recognition
• Convolutional, Long Short-Term Memory, fully
connected Deep Neural Networks
• Text Classification
• Convolutional Neural Networks for Sentence
Classification
98
Compare CNN with RNN
• CNN vs. RNN
CNNs RNNs
Advantages Extracting local and Modeling long-range context
position-invariant features dependency
Parameters Less parameters More parameters
Parallelization Better parallelization within Cannot be parallelized within
sentences sentences
99
Summary
100
NLP Pipeline Tutorial
(PyTorch)
Jing Yi
THUNLP
101
Pipeline for Deep Learning
• prepare data
• build model
• train model
• evaluate model
• test model
102
word language model
• target: to predict next word
• input: never too old to learn
• output: too old to learn English
• model: LSTM
• loss: cross_entropy
103
Exercise
task: sentiment analysis
• dataset: glue-sst2
• model: RNN or any other you interested in.
104
Thanks
THUNLP
105