0% found this document useful (0 votes)

13 views105 pages

L2 Neural Network Basics

LLM network

Uploaded by

magicuncle520

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views105 pages

L2 Neural Network Basics

LLM network

Uploaded by

magicuncle520

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 105

Neural Network Basics

Shi Yu, Chaoqun He, Jing Yi

THUNLP

1
Content
• Neural Network Components
• Simple Neuron; Multilayer; Feedforward; Non-linear; …

• How to Train
• Objective; Gradients; Backpropogation
• Word Representation: Word2Vec
• Common Neural Networks
• RNN
• Sequential Memory; Language Model
• Gradient Problem for RNN
• Variants: GRU; LSTM; Bidirectional;
• CNN
• NLP Pipeline Tutorial (PyTorch) 2
Neural Network Components

Shi Yu

THUNLP

3
Neural Network
• (Artificial) Neural Network

• Inspired by the biological neural networks in brains

Input Signals Output Signals

Source: Wikipedia

4
(Artificial) Neuron
• A neuron is a computational unit with 𝑛 inputs and 1 output
and parameters 𝒘, 𝑏

weights 𝒘 activation function

x1
𝑤!
inputs x2 𝑤"

𝑤# hw,b(x) output
x3
𝑏
bias +1

ℎ𝒘,# 𝒙 = 𝑓(𝒘$ 𝒙 + 𝑏)

w, b are the parameters of this neuron 5

Single Layer Neural Network
• A single layer neural network: Hooking together many
simple neurons

x1
inputs x2

bias +1
neurons

6
Matrix Notation
• A single layer neural network: Hooking together many
simple neurons
𝑊12
x1 a1 𝑎 = 𝑓(𝑊 𝑥 + 𝑊 𝑥 + 𝑊 𝑥 + 𝑏 )
# ## # #$ $ #% % #

x2 𝑎$ = 𝑓(𝑊$#𝑥# + 𝑊$$𝑥$ + 𝑊$%𝑥% + 𝑏$)

a2
𝑎% = 𝑓(𝑊%#𝑥# + 𝑊%$𝑥$ + 𝑊%%𝑥% + 𝑏%)
x3 a3
In matrix form:
+1 𝑏3
𝒂 = 𝑓(𝑾𝒙 + 𝒃)

7
Multilayer Neural Network
• Stacking multiple layers of neural networks

x1
inputs x2
ℎ%,# 𝑥
x3
+1
bias +1 +1

Feedforward Computation

8
Feedforward Computation
Multiple hidden layers
Input layer
x1

x2
ℎ%,# 𝑥
x3
+1
+1 +1

input 𝒙 𝒉& 𝒉' 𝒉(

𝒉# = 𝑓(𝑾#𝒙 + 𝒃#)
𝒉$ = 𝑓(𝑾$𝒉# + 𝒃$)
𝒉% = 𝑓(𝑾%𝒉$ + 𝒃%)
9
Why use non-linearities (f)?
• Without non-linearities, deep neural networks cannot do
anything more than a linear transform
• Extra layers could just be compiled down into a single linear
transform
𝒉& = 𝑾& 𝒙 + 𝒃& 𝒉' = 𝑾' 𝒉& + 𝒃' 𝒉' = 𝑾' 𝑾& 𝑥 + 𝑾' 𝒃& + 𝒃'

• With non-linearities, neural networks can approximate more

complex functions with more layers!

10
Choices of non-linearities
• Sigmoid
1
𝑓 𝑧 =
1 + 𝑒 &'
• Tanh
𝑒 ) − 𝑒 *)
𝑓 𝑧 = tanh 𝑧 =
𝑒 ) + 𝑒 *)
• ReLU

𝑓 𝑧 = max(𝑧, 0)
•…

11
Output Layer
Multiple hidden layers
Input layer Output layer
x1

x3
+1 +1
+1 +1

input 𝒙 𝒉& 𝒉' 𝒉( 𝒚

12
Output Layer

• Linear output
Output layer
• 𝑦 = 𝒘$ 𝒉 + 𝑏

• Sigmoid
• 𝑦 = 𝜎 𝒘$ 𝒉 + 𝑏
• For binary classification
• 𝑦 for one class
• 1 − 𝑦 for another +1

𝒉 𝑦

13
Output Layer
Output layer
• Softmax
,-.()! )
• 𝑦+ = softmax(𝒛)+ = ∑
" ,-.()" )
• 𝒛 = 𝑾𝒉 + 𝒃
• For multi-class classification

𝒉 𝒚

14
Summary
• Simple neuron
• Single layer neural network
• Multilayer neural network
• Stack multiple layers of neural networks
• Non-linearity activation function
• Enable neural nets to represent more complicated
features
• Output layer
• For desired output

15
How to Train a Neural
Network
Shi Yu

THUNLP

16
Training Objective
• Mean Squared Error
• Given 𝑁 training examples 𝑥+ , 𝑦+ - +,# where 𝑥+ and 𝑦+
are the attributes and price of a computer. We want to
train a neural network 𝐹. (⋅) which takes the attributes 𝑥
as input and predicts its price 𝑦. A reasonable training
objective is Mean Squared Error:
-
1 $,
min 𝐽 𝜃 = min B 𝑦+ − 𝐹. (𝑥+ )
. . 𝑁
+,#

where 𝜃 is the parameters in neural network 𝐹. (⋅).

17
Training Objective
• Cross-entropy
• Given 𝑁 training examples 𝑥+ , 𝑦+ - +,# where 𝑥+ and 𝑦+
are the sentence and its sentiment label. We want to
train a neural network 𝐹. (⋅) which takes the sentence 𝑥
as input and predicts its sentiment 𝑦. A reasonable
training objective is Cross-entropy:
-
1
min 𝐽 𝜃 = min − B log 𝑃/0123 (𝐹. 𝑥+ = 𝑦+ ) ,
. . 𝑁
+,#

where 𝜃 is the parameters in neural network 𝐹. (⋅).

18
Training Objective
• Cross-entropy
-
1
min 𝐽 𝜃 = min − B log 𝑃/0123 (𝐹. 𝑥+ = 𝑦+ ) ,
. . 𝑁
+,#
Output
distribution
If ground truth is y=1 (first class), then the loss
0.6 for this instance is
− log 𝑃'()*+ (𝐹, 𝑥 = 1) = −log 0.6 = 0.74.

0.3 If y=2 …
− log 𝑃'()*+ (𝐹, 𝑥 = 2) = −log 0.3 = 1.74.
+1 0.1
If y=3 …
− log 𝑃'()*+ (𝐹, 𝑥 = 3) = −log 0.1 = 3.32.
𝒉 𝒚
19
Stochastic Gradient Descent
• Update rule:

𝜃 DEF = 𝜃 GHI − 𝛼∇. J(𝜃)

𝛼 is step size or learning rate

• Just like climbing a mountain

• find the steepest direction
• take a step

20
Gradients
• Given a function with 1 output and n inputs:

F 𝒙 = 𝐹(𝑥#, 𝑥$ … 𝑥D )

• Its gradient is a vector of partial derivatives:

𝜕F 𝜕F 𝜕F 𝜕F
=[ , … ]
𝜕𝒙 𝜕𝑥# 𝜕𝑥$ 𝜕𝑥D

21
Jacobian Matrix: Generalization of the Gradient

• Given a function with 𝑚 outputs and 𝑛 inputs:

F 𝒙 = [𝐹# 𝑥#, 𝑥$ … 𝑥D , 𝐹$ 𝑥#, 𝑥$ … 𝑥D … 𝐹J 𝑥#, 𝑥$ … 𝑥D ]

• Its Jacobian matrix is an 𝑚 × 𝑛 matrix of partial

derivatives:
𝜕F# 𝜕F#
⋯
𝜕F 𝜕𝑥# 𝜕𝑥D 𝜕F 𝜕FL
= ⋮ ⋱ ⋮ =
𝜕𝒙 𝜕F/ 𝜕F/ 𝜕𝒙 +K 𝜕𝑥K
⋯
𝜕𝑥# 𝜕𝑥D

22
Chain Rule for Jacobians
• For one-variable functions: multiply derivatives
𝑧 = 3𝑦
𝑦 = 𝑥$
I' I' IN
IM
= IN IM = 3×2𝑥 = 6𝑥

• For multiple variables: multiply Jacobians

𝒉=𝑓 𝒛
𝒛 = 𝑾𝒙 + 𝒃
O𝒉 O𝒉 O𝒛
O𝒙
= O𝒛 O𝒙
=⋯
23
Back to Neural Network
= >?
• Given 𝑠 = 𝒖 𝒉, 𝒉 = 𝑓 𝒛 , 𝒛 = 𝑾𝒙 + 𝒃, what is ?
>𝒃

24
Back to Neural Network
= >?
• Given 𝑠 = 𝒖 𝒉, 𝒉 = 𝑓 𝒛 , 𝒛 = 𝑾𝒙 + 𝒃, what is ?
>𝒃
• Apply the chain rule:

>? >? >𝒉 >𝒛

=
>𝒃 >𝒉 >𝒛 >𝒃

𝒖$ diag(f’(𝑧)) 𝐈

25
Backpropagation
• Compute gradients algorithmically

• Used by deep learning frameworks (TensorFlow,

PyTorch, etc.)

26
Computational Graphs
• Representing our neural net equations as a graph
• Source node: inputs
• Interior nodes: operations 𝑠 = 𝒖S 𝒉
• Edges pass along result of 𝒉=𝑓 𝒛
𝒛 = 𝑾𝒙 + 𝒃
the operation
𝒙 input
Input
𝑾𝒙 𝒛 𝒉 𝑠
𝒙 G + f G

Parameters 𝑾 𝒃 𝒖
“Forward Propagation”
27
Backpropagation
• Go backwards along edges
• Pass along gradients
𝑠 = 𝒖S 𝒉
𝒉=𝑓 𝒛
𝒛 = 𝑾𝒙 + 𝒃
𝒙 input

𝑾𝒙 𝒛 𝒉 𝑠
𝒙 G + f G
𝜕𝑠 𝜕𝑠 𝜕𝑠
𝜕𝑠 𝜕𝒛 𝜕𝒉 𝜕𝒔
𝜕𝒃
𝑾 𝒃 𝒖

28
Backpropagation: Single Node
• Node receives an “upstream gradient”
• Goal is to pass on the correct “downstream
gradient”

𝒉=𝑓 𝒛

𝒛 f 𝒉
𝜕𝑠 𝜕𝑠
𝜕𝒛 𝜕𝒉
downstream upstream
gradient gradient
29
Backpropagation: Single Node
• Each node has a local gradient
• The gradient of its output with respect to its input

𝒉=𝑓 𝒛

𝜕𝒉
𝒛 f 𝒉
𝜕𝒛
𝜕𝑠 𝜕𝑠
local
𝜕𝒛 𝜕𝒉
gradient
downstream upstream
gradient gradient
30
Backpropagation: Single Node
• Each node has a local gradient
• The gradient of its output with respect to its input
• [downstream gradient] = [upstream gradient] x
[local gradient]
𝒉=𝑓 𝒛

𝜕𝒉
𝒛 f 𝒉
𝜕𝒛
𝜕𝑠 𝜕𝒔
Chain Rule: local
𝜕𝒛 𝜕𝒉
gradient
downstream upstream
gradient gradient
31
An Example
𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 max 𝑦, 𝑧
𝑥 = 1, 𝑦 = 2, 𝑧 = 0
Forward prop steps: Local gradients:
23 23
𝑎 =𝑥+𝑦 =3 24
= 1,
25
=1