0% found this document useful (0 votes)
16 views53 pages

CE6146 Lecture 4

Uploaded by

tony910313
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views53 pages

CE6146 Lecture 4

Uploaded by

tony910313
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

CE6146

Introduction to Deep Learning


Recurrent Neural Networks and Graph Neural Networks
Chia-Ru Chung
Department of Computer Science and Information Engineering
National Central University
2024/11/7
Outline

• Recurrent Neural Networks (RNNs)

• Graph Neural Networks (GNNs)

2
Recurrent Neural Networks (RNNs)
Intended Learning Outcomes

By the end of this lecture, you will be able to:


• Explain the architecture of Recurrent Neural Networks (RNNs), focusing on how they use
hidden states and loops to process sequential data such as time-series or text.
• Identify and describe the vanishing gradient problem in RNNs, and explain its impact on the
network's ability to learn long-term dependencies.
• Differentiate between standard RNNs, Long Short-Term Memory (LSTM) networks, and
Gated Recurrent Units (GRUs), and analyze how each model addresses the limitations of basic
RNNs.
• Apply RNNs to sequence prediction tasks, such as language modeling or time-series
forecasting, and demonstrate their ability to handle sequential data effectively.
4
What is a Recurrent Neural Network
⾃然語⾔處理

• An Recurrent Neural Network (RNN) is a type of neural network designed to


handle sequential data by maintaining an internal state, called a hidden state,
that is updated at each time step.
• Unlike standard neural networks, RNNs have loops that allow information to
be passed from one step to the next, making them ideal for tasks where the
order of data matters. information 可以從前⼀步傳到下⼀步

5
Example – Slot Filling

• Slot filling is a task in Natural Language Processing (NLP) where the goal is
to identify and extract specific pieces of information from a given text and
categorize them into predefined slots or categories.
• This task is part of a broader field known as Information Extraction (IE).

I will arrive in Taoyuan on November 10th.

In this scenario, the slots to be filled might include:


• Arrival City: Taoyuan
• Departure Date: November 10th 6
Solve Slot Filling by FNN
Arrival Departure
City Time
• Input: y1 y2

a word
(Each word is represented as a vector)
• Output:
Probability distribution that the input word
belonging to the slots

Taoyuan x1 x2
7
Solve Slot Filling by FNN
Arrival Departure
City Time
I will arrive in Taoyuan on November 10th. y1 y2

Other Arrival Departure Departure


Problem? City Time Time

I want to leave Taoyuan on November 10th.

Departure
City

Neural network needs memory!


Taoyuan x1 x2
8
RNN

所以可以使⽤之前的Data y1 y2

The output of hidden layer are stored in the memory.

Store

a1 a2

Memory can be considered as another input. x1 x2


9
Example
所以輸入順序不同, 2 8 22
Changing the sequence order will Output sequence:
得到的結果也會不同 2 8 22
change the output.
y1 y2
• All the weights are 1, and no bias term.
2822 2228
• All activation functions are linear.
傳統FNN沒有這個部分

Store

Given Initial 4111 1411


a4011 a4102
Values

1 1 1 x11 x0122
Input sequence: ……
0 1 2 10
RNN
The same network is used again and again.

Probability Probability of Probability of


of “arrive” “Taoyuan“ in “on“ in each
in each slot each slot slot
Y1 Y2 Y3
Store Store
A1 A2 A3
A1 A2

X1 X2 X3

I will arrive in Taoyuan on November 10th.

11
RNN
Different

Probability Probability of Probability Probability of


of “arrive” “Taoyuan“ in of “leave” “Taoyuan“ in
in each slot each slot in each slot each slot
Y1 Y2 …… Y1 Y2 ……
Store Store
A1 A2 A1 A2
A1 …… A1 ……

X1 X2 …… X1 X2 ……

arrive Taoyuan leave Taoyuan

The values stored in the memory is different. 12


Types of RNN
part-of-speech tagging, machine translation,
named entity recognition speech recognition
sentiment analysis,
anomaly detection

image captioning,
music generation

Four types of RNN structure, where yellow box represents input layer, blue for the hidden layer, and
green for output and prediction layers
Source: https://link.springer.com/chapter/10.1007/978-3-030-82184-5_7 13
Architecture of a Traditional RNN

For each timestep t, the activation a<t> and the output y <t> are expressed as follows:
𝑎<𝑡> = 𝑔1 (𝑊𝑎𝑎 𝑎<𝑡−1> + 𝑊𝑎𝑥 𝑎<𝑡> + 𝑏𝑎 ) and 𝑦 <𝑡> = 𝑔2 (𝑊𝑦𝑎 𝑎<𝑡> + 𝑏𝑦 )
where Wax, Waa, Way, ba, by, are coefficients that are shared temporally and g1, and g2
activations functions.
Source: https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks 14
Feedforward Neural Convolutional Neural Recurrent Neural
Networks Networks Networks

Primary Usage General-purpose Image and video processing Sequential data processing
因為會考慮之前的memory
One-way One-way Loops
Data Flow
(from input to output) (from input to output) (feedback connections)

Internal state
Memory None None
(sequence memory)

Locally connected with Connections through time


Layer Connectivity Fully connected
shared weights
Backpropagation Through
Training Algorithm Backpropagation Backpropagation
Time (BPTT)
Convolutional layers and
Key Feature Simplicity Recurrent loops
pooling
Image recognition, object Text generation, sentiment
Examples Basic classification tasks
detection analysis 15
The Vanishing Gradient Problem

• A problem that occurs during backpropagation where gradients become very


small, especially in deep RNNs or for long sequences.
• Impact:
‐ Gradients diminish as they propagate back through time, making it difficult for RNNs
to learn long-term dependencies. 無法學習長時間的Data

‐ This leads to the network forgetting early information in long sequences. 需要遺忘早期記憶

• Why Does It Happen?


Due to the repeated multiplication of small gradients through many layers (or time steps),
the gradients can shrink exponentially. 16
Overcoming the Vanishing Gradient (1/2)

• Long Short-Term Memory (LSTM):


Designed to handle long-term dependencies by using a memory cell to store
information across time steps.
‐ Memory Cell: Maintains information over long periods.
‐ Gates:
‣ Forget Gate: Controls which information to forget. 控制哪些資料要”遺忘”

‣ Input Gate: Decides which new information to store. 控制哪些資料要儲存

‣ Output Gate: Controls the output based on the memory cell.


17
Overcoming the Vanishing Gradient (2/2)

• Gated Recurrent Unit (GRU):


Similar to LSTM, but with a simpler architecture. GRUs use gates to control
what information is passed forward.
‐ Memory Cell: Maintains information over long periods.
‐ Gates: 沒有Output Gate !!!

‣ Reset Gate: Controls how much of the previous hidden state to forget.
‣ Update Gate: Decides how much of the current input should update the hidden
state.
18
LSTM (1/7)
Other part of the network
Special Neuron:
Signal control the 4 inputs,
Output Gate
output gate 1output
(Other part of the Signal control
network) the forget gate
Memory Cell Forget Gate
(Other part of
the network)
Signal control the L
input gate
Input Gate LSTM
(Other part of the L
network) L
Other part of the network 19
LSTM (2/7)

• Activation function f is
usually sigmoid function
• Between 0 and 1
• Mimic open and close gate

C’= f(Zi)g(Z) + cf(Zf)

20
LSTM (3/7)

Source: https://link.springer.com/chapter/10.1007/978-3-030-82184-5_7 21
LSTM (4/7)

The forget gate controls how much


of previous cell state should be kept.

Source: https://link.springer.com/chapter/10.1007/978-3-030-82184-5_7 22
LSTM (5/7)

The input gate adds the new


information to the cell state.

Source: https://link.springer.com/chapter/10.1007/978-3-030-82184-5_7 23
LSTM (6/7)

The output gate decides which


part of the cell state to output.

Source: https://link.springer.com/chapter/10.1007/978-3-030-82184-5_7 24
LSTM (7/7)

𝜎: The sigmoid activation function.

𝑊𝑚𝑛 and 𝑏𝑚𝑛 : Weight matrices.
The first subscript denotes whether
the weights are related to the input
• At each time step t (i) or the previous hidden state (ℎ).
The second subscript denotes
‐ Input Gate: 𝑖𝑡 = 𝜎(𝑊ℎ𝑖 ℎ𝑡−1 + 𝑏ℎ𝑖 + 𝑊𝑖𝑖 𝑥𝑡 + 𝑏𝑖𝑖 ) which part of the LSTM the
weights are associated with: the
‐ Forget Gate: 𝑓𝑡 = 𝜎 𝑊ℎ𝑓 ℎ𝑡−1 + 𝑏ℎ𝑓 + 𝑊𝑖𝑓 𝑥𝑡 + 𝑏𝑖𝑓 input gate (i), the forget gate (f),
the cell candidate (g), or the output
gate (o).
‐ Cell Candidate: 𝑐𝑡ǁ = tanh 𝑊ℎ𝑔 ℎ𝑡−1 + 𝑏ℎ𝑔 + 𝑊𝑖𝑔 𝑥𝑡 + 𝑏𝑖𝑔
‐ New Cell State: 𝑐𝑡 = 𝑓𝑡 ∘ 𝑐𝑡−1 + 𝑖𝑡 ∘ 𝑐𝑡ǁ
‐ Output Gate: 𝑜𝑡 = 𝜎(𝑊ℎ𝑜 ℎ𝑡−1 + 𝑏ℎ𝑜 + 𝑊𝑖𝑜 𝑥𝑡 + 𝑏𝑖𝑜 )
‐ New Hidden State: ℎ𝑡 = 𝑜𝑡 ∘ tanh(𝑐𝑡 )
25
GRU

• At each time step t


‐ Update Gate: 𝑧𝑡 = 𝜎(𝑊ℎ𝑧 ℎ𝑡−1 + 𝑏ℎ𝑧 + 𝑊𝑖𝑧 𝑥𝑡 + 𝑏𝑖𝑧 )
‐ Reset Gate: 𝑟𝑡 = 𝜎 𝑊ℎ𝑟 ℎ𝑡−1 + 𝑏ℎ𝑟 + 𝑊𝑖𝑟 𝑥𝑡 + 𝑏𝑖𝑟
‐ Candidate Activation:: ℎ෨ 𝑡 = tanh 𝑟𝑡 ∘ (𝑊ℎℎ ℎ𝑡−1 + 𝑏ℎℎ ) + 𝑊𝑖ℎ 𝑥𝑡 + 𝑏𝑖ℎ
‐ New Hidden State: ℎ𝑡 = 1 − 𝑧𝑡 ∘ ℎ𝑡−1 + 𝑧𝑡 ∘ ℎ෨ 𝑡

26
LSTM vs. GRU (1/2)
LSTM GRU
Input Gate: Controls how much of the newly Update Gate: Serves a dual purpose: decides how
computed information from the current input will be much of the previous hidden state to retain and how
stored in the cell state. much of the new candidate activation to use for
updating the hidden state. It plays a role similar to the
Forget Gate: Determines the extent to which combined function of the input and forget gates in
information from the previous cell state is retained or LSTM.
forgotten.

Output Gate: Modulates the amount of information Reset Gate: Determines how to combine the new
from the current cell state that will be used to compute input with the previous hidden state to compute the
the hidden state, which can then be output or passed to candidate activation. There isn't a direct equivalent in
the next time step. LSTM, though its function is somewhat akin to how
the input gate modulates the influence of the new input
in LSTM. 類似input Gate,負責將新的
輸入與先前的隱藏狀態結合
27
LSTM vs. GRU (2/2)

• LSTM has three gates (input, forget, and output) that control the flow of information into, within,
and out of the cell, enabling a finer-grained control compared to GRU.
• GRU simplifies the gating mechanism with only two gates (update and reset), making it
computationally more efficient but potentially less expressive than LSTM. 效能不⾜優先使⽤ GRU
• The Update Gate in GRU performs a function similar to a combination of the Input and Forget
Gates in LSTM.
• There isn't a direct counterpart in GRU for the Output Gate in LSTM. Instead, the function of
controlling the exposure of the internal state is embedded within the formula for computing the
new hidden state in GRU.
• The Reset Gate in GRU doesn't have a direct counterpart in LSTM, but it plays a somewhat similar
role to the Input Gate in terms of modulating the influence of new input.
28
Graph Neural Networks (GNNs)
Intended Learning Outcomes

By the end of this lecture, you will be able to:


• Explain the structure and purpose of Graph Neural Networks (GNNs).
• Compare GNNs with CNNs and RNNs in terms of data types and applications.
• Identify key GNN components (nodes, edges, adjacency matrices) and describe data
flow.
• Understand core GNN operations: node aggregation, message passing, and readout.
• Apply basic GNN models to graph-based tasks such as node classification.

30
What Are Graph Neural Networks
不侷限在圖形,結構圖也屬此項

• Graph Neural Networks (GNNs) are neural networks that operate on graph-
structured data, where entities (nodes) are connected by relationships (edges).
• Why Use GNNs?
‐ Limitations of Traditional Networks: Convolutional Neural Networks (CNNs) are best for
grid-like data (e.g., images), and Recurrent Neural Networks (RNNs) excel with sequential
data (e.g., text or time series). However, neither is suitable for graph structures, where
relationships and dependencies between nodes are essential.
‐ GNNs’ Advantage: GNNs can capture complex patterns in non-Euclidean (non-grid-like) data,
making them ideal for tasks where data has inherent relationships and connections.
31
Examples of Graph-Structured Data

• Social Networks: Nodes represent users, and edges represent friendships.


• Molecular Structures: Nodes represent atoms, and edges represent bonds.
• Recommendation Systems: Nodes represent users or items, and edges
represent interactions (e.g., user purchases item).

Source: https://medium.com/@bscarleth.gtz/introduction-to-graph-neural-networks-an-illustrated-guide-c3f19da2ba39 32
Comparing CNNs, RNNs, and GNNs

• CNNs are powerful for grid-like data (structured in a fixed format) and
capture spatial patterns.
• RNNs work well with sequential data, capturing temporal patterns and
dependencies in data over time.
• GNNs handle graph-structured data, focusing on learning from entities and
the relationships between them.

33
Structure of a Graph

• Nodes (Vertices): Represent individual entities (e.g., users in a social network


or atoms in a molecule).
• Edges: Represent relationships or interactions between nodes (e.g.,
friendships between users or bonds between atoms).
• Adjacency Matrix: A matrix where each cell (i, j) indicates if an edge exists
between nodes i and j. A A B C

C B 品 ò Adjacency
C 1 0 0 34
Key Components in GNNs

• Node Feature Matrix:


‐ Each node has a set of features.
‐ Example: In a social network, each user (node) may have features such as age, number
of friends, and activity level.
‐ Shape: (Number of nodes, Number of features per node)
• Adjacency Matrix:
‐ Shows the connections between nodes.
‐ Example: In a molecular graph, this matrix indicates bonds between atoms.
‐ Shape: (Number of nodes, Number of nodes)
35
Data Flow in GNNs

• In GNNs, the data flows through a series of layers where each layer performs
two main steps:
1) Message Passing (Aggregation):
Nodes gather information from their neighbors.
2) Transformation:
The gathered information is combined and transformed with learnable
weights to update each node’s feature representation.
36
Step 1: Initial Input Preparation

• In this step, we prepare the Node Feature Matrix and the Adjacency Matrix:
‐ Node Feature Matrix (X): Each row corresponds to a node, and each column represents
a feature (e.g., age, type, or atomic mass, depending on the application). This matrix
has a shape of (N,F), where N is the number of nodes and F is the number of features
per node.
‐ Adjacency Matrix (A): This matrix has a shape (N,N) and indicates the connections
between nodes, where Aij = 1 if there is an edge between nodes i and j, and Aij = 0
otherwise.
37
Step 2: Layer 1 – Message Passing

• In this step, each node gathers information from its neighboring nodes. This
process is often referred to as message passing or aggregation.
‐ Message Passing: For each node, the GNN collects the feature vectors of all its
neighbors as defined by the adjacency matrix.
‐ Aggregation: Common aggregation functions include sum, mean, or max pooling.
These functions combine the neighboring features to form a single aggregated feature
for each node.

38
Step 3: Layer 1 – Transformation

• Objective: Transform the aggregated information to update each node’s


feature representation.
• Process:
The aggregated feature vector is multiplied by a learnable weight matrix and passed
through an activation function (e.g., ReLU).
Example:
Node 1’s updated feature vector is calculated by applying the transformation to its
aggregated neighborhood information.
39
Step 4: Iterative Layers

• If there are multiple GNN layers, the process of message passing, aggregation,
and transformation repeats in each layer. This allows each node to gather
information from nodes that are further away in the graph, not just immediate
neighbors.
‐ Expanding the Neighborhood Influence: Each additional layer enables nodes to
incorporate information from increasingly distant nodes, giving a broader context.
‐ Refined Node Representations: With each layer, nodes gain richer, multi-hop
information from the graph, which makes the node embeddings more representative of
their position and connections in the overall structure. 40
Step 5: Final Output

• Depending on the task type, the GNN will produce either node-level, link-level, or
graph-level outputs. The readout operation is specifically used for graph-level tasks
and is essential when a single, unified representation of the entire graph is needed.
‐ Node-Level or Link-Level Output (for node or link prediction):
For tasks focused on nodes or edges, each node or edge is treated independently, and the GNN outputs
predictions directly for each node or edge.
‐ Graph-Level Output – Applying Readout:
In graph-level tasks, a single vector representation for the entire graph is required.
The readout operation aggregates the final feature vectors (embeddings) of all nodes in the graph into a
single graph-level vector. This vector represents the overall structure and features of the graph and serves as
input for a final prediction layer.
41
Common Readout Methods

• Sum Aggregation:
Adds up the embeddings of all nodes.
• Mean Aggregation:
Takes the average of all node embeddings.
• Max Pooling:
Selects the maximum value for each feature dimension across all node
embeddings.
42
Summary of Steps

Step Description

Step 1: Initial Input Preparation Prepare Node Feature Matrix and Adjacency Matrix.
Each node aggregates feature information from its
Step 2: Layer 1 – Message Passing
neighbors.
Aggregated features are transformed with weights
Step 3: Layer 1 – Transformation
and activation functions.
Additional layers allow nodes to gather information
Step 4: Iterative Layers
from distant neighbors.
The readout operation aggregates all node
Step 5: Final Output embeddings into a single graph-level representation
for graph-level predictions. 43
Input and Output Data in GNNs

• Input Data in GNNs: A single data point in GNNs is a graph that consists of:
‐ Nodes (Vertices): Each node has associated features (e.g., in a social network, a user profile may have
features like age, location, and interests).
‐ Edges: Edges represent relationships or interactions between nodes (e.g., friendships in a social network
or bonds in a molecular structure).
‐ Adjacency Matrix or Edge List: Specifies which nodes are connected, representing the graph’s structure.

• Output Data in GNNs:


‐ Node-Level Output: A prediction for each node (e.g., classifying users within a network).
‐ Edge-Level Output: Predictions for pairs of nodes (e.g., link prediction tasks).
‐ Graph-Level Output: A single prediction for the entire graph (e.g., graph classification tasks).
44
Example of a Single Data Point in GNNs

• In a social network GNN model:


‐ Input Data:
A graph where nodes represent users with profile
features, and edges represent friendships.
‐ Output Data:
Node classification labels for each user,
identifying them as “influencer” or “regular user.”

Source: https://medium.com/@bscarleth.gtz/introduction-to-graph-neural-networks-an-illustrated-guide-c3f19da2ba39 45
Tasks Ideal for GNNs

• Node Prediction: Predict the category of a node based on its features and the features of its
neighbors.
Example: In a social network, classifying users as “influencers” or “regular users” based
on connections.
• Link Prediction: Predict the likelihood or existence of an edge between two nodes.
Example: Predicting future friendships in a social network.
• Graph Prediction: Predict a label for the entire graph.
Example: In chemistry, classifying molecules as toxic or non-toxic.

46
Node Prediction

• In node prediction tasks, each node in the graph is treated as an individual data point. The
GNN learns to predict each node by considering its features and the features of its
neighbors. This task is often used in scenarios where we want to label each entity in a large,
‣ GNNs aggregate information from each node’s neighbors, allowing the model to
interconnected network. learn that influencers are often connected to other influencers.
‣ This helps the GNN make more accurate predictions than if it relied on each
• Data Setup: node’s features in isolation.
‐ Node Feature Matrix: Contains features for each user, such as age, number of friends, and engagement
score.
‐ Adjacency Matrix: Represents connections (friendships) between users.
‐ Labels: Each user (node) has a label, e.g., 1 for influencers and 0 for regular users.
47
Link Prediction

• In link prediction tasks, each pair of nodes (potential edge) is treated as a data point. The
GNN predicts whether a connection (edge) exists or will form between nodes based on
their features and the structure of the graph. This is useful in scenarios where we want to
understand or predict relationships in a network. ‣ GNNs leverage both node attributes and the
graph’s structure to predict links.
• Data Setup: ‣ By aggregating information from neighboring
nodes, GNNs can identify patterns of users who
‐ Node Feature Matrix: Contains features for each user. are likely to connect, such as users with many
‐ Adjacency Matrix: Shows the existing friendships (edges). mutual friends.
‐ Training Data: Pairs of nodes (user pairs) are treated as data points. Positive samples are existing
friendships, and negative samples are randomly selected pairs with no friendship.
‐ Labels: Each pair has a label—1 if an edge exists (friendship) and 0 if it doesn’t (no friendship).
48
Graph Prediction

• In graph-level tasks, each graph is treated as an individual data point. This task is common
in applications where each graph represents a separate entity, such as a molecule or a
social network substructure, and we need to classify or predict properties for the entire
entity. ‣ GNNs can learn from both atomic features and molecular structure.
‣ By aggregating information across all atoms and bonds, GNNs capture complex interactions
• Data Setup: within molecules, making them highly effective for predicting chemical properties.

‐ Node Feature Matrix: Contains features for each atom within each molecule.
‐ Adjacency Matrix: Represents bonds between atoms in each molecule.
‐ Training Data: Each molecule (graph) is treated as a data point, with its own node features and
adjacency matrix.
‐ Labels: Each molecule has a single label, e.g., 1 for toxic and 0 for non-toxic.
49
Tasks Ideal for GNNs – Summary Table

Task Type Example Scenario Input Structure Output

Single graph with Label for each node


Classifying users in a
Node Prediction node and adjacency (e.g., influencer vs.
social network
matrices regular user)
Single graph with Probability of a link
Predicting potential
Link Prediction node and adjacency forming between
friendships
matrices pairs of nodes
Label for each graph
Classifying Multiple graphs (each
Graph Prediction (e.g., toxic vs. non-
molecules by toxicity as a data point)
toxic)
50
Training Process in GNNs

• Forward Pass:
The GNN processes input data through layers, performing aggregation and transformation for
each node.
• Loss Calculation:
The model’s predictions are compared to the true labels, and a loss function (e.g., cross-entropy
for classification) computes the error.
• Backpropagation:
Gradients are calculated and backpropagated to adjust weights.
• Optimization:
The optimizer (e.g., Adam) updates weights based on gradients to minimize the loss.
51
Real-World Applications of GNNs

• Social Networks:
Community detection, friendship prediction, recommendation systems.
• Bioinformatics:
Protein interaction networks, drug discovery (e.g., molecule classification).
• Recommendation Systems:
Collaborative filtering, content recommendations based on user-item interactions.
• Cybersecurity:
Anomaly detection in network traffic, identifying suspicious connections.
52
Q&A

You might also like