CE6146 Lecture 4
CE6146 Lecture 4
2
Recurrent Neural Networks (RNNs)
Intended Learning Outcomes
5
Example – Slot Filling
• Slot filling is a task in Natural Language Processing (NLP) where the goal is
to identify and extract specific pieces of information from a given text and
categorize them into predefined slots or categories.
• This task is part of a broader field known as Information Extraction (IE).
a word
(Each word is represented as a vector)
• Output:
Probability distribution that the input word
belonging to the slots
Taoyuan x1 x2
7
Solve Slot Filling by FNN
Arrival Departure
City Time
I will arrive in Taoyuan on November 10th. y1 y2
Departure
City
所以可以使⽤之前的Data y1 y2
Store
a1 a2
Store
1 1 1 x11 x0122
Input sequence: ……
0 1 2 10
RNN
The same network is used again and again.
X1 X2 X3
11
RNN
Different
X1 X2 …… X1 X2 ……
image captioning,
music generation
Four types of RNN structure, where yellow box represents input layer, blue for the hidden layer, and
green for output and prediction layers
Source: https://link.springer.com/chapter/10.1007/978-3-030-82184-5_7 13
Architecture of a Traditional RNN
For each timestep t, the activation a<t> and the output y <t> are expressed as follows:
𝑎<𝑡> = 𝑔1 (𝑊𝑎𝑎 𝑎<𝑡−1> + 𝑊𝑎𝑥 𝑎<𝑡> + 𝑏𝑎 ) and 𝑦 <𝑡> = 𝑔2 (𝑊𝑦𝑎 𝑎<𝑡> + 𝑏𝑦 )
where Wax, Waa, Way, ba, by, are coefficients that are shared temporally and g1, and g2
activations functions.
Source: https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks 14
Feedforward Neural Convolutional Neural Recurrent Neural
Networks Networks Networks
Primary Usage General-purpose Image and video processing Sequential data processing
因為會考慮之前的memory
One-way One-way Loops
Data Flow
(from input to output) (from input to output) (feedback connections)
Internal state
Memory None None
(sequence memory)
‐ This leads to the network forgetting early information in long sequences. 需要遺忘早期記憶
‣ Reset Gate: Controls how much of the previous hidden state to forget.
‣ Update Gate: Decides how much of the current input should update the hidden
state.
18
LSTM (1/7)
Other part of the network
Special Neuron:
Signal control the 4 inputs,
Output Gate
output gate 1output
(Other part of the Signal control
network) the forget gate
Memory Cell Forget Gate
(Other part of
the network)
Signal control the L
input gate
Input Gate LSTM
(Other part of the L
network) L
Other part of the network 19
LSTM (2/7)
• Activation function f is
usually sigmoid function
• Between 0 and 1
• Mimic open and close gate
20
LSTM (3/7)
Source: https://link.springer.com/chapter/10.1007/978-3-030-82184-5_7 21
LSTM (4/7)
Source: https://link.springer.com/chapter/10.1007/978-3-030-82184-5_7 22
LSTM (5/7)
Source: https://link.springer.com/chapter/10.1007/978-3-030-82184-5_7 23
LSTM (6/7)
Source: https://link.springer.com/chapter/10.1007/978-3-030-82184-5_7 24
LSTM (7/7)
•
𝜎: The sigmoid activation function.
•
𝑊𝑚𝑛 and 𝑏𝑚𝑛 : Weight matrices.
The first subscript denotes whether
the weights are related to the input
• At each time step t (i) or the previous hidden state (ℎ).
The second subscript denotes
‐ Input Gate: 𝑖𝑡 = 𝜎(𝑊ℎ𝑖 ℎ𝑡−1 + 𝑏ℎ𝑖 + 𝑊𝑖𝑖 𝑥𝑡 + 𝑏𝑖𝑖 ) which part of the LSTM the
weights are associated with: the
‐ Forget Gate: 𝑓𝑡 = 𝜎 𝑊ℎ𝑓 ℎ𝑡−1 + 𝑏ℎ𝑓 + 𝑊𝑖𝑓 𝑥𝑡 + 𝑏𝑖𝑓 input gate (i), the forget gate (f),
the cell candidate (g), or the output
gate (o).
‐ Cell Candidate: 𝑐𝑡ǁ = tanh 𝑊ℎ𝑔 ℎ𝑡−1 + 𝑏ℎ𝑔 + 𝑊𝑖𝑔 𝑥𝑡 + 𝑏𝑖𝑔
‐ New Cell State: 𝑐𝑡 = 𝑓𝑡 ∘ 𝑐𝑡−1 + 𝑖𝑡 ∘ 𝑐𝑡ǁ
‐ Output Gate: 𝑜𝑡 = 𝜎(𝑊ℎ𝑜 ℎ𝑡−1 + 𝑏ℎ𝑜 + 𝑊𝑖𝑜 𝑥𝑡 + 𝑏𝑖𝑜 )
‐ New Hidden State: ℎ𝑡 = 𝑜𝑡 ∘ tanh(𝑐𝑡 )
25
GRU
26
LSTM vs. GRU (1/2)
LSTM GRU
Input Gate: Controls how much of the newly Update Gate: Serves a dual purpose: decides how
computed information from the current input will be much of the previous hidden state to retain and how
stored in the cell state. much of the new candidate activation to use for
updating the hidden state. It plays a role similar to the
Forget Gate: Determines the extent to which combined function of the input and forget gates in
information from the previous cell state is retained or LSTM.
forgotten.
Output Gate: Modulates the amount of information Reset Gate: Determines how to combine the new
from the current cell state that will be used to compute input with the previous hidden state to compute the
the hidden state, which can then be output or passed to candidate activation. There isn't a direct equivalent in
the next time step. LSTM, though its function is somewhat akin to how
the input gate modulates the influence of the new input
in LSTM. 類似input Gate,負責將新的
輸入與先前的隱藏狀態結合
27
LSTM vs. GRU (2/2)
• LSTM has three gates (input, forget, and output) that control the flow of information into, within,
and out of the cell, enabling a finer-grained control compared to GRU.
• GRU simplifies the gating mechanism with only two gates (update and reset), making it
computationally more efficient but potentially less expressive than LSTM. 效能不⾜優先使⽤ GRU
• The Update Gate in GRU performs a function similar to a combination of the Input and Forget
Gates in LSTM.
• There isn't a direct counterpart in GRU for the Output Gate in LSTM. Instead, the function of
controlling the exposure of the internal state is embedded within the formula for computing the
new hidden state in GRU.
• The Reset Gate in GRU doesn't have a direct counterpart in LSTM, but it plays a somewhat similar
role to the Input Gate in terms of modulating the influence of new input.
28
Graph Neural Networks (GNNs)
Intended Learning Outcomes
30
What Are Graph Neural Networks
不侷限在圖形,結構圖也屬此項
• Graph Neural Networks (GNNs) are neural networks that operate on graph-
structured data, where entities (nodes) are connected by relationships (edges).
• Why Use GNNs?
‐ Limitations of Traditional Networks: Convolutional Neural Networks (CNNs) are best for
grid-like data (e.g., images), and Recurrent Neural Networks (RNNs) excel with sequential
data (e.g., text or time series). However, neither is suitable for graph structures, where
relationships and dependencies between nodes are essential.
‐ GNNs’ Advantage: GNNs can capture complex patterns in non-Euclidean (non-grid-like) data,
making them ideal for tasks where data has inherent relationships and connections.
31
Examples of Graph-Structured Data
Source: https://medium.com/@bscarleth.gtz/introduction-to-graph-neural-networks-an-illustrated-guide-c3f19da2ba39 32
Comparing CNNs, RNNs, and GNNs
• CNNs are powerful for grid-like data (structured in a fixed format) and
capture spatial patterns.
• RNNs work well with sequential data, capturing temporal patterns and
dependencies in data over time.
• GNNs handle graph-structured data, focusing on learning from entities and
the relationships between them.
33
Structure of a Graph
C B 品 ò Adjacency
C 1 0 0 34
Key Components in GNNs
• In GNNs, the data flows through a series of layers where each layer performs
two main steps:
1) Message Passing (Aggregation):
Nodes gather information from their neighbors.
2) Transformation:
The gathered information is combined and transformed with learnable
weights to update each node’s feature representation.
36
Step 1: Initial Input Preparation
• In this step, we prepare the Node Feature Matrix and the Adjacency Matrix:
‐ Node Feature Matrix (X): Each row corresponds to a node, and each column represents
a feature (e.g., age, type, or atomic mass, depending on the application). This matrix
has a shape of (N,F), where N is the number of nodes and F is the number of features
per node.
‐ Adjacency Matrix (A): This matrix has a shape (N,N) and indicates the connections
between nodes, where Aij = 1 if there is an edge between nodes i and j, and Aij = 0
otherwise.
37
Step 2: Layer 1 – Message Passing
• In this step, each node gathers information from its neighboring nodes. This
process is often referred to as message passing or aggregation.
‐ Message Passing: For each node, the GNN collects the feature vectors of all its
neighbors as defined by the adjacency matrix.
‐ Aggregation: Common aggregation functions include sum, mean, or max pooling.
These functions combine the neighboring features to form a single aggregated feature
for each node.
38
Step 3: Layer 1 – Transformation
• If there are multiple GNN layers, the process of message passing, aggregation,
and transformation repeats in each layer. This allows each node to gather
information from nodes that are further away in the graph, not just immediate
neighbors.
‐ Expanding the Neighborhood Influence: Each additional layer enables nodes to
incorporate information from increasingly distant nodes, giving a broader context.
‐ Refined Node Representations: With each layer, nodes gain richer, multi-hop
information from the graph, which makes the node embeddings more representative of
their position and connections in the overall structure. 40
Step 5: Final Output
• Depending on the task type, the GNN will produce either node-level, link-level, or
graph-level outputs. The readout operation is specifically used for graph-level tasks
and is essential when a single, unified representation of the entire graph is needed.
‐ Node-Level or Link-Level Output (for node or link prediction):
For tasks focused on nodes or edges, each node or edge is treated independently, and the GNN outputs
predictions directly for each node or edge.
‐ Graph-Level Output – Applying Readout:
In graph-level tasks, a single vector representation for the entire graph is required.
The readout operation aggregates the final feature vectors (embeddings) of all nodes in the graph into a
single graph-level vector. This vector represents the overall structure and features of the graph and serves as
input for a final prediction layer.
41
Common Readout Methods
• Sum Aggregation:
Adds up the embeddings of all nodes.
• Mean Aggregation:
Takes the average of all node embeddings.
• Max Pooling:
Selects the maximum value for each feature dimension across all node
embeddings.
42
Summary of Steps
Step Description
Step 1: Initial Input Preparation Prepare Node Feature Matrix and Adjacency Matrix.
Each node aggregates feature information from its
Step 2: Layer 1 – Message Passing
neighbors.
Aggregated features are transformed with weights
Step 3: Layer 1 – Transformation
and activation functions.
Additional layers allow nodes to gather information
Step 4: Iterative Layers
from distant neighbors.
The readout operation aggregates all node
Step 5: Final Output embeddings into a single graph-level representation
for graph-level predictions. 43
Input and Output Data in GNNs
• Input Data in GNNs: A single data point in GNNs is a graph that consists of:
‐ Nodes (Vertices): Each node has associated features (e.g., in a social network, a user profile may have
features like age, location, and interests).
‐ Edges: Edges represent relationships or interactions between nodes (e.g., friendships in a social network
or bonds in a molecular structure).
‐ Adjacency Matrix or Edge List: Specifies which nodes are connected, representing the graph’s structure.
Source: https://medium.com/@bscarleth.gtz/introduction-to-graph-neural-networks-an-illustrated-guide-c3f19da2ba39 45
Tasks Ideal for GNNs
• Node Prediction: Predict the category of a node based on its features and the features of its
neighbors.
Example: In a social network, classifying users as “influencers” or “regular users” based
on connections.
• Link Prediction: Predict the likelihood or existence of an edge between two nodes.
Example: Predicting future friendships in a social network.
• Graph Prediction: Predict a label for the entire graph.
Example: In chemistry, classifying molecules as toxic or non-toxic.
46
Node Prediction
• In node prediction tasks, each node in the graph is treated as an individual data point. The
GNN learns to predict each node by considering its features and the features of its
neighbors. This task is often used in scenarios where we want to label each entity in a large,
‣ GNNs aggregate information from each node’s neighbors, allowing the model to
interconnected network. learn that influencers are often connected to other influencers.
‣ This helps the GNN make more accurate predictions than if it relied on each
• Data Setup: node’s features in isolation.
‐ Node Feature Matrix: Contains features for each user, such as age, number of friends, and engagement
score.
‐ Adjacency Matrix: Represents connections (friendships) between users.
‐ Labels: Each user (node) has a label, e.g., 1 for influencers and 0 for regular users.
47
Link Prediction
• In link prediction tasks, each pair of nodes (potential edge) is treated as a data point. The
GNN predicts whether a connection (edge) exists or will form between nodes based on
their features and the structure of the graph. This is useful in scenarios where we want to
understand or predict relationships in a network. ‣ GNNs leverage both node attributes and the
graph’s structure to predict links.
• Data Setup: ‣ By aggregating information from neighboring
nodes, GNNs can identify patterns of users who
‐ Node Feature Matrix: Contains features for each user. are likely to connect, such as users with many
‐ Adjacency Matrix: Shows the existing friendships (edges). mutual friends.
‐ Training Data: Pairs of nodes (user pairs) are treated as data points. Positive samples are existing
friendships, and negative samples are randomly selected pairs with no friendship.
‐ Labels: Each pair has a label—1 if an edge exists (friendship) and 0 if it doesn’t (no friendship).
48
Graph Prediction
• In graph-level tasks, each graph is treated as an individual data point. This task is common
in applications where each graph represents a separate entity, such as a molecule or a
social network substructure, and we need to classify or predict properties for the entire
entity. ‣ GNNs can learn from both atomic features and molecular structure.
‣ By aggregating information across all atoms and bonds, GNNs capture complex interactions
• Data Setup: within molecules, making them highly effective for predicting chemical properties.
‐ Node Feature Matrix: Contains features for each atom within each molecule.
‐ Adjacency Matrix: Represents bonds between atoms in each molecule.
‐ Training Data: Each molecule (graph) is treated as a data point, with its own node features and
adjacency matrix.
‐ Labels: Each molecule has a single label, e.g., 1 for toxic and 0 for non-toxic.
49
Tasks Ideal for GNNs – Summary Table
• Forward Pass:
The GNN processes input data through layers, performing aggregation and transformation for
each node.
• Loss Calculation:
The model’s predictions are compared to the true labels, and a loss function (e.g., cross-entropy
for classification) computes the error.
• Backpropagation:
Gradients are calculated and backpropagated to adjust weights.
• Optimization:
The optimizer (e.g., Adam) updates weights based on gradients to minimize the loss.
51
Real-World Applications of GNNs
• Social Networks:
Community detection, friendship prediction, recommendation systems.
• Bioinformatics:
Protein interaction networks, drug discovery (e.g., molecule classification).
• Recommendation Systems:
Collaborative filtering, content recommendations based on user-item interactions.
• Cybersecurity:
Anomaly detection in network traffic, identifying suspicious connections.
52
Q&A