0% found this document useful (0 votes)
13 views20 pages

Midterm Study Guide Csci566

The document covers foundational concepts in machine learning, including loss functions, supervised and unsupervised learning tasks, and the importance of regularization. It discusses various algorithms such as decision trees, k-nearest neighbors, and their advantages and disadvantages, along with optimization techniques and hyperparameter selection. Additionally, it highlights the integration of large language models with classical machine learning methods for enhanced performance.

Uploaded by

丁铭涛
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views20 pages

Midterm Study Guide Csci566

The document covers foundational concepts in machine learning, including loss functions, supervised and unsupervised learning tasks, and the importance of regularization. It discusses various algorithms such as decision trees, k-nearest neighbors, and their advantages and disadvantages, along with optimization techniques and hyperparameter selection. Additionally, it highlights the integration of large language models with classical machine learning methods for enhanced performance.

Uploaded by

丁铭涛
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

CSCI 566: Deep Learning - Lecture 1 • L2 Loss: (f (xi ; W )P

− yi )2
• Cross-entropy: − yi · log(softmax(f (xi ; W )))
Core Machine Learning Concepts
Empirical Risk Minimization
Definition of ML

• Machine Learning: Algorithms that improve performance at a task with experience • Empirical Minimizer: Minimizes average loss on training data
• Three Components: Tasks, Experience, Performance • Optimal Predictor: Minimizes expected loss on all data
• Law of Large Numbers: Empirical mean → true mean as n→
ML Tasks

• Supervised Learning: Learn from labeled data


• Unsupervised Learning: Learn patterns without labels Regularization
• Self-supervised: Generate supervision from data
• Reinforcement Learning: Learn from rewards/penalties Purpose
• Semi-supervised: Mix of labeled and unlabeled data

Supervised ML Framework • Prevent overfitting


• Express preferences over weights
• Input space X : All possible inputs • Make models simpler and more generalizable
• Output space Y: All possible outputs
• Target function c∗ : Unknown mapping X → Y
Types
• Hypotheses H: Candidate functions h : X → Y
• Training data D: {(x(1) , y (1) ), . . . , (x(N ) , y (N ) )} where y (i) = c∗ (x(i) )
• L2 regularization: λ wj2
P
Error Evaluation

P
P λ |wj P
L1 regularization: |
• Training error: error(h, Dtrain ) • Elastic net: λ1 |wj | + λ2 wj2
• Test error: error(h, Dtest ) • Advanced: Dropout, batch normalization, stochastic depth
• True error: errortrue (h) - unknown in practice
Total Loss
Historical Context of Deep Learning
Timeline of DL Development
• L(W ) = Data Loss + λ · Regularization
• λ = regularization strength (hyperparameter)
• Neocognitron (1982) - Fukushima: First self-organizing neural network model
• Handwritten Digit Recognition (1989) - LeCun: First backpropagation application
• AlexNet (2012) - Krizhevsky: Breakthrough in image classification with CNNs on GPUs
Optimization
Hardware Advancement and DL
Gradient Descent
• Success depends on both ”soft” and ”hardware” advancement
• GPU advancement enabled fast tensor operations
• AlexNet ingredients: Data (ImageNet), Algorithm (Backprop), Compute (GPUs) • Update rule: W ← W − α∇L(W )
• Learning rate α: Controls step size
Linear Models • Mini-batch: Use subset of training data for each update
Linear Classification
For linear model with squared error
• Form: f (x; W ) = W x + b
• Parameters: W (weights matrix), b (bias vector)
• Example: For 50×50×3 image input and 5 classes: • Weight update: Wi ← Wi + λ · 2 · xi (y − W T · x)
– W is 5×7500 matrix
– b is 5×1 vector Optimization Approaches

Loss Functions
• Random Search: Try random weights (inefficient)
• Hinge Loss: max(0, 1 − yi · f (xi ; W )) • Analytic Solution: W = (X T · X)−1 · X T · y (for linear regression)
• L1 Loss: |f (xi ; W ) − yi | • Gradient Descent: Iterative numerical approach
Non-linear Models & Deep Learning Graph Neural Networks (GNNs)
Why Non-linearity? Key Properties

• Aggregation function: Combines information from node’s neighbors


• Linear models cannot separate data with non-linear boundaries
• Weight sharing: Fixed parameters across different graph parts
• Real-world relationships are often complex and non-linear
• Permutation invariance: Output doesn’t change with node ordering
Deep Learning Properties GNN Formula

• Non-linear: High capacity for complex functions (0)


• hv = xv (Initial node features)
• Hierarchical: Multiple layers of abstraction (k+1) (k) (k) (k) (k)
• hv = f (AGGREGATE(hu1 , hu2 , . . . , hun , hv ))
• End-to-End learning: Learn features and classifier jointly (K)
• Universal Function Approximation: Can approximate any function • zv = hv (Final node embedding)

Activation Functions AI Robustness and Trustworthiness


AI Robustness
• ReLU: f (x) = max(0, x)
• Sigmoid: f (x) = 1+e1−x , range [0,1] • Definition: AI system performing reliably under various conditions
x −x • Importance: Critical for autonomous driving, medical diagnosis
• Tanh: f (x) = eex −e , range [-1,1]
( +e−x
x if x > 0 Out-of-Distribution (OOD) Detection
• ELU: f (x) =
α(ex − 1) if x ≤ 0
• Definition: Detecting samples with ”shifted distribution”
• MultiOOD: First benchmark for Multimodal OOD Detection
Key Numerical Formulas
Outlier Detection Algorithms
Parameter Counting
• k-NN: Distance to k-th nearest neighbor
• Linear layer (without bias): input size × output size • Local Outlier Factor (LOF): Compare local density
• Linear layer (with bias): input size × output size + output size • Histogram-based OD (HBOS): Independent feature estimation
• Convolutional layer (without bias): in channels × kernel height × kernel width × out channels • Isolation Forest: Random splits (fewer = more anomalous)
• 1×1 convolution (without bias): in channels × out channels • Autoencoder: Reconstruction error as anomaly score

CNN Output Size Large Language Models and Applications


TrustLLM Benchmark
Input size−Filter size+2×Padding
• Output size = Stride
+1
• Eight facets: Truthfulness, safety, fairness, robustness, privacy, ethics, transparency, accountability
• Evaluation: 16 LLMs, 30+ datasets, 6 dimensions
Neural Network Scaling
LLMs for Anomaly Detection (AD-LLM)
• For FC network with input i, hidden h1/h2, output o:
• Original params: i × h1 + h1 × h2 + h2 × o • Zero-shot detection: Using pre-trained LLMs without task-specific training
• Double hidden: i × (2h1) + (2h1) × (2h2) + (2h2) × o • Data augmentation: Generating synthetic samples
• Model selection: LLM recommendations for choosing models
Decision Trees
ML Applications
Characteristics
• Object Detection: Identify and locate objects
• Max depth for binary attributes: O(k) where k = # of attributes • Semantic Segmentation: Pixel-level classification
• Max leaf nodes for binary attributes: 2k • Image Captioning: Generate textual descriptions
• Pruning: Reduces tree depth to prevent overfitting • Machine Translation: Translate between languages

Decision Boundaries USC Fortis Lab Research

• Binary attributes create axis-aligned splits • PyGOD/PyOD: Open-source systems for outlier detection
• Continuous attributes allow arbitrary thresholds • Three research directions:
• With continuous attributes, max leaf nodes can be N – AI Robustness/Trustworthiness
– Structured/Generative AI • Overfitting: Tree too complex, fits noise in training data
• Underfitting: Tree too simple, can’t capture patterns
– Scalable/Open-Source AI
• Pruning: Removing branches to reduce complexity
Crucial Exam Concepts
– Pre-pruning: Early stopping during tree construction
• Overfitting vs. Underfitting: Bias-variance tradeoff
• Parameter counting: Calculate parameters for different layers – Post-pruning: Grow full tree, then remove branches that don’t improve validation error
• Vanishing/exploding gradients: Causes and solutions
• Activation function properties: Value ranges and derivatives Decision Boundaries
• Non-linearity importance: Enables modeling complex patterns
• Grading: May be curved based on score distribution
• Axis-aligned splits: Perpendicular to feature axes
• Complex boundaries: Deeper trees create more complex decision boundaries
CSCI 566: Classical ML Algorithms - Lecture 2 • Depth trade-off : Deeper trees fit training data better but risk overfitting

Decision Trees Decision Tree Advantages


Basic Concept

• Decision tree: Flowchart-like tree structure for classification/regression • Easy to understand and interpret (white-box model)
• Components: Root node, internal nodes (tests on attributes), leaf nodes (predictions) • Handles both numerical and categorical data
• Flow: Start at root, apply tests at each node, follow branches until reaching leaf • No normalization/scaling required
• Captures non-linear relationships
Learning Algorithm • Low computational cost for prediction

• Greedy approach: Recursively select best feature to split on


Decision Tree Disadvantages
• Split selection: Choose feature that minimizes classification error (or other metrics)
• Process:
– Start with all data at root node • Prone to overfitting, especially with deep trees
• Unstable: Small data changes can create different trees
– For each node, find best feature to split on
• Biased toward features with more levels
– Create child nodes based on feature values • Can create biased trees if classes are imbalanced
– Recursively build tree for each child node
– Stop when stopping criteria are met Decision Forests

Split Selection Criteria


• Random Forest: Ensemble of decision trees
# incorrect predictions • Process: Train multiple trees on random data subsets with random feature subsets
• Classification Error: Error = # examples • Prediction: Majority vote (classification) or average (regression) of all trees
• Gini Impurity: Measures probability of misclassification • Advantages: Higher accuracy, more stable, better feature importance
• Information Gain: Based on entropy reduction • Disadvantages: Less interpretable, higher computational cost, larger model size
• Research finding: Gini and Information Gain are statistically equivalent

Stopping Criteria Yggdrasil: Distributed Decision Trees

• All data points have same label: Perfect classification achieved


• All features used: No more features to split on • Column-Partitioning: Distribute features across workers
• Max depth reached: Tree depth limit exceeded • Workers: Compute sufficient statistics on local features
• Min samples per leaf : Too few samples to split further • Driver: Coordinates, picks best global split
• Result: Significant speedup (8.5-24×) for large-scale tree training
Handling Different Feature Types
LLM Applications with Trees
• Categorical features: Create branch for each category (e.g., Credit = {excellent, fair, poor})
• Continuous features: Create threshold split (e.g., Income ≥ threshold)
• Multiple thresholds: Continuous features can be reused with different thresholds • Chain-of-Thought: Step-by-step reasoning for problem-solving
• Tree-of-Thoughts: Explores multiple reasoning paths in tree structure
Overfitting and Pruning • Search strategies: Breadth-first search or depth-first search through reasoning paths
k-Nearest Neighbors (kNN) • Computationally expensive during prediction: O(ND + N log k)
• Memory intensive: Stores all training data
Basic Concept • Sensitive to feature scaling
• Curse of dimensionality: Performance degrades in high dimensions
• Instance-based learning: Memorize training examples, classify based on similarity • Sensitive to imbalanced data
• Lazy learning: No explicit training phase, prediction done at query time
• Prediction process: kNN for Large Language Models
– Find k nearest neighbors to query point
• kNN-LLMs: Enhance language models with nearest neighbor search
– Classification: Majority vote among k neighbors • Process:
– Regression: Average/weighted average of k neighbors’ values – Store vector representations of training sequences
– Find similar contexts for a given input
Algorithm Steps
– Combine neural LM prediction with kNN prediction
• Store: Keep all training examples • Benefit: Improves prediction by leveraging explicit memorization
• Distance calculation: Compute distance between query point and all training points
• Neighbor selection: Find k training points with smallest distances Clustering
• Decision: Use majority vote (classification) or average (regression)
Basic Concept
Distance Metrics
• Unsupervised learning: Group similar data points without labels
• Goal: Maximize intra-cluster similarity, minimize inter-cluster similarity
• Euclidean distance: d(x, y) = P i (xi − yi )2
pP
• Applications: Document grouping, customer segmentation, pattern discovery
• Manhattan distance: d(x, y) = i |xi − yi |
• Minkowski distance: d(x, y) = p 1/p
P 
i |xi − yi | K-Means Algorithm
x·y
• Cosine similarity: sim(x, y) = ||x||·||y||
• Initialization: Select k initial cluster centers
Effect of k Value • Assignment: Assign each point to nearest cluster center
• Update: Recalculate cluster centers as mean of assigned points
• Small k: Sensitive to noise, complex decision boundary • Iteration: Repeat assignment and update until convergence
• Large k: Smoother decision boundary, may underfit
• k = 1: Always perfect on training data, likely to overfit K-Means Mathematical Formulation
• Extreme k: With k =√N (dataset size), always predicts majority class • Objective: Minimize sum of squared distances within clusters
• Rule of thumb: k ≈ N (where N is dataset size) • Assignment: zi = argminj ||µj − xi ||2
P
i:zi =j xi
Choosing Hyperparameters • Update: µj = |{i:zi =j}|

• Bad approach: Optimize on training data (k=1 always perfect) Choosing K Value
• Bad approach: Optimize on test data (no generalization estimate)
• Better approach: Use validation set for hyperparameter selection • Elbow method: Plot WSS (Within-Cluster Sum of Squares) vs. k
• Best approach: Cross-validation (especially for small datasets) • Silhouette score: Measures how similar points are to their own cluster vs. other clusters
• Domain knowledge: Based on application requirements
Cross-Validation
K-Means Initialization
• Process: Split data into n folds, train on n-1 folds, test on remaining fold
• Repeat: Rotate validation fold n times • Random selection: Choose k random points as initial centers
• Result: More robust estimate of performance • K-means++:
• Common: 5-fold or 10-fold cross-validation – Choose first center randomly
– Select subsequent centers with probability proportional to squared distance from existing centers
kNN Advantages
• Multiple runs: Run algorithm multiple times with different initializations
• Simple to understand and implement
• No training phase (lazy learner) K-Means Limitations
• Naturally handles multi-class problems • Assumes spherical, equally sized clusters
• Can learn complex decision boundaries • Sensitive to initialization
• Adaptable to new data (just add to training set) • May converge to local optima
• Struggles with different cluster densities
kNN Disadvantages • Cannot handle non-convex cluster shapes
LLMs for Clustering LLM Integration with Classical ML

• Few-shot clustering: Using LLMs to enhance clustering with minimal supervision • LLMs can enhance classical ML algorithms (kNN-LLM, Few-Shot Clustering)
• LLM applications in clustering: • Tree-based structures can improve LLM reasoning (Tree-of-Thoughts)
– Before clustering: Generate keyphrases to enrich text representation • Hybrid approaches leverage strengths of both paradigms
– During clustering: Act as pseudo-expert to provide similarity constraints
Choosing the Right Algorithm
– After clustering: Correct errors in low-confidence cases
• Benefits: Reduces need for manual annotation, improves clustering quality • Decision Trees: When interpretability is key, features interact in complex ways
• Random Forests: When accuracy matters more than interpretability
Model Evaluation & Selection • kNN: When instance-based reasoning is appropriate, dataset is small/medium
• Clustering: When discovering hidden patterns in unlabeled data
Overfitting vs. Underfitting

• Underfitting: Model too simple, high bias, high training error


• Overfitting: Model too complex, high variance, low training error but high test error CSCI 566: Neural Networks - Lecture 3
• Definition: Model overfits if errortrue (h) > error(h, Dtrain )

Model Complexity Neural Network Fundamentals


Neural Networks Concept
• Definition: Capacity of a model to fit data points
• Factors: Number of parameters, parameter value range
• Trade-off : Higher complexity → better fit to training data but risk of overfitting • Definition: Set of neurons (atomic functions) connected in a non-linear way
• Biological Inspiration: Loosely based on the brain’s neurons, but with significant simplifications
Data Complexity • Core Idea: Learn hierarchical representations through multiple layers of transformations

• Factors: Number of examples, number of features, class separability Neuron Structure


• Relationship with model: Complex models need more data
• Input: Vector of values x ∈ Rn
Regularization • Weights: Parameter vector w ∈ Rn
• Bias: Scalar parameter b ∈ R
• Purpose: Prevent overfitting by constraining model complexity • Net Input: z = wT x + b
• L1 regularization: Promotes sparsity (feature selection) • Activation: a = f (z) where f is a non-linear function
• L2 regularization: Constrains weights to small values
Neural Network Architecture
Model Selection

• Goal: Choose model with proper complexity for your data • Input Layer: Raw features
• Process: Select model family, tune hyperparameters • Hidden Layers: Intermediate representations
• Metrics: Consider both performance metrics and business requirements • Output Layer: Final prediction
• Fully Connected/Dense Layer: Every neuron connected to all neurons in previous layer
Evaluation Metrics for Binary Classification • Naming Convention:

# correct predictions – ”2-layer Neural Network” = 1 hidden layer + output layer


• Accuracy: # examples
• # True positive
Precision: # (True positive – ”3-layer Neural Network” = 2 hidden layers + output layer
+ False positive)
# True positive
• Recall: # Positive examples Neural Network Data Shapes
• F1 Score: 2×Precision×Recall
Precision + Recall
• AUC-ROC: Area under ROC curve (TPR vs. FPR at different thresholds) • Input Shape: Depends on data (e.g., 784 features for 28×28 images)
• Weight Matrix Shapes:
Key Takeaways
– First layer: [input size × hidden size]
Classical ML vs. Deep Learning
– Hidden layers: [previous hidden size × next hidden size]
• Classical ML: Often interpretable, good for structured data, smaller datasets – Output layer: [last hidden size × output size]
• Deep Learning: More powerful, better for unstructured data, requires more data
• Both paradigms have their place in modern ML applications • Matrix Multiplication: [1 × input size] · [input size × hidden size] = [1 × hidden size]
Activation Functions • Despite advantages, ReLU can be more efficient for inference due to sparsity
Purpose of Activation Functions
Activation Function Best Practices
• Introduce non-linearity into the network • Use ReLU as default (with careful learning rate tuning)
• Without non-linearity, neural networks reduce to linear models • Try Leaky ReLU/ELU for potential improvements
• Allow networks to learn complex patterns and relationships • Avoid sigmoid/tanh in hidden layers
• Different activation functions have different properties and use cases • Use appropriate activation for output layer (sigmoid for binary, softmax for multi-class)
Sigmoid Activation Forward Propagation
• Formula: σ(x) = 1
1+e−x
Forward Pass Process
• Range: [0, 1] • Start with input X
• Use Cases: Output layer for binary classification • For each layer l:
• Problems:
– Compute pre-activation: Z [l] = W [l] A[l−1] + b[l]
– Saturated neurons ”kill” gradients
– Apply activation: A[l] = g [l] (Z [l] )
– Outputs not zero-centered
• Output is final layer activation A[L]
– Computationally expensive (exp)
Mathematical Representation
Tanh Activation
x −x • Single hidden layer: f (x; W [1] , b[1] , W [2] , b[2] ) = W [2] g(W [1] x + b[1] ) + b[2]
• Formula: tanh(x) = eex −e
+e−x • Multi-layer: Composition of functions f (x) = fL (fL−1 (...f1 (x)...))
• Range: [-1, 1]
• Advantage: Zero-centered output Backpropagation
• Problem: Still suffers from vanishing gradient
Backpropagation Concept
ReLU (Rectified Linear Unit) • Algorithm to compute gradients efficiently in neural networks
• Based on recursive application of the chain rule
• Formula: ReLU(x) = max(0, x) • Core of neural network training
• Advantages:
– Does not saturate in positive region Chain Rule Refresher
– Computationally efficient df
• For composite function f (g(x)): dx = df
· dg
dg dx
– Faster convergence than sigmoid/tanh • For multiple variables: ∂x = i ∂y ·
∂z P ∂z ∂yi
i ∂x
– Induces sparsity in activations
Backpropagation Steps
• Problem: ”Dying ReLU” - neurons can permanently die if they only receive negative inputs
• Forward pass: Compute output of network
Leaky ReLU
• Compute loss: Measure error using loss function
• Formula: LeakyReLU(x) = max(αx, x) where α is small (e.g., 0.01) • Backward pass:
• Advantage: Prevents dying ReLU problem – Start with gradient of loss w.r.t. output
• Parametric ReLU (PReLU): Learns α parameter during training – Propagate gradients backward through the network
ELU (Exponential Linear Unit) – For each layer, compute:
( ∗ Gradient w.r.t. layer output
x if x > 0 ∗ Gradient w.r.t. weights and biases
• Formula: ELU(x) =
α(ex − 1) if x ≤ 0 ∗ Gradient w.r.t. layer input (to propagate further)
• Advantages: All benefits of ReLU plus closer to zero-mean outputs • Update parameters: Use computed gradients to adjust weights and biases
• Disadvantage: Requires computing exp()
Computational Graph Representation
GELU (Gaussian Error Linear Unit)
• Represent computation as a directed graph
• Used in modern transformer models (BERT, GPT) • Nodes: Variables or operations
• Smoother than ReLU with probabilistic activation mechanism • Edges: Data flow between operations
• Better gradient flow during training • Enables systematic calculation of gradients
Patterns in Gradient Flow Neural Network Architecture Types
• Add gate: Acts as gradient distributor (passes upstream gradient unchanged to all inputs) Fully Connected Neural Networks
• Multiply gate: Acts as ”swap multiplier” (scales upstream gradient by the other input)
• Max gate: Routes gradient to the input that was selected during forward pass • Also called Multi-Layer Perceptrons (MLPs)
• Copy gate: Adds gradients from multiple downstream paths • Every neuron connected to all neurons in adjacent layers
• Simple but parameter-intensive
Gradient Calculation in Vector-Valued Functions
Convolutional Neural Networks (CNNs)
• Jacobian matrix: Contains partial derivatives of each output w.r.t. each input
• For backprop with vectors, matrix-vector multiplication is used • Specialized for processing grid-like data (e.g., images)
• Upstream gradient × local Jacobian = downstream gradient • Use convolutional filters instead of full connections
• Parameter sharing and local connectivity
Training Issues • Examples: AlexNet, VGG, ResNet
Vanishing Gradient Problem
Recurrent Neural Networks (RNNs)
• Problem: Gradients become extremely small as they propagate backward
• Causes: • Designed for sequential data
– Repeated multiplication by small numbers (¡ 1) • Maintain internal state (memory)
• Process one element at a time, carrying information forward
– Saturating activation functions (sigmoid, tanh) • Variants: LSTM, GRU (address vanishing gradient)
• Effects:
Graph Neural Networks (GNNs)
– Earlier layers learn very slowly or not at all
– Model cannot capture long-range dependencies • Process data represented as graphs
• Solutions: • Capture relationships between entities (nodes)
• Aggregate information from neighboring nodes
– Use ReLU activation functions • Applications: social networks, molecules, knowledge graphs
– Batch normalization
– Careful weight initialization Neural Network Practical Tips
– Skip connections (in modern architectures) Neural Network Parameter Counting

Exploding Gradient Problem • Fully connected layer: input size × output size + output size (with bias)
• Simple MNIST example:
• Problem: Gradients become extremely large
• Effects: – Input: 784 features (28×28 image)

– Unstable training – Hidden layer: 128 neurons

– Large parameter updates – Output: 10 classes


– NaN values (numerical overflow) – Parameters: (784×128+128) + (128×10+10) = 100,746
• Solution: Gradient clipping Model Capacity Considerations
– If gradient norm exceeds threshold, scale it down
– Takes step in same direction but smaller magnitude • More layers/neurons: Increases model capacity
• Increased capacity: Can fit more complex functions but risks overfitting
Advanced Training Approaches • Regularization techniques: Needed for larger models to generalize well

• HyperTuning: Key Exam Concepts


– Uses hypermodel to generate task-specific parameters • Understand activation functions and their properties
– Eliminates backpropagation for model adaptation • Know how to count parameters in neural networks
– Especially useful for Large Language Models • Understand basics of forward and backward propagation
• Recognize causes and solutions for vanishing/exploding gradients
• Parameter-Efficient Fine-Tuning (PEFT): • Compare different neural network architectures for various data types
– Reduces number of parameters to update
– More memory-efficient than full fine-tuning CSCI 566: Convolutional Neural Networks - Lecture 4
CNN Introduction & History Convolutional Layer
What is a CNN? Convolution Operation

• Neural network architecture specifically designed for processing grid-like data (e.g., images) • Key idea: Apply same filter across all spatial locations
• Inspired by the visual cortex organization in the brain • Filter (kernel) slides over input, computing dot products
• Preserves spatial relationships between pixels through local connectivity • Each dot product creates one value in output feature map
• Uses weight sharing to reduce parameters and improve generalization • Multiple filters create multiple feature maps (channels)

Historical Development Advantages Over Fully Connected

• Parameter sharing: Same weights applied at all locations


• Hubel & Wiesel (1959): Discovered how visual cortex builds complex representations from simple
• Sparse connectivity: Each output depends only on small local region
features
• Translation invariance: Detect features regardless of position
• Neocognitron (1980s): Fukushima’s self-organizing neural network model with ”sandwich”
• Hierarchy of features: Lower layers detect edges, higher layers detect complex patterns
architecture
• LeNet (1989): LeCun’s CNN for handwritten digit recognition with backpropagation Convolutional Layer Parameters
• AlexNet (2012): First CNN to win ImageNet competition, marking the beginning of CNN dominance
• VGGNet (2014): Popular CNN showing benefits of deeper architectures with small filters • Number of filters (K): Number of feature maps produced
• GoogLeNet/Inception (2014): Introduced 1×1 convolutions and inception modules • Filter size (F ): Spatial dimensions of kernel (e.g., 3×3, 5×5)
• ResNet (2015): Enabled training of extremely deep networks using residual connections • Stride (S): Step size when sliding filter
• Padding (P ): Adding borders to input before convolution
CNN Applications
Output Dimensions
• Image classification
• Object detection • Input dimensions: W1 × H1 × C
• Semantic segmentation • Output dimensions: W2 × H2 × K
• Image captioning • Where:
• Style transfer W1 −F +2P
– W2 = +1
• Face recognition S
• Medical image analysis – H2 = H1 −F +2P
+1
S
• Video analysis
• Game playing • Number of parameters: F 2 × C × K + K (including biases)

Common Settings
CNN Basic Layers
Main CNN Layer Types • K = powers of 2 (32, 64, 128, 512)
• F = 3, S = 1, P = 1 (maintains spatial dimensions)
• F = 5, S = 1, P = 2 (maintains spatial dimensions)
• Convolutional Layer: Extracts features using learnable filters • F = 5, S = 2, P = ”same” (halves spatial dimensions)
• Pooling Layer: Downsamples feature maps • F = 1, S = 1, P = 0 (1×1 convolution for channel mixing)
• Fully Connected Layer: Converts spatial features to class scores
• Activation Function: Adds non-linearity (typically ReLU)
• Batch Normalization: Normalizes activations for stable training
Pooling Layer
• Softmax Layer: Converts output to probability distribution Pooling Operation

Fully Connected Layer (FC) • Purpose: Reduce spatial dimensions, maintain important features
• Max Pooling: Takes maximum value in each region
• Operation: output = W · input + b • Average Pooling: Takes average value in each region
• Input: Flattened feature vector • Common setting: 2×2 filter with stride 2 (halves dimensions)
• Output: Fixed-size vector
• Parameters: W (weight matrix), b (bias vector) Pooling Advantages
• Limitations for Images:
• Reduces number of parameters in following layers
– Destroys spatial structure of input • Provides translation invariance
• Reduces memory usage and computation
– Inefficient for high-dimensional inputs
• Helps prevent overfitting
– Example: 200×200 image with 40K hidden units requires 2B parameters • No parameters to learn
Pooling Layer Parameters • All pooling layers use 2×2 max pooling with stride 2
• Feature channels increase progressively: 64→128→256→512→512
• Spatial extent (F ): Size of pooling window • 138M parameters (most in FC layers)
• Stride (S): Step size when sliding window • Why small 3×3 filters?
– Stack of three 3×3 conv = one 7×7 conv receptive field
Output Dimensions
– But with more non-linearities and fewer parameters (3×3×3 vs 7×7)
• Input dimensions: W1 × H1 × C – Better convergence due to more non-linearities
• Output dimensions: W2 × H2 × C (same number of channels)
• Where: ResNet (2015)
W1 −F
– W2 = S
+1 • Enabled training very deep networks (152+ layers)
– H2 = H1 −F
+1 • Key innovation: Residual connections (skip connections)
S • Residual block: H(x) = F (x) + x
• Number of parameters: 0 • Helps with vanishing gradient problem
• Easier optimization (if deeper layers can’t learn useful features, they can default to identity)
Batch Normalization • Variants: ResNet-18, ResNet-34, ResNet-50 (bottleneck design), ResNet-101, ResNet-152
Batch Normalization Operation • Bottleneck design: 1×1 → 3×3 → 1×1 convolutions for efficiency
• 25.5M parameters (ResNet-50)
• Purpose: Normalize activations for faster, more stable training • Winner of ILSVRC 2015 with 3.57
• Process:
2 across mini-batch Improvements on ResNet
– Compute mean µB and variance σB
x−µB • Identity Mappings in Deep Residual Networks (2016): Improved residual block design
– Normalize: x̂ =

q
2 +ϵ
σB Wide ResNet (2016): Wider residual blocks instead of deeper networks
• ResNeXt (2016): Parallel pathways within residual block (”cardinality”)
– Scale and shift: y = γ x̂ + β • SENet (2017): Squeeze-and-Excitation modules for adaptive feature recalibration
• Learnable parameters: γ (scale), β (shift) • DenseNet (2017): Dense connections where each layer connects to all previous layers

Batch Normalization Benefits Efficient Networks

• Makes training deep networks easier • MobileNet (2017): Depthwise separable convolutions for mobile applications
• Improves gradient flow • ShuffleNet (2018): Channel shuffling with grouped convolutions
• Allows higher learning rates • EfficientNet (2019): Compound scaling of width, depth, and resolution
• Reduces sensitivity to weight initialization • NASNet (2017): Neural Architecture Search for automated architecture design
• Acts as regularization
• Zero overhead at test time (can be fused with conv layer)
CNN Forward/Backward Propagation
Forward Propagation in CNNs
Batch Norm Placement
• Similar to regular neural networks, but with convolution operations
• Usually placed after convolutional or fully connected layer • For each convolutional layer:
• Often placed before activation function
• Common pattern: Conv → BatchNorm → ReLU – Perform convolution operation with filters
– Add bias
Popular CNN Architectures – Apply activation function
AlexNet (2012) • For each pooling layer:
– Apply pooling operation (max or average)
• First CNN to win ImageNet competition
• For each fully connected layer:
• 8 layers (5 conv, 3 FC)
• Used ReLU activation, dropout, data augmentation – Apply matrix multiplication and add bias
• Key features: Large filters (11×11), Local Response Normalization (LRN) – Apply activation function
• 60M parameters
Backward Propagation in CNNs
VGGNet (2014)
• Similar to regular backprop, but need to handle convolution and pooling
• Showed benefits of deeper networks with small filters • For fully connected layers: Standard backprop
• VGG16: 16 layers (13 conv, 3 FC) • For pooling layers:
• All conv layers use 3×3 filters with stride 1, pad 1 – Max pooling: Gradient flows back only to max element
– Average pooling: Gradient is distributed equally to all elements Neural Architecture Search (NAS)
• For convolutional layers:
• Automated design of neural network architectures
– Compute gradient w.r.t. each filter weight and bias
• Usually involves reinforcement learning or evolution
– Convolve the gradient of output with flipped filter to get gradient of input • Search for optimal network structure or building blocks
• Examples: NASNet, AmoebaNet, EfficientNet
CNN Applications
Image Classification Practical Tips
• Assign a single label to an entire image CNN Design Guidelines
• CNN outputs probability distribution over classes
• Examples: ImageNet classification, disease diagnosis • Use 3×3 conv filters with stride 1 and pad 1 as default
• Increase channel dimensions as spatial dimensions decrease
Object Detection • Use max pooling with 2×2 filters and stride 2 to downsample
• Add batch normalization after convolutions
• Locate and classify multiple objects in an image • Use ReLU activation functions
• Common approaches: • For modern architectures, consider residual connections
• For limited resources, consider efficient architectures like MobileNet
– R-CNN family: Generate proposals, classify with CNN
– Single-shot detectors: YOLO, SSD (no explicit proposal stage) Memory Efficiency
• Applications: Autonomous driving, surveillance, retail
• Most memory in early conv layers (large spatial dimensions)
Segmentation • Most parameters in final FC layers (or late conv layers)
• Total memory = activations + parameters + gradients
• Semantic segmentation: Classify each pixel into a category • Batch size significantly affects memory usage
• Instance segmentation: Distinguish between different instances of same class
• Panoptic segmentation: Combines semantic and instance segmentation Common Hyperparameters
• Applications: Medical imaging, autonomous driving, scene understanding
• Learning rate (often 0.01-0.001 with decay)
Other Applications
• Batch size (typically 32-256, depending on GPU memory)
• Weight decay for regularization
• Face recognition and analysis
• Data augmentation strategies
• Image generation and manipulation
• Number of filters and layers
• Video analysis
• 3D reconstruction
• Virtual try-on When to Use CNNs
• Deepfake creation and detection
• Grid-like data (images, spectrograms, etc.)
Modern CNN Developments • When spatial/local structure matters
• When translation invariance is desired
Contrastive Learning (CLIP) • When hierarchical feature extraction is beneficial

• CLIP (Contrastive Language-Image Pre-training): Key Exam Topics


– Learning from image-text pairs with contrastive objective
• CNN layer types and their functions
– Enables zero-shot classification using natural language descriptions
• Convolutional layer parameters and computation
– Joint embedding space for images and text • Output dimension calculations
• Parameter counting for different layer types
– More robust than supervised models to distribution shifts
• Advantages of convolutions over fully connected layers
• Key CNN architectures and their innovations
Vision Transformers
• Batch normalization process and benefits
• Apply transformer architecture (from NLP) to vision tasks
• Divide image into patches, treat as sequence of tokens
• Attention mechanism captures global dependencies CSCI 566: Recurrent Neural Networks & LSTM -
• Examples: ViT, Swin Transformer, DeiT Lecture 5
RNN Introduction & Motivation RNN Architectures
Limitations of Standard Neural Networks Many-to-Many

• Standard neural networks assume inputs are independent • Input: Sequence


• Cannot handle inputs of variable length • Output: Sequence (same length)
• Do not preserve order information between inputs • Used for: POS tagging, video classification on frame level
• Cannot easily model sequential dependencies • Example formula:
ht = fW (ht−1 , xt )
Sequential Data Applications
yt = g(ht )
• Speech recognition
• Natural language processing Many-to-One
• Time series prediction
• Video analysis • Input: Sequence
• Action recognition • Output: Single value/class
• Image captioning • Used for: Sentiment analysis, sequence classification
• Music generation • Only the final hidden state is used for prediction
• Machine translation • Example formula:
ht = fW (ht−1 , xt ) for t = 1 . . . T
Why RNNs?
y = g(hT )
• Process sequences of variable length
One-to-Many
• Share parameters across different positions in the sequence
• Maintain state/memory of previous inputs • Input: Single value
• Capture temporal dependencies in data • Output: Sequence
• Can map sequences to sequences of different lengths • Used for: Image captioning, music generation
• Example formula:
RNN Basics
h0 = fW (x)
RNN Structure
ht = fW (ht−1 , 0) or fW (ht−1 , yt−1 )
• Has loops allowing information to persist yt = g(ht )
• Maintains a hidden state that is updated at each time step
• Same function with same parameters applied at each step Sequence-to-Sequence (Encoder-Decoder)
• Input at time t: xt
• Hidden state at time t: ht • Input: Sequence
• Output at time t: yt • Output: Sequence (different length)
• Two-stage process:
Basic RNN Computation
– Encoder: Process input sequence into a context vector
• Initialize hidden state h0 (often set to zeros) – Decoder: Generate output sequence from context vector
• For each time step t: • Used for: Machine translation, text summarization
– ht = tanh(Whh ht−1 + Wxh xt + bh )
Character-level Language Model Example
– yt = Why ht + by
Training Process
• Whh : Weights for hidden-to-hidden connections
• Wxh : Weights for input-to-hidden connections • Input: Sequence of characters from vocabulary
• Why : Weights for hidden-to-output connections • Output: Prediction of next character at each step
• bh , by : Bias terms • Example: For ”hello”, predict ”e” from ”h”, ”l” from ”e”, etc.
• Loss: Sum of cross-entropy losses at each step
Weight Sharing • For vocabulary V , use one-hot encoding for input/output

• Same weights W used at every time step Sampling/Generation


• Reduces number of parameters to learn
• Allows processing of sequences of any length • Start with a seed character or random initialization
• Enables generalization to unseen sequence lengths • At each step:
– Forward pass through RNN to get probability distribution LSTM Components
– Sample next character from this distribution
• Cell state (ct ): Main memory pipeline
– Feed sampled character as input for next step • Hidden state (ht ): Output state at each step
• Continue until desired length or special token reached • Forget gate (ft ): Controls what to forget from cell state
• Input gate (it ): Controls what new information to store
Training RNNs • Gate gate/Candidate (gt ): Creates new candidate values
• Output gate (ot ): Controls what to output from cell state
Backpropagation Through Time (BPTT)
LSTM Update Equations
• Extension of backpropagation for sequences
• Unroll the RNN through time steps
• Forward pass through entire sequence
• Backward pass to calculate gradients ft = σ(Wf · [ht−1 , xt ] + bf )
• Update weights based on accumulated gradients it = σ(Wi · [ht−1 , xt ] + bi )
• Computational graph grows with sequence length
gt = tanh(Wg · [ht−1 , xt ] + bg )
Truncated BPTT ot = σ(Wo · [ht−1 , xt ] + bo )
ct = ft ⊙ ct−1 + it ⊙ gt
• Run forward and backward passes on chunks of the sequence
• Carry hidden states forward between chunks ht = ot ⊙ tanh(ct )
• Backpropagate only for a fixed number of steps
where ⊙ represents element-wise multiplication.
• Reduces memory requirements
• Makes training feasible for long sequences
How LSTM Prevents Vanishing Gradients
• Trade-off: May lose long-term dependencies

Challenges with Vanilla RNNs • Cell state provides direct path for gradient flow
• Gradient can flow through cell state without repeated multiplication by weights
Vanishing Gradient Problem • Forget gate allows selective retention of information
• Error can be backpropagated over many time steps
• Gradients can become exponentially small over many time steps
• During backpropagation: LSTM Variants
t
∂ht Y ∂hi Gated Recurrent Unit (GRU)
=
∂ht−n i=t−n+1
∂h i−1
• Simplified variant of LSTM
• Each term involves multiplication by Whh T and derivative of tanh
• Combines forget and input gates into a single ”update gate”
• If largest singular value of Whh < 1, gradients vanish exponentially • No separate cell state; updates hidden state directly
• Result: RNN cannot learn long-term dependencies • Fewer parameters than LSTM
• Often comparable performance with lower computational cost
Exploding Gradient Problem • Update equations:

• Gradients can become exponentially large over many time steps zt = σ(Wz · [ht−1 , xt ])
• Occurs when largest singular value of Whh > 1 rt = σ(Wr · [ht−1 , xt ])
• Results in unstable training and model divergence
• Solution: Gradient clipping h̃t = tanh(W · [rt ⊙ ht−1 , xt ])

– If ∥∇∥ > threshold, scale gradient: ∇ ← threshold·∇ ht = (1 − zt ) ⊙ ht−1 + zt ⊙ h̃t


∥∇∥

Long Short-Term Memory (LSTM) Bidirectional RNN/LSTM

LSTM Architecture • Processes sequence in both forward and backward directions


• Designed to overcome vanishing gradient problem • Captures context from both past and future states

→ ←

• Introduces a cell state that acts as a conveyor belt of information • Combines two hidden states: ht (forward) and ht (backward)

→ ← −
• Uses gates to control information flow • Output depends on both states: yt = g(ht , ht )
• Can maintain information over long sequences • Used in: speech recognition, machine translation, NLP tasks
• Better at capturing long-term dependencies • Limitation: Cannot be used for real-time applications
RNN Applications • LSTM:
Image Captioning – More complex architecture, more parameters
– Better at capturing long-term dependencies
• Uses CNN to encode image features
– More resistant to vanishing gradients
• RNN/LSTM generates text description
• Process: – Slower computation but better performance
– CNN extracts image features Common Issues & Solutions
– Features initialize RNN/LSTM hidden state
• Vanishing gradients:
– RNN/LSTM generates caption word by word
– Use LSTM/GRU instead of vanilla RNN
– Each word prediction conditioned on previous words and image
– Apply gradient clipping
Question Answering – Use appropriate activation functions
• Exploding gradients:
• Process text passage and question using RNNs
• Encode both text and question into vector representations – Apply gradient clipping
• Use attention mechanism to focus on relevant parts of text – Use proper weight initialization
• Generate or extract answer from the passage
• Variants: Machine comprehension, visual question answering • Poor convergence:
– Reduce learning rate
Machine Translation
– Apply batch normalization or layer normalization
• Sequence-to-sequence model with encoder-decoder architecture – Use adaptive optimizers (Adam, RMSprop)
• Encoder RNN processes source language sentence
• Decoder RNN generates target language translation Key Exam Topics
• Attention mechanism allows focus on relevant source words
• Current state-of-the-art uses Transformers, but RNN/LSTM still used in some applications • RNN architecture and computational graph
• LSTM components and information flow
Practical Considerations • Vanishing/exploding gradient problems and solutions
• Different RNN architectures (many-to-many, many-to-one, etc.)
Hyperparameters
• Backpropagation through time (BPTT)
• Hidden state size: Typically 128-512 • Applications of RNNs (NLP, image captioning, etc.)
• Number of layers: Typically 1-3 (stacked RNNs) • Differences between vanilla RNN and LSTM
• Sequence length/BPTT truncation: Task-dependent • Bidirectional RNNs and when to use them
• Learning rate: Often lower than feedforward networks (1e-3 to 1e-4)
• Dropout: Applied between layers, not within recurrence CSCI 566: Graph Neural Networks - Lecture 6
• Gradient clipping threshold: Typically 1.0-5.0
Introduction to Graph Neural Networks
Implementation Tips
Why Graphs?
• Initialize hidden state with zeros or learned parameters
• Use gradient clipping to prevent exploding gradients • Graphs are a general language for describing and analyzing entities with relations/interactions
• Apply layer normalization for more stable training • Many real-world data naturally form graphs: social networks, knowledge graphs, protein interactions,
• Consider bidirectional models when future context is available molecules, etc.
• Package sequences of similar lengths in same batch • Explicitly modeling relationships improves performance for prediction tasks
• Use LSTM or GRU instead of vanilla RNN for most tasks • Captures complex dependencies that are missed by standard neural networks
• Apply dropout to non-recurrent connections only
Examples of Graph Data
RNN vs LSTM Comparison
• Social networks (users, friendships)
• Vanilla RNN: • Citation networks (papers, citations)
• Knowledge graphs (entities, relations)
– Simpler architecture, fewer parameters • Molecules (atoms, bonds)
– Struggles with long-term dependencies • Protein-protein interaction networks
• Road networks (intersections, roads)
– Suffers from vanishing/exploding gradients • Scene graphs (objects, relationships)
– Faster computation • Computer networks (devices, connections)
Challenges in Analyzing Graphs Graph Representation Learning
• Arbitrary size and complex topological structure Goal of Graph Representation Learning
• No fixed node ordering or reference point
• Often dynamic and have multimodal features
• No spatial locality like grids (unlike images) • Map nodes/graphs to low-dimensional vector space
• Standard deep learning architectures (CNN, RNN) not directly applicable • Encode structural and feature information in embeddings
• Enable downstream ML tasks (classification, link prediction)
Graph Definitions & Representations • Formally: Learn function f : v → Rd mapping nodes to d-dimensional embeddings
• Features should preserve graph similarity in embedding space
Basic Graph Definitions
Encoder-Decoder Framework
• Graph: G = (V, E)
• Nodes/Vertices: V = {v1 , v2 , ..., vn }
• Edges/Links: E ⊆ V × V • Encoder: Maps nodes to vector embeddings ENC(v) = zv
• Adjacency matrix: A ∈ Rn×n where Aij = 1 if (vi , vj ) ∈ E, else 0 • Similarity function: Measures similarity between nodes in embedding space
• Node features: X ∈ Rn×d where d is feature dimension • Decoder: Maps similarity in embedding space to similarity in original graph
• Neighborhood of node v: N (v) = {u ∈ V |(v, u) ∈ E} • Goal: Maximize similarity for node pairs that are similar in the original graph

Types of Graphs
Shallow Encoding Approach
• Undirected graphs: Edges have no direction, Aij = Aji
• Directed graphs: Edges have direction, Aij may not equal Aji • Encoder is a simple embedding lookup: ENC(v) = zv = Z · v
• Weighted graphs: Edges have weights, Aij ∈ R • Z ∈ Rd×|V | is the embedding matrix
• Bipartite graphs: Nodes can be divided into two disjoint sets with edges only between sets • v is a one-hot encoding of node v
• Heterogeneous graphs: G = (V, E, R, T ) with node types T (vi ) and relation types r ∈ R • Directly optimize embedding of each node
• Examples: DeepWalk, node2vec
Graph Machine Learning Tasks • Limitation: Cannot generalize to unseen nodes
Node-level Tasks

• Node classification: Predict labels for nodes (e.g., protein function) Graph Neural Networks: Basic Concepts
• Node regression: Predict continuous values for nodes
• Node clustering: Group similar nodes GNN Design Principles
• Anomaly detection: Identify unusual nodes

Edge-level Tasks • Deep encoder rather than shallow encoder


• Use neural networks to learn representations
• Link prediction: Predict new/missing/future edges • Share parameters across different positions in the graph
• Edge classification: Predict edge types/properties • Preserve permutation invariance (node ordering shouldn’t matter)
• Recommender systems: Predict user-item interactions • Build representations based on local network neighborhoods
• Drug-drug interaction prediction
Permutation Invariance & Equivariance
Graph-level Tasks

• Graph classification: Assign labels to entire graphs (e.g., molecule properties) • Permutation invariance: For a function f , f (P X) = f (X) for any permutation P
• Graph regression: Predict continuous values for graphs • Permutation equivariance: For a function f , f (P X) = P f (X) for any permutation P
• Graph generation: Create new graphs with desired properties • Graph invariant: Graph-level predictions shouldn’t depend on node ordering
• Graph similarity: Measure how similar two graphs are • Graph equivariant: Node-level predictions should be consistently mapped when nodes are reordered
• Standard MLPs are neither invariant nor equivariant to permutations
Real-world Applications
Computation Graph
• Recommender systems (Pinterest, e-commerce)
• Traffic prediction (Google Maps ETA)
• Drug discovery (antibiotic discovery) • For a target node, recursively expand neighborhood to create a computation graph
• Knowledge graph completion • Layer k incorporates information from nodes k hops away from target
• Protein-protein interaction prediction • Each node’s representation depends on its neighborhood
• Fake account detection in social networks • Information propagates through the graph via message passing
Graph Neural Networks: Architecture • Regression: Mean squared error
• Link prediction: Binary cross-entropy, margin-based ranking loss
Message Passing Framework
Inductive vs. Transductive Learning
(0)
• Initialize node embeddings: hv = xv
• For each layer l and each node v: • Transductive: Test nodes seen during training (only labels unknown)
(l) (l−1) • Inductive: Generalize to completely unseen nodes/graphs
– Aggregate information from neighbors: av = AGGREGATE(l) ({hu : u ∈ N (v)}) • GNNs are naturally inductive: Same parameters used for all nodes
– Update node embedding:
(l)
hv =
(l−1)
UPDATE(l) (hv
(l)
, av ) • Can generate embeddings for new nodes on the fly
(L)
• After L layers, final node representation: zv = hv Advanced GNN Architectures
Graph Attention Networks (GAT)
Graph Convolutional Network (GCN)

• Simplified • Assigns different weights to different neighbors using attention


 model where aggregate and update are
 combined: • Attention coefficients: evu = LeakyReLU(aT [Whv ∥Whu ])
(l) (l−1)
• hv = σ W(l) u∈N (v)∪{v} |N (v)∪{v}|
1
P
hu exp(evu )
• Normalized attention weights: αvu = softmaxu (evu ) = P
• Mean aggregator function (averages neighbor embeddings) k∈N (v) exp(evk )
P 
(l) (l) (l−1)
• W(l) is a learnable weight matrix for layer l • Node update: hv = σ u∈N (v) αvu W hu
• σ is a non-linear activation function (ReLU) • Often uses multi-head attention for stability
• Matrix form: H(l) = σ(D̃−1/2 ÃD̃−1/2 H(l−1) W(l) )
Graph Pooling Methods
GraphSAGE
• Simple pooling: Mean, Max, Sum of node embeddings
• Separate aggregate and update steps: • Hierarchical pooling: Cluster nodes and pool within clusters
(l) (l−1)
• av = AGGREGATE(l) ({hu : u ∈ N (v)}) • Examples: DiffPool, SAGPool, TopKPool

(l)
hv = σ(W(l) · [hv
(l−1) (l)
∥av ]) • Graph classification: zG = READOUT({zv |v ∈ G})
• Aggregate functions: Mean, Max, LSTM, Pooling
• Concatenation of self and neighbor features (∥ denotes concatenation) Graph Isomorphism Network (GIN)
(l) (l) (l)
• Final step: Normalization hv ← hv /∥hv ∥2
• Most expressive GNN model(as powerful as Weisfeiler-Lehman graph
 isomorphism test)
(l) (l−1) (l−1)
• Update rule: hv = MLP(l) (1 + ϵ(l) ) · hv
P
Comparison with Other Neural Networks + u∈N (v) hu
• ϵ can be learned or fixed
• CNN: Special case of GNN for grid-structured data with fixed neighborhood size and ordering • Uses sum aggregator instead of mean
(l+1) (l) (l) (l)
• = σ( u∈N (v) Wu hu + b(l) hv )
P
CNN formulation: hv
• CNN learns different weights for each spatial position in the filter LLMs for Graphs
• GNN: Same weight for all neighbors, but different neighborhoods for each node
• Transformer: Can be viewed as a GNN operating on a fully-connected graph with attention Text-Attributed Graphs (TAGs)

Training GNNs • Graphs where nodes have text attributes (papers, products, etc.)
• Conventional shallow text embeddings can miss crucial context
Supervised Learning • LLMs have strong semantic understanding of text

• For node classification: LLMs-as-Enhancers


– Given labeled nodes (v, yv ) in training set • LLMs enrich/encode node text; GNN makes final prediction
– Compute node embeddings zv = GNN(G, v) • Feature-level enhancement:
– Apply classifier ŷv = f (zv ) – Cascading: Sentence-BERT or e5 embeddings fed to GNN
P
– Loss: L = v∈Vtrain loss(yv , ŷv ) – Iterative: GNN & PLM exchange pseudo-labels
• For graph classification: • Text-level enhancement:
– Pool node embeddings to get graph embedding: zG = READOUT({zv |v ∈ G}) – TAPE: LLM writes pseudo labels + explanations
– Apply classifier: ŷG = f (zG ) – KEA: LLM adds knowledge entities to node text

Common Loss Functions LLMs-as-Predictors

• Classification: Cross-entropy loss • LLM directly sees text (and optional neighbor info) and chooses label
• Zero/Few-shot: Node text in prompt → LLM decides label – Ensures all features contribute equally to learning
• Adding graph structure through neighbor text summaries • PCA Whitening: Transform data to remove correlation between features
• LLM as annotator: LLM labels some nodes → train a GNN
– Rotate data along principal components
Practical Tips & Limitations – Scale each dimension to unit variance
Common Issues & Solutions – Rarely used in practice for images

• Over-smoothing: As layers increase, node embeddings become similar Preprocessing in Practice


– Solution: Skip connections, JK networks, PairNorm
• Images:
• Over-fitting: Especially with small datasets
– Subtract mean image (e.g., AlexNet)
– Solution: Dropout, weight decay, data augmentation
– Subtract per-channel mean (e.g., VGGNet)
• Scalability: Full-batch training requires entire graph in memory
– Divide by per-channel standard deviation (e.g., ResNet)
– Solution: Mini-batch training, graph sampling, neighbor sampling
• Text:
Hyperparameter Considerations – Tokenization, lowercasing, stemming
– Transform to fixed-length vectors (word embeddings)
• Number of GNN layers: Usually 2-3 (due to over-smoothing)
• Hidden dimensions: Often 64-256 • Tabular data:
• Learning rate: Typically 0.01-0.001 – Standardize numerical features
• Dropout rate: 0.1-0.5
• Aggregation function: Mean vs. Sum vs. Max – Encode categorical features (one-hot, target encoding)

Key Exam Topics Batch Normalization


Internal Covariate Shift
• Graph representation basics (nodes, edges, adjacency matrix)
• Types of graph machine learning tasks • During training, the distribution of each layer’s inputs changes as parameters of the previous layers
• Message passing framework and basic GNN architectures (GCN, GraphSAGE) change
• Permutation invariance/equivariance properties • This slows down training by requiring lower learning rates
• Inductive vs. transductive learning • As networks become deeper, distribution shifts become more severe
• Comparison with other neural network architectures (CNN, Transformer)
• Aggregation mechanisms (mean, sum, max, attention) Batch Normalization Operation
• Training and evaluation of GNNs
• Compute mean and variance for each feature in the mini-batch:
CSCI 566: Training Neural Networks & AutoML - µB =
1 X
m
xi
Lecture 7 m i=1
m
2 1 X
Data Preprocessing σB = (xi − µB )2
m i=1
Why Preprocess Data?
• Normalize:
• Activation functions (like ReLU) work optimally around zero xi − µB
x̂i = q
• Randomly initialized layers can produce biased outputs if input data is biased 2 +ϵ
σB
• Features with small scale have negligible effect on backpropagation
• Helps models converge faster and perform better • Scale and shift (learnable parameters):

Common Preprocessing Techniques yi = γ x̂i + β

• Zero-centering: Subtract mean from each feature Batch Normalization Benefits


– xcentered = x − µ • Improves gradient flow through the network
– Makes gradient updates more symmetric • Allows higher learning rates
• Reduces strong dependence on initialization
• Normalization: Scale features to standard deviation of 1
• Acts as a form of regularization
x−µ
– xnormalized = σ
• Enables training of much deeper networks
(
Batch Normalization at Test Time x if x > 0
• f (x) =
α(ex − 1) if x ≤ 0
• During testing, batch statistics aren’t used • All benefits of ReLU
• Instead, use running averages of mean and variance from training: • Closer to zero mean outputs
µrunning = α · µrunning + (1 − α) · µB • Negative saturation regime adds robustness to noise
2 2 2
σrunning = α · σrunning + (1 − α) · σB
Activation Function TLDR
• Typically added after convolutional/fully connected layers and before non-linearities
• Use ReLU as default, but be careful with learning rates
Activation Functions • Try Leaky ReLU/ELU for potential performance gains
Sigmoid • Don’t use sigmoid or tanh for hidden layers
• Sigmoid is still useful for binary classification output
• σ(x) = 1+e1−x • Softmax is used for multi-class classification output
• Squashes numbers to range [0,1]
• Historically popular, but has problems:
Weight Initialization
– Saturated neurons ”kill” gradients
– Outputs are not zero-centered Importance of Initialization
– Computationally expensive (exponential)
• Initialization affects convergence speed and final performance
Tanh • Poor initialization can lead to:
x −x
• tanh(x) = eex −e+e−x
– Vanishing gradients
• Squashes numbers to range [-1,1]
• Zero-centered outputs – Exploding gradients
• Still suffers from vanishing gradients when saturated
– Poor convergence or no learning
ReLU (Rectified Linear Unit)
Small Random Numbers
• f (x) = max(0, x)
• Benefits:
• Initialize with small random values from normal distribution
– Does not saturate in positive region • E.g., W ∼ N (0, 0.01)
– Computationally efficient • Works for shallow networks but problematic for deep networks
• Deep networks tend to have vanishing activations with small random initialization
– Converges much faster than sigmoid/tanh
• Issues: Xavier/Glorot Initialization
– Not zero-centered
– ”Dying ReLU” problem - neurons can get stuck at 0 • For layers with
q tanh or sigmoid q activations
• W ∼ N (0, n1 ) or W ∼ N (0, n +n 2
)
Leaky ReLU in in out
• Where nin is the number of inputs to the layer
• f (x) = max(αx, x), where α is small (e.g., 0.01) • For conv layers, nin = filter size2 × input channels
• Does not saturate • Maintains variance of activations and gradients across layers
• Prevents dying ReLU problem • Derived from the principle that output variance should match input variance
• Computationally efficient
Kaiming/He Initialization
Parametric ReLU (PReLU)

• Like Leaky ReLU but α is a learnable parameter • Specializedq


for ReLU activations
• Learns the optimal negative slope • W ∼ N (0, n2 )
in
• Accounts for the fact that ReLU outputs are one-sided and have variance reduced by half
ELU (Exponential Linear Unit) • Default choice for training deep networks with ReLU activations
Optimization Algorithms Adam (Adaptive Moment Estimation)
Challenges in Optimization • Combines momentum and RMSProp:
• Poor conditioning: Loss changes quickly in some directions, slowly in others mt = β1 mt−1 + (1 − β1 )gt (momentum)
• Local minima and saddle points: Zero gradient can trap optimizer
vt = β2 vt−1 + (1 − β2 )gt2 (RMSProp)
• Noisy gradients: Mini-batch gradients introduce variance
mt
m̂t = (bias correction)
Gradient Descent (GD) 1 − β1t
vt
v̂t = (bias correction)
• θt+1 = θt − α∇θ J(θt ) 1 − β2t
• Uses all data to update model weights m̂t
• Computationally expensive for large datasets θt+1 = θt − α √
v̂t + ϵ
Stochastic Gradient Descent (SGD) • Typical values: β1 = 0.9, β2 = 0.999, ϵ = 10−8
• Learning rate α = 0.001 or 0.0005 is a good starting point
• θt+1 = θt − α∇θ J(θt ; x(i) , y (i) ) • Great default choice for most problems
• Uses a single example to compute gradient
• In practice, mini-batch SGD is used (32/64/128 examples) Learning Rate Schedules
• Noisy updates but faster iteration
Fixed Learning Rate
• Can escape shallow local minima
• Use same learning rate throughout training
SGD with Momentum
• Simple but often suboptimal
• Accumulates a velocity vector in directions of persistent reduction:
Step Decay
vt+1 = ρvt + ∇θ J(θt )
• Reduce learning rate at fixed intervals
θt+1 = θt − αvt+1 t
• α = α0 × γ ⌊ s ⌋
• ρ is momentum coefficient (typically 0.9 or 0.99) • where t is the epoch, s is the step size, and γ is decay rate
• Reduces oscillations in steep directions • E.g., ResNet: multiply learning rate by 0.1 after epochs 30, 60, and 90
• Accelerates progress in shallow directions
• Helps escape local minima and saddle points Cosine Decay

AdaGrad • Smooth decay following cosine function:


1+cos( π·t )
• αt = α0 · T
• Adapts learning rates for each parameter based on historical gradients: 2
• where α0 is initial learning rate, t is current epoch, T is total epochs
gt,i = ∇θi J(θt ) • Widely used in state-of-the-art models
2
st,i = st−1,i + gt,i Linear Decay
α
θt+1,i = θt,i − p gt,i • Linear decrease from initial to final learning rate:
st,i + ϵ
• αt = α0 · (1 − Tt )
• ”Per-parameter learning rates” • Used in models like BERT
• Dampens progress along steep directions
• Accelerates progress along flat directions Learning Rate Warmup
• Issue: Learning rates decay to zero over time
• Start with small learning rate and gradually increase
RMSProp • Prevents loss explosion at the beginning of training
• Typically linearly increase from 0 over first few thousand iterations
• ”Leaky AdaGrad” - uses exponential moving average: • Rule of thumb: If increasing batch size by N, scale initial learning rate by N
2
st,i = βst−1,i + (1 − β)gt,i Cyclical Learning Rates
α
θt+1,i = θt,i − p gt,i • Periodically vary learning rate between bounds
st,i + ϵ
• Can help escape saddle points and local minima
• Prevents learning rate decay to zero • Triangular policy: Linear increase then decrease
• Works well for RNNs and non-stationary objectives • May accelerate convergence and improve generalization
Regularization Techniques AutoML: Automated Machine Learning
Early Stopping What is AutoML?

• Automated selection of algorithms and hyperparameters


• Monitor validation performance and stop when it begins to decrease
• Reduces the manual effort in machine learning pipeline design
• Keep track of model snapshot with best validation performance
• Three main components:
• Simple and effective form of regularization
– Hyperparameter tuning
L2 Regularization (Weight Decay) – Algorithm selection
– Neural architecture search (NAS)
• Add term to loss: Lreg = L + λ 2
P
2 i wi
• Penalizes large weights Hyperparameter Tuning Methods
• Encourages weights to be small but not sparse
• Commonly used with values like λ = 0.0001 or 0.00001 • Grid Search:

L1 Regularization – Try all combinations from a predefined set of values


– Suffers from curse of dimensionality
• Add term to loss: Lreg = L + λ i |wi |
P
• Random Search:
• Encourages sparse weights (many zeros)
• Useful for feature selection – Randomly sample hyperparameters from defined distributions
– More efficient than grid search for high dimensions
Dropout
• Bayesian Optimization:
• Randomly set some neurons to zero during training – Build surrogate model to predict performance of hyperparameters
• Each forward pass uses a different random subset of neurons
– Use acquisition function to balance exploration and exploitation
• Force network to learn redundant representations
• Prevents co-adaptation of features – More efficient for expensive evaluations
• Works as an implicit model ensemble
• Common dropout probability: 0.5 for hidden layers, 0.1-0.2 for input layer Bayesian Optimization Process

Dropout at Test Time • Start with a few (random) configurations


• Build surrogate model (typically Gaussian Process)
• Predict both expected performance and uncertainty
• Standard approach: Scale outputs by dropout probability at test time
• Use acquisition function (e.g., Upper Confidence Bound, Expected Improvement)
• For dropout rate p, multiply outputs by (1 − p)
• Select most promising configuration to evaluate next
• Alternatively, use ”inverted dropout” during training: divide by (1 − p) during training, then use
• Update surrogate model with new results
unchanged weights at test time
• Repeat until convergence or budget is exhausted
Data Augmentation
Meta-Learning for AutoML
• Create additional training examples through transformations • Learn from experience across many datasets and tasks
• Common image augmentations: • Transfer knowledge to new datasets/tasks
– Horizontal flips • Approaches:

– Random crops and scales – Learning to recommend algorithms based on dataset characteristics
– Warm-starting hyperparameter optimization
– Color jitter (brightness, contrast, saturation)
– Learning initialization strategies for neural networks
– Rotations, translations, stretching, shearing
• Works as an implicit regularizer by increasing training set size Neural Architecture Search (NAS)

Other Regularization Methods • Automatically discover optimal neural network architectures


• Components to search for:
• DropConnect: Drop connections between neurons – Number of layers
• Stochastic Depth: Skip some layers randomly during training
– Types of layers
• Cutout: Set random image regions to zero
• Mixup: Train on random blends of pairs of images and their labels – Connectivity patterns
– Layer-specific hyperparameters • Example: multilingual models sharing lower layers
• Methods:
Domain-Adversarial Training
– Reinforcement learning-based search
• Address domain shift between source and target domains
– Evolutionary algorithms
• Learn features that are invariant to domain differences
– Gradient-based approaches • Architecture:

AutoML Tools – Feature extractor: Shared across domains


– Label predictor: Maximize label classification accuracy
• Auto-sklearn: Automated algorithm selection and hyperparameter tuning
• H2O AutoML: Automated ensembling of multiple algorithms – Domain classifier: Classify domain (source vs. target)
• TPOT: Uses genetic programming for pipeline optimization – Feature extractor trained to maximize label accuracy but minimize domain classification accuracy
• Google Cloud AutoML: End-to-end automated machine learning platform
Practical Tips for Training
Transfer Learning
Hyperparameter Selection Strategy
What is Transfer Learning?
• Step 1: Check initial loss (should match expected value)
• Leverage knowledge from one task to improve performance on another • Step 2: Overfit a small sample to verify model can learn
• Particularly effective when target task has limited data • Step 3: Find learning rate that makes loss go down
• Pre-train model on large source dataset, then adapt to target task • Step 4: Run coarse grid search for 1-5 epochs
• Step 5: Refine grid and train longer ( 10-20 epochs)
Transfer Learning with CNNs • Step 6: Analyze learning curves and model performance
• Step 7: Iterate and refine as needed
• Pre-train on large dataset (e.g., ImageNet)
• Re-use learned features for new task Diagnosing Learning Issues
• Approaches based on dataset size and similarity:
– Small dataset, similar domain: Linear classifier on top of frozen network • Train/val accuracy still increasing: Train longer
• Large train/val gap: Model is overfitting; increase regularization
– Small dataset, different domain: Linear classifier on multiple layers
• No gap but low accuracy: Model is underfitting; train longer or increase model capacity
– Large dataset, similar domain: Fine-tune a few layers • Loss plateaus early: Learning rate may be too low or optimization stuck
– Large dataset, different domain: Fine-tune more layers • Loss explodes: Learning rate too high or poor initialization

Fine-tuning Strategies Training Deep Networks

• Feature extraction: Freeze pre-trained weights, train only new layers • Start with proven architectures (ResNet, Transformer)
• Fine-tuning: Re-train some or all pre-trained weights • Use batch normalization (or layer normalization for Transformers)
• Use lower learning rate when fine-tuning (e.g., 1/10th of original) • Use ReLU or variants (LeakyReLU, ELU) with Kaiming initialization
• Start by freezing earlier layers, which capture more generic features • Optimizer: Adam with learning rate 0.001 or 0.0005
• Learning rate schedule: Cosine decay with warmup
Parameter Efficient Fine-Tuning (PEFT) • Regularization: Weight decay + dropout + data augmentation
• Gradient clipping for RNNs and Transformers
• Adapters: Add small trainable modules to frozen pre-trained model • Track exponential moving average of weights for better test performance
• LoRA (Low-Rank Adaptation): Update pretrained weights with low-rank matrices
• Prefix Tuning: Add trainable prefix tokens to inputs Key Exam Topics
• Significantly reduces the number of parameters to train
• Data preprocessing techniques and their effects
• Enables efficient fine-tuning of large models
• Batch normalization operation and benefits
Multi-task Learning • Different activation functions and their pros/cons
• Weight initialization methods (Xavier, Kaiming)
• Train single model to perform multiple related tasks • Optimization algorithms (SGD, Momentum, Adam)
• Share parameters across tasks • Regularization techniques (L2, Dropout, Data Augmentation)
• Benefits: • Learning rate schedules and their impact
• Transfer learning strategies
– Improved performance by leveraging task relationships • Hyperparameter tuning methods
– Better generalization by learning shared representations • Common training issues and how to diagnose them
– More efficient use of data

You might also like