Midterm Study Guide Csci566
Midterm Study Guide Csci566
− yi )2
• Cross-entropy: − yi · log(softmax(f (xi ; W )))
Core Machine Learning Concepts
Empirical Risk Minimization
Definition of ML
• Machine Learning: Algorithms that improve performance at a task with experience • Empirical Minimizer: Minimizes average loss on training data
• Three Components: Tasks, Experience, Performance • Optimal Predictor: Minimizes expected loss on all data
• Law of Large Numbers: Empirical mean → true mean as n→
ML Tasks
Loss Functions
• Random Search: Try random weights (inefficient)
• Hinge Loss: max(0, 1 − yi · f (xi ; W )) • Analytic Solution: W = (X T · X)−1 · X T · y (for linear regression)
• L1 Loss: |f (xi ; W ) − yi | • Gradient Descent: Iterative numerical approach
Non-linear Models & Deep Learning Graph Neural Networks (GNNs)
Why Non-linearity? Key Properties
• Binary attributes create axis-aligned splits • PyGOD/PyOD: Open-source systems for outlier detection
• Continuous attributes allow arbitrary thresholds • Three research directions:
• With continuous attributes, max leaf nodes can be N – AI Robustness/Trustworthiness
– Structured/Generative AI • Overfitting: Tree too complex, fits noise in training data
• Underfitting: Tree too simple, can’t capture patterns
– Scalable/Open-Source AI
• Pruning: Removing branches to reduce complexity
Crucial Exam Concepts
– Pre-pruning: Early stopping during tree construction
• Overfitting vs. Underfitting: Bias-variance tradeoff
• Parameter counting: Calculate parameters for different layers – Post-pruning: Grow full tree, then remove branches that don’t improve validation error
• Vanishing/exploding gradients: Causes and solutions
• Activation function properties: Value ranges and derivatives Decision Boundaries
• Non-linearity importance: Enables modeling complex patterns
• Grading: May be curved based on score distribution
• Axis-aligned splits: Perpendicular to feature axes
• Complex boundaries: Deeper trees create more complex decision boundaries
CSCI 566: Classical ML Algorithms - Lecture 2 • Depth trade-off : Deeper trees fit training data better but risk overfitting
• Decision tree: Flowchart-like tree structure for classification/regression • Easy to understand and interpret (white-box model)
• Components: Root node, internal nodes (tests on attributes), leaf nodes (predictions) • Handles both numerical and categorical data
• Flow: Start at root, apply tests at each node, follow branches until reaching leaf • No normalization/scaling required
• Captures non-linear relationships
Learning Algorithm • Low computational cost for prediction
• Bad approach: Optimize on training data (k=1 always perfect) Choosing K Value
• Bad approach: Optimize on test data (no generalization estimate)
• Better approach: Use validation set for hyperparameter selection • Elbow method: Plot WSS (Within-Cluster Sum of Squares) vs. k
• Best approach: Cross-validation (especially for small datasets) • Silhouette score: Measures how similar points are to their own cluster vs. other clusters
• Domain knowledge: Based on application requirements
Cross-Validation
K-Means Initialization
• Process: Split data into n folds, train on n-1 folds, test on remaining fold
• Repeat: Rotate validation fold n times • Random selection: Choose k random points as initial centers
• Result: More robust estimate of performance • K-means++:
• Common: 5-fold or 10-fold cross-validation – Choose first center randomly
– Select subsequent centers with probability proportional to squared distance from existing centers
kNN Advantages
• Multiple runs: Run algorithm multiple times with different initializations
• Simple to understand and implement
• No training phase (lazy learner) K-Means Limitations
• Naturally handles multi-class problems • Assumes spherical, equally sized clusters
• Can learn complex decision boundaries • Sensitive to initialization
• Adaptable to new data (just add to training set) • May converge to local optima
• Struggles with different cluster densities
kNN Disadvantages • Cannot handle non-convex cluster shapes
LLMs for Clustering LLM Integration with Classical ML
• Few-shot clustering: Using LLMs to enhance clustering with minimal supervision • LLMs can enhance classical ML algorithms (kNN-LLM, Few-Shot Clustering)
• LLM applications in clustering: • Tree-based structures can improve LLM reasoning (Tree-of-Thoughts)
– Before clustering: Generate keyphrases to enrich text representation • Hybrid approaches leverage strengths of both paradigms
– During clustering: Act as pseudo-expert to provide similarity constraints
Choosing the Right Algorithm
– After clustering: Correct errors in low-confidence cases
• Benefits: Reduces need for manual annotation, improves clustering quality • Decision Trees: When interpretability is key, features interact in complex ways
• Random Forests: When accuracy matters more than interpretability
Model Evaluation & Selection • kNN: When instance-based reasoning is appropriate, dataset is small/medium
• Clustering: When discovering hidden patterns in unlabeled data
Overfitting vs. Underfitting
• Goal: Choose model with proper complexity for your data • Input Layer: Raw features
• Process: Select model family, tune hyperparameters • Hidden Layers: Intermediate representations
• Metrics: Consider both performance metrics and business requirements • Output Layer: Final prediction
• Fully Connected/Dense Layer: Every neuron connected to all neurons in previous layer
Evaluation Metrics for Binary Classification • Naming Convention:
Exploding Gradient Problem • Fully connected layer: input size × output size + output size (with bias)
• Simple MNIST example:
• Problem: Gradients become extremely large
• Effects: – Input: 784 features (28×28 image)
• Neural network architecture specifically designed for processing grid-like data (e.g., images) • Key idea: Apply same filter across all spatial locations
• Inspired by the visual cortex organization in the brain • Filter (kernel) slides over input, computing dot products
• Preserves spatial relationships between pixels through local connectivity • Each dot product creates one value in output feature map
• Uses weight sharing to reduce parameters and improve generalization • Multiple filters create multiple feature maps (channels)
Common Settings
CNN Basic Layers
Main CNN Layer Types • K = powers of 2 (32, 64, 128, 512)
• F = 3, S = 1, P = 1 (maintains spatial dimensions)
• F = 5, S = 1, P = 2 (maintains spatial dimensions)
• Convolutional Layer: Extracts features using learnable filters • F = 5, S = 2, P = ”same” (halves spatial dimensions)
• Pooling Layer: Downsamples feature maps • F = 1, S = 1, P = 0 (1×1 convolution for channel mixing)
• Fully Connected Layer: Converts spatial features to class scores
• Activation Function: Adds non-linearity (typically ReLU)
• Batch Normalization: Normalizes activations for stable training
Pooling Layer
• Softmax Layer: Converts output to probability distribution Pooling Operation
Fully Connected Layer (FC) • Purpose: Reduce spatial dimensions, maintain important features
• Max Pooling: Takes maximum value in each region
• Operation: output = W · input + b • Average Pooling: Takes average value in each region
• Input: Flattened feature vector • Common setting: 2×2 filter with stride 2 (halves dimensions)
• Output: Fixed-size vector
• Parameters: W (weight matrix), b (bias vector) Pooling Advantages
• Limitations for Images:
• Reduces number of parameters in following layers
– Destroys spatial structure of input • Provides translation invariance
• Reduces memory usage and computation
– Inefficient for high-dimensional inputs
• Helps prevent overfitting
– Example: 200×200 image with 40K hidden units requires 2B parameters • No parameters to learn
Pooling Layer Parameters • All pooling layers use 2×2 max pooling with stride 2
• Feature channels increase progressively: 64→128→256→512→512
• Spatial extent (F ): Size of pooling window • 138M parameters (most in FC layers)
• Stride (S): Step size when sliding window • Why small 3×3 filters?
– Stack of three 3×3 conv = one 7×7 conv receptive field
Output Dimensions
– But with more non-linearities and fewer parameters (3×3×3 vs 7×7)
• Input dimensions: W1 × H1 × C – Better convergence due to more non-linearities
• Output dimensions: W2 × H2 × C (same number of channels)
• Where: ResNet (2015)
W1 −F
– W2 = S
+1 • Enabled training very deep networks (152+ layers)
– H2 = H1 −F
+1 • Key innovation: Residual connections (skip connections)
S • Residual block: H(x) = F (x) + x
• Number of parameters: 0 • Helps with vanishing gradient problem
• Easier optimization (if deeper layers can’t learn useful features, they can default to identity)
Batch Normalization • Variants: ResNet-18, ResNet-34, ResNet-50 (bottleneck design), ResNet-101, ResNet-152
Batch Normalization Operation • Bottleneck design: 1×1 → 3×3 → 1×1 convolutions for efficiency
• 25.5M parameters (ResNet-50)
• Purpose: Normalize activations for faster, more stable training • Winner of ILSVRC 2015 with 3.57
• Process:
2 across mini-batch Improvements on ResNet
– Compute mean µB and variance σB
x−µB • Identity Mappings in Deep Residual Networks (2016): Improved residual block design
– Normalize: x̂ =
•
q
2 +ϵ
σB Wide ResNet (2016): Wider residual blocks instead of deeper networks
• ResNeXt (2016): Parallel pathways within residual block (”cardinality”)
– Scale and shift: y = γ x̂ + β • SENet (2017): Squeeze-and-Excitation modules for adaptive feature recalibration
• Learnable parameters: γ (scale), β (shift) • DenseNet (2017): Dense connections where each layer connects to all previous layers
• Makes training deep networks easier • MobileNet (2017): Depthwise separable convolutions for mobile applications
• Improves gradient flow • ShuffleNet (2018): Channel shuffling with grouped convolutions
• Allows higher learning rates • EfficientNet (2019): Compound scaling of width, depth, and resolution
• Reduces sensitivity to weight initialization • NASNet (2017): Neural Architecture Search for automated architecture design
• Acts as regularization
• Zero overhead at test time (can be fused with conv layer)
CNN Forward/Backward Propagation
Forward Propagation in CNNs
Batch Norm Placement
• Similar to regular neural networks, but with convolution operations
• Usually placed after convolutional or fully connected layer • For each convolutional layer:
• Often placed before activation function
• Common pattern: Conv → BatchNorm → ReLU – Perform convolution operation with filters
– Add bias
Popular CNN Architectures – Apply activation function
AlexNet (2012) • For each pooling layer:
– Apply pooling operation (max or average)
• First CNN to win ImageNet competition
• For each fully connected layer:
• 8 layers (5 conv, 3 FC)
• Used ReLU activation, dropout, data augmentation – Apply matrix multiplication and add bias
• Key features: Large filters (11×11), Local Response Normalization (LRN) – Apply activation function
• 60M parameters
Backward Propagation in CNNs
VGGNet (2014)
• Similar to regular backprop, but need to handle convolution and pooling
• Showed benefits of deeper networks with small filters • For fully connected layers: Standard backprop
• VGG16: 16 layers (13 conv, 3 FC) • For pooling layers:
• All conv layers use 3×3 filters with stride 1, pad 1 – Max pooling: Gradient flows back only to max element
– Average pooling: Gradient is distributed equally to all elements Neural Architecture Search (NAS)
• For convolutional layers:
• Automated design of neural network architectures
– Compute gradient w.r.t. each filter weight and bias
• Usually involves reinforcement learning or evolution
– Convolve the gradient of output with flipped filter to get gradient of input • Search for optimal network structure or building blocks
• Examples: NASNet, AmoebaNet, EfficientNet
CNN Applications
Image Classification Practical Tips
• Assign a single label to an entire image CNN Design Guidelines
• CNN outputs probability distribution over classes
• Examples: ImageNet classification, disease diagnosis • Use 3×3 conv filters with stride 1 and pad 1 as default
• Increase channel dimensions as spatial dimensions decrease
Object Detection • Use max pooling with 2×2 filters and stride 2 to downsample
• Add batch normalization after convolutions
• Locate and classify multiple objects in an image • Use ReLU activation functions
• Common approaches: • For modern architectures, consider residual connections
• For limited resources, consider efficient architectures like MobileNet
– R-CNN family: Generate proposals, classify with CNN
– Single-shot detectors: YOLO, SSD (no explicit proposal stage) Memory Efficiency
• Applications: Autonomous driving, surveillance, retail
• Most memory in early conv layers (large spatial dimensions)
Segmentation • Most parameters in final FC layers (or late conv layers)
• Total memory = activations + parameters + gradients
• Semantic segmentation: Classify each pixel into a category • Batch size significantly affects memory usage
• Instance segmentation: Distinguish between different instances of same class
• Panoptic segmentation: Combines semantic and instance segmentation Common Hyperparameters
• Applications: Medical imaging, autonomous driving, scene understanding
• Learning rate (often 0.01-0.001 with decay)
Other Applications
• Batch size (typically 32-256, depending on GPU memory)
• Weight decay for regularization
• Face recognition and analysis
• Data augmentation strategies
• Image generation and manipulation
• Number of filters and layers
• Video analysis
• 3D reconstruction
• Virtual try-on When to Use CNNs
• Deepfake creation and detection
• Grid-like data (images, spectrograms, etc.)
Modern CNN Developments • When spatial/local structure matters
• When translation invariance is desired
Contrastive Learning (CLIP) • When hierarchical feature extraction is beneficial
Challenges with Vanilla RNNs • Cell state provides direct path for gradient flow
• Gradient can flow through cell state without repeated multiplication by weights
Vanishing Gradient Problem • Forget gate allows selective retention of information
• Error can be backpropagated over many time steps
• Gradients can become exponentially small over many time steps
• During backpropagation: LSTM Variants
t
∂ht Y ∂hi Gated Recurrent Unit (GRU)
=
∂ht−n i=t−n+1
∂h i−1
• Simplified variant of LSTM
• Each term involves multiplication by Whh T and derivative of tanh
• Combines forget and input gates into a single ”update gate”
• If largest singular value of Whh < 1, gradients vanish exponentially • No separate cell state; updates hidden state directly
• Result: RNN cannot learn long-term dependencies • Fewer parameters than LSTM
• Often comparable performance with lower computational cost
Exploding Gradient Problem • Update equations:
• Gradients can become exponentially large over many time steps zt = σ(Wz · [ht−1 , xt ])
• Occurs when largest singular value of Whh > 1 rt = σ(Wr · [ht−1 , xt ])
• Results in unstable training and model divergence
• Solution: Gradient clipping h̃t = tanh(W · [rt ⊙ ht−1 , xt ])
Types of Graphs
Shallow Encoding Approach
• Undirected graphs: Edges have no direction, Aij = Aji
• Directed graphs: Edges have direction, Aij may not equal Aji • Encoder is a simple embedding lookup: ENC(v) = zv = Z · v
• Weighted graphs: Edges have weights, Aij ∈ R • Z ∈ Rd×|V | is the embedding matrix
• Bipartite graphs: Nodes can be divided into two disjoint sets with edges only between sets • v is a one-hot encoding of node v
• Heterogeneous graphs: G = (V, E, R, T ) with node types T (vi ) and relation types r ∈ R • Directly optimize embedding of each node
• Examples: DeepWalk, node2vec
Graph Machine Learning Tasks • Limitation: Cannot generalize to unseen nodes
Node-level Tasks
• Node classification: Predict labels for nodes (e.g., protein function) Graph Neural Networks: Basic Concepts
• Node regression: Predict continuous values for nodes
• Node clustering: Group similar nodes GNN Design Principles
• Anomaly detection: Identify unusual nodes
• Graph classification: Assign labels to entire graphs (e.g., molecule properties) • Permutation invariance: For a function f , f (P X) = f (X) for any permutation P
• Graph regression: Predict continuous values for graphs • Permutation equivariance: For a function f , f (P X) = P f (X) for any permutation P
• Graph generation: Create new graphs with desired properties • Graph invariant: Graph-level predictions shouldn’t depend on node ordering
• Graph similarity: Measure how similar two graphs are • Graph equivariant: Node-level predictions should be consistently mapped when nodes are reordered
• Standard MLPs are neither invariant nor equivariant to permutations
Real-world Applications
Computation Graph
• Recommender systems (Pinterest, e-commerce)
• Traffic prediction (Google Maps ETA)
• Drug discovery (antibiotic discovery) • For a target node, recursively expand neighborhood to create a computation graph
• Knowledge graph completion • Layer k incorporates information from nodes k hops away from target
• Protein-protein interaction prediction • Each node’s representation depends on its neighborhood
• Fake account detection in social networks • Information propagates through the graph via message passing
Graph Neural Networks: Architecture • Regression: Mean squared error
• Link prediction: Binary cross-entropy, margin-based ranking loss
Message Passing Framework
Inductive vs. Transductive Learning
(0)
• Initialize node embeddings: hv = xv
• For each layer l and each node v: • Transductive: Test nodes seen during training (only labels unknown)
(l) (l−1) • Inductive: Generalize to completely unseen nodes/graphs
– Aggregate information from neighbors: av = AGGREGATE(l) ({hu : u ∈ N (v)}) • GNNs are naturally inductive: Same parameters used for all nodes
– Update node embedding:
(l)
hv =
(l−1)
UPDATE(l) (hv
(l)
, av ) • Can generate embeddings for new nodes on the fly
(L)
• After L layers, final node representation: zv = hv Advanced GNN Architectures
Graph Attention Networks (GAT)
Graph Convolutional Network (GCN)
Training GNNs • Graphs where nodes have text attributes (papers, products, etc.)
• Conventional shallow text embeddings can miss crucial context
Supervised Learning • LLMs have strong semantic understanding of text
• Classification: Cross-entropy loss • LLM directly sees text (and optional neighbor info) and chooses label
• Zero/Few-shot: Node text in prompt → LLM decides label – Ensures all features contribute equally to learning
• Adding graph structure through neighbor text summaries • PCA Whitening: Transform data to remove correlation between features
• LLM as annotator: LLM labels some nodes → train a GNN
– Rotate data along principal components
Practical Tips & Limitations – Scale each dimension to unit variance
Common Issues & Solutions – Rarely used in practice for images
– Random crops and scales – Learning to recommend algorithms based on dataset characteristics
– Warm-starting hyperparameter optimization
– Color jitter (brightness, contrast, saturation)
– Learning initialization strategies for neural networks
– Rotations, translations, stretching, shearing
• Works as an implicit regularizer by increasing training set size Neural Architecture Search (NAS)
• Feature extraction: Freeze pre-trained weights, train only new layers • Start with proven architectures (ResNet, Transformer)
• Fine-tuning: Re-train some or all pre-trained weights • Use batch normalization (or layer normalization for Transformers)
• Use lower learning rate when fine-tuning (e.g., 1/10th of original) • Use ReLU or variants (LeakyReLU, ELU) with Kaiming initialization
• Start by freezing earlier layers, which capture more generic features • Optimizer: Adam with learning rate 0.001 or 0.0005
• Learning rate schedule: Cosine decay with warmup
Parameter Efficient Fine-Tuning (PEFT) • Regularization: Weight decay + dropout + data augmentation
• Gradient clipping for RNNs and Transformers
• Adapters: Add small trainable modules to frozen pre-trained model • Track exponential moving average of weights for better test performance
• LoRA (Low-Rank Adaptation): Update pretrained weights with low-rank matrices
• Prefix Tuning: Add trainable prefix tokens to inputs Key Exam Topics
• Significantly reduces the number of parameters to train
• Data preprocessing techniques and their effects
• Enables efficient fine-tuning of large models
• Batch normalization operation and benefits
Multi-task Learning • Different activation functions and their pros/cons
• Weight initialization methods (Xavier, Kaiming)
• Train single model to perform multiple related tasks • Optimization algorithms (SGD, Momentum, Adam)
• Share parameters across tasks • Regularization techniques (L2, Dropout, Data Augmentation)
• Benefits: • Learning rate schedules and their impact
• Transfer learning strategies
– Improved performance by leveraging task relationships • Hyperparameter tuning methods
– Better generalization by learning shared representations • Common training issues and how to diagnose them
– More efficient use of data