Project Proposal
Project Proposal
I. Introduction
Background
The term "grokking" was coined in recent studies to describe neural networks’
emergent generalization after prolonged training, particularly in tasks like modular
addition or sparse parity (Power et al., 2022; Merrill et al., 2023). Initial phases of
grokking involve memorization, where models achieve high training accuracy but fail
to generalize. Eventually, after extended training, models transition to a state of
high test accuracy, indicating a deep understanding of the task. This transition,
though powerful, typically demands extensive training time, raising computational
and efficiency challenges (Nanda et al., 2023).
1
Problem Statement
While grokking offers powerful generalization capabilities, its high computational cost and
lengthy training periods make it inefficient. This project addresses the challenge of
achieving grokking more efficiently by investigating structured learning dynamics and
targeted regularization strategies to encourage faster generalization.
Objectives
I. Research Question 1
What role do structured learning dynamics play in achieving efficient grokking?
2. Sparse Network Transitions: Merrill et al. (2023) identified grokking as a phase transition
driven by the emergence of sparse subnetworks, which dominate model predictions
post-grokking. This study highlighted the role of network sparsification, suggesting targeted
pruning as a potential method to improve efficiency.
3. Role of Regularization and Weight Decay: Thilak et al. (2022) found that weight decay
facilitates grokking by encouraging models to prioritize generalizing components over
2
5. Learning Rate Schedules and Training Stability: Ganguli et al. (2022) examined dynamic
learning rates, such as cyclical or cosine annealing, and found that they could improve
model stability during emergent behavior transitions. This could reduce the training time
required for grokking.
6. Competitive Subnetworks and Phase Transitions: Studies by Engel & Van den Broeck
(2001) on emergent behaviors in neural networks reveal that competing network
components influence generalization. Applying this to grokking could provide insights into
encouraging sparse, generalizable structures in fewer steps.
Research Gaps
The reviewed literature points to several strategies for encouraging generalization but lacks
a cohesive framework that combines these approaches to accelerate grokking. This project
will address this gap by synthesizing structured training methods, such as curriculum
learning and targeted sparsification, with regularization strategies to achieve efficient
grokking.
III. Methodology
Approach
This research will investigate the dynamics of grokking across various neural network
architectures, analyzing stages within the training process that correspond to
memorization, circuit formation, and cleanup. Using interpretability techniques, the study
will develop metrics to monitor progress within these phases.
3
References
1. Nanda, N., et al. "Progress measures for grokking via mechanistic interpretability."
ICLR 2023.
2. Merrill, W., et al. "A tale of two circuits: Grokking as competition of sparse and dense
subnetworks." ICLR Workshop on Understanding Foundation Models, 2023.
3. Thilak, V., et al. "The slingshot mechanism: An empirical study of adaptive optimizers
and the Grokking Phenomenon." NeurIPS 2022 Workshop.
4. Power, A., et al. "Grokking: Generalization beyond overfitting on small algorithmic
datasets." arXiv preprint arXiv:2201.02177, 2022.
5. Barak, B., et al. "Hidden progress in deep learning: SGD learns parities near the
computational limit." Advances in Neural Information Processing Systems, 2022.
6. Engel, A., & Van den Broeck, C. Statistical Mechanics of Learning. Cambridge
University Press, 2001.
7. Liu, Z., et al. "Omnigrok: Grokking beyond algorithmic data." International Conference
on Learning Representations, 2023.