0% found this document useful (0 votes)
58 views6 pages

Conference Template A4

Uploaded by

sunita.sahu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views6 pages

Conference Template A4

Uploaded by

sunita.sahu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Understanding Scaling Laws for Large Language

Models:A Comprehensive Survey


Ms. Sunita Sahu
Deaprtment of Computer Science & Engineering
G.D. Goenka University,Gurugram
[email protected]

Abstract—Scaling laws have emerged as a critical NLP tasks, and multimodal models integrate text with other
framework for understanding the relationship between model data types like images.
parameters, computational resources, and performance in
natural language processing (NLP) applications. Scaling laws Deep learning has rapidly advanced in language
are rooted in the principles of machine learning, particularly modeling, with cutting-edge models now achieving near-
the interplay between model size, training data, and compute human performance on numerous tasks. Notably, they can
resources. Parameters define the complexity of a model, while generate coherent multi-paragraph text from prompts,
compute reflects the energy and time required for training. demonstrating impressive progress in natural language
Understanding these scaling laws is critical to both improving understanding and generation.
existing models, and also predicting the behavior and
development of next-generation AI systems. This paper covers Deep learning has recently seen rapid progress in
the basic understanding of scaling law, relationships between language modeling, with state-of-the-art models like
the parameters, and work done by various researchers and RNSS18, YDY+19, LOG+19, RSR+19 approaching human-
identified their key findings, strength and limitations. These level performance on many specific tasks.
papers collectively emphasize the importance of scaling laws in
Milestones include GPT-2, which demonstrated
understanding and developing LLMs. This study also makes us
understand that Optimization techniques, fine-tuning unprecedented language generation capabilities, and
strategies, and temporal aspects also play crucial roles in the subsequent iterations like GPT-3 and GPT-4, which scaled
effective development and deployment of these models. up parameters and data to unlock new functionalities. These
Understanding these scaling laws is critical to both improving advancements have underscored the importance of scaling
existing models, and also predicting the behavior and laws in guiding model development.
development of next-generation AI systems.
The remarkable accuracy of large language models can
Keywords—Large Language Model, Scaling Law, Fine
be attributed to three key factors: the massive number of
tuning, NLP parameters they are trained on, the extensive datasets used,
and the immense computational power required for training.
I. INTRODUCTION These models often involve millions or even billions of
parameters. However, this raises an important question: does
A language model is a computational tool or system simply increasing the number of parameters or the size of the
designed to process, understand, and generate natural dataset always lead to better performance? The answer lies in
language. It learns patterns, relationships, and structures understanding the scaling laws for large language models,
within language by training on extensive textual datasets. which provide insights into how these factors interact to
The primary function of a language model is to predict the influence model performance.
next word or sequence of words based on the context
extracted from training data, enabling it to generate coherent Scaling laws are fundamental to the design and
and contextually relevant text. development of large language models (LLMs). These laws
provide a mathematical framework to understand how key
Initially Language models are designed using statistical variables—model size, dataset size, and computational power
approaches, such as n-grams models. Now complex neural —affect the performance of LLMs. By identifying
architectures like Transformer-based models, including predictable relationships between these factors, scaling laws
Generative Pre-trained Transformer(GPT)[3] and enable researchers to optimize resource allocation and
Bidirectional Encoder Representations from Transformers improve model performance efficiently.
(BERT)[2] are used to design sophisticated models. These
models perform various modern natural language processing The importance of scaling laws lies in their ability to
tasks , such as text generation, machine translation, guide the development of increasingly powerful models. As
summarization, and sentiment analysis very effectively with LLMs grow in size and complexity, scaling laws help
great accuracy. address critical questions, such as how much data is needed
to train larger models effectively or the extent of
Different Deep learning models are used for language computational resources required to achieve desired
modelling tasks. For example RNNs e.g., LSTMs and GRUs performance levels. They reveal that model performance
are used for sequential data, Transformers (e.g., BERT, GPT, improves predictably as parameters, data, and compute scale,
and T5)[2,3,4] are very efficient for context understanding while also highlighting diminishing returns when one factor
and generation, and CNNs for capturing local text features. is increased disproportionately.
Seq2Seq models[15] with attention mechanisms are widely
used for tasks like machine translation, while hybrid models Scaling laws are particularly significant in building state-
combine RNNs, CNNs, or Transformers for enhanced of-the-art LLMs because they provide insights into trade-offs
performance. Additionally, unified models address multiple between resource use and performance. For instance, they
show that a balanced increase in model size and data yields

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


better results than scaling one factor alone. Moreover, these respect to data efficiency, larger models need more data for
laws have broader implications for developing models that training to achieve their full potential but beyond a certain
are not only effective but also efficient, ensuring that point, adding more parameters without increasing data size
advancements in natural language processing remain leads to diminishing returns. This concept was introduced by
sustainable in terms of computational and environmental paper Training Compute-Optimal Large Language Models
costs. by Jordan Hoffmann et al. in [6]. The goal of this paper is to
By leveraging scaling laws, researchers can design LLMs investigate the optimal model size and number of tokens for
that are better equipped to tackle complex tasks like natural training a transformer language model under a given
language understanding, text generation, and machine compute budget. The authors aim to determine how to scale
translation. Unfortunately, scaling analyses are rare in many model size and training data to achieve compute-optimal
benchmarking and post-training studies. This is largely training. The key results of the paper are that current large
because most researchers lack the computational resources language models are significantly undertrained, and for
needed to establish scaling laws from scratch. Additionally, compute-optimal training, model size and the number of
open models are often trained at only a few scales, which training tokens should be scaled equally. They test this
makes it difficult to make reliable scaling predictions [12]. hypothesis by training a model called Chinchilla, which uses
This paper explores the critical role of scaling laws in the same compute budget as Gopher but with fewer
shaping the future of LLM development and their potential to parameters (70 billion parameters trained on 1.4 trillion
drive innovation in artificial intelligence. The rest of this tokens) and more data, and demonstrates its superior
paper is organized as follows: Section 2 provides a performance. The Chinchilla model outperforms larger
comprehensive review of scaling laws in large language models like Gopher, GPT-3, Jurassic-1, and Megatron-
models (LLMs). It discusses the foundational principles, the Turing NLG on various benchmarks. Chinchilla achieves a
role of parameters, data size, and compute power, and key state-of-the-art average accuracy of 67.5% on the MMLU
contributions from prior research. Section 3 discusses benchmark, a 7% improvement over Gopher. The findings
important insights followed by section 4 which addresses suggest that smaller models trained on more data are more
emerging challenges in scaling LLMs, such as environmental efficient and performant.
sustainability, compute efficiency, and the need for smarter
scaling strategies etc. Section 5 covers the future directions Expanding on Kaplan's work on scaling laws, which showed
and section 6 concludes the paper by summarizing key how model size, dataset size, and compute budget impact
insights and emphasizing the broader significance of scaling LLM performance, recent research by Yizhe Xiong et al. has
laws in advancing large language models and artificial introduced the concept of the Temporal Scaling Law [7].
intelligence.
This approach looks at how test loss changes during the pre-
II. LITERATURE REVIEW training process instead of just focusing on the overall test
loss at the end. By analyzing the test loss at a fine-grained
Various large language models have been developed in the level, such as individual token positions, researchers
last few years and we have observed they are getting bigger developed a more precise method to predict how test loss
and smarter. The performance of large language models
evolves over training steps. This method has practical uses,
(LLMs) is influenced by three key factors: model size,
like helping to select better hyperparameters efficiently and
dataset size, and compute power used during training. This
improving our understanding of how LLMs learn over time,
paper focuses on scaling laws in LLMs, which have been
both for in-distribution and out-of-distribution data. The
extensively studied, particularly by Kaplan et al. in [5]. They
method also includes a two-stage pipeline for
identified a power-law relationship between performance
hyperparameter selection, using a small model to narrow
and the three scaling factors: the number of model
down candidate hyperparameters and then selecting the best
parameters (N), dataset size (D), and training compute (C).
one based on predicted test loss for the target LLM. In
Their findings emphasize that model performance is most
summary, the temporal scaling law proves highly useful in
strongly determined by scale, with architectural choices like
two main ways: it helps directly choose better
model depth or width having a minimal impact.
hyperparameters by predicting LLM performance and offers
Performance improves predictably as these factors scale,
insights into learning dynamics, supporting the common pre-
with diminishing returns occurring when only one factor
training approach of treating all token positions equally.
increases while others remain constant. Scaling both N and
D together yields the best performance, with a model size
In the year 2023, Niklas Muennighoff et al. have proposed
increase of 8x requiring only a 5x increase in data to avoid
Scaling law for Data-Constrained Language Models [9].
performance penalties. Additionally, training curves for
Recent research focuses on scaling language models when
large models follow predictable power-law patterns, with
unique data is limited, exploring how to use computational
performance predictably improving over time. LLMs also
resources more effectively. The study extends Chinchilla
show increased sample efficiency, achieving higher
scaling laws to account for the reduced value of repeated
performance with fewer optimization steps compared to
data. It finds that training models with repeated data over
smaller models. However, for optimal compute efficiency,
multiple epochs is still useful, though the benefits decrease
models should be trained to a point just before convergence.
over time. To be very precise as per the study, repeating the
Finally, the ideal batch size is determined by the loss
same data for up to 4 epochs yielded significant
function, typically around 1-2 million tokens for the largest
performance improvements, but benefits diminished beyond
models.
16 epochs. Adding code data is highlighted as a way to scale
models even further, up to 2×. While these approaches help
Generally, larger models are considered better, but this
unlock new possibilities with current data, the research also
raises the question: 'Is a bigger model always better? With
emphasizes the need to collect more data and use existing In [14], Authors proposed a two-stage algorithm that
data more efficiently to push the boundaries of scaling. provides a provable scaling law for the test-time
In 2023, authors Armen Aghajanyan et al. proposed the computation of large language models (LLMs), enhancing
Scaling Laws for Generative Mixed-Modal Language their reliability and stability by boosting the success
Models [11]. This paper explores the scaling properties of probability of solving problems through increased test-time
mixed-modal generative language models, which handle computation. The paper presents a two-stage algorithm for
diverse data modalities like text, images, and speech. By large language models (LLMs) that improves the success
conducting over 250 experiments with models ranging from probability of solving problems by scaling up test-time
8 million to 30 billion parameters and datasets of 5-100 computation. The algorithm first generates multiple
billion tokens, new scaling laws were developed to account candidate solutions and then selects the final output through
for both synergy and competition between modalities. Key pairwise comparisons in a knockout tournament. The
findings include guidelines for hyperparameter selection, success probability of the algorithm in solving a specific
emergent training dynamics that alternate between problem decay to zero exponentially with respect to the total
modalities, and the impact of mixed-modal competition on number of LLM calls, assuming the LLM can generate
training stability. Testing these scaling laws with a 30B correct solutions with a non-zero probability and can
speech-text model demonstrated significant performance distinguish between correct and incorrect solutions better
improvements over unimodal models. This study offers than random guessing. Theoretical analysis shows that the
valuable insights into designing and training unified mixed- failure probability decreases exponentially with the number
modal generative models. of candidates and comparisons. Empirical results on the
MMLU-Pro benchmark validate the algorithm's assumptions
In [13], authors introduce a novel "Rectified Scaling Law" and demonstrate its effectiveness, particularly in reasoning-
for large language model (LLM) fine-tuning. The goal of the focused tasks. Table 1 gives the summary of key papers
paper is to address the challenge of selecting the optimal related to scaling law for large language models.
pre-trained Large Language Model (LLM) for fine-tuning
on a specific downstream task, given resource constraints III. IMPORTANT INSIGHTS OF SCALING IN LLM
that prevent fine-tuning all available models. The authors  Foundational Insights
aim to achieve this by formulating the LLM selection task as
Scaling laws show that as models grow in size, their
a prediction problem of full fine-tuning performance and performance improves logarithmically, not linearly, but
connecting it to the concept of scaling laws. They propose a performance diminishes with excessive scaling unless
novel LLM selection algorithm based on a "Rectified counterbalanced by larger datasets and more computation
Scaling Law" that considers the "pre-learned data size" of a power. Without these, just making the model bigger won’t
model, enabling efficient selection of near-optimal models help and might even lead to inefficiencies. This means
with significantly reduced resource consumption. The scaling only works well when model size, data, and compute
authors observed a "pre-power phase" in the fine-tuning are balanced.
scaling curve, where the slope gradually decreases before
the typical "power phase." This phase transition is explained  Emergent capabilities:
by the pre-learned data size, representing the equivalent Large language models develop unexpected capabilities as
amount of downstream task samples the model has already they grow in size. These are often called "emergent abilities"
learned during pre-training. Based on this law, they propose because they appear without being explicitly programmed or
the "Accept then Stop" (AtS) algorithm for LLM selection, trained into the model. For example, few-shot learning,
which efficiently identifies the near-optimal model for fine- Logical reasoning, code generation and question answering
tuning with significantly reduced resource consumption etc, which make them versatile for new, unseen tasks. Few
compared to existing methods. The paper uses several shot learning is the ability of quickly understanding and
methods in their study of scaling laws for fine-tuning large performing tasks with only a few examples, even if they
language models and model selection: first , they did fine- haven’t seen those tasks during training. Logical reasoning
tuning experiments in which they fine-tuned 30 different allows LLM to solve problems that require reasoning and
large language models on subsets of 3 datasets (WMT19 pattern recognition, even though they weren't specifically
trained for logic. Some models, like GPT-4 have code
English-Chinese translation, Gigaword summarization, and
generation capabilities which allows model to write code or
FLAN instruction tuning) with sizes ranging from 200 to
debug programs, even when this wasn’t a primary focus
1,638,400 samples. Second, they analysed the relationship during training. These abilities emerge because the larger the
between dataset size and model performance, identifying a model, the more nuanced patterns and relationships it can
'pre-power phase' and 'power phase' in the scaling behavior capture from its training data. It's as if the model "unlocks"
and third theoretical analysis. In which they provide new skills just by being scaled up.
theoretical explanations for the observed scaling behavior
and limitations of existing scaling laws. 4. Development of a  Data Insight:
new scaling law. They introduce the concept of 'pre-learned Scaling insights reveal that high-quality and diverse datasets
data size' and incorporate it into a new 'Rectified Scaling produce better results compared to simply increasing the
Law' for fine-tuning. 5. LLM selection algorithm: they amount of data. At larger scales, models often hit a point
develop a novel model selection algorithm called 'Accept where adding more data provides little improvement,
then Stop' highlighting the need for carefully curated datasets. Larger
AtS) for choosing the best model to fine-tune under models trained on datasets dominated by English or biased
resource constraints. content tend to amplify these biases, which can impact their
fairness and accuracy
Table 1:Summary of review of scaling law for large language models
Publication Research Focus Techniques/ Key Findings Strengths Limitations Research
Ref Year Methods Gap
[5] Jan-20 First generalised Used training For a fixed compute budget, the best Foundational Limited Lack of
Scaling laws for data, model performance is achieved by balancing study, widely exploration emphasis on
neural LMs parameters and model size and dataset size. Under- cited. of efficiency
compute budget training a large model with insufficient multilingual improveme
as parameters data is inefficient. contexts nts
[6] Mar-22 Compute- Referred to as The Chinchilla paper introduced Practical Few Exploration
optimal model Chinchilla law significant insights into compute- insights for benchmarks of scaling in
scaling and focus on optimal training for large language efficient scaling outside low-
balancing data models, challenging earlier practices natural resource
vs. compute. based on Kaplan's scaling laws. For a language settings
fixed compute budget, smaller models tasks
trained on significantly more data
achieve better performance than larger
models with less data.
[11] Jan,23 Scaling laws for Mixed-modal The paper introduced scaling laws for Extensive Limited Impact of
generative scaling laws mixed-modal language models (e.g., experimentation exploration scaling on
mixed-modal incorporating models trained on text and other with 250+ runs of broader
language models competition and modalities such as images or speech). across seven exhaustive multimodal
synergy between modalities; modality applications
modalities It identifies how performance scales with couplings. ;
parameters, compute, and dataset size Introduces a High
when multiple modalities are involved. practical recipe computation
for multi-modal al costs
Adding complementary modalities (e.g., training.
speech alongside text) can enhance
model performance, particularly
[9] May,23 Scaling Repeating data The paper demonstrated that data Comprehensive Limited Impact of
language models compute budget repetition (multi-epoch training) can experiments evaluation repeated
in data- allocation, data mitigate the challenges of training (400 training of non- data on
constrained filtering, code language models in data-constrained runs); language diverse
regimes augmentation regimes. tasks; tasks;
Introduces a ethical potential
Repeating the same data for up to 4 new scaling law implications overfitting
epochs yielded significant performance not risks in
improvements, but benefits diminished discussed high-epoch
beyond 16 epochs. scenarios
The findings emphasized a compute-
over-data trade-off, where smaller
datasets paired with increased compute
(via repetition) could still achieve
competitive results.
[7] Jun,24 Temporal Dynamic Proposed a novel temporal scaling law Introduced Focused Limited
scaling laws in hyperbolic-law, that models how the test loss evolves token-level only on exploration
LLM pre- fine-grained throughout the training process, unlike granularity for GPT-based of scaling
training token analysis prior scaling laws that focus only on final precise decoder laws
test loss after training completion. modeling; models. beyond pre-
training
Discovered that test loss at different validated applicability stages (e.g.,
token positions follows a dynamic hyperparameter to other fine-tuning,
hyperbolic-law during training, selection architecture transfer
providing a fine-grained view of loss methods s learning)
evolution at the token level. unexplored
Achieved precise predictions of test loss
during intermediate training steps with
R² > 0.99 across various datasets.
[13] May,2024 Selecting Large LLM selection Analysed the relationship between Selecting Large applicability The AtS
Language Model algorithm dataset size and model performance. Language Model to other algorithm's
to Fine-tune via enabling to Fine-tune fine-tuning performanc
Rectified efficient Developed new scaling law 'Rectified under resurce strategies e can
Scaling Law selection of Scaling Law' for fine-tuning. constraints. such as degrade
near-optimal Introduced novel model selection Reinforcem when the
models with algorithm called 'Accept then Stop' for ent data budget
significantly choosing the best model to fine-tune Learning ratio is
reduced resource under resource constraints. from extremely
consumption. Human small.
Feedback
(RLHF) etc.
[14] Nov,2024 Scaling law for Two-Stage Two-stage algorithm demonstrates that Scalable and Limited The
the test-time Algorithm First, increasing test-time compute Efficient Assumption algorithm’s
compute of large generate significantly improves the accuracy and s, assumptions
language models multiple reliability of LLM outputs, with failure Minimalistic may not
possible probability decaying exponentially. Implementation Resource- apply to
solutions. then, Intensive, highly
The method requires only a black-box Theoretical and
compare Empirical possibility complex or
solutions in a LLM, achieving high accuracy without open-ended
external verifiers or additional models, Validation of Bias in
tournament to Pairwise tasks
find the best and supports parallel and distributed
computation for practical scalability. Comparison
one. s
Experiments on the MMLU-Pro
benchmark confirm the theoretical
scaling law, with better performance in
reasoning-focused tasks compared to
knowledge-heavy ones.

IV. CHALLENGES AND LIMITATIONS Develop more energy-efficient architectures and training
 Challenges of Deploying on Edge Devices: methods to reduce the environmental impact of scaling large
models.
Deploying scaled large language models (LLMs) in
constrained environments like edge devices presents several  Smaller but Smarter Models:
challenges. Edge devices, such as smartphones, IoT devices, Focus on creating smaller, optimized models that match or
and embedded systems, have limited computational, surpass the performance of larger ones, making them suitable
memory, and power resources compared to centralized for deployment on edge devices.
servers or cloud-based infrastructures.
 Collaborative Training:
 Cost-Performance Trade-offs:
Explore distributed and collaborative training techniques,
Larger models may achieve better performance but scaling is such as federated learning, to enable scaling without relying
not without costs. Training larger models requires entirely on centralized infrastructure.
exponential increases in compute and energy, raising
financial and environmental concerns. For instance, GPT-3  Multimodal and Adaptive Scaling:
required thousands of petaflop-days of compute, highlighting Scale models to handle multiple data types (text, images,
the need for efficient scaling strategies. audio) seamlessly or adapt to specific tasks dynamically
 Scaling Saturation: without retraining.

Beyond a certain size, performance improvements diminish,  Ecosystem Changes:


a phenomenon known as scaling saturation. This poses Open research collaborations and shared datasets are critical
questions about the utility of further scaling without to democratizing advancements in NLP. Community-driven
architectural innovations. efforts can help mitigate resource inequities.
 Resource Constraints: VI. CONCLUSION
The financial and computational requirements for training Scaled models have revolutionized industries by powering
and deploying large models create accessibility barriers for applications such as conversational AI, content creation, and
smaller organizations and researchers. predictive analytics. Their ability to learn from vast datasets
 Bias and Ethical Concerns: allows them to adapt to new domains with minimal human
intervention. In this paper we surveyed scaling law which
Scaling does not inherently address issues of bias present in offer invaluable insights into the development of large
training data. In fact, larger models may amplify biases, language models, revealing the interplay between
raising ethical concerns about fairness, accountability, and parameters, compute, and data. While these laws have driven
misuse. remarkable advancements, they also present challenges that
 Environmental impact due to training LLM: must be addressed to ensure sustainable and ethical progress.
Future research should focus on efficient scaling strategies,
Training large language models (LLMs) consumes massive ecosystem collaboration, and extending scaling principles
amounts of energy, often relying on power-intensive data beyond traditional language models to multimodal and
centers. This leads to a significant carbon footprint, generalized AI systems.
contributing to environmental issues like increased
greenhouse gas emissions. Developing energy-efficient REFERENCES
methods is crucial to reducing this impact.
[1] Zhao, Wayne Xin, et al. "A survey of large language models." arXiv
V. FUTURE DIRECTIONS preprint arXiv:2303.18223 (2023).
 Efficient Scaling: [2] Devlin, Jacob. "Bert: Pre-training of deep bidirectional transformers
for language understanding." arXiv preprint arXiv:1810.04805 (2018).
Techniques like sparse modelling, knowledge distillation, [3] Achiam, Josh, et al. "Gpt-4 technical report." arXiv preprint
and parameter-efficient fine-tuning hold promise for arXiv:2303.08774 (2023).
reducing resource demands without compromising [4] Colin, Raffel. "Exploring the limits of transfer learning with a unified
performance. text-to-text transformer." J. Mach. Learn. Res. 21 (2020): 140-1.
[5] Kaplan, Jared, et al. "Scaling laws for neural language models." arXiv
 Energy Efficiency: preprint arXiv:2001.08361 (2020).
[6] Mueninghoff, Niklas et al. “Scaling Data-Constrained Language [12] Ruan, Yangjun, Chris J. Maddison, and Tatsunori Hashimoto.
Models.” ArXiv abs/2305.16264 (2023): n. pag. "Observational Scaling Laws and the Predictability of Language
[7] Xiong, Yizhe, et al. "Temporal scaling law for large language Model Performance." arXiv preprint arXiv:2405.10938 (2024).
models." arXiv preprint arXiv:2404.17785 (2024). [13] Lin, Haowei et al. “Selecting a Large Language Model to Fine-tune
[8] Clark, Aidan, et al. "Unified scaling laws for routed language via Rectified Scaling Law.” ArXiv abs/2402.02314 (2024): n. Pag.
models." International conference on machine learning. PMLR, 2022 [14] Chen, Yanxi et al. “A Simple and Provable Scaling Law for the Test-
[9] Muennighoff, Niklas, et al. "Scaling data-constrained language Time Compute of Large Language Models.” (2024).
models." Advances in Neural Information Processing Systems 36 [15] Keneshloo, Yaser, Tian Shi, Naren Ramakrishnan, and Chandan K.
(2023): 50358-50376. Reddy. "Deep reinforcement learning for sequence-to-sequence
[10] Isik, Berivan, et al. "Scaling laws for downstream task performance of models." IEEE transactions on neural networks and learning systems
large language models." arXiv preprint arXiv:2402.04177 (2024). 31, no. 7 (2019): 2469-2489.
[11] Aghajanyan, Armen, et al. "Scaling laws for generative mixed-modal
language models." International Conference on Machine Learning.
PMLR, 2023.

You might also like