0% found this document useful (0 votes)
82 views

MLSys Class LLM Introduction

The document introduces language models including BERT, GPT, and T5 which use techniques like masked language modeling, causal language modeling, and text-to-text transfer. It discusses how transformer models use attention and self-attention. The document compares BERT and GPT and explains how pretraining, fine-tuning, prompting, and reinforcement learning from human feedback are used. It raises questions about the advantages and disadvantages of different training methods, the role of systems research in scaling language models, security considerations, and improving energy efficiency.

Uploaded by

Ali Elouafiq
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views

MLSys Class LLM Introduction

The document introduces language models including BERT, GPT, and T5 which use techniques like masked language modeling, causal language modeling, and text-to-text transfer. It discusses how transformer models use attention and self-attention. The document compares BERT and GPT and explains how pretraining, fine-tuning, prompting, and reinforcement learning from human feedback are used. It raises questions about the advantages and disadvantages of different training methods, the role of systems research in scaling language models, security considerations, and improving energy efficiency.

Uploaded by

Ali Elouafiq
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Introduction to

Language Models
Eve Fleisig & Kayo Yin
CS 294-162
August 28, 2023
Language Modeling

Image credit: jalammar.github.io/illustrated-word2vec/


Masked Language Modeling
BERT

Image credit: jalammar.github.io/illustrated-bert/


Causal Language Modeling
GPT

Image credit: jalammar.github.io/illustrated-gpt2/


BERT vs. GPT

● Bidirectional encoder models (BERT) do better than generative models at


non-generation tasks, for comparable training data/model complexity.

● Generative models (GPT) have training efficiency and scalability advantages


that may make them ultimately more accurate. They can also solve
downstream tasks in a zero-shot setting.
Transformer

Image credit: jalammar.github.io/illustrated-transformer/


Transformer

Image credit: jalammar.github.io/illustrated-transformer/


Transformer

Image credit: jalammar.github.io/illustrated-transformer/ v


Attention
Self-Attention
Self-Attention

Image credit: jalammar.github.io/illustrated-gpt2/


Self-Attention

Image credit: jalammar.github.io/illustrated-gpt2/


Self-Attention
Self-Attention
Self-Attention
Self-Attention
Multi-headed Attention
Multi-headed Attention
Transformer

Image credit: jalammar.github.io/illustrated-transformer/


Transformer Input
Transformer Encoder

Image credit: jalammar.github.io/illustrated-transformer/


Adding the Decoder

Image credit: jalammar.github.io/illustrated-transformer/


BERT

Image credit: jalammar.github.io/illustrated-bert/


BERT
GPT
GPT
T5

Text-to-Text Transfer Transformer


Pretraining & Fine-tuning
Pretraining & Fine-tuning
Pretraining & Fine-tuning

Unsupervised objective

Supervised objective
Prefixes & Prompting
Few- & Zero-Shot Learning
Few- & Zero-Shot Learning
Few- & Zero-Shot Learning
Few- & Zero-Shot Learning

Generalization to new tasks without fine-tuning enabled by:

Scaling
Data Compute
Scaling Data
Common Crawl dataset: introduced with T5; still in use
GPT-3 Training Data:
Scaling Data & Compute

Kaplan et al., 2020;


Hoffmann et al., 2022
Reinforcement Learning from Human Feedback
Reinforcement Learning from Human Feedback
Reinforcement Learning from Human Feedback
Discussion
● What are the advantages and disadvantages of different training or tuning methods
that have been tried (task-specific training, pretrain/fine-tune, prompting, RLHF)?
● What is the role of systems research in scaling up LLMs? How could advances in
systems research change scaling “laws”?
● What security considerations do we need to consider when deploying LLMs into the
real world?
● How can we improve the energy efficiency and carbon footprint of LLMs?

You might also like