0% found this document useful (1 vote)
263 views

Large Language Model

This document discusses large language models and their applications. It begins with an introduction to language models and then describes large language models, which can contain hundreds of billions of parameters. Examples of large language models discussed include BERT, GPT-3, and ChatGPT. The document explains techniques used in developing these models like pre-training, fine-tuning, and prompting. It also covers risks of using large language models.

Uploaded by

21020641
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
263 views

Large Language Model

This document discusses large language models and their applications. It begins with an introduction to language models and then describes large language models, which can contain hundreds of billions of parameters. Examples of large language models discussed include BERT, GPT-3, and ChatGPT. The document explains techniques used in developing these models like pre-training, fine-tuning, and prompting. It also covers risks of using large language models.

Uploaded by

21020641
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

UET

Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN


VNU-University of Engineering and Technology

Natural Language Processing - INT3406E 20

Large Language Model

Nguyen Van Vinh - UET


Outline

● Introduction to LM
● Large Language Models and applications

UET-FIT 2
Language Modeling (Mô hình ngôn ngữ)?

● What is the probability of “Tôi trình bày ChatGPT tại Trường ĐH Công
Nghệ” ?
● What is the probability of “Công Nghệ học Đại trình bày ChatGPT tại Tôi” ?
● “Tôi trình bày ChatGPT tại Trường ĐH Công nghệ, địa điểm …”) or
P(…/Tôi trình bày ChatGPT tại Trường ĐH Công nghệ, địa điểm) ?
● A model that computes either of these:
W = w1,w2,w3,w4,w5…wn
P(W) or P(wn|w1,w2…wn-1) is called a language model

3
Large Language Model

4
Large Language Model (Hundreds of Billions of
Tokens)

5
6
Large Language Models - yottaFlops of Compute

Source: https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture11-prompting-rlhf.pdf 7
Why LLMs?

● Double Descent

8
Why LLMs?

● Scaling Law for Neural Language Models


○ Performance depends strongly on scale! We keep getting better performance as
we scale the model, data, and compute up!

9
Why LLMs?

● Generalization
○ We can now use one single model to solve many NLP tasks

10
Why LLMs? Emergence in few-shot prompting
Emergent Abilities
• Some ability of LM is
not present in
smaller models but
is present in larger
models
Emergent Capability - In-Context Learning

12
Emergent Capability - In-Context Learning

13
What is pre-training / fine-tuning?

● “Pre-train” a model on a large dataset for task X, then “fine-tune” it on a


dataset for task Y
● Key idea: X is somewhat related to Y, so a model that can do X will have
some good neural representations for Y as well (transfer learning)
● ImageNet pre-training is huge in computer vision: learning generic visual
features for recognizing objects

Can we find some task X that can be


useful for a wide range of
downstream tasks Y?

14
Pretraining + Prompting Paradigm

15
Prompting Engineering (2020  now)
● Prompts involve instructions and context passed to a language model to
achieve a desired task

Prompt engineering is
the practice of
developing and
optimizing prompts to
efficiently use language
models (LMs) for a
variety of applications

16
Prompt Engineering Techniques
● Many advanced prompting techniques have been designed to
improve performance on complex tasks •
○ Few-shot prompts
○ Chain-of-thought (CoT) prompting
○ Self-Consistency
○ Knowledge Generation Prompting
○ ReAct

17
Temperature and Top-p Sampling in LLMs

● Temperature and Top-p sampling are two essential parameters that can be
tweaked to control the output of LLMs
● Temperature (0-2): This parameter determines the creativity and diversity of the text
generated by LLMs model. A higher temperature value (e.g., 1.5) leads to more
diverse and creative text, while a lower value (e.g., 0.5) results in more focused and
deterministic text.
● Top-p Sampling (0-1): This parameter maintains a balance between diversity and
high-probability words by selecting tokens from the top-p most probable tokens
whose collective probability mass is greater than or equal to a threshold p.

18
Three major forms of pre-training (LLMs)

19
BERT: Bidirectional Encoder Representations from
Transformers

Source: (Devlin et al, 2019): BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 20
Masked Language Modeling (MLM)

● Q: Why we can’t do language modeling with bidirectional models?

● Solution: Mask out k% of the input words, and then predict the masked words

21
Next Sentence Prediction (NSP)

22
BERT pre-training

23
RoBERTa
● BERT is still under-trained
● Removed the next sentence prediction pre-training — it adds more noise than
benefits!
● Trained longer with 10x data & bigger batch sizes
● Pre-trained on 1,024 V100 GPUs for one day in 2019

24
(Liu et al., 2019): RoBERTa: A Robustly Optimized BERT Pretraining Approach
Text-to-text models: the best of both worlds (Bard)?
● Encoder-only models (e.g., BERT) enjoy the benefits of bidirectionality but they can’t be
used to generate text
● Decoder-only models (e.g., GPT3, Lamma2) can do generation but they are left-to-right
LMs..
● Text-to-text models combine the best of both worlds!

T5 = Text-to-Text Transfer Transformer

(Raffel et al., 2020): Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer 25
How to use these pre-trained models?

26
From GPT to GPT-2 to GPT-3

27
Quiz

● Context size?
● The larger the size context, the more difficult it is?

28
GPT-3: language models are few-shot learners

● GPT-2 → GPT-3: 1.5B → 175B (# of parameters), ~14B → 300B (# of tokens)

29
GPT-3’s in-context learning

30
[2020] GPT-3 to [2022] ChatGPT

What’s new?
● Training on code

● Supervised
instruction tuning

● RLHF =
Reinforcement
learning from
human feedback

Source: Fu, 2022, “How does GPT Obtain its Ability? Tracing Emergent Abilities of Language
Models to their Sources" 31
How was ChatGPT developed?

32
Evaluation of LLMs

33
LLMs newest

● Claude 2.1 (Anthropic)


○ 200K Context Window
○ 2x Decrease in Hallucination Rates

● GPT4 turbo (Open AI)


○ 128K Context Window
Vietnamese
● PhoGPT (VinAI)
● FPT.AI
● VNG (Zalo):
● …

34
ChatGPT application for reading comprehension (ChatPdf)

● Fine-tune the ChatGPT model with training data in specific domain


● Using LLM improvement techniques based on Retrieval Augmented
Generation (RAG)
● Use efficient Prompting to achieve expectation output

35
Large Language models Risks

● LLMs make mistakes


○ (falsehoods, hallucinations)
● LLMs can be misused
○ (misinformation, spam)
● LLMs can cause harms
○ (toxicity, biases, stereotypes)
● LLMs can be attacked
○ (adversarial examples, poisoning, prompt injection)
● LLMs are costly to train and deploy

36
Summary

● Introduction to LLM
● Large Language models (types)

UET-FIT 37
UET
Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN


VNU-University of Engineering and Technology

Thank you
Email me
[email protected]

You might also like