0% found this document useful (0 votes)
42 views44 pages

Chapter 3

Uploaded by

nyj martin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views44 pages

Chapter 3

Uploaded by

nyj martin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

LLMs for text

classification and
generation
INTRODUCTION TO LLMS IN PYTHON

Iván Palomares Carrascosa, PhD


Senior Data Science & AI Manager
Loading a pre-trained LLM
Pipelines: pipeline() Auto classes ( AutoModel class)

Simple, high-level interface Flexibility, control and customization

Automatic model and tokenizer selection Complexity: manual set-ups


More abstraction = less control Support very diverse language tasks

Limited task flexibility Enable model fine-tuning

INTRODUCTION TO LLMS IN PYTHON


The AutoModel and AutoTokenizer classes
import torch.nn as nn from_pretrained()
from transformers import AutoModel, AutoTokenizer

Load pre-trained model weights and


model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer as specified in model_name
model = AutoModel.from_pretrained(model_name)
model_name : model checkpoint:
text = "I am an example sequence for text classification."
A unique model version with specific
class SimpleClassifier(nn.Module): architecture, configuration, and weights
def __init__(self, input_size, num_classes):
super(SimpleClassifier, self).__init__() AutoModel does not provide task-specific
self.fc = nn.Linear(input_size, num_classes)
head
def forward(self, x):
return self.fc(x)

INTRODUCTION TO LLMS IN PYTHON


The AutoModel and AutoTokenizer classes
inputs = tokenizer( Tokenize inputs
text, return_tensors="pt", padding=True,
truncation=True, max_length=64) Get model's hidden states in outputs
outputs = model(**inputs)
pooled_output = outputs.pooler_output pooler_output : high-level, aggregated
print("Hidden states size: ", outputs.last_hidden_state.shape)
representation of the sequence
print("Pooled output size: ", pooled_output.shape)

last_hidden_states : raw unaggregated


classifier_head = SimpleClassifier(
pooled_output.size(-1), num_classes=2) hidden states
logits = classifier_head(pooled_output)
probs = torch.softmax(logits, dim=1) Forward pass through classification head
print("Predicted Class Probabilities:", probs)
to obtain class probabilities
Hidden states size: torch.Size([1, 11, 768])
Pooled output size: torch.Size([1, 768])

Predicted Class Probabilities:


tensor([[0.4334, 0.5666]], grad_fn=<SoftmaxBackward0>)

INTRODUCTION TO LLMS IN PYTHON


Auto class for text classification
from transformers import AutoModelForSequenceClassification, AutoModelForSequenceClassification class:
AutoTokenizer

model_name = "nlptown/bert-base-multilingual-uncased-sentiment" Provides pre-configured model with a


tokenizer = AutoTokenizer.from_pretrained(model_name)
classification head
model = AutoModelForSequenceClassification.from_pretrained(
model_name)
No need to manually add model head
text = "The quality of the product was just okay."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits outputs already passed through head's
predicted_class = torch.argmax(logits, dim=1).item() linear layer
print(f"Predicted class index: {predicted_class + 1} star.")
Access raw class logits and return "most
likely" class
Predicted class index: 3 star.

INTRODUCTION TO LLMS IN PYTHON


Auto class for text generation
from transformers import AutoModelForCausalLM, AutoTokenizer AutoModelForCausalLM class:

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name) Pre-configured model for causal (auto-
model = AutoModelForCausalLM.from_pretrained(model_name)
regressive) language generation, e.g.:
prompt = "This is a simple example for text generation," "gpt2"
inputs = tokenizer.encode(
prompt, return_tensors="pt") Model head for next-word prediction
output = model.generate(inputs, max_length=26)
generate() takes prompt and generates
generated_text = tokenizer.decode(
output[0], skip_special_tokens=True) up to max_length subsequent tokens
print("Generated Text:")
print(generated_text) Raw outputs are decoded before printing
extended prompt with generated text
Generated Text:
This is a simple example for text generation, but it's also
a good way to get a feel for how the text is generated.

INTRODUCTION TO LLMS IN PYTHON


Exploring a dataset for text classification
from datasets import load_dataset load_dataset() : loads a dataset from
from torch.utils.data import DataLoader
Hugging Face hub
dataset = load_dataset("imdb") imdb: review sentiment classification
train_data = dataset["train"]
dataloader = DataLoader(train_data, batch_size=2, shuffle=True)

for batch in dataloader:


for i in range(len(batch["text"])): DataLoader class: simplifies iterating,
print(f"Example {i + 1}:")
print("Text:", batch["text"][i])
batch processing and parallelism
print("Label:", batch["label"][i]) Iterating through review-sentiment
examples
Example 1:
Text: Much worse than the original. It was actually *painf(...)
Label: tensor(0)
Example 2:
Text: I have to agree with Cal-37 it's a great movie, spec(...)
Label: tensor(1)

INTRODUCTION TO LLMS IN PYTHON


Exploring a dataset for text generation
from datasets import load_dataset Using a dataset from standfordnlp
dataset = load_dataset("stanfordnlp/shp", "askculinary")
catalogue
train_data = dataset["train"] Suitable for text generation and
print(train_data[i])
generative QA
for i in range(5):
example = train_data[i] Display some text information in data
print(f"Example {i + 1}:")
print("Title:", example["post_id"])
instances
print("Paragraph:", example["history"])
print()

Example 1:
Title: himc90
Paragraph: In an interview right before receiving the 2013
Nobel prize in physics, Peter Higgs stated that he (...)

Example 2 (...)

INTRODUCTION TO LLMS IN PYTHON


How text generation LLM training works
Input + target (labels) pairs

Input sequences: a segment of the text, e.g. "the cat is" from "the cat is sleeping on the
mat"

INTRODUCTION TO LLMS IN PYTHON


How text generation LLM training works
Input + target (labels) pairs

Input sequences: a segment of the text, e.g. "the cat is" from "the cat is sleeping on the
mat"
Target sequences: tokens shifted one position to the left, e.g. "cat is sleeping"

INTRODUCTION TO LLMS IN PYTHON


Let's practice!
INTRODUCTION TO LLMS IN PYTHON
LLMs for text
summarization and
translation
INTRODUCTION TO LLMS IN PYTHON

Iván Palomares Carrascosa, PhD


Senior Data Science & AI Manager
Inside text summarization
Goal: create a summarized version of a
text, preserving important information

Inputs: Original text

Target (labels): summarized text

INTRODUCTION TO LLMS IN PYTHON


Inside text summarization
Goal: create a summarized version of a
text, preserving important information

Inputs: Original text

Target (labels): summarized text

Extractive summarization: select, extract, and


combine parts of the original text

INTRODUCTION TO LLMS IN PYTHON


Inside text summarization
Goal: create a summarized version of a
text, preserving important information

Inputs: Original text

Target (labels): summarized text

Extractive summarization: select, extract, and


combine parts of the original text

Abstractive summarization: generate a


summary word by word

INTRODUCTION TO LLMS IN PYTHON


Exploring a text summarization dataset
from datasets import load_dataset example = dataset["train"][21]
example['Article']
dataset = load_dataset("ILSUM/ILSUM-1.0", "English")
print(f"Features: {dataset['train'].column_names}")
This is how an Apple Watch saved a man's life after detecting
accident. It all started when Gabe Burdett was waiting for his
Features: ['id', 'Article', 'Heading', 'Summary'] father Bob at their pre-designated location for some mountain
biking at the Riverside State Park when he received a text
alert from his dad's Apple Watch, saying it had detected a
Two main text attributes "hard fall".Burdett, from city of Spokane in Washington State
later received another update from the Watch, saying his father
had reached Sacred Heart Medical Center."We drove straight
Long text: input sequence for the LLM there but he was gone when we arrived. I get another (...)

'Article' in the example


example['Summary']
Summarized text: target, training label
'Summary' in the example Dad flipped his bike at the bottom of Doomsday, hit his head
and was knocked out until sometime during the ambulance ride.
The watch had called 911 with his location and EMS had him
scooped up and to the hospital in under a 1/2hr.

INTRODUCTION TO LLMS IN PYTHON


Loading a pre-trained LLM for summarization
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM Import and use AutoModelForSeq2SeqLM
model_name = "t5-small" Load t5-small : versatile for various tasks
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
Add a task-specific prefix to the input text:
input_ids = tokenizer.encode( "summarize:""
"summarize: " + example["Article"],
return_tensors="pt", max_length=512, truncation=True .generate() passes the tokenized input to
)
the model
summary_ids = model.generate(input_ids, max_length=150)
summary = tokenizer.decode( .decode() post-processes output
summary_ids[0], skip_special_tokens=True)
embedding back into text
print("Original Text:")
print(example["Article"])
print("\nGenerated Summary:")
print(summary)

INTRODUCTION TO LLMS IN PYTHON


Loading a pre-trained LLM for summarization
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM Original Text:
This is how an Apple Watch saved a man's life after detecting
model_name = "t5-small" accident. It all started when Gabe Burdett was waiting for his
tokenizer = AutoTokenizer.from_pretrained(model_name) father Bob at their pre-designated location for some mountain
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) biking at the Riverside State Park when he received a text
alert from his dad's Apple Watch, saying it had detected a
input_ids = tokenizer.encode( "hard fall".Burdett, from city of Spokane in Washington State
"summarize: " + example["Article"], later received another update from the Watch, saying his father
return_tensors="pt", max_length=512, truncation=True had reached Sacred Heart Medical Center."We drove straight
) there but he was gone when we arrived. I get another (...)

summary_ids = model.generate(input_ids, max_length=150)


Generated Summary:
summary = tokenizer.decode(
a man was waiting for his father when he received a text alert
summary_ids[0], skip_special_tokens=True)
from his dad's apple watch. the watch notified 911 with the
location and within 30 minutes, emergency medical services took
print("Original Text:")
the injured Bob to the hospital. the watch notified 911 with
print(example["Article"])
the location and within 30 minutes, emergency medical services
print("\nGenerated Summary:")
took the injured Bob to the hospital.
print(summary)

1 Due to space limitations, only the first 50% of the original input text is shown in the slide

INTRODUCTION TO LLMS IN PYTHON


Inside language translation
Goal: produce translated version of a text,
conveying same meaning and context

Inputs: text in source language

Target (labels): target language translation

INTRODUCTION TO LLMS IN PYTHON


Inside language translation
Goal: produce translated version of a text,
conveying same meaning and context

Inputs: text in source language

Target (labels): target language translation

Encode source language sequence

INTRODUCTION TO LLMS IN PYTHON


Inside language translation
Goal: produce translated version of a text,
conveying same meaning and context

Inputs: text in source language

Target (labels): target language translation

Encode source language sequence

Decode into target language sequence, using


learned language patterns and associations

INTRODUCTION TO LLMS IN PYTHON


Exploring a language translation dataset
from datasets import load_dataset Load English-Welsh bilingual dataset
dataset = load_dataset("techiaith/legislation-gov-uk_en-cy")
Dataset object

sample_data = dataset["train"] Extract a training example


source : English sequences
input_example = sample_data.data['source'][0]
target_example = sample_data.data['target'][0]
target : Welsh sequences
print("Input (English):", input_example)
print("Target (Welsh):", target_example)

Input (English): 2 Regulations under section 1: supplementary


Target (Welsh): 2 Rheoliadau o dan adran 1: atodol

INTRODUCTION TO LLMS IN PYTHON


Loading a pre-trained LLM for translation
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM Import and use AutoModelForSeq2SeqLM
model_name = "Helsinki-NLP/opus-mt-en-cy" Load Helsinki-NLP model for English-
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) Welsh translation

input_seq = "2 Regulations under section 1: supplementary"


input_ids = tokenizer.encode(input_seq, return_tensors="pt")
translated_ids = model.generate(input_ids)
translated_text = tokenizer.decode(
Tokenize English sequence ( .encode() )
translated_ids[0], skip_special_tokens=True) and pass it to the model ( .generate() )
print("Predicted (Welsh):", translated_text)
Decode and print Welsh translation

Predicted (Welsh):
2 Rheloiad o dan adran 1:aryary " means "i

INTRODUCTION TO LLMS IN PYTHON


Let's practice!
INTRODUCTION TO LLMS IN PYTHON
LLMs for question
answering
INTRODUCTION TO LLMS IN PYTHON

Iván Palomares Carrascosa, PhD


Senior Data Science & AI Manager
Types of question answering (QA) tasks
QA task type Architecture Extractive QA: The LLM extracts the answer
Extractive Encoder-only to a question from a provided context

Open Generative Encoder-Decoder


Closed generative Decoder-only

Open Generative QA: The LLM generates the Closed Generative QA: The LLM fully
answer based on a context generates the answer, no context provided

INTRODUCTION TO LLMS IN PYTHON


Exploring a QA dataset
from datasets import load_dataset Load English subset of the xtreme dataset
mlqa = load_dataset( for extractive QA
"xtreme", name="MLQA.en.en")
DatasetDict object.
print(mlqa)
Test and validation Dataset objects.
DatasetDict({
test: Dataset({
features: ['id', 'title', 'context',
'question', 'answers'], Relevant features:
num_rows: 11590 'context'
})
validation: Dataset({ 'question'
features: ['id', 'title', 'context',
'question','answers'], 'answers'
num_rows: 1148
})
})

INTRODUCTION TO LLMS IN PYTHON


Exploring a QA dataset
Example instance in the dataset:

print("Question:" , mlqa["test"]["question"][53])
print("Answer:" , mlqa["test"]["answers"][53])
print("Context:" , mlqa["test"]["context"][53])

Question: what is a kimchi?

Answer: {'answer_start': [271], 'text': ['a fermented, usually spicy vegetable dish']}

Context: Korean cuisine is largely based on rice, noodles, tofu, vegetables, fish and meats. Traditional Korean
meals are noted for the number of side dishes, banchan, which accompany steam-cooked short-grain rice. Every
meal is accompanied by numerous banchan. Kimchi, a fermented, usually spicy vegetable dish is commonly served
at every meal and is one of the best known Korean dishes. Korean cuisine usually involves heavy seasoning with
sesame oil, doenjang, a type of fermented soybean paste, soy sauce, salt, garlic, ginger, and gochujang, a hot
pepper paste. Other well-known dishes are Bulgogi, grilled marinated beef, Gimbap, and Tteokbokki , a spicy
snack consisting of rice cake seasoned with gochujang or a spicy chili paste.

INTRODUCTION TO LLMS IN PYTHON


Extractive QA: framing the problem

Supervised learning: span classification

INTRODUCTION TO LLMS IN PYTHON


Extractive QA: framing the problem

Supervised learning: span classification

INTRODUCTION TO LLMS IN PYTHON


Extractive QA: framing the problem

Supervised learning: span classification

Prediction result: answer span given by: [start position, end position]

Answer span obtained from most likely raw outputs (logits)

INTRODUCTION TO LLMS IN PYTHON


Extractive QA: tokenizing inputs
from transformers import AutoTokenizer Tokenization results:

model_ckp = "deepset/minilm-uncased-squad2"
Tensor Description
tokenizer = AutoTokenizer.from_pretrained(model_ckp)
input_ids Integer
question = "How is the taste of wasabi?"
attention_mask Boolean
context = """Japanese cuisine captures the essence of \
a harmonious fusion between fresh ingredients and \ token_type_ids 0: Question, 1: Context
traditional culinary techniques, all heightened \
by the zesty taste of the aromatic green condiment \
known as wasabi."""
inputs = tokenizer(question, context,
return_tensors="pt")

INTRODUCTION TO LLMS IN PYTHON


Extractive QA: loading and using model
from transformers import AutoModelForQuestionAnswering Custom model class:
AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.
from_pretrained(model_ckp) Inference on example input:
with torch.no_grad():
**inputs unpacks and extracts
outputs = model(**inputs) tokenized inputs

start_idx = torch.argmax(outputs.start_logits) Raw outputs post-processing


end_idx = torch.argmax(outputs.end_logits) + 1
start_logits , end_logits answer
answer_span = inputs["input_ids"][0] start/end likelihoods per input token
[start_idx:end_idx]
answer = tokenizer.decode(answer_span)
start_idx , end_idx : positions of input
tokens delimiting answer span

INTRODUCTION TO LLMS IN PYTHON


Managing long context sequences
long_exmp = tokenizer(example_qt, example_ct,
return_overflowing_tokens=True,
max_length=100, stride=25)

for idx, window in enumerate(long_exmp["input_ids"]):


print("Tokens in window ", idx, ": ", len(window))

No. tokens in window 0 : 100


No. tokens in window 1 : 100
[...]
Sliding window parameters:
No. tokens in window 8 : 50

max_length : sliding window size


for window in long_exmp["input_ids"]:
print(tokenizer.decode(window), "\n")
stride : stride size between windows

[CLS] what is a kimchi? [SEP] Korean cuisine is l[...]


[CLS] what is a kimchi? [SEP] steam-cooked short-[...]

INTRODUCTION TO LLMS IN PYTHON


Let's practice!
INTRODUCTION TO LLMS IN PYTHON
LLM fine-tuning and
transfer learning
INTRODUCTION TO LLMS IN PYTHON

Iván Palomares Carrascosa, PhD


Senior Data Science & AI Manager
Revisiting the LLM lifecycle

INTRODUCTION TO LLMS IN PYTHON


Revisiting the LLM lifecycle

Full fine-tuning: The entire model weights are updated; more computationally expensive

INTRODUCTION TO LLMS IN PYTHON


Revisiting the LLM lifecycle

Partial fine-tuning: Lower (body) layers fixed; only task-specific layers (head) are updated

INTRODUCTION TO LLMS IN PYTHON


Demystifying transfer learning
Transfer learning: a model trained on one task is adapted for a different but related task

In pre-trained LLMs, fine-tune on a smaller dataset for a specific task

Zero-shot learning: perform tasks never "seen" during training

One-shot, few-shot learning: adapt a model to a new task with one or a few examples only

INTRODUCTION TO LLMS IN PYTHON


Fine-tuning a pre-trained Hugging Face LLM
import torch Load BERT-based model for text
from transformers import AutoModelForSequenceClassification,
AutoTokenizer
classification and associated tokenizer
from datasets import load_dataset

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
Tokenize dataset used for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=2) IMDB reviews dataset
def tokenize_function(examples): truncation=True truncates input
return tokenizer(
examples["text"], padding="max_length", truncation=True) sequences beyond model's max_length

data = load_dataset("imdb") batched=True to process examples in


tokenized_data = data.map(tokenize_function, batched=True)
batches rather than individually

INTRODUCTION TO LLMS IN PYTHON


Fine-tuning a pre-trained Hugging Face LLM
from transformers import Trainer, TrainingArguments TrainingArguments class: customize training

training_args = TrainingArguments(
settings
output_dir="./smaller_bert_finetuned",
per_device_train_batch_size=8,
Output directory, batch size per GPU,
num_train_epochs=3,
evaluation_strategy="steps", epochs, etc.
eval_steps=500,
save_steps=500,
logging_dir="./logs",
)
Trainer class: manage training and
trainer = Trainer(
validation loop
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"], Specify model, training arguments, training
eval_dataset=tokenized_datasets["test"],
)
and validation sets

trainer.train()
trainer.train() : execute training loop

INTRODUCTION TO LLMS IN PYTHON


Inference and saving a fine-tuned LLM
example_input = tokenizer("I am absolutely amazed with this After fine-tuning, inference is performed as
new and revolutionary AI device",
return_tensors="pt")
usual
output = model(**example_input) Tokenize inputs, pass them to the LLM,
predicted_label = torch.argmax(output.logits, dim=1).item()
print("Predicted Label:", predicted_label)
obtain and post-process outputs

Predicted Label: 0

Fine-tuned model and tokenizer can be


model.save_pretrained("./my_bert_finetuned")
tokenizer.save_pretrained("./my_bert_finetuned")
saved using .save_pretrained()

INTRODUCTION TO LLMS IN PYTHON


Let's practice!
INTRODUCTION TO LLMS IN PYTHON

You might also like