Chapter 3
Chapter 3
classification and
generation
INTRODUCTION TO LLMS IN PYTHON
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name) Pre-configured model for causal (auto-
model = AutoModelForCausalLM.from_pretrained(model_name)
regressive) language generation, e.g.:
prompt = "This is a simple example for text generation," "gpt2"
inputs = tokenizer.encode(
prompt, return_tensors="pt") Model head for next-word prediction
output = model.generate(inputs, max_length=26)
generate() takes prompt and generates
generated_text = tokenizer.decode(
output[0], skip_special_tokens=True) up to max_length subsequent tokens
print("Generated Text:")
print(generated_text) Raw outputs are decoded before printing
extended prompt with generated text
Generated Text:
This is a simple example for text generation, but it's also
a good way to get a feel for how the text is generated.
Example 1:
Title: himc90
Paragraph: In an interview right before receiving the 2013
Nobel prize in physics, Peter Higgs stated that he (...)
Example 2 (...)
Input sequences: a segment of the text, e.g. "the cat is" from "the cat is sleeping on the
mat"
Input sequences: a segment of the text, e.g. "the cat is" from "the cat is sleeping on the
mat"
Target sequences: tokens shifted one position to the left, e.g. "cat is sleeping"
1 Due to space limitations, only the first 50% of the original input text is shown in the slide
Predicted (Welsh):
2 Rheloiad o dan adran 1:aryary " means "i
Open Generative QA: The LLM generates the Closed Generative QA: The LLM fully
answer based on a context generates the answer, no context provided
print("Question:" , mlqa["test"]["question"][53])
print("Answer:" , mlqa["test"]["answers"][53])
print("Context:" , mlqa["test"]["context"][53])
Answer: {'answer_start': [271], 'text': ['a fermented, usually spicy vegetable dish']}
Context: Korean cuisine is largely based on rice, noodles, tofu, vegetables, fish and meats. Traditional Korean
meals are noted for the number of side dishes, banchan, which accompany steam-cooked short-grain rice. Every
meal is accompanied by numerous banchan. Kimchi, a fermented, usually spicy vegetable dish is commonly served
at every meal and is one of the best known Korean dishes. Korean cuisine usually involves heavy seasoning with
sesame oil, doenjang, a type of fermented soybean paste, soy sauce, salt, garlic, ginger, and gochujang, a hot
pepper paste. Other well-known dishes are Bulgogi, grilled marinated beef, Gimbap, and Tteokbokki , a spicy
snack consisting of rice cake seasoned with gochujang or a spicy chili paste.
Prediction result: answer span given by: [start position, end position]
model_ckp = "deepset/minilm-uncased-squad2"
Tensor Description
tokenizer = AutoTokenizer.from_pretrained(model_ckp)
input_ids Integer
question = "How is the taste of wasabi?"
attention_mask Boolean
context = """Japanese cuisine captures the essence of \
a harmonious fusion between fresh ingredients and \ token_type_ids 0: Question, 1: Context
traditional culinary techniques, all heightened \
by the zesty taste of the aromatic green condiment \
known as wasabi."""
inputs = tokenizer(question, context,
return_tensors="pt")
Full fine-tuning: The entire model weights are updated; more computationally expensive
Partial fine-tuning: Lower (body) layers fixed; only task-specific layers (head) are updated
One-shot, few-shot learning: adapt a model to a new task with one or a few examples only
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
Tokenize dataset used for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=2) IMDB reviews dataset
def tokenize_function(examples): truncation=True truncates input
return tokenizer(
examples["text"], padding="max_length", truncation=True) sequences beyond model's max_length
training_args = TrainingArguments(
settings
output_dir="./smaller_bert_finetuned",
per_device_train_batch_size=8,
Output directory, batch size per GPU,
num_train_epochs=3,
evaluation_strategy="steps", epochs, etc.
eval_steps=500,
save_steps=500,
logging_dir="./logs",
)
Trainer class: manage training and
trainer = Trainer(
validation loop
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"], Specify model, training arguments, training
eval_dataset=tokenized_datasets["test"],
)
and validation sets
trainer.train()
trainer.train() : execute training loop
Predicted Label: 0