9 NLU Huggingface
9 NLU Huggingface
A short introduction
CSC401/2511 – Natural Language Computing – Winter 2023
University of Toronto
1
Logistics
• Today’s lecture will only last 35 minutes
• 10am session: The last 15 minutes is a survey.
• 11am session: The first 15 minutes is a survey.
2
Assignment 3 update 1: cuda
• In the test() function for classifier.py: change args.use_cuda to
args.use_gpu
Change to use_gpu
3
Assignment 3 update 2: package
• Currently, the package importlib-metadata is not present on the
wolf server – I asked the instruction support to install.
4
Recap: Sentiment Analysis
• Is this IMDB movie review a positive one?
This is not a movie for fans of the usual eerie Lynch stuff. Rather, it's for those
who either appreciate a good story, or have grown tired of the run-of-the-mill
stuff with overt sentimentalism […]<br /><br />The story unfolds flawlessly,
and we are taken along a journey that, I believe, most of us will come to
recognize at some time. A compassionate, existentialist journey where we
make amends for our past when approaching ourt inevitable demise.<br /><br
/>Acting is without faults, cinematography likewise (occasionally quite
brilliant!), and the dialogue leaves out just enough for the viewer to grasp the
details od the story.<br /><br />A warm movie. Not excessively sentimental.
5
Recap: DNN-based NLU
Pre-trained
Language Model
Downstream Prediction results
task data
6
Overview of the pipeline
• An overview of the pipeline that you will use for A3:
Tokenizer
Input_ids System
Text Pre-trained
Classifier
BERT outputs
Attention_masks
BERTForSequenceClassification
7
Overview of the pipeline
• An overview of the pipeline that you will use for A3:
Tokenizer
Input_ids System
Text Pre-trained
Classifier
BERT outputs
Attention_masks
BERTForSequenceClassification
8
Tokenizer
NLP systems need a tokenizer to encode texts into numbers.
Decode
9
Tokenization: word splitting
• Method 1: .split(), then look up the word index in a dictionary.
• Words with the same lemma forms are considered as different words.
E.g., “convert” vs “converts”
• Punctuations are not handled well.
E.g., “The end of a sentence. The start of the other”
Tokenize
“This is an example [“This”, “is”, “an”,
sentence.” “example”, “sentence.”]
10
Tokenization: better word splitting
• Method 2: Separate the words and the punctuations, then do
.split(), then look up the word index from the dictionary.
• Still, “convert” and “converts” are treated as different words.
• The vocabulary sizes are unnecessarily large.
• In multilingual tasks, the vocabulary sizes are even larger.
…although many English words have the same roots.
Some examples: geography, bibliography
Tokenize
“This is an example [“This”, “is”, “an”,
sentence.” “example”, “sentence”, “.”]
11
Tokenization: character encoding
• Method 3: Character / Byte-level encoding
• Example: CANINE (Lecture 7)
• The vocabulary size is significantly reduced.
• but how long are your sequence going to be?
12
Tokenization: subword
• Method 4: subword.
• This is adopted by popular LMs, including BERT and *GPT.
• The words to split, and the methods of splitting, differs.
In CSC401/2511: don’t worry about that ^.
• Each pretrained language model comes with its own tokenizer.
13
Loading and using the tokenizer
14
Two-step encoding process
• Calling tokenizer(sentence) is equivalent to:
• tokens = tokenizer.tokenize(sentence), and then:
• tokenizer.convert_tokens_to_ids(tokens)
15
Overview of the pipeline
• An overview of the pipeline
Tokenizer
Input_ids System
Text Pre-trained Classifier (just a
BERT Linear head) outputs
Attention_masks
BERTForSequenceClassification
16
BERTmodel
BERT doesn’t
have this part –
• BERTmodel is the encoder part of this part is GPT.
the Transformer:
• Also ref: Lecture 7
17
Lecture review questions
By the end of this lecture, you should be able to:
• Describe what is tokenization.
• Use huggingface’s tokenizer
• Describe a BERT for Sequential Classifier system.
• Start working on Q3 and Q4 in Assignment 3.
• Friday’s tutorial will also be helpful for Q3.
• Q2: Not yet. Speech Recognition is in next week.
18