0% found this document useful (0 votes)

7 views18 pages

9 NLU Huggingface

The document provides an overview of a lecture on Huggingface Transformers, focusing on sentiment analysis and the necessary tools for Assignment 3. It outlines logistics, updates regarding CUDA and package imports, and explains the tokenization process essential for natural language understanding. Additionally, it introduces the BERT model and its application in a classification system, preparing students for upcoming assignments and tutorials.

Uploaded by

Sheikh Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views18 pages

9 NLU Huggingface

Uploaded by

Sheikh Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Huggingface Transformers:

A short introduction
CSC401/2511 – Natural Language Computing – Winter 2023
University of Toronto

1
Logistics
• Today’s lecture will only last 35 minutes
• 10am session: The last 15 minutes is a survey.
• 11am session: The first 15 minutes is a survey.

• Contents: sentiment analysis with a huggingface model.

• I’ll introduce some key features of huggingface.

• After today’s lecture, you will be able to start working on

Assignment 3.

2
Assignment 3 update 1: cuda
• In the test() function for classifier.py: change args.use_cuda to
args.use_gpu

Change to use_gpu

3
Assignment 3 update 2: package
• Currently, the package importlib-metadata is not present on the
wolf server – I asked the instruction support to install.

• A walkaround is to modify the lines in utils.py:

• Comment out line 14 “import importlib_metadata”
• Change line 20 into _torch_version = torch.__version__

4
Recap: Sentiment Analysis
• Is this IMDB movie review a positive one?
This is not a movie for fans of the usual eerie Lynch stuff. Rather, it's for those
who either appreciate a good story, or have grown tired of the run-of-the-mill
stuff with overt sentimentalism […] The story unfolds flawlessly,
and we are taken along a journey that, I believe, most of us will come to
recognize at some time. A compassionate, existentialist journey where we
make amends for our past when approaching ourt inevitable demise. Acting is without faults, cinematography likewise (occasionally quite
brilliant!), and the dialogue leaves out just enough for the viewer to grasp the
details od the story. A warm movie. Not excessively sentimental.

5
Recap: DNN-based NLU

Pre-trained
Language Model
Downstream Prediction results
task data

Huggingface Transformers provides a convenient workflow for building

DNN-based NLU systems.

6
Overview of the pipeline
• An overview of the pipeline that you will use for A3:

Tokenizer
Input_ids System
Text Pre-trained
Classifier
BERT outputs
Attention_masks

BERTForSequenceClassification

7
Overview of the pipeline
• An overview of the pipeline that you will use for A3:

Tokenizer
Input_ids System
Text Pre-trained
Classifier
BERT outputs
Attention_masks

BERTForSequenceClassification

8
Tokenizer
NLP systems need a tokenizer to encode texts into numbers.

Encode = tokenize, and then convert_tokens_to_ids

Tokenize Convert tokens to ids

(A list of
“This is an example tokens)
123, 657, 28378, …
sentence.”

Decode
9
Tokenization: word splitting
• Method 1: .split(), then look up the word index in a dictionary.
• Words with the same lemma forms are considered as different words.
E.g., “convert” vs “converts”
• Punctuations are not handled well.
E.g., “The end of a sentence. The start of the other”

Tokenize
“This is an example [“This”, “is”, “an”,
sentence.” “example”, “sentence.”]

10
Tokenization: better word splitting
• Method 2: Separate the words and the punctuations, then do
.split(), then look up the word index from the dictionary.
• Still, “convert” and “converts” are treated as different words.
• The vocabulary sizes are unnecessarily large.
• In multilingual tasks, the vocabulary sizes are even larger.
…although many English words have the same roots.
Some examples: geography, bibliography
Tokenize
“This is an example [“This”, “is”, “an”,
sentence.” “example”, “sentence”, “.”]

11
Tokenization: character encoding
• Method 3: Character / Byte-level encoding
• Example: CANINE (Lecture 7)
• The vocabulary size is significantly reduced.
• but how long are your sequence going to be?

• Can we strike a balance between character-level encoding and

word-level encoding?
Tokenize
“This is an example [“T”, “h”, “I”, “s”, “ ”, “i”,
sentence.” “s”, ” “, …]

12
Tokenization: subword
• Method 4: subword.
• This is adopted by popular LMs, including BERT and *GPT.
• The words to split, and the methods of splitting, differs.
In CSC401/2511: don’t worry about that ^.
• Each pretrained language model comes with its own tokenizer.

Let’s</w> do</w> token ization</w> !</w>

13
Loading and using the tokenizer

14
Two-step encoding process
• Calling tokenizer(sentence) is equivalent to:
• tokens = tokenizer.tokenize(sentence), and then:
• tokenizer.convert_tokens_to_ids(tokens)

• Details will be presented in Friday’s tutorial.

15
Overview of the pipeline
• An overview of the pipeline

Tokenizer
Input_ids System
Text Pre-trained Classifier (just a
BERT Linear head) outputs
Attention_masks

BERTForSequenceClassification

16
BERTmodel

BERT doesn’t
have this part –
• BERTmodel is the encoder part of this part is GPT.
the Transformer:
• Also ref: Lecture 7

17
Lecture review questions
By the end of this lecture, you should be able to:
• Describe what is tokenization.
• Use huggingface’s tokenizer
• Describe a BERT for Sequential Classifier system.
• Start working on Q3 and Q4 in Assignment 3.
• Friday’s tutorial will also be helpful for Q3.
• Q2: Not yet. Speech Recognition is in next week.

Anonymous feedback form: https://forms.gle/W3i6AHaE4uRx2FAJA

NLP With Transformers
0% (1)
NLP With Transformers
3 pages
Standard Spreadsheet For Continuous Column
100% (1)
Standard Spreadsheet For Continuous Column
12 pages
Andman and Nicobar
No ratings yet
Andman and Nicobar
8 pages
CH37 Ime-C
No ratings yet
CH37 Ime-C
69 pages
Chief Architect x7 Users Guide
No ratings yet
Chief Architect x7 Users Guide
240 pages
Software Requirements Specification: Lovely Professional University
No ratings yet
Software Requirements Specification: Lovely Professional University
9 pages
Case Analysis 3 IKEA Group 6 1
100% (2)
Case Analysis 3 IKEA Group 6 1
16 pages
RHUB5921 Description
No ratings yet
RHUB5921 Description
11 pages
2600 Corporate Telecom Cabling Standard Rev 1A - (66778120)
No ratings yet
2600 Corporate Telecom Cabling Standard Rev 1A - (66778120)
182 pages
Paranoia Mutant Forms
100% (2)
Paranoia Mutant Forms
5 pages
Romance Astrology PDF
No ratings yet
Romance Astrology PDF
311 pages
Artistic Maps in GIMP
No ratings yet
Artistic Maps in GIMP
22 pages
Byproducts of Sulfur Hexafluoride (SF) Use in The Electric Power Industry
No ratings yet
Byproducts of Sulfur Hexafluoride (SF) Use in The Electric Power Industry
11 pages
Introduction To Economics Notes
No ratings yet
Introduction To Economics Notes
12 pages
Bài So N - Syntax Lesson 5
No ratings yet
Bài So N - Syntax Lesson 5
21 pages
RUNDOWN Prom 31
No ratings yet
RUNDOWN Prom 31
14 pages
Repaso 2 Evaluacion
No ratings yet
Repaso 2 Evaluacion
4 pages
Longformer Slides
No ratings yet
Longformer Slides
22 pages
Biology Course Outline 2021-2022
No ratings yet
Biology Course Outline 2021-2022
5 pages
Diseases of Nervous System of Farm Animals by Ali Sadiek
100% (7)
Diseases of Nervous System of Farm Animals by Ali Sadiek
65 pages
Rubrics For Student Engagement or Class Participation
No ratings yet
Rubrics For Student Engagement or Class Participation
2 pages
NewsRecord14 04 23
No ratings yet
NewsRecord14 04 23
12 pages
Accounting - Seneca - Toronto, Canada
No ratings yet
Accounting - Seneca - Toronto, Canada
7 pages
Oct2023
No ratings yet
Oct2023
7 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
Lesson Plan (Speaking)
No ratings yet
Lesson Plan (Speaking)
3 pages
CAD 111 Final
No ratings yet
CAD 111 Final
2 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
Adidas Ultraboost 1.0 Shoes - Orange Adidas UK
No ratings yet
Adidas Ultraboost 1.0 Shoes - Orange Adidas UK
1 page
Transformer
No ratings yet
Transformer
39 pages
Unit 5b - Natural Language Processing
No ratings yet
Unit 5b - Natural Language Processing
41 pages
1102AITA04 AI For Text Analytics
No ratings yet
1102AITA04 AI For Text Analytics
88 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time
19 pages
(Slide) Sentiment Analysis v3
No ratings yet
(Slide) Sentiment Analysis v3
46 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
Transformer Agents Revolutionizing NLP With Hugging Face's Open-Source Tools
No ratings yet
Transformer Agents Revolutionizing NLP With Hugging Face's Open-Source Tools
6 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
INTELLIPAAT - 2024 - 01 - 20 - Tansformers Cont. and Autoencoders
No ratings yet
INTELLIPAAT - 2024 - 01 - 20 - Tansformers Cont. and Autoencoders
11 pages
Fast Tokenizers' Special Powers - Hugging Face NLP Course
No ratings yet
Fast Tokenizers' Special Powers - Hugging Face NLP Course
23 pages
7 Transformers
No ratings yet
7 Transformers
20 pages
Neptune - Ai Hugging Face Pre-Trained Models
No ratings yet
Neptune - Ai Hugging Face Pre-Trained Models
14 pages
On The Short Term Feasibility of Whole Brain Emulation S
No ratings yet
On The Short Term Feasibility of Whole Brain Emulation S
7 pages
Discrete Structures (w8)
No ratings yet
Discrete Structures (w8)
17 pages
Guide To Sensory Circuit
No ratings yet
Guide To Sensory Circuit
7 pages
Normalization and Pre-Tokenization - Hugging Face NLP Course
No ratings yet
Normalization and Pre-Tokenization - Hugging Face NLP Course
11 pages
Week 3: Deeplearning - Ai
No ratings yet
Week 3: Deeplearning - Ai
98 pages
Transformers Models - The "Pipeline": Function
No ratings yet
Transformers Models - The "Pipeline": Function
5 pages
Joshua K. Cage - Python Transformers by Huggingface Hands On - 101 Practical Implementation Hands-On of ALBERT - ViT - BigBird and Other Latest Models With Huggingface Transformers
No ratings yet
Joshua K. Cage - Python Transformers by Huggingface Hands On - 101 Practical Implementation Hands-On of ALBERT - ViT - BigBird and Other Latest Models With Huggingface Transformers
186 pages
Assembly Section Reviews
No ratings yet
Assembly Section Reviews
24 pages
Transformers: State-of-the-Art Natural Language Processing
No ratings yet
Transformers: State-of-the-Art Natural Language Processing
8 pages
BERT and Transformer
No ratings yet
BERT and Transformer
48 pages
BSCS 7 A Final MCQS
No ratings yet
BSCS 7 A Final MCQS
69 pages
LSTM To BERT
No ratings yet
LSTM To BERT
30 pages
MCQs Related To Multivariate Calculus - 121248 - Edited
No ratings yet
MCQs Related To Multivariate Calculus - 121248 - Edited
18 pages
Chapter 1
No ratings yet
Chapter 1
45 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
Mcq-Coal 1
No ratings yet
Mcq-Coal 1
10 pages
Unit 3
No ratings yet
Unit 3
22 pages
Deep DL Manual Deep
No ratings yet
Deep DL Manual Deep
8 pages
Chapter 4
No ratings yet
Chapter 4
61 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
Final Practise Question OOP
No ratings yet
Final Practise Question OOP
6 pages
Psychology
No ratings yet
Psychology
5 pages
Pakistan Studies-1
No ratings yet
Pakistan Studies-1
4 pages
CPack
No ratings yet
CPack
4 pages
Chapter 2: Getting Started With Hugging Face Transformers
No ratings yet
Chapter 2: Getting Started With Hugging Face Transformers
1 page
Chapter 4: Tokenization
No ratings yet
Chapter 4: Tokenization
1 page
Harvard CS197 Lecture 4 Notes
No ratings yet
Harvard CS197 Lecture 4 Notes
15 pages
Ai 1
No ratings yet
Ai 1
22 pages
AM19 ADL Text-Summariztion
No ratings yet
AM19 ADL Text-Summariztion
4 pages
Pgi20s02j - Lab Record
No ratings yet
Pgi20s02j - Lab Record
24 pages
Hugging Face Transformers
No ratings yet
Hugging Face Transformers
8 pages
Hugging Face
100% (1)
Hugging Face
11 pages
cl12 Huggingface
No ratings yet
cl12 Huggingface
34 pages
The Statistical Mechanics of Irreversible Phenomena 1st Edition Pierre Gaspard Download
100% (5)
The Statistical Mechanics of Irreversible Phenomena 1st Edition Pierre Gaspard Download
76 pages
Transformers
No ratings yet
Transformers
27 pages
DAB311 DL Week 11 RNN
No ratings yet
DAB311 DL Week 11 RNN
25 pages
Transformers NLP Presentation
No ratings yet
Transformers NLP Presentation
7 pages
NLP LLM
No ratings yet
NLP LLM
47 pages
Choosing and Implementing Hugging Face Models - by Stephanie Kirmer - Towards Data Science
No ratings yet
Choosing and Implementing Hugging Face Models - by Stephanie Kirmer - Towards Data Science
15 pages
Somedoc 1995
No ratings yet
Somedoc 1995
3 pages
Pre-Training & LLM 2
No ratings yet
Pre-Training & LLM 2
46 pages
UNIT VI Gen-AI ASP Notes
No ratings yet
UNIT VI Gen-AI ASP Notes
11 pages
Gen Ai 6,7
No ratings yet
Gen Ai 6,7
6 pages
Student Lms - Usecs
No ratings yet
Student Lms - Usecs
1 page
Tensor Flow Chat Bot
No ratings yet
Tensor Flow Chat Bot
44 pages
Tokenisation and Embedding
No ratings yet
Tokenisation and Embedding
11 pages
Huggingface Basics
No ratings yet
Huggingface Basics
28 pages
How Does A GPT Tool Process Inputs
No ratings yet
How Does A GPT Tool Process Inputs
19 pages
Ls Comp 1ed Tr9 U2 Worksheet Ans
No ratings yet
Ls Comp 1ed Tr9 U2 Worksheet Ans
8 pages
GenAI Workflow Automation NPTEL Zoom Course
No ratings yet
GenAI Workflow Automation NPTEL Zoom Course
88 pages
Lecture 12 Pretraining
No ratings yet
Lecture 12 Pretraining
46 pages
Lecture 13 - Hugging Face
No ratings yet
Lecture 13 - Hugging Face
20 pages
Lecture9 Hope To Skills
No ratings yet
Lecture9 Hope To Skills
12 pages
Rans Formers
No ratings yet
Rans Formers
2 pages
精通Python自然语言处理: Chinese Edition
From Everand
精通Python自然语言处理: Chinese Edition
Posts & Telecom Press
No ratings yet
Assembly Programming:Simple, Short, And Straightforward Way Of Learning Assembly Language
From Everand
Assembly Programming:Simple, Short, And Straightforward Way Of Learning Assembly Language
Sherwyn Allibang
5/5 (2)
Basic Information About C language PDF
From Everand
Basic Information About C language PDF
Suraj Das
No ratings yet