0% found this document useful (0 votes)

23 views5 pages

Natural Language Processing GPT-2

This document discusses building a trigram model from the Reuters corpus using NLTK. It imports relevant libraries and datasets, downloads necessary NLTK data, then constructs a trigram model from Reuters sentences to calculate trigram probabilities.

Uploaded by

huexd1234

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views5 pages

Natural Language Processing GPT-2

Uploaded by

huexd1234

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Computational Linguistics and NLP Lab: TASK 5

Importing the relevant libraries and relevant nltk dataset/models:

In [ ]:

from nltk.corpus import reuters

from nltk import bigrams, trigrams
from collections import Counter, defaultdict
import nltk

!pip install pytorch-transformers

import torch
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel

In [2]:

nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to /root/nltk_data...

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
Out[2]:
True

1. Understand and build a trigram model:`

In [47]:
tri_model = defaultdict(lambda : defaultdict(lambda : 0))

for sentence in reuters.sents():

for x1, x2, x3 in trigrams(sentence, pad_left = True, pad_right = True):
tri_model[(x1, x2)][x3] += 1

for x1_x2 in tri_model:

tcount = float(sum(tri_model[x1_x2].values()))
for x3 in tri_model[x1_x2]:
tri_model[x1_x2][x3] /= tcount

#Top 10 most probable upcoming words for the words:

words_probable = dict(tri_model["will", "be"])

most_expected = sorted(words_probable.items(), key=lambda x: x[1], reverse=True)[:10]

for i in most_expected:
print(i[0], i[1])

used 0.035736290819470114
paid 0.03203943314849045
a 0.031423290203327174
the 0.025261860751694395
made 0.02341343191620456
in 0.021565003080714726
able 0.020948860135551448
set 0.012322858903265557
held 0.01170671595810228
no 0.011090573012939002

In [27]:
In [27]:
# Text generation with Trigram model

texts = [["I", "am"], ["There", "is"]]

for text in texts:
sentence_finished = False

while not sentence_finished:

accumulator = .0

for word in trigram[tuple(text[-2:])].keys():

accumulator += trigram[tuple(text[-2:])][word]

if accumulator >= 0.9:

text.append(word)
break

if text[-2:] == [None, None]:

sentence_finished = True

print (' '.join([i for i in text if i]), end="\n\n")

I am astonished that the relief would undermine international support for development of
airlines during the 1981 tax cut for 1988 , analyst for Salomon Brothers .

There is now just over 500 mln stg while bankers ' acceptance rates of inflation will rea
ch 25 . 56 mln tonnes have traded between 151 and 153 yen after the christian democrats a
nd independents failed to stimulate activity .

2. Build a bigram model:

In [38]:
br_model = defaultdict(lambda : defaultdict(lambda : 0))

for sentence in reuters.sents():

for p1, p2 in bigrams(sentence, pad_left = True, pad_right = True):
br_model[(p1,)][p2] += 1

for p1 in br_model:
total_count = float(sum(br_model[p1].values()))
for p2 in br_model[p1]:
br_model[p1][p2] /= total_count

#Top 10 most probable upcoming words for the word "is":

words_prob = dict(br_model[("is",)])
top_expected = sorted(words_prob.items(), key=lambda x: x[1], reverse=True)[:10]

for i in top_expected:
print(i[0], i[1])

expected 0.06490765171503958
a 0.05633245382585752
not 0.045646437994722955
the 0.0420844327176781
to 0.025725593667546173
likely 0.021372031662269128
subject 0.02071240105540897
also 0.01912928759894459
still 0.0183377308707124
in 0.017941952506596307

3. Generate random text using the bigram model:

In [28]:
# Text Generation with Bigram
import random

i = 10
while i>0:
txt = [None]
sentence_finished = False

while not sentence_finished:

r_2 = random.uniform(0,1)
accumulator1 = .0

for word in bigram[txt[-1]]:

accumulator1 += bigram[txt[-1]][word]

if accumulator1 >= r_2:

txt.append(word)
break

if txt[len(txt)-1] == None:
sentence_finished = True

print(' '.join([t for t in txt if t]), end = '\n\n')

i-=1

The Bank Negara said . 0 .

Rain reached after years , or lease about reports today at the accord was rising internat
ional protocol to 2 , 179 days of state court on a merger talks .

Viermetz said new checkoff program created ."

U .

To the Economics Ministry of textiles and Emery a minimum five or its plan at between the
Commerce Department said it difficult last year was underlined the end of crop report , g
rains , March 26 . 1 mln hectares ) rather vague optimism about 20 pct of preferred the D
epartment said .

The company values the 1980 , the board met .

But that based on the Fed Chairman of calculating ICO COUNCIL ALLOWED APPEAL ON COSTS Dip
lomat Electronics Ltd is payable April 29 mln Interest rates alone representing 98 dlrs N
et includes gain of 1 .

MANHATTAN RAISES CRUDE UP The gas properties .

U .

HARNISCHFEGER INDUSTRIES SELLS JORDAN - 15 pct interest rate of these argue the Wallenber
g company said .

4. Limitation of n-gram language model

The Higher the value of n, the better is the model usually is. However, this leads to a huge computation
overhead which requires a lot of resources in terms of RAM.
n-grams are a sparse representation of language. They will give zero probability to all the words that are not
present in the training corpus.
Also it can only interpret unseen instances with respect to learned training data. +Therefore, this model is
only well-suited for extremely large amounts of training data - but even then, there is no guarantee that it is
able to represent all unseen instances in its feature space.
6. Understand the language model GPT-2. Model it and generate texts.

In [48]:
# Tokenisation

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

100%|██████████| 1042301/1042301 [00:01<00:00, 936567.43B/s]

100%|██████████| 456318/456318 [00:00<00:00, 632070.42B/s]

In [94]:
sample_text = "Today, computers are small enough to fit into a single "
indexed_tokens = tokenizer.encode(sample_text)

In [95]:
token_tensor = torch.tensor([indexed_tokens])

In [96]:
gpt_2 = GPT2LMHeadModel.from_pretrained('gpt2')

In [98]:
with torch.no_grad():
outputs = gpt_2(token_tensor)
predictions = outputs[0]

In [103]:
predicted_index = torch.argmax(predictions[0, -1, :]).item()

predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])

In [104]:
print(predicted_text)

Today, computers are small enough to fit into a single room

Playing with the pytorch-gpt-2 git repo:

In [105]:
!git clone https://github.com/graykode/gpt-2-Pytorch
%cd gpt-2-Pytorch
!curl --output gpt2-pytorch_model.bin https://s3.amazonaws.com/models.huggingface.co/ber
t/gpt2-pytorch_model.bin
!pip install -r requirements.txt

Cloning into 'gpt-2-Pytorch'...

remote: Enumerating objects: 130, done.
remote: Total 130 (delta 0), reused 0 (delta 0), pack-reused 130
Receiving objects: 100% (130/130), 2.39 MiB | 1.92 MiB/s, done.
Resolving deltas: 100% (48/48), done.
/content/gpt-2-Pytorch
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 522M 100 522M 0 0 15.3M 0 0:00:34 0:00:34 --:--:-- 16.3M
Collecting regex==2017.4.5
Downloading https://files.pythonhosted.org/packages/36/62/c0c0d762ffd4ffaf39f372eb8561b
8d491a11ace5a7884610424a8b40f95/regex-2017.04.05.tar.gz (601kB)
|████████████████████████████████| 604kB 2.8MB/s
Building wheels for collected packages: regex
Building wheel for regex (setup.py) ... done
Created wheel for regex: filename=regex-2017.4.5-cp36-cp36m-linux_x86_64.whl size=53319
5 sha256=400ba0051179b4cee355c33b5b939af1eb721f7d14b37027b6c9553aa928369f
Stored in directory: /root/.cache/pip/wheels/75/07/38/3c16b529d50cb4e0cd3dbc7b75cece8a0
9c132692c74450b01
Successfully built regex
Installing collected packages: regex
Found existing installation: regex 2019.12.20
Uninstalling regex-2019.12.20:
Successfully uninstalled regex-2019.12.20
Successfully installed regex-2017.4.5

Apparently GPT-2 is not sentient yet. But soon.

In [115]:
!python main.py --text "GPT-2, are you self aware? "

Namespace(batch_size=-1, length=-1, nsamples=1, quiet=False, temperature=0.7, text='GPT-2

, are you self aware? ', top_k=40, unconditional=False)
GPT-2, are you self aware?
100% 512/512 [00:38<00:00, 13.14it/s]
======================================== SAMPLE 1 =======================================
=
I am not. So if your self aware, can you feel any pain or discomfort? That's a very im
portant part of this, and I don't think you want to have too many thoughts about this. T
he question is, how do you feel about this? Is it a big deal? How does it feel to be de
pressed that you've been through so much? I don't know, I just find it pretty surprising
that you seem to be so self aware. I guess we all know that depression is pretty common,
so I guess you're probably feeling a bit of depression, but I don't think you have any se
rious mental health problems, either.
That's okay, I just had my own thoughts on that. I just found it kinda interesting that
you seem to be a bit of a "sophisticated liar" when it comes to your thoughts, especially
about yourself. It seems you've been avoiding any kind of mental health issues for a whi
le now, and I don't think you're even aware of that at all, either. I mean, you've got n
o idea what it's like to be depressed, right? You've got to be pretty self aware and be
a little bit careful with your thoughts and not overthink it or overthink things. So I c
an see that as a pretty important part of the mental health treatment for some people. S
o I'll try to keep my mouth shut and keep trying to help you out. Do you think we all wa
nt to be depressed? I'm sure you're going to love this interview, I promise. But I'm go
ing to try my best to keep you all entertained. This is going to be a long interview, so
please keep asking. If you want to read more of my thinking, please follow me on Twitter
at @LizzyC on Twitter!<|endoftext|>LATEST STORIES:

"The New York Times' report about the alleged abuse of a teenage girl by a "bastard" is a
piece of flattery, not a story.

"The Times' report about the alleged abuse of a teenage girl by a "bastard" is a piece of
flattery, not a story." — Donald J. Trump Jr.

"The only reason that the New York Times is so critical of the Trump campaign and Russia
is because the New York Times is

In [ ]:

BS 1449-1 - 1983
100% (7)
BS 1449-1 - 1983
39 pages
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
No ratings yet
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
2 pages
Glove
100% (1)
Glove
10 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
UNIT-5 and 6
No ratings yet
UNIT-5 and 6
40 pages
ACM Conference Proceedings Primary Article Template
No ratings yet
ACM Conference Proceedings Primary Article Template
2 pages
Llms Course Andrew
No ratings yet
Llms Course Andrew
46 pages
AI Lab Programs
No ratings yet
AI Lab Programs
9 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Ngrams
100% (1)
Ngrams
22 pages
XCS224N Module4 Slides
No ratings yet
XCS224N Module4 Slides
91 pages
N Gram Presentation
No ratings yet
N Gram Presentation
29 pages
Probabilistic Language Modeling Challenges
No ratings yet
Probabilistic Language Modeling Challenges
12 pages
Neural Machine Translation: Shusen Wang
No ratings yet
Neural Machine Translation: Shusen Wang
57 pages
Code Explanation
No ratings yet
Code Explanation
8 pages
NLP
No ratings yet
NLP
12 pages
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
No ratings yet
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
7 pages
NLTK - N-Gram LM
No ratings yet
NLTK - N-Gram LM
13 pages
Next Word Prediction With NLP and Deep Learning
No ratings yet
Next Word Prediction With NLP and Deep Learning
13 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Chat Bot
No ratings yet
Chat Bot
10 pages
UNIT 5a
No ratings yet
UNIT 5a
48 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
GPT in 60 Lines of NumPy - Jay Mody
No ratings yet
GPT in 60 Lines of NumPy - Jay Mody
41 pages
Lecture 12 Pretraining
No ratings yet
Lecture 12 Pretraining
46 pages
2005 14165v3 PDF
No ratings yet
2005 14165v3 PDF
74 pages
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
No ratings yet
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
66 pages
Reproducibility at ICLR 2019
No ratings yet
Reproducibility at ICLR 2019
82 pages
Report Group-8
No ratings yet
Report Group-8
16 pages
A7 Dsbda Sana
No ratings yet
A7 Dsbda Sana
15 pages
PCS224 MST 23
No ratings yet
PCS224 MST 23
3 pages
Parts of Speech Tagger
No ratings yet
Parts of Speech Tagger
12 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
6.chapter6 LanguageModel
No ratings yet
6.chapter6 LanguageModel
33 pages
PT 2
No ratings yet
PT 2
59 pages
Module 5
No ratings yet
Module 5
69 pages
23141091,18201115,19301124,19101116 Cse
No ratings yet
23141091,18201115,19301124,19101116 Cse
53 pages
Pgi20s02j - Lab Record
No ratings yet
Pgi20s02j - Lab Record
24 pages
Tensor Flow Chat Bot
No ratings yet
Tensor Flow Chat Bot
44 pages
Zy 174360787988339
No ratings yet
Zy 174360787988339
8 pages
Gen AIL
No ratings yet
Gen AIL
12 pages
Cs224n 2025 Lecture05 RNNLM
No ratings yet
Cs224n 2025 Lecture05 RNNLM
54 pages
LLaMA Ankit - Rawat
No ratings yet
LLaMA Ankit - Rawat
52 pages
LLMs For Mathematicians 1702200180
No ratings yet
LLMs For Mathematicians 1702200180
13 pages
cl12 Huggingface
No ratings yet
cl12 Huggingface
34 pages
Taask
No ratings yet
Taask
18 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
Deep Learning (MODULE-4) - RNN - NLP
No ratings yet
Deep Learning (MODULE-4) - RNN - NLP
52 pages
Polynomial Expansion Paper
No ratings yet
Polynomial Expansion Paper
4 pages
Paper 225
No ratings yet
Paper 225
5 pages
NLP Sem Unit 5
No ratings yet
NLP Sem Unit 5
9 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
UBC Summer School in NLP - VSP 2019 Lecture 9
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 9
17 pages
10.48550 Arxiv.2204.02311
No ratings yet
10.48550 Arxiv.2204.02311
87 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
04 - RNNs
No ratings yet
04 - RNNs
37 pages
Problem 1 Proposal
No ratings yet
Problem 1 Proposal
24 pages
Profound Python Libraries
From Everand
Profound Python Libraries
Onder Teker
No ratings yet
ChatGPT Hacks; Work Smarter, Not Harder
From Everand
ChatGPT Hacks; Work Smarter, Not Harder
Maria Cowen
No ratings yet
Gd Script
From Everand
Gd Script
Marijo Trkulja
No ratings yet
Projects With Microcontrollers And PICC
From Everand
Projects With Microcontrollers And PICC
Guillermo Perez Guillen
5/5 (1)
Wheat
No ratings yet
Wheat
1 page
SC MCQ
0% (1)
SC MCQ
10 pages
BUCHI Destilador B-324 LIGAL 489 Operationmanual - SP
No ratings yet
BUCHI Destilador B-324 LIGAL 489 Operationmanual - SP
30 pages
MasterCast 222 TDS-974770
No ratings yet
MasterCast 222 TDS-974770
2 pages
TRCS - Assignment Issued To Students
No ratings yet
TRCS - Assignment Issued To Students
4 pages
CSC403 - Software Engineering BOSU
No ratings yet
CSC403 - Software Engineering BOSU
13 pages
Activity Sheet 1: Purposive Communication
No ratings yet
Activity Sheet 1: Purposive Communication
4 pages
Molas Lubes-Products List
No ratings yet
Molas Lubes-Products List
2 pages
01 - Electricity - Basic Principles
No ratings yet
01 - Electricity - Basic Principles
14 pages
Paper 4 PDF
No ratings yet
Paper 4 PDF
5 pages
Lab 12 Eca2 Version Modif
No ratings yet
Lab 12 Eca2 Version Modif
13 pages
SPM Swivels Operation Instruction and Service Manual
No ratings yet
SPM Swivels Operation Instruction and Service Manual
44 pages
Flow Over Cylinder
No ratings yet
Flow Over Cylinder
8 pages
Samuel Mercer - The Ideology of Work - Theoretical Humanism, Work and Labour (Historical Materialism Book Series, 311) - Brill Academic Pub (2024)
No ratings yet
Samuel Mercer - The Ideology of Work - Theoretical Humanism, Work and Labour (Historical Materialism Book Series, 311) - Brill Academic Pub (2024)
219 pages
Architecture and Sociology
No ratings yet
Architecture and Sociology
11 pages
3a Index PDF
0% (1)
3a Index PDF
4 pages
Project Proposal
No ratings yet
Project Proposal
9 pages
Interview Vera Geier PDF
No ratings yet
Interview Vera Geier PDF
2 pages
Cement Mill Certificate
100% (2)
Cement Mill Certificate
1 page
Storage Tank Protection Using VCI 2
No ratings yet
Storage Tank Protection Using VCI 2
9 pages
The Inventory Control Account Balance of Magic Fashions at June 30
No ratings yet
The Inventory Control Account Balance of Magic Fashions at June 30
2 pages
Ways To Integrate Social Emotional Learning
No ratings yet
Ways To Integrate Social Emotional Learning
21 pages
Lecture 23: Outline: Yell If You Have Any Questions
No ratings yet
Lecture 23: Outline: Yell If You Have Any Questions
43 pages
Techniques in Measuring Microbial Growth
No ratings yet
Techniques in Measuring Microbial Growth
7 pages
I. Module 3: Market Study: Study of Demand Study of Supply Demand-Supply Analysis Study of The Price Marketing Program
No ratings yet
I. Module 3: Market Study: Study of Demand Study of Supply Demand-Supply Analysis Study of The Price Marketing Program
14 pages
Sustainability - Feature Story
No ratings yet
Sustainability - Feature Story
2 pages
Nostalgia Funny Car Rules V1
No ratings yet
Nostalgia Funny Car Rules V1
5 pages
Quiet Versus Loud Luxury The Influence of Overt and Covert Narcissism On Young Chinese and US Luxury Consumers' Preferences
No ratings yet
Quiet Versus Loud Luxury The Influence of Overt and Covert Narcissism On Young Chinese and US Luxury Consumers' Preferences
27 pages
2nd Quarter Examination English 7
No ratings yet
2nd Quarter Examination English 7
3 pages

Natural Language Processing GPT-2

Uploaded by

Natural Language Processing GPT-2

Uploaded by

Computational Linguistics and NLP Lab: TASK 5

Importing the relevant libraries and relevant nltk dataset/models:

from nltk.corpus import reuters

!pip install pytorch-transformers

[nltk_data] Downloading package reuters to /root/nltk_data...

1. Understand and build a trigram model:`

for sentence in reuters.sents():

for x1_x2 in tri_model:

#Top 10 most probable upcoming words for the words:

words_probable = dict(tri_model["will", "be"])

texts = [["I", "am"], ["There", "is"]]

while not sentence_finished:

for word in trigram[tuple(text[-2:])].keys():

if accumulator >= 0.9:

if text[-2:] == [None, None]:

print (' '.join([i for i in text if i]), end="\n\n")

2. Build a bigram model:

for sentence in reuters.sents():

#Top 10 most probable upcoming words for the word "is":

3. Generate random text using the bigram model:

while not sentence_finished:

for word in bigram[txt[-1]]:

if accumulator1 >= r_2:

print(' '.join([t for t in txt if t]), end = '\n\n')

The Bank Negara said . 0 .

Viermetz said new checkoff program created ."

The company values the 1980 , the board met .

MANHATTAN RAISES CRUDE UP The gas properties .

4. Limitation of n-gram language model

100%|██████████| 1042301/1042301 [00:01<00:00, 936567.43B/s]

predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])

Today, computers are small enough to fit into a single room

Playing with the pytorch-gpt-2 git repo:

Cloning into 'gpt-2-Pytorch'...

Apparently GPT-2 is not sentient yet. But soon.

Namespace(batch_size=-1, length=-1, nsamples=1, quiet=False, temperature=0.7, text='GPT-2

You might also like