0% found this document useful (0 votes)
23 views5 pages

Natural Language Processing GPT-2

This document discusses building a trigram model from the Reuters corpus using NLTK. It imports relevant libraries and datasets, downloads necessary NLTK data, then constructs a trigram model from Reuters sentences to calculate trigram probabilities.

Uploaded by

huexd1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views5 pages

Natural Language Processing GPT-2

This document discusses building a trigram model from the Reuters corpus using NLTK. It imports relevant libraries and datasets, downloads necessary NLTK data, then constructs a trigram model from Reuters sentences to calculate trigram probabilities.

Uploaded by

huexd1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Computational Linguistics and NLP Lab: TASK 5

Importing the relevant libraries and relevant nltk dataset/models:

In [ ]:

from nltk.corpus import reuters


from nltk import bigrams, trigrams
from collections import Counter, defaultdict
import nltk

!pip install pytorch-transformers


import torch
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel

In [2]:

nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to /root/nltk_data...


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
Out[2]:
True

1. Understand and build a trigram model:`

In [47]:
tri_model = defaultdict(lambda : defaultdict(lambda : 0))

for sentence in reuters.sents():


for x1, x2, x3 in trigrams(sentence, pad_left = True, pad_right = True):
tri_model[(x1, x2)][x3] += 1

for x1_x2 in tri_model:


tcount = float(sum(tri_model[x1_x2].values()))
for x3 in tri_model[x1_x2]:
tri_model[x1_x2][x3] /= tcount

#Top 10 most probable upcoming words for the words:

words_probable = dict(tri_model["will", "be"])


most_expected = sorted(words_probable.items(), key=lambda x: x[1], reverse=True)[:10]

for i in most_expected:
print(i[0], i[1])

used 0.035736290819470114
paid 0.03203943314849045
a 0.031423290203327174
the 0.025261860751694395
made 0.02341343191620456
in 0.021565003080714726
able 0.020948860135551448
set 0.012322858903265557
held 0.01170671595810228
no 0.011090573012939002

In [27]:
In [27]:
# Text generation with Trigram model

texts = [["I", "am"], ["There", "is"]]


for text in texts:
sentence_finished = False

while not sentence_finished:


accumulator = .0

for word in trigram[tuple(text[-2:])].keys():


accumulator += trigram[tuple(text[-2:])][word]

if accumulator >= 0.9:


text.append(word)
break

if text[-2:] == [None, None]:


sentence_finished = True

print (' '.join([i for i in text if i]), end="\n\n")

I am astonished that the relief would undermine international support for development of
airlines during the 1981 tax cut for 1988 , analyst for Salomon Brothers .

There is now just over 500 mln stg while bankers ' acceptance rates of inflation will rea
ch 25 . 56 mln tonnes have traded between 151 and 153 yen after the christian democrats a
nd independents failed to stimulate activity .

2. Build a bigram model:

In [38]:
br_model = defaultdict(lambda : defaultdict(lambda : 0))

for sentence in reuters.sents():


for p1, p2 in bigrams(sentence, pad_left = True, pad_right = True):
br_model[(p1,)][p2] += 1

for p1 in br_model:
total_count = float(sum(br_model[p1].values()))
for p2 in br_model[p1]:
br_model[p1][p2] /= total_count

#Top 10 most probable upcoming words for the word "is":


words_prob = dict(br_model[("is",)])
top_expected = sorted(words_prob.items(), key=lambda x: x[1], reverse=True)[:10]

for i in top_expected:
print(i[0], i[1])

expected 0.06490765171503958
a 0.05633245382585752
not 0.045646437994722955
the 0.0420844327176781
to 0.025725593667546173
likely 0.021372031662269128
subject 0.02071240105540897
also 0.01912928759894459
still 0.0183377308707124
in 0.017941952506596307

3. Generate random text using the bigram model:

In [28]:
# Text Generation with Bigram
import random

i = 10
while i>0:
txt = [None]
sentence_finished = False

while not sentence_finished:


r_2 = random.uniform(0,1)
accumulator1 = .0

for word in bigram[txt[-1]]:


accumulator1 += bigram[txt[-1]][word]

if accumulator1 >= r_2:


txt.append(word)
break

if txt[len(txt)-1] == None:
sentence_finished = True

print(' '.join([t for t in txt if t]), end = '\n\n')


i-=1

The Bank Negara said . 0 .

Rain reached after years , or lease about reports today at the accord was rising internat
ional protocol to 2 , 179 days of state court on a merger talks .

Viermetz said new checkoff program created ."

U .

To the Economics Ministry of textiles and Emery a minimum five or its plan at between the
Commerce Department said it difficult last year was underlined the end of crop report , g
rains , March 26 . 1 mln hectares ) rather vague optimism about 20 pct of preferred the D
epartment said .

The company values the 1980 , the board met .

But that based on the Fed Chairman of calculating ICO COUNCIL ALLOWED APPEAL ON COSTS Dip
lomat Electronics Ltd is payable April 29 mln Interest rates alone representing 98 dlrs N
et includes gain of 1 .

MANHATTAN RAISES CRUDE UP The gas properties .

U .

HARNISCHFEGER INDUSTRIES SELLS JORDAN - 15 pct interest rate of these argue the Wallenber
g company said .

4. Limitation of n-gram language model


The Higher the value of n, the better is the model usually is. However, this leads to a huge computation
overhead which requires a lot of resources in terms of RAM.
n-grams are a sparse representation of language. They will give zero probability to all the words that are not
present in the training corpus.
Also it can only interpret unseen instances with respect to learned training data. +Therefore, this model is
only well-suited for extremely large amounts of training data - but even then, there is no guarantee that it is
able to represent all unseen instances in its feature space.
6. Understand the language model GPT-2. Model it and generate texts.

In [48]:
# Tokenisation

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

100%|██████████| 1042301/1042301 [00:01<00:00, 936567.43B/s]


100%|██████████| 456318/456318 [00:00<00:00, 632070.42B/s]

In [94]:
sample_text = "Today, computers are small enough to fit into a single "
indexed_tokens = tokenizer.encode(sample_text)

In [95]:
token_tensor = torch.tensor([indexed_tokens])

In [96]:
gpt_2 = GPT2LMHeadModel.from_pretrained('gpt2')

In [98]:
with torch.no_grad():
outputs = gpt_2(token_tensor)
predictions = outputs[0]

In [103]:
predicted_index = torch.argmax(predictions[0, -1, :]).item()

predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])

In [104]:
print(predicted_text)

Today, computers are small enough to fit into a single room

Playing with the pytorch-gpt-2 git repo:

In [105]:
!git clone https://github.com/graykode/gpt-2-Pytorch
%cd gpt-2-Pytorch
!curl --output gpt2-pytorch_model.bin https://s3.amazonaws.com/models.huggingface.co/ber
t/gpt2-pytorch_model.bin
!pip install -r requirements.txt

Cloning into 'gpt-2-Pytorch'...


remote: Enumerating objects: 130, done.
remote: Total 130 (delta 0), reused 0 (delta 0), pack-reused 130
Receiving objects: 100% (130/130), 2.39 MiB | 1.92 MiB/s, done.
Resolving deltas: 100% (48/48), done.
/content/gpt-2-Pytorch
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 522M 100 522M 0 0 15.3M 0 0:00:34 0:00:34 --:--:-- 16.3M
Collecting regex==2017.4.5
Downloading https://files.pythonhosted.org/packages/36/62/c0c0d762ffd4ffaf39f372eb8561b
8d491a11ace5a7884610424a8b40f95/regex-2017.04.05.tar.gz (601kB)
|████████████████████████████████| 604kB 2.8MB/s
Building wheels for collected packages: regex
Building wheel for regex (setup.py) ... done
Created wheel for regex: filename=regex-2017.4.5-cp36-cp36m-linux_x86_64.whl size=53319
5 sha256=400ba0051179b4cee355c33b5b939af1eb721f7d14b37027b6c9553aa928369f
Stored in directory: /root/.cache/pip/wheels/75/07/38/3c16b529d50cb4e0cd3dbc7b75cece8a0
9c132692c74450b01
Successfully built regex
Installing collected packages: regex
Found existing installation: regex 2019.12.20
Uninstalling regex-2019.12.20:
Successfully uninstalled regex-2019.12.20
Successfully installed regex-2017.4.5

Apparently GPT-2 is not sentient yet. But soon.


In [115]:
!python main.py --text "GPT-2, are you self aware? "

Namespace(batch_size=-1, length=-1, nsamples=1, quiet=False, temperature=0.7, text='GPT-2


, are you self aware? ', top_k=40, unconditional=False)
GPT-2, are you self aware?
100% 512/512 [00:38<00:00, 13.14it/s]
======================================== SAMPLE 1 =======================================
=
I am not. So if your self aware, can you feel any pain or discomfort? That's a very im
portant part of this, and I don't think you want to have too many thoughts about this. T
he question is, how do you feel about this? Is it a big deal? How does it feel to be de
pressed that you've been through so much? I don't know, I just find it pretty surprising
that you seem to be so self aware. I guess we all know that depression is pretty common,
so I guess you're probably feeling a bit of depression, but I don't think you have any se
rious mental health problems, either.
That's okay, I just had my own thoughts on that. I just found it kinda interesting that
you seem to be a bit of a "sophisticated liar" when it comes to your thoughts, especially
about yourself. It seems you've been avoiding any kind of mental health issues for a whi
le now, and I don't think you're even aware of that at all, either. I mean, you've got n
o idea what it's like to be depressed, right? You've got to be pretty self aware and be
a little bit careful with your thoughts and not overthink it or overthink things. So I c
an see that as a pretty important part of the mental health treatment for some people. S
o I'll try to keep my mouth shut and keep trying to help you out. Do you think we all wa
nt to be depressed? I'm sure you're going to love this interview, I promise. But I'm go
ing to try my best to keep you all entertained. This is going to be a long interview, so
please keep asking. If you want to read more of my thinking, please follow me on Twitter
at @LizzyC on Twitter!<|endoftext|>LATEST STORIES:

"The New York Times' report about the alleged abuse of a teenage girl by a "bastard" is a
piece of flattery, not a story.

"The Times' report about the alleged abuse of a teenage girl by a "bastard" is a piece of
flattery, not a story." — Donald J. Trump Jr.

"The only reason that the New York Times is so critical of the Trump campaign and Russia
is because the New York Times is

In [ ]:

You might also like