0% found this document useful (0 votes)
198 views11 pages

Artificial Intelligence - Assignment 3

The document summarizes GPT-3, a large language model created by OpenAI. It is a 175 billion parameter transformer model trained on a large text corpus using self-supervised learning. GPT-3 can perform a wide range of tasks like summarization, translation, and question answering without explicit supervision through conditioning on examples. The architecture uses transformer layers with attention and sparse attention patterns. GPT-3 demonstrates high-quality human-like text generation that is difficult to distinguish from human-written text.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
198 views11 pages

Artificial Intelligence - Assignment 3

The document summarizes GPT-3, a large language model created by OpenAI. It is a 175 billion parameter transformer model trained on a large text corpus using self-supervised learning. GPT-3 can perform a wide range of tasks like summarization, translation, and question answering without explicit supervision through conditioning on examples. The architecture uses transformer layers with attention and sparse attention patterns. GPT-3 demonstrates high-quality human-like text generation that is difficult to distinguish from human-written text.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Submitted by: Group 2

Ashutosh - A18PT2-33
Jigyasa - 19PT1-12
Navneet - 19PT1-17
Pankhuri - 19PT1-18
Deep - A18PT2-37
Artificial intelligence- Assignment 3

Open AI’s GPT-3 - A leap forward in Deep Learning and NLP

● GPT-3 stands for Generative Pretrained Transformer version 3, and it is a sequence


transduction model, which is a technique that transforms an input sequence to an output
sequence.
● It is an autoregressive language model that uses deep learning to produce human-like text. It is
the third-generation language prediction model in the GPT-n series created by OpenAI, a for-
profit San Francisco-based artificial intelligence research laboratory.
● By using sequence transduction, it can predict the likelihood of an output sequence given an
input sequence. This can be used, for instance, to predict which word makes the most sense
given a text sequence.
● What is new about GPT-3?
Its size.GPT-3's full version has a capacity of 175 billion machine learning parameters. GPT-3,
which was introduced in May 2020, and is in beta testing as of July 2020, is part of a trend in
natural language processing (NLP) systems of pre-trained language representations.

● Prior to the release of GPT-3, the largest language model was Microsoft's Turing NLG,
introduced in February 2020, with a capacity of 17 billion parameters or less than 10 percent
compared to GPT-3.
● The quality of the text generated by GPT-3 is so high that it is difficult to distinguish from
that written by a human, which has both benefits and risks.
● David Chalmers, an Australian philosopher, described GPT-3 as "one of the most
interesting and important AI systems ever produced."
● GPT-n models are based on this Transformer-based deep learning neural network
architecture.
● GPT-3 is based on a specific neural network architecture type called Transformer that,
simply put, is more effective than other architectures like RNNs (Recurrent Neural
Networks).

Language Model ?
“The diversity of tasks
the model is able to
perform in a zero-shot
setting suggests that High-capacity models trained to maximize the likelihood of a sufficiently varied text
corpus begin to learn how to perform a surprising amount of tasks without the need for explicit
supervision.”

GPT-3 Architecture

GPT-3 is a neural-network-powered language model and sequence transduction model based on deep
learning. GPT-3 is made up of a Transformers-based architecture similar to GPT-2, including the
modified initialization, pre-normalization, and reversible tokenization described therein, with the
exception that it uses alternating dense and locally banded sparse attention patterns in the layers of the
transformer, similar to the Sparse Transformer. Being a transformer-based with the basis of NLP model
BERT, it's not very novel.However, with 175 billion parameters which makes it really big and the largest
language model trained which makes it perform specific tasks without special tuning and few training
examples. GPT-3 is trained using Common Crawl, Wikipedia, Book1, Book2. GPT-3 175B has a lower data
compression ratio 499/175=2.85 in comparison to GPT-2 1.5G 10/1.5 = 6.66.

Essentially, the architecture is the same as GPT-2 (including the modified initialization, pre-
normalization, and reversible tokenization described therein), with the exception that authors use
alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar
to the Sparse Transformer.
The model became proportionally larger: more layers (up to 96), a higher number of units in each
bottleneck layer (up to 12288), larger context window (2048 tokens compared to 1024 in GPT-2 and 512
in GPT).
I/O - It's based on input (2048 words) and output sequences. The input is a sequence of N words (a.k.a
tokens). The output is a guess for the word most likely to be put at the end of the input sequence.
Followed by sequences of guesses with a probability of each likely word.

Encoding - As GPt can’t understand the words, each word operates in a vector of numbers from the GPT
vocabulary (50257 words). 2048 x 50257 matrix of ones and zeroes. GPT-3 actually uses byte-level Byte
Pair Encoding (BPE) tokenization.

Embedding - 50257 is pretty big for a vector, and it's mostly filled with zeroes. That's a lot of wasted
space. To solve this, we learn an embedding function: a neural network that takes a 50257-length vector
of ones and zeroes, and outputs a n-length vector of numbers. Here, we are trying to store (or project)
the information of the word's meaning to a smaller dimensional space. Example, picture below - N size
vector into 2-d vector. Mostly its not 2 dimensional and is generally 12288 dimensions which multiply
the 2048 x 50257 sequence-encodings matrix with the 50257 x 12288 embedding-weights matrix
(learned) and end up with a 2048 x 12288 sequence-embeddings matrix.
Positional Encoding: To encode the position of the current token in the sequence, the authors take the
token's position (a scalar i, in [0-2047]) and pass it through 12288 sinusoidal functions, each with a
different frequency. Finally, sequence-positional-encodings matrix, having the same shape as the
sequence-embeddings matrix made to get simply added.

Attention: For each output in the sequence, predict which input tokens to focus on and how much.
Here, imagine a sequence of 3 tokens, each represented with a 512-values embedding. The first two
matrices ("queries" and "keys") are multiplied together (QKT), which yields a 3x3 matrix. This matrix
(normalized through softmax) represents the importance of each token to each other.
Note: This (QKT) is the only operation in GPT which operates across words in the sequence. It is the only
operation where matrix rows interact.

The third matrix ("values") is multiplied with this importance matrix, resulting in, for each token, a mix of
all other token values weighted by the importance of their respective tokens.
Multi-Head Attention which is attention used many times i.e. (96x in GPT-3) also, sparse attention is
used by GPT-3.

Feed-Forward: The feed-forward block is a good-old multi-layer-perceptron with 1 hidden layer. Take
input, multiply with learned weights, add learned bias, do it again, get a result.

Add & Norm: After both the Multi-Head attention and the feed-forward blocks, the input of the block is
added to it's output, and the result is normalized. This is common in deep learning models.

Decoding: After passing through all 96 layers of GPT-3's attention/neural net machinery, the input has
been processed into a 2048 x 12288 matrix. This matrix is supposed to contain, for each of the 2048
output positions in the sequence, a 12288-vector of information about which word should appear. But
how do we extract this information?

As stated in the Embedding section, we learned a mapping which transforms a given (one-hot encoding
of a) word into a 12288-vector embedding. It turns out, we can just reverse this mapping to transform
our output 12288-vector embedding back into a 50257-word-encoding.

In addition, the GPT papers mention the parameter top-k, which limits the amount of possible words to
sample in the output to the k most likely predicted words. For example, with a top-k parameter of 1, we
always pick the most likely word.

Combing the steps in sequence below :

Step 1 & 2:
Step 3 & 4

Step 5,6 & 7


Architecture Summary: Like the models invented before it, the Transformer is an encoder-decoder
architecture. The encoder consists of a set of encoding layers that processes the input iteratively one
layer after another and the decoder consists of a set of decoding layers that does the same thing to the
output of the encoder.
The function of each encoder layer is to process its input to generate encodings, containing information
about which parts of the inputs are relevant to each other. It passes its set of encodings to the next
encoder layer as inputs. Each decoder layer does the opposite, taking all the encodings and processes
them, using their incorporated contextual information to generate an output sequence. To achieve this,
each encoder and decoder layer makes use of an attention mechanism, which for each input, weighs the
relevance of every other input and draws information from them accordingly to produce the
output.Each layer decoder also has an additional attention mechanism which draws information from
the outputs of previous decoders, before the decoder layer draws information from the encodings. Both
the encoder and decoder layers have a feed-forward neural network for additional processing of the
outputs, and contain residual connections and layer normalization steps.

GPT-3 Use Cases:


1. Text summarizing
2. Natural language to SQL
3. Natural language to LaTeX equations
4. Creative writing
5. Interface design and coding
6. Text to DevOps
7. Automatic mail answering
8. Dialog flows workbench for gaming and chatbots

GPT Key Features:

● GPT-3 is the largest language model trained today. GPT-3 is OpenAI's latest and greatest natural
language prediction model.
● The basic operating mode of GPT-3 is to generate text responses based on the input text. Eg to
answer a question or to write an essay based on a title.
● OpenAI now provides a developer API to interact with GPT-3 and build applications on top of it.
● GPT-3 is a few-shot learner. It requires priming with a few examples to work in a specific
context.
● Once primed correctly, GPT-3 could perform math calculations and generate answers in
programming languages, although it has not learned either explicitly
● GPT-3 shows that language model performance scales as a power-law of model size, dataset
size, and the amount of computation.
● GPT-3 demonstrates that a language model trained on enough data can solve NLP tasks that it
has never encountered. That is, GPT-3 studies the model as a general solution for many
downstream jobs without fine-tuning.
● The cost of AI is increasing exponentially. Training GPT-3 would cost over $4.6M using a Tesla
V100 cloud instance.
● The size of state-of-the-art (SOTA) language models is growing by at least a factor of 10 every
year. This outpaces the growth of GPU memory. For NLP, the days of "embarrassingly parallel" is
coming to the end; model parallelization will become indispensable.
● The GPT-3 is pre-trained with a large amount of natural language text from the Internet (45TB of
training text with 499 billion words). It cost at least 4.6 million US dollars (some estimated as
high as $12 million) to train on GPUs. The resulting model has 175 billion parameters.

GPT-3: Reaping the Benefits and Mitigating the Risks >>> For end

OpenAI’s GPT-3 is a massive step forward for the AI space, specifically for natural language generation,
and it is going to exponentially help multiple industries. Selecting the right application and combining it
with other components can really help leverage this model to a great extent.

GPT-3 is great, however, it also has its own setbacks and flaws. The algorithm can go completely off-
topic sometimes and it can get offensive too. These drawbacks may make people apprehensive to use
this model in production, which leads to them ignoring the benefits that come with it.

Consequently, there is also a need to balance the risks and benefits of using generative models. We can
mitigate the risks and reap the benefits of the algorithm in the following ways:

● Use it for use cases where the risk doesn’t have major repercussions: Instead of always using
it for an end-user application, it can be used for internal use cases which can boost productivity.
This can control the risk and will still reap the benefits.
● Build components to control the content: There is a way to control the content by architecting
different ML models on top of GPT-3, which will flag content that is not correct. This can help
prevent incorrect or inappropriate content from going out when it’s not supposed to. These
components on top can err on the side of caution and be more strict because their role is to
prevent bad content from going out even if it’s at the risk of restricting some good content. This
can lead to more false positives however, it will provide the reassurance that the content going
out is safe. This allows the generative model to be creative while still having an architecture to
prevent things from going off track.
● Adapt it to domain-specific data: o OpenAI does provide access to training APIs (on request)
which will allow to adapt GPT3 to a particular task/domain and make it more relevant for the
task at hand.
● Political design and governance of AI systems is the key.
● Results are amazing, but at what cost ?
● Finally, more data doesn’t necessarily mean better data. We need quality data , infact
unbiased and diverse data.

****************************************************************

You might also like