Artificial Intelligence - Assignment 3
Artificial Intelligence - Assignment 3
Ashutosh - A18PT2-33
Jigyasa - 19PT1-12
Navneet - 19PT1-17
Pankhuri - 19PT1-18
Deep - A18PT2-37
Artificial intelligence- Assignment 3
● Prior to the release of GPT-3, the largest language model was Microsoft's Turing NLG,
introduced in February 2020, with a capacity of 17 billion parameters or less than 10 percent
compared to GPT-3.
● The quality of the text generated by GPT-3 is so high that it is difficult to distinguish from
that written by a human, which has both benefits and risks.
● David Chalmers, an Australian philosopher, described GPT-3 as "one of the most
interesting and important AI systems ever produced."
● GPT-n models are based on this Transformer-based deep learning neural network
architecture.
● GPT-3 is based on a specific neural network architecture type called Transformer that,
simply put, is more effective than other architectures like RNNs (Recurrent Neural
Networks).
Language Model ?
“The diversity of tasks
the model is able to
perform in a zero-shot
setting suggests that High-capacity models trained to maximize the likelihood of a sufficiently varied text
corpus begin to learn how to perform a surprising amount of tasks without the need for explicit
supervision.”
GPT-3 Architecture
GPT-3 is a neural-network-powered language model and sequence transduction model based on deep
learning. GPT-3 is made up of a Transformers-based architecture similar to GPT-2, including the
modified initialization, pre-normalization, and reversible tokenization described therein, with the
exception that it uses alternating dense and locally banded sparse attention patterns in the layers of the
transformer, similar to the Sparse Transformer. Being a transformer-based with the basis of NLP model
BERT, it's not very novel.However, with 175 billion parameters which makes it really big and the largest
language model trained which makes it perform specific tasks without special tuning and few training
examples. GPT-3 is trained using Common Crawl, Wikipedia, Book1, Book2. GPT-3 175B has a lower data
compression ratio 499/175=2.85 in comparison to GPT-2 1.5G 10/1.5 = 6.66.
Essentially, the architecture is the same as GPT-2 (including the modified initialization, pre-
normalization, and reversible tokenization described therein), with the exception that authors use
alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar
to the Sparse Transformer.
The model became proportionally larger: more layers (up to 96), a higher number of units in each
bottleneck layer (up to 12288), larger context window (2048 tokens compared to 1024 in GPT-2 and 512
in GPT).
I/O - It's based on input (2048 words) and output sequences. The input is a sequence of N words (a.k.a
tokens). The output is a guess for the word most likely to be put at the end of the input sequence.
Followed by sequences of guesses with a probability of each likely word.
Encoding - As GPt can’t understand the words, each word operates in a vector of numbers from the GPT
vocabulary (50257 words). 2048 x 50257 matrix of ones and zeroes. GPT-3 actually uses byte-level Byte
Pair Encoding (BPE) tokenization.
Embedding - 50257 is pretty big for a vector, and it's mostly filled with zeroes. That's a lot of wasted
space. To solve this, we learn an embedding function: a neural network that takes a 50257-length vector
of ones and zeroes, and outputs a n-length vector of numbers. Here, we are trying to store (or project)
the information of the word's meaning to a smaller dimensional space. Example, picture below - N size
vector into 2-d vector. Mostly its not 2 dimensional and is generally 12288 dimensions which multiply
the 2048 x 50257 sequence-encodings matrix with the 50257 x 12288 embedding-weights matrix
(learned) and end up with a 2048 x 12288 sequence-embeddings matrix.
Positional Encoding: To encode the position of the current token in the sequence, the authors take the
token's position (a scalar i, in [0-2047]) and pass it through 12288 sinusoidal functions, each with a
different frequency. Finally, sequence-positional-encodings matrix, having the same shape as the
sequence-embeddings matrix made to get simply added.
Attention: For each output in the sequence, predict which input tokens to focus on and how much.
Here, imagine a sequence of 3 tokens, each represented with a 512-values embedding. The first two
matrices ("queries" and "keys") are multiplied together (QKT), which yields a 3x3 matrix. This matrix
(normalized through softmax) represents the importance of each token to each other.
Note: This (QKT) is the only operation in GPT which operates across words in the sequence. It is the only
operation where matrix rows interact.
The third matrix ("values") is multiplied with this importance matrix, resulting in, for each token, a mix of
all other token values weighted by the importance of their respective tokens.
Multi-Head Attention which is attention used many times i.e. (96x in GPT-3) also, sparse attention is
used by GPT-3.
Feed-Forward: The feed-forward block is a good-old multi-layer-perceptron with 1 hidden layer. Take
input, multiply with learned weights, add learned bias, do it again, get a result.
Add & Norm: After both the Multi-Head attention and the feed-forward blocks, the input of the block is
added to it's output, and the result is normalized. This is common in deep learning models.
Decoding: After passing through all 96 layers of GPT-3's attention/neural net machinery, the input has
been processed into a 2048 x 12288 matrix. This matrix is supposed to contain, for each of the 2048
output positions in the sequence, a 12288-vector of information about which word should appear. But
how do we extract this information?
As stated in the Embedding section, we learned a mapping which transforms a given (one-hot encoding
of a) word into a 12288-vector embedding. It turns out, we can just reverse this mapping to transform
our output 12288-vector embedding back into a 50257-word-encoding.
In addition, the GPT papers mention the parameter top-k, which limits the amount of possible words to
sample in the output to the k most likely predicted words. For example, with a top-k parameter of 1, we
always pick the most likely word.
Step 1 & 2:
Step 3 & 4
● GPT-3 is the largest language model trained today. GPT-3 is OpenAI's latest and greatest natural
language prediction model.
● The basic operating mode of GPT-3 is to generate text responses based on the input text. Eg to
answer a question or to write an essay based on a title.
● OpenAI now provides a developer API to interact with GPT-3 and build applications on top of it.
● GPT-3 is a few-shot learner. It requires priming with a few examples to work in a specific
context.
● Once primed correctly, GPT-3 could perform math calculations and generate answers in
programming languages, although it has not learned either explicitly
● GPT-3 shows that language model performance scales as a power-law of model size, dataset
size, and the amount of computation.
● GPT-3 demonstrates that a language model trained on enough data can solve NLP tasks that it
has never encountered. That is, GPT-3 studies the model as a general solution for many
downstream jobs without fine-tuning.
● The cost of AI is increasing exponentially. Training GPT-3 would cost over $4.6M using a Tesla
V100 cloud instance.
● The size of state-of-the-art (SOTA) language models is growing by at least a factor of 10 every
year. This outpaces the growth of GPU memory. For NLP, the days of "embarrassingly parallel" is
coming to the end; model parallelization will become indispensable.
● The GPT-3 is pre-trained with a large amount of natural language text from the Internet (45TB of
training text with 499 billion words). It cost at least 4.6 million US dollars (some estimated as
high as $12 million) to train on GPUs. The resulting model has 175 billion parameters.
GPT-3: Reaping the Benefits and Mitigating the Risks >>> For end
OpenAI’s GPT-3 is a massive step forward for the AI space, specifically for natural language generation,
and it is going to exponentially help multiple industries. Selecting the right application and combining it
with other components can really help leverage this model to a great extent.
GPT-3 is great, however, it also has its own setbacks and flaws. The algorithm can go completely off-
topic sometimes and it can get offensive too. These drawbacks may make people apprehensive to use
this model in production, which leads to them ignoring the benefits that come with it.
Consequently, there is also a need to balance the risks and benefits of using generative models. We can
mitigate the risks and reap the benefits of the algorithm in the following ways:
● Use it for use cases where the risk doesn’t have major repercussions: Instead of always using
it for an end-user application, it can be used for internal use cases which can boost productivity.
This can control the risk and will still reap the benefits.
● Build components to control the content: There is a way to control the content by architecting
different ML models on top of GPT-3, which will flag content that is not correct. This can help
prevent incorrect or inappropriate content from going out when it’s not supposed to. These
components on top can err on the side of caution and be more strict because their role is to
prevent bad content from going out even if it’s at the risk of restricting some good content. This
can lead to more false positives however, it will provide the reassurance that the content going
out is safe. This allows the generative model to be creative while still having an architecture to
prevent things from going off track.
● Adapt it to domain-specific data: o OpenAI does provide access to training APIs (on request)
which will allow to adapt GPT3 to a particular task/domain and make it more relevant for the
task at hand.
● Political design and governance of AI systems is the key.
● Results are amazing, but at what cost ?
● Finally, more data doesn’t necessarily mean better data. We need quality data , infact
unbiased and diverse data.
****************************************************************