0% found this document useful (0 votes)
38 views

Unit - 3

transformers
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Unit - 3

transformers
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

GENERATIVE AI AND LARGE LANGUAGE MODELS

SCSB4014

UNIT - 3

LARGE LANGUAGE MODELS (LLMs)

UNIT 3: LARGE LANGUAGE MODELS (LLMs)

How transformer is suitable for NLP – Encoder only LLMs : BERT, Decoder only LLMs : GPT,
Llama, MoE based Architecture, Significance of decoder only LLMs over encoder only LLMs -
Natural Language Processing (NLP) - Translation, summarization, Named Entity Generation
(NER) - Applications - Content Creation, Chatbot System.
How do Transformers(NLP) work?

A transformer neural network can take an input sentence in the form of a sequence of vectors,
converts it into a vector called an encoding, and then decodes it back into another sequence. An
important part of the transformer is the attention mechanism, which allows the network to
focus on specific parts of the input when decoding the output. This results in a more efficient
and accurate translation of the input sentence.

Transformers

As we know RNN, GRU and LSTM are sequential models that have some limitations, they cannot
work with large sequences. LSTM have some memory concept but still, it is not able to recall the
whole input. That’s why transformers come into the picture they have infinite recall of input
using the self-attention technique.

Transformer Models

A transformer model is a neural network that learns the context of sequential data and
generates new data out of it.

A transformer is a type of artificial intelligence model that learns to understand and generate
human-like text by analyzing patterns in large amounts of text data.

Transformers are a current state-of-the-art NLP model and are considered the evolution of the
encoder-decoder architecture. However, while the encoder-decoder architecture relies mainly
on Recurrent Neural Networks (RNNs) to extract sequential information, Transformers
completely lack this recurrency.
They are specifically designed to comprehend context and meaning by analyzing the
relationship between different elements, and they rely almost entirely on a mathematical
technique called attention to do so.

Fig 1

The Transformer Architecture

Overview

Originally devised for sequence transduction or neural machine translation, transformers excel
in converting input sequences into output sequences. It is the first transduction model relying
entirely on self-attention to compute representations of its input and output without using
sequence-aligned RNNs or convolution. The main core characteristic of the Transformers
architecture is that they maintain the encoder-decoder model.

If we start considering a Transformer for language translation as a simple black box, it would
take a sentence in one language, English for instance, as an input and output its translation in
English.

Fig 2
If we dive a little bit, we observe that this black box is composed of two main parts:

 The encoder takes in our input and outputs a matrix representation of that input. For instance,
the English sentence “How are you?”

 The decoder takes in that encoded representation and iteratively generates an output. In our
example, the translated sentence “¿Cómo estás?”

Fig 3

However, both the encoder and the decoder are actually a stack with multiple layers (same
number for each). All encoders present the same structure, and the input gets into each of
them and is passed to the next one. All decoders present the same structure as well and get the
input from the last encoder and the previous decoder.

The original architecture consisted of 6 encoders and 6 decoders, but we can replicate as many
layers as we want. So let’s assume N layers of each.
Fig 4

Transformers in NLP

Transformers in Natural Language Processing (NLP) is a model architecture that revolutionized


the field of NLP, leading to remarkable improvements in language understanding tasks.
Transformers leverage attention mechanisms that are designed to capture dependencies in the
input data, irrespective of their distance from each other. A Transformer is a type of deep
learning architecture that uses an attention mechanism to process text sequences. Unlike
traditional models based on recurrent neural networks. The way a Transformer works is based
on two key components: attention and transformer blocks. Attention allows the model to
assign different weights to words within a sequence and to focus its attention on the most
relevant parts of the text. Transformation blocks are layers that apply nonlinear
transformations to the input representations and help the model learn language patterns and
structures.

History

The Transformers architecture was first introduced in a paper titled "Attention is All You Need"
by Vaswani et al., in 2017. It signifies a departure from traditional sequential processing
methods, replacing them with parallel processing for improved efficiency.
Functionality and Features

Transformers use a unique mechanism known as "attention" to weigh the importance of


different words in input data. Key features include:

 Self-attention: Measures dependencies between all words in a sentence, irrespective of


their distance.

 Encoder-decoder structure: Consists of an encoder that processes the input and a


decoder that produces the output.

 Positional encoding: Injects information about position of words in the sequence.

Architecture

Transformers in NLP are composed of three main parts: an encoder, decoder, and a final linear
and softmax layer. The encoder and decoder are stacks of identical layers, each with two sub-
layers: a multi-head self-attention layer & a simple, position-wise fully connected feed-forward
network.

Benefits and Use Cases

Transformers have been used to achieve state-of-the-art results on a variety of NLP tasks, such
as translation, summarization, and sentiment analysis. They offer several advantages:

 Handle long-term dependencies better than RNNs and LSTMs.

 Parallelization leads to faster training.

 Improved accuracy on several NLP benchmarks.

Challenges and Limitations

Despite their advantages, Transformers face some challenges like shorter attention span for
longer sequences, and high resource consumption considering memory and computational
power.
Integration with Data Lakehouse

Transformers can be integrated into a data lake house setup for text data analytics. They
empower data scientists to extract insights from massive unstructured text data stored in a
data lake, enabling state-of-the-art NLP analytics in a data lakehouse environment.

Security Aspects

The security of Transformers in NLP is mostly dependent on the data and systems they are used
with. While they don't inherently provide any security features, it's important to ensure data
privacy and protection when dealing with sensitive text data.

Performance

Transformers are considered high-performing models by delivering state-of-the-art results on


many NLP tasks. However, their resource-intensive nature can sometimes be a downside in
constrained environments.

Uses and applications of a Transformer for language modeling

Transformers have been used in a wide range of natural language processing applications.
Examples include machine translation, text generation, question answering, automatic
summarization, text classification, and sentiment analysis.

A prominent example of Transformers in NLP is GPT-4 (Generative Pre-trained Transformer


4), developed by OpenAI and which currently reigns the large language model scene, according
to various human and automatic evaluations. GPT-4, and its predecessor GPT-3.5 (better known
as chatGPT), took the world by storm with their ability to generate consistent and compelling
text in different contexts.

These models have been applied for tasks such as automatic text generation, where they excel
in the generation of articles, essays, and even programming code. They have also been used for
virtual assistants, chatbots, and personalized recommendation systems.
Encoder only LLMs:

Large Language Models (LLMs) have revolutionized natural language processing (NLP) by
enabling advanced text understanding and generation capabilities.

1. Encoder-Only Models

What are Encoder-Only Models?

Encoder-only models focus on understanding and processing input text to extract meaningful
representations. They are particularly effective for tasks that require comprehension of text
rather than generation.

Example: BERT (Bidirectional Encoder Representations from Transformers)

BERT is a prime example of an encoder-only model. It uses a transformer encoder to create


deep bidirectional representations of text, meaning it looks at the entire context of a word by
considering both the words before and after it.

 Contextual Understanding: Excellent at understanding the context within text, making it


highly effective for tasks like text classification, named entity recognition, and question
answering.

 Pre-training Benefits: Pre-training on large datasets allows these models to capture a


wide range of language patterns.

 Limited Generation Capability: Not suitable for tasks that require text generation, such
as translation or summarization.

 Inference Complexity: Can be computationally intensive for large texts due to the
bidirectional context processing.

Use Cases:

 Sentiment Analysis: Determining the sentiment expressed in a piece of text.

 Named Entity Recognition: Identifying entities like names, dates, and locations within
text.
 Text Classification: Categorizing documents or sentences into predefined classes.

The Encoder WorkFlow

 The encoder is a fundamental component of the Transformer architecture. The primary


function of the encoder is to transform the input tokens into contextualized
representations. Unlike earlier models that processed tokens independently, the
Transformer encoder captures the context of each token with respect to the entire
sequence.
 Its structure composition consists as follows:

Fig 5
STEP 1 - Input Embeddings

The embedding only happens in the bottom-most encoder. The encoder begins by converting
input tokens - words or subwords - into vectors using embedding layers. These embeddings
capture the semantic meaning of the tokens and convert them into numerical vectors.

All the encoders receive a list of vectors, each of size 512 (fixed-sized). In the bottom encoder,
that would be the word embeddings, but in other encoders, it would be the output of the
encoder that’s directly below them.

Fig 6

STEP 2 - Positional Encoding

Since Transformers do not have a recurrence mechanism like RNNs, they use positional
encodings added to the input embeddings to provide information about the position of each
token in the sequence. This allows them to understand the position of each word within the
sentence.

To do so, the researchers suggested employing a combination of various sine and cosine
functions to create positional vectors, enabling the use of this positional encoder for sentences
of any length.

In this approach, each dimension is represented by unique frequencies and offsets of the wave,
with the values ranging from -1 to 1, effectively representing each position.
Fig 7

STEP 3 Multi-Headed Self-Attention Mechanism

In the encoder, the multi-headed attention utilizes a specialized attention mechanism known as
self-attention. This approach enables the models to relate each word in the input with other
words. For instance, in a given example, the model might learn to connect the word “are” with
“you”.

This mechanism allows the encoder to focus on different parts of the input sequence as it
processes each token. It computes attention scores based on:

 A query is a vector that represents a specific word or token from the input sequence in the
attention mechanism.

 A key is also a vector in the attention mechanism, corresponding to each word or token in the
input sequence.

 Each value is associated with a key and is used to construct the output of the attention layer.
When a query and a key match well, which basically means that they have a high attention
score, the corresponding value is emphasized in the output.

This first Self-Attention module enables the model to capture contextual information from the
entire sequence. Instead of performing a single attention function, queries, keys and values are
linearly projected h times. On each of these projected versions of queries, keys and values the
attention mechanism is performed in parallel, yielding h-dimensional output values.

The detailed architecture goes as follows:

Fig 8

Matrix Multiplication (MatMul) - Dot Product of Query and Key

Once the query, key, and value vectors are passed through a linear layer, a dot product matrix
multiplication is performed between the queries and keys, resulting in the creation of a score
matrix.

The score matrix establishes the degree of emphasis each word should place on other words.
Therefore, each word is assigned a score in relation to other words within the same time step.
A higher score indicates greater focus.

This process effectively maps the queries to their corresponding keys.


.

Fig 9

Reducing the Magnitude of attention scores

The scores are then scaled down by dividing them by the square root of the dimension of the
query and key vectors. This step is implemented to ensure more stable gradients, as the
multiplication of values can lead to excessively large effects.

Fig 10

Image by the author. Encoder’s workflow. Reducing the attention scores.

Applying Softmax to the Adjusted Scores

Subsequently, a softmax function is applied to the adjusted scores to obtain the attention
weights. This results in probability values ranging from 0 to 1. The softmax function emphasizes
higher scores while diminishing lower scores, thereby enhancing the model's ability to
effectively determine which words should receive more attention.
Fig 11

Image by the author. Encoder’s workflow. Softmax adjusted scores.

Combining Softmax Results with the Value Vector

The following step of the attention mechanism is that weights derived from the softmax
function are multiplied by the value vector, resulting in an output vector.

In this process, only the words that present high softmax scores are preserved. Finally, this
output vector is fed into a linear layer for further processing.

Fig 12

Remember that before all the process starts, we break our queries, keys and values h times.
This process, known as self-attention, happens separately in each of these smaller stages or
'heads'. Each head works its magic independently, conjuring up an output vector.

This ensemble passes through a final linear layer, much like a filter that fine-tunes their
collective performance. The beauty here lies in the diversity of learning across each head,
enriching the encoder model with a robust and multifaceted understanding.
STEP Normalization and Residual Connections

Each sub-layer in an encoder layer is followed by a normalization step. Also, each sub-layer
output is added to its input (residual connection) to help mitigate the vanishing gradient
problem, allowing deeper models. This process will be repeated after the Feed-Forward Neural
Network too.

Fig 13

STEP 3.Feed-Forward Neural Network

The journey of the normalized residual output continues as it navigates through a pointwise
feed-forward network, a crucial phase for additional refinement.

Picture this network as a duo of linear layers, with a ReLU activation nestled in between them,
acting as a bridge. Once processed, the output embarks on a familiar path: it loops back and
merges with the input of the pointwise feed-forward network.

This reunion is followed by another round of normalization, ensuring everything is well-


adjusted and in sync for the next steps.
Fig 14

STEP 4 - Output of the Encoder

The output of the final encoder layer is a set of vectors, each representing the input sequence
with a rich contextual understanding. This output is then used as the input for the decoder in a
Transformer model.

This careful encoding paves the way for the decoder, guiding it to pay attention to the right
words in the input when it's time to decode.

Think of it like building a tower, where you can stack up N encoder layers. Each layer in this
stack gets a chance to explore and learn different facets of attention, much like layers of
knowledge. This not only diversifies the understanding but could significantly amplify the
predictive capabilities of the transformer network.

Encoders: Transforming Inputs into Latent Representations

Encoders are the initial half of the generative process. They are responsible for transforming
raw input data, such as images or text, into a compact, latent representation. This latent space
representation captures the essence of the input data in a lower-dimensional form, highlighting
the crucial features that define it. This process is akin to how the human brain processes
information — abstracting away irrelevant details to focus on the essence of an object.
Encoders are particularly useful for dimensionality reduction, feature extraction, and anomaly
detection. In generative AI, these encoded representations serve as a bridge between the raw
data and the generative model, making it easier to manipulate and transform data for creative
purposes.

BERT:

As natural language processing (NLP) continues to advance, human-machine interaction has


become more prevalent, meaningful, and convincing than ever. In the following article, you can
take a closer look at how machines work to understand and generate human language. More
specifically, you’ll learn what was so revolutionary about the emergence of the BERT model, as
well as its architecture, use cases, and training methods.

How Does BERT Work?

Let’s take a look at how BERT works, covering the technology behind the model, how it’s
trained, and how it processes data.

Core architecture and functionality

Recurrent and convolutional neural networks use sequential computation to generate


predictions. That is, they can predict which word will follow a sequence of given words once
trained on huge datasets. In that sense, they were considered unidirectional or context-free
algorithms.

By contrast, transformer-powered models like BERT, which are also based on the encoder-
decoder architecture, are bidirectional because they predict words based on the previous
words and the following words. This is achieved through the self-attention mechanism, a layer
that is incorporated in both the encoder and the decoder. The goal of the attention layer is to
capture the contextual relationships existing between different words in the input sentence.

Nowadays, there are many versions of pre-trained BERT, but in the original paper, Google
trained two versions of BERT: BERTbase and BERTlarge with different neural architectures. In
essence, BERTbase was developed with 12 transformer layers, 12 attention layers, and 110
million parameters, while BERTlarge used 24 transformer layers, 16 attention layers, and 340
million parameters. As expected, BERTlarge outperformed its smaller brother in accuracy tests.

To know in detail how the encoder-decoder architecture works in transformers, we highly


recommend you to read our Introduction to Using Transformers and Hugging Face.

Fig 15

Pre-training and fine-tuning

Transformers are trained from scratch on a huge corpus of data, following a time-consuming
and expensive process.

To optimize the training process, Google developed new hardware, the so-called TPU (Tensor
Processing Unit), specifically designed for machine learning tasks.

To avoid unnecessary and costly interactions in the training process, Google researchers
used transfer learning techniques to separate the (pre)training phase from the fine-tuning
phase. This allows developers to choose pre-trained models, refine the input-output pair data
of the target task, and retrain the head of the pre-trained model by using domain-specific data.
This feature is what makes LLMs like BERT the foundation model of endless applications built on
top of them,

The role of Masked Language Modelling in BERT’s processing

The key element to achieving bidirectional learning in BERT (and every LLM based on
transformers) is the attention mechanism. This mechanism is based on masked language
modeling (MLM). By masking a word in a sentence, this technique forces the model to analyze
the remaining words in both directions in the sentence to increase the chances of predicting
the masked word. MLM is based on techniques already tried in the field of computer vision, and
it’s great for tasks that require a good contextual understanding of an entire sequence.

BERT was the first LLM to apply this technique. In particular, a random 15% of the tokenized
words were masked during training. The result shows that BERT could predict the hidden words
with high accuracy.

BERT’s Impact on NLP

Powered by transformers, BERT was able to achieve state-of-the-art results in multiple NLP
tasks. Here are some of the tests where BERT excels:

 Question answering. BERT has been one of the first transformer-powered chatbots, delivering
impressive results.

 Sentiment analysis. For example, BERT has been successful in predicting positive or negative
punctuation for movie reviews.

 Text generation. A precursor of next-generation chatbots, BERT was already able to create long
texts with simple prompts.

 Summarizing text. Equally, BERT was able to read and summarize texts from complex domains,
including law and healthcare.
 Language translation. BERT has been trained on data written in multiple languages. That makes
it a multilingual model, which translates into great suitability for language translation.

 Autocomplete tasks. BERT can be used for autocomplete tasks, for example, in emails or
messaging services.

BERT model :

BERT stands for Bidirectional Encoder Representations from Transformers. We’ve already
discussed how bidirectional pre-training with MLMs enables BERT to function, so let’s cover the
remaining letters in the acronym to get a better understanding of its architecture.

Encoder Representations: Encoders are neural network components that translate input data
into representations that are easier for machine learning algorithms to process. Once an
encoder reads input text, it generates a hidden state vector. Hidden state vectors are like lists
of values and internal parameters that provide additional context. This packaged
representation of information is then passed on to the transformer.

Transformer: The transformer uses the information above to infer patterns or make
predictions. A transformer is a deep learning architecture that transforms an input into another
type of output. Nearly all NLP applications use transformers. If you’ve ever used Chat-GPT,
you’ve seen transformer architecture in action. Typically, transformers consist of an encoder
and a decoder. However, BERT uses only the encoder part of the transformer.

The BERT language model

BERT is widely used in AI for language processing pre-training. For example, it can be used to
discern context for better results in search queries. BERT outperforms many other architectures
in a variety of token-level and sentence-level NLP tasks:

 Token-level task examples. Tokens refer to labels that are assigned to specific and
semantically meaningful groups of characters, like words. Examples of token-level tasks
include part of speech (POS) tagging and named entity recognition (NER).
 Sentence-level task examples. Processing each token or word and discerning the
context from surrounding words can be computationally exhausting for some NLP tasks.
Examples of sentence-level tasks include semantic search and sentiment analysis.

BERT model applications

From industry to industry, BERT is being fine-tuned for specific needs. Here are a few examples
of specialized pre-trained BERT models:

 bioBERT: Used for biomedical text mining, bioBERT is a pre-trained biomedical language
representation model.

 SciBERT: Similar to bioBERT, this model is pre-trained on a wide range of high-quality


scientific publications to perform downstream tasks in a variety of scientific domains.

 patentBERT: This BERT model version is used to perform patent classification.

 VideoBERT: VideoBERT is a visual-linguistic model used to leverage the abundance of


unlabeled data on platforms such as YouTube.

 FinBERT: General-purpose models struggle to conduct financial sentiment analysis due


to the field's specialized language. This BERT model is pre-trained on financial texts to
perform NLP tasks in the domain.

Decoder only LLM:

Decoder-only large language models (LLMs) are designed primarily for text generation tasks.
They utilize an autoregressive architecture, predicting the next token based on previously
generated tokens. This makes them particularly effective for generating coherent and
contextually relevant text. Most modern LLMs, like the GPT series, are built on this architecture.

The Decoder WorkFlow

The decoder's role centers on crafting text sequences. Mirroring the encoder, the decoder is
equipped with a similar set of sub-layers. It boasts two multi-headed attention layers, a
pointwise feed-forward layer, and incorporates both residual connections and layer
normalization after each sub-layer.

Fig 16

Global structure of Encoders.

These components function in a way akin to the encoder's layers, yet with a twist: each multi-
headed attention layer in the decoder has its unique mission.

The final of the decoder's process involves a linear layer, serving as a classifier, topped off with
a softmax function to calculate the probabilities of different words.
The Transformer decoder has a structure specifically designed to generate this output by
decoding the encoded information step by step.

It is important to notice that the decoder operates in an autoregressive manner, kickstarting its
process with a start token. It cleverly uses a list of previously generated outputs as its inputs, in
tandem with the outputs from the encoder that are rich with attention information from the
initial input.

This sequential dance of decoding continues until the decoder reaches a pivotal moment: the
generation of a token that signals the end of its output creation.

STEP 1 - Output Embeddings

At the decoder's starting line, the process mirrors that of the encoder. Here, the input first
passes through an embedding layer

STEP 2 - Positional Encoding

Following the embedding, again just like the decoder, the input passes by the positional
encoding layer. This sequence is designed to produce positional embeddings.

These positional embeddings are then channeled into the first multi-head attention layer of the
decoder, where the attention scores specific to the decoder’s input are meticulously computed.

STEP 3 - Stack of Decoder Layers

The decoder consists of a stack of identical layers (6 in the original Transformer model). Each
layer has three main sub-components:

STEP 3.1 Masked Self-Attention Mechanism

This is similar to the self-attention mechanism in the encoder but with a crucial difference: it
prevents positions from attending to subsequent positions, which means that each word in the
sequence isn't influenced by future tokens.
For instance, when the attention scores for the word "are" are being computed, it's important
that "are" doesn't get a peek at "you", which is a subsequent word in the sequence.

Fig 17

STEP 3. Encoder-Decoder Multi-Head Attention or Cross Attention

In the second multi-headed attention layer of the decoder, we see a unique interplay between
the encoder and decoder's components. Here, the outputs from the encoder take on the roles
of both queries and keys, while the outputs from the first multi-headed attention layer of the
decoder serve as values.

This setup effectively aligns the encoder's input with the decoder's, empowering the decoder to
identify and emphasize the most relevant parts of the encoder's input.

Following this, the output from this second layer of multi-headed attention is then refined
through a pointwise feedforward layer, enhancing the processing further.
Fig 18

In this sub-layer, the queries come from the previous decoder layer, and the keys and values
come from the output of the encoder. This allows every position in the decoder to attend over
all positions in the input sequence, effectively integrating information from the encoder with
the information in the decoder.

STEP 3 Feed-Forward Neural Network

Similar to the encoder, each decoder layer includes a fully connected feed-forward network,
applied to each position separately and identically.

STEP 4 Linear Classifier and Softmax for Generating Output Probabilities

The journey of data through the transformer model culminates in its passage through a final
linear layer, which functions as a classifier.

The size of this classifier corresponds to the total number of classes involved (number of words
contained in the vocabulary). For instance, in a scenario with 1000 distinct classes representing
1000 different words, the classifier's output will be an array with 1000 elements.
This output is then introduced to a softmax layer, which transforms it into a range of probability
scores, each lying between 0 and 1. The highest of these probability scores is key,its
corresponding index directly points to the word that the model predicts as the next in the
sequence.

Fig 19

Normalization and Residual Connections

Each sub-layer (masked self-attention, encoder-decoder attention, feed-forward network) is


followed by a normalization step, and each also includes a residual connection around it.

Output of the Decoder

The final layer's output is transformed into a predicted sequence, typically through a linear
layer followed by a softmax to generate probabilities over the vocabulary.

The decoder, in its operational flow, incorporates the freshly generated output into its growing
list of inputs, and then proceeds with the decoding process. This cycle repeats until the model
predicts a specific token, signaling completion.

The token predicted with the highest probability is assigned as the concluding class, often
represented by the end token.
Again remember that the decoder isn't limited to a single layer. It can be structured with N
layers, each one building upon the input received from the encoder and its preceding layers.
This layered architecture allows the model to diversify its focus and extract varying attention
patterns across its attention heads.

Such a multi-layered approach can significantly enhance the model’s ability to predict, as it
develops a more nuanced understanding of different attention combinations.

And the final architecture is something similar like this (form the original paper)

Fig 20

Core Architecture

 Transformer Blocks: Decoder-only LLMs consist of multiple stacked transformer blocks,


each containing:

 Masked Multi-Headed Self-Attention: This allows the model to focus on


previous tokens while generating the next token, ensuring that future tokens are
not considered.
 Feed-Forward Networks: Each token's representation is transformed through a
feed-forward neural network, enhancing the model's ability to learn complex
patterns.

 Residual Connections: These connections help mitigate issues like vanishing


gradients, allowing for more stable training.

 Layer Normalization: Applied to stabilize and accelerate training by normalizing


the inputs to each layer.

Input Processing

 Tokenization: The input text is tokenized into discrete tokens using algorithms like Byte-
Pair Encoding (BPE). Each token is then mapped to a vector in an embedding layer.

 Positional Embeddings: Since the model lacks recurrence or convolution, positional


information is injected into the model using either absolute or relative positional
embeddings. Techniques like Rotary Positional Embeddings (RoPE) and Attention with
Linear Biases (ALiBi) have been developed to improve the model's ability to generalize
to longer sequences.

Training and Inference

 Next Token Prediction: During training, the model learns to predict the next token in a
sequence using a cross-entropy loss function. This is achieved by passing the output
token vectors through a classification head that generates a probability distribution over
the vocabulary.

 Sampling Techniques: During inference, various sampling methods (e.g., greedy


selection, nucleus sampling) can be employed to generate text based on the predicted
probabilities.

Recent Innovations

 FlashAttention: This technique optimizes the self-attention mechanism for better


efficiency, allowing for faster training and the ability to handle longer context lengths.
 Multi-Query Attention: This approach shares key and value projections across attention
heads, improving inference speed while maintaining performance.

 Grouped Query Attention: A compromise between multi-headed and multi-query


attention, this method groups attention heads to share projections, balancing efficiency
and performance.

GPT:

Generative Pre-trained Transformers, commonly known as GPT, are a family of neural network
models that uses the transformer architecture and is a key advancement in artificial intelligence
(AI) powering generative AI applications such as ChatGPT. GPT models give applications the
ability to create human-like text and content (images, music, and more), and answer questions
in a conversational manner. Organizations across industries are using GPT models
and generative AI for Q&A bots, text summarization, content generation, and search.

The GPT models, and in particular, the transformer architecture that they use, represent a
significant AI research breakthrough. The rise of GPT models is an inflection point in the
widespread adoption of ML because the technology can be used now to automate and improve
a wide set of tasks ranging from language translation and document summarization to writing
blog posts, building websites, designing visuals, making animations, writing code, researching
complex topics, and even composing poems. The value of these models lies in their speed and
the scale at which they can operate. For example, where you might need several hours to
research, write, and edit an article on nuclear physics, a GPT model can produce one in
seconds. GPT models have sparked the research in AI towards achieving artificial general
intelligence, which means machines can help organizations reach new levels of productivity and
reinvent their applications and customer experiences.

The use cases of GPT


The GPT models are general-purpose language models that can perform a broad range of tasks
from creating original content to write code, summarizing text, and extracting data from
documents.

Here are some ways you can use the GPT models:

Create social media content

Digital marketers, assisted by artificial intelligence (AI), can create content for their social media
campaigns. For example, marketers can prompt a GPT model to produce an explainer video
script. GPT-powered image processing software can create memes, videos, marketing copy, and
other content from text instructions.

Convert text to different styles

GPT models generate text in casual, humorous, professional, and other styles. The models allow
business professionals to rewrite a particular text in a different form. For example, lawyers can
use a GPT model to turn legal copies into simple explanatory notes.

Write and learn code

As language models, the GPT models can understand and write computer code in different
programming languages. The models can help learners by explaining computer programs to
them in everyday language. Also, experienced developers can use GPT tools to autosuggest
relevant code snippets.

Analyze data

The GPT model can help business analysts efficiently compile large volumes of data. The
language models search for the required data and calculate and display the results in a data
table or spreadsheet. Some applications can plot the results on a chart or create
comprehensive reports.

Produce learning materials


Educators can use GPT-based software to generate learning materials such as quizzes and
tutorials. Similarly, they can use GPT models to evaluate the answers.

Build interactive voice assistants

The GPT models allow you to build intelligent interactive voice assistants. While many chatbots
only respond to basic verbal prompts, the GPT models can
produce chatbots with conversational AI capabilities. In addition, these chatbots can converse
verbally like humans when paired with other AI technologies.

How does GPT work?

Though it’s accurate to describe the GPT models as artificial intelligence (AI), this is a broad
description. More specifically, the GPT models are neural network-based language prediction
models built on the Transformer architecture. They analyze natural language queries, known as
prompts, and predict the best possible response based on their understanding of language.

To do that, the GPT models rely on the knowledge they gain after they’re trained with hundreds
of billions of parameters on massive language datasets. They can take input context into
account and dynamically attend to different parts of the input, making them capable of
generating long responses, not just the next word in a sequence. For example, when asked to
generate a piece of Shakespeare-inspired content, a GPT model does so by remembering and
reconstructing new phrases and entire sentences with a similar literary style.

What are examples of some applications that use GPT?

Since its launch, the GPT models have brought artificial intelligence (AI) to numerous
applications in various industries. Here are some examples:

 GPT models can be used to analyze customer feedback and summarize it in easily
understandable text. First, you can collect customer sentiment data from sources like
surveys, reviews, and live chats, then you can ask a GPT model to summarize the data.

 GPT models can be used to enable virtual characters to converse naturally with human
players in virtual reality.
 GPT models can be used to provide a better search experience for help desk personnel.
They can query the product knowledge base with conversational language to retrieve
relevant product information.

Llama refers to a series of large language models developed by Meta AI, designed for various
generative AI applications. These models, including the latest Llama 3.2, are open-source and
can be fine-tuned for specific tasks. They support a wide range of use cases, from natural
language processing to multimodal applications. Here are some key points about Llama in
generative AI:

Overview of Llama Models

 Development: Llama (Large Language Model Meta AI) was first released in February
2023, with the latest version, Llama 3.2, launched in September 2024.

 Model Sizes: Llama models are available in various sizes, ranging from 1 billion to 405
billion parameters, allowing for flexibility in deployment based on computational
resources.

 Open Source: The model weights for Llama are available under licenses that permit
some commercial use, making it accessible for developers and researchers.

Features of Llama 3.2

 Multimodal Capabilities: Llama 3.2 supports both text and image inputs, enabling
applications that require image reasoning and analysis.

 Instruction Tuning: The models are instruction-tuned, enhancing their performance in


dialogue and interactive applications.

 On-Device Use: Smaller models (1B and 3B) are optimized for on-device applications,
making them suitable for mobile and edge computing scenarios.
Applications and Use Cases

 Natural Language Processing: Llama models excel in tasks such as text generation,
summarization, and translation.

 Multimodal Applications: The ability to process images alongside text allows for
innovative applications in areas like visual question answering and content generation.

 Integration in Platforms: Llama has been integrated into platforms like Facebook and
WhatsApp, enhancing user interactions with AI-driven features.

Future Developments

 Ongoing Research: Meta AI continues to explore advancements in Llama models,


including plans for future versions (Llama 5, 6, and 7) and improvements in multilingual
and multimodal capabilities.

 Safety and Ethical Considerations: The release of Llama models has sparked discussions
about the potential misuse of open-weight models, prompting ongoing research into
safety measures and ethical guidelines.

Llama represents a significant advancement in generative AI, providing powerful tools for
developers and researchers while also raising important considerations regarding safety and
responsible use.

MoE based Architecture:

The Mixture of Experts (MoE) architecture is a neural network design that improves efficiency
and performance by dynamically activating a subset of specialized networks, called experts, for
each input. A gating network determines which experts to activate, leading to sparse activation
and reduced computational cost. MoE architecture consists of two critical components: the
gating network and the experts.

At its heart, the MoE architecture functions like an efficient traffic system, directing each
vehicle – or in this case, data – to the best route based on real-time conditions and the desired
destination. Each task is routed to the most suitable expert, or sub-model, specialized in
handling that particular task. This dynamic routing ensures that the most capable resources are
employed for each task, enhancing the overall efficiency and effectiveness of the model. The
MoE architecture takes advantage of all 3 ways how to improve a model’s fidelity.

 By implementing multiple experts, MoE inherently increases the model's

 parameter size by adding more parameters per expert.

 MoE changes the classic neural network architecture which incorporates a gated
network to determine which experts to employ for a designated task.

 Every AI model has some degree of fine-tuning, thus every expert in an MoE is fine-
tuned to perform as intended for an added layer of tuning traditional models could not
take advantage of.

MoE Gating Network

The gating network acts as the decision-maker or controller within the MoE model. It evaluates
incoming tasks and determines which expert is suited to handle them. This decision is typically
based on learned weights, which are adjusted over time through training, further improving its
ability to match tasks with experts. The gating network can employ various strategies, from
probabilistic methods where soft assignments are tasked to multiple experts, to deterministic
methods that route each task to a single expert.

MoE Experts

Each expert in the MoE model represents a smaller neural network, machine learning model, or
LLM optimized for a specific subset of the problem domain. For example, in Mistral, different
experts might specialize in understanding certain languages, dialects, or even types of queries.
The specialization ensures each expert is proficient in its niche, which, when combined with the
contributions of other experts, will lead to superior performance across a wide array of tasks.
MoE Loss Function

Although not considered a main component of the MoE architecture, the loss function plays a
pivotal role in the future performance of the model, as it’s designed to optimize both the
individual experts and the gating network.

It typically combines the losses computed for each expert which are weighted by the probability
or significance assigned to them by the gating network. This helps to fine-tune the experts for
their specific tasks while adjusting the gating network to improve routing accuracy.

Fig 21

The MoE Process Start to Finish

Here's a summarized explanation of how the routing process works from start to finish:

 Input Processing: Initial handling of incoming data. Mainly our Prompt in the case of LLMs.
 Feature Extraction: Transforming raw input for analysis.
 Gating Network Evaluation: Assessing expert suitability via probabilities or weights.
 Weighted Routing: Allocating input based on computed weights. Here, the process of
choosing the most suitable LLM is completed. In some cases, multiple LLMs are chosen to
answer a single input.
 Task Execution: Processing allocated input by each expert.
 Integration of Expert Outputs: Combining individual expert results for final output.
 Feedback and Adaptation: Using performance feedback to improve models.
 Iterative Optimization: Continuous refinement of routing and model parameters.

Popular Models that Utilize an MoE Architecture

 OpenAI’s GPT-4 and GPT-4o: GPT-4 and GPT4o power the premium version of ChatGPT.
These multi-modal models utilize MoE to be able to ingest different source mediums like
images, text, and voice. It is rumored and slightly confirmed that GPT-4 has 8 experts each
with 220 billion paramters totalling the entire model to over 1.7 trillion parameters.

 Mistral AI’s Mixtral 8x7b: Mistral AI delivers very strong AI models open source and have
said their Mixtral model is a sMoE model or sparse Mixture of Experts model delivered in a
small package. Mixtral 8x7b has a total of 46.7 billion parameters but only uses 12.9B
parameters per token, thus processing inputs and outputs at that cost. Their MoE model
consistently outperforms Llama2 (70B) and GPT-3.5 (175B) while costing less to run.

The Benefits of MoE and Why It's the Preferred Architecture

Ultimately, the main goal of MoE architecture is to present a paradigm shift in how complex
machine learning tasks are approached. It offers unique benefits and demonstrates its
superiority over traditional models in several ways.

 Enhanced Model Scalability


o Each expert is responsible for a part of a task, therefore scaling by adding experts won't
incur a proportional increase in computational demands.
o This modular approach can handle larger and more diverse datasets and facilitates
parallel processing, speeding up operations. For instance, adding an image recognition
model to a text-based model can integrate an additional LLM expert for interpreting
pictures while still being able to output text. Or
o Versatility allows the model to expand its capabilities across different types of data
inputs.
 Improved Efficiency and Flexibility
o MoE models are extremely efficient, selectively engaging only necessary experts for
specific inputs, unlike conventional architectures that use all their parameters
regardless.
o The architecture reduces the computational load per inference, allowing the model to
adapt to varying data types and specialized tasks.
 Specialization and Accuracy:
o Each expert in an MoE system can be finely tuned to specific aspects of the overall
problem, leading to greater expertise and accuracy in those areas
o Specialization like this is helpful in fields like medical imaging or financial forecasting,
where precision is key
o MoE can generate better results from narrow domains due to its nuanced
understanding, detailed knowledge, and the ability to outperform generalist models on
specialized tasks.

Significance of decoder only LLMs over encoder only LLMs :

Decoder-only language models (LLMs) are significant primarily for their efficiency in generating
text. They are designed to predict the next token in a sequence based solely on previous
tokens, making them particularly effective for tasks like text generation, summarization, and
question-answering. This architecture allows them to focus on the generative aspect of
language processing without the need for an encoder component, which can complicate the
model and increase computational costs. Here are some key points regarding their significance:
1. Simplified Architecture

 Decoder-only models, such as the GPT series, utilize a simpler architecture that focuses
solely on generating text.

 This simplicity often leads to faster training times and lower computational costs
compared to encoder-decoder models.

2. Autoregressive Nature

 These models are autoregressive, meaning they generate text one token at a time, using
previously generated tokens as context.

 This allows for coherent and contextually relevant text generation, making them
suitable for a wide range of applications.

3. Strong Zero-Shot Generalization

 Decoder-only models have demonstrated strong zero-shot generalization capabilities,


meaning they can perform well on tasks they were not explicitly trained for.

 This is particularly beneficial in real-world applications where labeled data may be


scarce.

4. Emergent Abilities

 As these models scale in size and complexity, they exhibit emergent abilities, allowing
them to perform complex reasoning and understand nuanced language tasks.

 This includes capabilities like summarization, translation, and even sentiment analysis,
despite being primarily designed for text generation.

5. Efficient Use of Data

 Decoder-only models can be trained on vast amounts of unlabeled data, leveraging self-
supervised learning techniques.
 This enables them to learn rich representations of language without the need for
extensive labeled datasets.

6. Flexibility in Task Performance

 While encoder-only models excel in tasks like classification and understanding, decoder-
only models can adapt to various generative tasks, making them versatile tools in
natural language processing.

Natural Language Processing (NLP):

NLP stands for Natural Language Processing, which is a part of Computer Science, Human
language, and Artificial Intelligence. It is the technology that is used by machines to understand,
analyse, manipulate, and interpret human's languages. It helps developers to organize
knowledge for performing tasks such as translation, automatic summarization, Named Entity
Recognition (NER), speech recognition, relationship extraction, and topic segmentation.

Fig 22

History of NLP:

(1940-1960) - Focused on Machine Translation (MT)


The Natural Languages Processing started in the year 1940s.

1948 - In the Year 1948, the first recognisable NLP application was introduced in Birkbeck
College, London.

1950s - In the Year 1950s, there was a conflicting view between linguistics and computer
science. Now, Chomsky developed his first book syntactic structures and claimed that language
is generative in nature.

In 1957, Chomsky also introduced the idea of Generative Grammar, which is rule based
descriptions of syntactic structures.

(1960-1980) - Flavored with Artificial Intelligence (AI)

In the year 1960 to 1980, the key developments were:

Augmented Transition Networks (ATN)

Augmented Transition Networks is a finite state machine that is capable of recognizing regular
languages.

Case Grammar

Case Grammar was developed by Linguist Charles J. Fillmore in the year 1968. Case Grammar
uses languages such as English to express the relationship between nouns and verbs by using
the preposition.

In Case Grammar, case roles can be defined to link certain kinds of verbs and objects.

For example: "Neha broke the mirror with the hammer". In this example case grammar identify
Neha as an agent, mirror as a theme, and hammer as an instrument.

In the year 1960 to 1980, key systems were:


SHRDLU

SHRDLU is a program written by Terry Winograd in 1968-70. It helps users to communicate with
the computer and moving objects. It can handle instructions such as "pick up the green boll"
and also answer the questions like "What is inside the black box." The main importance of
SHRDLU is that it shows those syntax, semantics, and reasoning about the world that can be
combined to produce a system that understands a natural language.

LUNAR

LUNAR is the classic example of a Natural Language database interface system that is used
ATNs and Woods' Procedural Semantics. It was capable of translating elaborate natural
language expressions into database queries and handle 78% of requests without errors.

1980 - Current

Till the year 1980, natural language processing systems were based on complex sets of hand-
written rules. After 1980, NLP introduced machine learning algorithms for language processing.

In the beginning of the year 1990s, NLP started growing faster and achieved good process
accuracy, especially in English Grammar. In 1990 also, an electronic text introduced, which
provided a good resource for training and examining natural language programs. Other factors
may include the availability of computers with fast CPUs and more memory. The major factor
behind the advancement of natural language processing was the Internet.

Now, modern NLP consists of various applications, like speech recognition, machine
translation, and machine text reading. When we combine all these applications then it allows
the artificial intelligence to gain knowledge of the world. Let's consider the example of
AMAZON ALEXA, using this robot you can ask the question to Alexa, and it will reply to you.
Advantages of NLP

 NLP helps users to ask questions about any subject and get a direct response within
seconds.
 NLP offers exact answers to the question means it does not offer unnecessary and
unwanted information.
 NLP helps computers to communicate with humans in their languages.
 It is very time efficient.
 Most of the companies use NLP to improve the efficiency of documentation processes,
accuracy of documentation, and identify the information from large databases.

Disadvantages of NLP

 A list of disadvantages of NLP is given below:


 NLP may not show context.
 NLP is unpredictable
 NLP may require more keystrokes.
 NLP is unable to adapt to the new domain, and it has a limited function that's why NLP is
built for a single and specific task only.

Components of NLP

There are the following two components of NLP -

1. Natural Language Understanding (NLU)

Natural Language Understanding (NLU) helps the machine to understand and analyse human
language by extracting the metadata from content such as concepts, entities, keywords,
emotion, relations, and semantic roles.
NLU mainly used in Business applications to understand the customer's problem in both spoken
and written language.

 NLU involves the following tasks -


 It is used to map the given input into useful representation.
 It is used to analyze different aspects of the language.

2. Natural Language Generation (NLG)

Natural Language Generation (NLG) acts as a translator that converts the computerized data
into natural language representation. It mainly involves Text planning, Sentence planning, and
Text Realization.

Natural Language Processing

To understand the role generative AI in NLP, First you need to know what is natural language
processing? NLP is a branch of AI that empowers machines to comprehend, interpret, and
respond to human language. It involves the development of algorithms and models capable of
understanding the nuances of natural language, enabling seamless interaction between
humans and machines. NLP is not merely about recognizing words; it delves into the intricacies
of syntax, semantics, and pragmatics to grasp the full spectrum of human communication.

Discussing NLP in the Context of Generative AI

When we intertwine NLP with Generative AI, a new dimension unfolds. NLP algorithms
enhanced by Generative AI can not only understand language but also generate human-like
responses, opening doors to more nuanced and context-aware interactions.

The collaborative approach of Generative AI and NLP allows machines to transcend basic
language processing. It empowers systems to not only comprehend the explicit meaning of
words but also to grasp the underlying context, sentiment, and even subtle nuances that define
human communication. This contextual understanding is a game-changer, as it enables
machines to respond in a manner that goes beyond scripted replies.
Fig 23

The Role of Generative AI in Natural Language Processing (NLP)

* Language Generation:

Generative AI contributes significantly to NLP by enabling machines to create coherent and


contextually relevant language. This goes beyond simple text generation; it involves the
synthesis of language that aligns with the context and purpose of the communication. This
capability is particularly valuable in applications such as content creation, where generating
engaging and personalized language is paramount.

* Contextual Understanding:

Through extensive training, Generative AI equips NLP models with the ability to grasp the
subtleties of context, leading to more accurate and context-aware language processing. This is
crucial in scenarios where the meaning of a statement can vary based on the surrounding
context. Whether it’s understanding humor, sarcasm, or cultural references, the amalgamation
of Generative AI and NLP enables machines to navigate the intricacies of human
communication.
* Conversational Agents:

Generative AI powers the development of sophisticated chatbots and virtual assistants,


elevating the conversational experience by providing more natural and human-like interactions.
These conversational agents are not confined to scripted responses; they can dynamically adapt
to user input, making interactions more engaging and effective. This has significant implications
for customer support, where an empathetic and context-aware response can enhance user
satisfaction.

* Data Augmentation:

By generating synthetic data, Generative AI aids NLP models in overcoming data scarcity issues,
enhancing their training and performance. In NLP, the quality and diversity of training data play
a crucial role. Generative AI can generate additional training samples, simulating a broader
range of linguistic scenarios. This, in turn, improves the robustness of NLP models, making them
more adept at handling real-world variations in language use.

* Multilingual Capabilities:

Generative AI facilitates NLP models to comprehend and generate content in multiple


languages, breaking down linguistic barriers and fostering global communication. The ability to
seamlessly switch between languages and maintain linguistic accuracy is a testament to the
versatility that Generative AI brings to NLP. This is particularly valuable in a connected world
where cross-cultural communication is increasingly prevalent.

* Creative Content Creation:

The integration of Generative AI in NLP allows for the creation of creative and engaging
content, from marketing materials to personalized recommendations. This isn’t limited to
generating text; it extends to the synthesis of diverse media types, including images and videos.
In marketing, for example, AI in Digital Marketing can assist in crafting compelling ad copy or
generating visually appealing content tailored to specific audiences.
Impact of Generative AI in NLP on Business

The infusion of Generative AI into NLP has transformative implications for businesses. The
impact extends across various facets of operations, bringing about efficiency gains, improved
user experiences, and new possibilities.

From a customer-centric perspective, businesses witness a notable enhancement in customer


interactions. Advanced chatbots, powered by Generative AI and NLP, can understand user
queries in context, providing more accurate and personalized responses. This not only improves
customer satisfaction but also contributes to the overall brand image by showcasing a
commitment to cutting-edge technology.

Moreover, the automation of content creation becomes a reality with Generative AI and NLP.
Businesses can leverage these technologies to generate marketing materials, product
descriptions, and other content at scale. This not only saves time and resources but also
ensures consistency in messaging across different channels.

The impact isn’t confined to customer-facing applications. Internally, Generative AI in NLP


streamlines processes by automating routine tasks that involve language processing. Whether
it’s drafting emails, summarizing documents, or extracting insights from large volumes of text
data, the collaboration between Generative AI and NLP amplifies the efficiency of knowledge
workers.

In the realm of data analytics, the ability of Generative AI to augment data sets contributes to
more robust and accurate models. This is particularly valuable in industries where data is scarce
or where generating real-world data for training purposes is challenging. NLP models trained on
diverse and representative data sets exhibit improved performance in understanding and
processing language, leading to better-informed decision-making.

From a strategic standpoint, businesses embracing Generative AI in NLP gain a competitive


edge. The ability to harness the power of language generation and understanding opens
avenues for innovation in product development, marketing strategies, and customer
engagement. It positions companies as pioneers in leveraging cutting-edge technology to stay
ahead in an increasingly competitive landscape.

Translation in NLP:

NLP—natural language processing—is an emerging AI field that trains computers to understand


human languages. NLP uses machine learning algorithms to gain knowledge and get smarter
every day. And it’s built into tools that many of us use daily, from spell checkers to smart
speakers.

Uses of NLP

Natural language processing is a billion-dollar industry for a reason. Some researchers believe
that NLP has the potential to revolutionize many industries, from healthcare to sales and
marketing.

NLP is the backbone of many popular tools that a lot of us use every day. Here are just a few of
them.

 Language TranslationAI language translation tools use NLP to translate written text
or spoken words from one source language into a different target language.

 Virtual AssistantsVirtual assistants like Siri, Alexa, and Cortana use NLP to “speak” in full
sentences and sound like humans. They also use NLP for speech recognition.

 Search EnginesSearch engines like Google, Bing, Yahoo, Baidu, and Yandex use NLP to
autofill search queries for you so you get the most relevant results.

 Writing AssistantsTools like Grammarly and Microsoft Editor use NLP to check spelling,
grammar, and syntax in text. They can also analyze the voice, tone, and formality of your
writing.

 Autocomplete and AutocorrectWhen you text a friend on your smartphone or send a


message on LinkedIn, NLP provides predictive suggestions for you with autocomplete
and autocorrect.
 ChatbotsImprovements in NLP have made it easier for companies to engage with
customers and solve problems with AI-powered chatbots.

 Automatic Content ModerationAt times, social media can be notoriously toxic, but NLP
is effective at performing sentiment analysis. This helps platforms automatically
moderate their sites to identify and remove problematic content like bullying or slurs.

 Text-to-speech ToolsText-to-speech tools use NLP to transcribe spoken content


accurately and quickly.

How NLP Works:

Natural language processing trains machines to learn and understand human language and
speech. NLP and ML (machine learning) are part of AI (Artificial Intelligence). Both subfields
share techniques, algorithms, and knowledge.

Fig 24

1: Preprocessing Data

To begin, NLP pre-processes texts to extract pertinent information. Here are just a few of
the data science techniques that work behind the scenes of NLP preprocessing.

 TokenizationTokenization analyzes sentences to break them down into smaller building


blocks like words, numbers, or symbols.
 Lemmatization and stemmingLemmatization and stemming help AI understand the
variants of a specific word. For example, if you search Amazon for shirts, Amazon should
query its product database for both the plural and singular forms of the word “shirt.”
 Part of speech (POS) taggingPOS tagging is a machine learning technique that helps AI
understand parts of speech like nouns, pronouns, verbs, adverbs, etc.
 Sentiment AnalysisThis is used to scrutinize text to determine its emotional tone. For
example, some companies use sentiment analysis to filter online reviews into positive or
negative categories.
 Named Entity Recognition (NER)NER locates and classifies named entities in a body of
text. NER can help identify and categorize names, organizations, locations, events, and
dates.
 Text SummarizationThis NLP technique can concisely summarize a text. Summarizing a
long text can be very time-consuming for a human, but NLP can do it in seconds.

2: Analyzing and Classifying the Data with Algorithms

After the data is preprocessed, the NLP algorithm analyzes it. There are many different types of
algorithms used in NLP, but the following types are the most common:

 Rule-Based SystemsThis type of system uses carefully crafted sets of linguistic rules and
statistics to analyze data and complete tasks.
 Machine Learning SystemsMachine learning algorithms perform tasks based on their
training data. As they process more data, they adjust their methods. For example,
neural technology employs artificial intelligence to learn and improve its knowledge
constantly. In this way, it strives to mimic the neural networks in the human brain.

Applications of NLP
There are the following applications of NLP -

1. Question Answering

 Question Answering focuses on building systems that automatically answer the


questions asked by humans in a natural language
2. Spam Detection

 Spam detection is used to detect unwanted e-mails getting to a user's inbox.

Fig 25

3. Sentiment Analysis

 Sentiment Analysis is also known as opinion mining. It is used on the web to analyse the
attitude, behaviour, and emotional state of the sender. This application is implemented
through a combination of NLP (Natural Language Processing) and statistics by assigning
the values to the text (positive, negative, or natural), identify the mood of the context
(happy, sad, angry, etc.)

Fig 26

4. Machine Translation

 Machine translation is used to translate text or speech from one natural language to
another natural language.
5. Spelling correction

 Microsoft Corporation provides word processor software like MS-word, PowerPoint for
the spelling correction.

6. Speech Recognition

 Speech recognition is used for converting spoken words into text. It is used in
applications, such as mobile, home automation, video recovery, dictating to Microsoft
Word, voice biometrics, voice user interface, and so on.

7. Chatbot

 Implementing the Chatbot is one of the important applications of NLP. It is used by


many companies to provide the customer's chat services.

Named entity recognition (NER)

NER stands for Named Entity Recognition. It is a subtask that involves identifying and classifying
named entities in text into predefined categories such as the names of persons, organizations,
locations, expressions of times, quantities, monetary values, percentages, etc.

These categories can include, but are not limited to, names of individuals, organizations,
locations, expressions of times, quantities, medical codes, monetary values and percentages,
among others. Essentially, NER is the process of taking a string of text (i.e., a sentence,
paragraph or entire document), and identifying and classifying the entities that refer to each
category.

When the term “NER” was coined at the Sixth Message Understanding Conference (MUC-6),
the goal was to streamline information extraction tasks, which involved processing large
amounts of unstructured text and identifying key information. Since then, NER has expanded
and evolved, owing much of its evolution to advancements in machine learning and deep
learning techniques.
Named Entity Recognition (NER) Methods
Lexicon Based Method
The NER uses a dictionary with a list of words or terms. The process involves checking if any
of these words are present in a given text. However, this approach isn’t commonly used
because it requires constant updating and careful maintenance of the dictionary to stay
accurate and effective.
Rule Based Method
The Rule Based NER method uses a set of predefined rules guides the extraction of
information. These rules are based on patterns and context. Pattern-based rules focus on the
structure and form of words, looking at their morphological patterns. On the other hand,
context-based rules consider the surrounding words or the context in which a word appears
within the text document. This combination of pattern-based and context-based rules
enhances the precision of information extraction in Named Entity Recognition (NER).
Machine Learning-Based Method:
Multi-Class Classification with Machine Learning Algorithms
 One way is to train the model for multi-class classification using different machine
learning algorithms, but it requires a lot of labelling. In addition to labelling the model also
requires a deep understanding of context to deal with the ambiguity of the sentences.
This makes it a challenging task for a simple machine learning algorithm.
Deep Learning Based Method
 Deep learning NER system is much more accurate than previous method, as it is capable
to assemble words. This is due to the fact that it used a method called word embedding,
that is capable of understanding the semantic and syntactic relationship between various
words.
 It is also able to learn analyzes topic specific as well as high level words automatically.
 This makes deep learning NER applicable for performing multiple tasks. Deep learning can
do most of the repetitive work itself, hence researchers for example can use their time
more efficiently.
NER techniques:

According to a 2019 survey, about 64 percent of companies rely on structured data from
internal resources, but fewer than 18 percent are leveraging unstructured data and social
media comments to inform business decisions.

The organizations that do utilize NER for unstructured data extraction rely on a range of
approaches, but most fall into three broad categories: rule-based approaches, machine learning
approaches and hybrid approaches.

 Rule-based approaches involve creating a set of rules for the grammar of a language.
The rules are then used to identify entities in the text based on their structural and
grammatical features. These methods can be time-consuming and may not generalize
well to unseen data.
 Machine learning approaches involve training an AI-driven machine learning model on a
labeled dataset using algorithms like conditional random fields and maximum entropy
(two types of complex statistical language models). Techniques can range from
traditional machine learning methods (e.g., decision trees and support vector machines)
to more complex deep learning approaches, like recurrent neural networks (RNNs) and
transformers. These methods generalize better to unseen data, but they require a large
amount of labeled training data and can be computationally expensive.
 Hybrid approaches combine rule-based and machine learning methods to leverage the
strengths of both. They can use a rule-based system to quickly identify easy-to-
recognize entities and a machine learning system to identify more complex entities.
NER methodologies

Since the inception of NER, there have been some significant methodological advancements,
especially those that rely on deep learning-based techniques. Newer iterations include:

 Recurrent neural networks (RNNs) and long short-term memory (LSTM). RNNs are a
type of neural network designed for sequence prediction problems. LSTMs, a special
kind of RNN, can learn to recognize patterns over time and maintain information in
“memory” over long sequences, making them particularly useful for understanding
context and identifying entities.
 Conditional random fields (CRFs). CRFs are often used in combination with LSTMs for
NER tasks. They can model the conditional probability of an entire sequence of labels,
rather than just individual labels, making them useful for tasks where the label of a word
depends on the labels of surrounding words.
 Transformers and BERT. Transformer networks, particularly the BERT (Bidirectional
Encoder Representations from Transformers) model, have had a significant impact on
NER. Using a self-attention mechanism that weighs the importance of different words,
BERT accounts for the full context of a word by looking at the words that come before
and after it.
Chatbot System:

A chatbot is a computer program that simulates human conversation with an end user. Not all
chatbots are equipped with artificial intelligence (AI), but modern chatbots increasingly
use conversational AI techniques such as natural language processing (NLP) to understand user
questions and automate responses to them.

 Chatbots, also called chatterbots, is a form of artificial intelligence (AI) used in


messaging apps.
 This tool helps add convenience for customers—they are automated programs that
interact with customers like a human would and cost little to nothing to engage with.
 Key examples are chatbots used by businesses in Facebook Messenger, or as virtual
assistants, such as Amazon's Alexa.
 Chatbots tend to operate in one of two ways—either via machine learning or with set
guidelines.
 However, due to advancements in AI technology, chatbots using set guidelines are
becoming a historical footnote.

Understanding Chatbots

The progressive advance of technology has seen an increase in businesses moving from
traditional to digital platforms to transact with consumers. Convenience through technology is
being carried out by businesses by implementing AI techniques on their digital platforms. One
AI technique that is growing in its application and use is chatbots

You might also like