?? ??? ????????? ?????????
?? ??? ????????? ?????????
?? ??? ????????? ?????????
Q1. What are the stages in the lifecycle of a natural language processing (NLP) project?
Data Collection: The procedure of collecting, measuring, and evaluating correct insights for research
using established approved procedures is referred to as data collection.
Data Cleaning: The practice of correcting or deleting incorrect, corrupted, improperly formatted,
duplicate, or incomplete data from a dataset is known as data cleaning.
Data Pre-Processing: The process of converting raw data into a comprehensible format is known as
data preparation.
Feature Engineering: Feature engineering is the process of extracting features (characteristics,
qualities, and attributes) from raw data using domain expertise.
Data Modeling: The practice of examining data objects and their relationships with other things is
known as data modelling. It's utilised to look into the data requirements for various business
activities.
Model Evaluation: Model evaluation is an important step in the creation of a model. It aids in the
selection of the best model to represent our data and the prediction of how well the chosen model
will perform in the future.
Model Deployment: The technical task of exposing an ML model to real-world use is known as
model deployment.
Monitoring and Updating: The activity of measuring and analysing production model performance to
ensure acceptable quality as defined by the use case is known as machine learning monitoring. It
delivers alerts about performance difficulties and assists in diagnosing and resolving the core cause.
Language Modeling: Predicting the next word in a sequence based on the history of previous words,
often used in auto-complete features.
Topic Modeling: Identifying the topical structure and main themes within a collection of documents.
Conversational Agents: Voice-activated assistants like Alexa, Siri, Google Assistant, and Cortana.
Information Extraction: Identifying and extracting specific pieces of information from a text, such as
events from emails.
Text Classification: Categorizing text into predefined groups based on its content, used for tasks like
sentiment analysis and spam detection.
specifically the arrangement of words or tokens and the grammatical rules that govern them.
Parsing: It helps in deciding the structure of a sentence or text in a document. It helps analyze the
words in the text based on the grammar of the language.
Word segmentation: The segmentation of words segregates the text into small significant units.
Morphological segmentation: The purpose of morphological segmentation is to break words into
their base form.
Stemming: It is the process of removing the suffix from a word to obtain its root word.
Lemmatization: It helps combine words using suffixes, without altering the meaning of the word.
Semantic analysis focuses on the meaning of words, phrases, and sentences. It aims to understand
the actual meaning of the text, rather than just its grammatical structure.
Word embeddings are dense vector representations of words that capture semantic meanings and
relationships between words in a language. They are used to convert words into numerical vectors
that can be understood by machine learning models. Word embeddings like Word2Vec, GloVe, and
FastText have become popular and essential in natural language processing (NLP) and various
machine learning tasks.
Applications:
Word embeddings are used in various NLP applications and tasks, including:
Text Classification
Sentiment Analysis
Named Entity Recognition
Machine Translation
Question Answering
Semantic Similarity
Information Retrieval
1. Tokenization:
Break the text into words, phrases, symbols, or other meaningful elements (tokens).
2. Part-of-Speech Tagging:
Assign a part-of-speech (noun, verb, adjective) to each token to understand its grammatical function
in the sentence.
3. Dependency Parsing:
Analyze the grammatical structure and relationships between tokens in a sentence to understand
how different words or phrases relate to each other.
7. Semantic Similarity:
Measure the similarity between words, phrases, or sentences based on their meaning.
8. Sentiment Analysis:
Determine the sentiment or emotion expressed in a text, whether it is positive, negative, or neutral.
Q7. . List the components of Natural Language Processing.
Natural Language Processing (NLP) is a multidisciplinary field that combines linguistics, computer
science, and artificial intelligence to enable machines to understand, interpret, generate, and
respond to human language. The components of NLP can be categorized as follows:
1. Lexical Analysis:
Tokenization: Breaking down text into words, phrases, symbols, or other meaningful elements
(tokens).
Stemming and Lemmatization: Reducing words to their root form.
Part-of-Speech Tagging: Assigning a part-of-speech ( noun, verb, adjective) to each token.
2. Syntactic Analysis:
Parsing: Analyzing the grammatical structure of sentences to determine how different words relate to
each other.
Dependency Parsing: Identifying the grammatical relationships between words in a sentence.
3. Semantic Analysis:
Word Sense Disambiguation: Resolving the correct meaning of words with multiple meanings based
on context.
Semantic Role Labeling: Identifying the semantic relationships between different parts of a sentence.
Named Entity Recognition (NER): Identifying and classifying named entities such as names of
persons, organizations, locations, etc.
5. Pragmatic Analysis:
Speech Act Recognition: Identifying the type of speech acts in a sentence ( statement, question,
request).
Sentiment Analysis: Determining the sentiment or emotion expressed in a text ( positive, negative,
neutral).
1. Syntactic Parsing:
Constituency Parsing: Breaks down a sentence into constituent phrases ( noun phrases, verb
phrases) using parse trees.
Dependency Parsing: Identifies the grammatical relationships between words in a sentence using a
directed graph.
2. Semantic Parsing:
Semantic Role Labeling (SRL): Identifies the semantic relationships between different parts of a
sentence (agent, patient).
Logical Form Parsing: Translates sentences into a formal logical representation (e.g., first-order logic)
for reasoning and inference.
Text extraction in NLP :
Following are the common ways used for Text Extraction in NLP:
2. Lemmatization:
Definition: Lemmatization is the process of reducing words to their base or dictionary form (lemma)
using vocabulary and morphological analysis.
Example:
Original Word: better
Lemma: good
Purpose: It uses linguistic rules and morphological analysis to ensure that the root word belongs to
the language and is a valid word.
TF-IDF also called Term Frequency-Inverse Document Frequency helps us get the importance of a
particular word relative to other words in the corpus. It's a common scoring metric in information
retrieval (IR) and summarization.
TF-IDF converts words into vectors and adds semantic information, resulting in weighted unusual
words that may be utilised in a variety of NLP applications.
Q12. . What are some metrics on which NLP models are evaluated?
The following are some metrics on which NLP models are evaluated:
Accuracy: When the output variable is categorical or discrete, accuracy is used. It is the percentage
of correct predictions made by the model compared to the total number of predictions made.
Precision: Indicates how precise or exact the model's predictions are, i.e., how many positive (the
class we care about) examples can the model correctly identify given all of them?
Recall: Precision and recall are complementary. It measures how effectively the model can recall the
positive class, i.e., how many of the positive predictions it generates are correct.
F1 score: This metric combines precision and recall into a single metric that also represents the
trade-off between accuracy and recall, i.e., completeness and exactness.
(2 Precision Recall) / (Precision + Recall) is the formula for F1.
AUC: As the prediction threshold is changed, the AUC captures the number of correct positive
predictions versus the number of incorrect positive predictions.
Answer: Dependency Parsing is a syntactic parsing technique that analyzes the grammatical
structure of a sentence by identifying the relationships between words in a sentence and
representing them as a directed graph.
The Bag-of-Words (BoW) Model is a simple representation technique in NLP that converts text into a
set of words, ignoring grammar and word order, and counts the frequency of each word.
Q22.What is a Corpus?
Answer: A Corpus is a large and structured set of texts in a language that is used to train and test
NLP models.
Q23. What are the main challenges in handling ambiguity in language in NLP?
Answer: Ambiguity arises due to multiple meanings of words, syntactic structures, and context. NLP
models often struggle to interpret ambiguous language correctly.
Answer: Overfitting occurs when an NLP model performs well on the training data but poorly on
unseen data. It can be addressed by using techniques like regularization, dropout, and cross-
validation.
Answer: Handling multiple languages requires developing multilingual models, dealing with
language-specific nuances, and ensuring the availability of sufficient training data for each language.
Q27.What are Word Embeddings and how are they useful in NLP?
Word Embeddings are dense vector representations of words in a high-dimensional space that
capture semantic meanings and relationships between words, which are useful in various NLP tasks
like sentiment analysis, text classification, and machine translation.
Natural Language Processing (NLP) is a multidisciplinary field of artificial intelligence (AI) that focuses
on enabling machines to understand, interpret, generate, and respond to human language in a
valuable way. It involves the application of computational techniques to analyze and understand
human language, such as text analysis, sentiment analysis, and language translation.
Natural Language Understanding (NLU) is a subfield of NLP that focuses on enabling machines to
understand the meaning and context of human language. It goes beyond syntactic and semantic
analysis to understand the intentions, emotions, and nuances of human language, allowing machines
to interact with humans in a more natural and intuitive way.
When the sentence is parsed three words at a time, then it is a trigram. Similarly, n-gram refers to the
parsing of n words at a time.
Pragmatic analysis is an important task in NLP for interpreting knowledge that is lying outside a given
document. The aim of implementing pragmatic analysis is to focus on exploring a different aspect of
the document or text in a language. This requires a comprehensive knowledge of the real world. The
pragmatic analysis allows software applications for the critical interpretation of the real-world data to
know the actual meaning of sentences and words.
Applications:
Text Classification: Use numerical vectors and machine learning algorithms like SVM or Naive Bayes.
Information Retrieval: Use TF-IDF to rank and retrieve relevant documents.
Sentiment Analysis: Use BoW or Word Embeddings to classify sentiments.
Benefits:
1.Dimensionality Reduction: Convert text into a compact numerical form, improving computational
efficiency.
2.Semantic Representation: Capture the semantic meanings and relationships between words,
enabling effective text analysis.
Challenge: Accurately capturing the meaning of words, phrases, and sentences, including word
sense disambiguation, metaphorical language, idioms, and other linguistic phenomena.
Ambiguity:
Challenge: Resolving the ambiguity in language where words and phrases can have multiple
meanings depending on context.
Contextual Understanding:
Challenge: Understanding and using context to interpret language, including comprehending
referential statements and resolving pronouns.
Language Diversity:
Challenge: Handling the variety of languages and dialects, each with its own linguistic traits, lexicon,
and grammar, and the lack of resources for low-resource languages.
Real-world Understanding:
Challenge: Integrating real-world knowledge and common sense into NLP systems to improve
understanding and reasoning capabilities.
Tokenization
Stop Word Removal
Text Normalization
Lowercasing
Lemmatization
Stemming
Date and Time Normalization
Removal of Special Characters and Punctuation
Removing HTML Tags or Markup
Spell Correction
Sentence Segmentation
Text normalization, or text standardization, is the process of transforming text data into a
standardized form to ensure consistency and simplify the representation of textual information.
Techniques:
Lowercasing:
Example: Converting all text to lowercase to treat words with the same characters as identical and
avoid duplication.
Lemmatization:
Example: Converting words to their base or dictionary form, e.g., “running” to “run” or “better” to
“good.”
Stemming:
Example: Reducing words to their root form by removing suffixes or prefixes, e.g., “playing” to “play”
or “cats” to “cat.”
Abbreviation Expansion:
Example: Expanding abbreviations or acronyms to their full forms, e.g., “NLP” to “Natural Language
Processing.”
Numerical Normalization:
Example: Converting numerical digits to their written form or normalizing numerical representations,
e.g., “100” to “one hundred” or normalizing dates.
NLTK stands for Natural Language Processing Toolkit. It is a suite of libraries and programs written
in Python Language for symbolic and statistical natural language processing. It offers tokenization,
stemming, lemmatization, POS tagging, Named Entity Recognization, parsing, semantic reasoning,
and classification.
Cosine similarity is a metric used to measure the similarity between two vectors in a multi-
dimensional space by calculating the cosine of the angle between them.
Importance in NLP:
Text Representation:
Convert text documents into numerical vectors using techniques like bag-of-words, TF-IDF, or word
embeddings like Word2Vec or GloVe.
Vector Normalization:
Normalize the document vectors to unit length to ensure the length or magnitude of the vectors does
not affect the cosine similarity calculation.
Naive Bayes:
Usage: Text classification based on word or feature presence.
Decision Trees:
Usage: Sentiment analysis, information extraction.
Random Forests:
Usage: Text classification, named entity recognition, sentiment analysis.
Transformer (BERT):
Usage: Achieving state-of-the-art performance in various NLP tasks by capturing contextual
relationships in text using self-attention processes.
GPT stands for “Generative Pre-trained Transformer”. It refers to a collection of large language
models created by OpenAI. It is trained on a massive dataset of text and code, which allows it to
generate text, generate code, translate languages, and write many types of creative content, as well
as answer questions in an informative manner.
The GPT series includes various models, the most well-known and commonly utilised of which are
the GPT-2 and GPT-3.
GPT models are built on the Transformer architecture, which allows them to efficiently capture long-
term dependencies and contextual information in text. These models are pre-trained on a large
corpus of text data from the internet, which enables them to learn the underlying patterns and
structures of language.
Word embeddings are dense, low-dimensional vector representations of words trained using big text
corpora through unsupervised or supervised methods to capture semantic and contextual
information about words in a language.
Goal:
Capture relationships and similarities between words by representing them as dense vectors in a
continuous vector space using the distributional hypothesis, which states that words with similar
meanings tend to occur in similar contexts.
Word2Vec
GloVe (Global Vectors for Word Representation)
FastText
Advantages over Traditional Text Vectorization:
Semantic Similarity:
Capture semantic similarity between words.
Syntactic Links:
Capture syntactic links between words. For example, “king” – “man” + “woman” may produce a
vector similar to “queen,” capturing gender analogy.
Reduced Dimensionality:
Reduce dimensionality of word representations, representing words as dense vectors instead of
high-dimensional sparse vectors.
Skip-gram:
Type: Neural Network-based
Approach: Predicting context words given a target word.
Implementation: Uses a target word to predict the surrounding context words.
Encoder:
Function: Transforms the input sequence (e.g., a sentence in the source language) into a fixed-length
vector known as the "context vector" or "thought vector."
Architecture: Utilizes recurrent neural networks (RNNs) like Long Short-Term Memory (LSTM) or
Gated Recurrent Units (GRU) to capture sequential information from the input.
Context Vector:
Purpose: Acts as a summary or representation of the input sequence by encoding its meaning and
important information into a fixed-size vector, regardless of the input length.
Decoder:
Function: Uses the encoder’s context vector to generate the output sequence (e.g., translation or
summarization) one token at a time.
Architecture: Another RNN-based network that can be conditioned on the context vector, serving as
an initial hidden state at each step.
An attention mechanism is a kind of neural network that uses an additional attention layer within an
Encoder-Decoder neural network that enables the model to focus on specific parts of the input while
performing a task. It achieves this by dynamically assigning weights to different elements in the
input, indicating their relative importance or relevance. This selective attention allows the model to
focus on relevant information, capture dependencies, and analyze relationships within the data.
The attention mechanism is particularly valuable in tasks involving sequential or structured data,
such as natural language processing or computer vision, where long-term dependencies and
contextual information are crucial for achieving high performance. By allowing the model to
selectively attend to important features or contexts, it improves the model’s ability to handle complex
relationships and dependencies in the data, leading to better overall performance in various tasks.
Transformer is one of the fundamental models in NLP based on the attention mechanism, which
allows it to capture long-range dependencies in sequences more effectively than traditional recurrent
neural networks (RNNs). It has given state-of-the-art results in various NLP tasks like word
embedding, machine translation, text summarization, question answering etc.
Parallelization: The self-attention mechanism allows the model to process words in parallel, which
makes it significantly faster to train compared to sequential models like RNNs.
Long-Range Dependencies: The attention mechanism enables the Transformer to effectively capture
long-range dependencies in sequences, which makes it suitable for tasks where long-term context is
essential.
Self-Attention Mechanism:
Encoder-Decoder Network:
Multi-head Attention:
Positional Encoding
Feed-Forward Neural Networks
Layer Normalization and Residual Connections
The self-attention mechanism is a powerful tool that allows the Transformer model to capture long-
range dependencies in sequences. It allows each word in the input sequence to attend to all other
words in the same sequence, and the model learns to assign weights to each word based on its
relevance to the others. This enables the model to capture both short-term and long-term
dependencies, which is critical for many NLP applications
Q49.What are positional encodings in Transformers, and why are they necessary?
The transformer model processes the input sequence in parallel, so that lacks the inherent
understanding of word order like the sequential model recurrent neural networks (RNNs), LSTM
possess. So, that. it requires a method to express the positional information explicitly.
Positional encoding is applied to the input embeddings to offer this positional information like the
relative or absolute position of each word in the sequence to the model. These encodings are
typically learnt and can take several forms, including sine and cosine functions or learned
embeddings. This enables the model to learn the order of the words in the sequence, which is critical
for many NLP tasks.
Definition:
Generative models aim to model the joint probability distribution
Working Principle:
They learn the underlying data distribution and generate new data samples.
Examples in NLP:
Naive Bayes, Hidden Markov Models (HMM), and Generative Adversarial Networks (GANs) for text
generation.
Usage:
Text generation, machine translation, and data augmentation.
Pros:
Can handle missing data.
Can generate new data points.
Cons:
May suffer from complex computations.
Less straightforward in decision-making.
Discriminative Models:
Definition:
Discriminative models aim to model the posterior probability
P(Y∣X) directly, focusing on learning the boundary between different classes.
Working Principle:
They learn the decision boundary between classes and make predictions based on the input data.
Examples in NLP:
Logistic Regression, Support Vector Machines (SVM), and Neural Networks.
Usage:
Text classification, sentiment analysis, and named entity recognition.
Pros:
Typically have better performance for classification tasks.
More straightforward in decision-making.
Cons:
Cannot generate new data points.
May suffer from overfitting with limited data.
Summary:
Generative Models:
Model joint probability
P(X,Y)
Learn underlying data distribution
Examples: Naive Bayes, HMM, GANs
Usage: Text generation, machine translation
Discriminative Models:
Model posterior probability
P(Y∣X)
Learn decision boundary between classes
Examples: Logistic Regression, SVM, Neural Networks
Usage: Text classification, sentiment analysis
Extraction-based summarization
Abstraction-based summarization
Word2Vec:
Developed by Google.
Uses shallow neural networks to learn word embeddings.
Two main architectures: Continuous Bag of Words (CBOW) and Skip-Gram.
Captures semantic relationships between words through vector arithmetic ( king - man + woman =
queen).
Architecture: BERT is based on the Transformer architecture, which is a type of neural network
architecture introduced in the paper "Attention Is All You Need" by Vaswani et al. The Transformer
uses self-attention mechanisms to weigh the significance of different words in a sentence when
encoding or decoding.
Pre-training: BERT is pre-trained on a large corpus of text using two unsupervised learning tasks:
Masked Language Model (MLM): Randomly masks some of the input tokens and then predicts them
based on the remaining unmasked tokens. This allows BERT to learn contextual representations of
words.
Next Sentence Prediction: Given two sentences A and B, BERT predicts whether sentence B is the
subsequent sentence to A or not. This helps BERT understand the relationship between sentences.
Fine-tuning: After pre-training, BERT can be fine-tuned on specific NLP tasks, such as text
classification, named entity recognition, and question answering, by adding a task-specific layer on
top of the pre-trained model and training it on labeled data.
Data Collection: BERT is pre-trained on a large corpus of text, such as Wikipedia, BookCorpus, and
other publicly available text sources.
Tokenization: The input text is tokenized into subwords using WordPiece tokenization, which breaks
down words into smaller units to handle out-of-vocabulary words and reduce the vocabulary size.
Masked Language Model (MLM): Randomly masks 15% of the input tokens and predicts them based
on the remaining unmasked tokens. The objective is to learn bidirectional representations of the
words.
Next Sentence Prediction: Given a pair of sentences, BERT predicts whether the second sentence
follows the first sentence or not, helping the model understand the relationship between sentences.
Fine-tuning:
Task-specific Layer: A task-specific layer (e.g., a softmax layer for classification) is added on top of
the pre-trained BERT model.
Training on Labeled Data: BERT is fine-tuned on a specific NLP task using labeled data. The
parameters of the pre-trained BERT model and the task-specific layer are jointly optimized to
minimize the task-specific loss.
Self-Attention Mechanism: For each word in the input sequence, BERT uses self-attention to weigh
the importance of all other words in the sequence. This enables BERT to capture the bidirectional
context of each word, allowing it to understand the meaning of a word based on its surrounding
words.
Masked Language Model (MLM): During pre-training, BERT randomly masks some of the input
tokens and predicts them based on the remaining unmasked tokens. This forces the model to
consider the bidirectional context to predict the masked words accurately.
Bidirectional Context: Unlike traditional language models like LSTMs and RNNs, which are
unidirectional and process the text sequentially, BERT captures the bidirectional context of each
word, resulting in better understanding of the meaning and relationships between words in a
sentence.
Pre-training and Fine-tuning: BERT is pre-trained on a large corpus of text and can be fine-tuned on
specific NLP tasks, making it highly versatile and adaptable to various NLP tasks with minimal task-
specific data.
State-of-the-art Performance: BERT has achieved state-of-the-art results on a wide range of NLP
tasks, including text classification, named entity recognition, question answering, and more,
surpassing traditional language models and other pre-trained models like Word2Vec and GloVe.
Multi-Head Self-Attention Mechanism: This component allows the encoder to weigh the significance
of different words in the input sequence when encoding each word, enabling the model to capture
the context and relationships between words effectively.
Feed-Forward Neural Network: After the self-attention mechanism, the output is passed through a
position-wise feed-forward neural network, which applies the same feed-forward neural network to
each position independen
Q63.How does the transformer architecture differ from recurrent neural networks (RNNs) and
convolutional neural networks (CNNs)?
Transformer vs RNNs:
Parallel Processing: Transformers process all words in the sequence in parallel, whereas RNNs
process the sequence sequentially, making transformers more computationally efficient.
Bidirectional Context: Transformers use the self-attention mechanism to capture the bidirectional
context of each word, whereas RNNs are unidirectional and only consider the previous words in the
sequence.
Transformer vs CNNs:
Global Context: Transformers capture the global context of the input sequence through the self-
attention mechanism, whereas CNNs capture local features using convolutional filters.
Positional Information: Transformers preserve the positional information of the words in the input
sequence using positional encodings, whereas CNNs do not inherently preserve the positional
information.
Parameter Efficiency: Transformers are more parameter-efficient and require fewer parameters to
capture complex relationships in the input sequence compared to CNNs.
The self-attention mechanism in the transformer allows the model to weigh the significance of
different words in the input sequence when encoding each word, enabling the model to capture the
context and relationships between words effectively. The self-attention mechanism consists of three
main steps:
Compute Query, Key, and Value Vectors:
For each word in the input sequence, three vectors are computed:
Query (Q): Represents the current word.
Key (K): Represents all words in the sequence.
Value (V): Represents all words in the sequence.
Parallel Processing: Transformers process all words in the input sequence in parallel, making them
more computationally efficient compared to RNNs and CNNs.
Bidirectional Context: Transformers use the self-attention mechanism to capture the bidirectional
context of each word, enabling them to understand the meaning and relationships between words in
a sentence effectively.
Positional Information: Transformers preserve the positional information of the words in the input
sequence using positional encodings, allowing them to understand the sequence of words and their
relationships in the input text.
State-of-the-art Performance: Transformer-based models, such as BERT and GPT, have achieved
state-of-the-art results on a wide range of NLP tasks, surpassing traditional language models and
other pre-trained models like Word2Vec and GloVe.
Masked attention in transformers refers to the practice of preventing certain tokens from attending to
others during the self-attention calculation. This is typically done by masking the positions of the
tokens in the input sequence to ensure that the model does not peek into the future tokens during
training.
Masked attention is crucial in the pre-training of BERT because it enables the model to learn
bidirectional representations of the words. By randomly masking some of the input tokens and
predicting them based on the remaining unmasked tokens, BERT is forced to consider the
bidirectional context to predict the masked words accurately. This allows BERT to capture more
comprehensive and context-aware representations of the words, which is essential for achieving
high performance on various downstream NLP task
Definition: Self-attention is a mechanism in transformers that allows the model to weigh the
significance of different words in the input sequence when encoding each word.
Calculation: For each word in the input sequence, self-attention computes the attention scores by
taking the dot product of the Query and Key vectors and scales it by the square root of the
dimension of the Key vector.
Output: The weighted sum of the Value vectors is computed using the attention scores as weights.
Masked Attention:
Definition: Masked attention is a specific type of self-attention in transformers where certain tokens
are masked or prevented from attending to others during the self-attention calculation.
Purpose: Masked attention is used in the pre-training of transformers like BERT to enable the model
to learn bidirectional representations of the words by predicting the masked words based on the
remaining unmasked tokens.
Positional Encoding: Adds positional information to the input embeddings to preserve the order of
the tokens.
Output Layer: Produces the final output of the transformer model, which can be used for various NLP
tasks such as classification, named entity recognition, and machine translation.
Transformers address the vanishing gradient problem by using the Layer Normalization and Residual
Connection techniques:
Layer Normalization: Normalizes the activations of the neurons in each layer, preventing the
gradients from becoming too small during backpropagation.
Residual Connections: Adds the input of each layer to its output, allowing the gradients to flow
through the network more effectively and preventing them from vanishing.
Transformer Encoder:
Input: Takes the input sequence and produces a sequence of feature vectors.
Output: The output of the encoder is a sequence of feature vectors, which contain information about
the input sequence.
Use: Used for tasks like text classification, named entity recognition, and sentence encoding.
Transformer Decoder:
Input: Takes the output of the encoder and the target sequence and produces the final output
sequence.
Output: The output of the decoder is the final output sequence, which can be used for tasks like
machine translation and text summarization.
Use: Used for sequence-to-sequence tasks where the input and output sequences have different
lengths.
For sequence-to-sequence tasks like machine translation and text summarization, transformers use
a transformer encoder to process the input sequence and produce a sequence of feature vectors,
and a transformer decoder to take the output of the encoder and the target sequence and produce
the final output sequence.
The transformer decoder is similar to the transformer encoder but also includes an additional
Encoder-Decoder Attention Mechanism, which allows the decoder to focus on the relevant parts of
the input sequence when generating the output sequence.
Q75.What are the steps involved in fine-tuning a BERT model for a specific NLP task?
The steps involved in fine-tuning a BERT model for a specific NLP task are:
Load Pre-trained BERT Model: Load the pre-trained BERT model and add a task-specific layer on
top of the pre-trained model.
Task-Specific Layer: Add a task-specific layer on top of the pre-trained BERT model:
Evaluation: Evaluate the performance of the fine-tuned BERT model on the validation and test
datasets using appropriate evaluation metrics (e.g., accuracy, F1-score, BLEU score).