DL Unit-IV
DL Unit-IV
Natural Language Processing (NLP) is a field of artificial intelligence (AI) that enables
machines to understand, interpret, generate, and interact with human language. When combined
with deep learning, NLP has significantly advanced, enabling systems to achieve state-of-the-art
results in tasks like translation, sentiment analysis, question answering, and text generation.
1. Text Representation:
o Words and sentences need to be represented numerically for models to process.
o Common approaches:
▪ One-Hot Encoding: Binary vector representation.
▪ Word Embeddings: Dense vector representations capturing semantic
similarity (e.g., Word2Vec, GloVe).
▪ Contextual Embeddings: Represent words based on their context in a
sentence (e.g., BERT, GPT).
2. Sequence Modeling:
o NLP tasks often involve understanding sequences of words.
o Sequence models process these inputs while preserving order and context.
3. Language Understanding and Generation:
o Tasks range from analyzing existing text to generating coherent and meaningful
new text.
1. Text Classification:
o Categorize text into predefined labels.
o Examples: Sentiment analysis, spam detection.
2. Sequence Labeling:
o Assign a label to each token in a sequence.
o Examples: Part-of-speech tagging, named entity recognition (NER).
3. Machine Translation:
o Translate text from one language to another.
o Example: English to French translation.
4. Question Answering:
o Answer questions based on input text or a knowledge base.
5. Text Generation:
o Generate coherent and contextually relevant text.
o Examples: Chatbots, story generation.
6. Summarization:
o Generate concise summaries from longer texts.
o Types: Extractive and abstractive summarization.
1. Preprocessing:
o Tokenization: Splitting text into words or subwords.
o Lowercasing, removing stopwords, and stemming/lemmatization (traditional
methods).
o SentencePiece or Byte Pair Encoding (BPE) for subword tokenization in deep
learning.
2. Embedding Layers:
o Map words/tokens to vectors in a continuous space.
o Examples: Word2Vec, FastText, pre-trained embeddings (e.g., GloVe).
3. Model Architecture:
o Sequence models (RNNs, LSTMs) or attention-based models (Transformers).
o Pre-trained models like BERT, RoBERTa, GPT-3 fine-tuned for specific tasks.
4. Loss Functions:
o Cross-entropy loss for classification.
o Perplexity for language modeling.
1. TensorFlow and PyTorch: Popular frameworks for building deep learning models.
2. Hugging Face Transformers: Library for pre-trained models and NLP pipelines.
3. spaCy: Industrial-strength NLP library.
4. NLTK: A classical NLP library for preprocessing and analysis.
Challenges in NLP
1. Ambiguity:
o Words can have multiple meanings depending on context.
o Example: "bank" (financial institution vs riverbank).
2. Data Scarcity:
o Annotated datasets may not be available for all languages or domains.
3. Bias in Data:
o Pre-trained models can reflect societal biases present in training data.
4. Multilingual Understanding:
o Handling multiple languages and code-switching (mixing languages in a
sentence).
5. Efficiency:
o Training and inference for large models can be resource-intensive.
The Vector Space Model (VSM) is a mathematical framework for representing and analyzing
the semantics of words, phrases, sentences, or entire documents. In this model, linguistic units
are represented as vectors in a high-dimensional space, where geometric relationships (e.g.,
angles or distances) capture semantic relationships.
• A sparse matrix where rows represent terms, and columns represent documents.
• Entries are term frequencies, term frequencies weighted by inverse document frequency (TF-
IDF), or similar metrics.
• Example: Representing a document by its word occurrence frequencies.
• Reduces the dimensionality of the term-document matrix using Singular Value Decomposition
(SVD).
• Projects terms and documents into a lower-dimensional "latent" space where semantic
relationships are preserved.
• Addresses synonymy and polysemy (different words with similar meanings or the same word
with multiple meanings).
3. Word Embeddings:
1. Document Retrieval:
o Query-document matching using cosine similarity.
o Example: Search engines rank documents based on their vector similarity to a query.
2. Text Classification:
o Documents represented as vectors are used as inputs for classifiers.
o Example: Spam detection, topic classification.
3. Semantic Similarity:
o Computing similarity between words, sentences, or documents.
o Example: Plagiarism detection, sentence paraphrase identification.
4. Word Analogy Tasks:
o Solving analogies like "king - man + woman = queen" using vector arithmetic.
5. Clustering and Topic Modeling:
o Grouping semantically similar items together.
o Example: News categorization.
Strengths of VSM
1. Simplicity:
o Easy to compute and interpret basic term-document matrices.
2. Semantic Generalization:
o Techniques like LSA or embeddings capture latent relationships beyond word surface
forms.
3. Efficiency:
o Pre-trained word embeddings reduce the need for building models from scratch.
1. Sparsity:
o Classical term-document matrices are sparse, requiring dimensionality reduction for
effective use.
2. Lack of Context:
o Words have fixed meanings, ignoring context.
o Example: "bank" (financial institution vs riverbank).
3. Synonymy and Polysemy:
o Hard to resolve without advanced techniques.
4. Scalability:
o Computationally expensive for large datasets without dimensionality reduction.
Modern Advances
1. Dynamic Embeddings:
o Contextual embeddings (BERT, GPT) address fixed-meaning limitations by considering
sentence context.
o Examples:
▪ In "He went to the bank to withdraw money," "bank" is associated with financial
institutions.
▪ In "The boat docked near the bank of the river," "bank" is associated with a
riverbank.
2. Sentence and Document Embeddings:
o Representing entire sentences or documents as single vectors.
o Example: Sentence-BERT, Universal Sentence Encoder.
3. Graph-Based VSMs:
o Represent relationships as graphs (e.g., word graphs, knowledge graphs).
Training:
• Given a sentence, the model predicts surrounding words (context) for a given word.
• Result: Embeddings where similar words are geometrically close.
python
Copy code
from gensim.models import Word2Vec
# Sample sentences
sentences = [
['king', 'queen', 'man', 'woman'],
['cat', 'dog', 'animal', 'pet'],
['car', 'bus', 'vehicle', 'transport']
]
# Train Word2Vec
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1,
workers=4)
Word vector representations, also known as word embeddings, are dense, continuous, and low-
dimensional vector representations of words. They are foundational to many Natural Language
Processing (NLP) tasks, enabling models to capture semantic and syntactic relationships between
words.
1. Semantic Similarity:
o Words with similar meanings have similar vector representations.
o Example: "king" and "queen" are closer in vector space than "king" and "car."
2. Overcome Sparsity:
o Traditional methods like one-hot encoding create high-dimensional, sparse vectors
where most entries are zero.
o Word embeddings reduce dimensionality and improve efficiency.
3. Generalization:
o Embeddings capture word relationships beyond exact matches, enabling models to
generalize better across tasks.
1. Count-Based Approaches
2. Prediction-Based Approaches
1. Word2Vec
• Developed by Google.
• Two architectures:
o Skip-Gram: Predicts surrounding words given a target word.
o CBOW (Continuous Bag of Words): Predicts a target word given its context.
• Captures relationships like:
o "king - man + woman ≈ queen"
2. GloVe
• Developed by Stanford.
• Uses word co-occurrence matrices to capture global statistics.
• Embeddings are optimized such that: wordi⋅wordj=co-occurrence probability\text{word}_i \cdot
\text{word}_j = \text{co-occurrence probability}wordi⋅wordj=co-occurrence probability
3. FastText
• Developed by Facebook.
• Represents words as a sum of subword embeddings.
• Effective for rare words and morphologically rich languages.
# Sample corpus
sentences = [
['king', 'queen', 'man', 'woman'],
['dog', 'cat', 'animal', 'pet'],
['car', 'bus', 'vehicle', 'transport']
]
1. Intrinsic Evaluation:
o Measures the quality of embeddings on linguistic tasks like:
▪ Word similarity (e.g., cosine similarity between "car" and "bus").
▪ Word analogy tasks (e.g., "man:king :: woman:queen").
2. Extrinsic Evaluation:
o Assesses the embeddings' impact on downstream tasks such as sentiment analysis,
machine translation, or text classification.
Unlike static embeddings (e.g., Word2Vec, GloVe), contextual embeddings represent words
differently depending on their usage in a sentence. For example:
• In "He went to the bank to withdraw money," "bank" refers to a financial institution.
• In "The boat docked near the bank of the river," "bank" refers to a riverbank.
Examples:
1. Text Classification:
o Input embeddings are fed into classifiers for tasks like spam detection or sentiment
analysis.
2. Machine Translation:
o Models like Seq2Seq with attention mechanisms use embeddings for translating text.
3. Named Entity Recognition (NER):
o Identify and classify entities (e.g., names, dates, locations).
4. Question Answering:
o Power models like BERT for answering questions based on context.
5. Search and Information Retrieval:
o Rank documents based on semantic similarity with queries.
Key Idea
The primary objective of the skip-gram model is to maximize the likelihood of predicting the
context (neighboring words) for a given target word.
For a given word wtw_twt, the skip-gram model aims to maximize the probability of its
surrounding words within a defined context window size CCC:
Here:
Architecture
1. Input Layer:
o One-hot vector representation of the target word (size = vocabulary size VVV).
2. Embedding Layer:
o Transforms the one-hot vector into a dense vector of size ddd (embedding size).
o This layer consists of a matrix WWW of size V×dV \times dV×d, where each row
is the embedding of a word.
3. Output Layer:
o Produces probabilities for all words in the vocabulary using a softmax function:
P(wt+j∣wt)=exp(vwt+j⋅vwt)∑w′∈Vexp(vw′⋅vwt)P(w_{t+j} | w_t) =
\frac{\exp(v_{w_{t+j}} \cdot v_{w_t})}{\sum_{w' \in V} \exp(v_{w'} \cdot
v_{w_t})}P(wt+j∣wt)=∑w′∈Vexp(vw′⋅vwt)exp(vwt+j⋅vwt)
o Here, vwtv_{w_t}vwt and vwt+jv_{w_{t+j}}vwt+j are the embeddings of the
target and context words.
Loss Function
Training
python
Copy code
from gensim.models import Word2Vec
# Sample corpus
sentences = [
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'],
['king', 'queen', 'man', 'woman'],
['car', 'bus', 'vehicle', 'transport']
]
Applications
1. Semantic Similarity:
o Measure how similar two words are based on their vector representations.
2. Word Analogy Tasks:
o Solve analogies using vector arithmetic (e.g., "man:king :: woman:queen").
3. NLP Tasks:
oInput for downstream tasks like text classification, machine translation, and
question answering.
4. Pre-training:
o Skip-gram embeddings can initialize models for other NLP tasks to improve
performance.
The Continuous Bag of Words (CBOW) model is one of the architectures introduced with
Word2Vec (Mikolov et al., 2013) for learning word embeddings. CBOW predicts a target word
based on its surrounding context, which is the opposite of the Skip-Gram model.
Key Idea
The CBOW model predicts the target word (wtw_twt) using the words in its context window. It
works by maximizing the probability of the target word given its surrounding context words.
Where:
Architecture
1. Input Layer:
o Inputs are one-hot encoded representations of the context words.
2. Embedding Layer:
o Transforms the one-hot vectors into dense vector representations (word
embeddings) using a shared embedding matrix.
3. Averaging Layer:
o Takes the average of the embeddings of the context words to represent the entire
context.
4. Output Layer:
o Uses a softmax function to predict the probability distribution over the vocabulary
for the target word: P(wt∣context)=exp(vwt⋅h)∑w′∈Vexp(vw′⋅h)P(w_t |
context) = \frac{\exp(v_{w_t} \cdot h)}{\sum_{w' \in V} \exp(v_{w'} \cdot
h)}P(wt∣context)=∑w′∈Vexp(vw′⋅h)exp(vwt⋅h)
▪ hhh is the averaged context vector.
▪ vwtv_{w_t}vwt is the embedding for the target word.
Loss Function
Training Process
python
Copy code
from gensim.models import Word2Vec
# Sample corpus
sentences = [
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'],
['king', 'queen', 'man', 'woman'],
['car', 'bus', 'vehicle', 'transport']
]
# Train CBOW model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0,
workers=4)
Here:
Advantages of CBOW
Disadvantages of CBOW
Applications of CBOW
Word similarity is a key concept in Natural Language Processing (NLP) that measures how
closely related two words are, either semantically or syntactically. Word embeddings, such as
those learned by models like Word2Vec, GloVe, or BERT, are commonly used to evaluate and
apply word similarity.
1. Intrinsic Evaluation
Evaluates word embeddings by comparing their performance on specific linguistic tasks, such as
word similarity or word analogy, without involving downstream applications.
• Purpose: These datasets consist of word pairs along with human-assigned similarity scores. The
task is to compute the similarity of word embeddings and compare them to human judgments.
• Popular Datasets:
o WordSim-353: 353 word pairs with similarity scores.
o SimLex-999: Focuses on distinguishing between similarity and association.
o MEN: Evaluates general word similarity and relatedness.
o RG-65: 65 word pairs for evaluating synonymy.
o Rare Words (RW): Measures performance on infrequent or domain-specific words.
Metrics
• Cosine Similarity: Measures the cosine of the angle between two word vectors:
cosine similarity=v1⋅v2∥v1∥∥v2∥\text{cosine similarity} = \frac{v_1 \cdot v_2}{\|v_1\|
\|v_2\|}cosine similarity=∥v1∥∥v2∥v1⋅v2 Where v1v_1v1 and v2v_2v2 are the word vectors.
• Spearman’s Rank Correlation: Compares the ranked similarity scores predicted by embeddings
with human-annotated scores.
Example in Python
python
Copy code
from scipy.spatial.distance import cosine
from scipy.stats import spearmanr
2. Extrinsic Evaluation
• Examples:
o Text classification (e.g., sentiment analysis).
o Machine translation.
o Named entity recognition (NER).
The quality of word embeddings is measured by their contribution to the performance of these
tasks.
3. Synonym Detection
4. Text Similarity
• Compare larger text units (e.g., phrases, sentences, or documents) by averaging or combining
word vectors.
o Example: Measure similarity between "I love programming" and "Coding is my passion."
5. Machine Translation
• Align semantically similar words across languages using bilingual or multilingual embeddings.
o Example: Map the French word "roi" (king) to its English counterpart "king."
6. Sentiment Analysis
• Identify user intents and match them to predefined responses by measuring similarity between
input and response vectors.
8. Recommender Systems
• Suggest items based on semantic similarity between user preferences and available options.
o Example: Recommend books similar to "The Lord of the Rings" by comparing
descriptions.
9. Knowledge Graph Construction
# Sample corpus
sentences = [
['king', 'queen', 'man', 'woman'],
['car', 'vehicle', 'automobile', 'transport'],
['happy', 'joyful', 'cheerful', 'content']
]
Output:
arduino
Copy code
Synonyms for 'happy': [('joyful', 0.95), ('cheerful', 0.93), ('content',
0.91)]
1. Polysemy:
o A word may have multiple meanings (e.g., "bank").
o Solution: Use contextual embeddings (e.g., BERT) to represent words dynamically based
on context.
2. Domain Dependence:
o Word similarities may vary across domains (e.g., "cell" in biology vs. technology).
o Solution: Train embeddings on domain-specific corpora.
3. Rare Words:
o Rare or out-of-vocabulary (OOV) words may lack meaningful embeddings.
o Solution: Use models like FastText, which consider subword information.