NLP_Notes
NLP_Notes
Tokenization:
Tokenization is the process of breaking down a text into smaller units called "tokens." These
tokens can be words, sub words, characters, or even sentences, depending on the level of
tokenization. Tokenization is a crucial step in natural language processing (NLP) as it prepares
the text for further processing, such as analysis, modelling, or understanding.
Lemmatisation:
Lemmatization is the process of reducing a word to its canonical or dictionary form (known
as the lemma) based on its intended meaning. It considers the context and grammatical
form of the word.
"running" → "run"
"better" → "good"
"bigger" → "big"
Pros: More accurate than stemming, preserves the original meaning of the word.
Cons: More computationally intensive and slower, requires access to a language
model or dictionary.
Stemming
Definition: Stemming is the process of reducing a word to its base or root form,
often by removing suffixes. The result may not be a real word but a "stem" that
represents the root idea.
How It Works: Stemming algorithms, like the Porter Stemmer or Snowball Stemmer,
use heuristic rules to chop off common word endings (e.g., "-ing," "-ed," "-s").
Example:
"running" → "run"
"happiness" → "happi"
"bigger" → "bigg"
English Stop Words: "the," "is," "in," "and," "to," "of," "a," "for," "on," "with," "by,"
etc.
Other Languages: Different languages have their own sets of stop words, like "le,"
"la," "et" in French, or "der," "die," "und" in German.
Text Simplification: Removing stop words can simplify the text and reduce noise,
making it easier to focus on the meaningful words.
Dimensionality Reduction: It helps reduce the number of tokens, making text data
more manageable, especially in tasks like text classification or clustering.
Improved Performance: In some NLP tasks, filtering out stop words can improve the
performance of algorithms by reducing the amount of unnecessary information.
Sentiment Analysis: In some cases, stop words might carry sentiment (e.g., "not" in
"not good").
Phrase Detection: If the analysis requires understanding of specific phrases or idioms
where stop words are important.
Contextual Understanding: For models that need a deep understanding of context,
like in language translation or generative models.
Task-Specific: Some projects may require customizing the list of stop words, adding
or removing certain words based on the specific needs of the analysis.
Domain-Specific: In certain domains, words that are usually stop words might be
essential (e.g., "law" in legal texts).
Implementation:
Most NLP libraries, like NLTK, SpaCy, and others, provide predefined lists of stop words and
functions to remove them from text data.
One-Hot Encoding:
It is a technique used in Natural Language Processing (NLP) to convert categorical data, such
as words or tokens, into a numerical format that can be used by machine learning models. In
the context of NLP, one-hot encoding transforms text data into a binary vector
representation.
How One-Hot Encoding Works:
1. Vocabulary Creation: First, a vocabulary of all unique words (or tokens) in the text corpus
is created. Each word in this vocabulary is assigned a unique index.
2. Binary Vector Representation: For each word in the text, a binary vector is created, where
the length of the vector is equal to the size of the vocabulary. In this vector, all elements are
set to 0 except for the element corresponding to the word's index, which is set to 1.
Example:
Suppose you have a sentence: `"I like cats"`.
Step 1: Build Vocabulary**:
Vocabulary: ["I", "like", "cats"]
Indexes: {"I": 0, "like": 1, "cats": 2}
4. Frequency Counts:
Each element in the vector represents the number of times a word appears in the
document. This way, the sentence `"The cat sat on the mat"` can be represented as a vector
of counts.
Example:
Suppose you have the following two sentences:
Sentence 1: `"The cat sat on the mat"`
Sentence 2: `"The dog sat on the log"`
1. Vocabulary:
["The", "cat", "sat", "on", "the", "mat", "dog", "log"]
2. Vector Representation:
Sentence 1: [2, 1, 1, 1, 1, 1, 0, 0] (since "The" appears twice, and "dog" and "log" do not
appear)
Sentence 2: [2, 0, 1, 1, 1, 0, 1, 1] (since "The" appears twice, and "cat" and "mat" do not
appear)
Key Points:
Order Doesn't Matter: In Bag of Words, the order of words in the document is not
considered, hence the name "bag" of words.
Frequency-Based: It only considers the presence (or absence) and frequency of words in a
document.
Simplicity: The model is easy to implement and understand, making it a popular starting
point for text analysis.
Limitations:
1. No Contextual Understanding: Bag of Words does not capture the context or meaning of
words. For example, "not happy" and "happy" would have similar vectors despite having
opposite meanings.
2. High Dimensionality: If the vocabulary is large, the resulting vectors can be high-
dimensional and sparse, leading to inefficiency in storage and computation.
3. Does Not Capture Semantic Relationships: Words with similar meanings (e.g., "cat" and
"feline") will have completely different vectors, even though they are semantically related.
Alternatives:
Due to its limitations, Bag of Words is often replaced or supplemented by more advanced
techniques such as:
TF-IDF (Term Frequency-Inverse Document Frequency): Adjusts word frequencies by their
importance across documents.
Word Embeddings (e.g., Word2Vec, GloVe): Captures semantic relationships between words
by representing them as dense vectors in a lower-dimensional space.
Conclusion:
Bag of Words is a fundamental technique in NLP for converting text into numerical features
for machine learning models. While it is simple and intuitive, it may not capture the full
complexity of language, making it suitable for basic tasks but often supplemented by more
sophisticated approaches in advanced applications.
Example: If the term "cat" appears 3 times in a document that has 100 words, the term
frequency for "cat" is 3/100 = 0.03
3. TF-IDF Score:
The TF-IDF score is the product of TF and IDF, providing a measure of the importance of a
word in a document relative to the entire corpus.
Example Calculation:
Suppose you have the following 3 documents:
Document 1: `"The cat sat on the mat"`
Document 2: `"The dog sat on the log"`
Document 3: `"The cat chased the dog"`
Limitations:
1. Vocabulary Size: Like Bag of Words, TF-IDF can result in high-dimensional vectors if the
vocabulary is large.
2. Does Not Capture Semantics: TF-IDF does not consider the meaning of words or their
relationships. For example, synonyms are treated as different words.
3. Sparse Representation: TF-IDF vectors are typically sparse, meaning most elements are
zero, which can be inefficient for storage and computation.
Conclusion:
TF-IDF is a powerful and commonly used technique in NLP that enhances the Bag of Words
model by considering both the frequency of words in a document and their importance
across the corpus. It provides a better representation of the importance of terms in text data
and is widely used in text classification, search engines, and document similarity tasks.
Word2Vec:
Word2Vec is a popular technique in Natural Language Processing (NLP) used to create vector
representations of words. Developed by researchers at Google, it represents words in a
continuous vector space such that semantically similar words are mapped to nearby points
in that space
Example of Word2Vec Usage:
Imagine we have a sentence: "The quick brown fox jumps over the lazy dog".
CBOW Model: The model tries to predict the word "fox" given the context words
"quick brown jumps over".
Skip-Gram Model: The model tries to predict the context words "quick" and "brown"
given the target word "fox".
CBOW
Skip Gram
C-BOW:
To demonstrate the Continuous Bag of Words (CBOW) model with a window size of 5, let's
take the sentence:
Sentence: "The quick brown fox jumps over the lazy dog."
Let's take an example using the word "fox" as the target word.
Example 1:
Target word: "fox"
Context words: ["The", "quick", "brown", "jumps", "over"]
Here, the 5 context words are the 5 words on either side of "fox."
Example 2:
Now, let's take another target word, "jumps":
Target word: "jumps"
Context words: ["The", "quick", "brown", "fox", "over", "the", "lazy", "dog"]
But with the window size of 5, only the 5 words around the word "jumps" would be
considered.
So, Context Words: ["quick", "brown", "fox", "over", "the"]
This means it would learn on input words ["quick", "brown", "fox", "over", "the"] to predict
target output "jumps"
Key Points:
1. Window Size: Defines how many words to include on either side of the target word
as context. Here, it's 5 words on either side, leading to a total of 10 context words
surrounding the target word.
2. Training: The CBOW model uses the context words to predict the target word during
training. Over time, it learns embeddings such that words appearing in similar
contexts get similar vectors.
Summary:
With a window size of 5, CBOW learns to predict the target word from the surrounding 5
words. This captures the broader context in which words are used, making the embeddings
more contextually aware.
Skip-gram:
The Skip-gram model is another variant of Word2Vec that works opposite to the Continuous
Bag of Words (CBOW) model. While CBOW predicts the target word given a set of context
words, Skip-gram predicts the context words given a target word.
Example Sentence:
"The quick brown fox jumps over the lazy dog."
During training, the Skip-gram model tries to maximize the probability of predicting the
context words given the target word.
So for the pair ("fox", "quick"), the model adjusts the word embeddings so that the word
"quick" is likely to appear in the context of "fox".
Each of these pairs is used to train the Skip-gram model, where the target word tries to
predict its context words.
Key Points:
1. Target Word Focus: Skip-gram focuses on predicting context words from a given
target word, rather than the other way around.
2. Training: The model uses (target, context) pairs to adjust word vectors so that words
frequently appearing together in similar contexts will have similar vectors.
3. Flexibility: Skip-gram can be more effective with smaller datasets or for capturing
rare words, as it places more emphasis on the target word's context.
Summary:
Skip-gram is a predictive model that tries to predict the context words surrounding a
target word.
Input: Target word.
Output: Context words.
Example: Given the target word "fox," the model predicts the surrounding context
words "quick," "brown," "jumps," and "over" within a window of size 2
This model is particularly useful in NLP for capturing word relationships in large text corpora
and generating meaningful word embeddings.
E.g.: "The quick brown fox jumps over the lazy dog."
"The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"
Bigrams and trigrams are sequences of two and three consecutive words or tokens,
respectively, in a text. They are useful in natural language processing for tasks like text
analysis, language modelling, and more.
Bigrams:
1. I love
2. love learning
3. learning about
4. about AI
Trigrams:
1. I love learning
2. love learning about
3. learning about AI