0% found this document useful (0 votes)
3 views

NLP_Notes

The document provides an overview of key concepts in Natural Language Processing (NLP) including tokenization, lemmatization, stemming, stop words, one-hot encoding, Bag of Words, TF-IDF, and Word2Vec. Each technique is explained with definitions, examples, advantages, and limitations, highlighting their roles in preparing text data for analysis and modeling. The document emphasizes the importance of understanding these techniques for effective text processing and machine learning applications.

Uploaded by

chinniz999
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

NLP_Notes

The document provides an overview of key concepts in Natural Language Processing (NLP) including tokenization, lemmatization, stemming, stop words, one-hot encoding, Bag of Words, TF-IDF, and Word2Vec. Each technique is explained with definitions, examples, advantages, and limitations, highlighting their roles in preparing text data for analysis and modeling. The document emphasizes the importance of understanding these techniques for effective text processing and machine learning applications.

Uploaded by

chinniz999
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Natural Language Processing

Tokenization:
Tokenization is the process of breaking down a text into smaller units called "tokens." These
tokens can be words, sub words, characters, or even sentences, depending on the level of
tokenization. Tokenization is a crucial step in natural language processing (NLP) as it prepares
the text for further processing, such as analysis, modelling, or understanding.

Lemmatisation:

Lemmatization is the process of reducing a word to its canonical or dictionary form (known
as the lemma) based on its intended meaning. It considers the context and grammatical
form of the word.

 How It Works: Lemmatization requires a detailed understanding of the word's


morphology and often involves looking up words in a dictionary or using a language
model to determine the correct lemma.
 Example:

 "running" → "run"
 "better" → "good"
 "bigger" → "big"

 Pros: More accurate than stemming, preserves the original meaning of the word.
 Cons: More computationally intensive and slower, requires access to a language
model or dictionary.

Stemming

 Definition: Stemming is the process of reducing a word to its base or root form,
often by removing suffixes. The result may not be a real word but a "stem" that
represents the root idea.
 How It Works: Stemming algorithms, like the Porter Stemmer or Snowball Stemmer,
use heuristic rules to chop off common word endings (e.g., "-ing," "-ed," "-s").
 Example:

 "running" → "run"
 "happiness" → "happi"
 "bigger" → "bigg"

 Pros: Fast and straightforward, reduces dimensionality of the text data.


 Cons: Can be overly aggressive, leading to stems that are not real words or that lose
some meaning.
Stop Words:
Stop words are common words in a language that are often filtered out during text
processing and analysis because they are considered to have little value in understanding
the main content or meaning of the text. These words include articles, prepositions,
conjunctions, and other frequently used terms that don't contribute much to the context or
information.
Examples of Stop Words:

 English Stop Words: "the," "is," "in," "and," "to," "of," "a," "for," "on," "with," "by,"
etc.
 Other Languages: Different languages have their own sets of stop words, like "le,"
"la," "et" in French, or "der," "die," "und" in German.

Why Remove Stop Words?

 Text Simplification: Removing stop words can simplify the text and reduce noise,
making it easier to focus on the meaningful words.
 Dimensionality Reduction: It helps reduce the number of tokens, making text data
more manageable, especially in tasks like text classification or clustering.
 Improved Performance: In some NLP tasks, filtering out stop words can improve the
performance of algorithms by reducing the amount of unnecessary information.

When to Keep Stop Words:

 Sentiment Analysis: In some cases, stop words might carry sentiment (e.g., "not" in
"not good").
 Phrase Detection: If the analysis requires understanding of specific phrases or idioms
where stop words are important.
 Contextual Understanding: For models that need a deep understanding of context,
like in language translation or generative models.

Custom Stop Word Lists:

 Task-Specific: Some projects may require customizing the list of stop words, adding
or removing certain words based on the specific needs of the analysis.
 Domain-Specific: In certain domains, words that are usually stop words might be
essential (e.g., "law" in legal texts).

Implementation:

Most NLP libraries, like NLTK, SpaCy, and others, provide predefined lists of stop words and
functions to remove them from text data.
One-Hot Encoding:
It is a technique used in Natural Language Processing (NLP) to convert categorical data, such
as words or tokens, into a numerical format that can be used by machine learning models. In
the context of NLP, one-hot encoding transforms text data into a binary vector
representation.
How One-Hot Encoding Works:
1. Vocabulary Creation: First, a vocabulary of all unique words (or tokens) in the text corpus
is created. Each word in this vocabulary is assigned a unique index.
2. Binary Vector Representation: For each word in the text, a binary vector is created, where
the length of the vector is equal to the size of the vocabulary. In this vector, all elements are
set to 0 except for the element corresponding to the word's index, which is set to 1.
Example:
Suppose you have a sentence: `"I like cats"`.
Step 1: Build Vocabulary**:
Vocabulary: ["I", "like", "cats"]
Indexes: {"I": 0, "like": 1, "cats": 2}

Step 2: One-Hot Encoding:


"I" → ` [1, 0, 0]
"like" → [0, 1, 0]
"cats" → [0, 0, 1]
Advantages:
Simple to Implement: One-hot encoding is straightforward and easy to apply.
No Assumption of Order: Since it assigns a binary value to each word, it doesn’t impose any
assumption of order or importance among words.
Limitations:
1.High Dimensionality: For large vocabularies, one-hot encoding results in very high-
dimensional vectors, which can lead to computational inefficiency and memory constraints.
2. No Semantic Meaning: One-hot encoded vectors do not capture any semantic
relationship between words. For example, "cat" and "dog" may be similar in meaning, but
their one-hot encodings will be completely different.
Alternatives:
Due to its limitations, one-hot encoding is often replaced by more advanced techniques in
NLP, such as:
Word Embeddings (e.g., Word2Vec, Glove): These methods convert words into dense
vectors that capture semantic relationships.
TF-IDF: A statistical method that reflects the importance of a word in a document relative to
a corpus.
Conclusion:
One-hot encoding is a fundamental technique in NLP that transforms words into a numerical
format suitable for machine learning models. However, due to its limitations in terms of
scalability and lack of semantic understanding, it is often used in simpler models or as a
starting point before moving to more sophisticated representations.

Bag of Words (Bow):


It is a simple and commonly used technique in Natural Language Processing (NLP) to
represent text data in a numerical format. The main idea behind the Bag of Words model is
to treat a text document as a collection of individual words (ignoring grammar and word
order), and then represent it as a vector based on the frequency of each word in the
document.
How Bag of Words Works:
1. Tokenization:
Split the text into individual words (tokens). For example, the sentence `"The cat sat on the
mat"` would be tokenized into the words ["The", "cat", "sat", "on", "the", "mat"].
2. Vocabulary Creation:
Create a vocabulary of all unique words in the text corpus. In our example, the vocabulary
would be ["The", "cat", "sat", "on", "the", "mat"].
3. Vector Representation:
For each document (or sentence), create a vector where each element corresponds to a
word in the vocabulary. The value of each element is the frequency (or count) of the word in
the document.
For example, for the sentence `"The cat sat on the mat"`, the vector might look like `[1, 1, 1,
1, 1, 1]`, assuming we treat "The" and "the" as different words. If we ignore case, it could be
[2, 1, 1, 1, 1].

4. Frequency Counts:
Each element in the vector represents the number of times a word appears in the
document. This way, the sentence `"The cat sat on the mat"` can be represented as a vector
of counts.
Example:
Suppose you have the following two sentences:
Sentence 1: `"The cat sat on the mat"`
Sentence 2: `"The dog sat on the log"`

1. Vocabulary:
["The", "cat", "sat", "on", "the", "mat", "dog", "log"]

2. Vector Representation:
Sentence 1: [2, 1, 1, 1, 1, 1, 0, 0] (since "The" appears twice, and "dog" and "log" do not
appear)
Sentence 2: [2, 0, 1, 1, 1, 0, 1, 1] (since "The" appears twice, and "cat" and "mat" do not
appear)

Key Points:
Order Doesn't Matter: In Bag of Words, the order of words in the document is not
considered, hence the name "bag" of words.
Frequency-Based: It only considers the presence (or absence) and frequency of words in a
document.
Simplicity: The model is easy to implement and understand, making it a popular starting
point for text analysis.

Limitations:
1. No Contextual Understanding: Bag of Words does not capture the context or meaning of
words. For example, "not happy" and "happy" would have similar vectors despite having
opposite meanings.
2. High Dimensionality: If the vocabulary is large, the resulting vectors can be high-
dimensional and sparse, leading to inefficiency in storage and computation.
3. Does Not Capture Semantic Relationships: Words with similar meanings (e.g., "cat" and
"feline") will have completely different vectors, even though they are semantically related.
Alternatives:
Due to its limitations, Bag of Words is often replaced or supplemented by more advanced
techniques such as:
TF-IDF (Term Frequency-Inverse Document Frequency): Adjusts word frequencies by their
importance across documents.
Word Embeddings (e.g., Word2Vec, GloVe): Captures semantic relationships between words
by representing them as dense vectors in a lower-dimensional space.
Conclusion:
Bag of Words is a fundamental technique in NLP for converting text into numerical features
for machine learning models. While it is simple and intuitive, it may not capture the full
complexity of language, making it suitable for basic tasks but often supplemented by more
sophisticated approaches in advanced applications.

TF-IDF (Term Frequency-Inverse Document Frequency)


It is a numerical statistic used in Natural Language Processing (NLP) and information retrieval
to evaluate how important a word is to a document in a collection or corpus. It is an
improvement over the simple Bag of Words model by not only considering the frequency of
words in a document but also their significance across the entire dataset.
Components of TF-IDF:
1. Term Frequency (TF):
This measures how frequently a word (term) occurs in a document. The basic idea is that if a
word appears frequently in a document, it should be given more importance.

Example: If the term "cat" appears 3 times in a document that has 100 words, the term
frequency for "cat" is 3/100 = 0.03

2. Inverse Document Frequency (IDF):


This measures how important a word is across all documents in the corpus. The idea is that
if a word appears in many documents, it’s likely less important for differentiating between
documents.
Example: If the term "the" appears in all 10 documents of a corpus, its IDF will be low.
However, if "cat" appears in only 2 documents, its IDF will be higher.

3. TF-IDF Score:
The TF-IDF score is the product of TF and IDF, providing a measure of the importance of a
word in a document relative to the entire corpus.

Example Calculation:
Suppose you have the following 3 documents:
Document 1: `"The cat sat on the mat"`
Document 2: `"The dog sat on the log"`
Document 3: `"The cat chased the dog"`

1. TF Calculation for "cat":


- In Document 1: TF("cat") = 1/6
- In Document 2: TF("cat") = 0/6
- In Document 3: TF("cat") = 1/5

2. IDF Calculation for "cat":


"cat" appears in 2 out of 3 documents.
IDF("cat") = log(3/2) = 0.176

3. TF-IDF for "cat" in Document 1:


TF- IDF ("cat", Document 1) = 1/6*0.176 = 0.029
Key Points:
Importance: TF-IDF gives a high score to words that are frequent in a specific document but
infrequent across the corpus, making it useful for identifying keywords.
Use Case: It’s often used in text mining, information retrieval, and as a feature extraction
technique for machine learning models in NLP.
Normalization: TF-IDF vectors are often normalized to prevent bias towards longer
documents.

Limitations:
1. Vocabulary Size: Like Bag of Words, TF-IDF can result in high-dimensional vectors if the
vocabulary is large.
2. Does Not Capture Semantics: TF-IDF does not consider the meaning of words or their
relationships. For example, synonyms are treated as different words.
3. Sparse Representation: TF-IDF vectors are typically sparse, meaning most elements are
zero, which can be inefficient for storage and computation.
Conclusion:
TF-IDF is a powerful and commonly used technique in NLP that enhances the Bag of Words
model by considering both the frequency of words in a document and their importance
across the corpus. It provides a better representation of the importance of terms in text data
and is widely used in text classification, search engines, and document similarity tasks.

Word2Vec:
Word2Vec is a popular technique in Natural Language Processing (NLP) used to create vector
representations of words. Developed by researchers at Google, it represents words in a
continuous vector space such that semantically similar words are mapped to nearby points
in that space
Example of Word2Vec Usage:
Imagine we have a sentence: "The quick brown fox jumps over the lazy dog".
 CBOW Model: The model tries to predict the word "fox" given the context words
"quick brown jumps over".
 Skip-Gram Model: The model tries to predict the context words "quick" and "brown"
given the target word "fox".

 CBOW
 Skip Gram
C-BOW:
To demonstrate the Continuous Bag of Words (CBOW) model with a window size of 5, let's
take the sentence:

Sentence: "The quick brown fox jumps over the lazy dog."

Step 1: Define the Window Size


 Window size: 5 (This means the model considers 5 words before and after the target
word as the context.)

Step 2: Define the Context and Target Words


In CBOW, the context is the words surrounding the target word. With a window size of 5,
you’ll use 5 words on either side of the target word.

Let's take an example using the word "fox" as the target word.

Example 1:
 Target word: "fox"
 Context words: ["The", "quick", "brown", "jumps", "over"]

Here, the 5 context words are the 5 words on either side of "fox."

Step 3: CBOW Training


In CBOW, the model takes the context words as input and tries to predict the target word.

Input: ["The", "quick", "brown", "jumps", "over"] Output: "fox"


The model aggregates (e.g., by averaging) the embeddings of the context words and uses
this to predict the target word, "fox."

Example 2:
Now, let's take another target word, "jumps":
 Target word: "jumps"
 Context words: ["The", "quick", "brown", "fox", "over", "the", "lazy", "dog"]
But with the window size of 5, only the 5 words around the word "jumps" would be
considered.
So, Context Words: ["quick", "brown", "fox", "over", "the"]
This means it would learn on input words ["quick", "brown", "fox", "over", "the"] to predict
target output "jumps"

Key Points:
1. Window Size: Defines how many words to include on either side of the target word
as context. Here, it's 5 words on either side, leading to a total of 10 context words
surrounding the target word.
2. Training: The CBOW model uses the context words to predict the target word during
training. Over time, it learns embeddings such that words appearing in similar
contexts get similar vectors.

Summary:
With a window size of 5, CBOW learns to predict the target word from the surrounding 5
words. This captures the broader context in which words are used, making the embeddings
more contextually aware.

Skip-gram:
The Skip-gram model is another variant of Word2Vec that works opposite to the Continuous
Bag of Words (CBOW) model. While CBOW predicts the target word given a set of context
words, Skip-gram predicts the context words given a target word.

Example Sentence:
"The quick brown fox jumps over the lazy dog."

Step 1: Define the Window Size


 Window size: The window size determines the number of context words to consider
around the target word. Let’s use a window size of 2 for this example. This means the
model considers 2 words before and after the target word as the context.

Step 2: Generate (Target, Context) Pairs


In Skip-gram, the model takes a target word and predicts the context words within the
specified window size.
For example, consider the target word "fox":
 Target word: "fox"
 Context words (with a window size of 2): ["quick", "brown", "jumps", "over"]
The Skip-gram model will generate the following (target, context) pairs:
1. ("fox", "quick")
2. ("fox", "brown")
3. ("fox", "jumps")
4. ("fox", "over")

Step 3: Training the Skip-gram Model

During training, the Skip-gram model tries to maximize the probability of predicting the
context words given the target word.
So for the pair ("fox", "quick"), the model adjusts the word embeddings so that the word
"quick" is likely to appear in the context of "fox".

Example of Skip-gram with a Window Size of 2:


Let’s generate the (target, context) pairs for all the words in the sentence:
 Target: "quick", Context: ["The", "brown"]
 Target: "brown", Context: ["quick", "fox"]
 Target: "fox", Context: ["brown", "jumps"]
 Target: "jumps", Context: ["fox", "over"]
 Target: "over", Context: ["jumps", "the"]
 Target: "the", Context: ["over", "lazy"]
 Target: "lazy", Context: ["the", "dog"]
 Target: "dog", Context: ["lazy"]

Each of these pairs is used to train the Skip-gram model, where the target word tries to
predict its context words.

Key Points:
1. Target Word Focus: Skip-gram focuses on predicting context words from a given
target word, rather than the other way around.
2. Training: The model uses (target, context) pairs to adjust word vectors so that words
frequently appearing together in similar contexts will have similar vectors.
3. Flexibility: Skip-gram can be more effective with smaller datasets or for capturing
rare words, as it places more emphasis on the target word's context.
Summary:
 Skip-gram is a predictive model that tries to predict the context words surrounding a
target word.
 Input: Target word.
 Output: Context words.
 Example: Given the target word "fox," the model predicts the surrounding context
words "quick," "brown," "jumps," and "over" within a window of size 2
This model is particularly useful in NLP for capturing word relationships in large text corpora
and generating meaningful word embeddings.

Uni-Gram: A unigram is a type of n-gram where n = 1. In natural language processing (NLP),


unigrams are simply individual words or tokens from a given text.

E.g.: "The quick brown fox jumps over the lazy dog."
"The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"

Bigrams and trigrams are sequences of two and three consecutive words or tokens,
respectively, in a text. They are useful in natural language processing for tasks like text
analysis, language modelling, and more.

I love learning about AI

Bigrams:
1. I love
2. love learning
3. learning about
4. about AI

Trigrams:
1. I love learning
2. love learning about
3. learning about AI

You might also like