0% found this document useful (0 votes)
11 views

Natural Language Processing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Natural Language Processing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Natural Language Processing

Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) and computer
science that focuses on the interaction between computers and human (natural) languages. The
goal of NLP is to enable computers to understand, interpret, and generate human language in a
way that is both meaningful and useful.

Things to be cover
example

Step 1

a) Tokenization:
tokens can be words, phrases, or even characters, depending on the specific application and
the granularity required. Tokenization is crucial because it transforms raw text into a structured
format that can be analyzed by NLP algorithms.

Types of Tokenization
1. Word Tokenization:
○ Splits text into individual words.
○ Example: "This is an example." → ["This", "is", "an", "example", "."]
2. Sentence Tokenization:
○ Splits text into individual sentences.
○ Example: "This is the first sentence. This is the second sentence." → ["This is the
first sentence.", "This is the second sentence."]
3. Character Tokenization:
○ Splits text into individual characters.
○ Example: "Hello" → ["H", "e", "l", "l", "o"]

Code:
import nltk
nltk.download('punkt') from nltk.tokenize
import word_tokenize
text = "This is an example."
tokens = word_tokenize(text)
print(tokens)
# Output: ['This', 'is', 'an', 'example', '.']

b) Stopwords:-

These words are often removed because they occur frequently but carry minimal meaningful
information that contributes to the analysis or understanding of the text. Examples of stopwords
in English include words like "and," "the," "is," "in," and "at."

Examples of Common Stopwords in English

● Articles: "a", "an", "the"


● Conjunctions: "and", "or", "but"
● Prepositions: "in", "on", "at", "by"
● Pronouns: "he", "she", "it", "they"
● Auxiliary Verbs: "is", "am", "are", "was", "were"

Challenges with Stopwords

1. Context Dependence: Some words may be stopwords in one context but meaningful in
another. For example, "who" in "Who is coming?" vs. "The doctor who saved lives."
2. Language Variability: Different languages have different sets of stopwords, and these
need to be curated carefully for each language.
3. Custom Stopwords: Depending on the specific application, additional domain-specific
stopwords may need to be added to the default list.

Stopwords can be done by different libraries NLTK,spacy,Scikit-learn.

Code:
This example is only Using NLTK.

from nltk.corpus import stopwords


from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')
text = "This is an example sentence demonstrating stopwords."
stop_words = set(stopwords.words('english'))
words = word_tokenize(text)
filtered_sentence = [w for w in words if not w.lower() in stop_words]

print(filtered_sentence)
# Output: ['example', 'sentence', 'demonstrating', 'stopwords', '.']

c) Stemming ( root word or base word ):-


Processing of reducing words to their base words.
. The goal of stemming is to group together different forms of a word so they can be analyzed as
a single item. This process involves cutting off suffixes and prefixes from words, often resulting
in the removal of inflectional endings.

Examples of Stemming

● Connecting → "connect"
● Connected → "connect"
● Connection → "connect"

Advantages:

● Efficiency: Stemming is computationally inexpensive and quick to apply.


● Reduced Vocabulary Size: By reducing different forms of a word to a common root,
stemming helps decrease the size of the vocabulary.

Disadvantages:
● Over-Stemming: The algorithm may cut too much, leading to incorrect root forms (e.g.,
"university" → "univers").
● Ambiguity: Different words with different meanings might be reduced to the same root
(e.g., "better" → "bet").

In this example these words have same root words meaning is gone.

But in these words root word is having meaningful meaning.

d) Lemmatization :-

Lemmatization is a text normalization technique in Natural Language Processing (NLP) that


transforms words to their base or dictionary form, known as the lemma. Unlike stemming, which
simply cuts off suffixes to reduce words to their root form, lemmatization considers the context
and the part of speech of the word, ensuring that the base form is a valid word in the language.

Examples of Lemmatization

● Running → "run"
● Better → "good"
● Connected → "connect"
● Brought → "bring"
Advantages:

● Accuracy: Lemmatization provides more accurate results by considering the word's


context and part of speech.
● Meaning Preservation: Ensures that the base form is a valid word in the language,
preserving the meaning of the original word.

Disadvantages:

● Computationally Intensive: Lemmatization is more computationally expensive and


slower than stemming because it requires additional context and morphological analysis.
● Complexity: Requires more complex setup and resources, such as dictionaries and
part-of-speech taggers.

Use cases:-

Stemming and lemmatization are text normalization techniques in NLP. Stemming reduces
words to their root forms by cutting off suffixes, useful for search engines and text classification.
Lemmatization converts words to their dictionary base forms considering context, enhancing
accuracy in tasks like machine translation and question answering systems.
Session 2

Step 2

Basic terminology used in NLP

1) Corpus(paragraph):-

A corpus is like a big database of text that helps computers learn and understand
human language by providing lots of examples. It's used in various applications like
sentiment analysis, translation, and question answering.
2) Documents(sentence):-

documents refer to individual pieces of text that make up a corpus. Each


document is a separate unit that can range from a single sentence to an entire
book, depending on the application and scope of the corpus.

3) Vocabulary(unique words):-

a vocabulary is a set of all unique words or tokens that appear in a given corpus. It
represents the collection of terms that an NLP model can recognize and process.

4) Words:-

words are the fundamental units of language that carry meaning. They are the
building blocks used to construct sentences, convey ideas, and communicate
effectively.

e) One Hot Encoding:-

to convert categorical data into a format that can be provided to machine learning algorithms to
improve their performance. This encoding scheme is particularly useful for representing text
data or any categorical features in a way that is suitable for numerical computation.

example:-
Advantages and disadvantages:-

Pros

● Simplicity: Easy to understand and implement.


● Non-Numeric Data: Allows algorithms to work with categorical data.

Cons

● High Dimensionality: For a large number of categories, the resulting vectors can
become very large and sparse.
● No Semantic Meaning: One-hot vectors do not capture any relationship
between categories (e.g., "cat" and "dog" are not more similar to each other than
to "fish")

f) Bag of words:-
The "bag of words" (BoW) is a fundamental technique used in natural language processing
(NLP) and information retrieval (IR) to represent text data. Here's a detailed explanation:

Definition
The bag of words model is a way of representing text data where:
● Each document (or piece of text) is represented as a bag (collection) of words.
● Word order and grammar are ignored, but multiplicity (the number of times each word
occurs) is retained.
Tokenization:
● Document 1 tokens: ["The", "cat", "sat", "on", "the", "mat"]
● Document 2 tokens: ["The", "dog", "played", "in", "the", "garden"]
Vocabulary:
● Unique words: ["The", "cat", "sat", "on", "mat", "dog", "played", "in", "garden"]
Vectorization:
● Represent each document using the vocabulary:
○ Document 1: [1, 1, 1, 1, 1, 0, 0, 0, 0]
■ Explanation: The word "The" appears once, "cat" appears once, "sat"
appears once, "on" appears once, "mat" appears once. Other words (dog,
played, in, garden) do not appear, hence 0.

Limitations
● Semantic Information(cosine similarity): BoW ignores the meaning or context of words
and focuses solely on their occurrence.

● Out of vocabulary
● Sparse Representation: With large vocabularies or datasets, BoW can lead to
high-dimensional and sparse vectors.
● Order Sensitivity: It doesn't capture the sequence or proximity of words, which can be
crucial in certain tasks.

Code :-

N-grams:-

N-grams are contiguous sequences of n items (or tokens) from a given sequence of text
or speech. In the context of natural language processing (NLP), these items are
typically words, but they can also be characters. N-grams are used to capture linguistic
patterns and relationships between words or characters within a text.
Example: Consider the sentence "The cat sat on the mat."

● Unigrams (1-grams): ["The", "cat", "sat", "on", "the", "mat"]


● Bigrams (2-grams): ["The cat", "cat sat", "sat on", "on the", "the mat"]
● Trigrams (3-grams): ["The cat sat", "cat sat on", "sat on the", "on the mat"]
● Four-grams (4-grams): ["The cat sat on", "cat sat on the", "sat on the mat"]

Code:

import nltk
from nltk import word_tokenize
from nltk.util import ngrams

# Example text
text = "The cat sat on the mat."

# Tokenize the text into words


tokens = word_tokenize(text)

# Generate bigrams
bigrams = list(ngrams(tokens, 2))

# Print the generated bigrams


for bigram in bigrams:
print(bigram)

Session 3

TF-IDF(Term frequency - inverse Document Frequency )

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a statistical measure used in

information retrieval and text mining to evaluate the importance of a word in a document relative to a collection

of documents (corpus). It combines two metrics:

1. Term Frequency (TF): Measures how frequently a term (word) appears in a document.

2. Inverse Document Frequency (IDF): Measures how important a term is by considering how frequently

it appears across all documents in the corpus.


Intuition

Term Frequency (TF): Words that appear more frequently in a document are more important.

Inverse Document Frequency (IDF): Words that appear in fewer documents are more distinctive

and important.
Advantages and disadvantages

Advantages of TF-IDF

1. Simplicity: Easy to implement and understand.


2. Relevance Weighting: Highlights important words by reducing the impact of
common words.
3. Efficiency: Computationally efficient for small to medium-sized datasets.
4. Baseline Performance: Provides strong baseline performance for NLP tasks.
5. Independence: Does not require pre-trained models or external resources.

Disadvantages of TF-IDF

1. Semantic Limitations: Does not capture word meanings or context.


2. High Dimensionality: Produces high-dimensional vectors, especially for large
vocabularies.
3. Sparse Vectors: Results in sparse representations, which can be inefficient.
4. Phrase Ignorance: Fails to capture phrase-level information.
5. Document Length Bias: Can be biased towards longer documents.
6. Static Nature: Does not adapt to new data or evolving language usage.
Code:-

To get top max features we use hyperparameter max_features

Summary

● BoW focuses solely on the raw frequency of words within documents without
considering the significance of the words across the corpus.

● TF-IDF improves upon BoW by assigning weights to words based on their


frequency in individual documents and their rarity across the corpus, making it
more effective in identifying important terms and reducing the influence of
common words.
Session 4

Word embedding

It is a technique which converts words into vectors.


Word embeddings are pre-trained or learned dense vector representations of words.
They capture semantic relationships between words in a continuous vector space,
enabling words with similar meanings to have similar vector representations.

Key Characteristics
1. Pre-trained Models:
○ Examples include Word2Vec, GloVe, and FastText.
○ These models are trained on large corpora to produce fixed word vectors.
2. Fixed Vectors:
○ Each word in the vocabulary is represented by a fixed-size vector, typically
with dimensions ranging from 50 to 300.
3. Static Representations:
○ Once trained, the word vectors do not change. They are static
embeddings.
4. Usage:
○ Word embeddings can be used directly in various NLP tasks.
○ They are often used to initialize embedding layers in neural networks.

Example
Pre-trained GloVe embeddings:
● "king" might be represented as [0.1, 0.3, 0.5, ...]
● "queen" might be represented as [0.2, 0.4, 0.6, ...]
Word2Vec
King -man + women=queen

Word2Vec is a popular technique used in natural language processing (NLP) to represent words
as continuous vector representations. It was developed by a team led by Tomas Mikolov at
Google in 2013. Unlike traditional methods like Bag of Words (BoW) and TF-IDF, Word2Vec
captures semantic relationships between words, allowing words with similar meanings to have
similar vector representations.
Key Concepts
1. Vector Representation:
○ Each word is represented as a vector in a high-dimensional space.
○ Words with similar meanings are located closer to each other in this vector
space.
2. Neural Network Models:
○ Word2Vec uses shallow neural network models to learn these vector
representations.
○ Two primary architectures are used: Continuous Bag of Words (CBOW) and
Skip-gram

Architectures
1) Continuous Bag of Words (CBOW):
a) Predicts a target word based on its surrounding context words.
b) Given a context (e.g., "The cat ___ on the mat"), CBOW tries to predict
the missing word "sat."
2) Skip-gram:
a) Predicts context words based on a target word.
b) Given a target word (e.g., "cat"), Skip-gram tries to predict the surrounding
context words ("The", "sat", "on", "the", "mat").

CBOW (continuous bag of words)


Continuous Bag of Words (CBOW) is one of the two primary architectures used in
Word2Vec for learning word embeddings. It focuses on predicting a target word based
on its surrounding context words.

Key Concepts
1. Context and Target Words:
○ Context words: The words surrounding a target word in a sentence.
○ Target word: The word to be predicted from the given context.
2. Objective:
○ The CBOW model aims to maximize the probability of predicting the target
word given its context words.
How CBOW Works

1. Input:
○ A context window of words surrounding a target word. For example, in the
sentence "The cat sat on the mat," if "sat" is the target word, the context
words might be ["The", "cat", "on", "the", "mat"] with a context window size
of 2 on either side.
2. Model Architecture:
○ Input Layer: Takes the one-hot encoded vectors of the context words.
○ Hidden Layer: A single hidden layer that averages or sums the input
vectors to create a combined context vector.
○ Output Layer: Produces a probability distribution over the vocabulary,
indicating the likelihood of each word being the target word.
3. Training Process:
○ Forward Pass: The context words are averaged and passed through the
network to predict the target word.
○ Backpropagation: The model adjusts weights to minimize the error
between the predicted and actual target words.
Skip-gram

Skip-gram Model
The Skip-gram model is one of the fundamental architectures used in Word2Vec
for learning word embeddings. It predicts the context words surrounding a target
word, capturing the relationships between words based on their co-occurrence in
a corpus.

Key Concepts
1. Target and Context Words:
○ Target Word: The central word used to predict its surrounding words.
○ Context Words: The words that appear within a specified window
around the target word.
2. Objective:
○ The main goal is to predict context words given a target word,
maximizing the likelihood of the context words appearing given the
target word.

How Skip-gram Works


1. Input:
○ A target word from the text. For example, in the sentence "The quick
brown fox jumps over the lazy dog," if "fox" is the target word, the context
words could be ["quick", "brown", "jumps", "over"] with a context window
size of 2.
2. Model Architecture:
○ Input Layer: Takes the one-hot encoded vector of the target word.
○ Hidden Layer: Projects this one-hot encoded vector into a continuous
vector space (embedding space).
○ Output Layer: Predicts the probability distribution over the entire
vocabulary for each context word position.
3. Training Process:
○ Forward Pass: The target word is passed through the network to predict
the context words.
○ Backpropagation: The model adjusts weights to minimize the error in
predicting the context words.
AvgWord2Vec

Average Word2Vec is a technique used to represent entire sentences or documents as


single vectors by averaging the Word2Vec embeddings of the individual words. This
approach leverages the pre-trained word vectors generated by Word2Vec models to
create a fixed-size vector representation for larger text units.

Key Concepts

1. Word Embeddings:
○ Each word in the vocabulary is represented by a continuous vector (word
embedding) learned by the Word2Vec model.
2. Sentence/Document Representation:
○ To represent a sentence or document, the embeddings of the individual
words are averaged to create a single vector that captures the overall
meaning of the text.

How It Works

1. Pre-trained Word2Vec Model:


○ A Word2Vec model is trained on a large corpus of text to learn word
embeddings.
2. Tokenize Sentence/Document:
○ The text is tokenized into individual words.
3. Retrieve Word Embeddings:
○ For each word in the sentence or document, retrieve its corresponding
embedding from the pre-trained Word2Vec model.
4. Average Embeddings:
○ Compute the average of these word embeddings to form a single vector
representing the entire sentence or document.

Advantages

1. Simplicity: Easy to implement and compute.


2. Fixed-size Representation: Converts variable-length text into fixed-size vectors
suitable for machine learning models.
3. Semantic Capture: Leverages pre-trained word embeddings that capture
semantic relationships between words.

Disadvantages

1. Loss of Context: Averages word embeddings, which might lead to loss of


specific contextual information.
2. Equal Weighting: Treats all words equally, without considering their importance
or position in the sentence.

Applications

1. Text Classification: Average Word2Vec vectors can be used as features for


classification tasks.
2. Clustering: Useful for clustering similar sentences or documents.
3. Semantic Search: Helps in finding semantically similar documents or sentences.

For code reference:-


https://github.com/krishnaik06/NLP-Live/blob/main/Day%205-%20NLP%20Word2vec%
20And%20AvgWord2vec.ipynb
Embedding layers
An embedding layer is a type of neural network layer used to convert categorical data
(typically words) into dense vector representations (embeddings). This is particularly
useful in natural language processing (NLP) tasks, where words or tokens need to be
transformed into numerical vectors that can be used as input to a machine learning
model.

Key Concepts
1. Dense Vectors:
○ Words or tokens are represented as dense vectors of fixed size.
○ These vectors capture semantic meanings and relationships between
words.
2. Learnable Parameters:
○ The embedding layer contains weights that are learned during the training
process.
○ The vectors are adjusted to optimize the performance of the downstream
task.

How Embedding Layers Work


1. Initialization:
○ Embedding layers are initialized with random weights or pre-trained
embeddings (e.g., Word2Vec, GloVe).
2. Input:
○ The input to an embedding layer is typically a sequence of integer indices,
where each index corresponds to a word or token in the vocabulary.
3. Lookup:
○ For each input index, the embedding layer performs a lookup to retrieve
the corresponding vector.
4. Output:
○ The output is a sequence of dense vectors representing the input words or
tokens.

Example
Consider a simple sentence "I love NLP" and a small vocabulary of size 5:
● Vocabulary: {"I": 0, "love": 1, "NLP": 2, "data": 3, "science": 4}
● Sentence: [0, 1, 2] ("I love NLP" mapped to indices)

You might also like