Natural Language Processing
Natural Language Processing
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) and computer
science that focuses on the interaction between computers and human (natural) languages. The
goal of NLP is to enable computers to understand, interpret, and generate human language in a
way that is both meaningful and useful.
Things to be cover
example
Step 1
a) Tokenization:
tokens can be words, phrases, or even characters, depending on the specific application and
the granularity required. Tokenization is crucial because it transforms raw text into a structured
format that can be analyzed by NLP algorithms.
Types of Tokenization
1. Word Tokenization:
○ Splits text into individual words.
○ Example: "This is an example." → ["This", "is", "an", "example", "."]
2. Sentence Tokenization:
○ Splits text into individual sentences.
○ Example: "This is the first sentence. This is the second sentence." → ["This is the
first sentence.", "This is the second sentence."]
3. Character Tokenization:
○ Splits text into individual characters.
○ Example: "Hello" → ["H", "e", "l", "l", "o"]
Code:
import nltk
nltk.download('punkt') from nltk.tokenize
import word_tokenize
text = "This is an example."
tokens = word_tokenize(text)
print(tokens)
# Output: ['This', 'is', 'an', 'example', '.']
b) Stopwords:-
These words are often removed because they occur frequently but carry minimal meaningful
information that contributes to the analysis or understanding of the text. Examples of stopwords
in English include words like "and," "the," "is," "in," and "at."
1. Context Dependence: Some words may be stopwords in one context but meaningful in
another. For example, "who" in "Who is coming?" vs. "The doctor who saved lives."
2. Language Variability: Different languages have different sets of stopwords, and these
need to be curated carefully for each language.
3. Custom Stopwords: Depending on the specific application, additional domain-specific
stopwords may need to be added to the default list.
Code:
This example is only Using NLTK.
nltk.download('stopwords')
nltk.download('punkt')
text = "This is an example sentence demonstrating stopwords."
stop_words = set(stopwords.words('english'))
words = word_tokenize(text)
filtered_sentence = [w for w in words if not w.lower() in stop_words]
print(filtered_sentence)
# Output: ['example', 'sentence', 'demonstrating', 'stopwords', '.']
Examples of Stemming
● Connecting → "connect"
● Connected → "connect"
● Connection → "connect"
Advantages:
Disadvantages:
● Over-Stemming: The algorithm may cut too much, leading to incorrect root forms (e.g.,
"university" → "univers").
● Ambiguity: Different words with different meanings might be reduced to the same root
(e.g., "better" → "bet").
In this example these words have same root words meaning is gone.
d) Lemmatization :-
Examples of Lemmatization
● Running → "run"
● Better → "good"
● Connected → "connect"
● Brought → "bring"
Advantages:
Disadvantages:
Use cases:-
Stemming and lemmatization are text normalization techniques in NLP. Stemming reduces
words to their root forms by cutting off suffixes, useful for search engines and text classification.
Lemmatization converts words to their dictionary base forms considering context, enhancing
accuracy in tasks like machine translation and question answering systems.
Session 2
Step 2
1) Corpus(paragraph):-
A corpus is like a big database of text that helps computers learn and understand
human language by providing lots of examples. It's used in various applications like
sentiment analysis, translation, and question answering.
2) Documents(sentence):-
3) Vocabulary(unique words):-
a vocabulary is a set of all unique words or tokens that appear in a given corpus. It
represents the collection of terms that an NLP model can recognize and process.
4) Words:-
words are the fundamental units of language that carry meaning. They are the
building blocks used to construct sentences, convey ideas, and communicate
effectively.
to convert categorical data into a format that can be provided to machine learning algorithms to
improve their performance. This encoding scheme is particularly useful for representing text
data or any categorical features in a way that is suitable for numerical computation.
example:-
Advantages and disadvantages:-
Pros
Cons
● High Dimensionality: For a large number of categories, the resulting vectors can
become very large and sparse.
● No Semantic Meaning: One-hot vectors do not capture any relationship
between categories (e.g., "cat" and "dog" are not more similar to each other than
to "fish")
f) Bag of words:-
The "bag of words" (BoW) is a fundamental technique used in natural language processing
(NLP) and information retrieval (IR) to represent text data. Here's a detailed explanation:
Definition
The bag of words model is a way of representing text data where:
● Each document (or piece of text) is represented as a bag (collection) of words.
● Word order and grammar are ignored, but multiplicity (the number of times each word
occurs) is retained.
Tokenization:
● Document 1 tokens: ["The", "cat", "sat", "on", "the", "mat"]
● Document 2 tokens: ["The", "dog", "played", "in", "the", "garden"]
Vocabulary:
● Unique words: ["The", "cat", "sat", "on", "mat", "dog", "played", "in", "garden"]
Vectorization:
● Represent each document using the vocabulary:
○ Document 1: [1, 1, 1, 1, 1, 0, 0, 0, 0]
■ Explanation: The word "The" appears once, "cat" appears once, "sat"
appears once, "on" appears once, "mat" appears once. Other words (dog,
played, in, garden) do not appear, hence 0.
Limitations
● Semantic Information(cosine similarity): BoW ignores the meaning or context of words
and focuses solely on their occurrence.
●
● Out of vocabulary
● Sparse Representation: With large vocabularies or datasets, BoW can lead to
high-dimensional and sparse vectors.
● Order Sensitivity: It doesn't capture the sequence or proximity of words, which can be
crucial in certain tasks.
Code :-
N-grams:-
N-grams are contiguous sequences of n items (or tokens) from a given sequence of text
or speech. In the context of natural language processing (NLP), these items are
typically words, but they can also be characters. N-grams are used to capture linguistic
patterns and relationships between words or characters within a text.
Example: Consider the sentence "The cat sat on the mat."
Code:
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
# Example text
text = "The cat sat on the mat."
# Generate bigrams
bigrams = list(ngrams(tokens, 2))
Session 3
TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a statistical measure used in
information retrieval and text mining to evaluate the importance of a word in a document relative to a collection
1. Term Frequency (TF): Measures how frequently a term (word) appears in a document.
2. Inverse Document Frequency (IDF): Measures how important a term is by considering how frequently
Term Frequency (TF): Words that appear more frequently in a document are more important.
Inverse Document Frequency (IDF): Words that appear in fewer documents are more distinctive
and important.
Advantages and disadvantages
Advantages of TF-IDF
Disadvantages of TF-IDF
Summary
● BoW focuses solely on the raw frequency of words within documents without
considering the significance of the words across the corpus.
Word embedding
Key Characteristics
1. Pre-trained Models:
○ Examples include Word2Vec, GloVe, and FastText.
○ These models are trained on large corpora to produce fixed word vectors.
2. Fixed Vectors:
○ Each word in the vocabulary is represented by a fixed-size vector, typically
with dimensions ranging from 50 to 300.
3. Static Representations:
○ Once trained, the word vectors do not change. They are static
embeddings.
4. Usage:
○ Word embeddings can be used directly in various NLP tasks.
○ They are often used to initialize embedding layers in neural networks.
Example
Pre-trained GloVe embeddings:
● "king" might be represented as [0.1, 0.3, 0.5, ...]
● "queen" might be represented as [0.2, 0.4, 0.6, ...]
Word2Vec
King -man + women=queen
Word2Vec is a popular technique used in natural language processing (NLP) to represent words
as continuous vector representations. It was developed by a team led by Tomas Mikolov at
Google in 2013. Unlike traditional methods like Bag of Words (BoW) and TF-IDF, Word2Vec
captures semantic relationships between words, allowing words with similar meanings to have
similar vector representations.
Key Concepts
1. Vector Representation:
○ Each word is represented as a vector in a high-dimensional space.
○ Words with similar meanings are located closer to each other in this vector
space.
2. Neural Network Models:
○ Word2Vec uses shallow neural network models to learn these vector
representations.
○ Two primary architectures are used: Continuous Bag of Words (CBOW) and
Skip-gram
Architectures
1) Continuous Bag of Words (CBOW):
a) Predicts a target word based on its surrounding context words.
b) Given a context (e.g., "The cat ___ on the mat"), CBOW tries to predict
the missing word "sat."
2) Skip-gram:
a) Predicts context words based on a target word.
b) Given a target word (e.g., "cat"), Skip-gram tries to predict the surrounding
context words ("The", "sat", "on", "the", "mat").
Key Concepts
1. Context and Target Words:
○ Context words: The words surrounding a target word in a sentence.
○ Target word: The word to be predicted from the given context.
2. Objective:
○ The CBOW model aims to maximize the probability of predicting the target
word given its context words.
How CBOW Works
1. Input:
○ A context window of words surrounding a target word. For example, in the
sentence "The cat sat on the mat," if "sat" is the target word, the context
words might be ["The", "cat", "on", "the", "mat"] with a context window size
of 2 on either side.
2. Model Architecture:
○ Input Layer: Takes the one-hot encoded vectors of the context words.
○ Hidden Layer: A single hidden layer that averages or sums the input
vectors to create a combined context vector.
○ Output Layer: Produces a probability distribution over the vocabulary,
indicating the likelihood of each word being the target word.
3. Training Process:
○ Forward Pass: The context words are averaged and passed through the
network to predict the target word.
○ Backpropagation: The model adjusts weights to minimize the error
between the predicted and actual target words.
Skip-gram
Skip-gram Model
The Skip-gram model is one of the fundamental architectures used in Word2Vec
for learning word embeddings. It predicts the context words surrounding a target
word, capturing the relationships between words based on their co-occurrence in
a corpus.
Key Concepts
1. Target and Context Words:
○ Target Word: The central word used to predict its surrounding words.
○ Context Words: The words that appear within a specified window
around the target word.
2. Objective:
○ The main goal is to predict context words given a target word,
maximizing the likelihood of the context words appearing given the
target word.
Key Concepts
1. Word Embeddings:
○ Each word in the vocabulary is represented by a continuous vector (word
embedding) learned by the Word2Vec model.
2. Sentence/Document Representation:
○ To represent a sentence or document, the embeddings of the individual
words are averaged to create a single vector that captures the overall
meaning of the text.
How It Works
Advantages
Disadvantages
Applications
Key Concepts
1. Dense Vectors:
○ Words or tokens are represented as dense vectors of fixed size.
○ These vectors capture semantic meanings and relationships between
words.
2. Learnable Parameters:
○ The embedding layer contains weights that are learned during the training
process.
○ The vectors are adjusted to optimize the performance of the downstream
task.
Example
Consider a simple sentence "I love NLP" and a small vocabulary of size 5:
● Vocabulary: {"I": 0, "love": 1, "NLP": 2, "data": 3, "science": 4}
● Sentence: [0, 1, 2] ("I love NLP" mapped to indices)