Module I NLP
Module I NLP
Module 1
Introduction to NLP
Prepared by
Dr. Venkata Rami Reddy Ch
SCOPE
Syllabus
• Overview:
• Origins and challenges of NLP
• Need of NLP
• Preprocessing techniques-
• Text Wrangling, Text cleansing, sentence splitter, tokenization, stemming, lemmatization, stop
word removal, rare word removal, spell correction.
• Word Embeddings, Different Types :
One Hot Encoding, Bag of Words (BoW), TF-IDF
• Static word embeddings:
Word2vec, GloVe, FastText
Introduction
• NLP stands for Natural Language Processing, which is a part
of Computer Science, Human language, and Artificial Intelligence.
• The goal is to enable machines to understand, interpret, generate, and respond to human
language in a way that is both meaningful and useful.
• NLP combines concepts from linguistics, computer science, and machine learning to bridge
the gap between human communication and machine understanding
1950s: Beginnings of NLP 2000s: Rise of Probabilistic Models and Tools
1950: Alan Turing publishes "Computing Machinery and Intelligence", introducing the 2001: Conditional Random Fields (CRFs) are introduced for sequence labeling tasks.
Turing Test to evaluate a machine's ability to exhibit intelligent behavior equivalent to 2003: Stanford Parser is released, providing tools for syntactic analysis.
or indistinguishable from that of a human. 2010s: Deep Learning Revolution
1954: The Georgetown-IBM experiment demonstrates machine translation, translating 2013: Word2Vec, introduced by Google, revolutionizes word embeddings using
60 Russian sentences into English. This is one of the earliest NLP experiments. neural networks.
1957: Noam Chomsky introduces transformational grammar in "Syntactic Structures", 2014:
laying the theoretical foundation for many linguistic models.
o GloVe (Global Vectors for Word Representation) is introduced by Stanford.
1960s: Rule-Based Systems
o Seq2Seq models, a foundation for machine translation, gain popularity.
1964-1966: ELIZA, one of the first chatbots, is developed by Joseph Weizenbaum,
2017:
demonstrating simple natural language understanding through pattern matching.
1966: The ALPAC Report criticizes machine translation efforts, leading to reduced o The Transformer architecture is introduced in the paper "Attention is All You
funding for NLP in the U.S. Need", laying the foundation for modern NLP models.
o ELMo (Embeddings from Language Models) demonstrates contextualized
1970s: Emergence of Parsing and Semantics
embeddings.
1970: William A. Woods develops the Augmented Transition Network (ATN) for
2018:
parsing natural language.
o OpenAI introduces GPT (Generative Pre-trained Transformer).
1972: SHRDLU, developed by Terry Winograd, showcases NLP capabilities in a virtual
blocks world, integrating syntax, semantics, and reasoning. o BERT (Bidirectional Encoder Representations from Transformers) is introduced
• 1980s: Statistical Methods by Google, setting new state-of-the-art results for many NLP tasks.
1980: Introduction of the concept of probabilistic language models, moving beyond 2020s: Large Language Models and Multimodal Systems
purely rule-based systems. 2020: GPT-3, a 175-billion parameter model by OpenAI, demonstrates
1983: The development of WordNet by George Miller begins, creating a semantic unprecedented language generation capabilities.
network for English. 2021: Multimodal models like CLIP and DALL-E combine text and images.
• 1990s: Statistical and Machine Learning Approaches 2022: ChatGPT, based on GPT-3.5, provides a conversational AI experience.
1990s: Hidden Markov Models (HMMs) and n-gram models become widely used for 2023: Models like GPT-4 improve multi-modality, reasoning, and understanding.
tasks like speech recognition and part-of-speech tagging.
1996: IBM develops BLEU, a metric for evaluating machine translation quality.
1999: Latent Semantic Analysis (LSA) emerges for information retrieval and document
Need of NLP
• Bridging the Gap Between Humans and Machines
•NLP enables interaction between these two entities by allowing machines to process,
understand, and respond to human language.
•Examples: Virtual assistants like Siri and Alexa, customer service chatbots.
Need of NLP/Application of NLP
• Email platforms, such as Gmail, Outlook, etc., use NLP extensively to provide a range of
product features, such as spam classification,calendar event extraction, auto-complete, etc.
• Voice-based assistants, such as Apple Siri, Google Assistant, Microsoft Cortana, and
Amazon Alexa rely on a range of NLP techniques to interact with the user, understand
user commands, and respond accordingly.
• Modern search engines, such as Google and Bing, use NLP heavily for various subtasks,
such as query understanding, query expansion, question answering, information retrieval,
and grouping of the results, etc.
• Machine translation services, such as Google Translate, Bing Microsoft Translator, and
Amazon Translate are used in to solve a wide range of scenarios and business use cases.
• NLP forms the backbone of spelling- and grammar-correction tools, such as Grammarly
and spell check in Microsoft Word and Google Docs.
Need of NLP/Application of NLP
Spam Sentiment
Question Answering
Detection Analysis
Spelling
correction
Machine Chatbot
Translation
NLP Pipeline
Main components of a generic pipeline NLP system
NLP Pipeline
Data acquisition:
• Data acquisition involves obtaining raw textual data from various sources to create a
dataset for NLP tasks.
• various sources like Documents, Emails, Social media posts, Transcribed speech,
Application logs, Public Dataset, Web Scrapping, Image to Text, pdf to Text ,Data
augmentation.
Text Cleaning :
• Sometimes our acquired data is not very clean.
• it may contain HTML tags, spelling mistakes, or special characters.
• So, use some techniques to clean our text data.
NLP Pipeline
Text Preprocessing:
• Preprocessing prepares the text for further analysis by cleaning and structuring it.
Steps in Preprocessing:
Tokenization: Splitting text into smaller units like words or sentences.
• Example: "I love NLP!" → ["I", "love", "NLP", "!"]
Lowercasing: Converting all text to lowercase for consistency.
• Example: "Natural Language Processing" → "natural language processing"
Stop-word Removal: Eliminating common, non-informative words.
• Example: Removing "the," "is," "and."
Lemmatization/Stemming: Reducing words to their root or base forms.
• Lemmatization: "running" → "run"
• Stemming: "flies" → "fli"
Punctuation and Special Character Removal: Removing unnecessary symbols or noise.
Part-of-Speech (POS) Tagging: POS tagging involves assigning a part of speech tag to each
word in a text.
Example: "I love NLP." → [("I", Pronoun), ("love", Verb), ("NLP", Noun)]
NLP Pipeline
Feature Engineering/Feature Extraction:
• The goal of feature engineering is to represent/convert the text into a numeric vector that
can be understood by the ML algorithms.
In this step, we use multiple techniques to convert text to numerical vectors.
1. One Hot Encoder
2. Bag Of Word(BOW)
3. n-grams
4. Tf-Idf
5. Word2vec
Modelling/Model Building
• In the modeling step, we try to make a model based on data.
• Here also we can use multiple approaches to build the model based on the problem
statement.
Approaches to building model –
Deployment
• In the deployment step, we have to deploy our model on the cloud/Server for the users
and users can use this model.
• Deployment has three stages deployment, monitoring, and update.
Challenges in NLP
Ambiguity
• Lexical Ambiguity: Words can have multiple meanings depending on context (e.g.,
"bank" could mean a financial institution or a riverbank).
• Syntactic Ambiguity: Sentences can have multiple valid grammatical interpretations
(e.g., "I saw the man with a telescope").
Misspellings
• Misspellings, can be more difficult for a machine to detect.
• You'll need to employ a natural language processing (NLP) technology that can identify
and progress beyond typical misspellings of terms.
Multilingual and Cross-Language Challenges
• Developing systems that handle multiple languages or translate between them
accurately is hard due to varying syntax, grammar, and idiomatic expressions.
Challenges in NLP
Data Quality and Bias
• Training data may contain biases, inaccuracies, or imbalances that result in biased or
unfair NLP systems.
• Poor-quality datasets can lead to models misunderstanding or misrepresenting input.
Dynamic and Evolving Language
• Language constantly changes, with new words, slang, and phrases emerging, requiring
models to stay updated.
• Handling code-switching (switching between languages or dialects in a conversation)
remains challenging.
Domain Adaptation
• NLP models trained on general data may not perform well in specialized domains like
medicine, law, or engineering, requiring fine-tuning with domain-specific data.
Low-Resource Languages
• Many languages lack large, high-quality datasets, making it challenging to build robust
NLP systems for these languages.
Introduction to NLTK
• NLTK (Natural Language Toolkit) is a powerful and widely-used Python library for
processing and analyzing human language data (text).
• It provides tools and methods for text processing, such as tokenization,
stemming, lemmatization, parsing, classification, and more.
• To install
• pip install nltk
• A variety of tasks can be performed using NLTK are
Tokenization
Lower case conversion
Stop Words removal
Stemming
Lemmatization
Parse tree or Syntax Tree generation
POS Tagging
Preprocessing techniques
• Preprocessing in NLP refers to the steps taken to clean and transform raw text
data into a format suitable for further analysis.
• Since raw text often contains noise, inconsistencies, or irrelevant details,
preprocessing ensures better performance of NLP tasks.
• Preprocessing techniques in NLP involve a series of steps to clean, transform,
and prepare raw text for further analysis or modeling.
• These techniques ensure that text data is in a suitable format for machine
learning algorithms or statistical models.
Text Wrangling
• Text wrangling, also known as text preprocessing or data cleaning, is the process of
transforming raw, unstructured, and noisy text data into a clean and structured format
that can be used effectively in NLP tasks.
Why is Text Wrangling Important?
1.Raw Text is Noisy: Raw data often contains irrelevant information such as HTML tags,
emojis, misspellings, or special characters that can distort the results of NLP algorithms.
2.Standardization: It ensures that the text follows a consistent structure, making it easier to
process and analyze.
3.Improves Model Performance: Properly cleaned and preprocessed data can significantly
improve the accuracy and efficiency of machine learning models.
Text Wrangling/Text Cleaning Techniques
sentence splitter
• Spit the text into sentences.
Word Tokenization
split a sentence into words.
stop word removal
• Removal of most common words
rare word removal
• Removal of less important words.(low distribution)
Stemming
• words to their root forms
Lemmatization
• words to their root forms with preserving the meaning.
spell correction
sentence splitter
• Sentence Splitting (or Sentence Segmentation) in NLP is the task of dividing a
stream of text into individual sentences.
• In NLTK, you can use the built-in sent_tokenize() function to split text into
sentences.
import nltk
import nltk
from nltk.tokenize import word_tokenize
text = "Hello world! NLP is amazing. Let's tokenize this text, it's fun."
Tokens: ['Hello', 'world', '!', 'NLP', 'is', 'amazing', '.', 'Let', "'s", 'tokenize', 'this', 'text', ',', 'it', "'s", 'fun', '.']
Stop word removal
• Stop word removal is a preprocessing step in NLP, where common words (like "the,"
"is," "in," etc.) are removed from a text because they do not contribute much
meaningful information for many NLP tasks like text classification, sentiment analysis,
and topic modeling.
Why Remove Stop Words?
• Reduces Noise: Stop words are frequent and usually carry little or no meaningful
information, so removing them can help reduce the "noise" in the text.
• Improves Efficiency: Reducing the number of words in the dataset can speed up
downstream processes like training machine learning models or performing text
analysis.
• Focus on Important Words: It helps the model focus on words that carry more
meaning and are more likely to affect the outcome of the analysis.
Common Stop Words:
• English stop words: "the", "is", "at", "which", "in", "on", "of", "for", "and", "or", "a",
"an", etc.
Stop word removal
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
text = "This is a sample text with some rare words like xylophone, and other common
words."
tokens = word_tokenize(text)
# Calculate word frequency distribution
fdist = FreqDist(tokens)
# Set a frequency threshold (e.g., remove words that appear less than 2 times)
threshold = 2
filtered_tokens = []
for word in tokens:
if fdist[word] >= threshold:
filtered_tokens.append(word) Original Text:
This is a sample text with some rare words
print("Original Text:") like xylophone, and other common words.
print(text)
print("\nFiltered Text (after removing rare words):") Filtered Text (after removing rare words):
print(filtered_tokens) ['words', 'words']
Rare word removal
import nltk
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
doc = [
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?"
]
# Tokenize all documents in the corpus
all_tokens = []
for line in doc:
all_tokens.extend(word_tokenize(line))
# Apply stemming
stemmed_words = []
for word in words:
stemmed_words.append(porter.stem(word))
# Example words
words = ["running", "flies", "easily",
"played",]
stemmed_words = []
for word in words:
stemmed_words.append(lancaster.stem(word))
stemmed_words = []
for word in words:
stemmed_words.append(snowball.stem(word))
print("Original Words: ", words)
print("Stemmed Words: ", stemmed_words)
# Define a RegexpStemmer to remove common suffixes like 'ing', 'ly', 'ed', 's'
regstemmer = RegexpStemmer(r'(ing$|ly$|ed$|s$)')
lemmatized_words = []
stemmed_words = []
for word in words:
lemmatized_words.append(lemmatizer.lemmatize(word))
• Spelling corrections are important phase of text cleaning process, since misspelled
words will leads a wrong prediction during machine learning process.
• Edit Distance measures the minimum number of edits (insertions, deletions,
substitutions, or transpositions) required to transform one word into another.
• Words with small edit distances to known words in the dictionary can be suggested as
corrections.
import nltk
from nltk.metrics import edit_distance
from nltk.corpus import words
# Download the words dataset
nltk.download("words")
# Get the list of valid words
valid_words = set(words.words())
input_words = ["exampl", "runnig", "crickt"]
corrected_words = []
words = word_tokenize(text)
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
lemmatized_words = []
for word in words:
if word not in stop_words:
lemmatized_words.append(lemmatizer.lemmatize(word))
lemmatized_words=' '.join(lemmatized_words)
print(lemmatized_words)
Welcome NLP This example text full special character number
Text Representation in NLP
• Text representation is a foundational aspect of NLP that involves converting raw text into
numerical vectors that algorithms can process.
• Since machine learning models and algorithms work with numerical data, text must be
transformed into a mathematical representation.
I -> [1, 0, 0, 0, 0, 0, 0]
love-> [0, 1, 0, 0, 0, 0, 0] D1: [[1, 0, 0, 0, 0, 0, 0][0, 1, 0, 0, 0, 0, 0][0, 0, 1, 0, 0, 0, 0]]
NLP-> [0, 0, 1, 0, 0, 0, 0] D2:[[0, 0, 1, 0, 0, 0, 0][0, 0, 0, 1, 0, 0, 0][0, 0, 0, 0, 1, 0, 0]]
is -> [0, 0, 0, 1, 0, 0, 0] D3: [[1, 0, 0, 0, 0, 0, 0][0, 0, 0, 0, 0, 1, 0][0, 0, 0, 0, 0, 0, 1]
amazing-> [0, 0, 0, 0, 1, 0, 0] [0, 0, 1, 0, 0, 0, 0]]
enjoy-> [0, 0, 0, 0, 0, 1, 0]
learning -> [0, 0, 0, 0, 0, 0, 1]
import nltk
from nltk.tokenize import word_tokenize
documents = [
"This movie is very scary and long",
"This movie is not scary and is slow",
"This movie is spooky and good"
]
# Fit and transform the documents into the Bag of Words representation
bow_matrix = vectorizer.fit_transform(documents)
• IDF (inverse document frequency): measures the importance of the term across a corpus.
• IDF of a term t is calculated as follows:
• The TF-IDF score is a product of these two terms. Thus, TF-IDF score = TF * IDF.
Corpus: Step 1: Data Pre-processing
Inflation has increased unemployment
The company has increased its sales After lowercasing and removing stop words the sentences
Fear increased his pulse are transformed as below:
TF-IDF Matrix:
[[0.159 0 0 0 0 0 0.159]
[0 0.159 0 0.159 0 0 0]
[0 0 0 0 0.159 0.159 0]]
TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"The cat sat on the mat.",
"The dog sat on the mat.",
"The mat is warm."
]
# Feature names
features = vectorizer.get_feature_names_out()
print("TF-IDF Matrix:")
print(tfidf_matrix.toarray())
print("\nFeatures (Terms):")
print(features)
TF-IDF
High Dimensionality
• For large corpora with many unique words, the feature vectors generated by TF-IDF can be
very high-dimensional, leading to computational inefficiency and storage issues.
Ignores Context and Semantic Relationships
• TF-IDF treats words as independent entities and does not capture the relationships or
meanings of words.
Does Not Consider Word Order
• TF-IDF is a "bag of words" model, meaning it ignores the order of words in the document.
• As a result, sentences with completely different meanings but similar word distributions can
produce similar TF-IDF representations.
Word embeddings
• Word embeddings in NLP are a type of word representation where words or phrases are
mapped to numerical vectors in a continuous vector space.
• These vectors capture semantic and syntactic meanings of words such that similar words (in
meaning or context) are represented by similar vectors.
Static Embeddings:
1. Word2Vec (Google):
2. GloVe (Stanford)
3. FastText(Facebook)
Contextual Embeddings:
4. ELMo (Embeddings from Language Models):
5. BERT (Bidirectional Encoder Representations from Transformers)
6. GPT (Generative Pre-trained Transformer)
Word2Vec
• Word2Vec is a popular algorithm/model in NLP used to create vector representations of
words.
• These vectors are called word embeddings.
• It was introduced by researchers at Google in 2013
• Words that occur in similar contexts to have similar vector representations.
• Word2Vec is a shallow, two-layer neural network that transforms words into dense vector
representations.
• Word2Vec uses a neural network-based approach to learn these embeddings from a large
corpus of text
Two main architectures used in the Word2Vec model to learn word embeddings:
CBOW (Continuous Bag of Words): Predicts the target/center word based on its context
(surrounding words).
Skip-gram: Predicts the context words(surrounding words) based on the target/center word.
Word2Vec with CBOW
• CBOW predicts a target word based on its context words.
sentences = [
['the', 'quick', 'brown', 'fox'],
['jumps', 'over', 'the', 'lazy', 'dog’]
]