NLP Lecture
NLP Lecture
What is NLP?
Natural language processing (NLP) can be defined as the automatic (orsemi-automatic) processing of
human language. The term ‘NLP’ is sometimes used rather more narrowly than that, often excluding
information retrieval and sometimes even excluding machine translation. NLP is sometimes contrasted
with ‘computational linguistics’, with NLP being thought of as more applied. Nowadays, alternative
terms are often preferred, like ‘Language Technology’ or ‘Language Engineering’. Language is often used
in contrast with speech (e.g., Speech and Language Technology). But I’m going to simply refer to NLP
and use the term broadly. NLP is essentially multidisciplinary: it is closely related to linguistics (although
the extent to which NLP overtly draws on linguistic theory varies considerably). It also has links to
research in cognitive science, psychology, philosophy and maths (especially logic). Within CS, it relates to
formal language theory, compiler techniques, theorem proving, machine learning and human-computer
interaction. Of course it is also related to AI, though nowadays it’s not generally thought of as part of AI.
2. Syntax: the way words are used to form phrases. e.g., it is part of English syntax that a determiner
such as the will come before a noun, and also that determiners are obligatory with certain singular
nouns..
4. Pragmatics: meaning in context.,although linguistics and NLP generally have very different
perspectives here.
Advantages of NLP
o NLP helps users to ask questions about any subject and get a direct
response within seconds.
o NLP offers exact answers to the question means it does not offer
unnecessary and unwanted information.
o NLP helps computers to communicate with humans in their languages.
o It is very time efficient.
Disadvantages of NLP
A list of disadvantages of NLP is given below:
Components of NLP
There are the following two components of NLP -
NLU NLG
NLU is the process of reading and NLG is the process of writing or generatin
interpreting language. language.
Applications of NLP
1. Sentiment Analysis
2. Speech Recognition
Speech recognition is used for converting spoken words into text. It is used
in applications, such as mobile, home automation, video recovery, dictating
to Microsoft Word, voice biometrics, voice user interface, and so on.
3. Information extraction
It converts a large set of text into more formal representations such as first-
order logic structures that are easier for the computer programs to
manipulate notations of the natural language processing.
5 Question Answering-Question Answering focuses on building systems
that automatically answer the questions asked by humans in a natural
language.
Phases of NLP
There are the following five phases of NLP:
The first phase of NLP is the Lexical Analysis. This phase scans the source
code as a stream of characters and converts it into meaningful lexemes. It
divides the whole text into paragraphs, sentences, and words.
3. Semantic Analysis
4. Discourse Integration
Discourse Integration depends upon the sentences that proceeds it and also
invokes the meaning of the sentences that follow it.
5. Pragmatic Analysis
Pragmatic is the fifth and last phase of NLP. It helps you to discover the
intended effect by applying a set of rules that characterize cooperative
dialogues.
Ambiguity
o Lexical Ambiguity
Example:
In the above example, the word match refers to that either Manya is looking
for a partner or Manya is looking for a match. (Cricket or other match)
o Syntactic Ambiguity
Syntactic Ambiguity exists in the presence of two or more possible meanings
within the sentence.
Example:
In the above example, did I have the binoculars? Or did the girl have the
binoculars?
o Referential Ambiguity
Referential Ambiguity exists when you are referring to something using the
pronoun.
In the above sentence, you do not know that who is hungry, either Kiran or
Sunita.
NLP Libraries
Scikit-learn: It provides a wide range of algorithms for building machine
learning models in Python.
Natural language Toolkit (NLTK): NLTK is a complete toolkit for all NLP
techniques.
1. Porter’s Stemmer
It is one of the most popular stemming methods proposed in 1980. It is based
on the idea that the suffixes in the English language are made up of a
combination of smaller and simpler suffixes. This stemmer is known for its
speed and simplicity. The main applications of Porter Stemmer include data
mining and Information retrieval. However, its applications are only limited to
English words. Also, the group of stems is mapped on to the same stem and the
output stem is not necessarily a meaningful word. The algorithms are fairly
lengthy in nature and are known to be the oldest stemmer.
Example: EED -> EE means “if the word has at least one vowel and consonant
plus EED ending, change the ending to EE” as ‘agreed’ becomes ‘agree’.
Implementation of Porter Stemmer
Python3
porter_stemmer = PorterStemmer()
Output:
Original words: ['running', 'jumps', 'happily', 'running',
'happily']
Stemmed words: ['run', 'jump', 'happili', 'run', 'happili']
Advantage: It produces the best output as compared to other stemmers and
it has less error rate.
Limitation: Morphological variants produced are not always real words.
2. Lovins Stemmer
It is proposed by Lovins in 1968, that removes the longest suffix from a word
then the word is recorded to convert this stem into valid words.
Example: sitting -> sitt -> sit
Advantage: It is fast and handles irregular plurals like ‘teeth’ and ‘tooth’ etc.
Limitation: It is time consuming and frequently fails to form words from
stem.
3. Dawson Stemmer
It is an extension of Lovins stemmer in which suffixes are stored in the reversed
order indexed by their length and last letter.
4. Krovetz Stemmer
It was proposed in 1993 by Robert Krovetz. Following are the steps:
1) Convert the plural form of a word to its singular form.
2) Convert the past tense of a word to its present tense and remove the suffix
‘ing’.
Example: ‘children’ -> ‘child’
Advantage: It is light in nature and can be used as pre-stemmer for other
stemmers.
Limitation: It is inefficient in case of large documents.
5. Xerox Stemmer
Capable of processing extensive datasets and generating valid words, it has a
tendency to over-stem, primarily due to its reliance on lexicons, making it
language-dependent. This constraint implies that its effectiveness is limited to
specific languages.
Example:
‘children’ -> ‘child’
‘understood’ -> ‘understand’
‘whom’ -> ‘who’
‘best’ -> ‘good’
6. N-Gram Stemmer
The algorithm, aptly named n-grams (typically n=2 or 3), involves breaking
words into segments of length n and then applying statistical analysis to identify
patterns. An n-gram is a set of n consecutive characters extracted from a word
in which similar words will have a high proportion of n-grams in common.
Example: ‘INTRODUCTIONS’ for n=2 becomes : *I, IN, NT, TR, RO, OD, DU,
UC, CT, TI, IO, ON, NS, S*
Advantage: It is based on string comparisons and it is language dependent.
Limitation: It requires space to create and index the n-grams and it is not
time efficient.
7. Snowball Stemmer
The Snowball Stemmer, compared to the Porter Stemmer, is multi-lingual as it
can handle non-English words. It supports various languages and is based on
the ‘Snowball’ programming language, known for efficient processing of small
strings.
The Snowball stemmer is way more aggressive than Porter Stemmer and is
also referred to as Porter2 Stemmer. Because of the improvements added
when compared to the Porter Stemmer, the Snowball stemmer is having greater
computational speed.
Implementation of Snowball Stemmer
Python3
stemmer = SnowballStemmer(language='english')
Output:
Original words: ['running', 'jumped', 'happily', 'quickly',
'foxes']
Stemmed words: ['run', 'jump', 'happili', 'quick', 'fox']
8. Lancaster Stemmer
The Lancaster stemmers are more aggressive and dynamic compared to the
other two stemmers. The stemmer is really faster, but the algorithm is really
confusing when dealing with small words. But they are not as efficient as
Snowball Stemmers. The Lancaster stemmers save the rules externally and
basically uses an iterative algorithm.
Implementation of Lancaster Stemmer
Python3
stemmer = LancasterStemmer()
Output:
Original words: ['running', 'jumped', 'happily', 'quickly',
'foxes']
Stemmed words: ['run', 'jump', 'happy', 'quick', 'fox']
9. Regexp Stemmer
The Regexp Stemmer, or Regular Expression Stemmer, is a stemming
algorithm that utilizes regular expressions to identify and remove suffixes from
words. It allows users to define custom rules for stemming by specifying
patterns to match and remove.
This method provides flexibility and control over the stemming process, making
it suitable for specific applications where custom rule-based stemming is
desired.
Implementation of Regexp Stemmer
Python3
custom_rule = r'ing$'
regexp_stemmer = RegexpStemmer(custom_rule)
word = 'running'
stemmed_word = regexp_stemmer.stem(word)
Output:
Original Word: running
Stemmed Word: runn
Applications of Stemming
1. Stemming is used in information retrieval systems like search engines.
2. It is used to determine domain vocabularies in domain analysis.
3. To display search results by indexing while documents are evolving into
numbers and to map documents to common subjects by stemming.
4. Sentiment Analysis, which examines reviews and comments made by
different users about anything, is frequently used for product analysis, such
as for online retail stores. Before it is interpreted, stemming is accepted in
the form of the text-preparation mean.
5. A method of group analysis used on textual materials is called document
clustering (also known as text clustering). Important uses of it include subject
extraction, automatic document structuring, and quick information retrieval.
Disadvantages in Stemming
There are mainly two errors in stemming –
Over-stemming: Over-stemming in natural language processing occurs
when a stemmer produces incorrect root forms or non-valid words. This can
result in a loss of meaning and readability. For instance, “arguing” may be
stemmed to “argu,” losing meaning. To address this, choosing an
appropriate stemmer, testing on sample text, or using lemmatization can
mitigate over-stemming issues. Techniques like semantic role labeling and
sentiment analysis can enhance context awareness in stemming.
Under-stemming: Under-stemming in natural language processing arises
when a stemmer fails to produce accurate root forms or reduce words to
their base form. This can result in a loss of information and hinder text
analysis. For instance, stemming “arguing” and “argument” to “argu” may
lose meaning. To mitigate under-stemming, selecting an appropriate
stemmer, testing on sample text, or opting for lemmatization can be
beneficial. Techniques like semantic role labeling and sentiment analysis
enhance context awareness in stemming.
Advantages of Stemming
Stemming in natural language processing offers advantages such as
text normalization, simplifying word variations to a common base form. It aids in
information retrieval, text mining, and reduces feature dimensionality in machine
learning. Stemming enhances computational efficiency, making it a valuable
step in text pre-processing for various NLP applications.
Stemming vs Lemmatization
Stemming Lemmatization
Stemming is a process that stems or removes Lemmatization considers the context and converts
last few characters from a word, often the word to its meaningful base form, which is
leading to incorrect meanings and spelling. called Lemma.
For instance, stemming the word ‘Caring‘ For instance, lemmatizing the word ‘Caring‘
would return ‘Car‘. would return ‘Care‘.
Stemming is used in case of large dataset Lemmatization is computationally expensive since
where performance is an issue. it involves look-up tables and what not.
Tokenization: The first step is to break down a text into individual words or
tokens. This can be done using various methods, such as splitting the text
based on spaces.
POS Tagging: Parts-of-speech tagging involves assigning a grammatical
category (like noun, verb, adjective, etc.) to each token. Lemmatization often
relies on this information, as the base form of a word can depend on its
grammatical role in a sentence.
Lemmatization: Once each word has been tokenized and assigned a part-of-
speech tag, the lemmatization algorithm uses a lexicon or linguistic rules to
determine the lemma of each word. The lemma is the base form of the word,
which may not necessarily be the same as the word’s root. For example, the
lemma of “running” is “run,” and the lemma of “better” (in the context of an
adjective) is “good.”
Applying Rules: Lemmatization algorithms often rely on linguistic rules and
patterns. For irregular verbs or words with multiple possible lemmas, these
rules help in making the correct lemmatization decision.
Output: The result of lemmatization is a set of words in their base or
dictionary form, making it easier to analyze and understand the underlying
meaning of a text.
Unit-2
What is the bag of words in NLP?
Bag-of-words(BoW) is a statistical language model used to analyze text and documents
based on word count. The model does not account for word order within a document.
BoW can be implemented as a Python dictionary with each key set to a word and each
value set to the number of times that word appears in a text.
WordNet
WordNet is a lexical database which is available online, and provides a large
repository of English lexical items. There is a multilingual WordNet for
European languages which is structured in the same way as the English
language WordNet.
For example, tree is a kind of plant, tree is a hyponym of plant, and plant is a
hypernym of tree. Analogously, trunk is a part of a tree, and we have trunk
as a meronym of tree, and tree is a holonym of trunk. For one word and one
type of POS, if there is more than one sense, WordNet organizes them in the
order of the most frequently used to the least frequently used (Semcor).
Here are the steps for computing semantic similarity between two
sentences:
Tokenization
Each sentence is partitioned into a list of words, and we remove the stop
words. Stop words are frequently occurring, insignificant words that appear
in a database record, article, or a web page, etc.