0% found this document useful (0 votes)
32 views18 pages

NLP Lecture

Uploaded by

shubham mahant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views18 pages

NLP Lecture

Uploaded by

shubham mahant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Unit 1

What is NLP?
Natural language processing (NLP) can be defined as the automatic (orsemi-automatic) processing of
human language. The term ‘NLP’ is sometimes used rather more narrowly than that, often excluding
information retrieval and sometimes even excluding machine translation. NLP is sometimes contrasted
with ‘computational linguistics’, with NLP being thought of as more applied. Nowadays, alternative
terms are often preferred, like ‘Language Technology’ or ‘Language Engineering’. Language is often used
in contrast with speech (e.g., Speech and Language Technology). But I’m going to simply refer to NLP
and use the term broadly. NLP is essentially multidisciplinary: it is closely related to linguistics (although
the extent to which NLP overtly draws on linguistic theory varies considerably). It also has links to
research in cognitive science, psychology, philosophy and maths (especially logic). Within CS, it relates to
formal language theory, compiler techniques, theorem proving, machine learning and human-computer
interaction. Of course it is also related to AI, though nowadays it’s not generally thought of as part of AI.

Some linguistic terminology


1. Morphology: the structure of words. For instance, unusually can be thought of as composed of a
prefix un-, a stem usual, and an affix -ly. composed is compose plus the inflectional affix -ed: a spelling
rule means we end up with composed rather than composeed.

2. Syntax: the way words are used to form phrases. e.g., it is part of English syntax that a determiner
such as the will come before a noun, and also that determiners are obligatory with certain singular
nouns..

3. Semantics. Compositional semantics is the construction of meaning (generally expressed as logic)


based on syntax. This is contrasted to lexical semantics, i.e., the meaning of individual words,

4. Pragmatics: meaning in context.,although linguistics and NLP generally have very different
perspectives here.

Advantages of NLP
o NLP helps users to ask questions about any subject and get a direct
response within seconds.
o NLP offers exact answers to the question means it does not offer
unnecessary and unwanted information.
o NLP helps computers to communicate with humans in their languages.
o It is very time efficient.

Disadvantages of NLP
A list of disadvantages of NLP is given below:

o NLP may not show context.


o NLP is unpredictable
o NLP may require more keystrokes.
o NLP is unable to adapt to the new domain, and it has a limited function
that's why NLP is built for a single and specific task only.

Components of NLP
There are the following two components of NLP -

1. Natural Language Understanding (NLU)

Natural Language Understanding (NLU) helps the machine to understand and


analyse human language by extracting the metadata from content such as
concepts, entities, keywords, emotion, relations, and semantic roles.

NLU mainly used in Business applications to understand the customer's


problem in both spoken and written language.

NLU involves the following tasks -

o It is used to map the given input into useful representation.


o It is used to analyze different aspects of the language.

2. Natural Language Generation (NLG)

Natural Language Generation (NLG) acts as a translator that converts the


computerized data into natural language representation. It mainly involves
Text planning, Sentence planning, and Text Realization.
Difference between NLU and NLG

NLU NLG

NLU is the process of reading and NLG is the process of writing or generatin
interpreting language. language.

It produces non-linguistic outputs from It produces constructing natural languag


natural language inputs. outputs from non-linguistic inputs.

Applications of NLP
1. Sentiment Analysis

Sentiment Analysis is also known as opinion mining. It is used on the web


to analyse the attitude, behaviour, and emotional state of the sender. This
application is implemented through a combination of NLP (Natural Language
Processing) and statistics by assigning the values to the text (positive,
negative, or natural), identify the mood of the context (happy, sad, angry,
etc.)

2. Speech Recognition

Speech recognition is used for converting spoken words into text. It is used
in applications, such as mobile, home automation, video recovery, dictating
to Microsoft Word, voice biometrics, voice user interface, and so on.

3. Information extraction

Information extraction is one of the most important applications of NLP. It is


used for extracting structured information from unstructured or semi-
structured machine-readable documents.

4. Natural Language Understanding (NLU)

It converts a large set of text into more formal representations such as first-
order logic structures that are easier for the computer programs to
manipulate notations of the natural language processing.
5 Question Answering-Question Answering focuses on building systems
that automatically answer the questions asked by humans in a natural
language.

Phases of NLP
There are the following five phases of NLP:

1. Lexical Analysis and Morphological

The first phase of NLP is the Lexical Analysis. This phase scans the source
code as a stream of characters and converts it into meaningful lexemes. It
divides the whole text into paragraphs, sentences, and words.

2. Syntactic Analysis (Parsing)

Syntactic Analysis is used to check grammar, word arrangements, and shows


the relationship among the words.

Example: Agra goes to the Poonam


In the real world, Agra goes to the Poonam, does not make any sense, so this
sentence is rejected by the Syntactic analyzer.

3. Semantic Analysis

Semantic analysis is concerned with the meaning representation. It mainly


focuses on the literal meaning of words, phrases, and sentences.

4. Discourse Integration

Discourse Integration depends upon the sentences that proceeds it and also
invokes the meaning of the sentences that follow it.

5. Pragmatic Analysis

Pragmatic is the fifth and last phase of NLP. It helps you to discover the
intended effect by applying a set of rules that characterize cooperative
dialogues.

For Example: "Open the door" is interpreted as a request instead of an


order.

Why NLP is difficult?


NLP is difficult because Ambiguity and Uncertainty exist in the language.

Ambiguity

There are the following three ambiguity -

o Lexical Ambiguity

Lexical Ambiguity exists in the presence of two or more possible meanings of


the sentence within a single word.

Example:

Manya is looking for a match.

In the above example, the word match refers to that either Manya is looking
for a partner or Manya is looking for a match. (Cricket or other match)

o Syntactic Ambiguity
Syntactic Ambiguity exists in the presence of two or more possible meanings
within the sentence.

Example:

I saw the girl with the binocular.

In the above example, did I have the binoculars? Or did the girl have the
binoculars?

o Referential Ambiguity

Referential Ambiguity exists when you are referring to something using the
pronoun.

Example: Kiran went to Sunita. She said, "I am hungry."

In the above sentence, you do not know that who is hungry, either Kiran or
Sunita.

NLP Libraries
Scikit-learn: It provides a wide range of algorithms for building machine
learning models in Python.

Natural language Toolkit (NLTK): NLTK is a complete toolkit for all NLP
techniques.

Pattern: It is a web mining module for NLP and machine learning.

TextBlob: It provides an easy interface to learn basic NLP tasks like


sentiment analysis, noun phrase extraction, or pos-tagging.

Quepy: Quepy is used to transform natural language questions into queries


in a database query language.

Stemming is a method in text processing that eliminates prefixes and


suffixes from words, transforming them into their fundamental or root form, The
main objective of stemming is to streamline and standardize words, enhancing
the effectiveness of the natural language processing tasks. The article
explores more on the stemming technique and how to perform stemming in
Python.
What is Stemming in NLP?
Simplifying words to their most basic form is called stemming, and it is made
easier by stemmers or stemming algorithms. For example, “chocolates”
becomes “chocolate” and “retrieval” becomes “retrieve.” This is crucial for
pipelines for natural language processing, which use tokenized words that are
acquired from the first stage of dissecting a document into its constituent words.
Stemming in natural language processing reduces words to their base or root
form, aiding in text normalization for easier processing. This technique is crucial
in tasks like text classification, information retrieval , and text summarization.
While beneficial, stemming has drawbacks, including potential impacts on text
readability and occasional inaccuracies in determining the correct root form of a
word.

Why is Stemming important?


It is important to note that stemming is different from Lemmatization.
Lemmatization is the process of reducing a word to its base form, but unlike
stemming, it takes into account the context of the word, and it produces a valid
word, unlike stemming which may produce a non-word as the root form.
Some more example of stemming for root word "like" include:
->"likes"
->"liked"
->"likely"
->"liking"

Types of Stemmer in NLTK


Python NLTK contains a variety of stemming algorithms, including several
types. Let’s examine them down below.

1. Porter’s Stemmer
It is one of the most popular stemming methods proposed in 1980. It is based
on the idea that the suffixes in the English language are made up of a
combination of smaller and simpler suffixes. This stemmer is known for its
speed and simplicity. The main applications of Porter Stemmer include data
mining and Information retrieval. However, its applications are only limited to
English words. Also, the group of stems is mapped on to the same stem and the
output stem is not necessarily a meaningful word. The algorithms are fairly
lengthy in nature and are known to be the oldest stemmer.
Example: EED -> EE means “if the word has at least one vowel and consonant
plus EED ending, change the ending to EE” as ‘agreed’ becomes ‘agree’.
Implementation of Porter Stemmer

 Python3

from nltk.stem import PorterStemmer

# Create a Porter Stemmer instance

porter_stemmer = PorterStemmer()

# Example words for stemming

words = ["running", "jumps", "happily", "running", "happily"]

# Apply stemming to each word

stemmed_words = [porter_stemmer.stem(word) for word in words]

# Print the results

print("Original words:", words)

print("Stemmed words:", stemmed_words)

Output:
Original words: ['running', 'jumps', 'happily', 'running',
'happily']
Stemmed words: ['run', 'jump', 'happili', 'run', 'happili']
 Advantage: It produces the best output as compared to other stemmers and
it has less error rate.
 Limitation: Morphological variants produced are not always real words.
2. Lovins Stemmer
It is proposed by Lovins in 1968, that removes the longest suffix from a word
then the word is recorded to convert this stem into valid words.
Example: sitting -> sitt -> sit
 Advantage: It is fast and handles irregular plurals like ‘teeth’ and ‘tooth’ etc.
 Limitation: It is time consuming and frequently fails to form words from
stem.

3. Dawson Stemmer
It is an extension of Lovins stemmer in which suffixes are stored in the reversed
order indexed by their length and last letter.

 Advantage: It is fast in execution and covers more suffices.


 Limitation: It is very complex to implement.

4. Krovetz Stemmer
It was proposed in 1993 by Robert Krovetz. Following are the steps:
1) Convert the plural form of a word to its singular form.
2) Convert the past tense of a word to its present tense and remove the suffix
‘ing’.
Example: ‘children’ -> ‘child’
 Advantage: It is light in nature and can be used as pre-stemmer for other
stemmers.
 Limitation: It is inefficient in case of large documents.

5. Xerox Stemmer
Capable of processing extensive datasets and generating valid words, it has a
tendency to over-stem, primarily due to its reliance on lexicons, making it
language-dependent. This constraint implies that its effectiveness is limited to
specific languages.
Example:
‘children’ -> ‘child’
‘understood’ -> ‘understand’
‘whom’ -> ‘who’
‘best’ -> ‘good’

6. N-Gram Stemmer
The algorithm, aptly named n-grams (typically n=2 or 3), involves breaking
words into segments of length n and then applying statistical analysis to identify
patterns. An n-gram is a set of n consecutive characters extracted from a word
in which similar words will have a high proportion of n-grams in common.
Example: ‘INTRODUCTIONS’ for n=2 becomes : *I, IN, NT, TR, RO, OD, DU,
UC, CT, TI, IO, ON, NS, S*
 Advantage: It is based on string comparisons and it is language dependent.
 Limitation: It requires space to create and index the n-grams and it is not
time efficient.

7. Snowball Stemmer
The Snowball Stemmer, compared to the Porter Stemmer, is multi-lingual as it
can handle non-English words. It supports various languages and is based on
the ‘Snowball’ programming language, known for efficient processing of small
strings.
The Snowball stemmer is way more aggressive than Porter Stemmer and is
also referred to as Porter2 Stemmer. Because of the improvements added
when compared to the Porter Stemmer, the Snowball stemmer is having greater
computational speed.
Implementation of Snowball Stemmer

 Python3

from nltk.stem import SnowballStemmer

# Choose a language for stemming, for example, English

stemmer = SnowballStemmer(language='english')

# Example words to stem

words_to_stem = ['running', 'jumped', 'happily', 'quickly', 'foxes']

# Apply Snowball Stemmer

stemmed_words = [stemmer.stem(word) for word in words_to_stem]


# Print the results

print("Original words:", words_to_stem)

print("Stemmed words:", stemmed_words)

Output:
Original words: ['running', 'jumped', 'happily', 'quickly',
'foxes']
Stemmed words: ['run', 'jump', 'happili', 'quick', 'fox']

8. Lancaster Stemmer
The Lancaster stemmers are more aggressive and dynamic compared to the
other two stemmers. The stemmer is really faster, but the algorithm is really
confusing when dealing with small words. But they are not as efficient as
Snowball Stemmers. The Lancaster stemmers save the rules externally and
basically uses an iterative algorithm.
Implementation of Lancaster Stemmer

 Python3

from nltk.stem import LancasterStemmer

# Create a Lancaster Stemmer instance

stemmer = LancasterStemmer()

# Example words to stem

words_to_stem = ['running', 'jumped', 'happily', 'quickly', 'foxes']


# Apply Lancaster Stemmer

stemmed_words = [stemmer.stem(word) for word in words_to_stem]

# Print the results

print("Original words:", words_to_stem)

print("Stemmed words:", stemmed_words)

Output:
Original words: ['running', 'jumped', 'happily', 'quickly',
'foxes']
Stemmed words: ['run', 'jump', 'happy', 'quick', 'fox']

9. Regexp Stemmer
The Regexp Stemmer, or Regular Expression Stemmer, is a stemming
algorithm that utilizes regular expressions to identify and remove suffixes from
words. It allows users to define custom rules for stemming by specifying
patterns to match and remove.
This method provides flexibility and control over the stemming process, making
it suitable for specific applications where custom rule-based stemming is
desired.
Implementation of Regexp Stemmer

 Python3

from nltk.stem import RegexpStemmer

# Create a Regexp Stemmer with a custom rule

custom_rule = r'ing$'
regexp_stemmer = RegexpStemmer(custom_rule)

# Apply the stemmer to a word

word = 'running'

stemmed_word = regexp_stemmer.stem(word)

print(f'Original Word: {word}')

print(f'Stemmed Word: {stemmed_word}')

Output:
Original Word: running
Stemmed Word: runn

Applications of Stemming
1. Stemming is used in information retrieval systems like search engines.
2. It is used to determine domain vocabularies in domain analysis.
3. To display search results by indexing while documents are evolving into
numbers and to map documents to common subjects by stemming.
4. Sentiment Analysis, which examines reviews and comments made by
different users about anything, is frequently used for product analysis, such
as for online retail stores. Before it is interpreted, stemming is accepted in
the form of the text-preparation mean.
5. A method of group analysis used on textual materials is called document
clustering (also known as text clustering). Important uses of it include subject
extraction, automatic document structuring, and quick information retrieval.

Disadvantages in Stemming
There are mainly two errors in stemming –
 Over-stemming: Over-stemming in natural language processing occurs
when a stemmer produces incorrect root forms or non-valid words. This can
result in a loss of meaning and readability. For instance, “arguing” may be
stemmed to “argu,” losing meaning. To address this, choosing an
appropriate stemmer, testing on sample text, or using lemmatization can
mitigate over-stemming issues. Techniques like semantic role labeling and
sentiment analysis can enhance context awareness in stemming.
 Under-stemming: Under-stemming in natural language processing arises
when a stemmer fails to produce accurate root forms or reduce words to
their base form. This can result in a loss of information and hinder text
analysis. For instance, stemming “arguing” and “argument” to “argu” may
lose meaning. To mitigate under-stemming, selecting an appropriate
stemmer, testing on sample text, or opting for lemmatization can be
beneficial. Techniques like semantic role labeling and sentiment analysis
enhance context awareness in stemming.

Advantages of Stemming
Stemming in natural language processing offers advantages such as
text normalization, simplifying word variations to a common base form. It aids in
information retrieval, text mining, and reduces feature dimensionality in machine
learning. Stemming enhances computational efficiency, making it a valuable
step in text pre-processing for various NLP applications.

What is Lemmatization in NLP?


The purpose of lemmatization is same as that of stemming but overcomes the
drawbacks of stemming. In stemming, for some words, it may not give may not give
meaningful representation such as “Histori”. Here, lemmatization comes into picture
as it gives meaningful word.Lemmatization takes more time as compared to
stemming because it finds meaningful word/ representation. Stemming just needs to
get a base word and therefore takes less time.Stemming has its application in
Sentiment Analysis while Lemmatization has its application in Chatbots, human-
answering.

Stemming vs Lemmatization
Stemming Lemmatization
Stemming is a process that stems or removes Lemmatization considers the context and converts
last few characters from a word, often the word to its meaningful base form, which is
leading to incorrect meanings and spelling. called Lemma.
For instance, stemming the word ‘Caring‘ For instance, lemmatizing the word ‘Caring‘
would return ‘Car‘. would return ‘Care‘.
Stemming is used in case of large dataset Lemmatization is computationally expensive since
where performance is an issue. it involves look-up tables and what not.

What is word boundary detection in NLP?


Word Boundary Detection (WBD) is defined as identifying the start and the end of each
word in a spoken utterance. Detecting the word boundary is used in many applications
like keyword spotting, speech recognition system etc.

How Lemmatization works?


Lemmatization is a linguistic process that involves reducing words to their base or
root form, known as the lemma. The goal is to normalize different inflected forms of
a word so that they can be analyzed or compared more easily. This is particularly
useful in natural language processing (NLP) and text analysis.

Here’s how lemmatization generally works:

 Tokenization: The first step is to break down a text into individual words or
tokens. This can be done using various methods, such as splitting the text
based on spaces.
 POS Tagging: Parts-of-speech tagging involves assigning a grammatical
category (like noun, verb, adjective, etc.) to each token. Lemmatization often
relies on this information, as the base form of a word can depend on its
grammatical role in a sentence.
 Lemmatization: Once each word has been tokenized and assigned a part-of-
speech tag, the lemmatization algorithm uses a lexicon or linguistic rules to
determine the lemma of each word. The lemma is the base form of the word,
which may not necessarily be the same as the word’s root. For example, the
lemma of “running” is “run,” and the lemma of “better” (in the context of an
adjective) is “good.”
 Applying Rules: Lemmatization algorithms often rely on linguistic rules and
patterns. For irregular verbs or words with multiple possible lemmas, these
rules help in making the correct lemmatization decision.
 Output: The result of lemmatization is a set of words in their base or
dictionary form, making it easier to analyze and understand the underlying
meaning of a text.

Lemmatization is distinct from stemming, another text normalization technique.


While stemming involves chopping off prefixes or suffixes from words to obtain a
common root, lemmatization aims for a valid base form through linguistic analysis.
Lemmatization tends to be more accurate but can be computationally more
expensive than stemming.

Unit-2
What is the bag of words in NLP?
Bag-of-words(BoW) is a statistical language model used to analyze text and documents
based on word count. The model does not account for word order within a document.
BoW can be implemented as a Python dictionary with each key set to a word and each
value set to the number of times that word appears in a text.

POS-- tagging is the process of labeling words in a text with their


corresponding parts of speech (e.g., noun, verb, adjective). This helps
algorithms understand the grammatical structure and meaning of a
text and is an important step in natural language processing (NLP).

WordNet
WordNet is a lexical database which is available online, and provides a large
repository of English lexical items. There is a multilingual WordNet for
European languages which is structured in the same way as the English
language WordNet.

WordNet was designed to establish the connections between four types of


Parts of Speech (POS) - noun, verb, adjective, and adverb. The smallest unit
in a WordNet is synset, which represents a specific meaning of a word. It
includes the word, its explanation, and its synonyms. The specific meaning of
one word under one type of POS is called a sense. Each sense of a word is in
a different synset. Synsets are equivalent to senses = structures containing
sets of terms with synonymous meanings. Each synset has a gloss that
defines the concept it represents. For example, the words night, nighttime,
and dark constitute a single synset that has the following gloss: the time
after sunset and before sunrise while it is dark outside. Synsets are
connected to one another through explicit semantic relations. Some of these
relations (hypernym, hyponym for nouns, and hypernym and troponym for
verbs) constitute is-a-kind-of (holonymy) and is-a-part-of (meronymy for
nouns) hierarchies.

For example, tree is a kind of plant, tree is a hyponym of plant, and plant is a
hypernym of tree. Analogously, trunk is a part of a tree, and we have trunk
as a meronym of tree, and tree is a holonym of trunk. For one word and one
type of POS, if there is more than one sense, WordNet organizes them in the
order of the most frequently used to the least frequently used (Semcor).

Semantic similarity between sentences


Given two sentences, the measurement determines how similar the meaning
of two sentences is. The higher the score, the more similar the meaning of
the two sentences.

Here are the steps for computing semantic similarity between two
sentences:

 First, each sentence is partitioned into a list of tokens.


 Part-of-speech disambiguation (or tagging).
 Stemming words.
 Find the most appropriate sense for every word in a sentence (Word Sense
Disambiguation).
 Finally, compute the similarity of the sentences based on the similarity of the
pairs of words.

Tokenization
Each sentence is partitioned into a list of words, and we remove the stop
words. Stop words are frequently occurring, insignificant words that appear
in a database record, article, or a web page, etc.

Tagging part of speech (+)


This task is to identify the correct part of speech (POS - like noun, verb,
pronoun, adverb ...) of each word in the sentence. The algorithm takes a
sentence as input and a specified tag set (a finite list of POS tags). The
output is a single best POS tag for each word. There are two types of
taggers: the first one attaches syntactic roles to each word (subject,
object, ..), and the second one attaches only functional roles (noun, verb, ...).
There is a lot of work that has been done on POS tagging. The tagger can be
classified as rule-based or stochastic. Rule-based taggers use hand written
rules to disambiguate tag ambiguity. An example of rule-based tagging is
Brill's tagger (Eric Brill algorithm). Stochastic taggers resolve tagging
ambiguities by using a training corpus to compute the probability of a given
word having a given tag in a given context. For example: tagger using the
Hidden Markov Model, Maximize likelihood.

Stemming word (+)


We use the Porter stemming algorithm. Porter stemming is a process of
removing the common morphological and inflexional endings of words. It can
be thought of as a lexicon finite state transducer with the following steps:
Surface form -> split word into possible morphemes -> getting intermediate
form -> map stems to categories and affixes to meaning -> underlying form.
I.e.: foxes -> fox + s -> fox.

You might also like