0% found this document useful (0 votes)
68 views63 pages

Natural Language Processing (NLP)

Natural Language Processing (NLP) is a subfield of artificial intelligence that deals with interactions between computers and humans using natural language. The goal of NLP is to allow computers to understand, interpret, and generate human language to accomplish tasks. NLP involves analyzing text at various levels including lexical, syntactic, semantic, and pragmatic to understand the meaning. Popular NLP techniques include tokenization, stemming, tagging parts of speech, named entity recognition, and more. Common NLP libraries are NLTK, spaCy, Gensim, and Pattern.

Uploaded by

Saif Jutt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views63 pages

Natural Language Processing (NLP)

Natural Language Processing (NLP) is a subfield of artificial intelligence that deals with interactions between computers and humans using natural language. The goal of NLP is to allow computers to understand, interpret, and generate human language to accomplish tasks. NLP involves analyzing text at various levels including lexical, syntactic, semantic, and pragmatic to understand the meaning. Popular NLP techniques include tokenization, stemming, tagging parts of speech, named entity recognition, and more. Common NLP libraries are NLTK, spaCy, Gensim, and Pattern.

Uploaded by

Saif Jutt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Natural Language Processing

(NLP)
What is Natural Language Processing (NLP)?
• Computers and machines are great at working with tabular data or
spreadsheets. However, human beings generally communicate in
words and sentences, not in the form of tables.
• Much information that humans speak or write is unstructured. So it is
not very clear for computers to interpret such.
• In natural language processing (NLP), the goal is to make computers
understand the unstructured text and retrieve meaningful pieces of
information from it.
• Natural language Processing (NLP) is a subfield of Artificial
Intelligence (AI), in which its depth involves the interactions between
computers and humans.
Applications of NLP
• Machine Translation. • Text Classifications.
• Speech Recognition. • Character Recognition.
• Sentiment Analysis. • Spell Checking.
• Question Answering. • Spam Detection.
• Summarization of Text. • Autocomplete.
• Chatbot. • Named Entity Recognition.
• Intelligent Systems. • Predictive Typing.
Understanding NLP:
• We, as humans, perform natural language processing considerably
well. But even then, we are not perfect. We often misunderstand one
thing for another, and we often interpret the same sentences or
words differently.
• Example 1: “I saw a man on a hill with a telescope.”
• These are some interpretations of the sentence shown above.
• There is a man on the hill, and I watched him with my telescope.
• There is a man on the hill, and he has a telescope.
• I’m on a hill, and I saw a man using my telescope.
• I’m on a hill, and I saw a man who has a telescope.
• There is a man on a hill, and I saw him something with my telescope.
• Example 2: Can you help me with the can?
• In the sentence above, we can see that there are two “can” words,
but both of them have different meanings. Here the first “can” word
is used for question formation. The second “can” word at the end of
the sentence is used to represent a container that holds food or
liquid.
NLP – Non Deterministic
• Hence, from the examples above, we can see that language
processing is not “deterministic” (the same language has the same
interpretations), and something suitable to one person might not be
suitable to another.
• Therefore, Natural Language Processing (NLP) has a non-deterministic
approach. In other words, Natural Language Processing can be used
to create a new intelligent system that can understand how humans
understand and interpret language in different situations.
NLP - Approaches
• Natural Language Processing is separated in two different
approaches:

• Rule-based Natural Language Processing:


• Statistical Natural Language Processing:
Rule-based Natural Language Processing:
• It uses common sense reasoning for processing tasks.
• For instance, the freezing temperature can lead to death, or hot
coffee can burn people’s skin, along with other common sense
reasoning tasks.
• However, this process can take much time and it requires manual
effort.
Statistical Natural Language Processing:
• It uses large amounts of data and tries to derive conclusions from it.
• Statistical NLP uses machine learning algorithms to train NLP models.
After successful training on large amounts of data, the trained model
will have positive outcomes with deduction.
Rule-based NLP vs. Statistical NLP
Components of NLP
Lexical Analysis:

• It involves identifying and analyzing the structure of words.


• Lexicon of a language means the collection of words and phrases in a
language. Lexical analysis is dividing the whole chunk of text into
paragraphs, sentences, and words
• Individual words are analyzed into their components, and nonword
tokens such as punctuations are separated from the words.

13
Syntactic Analysis (Parsing)

• Syntactic Analysis (Parsing) − It involves analysis of words in the


sentence for grammar and arranging words in a manner that shows
the relationship among the words. The sentence such as “The school
goes to boy” is rejected by English syntactic analyzer.

14
Semantic Analysis
• Semantic Analysis is a structure created by the syntactic analyzer which
assigns meanings. This component transfers linear sequences of words into
structures. It shows how the words are associated with each other.
• Semantics focuses only on the literal meaning of words, phrases, and
sentences. This only abstracts the dictionary meaning or the real meaning
from the given context. The structures assigned by the syntactic analyzer
always have assigned meaning
• E.g.. "colorless green idea." This would be rejected by the Symantec
analysis as colorless Here; green doesn't make any sense.
• E.g.. The semantic analyzer disregards sentence such as “hot ice-cream”.

15
Discourse Integration:
• The meaning of any sentence depends upon the meaning of the
sentence just before it. In addition, it also brings about the meaning
of immediately succeeding sentence.
• It means a sense of the context. The meaning of any single sentence
which depends upon that sentences. It also considers the meaning of
the following sentence.
• For example-1: “He works at Google.” In this sentence, “he” must be
referenced in the sentence before it.
• For example-2, the word "that" in the sentence "He wanted that"
depends upon the prior discourse context.
16
Pragmatic Analysis
• Pragmatic Analysis deals with the overall communicative and social content
and its effect on interpretation. It means abstracting or deriving the
meaningful use of language in situations. In this analysis, the main focus
always on what was said in reinterpreted on what is meant.
• During this, what was said is re-interpreted on what it actually meant. It
involves deriving those aspects of language which require real world
knowledge.
• Pragmatic analysis helps users to discover this intended effect by applying a
set of rules that characterize cooperative dialogues.
• E.g., "close the window?" should be interpreted as a request instead of an
order.
• E.g, if someone were to walk up to you and say, “Ali is inside. He told me to
greet you,” you will likely understand that Ali is the person who told the
speaker to greet you.

17
Current challenges in NLP:
1.Breaking sentences into tokens.
2.Tagging parts of speech (POS).
3.Building an appropriate vocabulary.
4.Linking the components of a created vocabulary.
5.Understanding the context.
6.Extracting semantic meaning.
7.Named Entity Recognition (NER).
8.Transforming unstructured data into structured data.
9.Ambiguity in speech.
Popular NLP libraries

• NLTK (Natural Language Toolkit)


• spaCy
• Gensim
• Pattern
• TextBlob
NLTK (Natural Language Toolkit)

• The NLTK Python framework is generally used as an education and research


tool. It’s not usually used on production applications. However, it can be
used to build exciting programs due to its ease of use.
• Some Features are:
• Tokenization.
• Part Of Speech tagging (POS).
• Named Entity Recognition (NER).
• Classification.
• Sentiment analysis.
• Packages of chatbots.
NLTK (Natural Language Toolkit)

• Use-cases:
• Recommendation systems.
• Sentiment analysis.
• Building chatbots.
NLTK- word_tokenize
import nltk
nltk.download('punkt’)

sentence = "At two o'clock on Thursday your mid exam will be held."
tokens = nltk.word_tokenize(sentence)
print(tokens)

Output:
['At', 'two', "o'clock", 'on', 'Thursday', 'your', 'mid', 'exam', 'will', 'be', 'held',
'.']
NLTK- sent_tokenize
from nltk.tokenize import sent_tokenize, word_tokenize

data = "All work and no play makes jack dull boy. All work and no play
makes jack a dull boy."
print(sent_tokenize(data))

output: ['All work and no play makes jack dull boy.’ , 'All work and no
play makes jack a dull boy.']
NLTK - stopwords :
from nltk.tokenize import sent_tokenize, word_tokenize
# nltk.download('stopwords’) # If you get the error NLTK stop words not
found
from nltk.corpus import stopwords

data = "All work and no play makes jack dull boy. All work and no play makes
jack a dull boy."
stopWords = set(stopwords.words('english'))
print (stopWords)
Output:
{'myself', 'am', 'most', 'will', "you've", 'should', 'out', 'in', 'needn', 'between',
"needn't", 'weren', 'and', 'herself', 'when', 'o', 'because', .....}
NLTK - stopwords :
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
data = "All work and no play makes jack dull boy. All work and no play makes jack a
dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []
for w in words:
if w not in stopWords:
wordsFiltered.append(w)
print(wordsFiltered)
output: ['All', 'work', 'play', 'makes', 'jack', 'dull', 'boy', '.', 'All', 'work', 'play', 'makes',
'jack', 'dull', 'boy', '.']
NLTK - stemming
• Stemming is the process of
producing morphological variants of
a root/base word.
• Stemmers remove morphological
affixes from words, leaving only the
word stem.
• For example, the stem of the word
waiting is wait.
NLTK - stemming
from nltk.stem import PorterStemmer output:
from nltk.tokenize import sent_tokenize, game
word_tokenize
game
game
words = ["game","gaming","gamed","games"]
ps = PorterStemmer() game

for word in words:


print(ps.stem(word))
NLTK – Various Stemming Algorithms:
• Porter’s Stemmer
• Snowball Stemmer
• Lovin’s Stemmer
NLTK – Various Stemming Algorithms:
• Dawson’s Stemmer
• Krovetz Stemmer
• Xerox Stemmer
NLTK – stemming - SnowballStemmer
import nltk
from nltk.stem.snowball import SnowballStemmer
#print (SnowballStemmer.languages) cared ----> care
#the stemmer requires a language parameter
snow_stemmer = SnowballStemmer(language='english')
university ----> univers
#list of tokenized words fairly ----> fair
words = ['cared','university','fairly','easily','singing’,
'sings','sung','singer','sportingly'] easily ----> easili
singing ----> sing
#stem's of each word
stem_words = [] sings ----> sing
for w in words:
sung ----> sung
x = snow_stemmer.stem(w)
stem_words.append(x) singer ----> singer
sportingly ----> sport
for e1,e2 in zip(words,stem_words):
print(e1+' ----> '+e2)
NLTK - Lemmatization
• Lemmatization tries to achieve a similar base “stem” for a word.
However, what makes it different is that it finds the dictionary word
instead of truncating the original word. Stemming does not consider
the context of the word. That is why it generates results faster, but it is
less accurate than lemmatization.
• If accuracy is not the project’s final goal, then stemming is an
appropriate approach. If higher accuracy is crucial and the project is
not on a tight deadline, then the best option is amortization
(Lemmatization has a lower processing speed, compared to stemming).
NLTK - Lemmatization
Lemmatization takes into account Part Of Speech (POS) values. Also,
lemmatization may generate different outputs for different values of
POS. We generally have four choices for POS
Difference between Stemmer and Lemmatizer:

Stemming Lemmatizing
Difference between Stemmer and Lemmatizer:
PoS tag as “Verb- v” default value of PoS tag as “Noun-n”
Bag of Words:
• A bag of words model converts the raw text into words, and it also
counts the frequency for the words in the text. In summary, a bag of
words is a collection of words that represent a sentence along with
the word count where the order of occurrences is not relevant.
Bag of Words:
1.Raw Text: This is the original text on which we want to
perform analysis.
2.Clean Text: Since our raw text contains some unnecessary
data like punctuation marks and stopwords, so we need to
clean up our text. Clean text is the text after removing such
words.
3.Tokenize: Tokenization represents the sentence as a group
of tokens or words.
4.Building Vocab: It contains total words used in the text
after removing unnecessary data.
5.Generate Vocab: It contains the words along with their
frequencies in the sentences.
• Sentences: Creating a basic structure
1. Jim and Pam traveled by bus.
2. The train was late.
3. The flight was full. Traveling by
flight is expensive.
Bag of Words:
Words with frequencies Combining all the words
Bag of Words:

Final model
Python Implementation:
Applications & Limitations
• Applications:
1.Natural language processing.
2.Information retrieval from documents.
3.Classifications of documents.
• Limitations:
1.Semantic meaning: It does not consider the semantic meaning of
a word. It ignores the context in which the word is used.
2.Vector size: For large documents, the vector size increase, which
may result in higher computational time.
3.Preprocessing: In preprocessing, we need to perform data
cleansing before using it.
Term Frequency — Inverse Document Frequency (TF-IDF)
• TF-IDF stands for Term Frequency — Inverse Document Frequency,
which is a scoring measure generally used in information retrieval (IR)
and summarization.
• The TF-IDF score shows how important or relevant a term is in a given
document.
Term Frequency — Inverse Document Frequency (TF-IDF)
• If a particular word appears multiple times in a document, then it
might have higher importance than the other words that appear
fewer times (TF). At the same time, if a particular word appears many
times in a document, but it is also present many times in some other
documents, then maybe that word is frequent, so we cannot assign
much importance to it (IDF).
Term Frequency — Inverse Document Frequency (TF-IDF)
• For instance, we have a database of thousands of dog descriptions,
and the user wants to search for “a cute dog” from our database. The
job of our search engine would be to display the closest response to
the user query.
How would a search engine do that?
• The search engine will possibly use TF-IDF to calculate the score for all
of our descriptions, and the result with the higher score will be
displayed as a response to the user. Now, this is the case when there
is no exact match for the user’s query.
• If there is an exact match for the user query, then that result will be
displayed first.
Term Frequency — Inverse Document Frequency (TF-IDF)

• https://www.analyticsvidhya.com/blog/2021/09/creating-a-movie-
reviews-classifier-using-tf-idf-in-python/
NLTK – speech tagging example
The meanings of speech codes are shown in the table below:
NLTK – speech tagging example
import nltk output:
# nltk.download('averaged_perceptron_tagger') #if [('Whether', 'IN'), ('you',
not downloded before 'PRP'), ("'re", 'VBP'),
from nltk.tokenize import PunktSentenceTokenizer ('new', 'JJ'), ('to', 'TO'),
('programming', 'VBG'),
('or', 'CC'), ('an', 'DT'),
document = 'Whether you\'re new to programming ('experienced', 'JJ'),
or an experienced developer, it\'s easy to learn and
use Python.' ('developer', 'NN'), (',',
','), ('it', 'PRP'), ("'s",
sentences = nltk.sent_tokenize(document) 'VBZ'), ('easy', 'JJ'), ('to',
for sent in sentences: 'TO'), ('learn', 'VB'),
print(nltk.pos_tag(nltk.word_tokenize(sent))) ('and', 'CC'), ('use', 'VB'),
('Python', 'NNP'), ('.', '.')]
NLP – Gender Prediction Example
• Given a name and the classifier will predict that it’s a male or
female.
• To create our analysis program, we have several steps:
• Data preparation
• Feature extraction
• Training & Prediction
Gender Prediction - Data preparation
output:
• The first step is to prepare data. [('Aamir', 'male'),
• We use the names set included with nltk. ('Aaron', 'male'),
('Abbey', 'male'),
# nltk.download('names') ('Mersey', 'female'),
from nltk.corpus import names ('Meryl', 'female'),
('Meta', 'female'),
('Mia', 'female'), .... ]
# Load data and training
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
print (names)
Gender Prediction - Feature extraction
• Based on the dataset, we prepare our feature:
• The feature we will use is the last letter of a name
def gender_features(word):
return {'last_letter': word[-1]}

featuresets = [(gender_features(n), g) for (n,g) in names]


print (featuresets)

output: [({'last_letter': 'r'}, 'male'), ({'last_letter': 'n'}, 'male'), ({'last_letter':


'y'}, 'male') ....]
Gender Prediction - Training and prediction
• Based on the dataset, we prepare our feature:
• The feature we will use is the last letter of a name
train_set = featuresets
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Predict
print(classifier.classify(gender_features('Frank')))
Output : male
Sentiment Analysis Example
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
def word_feats(words):
return dict([(word, True) for word in words])

positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]


negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was’, 'is','actors’, 'did','know’, 'words','not' ]

positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]


negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]

train_set = negative_features + positive_features + neutral_features


classifier = NaiveBayesClassifier.train(train_set)
Sentiment Analysis Example
# Predict
neg = 0
pos = 0
sentence = "Awesome movie, I liked it"
sentence = sentence.lower() output:
words = sentence.split(' ') Positive: 0.6
for word in words:
classResult = classifier.classify( word_feats(word)) Negative: 0.2
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1

print('Positive: ' + str(float(pos)/len(words)))


print('Negative: ' + str(float(neg)/len(words)))
spaCy:
• spaCy is an open-source natural language processing Python library
designed to be fast and production-ready. spaCy focuses on providing
software for production usage.
• Features:
• Tokenization.
• Part Of Speech tagging (POS).
• Named Entity Recognition (NER).
• Classification.
• Sentiment analysis.
• Dependency parsing.
• Word vectors.
spaCy:
• Use-cases:
• Autocomplete and autocorrect.
• Analyzing reviews.
• Summarization.
Gensim
• Gensim is an NLP Python framework generally used in topic modeling
and similarity detection. It is not a general-purpose NLP library, but it
handles tasks assigned to it very well.
• Features:
• Latent semantic analysis.
• Non-negative matrix factorization.
• TF-IDF.
Gensim
• Use-cases:
• Converting documents to vectors.
• Finding text similarity.
• Text summarization.
Pattern

• Pattern is an NLP Python framework with straightforward syntax. It’s a


powerful tool for scientific and non-scientific tasks. It is highly
valuable to students.
• Features:
• Tokenization.
• Part of Speech tagging.
• Named entity recognition.
• Parsing.
• Sentiment analysis.
Pattern
• Use-cases:
• Spelling correction.
• Search engine optimization.
• Sentiment analysis.
TextBlob
• TextBlob is a Python library designed for processing textual data.
• Features:
• Part-of-Speech tagging.
• Noun phrase extraction.
• Sentiment analysis.
• Classification.
• Language translation.
• Parsing.
• Wordnet integration.
TextBlob
• Use-cases:
• Sentiment Analysis.
• Spelling Correction.
• Translation and Language Detection.
Advantages of NLP
• Users can ask questions about any subject and get a direct response within
seconds.
• NLP system provides answers to the questions in natural language
• NLP system offers exact answers to the questions, no unnecessary or
unwanted information
• The accuracy of the answers increases with the amount of relevant
information provided in the question.
• NLP process helps computers communicate with humans in their language
and scales other language-related tasks
• Allows you to perform more language-based data compares to a human
being without fatigue and in an unbiased and consistent way.
• Structuring a highly unstructured data source

62
Disadvantages of NLP
• Complex Query Language- the system may not be able to provide the
correct answer it the question that is poorly worded or ambiguous.
• The system is built for a single and specific task only; it is unable to
adapt to new domains and problems because of limited functions.
• NLP system doesn't have a user interface which lacks features that
allow users to further interact with the system

63

You might also like