Natural Language Processing (NLP)
Natural Language Processing (NLP)
(NLP)
What is Natural Language Processing (NLP)?
• Computers and machines are great at working with tabular data or
spreadsheets. However, human beings generally communicate in
words and sentences, not in the form of tables.
• Much information that humans speak or write is unstructured. So it is
not very clear for computers to interpret such.
• In natural language processing (NLP), the goal is to make computers
understand the unstructured text and retrieve meaningful pieces of
information from it.
• Natural language Processing (NLP) is a subfield of Artificial
Intelligence (AI), in which its depth involves the interactions between
computers and humans.
Applications of NLP
• Machine Translation. • Text Classifications.
• Speech Recognition. • Character Recognition.
• Sentiment Analysis. • Spell Checking.
• Question Answering. • Spam Detection.
• Summarization of Text. • Autocomplete.
• Chatbot. • Named Entity Recognition.
• Intelligent Systems. • Predictive Typing.
Understanding NLP:
• We, as humans, perform natural language processing considerably
well. But even then, we are not perfect. We often misunderstand one
thing for another, and we often interpret the same sentences or
words differently.
• Example 1: “I saw a man on a hill with a telescope.”
• These are some interpretations of the sentence shown above.
• There is a man on the hill, and I watched him with my telescope.
• There is a man on the hill, and he has a telescope.
• I’m on a hill, and I saw a man using my telescope.
• I’m on a hill, and I saw a man who has a telescope.
• There is a man on a hill, and I saw him something with my telescope.
• Example 2: Can you help me with the can?
• In the sentence above, we can see that there are two “can” words,
but both of them have different meanings. Here the first “can” word
is used for question formation. The second “can” word at the end of
the sentence is used to represent a container that holds food or
liquid.
NLP – Non Deterministic
• Hence, from the examples above, we can see that language
processing is not “deterministic” (the same language has the same
interpretations), and something suitable to one person might not be
suitable to another.
• Therefore, Natural Language Processing (NLP) has a non-deterministic
approach. In other words, Natural Language Processing can be used
to create a new intelligent system that can understand how humans
understand and interpret language in different situations.
NLP - Approaches
• Natural Language Processing is separated in two different
approaches:
13
Syntactic Analysis (Parsing)
14
Semantic Analysis
• Semantic Analysis is a structure created by the syntactic analyzer which
assigns meanings. This component transfers linear sequences of words into
structures. It shows how the words are associated with each other.
• Semantics focuses only on the literal meaning of words, phrases, and
sentences. This only abstracts the dictionary meaning or the real meaning
from the given context. The structures assigned by the syntactic analyzer
always have assigned meaning
• E.g.. "colorless green idea." This would be rejected by the Symantec
analysis as colorless Here; green doesn't make any sense.
• E.g.. The semantic analyzer disregards sentence such as “hot ice-cream”.
15
Discourse Integration:
• The meaning of any sentence depends upon the meaning of the
sentence just before it. In addition, it also brings about the meaning
of immediately succeeding sentence.
• It means a sense of the context. The meaning of any single sentence
which depends upon that sentences. It also considers the meaning of
the following sentence.
• For example-1: “He works at Google.” In this sentence, “he” must be
referenced in the sentence before it.
• For example-2, the word "that" in the sentence "He wanted that"
depends upon the prior discourse context.
16
Pragmatic Analysis
• Pragmatic Analysis deals with the overall communicative and social content
and its effect on interpretation. It means abstracting or deriving the
meaningful use of language in situations. In this analysis, the main focus
always on what was said in reinterpreted on what is meant.
• During this, what was said is re-interpreted on what it actually meant. It
involves deriving those aspects of language which require real world
knowledge.
• Pragmatic analysis helps users to discover this intended effect by applying a
set of rules that characterize cooperative dialogues.
• E.g., "close the window?" should be interpreted as a request instead of an
order.
• E.g, if someone were to walk up to you and say, “Ali is inside. He told me to
greet you,” you will likely understand that Ali is the person who told the
speaker to greet you.
17
Current challenges in NLP:
1.Breaking sentences into tokens.
2.Tagging parts of speech (POS).
3.Building an appropriate vocabulary.
4.Linking the components of a created vocabulary.
5.Understanding the context.
6.Extracting semantic meaning.
7.Named Entity Recognition (NER).
8.Transforming unstructured data into structured data.
9.Ambiguity in speech.
Popular NLP libraries
• Use-cases:
• Recommendation systems.
• Sentiment analysis.
• Building chatbots.
NLTK- word_tokenize
import nltk
nltk.download('punkt’)
sentence = "At two o'clock on Thursday your mid exam will be held."
tokens = nltk.word_tokenize(sentence)
print(tokens)
Output:
['At', 'two', "o'clock", 'on', 'Thursday', 'your', 'mid', 'exam', 'will', 'be', 'held',
'.']
NLTK- sent_tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
data = "All work and no play makes jack dull boy. All work and no play
makes jack a dull boy."
print(sent_tokenize(data))
output: ['All work and no play makes jack dull boy.’ , 'All work and no
play makes jack a dull boy.']
NLTK - stopwords :
from nltk.tokenize import sent_tokenize, word_tokenize
# nltk.download('stopwords’) # If you get the error NLTK stop words not
found
from nltk.corpus import stopwords
data = "All work and no play makes jack dull boy. All work and no play makes
jack a dull boy."
stopWords = set(stopwords.words('english'))
print (stopWords)
Output:
{'myself', 'am', 'most', 'will', "you've", 'should', 'out', 'in', 'needn', 'between',
"needn't", 'weren', 'and', 'herself', 'when', 'o', 'because', .....}
NLTK - stopwords :
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
data = "All work and no play makes jack dull boy. All work and no play makes jack a
dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []
for w in words:
if w not in stopWords:
wordsFiltered.append(w)
print(wordsFiltered)
output: ['All', 'work', 'play', 'makes', 'jack', 'dull', 'boy', '.', 'All', 'work', 'play', 'makes',
'jack', 'dull', 'boy', '.']
NLTK - stemming
• Stemming is the process of
producing morphological variants of
a root/base word.
• Stemmers remove morphological
affixes from words, leaving only the
word stem.
• For example, the stem of the word
waiting is wait.
NLTK - stemming
from nltk.stem import PorterStemmer output:
from nltk.tokenize import sent_tokenize, game
word_tokenize
game
game
words = ["game","gaming","gamed","games"]
ps = PorterStemmer() game
Stemming Lemmatizing
Difference between Stemmer and Lemmatizer:
PoS tag as “Verb- v” default value of PoS tag as “Noun-n”
Bag of Words:
• A bag of words model converts the raw text into words, and it also
counts the frequency for the words in the text. In summary, a bag of
words is a collection of words that represent a sentence along with
the word count where the order of occurrences is not relevant.
Bag of Words:
1.Raw Text: This is the original text on which we want to
perform analysis.
2.Clean Text: Since our raw text contains some unnecessary
data like punctuation marks and stopwords, so we need to
clean up our text. Clean text is the text after removing such
words.
3.Tokenize: Tokenization represents the sentence as a group
of tokens or words.
4.Building Vocab: It contains total words used in the text
after removing unnecessary data.
5.Generate Vocab: It contains the words along with their
frequencies in the sentences.
• Sentences: Creating a basic structure
1. Jim and Pam traveled by bus.
2. The train was late.
3. The flight was full. Traveling by
flight is expensive.
Bag of Words:
Words with frequencies Combining all the words
Bag of Words:
Final model
Python Implementation:
Applications & Limitations
• Applications:
1.Natural language processing.
2.Information retrieval from documents.
3.Classifications of documents.
• Limitations:
1.Semantic meaning: It does not consider the semantic meaning of
a word. It ignores the context in which the word is used.
2.Vector size: For large documents, the vector size increase, which
may result in higher computational time.
3.Preprocessing: In preprocessing, we need to perform data
cleansing before using it.
Term Frequency — Inverse Document Frequency (TF-IDF)
• TF-IDF stands for Term Frequency — Inverse Document Frequency,
which is a scoring measure generally used in information retrieval (IR)
and summarization.
• The TF-IDF score shows how important or relevant a term is in a given
document.
Term Frequency — Inverse Document Frequency (TF-IDF)
• If a particular word appears multiple times in a document, then it
might have higher importance than the other words that appear
fewer times (TF). At the same time, if a particular word appears many
times in a document, but it is also present many times in some other
documents, then maybe that word is frequent, so we cannot assign
much importance to it (IDF).
Term Frequency — Inverse Document Frequency (TF-IDF)
• For instance, we have a database of thousands of dog descriptions,
and the user wants to search for “a cute dog” from our database. The
job of our search engine would be to display the closest response to
the user query.
How would a search engine do that?
• The search engine will possibly use TF-IDF to calculate the score for all
of our descriptions, and the result with the higher score will be
displayed as a response to the user. Now, this is the case when there
is no exact match for the user’s query.
• If there is an exact match for the user query, then that result will be
displayed first.
Term Frequency — Inverse Document Frequency (TF-IDF)
• https://www.analyticsvidhya.com/blog/2021/09/creating-a-movie-
reviews-classifier-using-tf-idf-in-python/
NLTK – speech tagging example
The meanings of speech codes are shown in the table below:
NLTK – speech tagging example
import nltk output:
# nltk.download('averaged_perceptron_tagger') #if [('Whether', 'IN'), ('you',
not downloded before 'PRP'), ("'re", 'VBP'),
from nltk.tokenize import PunktSentenceTokenizer ('new', 'JJ'), ('to', 'TO'),
('programming', 'VBG'),
('or', 'CC'), ('an', 'DT'),
document = 'Whether you\'re new to programming ('experienced', 'JJ'),
or an experienced developer, it\'s easy to learn and
use Python.' ('developer', 'NN'), (',',
','), ('it', 'PRP'), ("'s",
sentences = nltk.sent_tokenize(document) 'VBZ'), ('easy', 'JJ'), ('to',
for sent in sentences: 'TO'), ('learn', 'VB'),
print(nltk.pos_tag(nltk.word_tokenize(sent))) ('and', 'CC'), ('use', 'VB'),
('Python', 'NNP'), ('.', '.')]
NLP – Gender Prediction Example
• Given a name and the classifier will predict that it’s a male or
female.
• To create our analysis program, we have several steps:
• Data preparation
• Feature extraction
• Training & Prediction
Gender Prediction - Data preparation
output:
• The first step is to prepare data. [('Aamir', 'male'),
• We use the names set included with nltk. ('Aaron', 'male'),
('Abbey', 'male'),
# nltk.download('names') ('Mersey', 'female'),
from nltk.corpus import names ('Meryl', 'female'),
('Meta', 'female'),
('Mia', 'female'), .... ]
# Load data and training
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
print (names)
Gender Prediction - Feature extraction
• Based on the dataset, we prepare our feature:
• The feature we will use is the last letter of a name
def gender_features(word):
return {'last_letter': word[-1]}
# Predict
print(classifier.classify(gender_features('Frank')))
Output : male
Sentiment Analysis Example
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
def word_feats(words):
return dict([(word, True) for word in words])
62
Disadvantages of NLP
• Complex Query Language- the system may not be able to provide the
correct answer it the question that is poorly worded or ambiguous.
• The system is built for a single and specific task only; it is unable to
adapt to new domains and problems because of limited functions.
• NLP system doesn't have a user interface which lacks features that
allow users to further interact with the system
63