0% found this document useful (0 votes)
13 views12 pages

NLPAssignment Purna

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views12 pages

NLPAssignment Purna

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

1a. Differentiate NLP and NLU?

NLP or natural language processing is evolved from computational linguistics, which aims to model
natural human language data. Also, NLP processes a large amount of human data and focus on use
of machine learning and deep learning techniques. It is commonly used in computer science,
information systems, linguistics, communications, and philosophy.

NLP has many subfields, including computational linguistics, syntax analysis, speech recognition,
machine translation, and more.

Natural language processing works by taking unstructured text and converting it into a correct format
or a structured text. It works by building the algorithm and training the model on large amounts of
data analyzed to understand what the user means when they say something.

It works by taking and identifying various entities together (named entity recognition) and
identification of word patterns. The word patterns are identified using methods such as tokenization,
stemming, and lemmatization.

NLP undertakes various tasks such as parsing, speech recognition, part-of-speech tagging, and
information extraction.

In the real world, NLP is used for text summarization, sentiment analysis, topic extraction, named
entity recognition, parts-of-speech tagging, relationship extraction, stemming, text mining, machine
translation, and automated question answering, as well as ontology population, language modelling,
and any other language-related task.

1|Page
NLU is a subset of natural language processing that uses the semantic analysis of text to
understand the meaning of sentences. It's possible that the same text can have many meanings,
that different words can have the same meaning, or that the meaning can change depending on the
situation.

NLU algorithms process text from different sources using computational methods to reach some
understanding of an input text, which is as simple as understanding what a sentence says or as
complex as understanding the dialogue between two people. So, NLU uses computational methods
to understand the text and produce a result.

NLU can be used in many different ways, including understanding dialogue between two people,
understanding how someone feels about a particular situation, and other similar scenarios.

There are namely three linguistic levels to understand NLU:

 Syntax: This is the process of understanding how sentences are constructed and if the
grammar is used correctly. For example, to understand if a sentence makes sense, it must
be considered in context and its syntax analyzed.

 Semantics: When we look at the text that contains contextual meaning details such as tone
of voice or word choice between two people. These pieces of data can also be used for an
NLU algorithm to produce results from all possible contexts in which the same piece of
spoken

 Pragmatic analysis: It helps understand the context and what the text is trying to achieve.

 Word sense disambiguation is the process of determining the meaning of words in


sentences. It gives a word meaning based on its context.

The major difference between the NLU and NLP is that NLP focuses on building algorithms to
recognize and understand natural language, while NLU focuses on the meaning of a sentence.
Another difference is that NLP breaks and processes language, while NLU provides language
comprehension.

Both NLU and NLP use supervised learning, which means that they train their models using labelled
data. However, the difference between them is in how it's done.

2|Page
Another difference between NLU and NLP is that NLU is focused more on sentiment analysis.
Sentiment analysis involves extracting information from the text in order to determine the emotional
tone of a text.

Natural language processing and natural language understanding language are not just about
training a dataset. The computer uses NLP algorithms to detect patterns in a large amount of
unstructured data.

NLU recognizes that language is a complex task made up of many components such as motions,
facial expression recognition etc. Furthermore, NLU enables computer programmes to deduce
purpose from language, even if the written or spoken language is flawed.

1b. Write regular expression for validation of email

A regular expression is a sequence of characters that defines a search pattern. They are used to
match character combinations in strings. A very common example of a real world use is when websites
verify whether the email address you entered is valid or not.

Any email is combination of 3 parts that is username, Domain & TLD

Username

We will need to extract all alphabets and numbers and dots before the @ sign. The following regexes
will get us what we want

A single alphabet or a number — [A-Za-z0-9]

Multiple “alphabets and numbers” — [A-Za-z0-9]*

Anything before an @ sign — ()@

Combining these 3 we get ([A-Za-z0-9\.]*)@

Domain

We will need to extract all alphabets and numbers after the @ and before the first dot. The following
regexes will get us what we want

Anything after the @ — @()

Anything before a dot — ()\.

Multiple “alphabets and numbers” — [A-Za-z0-9]*

Combining these 3 we get @([A-Za-z0–9]*)\.

3|Page
TLD

We will need to extract all alphabets after the dot that occurs after the domain name. The following
regexes will get us what we want

Anything after the domain name— regex-for-domain-name() = @[A-Za-z0–9]*\.

Multiple “alphabets and numbers” — [A-Za-z0-9]*

Combining these 2we get @[A-Za-z0–9]*\.([A-Za-z\.]*)

1c. What is nGram tagging.

N-grams of texts are extensively used in text mining and natural language processing tasks.

NgramTagger has 3 subclasses

 UnigramTagger

 BigramTagger

 TrigramTagger

BigramTagger subclass uses previous tag as part of its context TrigramTagger subclass uses the
previous two tags as part of its context. ngram – It is a subsequence of n items.
Idea of NgramTagger subclasses :

By looking at the previous words and P-O-S tags, part-of-speech tag for the current word can be
guessed. Each tagger maintains a context dictionary (ContextTagger parent class is used to
implement it). This dictionary is used to guess that tag based on the context.

The context is some number of previous tagged words in the case of NgramTagger subclasses.

# Loading Libraries

from nltk.tag import DefaultTagger

from nltk.tag import BigramTagger

from nltk.corpus import treebank

# initializing training and testing set

4|Page
train_data = treebank.tagged_sents()[:3000]

test_data = treebank.tagged_sents()[3000:]

# Tagging

tag1 = BigramTagger(train_data)

# Evaluation

tag1.evaluate(test_data)

Output:
0.11318799913662854

Code #2 : Working of Trigram tagger

# Loading Libraries

from nltk.tag import DefaultTagger

from nltk.tag import TrigramTagger

from nltk.corpus import treebank

# initializing training and testing set

train_data = treebank.tagged_sents()[:3000]

test_data = treebank.tagged_sents()[3000:]

# Tagging

tag1 = TrigramTagger(train_data)

# Evaluation

tag1.evaluate(test_data)

5|Page
Output :

0.06876753723289446

1d. Define precision & Recall, explain cofth an example.

There is this huge farm filled with apple and orange trees.The owner of the farm wants to build a
classifier that would rightly predict apples and oranges so that he could categorize them and sell. The
owner of the farm wants to build a classifier that would rightly predict apples and oranges so that he
could categorize them and sell. The owner builds a classifier and sends random samples of 13 fruits
to calssify.

He made a chart as below to check how well the model had performed:

True positives: These are the apples that the model rightly predicted.

False positives: There are the oranges that the model predicted as apples.

False negatives: There are the apples that the model predicted as oranges.

True negatives: There are the oranges that the model rightly predicted.

From the chart we can draw the below inferences:

• Model classified 2 oranges as apples

• Model classified 3 apples as oranges

• Model classified 5 apples rightly

• Model classified 3 oranges rightly

The above image helps us gain a different insight into the model’s predictions:

• Out of 8 values that it classified as apples, only 5 are real apples, 3 are oranges.

• Out of 5 values that it classified as oranges, only 2 are real oranges, 3 are apples.

Now, let’s dive into precision and recall. Concerning the above image.

Precision:

6|Page
It is the quantity of the right predictions that the model made. In simpler words, it is:

Number of apples predicted correctly by the model / Number of apples and oranges predicted correctly
by the model

It doesn’t consider the wrong predictions done by the model.

The formula for precision:

# of true positives/ (# of true positives + # of false positives)

Precision for apple predictor: 5/(5+2) = 5/7 = 0.714

Recall:

It is the quantity of right predictions the model made concerning the total positive values present. In
simpler words, it is:

Number of apples predicted correctly by the model/Total number of apples

The total number of apples is the number of apples sent to the system i.e., 8.

It considers the wrong prediction made by the model. The formula for recall:

# of true positives/(# of false negatives + # of true positives)

For the above example, it is: 5/(5+3) = 5/8 = 0.625

So, we know that the model created by the owner of the farm has high precision but low recall!.

2a. # Python Program to Count Vowels and Consonants in a String

str1 = input("Please Enter Your Own String : ")

vowels = 0

consonants = 0

for i in str1:

if(i == 'a' or i == 'e' or i == 'i' or i == 'o' or i == 'u'

or i == 'A' or i == 'E' or i == 'I' or i == 'O' or i == 'U'):

vowels = vowels + 1

else:

consonants = consonants + 1

print("Total Number of Vowels in this String = ", vowels)

print("Total Number of Consonants in this String = ", consonants)

7|Page
2b. write in-detail about Regular Expressions for detecting word patterns.

Many linguistic processing tasks involve pattern matching. For example, we can find words ending
with ed using endswith('ed'). Regular expressions give us a more powerful and flexible method for de
scribing the character patterns we are interested in.

To use regular expressions in Python, we need to import the re library using: import re. Let’s find
words ending with ed using the regular expression «ed$». We will use the re.search(p, s) function
to check whether the pattern p can be found somewhere inside the string s. We need to specify the
characters of interest, and use the dollar sign, which has a special behavior in the context of regular
expressions in that it matches the end of the word:

>>> [w for w in wordlist if re.search('ed$', w)]

['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', 'abed', 'aborted', ...]

The . wildcard symbol matches any single character. Suppose we have room in a crossword puzzle
for an eight-letter word, with j as its third letter and t as its sixth letter. In place of each blank cell we
use a period:

>>> [w for w in wordlist if re.search('^..j..t..$', w)]

['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic', ...

Finally, the ? symbol specifies that the previous character is optional. Thus «^e-?mail $» will match
both email and e-mail. We could count the total number of occurrences of this word (in either spelling)
in a text using sum(1 for w in text if re.search('^e-?mail$', w)).

8|Page
3a. write the implementation of stemming & lemmatizat"

Stemming

Stemming generates the base word from the inflected word by removing the affixes of the word. It has
a set of pre-defined rules that govern the dropping of these affixes. It must be noted that stemmers
might not always result in semantically meaningful base words. Stemmers are faster and
computationally less expensive than lemmatizers.

from nltk.stem import PorterStemmer

# create an object of class PorterStemmer

porter = PorterStemmer()

print(porter.stem("Communication"))

Output:

commun

The stemmer reduces the word ‘communication’ to a base word ‘commun’ which is meaningless in
itself.

Lemmatization

Lemmatization involves grouping together the inflected forms of the same word. This way, we can
reach out to the base form of any word which will be meaningful in nature. The base from here is called
the Lemma.

Lemmatizers are slower and computationally more expensive than stemmers.

9|Page
Example:

'play', 'plays', 'played', and 'playing' have 'play' as the lemma.

In Python, both these tokenizations can be implemented in NLTK as follows:

from nltk.stem import WordNetLemmatizer

# create an object of class WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("plays", 'v'))

print(lemmatizer.lemmatize("played", 'v'))

print(lemmatizer.lemmatize("play", 'v'))

print(lemmatizer.lemmatize("playing", 'v'))

Output:

play

play

play

play

3b. Give the implementation of pos Tagging.

import nltk

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize, sent_tokenize

stop_words = set(stopwords.words('english'))

// Dummy text

txt = "Sukanya, Rajib and Naba are my good friends. " \

"Sukanya is getting married next year. " \

"Marriage is a big step in one’s life." \

"It is both exciting and frightening. " \

"But friendship is a sacred bond between people." \

"It is a special kind of love between us. " \

"Many of you must have tried searching for a friend "\

10 | P a g e
"but never found the right one."

# sent_tokenize is one of instances of

# PunktSentenceTokenizer from the nltk.tokenize.punkt module

tokenized = sent_tokenize(txt)

for i in tokenized:

# Word tokenizers is used to find the words

# and punctuation in a string

wordsList = nltk.word_tokenize(i)

# removing stop words from wordList

wordsList = [w for w in wordsList if not w in stop_words]

# Using a Tagger. Which is part-of-speech

# tagger or POS-tagger.

tagged = nltk.pos_tag(wordsList)

print(tagged)

Output:

[('Sukanya', 'NNP'), ('Rajib', 'NNP'), ('Naba', 'NNP'), ('good', 'JJ'), ('friends', 'NNS')]

[('Sukanya', 'NNP'), ('getting', 'VBG'), ('married', 'VBN'), ('next', 'JJ'), ('year', 'NN')]

[('Marriage', 'NN'), ('big', 'JJ'), ('step', 'NN'), ('one', 'CD'), ('’', 'NN'), ('life', 'NN')]

[('It', 'PRP'), ('exciting', 'VBG'), ('frightening', 'VBG')]

[('But', 'CC'), ('friendship', 'NN'), ('sacred', 'VBD'), ('bond', 'NN'), ('people', 'NNS')]

[('It', 'PRP'), ('special', 'JJ'), ('kind', 'NN'), ('love', 'VB'), ('us', 'PRP')]

[('Many', 'JJ'), ('must', 'MD'), ('tried', 'VB'), ('searching', 'VBG'), ('friend', 'NN'),

('never', 'RB'), ('found', 'VBD'), ('right', 'RB'), ('one', 'CD')]

4a. Give the framework for the supervised classification

Classification is the task of choosing the correct class label for a given input. In basic classification
tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in
advance. Some examples of classification tasks are:

Deciding whether an email is spam or not.

• Deciding what the topic of a news article is, from a fixed list of topic areas such as “sports,”
“technology,” and “politics.”

11 | P a g e
• Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial
institution, the act of tilting to the side, or the act of depositing something in a financial institution.

A classifier is called supervised if it is built based on training corpora containing the correct label for
each input. The framework used by supervised classification is shown in Figure

4b. Explain Naive Based algorithm with suitable numerical examples in the Context of Text
analysis.

12 | P a g e

You might also like