0% found this document useful (0 votes)
52 views

NLP Lab Manual Lab Work

Uploaded by

deepa.yogish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

NLP Lab Manual Lab Work

Uploaded by

deepa.yogish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

lOMoARcPSD|32992122

NLP LAB Manual - Lab work

Natural language Processing (University of Mumbai)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Dr.Deepa Yogish ([email protected])
lOMoARcPSD|32992122

CHHATRAPATI SHIVAJI MAHARAJ INSTITUTE OF TECHNOLOGY

CERTIFICATE

This is to certify that MISS. KASHMIRA GANESH DALVI


Rollno: 12 Semester: VII Branch: COMPUTER ENGINEERING has
conducted all practical work of the session for Subject: NATURAL
LANGUAGE PROCESSING (NLP) as a part of academic requirement of
University of Mumbai and has completed all exercise satisfactorily
during the academic year 2022-2023

DATE :

SIGNATURE OF STUDENT LECTURE INCHARGE

INTERNAL EXAMINER HEAD OF DEPARTMENT

EXTERNAL EXAMINER PRINCIPAL

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

CHHATRAPATI SHIVAJI MAHARAJ INSTITUTE OF TECHNOLOGY

INDEX

Academic year: 2022-23 Semester: VII Branch: COMPUTER


Rollno: 33 Subject: NATURAL LANGUAGE PROCESSING (NLP)
SR TITLE OF EXPERIMENT PG. DATE OF DATE OF REMARK SIGN
NO. NO. PERFORMANCE SUBMISSION

1 To implement Tokenization of
text.
2 To implement Stop word
removal.
3 To implement Stemming of text
4 To implement Lemmatization
5 To implement N-gram model.
6 To implement POS tagging.
7 To implement Chunking.
8 To implement Named Entity
Recognition
9 MINI-PROJECT

SR TITLE PG. DATE OF DATE OF REMARK SIGN


NO. NO. PERFORMANCE SUBMISSION
1 ASSIGNMENT NO:1
2 ASSIGNMENT NO:2
3 ASSIGNMENT NO:3
4 ASSIGNMENT NO:4

SIGNATURE OF STUDENT SIGNATURE OF STAFF

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

EXPERIMENT NO.1
AIM: To implement Tokenization of text.

RESOURCES REQUIRED: Python 3, NLTK toolkit, Text editor, 4 GB RAM and above, i5 processor and
above.

THEORY:
TOKENIZATION:
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into
pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.
Here is an example of tokenization:
Input: Friends, Romans, Countrymen, lend me your ears;

Output:
These tokens are often loosely referred to as terms or words, but it is sometimes important to make a
type/token distinction. A token is an instance of a sequence of characters in some particular document
that are grouped together as a useful semantic unit for processing. A type is the class of all tokens
containing the same character sequence. A term is a (perhaps normalized) type that is included in the IR
system's dictionary. The set of index terms could be entirely distinct from the tokens, for instance, they
could be semantic identifiers in a taxonomy, but in practice in modern IR systems they are strongly related
to the tokens in the document. However, rather than being exactly the tokens that appear in the
document, they are usually derived from them by various normalization processes.
The major question of the tokenization phase is what are the correct tokens to use? In this example, it
looks fairly trivial: you chop on whitespace and throw away punctuation characters. This is a starting point,
but even for English there are a number of tricky cases. For example, what do you do about the various
uses of the apostrophe for possession and contractions? Mr. O'Neill thinks that the boys' stories about
Chile's capital aren't amusing.
For O'Neill, which of the following is the desired tokenization?

And for aren't, is it:

CHALLENGES IN TOKENIZATION
Challenges in tokenization depends on the type of language. Languages such as English and French are
referred to as space-delimited as most of the words are separated from each other by white spaces.
Languages such as Chinese and Thai are referred to as unsegmented as words do not have clear
boundaries. Tokenising unsegmented language sentences requires additional lexical and morphological

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

information. Tokenization is also affected by writing system and the typographical structure of the words.
Structures of languges can be grouped into three categories:
Isolating: Words do not divide into smaller units. Example: Mandarin Chinese
Agglutinative: Words divide into smaller units. Example: Japanese, Tamil
Inflectional: Boundaries between morphemes are not clear and ambiguous in terms of grammatical
meaning. Example: Latin.
CODE:
import nltk
corpus = "This is an exciting time to be working in speech and language processing. Historically distinct
fields
(natural language processing, speech recognition, computational linguistics, computational
psycholinguistics)
have begun to merge."
tokens = nltk.word_tokenize(corpus)
print("Original corpus :\n",corpus,"\n")
print("Tokenized words : \n", tokens)

OUTPUT:

CONCLUSION:
The process of segmenting running text into words and sentences is called tokenization.Tokenization is a
basic pre-processing step in every NLP task. There are two types of tokenization, sentence and word
tokenization. Tokenization has been performed on a simple text corpus.

EXPERIMENT NO.2
AIM: To implement Stop word removal.
4

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

RESOURCES REQUIRED: Python 3, NLTK toolkit, Text editor, 4 GB RAM and above, i5 processor and
above

THEORY:
STOP WORD REMOVAL:
Stop words are the most common words in any natural language. For the purpose of analyzing text data
and building NLP models, these stop words might not add much value to the meaning of the document.
Consider this text string – “There is a pen on the table”. Now, the words “is”, “a”, “on” and “the” add no
meaning to the statement while parsing it. Whereas words like “there”, “book”, and “table” are the
keywords and tell us what the statement is all about.
A basic list of stop words is given below:

Removing stop words is not a hard and fast rule in NLP. It depends upon the task that we are working on.
For tasks like text classification, where the text is to be classified into different categories, stop words are
removed or excluded from the given text so that more focus can be given to those words which define the
meaning of the text. A few key benefits of removing stop words:
On removing stop words, dataset size decreases and the time to train the model also decreases
Removing stop words can potentially help improve the performance as there are fewer and only
meaningful tokens left. Thus, it could increase classification accuracy Even search engines like Google
remove stop words for fast and relevant retrieval of data from the database.
We can remove stop words while performing the following tasks:
Text Classification
Spam Filtering
Language Classification
Genre Classification
Caption Generation
Auto-Tag Generation

CODE:
import nltk
from nltk.corpus import stopwords
corpus = "This is an exciting time to be working in speech and language processing. Historically distinct
fields (natural language processing, speech recognition, computational linguistics, computational
psycholinguistics) have begun to merge."
tokens = nltk.word_tokenize(corpus)
print("Original corpus :\n",corpus,"\n")
5

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

print("Tokenized words : \n", tokens)


stop_words = set(stopwords.words("english"))
rel_words = [rel for rel in tokens if not rel in stop_words]
print("\nTokens without stop words :\n",rel_words)

OUTPUT:

CONCLUSION:
Stop word removal is a pre-processing task in natural language processing. Stop word removal is necessary
to improve analysis of the corpora in use. Stop word removal helps to understand relationships between
the elements of the text and extract features. Stop word removal has been performed on a simple text
corpus.

EXPERIMENT NO.3
AIM: To implement Stemming of text

RESOURCES REQUIRED: Python 3, NLTK toolkit, Text editor, 4 GB RAM and above, i5 processor and
above.

THEORY:
6

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

STEMMING:
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the
roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and
natural language processing (NLP).
Stemming is a part of linguistic studies in morphology and artificial intelligence (AI) information retrieval
and extraction. Stemming and AI knowledge extract meaningful information from vast sources like big data
or the Internet since additional forms of a word related to a subject may need to be searched to get the
best results.
Stemming is also a part of queries and Internet search engines.
Recognizing, searching and retrieving more forms of words returns more results. When a form of a word is
recognized it can make it possible to return search results that otherwise might have been missed. That
additional information retrieved is why stemming is integral to search queries and information retrieval.
When a new word is found, it can present new research opportunities. Often, the best results can be
attained by using the basic morphological form of the word: the lemma. To find the lemma, stemming is
performed by an individual or an algorithm, which may be used by an AI system. Stemming uses a number
of approaches to reduce a word to its base from whatever inflected form is encountered.
It can be simple to develop a stemming algorithm. Some simple algorithms will simply strip recognized
prefixes and suffixes. However, these simple algorithms are prone to error. For example, an error can
reduce words like laziness to lazi instead of lazy. Such algorithms may also have difficulty with terms whose
inflectional forms don't perfectly mirror the lemma such as with saw and see.
Examples of stemming algorithms include:
Lookups in tables of inflected forms of words. This approach requires all inflected forms be listed.
Suffix strippi . Algorithms recognize known suffixes on inflected words and remove them.
PORTER STEMMER:
A consonant in a word is a letter other than A, E, I, O or U, and other than Y preceded by a consonant. (The
fact
that the term consonant is defined to some extent in terms of itself does not make it ambiguous.) So in
TOY the
consonants are T and Y, and in SYZYGY they are S, Z and G. If a letter is not a consonant it is a vowel.
A consonant will be denoted by c, a vowel by v. A list ccc... of length greater than 0 will be denoted by C,
and a
list vvv... of length greater than 0 will be denoted by V. Any word, or part of a word, therefore has one of
the four
forms:
CVCV ... C
CVCV ... V
VCVC ... C
VCVC ... V
These may all be represented by the single form
[C]VCVC ... [V]
where the square brackets denote arbitrary presence of their contents. Using (VCmVCm) to denote VC
repeated
m times, this may again be written as
[C](VCmVCm)[V]
m will be called the measure of any word or word part when represented in this form. The case m = 0
covers the
null word. Here are some examples:m=0 TR, EE, TREE, Y, BY.
m=1 TROUBLE, OATS, TREES, IVY.
m=2 TROUBLES, PRIVATE, OATEN, ORRERY.
The rules for removing a suffix will be given in the form
(condition) S1 -> S2

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

This means that if a word ends with the suffix S1, and the stem before S1 satisfies the given condition, S1 is
replaced by S2. The condition is usually given in terms of m, e.g.
(m > 1) EMENT ->
Here S1 is 'EMENT' and S2 is null. This would map REPLACEMENT to REPLAC, since REPLAC is a word
part for which m = 2.
The 'condition' part may also contain the following:
*S - the stem ends with S (and similarly for the other letters).
*v* - the stem contains a vowel.
m=2 TROUBLES, PRIVATE, OATEN, ORRERY.
*d - the stem ends with a double consonant (e.g. -TT, -SS).
*o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).
And the condition part may also contain expressions with and, or and not, so that:
(m>1 and (*S or *T)) : tests for a stem with m>1 ending in S or T, while
(*d and not (*L or *S or *Z)) : tests for a stem ending witha double consonant other than L, S or Z.
Elaborate
conditions like this are required only rarely.
In a set of rules written beneath each other, only one is obeyed, and this will be the one with the longest
matching
S1 for the given word. For example, with
SSES -> SS
IES -> I
SS -> SS
S ->
(here the conditions are all null) CARESSES maps to CARESS since SSES is the longest match for S1. Equally
CARESS maps to CARESS (S1=SS) and CARES to CARE (S1=S).

CODE:
import nltk
from nltk.stem.porter import PorterStemmer
corpus = "This is an exciting time to be working in speech and language processing. Historically distinct
fields
(natural language processing, speech recognition, computational linguistics, computational
psycholinguistics)
have begun to merge."
tokens = nltk.word_tokenize(corpus)
print("Original corpus :\n",corpus,"\n")
print("Tokenized words : \n", tokens)
porter = PorterStemmer()
stem_words = [porter.stem(stem) for stem in tokens]
print("\nStemmed words :\n",stem_words)

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

OUTPUT:

CONCLUSION:
Stemming is a text pre-processing task used in natural language processing. Stemming is the process of
reducing words to their root form or stem. Stemming is useful to simplify text analysis in large corpora. A
very common Stemming algorithm is the Porter Stemmer algorithm which has been implemented using
the nltk toolkit.

EXPERIMENT NO.4
AIM: To implement Lemmatization

RESOURCES REQUIRED: Python 3, NLTK toolkit, Text editor, 4 GB RAM and above, i5 processor and
above.
9

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

THEORY:
Lemmatization:
Lemmatization is the process of grouping together the different inflected forms of a word so they can be
analysed
as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words
with
similar meaning to one word.
Text pre-processing includes both Stemming as well as Lemmatization. Many times people find these two
terms
confusing. Some treat these two as same. Actually, lemmatization is preferred over Stemming because
lemmatization does morphological analysis of the words.
Applications of lemmatization are:
Used in comprehensive retrieval systems like search engines.
Used in compact indexing
Examples of lemmatization:
rocks : rock
corpora : corpus
better : good

CODE:
from nltk.stem import WordNetLemmatizer
corpus = "studies studying cries cry"
tokens = nltk.word_tokenize(corpus)
print("Original corpus :\n",corpus,"\n")
print("Tokenized words : \n", tokens)
lemma = WordNetLemmatizer()
lem_words = [lemma.lemmatize(lem) for lem in tokens]
print("\nLemmatized words :\n",lem_words)

OUTPUT:

10

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

CONCLUSION:
Lemmatization is a basic text pre-processing operation in many natural language processing tasks. It is
similar to stemming but unlike stemming, it does not truncate any affixes from the morpheme but rather
reduces the inflected form to its actual stem. Therefore, lemmatization provides a better result when
compared to stemming

EXPERIMENT NO.5
AIM: To implement N-gram model.

RESOURCES REQUIRED: Python 3, NLTK toolkit, Text editor, 4 GB RAM and above, i5 processor and
above.

11

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

THEORY:
N-gram Model:
Statistical language models, in its essence, are the type of models that assign probabilities to the
sequences of words. In this article, we’ll understand the simplest model that assigns probabilities to
sentences and sequences of words, the n-gram You can think of an N-gram as the sequence of N words, by
that notion, a 2-gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your
homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn
your homework”.
Let’s start with equation P(w|h), the probability of word w, given some history, h. For example,

Here,
w = The
h = its water is so transparent that
And, one way to estimate the above probability function is through the relative frequency count approach,
where you would take a substantially large corpus, count the number of times you see its water is so
transparent that, and then count the number of times it is followed by the. In other words, you are
answering the question: Out of the times you saw the history h, how many times did the word w follow it

Now, you can imagine it is not feasible to perform this over an entire corpus; especially it is of a significant
a size. This shortcoming and ways to decompose the probability function using the chain rule serves as the
base intuition of the N-gram model. Here, you, instead of computing probability using the entire corpus,
would approximate it by just a few historical words.
The Bigram Model:
As the name suggests, the bigram model approximates the probability of a word given all the previous
words by using only the conditional probability of one preceding word. In other words, you approximate it
with the
probability: P(the | that) And so, when you use a bigram model to predict the conditional probability of the
next word, you are thus making the following approximation:

This assumption that the probability of a word depends only on the previous word is also known as Markov
assumption. Markov models are the class of probabilisitic models that assume that we can predict the
probability of some future unit without looking too far in the past. You can further generalize the bigram
model to the trigram model which looks two words into the past and can thus be further generalized to the
N-gram model.Now, that we understand the underlying base for N-gram models, you’d think, how can we
estimate the probability function. One of the most straightforward and intuitive ways to do so is Maximum
Likelihood Estimation (MLE) For example, to compute a particular bigram probability of a word y given a
previous word x, you can determine the count of the bigram C(xy) and normalize it by the sum of all the
bigrams that share the same first-word x. There are, of course, challenges, as with every modeling
approach, and estimation method. Let’s look at the key ones affecting the N-gram model, as well as the
use of MLE
Sensitivity to the training corpus
The N-gram model, like many statistical models, is significantly dependent on the training corpus. As a
result, the probabilities often encode particular facts about a given training corpus. Besides, the
performance of the Ngram model varies with the change in the value of N. Moreover, you may have a
language task in which you know all the words that can occur, and hence we know the vocabulary size V in
advance. The closed vocabulary assumption assumes there are no unknown words, which is unlikely in
practical scenarios.
Smoothing
12

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

A notable problem with the MLE approach is sparse data. Meaning, any N-gram that appeared a sufficient
number of times might have a reasonable estimate for its probability. But because any corpus is limited,
some perfectly acceptable English word sequences are bound to be missing from it. As a result of it, the N-
gram matrix for any training corpus is bound to have a substantial number of cases of
putative “zero probability N-grams”

CODE:
from nltk import pos_tag
text = "This is an exciting time to be working in speech and language processing. Historically distinct fields
(natural language processing, speech recognition, computational linguistics, computational
psycholinguistics)
have begun to merge."
tokens = text.split()
print("Original text :\n",text)
print("\nTokenized words : \n",tokens)
tagged_words = pos_tag(tokens)
print("\nPOS tagged words : \n",tagged_words)

OUTPUT:

CONCLUSION:
The N-gram model is a statistical language model that has various applications in natural language
processing such as spell correction etc. The bi-gram model specifically has been studied in detail and has
been implemented.

EXPERIMENT 6
AIM: To implement POS tagging.

RESOURCES REQUIRED: Python 3, NLTK toolkit, Text editor, 4 GB RAM and above, i5 processor
and above.

THEORY:

13

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

Part of Speech (hereby referred to as POS) Tags are useful for building parse trees, which are used in
building NERs (most named entities are Nouns) and extracting relations between words. POS Tagging is
also essential for building lemmatizers which are used to reduce a word to its root form.
POS tagging is the process of marking up a word in a corpus to a corresponding part of a speech tag, based
on its context and definition. This task is not straightforward, as a particular word may have a different part
of speech based on the context in which the word is used.
For example: In the sentence “Give me your answer”, answer is a Noun, but in the sentence “Answer the
question”, answer is a verb.
To understand the meaning of any sentence or to extract relationships and build a knowledge graph, POS
Tagging
is a very important step.
The Different POS Tagging Techniques
There are different techniques for POS Tagging:
• Lexical Based Methods — Assigns the POS tag the most frequently occurring with a word in the training
corpus.
• Rule-Based Methods — Assigns POS tags based on rules. For example, we can have a rule that says,
words ending with “ed” or “ing” must be assigned to a verb. Rule-Based Techniques can be used along
with Lexical Based approaches to allow POS Tagging of words that are not present in the training corpus
but are there in the testing data.
• Probabilistic Methods — This method assigns the POS tags based on the probability of a particular tag
sequence occurring. Conditional Random Fields (CRFs) and Hidden Markov Models (HMMs) are
probabilistic approaches to assign a POS Tag.
• Deep Learning Methods — Recurrent Neural Networks can also be used for POS tagging.
Steps Involved:
Tokenize text (word_tokenize) apply pos_tag to above step that is nltk.pos_tag(tokenize_text)

CODE:
from nltk import pos_tag
text = "This is an exciting time to be working in speech and language processing. Historically distinct fields
(natural language processing, speech recognition, computational linguistics, computational
psycholinguistics)
have begun to merge."
tokens = text.split()
print("Tokenized words : \n",tokens)
tagged_words = pos_tag(tokens)
print("\nPOS tagged words : \n",tagged_words)

OUTPUT:

14

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

CONCLUSION:
Parts of speech tagging is the process of assigning a word in a corpus a word class. Parts of speech tagging
has numerous uses such as in Named Entity Recognition. Parts of Speech tagging has been carefully
studied and implemented on a text corpus.

EXPERIMENT NO 7
AIM: To implement Chunking.

RESOURCES REQUIRED: Python 3, NLTK toolkit, Text editor, 4 GB RAM and above, i5 processor and
above
15

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

THEORY:
Chunking:
Chunking is used to add more structure to the sentence by following parts of speech (POS) tagging. It is
also
known as shallow parsing. The resulted group of words is called "chunks." In shallow parsing, there is
maximum one level between roots and leaves while deep parsing comprises of more than one level.
Shallow Parsing is also called light parsing or chunking. The primary usage of chunking is to make a group
of "noun phrases." The parts of speech are combined with
regular expressions.
Rules for Chunking:
There are no pre-defined rules, but you can combine them according to need and requirement.
For example, you need to tag Noun, verb (past tense), adjective, and coordinating junction from the
sentence.
You can use the rule as below
chunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}
Use Case of Chunking
Chunking is used for entity detection. An entity is that part of the sentence by which machine get the value
for any intention
Example: Temperature of New York. Here Temperature is the intention and New York is an entity.
In other words, chunking is used as selecting the subsets of tokens. Please follow the below code to
understand how chunking is used to select the tokens. In this example, you will see the graph which will
correspond to a chunk of a noun phrase. We will write the code and draw the graph for better
understanding

CODE:
from nltk import pos_tag
from nltk import RegexpParser
corpus ="This is an exciting time to be working in speech and language processing. Historically distinct
fields
(natural language processing, speech recognition, computational linguistics, computational
psycholinguistics)
have begun to merge."
tokens = corpus.split()
print("Original corpus :\n",corpus)
print("\nSplit Text :\n",tokens)
tokens_tag = pos_tag(tokens)
print("\nPOS tagging :\n",tokens_tag)
patterns= """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""
chunker = RegexpParser(patterns)
print("\nAfter Regex :\n",chunker)
output = chunker.parse(tokens_tag)
print("\nChunked Text \n",output)

OUTPUT:

16

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

17

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

18

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

CONCLUSION:
Chunking is the process of making noun phrases. It is applied after parts of speech tagging. Chunking has
been studied and implemented on a text corpus.

EXPERIMENT NO.8
AIM: To implement Named Entity Recognition

19

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

RESOURCES REQUIRED: Python 3, NLTK toolkit, Text editor, 4 GB RAM and above, i5 processor and
above

THEORY:
Named Entity Recognition:
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction)
is a sub-task of information extraction that seeks to locate and classify named entities in text into pre-
defined
categories such as the names of persons, organizations, locations, expressions of times, quantities,
monetary
values, percentages, etc. NER systems have been created that use linguistic grammar-based techniques as
well as statistical models such as machine learning. Hand-crafted grammar-based systems typically obtain
better precision, but at the cost of lower recall and months of work by experienced computational
linguists . Statistical NER systems typically require a large amount of manually annotated training data.
Semi-supervised approaches have been suggested to avoid part of the annotation effort. Named Entity
Recognition has a wide range of applications in the field of Natural Language Processing and Information
Retrieval. Few such examples have been listed below :
Automatically Summarizing Resumes
Optimizing Search Engine Algorithms
Powering Recommender Systems
Now that we explained NLP, we can describe how Named Entity Recognition works. NER plays a major role
in the semantic part of NLP, which, extracts the meaning of words, sentences and their relationships. Basic
NER processes structured and unstructured texts by identifying and locating entities. For example, instead
of identifying “Steve” and “Jobs” as different entities, NER understands that “Steve Jobs” is a single entity.
More developed NER processes can classify identified entities as well. In this case, NER not only identifies
but classifies “Steve Jobs” as a person. In the following, we will describe the two most popular NER
methods.

CODE:
# command to run before code
! pip install spacy
! pip install nltk
! python -m spacy download en_core_web_sm
# imports and load spacy english language package
import spacy
from spacy import displacy
from spacy import tokenizer
nlp = spacy.load('en_core_web_sm')
#Load the text and process it
# I copied the text from python wiki
text =("Python is an interpreted, high-level and general-purpose programming language"Pythons design
philosophy emphasizes code readability with"
"its notable use of significant indentation."
"Its language constructs and object-oriented approach aim to"
"help programmers write clear and"
"logical code for small and large-scale projects")
# text2 = # copy the paragraphs from https://www.python.org/doc/essays/
doc = nlp(text)
#doc2 = nlp(text2)
sentences = list(doc.sents)

20

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

print(sentences)
# tokenization
for token in doc:
print(token.text)
# print entities
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)
# now we use displaycy function on doc2
displacy.render(doc, style='ent', jupyter=True)

OUTPUT:

21

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

22

Downloaded by Dr.Deepa Yogish ([email protected])


lOMoARcPSD|32992122

CONCLUSION:
Named Entity Recognition is a technique in natural language processing used to extract real entity names
from word corpora. NER can be used to extract names, location, places etc. NER has been carefully studied
and implemented.

23

Downloaded by Dr.Deepa Yogish ([email protected])

You might also like