0% found this document useful (0 votes)
38 views54 pages

NLP Manual (1-12)

The document discusses an experiment on preprocessing text for natural language processing. It describes steps for tokenization, filtration, and script validation on a sample text. It defines key terms like tokens, corpus, lexicon. It also discusses techniques for stopword removal, lemmatization and stemming. The aim is to apply various text preprocessing techniques on a given text.

Uploaded by

sj120cp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views54 pages

NLP Manual (1-12)

The document discusses an experiment on preprocessing text for natural language processing. It describes steps for tokenization, filtration, and script validation on a sample text. It defines key terms like tokens, corpus, lexicon. It also discusses techniques for stopword removal, lemmatization and stemming. The aim is to apply various text preprocessing techniques on a given text.

Uploaded by

sj120cp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Name :

Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 1

AIM : Study various applications of NLP and formulate the Problem Statement
for Mini Project based on chosen real world NLP applications.

PROBLEM STATEMENT : The application of grammatical error-correcting


systems is the subject of the mini-project (GEC). The study of grammar focuses on
words and how they might be utilized to construct sentences. It can also cover a
word's pronunciation, definition, and linguistic background in addition to the
language's inflexions, grammar, and word construction. Numerous communication
issues, including those that negatively affect both personal and professional contacts,
can be brought on by grammatical errors. These programs work to fix grammatical
errors in the text. An illustration of one of these grammar checkers is Grammarly.
Correction of typographical errors can raise the caliber of writing in chats, blogs, and
emails.

Team Members :
1. Sanika S. Bhatye (Roll Number : 14)
2. Nachiket S. Gaikwad (Roll Number : 35)
3. Priyanka A. Gupta (Roll Number : 45)

Page | 1
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 2

AIM : Program on Preprocessing of Text (Tokenization, Filtration, Script


Validation).

THEORY : Text preprocessing is traditionally an important step for Natural Language


Processing (NLP) tasks. It transforms text into a more digestible form so that machine learning
algorithms can perform better. The process of converting data to something a computer can
understand is referred to as pre-processing. Following is the list of Text Preprocessing steps:

 Remove HTML tags.


 Remove extra whitespaces.
 Convert accented characters to ASCII characters.
 Expand contractions.
 Remove special characters.
 Lowercase all texts.
 Convert number words to numeric form.
 Remove numbers.
 Remove stop words.
 Lemmatization.
 Stemming.
 Script validation, etc.

Page | 1
Tokenization : Given a character sequence and a defined document unit, tokenization is the
task of chopping it up into pieces, called tokens. Tokenization is the act of breaking up a
sequence of strings into pieces such as words, keywords, phrases, symbols and other
elements called tokens. Tokens can be individual words, phrases or even whole sentences. In
the process of tokenization, some characters like punctuation marks are discarded.

Filtration : Many of the words used in the phrase are insignificant and hold no meaning.
For example – English is a subject.
Here, ‘English’ and ‘subject’ are the most significant words and ‘is’, ‘a’ are almost useless.
English subject and subject English holds the same meaning even if we remove the
insignificant words – (‘is’, ‘a’).
Using the nltk, we can remove the insignificant words by looking at their part-of-speech tags.
For that we have to decide which Part-Of-Speech tags are significant.

Steps: Tokenization
a. In order to get started, we need the NLTK module, as well as Python.
b. Download the latest version of Python if you are on Windows. If you are on Mac or
Linux, you should be able to run an apt-get install python3.
c. Next, we need NLTK 3. The easiest method to installing the NLTK module is going to
be with pip. For all users, that is done by opening up cmd.exe, bash, or whatever shell
you use and typing: pip install nltk
d. Next, we need to install some of the components for NLTK.

Open python via whatever means you normally do, and type:

import nltk
nltk.download()

Page | 2
Unless you are operating headless, a GUI will pop up like this, only probably with
red instead ofgreen:

Choose to download "all" for all packages, and then click 'download.' This will give you all of
the tokenizers, chunkers, other algorithms, and all of the corpora. If space is an issue, you can
elect to selectively download everything manually. The NLTK module will take up about 7MB,
and the entire nltk_data directory will take up about 1.8GB, which includes your chunkers,
parsers, and the corpora. If you are operating headless, like on a VPS, you can install
everything by running Python and doing:

import nltk
nltk.download()

d(for download)

all(for download everything)

Now that you have all the things that you need, let's knock out some quick vocabulary:

Corpus - Body of text, singular. Corpora is the plural of this.


Example: A collection of medical journals.

Page | 3
Lexicon - Words and their meanings.
Example: English dictionary.
Consider, however, that various fields will have different lexicons. For example: To
a financial investor, the first meaning for the word "Bull" is someone who is confident
about the market, as compared to the common English lexicon, where the first
meaning for the word "Bull" is an animal. As such, there is a speciallexicon for
financial investors, doctors, children, mechanics, and so on.

Token - Each "entity" that is a part of whatever was split up based on rules.
For examples, each word is a token when a sentence is "tokenized" into
words. Each sentence can also be a token, if you tokenized the sentences
out of a paragraph.

These are the words you will most commonly hear upon entering the Natural
Language Processing (NLP) space. With that, let's show an example of how one
might actually tokenize something into tokens with the NLTK module.

from nltk.tokenize import sent_tokenize, word_tokenize

EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The
weather is great, and Python is awesome. The sky is pinkish-
blue. You shouldn't eat cardboard."

print(sent_tokenize(EXAMPLE_TEXT))

The above code will output the sentences, split up into a list of sentences, which
you can do things like iterate through with a for loop.

['Hello Mr. Smith, how are you doing today?',


'The weather is great, and Python is awesome.',
'The sky is pinkish-blue.', "You shouldn't eat
cardboard."]

So there, we have created tokens, which are sentences. Let's tokenize by word instead this time:

print(word_tokenize(EXAMPLE_TEXT))

Page | 4
Now our output is: ['Hello', 'Mr.', 'Smith', ',',
'how', 'are',
'you', 'doing', 'today', '?', 'The', 'weather', 'is',
'great',
',', 'and', 'Python', 'is', 'awesome', '.',
'The', 'sky',
'is', 'pinkish-blue', '.', 'You', 'should', "n't",
'eat','cardboard', '.']

CONCLUSION : Thus, we have successfully performed an experiment on Pre-processing of


text.

OUTPUT :

Page | 5
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 3

AIM : Apply various other text preprocessing techniques for any given text :
StopWord Removal, Lemmatization / Stemming.

THEORY :
Stop Word Removal : One of the major forms of pre-processing is to filter out
useless data. In NLP, useless words (data) are referred to as stop words.

What are Stop words?


Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”)
that a searchengine has been programmed to ignore, both when indexing entries
for searching and when retrieving them as the result of a search query.

Stemming : Stemming is a kind of normalization for words. Normalization is a


technique where a set of words in a sentence are converted into a sequence to
shorten its lookup. The words which have the same meaning but have some
variation according to the context or sentence are normalized.

In another word, there is one root word, but there are many variations of the same
words. For example, the root word is "eat" and it's variations are "eats, eating,
eaten and like so". In the same way, with the help of Stemming, we can find the
root word of any variations.

Page | 1
Lemmatization : Lemmatization is a text normalization technique used in Natural Language
Processing (NLP). It has been studied for a very long time and lemmatization algorithms have
been made since the 1960s. Essentially, lemmatization is a technique that switches any kind
of a word to its base root mode. Lemmatization is responsible for grouping different inflected
forms of words into the root form, having the same meaning.

Steps: Stop word removal.

We can do this easily, by storing a list of words that you consider to be stop words.
NLTK starts you off with a bunch of words that they consider to be stop words,
you can access it via the NLTK corpus with:

from nltk.corpus import stopwords

Here is the list:

>>> set(stopwords.words('english'))

{'ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during',
'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its',
'yours', 'such',
'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each',
'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through',
'don', 'nor',
'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above',
'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before',
'them', 'same',
'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what',
'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself',
'has', 'just',
'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if',
'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here',
'than'}

Here is how we might incorporate using the stop_words set to remove the stopwords from
your text:

Page | 2
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the


stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence [w for w in word_tokens if not w in


stop_words]

filtered_sentence = []

for w in word_tokens:
if w not in stop_words:

print(word_tokens)
print(filtered_sentence)

Our output here:

['This', 'is', 'a', 'sample', 'sentence', ',',


'showing',
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',',
'showing', 'stop','words',
'filtration', '.']

Steps : Stemming

First, we're going to grab and define our stemmer:

from nltk.stem import PorterStemmer


from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

Now, let's choose some words with a similar stem, like:

example_words

Page | 3
Next, we can easily stem by doing something like:

for w in example_words:
print(ps.stem(w))
Our output:

python
python
python
python
pythonli

Now let's try stemming a typical sentence, rather than some words:

new_text = "It is important to by very pythonly while you are


pythoning with python. All pythoners have pythoned poorly at
least once."
words = word_tokenize(new_text)
for w in words:
print(ps.stem(w))

Now our result is:

It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
All
python
have
python
poorli
at
least
onc
.

Page | 4
Steps : Lemmatization

A very similar operation to stemming is called lemmatizing. The major difference


between these is, as you saw earlier, stemming can often create non-existent
words, whereas lemmas are actual words.

So, your root stem, meaning the word you end up with, is not something you can
just look up ina dictionary, but you can look up a lemma.

Sometimes you will wind up with a very similar word, but sometimes, you will
wind up with a completely different word. Let's see some examples.

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

Page | 5
OUTPUT :
Stopwords Removal -

Stemming -

Page | 6
Lemmatization -

CONCLUSION : Thus, we have successfully performed an experiment on


various text processing techniques.

Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 4

AIM : Program to demonstrate Morphological Analysis.

THEORY : Morphology is the study of the structure and formation of words. It’s
most important unit is the morpheme, which is defined as the "minimal unit of
meaning".

In linguistics, morphology refers to the mental system involved in word


formation or to the branch of linguistics that deals with words, their internal
structure, and how they are formed. Morphological Analysis is very essential for
various automatic natural language processing applications.

Consider a word like: "unhappiness". This has three parts:

Page | 1
There are three morphemes, each carrying a certain amount of meaning. un
means "not", while ness means "being in a state or condition". Happy is a free
morpheme because it can appear on its own (as a "word" in its own right). Bound
morphemes have to be attached to a free morpheme, and so cannot be words
in their own right. Thus, you cannot have sentences in English such as "Jason
feels very un ness today".

Inflection:
Inflection is the process of changing the form of a word so that it expresses
information such as number, person, case, gender, tense, mood and aspect, but
the syntactic category of the word remains unchanged. As an example, the plural
form of the noun in English is usually formed from the singular form by adding
an s.

• car / cars
• table / tables
• dog / dogs

In each of these cases, the syntactic category of the word remains unchanged.

Derivation:
As was seen above, inflection does not change the syntactic category of a word.
Derivation does change the category. Linguists classify derivation in English
according to whether or not it induces a change of pronunciation. For instance,
adding the suffix ity changes the pronunciation of the root of active so the stress
is on the second syllable: activity. The addition of the suffix al to approve doesn't
change the pronunciation of the root: approval.

Page | 2
Code POS tagging :

Result :

Page | 3
Code TextSimilar() :

Result :

Page | 4
Code Stemming :

Result :

Page | 5
Code Stemming :

Result :

Page | 6
Code Lemmatization :

Result :

CONCLUSION : Hence, we have successfully implemented the program to


demonstrate Morphological Analysis.

Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 5

AIM : Program to implement N-gram model.

THEORY :

N – Grams : The general idea is that we can look at each pair (or triple, set of
four, etc.) of words that occur next to each other. In a sufficiently-large corpus,
we are likely to see "the red" and "red apple" several times, but less likely to see
"apple red" and "red the". This is useful to know if, for example, we are trying to
figure out what someone is more likely to say to help decide between possible
output for an automatic speech recognition system. These co-occurring words
are known as "n-grams", where "n" is a number saying how long a string of
words we considered. (Unigrams are single words, bigrams are two words,
trigrams are three words, 4-grams are four words, 5-grams are five words, etc.)
In particular, nltk has the n-grams function that returns a generator of n-grams
given a tokenized sentence.

Page | 1
An n-gram tagger is a generalization of a unigram tagger whose context is the
current word together with the part-of-speech tags of the n-1 preceding tokens.

Generating Unigrams :

Result:

Page | 2
Generating Bigrams :

Result:

Generating Trigrams :

Page | 3
Result:

CONCLUSION : Hence, we have successfully implemented the program to


demonstrate N-gram model.

Page | 4
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 6

AIM : Program to implement POS tagging.

THEORY : Tagging is a kind of classification that may be defined as the


automatic assignment of description to the tokens. Here, the descriptor is called
tag, which may represent one of the part-of-speech, semantic information and
so on. PoS tagging may be defined as the process of assigning one of the parts
of speech to the given word.

Rule-based POS Tagging : One of the oldest techniques of tagging is rule-based


POS tagging. Rule-based taggers use dictionary or lexicon for getting possible
tags for tagging each word. If the word has more than one possible tag, then
rule-based taggers use hand-written rules to identify the correct tag.
Disambiguation can also be performed in rule-based tagging by analyzing the
linguistic featuresof a word along with its preceding as well as following words.
For example, suppose if the preceding word of a word is article, then the word
must be a noun. As the name suggests, all such kind of information in rule-based
POS tagging is coded in the form of rules. These rules may be either –

 Context-pattern rules.

Page | 1
 Or, as Regular expression compiled into finite-state
automata, intersected with lexically ambiguous
sentence representation.
We can also understand Rule-based POS tagging by its two-stage architecture −
 First stage − Uses a dictionary to assign each word a
list of potential parts-of-speech.
 Second stage − Uses large lists of hand-
written disambiguation rules to sort down the
list to asingle part-of-speech for each word.
Properties of Rule-Based POS Tagging : Rule-based POS taggers possess the
following properties −
 These taggers are knowledge-driven taggers.
 The rules in Rule-based POS tagging are built manually.
 The information is coded in the form of rules.
 We have some limited number of rules approximately around
1000.
 Smoothing and language modeling is defined explicitly in rule-
based taggers.

Stochastic POS Tagging : Another technique of tagging is Stochastic POS Tagging.


The model that includes frequency or probability (statistics) can be called
stochastic. Any number of different approaches to the problem of part-of-
speech tagging can be referred to as stochastic tagger. The simplest stochastic
tagger applies the following approaches for POS tagging:

Word Frequency Approach - In this approach, the stochastic taggers


disambiguate the words based on the probability that a word occurs with a
particular tag. We can also say that the tag encountered most frequently with
the word in the training set is the one assigned to an ambiguous instance of that
word. The main issue with this approach is that it may yield inadmissible
sequence of tags.

Page | 2
Tag Sequence Probabilities - It is another approach of stochastic tagging, where
the tagger calculates the probability of a given sequence of tags occurring. It is
also called n-gram approach. It is called so because the best tag for a given word
is determined by the probability at which it occurs with the n previous tags.

Properties of Stochastic POST Tagging :


Stochastic POS taggers possess the following properties −
 This POS tagging is based on the probability of tag occurring.
 It requires training corpus.
 There would be no probability for the words that do not exist in
the corpus.
 It uses different testing corpus (other than training corpus).
 It is the simplest POS tagging because it chooses
most frequent tags associated with a word in
training corpus.

Transformation-based Tagging : Transformation based tagging is also called Brill


tagging. It is an instance of the transformation-based learning (TBL), which is a
rule-based algorithm for automatic tagging of POS to the given text. TBL, allows
us to have linguistic knowledge in a readable form, transforms one state to
another state by using transformation rules. It draws the inspiration from both
the previous explained taggers − rule-based and stochastic. If we see similarity
between rule-based and transformation tagger, then like rule-based, it is also
based on the rules that specify what tags need to be assigned to what words.
On the other hand,if we see similarity between stochastic and transformation
tagger then like stochastic, it is machine learning technique in which rules are
automatically induced from data.

Working of Transformation Based Learning (TBL) : In order to understand the


working and concept of transformation-based taggers, we need to understand
the working of transformation-based learning. Consider the following steps to
understand the working of TBL −
 Start with the solution − The TBL usually starts with

Page | 3
some solution to the problem and works in cycles.
 Most beneficial transformation chosen − In each
cycle, TBL will choose the most beneficial
transformation.
 Apply to the problem − The transformation chosen in
the last step will be applied to the problem.
 The algorithm will stop when the selected
transformation in step 2 will not add either more
value or there are no more transformations to be
selected. Such kind of learning is best suited in
classification tasks.

One of the more powerful aspects of the NLTK module is the Part of
Speech tagging that it can do for you. This means labeling words in a
sentence as nouns, adjectives, verbs, etc. Even more impressive, it also
labels by tense, and more.

CODE :

Page | 4
RESULT :

CONCLUSION : Hence, we have successfully implemented the program to


demonstrate PoS tagging.

Page | 5
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 7

AIM : Program to implement Chunking.

THEORY : Text chunking, also referred to as shallow parsing, is a task that


follows Part-Of-Speech Tagging and that adds more structure to the sentence.
The result is a grouping of the words in “chunks”. Chunk extraction or partial
parsing is a process of meaningful extracting short phrases from the sentence
(tagged with Part-of-Speech). Chunks are made up of words and the kinds of
words are defined using the part-of-speech tags. A Chunking activity involves
breaking down a difficult text into more manageable pieces and having students
rewrite these “chunks” in their own words. Now that we know the parts of
speech, we can do what is called chunking, and group words into hopefully
meaningful chunks. One of the main goals of chunking is to group into what are
known as "noun phrases." These are phrases of one or more words that contain
a noun, maybe some descriptive words, maybe a verb, and maybe something
like an adverb. The idea is to group nouns with the words that are in relation to
them. In order to chunk, we combine the part of speech tags with regular
expressions. Mainly from regular expressions, we are going to utilize the
following:

Page | 1
+ = match 1 or more
? = match 0 or 1 repetitions.
* = match 0 or MORE repetitions
. = Any character except a new line

The last things to note is that the part of speech tags are denoted with
the "<" and ">" and we can also place regular expressions within the
tags themselves, so account for things like "all nouns" (<N.*>).

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

The result of this is something like:

Page | 2
The main line here in question is:

chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""

This line, broken down:

<RB.?>* = "0 or more of any tense of adverb," followed by:

<VB.?>* = "0 or more of any tense of verb," followed by:

<NNP>+ = "One or more proper nouns," followed by

<NN>? = "zero or one singular noun."

Try playing around with combinations to group various instances until


you feel comfortable withchunking. Say you print the chunks out, you
are going to see output like:

Page | 3
Cool, that helps us visually, but what if we want to access this data via
our program? Well, what is happening here is our "chunked" variable
is an NLTK tree. Each "chunk" and "non chunk" is a"subtree" of the
tree. We can reference these by doing something like
chunked.subtrees(). We can then iterate through these subtrees like
so:

for subtree in chunked.subtrees():


print(subtree)

Next, we might be only interested in getting just the chunks,


ignoring the rest. We can use the filter parameter in the
chunked.subtrees() call.

for subtree in chunked.subtrees(filter=lambda t: t.label() ==


'Chunk'):
print(subtree)

Page | 4
Now, we're filtering to only show the subtrees with the label of
"Chunk." Keep in mind, this isn't "Chunk" as in the NLTK chunk
attribute... this is "Chunk" literally because that's the label we gave it
here:

chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""

Had we said instead something like chunkGram = r"""Pythons:


{<RB.?>*<VB.?>*<NNP>+<NN>?}""", then we would filter by the label
of "Pythons." The result here should be something like:

Page | 5
RESULT :

Page | 6
CONCLUSION : Hence, we have successfully implemented the experiment on
Chunking.

Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 8

AIM : Program to implement Named Entity Recognition.

THEORY : In any text document, there are particular terms that represent
specific entities that are more informative and have a unique context. These
entities are known as named entities, which more specifically refer to terms that
represent real-world objects like people, places, organizations, and so on, which
are often denoted by proper names. A naive approach could be to find these by
looking at the noun phrases in text documents. Named entity recognition (NER),
also known as entity chunking/extraction, is a popular technique used in
information extraction to identify and segment the named entities and classify
or categorize them under various predefined classes. One of the most major
forms of chunking in NLP is called "Named EntityRecognition." The idea is to
have the machine immediately be able to pull out "entities" like people, places,
things, locations, monetary figures, and more. This can be a bit of a challenge,
but NLTK is this built in for us. There are two major options with NLTK's named
entity recognition: either recognize all named entities, or recognize named
entities as their respective type, like people, places, locations, etc.

Here's an example:

Page | 1
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
try:
for i in tokenized[5:]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
namedEnt = nltk.ne_chunk(tagged, binary=True)
namedEnt.draw()
except Exception as e:
print(str(e))

process_content()

Here, with the option of binary = True, this means either something
is a named entity, or not.
The result is:

If you set binary = False, then the result is:

Page | 2
Immediately, you can see a few things. When Binary is False, it
picked up the same things, but wound up splitting up terms like
White House into "White" and "House" as if they were different,
whereas we could see in the binary = True option, the named entity
recognition was correct to say White House was part of the same
named entity. Depending on your goals, you may use the binary
option how you see fit. Here are the types of Named Entities that
you can get if you have binary as false:

Page | 3
RESULT :

Page | 4
Binary = true

Binary = false

CONCLUSION : Hence, we have successfully implemented Named Entity


Recognition.

Page | 5
Page | 6
CONCLUSION : Thus, we have successfully implemented EDA.

Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. :

AIM : Case study on applications of NLP.

THEORY :

Topic: NLP in Healthcare.


The NLP illustrates the manners in which artificial intelligence policies gather and assess
unstructured data from the language of humans to extract patterns, get the meaning and
thus compose feedback. This is helping the healthcare industry to make the best use of
unstructured data. This technology facilitates providers to automate the managerial job,
invest more time in taking care of the patients, and enrich the patient’s experience using
real-time data.

Best Use Cases of NLP in Healthcare :

1. Clinical Documentation.
The NLP’s clinical documentation helps free clinicians from the laborious physical systems of
EHRs and permits them to invest more time in the patient; this is how NLP can help doctors.
Both speech-to-text dictation and formulated data entry have been a blessing.
The Nuance and M*Modal consists of technology that functions in team and speech
recognition technologies for getting structured data at the point of care and formalised
vocabularies for future use.
The NLP technologies bring out relevant data from speech recognition equipment which will

Page | 1
considerably modify analytical data used to run VBC and PHM efforts. This has better
outcomes for the clinicians. In upcoming times, it will apply NLP tools to various public data
sets and social media to determine Social Determinants of Health (SDOH) and the usefulness
of wellness-based policies.

2. Speech Recognition.
NLP has matured its use case in speech recognition over the years by allowing clinicians to
transcribe notes for useful EHR data entry. Front-end speech recognition eliminates the task
of physicians to dictate notes instead of having to sit at a point of care, while back-end
technology works to detect and correct any errors in the transcription before passing it on for
human proofing.
The market is almost saturated with speech recognition technologies, but a few start-ups are
disrupting the space with deep learning algorithms in mining applications, uncovering more
extensive possibilities.

3. Computer-Assisted Coding (CAC).


CAC captures data of procedures and treatments to grasp each possible code to maximise
claims. It is one of the most popular uses of NLP, but unfortunately, its adoption rate is just
30%. It has enriched the speed of coding but fell short at accuracy.

4. Data Mining Research.


The integration of data mining in healthcare systems allows organizations to reduce the levels
of subjectivity in decision-making and provide useful medical know-how. Once started, data
mining can become a cyclic technology for knowledge discovery, which can help any HCO
create a good business strategy to deliver better care to patients.

5. Automated Registry Reporting.


An NLP use case is to extract values as needed by each use case. Many health IT systems are
burdened by regulatory reporting when measures such as ejection fraction are not stored as
discrete values. For automated reporting, health systems will have to identify when an
ejection fraction is documented as part of a note, and save each value in a form that can be
utilized by the organization’s analytics platform for automated registry reporting.

How can Healthcare Organizations leverage NLP?


Healthcare organizations can use NLP to transform the way they deliver care and manage
solutions. Organizations can use machine learning in healthcare to improve provider
workflows and patient outcomes.

Page | 2
Implementing Predictive Analytics in Healthcare :

Identification of high-risk patients, as well as improvement of the diagnosis process, can be


done by deploying Predictive Analytics along with Natural Language Processing in Healthcare
along with predictive analytics.
It is vital for emergency departments to have complete data quickly, at hand. For example,
the delay in diagnosis of Kawasaki diseases leads to critical complications in case it is omitted
or mistreated in any way. As proved by scientific results, an NLP based algorithm identified
at-risk patients of Kawasaki disease with a sensitivity of 93.6% and specificity of 77.5%
compared to the manual review of clinician’s notes.
A set of researchers from France worked on developing another NLP based algorithm that
would monitor, detect and prevent hospital-acquired infections (HAI) among patients. NLP
helped in rendering unstructured data which was then used to identify early signs and
intimate clinicians accordingly.
Similarly, another experiment was carried out to automate the identification as well as risk
prediction for heart failure patients that were already hospitalized. Natural Language
Processing was implemented to analyse free text reports from the last 24 hours and predict
the patient’s risk of hospital readmission and mortality over the time of 30 days. At the end
of the successful experiment, the algorithm performed better than expected and the model’s
overall positive predictive value stood at 97.45%.
The benefits of deploying NLP can be applied to other areas of interest and a myriad of
algorithms can be deployed to pick out and predict specified conditions amongst patients.
Even though the healthcare industry at large still needs to refine its data capabilities prior to
deploying NLP tools, it still has a massive potential to significantly improve care delivery as
well as streamline workflows. Down the line, Natural Language Processing and other ML tools
will be the key to superior clinical decision support & patient health outcomes.

CONCLUSION : Thus, we have successfully curated a case study on the applications of NLP.

Page | 3
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :

Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)

Submitted to : PROF. NAZIA SULTHANA

Experiment No. :

AIM : Miniproject based on real life application of Natural Language


Processing.

THEORY :

Title: GRAMMATICAL ERROR CORRECTION (GEC).

Abstract: Grammatical Error Correction (GEC) systems aim to correct grammatical mistakes
in the text. Grammarly is an example of such a grammar correction product. Error correction
can improve the quality of written text in emails, blogs and chats. GEC task can be thought of
as a sequence to sequence task where a Transformer model is trained to take an
ungrammatical sentence as input and return a grammatically correct sentence.

Implementation:

1. Dataset:
For the training of our Grammar Corrector, we have used the C4_200M dataset
recently released by Google. This dataset consists of 200MM examples of synthetically
generated grammatical corruptions along with the correct text.

Page | 1
One of the biggest challenges in GEC is getting a good variety of data that simulates
the errors typically made in written language. If the corruptions are random, then they
would not be representative of the distribution of errors encountered in real use
cases.

To generate the corruption, a tagged corruption model is first trained. This model is
trained on existing datasets by taking as input a clean text and generating a corrupted
text. This is represented in the figure below:

For C4_2OOM dataset, the authors first determined the distribution of relative type
of errors encountered in written language. When generating the corruptions, they
were conditioned on the type of error. As shown in figure below, the corruption model
was conditioned to generate a determiner type error.

This allows the C4_200M dataset to have a diverse set of errors reflecting their relative
frequency in real-world applications. For the purpose of this project, we extracted
550K sentences from C4_200M. The C4_200M dataset is available on TF datasets. We
extracted the sentences we needed and saved them as a CSV.

2. Model Training:
T5 is a text-to-text model meaning it can be trained to go from input text of one format
to output text of one format. This model can be used for many different objectives
like summarization and text classification, also can be used to build a trivia bot that
can retrieve answers from memory without any provided context.

Page | 2
T5 is preferred for a lot of tasks for a few reasons :
1. Can be used for any text-to-text task.
2. Good accuracy on downstream tasks after fine-tuning.

Steps:

1. Tokenizing the data


We set the incorrect sentence as the input and the corrected text as the label. Both
the inputs and targets are tokenized using the T5 tokenizer. The max length is set to
64 since most of the inputs in C4_200M are sentences and the assumption is that this
model will also be used on sentences.

2. Training the model using seq2seq trainer class


We use the Seq2Seq trainer class in Huggingface to instantiate the model and we
instantiate logging to wandb. Using weights and biases with HuggingFace is very
simple. All that needs to be done is to set report_to = “wandb" in the training
arguments.

3. Monitoring and evaluating the data


We have used the Rouge score as the metric for evaluating the model. As seen in the
plots below from W&B, the model gets to a rouge score of 72 after 1 epoch of training.

Page | 3
Code:

from datasets import load_dataset


from tqdm import tqdm
import argparse
import glob
import os
import json
import time
import logging
import random
import re
from itertools import chain
from string import punctuation

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import pandas as pd

import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

from transformers import (


AdamW,
T5ForConditionalGeneration,
T5Tokenizer,
get_linear_schedule_with_warmup
)

import random
import numpy as np
import torch
import datasets

def set_seed(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

set_seed(42)
from transformers import (
T5ForConditionalGeneration, T5Tokenizer,
Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
)

from torch.utils.data import Dataset, DataLoader


model_name = 't5-base'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

def calc_token_len(example):
return len(tokenizer(example).input_ids)
from sklearn.model_selection import train_test_split

Page | 4
train_df, test_df = train_test_split(df, test_size=0.10, shuffle=True)
train_df.shape, test_df.shape

from torch.utils.data import Dataset, DataLoader


class GrammarDataset(Dataset):
def init (self, dataset, tokenizer,print_text=False):
self.dataset = dataset
self.pad_to_max_length = False
self.tokenizer = tokenizer
self.print_text = print_text
self.max_len = 64

def len (self):


return len(self.dataset)

def tokenize_data(self, example):


input_, target_ = example['input'], example['output']

# tokenize inputs
tokenized_inputs =
tokenizer(input_, pad_to_max_length=self.pad_to_max_len
gth,
max_length=self.max_len,
return_attention_mask=True)

tokenized_targets =
tokenizer(target_, pad_to_max_length=self.pad_to_max_leng
th,
max_length=self.max_len,
return_attention_mask=True)

inputs={"input_ids": tokenized_inputs['input_ids'],
"attention_mask": tokenized_inputs['attention_mask'],
"labels": tokenized_targets['input_ids']
}

return inputs

def getitem (self, index):


inputs = self.tokenize_data(self.dataset[index])

if self.print_text:
for k in inputs.keys():
print(k, len(inputs[k]))

return inputs

from datasets import load_metric


rouge_metric = load_metric("rouge")

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model,


padding='longest', return_tensors='pt')

Page | 5
# defining training related arguments
batch_size = 16
args =
Seq2SeqTrainingArguments(output_dir="/content/drive/MyDrive/c4_200m/wei
ghts",
evaluation_strategy="steps",
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
learning_rate=2e-5,
num_train_epochs=1,
weight_decay=0.01,
save_total_limit=2,
predict_with_generate=True,
fp16 = True,
gradient_accumulation_steps = 6,
eval_steps = 500,
save_steps = 500,
load_best_model_at_end=True,
logging_dir="/logs",
report_to="wandb")

import nltk
nltk.download('punkt')
import numpy as np

def compute_metrics(eval_pred):
predictions, labels = eval_pred
decoded_preds = tokenizer.batch_decode(predictions,
skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels,
skip_special_tokens=True)

# Rouge expects a newline after each sentence


decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for
pred in decoded_preds]
decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for
label in decoded_labels]

result = rouge_metric.compute(predictions=decoded_preds,
references=decoded_labels, use_stemmer=True)
# Extract a few results
result = {key: value.mid.fmeasure * 100 for key, value in
result.items()}

# Add mean generated length


prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id)
for pred in predictions]
result["gen_len"] = np.mean(prediction_lens)
return {k: round(v, 4) for k, v in result.items()}

# defining trainer using huggingface


trainer = Seq2SeqTrainer(model=model,
args=args,

Page | 6
train_dataset= GrammarDataset(train_dataset,
tokenizer),
eval_dataset=GrammarDataset(test_dataset, tokenizer),
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics)

import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
model_name = 'deep-learning-analytics/GrammarCorrector'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model =
T5ForConditionalGeneration.from_pretrained(model_name).to(torch_device)

def correct_grammar(input_text,num_return_sequences):
batch =
tokenizer([input_text],truncation=True,padding='max_length',max_length=
64, return_tensors="pt").to(torch_device)
translated = model.generate(**batch,max_length=64,num_beams=4,
num_return_sequences=num_return_sequences, temperature=1.5)
tgt_text = tokenizer.batch_decode(translated,
skip_special_tokens=True)
return tgt_text

Output:

Applications:

1. Can be used for Grammar error correction specific applications like Grammarly.
2. Can be implemented in paraphrasing software and applications.
3. Can be included in document or content writing software like Microsoft Word, Libra
and Google Docs.

Page | 7
Results:

Fine Tuning T5 Transformer to Grammar Error Correction and training it on C4_550k dataset
we achieved a Rogue Score of 80%.

Conclusion:
In this project, we proposed a new strategy for the Grammar Error Correction system based
on Deep Learning, and the experimental results show that the proposed method is effective.
It makes full use of the advantages of Deep Learning.

Page | 8

You might also like