0% found this document useful (0 votes)

38 views54 pages

NLP Manual (1-12)

The document discusses an experiment on preprocessing text for natural language processing. It describes steps for tokenization, filtration, and script validation on a sample text. It defines key terms like tokens, corpus, lexicon. It also discusses techniques for stopword removal, lemmatization and stemming. The aim is to apply various text preprocessing techniques on a given text.

Uploaded by

sj120cp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views54 pages

NLP Manual (1-12)

Uploaded by

sj120cp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Name :

Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 1

AIM : Study various applications of NLP and formulate the Problem Statement
for Mini Project based on chosen real world NLP applications.

PROBLEM STATEMENT : The application of grammatical error-correcting

systems is the subject of the mini-project (GEC). The study of grammar focuses on
words and how they might be utilized to construct sentences. It can also cover a
word's pronunciation, definition, and linguistic background in addition to the
language's inflexions, grammar, and word construction. Numerous communication
issues, including those that negatively affect both personal and professional contacts,
can be brought on by grammatical errors. These programs work to fix grammatical
errors in the text. An illustration of one of these grammar checkers is Grammarly.
Correction of typographical errors can raise the caliber of writing in chats, blogs, and
emails.

Team Members :
1. Sanika S. Bhatye (Roll Number : 14)
2. Nachiket S. Gaikwad (Roll Number : 35)
3. Priyanka A. Gupta (Roll Number : 45)

Page | 1
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 2

AIM : Program on Preprocessing of Text (Tokenization, Filtration, Script

Validation).

THEORY : Text preprocessing is traditionally an important step for Natural Language

Processing (NLP) tasks. It transforms text into a more digestible form so that machine learning
algorithms can perform better. The process of converting data to something a computer can
understand is referred to as pre-processing. Following is the list of Text Preprocessing steps:

Page | 1
Tokenization : Given a character sequence and a defined document unit, tokenization is the
task of chopping it up into pieces, called tokens. Tokenization is the act of breaking up a
sequence of strings into pieces such as words, keywords, phrases, symbols and other
elements called tokens. Tokens can be individual words, phrases or even whole sentences. In
the process of tokenization, some characters like punctuation marks are discarded.

Filtration : Many of the words used in the phrase are insignificant and hold no meaning.
For example – English is a subject.
Here, ‘English’ and ‘subject’ are the most significant words and ‘is’, ‘a’ are almost useless.
English subject and subject English holds the same meaning even if we remove the
insignificant words – (‘is’, ‘a’).
Using the nltk, we can remove the insignificant words by looking at their part-of-speech tags.
For that we have to decide which Part-Of-Speech tags are significant.

Steps: Tokenization
a. In order to get started, we need the NLTK module, as well as Python.
b. Download the latest version of Python if you are on Windows. If you are on Mac or
Linux, you should be able to run an apt-get install python3.
c. Next, we need NLTK 3. The easiest method to installing the NLTK module is going to
be with pip. For all users, that is done by opening up cmd.exe, bash, or whatever shell
you use and typing: pip install nltk
d. Next, we need to install some of the components for NLTK.

Open python via whatever means you normally do, and type:

import nltk
nltk.download()

Page | 2
Unless you are operating headless, a GUI will pop up like this, only probably with
red instead ofgreen:

Choose to download "all" for all packages, and then click 'download.' This will give you all of
the tokenizers, chunkers, other algorithms, and all of the corpora. If space is an issue, you can
elect to selectively download everything manually. The NLTK module will take up about 7MB,
and the entire nltk_data directory will take up about 1.8GB, which includes your chunkers,
parsers, and the corpora. If you are operating headless, like on a VPS, you can install
everything by running Python and doing:

import nltk
nltk.download()

d(for download)

all(for download everything)

Now that you have all the things that you need, let's knock out some quick vocabulary:

Corpus - Body of text, singular. Corpora is the plural of this.

Example: A collection of medical journals.

Page | 3
Lexicon - Words and their meanings.
Example: English dictionary.
Consider, however, that various fields will have different lexicons. For example: To
a financial investor, the first meaning for the word "Bull" is someone who is confident
about the market, as compared to the common English lexicon, where the first
meaning for the word "Bull" is an animal. As such, there is a speciallexicon for
financial investors, doctors, children, mechanics, and so on.

Token - Each "entity" that is a part of whatever was split up based on rules.
For examples, each word is a token when a sentence is "tokenized" into
words. Each sentence can also be a token, if you tokenized the sentences
out of a paragraph.

These are the words you will most commonly hear upon entering the Natural
Language Processing (NLP) space. With that, let's show an example of how one
might actually tokenize something into tokens with the NLTK module.

from nltk.tokenize import sent_tokenize, word_tokenize

EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The
weather is great, and Python is awesome. The sky is pinkish-
blue. You shouldn't eat cardboard."

print(sent_tokenize(EXAMPLE_TEXT))

The above code will output the sentences, split up into a list of sentences, which
you can do things like iterate through with a for loop.

['Hello Mr. Smith, how are you doing today?',

'The weather is great, and Python is awesome.',
'The sky is pinkish-blue.', "You shouldn't eat
cardboard."]

So there, we have created tokens, which are sentences. Let's tokenize by word instead this time:

print(word_tokenize(EXAMPLE_TEXT))

Page | 4
Now our output is: ['Hello', 'Mr.', 'Smith', ',',
'how', 'are',
'you', 'doing', 'today', '?', 'The', 'weather', 'is',
'great',
',', 'and', 'Python', 'is', 'awesome', '.',
'The', 'sky',
'is', 'pinkish-blue', '.', 'You', 'should', "n't",
'eat','cardboard', '.']

CONCLUSION : Thus, we have successfully performed an experiment on Pre-processing of

text.

OUTPUT :

Page | 5
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 3

AIM : Apply various other text preprocessing techniques for any given text :
StopWord Removal, Lemmatization / Stemming.

THEORY :
Stop Word Removal : One of the major forms of pre-processing is to filter out
useless data. In NLP, useless words (data) are referred to as stop words.

What are Stop words?

Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”)
that a searchengine has been programmed to ignore, both when indexing entries
for searching and when retrieving them as the result of a search query.

Stemming : Stemming is a kind of normalization for words. Normalization is a

technique where a set of words in a sentence are converted into a sequence to
shorten its lookup. The words which have the same meaning but have some
variation according to the context or sentence are normalized.

In another word, there is one root word, but there are many variations of the same
words. For example, the root word is "eat" and it's variations are "eats, eating,
eaten and like so". In the same way, with the help of Stemming, we can find the
root word of any variations.

Page | 1
Lemmatization : Lemmatization is a text normalization technique used in Natural Language
Processing (NLP). It has been studied for a very long time and lemmatization algorithms have
been made since the 1960s. Essentially, lemmatization is a technique that switches any kind
of a word to its base root mode. Lemmatization is responsible for grouping different inflected
forms of words into the root form, having the same meaning.

Steps: Stop word removal.

We can do this easily, by storing a list of words that you consider to be stop words.
NLTK starts you off with a bunch of words that they consider to be stop words,
you can access it via the NLTK corpus with:

from nltk.corpus import stopwords

Here is the list:

>>> set(stopwords.words('english'))

{'ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during',
'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its',
'yours', 'such',
'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each',
'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through',
'don', 'nor',
'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above',
'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before',
'them', 'same',
'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what',
'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself',
'has', 'just',
'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if',
'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here',
'than'}

Here is how we might incorporate using the stop_words set to remove the stopwords from
your text:

Page | 2
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the

stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence [w for w in word_tokens if not w in

stop_words]

filtered_sentence = []

for w in word_tokens:
if w not in stop_words:

print(word_tokens)
print(filtered_sentence)

Our output here:

['This', 'is', 'a', 'sample', 'sentence', ',',

'showing',
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',',
'showing', 'stop','words',
'filtration', '.']

Steps : Stemming

First, we're going to grab and define our stemmer:

from nltk.stem import PorterStemmer

from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

Now, let's choose some words with a similar stem, like:

example_words

Page | 3
Next, we can easily stem by doing something like:

for w in example_words:
print(ps.stem(w))
Our output:

python
python
python
python
pythonli

Now let's try stemming a typical sentence, rather than some words:

new_text = "It is important to by very pythonly while you are

pythoning with python. All pythoners have pythoned poorly at
least once."
words = word_tokenize(new_text)
for w in words:
print(ps.stem(w))

Now our result is:

It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
All
python
have
python
poorli
at
least
onc
.

Page | 4
Steps : Lemmatization

A very similar operation to stemming is called lemmatizing. The major difference

between these is, as you saw earlier, stemming can often create non-existent
words, whereas lemmas are actual words.

So, your root stem, meaning the word you end up with, is not something you can
just look up ina dictionary, but you can look up a lemma.

Sometimes you will wind up with a very similar word, but sometimes, you will
wind up with a completely different word. Let's see some examples.

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

Page | 5
OUTPUT :
Stopwords Removal -

Stemming -

Page | 6
Lemmatization -

CONCLUSION : Thus, we have successfully performed an experiment on

various text processing techniques.

Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 4

AIM : Program to demonstrate Morphological Analysis.

THEORY : Morphology is the study of the structure and formation of words. It’s
most important unit is the morpheme, which is defined as the "minimal unit of
meaning".

In linguistics, morphology refers to the mental system involved in word

formation or to the branch of linguistics that deals with words, their internal
structure, and how they are formed. Morphological Analysis is very essential for
various automatic natural language processing applications.

Consider a word like: "unhappiness". This has three parts:

Page | 1
There are three morphemes, each carrying a certain amount of meaning. un
means "not", while ness means "being in a state or condition". Happy is a free
morpheme because it can appear on its own (as a "word" in its own right). Bound
morphemes have to be attached to a free morpheme, and so cannot be words
in their own right. Thus, you cannot have sentences in English such as "Jason
feels very un ness today".

Inflection:
Inflection is the process of changing the form of a word so that it expresses
information such as number, person, case, gender, tense, mood and aspect, but
the syntactic category of the word remains unchanged. As an example, the plural
form of the noun in English is usually formed from the singular form by adding
an s.

• car / cars
• table / tables
• dog / dogs

In each of these cases, the syntactic category of the word remains unchanged.

Derivation:
As was seen above, inflection does not change the syntactic category of a word.
Derivation does change the category. Linguists classify derivation in English
according to whether or not it induces a change of pronunciation. For instance,
adding the suffix ity changes the pronunciation of the root of active so the stress
is on the second syllable: activity. The addition of the suffix al to approve doesn't
change the pronunciation of the root: approval.

Page | 2
Code POS tagging :

Result :

Page | 3
Code TextSimilar() :

Result :

Page | 4
Code Stemming :

Result :

Page | 5
Code Stemming :

Result :

Page | 6
Code Lemmatization :

Result :

CONCLUSION : Hence, we have successfully implemented the program to

demonstrate Morphological Analysis.

Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 5

AIM : Program to implement N-gram model.

THEORY :

N – Grams : The general idea is that we can look at each pair (or triple, set of
four, etc.) of words that occur next to each other. In a sufficiently-large corpus,
we are likely to see "the red" and "red apple" several times, but less likely to see
"apple red" and "red the". This is useful to know if, for example, we are trying to
figure out what someone is more likely to say to help decide between possible
output for an automatic speech recognition system. These co-occurring words
are known as "n-grams", where "n" is a number saying how long a string of
words we considered. (Unigrams are single words, bigrams are two words,
trigrams are three words, 4-grams are four words, 5-grams are five words, etc.)
In particular, nltk has the n-grams function that returns a generator of n-grams
given a tokenized sentence.

Page | 1
An n-gram tagger is a generalization of a unigram tagger whose context is the
current word together with the part-of-speech tags of the n-1 preceding tokens.

Generating Unigrams :

Result:

Page | 2
Generating Bigrams :

Result:

Generating Trigrams :

Page | 3
Result:

CONCLUSION : Hence, we have successfully implemented the program to

demonstrate N-gram model.

Page | 4
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 6

AIM : Program to implement POS tagging.

THEORY : Tagging is a kind of classification that may be defined as the

automatic assignment of description to the tokens. Here, the descriptor is called
tag, which may represent one of the part-of-speech, semantic information and
so on. PoS tagging may be defined as the process of assigning one of the parts
of speech to the given word.

Rule-based POS Tagging : One of the oldest techniques of tagging is rule-based

POS tagging. Rule-based taggers use dictionary or lexicon for getting possible
tags for tagging each word. If the word has more than one possible tag, then
rule-based taggers use hand-written rules to identify the correct tag.
Disambiguation can also be performed in rule-based tagging by analyzing the
linguistic featuresof a word along with its preceding as well as following words.
For example, suppose if the preceding word of a word is article, then the word
must be a noun. As the name suggests, all such kind of information in rule-based
POS tagging is coded in the form of rules. These rules may be either –

 Context-pattern rules.

Page | 1
 Or, as Regular expression compiled into finite-state
automata, intersected with lexically ambiguous
sentence representation.
We can also understand Rule-based POS tagging by its two-stage architecture −
 First stage − Uses a dictionary to assign each word a
list of potential parts-of-speech.
 Second stage − Uses large lists of hand-
written disambiguation rules to sort down the
list to asingle part-of-speech for each word.
Properties of Rule-Based POS Tagging : Rule-based POS taggers possess the
following properties −
 These taggers are knowledge-driven taggers.
 The rules in Rule-based POS tagging are built manually.
 The information is coded in the form of rules.
 We have some limited number of rules approximately around
1000.
 Smoothing and language modeling is defined explicitly in rule-
based taggers.

Stochastic POS Tagging : Another technique of tagging is Stochastic POS Tagging.

The model that includes frequency or probability (statistics) can be called
stochastic. Any number of different approaches to the problem of part-of-
speech tagging can be referred to as stochastic tagger. The simplest stochastic
tagger applies the following approaches for POS tagging:

Word Frequency Approach - In this approach, the stochastic taggers

disambiguate the words based on the probability that a word occurs with a
particular tag. We can also say that the tag encountered most frequently with
the word in the training set is the one assigned to an ambiguous instance of that
word. The main issue with this approach is that it may yield inadmissible
sequence of tags.

Page | 2
Tag Sequence Probabilities - It is another approach of stochastic tagging, where
the tagger calculates the probability of a given sequence of tags occurring. It is
also called n-gram approach. It is called so because the best tag for a given word
is determined by the probability at which it occurs with the n previous tags.

Properties of Stochastic POST Tagging :

Stochastic POS taggers possess the following properties −
 This POS tagging is based on the probability of tag occurring.
 It requires training corpus.
 There would be no probability for the words that do not exist in
the corpus.
 It uses different testing corpus (other than training corpus).
 It is the simplest POS tagging because it chooses
most frequent tags associated with a word in
training corpus.

Transformation-based Tagging : Transformation based tagging is also called Brill

tagging. It is an instance of the transformation-based learning (TBL), which is a
rule-based algorithm for automatic tagging of POS to the given text. TBL, allows
us to have linguistic knowledge in a readable form, transforms one state to
another state by using transformation rules. It draws the inspiration from both
the previous explained taggers − rule-based and stochastic. If we see similarity
between rule-based and transformation tagger, then like rule-based, it is also
based on the rules that specify what tags need to be assigned to what words.
On the other hand,if we see similarity between stochastic and transformation
tagger then like stochastic, it is machine learning technique in which rules are
automatically induced from data.

Working of Transformation Based Learning (TBL) : In order to understand the

working and concept of transformation-based taggers, we need to understand
the working of transformation-based learning. Consider the following steps to
understand the working of TBL −
 Start with the solution − The TBL usually starts with

Page | 3
some solution to the problem and works in cycles.
 Most beneficial transformation chosen − In each
cycle, TBL will choose the most beneficial
transformation.
 Apply to the problem − The transformation chosen in
the last step will be applied to the problem.
 The algorithm will stop when the selected
transformation in step 2 will not add either more
value or there are no more transformations to be
selected. Such kind of learning is best suited in
classification tasks.

One of the more powerful aspects of the NLTK module is the Part of
Speech tagging that it can do for you. This means labeling words in a
sentence as nouns, adjectives, verbs, etc. Even more impressive, it also
labels by tense, and more.

CODE :

Page | 4
RESULT :

CONCLUSION : Hence, we have successfully implemented the program to

demonstrate PoS tagging.

Page | 5
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 7

AIM : Program to implement Chunking.

THEORY : Text chunking, also referred to as shallow parsing, is a task that

follows Part-Of-Speech Tagging and that adds more structure to the sentence.
The result is a grouping of the words in “chunks”. Chunk extraction or partial
parsing is a process of meaningful extracting short phrases from the sentence
(tagged with Part-of-Speech). Chunks are made up of words and the kinds of
words are defined using the part-of-speech tags. A Chunking activity involves
breaking down a difficult text into more manageable pieces and having students
rewrite these “chunks” in their own words. Now that we know the parts of
speech, we can do what is called chunking, and group words into hopefully
meaningful chunks. One of the main goals of chunking is to group into what are
known as "noun phrases." These are phrases of one or more words that contain
a noun, maybe some descriptive words, maybe a verb, and maybe something
like an adverb. The idea is to group nouns with the words that are in relation to
them. In order to chunk, we combine the part of speech tags with regular
expressions. Mainly from regular expressions, we are going to utilize the
following:

Page | 1
+ = match 1 or more
? = match 0 or 1 repetitions.
* = match 0 or MORE repetitions
. = Any character except a new line

The last things to note is that the part of speech tags are denoted with
the "<" and ">" and we can also place regular expressions within the
tags themselves, so account for things like "all nouns" (<N.*>).

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

The result of this is something like:

Page | 2
The main line here in question is:

chunkGram = r"""Chunk: {<RB.?><VB.?><NNP>+<NN>?}"""

This line, broken down:

<RB.?>* = "0 or more of any tense of adverb," followed by:

<VB.?>* = "0 or more of any tense of verb," followed by:

<NNP>+ = "One or more proper nouns," followed by

<NN>? = "zero or one singular noun."

Try playing around with combinations to group various instances until

you feel comfortable withchunking. Say you print the chunks out, you
are going to see output like:

Page | 3
Cool, that helps us visually, but what if we want to access this data via
our program? Well, what is happening here is our "chunked" variable
is an NLTK tree. Each "chunk" and "non chunk" is a"subtree" of the
tree. We can reference these by doing something like
chunked.subtrees(). We can then iterate through these subtrees like
so:

for subtree in chunked.subtrees():

print(subtree)

Next, we might be only interested in getting just the chunks,

ignoring the rest. We can use the filter parameter in the
chunked.subtrees() call.

for subtree in chunked.subtrees(filter=lambda t: t.label() ==

'Chunk'):
print(subtree)

Page | 4
Now, we're filtering to only show the subtrees with the label of
"Chunk." Keep in mind, this isn't "Chunk" as in the NLTK chunk
attribute... this is "Chunk" literally because that's the label we gave it
here:

chunkGram = r"""Chunk: {<RB.?><VB.?><NNP>+<NN>?}"""

Had we said instead something like chunkGram = r"""Pythons:

{<RB.?>*<VB.?>*<NNP>+<NN>?}""", then we would filter by the label
of "Pythons." The result here should be something like:

Page | 5
RESULT :

Page | 6
CONCLUSION : Hence, we have successfully implemented the experiment on
Chunking.

Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. : 8

AIM : Program to implement Named Entity Recognition.

THEORY : In any text document, there are particular terms that represent
specific entities that are more informative and have a unique context. These
entities are known as named entities, which more specifically refer to terms that
represent real-world objects like people, places, organizations, and so on, which
are often denoted by proper names. A naive approach could be to find these by
looking at the noun phrases in text documents. Named entity recognition (NER),
also known as entity chunking/extraction, is a popular technique used in
information extraction to identify and segment the named entities and classify
or categorize them under various predefined classes. One of the most major
forms of chunking in NLP is called "Named EntityRecognition." The idea is to
have the machine immediately be able to pull out "entities" like people, places,
things, locations, monetary figures, and more. This can be a bit of a challenge,
but NLTK is this built in for us. There are two major options with NLTK's named
entity recognition: either recognize all named entities, or recognize named
entities as their respective type, like people, places, locations, etc.

Here's an example:

Page | 1
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
try:
for i in tokenized[5:]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
namedEnt = nltk.ne_chunk(tagged, binary=True)
namedEnt.draw()
except Exception as e:
print(str(e))

process_content()

Here, with the option of binary = True, this means either something
is a named entity, or not.
The result is:

If you set binary = False, then the result is:

Page | 2
Immediately, you can see a few things. When Binary is False, it
picked up the same things, but wound up splitting up terms like
White House into "White" and "House" as if they were different,
whereas we could see in the binary = True option, the named entity
recognition was correct to say White House was part of the same
named entity. Depending on your goals, you may use the binary
option how you see fit. Here are the types of Named Entities that
you can get if you have binary as false:

Page | 3
RESULT :

Page | 4
Binary = true

Binary = false

CONCLUSION : Hence, we have successfully implemented Named Entity

Recognition.

Page | 5
Page | 6
CONCLUSION : Thus, we have successfully implemented EDA.

Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA

Experiment No. :

AIM : Case study on applications of NLP.

THEORY :

Topic: NLP in Healthcare.

The NLP illustrates the manners in which artificial intelligence policies gather and assess
unstructured data from the language of humans to extract patterns, get the meaning and
thus compose feedback. This is helping the healthcare industry to make the best use of
unstructured data. This technology facilitates providers to automate the managerial job,
invest more time in taking care of the patients, and enrich the patient’s experience using
real-time data.

Best Use Cases of NLP in Healthcare :

1. Clinical Documentation.
The NLP’s clinical documentation helps free clinicians from the laborious physical systems of
EHRs and permits them to invest more time in the patient; this is how NLP can help doctors.
Both speech-to-text dictation and formulated data entry have been a blessing.
The Nuance and M*Modal consists of technology that functions in team and speech
recognition technologies for getting structured data at the point of care and formalised
vocabularies for future use.
The NLP technologies bring out relevant data from speech recognition equipment which will

Page | 1
considerably modify analytical data used to run VBC and PHM efforts. This has better
outcomes for the clinicians. In upcoming times, it will apply NLP tools to various public data
sets and social media to determine Social Determinants of Health (SDOH) and the usefulness
of wellness-based policies.

2. Speech Recognition.
NLP has matured its use case in speech recognition over the years by allowing clinicians to
transcribe notes for useful EHR data entry. Front-end speech recognition eliminates the task
of physicians to dictate notes instead of having to sit at a point of care, while back-end
technology works to detect and correct any errors in the transcription before passing it on for
human proofing.
The market is almost saturated with speech recognition technologies, but a few start-ups are
disrupting the space with deep learning algorithms in mining applications, uncovering more
extensive possibilities.

3. Computer-Assisted Coding (CAC).

CAC captures data of procedures and treatments to grasp each possible code to maximise
claims. It is one of the most popular uses of NLP, but unfortunately, its adoption rate is just
30%. It has enriched the speed of coding but fell short at accuracy.

4. Data Mining Research.

The integration of data mining in healthcare systems allows organizations to reduce the levels
of subjectivity in decision-making and provide useful medical know-how. Once started, data
mining can become a cyclic technology for knowledge discovery, which can help any HCO
create a good business strategy to deliver better care to patients.

5. Automated Registry Reporting.

An NLP use case is to extract values as needed by each use case. Many health IT systems are
burdened by regulatory reporting when measures such as ejection fraction are not stored as
discrete values. For automated reporting, health systems will have to identify when an
ejection fraction is documented as part of a note, and save each value in a form that can be
utilized by the organization’s analytics platform for automated registry reporting.

How can Healthcare Organizations leverage NLP?

Healthcare organizations can use NLP to transform the way they deliver care and manage
solutions. Organizations can use machine learning in healthcare to improve provider
workflows and patient outcomes.

Page | 2
Implementing Predictive Analytics in Healthcare :

Identification of high-risk patients, as well as improvement of the diagnosis process, can be

done by deploying Predictive Analytics along with Natural Language Processing in Healthcare
along with predictive analytics.
It is vital for emergency departments to have complete data quickly, at hand. For example,
the delay in diagnosis of Kawasaki diseases leads to critical complications in case it is omitted
or mistreated in any way. As proved by scientific results, an NLP based algorithm identified
at-risk patients of Kawasaki disease with a sensitivity of 93.6% and specificity of 77.5%
compared to the manual review of clinician’s notes.
A set of researchers from France worked on developing another NLP based algorithm that
would monitor, detect and prevent hospital-acquired infections (HAI) among patients. NLP
helped in rendering unstructured data which was then used to identify early signs and
intimate clinicians accordingly.
Similarly, another experiment was carried out to automate the identification as well as risk
prediction for heart failure patients that were already hospitalized. Natural Language
Processing was implemented to analyse free text reports from the last 24 hours and predict
the patient’s risk of hospital readmission and mortality over the time of 30 days. At the end
of the successful experiment, the algorithm performed better than expected and the model’s
overall positive predictive value stood at 97.45%.
The benefits of deploying NLP can be applied to other areas of interest and a myriad of
algorithms can be deployed to pick out and predict specified conditions amongst patients.
Even though the healthcare industry at large still needs to refine its data capabilities prior to
deploying NLP tools, it still has a massive potential to significantly improve care delivery as
well as streamline workflows. Down the line, Natural Language Processing and other ML tools
will be the key to superior clinical decision support & patient health outcomes.

CONCLUSION : Thus, we have successfully curated a case study on the applications of NLP.

Page | 3
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :

Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)

Submitted to : PROF. NAZIA SULTHANA

Experiment No. :

AIM : Miniproject based on real life application of Natural Language

Processing.

THEORY :

Title: GRAMMATICAL ERROR CORRECTION (GEC).

Abstract: Grammatical Error Correction (GEC) systems aim to correct grammatical mistakes
in the text. Grammarly is an example of such a grammar correction product. Error correction
can improve the quality of written text in emails, blogs and chats. GEC task can be thought of
as a sequence to sequence task where a Transformer model is trained to take an
ungrammatical sentence as input and return a grammatically correct sentence.

Implementation:

1. Dataset:
For the training of our Grammar Corrector, we have used the C4_200M dataset
recently released by Google. This dataset consists of 200MM examples of synthetically
generated grammatical corruptions along with the correct text.

Page | 1
One of the biggest challenges in GEC is getting a good variety of data that simulates
the errors typically made in written language. If the corruptions are random, then they
would not be representative of the distribution of errors encountered in real use
cases.

To generate the corruption, a tagged corruption model is first trained. This model is
trained on existing datasets by taking as input a clean text and generating a corrupted
text. This is represented in the figure below:

For C4_2OOM dataset, the authors first determined the distribution of relative type
of errors encountered in written language. When generating the corruptions, they
were conditioned on the type of error. As shown in figure below, the corruption model
was conditioned to generate a determiner type error.

This allows the C4_200M dataset to have a diverse set of errors reflecting their relative
frequency in real-world applications. For the purpose of this project, we extracted
550K sentences from C4_200M. The C4_200M dataset is available on TF datasets. We
extracted the sentences we needed and saved them as a CSV.

2. Model Training:
T5 is a text-to-text model meaning it can be trained to go from input text of one format
to output text of one format. This model can be used for many different objectives
like summarization and text classification, also can be used to build a trivia bot that
can retrieve answers from memory without any provided context.

Page | 2
T5 is preferred for a lot of tasks for a few reasons :
1. Can be used for any text-to-text task.
2. Good accuracy on downstream tasks after fine-tuning.

Steps:

1. Tokenizing the data

We set the incorrect sentence as the input and the corrected text as the label. Both
the inputs and targets are tokenized using the T5 tokenizer. The max length is set to
64 since most of the inputs in C4_200M are sentences and the assumption is that this
model will also be used on sentences.

2. Training the model using seq2seq trainer class

We use the Seq2Seq trainer class in Huggingface to instantiate the model and we
instantiate logging to wandb. Using weights and biases with HuggingFace is very
simple. All that needs to be done is to set report_to = “wandb" in the training
arguments.

3. Monitoring and evaluating the data

We have used the Rouge score as the metric for evaluating the model. As seen in the
plots below from W&B, the model gets to a rouge score of 72 after 1 epoch of training.

Page | 3
Code:

from datasets import load_dataset

from tqdm import tqdm
import argparse
import glob
import os
import json
import time
import logging
import random
import re
from itertools import chain
from string import punctuation

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import pandas as pd

import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

from transformers import (

AdamW,
T5ForConditionalGeneration,
T5Tokenizer,
get_linear_schedule_with_warmup
)

import random
import numpy as np
import torch
import datasets

def set_seed(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

set_seed(42)
from transformers import (
T5ForConditionalGeneration, T5Tokenizer,
Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
)

from torch.utils.data import Dataset, DataLoader

model_name = 't5-base'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

def calc_token_len(example):
return len(tokenizer(example).input_ids)
from sklearn.model_selection import train_test_split

Page | 4
train_df, test_df = train_test_split(df, test_size=0.10, shuffle=True)
train_df.shape, test_df.shape

from torch.utils.data import Dataset, DataLoader

class GrammarDataset(Dataset):
def init (self, dataset, tokenizer,print_text=False):
self.dataset = dataset
self.pad_to_max_length = False
self.tokenizer = tokenizer
self.print_text = print_text
self.max_len = 64

def len (self):

return len(self.dataset)

def tokenize_data(self, example):

input_, target_ = example['input'], example['output']

# tokenize inputs
tokenized_inputs =
tokenizer(input_, pad_to_max_length=self.pad_to_max_len
gth,
max_length=self.max_len,
return_attention_mask=True)

tokenized_targets =
tokenizer(target_, pad_to_max_length=self.pad_to_max_leng
th,
max_length=self.max_len,
return_attention_mask=True)

inputs={"input_ids": tokenized_inputs['input_ids'],
"attention_mask": tokenized_inputs['attention_mask'],
"labels": tokenized_targets['input_ids']
}

return inputs

def getitem (self, index):

inputs = self.tokenize_data(self.dataset[index])

if self.print_text:
for k in inputs.keys():
print(k, len(inputs[k]))

return inputs

from datasets import load_metric

rouge_metric = load_metric("rouge")

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model,

padding='longest', return_tensors='pt')

Page | 5
# defining training related arguments
batch_size = 16
args =
Seq2SeqTrainingArguments(output_dir="/content/drive/MyDrive/c4_200m/wei
ghts",
evaluation_strategy="steps",
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
learning_rate=2e-5,
num_train_epochs=1,
weight_decay=0.01,
save_total_limit=2,
predict_with_generate=True,
fp16 = True,
gradient_accumulation_steps = 6,
eval_steps = 500,
save_steps = 500,
load_best_model_at_end=True,
logging_dir="/logs",
report_to="wandb")

import nltk
nltk.download('punkt')
import numpy as np

def compute_metrics(eval_pred):
predictions, labels = eval_pred
decoded_preds = tokenizer.batch_decode(predictions,
skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels,
skip_special_tokens=True)

# Rouge expects a newline after each sentence

decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for
pred in decoded_preds]
decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for
label in decoded_labels]

result = rouge_metric.compute(predictions=decoded_preds,
references=decoded_labels, use_stemmer=True)
# Extract a few results
result = {key: value.mid.fmeasure * 100 for key, value in
result.items()}

# Add mean generated length

prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id)
for pred in predictions]
result["gen_len"] = np.mean(prediction_lens)
return {k: round(v, 4) for k, v in result.items()}

# defining trainer using huggingface

trainer = Seq2SeqTrainer(model=model,
args=args,

Page | 6
train_dataset= GrammarDataset(train_dataset,
tokenizer),
eval_dataset=GrammarDataset(test_dataset, tokenizer),
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics)

import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
model_name = 'deep-learning-analytics/GrammarCorrector'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model =
T5ForConditionalGeneration.from_pretrained(model_name).to(torch_device)

def correct_grammar(input_text,num_return_sequences):
batch =
tokenizer([input_text],truncation=True,padding='max_length',max_length=
64, return_tensors="pt").to(torch_device)
translated = model.generate(**batch,max_length=64,num_beams=4,
num_return_sequences=num_return_sequences, temperature=1.5)
tgt_text = tokenizer.batch_decode(translated,
skip_special_tokens=True)
return tgt_text

Output:

Applications:

1. Can be used for Grammar error correction specific applications like Grammarly.
2. Can be implemented in paraphrasing software and applications.
3. Can be included in document or content writing software like Microsoft Word, Libra
and Google Docs.

Page | 7
Results:

Fine Tuning T5 Transformer to Grammar Error Correction and training it on C4_550k dataset
we achieved a Rogue Score of 80%.

Conclusion:
In this project, we proposed a new strategy for the Grammar Error Correction system based
on Deep Learning, and the experimental results show that the proposed method is effective.
It makes full use of the advantages of Deep Learning.

Page | 8

Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
NLP Lab Manual Lab Work
No ratings yet
NLP Lab Manual Lab Work
24 pages
Intonation Exercises With Answers
0% (2)
Intonation Exercises With Answers
3 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
33 pages
Background of The Study
No ratings yet
Background of The Study
21 pages
Omid Jamal PS
No ratings yet
Omid Jamal PS
2 pages
NLP Sem Imp
No ratings yet
NLP Sem Imp
46 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP Exp 3
No ratings yet
NLP Exp 3
24 pages
Keller SBM5e Accessible CH15
No ratings yet
Keller SBM5e Accessible CH15
20 pages
Reviewer Ptlal
No ratings yet
Reviewer Ptlal
24 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
Chapter 2 Lesson 2
100% (1)
Chapter 2 Lesson 2
20 pages
NLP Steps Basic
No ratings yet
NLP Steps Basic
26 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
16 pages
The Influence of Social Media On Tourist Decision-Making and Destination Choice - Case Study in Paranaque City
No ratings yet
The Influence of Social Media On Tourist Decision-Making and Destination Choice - Case Study in Paranaque City
5 pages
NLP Module 1
No ratings yet
NLP Module 1
71 pages
Module 1-The Components of Communication
No ratings yet
Module 1-The Components of Communication
2 pages
Feist - Skolt Sami Grammar
100% (2)
Feist - Skolt Sami Grammar
478 pages
NLP m2
No ratings yet
NLP m2
71 pages
Ochoa Alice Observation 4 Lesson Plan
No ratings yet
Ochoa Alice Observation 4 Lesson Plan
4 pages
NLP Unit 1 Part1
No ratings yet
NLP Unit 1 Part1
61 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
Lab 2
No ratings yet
Lab 2
49 pages
Preposition (Group 7)
No ratings yet
Preposition (Group 7)
16 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
NLP Lab Manual Final
No ratings yet
NLP Lab Manual Final
25 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
Inglés A1
No ratings yet
Inglés A1
48 pages
Text Analytics and Natural Language Processing - KAI073
No ratings yet
Text Analytics and Natural Language Processing - KAI073
24 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
VO - MCA - SEM 4 - Text Mining - U2
No ratings yet
VO - MCA - SEM 4 - Text Mining - U2
15 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
Jal Patel NLP
No ratings yet
Jal Patel NLP
32 pages
NLP Basics
No ratings yet
NLP Basics
12 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Case Method Group 6
No ratings yet
Case Method Group 6
12 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
Final LP-VI NLP Manual 2023-24
No ratings yet
Final LP-VI NLP Manual 2023-24
29 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
UNIT-5 Quetions - Answers
No ratings yet
UNIT-5 Quetions - Answers
10 pages
Aiml P4
No ratings yet
Aiml P4
12 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
Assessment Hanisa Insyarani
No ratings yet
Assessment Hanisa Insyarani
8 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
Nipun Literacy
No ratings yet
Nipun Literacy
8 pages
Python NLP Assignment
No ratings yet
Python NLP Assignment
9 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
ANLP semVI Labmanual
No ratings yet
ANLP semVI Labmanual
33 pages
Ai & ML Week-11
No ratings yet
Ai & ML Week-11
32 pages
NLP Unit-2
No ratings yet
NLP Unit-2
12 pages
IMNPD Assignment Group Part 2 - Sept 2024
No ratings yet
IMNPD Assignment Group Part 2 - Sept 2024
10 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
4 Things All Great Listeners Know
No ratings yet
4 Things All Great Listeners Know
2 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
Text Preprocessing Stages
No ratings yet
Text Preprocessing Stages
8 pages
NLTK
No ratings yet
NLTK
4 pages
Task and Procedural Feedback
No ratings yet
Task and Procedural Feedback
4 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
NLP 02
No ratings yet
NLP 02
6 pages
Chapter Ii-Rrl
No ratings yet
Chapter Ii-Rrl
2 pages
Kriday Misri: Editor
No ratings yet
Kriday Misri: Editor
2 pages
Greetings in Japan - Assignment 2
No ratings yet
Greetings in Japan - Assignment 2
7 pages
Brand Name
No ratings yet
Brand Name
3 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
29 pages
Excel 9-кл 3-тоқс ҚМЖ
No ratings yet
Excel 9-кл 3-тоқс ҚМЖ
5 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
STANDARD - docxCNF WEEK3 4
No ratings yet
STANDARD - docxCNF WEEK3 4
5 pages
Mid-Term English Test: I-Listening
No ratings yet
Mid-Term English Test: I-Listening
5 pages
NLP Experiment 2
No ratings yet
NLP Experiment 2
5 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
Quiz - S8
No ratings yet
Quiz - S8
1 page
Sociolinguistic Influence in The Use of English As A Second Language (ESL) Classroom: Seeing From Onovughe's (2012) Perspective
No ratings yet
Sociolinguistic Influence in The Use of English As A Second Language (ESL) Classroom: Seeing From Onovughe's (2012) Perspective
5 pages
NLTK
No ratings yet
NLTK
3 pages
NLP Programs
No ratings yet
NLP Programs
5 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
ENG504 Important Topics For Terminal Exam
No ratings yet
ENG504 Important Topics For Terminal Exam
8 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
DLL - 3rd Quarter
No ratings yet
DLL - 3rd Quarter
6 pages
ESP Info For Final Exam
No ratings yet
ESP Info For Final Exam
18 pages
Learn Python in One Hour: Programming by Example
From Everand
Learn Python in One Hour: Programming by Example
Victor R. Volkman
3/5 (2)

NLP Manual (1-12)

Uploaded by

NLP Manual (1-12)

Uploaded by

Name :

PROBLEM STATEMENT : The application of grammatical error-correcting

AIM : Program on Preprocessing of Text (Tokenization, Filtration, Script

THEORY : Text preprocessing is traditionally an important step for Natural Language

 Remove HTML tags.

all(for download everything)

Corpus - Body of text, singular. Corpora is the plural of this.

from nltk.tokenize import sent_tokenize, word_tokenize

['Hello Mr. Smith, how are you doing today?',

CONCLUSION : Thus, we have successfully performed an experiment on Pre-processing of

What are Stop words?

Stemming : Stemming is a kind of normalization for words. Normalization is a

Steps: Stop word removal.

from nltk.corpus import stopwords

Here is the list:

example_sent = "This is a sample sentence, showing off the

filtered_sentence [w for w in word_tokens if not w in

Our output here:

['This', 'is', 'a', 'sample', 'sentence', ',',

First, we're going to grab and define our stemmer:

from nltk.stem import PorterStemmer

Now, let's choose some words with a similar stem, like:

new_text = "It is important to by very pythonly while you are

Now our result is:

A very similar operation to stemming is called lemmatizing. The major difference

from nltk.stem import WordNetLemmatizer

CONCLUSION : Thus, we have successfully performed an experiment on

AIM : Program to demonstrate Morphological Analysis.

In linguistics, morphology refers to the mental system involved in word

Consider a word like: "unhappiness". This has three parts:

CONCLUSION : Hence, we have successfully implemented the program to

AIM : Program to implement N-gram model.

CONCLUSION : Hence, we have successfully implemented the program to

AIM : Program to implement POS tagging.

THEORY : Tagging is a kind of classification that may be defined as the

Rule-based POS Tagging : One of the oldest techniques of tagging is rule-based

Stochastic POS Tagging : Another technique of tagging is Stochastic POS Tagging.

Word Frequency Approach - In this approach, the stochastic taggers

Properties of Stochastic POST Tagging :

Transformation-based Tagging : Transformation based tagging is also called Brill

Working of Transformation Based Learning (TBL) : In order to understand the

CONCLUSION : Hence, we have successfully implemented the program to

AIM : Program to implement Chunking.

THEORY : Text chunking, also referred to as shallow parsing, is a task that

The result of this is something like:

chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""

This line, broken down:

<RB.?>* = "0 or more of any tense of adverb," followed by:

<VB.?>* = "0 or more of any tense of verb," followed by:

<NNP>+ = "One or more proper nouns," followed by

<NN>? = "zero or one singular noun."

Try playing around with combinations to group various instances until

for subtree in chunked.subtrees():

Next, we might be only interested in getting just the chunks,

for subtree in chunked.subtrees(filter=lambda t: t.label() ==

chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""

Had we said instead something like chunkGram = r"""Pythons:

AIM : Program to implement Named Entity Recognition.

If you set binary = False, then the result is:

CONCLUSION : Hence, we have successfully implemented Named Entity

AIM : Case study on applications of NLP.

Topic: NLP in Healthcare.

Best Use Cases of NLP in Healthcare :

3. Computer-Assisted Coding (CAC).

4. Data Mining Research.

5. Automated Registry Reporting.

How can Healthcare Organizations leverage NLP?

Identification of high-risk patients, as well as improvement of the diagnosis process, can be

Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)

Submitted to : PROF. NAZIA SULTHANA

AIM : Miniproject based on real life application of Natural Language

Title: GRAMMATICAL ERROR CORRECTION (GEC).

1. Tokenizing the data

2. Training the model using seq2seq trainer class

3. Monitoring and evaluating the data

from datasets import load_dataset

chunkGram = r"""Chunk: {<RB.?><VB.?><NNP>+<NN>?}"""

chunkGram = r"""Chunk: {<RB.?><VB.?><NNP>+<NN>?}"""