NLP Manual (1-12)
NLP Manual (1-12)
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA
Experiment No. : 1
AIM : Study various applications of NLP and formulate the Problem Statement
for Mini Project based on chosen real world NLP applications.
Team Members :
1. Sanika S. Bhatye (Roll Number : 14)
2. Nachiket S. Gaikwad (Roll Number : 35)
3. Priyanka A. Gupta (Roll Number : 45)
Page | 1
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA
Experiment No. : 2
Page | 1
Tokenization : Given a character sequence and a defined document unit, tokenization is the
task of chopping it up into pieces, called tokens. Tokenization is the act of breaking up a
sequence of strings into pieces such as words, keywords, phrases, symbols and other
elements called tokens. Tokens can be individual words, phrases or even whole sentences. In
the process of tokenization, some characters like punctuation marks are discarded.
Filtration : Many of the words used in the phrase are insignificant and hold no meaning.
For example – English is a subject.
Here, ‘English’ and ‘subject’ are the most significant words and ‘is’, ‘a’ are almost useless.
English subject and subject English holds the same meaning even if we remove the
insignificant words – (‘is’, ‘a’).
Using the nltk, we can remove the insignificant words by looking at their part-of-speech tags.
For that we have to decide which Part-Of-Speech tags are significant.
Steps: Tokenization
a. In order to get started, we need the NLTK module, as well as Python.
b. Download the latest version of Python if you are on Windows. If you are on Mac or
Linux, you should be able to run an apt-get install python3.
c. Next, we need NLTK 3. The easiest method to installing the NLTK module is going to
be with pip. For all users, that is done by opening up cmd.exe, bash, or whatever shell
you use and typing: pip install nltk
d. Next, we need to install some of the components for NLTK.
Open python via whatever means you normally do, and type:
import nltk
nltk.download()
Page | 2
Unless you are operating headless, a GUI will pop up like this, only probably with
red instead ofgreen:
Choose to download "all" for all packages, and then click 'download.' This will give you all of
the tokenizers, chunkers, other algorithms, and all of the corpora. If space is an issue, you can
elect to selectively download everything manually. The NLTK module will take up about 7MB,
and the entire nltk_data directory will take up about 1.8GB, which includes your chunkers,
parsers, and the corpora. If you are operating headless, like on a VPS, you can install
everything by running Python and doing:
import nltk
nltk.download()
d(for download)
Now that you have all the things that you need, let's knock out some quick vocabulary:
Page | 3
Lexicon - Words and their meanings.
Example: English dictionary.
Consider, however, that various fields will have different lexicons. For example: To
a financial investor, the first meaning for the word "Bull" is someone who is confident
about the market, as compared to the common English lexicon, where the first
meaning for the word "Bull" is an animal. As such, there is a speciallexicon for
financial investors, doctors, children, mechanics, and so on.
Token - Each "entity" that is a part of whatever was split up based on rules.
For examples, each word is a token when a sentence is "tokenized" into
words. Each sentence can also be a token, if you tokenized the sentences
out of a paragraph.
These are the words you will most commonly hear upon entering the Natural
Language Processing (NLP) space. With that, let's show an example of how one
might actually tokenize something into tokens with the NLTK module.
EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The
weather is great, and Python is awesome. The sky is pinkish-
blue. You shouldn't eat cardboard."
print(sent_tokenize(EXAMPLE_TEXT))
The above code will output the sentences, split up into a list of sentences, which
you can do things like iterate through with a for loop.
So there, we have created tokens, which are sentences. Let's tokenize by word instead this time:
print(word_tokenize(EXAMPLE_TEXT))
Page | 4
Now our output is: ['Hello', 'Mr.', 'Smith', ',',
'how', 'are',
'you', 'doing', 'today', '?', 'The', 'weather', 'is',
'great',
',', 'and', 'Python', 'is', 'awesome', '.',
'The', 'sky',
'is', 'pinkish-blue', '.', 'You', 'should', "n't",
'eat','cardboard', '.']
OUTPUT :
Page | 5
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA
Experiment No. : 3
AIM : Apply various other text preprocessing techniques for any given text :
StopWord Removal, Lemmatization / Stemming.
THEORY :
Stop Word Removal : One of the major forms of pre-processing is to filter out
useless data. In NLP, useless words (data) are referred to as stop words.
In another word, there is one root word, but there are many variations of the same
words. For example, the root word is "eat" and it's variations are "eats, eating,
eaten and like so". In the same way, with the help of Stemming, we can find the
root word of any variations.
Page | 1
Lemmatization : Lemmatization is a text normalization technique used in Natural Language
Processing (NLP). It has been studied for a very long time and lemmatization algorithms have
been made since the 1960s. Essentially, lemmatization is a technique that switches any kind
of a word to its base root mode. Lemmatization is responsible for grouping different inflected
forms of words into the root form, having the same meaning.
We can do this easily, by storing a list of words that you consider to be stop words.
NLTK starts you off with a bunch of words that they consider to be stop words,
you can access it via the NLTK corpus with:
>>> set(stopwords.words('english'))
{'ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during',
'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its',
'yours', 'such',
'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each',
'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through',
'don', 'nor',
'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above',
'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before',
'them', 'same',
'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what',
'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself',
'has', 'just',
'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if',
'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here',
'than'}
Here is how we might incorporate using the stop_words set to remove the stopwords from
your text:
Page | 2
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
print(word_tokens)
print(filtered_sentence)
Steps : Stemming
ps = PorterStemmer()
example_words
Page | 3
Next, we can easily stem by doing something like:
for w in example_words:
print(ps.stem(w))
Our output:
python
python
python
python
pythonli
Now let's try stemming a typical sentence, rather than some words:
It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
All
python
have
python
poorli
at
least
onc
.
Page | 4
Steps : Lemmatization
So, your root stem, meaning the word you end up with, is not something you can
just look up ina dictionary, but you can look up a lemma.
Sometimes you will wind up with a very similar word, but sometimes, you will
wind up with a completely different word. Let's see some examples.
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))
Page | 5
OUTPUT :
Stopwords Removal -
Stemming -
Page | 6
Lemmatization -
Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA
Experiment No. : 4
THEORY : Morphology is the study of the structure and formation of words. It’s
most important unit is the morpheme, which is defined as the "minimal unit of
meaning".
Page | 1
There are three morphemes, each carrying a certain amount of meaning. un
means "not", while ness means "being in a state or condition". Happy is a free
morpheme because it can appear on its own (as a "word" in its own right). Bound
morphemes have to be attached to a free morpheme, and so cannot be words
in their own right. Thus, you cannot have sentences in English such as "Jason
feels very un ness today".
Inflection:
Inflection is the process of changing the form of a word so that it expresses
information such as number, person, case, gender, tense, mood and aspect, but
the syntactic category of the word remains unchanged. As an example, the plural
form of the noun in English is usually formed from the singular form by adding
an s.
• car / cars
• table / tables
• dog / dogs
In each of these cases, the syntactic category of the word remains unchanged.
Derivation:
As was seen above, inflection does not change the syntactic category of a word.
Derivation does change the category. Linguists classify derivation in English
according to whether or not it induces a change of pronunciation. For instance,
adding the suffix ity changes the pronunciation of the root of active so the stress
is on the second syllable: activity. The addition of the suffix al to approve doesn't
change the pronunciation of the root: approval.
Page | 2
Code POS tagging :
Result :
Page | 3
Code TextSimilar() :
Result :
Page | 4
Code Stemming :
Result :
Page | 5
Code Stemming :
Result :
Page | 6
Code Lemmatization :
Result :
Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA
Experiment No. : 5
THEORY :
N – Grams : The general idea is that we can look at each pair (or triple, set of
four, etc.) of words that occur next to each other. In a sufficiently-large corpus,
we are likely to see "the red" and "red apple" several times, but less likely to see
"apple red" and "red the". This is useful to know if, for example, we are trying to
figure out what someone is more likely to say to help decide between possible
output for an automatic speech recognition system. These co-occurring words
are known as "n-grams", where "n" is a number saying how long a string of
words we considered. (Unigrams are single words, bigrams are two words,
trigrams are three words, 4-grams are four words, 5-grams are five words, etc.)
In particular, nltk has the n-grams function that returns a generator of n-grams
given a tokenized sentence.
Page | 1
An n-gram tagger is a generalization of a unigram tagger whose context is the
current word together with the part-of-speech tags of the n-1 preceding tokens.
Generating Unigrams :
Result:
Page | 2
Generating Bigrams :
Result:
Generating Trigrams :
Page | 3
Result:
Page | 4
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA
Experiment No. : 6
Context-pattern rules.
Page | 1
Or, as Regular expression compiled into finite-state
automata, intersected with lexically ambiguous
sentence representation.
We can also understand Rule-based POS tagging by its two-stage architecture −
First stage − Uses a dictionary to assign each word a
list of potential parts-of-speech.
Second stage − Uses large lists of hand-
written disambiguation rules to sort down the
list to asingle part-of-speech for each word.
Properties of Rule-Based POS Tagging : Rule-based POS taggers possess the
following properties −
These taggers are knowledge-driven taggers.
The rules in Rule-based POS tagging are built manually.
The information is coded in the form of rules.
We have some limited number of rules approximately around
1000.
Smoothing and language modeling is defined explicitly in rule-
based taggers.
Page | 2
Tag Sequence Probabilities - It is another approach of stochastic tagging, where
the tagger calculates the probability of a given sequence of tags occurring. It is
also called n-gram approach. It is called so because the best tag for a given word
is determined by the probability at which it occurs with the n previous tags.
Page | 3
some solution to the problem and works in cycles.
Most beneficial transformation chosen − In each
cycle, TBL will choose the most beneficial
transformation.
Apply to the problem − The transformation chosen in
the last step will be applied to the problem.
The algorithm will stop when the selected
transformation in step 2 will not add either more
value or there are no more transformations to be
selected. Such kind of learning is best suited in
classification tasks.
One of the more powerful aspects of the NLTK module is the Part of
Speech tagging that it can do for you. This means labeling words in a
sentence as nouns, adjectives, verbs, etc. Even more impressive, it also
labels by tense, and more.
CODE :
Page | 4
RESULT :
Page | 5
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA
Experiment No. : 7
Page | 1
+ = match 1 or more
? = match 0 or 1 repetitions.
* = match 0 or MORE repetitions
. = Any character except a new line
The last things to note is that the part of speech tags are denoted with
the "<" and ">" and we can also place regular expressions within the
tags themselves, so account for things like "all nouns" (<N.*>).
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
Page | 2
The main line here in question is:
Page | 3
Cool, that helps us visually, but what if we want to access this data via
our program? Well, what is happening here is our "chunked" variable
is an NLTK tree. Each "chunk" and "non chunk" is a"subtree" of the
tree. We can reference these by doing something like
chunked.subtrees(). We can then iterate through these subtrees like
so:
Page | 4
Now, we're filtering to only show the subtrees with the label of
"Chunk." Keep in mind, this isn't "Chunk" as in the NLTK chunk
attribute... this is "Chunk" literally because that's the label we gave it
here:
Page | 5
RESULT :
Page | 6
CONCLUSION : Hence, we have successfully implemented the experiment on
Chunking.
Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA
Experiment No. : 8
THEORY : In any text document, there are particular terms that represent
specific entities that are more informative and have a unique context. These
entities are known as named entities, which more specifically refer to terms that
represent real-world objects like people, places, organizations, and so on, which
are often denoted by proper names. A naive approach could be to find these by
looking at the noun phrases in text documents. Named entity recognition (NER),
also known as entity chunking/extraction, is a popular technique used in
information extraction to identify and segment the named entities and classify
or categorize them under various predefined classes. One of the most major
forms of chunking in NLP is called "Named EntityRecognition." The idea is to
have the machine immediately be able to pull out "entities" like people, places,
things, locations, monetary figures, and more. This can be a bit of a challenge,
but NLTK is this built in for us. There are two major options with NLTK's named
entity recognition: either recognize all named entities, or recognize named
entities as their respective type, like people, places, locations, etc.
Here's an example:
Page | 1
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
try:
for i in tokenized[5:]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
namedEnt = nltk.ne_chunk(tagged, binary=True)
namedEnt.draw()
except Exception as e:
print(str(e))
process_content()
Here, with the option of binary = True, this means either something
is a named entity, or not.
The result is:
Page | 2
Immediately, you can see a few things. When Binary is False, it
picked up the same things, but wound up splitting up terms like
White House into "White" and "House" as if they were different,
whereas we could see in the binary = True option, the named entity
recognition was correct to say White House was part of the same
named entity. Depending on your goals, you may use the binary
option how you see fit. Here are the types of Named Entities that
you can get if you have binary as false:
Page | 3
RESULT :
Page | 4
Binary = true
Binary = false
Page | 5
Page | 6
CONCLUSION : Thus, we have successfully implemented EDA.
Page | 7
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Subject : NATURAL LANGUAGE PROCESSING (CSDL7013)
Submitted to : PROF. NAZIA SULTHANA
Experiment No. :
THEORY :
1. Clinical Documentation.
The NLP’s clinical documentation helps free clinicians from the laborious physical systems of
EHRs and permits them to invest more time in the patient; this is how NLP can help doctors.
Both speech-to-text dictation and formulated data entry have been a blessing.
The Nuance and M*Modal consists of technology that functions in team and speech
recognition technologies for getting structured data at the point of care and formalised
vocabularies for future use.
The NLP technologies bring out relevant data from speech recognition equipment which will
Page | 1
considerably modify analytical data used to run VBC and PHM efforts. This has better
outcomes for the clinicians. In upcoming times, it will apply NLP tools to various public data
sets and social media to determine Social Determinants of Health (SDOH) and the usefulness
of wellness-based policies.
2. Speech Recognition.
NLP has matured its use case in speech recognition over the years by allowing clinicians to
transcribe notes for useful EHR data entry. Front-end speech recognition eliminates the task
of physicians to dictate notes instead of having to sit at a point of care, while back-end
technology works to detect and correct any errors in the transcription before passing it on for
human proofing.
The market is almost saturated with speech recognition technologies, but a few start-ups are
disrupting the space with deep learning algorithms in mining applications, uncovering more
extensive possibilities.
Page | 2
Implementing Predictive Analytics in Healthcare :
CONCLUSION : Thus, we have successfully curated a case study on the applications of NLP.
Page | 3
Name :
Roll No. :
Class : BE – A / Computer Engineering
UID :
Experiment No. :
THEORY :
Abstract: Grammatical Error Correction (GEC) systems aim to correct grammatical mistakes
in the text. Grammarly is an example of such a grammar correction product. Error correction
can improve the quality of written text in emails, blogs and chats. GEC task can be thought of
as a sequence to sequence task where a Transformer model is trained to take an
ungrammatical sentence as input and return a grammatically correct sentence.
Implementation:
1. Dataset:
For the training of our Grammar Corrector, we have used the C4_200M dataset
recently released by Google. This dataset consists of 200MM examples of synthetically
generated grammatical corruptions along with the correct text.
Page | 1
One of the biggest challenges in GEC is getting a good variety of data that simulates
the errors typically made in written language. If the corruptions are random, then they
would not be representative of the distribution of errors encountered in real use
cases.
To generate the corruption, a tagged corruption model is first trained. This model is
trained on existing datasets by taking as input a clean text and generating a corrupted
text. This is represented in the figure below:
For C4_2OOM dataset, the authors first determined the distribution of relative type
of errors encountered in written language. When generating the corruptions, they
were conditioned on the type of error. As shown in figure below, the corruption model
was conditioned to generate a determiner type error.
This allows the C4_200M dataset to have a diverse set of errors reflecting their relative
frequency in real-world applications. For the purpose of this project, we extracted
550K sentences from C4_200M. The C4_200M dataset is available on TF datasets. We
extracted the sentences we needed and saved them as a CSV.
2. Model Training:
T5 is a text-to-text model meaning it can be trained to go from input text of one format
to output text of one format. This model can be used for many different objectives
like summarization and text classification, also can be used to build a trivia bot that
can retrieve answers from memory without any provided context.
Page | 2
T5 is preferred for a lot of tasks for a few reasons :
1. Can be used for any text-to-text task.
2. Good accuracy on downstream tasks after fine-tuning.
Steps:
Page | 3
Code:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
import random
import numpy as np
import torch
import datasets
def set_seed(seed):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
set_seed(42)
from transformers import (
T5ForConditionalGeneration, T5Tokenizer,
Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
)
def calc_token_len(example):
return len(tokenizer(example).input_ids)
from sklearn.model_selection import train_test_split
Page | 4
train_df, test_df = train_test_split(df, test_size=0.10, shuffle=True)
train_df.shape, test_df.shape
# tokenize inputs
tokenized_inputs =
tokenizer(input_, pad_to_max_length=self.pad_to_max_len
gth,
max_length=self.max_len,
return_attention_mask=True)
tokenized_targets =
tokenizer(target_, pad_to_max_length=self.pad_to_max_leng
th,
max_length=self.max_len,
return_attention_mask=True)
inputs={"input_ids": tokenized_inputs['input_ids'],
"attention_mask": tokenized_inputs['attention_mask'],
"labels": tokenized_targets['input_ids']
}
return inputs
if self.print_text:
for k in inputs.keys():
print(k, len(inputs[k]))
return inputs
Page | 5
# defining training related arguments
batch_size = 16
args =
Seq2SeqTrainingArguments(output_dir="/content/drive/MyDrive/c4_200m/wei
ghts",
evaluation_strategy="steps",
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
learning_rate=2e-5,
num_train_epochs=1,
weight_decay=0.01,
save_total_limit=2,
predict_with_generate=True,
fp16 = True,
gradient_accumulation_steps = 6,
eval_steps = 500,
save_steps = 500,
load_best_model_at_end=True,
logging_dir="/logs",
report_to="wandb")
import nltk
nltk.download('punkt')
import numpy as np
def compute_metrics(eval_pred):
predictions, labels = eval_pred
decoded_preds = tokenizer.batch_decode(predictions,
skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels,
skip_special_tokens=True)
result = rouge_metric.compute(predictions=decoded_preds,
references=decoded_labels, use_stemmer=True)
# Extract a few results
result = {key: value.mid.fmeasure * 100 for key, value in
result.items()}
Page | 6
train_dataset= GrammarDataset(train_dataset,
tokenizer),
eval_dataset=GrammarDataset(test_dataset, tokenizer),
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics)
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
model_name = 'deep-learning-analytics/GrammarCorrector'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model =
T5ForConditionalGeneration.from_pretrained(model_name).to(torch_device)
def correct_grammar(input_text,num_return_sequences):
batch =
tokenizer([input_text],truncation=True,padding='max_length',max_length=
64, return_tensors="pt").to(torch_device)
translated = model.generate(**batch,max_length=64,num_beams=4,
num_return_sequences=num_return_sequences, temperature=1.5)
tgt_text = tokenizer.batch_decode(translated,
skip_special_tokens=True)
return tgt_text
Output:
Applications:
1. Can be used for Grammar error correction specific applications like Grammarly.
2. Can be implemented in paraphrasing software and applications.
3. Can be included in document or content writing software like Microsoft Word, Libra
and Google Docs.
Page | 7
Results:
Fine Tuning T5 Transformer to Grammar Error Correction and training it on C4_550k dataset
we achieved a Rogue Score of 80%.
Conclusion:
In this project, we proposed a new strategy for the Grammar Error Correction system based
on Deep Learning, and the experimental results show that the proposed method is effective.
It makes full use of the advantages of Deep Learning.
Page | 8