0% found this document useful (0 votes)
235 views

Natural Language Processing

This document provides an introduction to natural language processing (NLP) and summarizes key points about NLP and its applications. It discusses the advantages of using Python for NLP, how to set up the NLTK environment in Python, and an overview of corpus and databases, including different types of corpora and attributes of data. Resources for accessing free corpora using NLTK are also listed.

Uploaded by

btms
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
235 views

Natural Language Processing

This document provides an introduction to natural language processing (NLP) and summarizes key points about NLP and its applications. It discusses the advantages of using Python for NLP, how to set up the NLTK environment in Python, and an overview of corpus and databases, including different types of corpora and attributes of data. Resources for accessing free corpora using NLTK are also listed.

Uploaded by

btms
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

NATURAL LANGUAGE PROCESSING

UNIT – 1 :
1. Introduction
1.1. Understanding NLP
1.2. Understanding Basic Applications
1.3. Advantages of togetherness of NLP and python
1.4. Environment setup for nltk
2. Practical understanding of corpus and database.

1. Introduction :
1. Understanding NLP :
Natural Language Processing (NLP) is a part of Artificial Intelligence (AI) and it works like as
human behavior and to teach the human language with all its complexity to computers.
2. Understanding Basic Applications :
Applications of NLP :
i. Chatbots
ii. Language Translators
iii. Sentiment Analysis
iv. Auto Correct / Auto Complete
v. Social media marketing
vi. Voice Assistance
vii. Grammar Checkers
viii. E-mail classification & Filtering
ix. Machine Translation
x. Speech Recognition
xi. Text Extraction
xii. Predictive Text
xiii. Targeted Advertisement
3. Advantages of togetherness of NLP and python :
a. Developing prototypes for the NLP based expert System using python is very easy &
efficient.
b. A large variety of opensource NLP libraries are available for python programmers.
c. Community support is very strong.
d. Easy to use & less complex for beginners
e. Rapid development : Testing & Development are easy and less complex.
f. Optimization of NLP bases system is less complex compared to other programming
languages.
g. Many of the new frame works such as Apache spark , Apache flink , TensorFlow and so
on provide API for python.
4. Environment setup for nltk(Natural Language Tool Kit) :
1. In python shell to know the version of the python we can use -V (or) –version
>python -V
Python 3.11.2

>python --version
Python 3.11.2
2. In Jupyter Notebook / Spyder
from platform import python_version
print(python_version())
Output: 3.9.7

3. In python shell
import nltk
(or) download nltk (if the package is not available)
In other platforms
pip install -U nltk
from sklearn import nltk //it will import the package
for Windows, install python free version
https://docs.python-giude.org/starting/install3/win

Environment setup for nltk :


a) Install nltk : run pip install –user -U nltk
b) Install numpy : run pip install –user -U numpy
c) Test Installation : run python then import nltk

2. Practical Understanding of Corpus and Database :


Corpus – singular, Corpora – plural
Datum – singular, Data – plural
 Mainly Corpus is related to linguistics(language).
 Corpus is a dataset having data in the form of written (or) spoken format ie., audio/text
data.
 What is Corpus ?
Package – nltk.corpus
We can take dataset un the form of text(.txt) for corpus.
 Why do we need corpus ?
To know the type of the data it contains.
Types of Corpus :
There are 3 types of Corpus.
1. Mono Linguistic : Contains with 1 language.
2. Binary Linguistic : Contains with 2 languages.
3. Multi Linguistic : Contains with more than 2 languages.
Extracting Data from Corpus :
There are different tools (Corpus Linguistic Tools) for corpus to extract data
1. Text Analysis
2. Lexical Analysis
3. Text Mining
4. Semantic Tagger
5. Semantic Parser
6. Topic Models
7. Variant Detector
8. Pattern Matching
9. Phraseology
10. Keywords
11. Temporal Tagger
 Corpus Analysis can be defined as a methodology for pursuing in depth investigations of
linguistic concepts as grounded in the context of authentic & communicative situations.
Challenges regarding creating a Corpus for NLP Applications :
There are 4 challenges regarding creating a Corpus for NLP Applications.
1. Deciding the type of data we need in order to solve the problem statement.
2. Availability of Data.
3. Quality of Data.
4. Adequacy of the Data in terms of amount.
Understanding types of Data Attributes :
Attributes are data fields that represents the characteristics/features of a data object.
There are 2 types of Data Attributes.
1. Qualitative (Categorical)
2. Quantitative (Numeric)
1. Qualitative Data Attributes (or) Categorical Data Attributes :
These kinds of attributes are more descriptive.
e.g., our written notes.
These are of 3 types
1. Nominal Attributes
2. Binary Attributes
3. Ordinal Attributes
1. Nominal Attributes :
The values of the Nominal attributes are names of the things, some kind of symbols.
There is no order (rank, position) among the value of the nominal attribute.
Attributes values
e.g., Colours Black, Blue, White...
Categorical data Lecturer, Professor, Assistant Professor…

2. Binary Attributes :
Binary Data has only 2 values(or)states. e.g., yes/no, effected/unaffected, true/false
Symmetric : Both values are equally important (e.g., Gender)
Asymmetric : Both values are not equally important (e.g., Result)
Attributes values
e.g.,
Gender Male, Female
Result Pass, Fail
3. Ordinal Attributes :
The Ordinal attributes contains the values that have a meaningful sequence (or) ranking
(order) between them, but the magnitude between values is not actually known, the order
of values shows what is important but don’t indicate how important it is.
Attribute Value
e.g.,
Grade A, B, C, D, E, F
Basic Pay Scale 16, 17, 18
2. Quantitative Data Attributes (or) Numeric Data Attributes :
These are of 3 types
1. Numeric Attributes
2. Discrete Attributes
3. Continuous Attributes
1. Numeric Attributes :
A Numeric Attribute is a quantitative because, it is a measurable quantity, represented
in integer (or) real values. Numerical Attributes are of 2 types.
i. Interval Scaled
ii. Ratio-Scaled
i. Interval Scaled :
Attributes has values, whose differences are interpretable, but the numerical
attributes do not have the correct reference point, (or) zero points.
Data can be added and subtracted at an interval scale but cannot be multiplied (or)
divided.
e.g., Temperature in degrees (or) Centigrade.
If a day’s temperature of one day is twice of the other day, we cannot
say that one day is twice as hot as another day.
ii. Ratio-Scaled :
It is numeric attribute with a fix zero-point. If a measurement is ratio-scaled, we can
say of a value as a being a multiple (or) ratio of another value.
The values are ordered, and we can also compute the difference between values,
and the mean, median, mode, Quantile range…
2. Discrete Attributes :
Discrete data have finite values it can be numerical and can also be in categorical form.
These attributes have finite (or) countably infinite set of values.
e.g., Attributes values
Profession Teacher, Business man, Peon
Zip Code 501701, 120042

3. Continuous Attributes :
Continuous data have an infinite number of states. Continuous data is of float type.
There can be many values between 2 integers. (Say 1 & 2)
e.g., Attribute values
Height 6, 5.9, 5.6…
Weight 40, 60, 45…

Data Attributes

Qualitative Quantitative
(Categorical) (Numeric)

Nominal Binary Ordinal Numeric Discrete Continuous


Resources for accessing free Corpora :
The nltk library provide some in-built corpus. To list down all the corpus names execute the
following commands.
import nltk.corpus
print(dir(nltk.corpus))
output : ['_LazyModule__lazymodule_globals', '_LazyModule__lazymodule_import', '_Laz
yModule__lazymodule_init', '_LazyModule__lazymodule_loaded', '_LazyModule__lazymodu
le_locals', '_LazyModule__lazymodule_name', '__class__', '__delattr__', '__dict__',
'__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattrib
ute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__',
'__module__', '__name__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__re
pr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__']
The major Corpus repositories are :
1. Oxford Text Archive
2. CORD (COrpus Research Database)
3. Linguistic Data Consortium
4. BYU Corpus Collection (English-corpus.org)
5. CQP. Web (By login access)
Web-based corpus query systems are also rich resources.
Exploring Different File formats for corpora :
Corpus can be in many different formats. All these file formats are generally used to store
features, which we will feed into our machine learning algorithms.
There are 2 basic file formats for corpora.
1. .txt
2. .csv
1. .txt : This format is basically given to us as a raw dataset.
The Gutenberg corpus is one of the example corpora.
2. .csv : ( Comma separated value ) This file format is generally given while working with
machine learning algorithms.
Gutenberg Corpus :
from nltk.corpus import gutenberg
#import nltk
nltk.corpus.gutenberg.fileids()
ex=nltk.corpus.gutenberg.words('whitman-leaves.txt')
print(len(ex))
print(type(ex))
print(ex)
Output: 154883
<class 'nltk.corpus.reader.util.StreamBackedCorpusView'>
['[', 'Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', ...]
Preparing a Dataset for NLP Applications :
Preparing a dataset for NLP applications involves 3 steps.
1. Selecting data
2. Pre-processing Data
3. Transforming Data
1. Selecting data :
By using the lines of the code below we can select the data.
import pandas as pd
data = pd.read_csv(‘data.csv’)
print(data)
2. Pre-processing data:
Pre-processing of data includes several steps such as
 Load data in pandas
data = pd.read_csv(‘data.csv’)
 Drop columns that are not useful
cols = [‘column1’,’column2’]
data = data.drop(cols, axis = 1)
 Drop rows with missing values
data = data.dropna()
 Create dummy variables
dummies = []
cols = [‘age’, ‘address’]
for col in cols:
dummies.append(pd.get_dummies(df[col]))
 Take care of missing data
 Convert the data frame to numpy
 Divide the dataset into training data and testing data
3. Transforming data:
Data transformation is the process of converting raw-data into a format (or) structure that
would be made suitable for the model (or) algorithm.
Why data transform ?
According to the usage of algorithm and the algorithm allows to compare the relative
relationship between data points.
When to apply data transform ?
When applying Supervised algorithms, training data and testing data need to be
transformed in the same way.
Web-Scraping :
Web-Scraping is an automatic method to obtain large amounts of data from websites. Most
of this data is un-structured data in an HTML format which is then converted into structured
data in a spreadsheet (or) a database, so that it can be used in various applications.
e.g., getting data from the webpage
It is used in Research Development, Business(marketing), social media
Why is python good for web-Scraping ?
 Ease of use
 Large collection of libraries such as Numpy, Pandas…
 Dynamically typed
 Easily understandable syntax
 Small code, large task
 Community support is strong
Process of Scraping data from a website :
1. Find the URL you want to scrap
2. Inspecting the page
3. Find the data you want to extract
4. Write the code
5. Run the code & extract the data
6. Store the extracted data in the required format
Libraries used for Web Scraping:
 Selenium ( A web testing library used to automate browser activities )
 Beautiful soup (Python package for parsing HTML & XML documents, creates parse
trees to extract data easily )
 Scrapy
 Mechanical soup
UNIT – 2 :
1. Understanding of NLU(Natural Language Understanding)
1.1. Understanding the components of NLP
1.2. Natural Language Understanding
2. Defining CFG(Context Free Grammar)
3. Morphological Analysis
4. Syntactic Analysis
5. Discourse Integration
6. Pragmatic Analysis
Understanding the Components of NLP:
There are 2 components of NLP
a. Natural Language Understanding(NLU)
b. Natural Language Generation(NLG)
a. Natural Language Generation(NLG) :
Acts as a translator
It consists of 3 stages
a. Text Planning
b. Sentence Planning
c. Text Realization
b. Natural Language Understanding(NLU) :
It is a tool/component to convert machine language to natural language.
NLU is difficult than NLG.
There are 3 ambiguities /difficulties than NLG
a. Lexical Ambiguity
b. Syntactic Ambiguity
c. Referential Ambiguity
a. Lexical Ambiguity :
This means that one word holds several meanings.
e.g., The man is looking at the match.
b. Syntactic Ambiguity :
This refers to a sequence of words with more than one meaning.
e.g., The fish is ready to eat.
c. Referential Ambiguity :
This involves a word order pair that could refer to two/more properties.
e.g., Tom met jerry and Tim

Context-Free Grammar (CFG) :


x y
syntactic alternate component
category
It is a list of rules that define the set of well-formed sentences in a language. Each rule has a
left-hand side which identifies a syntactic category and a right-hand side which defines its
alternative component paths, reading from left to right.
It is defined as G = <V,T,P,S>
Where V/N : Non-Terminals (denoted in Capital Letters) ie; {A,B,C….}
T : Terminals (denoted in small letters) ie; {a,b,c…,0,1…}
P : set of productions
S : start symbol
*
e.g., L = { W c WR where W € (a,b) }
Input: abcba
Production Rules :
S→aSa
S→bSb
S→c
⇒S {S→aSa}
⇒aSa {S→bSb}
⇒abSba {S→c}
⇒abcba
L = { an b2n , n ≥ 1 }
Production Rules :
S→aSb
S→ab
Input: aaabbbbbb
⇒S
⇒aSb
⇒aaSbbb
⇒aaabbbbbb
Context-Free Grammar in NLP :
Context-Free Grammar(CFG) is the capacity of the computer software to understand the
natural language.
Properties:
a. A CFG defines a set of possible derivations.
*
b. A string s ∈ ∑ is in the language defined by the CFG if there is at least one
derivation that yields s .
c. Each string in the language generated by the CFG may have more than 1 derivation.
Part of Speech (POS) :
It is a process of converting a sentence to form list of words, list of tuples...
The tag in case of is a part of speech tag & signifies whether the given word is a noun, adjective,
verb…
List of Parts of Speech(POS) :
1. NP - Noun Phrase
2. VP - Verb Phrase
3. S - Sentence
4. det - determiner
5. n - Noun
6. tv - Transitive Verb
7. iv - Intransitive Verb
8. prep - preposition
9. pp - prepositional phrase
10. adj - adjective
e.g., V = { S, NP, VP, PP, Det, Noun, Verb, Aux, Prep }
T = { ‘a’, ‘ate’, ‘cake’, ‘child’, ‘fork’, ‘the’, ‘with’ }
P = { S → NP VP
NP → Det Noun | NP PP
PP → Prep NP
VP → Verb NP
Det → ‘a’ | ‘the’
Noun → ‘cake’ | ‘child’ | ‘fork’
Prep → ‘with’
Verb → ‘ate’ }
Input: The child ate a cake
Derivation:
S ⇒ NP VP
⇒ Det Noun VP
⇒ The Noun VP
⇒ The child VP
⇒ The child Verb NP
⇒ The child ate NP
⇒ The child ate a Noun
⇒ The child ate a cake
Parse tree:
S

NP VP

Det Noun Verb NP

The child ate Det Noun

a cake

Morphological Analysis (or) Parsing :


Morphology(Morphological Analysis) is a sub-discipline of the linguistics. It studies word
structure & the formation of word from the smaller units(morphenes).
The goal of the morphological Analysis is to discover the morphenes from the given word.
Morphenes are the smallest meaning-bearing units in a language.
e.g., Bread, Milk(Single Morphene)
Eggs = Egg+s (‘Egg’ is a meaningful morphene)
There are 2 broad classifications of morphenes
1. Stems (Root word)
2. affixes
There are 4 types of affixes
i. prefix : morphenes which appear before the stem
ii. suffix : morphenes which appear after a stem/end of the stem.
iii. infix : morphenes which appear inside/middle of the stem.
iv. circumfix : morphenes that may be applied to either end of the stem.
There are 3 main pairs of word formation
1. Inflection
2. Derivation
3. Compounding
1. Inflection :
A root word is combined with a grammatical morphene to yield a word of the same class
as the original form.
2. Derivation :
Combines a word stem with a grammatical morphene to yield a word belonging to a
different class.
3. Compounding :
It is process of merging 2 (or) more words to form a new word.
 A morphological parser uses the following information sources
o Lexicon
o Morphotactics
o Orthographic rules
o Lexicon : A lexicon lists stems & affixes together with basic information about them.
o Morphotactics : Deals with the ordering of morphenes. It describes the way morphenes
are arranged/touch each other.
o Orthographic rules : These are spelling rules that specify the changes that occur when 2
given morphenes combine.
E.g., y → ier spelling rule changes easy → easier but not easy → easyer.
Limitations :
 It puts heavy demand on memory.
 An exhaustive lexicon fails to show the relationship between different roots having
similar word forms.
 For morphologically complex languages like Turkish, the possible word forms may be
theoretically infinite.
Stemmers have been especially used in information retrieval. Stemming Algorithms have been
developed by
1. Lovins(1968)
2. Porter (1980)
Stemming Algorithms works in 2 steps
1. Suffix removal
2. Recording

1. Suffix removal :
This step removes the pre-defined ending from words
2. Recording :
Has pre-defined endings to the output of the first stem
e.g., ational → ate changes rotational → rotate
It is difficult to use stemming with morphologically rich languages.
Even in English stemmers are not perfect.
Another problem with porters is it reduces only the suffixes.
A more efficient 2-level morphological model first proposed by koskenniemi(1983) can be
used for high level languages.
2-step morphological level consists
1. Lexical level
2. Surface level
Step 1: Step 2:
surface form Split the word into Intermediate form Maps morphenes to stem lexical form
possible morphenes and morphological features

1. Lexical level : Represents the concatenation of its constituent morphenes.


2. Surface level : Represents the actual spelling of the word.
Syntactic Analysis :
1) Introduction
2) Context-free Grammar
3) Construction/Constituency
4) Parsing
Introduction :
 Introduced by Chomsky in the year 1957.
Context-Free Grammar :
 CFG is also known as phrase-structure grammar.
 A CFG can be used to generate a sentence/to assign a structure to a given sentence.
Constituency/Construction :
It tells about the rules for constructing the grammar.
It is of 2 types
1) Phrase Level Construction
2) Sentence Level Construction
1) Phrase Level Construction :
The fundamental idea of the syntax analysis is that words grouped together to form
constituent, each of which acts a single unit. These constituents are identified by their
ability to occur in similar contexts.
e.g., Heena reads a book
A group of words is a phrase is to see if it can be substituted with some other
group of words without changing the meaning. Hence the above sentence can be
constructed in different ways without changing their meaning as
 She reads a book.
 Heena reads a comic book.
 That Girl reads a book.
 A book was read by Heena.
In the phrases there are of 5 types
1) Noun Phrase
2) Verb Phrase
3) Adjective Phrase
4) Adverb Phrase
5) Prepositional Phrase
1) Noun Phrase :
A Noun Phrase is a phrase whose head is a Noun/Pronoun, optionally accompanied
by a set of modifiers. The modifiers of a Noun Phrase can be determiners/Adjective
phrases.
Phrase structure rules related to Noun
NP → Pronoun
NP → Det Noun
NP → Noun
NP → Adj Noun
NP → Det Adj Noun
E.g., They (Pronoun)
The Foggy Morning (Adj Noun)
Chilled Water (Adj Noun)
A Beautiful Lake in Kashmir (Det Adj Noun Prep Noun)
Cold Banana Shake (Adj Noun)
 A Noun phrase can act as a Subject, an Object or a Predicate.
e.g., The foggy damped weather disturbed the match (Subject)
I would like a nice cold Banana shake (Object)
Kula botanical garden is a beautiful location (Predicate)
2) Verb Phrase :
Analogous to the Noun phrase is the verb Phrase which is headed by a verb. A Verb
phrase organizes various elements of the sentence that depend syntactically on the verb.
e.g., Eswar slept (verb)
The Boy kicked the ball (NP VP NP)
The boy gave the girl a book (NP VP NP Det NP)
Kushi slept in the garden ( NP VP Prep NP )
3) Adjective Phrase :
The head of an Adjective phrase is an Adjective. It consists of an Adjective, which may
be preceded by an adverb and followed by a prepositional phrase (PP).
E.g., A Beautiful Lake in Kashmir (det Adj Noun Prep N)
Venkat is clever (NP Adj)
The Train is very late (NP PP Adj Prep)
My Sister is fond of Animals (NP Adj PP NP)
4) Adverb Phrase :
An Adverb phrase consists of an adverb, possibly preceded by a degree adverb.
e.g., Time passes quickly (NP VP Adj Adv)
He speaks loudly (NP VP Adv)
5) Prepositional Phrase :
Prepositional phrases are headed by prepositions. They consist of a preposition,
possibly followed by some other constituents usually a Noun phrase.
e.g., Fox in the Forest (NP PP NP)
Suresh in the class (NP PP NP)
Pardhu sit on the bench (NP VP PP NP)
2) Sentence Level Construction :
A Sentence can have varying structure.
The 4 commonly known structures are
1) Declarative Structure
2) Imperative Structure
3) Yes-No Question Structure (Polar form)
4) Wh Question structure
1) Declarative Structure :
A Simple statement used to provide information about something or a fact.
e.g., The sky is blue
It is sunny
2) Imperative Structure :
It is a sentence that expresses a direct command, request, invitation or instruction.
e.g., Sit on the bench
Look at the door
3) Yes-No Question Structure :
An Interrogative construction (such as ‘Are you ready’) that expects an answer of
either ‘yes’ or ‘no’.
e.g., Is this the NLP class ?
Are you a student ?
4) Wh Question Structure :
A ‘wh’ Question is used for seeking content, information relating to persons, things,
facts, time, place, reason, manner etc.
e.g., What is your name ?
Who is she ?
Co-ordination :
Refers to conjoining phrases with conjunctions like and, or, but, also, inspiteof, …
A co-ordinate noun phrase can consist of 2 other noun phrases separated by a conjunction.
e.g., I ate an apple and a banana.
Rishi is smart and Intelligent.
I’m reading the book also I’m watching the movie.
 Conjunction rules for NP, VP and S (Sentence) can be built as
NP → NP and NP
VP → VP and VP
S → S and S
Agreement : Most of the verbs use 2 different forms in present tense. 1 form for the 3rd Person
singular subject and the other for all other kind of subject. The 3rd person singular (3sg) form
ends with a ‘s’ whereas the non 3sg form ends with any other letter but not ‘s’.
 Agreement has to be confirmed how the subject, NP effects the form of the verb.
e.g., Does shyam dance
Do they write
Parsing :
The task that uses the rewrite the rules of a grammar to either generate a particular
sequence of words/Reconstruct its derivation is known as ‘parsing’. A phrase structured
constructor from a sentence is called parsing.
 A syntactic parser is thus responsible for recognizing a sentence and assigning a sentence’s
syntactic structure to it.
 There exist 2 types of parsing techniques:
1. Top – Down parsing
2. Bottom – Up parsing
1. Top – down parsing :
Top-down parsing starts its search from the root node ( S ) and works downwards
towards the leaves.
A successful parse corresponds to a tree which matches exactly with the words in
the input sentence.
e.g., Input : Paint the door Grammar:
Level 1 S S → NP VP
Level 2 S S → VP
NP → Det Nominal
NP VP Nominal → Noun
Level 3 S S S VP → Verb NP
Verb → paint
NP VP NP VP NP VP Noun → door|paint
Det → this | that | the
Det Nominal Noun Det Noun PP NP → Noun
NP → Det Noun PP
The Correct parse tree is PP → Preposition NP
S Preposition → from|to

VP

Verb NP

Paint Det Nominal

the Noun

door
2. Bottom – Up parsing :
A Bottom-up parser stars with the words in the input sentence & attempts to
construct a parse tree in an upward direction towards the root.
At each step the parser looks for rules in the grammar where the right-hand side
matches some of the portions in the parse tree constructed so far, and reduces it
using the left-hand side of the production.
e.g., Input : Paint the door
The given string has 2 possibilities for parse trees of which we have to consider
the correct parse tree among them based on the given Grammar/ production rules.
Level 1 : paint the door

Level 2 : Noun Det Noun Verb Det Noun


(or)
Paint the door Paint the door

Level 3 : Nominal Nominal VP Nominal

Noun Det Noun (or) Verb Det Noun

Paint the door Paint the door

Level 4 : NP NP

Nominal Nominal VP Nominal

Noun Det Noun (or) Verb Det Noun

Paint the door Paint the door

The Correct parse tree is


S

VP

Verb NP

Paint Det Nominal

the Noun

door

Ambiguities in Syntactic Analysis


o Structural Ambiguity
o Attachment Ambiguity
o Co-ordination Ambiguity
o Local Ambiguity
Structural Ambiguity:
Which occurs when a grammar assigns more than one parse to a sentence.
Attachment Ambiguity :
If a constituent fits more than one position in a parse tree.
Co-ordination Ambiguity :
Occurs when it is not clear which phrases are being combined with a conjunction like
and, or etc.
Local Ambiguity :
Occurs when certain parts of a sentence are ambiguous.
To resolve ambiguities, we use 2 dynamic algorithms
1. Earley Parsing
2. CYK
1. Earley Parsing :
The Earley parser implements an efficient parallel top-down search using Dynamic
programming.
 This algorithm eliminates the repetitive parse of a constituent which arises from
backtracking and successfully reduces the exponential time problem to polynomial
time.
 The Earley Parser can handle recursive rules without getting into an infinite loop.
Algorithm:
Input: Sentence and the Grammar
Output: Chart
Chart[0] ← s → s, [0,0]
|

n ← length(sentence) // number of words in the sentence


for i = 0 to n do
for each state in chart[i] do
if( incomplete(state) and next category is not a part of speech ) then
predictor(state)
else if (incomplete(state) and next category is a part of speech )
scanner(state)
else
completer(state)
end-if
end-if
end-for
end-for
return
procedure predictor ( A → X1… .B….Xm ,[i,j] )
for each rule ( B → α ) in G do
insert the state B → .α , [j,j] to chart[j]
end
procedure scanner ( A → X1… .B….Xm ,[i,j] )
if B is one of the parts of speech associated with word[j] then
insert the state B → word[j]. ,[j,j+1] to chart[j+1]
end
procedure completer ( A → X1… . ,[j,k] )
for each B → X1…. .A…, [i,j] in chart[j] then do
insert the state B → X1….A....[i,k] to chart[k]
end
Example:
Input: paint the door
Grammar:
S → NP VP
S → VP
NP → Det Nominal
NP → Noun
NP → Det Noun PP
Nominal → Noun
Nominal → Noun Nominal
VP → Verb NP
VP → Verb
PP → Preposition NP
Det → this|that|a|the
Verb → sleeps|sings|open|saw|paint
Preposition → from|with|to|on
Noun → door|table

(3,0) (3,1) (3,2) (3,3)

(2,0) (2,1) (2,2) (2,3)

Nominal → .Noun Noun → door.


Nominal → .Noun Nominal Nominal → Noun.
(1,0) (1,1) (1,2) (1,3)

Det → the . Det → the.


NP → Det. Nominal
(0,0) (0,1) (0,2) (0,3)
Top → .S
S → .NP VP
Verb → Paint.
S → .VP
VP → Verb.
VP → .Verb NP
S → .VP
VP → .Verb
2. CYK(Cocke Younger Kasami) :
CYK is a bottom-up parser algorithm which is implemented in triangular format.
CYK stands for Cocke Younger Kasami.
To apply CYK, grammar is in CNF (Chomsky Normal Form).
Algorithm:
Input: Sentence and the Grammar
Output: Chart
Let w = w1w2...wi...wj...wn and wij = wi...wi+j-1
for i:=1 to n do
for all rules A →Wi do
chart[i,1] = {A}
for j = 2 to n do
for i = 1 to n-j+1 do
begin
chart[i,j] = Ø
for k = 1 to j-1 do
chart[i,j] := chart[i,j] ⋃ { A | A → BC is a production
and B ∈ chart[i,k] and C ∈ chart[i+k,j-k] }
end
if s ∈ chart[1,n] then accept else reject

Example: Grammar:
Input: The girl wrote an essay S → NP VP
VP → Verb NP
S
NP → Det Noun
Ø Ø
Det → an|the
Ø Ø Ø Verb → wrote
NP Ø Ø VP Noun → girl
Noun → essay
det N V det N

1 2 3 4 5
The girl wrote an essay
Rules :
 A CFG is in CNF, if all the rules are of only 2 forms
A → BC (or) A → Wi
 Each entry in the table is based on previous entry. The Basic CYK algorithm is
also a chart-based algorithm.
,
 A non-terminal is stored in the [i,j] entry after that iff A ⇒Wi Wi+1 , ... Wi+k-1
Discourse Integration :
Discourse Integration is closely related to pragmatics. It is considered as the larger context
for every smaller part of Natural Language structure.
 Discourse Analysis delas with how the immediately preceding sentence can affect the
meaning & interpretation of the next sentence.
Hence context can be analyzed in a bigger context such as paragraph level, document
level.
(or)
The meaning of any sentence depends on the meaning of the sentence just before it. It also
brings about the meaning of immediately succeeding sentence.
e.g., Jai had an NLP text book. I want it.
Key aspects of concepts :
 Situational concepts context :
What people know about what they can see around them.
 Background Knowledge context :
What people know about each other & the world.
 Core textual context :
What people know about what they have been saying.
Pragmatic Analysis :
It deals with the overall communicative & social content & its effect on the interpretation. In
this analysis the main focus is always on what was said in interpreted on what is spent.
Pragmatic analysis helps users to discover this intended effect by applying a set of rules that
characterize cooperative dialogues.
e.g., Close the window (It should be interpreted as a request instead of an order)
I heart you
If you eat all of the food, it will make you bigger.
UNIT – 3 : PRE-PROCESSING
1. Handling corpus-raw
1.1. Handling raw-text
1.2. Sentence Tokenization
1.3. Lower-Case Conversion
1.4. Stemming
1.5. Lemmatization
1.6. Stop words removal
2. Handling corpus-raw sentences
2.1. Word Tokenizer
2.2. Word Lemmatization
3. Basic preprocessing
3.1. Basic level Regular Expression
3.2. Advanced Regular Expression
4. Practical and customized preprocessing
1) Handling the corpus-raw :

Getting the raw data which Load the data (.txt file) run
contains in a paragraph into the system sentence tokenizer

Fig. Process of corpus-raw


Getting raw data :
There are 3 sources where we can get the raw text data
a) Raw text file.
b) Define raw text data inside a script in the format of a local variable.
c) Use any of the available corpus from nltk
There are 3 functions to handle the corpus raw text data
i) Fileread()
ii) Localtextvalue()
iii) Readcorpus()
Processing Raw Text :
import nltk,re,pprint
from nltk import word_tokenize
#nltk.download('gutenberg')
from nltk.corpus import gutenberg as cg
def fileread():
file_contents = open('data.txt','r').read()
return file_contents
def localtextvalue():
text="""one paragraph"""
return text
def readcorpus():
raw_content_cd = cg.raw('burgess-busterbrown.txt')
return raw_content_cd
if __name__ == "__main__":
print(" ")
print("----output from raw text file----")
print(" ")
filecontentdetails = fileread()
print(filecontentdetails)
print(" ")
print("----output from assigned variable----")
print(" ")
localvariabledata = localtextvalue()
print(localvariabledata)
print(" ")
print("----output corpus data----")
print(" ")
fromcorpusdata = readcorpus()
print(fromcorpusdata)
Output :

----output from raw text file----

This is a sample text file which is using for reading the data

----output from assigned variable----

one paragraph

----output corpus data----

[The Adventures of Buster Bear by Thornton W. Burgess 1920]

BUSTER BEAR GOES FISHING

Buster Bear yawned as he lay on his comfortable bed of leaves and


watched the first early morning sun

Lower-Case Conversion:
Converting all data to lower case help in the pre-processing and in the later stages in the
NLP Application. When we are doing parsing
Code :
def wordlowercase():
text ="I am student"
print(text.lower())
wordlowercase()
Output:
i am student
Sentence Tokenization:
Sentence Tokenization is the process of identifying the boundary of the sentences starting
& ending point.
The following open-source tools are available for performing the sentence tokenization
1. OpenNLP
2. StanfordcoreNLP
3. GATE
4. Nltk
Code:
from nltk import sent_tokenize as st
from nltk.corpus import gutenberg as cg
def fileread():
file_contents = open('data.txt','r').read()
return file_contents
def localtextvalue():
text="""one paragraph"""
return text
def readcorpus():
raw_content_cd = cg.raw('burgess-busterbrown.txt')
return raw_content_cd
if __name__ == "__main__":
print(" ")
print("----output from raw text file----")
print(" ")
filecontentdetails = fileread()
print(filecontentdetails)
print(" ")
print("----sentence tokenization of raw text----")
print(" ")
st_list_rawfile = st(filecontentdetails)
print(st_list_rawfile)
print(len(st_list_rawfile))
print(" ")
print("----output from assigned variable----")
print(" ")
localvariabledata = localtextvalue()
print(localvariabledata)
print(" ")
print("----sentence tokenization of assigned variable----")
print(" ")
st_list_local = st(localvariabledata)
print(st_list_local)
print(len(st_list_local))
print(" ")
print("----output corpus data----")
print(" ")
fromcorpusdata = readcorpus()
print(fromcorpusdata)
print(" ")
print("----sentence tokenization of corpus data----")
print(" ")
st_list_corpus = st(fromcorpusdata)
print(st_list_corpus)
print(len(st_list_corpus))
Output :

----output from raw text file----

This is a sample text file which is using for reading the data

----sentence tokenization of raw text----

['This is a sample text file which is using for reading the data']
1

----output from assigned variable----

one paragraph

----sentence tokenization of assigned variable----

['one paragraph']
1

----output corpus data----

[The Adventures of Buster Bear by Thornton W. Burg

----sentence tokenization of corpus data----

['[The Adventures of Buster Bear by Thornton W. Burg']

Challenges of Sentence Tokenization:


There are 4 challenges
1) If there is a small letter after a dot (.) then the sentence should not split after the dot(.).
2) If there is a small letter after the dot (.) then the sentence should be split. The sentence as
after dot(.).
e.g.: There is an apple.an apple is good for health
can be split into 2 sentences as
There is an apple
An apple is good for health
3) If there is an initial in the sentence then the sentence should not split after the initial.
4) The grammar correction software customized a rule for the identification of sentences.
Stemming :
Removing/Replacing of suffixes from the root word.
Challenges of Stemming:
 It’s best applicable for English language. Other languages have very less accuracy
when compared with English language.
 All the words of other languages are not recognized accurately.
Tools used for Stemming:
1. POS (Part Of Speech )
2. NER (Name Entity Relation )
Algorithms for Stemming:
1. Porter Stemmer
2. Lovins
Code:
from nltk.stem import PorterStemmer
text = open('rawtextcorpus.txt','r').read()[0:200]

def stemmer_porter():
port = PorterStemmer()
print("Stemmer")
return " ".join([port.stem(i) for i in text.split()])
if __name__ == "__main__":
print (stemmer_porter())

Output:
Stemmer
natur languag process (nlp) is a branch of artifici intellig that help comp
ut understand, interpret and manipul human language. nlp draw from mani dis
ciplines, includ comput s
Lemmatization :
Extracting a meaningful word (base form) from the given word as per the context.
Challenges of Lemmatization:
 It’s best applicable for English language. Other languages have very less accuracy
when compared with English language.
 All the words of other languages are not recognized accurately.
Code:
from nltk.stem import WordNetLemmatizer
def lemmatizer():
word_lemma = WordNetLemmatizer()
print("raw-text")
print()
print(text)
print ("Verb lemma")
print (" ".join([word_lemma.lemmatize(i,pos="v") for i in text.split()]))
print ("Noun lemma")
print (" ".join([word_lemma.lemmatize(i,pos="n") for i in text.split()]))
print ("Adjective lemma")
print (" ".join([word_lemma.lemmatize(i, pos="a") for i in text.split()]))
print ("Satellite adjectives lemma")
print (" ".join([word_lemma.lemmatize(i, pos="s") for i in text.split()]))
print ("Adverb lemma")
print (" ".join([word_lemma.lemmatize(i, pos="r") for i in text.split()]))
if __name__ == "__main__":
lemmatizer()
Output:
raw-text

Natural language processing (NLP) is a branch of artificial intelligence that helps


computers understand, interpret and manipulate human language. NLP draws from many
disciplines, including computer
Verb lemma
Natural language process (NLP) be a branch of artificial intelligence that help com
puters understand, interpret and manipulate human language. NLP draw from many disc
iplines, include computer
Noun lemma
Natural language processing (NLP) is a branch of artificial intelligence that help
computer understand, interpret and manipulate human language. NLP draw from many di
sciplines, including computer

Adjective lemma
Natural language processing (NLP) is a branch of artificial intelligence that helps
computers understand, interpret and manipulate human language. NLP draws from many
disciplines, including computer

Satellite adjectives lemma


Natural language processing (NLP) is a branch of artificial intelligence that helps
computers understand, interpret and manipulate human language. NLP draws from many
disciplines, including computer

Adverb lemma
Natural language processing (NLP) is a branch of artificial intelligence that helps
computers understand, interpret and manipulate human language. NLP draws from many
disciplines, including computer

Stop words removal :


Stop words are the common words in any language that do not carry any meaning and are
usually ignored by NLP.
e.g., ‘a’, ‘an’, ‘the’, ‘of’….
Code:
from nltk.corpus import stopwords
def stopwordlist():
stopwordlist = stopwords.words('english')
print("")
def stopwordremove():
stoplist = set(stopwords.words('english'))
sentence = open(rawtextcorpus.txt','r').read()
print("--------raw text---------")
print(sentence)
print("--------Stop word removal from raw text---------")
print(" ".join([i for i in sentence.lower().split() if i not in stoplist]),sep="\n")

if __name__ == "__main__":
stopwordlist()
stopwordremove()
Output:
--------raw text---------
Natural language processing (NLP) is a branch of artificial intelligence that helps
computers understand, interpret and manipulate human language. NLP draws from many
disciplines, including computer science and computational linguistics, in its pursu
it to fill the gap between human communication and

--------Stop word removal from raw text---------


natural language processing (nlp) branch artificial intelligence helps computers un
derstand, interpret manipulate human language. nlp draws many disciplines, includin
g computer science computational linguistics, pursuit fill gap human communication

Challenges of Stop words removal:


 It’s best applicable for English language. Other languages have very less accuracy
when compared with English language.
2) Handling corpus-raw sentences :
a) Word Tokenizer :
Splits the data into tokens (sentences to words)
e.g., We are CSE students ⇒ We | are |CSE | students
Challenges of Word Tokenizer:
 It’s best applicable for English language. Other languages (Telugu, Hindi, Urdu,
Arabic, Hebrew ) have very less accuracy when compared with English language.
 Contracted forms such as don’t/can’t/won’t… are difficult to tokenize.
Code:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
def wordtokenization():
content = """She really wants to buy a car. She told me angrily. It is better
for you. Man is walking. We are meeting tomorrow"""
print(word_tokenize(content))
if __name__=="__main__":
wordtokenization()
Output:
['She', 'really', 'wants', 'to', 'buy', 'a', 'car', '.', 'She', 'told',
'me', 'angrily', '.', 'It', 'is', 'better', 'for', 'you', '.', 'Man', '
is', 'walking', '.', 'We', 'are', 'meeting', 'tomorrow']

b) Word Lemmatization :
It is a process of deleting/modifying affixes of word as per context and these words
came as per the context.
e.g., They are going to picnic.
She really loves to buy cars.
We are running for taking a meal.
Challenges of Word Lemmatization:
 It’s best applicable for English language. Other languages (Telugu, Hindi, Urdu,
Arabic, Hebrew ) have very less accuracy when compared with English language.
Code:
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
def wordtokenization():
content = """She really wants to buy a car. She told me angrily. It is better for
you. Man is walking. We are meeting tomorrow"""
print(word_tokenize(content))
def wordlemmatization():
wordlemma = WordNetLemmatizer()
print(wordlemma.lemmatize('cars'))
print(wordlemma.lemmatize('walking',pos='v'))
print(wordlemma.lemmatize('meeting',pos='v'))
print(wordlemma.lemmatize('meeting',pos='n'))
print(wordlemma.lemmatize('better',pos='a'))
if __name__=="__main__":
print("---Word Tokenization---")
wordtokenization()
print("---Word Lemmatization---")
wordlemmatization()
Output:
---Word Tokenization---
['She', 'really', 'wants', 'to', 'buy', 'cars', '.', 'She', 'told', 'me', '
angrily', '.', 'It', 'is', 'better', 'for', 'you', '.', 'Man', 'is', 'walki
ng', '.', 'We', 'are', 'meeting', 'tomorrow']
---Word Lemmatization---
car
walk
meet
meeting
good
3. Pre-processing :
Basic Level Regular Expression :
Regular Expression is a powerful tool when we want to do customized pre-processing
(or) when we have noisy data.
Basic flags : The Basic Flags are I, L, M, S, U and X
 re.I : This flag is used for ignoring case.
Output: re.IGNORECASE
 re.L : This flag is used to find a local dependent.
Output: re.LOCALE
 re.M : This flag is useful if we want to find patterns throughout multiple lines.
Output: re.MULTILINE
 re.S : This flag is used to find dot(.) matches.
Output: re.DOTALL
 re.U : This flag is used to work for unicode characters.
Output: re.UNICODE
 re.X : This flag is used for writing regex in a more readable format.
Output: re.VERBOSE
 re.match() : This checks for a match of the string only at the beginning of the
string. So if it finds the pattern at the beginning of the input string then it returns
the matched pattern. Otherwise, it returns a noun.
 re.search() : This checks for a match of the string anywhere in the string. It finds all
the occurrences of the pattern in the given input string (or) data.
Code:
import re
def searchmatch():
line = "I love animals"
matchobj = re.match(r'animals',line,re.M|re.I)
if matchobj:
print("match:",matchobj.group())
else:
print("No Match!")
searchobj = re.search(r'animals',line,re.M|re.I)
if searchobj:
print("search:",searchobj.group())
else:
print("Nothing found")
if __name__ == "__main__":
searchmatch()
Output:
No Match!
search: animals

 For any digit, the regular expression (Regex) is \d


 For any non-digit, the Regex is \D
 To find the single occurrence of the characters a & b, the Regex is [ab]
 To find the characters except a & b Regex, the is [^ab]
 To find the range of characters from a-z, the Regex is [a-z]
 To find all the characters a-z as well as A-Z, the Regex is [a-zA-Z]
 For any single character, the Regex is .(dot)
 For any whitespace character, the Regex is \s
 For non-whitespace character, the Regex is \S
 For any words, the Regex is \w
 For any non-words, the Regex is \W
 To match either a (or) b, the Regex is (a|b)
 To find the occurrences of ‘a’ is either 0(or)1, the Regex is a?;
{? matches 0(or)more occurrences}
Advanced level Regular expressions :
The ‘lookahead’ and ‘lookbehind’ are used to find out the sub-string patterns from
your data.
These are of 4 types
1. Positive lookahead
2. Negative lookahead
3. Positive lookbehind
4. Negative lookbehind
1. Positive lookahead : (?= pattern)
Positive lookahead matches the sub-string from a string if the defined pattern is
followed by the sub-string.( ie., main string + sub-string )
2. Negative lookahead : (?! pattern)
Negative lookahead matches the sub-string from a string if the defined pattern is
definitely not followed by the pattern.
3. Positive lookbehind : (?<= pattern)
Positive lookbehind matches the sub-string from a string if the defined pattern is
preceded by the sub-string. ( ie., sub-string + main string )

4. Negative lookbehind : (?<! pattern)


Negative lookbehind matches the sub-string from a string if the defined pattern is
definitely not preceded by the sub-string.
Code:
import re
def advRegEx():
text = "I play on playground. It is the best ground"

poslookaheadobjpattern = re.findall(r'play(?=ground)',text,re.M|re.I)
print("Positive lookahead : "+str(poslookaheadobjpattern))
poslookaheadobj = re.search(r'play(?=ground)',text,re.M|re.I)
print("positive lookahead character Index : "+str(poslookaheadobj.span()))

poslookbehindobjpattern = re.findall(r'(?<=play)ground',text,re.M|re.I)
print("Positive LookBehind : "+str(poslookbehindobjpattern))
poslookbehindobj = re.search(r'(?<=play)ground',text,re.M|re.I)
print("Positive LookBehind Character Index : "+str(poslookbehindobj.span()))

neglookaheadobjpattern = re.findall(r'play(?!ground)',text,re.M|re.I)
print("Negative lookahead : "+str(neglookaheadobjpattern))
neglookaheadobj = re.search(r'play(?!ground)',text,re.M|re.I)
print("Negative lookahead character Index : "+str(neglookaheadobj.span()))

neglookbehindobjpattern = re.findall(r'(?<!play)ground',text,re.M|re.I)
print("Negative LookBehind : "+str(neglookbehindobjpattern))
neglookbehindobj = re.search(r'(?<!play)ground',text,re.M|re.I)
print("Negative Lookbehind Index : "+str(neglookbehindobj.span()))

if __name__ == "__main__":
print("----Advanced Regular expression----")
advRegEx()
Output:
----Advanced Regular expression----
Positive lookahead : ['play']
positive lookahead character Index : (11, 15)
Positive LookBehind : ['ground']
Positive LookBehind Character Index : (15, 21)
Negative lookahead : ['play']
Negative lookahead character Index : (2, 6)
Negative LookBehind : ['ground']
Negative Lookbehind Index : (38, 44)

4. Practical & Customized Pre-processing :


Pre-processing : Processing of data(raw-data) before it is used.
Why pre-processing ?
Removal of noisy data, missing values, Duplicates removal, repeated text, special
symbols, html tags…
Where we use pre-processing ?
In the dataset applications for many purposes such as removal of noisy data, missing
values...etc.
Tools used for pre-processing ?
Tools such as
 Word/sentence tokenizers
 Sentence/word lemmatizers
 Stop words removal
 Lowercase conversion
 Stemming
 Grammar checkers etc. are used.
Flow chart :
If
Dataset contains html tags & repeated text

Yes No
Pre-processing required Pre-processing is not required

Remove html tags & repeated text


Sentence Tokenizer

Word Tokenizer

Word Lemmatization

Sentence Lemmatization

Lower case Conversion

Stop words removal

Stemming
.
.
Understanding Case Studies of pre-processing :
1. Grammarly Correction System (e.g., Customer Reviews)
2. Sentiment Analysis ( positive, negative, neutral )
3. Machine Translation (speech based, text based)
4. Spelling Correction
Operations used in Pre-processing :
1. Insertion
2. Deletion
3. Substitution
1. Insertion :
If we have any incorrect string, after inserting 1 (or) more characters we will get the
correct string (or) expected string.
e.g., ‘aple’ on insertion of ‘p’ becomes ‘apple’
‘puzle’ on insertion of ‘z’ becomes ‘puzzle’
2. Deletion :
If we have an incorrect string, which can be converted into a correct string after
deleting 1 (or) more characters of the string.
e.g., ‘carroot’ after deleting ‘o’ becomes ‘carrot’
‘bannana’ after deleting ‘n’ becomes ‘banana’
3. Substitution :
If we get correct string by substitutions/substituting 1 (or) more characters then it is
called a substitution.
e.g., ‘implemantation’ on substituting ‘a’ with ‘e’ it becomes ‘implementation’
‘corroption’ on substituting ‘o’ with ‘u’ it becomes ‘corruption’
Minimum Edit Distance Algorithm :
This algorithm works in converting one string ‘x’ to another string ‘y’ and we need to find is
what the minimum edit cost is to convert string ‘x’ to string ‘y’.
Algorithm :
Input : Two strings (‘x’ & ‘y’)
Output : Cheapest possible sequences of the characters for converting the string from ‘x’ to
‘y’ that is equals to the minimum edit distance cost for converting string ‘x’ to ‘y’.
i.e., sequence of characters = minimum edit distance
Steps:
1. Set n to a length of P.
Set m to a length of Q.
2. If n=0, return m and exit.
If m=0, return n and exit.
3. Create a matrix of containing 0…m rows & 0…n columns.
4. Initialize the first row to 0…n.
Initialize the first column to 0...m.
5. Iterate each character of P ( i from 1 to n ).
Iterate each character of Q ( j from 1 to m ).
6. If P[i] equals Q[j] then cost is 0.
If P[i] doesn’t equals Q[j] then cost is 1.
Set the value at cell v[i,j] of the matrix equals to the minimum of all three values of the
following points:
7. The cell immediately previous plus 1 : v[i-1,j]+1.
8. The cell immediately to the left plus 1 : v[i,j-1]+1.
9. The cell diagonally previous & to the left plus the cost : v[i-1,j-1]+1 for v[i-1,j-1]+cost
should be considered.
10. After the iteration in step 7 to step 9 has been completed, the distance is found in cell
v[n,m].
Code :
import re
from collections import Counter
def words(text):
return re.findall(r'\w+',text.lower())
WORDS = Counter(words(open('data.txt').read()))
def P(word, N = sum(WORDS.values())): #probability of word
return WORDS[word]/N
def correction(word): #most probable spelling correction of the word
return max(candidates(word),key=P)
def candidates(word): #generate possible spelling corrections for word
return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])
def known(words): #the subset of words that appear in the dictionary of WORDS
return set(w for w in words if w in WORDS)
def edits1(word): #all edits that are one edit away from word
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word)+1)]
deletes = [L+R[1:] for L,R in splits if R]
transposes = [L+R[1]+R[0]+R[2:] for L,R in splits if len(R)>1]
replaces = [L+c+R[1:] for L,R in splits if R for c in letters]
inserts = [L+c+R for L,R in splits for c in letters]
return set(deletes + transposes +replaces + inserts)
def edits2(word): #all edits that are two edits away from word
return (e2 for e1 in edits1(word) for e2 in edits1(e1))

if __name__=="__main__":
print(correction('aple'))
print(correction('correcton'))
print(correction('statament'))
print(correction('tutpore'))

Output :
apple
correction
statement
tutor
UNIT – 4 : FEATURE ENGINEERING & NLP ALGORITHMS
1. Understanding of feature Engineering
2.1.1. Introduction
2.1.2. Purpose
2.1.3. Challenges
2. Basic features of NLP
i. Understanding the basics of parsing
ii. Understanding the concepts of parsing
iii. Developing a parser from the scratch
iv. Types of Grammars
3. Basic statistical feature of NLP, Advantages of feature Engineering
4. Challenges of feature Engineering
1. Understanding of feature Engineering :
Introduction :
Feature Engineering :
It is the process of decorating (or) deriving the features (or) attributes for given data.
These features are useful to develop NLP applications.
Features are required for applying machine learning algorithms.
The main purpose of feature engineering is to develop machine learning algorithms (or)
techniques for the NLP applications, we require it to decorate/provide features of the data.
Also, It is difficult to measure/calculate the performance/accuracy of the machine learning
algorithm.
Challenges of feature Engineering :
 After generating features we need to decide which feature should be selected, selection of
features & perform Machine learning techniques.
 To select the good feature is difficult & sometimes it’s complex to select.
 During the feature selection we need to eliminate some of the less important features and
this elimination features is also a critical part of the feature engineering.
 Manual feature engineering is time consuming.
 Feature engineering requires domain expertise (or) at least basic knowledge about
domains.
2. Basic features of NLP :
Understanding the basics of parsing :
Parsing :
The task that uses the rewrite the rules of a grammar to either generate a particular
sequence of words/Reconstruct its derivation is known as ‘parsing’. A phrase structured
constructor from a sentence is called parsing.
 A syntactic parser is thus responsible for recognizing a sentence and assigning a sentence’s
syntactic structure to it.
 There exist 2 types of parsing techniques:
3. Top – Down parsing
4. Bottom – Up parsing
3. Top – down parsing :
Top-down parsing starts its search from the root node ( S ) and works
downwards towards the leaves.
A successful parse corresponds to a tree which matches exactly with the words in
the input sentence.
4. Bottom – Up parsing :
A Bottom-up parser stars with the words in the input sentence & attempts to
construct a parse tree in an upward direction towards the root.
At each step the parser looks for rules in the grammar where the right-hand side
matches some of the portions in the parse tree constructed so far, and reduces it
using the left-hand side of the production.
We use 2 dynamic algorithms to implement parsing.
3. Earley Parsing
4. CYK
Types of Grammars :
There exist 2 types of grammars.
1. CFG (Context Free Grammar)
2. PCFG (Probability Context Free Grammar)
1. CFG (Context Free Grammar) :
Context Free Grammar G = ( T, C, N, S, L, R )
where T → Terminals/Lexical symbols
C → Pre-Terminals
N → non-terminals
S → Start Symbol belongs to non-terminals
L → Lexical Terminals
Grammar : R → grammar

S → NP VP NP → N N → rods
VP → V NP PP → P NP V → people
VP → V NP PP N → people V → fish
NP → NP NP N → fish V → tank
NP → NP PP N → tank P → with
e.g.,
1. People fish tank
Parse tree :
S

NP VP

N V NP

People fish N

Tank
2. People fish tank with rods
Parse tree :
S

NP VP

N V NP PP

People fish N P NP

Tank with rods

(or)

NP VP

N V NP

People fish NP PP

Tank P NP

With N

rods

2. PCFG (Probability Context Free Grammar) :


Probability Context Free Grammar G = ( T, N, S, R, P )
where T → Terminals/Lexical symbols
N → non-terminals
S → Start Symbol belongs to non-terminals
R → grammar
P → Probability function, P is [0,1]
∀ 𝑥 ∈ 𝑁, ∑ → ∈ 𝑃(𝑥 → 𝑦) = 1

Grammar :
S → NP VP (1.0) N → people (0.5) P → with (1.0)
VP → V NP (0.6) N → fish (0.2)
VP → V NP PP (0.4) N → tank (0.2)
NP → NP NP (0.1) N → rods (0.1)
NP → NP PP (0.2) V → people (0.1)
NP → N (0.7) V → fish (0.6)
PP → P NP (1.0) V → tank (0.3)
e.g., People fish tank with rods
Parse tree :
S (1.0)

NP (0.7) VP (0.4)

N (0.5) V (0.6) NP (0.7) PP (1.0)

People fish N (0.2) P (1.0) NP (0.7)

Tank with N (0.1)

Rods

P(t1) = 1 * 0.7 * 0.4 * 0.5 * 0.6 * 0.7 * 1 * 0.2 * 1 * 0.7 * 0.1


= 0.0008232
(or)

S (1.0)

NP (0.7) VP (0.6)

N (0.5) V (0.6) NP (0.2)

People fish NP (0.7) PP (1.0)

N (0.2) P (1.0) NP (0.7)

Tank with N (0.1)

Rods

P(t2) = 1 * 0.7 * 0.6 * 0.5 * 0.6 * 0.2 * 0.7 * 1 * 0.2 * 1 * 0.7 * 0.1
= 0.00012348
∴ P = P(t1) + P(t2)
= 0.0008232 + 0.00012348
= 0.00094668
3. Basic Statistical feature of NLP, Advantages of feature Engineering :
Advantages of feature Engineering :
 Better features give a lot of meaning even if we chose a less optimal Machine Learning
algorithms, we will get a good result. Good features provide the flexibility of choosing an
algorithm. Even if we chose a less complex model then we will get good accuracy.
 If we choose good features then even simple Machine Learning algorithms do well.
 Better Understanding will lead to better accuracy. we should spend more time on feature
engineering to generate the application features for our dataset.
4. Challenges of feature Engineering :
 An effective way of converting text data into a numerical format is quite challenging. For
this challenge trial & error method may help to us.
 In the NLP domain, we can easily derive the features that are categorical features (or)
basic NLP features.
 We have to convert these features into a numerical format.
Although there are couple of techniques that we can use such as TF-IDF (Term Frequen
cy – Inverse Document Frequent items)Encode, Ranking, Co-occurrence matrix, Word
Embedding, Word2Vector, VEC….to convert our text data into numerical data.

You might also like