NLP Notes
NLP Notes
Finding the Structure of Words: Words and Their Components, Issues and
Challenges, Morphological Models
NLP
Natural Language Processing
NLP stands for Natural Language Processing, which is a part of Computer
Science, Human language, and Artificial Intelligence
ability of a computer program to understand human language referred to
as natural language.
It's a component of artificial intelligence
It is a technology used by machines to understand, analyse, manipulate,
and interpret human's languages.
Applications of NLP
Question Answering
Spam Detection
Sentiment Analysis
Machine Translation
Spelling correction
Speech Recognition
Chatbot
Information extraction
Components of NLP
o NLU (Natural Language Understanding)
o NLG (Natural Language Generation)
Lexical Ambiguity exists in the presence of two or more possible meanings of the
sentence within a single word.
Example:
Syntactic Ambiguity
Syntactic Ambiguity exists in the presence of two or more possible meanings within
the sentence.
Example:
Referential Ambiguity
Referential Ambiguity exists when you are referring to something using the pronoun.
In the above sentence, you do not know that who is hungry, either Kiran or Sunita.
Phases of NLP
NLP Challenges
Elongated words
Shortcuts
Emojis
Mix use of Language
Ellipsis
LEXICON ANALYSIS
It is fundamental stage
Identifying and analysing the structure of words
It is word level processing
Dividing the whole text into paragraph, sentence and words
It involves stemming and lemmatization
SYNTACTIC ANALYSIS
SEMANTIC ANALYSIS
DISCOURSE ANALYSIS
PRAGMATIC ANALYSIS
how people communicate with each other, in which context they are talking
required knowledge of the word
Finding the Structure of Words
Words are the basic building blocks of a Language. we have following components of Words
Tokens
Lexemes
Morphemes
Typology
Tokens
Tokens are words that are created by dividing the text into smaller units
Process to identify tokens from the given text is known as Tokenization
Tokenization involves segmenting text into smaller units that are analysed individually.
Input is text and output are tokens
Types of Tokenization
Character Tokenization
Word Tokenization
Sentence Tokenization
Sub word Tokenization
Number Tokenization
Character Tokenization
Output: ["T", "o", "d", "a", "y", "i", "s", "M", "o", "n", "d", "a",” y”]
Word Tokenization (whitespace based, punctuation based)
Sentence Tokenization
Input: "Tokenization is an important NLP task. It helps break down text into smaller units."
Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller units."]
Morphological Process
Morphemes
Number tokenization
LEXEMES
MORPHEMES
TYPOLOGY
Irregularity
Ambiguity
Productivity
Irregularity
Words or words forming follow regular patterns then it is regularity
Words or words forming doesn’t follow regular patterns then it is
irregularity
Ambiguity
word or word forms that can be having more than one meaning
irrespective of context.
Word forms that look like same but meaning is not unique
Occurred morphological processing
Ambiguity can be
o Word sense ambiguity
Meaning depending on the context
o Parts of speech ambiguity
Different part of speech
o Structural ambiguity
Multiple valid syntactic structure
o Referential ambiguity
Referring person or noun
Productivity
Forming new word or word forms using productive rules
Person names, location names, organization names.
Morphological Models
Morphological models are used to analyse the structure and formation of words
Dictionary Lookup
Finite state morphology
Unification based morphology
Functional morphology
Morphological Induction
Morphemes
Prefix
In fix
suffix
Dictionary Lookup
Includes
retrieve information
success
success
un success
successfull
stem suffix
unsuccessfull
e stem suffix
prefix
stem
STEM CHANGES
d o g epsilon
m s
c e
i e
u s
Mice
mouse
Surface tape
Lexical tape
Surface tape
c a t s
Lexical tape
c a t N Pl
MORPHEMES TYPES
Types of segmentation
Corpus
Documents/sentences
Word/tokens
Vocabulary
Applications:
1. POS tagging
2. Speech recognition
3. Named entity recognition
Complexity of approaches
Quality
Quantity
Computational complexity
Structural complexity
Space
Time
Training
Prediction
Confusion matrix
Precision
When it predicts yes, how often is it correct
TP/predicted Yes
100/110=90.9%
Accuracy
How often classifier correct
TN+TP/TN+FP+FN+TP
50+100/50+10+5+100
150/165=90%
Misclassification Rate:
Overall, how often is it wrong
FN+FP/TN+FP+FN+TP
5+10/50+10+5+100
15/165=9%
PRCESION = Total No. Of Correct Positive Prediction/Total No. Of
Positive Prediction
RECALL= Total No. Of Correct Positive Prediction/Total no. of positive
instances (+ve ,-ve)
ACCURACY
How often classifier correct
TN+TP/TN+FP+FN+TP
F1 SCORE= 2* precision *recall / precision +recall
UNIT -II
Prerequires CFG
Chart parser
RegEx parser
Shift reduce parser
Recursive parser
Return_type Function_name(parameters);
Function_name(parameters) return_type;
Function_name(parameter);
Return_type Function_name(parameters)
Parsing
CFG
G= (N, T, P, S)
A α
α (NUT)*
AB
SNP VP
NP{article(adj)N, PRO, PN}
VPV NP(PP)(ADV)
PP PREP NP
NPN
NP DN
Brown ,switchboard