0% found this document useful (0 votes)

23 views26 pages

NLP Notes

The document discusses the key components of natural language processing including finding the structure of words and documents. It covers morphological models, tokenization, lemmatization, parts of speech tagging, and classification algorithms for sentence and topic boundary detection.

Uploaded by

rusma1786

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views26 pages

NLP Notes

Uploaded by

rusma1786

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 26

UNIT - I

Finding the Structure of Words: Words and Their Components, Issues and
Challenges, Morphological Models

Finding the Structure of Documents: Introduction, Methods, Complexity of the

Approaches, Performances of the Approaches

NLP
Natural Language Processing
NLP stands for Natural Language Processing, which is a part of Computer
Science, Human language, and Artificial Intelligence
 ability of a computer program to understand human language referred to
as natural language.
 It's a component of artificial intelligence
 It is a technology used by machines to understand, analyse, manipulate,
and interpret human's languages.
Applications of NLP
 Question Answering
 Spam Detection
 Sentiment Analysis
 Machine Translation
 Spelling correction
 Speech Recognition
 Chatbot
 Information extraction

Components of NLP
o NLU (Natural Language Understanding)
o NLG (Natural Language Generation)

NLU (Natural Language Understanding)

 Lexical Ambiguity

Lexical Ambiguity exists in the presence of two or more possible meanings of the
sentence within a single word.

Example:

Manya is looking for a match.

 Syntactic Ambiguity
Syntactic Ambiguity exists in the presence of two or more possible meanings within
the sentence.

Example:

I saw the girl with the binocular.

 Referential Ambiguity

Referential Ambiguity exists when you are referring to something using the pronoun.

Example: Kiran went to suresh. He eats apple.

In the above sentence, you do not know that who is hungry, either Kiran or Sunita.

Phases of NLP
NLP Challenges

 Elongated words
 Shortcuts
 Emojis
 Mix use of Language
 Ellipsis
LEXICON ANALYSIS

 It is fundamental stage
 Identifying and analysing the structure of words
 It is word level processing
 Dividing the whole text into paragraph, sentence and words
 It involves stemming and lemmatization

SYNTACTIC ANALYSIS

 Required syntactic knowledge

 Find the roles played by words in a sentence,
 Interpret the relationship between words,
 Interpret the grammatical structure of sentences.

SEMANTIC ANALYSIS

 exact meaning or dictionary meaning from the text.

 to check the text for meaningfulness.

DISCOURSE ANALYSIS

 Required discourse knowledge

PRAGMATIC ANALYSIS

 how people communicate with each other, in which context they are talking
 required knowledge of the word
Finding the Structure of Words

Words and Their Components

Words are the basic building blocks of a Language. we have following components of Words

 Tokens
 Lexemes
 Morphemes
 Typology

Tokens

 Tokens are words that are created by dividing the text into smaller units
 Process to identify tokens from the given text is known as Tokenization
 Tokenization involves segmenting text into smaller units that are analysed individually.
 Input is text and output are tokens

Types of Tokenization

 Character Tokenization
 Word Tokenization
 Sentence Tokenization
 Sub word Tokenization
 Number Tokenization

Character Tokenization

Input: "Today is Monday"

Output: ["T", "o", "d", "a", "y", "i", "s", "M", "o", "n", "d", "a",” y”]
Word Tokenization (whitespace based, punctuation based)

Sentence Tokenization

Input: "Tokenization is an important NLP task. It helps break down text into smaller units."

Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller units."]

Sub word Tokenization (frequently used words, infrequency used

words)
Input: unusual

Output: [“un”, “usual”].

Morphological Process

Morphemes

Number tokenization

She had 100 pencils

LEXEMES

 Base or canonical form of words

 Process to find the lexemes is known as lemmatization.

MORPHEMES

 Words are formed by combing more than one morpheme

 Process to find morphemes from text is known as morphological process
 We have following types of morphemes
1. Free morphemes
2. Bounded morphemes

TYPOLOGY

 It refers categorized or classification of a language based structural and grammatical features

 We have following categories
1. Isolated or analytical languages
2. Synthetic languages
3. Agglutinative languages

Issues and Challenges

 Irregularity
 Ambiguity
 Productivity
Irregularity
 Words or words forming follow regular patterns then it is regularity
 Words or words forming doesn’t follow regular patterns then it is
irregularity

Ambiguity
 word or word forms that can be having more than one meaning
irrespective of context.
 Word forms that look like same but meaning is not unique
 Occurred morphological processing
 Ambiguity can be
o Word sense ambiguity
 Meaning depending on the context
o Parts of speech ambiguity
 Different part of speech
o Structural ambiguity
 Multiple valid syntactic structure
o Referential ambiguity
 Referring person or noun

Productivity
 Forming new word or word forms using productive rules
 Person names, location names, organization names.

Morphological Models
Morphological models are used to analyse the structure and formation of words

We have 5 morphological models

 Dictionary Lookup
 Finite state morphology
 Unification based morphology
 Functional morphology
 Morphological Induction

Morphemes

Root word affixes

Prefix

In fix

suffix

Dictionary Lookup
Includes

wordbase form or canonical form search in dictionary

retrieve information

Finite state morphology

Based on formal language theory

Process is known as FSTs (finite state transducers)

success
success

un success

pre fix stem

successfull

stem suffix

unsuccessfull

prefix stem suffix

e stem suffix

prefix

stem
STEM CHANGES

Some irregular word requires stem changes

d o g epsilon

m s

c e

i e

u s

Mice

mouse

FST has two types of tapes

 Surface tape
 Lexical tape

Surface tape

c a t s

Lexical tape

c a t N Pl

FST has 7 tuples

MORPHEMES TYPES

Basically, two types of morphemes

o Free morphemes
 Lexical
 example
 Functional
 example
o Bound morphemes
 Inflectional
example
 Derivational
 Class changing
 Class maintaining

Finding structure of Document

Segmentation is chunking the input text or speech into blocks

Types of segmentation

 Sentence boundary detection

o Optical character recognition
o Automatic speech recognition
 Topic boundary detection

Corpus

Documents/sentences

Word/tokens

Vocabulary

I met Dr.Xyz and he suggested some medices.

What is the time now?

Topic boundary detection

 Discourse segmentation / text segmentation
 Process of dividing speech or text into homogenous blocks
called as topic segmentation
 Two ways (text segmentation)
1. By following headlines
2. By paragraph breaks
 Two ways (speech segmentation)
1. Pause duration
2. Speaker changes

METHODS for sentence boundary and topic boundary

1. Generative sequence classification method
2. Discriminative local classification method

Generative sequence classification method

 Observations: words & punctuations
 Labels: sentence boundary, topic boundary
Hidden Markov Model (HMM) is method of Generative approach
How it works
 Learn from the data about the observation and corresponding
hidden states (POS)
 Predict label or sequence generation
 Classifies the new sequence
Ex:
I Love Coding
She sells apples
the quick brown fox jumps over the lazy dog
Discriminative local classification method
Local feature: word, prefixes, suffixes, nearby POS
Label: sentence boundary label, topic boundary label
Ex: maximum entropy Markov model, SVM

Applications:
1. POS tagging
2. Speech recognition
3. Named entity recognition

Complexity of approaches
 Quality
 Quantity
 Computational complexity
 Structural complexity
 Space
 Time
 Training
 Prediction

Performance of the approaches

 Precision
 Recall
 Accuracy
 F1 measure/F1 score

Confusion matrix

Precision
When it predicts yes, how often is it correct
TP/predicted Yes
100/110=90.9%

True Negative Rate:

Actually No, how often does it predict No

TN/actual No=50/60=83%

True Positive Rate: (Recall/Sensitivity)

When it actual Yes how often does it predict Yes
TP/Actual Yes
100/105=95%

Accuracy
How often classifier correct

TN+TP/TN+FP+FN+TP
50+100/50+10+5+100
150/165=90%

Misclassification Rate:
Overall, how often is it wrong

FN+FP/TN+FP+FN+TP
5+10/50+10+5+100
15/165=9%
PRCESION = Total No. Of Correct Positive Prediction/Total No. Of
Positive Prediction
RECALL= Total No. Of Correct Positive Prediction/Total no. of positive
instances (+ve ,-ve)
ACCURACY
How often classifier correct
TN+TP/TN+FP+FN+TP
F1 SCORE= 2* precision *recall / precision +recall
UNIT -II
Prerequires CFG

Syntax Analysis: Parsing Natural Language, Treebanks: A Data-Driven

Approach to Syntax, Representation of Syntactic Structure, Parsing Algorithms,
Models for Ambiguity Resolution in Parsing, Multilingual Issues

Chart parser
RegEx parser
Shift reduce parser
Recursive parser

Syntax Analysis /syntactic Analysis

Syntax Vs grammar

Return_type Function_name(parameters);
Function_name(parameters) return_type;
Function_name(parameter);

Return_type Function_name(parameters)

Ramu eats apple

Eats ramu apple
Tree

Parsing
CFG

G= (N, T, P, S)

A α
α  (NUT)*

AB

SNP VP
NP{article(adj)N, PRO, PN}
VPV NP(PP)(ADV)
PP PREP NP
NPN
NP DN

Ate ramu apple the

RAMU ATE THE APPLE

Brown ,switchboard

At/in the/at same/ap time/nn reaction/nn among/in anti-

organization/jj
At the same time reaction among anti-organization
Penn treebank
Ate ramu apple the
Representation of Syntactic Structure
Two types of approaches

 Phrase structure graph

o Example
 Dependency graph
o Example

NLP Unit-I Notes
No ratings yet
NLP Unit-I Notes
19 pages
Natural Language Processing
No ratings yet
Natural Language Processing
72 pages
Unit 2 Syntactic Processing
No ratings yet
Unit 2 Syntactic Processing
17 pages
Natural Language Processing
No ratings yet
Natural Language Processing
47 pages
Language - The Social Mirror, 3rd Edition
100% (1)
Language - The Social Mirror, 3rd Edition
452 pages
Natural Language Processing
No ratings yet
Natural Language Processing
47 pages
Natural Language Processing PDF
100% (1)
Natural Language Processing PDF
47 pages
Unit 1
No ratings yet
Unit 1
24 pages
Nlp-Unit I
No ratings yet
Nlp-Unit I
69 pages
Dyslexia Test (DAIPC)
No ratings yet
Dyslexia Test (DAIPC)
147 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
3138 01 MS 4RP AFP tcm143-700697
100% (2)
3138 01 MS 4RP AFP tcm143-700697
10 pages
NLP Notes
No ratings yet
NLP Notes
43 pages
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
No ratings yet
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
30 pages
NLP Unit 1 Answers
No ratings yet
NLP Unit 1 Answers
7 pages
NLP Unit-1 Notes
No ratings yet
NLP Unit-1 Notes
59 pages
Assessing Listening
100% (1)
Assessing Listening
26 pages
Unit - 5 Natural Language Processing
No ratings yet
Unit - 5 Natural Language Processing
66 pages
5.natural Language Processing
No ratings yet
5.natural Language Processing
5 pages
Natural Language Processing
100% (1)
Natural Language Processing
21 pages
Life Span Development Notes-Chapter 5
100% (1)
Life Span Development Notes-Chapter 5
11 pages
Part-of-Speech (POS) Tagging
No ratings yet
Part-of-Speech (POS) Tagging
4 pages
Unit Ii NLP Notes Final
No ratings yet
Unit Ii NLP Notes Final
6 pages
AI Unit 5
No ratings yet
AI Unit 5
18 pages
NLP Sem Unit 1
No ratings yet
NLP Sem Unit 1
8 pages
Artificial Intelligence Assignment 6
No ratings yet
Artificial Intelligence Assignment 6
4 pages
NLP JNTUH Unit 1
No ratings yet
NLP JNTUH Unit 1
9 pages
Feature Systems and Augmented Grammars
No ratings yet
Feature Systems and Augmented Grammars
7 pages
Unit 5
No ratings yet
Unit 5
70 pages
NLP Shorts 3
No ratings yet
NLP Shorts 3
25 pages
NLP Notes
No ratings yet
NLP Notes
180 pages
What Is NLP?: Components of An FSA
No ratings yet
What Is NLP?: Components of An FSA
16 pages
NLP Ans
No ratings yet
NLP Ans
9 pages
NLP Mid-1
No ratings yet
NLP Mid-1
15 pages
Natural Language Processing by DR A Nagesh
No ratings yet
Natural Language Processing by DR A Nagesh
136 pages
UNIT-1 Notes
No ratings yet
UNIT-1 Notes
19 pages
Poeter Stemmer Algorithm
No ratings yet
Poeter Stemmer Algorithm
57 pages
Morphological Analysis
No ratings yet
Morphological Analysis
35 pages
Solution NLP UT1
No ratings yet
Solution NLP UT1
7 pages
Ai Phases in NLP Sem Vi
No ratings yet
Ai Phases in NLP Sem Vi
3 pages
NLP Unit 1
No ratings yet
NLP Unit 1
52 pages
NLPNotes
No ratings yet
NLPNotes
12 pages
NLP Lect 2 Words and Morphology
No ratings yet
NLP Lect 2 Words and Morphology
52 pages
Lecture Template 16x9
No ratings yet
Lecture Template 16x9
16 pages
Lect1 Intro 3jan08
No ratings yet
Lect1 Intro 3jan08
94 pages
NLP Unit-1
No ratings yet
NLP Unit-1
37 pages
NLP Unit 2
No ratings yet
NLP Unit 2
48 pages
Unit 12 (3 Half)
No ratings yet
Unit 12 (3 Half)
37 pages
ACFrOgBKMtkrKQXYgwzYfGAQxQ0GJjQ4MloahBs6vi5pwqo xRZUN6IRgh8lAAyR2U7sguAn6becvxh174Y RYo84nZ3K9mm OlN3Q JrDvd18FxMzMkCBuxruzd1tH0C6XqndKXsCSXuwHIWVT7olg5FKOstIhFYq-Kh6hMBg
No ratings yet
ACFrOgBKMtkrKQXYgwzYfGAQxQ0GJjQ4MloahBs6vi5pwqo xRZUN6IRgh8lAAyR2U7sguAn6becvxh174Y RYo84nZ3K9mm OlN3Q JrDvd18FxMzMkCBuxruzd1tH0C6XqndKXsCSXuwHIWVT7olg5FKOstIhFYq-Kh6hMBg
32 pages
NLP Introduction
No ratings yet
NLP Introduction
35 pages
Chapter 1 - Natural Language Processing (NLP)
No ratings yet
Chapter 1 - Natural Language Processing (NLP)
35 pages
Latest English Language Yearly Plan Year 3
No ratings yet
Latest English Language Yearly Plan Year 3
10 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
7 pages
Natural Language Processing
No ratings yet
Natural Language Processing
44 pages
NLP m2
No ratings yet
NLP m2
71 pages
Wonder Unit Plan
No ratings yet
Wonder Unit Plan
4 pages
NLP Lecture 3
No ratings yet
NLP Lecture 3
44 pages
NLP Material
No ratings yet
NLP Material
250 pages
2 NLP
No ratings yet
2 NLP
36 pages
NLP Simple Explanation
No ratings yet
NLP Simple Explanation
9 pages
NLP Question and Answers Final
No ratings yet
NLP Question and Answers Final
129 pages
NLP CSM
No ratings yet
NLP CSM
136 pages
NLP Unit-1 Notes
No ratings yet
NLP Unit-1 Notes
162 pages
Yearly Scheme of Work English KSSR Year 6
No ratings yet
Yearly Scheme of Work English KSSR Year 6
24 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Content and Contextual Analysis Rubric
No ratings yet
Content and Contextual Analysis Rubric
4 pages
Atural Anguage Rocessing: Chandra Prakash LPU
No ratings yet
Atural Anguage Rocessing: Chandra Prakash LPU
59 pages
Language of Mathematics
No ratings yet
Language of Mathematics
20 pages
Body Paragraphs OET
No ratings yet
Body Paragraphs OET
5 pages
5 Lexicology and Lexicography
No ratings yet
5 Lexicology and Lexicography
35 pages
A Stylistic Analysis of The Prolog of The Novel Moth Smoke by Mohsin Hamid
No ratings yet
A Stylistic Analysis of The Prolog of The Novel Moth Smoke by Mohsin Hamid
12 pages
Paper Semantics Group 2
No ratings yet
Paper Semantics Group 2
7 pages
2024 Grade 8 Skills in English Records of Work Term 1-13-24 Mar 15-39-01
No ratings yet
2024 Grade 8 Skills in English Records of Work Term 1-13-24 Mar 15-39-01
15 pages
BS7 Twi Scheme
No ratings yet
BS7 Twi Scheme
2 pages
English 10 Q4 Module 3 1
No ratings yet
English 10 Q4 Module 3 1
24 pages
Week 3
No ratings yet
Week 3
13 pages
Daily Lesson Log Andabuen National High School Grade 9-GOLD, SILVER Nov. 6-11, 2017 Third Quarter
No ratings yet
Daily Lesson Log Andabuen National High School Grade 9-GOLD, SILVER Nov. 6-11, 2017 Third Quarter
13 pages
What Is Linguistics
No ratings yet
What Is Linguistics
3 pages
CĐR Listening-Class 3.students
No ratings yet
CĐR Listening-Class 3.students
17 pages
Eng7 Lesson 1
No ratings yet
Eng7 Lesson 1
125 pages
IGBO
No ratings yet
IGBO
7 pages
Word Formation Processes in Koiné Greek
No ratings yet
Word Formation Processes in Koiné Greek
28 pages
Stress Timed English
No ratings yet
Stress Timed English
2 pages
Bruggeman 2020
No ratings yet
Bruggeman 2020
219 pages
Students' Writing Ability On English Descriptive Text at Grade VIII in SMPN 33 Padang
No ratings yet
Students' Writing Ability On English Descriptive Text at Grade VIII in SMPN 33 Padang
24 pages
Academic Writing Skills: Anitha Munasinghe LLB, BSN, MPH, MBA
No ratings yet
Academic Writing Skills: Anitha Munasinghe LLB, BSN, MPH, MBA
22 pages
Interpretation and Definitions Clauses
No ratings yet
Interpretation and Definitions Clauses
2 pages
Reading and Writing Skills
No ratings yet
Reading and Writing Skills
2 pages
Understanding Words and Morphology
From Everand
Understanding Words and Morphology
Gauraang Asan
No ratings yet
Coreference: Fundamentals and Applications
From Everand
Coreference: Fundamentals and Applications
Fouad Sabry
No ratings yet

NLP Notes

Uploaded by

NLP Notes

Uploaded by

UNIT - I

Finding the Structure of Documents: Introduction, Methods, Complexity of the

NLU (Natural Language Understanding)

Manya is looking for a match.

I saw the girl with the binocular.

Example: Kiran went to suresh. He eats apple.

 Required syntactic knowledge

 exact meaning or dictionary meaning from the text.

 Required discourse knowledge

Words and Their Components

Input: "Today is Monday"

Sub word Tokenization (frequently used words, infrequency used

Output: [“un”, “usual”].

She had 100 pencils

 Base or canonical form of words

 Words are formed by combing more than one morpheme

 It refers categorized or classification of a language based structural and grammatical features

Issues and Challenges

We have 5 morphological models

Root word affixes

wordbase form or canonical form search in dictionary

Finite state morphology

Process is known as FSTs (finite state transducers)

pre fix stem

prefix stem suffix

Some irregular word requires stem changes

FST has two types of tapes

FST has 7 tuples

Basically, two types of morphemes

Finding structure of Document

Segmentation is chunking the input text or speech into blocks

 Sentence boundary detection

I met Dr.Xyz and he suggested some medices.

What is the time now?

Topic boundary detection

METHODS for sentence boundary and topic boundary

Generative sequence classification method

Performance of the approaches

True Negative Rate:

Actually No, how often does it predict No

True Positive Rate: (Recall/Sensitivity)

Syntax Analysis: Parsing Natural Language, Treebanks: A Data-Driven

Syntax Analysis /syntactic Analysis

Ramu eats apple

Ate ramu apple the

At/in the/at same/ap time/nn reaction/nn among/in anti-

 Phrase structure graph

You might also like