0% found this document useful (0 votes)

103 views80 pages

Natural Language Processing (CSE4022) : by N. Ilakiyaselvan

This document provides an overview of natural language processing and computational challenges in other languages. It discusses spelling correction through candidate generation using edit distance and prior word probabilities. It also covers information retrieval concepts like precision, recall and the basic assumptions of IR systems. Finally, it describes question answering as one of the oldest NLP tasks and different paradigms for question answering systems like IR-based, knowledge-based and hybrid approaches.

Uploaded by

naruto sasuke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

103 views80 pages

Natural Language Processing (CSE4022) : by N. Ilakiyaselvan

Uploaded by

naruto sasuke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 80

Natural Language Processing(CSE4022)

By
N. Ilakiyaselvan
Computational Challenges in Other
Languages
Spelling Correction
Non-word spelling error example
acress
Candidate generation:
• Words with similar spelling
– Small edit distance to error
• Words with similar pronunciation
– Small edit distance of pronunciation to error

41
Damerau-Levenshtein edit distance
• Minimal edit distance between two strings,
where edits are:
– Insertion
– Deletion
– Substitution
– Transposition of two adjacent letters

42
Words within 1 of acress
Error Candid Corre Error Type
ate ct Lette
Correcti Letter r
on
acres actre t - deletion
s ss
acres cress - a insertion
s
acres cares ca ac transpositio
s s n
acres acces c r substitution
s s
43
Candidate generation
• 80% of errors are within edit distance 1
• Almost all errors within edit distance 2

• Also allow insertion of space or hyphen

– thisidea  this idea
– inlaw  in-law

44
Unigram Prior probability
Counts from 404,253,213 words in Corpus of Contemporary English (COCA)

word Frequency of P(word)

word
actress 9,321 .0000230573
cress 220 .0000005442
caress 686 .0000016969
access 37,038 .0000916207
across 120,844 .0002989314
acres 12,874 .0000318463
45
Issues in spelling
• If very confident in correction
– Autocorrect
• Less confident
– Give the best correction
• Less confident
– Give a correction list
• Unconfident
– Just flag as an error

46
Information Retrieval
• Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).

– These days we frequently think first of web search,

but there are many other cases:
• E-mail search
• Searching your laptop
• Corporate knowledge bases
• Legal information retrieval

47
Sec. 1.1

Basic assumptions of Information Retrieval

• Collection: A set of documents

– Assume it is a static collection for the moment

• Goal: Retrieve documents with information

that is relevant to the user’s information need
and helps the user complete a task

48
Sec. 1.1

How good are the retrieved docs?

 Precision : Fraction of retrieved docs that are
relevant to the user’s information need
 Recall : Fraction of relevant docs in collection
that are retrieved

49
Question Answering
One of the oldest NLP tasks (punched card systems in 1961)
Simmons, Klein, McConlogue. 1964. Indexing and
Dependency Logic for Answering English Questions.
American Documentation 15:30, 196-204

50
Apple’s Siri

51
Wolfram Alpha

52
Types of Questions in Modern
Systems
• Factoid questions
– Who wrote “The Universal Declaration of Human
Rights”?
– How many calories are there in two slices of apple
pie?
– What is the average age of the onset of autism?
– Where is Apple Computer based?
• Complex (narrative) questions:
– In children with an acute febrile illness, what is
the efficacy of acetaminophen in reducing
fever?
– What do scholars think about Jefferson’s position
53 on dealing with pirates?
Commercial systems:
mainly factoid questions
Where is the Louvre Museum In Paris, France
located?
What’s the abbreviation for L.P.
limited partnership?
What currency is used in China? The yuan
What kind of nuts are used in almonds
marzipan?
What instrument does Max drums
Roach play?
What is the telephone number 650-723-2300
for Stanford University?
Paradigms for QA
• IR-based approaches
– TREC; IBM Watson; Google
• Knowledge-based and Hybrid approaches
– IBM Watson; Apple Siri;
– Wolfram Alpha;
– True Knowledge Evi

55
IR-based Factoid QA
• QUESTION PROCESSING
– Detect question type, answer type, focus, relations
– Formulate queries to send to a search engine
• PASSAGE RETRIEVAL
– Retrieve ranked documents
– Break into suitable passages and rerank
• ANSWER PROCESSING
– Extract candidate answers
– Rank candidates
• using evidence from the text and external sources
IR-based Factoid QA

Document
DocumentDocument
Document
Document Document
Indexing Answer

Passage
Question Retrieval
Processing Docume
Query Document
Docume
nt
Docume
nt
Docume
nt
Passage Answer
Docume
Formulation Retrieval Relevant
nt
nt Retrieval passages Processing
Question Docs
Answer Type
Detection

57
Knowledge-based approaches (Siri)

• Build a semantic representation of the query

– Times, dates, locations, entities, numeric quantities
• Map from this semantics to query structured
data or resources
– Geospatial databases
– Ontologies (Wikipedia infoboxes, dbPedia, WordNet,
Yago)
– Restaurant review sources and reservation services
– Scientific databases
58
Hybrid approaches (IBM Watson)
• Build a shallow semantic representation of the
query
• Generate answer candidates using IR methods
– Augmented with ontologies and semi-structured data
• Score each candidate using richer knowledge
sources
– Geospatial databases
– Temporal reasoning
– Taxonomical classification
59
Question Processing
Things to extract from the question
• Answer Type Detection
– Decide the named entity type (person, place) of the
answer
• Query Formulation
– Choose query keywords for the IR system
• Question Type classification
– Is this a definition question, a math question, a list
question?
• Focus Detection
– Find the question words that are replaced by the answer
• Relation Extraction
– Find relations between entities in the question
60
Answer Type Detection: Named
Entities
• Who founded Virgin Airlines?
– PERSON
• What Canadian city has the largest
population?
– CITY.
Answer Type Taxonomy
Xin Li, Dan Roth. 2002. Learning Question Classifiers. COLING'02

• 6 coarse classes
– ABBREVIATION, ENTITY, DESCRIPTION, HUMAN, LOCATION,
NUMERIC
• 50 finer classes
– LOCATION: city, country, mountain…
– HUMAN: group, individual, title, description
– ENTITY: animal, body, color, currency…

62
Answer Types

63
More Answer Types

64
Text Normalization
• Every NLP task needs to do text
normalization:
1. Segmenting/tokenizing words in running text
2. Normalizing word formats
3. Segmenting sentences in running text
Text Normalization
How many words?
• I do uh main- mainly business data processing
– Fragments, filled pauses
• Seuss’s cat in the hat is different from other cats!
– Lemma: same stem, part of speech, rough word sense
• cat and cats = same lemma
– Wordform: the full inflected surface form
• cat and cats = different wordforms
Issues in Tokenization
• Finland’s capital  Finland Finlands Finland’s ?
• what’re, I’m, isn’t  What are, I am, is not
• Hewlett-Packard  Hewlett Packard ?
• state-of-the-art  state of the art ?
• Lowercase  lower-case lowercase lower case ?
• San Francisco  one token or two?
• m.p.h., PhD.  ??
Word Tokenization in Chinese
• Also called Word Segmentation
• Chinese words are composed of characters
– Characters are generally 1 syllable and 1
morpheme.
– Average word is 2.4 characters long.
• Standard baseline segmentation algorithm:
– Maximum Matching (also called Greedy)
Maximum Matching
Word Segmentation Algorithm
• Given a wordlist of Chinese, and a string.
1) Start a pointer at the beginning of the string
2) Find the longest word in dictionary that
matches the string starting at pointer
3) Move the pointer over the word in string
4) Go to 2
Max-match segmentation

• Thecatinthehat the cat in the hat

• Thetabledownthere the table down there

theta bled own there

• Doesn’t generally work in English!

• But works astonishingly well in Chinese

– 莎拉波娃现在居住在美国东南部的佛罗里达。
– 莎拉波娃现在居住在美国东南部的佛罗
里达
• Modern probabilistic segmentation algorithms
even better
Normalization
• Need to “normalize” terms
– Information Retrieval: indexed text & query terms
must have same form.
• We want to match U.S.A. and USA

• We implicitly define equivalence classes of terms

– e.g., deleting periods in a term
• Alternative: asymmetric expansion:
– Enter: window Search: window, windows
– Enter: windows Search: Windows, windows, window
– Enter: Windows Search: Windows

• Potentially more powerful, but less efficient

Case folding
• Applications like IR: reduce all letters to lower case
– Since users tend to use lower case
– Possible exception: upper case in mid-sentence?
• e.g., General Motors
• Fed vs. fed
• SAIL vs. sail
• For sentiment analysis, MT, Information extraction
– Case is helpful (US versus us is important)
Lemmatization

• Reduce inflections or variant forms to base form

– am, are, is  be
– car, cars, car's, cars'  car
• the boy's cars are different colors  the boy car be
different color
• Lemmatization: have to find correct dictionary
headword form
• Machine translation
– Spanish quiero (‘I want’), quieres (‘you want’) same
lemma as querer ‘want’
Morphology
• Morphemes:
– The small meaningful units that make up words
– Stems: The core meaning-bearing units
– Affixes: Bits and pieces that adhere to stems
• Often with grammatical functions
Stemming
• Reduce terms to their stems in information retrieval
• Stemming is crude chopping of affixes
– language dependent
– e.g., automate(s), automatic, automation all
reduced to automat.

for example compressed for exampl compress and

and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
Porter’s algorithm
The most common English stemmer
Step 1a Step 2 (for long stems)
sses  ss caresses  caress ational ate relational relate
ies  i ponies  poni izer ize digitizer  digitize
ss  ss caress  caress ator ate operator  operate
s  ø cats  cat …
Step 1b Step 3 (for longer stems)
(*v*)ing  ø walking  walk al  ø revival  reviv
sing  sing able  ø adjustable  adjust
(*v*)ed  ø plastered  plaster
ate  ø activate  activ
… …
Viewing morphology in a corpus
Why only strip –ing if there is a vowel?

(v)ing  ø walking  walk

sing  sing

78
Viewing morphology in a corpus
Why only strip –ing if there is a vowel?
(*v*)ing  ø walking  walk
sing  sing

tr -sc 'A-Za-z' '\n' < shakes.txt | grep ’ing$' | sort | uniq -c | sort –nr

1312 King 548 being

548 being 541 nothing
541 nothing 152 something
388 king 145 coming
375 bring 130 morning
358 thing 122 having
307 ring 120 living
152 something 117 loving
145 coming 116 Being
130 morning 102 going
tr -sc 'A-Za-z' '\n' < shakes.txt | grep '[aeiou].*ing$' | sort | uniq -c | sort –nr

79
Dealing with complex morphology is
sometimes necessary
• Some languages requires complex morpheme
segmentation
– Turkish
– Uygarlastiramadiklarimizdanmissinizcasina
– `(behaving) as if you are among those whom we could not
civilize’
– Uygar `civilized’ + las `become’
+ tir `cause’ + ama `not able’
+ dik `past’ + lar ‘plural’
+ imiz ‘p1pl’ + dan ‘abl’
+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’

NLP Semester 7
No ratings yet
NLP Semester 7
1,072 pages
Faculty Name: Dr. Humera Khanam Subject Name:NLP
No ratings yet
Faculty Name: Dr. Humera Khanam Subject Name:NLP
206 pages
AI
No ratings yet
AI
101 pages
Write Right Beginner SB 3
33% (3)
Write Right Beginner SB 3
15 pages
Module2 NLP BAD613B Notes
100% (1)
Module2 NLP BAD613B Notes
16 pages
Cs8082 Machine Learning Techniques Ripped From Amazon Kindle e Books by Sai Seena
No ratings yet
Cs8082 Machine Learning Techniques Ripped From Amazon Kindle e Books by Sai Seena
148 pages
Linear Algebra LectureNote
No ratings yet
Linear Algebra LectureNote
288 pages
Cns Decode
No ratings yet
Cns Decode
142 pages
Iiitb Ed ML Ai
No ratings yet
Iiitb Ed ML Ai
24 pages
NLP Course File Notes
No ratings yet
NLP Course File Notes
71 pages
Week 1
No ratings yet
Week 1
184 pages
2-Regular Expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular Expressions, Text Normalization, Edit Distance
42 pages
ML - 1 - Sovan - Introduction To ML
No ratings yet
ML - 1 - Sovan - Introduction To ML
83 pages
Module 1 Topic-3-ML Framework
No ratings yet
Module 1 Topic-3-ML Framework
82 pages
Data Mining Lab
No ratings yet
Data Mining Lab
58 pages
VHDL Lecture Notes - Navabi
100% (2)
VHDL Lecture Notes - Navabi
556 pages
PPT08-Natural Language Processing
100% (1)
PPT08-Natural Language Processing
44 pages
MLP Mid Sem Merge (Raja)
No ratings yet
MLP Mid Sem Merge (Raja)
351 pages
Data Science Laboratory Lab Manual: Prepared by Dr. R Obulakonda Reddy, Associate Professor
No ratings yet
Data Science Laboratory Lab Manual: Prepared by Dr. R Obulakonda Reddy, Associate Professor
35 pages
ML Unit-1
No ratings yet
ML Unit-1
26 pages
Introduction To Data Visualization With Python
No ratings yet
Introduction To Data Visualization With Python
47 pages
Proposal For Soft Skills and Personality Development
75% (8)
Proposal For Soft Skills and Personality Development
3 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
Data Science
No ratings yet
Data Science
15 pages
Discrete Mathematics Farkaleet Series
No ratings yet
Discrete Mathematics Farkaleet Series
154 pages
Flyers - How To Write A Story
100% (3)
Flyers - How To Write A Story
2 pages
Kaushal Chavda
No ratings yet
Kaushal Chavda
137 pages
Lecture 1
No ratings yet
Lecture 1
43 pages
New Advances in Machine Learning: ISBN 978-953-307-034-6
No ratings yet
New Advances in Machine Learning: ISBN 978-953-307-034-6
378 pages
CST395 Neural Networks and Deep Learning, December 2021
No ratings yet
CST395 Neural Networks and Deep Learning, December 2021
3 pages
ML UNIT-IV Notes
100% (1)
ML UNIT-IV Notes
23 pages
ML Decode TE IT
No ratings yet
ML Decode TE IT
71 pages
Lecture-1-Introduction To Natural Language Processing-2021
No ratings yet
Lecture-1-Introduction To Natural Language Processing-2021
46 pages
Mathematics For Machine Learning-I
No ratings yet
Mathematics For Machine Learning-I
10 pages
CD Unit - 1
No ratings yet
CD Unit - 1
38 pages
Theoretical Perspectives and Strategies For Teaching and Learning Writing
100% (1)
Theoretical Perspectives and Strategies For Teaching and Learning Writing
25 pages
Textbook ML - Removed - Removed
No ratings yet
Textbook ML - Removed - Removed
44 pages
EE8012 - Soft Computing
No ratings yet
EE8012 - Soft Computing
340 pages
PyCUDA Tutorial
100% (1)
PyCUDA Tutorial
15 pages
2nd Exam Question Paper 2
No ratings yet
2nd Exam Question Paper 2
16 pages
NEURAL NETWORKS and Deep Learning: Going Deep About Neural Network
No ratings yet
NEURAL NETWORKS and Deep Learning: Going Deep About Neural Network
4 pages
Data Centric Artificial Intelligence: A Beginner's Guide
No ratings yet
Data Centric Artificial Intelligence: A Beginner's Guide
137 pages
Module 1 Topic-2-ML Applications
No ratings yet
Module 1 Topic-2-ML Applications
44 pages
Unit I
No ratings yet
Unit I
10 pages
Question Bank AML
No ratings yet
Question Bank AML
4 pages
Unit 3 Notes UDS23201J Query Processing
No ratings yet
Unit 3 Notes UDS23201J Query Processing
38 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Data Science Syllabus
No ratings yet
Data Science Syllabus
23 pages
Full Notes
No ratings yet
Full Notes
37 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
ML CHP 123
No ratings yet
ML CHP 123
69 pages
Sathish Yellanki: Skyess: in Association With
No ratings yet
Sathish Yellanki: Skyess: in Association With
12 pages
First-Order Logic in Artificial Intelligence - Javatpoint
No ratings yet
First-Order Logic in Artificial Intelligence - Javatpoint
10 pages
NLP End Sem Paper - Evaluation Scheme
No ratings yet
NLP End Sem Paper - Evaluation Scheme
14 pages
Natural Language Processing Notes
No ratings yet
Natural Language Processing Notes
26 pages
DAV - Viva QnA - Doubtly - in
No ratings yet
DAV - Viva QnA - Doubtly - in
12 pages
ML Question Bank and Sol
No ratings yet
ML Question Bank and Sol
12 pages
Natural Language Processing: Dr. Tulasi Prasad Sariki SCOPE, VIT Chennai
No ratings yet
Natural Language Processing: Dr. Tulasi Prasad Sariki SCOPE, VIT Chennai
29 pages
Session 1
No ratings yet
Session 1
33 pages
Chinese Characters: A Step by Step Guide
75% (4)
Chinese Characters: A Step by Step Guide
6 pages
Introduction To Machine Learning PART 1
No ratings yet
Introduction To Machine Learning PART 1
6 pages
Diploma in Data Science Online Training Content by MR Navin NareshIT Modified
No ratings yet
Diploma in Data Science Online Training Content by MR Navin NareshIT Modified
10 pages
Profesional - Modul 3
100% (1)
Profesional - Modul 3
4 pages
Early Detection of Lung Cancer Using AI and ML
No ratings yet
Early Detection of Lung Cancer Using AI and ML
6 pages
MATATAG-Q1-DLL-in-Reading-and-Literacy-Week-7-Day1-4 - September16-19, 2024
No ratings yet
MATATAG-Q1-DLL-in-Reading-and-Literacy-Week-7-Day1-4 - September16-19, 2024
13 pages
WEKA Manual For Version 3-6-5
No ratings yet
WEKA Manual For Version 3-6-5
303 pages
CS230: Deep Learning: Winter Quarter 2019 Stanford University Midterm Examination 180 Minutes
No ratings yet
CS230: Deep Learning: Winter Quarter 2019 Stanford University Midterm Examination 180 Minutes
29 pages
3.word Level Analysis-Tokenization Stemming
No ratings yet
3.word Level Analysis-Tokenization Stemming
8 pages
Isg Pge 4 Job Hunting
No ratings yet
Isg Pge 4 Job Hunting
114 pages
Pronunciation - Connected Speech 1
No ratings yet
Pronunciation - Connected Speech 1
2 pages
Chapter 3 (About Bangla Character)
No ratings yet
Chapter 3 (About Bangla Character)
6 pages
Final Nep Ba Hons 07.02
No ratings yet
Final Nep Ba Hons 07.02
213 pages
Unit 3 Selling
No ratings yet
Unit 3 Selling
9 pages
Presentation Skills Course Specifications
No ratings yet
Presentation Skills Course Specifications
7 pages
Syllabus Preparatory 2023 24 Revised PDF
No ratings yet
Syllabus Preparatory 2023 24 Revised PDF
21 pages
Class 7 English Grammar Ncert Solutions The Articles
No ratings yet
Class 7 English Grammar Ncert Solutions The Articles
7 pages
Translation As A Human Skill From Predis
100% (1)
Translation As A Human Skill From Predis
461 pages
6 Most Effective Advertising Channels in Vietnam (2020 Review)
No ratings yet
6 Most Effective Advertising Channels in Vietnam (2020 Review)
13 pages
Natural Language Processing: Topic: Morphology
No ratings yet
Natural Language Processing: Topic: Morphology
52 pages
3PS Seq 3 Section 2
No ratings yet
3PS Seq 3 Section 2
4 pages
Final Report of PLC Projects: Phase II
No ratings yet
Final Report of PLC Projects: Phase II
7 pages
Revision For The Mid Term Test
No ratings yet
Revision For The Mid Term Test
8 pages
Practical Research 2 - PETA 2 (RRL), 3 (Conceptual Framework), and Final Paper (QE)
No ratings yet
Practical Research 2 - PETA 2 (RRL), 3 (Conceptual Framework), and Final Paper (QE)
7 pages
Ilak Pos Tagging
No ratings yet
Ilak Pos Tagging
48 pages
Getu Atalie Final Research
No ratings yet
Getu Atalie Final Research
50 pages
Vanessa Van Edwards - Calendar
No ratings yet
Vanessa Van Edwards - Calendar
4 pages
Guide Springboard
No ratings yet
Guide Springboard
28 pages
Stemming: Ilakiyaselvan N, B2 Slot
No ratings yet
Stemming: Ilakiyaselvan N, B2 Slot
23 pages
Chandler Texts and The Construction of Meaning
No ratings yet
Chandler Texts and The Construction of Meaning
3 pages
Multilingualism Session 8
No ratings yet
Multilingualism Session 8
5 pages
Lecture Reading
No ratings yet
Lecture Reading
3 pages
Week 25 Gr.6
No ratings yet
Week 25 Gr.6
3 pages
41 Zonotavofymo
No ratings yet
41 Zonotavofymo
3 pages
Brochure
No ratings yet
Brochure
2 pages

Natural Language Processing (CSE4022) : by N. Ilakiyaselvan

Uploaded by

Natural Language Processing (CSE4022) : by N. Ilakiyaselvan

Uploaded by

Natural Language Processing(CSE4022)

• Also allow insertion of space or hyphen

word Frequency of P(word)

– These days we frequently think first of web search,

Basic assumptions of Information Retrieval

• Collection: A set of documents

• Goal: Retrieve documents with information

How good are the retrieved docs?

• Build a semantic representation of the query

• Thecatinthehat the cat in the hat

theta bled own there

• But works astonishingly well in Chinese

• We implicitly define equivalence classes of terms

• Potentially more powerful, but less efficient

• Reduce inflections or variant forms to base form

for example compressed for exampl compress and

(*v*)ing  ø walking  walk

1312 King 548 being

You might also like

(v)ing  ø walking  walk