Natural Language Processing (CSE4022) : by N. Ilakiyaselvan
Natural Language Processing (CSE4022) : by N. Ilakiyaselvan
Natural Language Processing (CSE4022) : by N. Ilakiyaselvan
By
N. Ilakiyaselvan
Computational Challenges in Other
Languages
Spelling Correction
Non-word spelling error example
acress
Candidate generation:
• Words with similar spelling
– Small edit distance to error
• Words with similar pronunciation
– Small edit distance of pronunciation to error
41
Damerau-Levenshtein edit distance
• Minimal edit distance between two strings,
where edits are:
– Insertion
– Deletion
– Substitution
– Transposition of two adjacent letters
42
Words within 1 of acress
Error Candid Corre Error Type
ate ct Lette
Correcti Letter r
on
acres actre t - deletion
s ss
acres cress - a insertion
s
acres cares ca ac transpositio
s s n
acres acces c r substitution
s s
43
Candidate generation
• 80% of errors are within edit distance 1
• Almost all errors within edit distance 2
44
Unigram Prior probability
Counts from 404,253,213 words in Corpus of Contemporary English (COCA)
46
Information Retrieval
• Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).
47
Sec. 1.1
48
Sec. 1.1
49
Question Answering
One of the oldest NLP tasks (punched card systems in 1961)
Simmons, Klein, McConlogue. 1964. Indexing and
Dependency Logic for Answering English Questions.
American Documentation 15:30, 196-204
50
Apple’s Siri
51
Wolfram Alpha
52
Types of Questions in Modern
Systems
• Factoid questions
– Who wrote “The Universal Declaration of Human
Rights”?
– How many calories are there in two slices of apple
pie?
– What is the average age of the onset of autism?
– Where is Apple Computer based?
• Complex (narrative) questions:
– In children with an acute febrile illness, what is
the efficacy of acetaminophen in reducing
fever?
– What do scholars think about Jefferson’s position
53 on dealing with pirates?
Commercial systems:
mainly factoid questions
Where is the Louvre Museum In Paris, France
located?
What’s the abbreviation for L.P.
limited partnership?
What currency is used in China? The yuan
What kind of nuts are used in almonds
marzipan?
What instrument does Max drums
Roach play?
What is the telephone number 650-723-2300
for Stanford University?
Paradigms for QA
• IR-based approaches
– TREC; IBM Watson; Google
• Knowledge-based and Hybrid approaches
– IBM Watson; Apple Siri;
– Wolfram Alpha;
– True Knowledge Evi
55
IR-based Factoid QA
• QUESTION PROCESSING
– Detect question type, answer type, focus, relations
– Formulate queries to send to a search engine
• PASSAGE RETRIEVAL
– Retrieve ranked documents
– Break into suitable passages and rerank
• ANSWER PROCESSING
– Extract candidate answers
– Rank candidates
• using evidence from the text and external sources
IR-based Factoid QA
Document
DocumentDocument
Document
Document Document
Indexing Answer
Passage
Question Retrieval
Processing Docume
Query Document
Docume
nt
Docume
nt
Docume
nt
Passage Answer
Docume
Formulation Retrieval Relevant
nt
nt Retrieval passages Processing
Question Docs
Answer Type
Detection
57
Knowledge-based approaches (Siri)
• 6 coarse classes
– ABBREVIATION, ENTITY, DESCRIPTION, HUMAN, LOCATION,
NUMERIC
• 50 finer classes
– LOCATION: city, country, mountain…
– HUMAN: group, individual, title, description
– ENTITY: animal, body, color, currency…
62
Answer Types
63
More Answer Types
64
Text Normalization
• Every NLP task needs to do text
normalization:
1. Segmenting/tokenizing words in running text
2. Normalizing word formats
3. Segmenting sentences in running text
Text Normalization
How many words?
• I do uh main- mainly business data processing
– Fragments, filled pauses
• Seuss’s cat in the hat is different from other cats!
– Lemma: same stem, part of speech, rough word sense
• cat and cats = same lemma
– Wordform: the full inflected surface form
• cat and cats = different wordforms
Issues in Tokenization
• Finland’s capital Finland Finlands Finland’s ?
• what’re, I’m, isn’t What are, I am, is not
• Hewlett-Packard Hewlett Packard ?
• state-of-the-art state of the art ?
• Lowercase lower-case lowercase lower case ?
• San Francisco one token or two?
• m.p.h., PhD. ??
Word Tokenization in Chinese
• Also called Word Segmentation
• Chinese words are composed of characters
– Characters are generally 1 syllable and 1
morpheme.
– Average word is 2.4 characters long.
• Standard baseline segmentation algorithm:
– Maximum Matching (also called Greedy)
Maximum Matching
Word Segmentation Algorithm
• Given a wordlist of Chinese, and a string.
1) Start a pointer at the beginning of the string
2) Find the longest word in dictionary that
matches the string starting at pointer
3) Move the pointer over the word in string
4) Go to 2
Max-match segmentation
78
Viewing morphology in a corpus
Why only strip –ing if there is a vowel?
(*v*)ing ø walking walk
sing sing
tr -sc 'A-Za-z' '\n' < shakes.txt | grep ’ing$' | sort | uniq -c | sort –nr
79
Dealing with complex morphology is
sometimes necessary
• Some languages requires complex morpheme
segmentation
– Turkish
– Uygarlastiramadiklarimizdanmissinizcasina
– `(behaving) as if you are among those whom we could not
civilize’
– Uygar `civilized’ + las `become’
+ tir `cause’ + ama `not able’
+ dik `past’ + lar ‘plural’
+ imiz ‘p1pl’ + dan ‘abl’
+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’