0% found this document useful (0 votes)
66 views44 pages

Natural Language Processing: Some Screenshots Are Taken From NLP Course by Jufrasky - Used Only For Educational Purpose

Uploaded by

yash sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views44 pages

Natural Language Processing: Some Screenshots Are Taken From NLP Course by Jufrasky - Used Only For Educational Purpose

Uploaded by

yash sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Natural Language Processing

Some screenshots are taken from NLP course by Jufrasky


— Used only for educational purpose

NLP © Sakthi Balan M


Developments

• IBM Watson wins Jeopardy Challenge (2011)

• IBM Deep Blue beats Gary Kasparov (1997)

• DeepMind’s AlphaZero (AI system that learns chess in 24 hours)


beats the best chess engine Stockfish (2017)

• AlphaGo defeated two best in players of Go (2015-16)

NLP © Sakthi Balan M


Problems

• Information Extraction

• Machine Translation

• Conversational Agent / Dialogue System

• Question and Answer Systems

NLP © Sakthi Balan M


We need to look into…

• Phonetics

• Morphology

• Syntax

• Semantics

• Pragmatics

• Discourse
NLP © Sakthi Balan M
Progress

• Good Progress: • Hard problems:


• Almost Done: • Sentiment Analysis • Question
Answering
• Spam versus Ham - • Wordsense
systems
99% accuracy disambiguation

• Parsing
• Paraphrase
• PoS - 97%
• Machine Translation • Summarisation
• NER - 97%
• Information Extraction • Dialogue
NLP © Sakthi Balan M
Crash Blossoms and Garden Path Sentences

• "Dutch military plane carrying bodies from


Malaysian Airlines Flight 17 crash lands in
Eindhoven” (July 23, 2014)

• “I went to bank”

• “Fed raises interest rates”

• “The old man the boat”

NLP © Sakthi Balan M


Issues
• ambiguity

• non-standard text (example: tweets)

• segmentation problems

• idioms

• neologisms

• world knowledge

• tricky entity names - bio-names


NLP © Sakthi Balan M
State of the Art & History
• Foundation Insights: 1940s and 1950s

• Automaton

• Probabilistic or Information Theoretic

• McCulloch-Pitts Neuron (1943)

• Chomsky (1950) — Finite State Machines and CFG

• Backus (1959) & Naur et al (1960) — ALGOL

• Probabilistic algorithms for speech and language processing


NLP © Sakthi Balan M
State of the Art & History
• Two Camps: 1957 to 1970

• Symbolic — Chomsky’s related works

• Stochastic — AI — McCarthy, Minsky, Shannon and others

• Stochastic and Statistics — Bayesian models (Mosteller and


Wallace (1964)

• Logic and General problem Solving — Newell and Simon

• Brown Corpus — one-million word corpus from Newspaper, Novels,


non-fiction, academics etc.,
NLP © Sakthi Balan M
State of the Art & History

• Four Paradigms: 1970 to 1983 Unified for LUNAR QA system

• Stochastic — speech recognition, HMM

• Logic-based — functional grammar

• Natural language understanding — SHRDLU systems

• Discourse modeling — BDI (Belief-Desire-Intention)

NLP © Sakthi Balan M


State of the Art & History
• Empiricism and Finite State Machine Revisited

• FSM:

• Finite state phonology and morphology by Kaplan and Kay (1981)

• Finite state models of syntax by Church (1980)

• Empiricism:

• IBM Watson Research Center’s work on probabilistic models of


speech recognition: parsing, PoS tagging, addressing ambiguities
and semantics
NLP © Sakthi Balan M
State of the Art & History

• All branches come together: 1994 to 1999

• Algorithms for parsing, PoS tagging, reference resolution and


discourse processing through probabilistic models

• Commercial exploitation of speech and language processing

NLP © Sakthi Balan M


State of the Art & History
• The Rise of ML (2000 to 2008)

• Linguistics Data Consortium (LDC) — large amounts of spoken and written materials
available

• All has syntactic, semantics and pragmatic annotations

• Parsing and semantic analysis problems became a set of problems in supervised


learning

• Learning models brought statistical & probabilistical models closer

• High Performance Computing enabled ML in NLP

• At last works of Brown et al (1990), Och and Ney (2003) [Machine translation] and Biel
(2003) [Topic modeling] showed that we can even work with unannotated text data
NLP © Sakthi Balan M
Regular Expression

— Weizenbaum (1966) ELIZA — A computer program for the study


of natural language communication between man and machine
NLP © Sakthi Balan M
Regular Expression

• First developed by Kleene (1956)

• Regular Expression (RE) is a formula in a special language that specifies


simple classes of strings

• Alternatively, RE is an algebraic notation for characterising a set of strings

• For any RE we can build an equivalent finite state automata (FSA)

• RE search requires a pattern that we want to search for and a corpus of


texts to search through
NLP © Sakthi Balan M
Disjunction

• [Ww] matches either W or w

• [A-Z] matches any one of the alphabet from A to Z

• [a-z] matches any one of the alphabet from a to z

• [A-Za-z] matches any one of the alphabet from A to Z or from a to z

• [0-9] matches any one of the digit from 0 to 9

• [ !] what this will match?


NLP © Sakthi Balan M
Negation in Disjunction

• [^Tt] matches characters other than T or t

• [^A-Z] matches all characters except A to Z

• [^A-Za-z] matches all characters other than the alphabets

• Ram|Sita represents either Ram or Sita

NLP © Sakthi Balan M


Special Characters
• ? matches exactly zero or one occurrence of the previous character or
expression

• * matches exactly zero or more occurrences of the previous character or


expression

• + matches exactly one or more occurrences of the previous character or


expression

• {n} matches n occurrences of the previous character or expression

• {n,m} matches n to m occurrences of the previous character or expression

• {n,} matches at least n occurrences of the previous character or expression


NLP © Sakthi Balan M
Anchors

• ^ is used to show that expression to be matched at the starting of


new line

• $ is used to show that expression to be matched at the end of new


line

NLP © Sakthi Balan M


Basic Text Processing

• “The cat in the hat”

• “The other one there, the blithe one”

NLP © Sakthi Balan M


Basic Text Processing

• “The cat in the hat”

• “The other one there, the blithe one”


• Search for

• [Tt]he • False Positive: ‘blithe’

• [Th]he[^A-Za-z] • False Negative: ‘The’

• [^A-Za-z][Th]he[^A-Za-z]
NLP © Sakthi Balan M
Basic Text Processing
• Tokenisation

• How many tokens are there?

• San Francisco, New Delhi

• Speech — uh…, main….mainly

• Cat, Cats, cat, cats, I’m, They’re, India’s capital, Ph.D and so on

• How many types/unique tokens?

NLP © Sakthi Balan M


Basic Text Processing
• N = Number of tokens • Shakespeare:

• V = Vocabulary = Set of types • N = 834000

• Phone conversations: • |V| = 31000

• N = 2.4 million • Google N-grams

• |V| = 20000 • N = 1 trillion

• |V| = 13 million

|V| > O(N )


from
1/2

Church and Gale 1990)


NLP © Sakthi Balan M
Word Segmentation — Maximum
Matching algorithm

• “The cat in the hat”

• apply maximum matching algorithm for the


above

• “The table down there”

• apply maximum matching algorithm for the


same!!

NLP © Sakthi Balan M


Word Segmentation — Maximum
Matching algorithm

• Maximum Matching doesn’t work well in English

• It works well for Chinese where the average word


length is just 2.3

• It works well with words of less length

NLP © Sakthi Balan M


Normalising Tokens

• U.S.A & USA

• asymmetric expansions — Window, Window(s)

• case folding — make all lower case letters

• Exceptions — General Motors, Congress

• Sentiments analysis caps or lower is important

NLP © Sakthi Balan M


Lemmatization

• Reduce the variant forms to base forms

• am, are, is —> be

• cat, cats, Cat, Cats —> cat

• Morphemes are the small meaning full units that make words

• Morphemes can be words, affixes-prefixes or suffixes. Examples of


Morpheme: -ed = turns a verb into the past tense. un- = prefix that
means not.
NLP © Sakthi Balan M
Lemmatization

• Stems: The core meaning-bearing unit

• Affixes: that is attached to the stems — prefix or suffix — according to


a grammatical rule

• Stemming: It is a crude chopping of affixes

• automate, automation, automates, automatic, automated all


converted to automat

NLP © Sakthi Balan M


Stemming Process

• Porter’s algorithm for English Stemmer:

NLP © Sakthi Balan M


Minimum Edit Distance

• Word’s / String’s similarity

• Spell correction — graet with either great/grate/target/rage/raget

• Computational Biology — Aligning two nucleotides / amino-acids


sequence

• Machine Translation, Information Extraction, Speech Recognition

NLP © Sakthi Balan M


Minimum Edit Distance

• Minimum edit distance between two


strings I N T E * N T I O N

• Operations: * E X E C U T I O N

• Insertion
1 delete 5 operations
• Deletion 3 substitution
1 insertion
8 operations
• Substitution
Levenshtein Distance
• Minimise the number of operations

NLP © Sakthi Balan M


Computational Biology

NLP © Sakthi Balan M


Other Applications

• Evaluating Similarity of Sentences

• Spokesman said the senior advisor was killed

• Spokesman confirmed that the senior advisor was dead

• Named entity extraction and entity coreference — IBM and IBM Ltd

NLP © Sakthi Balan M


Minimum Edit Distance

• Searching for a sequence of edits / paths from the starting string to the
final string

• Given: Word which is to be transformed

• Operations: Insertion, Deletion and Substitution

• Output: The word we are trying to get

• Path Cost: Minimise the cost / edits / operations

NLP © Sakthi Balan M


Algorithm

• Sample space is HUGE! (If we do it exhaustively)

• Minimising number of edits for two strings depends on minimising


the number of edits for its substrings!

• Problem is recursive in nature but subproblems are depended!

• Dynamic Programming is the appropriate one here

NLP © Sakthi Balan M


Algorithm

• Two strings X and Y: X of length n and Y of length n

• X to be transformed to Y through I, D and S operations

• D(i,j) is the edit distance between X[1..i] and Y[1..j]

• D(n,m) is the edit distance of X to Y

NLP © Sakthi Balan M


Algorithm

• Computing D(n,m) using D(i,j) where i and j are


smaller values than n and m, respectively

• Combine the values D(i,j) to get D(n,m)

NLP © Sakthi Balan M


Algorithm

D(0,j) = j (Insert)

D(i,0) = i (delete)

For all i,j

D(i,j) = Min{D(i-1,j) + 1, D(i,j-1) + 1, D(i-1,j-1) + 2}

D(n,m) will be the output

NLP © Sakthi Balan M


Algorithm

NLP © Sakthi Balan M


Algorithm

NLP © Sakthi Balan M


Algorithm

NLP © Sakthi Balan M


Backtrace in Minimum edit distance

• Ultimately if we need to find the optimal alignment then we need


to do a backtrace

• Every time we enter a new cell in the table we note down from
where we came from (minimum one)

• After we reach the end of the table we back trace by recalling


the previous cell we came from we shall be able to get the
optimal alignment

NLP © Sakthi Balan M


Backtrace in Minimum edit distance

I N T E * N T I O N

NLP * E X E C U T I O N © Sakthi Balan M


Backtrace in Minimum edit distance

• Time complexity: O(mn)

• Space complexity: O(mn)

• Backtrace: O(m+n)

NLP © Sakthi Balan M

You might also like