0% found this document useful (0 votes)
73 views

Lecture 2

The document discusses natural language processing and basic text processing tasks. It covers topics like what language is, linguistic structure in NLP systems, syntax, semantics and pragmatics. It also provides examples of common NLP tasks like part-of-speech tagging, phrase chunking, syntactic parsing, word sense disambiguation, semantic role labeling, textual entailment, information extraction and temporal information tagging.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views

Lecture 2

The document discusses natural language processing and basic text processing tasks. It covers topics like what language is, linguistic structure in NLP systems, syntax, semantics and pragmatics. It also provides examples of common NLP tasks like part-of-speech tagging, phrase chunking, syntactic parsing, word sense disambiguation, semantic role labeling, textual entailment, information extraction and temporal information tagging.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 151

Advanced Data Engineering &

Analytics: NLP Overview & Basic Text


Processing
13 March 2024
Adam Jatowt
What is language? How do we communicate?
Language
• All humans have language, and no other animal communication is similar
• Language can be manipulated to say infinite things while the structure of the brain is
finite
• We can talk about things that don’t exist, that can’t exist, things that are totally
abstract, and we can express subtle differences between similar things
Structure dictates how we can use language
We implicitly know complex rules about structure

What can we pull out to make a question?

Leon is a doctor What is Leon?

My cat likes tuna What does my cat like?

Leon is a doctor and an activist ❌ What is Leon a doctor


and?
Not the rules we learned in school..

A community of speakers (eg, Standard American English speakers) share a


rough consensus of their implicit rules.
A grammar: an attempt to describe all these rules
All the utterances we can generate from these rules are grammatical.
If we cannot produce an utterance using these rules, it’s ungrammatical

Example:
Subject, Verb, and Object appear in SVO order
Subject pronouns (I/she/he/they) have to be subjects, Object pronouns
(me/her/him/them) have to be objects

❌ “Me love she”


Language is Compositional
A set of rules that define A lexicon of words that relate to
grammaticality the world we want to talk about

Anything we want to say!


Linguistic Structure in NLP

Linguistic structure in humans


There is a system for producing language, that can be described by discrete rules

Do NLP systems work like that?

They definitely used to..


How do humans may structure this string of words?
Many linguists might tell us something like this for estimating the sentiment of text:
Now, language models seem to catch on to a lot of these things
Act of Communication
• The goal in the production and comprehension of natural language is
communication
• Communication stages for a speaker:
• Intention: Decide when and what information should be transmitted (a.k.a.
content selection). May require planning and reasoning about agents’ goals
and beliefs
• Generation: Translate the information to be communicated (in internal
logical representation, or “language of thought”) into string of words in
desired natural language (a.k.a. surface realization)
• Synthesis: Output the string in desired modality, text, or speech
Act of Communication (cont.)
• Communication stages for a listener:
• Perception: Map input modality to a string of words, e.g. optical
character recognition (OCR) or speech recognition
• Analysis: Determine the information content of the string
• Syntactic Interpretation (parsing): Find the correct parse tree showing the phrase
structure of the string
• Semantic Interpretation: Extract the (literal) meaning of the string (logical form)
• Pragmatic Interpretation: Consider effect of the overall context on altering the
literal meaning of a sentence
• Incorporation: Decide whether or not to believe the content of the string
and whether or not to act upon (e.g., add it to the KB).
Syntax, Semantics, Pragmatics
• Syntax concerns the proper ordering of words and grammatical structure of
text.
• The dog bit the boy.
• The boy bit the dog.
• * Bit boy dog the the.
• Colorless green ideas sleep furiously.
• Semantics concerns the (literal) meaning of words, phrases, and sentences.
• “plant” as a photosynthetic organism
• “plant” as a manufacturing facility
• “plant” as the act of putting a seed into ground
• Pragmatics concerns the overall communicative and social context and its
effect on interpretation.
• The ham sandwich wants another beer.
• John thinks vanilla.

A syntactically correct but not semantically correct example:


“Cows flow supremely.”
What does it mean to understand spoken
language?
Example of Modular Comprehension System
for Spoken Communication

Acoustic/ Pragmatics
Phonetic
Syntax Semantics
sound words parse literal meaning
waves trees meaning (contextualized)
NLP
• Natural Language Processing
• Large field: processing natural language text involves many various
syntactic, semantic, and pragmatic tasks, in addition to other problems
Example Syntactic Tasks
Word Segmentation
• Breaking a string of characters into a sequence of words
• In some written languages (e.g. Chinese, Japanese) words are not
separated by spaces
• Even in English, characters other than white-space can be used to
separate words [e.g. , ; . - : ( ) ]
• Examples from English URLs:
• jumptheshark.com  jump the shark .com
• myspace.com/pluckerswingbar
 myspace .com pluckers wing bar
 myspace .com plucker swing bar
Morphological Analysis
• Morphology is the field of linguistics that studies the internal structure of words.
• A morpheme is the smallest linguistic unit that has some meaning (Wikipedia)
• e.g. “carry”, “pre”, “ed”, “ly”, “s”
• Morphological analysis is the task of segmenting a word into its morphemes:
• carried  carry + ed (past tense)
• independently  in + (depend + ent) + ly
• Googlers  (Google + er) + s (plural)
• unlockable  un + (lock + able) ?
 (un + lock) + able ?
Part Of Speech (POS) Tagging
• Annotate each word in a sentence with a part-of-speech.

I ate the spaghetti with meatballs.


Pro V Det N Prep N
John saw the saw and decided to take it to the table.
PN V Det N Con V Part V Pro Prep Det N
• Useful for subsequent syntactic parsing and word sense
disambiguation.
Phrase Chunking
• Find all non-recursive noun phrases (NPs) and verb phrases (VPs) in a
sentence.

• [NP I] [VP ate] [NP the spaghetti] [PP with] [NP meatballs].
• [NP He] [VP reckons] [NP the current account deficit] [VP will narrow]
[PP to] [NP only # 1.8 billion] [PP in] [NP September]
Phrase Chunking Example

Brenda Salenave Santana, Ricardo Campos, Evelin Amorim, Alípio Jorge, Purificação Silvano, Sérgio Nunes: A survey on narrative extraction from textual data. Artif. Intell. Rev. 56(8): 8393-8435 (2023)
Syntactic Parsing
• Produce the correct syntactic parse tree for a sentence.
Example Semantic Tasks
Word Sense Disambiguation (WSD)
• Words in natural language usually have a fair number of different
possible meanings.
• Ellen has a strong interest in computational linguistics.
• Ellen pays a large amount of interest on her credit card.
• For many tasks (e.g., question answering, translation), the proper sense
of each ambiguous word in a sentence must be determined.
Semantic Role Labeling (SRL)
• For each clause, determine the semantic role played by each noun phrase that is an
argument to the verb.

agent patient source destination instrument


• John drove Mary from Austin to Dallas in his Toyota Prius.
• The hammer broke the window.

• Also referred to a “case role analysis,” “thematic analysis,” and “shallow semantic
parsing”
Textual Entailment (aka. Natural Language
Inference or NLI)
• Determine whether one natural language sentence entails
(implies) another under an ordinary interpretation
Textual Entailment Problems in PASCAL Challenge
TEXT HYPOTHESIS ENTAILMENT
Eyeing the huge market potential, currently led by Google,
Yahoo took over search company Overture Services Yahoo bought Overture. TRUE
Inc last year.

Microsoft's rival Sun Microsystems Inc. bought Star Office


last month and plans to boost its development as a
Microsoft bought Star Office. FALSE
Web-based device running over the Net on personal
computers and Internet appliances.

The National Institute for Psychobiology in Israel was


Israel was established in May
established in May 1971 as the Israel Center for FALSE
1971.
Psychobiology by Prof. Joel.

Since its formation in 1948, Israel fought many wars with Israel was established in
TRUE
neighboring Arab countries. 1948.
NLI example
Information Extraction (IE)
• Identify phrases in language that refer to specific types of entities and
relations in text.
• Named entity recognition is the task of identifying names of people,
places, organizations, etc.
people organizations places
• Michael Dell is the CEO of Dell Computer Corporation and lives in Austin, Texas.
• Relation extraction identifies specific relations between entities.
• Michael Dell is the CEO of Dell Computer Corporation and lives in Austin, Texas.
Temporal Information Tagging (Extraction)

Brenda Salenave Santana, Ricardo Campos, Evelin Amorim, Alípio Jorge, Purificação Silvano, Sérgio Nunes: A survey on narrative extraction from textual data. Artif. Intell. Rev. 56(8): 8393-8435 (2023)
Ablesbarkeitsmesser: A System for Assessing the Readability of German Text. ECIR (3) 2023: 288-293

Text Readability Assessment

Florian Pickelmann, Michael Färber, Adam Jatowt: Ablesbarkeitsmesser: A System for Assessing the Readability of German Text. ECIR (3) 2023: 288-293
Question Answering

What did Barack Obama teach?


Question Answering
• Directly answer natural language questions based on information
presented in a corpus of textual documents (e.g., the web).
• When was Barack Obama born? (factoid)
• August 4, 1961
• Who was president when Barack Obama was born?
• John F. Kennedy
• How many presidents have there been since Barack Obama was born?
• 9
Text Summarization (Abstractive)
• Produce a short summary of a longer document or article.
• Article: With a split decision in the final two primaries and a flurry of superdelegate
endorsements, Sen. Barack Obama sealed the Democratic presidential nomination last
night after a grueling and history-making campaign against Sen. Hillary Rodham Clinton
that will make him the first African American to head a major-party ticket. Before a
chanting and cheering audience in St. Paul, Minn., the first-term senator from Illinois
savored what once seemed an unlikely outcome to the Democratic race with a nod to the
marathon that was ending and to what will be another hard-fought battle, against Sen.
John McCain, the presumptive Republican nominee….
• Summary: Senator Barack Obama was declared the presumptive Democratic presidential nominee.
Text Summarization (Extractive)
Lindsay Lohan pleaded not guilty Wednesday to felony grand theft of a
$2,500 necklace, a case that could return the troubled starlet to jail rather
than the big screen. Saying it appeared that Lohan had violated her
probation in a 2007 drunken driving case, the judge set bail at $40,000 and
warned that if Lohan was accused of breaking the law while free he would
have her held without bail. The Mean Girls star is due back in court on Feb.
23, an important hearing in which Lohan could opt to end the case early.
Sentiment/Opinion Analysis
Machine Translation (MT)
• Translate a sentence from one natural language to another.
• Hasta la vista, bebé 
Until we see each other again, baby.
• 我喜欢汉堡 
I like burgers.
Commonsense Reasoning
• the basic level of practical knowledge and reasoning
• concerning common situations and events
• that are commonly shared among most people

Y. Choi et al., ACL 2020 Commonsense Tutorial, https://maartensap.com/acl2020-commonsense/


Temporal Commonsense Reasoning Examples

https://maartensap.com/acl2020-commonsense

https://maartensap.com/acl2020-commonsense

Y. Choi et al., ACL 2020 Commonsense Tutorial, https://maartensap.com/acl2020-commonsense/


G. Wenzel, A. Jatowt: “An Overview of Temporal Commonsense Reasoning and Acquisition”, 2023, arxiv
Commonsense reasoning (knowledge bases)
Image2text, Text2Image
generation

https://openai.com/blog/dall-e/
Fake News Detection, Rumour & Bias Analysis
• Example: http://www.fakenewschallenge.org/
• “The goal of the Fake News Challenge is to explore how artificial intelligence technologies,
particularly machine learning and natural language processing, might be leveraged to combat the
fake news problem. We believe that these AI technologies hold promise for significantly
automating parts of the procedure human fact checkers use today to determine if a story is real or
a hoax.”
Example Pragmatic Tasks
Anaphora Resolution/Co-Reference

• Determine which phrases in a document refer to the


same underlying entity.
• John put the carrot on the plate and ate it.

• Bush started the war in Iraq. But the president needed


the consent of Congress.
• Some cases require difficult reasoning.
• Today was Jack's birthday. Penny and Janet went to the store.
They were going to get presents. Janet decided to get a kite.
"Don't do that," said Penny. "Jack has a kite. He will make you
take it back."
Ellipsis Resolution
• Frequently words and phrases are omitted from sentences when
they can be inferred from context.

"Wise men talk because they have something to say;


fools, because they have to say something.“ (Plato)

"Wise men talk because they have something to say;


fools talk because they have to say something.“ (Plato)
Other Tasks
https://values.args.me/
Chatbots (Spoken Dialogue Systems)
Computational Social Science
• e.g., finding politics-
focused communities
in blogs

• e.g., detecting the


triggers of censorship
in blogs/ social media

• e.g., inferring
power
differentials in Link structure in political blogs
language use Adamic and Glance 2005
Computational Journalism

https://www.nytimes.com/2019/02/05/business/media/artificial-intelligence-journalism-robots.html
Computational Humanities, e.g.:
Text-driven forecasting
Discovery? Historical Book Example
• E.g., book in language we cannot understand

Voynich manuscript
Why NLP was/is hard?
Why NLP was/is hard?
• Language is a complex social process
• Human language is highly ambiguous:
• I ate pizza with friends vs.
• I ate pizza with olives vs.
• I ate pizza with a fork
• It is also ever-changing and evolving (e.g., Hashtags in Twitter)
• …
Why NLP was/is hard?
• Ambiguity at many levels:
• Word senses: bank (finance or river?)
• Part of speech: chair (noun or verb?)
• Syntactic structure: I saw a man with a telescope
• Quantifier scope: Every child loves some movie
• Multiple: I saw her duck
Ambiguity is Ubiquitous
• Speech Recognition
• “recognize speech” vs. “wreck a nice beach”
• “youth in Asia” vs. “euthanasia”
• Syntactic Analysis
• “I ate spaghetti with chopsticks” vs. “I ate spaghetti with
meatballs.”
• Semantic Analysis
• “The dog is in the pen.” vs. “The ink is in the pen.”
• “I put the plant in the window” vs. “Ford put the plant in Mexico”
• Pragmatic Analysis
• From “The Pink Panther Strikes Again”:
Clouseau: Does your dog bite?
Hotel Clerk: No.
Clouseau: [bowing down to pet the dog] Nice doggie.
[Dog barks and bites Clouseau in the hand]
Clouseau: I thought you said your dog did not bite!
Hotel Clerk: That is not my dog.
Humor and Ambiguity
• Many jokes rely on the ambiguity of language:
• Groucho Marx: One morning I shot an elephant in my pajamas. How he
got into my pajamas, I’ll never know.
• Policeman to little boy: “We are looking for a thief with a bicycle.” Little
boy: “Wouldn’t you be better using your eyes.”
• Agent criticized my apartment, so I knocked him flat.
• Why is the teacher wearing sun-glasses? Because the class is so bright.
Why is Language Ambiguous?
• Having a unique linguistic expression for every possible conceptualization that could
be conveyed would make language overly complex and linguistic expressions
unnecessarily long
• Allowing resolvable ambiguity permits shorter linguistic expressions, i.e., data
compression
• Language relies on people’s ability to use their knowledge and inference abilities to
properly resolve ambiguities
Natural Languages vs. Computer Languages
• Ambiguity is the primary difference between natural and computer
languages
• Formal programming languages are designed to be unambiguous, i.e., they can
be defined by a grammar that produces a unique parse for each sentence in
the language
Ambiguity Resolution is Required for Translation
• Syntactic and semantic ambiguities must be properly resolved for
correct translation:
• “John plays the guitar.” → “John toca la guitarra.”
• “John plays soccer.” → “John juega el fútbol.”
• Anecdotal examples of early MT systems giving the following results
when translating from English to Russian and then back to English:
• “The spirit is willing but the flesh is weak.” 
“The liquor is good but the meat is spoiled.”
• “Out of sight, out of mind.” 
“Invisible idiot.”
Ambiguity is explosive..
• Ambiguities compound to generate enormous numbers of possible interpretations.
• In English, a sentence ending in n prepositional phrases has over 2n syntactic
interpretations.
• “Isaw the man with the telescope”: 2 parses
• “I saw the man on the hill with the telescope.”: 5 parses
• “I saw the man on the hill in Texas with the telescope”: 14 parses
• “I saw the man on the hill in Texas with the telescope at noon.”: 42 parses
• “I saw the man on the hill in Texas with the telescope at noon on Monday” 132 parses
Importance of probability
• Unlikely interpretations of words can combine to generate
spurious ambiguity:
• “Time flies like an arrow” has 4 parses, including those meanings:
• Insects of a variety called “time flies” are fond of a particular arrow
• A command to record insects’ speed in the manner that an arrow would
• “The a are of I” is a valid English noun phrase
• “a” is an adjective for the letter A
• “are” is a noun for an area of land (as in hectare)
• “I” is a noun for the letter I
• Statistical methods allow computing most likely
interpretation by combining probabilistic evidence from a
variety of uncertain knowledge sources
Meaning can’t always be composed from individual words

Language is full of idioms

• And not just canned wisdoms like “don’t count your chickens before they hatch”

We’re constantly using constructions that we couldn’t get from just a syntactic + semantic parse

• “I wouldn’t put it past him”, “They’re getting to me these days”, “That won’t go down well with the
boss”…

And even mixed constructions that can compositionally take arguments!

• “He won’t X, let alone Y”, “She slept the afternoon away”, “The bigger they are, the more expensive they
are”, “That travesty of a theory”
Many languages, domains and tasks..
Japanese example

syntactic parsing

word alignment
Language diversity: evidentiality
“In about a quarter of the world’s languages, every statement must specify the type
of source on which it is based”

Examples in Tariana
Language is dynamic
• It is also ever-changing and evolving (e.g., Hashtags in Twitter) or
newly coined terms (e.g., “to google”)
• Existing words changed meaning as well, e.g.:
• “nice” used to mean silly/foolish/simple
• “silly” meant things worthy or blessed,
• “meat” denoted food in general
Brief history of NLP field
https://medium.com/nlplanet/a-brief-timeline-of-nlp-bc45b640f07d
Historical perspective
• 1950’s: Early days
• Foundational work: automata, information theory, etc.
• First speech systems
• Machine translation (MT) hugely funded by military
• Toy models: MT using basically word-substitution
• Optimism!
• Rationalism: approaches to design hand-crafted rules to incorporate knowledge and reasoning mechanisms
into intelligent NLP systems (e.g., ELIZA for simulating a Rogerian psychotherapist, MARGIE for structuring
real-world information into concept ontologies)
• 1960’s and 1970’s: NLP Winter
• Bar-Hillel (FAHQT: fully automatic high-quality translation) and ALPAC reports “kills” MT
• Work shifts to deeper models, syntax... but toy domains / grammars

The ALPAC report “Language and Machines” released to the public in November, 1966 recommended expenditures in two distinct areas: ( 1 )
computational linguistics, and (2) improvement of translation. It also suggested by inference that the pursuit of FAHQT is not a realistic goal in the
immediate future, as reported in the Finite String:
“The committee sees, however, little justification at present for massive support of machine translation per se, finding it―overall—slower, less accurate
and more costly than that provided by the human translator. The committee also finds that . . , without recourse to human translation or editing. . . . there
has been no machine translation of general scientific text, and none is in immediate prospect.”
Historical perspective
• 1980’s and 1990’s: The Empirical Revolution
• Expectations get reset
• Empiricism: characterized by the exploitation of data corpora and of (shallow) machine
learning and statistical models (e.g., Naive Bayes, HMMs, IBM translation models).
• Corpus-based methods become central
• Deep analysis often traded for robust and simple approximations
• Evaluate everything
• Initial annotated corpora developed for training and testing systems for POS tagging,
parsing, WSD, information extraction, MT, etc.
• First statistical machine translation systems developed at IBM for Canadian Hansards
corpus (Brown et al., 1990)
• First robust statistical parsers developed (Magerman, 1995; Collins, 1996; Charniak,
1997)
Historical perspective
• 2000+: Richer Statistical Methods
• Models increasingly merge linguistically sophisticated representations with statistical methods
• Begin to get both breadth and depth
• Increased use of a variety of ML methods, SVMs, logistic regression (i.e. max-ent), CRF’s, etc.
• Continued developed of corpora and competitions on shared data.
• TREC Q/A
• SENSEVAL/SEMEVAL
• CONLL Shared Tasks (NER, SRL…)
• Increased emphasis on unsupervised, semi-supervised, and active learning as alternatives to purely
supervised learning.
• Shifting focus to semantic tasks such as WSD, SRL, and semantic parsing.
• Grounded Language: Connecting language to perception and action.
• Image and video description
• Visual question answering (VQA)
• Human-Robot Interaction (HRI) in NL

• 2011+: Deep Learning


• feature engineering (considered as a bottleneck) is replaced with representation learning
and/or deep neural networks (e.g., https://www.deepl.com/translator)
• A very influential paper in this revolution: [Collobert et al., 2011]

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of machine learning research, 12(Aug):2493–2537.
Brief historical perspective
• 2017+: Pretrained Language Models
• Transformers, massive datasets, and high compute
• Instruction tuning and reinforcement learning from human feedback
• GPT model family, Llama, etc.
• An influential paper in this revolution: [Vaswani et al., 2017]

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017) {>100k citations}
Where are we now?

https://thelowdown.momentum.asia/the-emergence-of-large-language-models-llms/
And many new developments recently..
Some related fields
• Cognitive Science
• Figuring out how the human brain works
• Includes the bits that do language
• Humans: the only working NLP prototype..
• Speech Processing
• Mapping audio signals to text
• Traditionally separate from NLP, recently converging
• Two components: acoustic models and language models
• Language models in the domain of statistical or NN-based
NLP
• Computational Linguistics (CL)
Difference of NLP & CL
• Most conferences and journals that host natural language processing research
bear the name “computational linguistics” (e.g., ACL, NAACL, COLING)
• NLP and CL may be thought of as essentially synonymous
• While there is substantial overlap, there is an important focus difference
• CL is essentially linguistics supported by computational methods (similar to computational
biology, computational astronomy)
• In linguistics, language is the object of study
• NLP focuses on solving well-defined tasks involving human language (e.g., translation, query
answering, holding conversations, information extraction, machine reading)
• Fundamental linguistic insights may be crucial for accomplishing these tasks, but success is ultimately
measured by whether and how well the job gets done according to used evaluation metrics

Eisenstein, J. (2018). Natural language processing. Technical report, Georgia Tech.


Other related fields
• Artificial Intelligence & Machine Learning
• Formal Language (Automata) Theory
• Linguistics
• Psycholinguistics
• Philosophy of Language
• …
Basic Text Processing
Regular Expressions
Regular expressions
• A formal language for specifying text strings
• How can one search for any of these?
• elephant
• elephants
• Elephant
• Elephants
Regular expressions
• Practical language for specifying text strings, used in every computer
language, word processor and text processing tools (e.g., grep or Emacs)
• Especially useful for searching, given particular patterns and a corpus (a
document or document collection)
• Corpus: computer-readable text or speech
• Simplest case: sequence of simple characters:
• Expression /able/ matches any string containing substring “able”
• It can be also an individual character /!/
Regular Expressions: Disjunctions
• Square brackets [] specify disjunction
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit

• Ranges [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were tasty
[0-9] A single digit Chapter 1: Down the Rabbit Hole
[b-f] Any letter: b, c, d, e, f Drenched Blossoms
Regular Expressions: Negation in Disjunction
• Negations [^Ss]
• Carat means negation only when is mentioned first in []

Pattern Matches

[^A-Z] Not an upper case letter Oyfn pripetchik

[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”

[^.] Not a dot Dr. Lee is there.


[^e^] Neither e nor ^ everything is great
a^b The pattern “a carat b” Look up a^b now
Regular Expressions: More Disjunction
• The pipe | for disjunction is needed to specify disjunctions of
strings

Pattern Matches

groundhog|woodchuck

yours|mine yours
mine
a|b|c = [abc]

[gG]roundhog|[Ww]oodchuck
Regular Expressions: ? * + .
Pattern Matches

colou?r Optional (0 or 1) color colour


previous char
oo*h! 0 or more of oh! ooh! oooh! ooooh!
previous char
o+h! 1 or more of oh! ooh! oooh! ooooh!
previous char
baa+ baa baaa baaaa baaaaa

beg.n Any character begin begun begun beg3n


Stephen C Kleene
/aardvark.* any line in which aardvark is a complex word.
aardvark/ aardvark appears aardvark Kleene *, Kleene +
twice.
Regular Expressions: More on Disjunction,
Scoping
• Sometimes we need to group characters in parenthesis to make it
act as a single character or to scope |:
• /gupp(y|ies)/ for disjunction of only suffixes “y” and “ies”

• Suppose we want to match repeated string: Column 1 Column 2 Column


3. The expression /Column [0-9]+ */ will not work but /(Column [0-
9]+)*/ will do
Regular Expressions: Anchors ^ $
• Special characters to anchor regular expressions to particular places
in a string

Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
\bthe\b the car other
Example
• Find all instances of the word “the” in a text
the
Misses capitalized examples
[tT]he
Incorrectly returns other or theology
[^a-zA-Z][tT]he[^a-zA-Z]
[^a-zA-Z] implies that there must be some
single (although non-alphabetic) character
(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)
Errors
• The refinement process in the previous slide was based on
fixing two kinds of errors
• Matching strings that should not be matched (e.g., there, then, other)
• False positives (Type I)
• Not matching things that we should have matched (e.g., The)
• False negatives (Type II)
Errors cont.
• In NLP we always deal with these kinds of errors
• Reducing the error rate for an application often involves two
antagonistic efforts:
• Increasing precision (minimizing false positives)
• Increasing coverage or recall (minimizing false negatives)
Substitutions
• Substitution in Python and UNIX commands:

• s/regexp1/pattern/
• e.g.:
• s/colour/color/
Capture Groups
• Say we want to put angles around all numbers:
the 35 boxes → the <35> boxes
• Use parentheses () to "capture" a pattern into a numbered register
(1, 2, 3…)
• Use \1 to refer to the contents of the register
s/([0-9]+)/<\1>/
Capture groups: multiple registers
• /the (.*)er they (.*), the \1er we \2/

• Matches
the faster they ran, the faster we ran
• But not
the faster they ran, the faster we ate
But suppose we don't want to capture some
elements?
• Parentheses have a double function: grouping terms, and
capturing
• Non-capturing groups:
• add a ?: after parenthesis
• /(?:some|a few) (people|cats) like some \1/
• matches
• some cats like some cats
• but not
• some cats like some a few
Simple Application: ELIZA
• Early NLP system that imitated a Rogerian psychotherapist
(Weizenbaum, 1966)

• Uses pattern matching to match, e.g.,:


• “I need X”
and translates them into, e.g.
• “What would it mean to you if you got X?

Weizenbaum, J. (1966). ELIZA – A computer program for the study of natural language communication between man and machine. CACM 9(1), 36–45
Simple Application: ELIZA
Men are all alike.
IN WHAT WAY
They're always bugging us about something or other.
CAN YOU THINK OF A SPECIFIC EXAMPLE
Well, my boyfriend made me come here.
YOUR BOYFRIEND MADE YOU COME HERE
He says I'm depressed much of the time.
I AM SORRY TO HEAR YOU ARE DEPRESSED
How ELIZA works?
• s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
• s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
• s/.* all .*/IN WHAT WAY?/
• s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE?/
Example of Eliza in action

Weizenbaum, J. (1966). ELIZA – A computer program for the study of natural language communication between man and machine. CACM 9(1), 36–45
History does not repeat itself (but it rhymes)
• Recent article in Guardian
• Compares ChatGPT and
Weizenbaum's ELIZA
• https://www.theguardian.com/technology/20
23/jul/25/joseph-weizenbaum-inventor-eliza-
chatbot-turned-against-artificial-intelligence-ai
History of Conversational Systems

https://ecai-tutorial-ijcai23.github.io/assets/docs/IJCAI23-Tutorial-Final.pdf
Summary
• Regular expressions play a surprisingly large role
• Sophisticated sequences of regular expressions are often the first model
for any text processing text
• For many hard tasks, we use machine learning classifiers and now
increasingly more LLMs
• But regular expressions can be used for preprocessing or as features in
the classifiers
• Can be very useful in capturing generalizations
Basic Text Processing
Word tokenization
Text Normalization
• Nearly every NLP task needs to do text normalization:
1. Segmenting/tokenizing words in running text
2. Normalizing word formats
3. Segmenting sentences in running text
How many words?
• “The University of Innsbruck is located in the capital of Tyrol.”
• 11
• “I do uh main- mainly business data processing”
• Fragments, filled pauses (fillers) – can be considered as words in
some cases (e.g., for speech recognition systems, speaker
identification)
• “Seuss’s cat in the hat is different from other cats!”
• Lemma: canonical, dictionary (or citation) form of a word
• cat and cats have the same lemma
• Wordform: the full inflected surface form
• cat and cats are different wordforms
How many tokens and types?
“they lay back on the San Francisco grass and look at the stars and their”

• Type: an element of the vocabulary


• Token: an instance of that type in running text
(A standard word count tells the number of tokens in text)
• How many tokens and types in the sentence at the top?
• 15 tokens (or 14)
• 13 types (or 12)
Type-Token Ratio of Text

• V(d)/ N(d)
• An index of lexical diversity (different from syntactic complexity), often
used to measure text complexity or vocabulary richness
• Can be used for instance for analysis of freshman compositions, studies
of childhood acquisition of language, etc.

(Example) Scatter plot of age and lexical


diversity (D), with quadratic regression
line, for typically developing Cantonese-
speaking children (N = 70) [Klee 2004]

Klee, Thomas, et al. "Utterance length and lexical diversity in Cantonese-speaking children with and
without specific language impairment." Journal of Speech, Language, and Hearing Research (2004)
TTR and Text Length
• The longer the text, the less likely it is that novel vocabulary will
be introduced.
• Longer texts might lean more towards the tokens side of the equation:
more words (tokens) are added but less and less represent unique words
(types).
• Tokens increase linearly, while types do not
How large is vocabulary of English (or any
other language)?
N = number of tokens
Church and Gale (1990): |V| > O(N½)
V = vocabulary = set of types
|V| is the size of the vocabulary

Corpora Tokens = N Types = |V|

Switchboard phone 2.4 million 20 thousand


conversations
Shakespeare 884,000 31 thousand

Google N-grams 1 trillion 13 million (that appear > 40 times)


Corpus (plur. Corpora)
• Corpus: a computer-readable collection of texts
• E.g., Brown Corpus – a million words large collection of samples from 500
written English texts of different genres (news, fiction, academic, etc.)
• In NLP we often process various corpora
• Usually the larger, the better as then they are likely more representative for
various linguistic phenomena
• Corpora vary by language/dialect, genre, author demographics, etc.
• over 6k-7k recognized languages in the world
Vocabulary Growth: Heaps Law
• As document collection grows, so does the size of its vocabulary
(total number of different words)
• Fewer new words are found when collection is already large
• Observed relationship (Heaps’ Law):
|V| = k*Nβ
where |V| = vocabulary size (the number of unique words),
N = total number of words in document collection,
k, β = parameters that vary for each document collection

(typical values are 10 ≤ k ≤ 100 and β ≈ 0.5)


AP89 Example: the Associated Press
collection of news stories from 1989

Total documents84,678
Total word occurrences 39,749,179
Vocabulary size 198,763
Words occurring > 1,000 times 4,169
Words occurring once 70,064
Heaps’ Law Predictions
• Predictions for TREC collections are accurate for large numbers of
words
• e.g., first 10,879,522 words of the AP89 collection scanned
• prediction is 100,151 unique words
• actual number is 100,024
• Predictions for small numbers of words (i.e. N < 1000) are much
worse
GOV2 (Web) Example
Web Example
• Heaps’ Law works with very large document collections
• new words occurring even after already seeing 30 million words!
• New words come from a variety of sources
• spelling errors, invented words (e.g. product, company names), code,
other languages, email addresses, etc.
• Search engines must deal with these large and growing
vocabularies
Tokenization
• What is a word?
• A word is any sequence of alphabetical characters between whitespaces
that is not a punctuation mark..
• Later we’ll ask more questions about words, e.g.:
• How can we identify different word classes (parts of speech)?
• What is the meaning of words?
• How can we represent that?
Simple Tokenization in UNIX
• Naïve tokenization algorithm
• Given a text file, output the word tokens and their frequencies

tr -sc ’A-Za-z’ ’\n’ < shakes.txt Change all non-alpha to newlines

| sort
Sort in alphabetical order
| uniq –c
Merge and count each type
1945 A
72 AARON
19 ABBESS
5 ABBOT
... ...

Taken from Church, Kenneth Ward. "Unix™ for poets." Notes of a course from the European Summer School on
Language and Speech Communication, Corpus Based Methods (1994).
The first step: tokenizing
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | head

THE
SONNETS
by
William
Shakespeare
From
fairest
creatures
We
...
The second step: sorting
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head

A
A
A
A
A
A
A
A
A
...
Counting
• Merging upper and lower case
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort |
uniq –c
• Sorting by the counts
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c |
sort –n –r

23243 the
22225 i
18618 and
16339 to
15687 of
12780 a
12163 you
10839
10005
my
in
Why this one? I’d: I had or I would or I should..
8954 d
Word Frequency
• What we have actually obtained in the previous slide is a frequency
distribution of words in Shakespeare texts
• Word frequency: the number of occurrences of a word type in a text (or
in a collection of texts)

• You may have heard statements such as “adults know about 30,000
words”, “you need to know at least 5,000 words to be fluent”
• Such statements do not refer to inflected word forms
(take/takes/taking/take/takes/took) but to lemmas or dictionary forms
(take), and assume if you know a lemma, you know all its inflected forms too
Zipf's Law
• How many words occur once, twice, 100 times, 1000
times?

• Zipf's law:
• rank (r) of a word multiplied by its frequency (f) is
approximately constant (k)
• assuming words are ranked in the order of decreasing frequency
• r*f  k
• or
• r*Pr  c
• Pr is probability of word occurrence, and c  0.1 for English
News Collection (AP89) Statistics

Word Freq. r Pr(%) r*Pr


assistant 5,095 1,021 .013 0.13
sewers 100 17,110 2.56 × 10−4 0.04
toothbrush 10 51,555 2.56 × 10−5 0.01
hazmat 1 166,945 2.56 × 10−6 0.04
Top 50 Words from AP89

The two most common words


(“the”, “of”) make up about 10% of
all word occurrences in text
documents
Zipf’s Law: Probability of Word vs. its Rank

• A small number of events (e.g. words) occur with high frequency (mostly closed-class words like the, be, to, of,
and, a, in, that,...)
• A large number of events occur with very low frequency (all open class words)
Zipf’s Law for AP89

Some “problems” at very high and low frequencies


Implications of Zipf’s Law
• Good News:
• Stop words (commonly occurring words such as “the”, “a”) will account for a large
fraction of text so eliminating them greatly reduces size of vocabulary of any text
• We have seen these words often enough that we know (almost) everything about them.
These words will help us get at the structure (and possibly meaning) of text
• Bad News:
• For most words, gathering sufficient data for meaningful statistical analysis (e.g. for
correlation analysis for query expansion) is difficult since they are extremely rare.
• We know something about these words, but haven’t seen them often enough to know
everything about them. They may occur with a meaning or a part of speech we haven’t
seen before
• Any text may contain a number of words that are unknown to us. We have never seen
them before, but we still need to get at the structure (and meaning) of these texts
More on Tokenization
• The simple tokenization approach we have seen so far is fine for
getting rough statistics but we could do better
• Let’s see some real-world issues here
Issues in Tokenization
• Finland’s capital → Finland Finlands Finland’s ?
• what’re, I’m, isn’t → What are, I am, is not ?
• Hewlett-Packard → Hewlett Packard ?
• state-of-the-art → state of the art ?
• Lowercase → lower-case lowercase lower case ?
• m.p.h., Ph.D. →?
• $45.55 →?
• 01/02/19 →?
• #ACL →?
Words aren’t just
defined by blanks..
Issues in Tokenization
• Hong Kong → “Hong”, “Kong” or “Hong Kong” ?
• New York-based → ?
• rock ‘n’ roll → ?

(The above examples require multiword expression dictionary)


(Tokenization is actually also tied with named entity recognition that we
will study later)
Informal spelling, emoticons, hashtags..
Tokenization: language issues
• French (clitic contractions marked by apostrophes)
• L'ensemble → one token or two?
• L ? L’ ? Le ?
• Want l’ensemble to match with un ensemble

• German noun compounds are not segmented


• Lebensversicherungsgesellschaftsangestellter
• ‘life insurance company employee’
• German text processing and information retrieval usually needs
compound splitter
Tokenization: language issues
• Chinese and Japanese languages have no spaces between words:
• 莎拉波娃现在居住在美国东南部的佛罗里达。
• 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达
• Sharapova now lives in US southeastern Florida
• Further complicated in Japanese, with multiple alphabets mixed
in a sentence
• Also dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji Romaji


End-user could express query entirely in hiragana!
Word Tokenization in Chinese
• Also called Word Segmentation
• Chinese words are composed of characters
• Characters are generally 1 syllable
• Each character generally represents a single unit of meaning (called a
morpheme) and is pronounceable as a single syllable
• Average word is 2.4 characters long
• Standard baseline segmentation algorithm:
• Maximum Matching (aka. greedy longest-match-first decoding (MaxMatch))
Maximum Matching
Word Segmentation Algorithm
Given a wordlist of Chinese, and a string:
1) Start a pointer at the beginning of the string
2) Find the longest word in dictionary that matches the string starting at
pointer
3) Move the pointer over the word in string
4) Go to 2
Max-match Segmentation Illustration
the cat in the hat
• Thecatinthehat
• Thetabledownthere the table down there
theta bled own there
• Doesn’t generally work in English..

• But works astonishingly well in Chinese


• 莎拉波娃现在居住在美国东南部的佛罗里达。
• 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达
• Modern probabilistic or NN-based segmentation algorithms work even better
Character as basic input in Chinese?
• In fact, for most Chinese NLP tasks it turns out to work better to
take characters rather than words as input, since characters are at
a reasonable semantic level for most applications
• However, for Japanese and Thai the character is too small a unit,
and so algorithms for word segmentation are required
Phrases
• In IR many search queries are 2-3 word long phrases
• Phrases are:
• More precise than single words
• e.g., documents containing “black sea” vs. two words “black” and “sea”
• Less ambiguous
• e.g., “big apple” vs. “apple”
• Can be useful for ranking
• e.g., Given query “fishing supplies”, how do we score documents with exact phrase many
times, exact phrase just once, individual words in same sentence, same paragraph, whole
document, variations on words?
Example Noun Phrases
Phrases Detection
• Text processing issue – how are phrases recognized?
• Possible approaches:
• Identify syntactic phrases using a part-of-speech (POS) tagger
• Use word n-grams
Summary: Issues in Tokenization
• Can't just blindly remove punctuation:
• m.p.h., Ph.D., AT&T, cap’n.
• prices ($45.55) and dates (01/02/06); URLs; (http://www.stanford.edu),
hashtags (#nlproc), email addresses ([email protected]).
• Clitics: a part of a word that can't stand on its own
• we're → we are, French j'ai, l'honneur
• Can "Multiword Expressions” (MWE) be words?
• New York, rock ’n’ roll
Tokenization Standards
• Any actual NLP system will assume a particular tokenization standard
• Because so much NLP is based on systems that are trained on particular corpora (text
datasets) that everybody uses, these corpora often define a de facto standard
• Penn Treebank 3 standard (separates out clitics so “doesn’t” becomes “does” plus “n’t”,
keeps hyphenated words together, and separates out all punctuation):
• Input:
• "The San Francisco-based restaurant," they said, "doesn’t charge $10".
• Output:
• “_The _San _Francisco-based _restaurant _, _” _they_said_,_"_does _n’t _ charge_$_10 _ " _.

Good practice: be aware of, and better write down, any normalization
(tokenization, lowercasing, spell-checking, ...) steps that your system does
Assignment
Next week’s Assignment
• Pick up two (possibly quite different) books from Project Gutenberg
(https://www.gutenberg.org/)
1. Show the top most common 100 words for both the books (aligned side by side for easy
comparison) after their tokenization
2. Plot and compare Zipf curves for both of them
3. Compute type-to-token ratio for the two books (avg TTR over non-overlapping windows
of 1k tokens). Consider the impact of text length in comparison.
4. Explore the relation between word frequency and word length in both books based on
their top most frequent 1,000 words.
5. Discuss any observations
6. Upload the report in pdf to OLAT by March 20th, 08:30
Paper 1
Paper 2
Paper 3
Thank you!

You might also like