Lecture 2
Lecture 2
Example:
Subject, Verb, and Object appear in SVO order
Subject pronouns (I/she/he/they) have to be subjects, Object pronouns
(me/her/him/them) have to be objects
Acoustic/ Pragmatics
Phonetic
Syntax Semantics
sound words parse literal meaning
waves trees meaning (contextualized)
NLP
• Natural Language Processing
• Large field: processing natural language text involves many various
syntactic, semantic, and pragmatic tasks, in addition to other problems
Example Syntactic Tasks
Word Segmentation
• Breaking a string of characters into a sequence of words
• In some written languages (e.g. Chinese, Japanese) words are not
separated by spaces
• Even in English, characters other than white-space can be used to
separate words [e.g. , ; . - : ( ) ]
• Examples from English URLs:
• jumptheshark.com jump the shark .com
• myspace.com/pluckerswingbar
myspace .com pluckers wing bar
myspace .com plucker swing bar
Morphological Analysis
• Morphology is the field of linguistics that studies the internal structure of words.
• A morpheme is the smallest linguistic unit that has some meaning (Wikipedia)
• e.g. “carry”, “pre”, “ed”, “ly”, “s”
• Morphological analysis is the task of segmenting a word into its morphemes:
• carried carry + ed (past tense)
• independently in + (depend + ent) + ly
• Googlers (Google + er) + s (plural)
• unlockable un + (lock + able) ?
(un + lock) + able ?
Part Of Speech (POS) Tagging
• Annotate each word in a sentence with a part-of-speech.
• [NP I] [VP ate] [NP the spaghetti] [PP with] [NP meatballs].
• [NP He] [VP reckons] [NP the current account deficit] [VP will narrow]
[PP to] [NP only # 1.8 billion] [PP in] [NP September]
Phrase Chunking Example
Brenda Salenave Santana, Ricardo Campos, Evelin Amorim, Alípio Jorge, Purificação Silvano, Sérgio Nunes: A survey on narrative extraction from textual data. Artif. Intell. Rev. 56(8): 8393-8435 (2023)
Syntactic Parsing
• Produce the correct syntactic parse tree for a sentence.
Example Semantic Tasks
Word Sense Disambiguation (WSD)
• Words in natural language usually have a fair number of different
possible meanings.
• Ellen has a strong interest in computational linguistics.
• Ellen pays a large amount of interest on her credit card.
• For many tasks (e.g., question answering, translation), the proper sense
of each ambiguous word in a sentence must be determined.
Semantic Role Labeling (SRL)
• For each clause, determine the semantic role played by each noun phrase that is an
argument to the verb.
• Also referred to a “case role analysis,” “thematic analysis,” and “shallow semantic
parsing”
Textual Entailment (aka. Natural Language
Inference or NLI)
• Determine whether one natural language sentence entails
(implies) another under an ordinary interpretation
Textual Entailment Problems in PASCAL Challenge
TEXT HYPOTHESIS ENTAILMENT
Eyeing the huge market potential, currently led by Google,
Yahoo took over search company Overture Services Yahoo bought Overture. TRUE
Inc last year.
Since its formation in 1948, Israel fought many wars with Israel was established in
TRUE
neighboring Arab countries. 1948.
NLI example
Information Extraction (IE)
• Identify phrases in language that refer to specific types of entities and
relations in text.
• Named entity recognition is the task of identifying names of people,
places, organizations, etc.
people organizations places
• Michael Dell is the CEO of Dell Computer Corporation and lives in Austin, Texas.
• Relation extraction identifies specific relations between entities.
• Michael Dell is the CEO of Dell Computer Corporation and lives in Austin, Texas.
Temporal Information Tagging (Extraction)
Brenda Salenave Santana, Ricardo Campos, Evelin Amorim, Alípio Jorge, Purificação Silvano, Sérgio Nunes: A survey on narrative extraction from textual data. Artif. Intell. Rev. 56(8): 8393-8435 (2023)
Ablesbarkeitsmesser: A System for Assessing the Readability of German Text. ECIR (3) 2023: 288-293
Florian Pickelmann, Michael Färber, Adam Jatowt: Ablesbarkeitsmesser: A System for Assessing the Readability of German Text. ECIR (3) 2023: 288-293
Question Answering
https://maartensap.com/acl2020-commonsense
https://maartensap.com/acl2020-commonsense
https://openai.com/blog/dall-e/
Fake News Detection, Rumour & Bias Analysis
• Example: http://www.fakenewschallenge.org/
• “The goal of the Fake News Challenge is to explore how artificial intelligence technologies,
particularly machine learning and natural language processing, might be leveraged to combat the
fake news problem. We believe that these AI technologies hold promise for significantly
automating parts of the procedure human fact checkers use today to determine if a story is real or
a hoax.”
Example Pragmatic Tasks
Anaphora Resolution/Co-Reference
• e.g., inferring
power
differentials in Link structure in political blogs
language use Adamic and Glance 2005
Computational Journalism
https://www.nytimes.com/2019/02/05/business/media/artificial-intelligence-journalism-robots.html
Computational Humanities, e.g.:
Text-driven forecasting
Discovery? Historical Book Example
• E.g., book in language we cannot understand
Voynich manuscript
Why NLP was/is hard?
Why NLP was/is hard?
• Language is a complex social process
• Human language is highly ambiguous:
• I ate pizza with friends vs.
• I ate pizza with olives vs.
• I ate pizza with a fork
• It is also ever-changing and evolving (e.g., Hashtags in Twitter)
• …
Why NLP was/is hard?
• Ambiguity at many levels:
• Word senses: bank (finance or river?)
• Part of speech: chair (noun or verb?)
• Syntactic structure: I saw a man with a telescope
• Quantifier scope: Every child loves some movie
• Multiple: I saw her duck
Ambiguity is Ubiquitous
• Speech Recognition
• “recognize speech” vs. “wreck a nice beach”
• “youth in Asia” vs. “euthanasia”
• Syntactic Analysis
• “I ate spaghetti with chopsticks” vs. “I ate spaghetti with
meatballs.”
• Semantic Analysis
• “The dog is in the pen.” vs. “The ink is in the pen.”
• “I put the plant in the window” vs. “Ford put the plant in Mexico”
• Pragmatic Analysis
• From “The Pink Panther Strikes Again”:
Clouseau: Does your dog bite?
Hotel Clerk: No.
Clouseau: [bowing down to pet the dog] Nice doggie.
[Dog barks and bites Clouseau in the hand]
Clouseau: I thought you said your dog did not bite!
Hotel Clerk: That is not my dog.
Humor and Ambiguity
• Many jokes rely on the ambiguity of language:
• Groucho Marx: One morning I shot an elephant in my pajamas. How he
got into my pajamas, I’ll never know.
• Policeman to little boy: “We are looking for a thief with a bicycle.” Little
boy: “Wouldn’t you be better using your eyes.”
• Agent criticized my apartment, so I knocked him flat.
• Why is the teacher wearing sun-glasses? Because the class is so bright.
Why is Language Ambiguous?
• Having a unique linguistic expression for every possible conceptualization that could
be conveyed would make language overly complex and linguistic expressions
unnecessarily long
• Allowing resolvable ambiguity permits shorter linguistic expressions, i.e., data
compression
• Language relies on people’s ability to use their knowledge and inference abilities to
properly resolve ambiguities
Natural Languages vs. Computer Languages
• Ambiguity is the primary difference between natural and computer
languages
• Formal programming languages are designed to be unambiguous, i.e., they can
be defined by a grammar that produces a unique parse for each sentence in
the language
Ambiguity Resolution is Required for Translation
• Syntactic and semantic ambiguities must be properly resolved for
correct translation:
• “John plays the guitar.” → “John toca la guitarra.”
• “John plays soccer.” → “John juega el fútbol.”
• Anecdotal examples of early MT systems giving the following results
when translating from English to Russian and then back to English:
• “The spirit is willing but the flesh is weak.”
“The liquor is good but the meat is spoiled.”
• “Out of sight, out of mind.”
“Invisible idiot.”
Ambiguity is explosive..
• Ambiguities compound to generate enormous numbers of possible interpretations.
• In English, a sentence ending in n prepositional phrases has over 2n syntactic
interpretations.
• “Isaw the man with the telescope”: 2 parses
• “I saw the man on the hill with the telescope.”: 5 parses
• “I saw the man on the hill in Texas with the telescope”: 14 parses
• “I saw the man on the hill in Texas with the telescope at noon.”: 42 parses
• “I saw the man on the hill in Texas with the telescope at noon on Monday” 132 parses
Importance of probability
• Unlikely interpretations of words can combine to generate
spurious ambiguity:
• “Time flies like an arrow” has 4 parses, including those meanings:
• Insects of a variety called “time flies” are fond of a particular arrow
• A command to record insects’ speed in the manner that an arrow would
• “The a are of I” is a valid English noun phrase
• “a” is an adjective for the letter A
• “are” is a noun for an area of land (as in hectare)
• “I” is a noun for the letter I
• Statistical methods allow computing most likely
interpretation by combining probabilistic evidence from a
variety of uncertain knowledge sources
Meaning can’t always be composed from individual words
• And not just canned wisdoms like “don’t count your chickens before they hatch”
We’re constantly using constructions that we couldn’t get from just a syntactic + semantic parse
• “I wouldn’t put it past him”, “They’re getting to me these days”, “That won’t go down well with the
boss”…
• “He won’t X, let alone Y”, “She slept the afternoon away”, “The bigger they are, the more expensive they
are”, “That travesty of a theory”
Many languages, domains and tasks..
Japanese example
syntactic parsing
word alignment
Language diversity: evidentiality
“In about a quarter of the world’s languages, every statement must specify the type
of source on which it is based”
Examples in Tariana
Language is dynamic
• It is also ever-changing and evolving (e.g., Hashtags in Twitter) or
newly coined terms (e.g., “to google”)
• Existing words changed meaning as well, e.g.:
• “nice” used to mean silly/foolish/simple
• “silly” meant things worthy or blessed,
• “meat” denoted food in general
Brief history of NLP field
https://medium.com/nlplanet/a-brief-timeline-of-nlp-bc45b640f07d
Historical perspective
• 1950’s: Early days
• Foundational work: automata, information theory, etc.
• First speech systems
• Machine translation (MT) hugely funded by military
• Toy models: MT using basically word-substitution
• Optimism!
• Rationalism: approaches to design hand-crafted rules to incorporate knowledge and reasoning mechanisms
into intelligent NLP systems (e.g., ELIZA for simulating a Rogerian psychotherapist, MARGIE for structuring
real-world information into concept ontologies)
• 1960’s and 1970’s: NLP Winter
• Bar-Hillel (FAHQT: fully automatic high-quality translation) and ALPAC reports “kills” MT
• Work shifts to deeper models, syntax... but toy domains / grammars
The ALPAC report “Language and Machines” released to the public in November, 1966 recommended expenditures in two distinct areas: ( 1 )
computational linguistics, and (2) improvement of translation. It also suggested by inference that the pursuit of FAHQT is not a realistic goal in the
immediate future, as reported in the Finite String:
“The committee sees, however, little justification at present for massive support of machine translation per se, finding it―overall—slower, less accurate
and more costly than that provided by the human translator. The committee also finds that . . , without recourse to human translation or editing. . . . there
has been no machine translation of general scientific text, and none is in immediate prospect.”
Historical perspective
• 1980’s and 1990’s: The Empirical Revolution
• Expectations get reset
• Empiricism: characterized by the exploitation of data corpora and of (shallow) machine
learning and statistical models (e.g., Naive Bayes, HMMs, IBM translation models).
• Corpus-based methods become central
• Deep analysis often traded for robust and simple approximations
• Evaluate everything
• Initial annotated corpora developed for training and testing systems for POS tagging,
parsing, WSD, information extraction, MT, etc.
• First statistical machine translation systems developed at IBM for Canadian Hansards
corpus (Brown et al., 1990)
• First robust statistical parsers developed (Magerman, 1995; Collins, 1996; Charniak,
1997)
Historical perspective
• 2000+: Richer Statistical Methods
• Models increasingly merge linguistically sophisticated representations with statistical methods
• Begin to get both breadth and depth
• Increased use of a variety of ML methods, SVMs, logistic regression (i.e. max-ent), CRF’s, etc.
• Continued developed of corpora and competitions on shared data.
• TREC Q/A
• SENSEVAL/SEMEVAL
• CONLL Shared Tasks (NER, SRL…)
• Increased emphasis on unsupervised, semi-supervised, and active learning as alternatives to purely
supervised learning.
• Shifting focus to semantic tasks such as WSD, SRL, and semantic parsing.
• Grounded Language: Connecting language to perception and action.
• Image and video description
• Visual question answering (VQA)
• Human-Robot Interaction (HRI) in NL
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of machine learning research, 12(Aug):2493–2537.
Brief historical perspective
• 2017+: Pretrained Language Models
• Transformers, massive datasets, and high compute
• Instruction tuning and reinforcement learning from human feedback
• GPT model family, Llama, etc.
• An influential paper in this revolution: [Vaswani et al., 2017]
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017) {>100k citations}
Where are we now?
https://thelowdown.momentum.asia/the-emergence-of-large-language-models-llms/
And many new developments recently..
Some related fields
• Cognitive Science
• Figuring out how the human brain works
• Includes the bits that do language
• Humans: the only working NLP prototype..
• Speech Processing
• Mapping audio signals to text
• Traditionally separate from NLP, recently converging
• Two components: acoustic models and language models
• Language models in the domain of statistical or NN-based
NLP
• Computational Linguistics (CL)
Difference of NLP & CL
• Most conferences and journals that host natural language processing research
bear the name “computational linguistics” (e.g., ACL, NAACL, COLING)
• NLP and CL may be thought of as essentially synonymous
• While there is substantial overlap, there is an important focus difference
• CL is essentially linguistics supported by computational methods (similar to computational
biology, computational astronomy)
• In linguistics, language is the object of study
• NLP focuses on solving well-defined tasks involving human language (e.g., translation, query
answering, holding conversations, information extraction, machine reading)
• Fundamental linguistic insights may be crucial for accomplishing these tasks, but success is ultimately
measured by whether and how well the job gets done according to used evaluation metrics
• Ranges [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were tasty
[0-9] A single digit Chapter 1: Down the Rabbit Hole
[b-f] Any letter: b, c, d, e, f Drenched Blossoms
Regular Expressions: Negation in Disjunction
• Negations [^Ss]
• Carat means negation only when is mentioned first in []
Pattern Matches
Pattern Matches
groundhog|woodchuck
yours|mine yours
mine
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck
Regular Expressions: ? * + .
Pattern Matches
Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
\bthe\b the car other
Example
• Find all instances of the word “the” in a text
the
Misses capitalized examples
[tT]he
Incorrectly returns other or theology
[^a-zA-Z][tT]he[^a-zA-Z]
[^a-zA-Z] implies that there must be some
single (although non-alphabetic) character
(^|[^a-zA-Z])[tT]he([^a-zA-Z]|$)
Errors
• The refinement process in the previous slide was based on
fixing two kinds of errors
• Matching strings that should not be matched (e.g., there, then, other)
• False positives (Type I)
• Not matching things that we should have matched (e.g., The)
• False negatives (Type II)
Errors cont.
• In NLP we always deal with these kinds of errors
• Reducing the error rate for an application often involves two
antagonistic efforts:
• Increasing precision (minimizing false positives)
• Increasing coverage or recall (minimizing false negatives)
Substitutions
• Substitution in Python and UNIX commands:
• s/regexp1/pattern/
• e.g.:
• s/colour/color/
Capture Groups
• Say we want to put angles around all numbers:
the 35 boxes → the <35> boxes
• Use parentheses () to "capture" a pattern into a numbered register
(1, 2, 3…)
• Use \1 to refer to the contents of the register
s/([0-9]+)/<\1>/
Capture groups: multiple registers
• /the (.*)er they (.*), the \1er we \2/
• Matches
the faster they ran, the faster we ran
• But not
the faster they ran, the faster we ate
But suppose we don't want to capture some
elements?
• Parentheses have a double function: grouping terms, and
capturing
• Non-capturing groups:
• add a ?: after parenthesis
• /(?:some|a few) (people|cats) like some \1/
• matches
• some cats like some cats
• but not
• some cats like some a few
Simple Application: ELIZA
• Early NLP system that imitated a Rogerian psychotherapist
(Weizenbaum, 1966)
Weizenbaum, J. (1966). ELIZA – A computer program for the study of natural language communication between man and machine. CACM 9(1), 36–45
Simple Application: ELIZA
Men are all alike.
IN WHAT WAY
They're always bugging us about something or other.
CAN YOU THINK OF A SPECIFIC EXAMPLE
Well, my boyfriend made me come here.
YOUR BOYFRIEND MADE YOU COME HERE
He says I'm depressed much of the time.
I AM SORRY TO HEAR YOU ARE DEPRESSED
How ELIZA works?
• s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
• s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
• s/.* all .*/IN WHAT WAY?/
• s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE?/
Example of Eliza in action
Weizenbaum, J. (1966). ELIZA – A computer program for the study of natural language communication between man and machine. CACM 9(1), 36–45
History does not repeat itself (but it rhymes)
• Recent article in Guardian
• Compares ChatGPT and
Weizenbaum's ELIZA
• https://www.theguardian.com/technology/20
23/jul/25/joseph-weizenbaum-inventor-eliza-
chatbot-turned-against-artificial-intelligence-ai
History of Conversational Systems
https://ecai-tutorial-ijcai23.github.io/assets/docs/IJCAI23-Tutorial-Final.pdf
Summary
• Regular expressions play a surprisingly large role
• Sophisticated sequences of regular expressions are often the first model
for any text processing text
• For many hard tasks, we use machine learning classifiers and now
increasingly more LLMs
• But regular expressions can be used for preprocessing or as features in
the classifiers
• Can be very useful in capturing generalizations
Basic Text Processing
Word tokenization
Text Normalization
• Nearly every NLP task needs to do text normalization:
1. Segmenting/tokenizing words in running text
2. Normalizing word formats
3. Segmenting sentences in running text
How many words?
• “The University of Innsbruck is located in the capital of Tyrol.”
• 11
• “I do uh main- mainly business data processing”
• Fragments, filled pauses (fillers) – can be considered as words in
some cases (e.g., for speech recognition systems, speaker
identification)
• “Seuss’s cat in the hat is different from other cats!”
• Lemma: canonical, dictionary (or citation) form of a word
• cat and cats have the same lemma
• Wordform: the full inflected surface form
• cat and cats are different wordforms
How many tokens and types?
“they lay back on the San Francisco grass and look at the stars and their”
• V(d)/ N(d)
• An index of lexical diversity (different from syntactic complexity), often
used to measure text complexity or vocabulary richness
• Can be used for instance for analysis of freshman compositions, studies
of childhood acquisition of language, etc.
Klee, Thomas, et al. "Utterance length and lexical diversity in Cantonese-speaking children with and
without specific language impairment." Journal of Speech, Language, and Hearing Research (2004)
TTR and Text Length
• The longer the text, the less likely it is that novel vocabulary will
be introduced.
• Longer texts might lean more towards the tokens side of the equation:
more words (tokens) are added but less and less represent unique words
(types).
• Tokens increase linearly, while types do not
How large is vocabulary of English (or any
other language)?
N = number of tokens
Church and Gale (1990): |V| > O(N½)
V = vocabulary = set of types
|V| is the size of the vocabulary
Total documents84,678
Total word occurrences 39,749,179
Vocabulary size 198,763
Words occurring > 1,000 times 4,169
Words occurring once 70,064
Heaps’ Law Predictions
• Predictions for TREC collections are accurate for large numbers of
words
• e.g., first 10,879,522 words of the AP89 collection scanned
• prediction is 100,151 unique words
• actual number is 100,024
• Predictions for small numbers of words (i.e. N < 1000) are much
worse
GOV2 (Web) Example
Web Example
• Heaps’ Law works with very large document collections
• new words occurring even after already seeing 30 million words!
• New words come from a variety of sources
• spelling errors, invented words (e.g. product, company names), code,
other languages, email addresses, etc.
• Search engines must deal with these large and growing
vocabularies
Tokenization
• What is a word?
• A word is any sequence of alphabetical characters between whitespaces
that is not a punctuation mark..
• Later we’ll ask more questions about words, e.g.:
• How can we identify different word classes (parts of speech)?
• What is the meaning of words?
• How can we represent that?
Simple Tokenization in UNIX
• Naïve tokenization algorithm
• Given a text file, output the word tokens and their frequencies
| sort
Sort in alphabetical order
| uniq –c
Merge and count each type
1945 A
72 AARON
19 ABBESS
5 ABBOT
... ...
Taken from Church, Kenneth Ward. "Unix™ for poets." Notes of a course from the European Summer School on
Language and Speech Communication, Corpus Based Methods (1994).
The first step: tokenizing
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | head
THE
SONNETS
by
William
Shakespeare
From
fairest
creatures
We
...
The second step: sorting
tr -sc ’A-Za-z’ ’\n’ < shakes.txt | sort | head
A
A
A
A
A
A
A
A
A
...
Counting
• Merging upper and lower case
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort |
uniq –c
• Sorting by the counts
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c |
sort –n –r
23243 the
22225 i
18618 and
16339 to
15687 of
12780 a
12163 you
10839
10005
my
in
Why this one? I’d: I had or I would or I should..
8954 d
Word Frequency
• What we have actually obtained in the previous slide is a frequency
distribution of words in Shakespeare texts
• Word frequency: the number of occurrences of a word type in a text (or
in a collection of texts)
• You may have heard statements such as “adults know about 30,000
words”, “you need to know at least 5,000 words to be fluent”
• Such statements do not refer to inflected word forms
(take/takes/taking/take/takes/took) but to lemmas or dictionary forms
(take), and assume if you know a lemma, you know all its inflected forms too
Zipf's Law
• How many words occur once, twice, 100 times, 1000
times?
• Zipf's law:
• rank (r) of a word multiplied by its frequency (f) is
approximately constant (k)
• assuming words are ranked in the order of decreasing frequency
• r*f k
• or
• r*Pr c
• Pr is probability of word occurrence, and c 0.1 for English
News Collection (AP89) Statistics
• A small number of events (e.g. words) occur with high frequency (mostly closed-class words like the, be, to, of,
and, a, in, that,...)
• A large number of events occur with very low frequency (all open class words)
Zipf’s Law for AP89
Good practice: be aware of, and better write down, any normalization
(tokenization, lowercasing, spell-checking, ...) steps that your system does
Assignment
Next week’s Assignment
• Pick up two (possibly quite different) books from Project Gutenberg
(https://www.gutenberg.org/)
1. Show the top most common 100 words for both the books (aligned side by side for easy
comparison) after their tokenization
2. Plot and compare Zipf curves for both of them
3. Compute type-to-token ratio for the two books (avg TTR over non-overlapping windows
of 1k tokens). Consider the impact of text length in comparison.
4. Explore the relation between word frequency and word length in both books based on
their top most frequent 1,000 words.
5. Discuss any observations
6. Upload the report in pdf to OLAT by March 20th, 08:30
Paper 1
Paper 2
Paper 3
Thank you!