Lecture_1_Introduction
Lecture_1_Introduction
Lecture 1
Introduction
1
Course Basics
2
Course Basics
• Textbook and Other Required Material:
• Recommended - Textbook: Daniel Jurafsky, and James H. Martin,
"Speech and Language Processing", Prentice Hall, 2000 - 2018
online version. January 12, 2025 release!
• https://web.stanford.edu/~jurafsky/slp3/
• Recommended - Textbook: Christopher D. Manning, and Hinrich
Schütze, "Foundations of Statistical Natural Language
Processing", The MIT Press, 1999.
• Prerequisite(s): (Math 241 or Math 225 or Math 220)
and (Math 255 or Math 230 or Math 250)
• Probability, linear algebra and (of course) programming skills
3
Course Basics
• Grading Policy (tentative):
4
Late Submission Policy
• No late submissions for Term Project-related
activities
• For individual assignments:
• We can discuss the deadline in advance
• However, once set NO POSTPONEMENTS as the deadline
approaches
• Late submissions will be penalized as follows:
• 25% penalty for the 1st day late
• 10% additional penalties for each late day afterwards
• For example, if you submit 3 days late, you can get at most 55/100
pts.
5
Course Basics
All project reports should be completed and submitted. Grades, except the final
exam, should justify a letter grade of D or better. (Note that meeting this criterion
does not guarantee that you will eventually pass the course. Your letter grade still
depends on your overall performance including the final.)
6
Course Basics
Plagiarism Policy:
Your submissions for all assignments are expected to show your own work, i.e.,
they should show your own software design (except proper usage of software
libraries when necessary and allowed) and results, and should be written in your
own words.
7
What about the course?
Fundamentals
+
Some interesting applications to other fields
8
Course Objectives and Expectations
9
Course Objectives and Expectations
• Mathematical models.
• Come up with interesting ideas that you like to work on for the term project.
10
Natural Language
11
Natural Language
Processing (NLP)
• What is NLP?
• Wikipedia says:
12
Linguistics
Edward Sapir
an American linguist who is among
the pioneers of linguistics
13
13
Computational
Linguistics
• Is an interdisciplinary branch of linguistics
• Rule-based, statistical and computational
approaches to the problems of linguistics
• Understanding language by means of
computational methods
• Related to artificial intelligence but it is older
than AI
• AI was born at a workshop at Dartmouth
College in 1956.
• Immediately after WW2, computational efforts
started for automatic translation. From Russian
to English.
• Knowledge from linguistics, computer science,
and statistics/machine learning is used to
provide computational models for languages
• Closely related and overlapped with NLP.
• May be a science vs. engineering question?
14
Natural Language Processing (NLP)
• The field of NLP is primarily concerned with getting computers to perform useful
and interesting tasks with natural languages.
15
Natural Language Processing
16
NLP – Highly Interdisciplinary Field
Knowledge and techniques from several disciplines
• Linguistics: How do words form phrases and sentences? What is meaning? What
are the possible meanings for a sentence?
• Computational Linguistics: How are the structure of sentences identified? How
can language knowledge be modeled?
• Computer Science: Algorithms for automation, parsers, machine learning.
• Engineering: Probabilistic techniques, machine learning.
• Psychology: What linguistic constructions are easy or difficult for people to learn
to use? Do psychologic situations of the speaker and hearer affect the language?
• Philosophy: What is the meaning, and how do words and sentences acquire it?
Why do we communicate?
17
Applications of NLP
18
Information Extraction/Retrieval
(IE/IR)
• Automatically
extracting structured
information from
unstructured or semi-
structured documents.
by J. R. Firth
by J. R. Firth
20
Summarization
21
Question & Answering
• Could be pure NLP or vision-based
22
Language Generation
• Machines that can speak
• HAL-9000 in “2001: A Space Odyssey”
23
Text Classification / Sentiment Analysis
24
Text to Knowledge
25
Image Captioning
• Requires both computer vision and natural language processing
26
Dialogue Systems and Chatbots
• Dialogue systems may include:
• Text, speech, gestures etc.
• Online Assistants: restaurant booking, customer
support, healthcare, auto phone answering, etc.
• ELIZA: Rule-based NLP chatbot (MIT 1966)
27
Siri
28
Text-to-Speech / Speech-to-Text
29
Speech Processing
• Something different…
• It requires:
NLP
+
Signal Processing
+
Acoustics
+
Speech Synthesis / Recognitions
30
Other Applications
• Text Proofreading
• Plagiarism detection
31
Of Course - ChatGPT
32
And now... DeepSeek
• Open source model
• Still transformer-based
• Efficient
• Multi-token prediction
• Etc.
33
Captain Obvious
OK, cool! But, all these are obvious stuff!
What about more interesting ones?
34
The Unabomber Manifesto
• Was sent to mainstream newspapers in 1995: Industrial Society and Its Future
• The sender threatens for several bombings unless the manifesto is published
• Authorities didn’t get the risk and the manifesto was published
• Based on this evidence they took a house search grant from the court
• Police found bombing related materials in the house
Chimeric: (of a mythical animal) formed from parts of various animals. (eg. Centaur) 35
Forensic Linguistics / Language & Law
/ AI & Law
• What can you tell about somebody by
the way they speak and write? What
subtle clues do you give to people you
speak and write with?
36
Schizophrenia
• Severe mental disorder
• ~ 0.5 % of adult world
population
• Yet to be understood…
• Losing touch with reality
• Incoherent/disorganized
speech and text
• Automatic detection of
incoherent speech or text for
diagnosing schizophrenia
D. Iter, J. Yoon, and D. Jurafsky, “Automatic detection of incoherent speech for diagnosing schizophrenia,” in Proc. of the Fifth Workshop on
Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, 2018, pp. 136–146.
37
Sample Text of a Speech From a
Schizophrenic
“The lion will have to change from dogs into cats until I
can meet my father and mother and we dispart some
rats. I live on the front of Whitton’s head. You have to
work hard if you don’t want to get into bed… It’s all over
for a squab true tray and there ain’t no squabs, there
ain’t no men, there ain’t no music, there ain’t nothing
besides my mother and my father who stand along upon
the Island of Capri where is no ice. Well it’s my suitcase
sir.”
Bland, R. C. (1982). Predicting the Outcome in Schizophrenia. The Canadian Journal of Psychiatry, 27(1), 52–62.
38
Gender/Race Bias
• Unfortunately, there is bias in the society
• Propagates into the texts produced by people
• NLP unintentionally reflect these to applications
• Applications lead to biased decisions!
NOT ACCEPTABLE
• Analyze biases
• Techniques for debiasing
“Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings” by Bolukbasi et al., NIPS 2016.
39
Semantic Shift / Change
Daft: silly
Flaunting: show of to
• Understanding how words change their meanings over time get admiration
Frolicsome: lively
• Key to understand linguistics and cultural evolution Witty: humorous,
• Historical text data is needed. funny
LAWS!
• The Law of Conformity: Rates of semantic change scale with
a negative power of word frequency. (Inverse power law).
“Frequently used words change at slower rates.”
• The Law of Innovation: Polysemous (one word with two
related meaning) words have significantly higher rates of
semantic change.
(Ex: bank: (1) institution, (2) the building of this institution.
Bank (river bank) is homonym not polysemy, we’ll see it later.)
pupil (student), pupil (eye related). What about it?
“Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change“ by Hamilton et al., ACL 2016.
40
Social Science - Politics
• Government/Political media manipulations – Russian Case
• “Framing and Agenda-setting in Russian News: a Computational Analysis of Intricate
Political Strategies”, Field et al. 2018.
• Izvestia – daily Soviet government-controlled newspaper
• 1917 to date (now semi-official newspaper)
• Political Science – propaganda techniques
• Agenda-setting (selecting what topics to cover)
• Framing (deciding how topics are covered)
• Detecting manipulation and fake news
• Field et al. analyzed 13 years of Izvestia (about 100K articles & news)
• Detected:
“Strategy of distraction: articles mention the U.S. more frequently in the month directly
following an economic downturn in Russia.”
41
Sarcasm Detection
• Sarcasm: the use of remarks that clearly mean the opposite of
what they say, made in order to hurt someone's feelings or to
criticize something in a humorous way. Irony.
• It is indeed a sub problem of sentiment analysis.
• Sarcasm is indeed a sentiment
• Instead of getting the sentiment, try to detect sarcasm
• Text classification
• Sample: “I love being rejected!” -> change of polarity
• Important to convey real information being conveyed
• It’s not easy to detect it even for us
• Machines?
• Natural Language Generation with sarcasm???
42
Natural Language Understanding (NLU)
• NLU and NLP are often confused.
• Indeed, NLU is a component of NLP.
• NLU is a subset of the understanding/comprehension
part of NLP.
• NLP is considered AI-complete or AI-hard.
• AI-complete: most difficult problems that require the
solution of the central artificial intelligence problem.
• Cannot be solved by a specific algorithm
• Being as intelligent as humans?
• Not possible with current technology
43
Levels of Language Analysis
• Phonetics – study of the description and classification of speech sounds, particularly how
sounds are produced, transmitted and received. “Sounds of human speech”
• Phonetic symbols: tea = /ti:/
• Phonology – the speech sounds used in a particular language. Concerns how words are
related to the sounds. Accents… “Sound of human languages”
• Each letter -> one phoneme So, Turkish is spoken as it is written (almost)
• Morphology – concerns how words are constructed from more basic meaning units called
morphemes. A morpheme is the primitive unit of meaning in a language.
• Syntax – study of how we can put together words to form correct sentences and
determines what structural role each word plays in the sentence and what phrases are
subparts of other phrases.
• Grammar is something more general. It includes general rules for a language including syntax and
morphology.
• Semantics – study of words and their meanings in a language. The study of context-
independent meaning. Literal meaning. In abstraction from particular situations, speakers
or listeners.
44
Levels of Language Analysis
• Pragmatics – study of words and their meanings in a
context. Inferred and intended meanings.
• “They are former pupils of the school.”
• “She has a dilated pupil.” The center of the iris of the
eye. In dim light, pupils dilate (enlarge).
45
NLP is hard
• Natural languages are very rich in forms, structures,
vocabularies…
• There are too much ambiguities. We’ll come to this.
• There are different languages in the world
• Even “extinct” ones => no living user
• Even “dead” ones => is used but no one’s mother tongue
• We have words + rules + exceptions. It is not like physics…
• It is changing…
• High frequency: new words are invented like computer mouse, or even
higher: Brexit
• Low frequency: Old English became English; Old Ottoman became
Turkish, etc.
46
Ambiguities
47
Ambiguities
48
Statistical Natural Language
Processing
• Lots of definitions…
• In early days, NLP methods relied on hand-coded rules -> rule-based
• They are not flexible to cope with people’s complex and ambiguous usage of
languages
• Then, Statistical Inference tries to learn these rules automatically from corpora.
• Corpus – Corpora: One corpus, two corpora. Large collections of text.
• Corpus is indeed data.
• All quantitative approaches to automated language processing
• Statistical modelling
• Information theory
• Linear algebra
• Machine learning
• Neural networks
• Etc.
49
Statistical Natural Language
Processing
50
DATA?
• Internet!
• Wikipedia
• Several free corpora on the net
• Both academic datasets and others
• The Common Crawl Project
• ~ 630 billion words
• https://commoncrawl.org/
• Project Gutenberg (A library of free e-books)
• https://www.gutenberg.org/
• The Pile, an 825 GB English text corpus
51
Some Abbreviations
52
Think out of the box
• So, start
• Thinking,
• Reading,
• Searching,
for your term project proposal!
• Try to be different…
• Please do not come with a proposal for “Document Classification
Problem” or “Implementation of a fancy, state-of-the-art, popular
deep machine learning architecture”…
• Please approach course projects for chances of inspiration for your
future passions and endeavors…
53
Term Project
• Project Themes
54
Resources
• ACL Anthology https://aclanthology.org/
• IEEE Transactions on Audio, Speech, and Language Processing
https://ieeexplore.ieee.org/xpl/aboutJournal.jsp?punumber=10376
• IEEE Transactions on Emerging Topics in Computational Intelligence
https://ieeexplore.ieee.org/xpl/aboutJournal.jsp?punumber=7433297
• IEEE Transactions on Computational Social Systems
https://ieeexplore.ieee.org/xpl/aboutJournal.jsp?punumber=6570650
• IEEE Transactions on Artificial Intelligence
https://ieeexplore.ieee.org/xpl/aboutJournal.jsp?punumber=9078688
• Natural Language Processing
https://www.cambridge.org/core/journals/natural-language-engineering
55
Resources
• https://huggingface.co/
• http://nlpprogress.com/
• https://web.stanford.edu/~jurafsky/slp3/
56
References
• https://www.quickanddirtytips.com • https://www.forbes.com/
• https://www.glossophilia.org • http://www.ox.ac.uk/
• http://algomuse.com • https://thedaring50.com/thinking-differently-about-legal-ai/
• https://www.researchgate.net • https://epsiloneg.com/course/natural-language-processing/
• https://www.at-languagesolutions.com • https://becominghuman.ai/natural-language-processing-in-a-nutshell-a784b9fea849
• https://blog.floydhub.com/gentle-introduction-to-text-summarization-in-machine-learning/ • https://historycollection.co/famous-people-with-schizophrenia-and-other-relatable-mental-
illnesses/
• visualqa.org
• https://en.wikipedia.org/wiki/Dialogue_system#/media/File:Automated_online_assistant.pn
• https://paperswithcode.com/task/question-answering g
• https://tincture.io/say-it-aint-so-hal-9000-2e4e449caba5 • https://medium.com/nlp-chatbot-survey/computational-lingustics-754c16fc7355
• https://towardsdatascience.com/social-media-sentiment-analysis-part-ii-bcacca5aaa39 • https://www.publicationcoach.com/become-a-better-proofreader/
• https://towardsdatascience.com/analyzing-text-classification-techniques-on-youtube-data- • https://antiplagiarism.net/blogs/avoid-plagiarism-tool/
7af578449f58
• Thinkstock/jozefmicic
• https://www.groundai.com/project/context-aware-visual-policy-network-for-fine-grained-
image-captioning/1 • Captain Obvious - Funny Hotels.com Commercials
• https://www.ekino.com/articles/introduction-to-nlp-part-i • https://www.voicesofyouth.org/blog/lets-talk-politics
• Lecture Notes: Prof. Christopher Manning and Prof. Hinrich Schutze • https://towardsdatascience.com/note-statistical-inference-the-big-picture-b1c1c4099cc7
57