0% found this document useful (0 votes)
3 views

Lecture_1_Introduction

The document outlines the course EEE 486 / EEE 586 on Statistical Foundations of Natural Language Processing, taught by Aykut Koç at Bilkent University. It covers course basics, grading policies, late submission rules, and the interdisciplinary nature of natural language processing (NLP), emphasizing its applications and objectives. The course aims to equip students with fundamental principles and practical skills in NLP, preparing them for advanced studies and projects.

Uploaded by

meneserdem06
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture_1_Introduction

The document outlines the course EEE 486 / EEE 586 on Statistical Foundations of Natural Language Processing, taught by Aykut Koç at Bilkent University. It covers course basics, grading policies, late submission rules, and the interdisciplinary nature of natural language processing (NLP), emphasizing its applications and objectives. The course aims to equip students with fundamental principles and practical skills in NLP, preparing them for advanced studies and projects.

Uploaded by

meneserdem06
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

EEE 486 / EEE 586

Statistical Foundations of Natural Language Processing

Lecture 1

Introduction

1
Course Basics

• Instructor: Aykut Koç


• e-mail: [email protected]
• Office: EE-305 / UMRAM SC-110
• Office hours: please feel free to arrange by email
• Credits: Bilkent 3, ECTS 5
• TA: Emirhan Koc [email protected]
Enes Koşar [email protected]
• Contact Hours: 3 hours of lecture per week

2
Course Basics
• Textbook and Other Required Material:
• Recommended - Textbook: Daniel Jurafsky, and James H. Martin,
"Speech and Language Processing", Prentice Hall, 2000 - 2018
online version. January 12, 2025 release!
• https://web.stanford.edu/~jurafsky/slp3/
• Recommended - Textbook: Christopher D. Manning, and Hinrich
Schütze, "Foundations of Statistical Natural Language
Processing", The MIT Press, 1999.
• Prerequisite(s): (Math 241 or Math 225 or Math 220)
and (Math 255 or Math 230 or Math 250)
• Probability, linear algebra and (of course) programming skills

3
Course Basics
• Grading Policy (tentative):

• Assignments: 3x10% = 30% (Individual)


• The problems will be defined

• Term Project: 35% (Teams of 1-3)


• No late submission!
• Open-ended, research oriented, self-proposed project
• Survey + Proposal + Final report (in a conference paper format) +
Presentation
• Themes!

• Final Exam: 35%

4
Late Submission Policy
• No late submissions for Term Project-related
activities
• For individual assignments:
• We can discuss the deadline in advance
• However, once set NO POSTPONEMENTS as the deadline
approaches
• Late submissions will be penalized as follows:
• 25% penalty for the 1st day late
• 10% additional penalties for each late day afterwards
• For example, if you submit 3 days late, you can get at most 55/100
pts.

5
Course Basics

Minimum Requirements to Qualify for the Final Exam:

All project reports should be completed and submitted. Grades, except the final
exam, should justify a letter grade of D or better. (Note that meeting this criterion
does not guarantee that you will eventually pass the course. Your letter grade still
depends on your overall performance including the final.)

6
Course Basics

Plagiarism Policy:

Your submissions for all assignments are expected to show your own work, i.e.,
they should show your own software design (except proper usage of software
libraries when necessary and allowed) and results, and should be written in your
own words.

All submissions will be automatically submitted to Turnitin. If any plagiarism is


detected by Turnitin, that assignment will be assigned zero grade. A disciplinary
investigation might also be initiated.

7
What about the course?

• An introduction to natural language processing


• Exposure to basics of the field

Fundamentals
+
Some interesting applications to other fields

8
Course Objectives and Expectations

• To learn fundamental principles, basic knowledge and techniques of natural


language processing.
• To apply statistics and probability knowledge to natural language processing.
• To demonstrate knowledge both in theory and computational skills for successfully
implementing a complete NLP project.
• To prepare for advanced level studies in natural language processing and/or
computational linguistics.
• To broaden student’s vision regarding out-of-the-box applications in other areas
like social science, law, etc.

9
Course Objectives and Expectations

• Be curios and read!

• Programming skills for implementing projects.

• Mathematical models.

• Come up with interesting ideas that you like to work on for the term project.

10
Natural Language

• Languages used by human beings

• CS people take “language” as


“programming language”

• In text or spoken form

11
Natural Language
Processing (NLP)
• What is NLP?
• Wikipedia says:

Natural language processing (NLP) is a


subfield of linguistics, computer science,
information engineering, and artificial
intelligence concerned with the interactions
between computers and human (natural)
languages.

In particular: how to program computers to


process and analyze large amounts of
natural language data.

• NLP is a way for computers to analyze,


understand, and derive meaning from human
language in a smart and useful way.

12
Linguistics

• The scientific study of languages.

• “Language is a purely human and


non-instinctive method of
communicating ideas, emotions
and desires by means of a system
of voluntarily produced symbols.”

Edward Sapir
an American linguist who is among
the pioneers of linguistics

• There are approaches that count


animal languages as well.

13
13
Computational
Linguistics
• Is an interdisciplinary branch of linguistics
• Rule-based, statistical and computational
approaches to the problems of linguistics
• Understanding language by means of
computational methods
• Related to artificial intelligence but it is older
than AI
• AI was born at a workshop at Dartmouth
College in 1956.
• Immediately after WW2, computational efforts
started for automatic translation. From Russian
to English.
• Knowledge from linguistics, computer science,
and statistics/machine learning is used to
provide computational models for languages
• Closely related and overlapped with NLP.
• May be a science vs. engineering question?

14
Natural Language Processing (NLP)

• The process of computer analysis of input provided in a natural language, and


conversion of this input into a useful form of representation.

• The field of NLP is primarily concerned with getting computers to perform useful
and interesting tasks with natural languages.

• The field of NLP is secondarily concerned with helping us come to a better


understanding of natural languages. (Computational Linguistics)

15
Natural Language Processing

• “You shall know a word by the company it keeps.”


by J. R. Firth (1957)

• Ok, but how are we gonna know the companies?

• Note: choose your company in the term project wisely!

16
NLP – Highly Interdisciplinary Field
Knowledge and techniques from several disciplines
• Linguistics: How do words form phrases and sentences? What is meaning? What
are the possible meanings for a sentence?
• Computational Linguistics: How are the structure of sentences identified? How
can language knowledge be modeled?
• Computer Science: Algorithms for automation, parsers, machine learning.
• Engineering: Probabilistic techniques, machine learning.
• Psychology: What linguistic constructions are easy or difficult for people to learn
to use? Do psychologic situations of the speaker and hearer affect the language?
• Philosophy: What is the meaning, and how do words and sentences acquire it?
Why do we communicate?

17
Applications of NLP

18
Information Extraction/Retrieval
(IE/IR)
• Automatically
extracting structured
information from
unstructured or semi-
structured documents.

Information Retrieval: obtaining information


resources relevant to a query from a collection
of information resources. Searches can be
based on metadata or on full-text indexing.
19
Machine/Automatic Translation
“You shall know a word by the company
it keeps.”

by J. R. Firth

“Tuttuğu şirket tarafından bir kelime


bileceksin.”

by J. R. Firth

20
Summarization

21
Question & Answering
• Could be pure NLP or vision-based

22
Language Generation
• Machines that can speak
• HAL-9000 in “2001: A Space Odyssey”

• Natural Language Generation (NLG)

23
Text Classification / Sentiment Analysis

24
Text to Knowledge

25
Image Captioning
• Requires both computer vision and natural language processing

26
Dialogue Systems and Chatbots
• Dialogue systems may include:
• Text, speech, gestures etc.
• Online Assistants: restaurant booking, customer
support, healthcare, auto phone answering, etc.
• ELIZA: Rule-based NLP chatbot (MIT 1966)

27
Siri

• Apple Inc.'s Virtual Assistant

28
Text-to-Speech / Speech-to-Text

• Both directions could be possible


• Both require NLP, among other techniques
• Of course, text2speech and speech2text systems have their own applications as
well

29
Speech Processing
• Something different…
• It requires:
NLP
+
Signal Processing
+
Acoustics
+
Speech Synthesis / Recognitions

• In this course, we deal only with written language (text)


• Speech-to-text systems can be used

30
Other Applications

• Text Proofreading

• Spelling & Grammar

• Plagiarism detection

• Fake review detection

31
Of Course - ChatGPT

• Built on the pre-trained GPT-3


language model (Indeed GPT-3.5)

• Generative Pre-trained Transformer 3


(GPT-3)

• Fine-tuned via supervision and


reinforcement learning

• Underlying architecture: Transformer

32
And now... DeepSeek
• Open source model

• Still transformer-based

• Efficient

• Multi-head latent attention (MLA)

• Multi-token prediction

• Etc.

33
Captain Obvious
OK, cool! But, all these are obvious stuff!
What about more interesting ones?

34
The Unabomber Manifesto
• Was sent to mainstream newspapers in 1995: Industrial Society and Its Future
• The sender threatens for several bombings unless the manifesto is published
• Authorities didn’t get the risk and the manifesto was published

• Someone recognized “odd style” like persistent


references to African-Americans as “negros” or
frequent usage of very rare words like
“chimeric” in the manifesto and recognized
similar odd features at the letters and other
texts written by a former Math Professor,
• FBI linguists worked on the manifesto and
other texts of Prof to derive evidence

• Based on this evidence they took a house search grant from the court
• Police found bombing related materials in the house

Chimeric: (of a mythical animal) formed from parts of various animals. (eg. Centaur) 35
Forensic Linguistics / Language & Law
/ AI & Law
• What can you tell about somebody by
the way they speak and write? What
subtle clues do you give to people you
speak and write with?

36
Schizophrenia
• Severe mental disorder
• ~ 0.5 % of adult world
population
• Yet to be understood…
• Losing touch with reality
• Incoherent/disorganized
speech and text
• Automatic detection of
incoherent speech or text for
diagnosing schizophrenia

D. Iter, J. Yoon, and D. Jurafsky, “Automatic detection of incoherent speech for diagnosing schizophrenia,” in Proc. of the Fifth Workshop on
Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, 2018, pp. 136–146.

37
Sample Text of a Speech From a
Schizophrenic

“The lion will have to change from dogs into cats until I
can meet my father and mother and we dispart some
rats. I live on the front of Whitton’s head. You have to
work hard if you don’t want to get into bed… It’s all over
for a squab true tray and there ain’t no squabs, there
ain’t no men, there ain’t no music, there ain’t nothing
besides my mother and my father who stand along upon
the Island of Capri where is no ice. Well it’s my suitcase
sir.”
Bland, R. C. (1982). Predicting the Outcome in Schizophrenia. The Canadian Journal of Psychiatry, 27(1), 52–62.

38
Gender/Race Bias
• Unfortunately, there is bias in the society
• Propagates into the texts produced by people
• NLP unintentionally reflect these to applications
• Applications lead to biased decisions!
NOT ACCEPTABLE
• Analyze biases
• Techniques for debiasing

“Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings” by Bolukbasi et al., NIPS 2016.

39
Semantic Shift / Change
Daft: silly
Flaunting: show of to
• Understanding how words change their meanings over time get admiration
Frolicsome: lively
• Key to understand linguistics and cultural evolution Witty: humorous,
• Historical text data is needed. funny

LAWS!
• The Law of Conformity: Rates of semantic change scale with
a negative power of word frequency. (Inverse power law).
“Frequently used words change at slower rates.”
• The Law of Innovation: Polysemous (one word with two
related meaning) words have significantly higher rates of
semantic change.
(Ex: bank: (1) institution, (2) the building of this institution.
Bank (river bank) is homonym not polysemy, we’ll see it later.)
pupil (student), pupil (eye related). What about it?

“Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change“ by Hamilton et al., ACL 2016.

40
Social Science - Politics
• Government/Political media manipulations – Russian Case
• “Framing and Agenda-setting in Russian News: a Computational Analysis of Intricate
Political Strategies”, Field et al. 2018.
• Izvestia – daily Soviet government-controlled newspaper
• 1917 to date (now semi-official newspaper)
• Political Science – propaganda techniques
• Agenda-setting (selecting what topics to cover)
• Framing (deciding how topics are covered)
• Detecting manipulation and fake news

• Field et al. analyzed 13 years of Izvestia (about 100K articles & news)
• Detected:

“Strategy of distraction: articles mention the U.S. more frequently in the month directly
following an economic downturn in Russia.”

41
Sarcasm Detection
• Sarcasm: the use of remarks that clearly mean the opposite of
what they say, made in order to hurt someone's feelings or to
criticize something in a humorous way. Irony.
• It is indeed a sub problem of sentiment analysis.
• Sarcasm is indeed a sentiment
• Instead of getting the sentiment, try to detect sarcasm
• Text classification
• Sample: “I love being rejected!” -> change of polarity
• Important to convey real information being conveyed
• It’s not easy to detect it even for us
• Machines?
• Natural Language Generation with sarcasm???

42
Natural Language Understanding (NLU)
• NLU and NLP are often confused.
• Indeed, NLU is a component of NLP.
• NLU is a subset of the understanding/comprehension
part of NLP.
• NLP is considered AI-complete or AI-hard.
• AI-complete: most difficult problems that require the
solution of the central artificial intelligence problem.
• Cannot be solved by a specific algorithm
• Being as intelligent as humans?
• Not possible with current technology

43
Levels of Language Analysis
• Phonetics – study of the description and classification of speech sounds, particularly how
sounds are produced, transmitted and received. “Sounds of human speech”
• Phonetic symbols: tea = /ti:/
• Phonology – the speech sounds used in a particular language. Concerns how words are
related to the sounds. Accents… “Sound of human languages”
• Each letter -> one phoneme So, Turkish is spoken as it is written (almost)
• Morphology – concerns how words are constructed from more basic meaning units called
morphemes. A morpheme is the primitive unit of meaning in a language.

• Syntax – study of how we can put together words to form correct sentences and
determines what structural role each word plays in the sentence and what phrases are
subparts of other phrases.
• Grammar is something more general. It includes general rules for a language including syntax and
morphology.

• Semantics – study of words and their meanings in a language. The study of context-
independent meaning. Literal meaning. In abstraction from particular situations, speakers
or listeners.

44
Levels of Language Analysis
• Pragmatics – study of words and their meanings in a
context. Inferred and intended meanings.
• “They are former pupils of the school.”
• “She has a dilated pupil.” The center of the iris of the
eye. In dim light, pupils dilate (enlarge).

• Discourse – concerns how the immediately preceding


sentences affect the interpretation of the next sentence.
Beyond sentences…
• For example: pronouns. “He was very successful.” Who is
that “he”?

• World Knowledge: general knowledge about the world,


users of language must know each other’s world and beliefs.

45
NLP is hard
• Natural languages are very rich in forms, structures,
vocabularies…
• There are too much ambiguities. We’ll come to this.
• There are different languages in the world
• Even “extinct” ones => no living user
• Even “dead” ones => is used but no one’s mother tongue
• We have words + rules + exceptions. It is not like physics…
• It is changing…
• High frequency: new words are invented like computer mouse, or even
higher: Brexit
• Low frequency: Old English became English; Old Ottoman became
Turkish, etc.

46
Ambiguities

• Present at all levels of the language

• I shot an elephant wearing a hat.


• Ambiguity 1: does “shoot” mean taking a photo or kill with a bullet?
• Word level (lexical) ambiguity
• Ambiguity 2: Who is wearing the hat? Me or the elephant?
• Syntax level
• Same sentence could be interpreted in (2 x 2 = 4 possible) ways…
• Speech and body language have also effects on the communication

47
Ambiguities

• Several different “signals” can mean almost the same thing


• Paraphrasing
• Same “signal” can mean different things as we have seen
• This can happen at different levels of a language:
• Semantic ambiguity – different meanings of words
• Syntactic ambiguity – different ways to parse the sentence (Me or the
elephant?)
• Partial information – pronouns, to whom they refer? (He is successful.)
• Contextual information – context of a word may affect its meaning
• Mouse in a computer magazine or a children’s book on animals.

48
Statistical Natural Language
Processing
• Lots of definitions…
• In early days, NLP methods relied on hand-coded rules -> rule-based
• They are not flexible to cope with people’s complex and ambiguous usage of
languages
• Then, Statistical Inference tries to learn these rules automatically from corpora.
• Corpus – Corpora: One corpus, two corpora. Large collections of text.
• Corpus is indeed data.
• All quantitative approaches to automated language processing
• Statistical modelling
• Information theory
• Linear algebra
• Machine learning
• Neural networks
• Etc.
49
Statistical Natural Language
Processing

• SNLP tries to perform statistical inference for the field of


NLP.
• Statistical inference consists of taking some data (corpus)
generated in accordance with some unknown probability
distribution and making inferences.
• Language models: probability models that distinguish
more vs. less likely word sequences.
• Statistical Natural Language Processing (SNLP) uses
methods of supervised, semi-supervised and unsupervised
learning to address tasks that involve written or spoken
language.

50
DATA?

• Internet!
• Wikipedia
• Several free corpora on the net
• Both academic datasets and others
• The Common Crawl Project
• ~ 630 billion words
• https://commoncrawl.org/
• Project Gutenberg (A library of free e-books)
• https://www.gutenberg.org/
• The Pile, an 825 GB English text corpus

51
Some Abbreviations

• NLP – Natural Language Processing


• CL – Computational Linguistics
• SP – Speech Processing
• HLT – Human Language Technology
• NLE – Natural Language Engineering
• NLU – Natural Language Understanding
• NLG – Natural Language Generation
• SNLP – Statistical Natural Language Processing

52
Think out of the box
• So, start
• Thinking,
• Reading,
• Searching,
for your term project proposal!
• Try to be different…
• Please do not come with a proposal for “Document Classification
Problem” or “Implementation of a fancy, state-of-the-art, popular
deep machine learning architecture”…
• Please approach course projects for chances of inspiration for your
future passions and endeavors…

53
Term Project

• Project Themes

• In final reports, a statement should be included, in which each group member's


contributions to the project are explained. If one or more group members fail to
contribute to the project fairly, this should be stated. Based on these
statements as well as his assessments during presentations, the
instructor reserves the right to assign different individual term project
grades within a group.

54
Resources
• ACL Anthology https://aclanthology.org/
• IEEE Transactions on Audio, Speech, and Language Processing
https://ieeexplore.ieee.org/xpl/aboutJournal.jsp?punumber=10376
• IEEE Transactions on Emerging Topics in Computational Intelligence
https://ieeexplore.ieee.org/xpl/aboutJournal.jsp?punumber=7433297
• IEEE Transactions on Computational Social Systems
https://ieeexplore.ieee.org/xpl/aboutJournal.jsp?punumber=6570650
• IEEE Transactions on Artificial Intelligence
https://ieeexplore.ieee.org/xpl/aboutJournal.jsp?punumber=9078688
• Natural Language Processing
https://www.cambridge.org/core/journals/natural-language-engineering

55
Resources

• https://huggingface.co/

• http://nlpprogress.com/

• https://web.stanford.edu/~jurafsky/slp3/

56
References
• https://www.quickanddirtytips.com • https://www.forbes.com/

• https://www.glossophilia.org • http://www.ox.ac.uk/

• http://algomuse.com • https://thedaring50.com/thinking-differently-about-legal-ai/

• https://www.researchgate.net • https://epsiloneg.com/course/natural-language-processing/

• https://www.at-languagesolutions.com • https://becominghuman.ai/natural-language-processing-in-a-nutshell-a784b9fea849

• https://blog.floydhub.com/gentle-introduction-to-text-summarization-in-machine-learning/ • https://historycollection.co/famous-people-with-schizophrenia-and-other-relatable-mental-
illnesses/
• visualqa.org
• https://en.wikipedia.org/wiki/Dialogue_system#/media/File:Automated_online_assistant.pn
• https://paperswithcode.com/task/question-answering g

• https://tincture.io/say-it-aint-so-hal-9000-2e4e449caba5 • https://medium.com/nlp-chatbot-survey/computational-lingustics-754c16fc7355

• https://towardsdatascience.com/social-media-sentiment-analysis-part-ii-bcacca5aaa39 • https://www.publicationcoach.com/become-a-better-proofreader/

• https://towardsdatascience.com/analyzing-text-classification-techniques-on-youtube-data- • https://antiplagiarism.net/blogs/avoid-plagiarism-tool/
7af578449f58
• Thinkstock/jozefmicic
• https://www.groundai.com/project/context-aware-visual-policy-network-for-fine-grained-
image-captioning/1 • Captain Obvious - Funny Hotels.com Commercials

• https://www.ekino.com/articles/introduction-to-nlp-part-i • https://www.voicesofyouth.org/blog/lets-talk-politics

• Lecture Notes: Prof. Christopher Manning and Prof. Hinrich Schutze • https://towardsdatascience.com/note-statistical-inference-the-big-picture-b1c1c4099cc7

• Lecture Notes: Dr. Mariana Neves, Universitat Potsdam • https://wordmaze.net/psychology-of-sarcasm-ways-to-deal-with-a-sarcastic-person/

• Lecture Notes: Dr. Sameer Maskey http://www.cs.columbia.edu/~smaskey/CS6998/

• Lecture Notes: Burcu Can

• Lecture Notes: İlyas Çiçekli

57

You might also like