0% found this document useful (0 votes)
1 views41 pages

Lecture 1 Course Overview

The document outlines a course on Natural Language Processing (NLP) led by Danish Pruthi, covering topics such as text classification, machine translation, and question answering. It emphasizes the challenges of understanding human language and the importance of language comprehension in building intelligent systems. The course includes practical assignments, evaluations, and a focus on computational models, with no formal prerequisites but a recommendation for familiarity with Python and basic probability.

Uploaded by

Himanshu Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views41 pages

Lecture 1 Course Overview

The document outlines a course on Natural Language Processing (NLP) led by Danish Pruthi, covering topics such as text classification, machine translation, and question answering. It emphasizes the challenges of understanding human language and the importance of language comprehension in building intelligent systems. The course includes practical assignments, evaluations, and a focus on computational models, with no formal prerequisites but a recommendation for familiarity with Python and basic probability.

Uploaded by

Himanshu Ranjan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

DS 207: Introduction to

Natural Language Processing


Danish Pruthi
What is Natural Language Processing
The science and engineering of building computational models to comprehend language
What is Natural Language Processing
The science and engineering of building computational models to comprehend language
Text Classification
"Lots of epic shows feel a little underpopulated
towards the end but there's really no excuse for Negative
something as mythic, huge and mesmerizing to end as
disappointingly as this."

Machine Translation
"India recorded their first Test victory, in their 24th भारत ने 1952 में मद्रास में इं ग्लैंड के िखलाफ अपने 24वें मैच में
match, against England at Madras in 1952. Later in the अपनी पहली टेस्ट जीत दजर् की। बाद में उसी वषर्, उन्होंने अपनी
same year, they won their first Test series, which was
against Pakistan." पहली टेस्ट श्रृंखला जीती, जो पािकस्तान के िखलाफ थी।

Question answering
"When did India win
their first test match?" 1952
You use NLP everyday … (maybe without even noticing)

3
You use NLP everyday … (maybe without even noticing)

3
You use NLP everyday … (maybe without even noticing)

3
You use NLP everyday … (maybe without even noticing)

3
You use NLP everyday … (maybe without even noticing)

3
You use NLP everyday … (maybe without even noticing)

3
You use NLP everyday … (maybe without even noticing)

3
You probably have used ChatGPT
• Use cases abound:
• Summarizing (or simplifying) content,
• Writing content (emails, documents, etc.)
• Creative content (e.g., advertisements, titles, names, etc.)
• Question answering
• Problem solving (to some degree)
• … and many more

4
Understanding language is critical
• Language is a means for people to communicate …
• Majority of the available data is in textual format

• Language understanding is a core requirement of intelligence


• Same for for building "intelligent machines"

5
Why is it hard to handle human languages

6
Why is it hard to handle human languages
• The same word can have different meaning in different contexts (and cultures!)

6
Why is it hard to handle human languages
• The same word can have different meaning in different contexts (and cultures!)

• Understanding language often requires some common sense


• Olive oil is made up of olives,
• palm oil is made up of palm fruit,
• peanut oil is made of peanuts
• This does not mean that baby oil is made up of babies

6
Why is it hard to handle human languages
• The same word can have different meaning in different contexts (and cultures!)

• Understanding language often requires some common sense


• Olive oil is made up of olives,
• palm oil is made up of palm fruit,
• peanut oil is made of peanuts
• This does not mean that baby oil is made up of babies

• Languages are highly compositional—you can create infinite novel sentences

6
Why is it hard to handle human languages

7
Why is it hard to handle human languages
• Word frequencies follow a power law (zipf's law)

7
Why is it hard to handle human languages
• Word frequencies follow a power law (zipf's law)

• Ambiguity:
• Semantic: "The trophy did not fit the suitcase because it was too small
• Syntactic: "A computer that understands you like your mother"

7
Why is it hard to handle human languages
• Word frequencies follow a power law (zipf's law)

• Ambiguity:
• Semantic: "The trophy did not fit the suitcase because it was too small
• Syntactic: "A computer that understands you like your mother"

• Meanings of words change, new words are introduced, old discontinued


• master & mistress, buddy & sissy, bachelor & spinster, doctor & doctress

7
Why computationally study languages
• Our (spoken/written) language is a window into our life

8
Why computationally study languages
• Our (spoken/written) language is a window into our life

• What our (function) words say about us?


• State of mind, i.e., well-being
• Economy
• Propensity to lead
• Many other aspects

8
Course content
• Tasks: classification, sequence to sequence, tagging, language modeling
• Architectures: RNNs, LSTMs, GRUs, Transformers
• Models: n-gram models, encoder, decoder (e.g., GPTs), encoder-decoder models
• Algorithms for learning: largely gradient descent, MLE of probabilistic models
• Algorithms for decoding: greedy, top-k and top-p sampling, Viterbi decoding

9
Course content: what this course is not?
• Given the nature of the subject (and what currently works in practice):
• There will be no mathematical derivations, proofs, bounds or guarantees

• Our understanding of current NLP systems is quite limited (theoretically),


• But we know a fair bit about what works in practice (empirically)

10
Course logistics: pre-requisites
• No formal pre-requisites

• We expect you to be familiar with


• Basic probability (e.g., joint distribution, bayes rule, expectations)
• Linear algebra (matrix manipulation)
• Python programming: assignments require you to write a fair bit of code!
• Good to have: familiarity with PyTorch and deep learning background
• I'll try to introduce topics you might not be aware of …
• But it takes time for concepts to settle in

11
Course logistics: important links

12
Course logistics: important links
• Course website: https://danishpruthi.com/teaching/ds-207-jan-2025/

12
Course logistics: important links
• Course website: https://danishpruthi.com/teaching/ds-207-jan-2025/

12
Course logistics: important links
• Course website: https://danishpruthi.com/teaching/ds-207-jan-2025/

• Anonymous (continuous) feedback: http://tinyurl.com/feedback-for-danish


• Open to criticism, but please be civil and polite
• Treat others how you want to be treated

12
Course logistics: important links
• Course website: https://danishpruthi.com/teaching/ds-207-jan-2025/

• Anonymous (continuous) feedback: http://tinyurl.com/feedback-for-danish


• Open to criticism, but please be civil and polite
• Treat others how you want to be treated

• Teams link for discussions: TBA

12
Course evaluation: assignments

13
Course evaluation: assignments
• Four assignments to be solved individually
• Text classification & representation learning; language modeling; translation; TBD

13
Course evaluation: assignments
• Four assignments to be solved individually
• Text classification & representation learning; language modeling; translation; TBD

• A major component of the grade: 4 x 15 = 60%

13
Course evaluation: assignments
• Four assignments to be solved individually
• Text classification & representation learning; language modeling; translation; TBD

• A major component of the grade: 4 x 15 = 60%


• About 10-15 days to solve (start early!)

13
Course evaluation: assignments
• Four assignments to be solved individually
• Text classification & representation learning; language modeling; translation; TBD

• A major component of the grade: 4 x 15 = 60%


• About 10-15 days to solve (start early!)
• 4 late days, no extensions (please don't even ask)

13
Course evaluation: assignments
• Four assignments to be solved individually
• Text classification & representation learning; language modeling; translation; TBD

• A major component of the grade: 4 x 15 = 60%


• About 10-15 days to solve (start early!)
• 4 late days, no extensions (please don't even ask)
• Interactive Python notebooks to be run on Colab

13
Course evaluation: assignments
• Four assignments to be solved individually
• Text classification & representation learning; language modeling; translation; TBD

• A major component of the grade: 4 x 15 = 60%


• About 10-15 days to solve (start early!)
• 4 late days, no extensions (please don't even ask)
• Interactive Python notebooks to be run on Colab
• All assignments are due at 16:59, always on week days
• Learning about NLP is important but not at the cost of your mental health

13
About use of AI models for assignments (e.g, GPT 4o)
• Each assignment would clearly say what's allowed or not …
• You will be asked to declare and specify the use of AI models
• Might conduct a in-class quiz to be sure whether students did their HWs

• If you lift content from language models, your code might be similar to others
• Plagiarism cases: there were several last time! (Let's keep it clean this time)

14
Course evaluation: quizzes
• Two exams
• Mid term (15%)
• Final (25%)

15
Course logistics: TAs
• Tarun Gupta
• Yash Patel
• Shivashish Naithani
• Karan Raj (primarily available in March/April)

16
Questions?

Thank you

Next class: text classification

17

You might also like