0% found this document useful (0 votes)

44 views102 pages

Lectures 2 - MS CLASS Words and Text Classification

This document provides an overview of natural language processing concepts including: - Text classification, sentiment analysis, machine translation, question answering, and summarization are introduced as key NLP tasks. - Challenges in NLP like ambiguity and variability in language are discussed. Examples like word sense ambiguity are provided. - Core linguistic concepts relevant to NLP like words, tokenization, part-of-speech tagging, and types vs. tokens are defined in 3 sentences or less.

Uploaded by

overload

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views102 pages

Lectures 2 - MS CLASS Words and Text Classification

Uploaded by

overload

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 102

Natural Language

Processing
Dr. Sajid Mahmood

Slides Credit: Slides taken from different Sources

‫بسم هللا الرحمن الرحيم‬
CS 438/DS5165 Natural Language Processing

Lectures 3 - Words and Text Classification

Dr. Sajid Mahmood

2
Prerequisites
• No course prerequisites, but I will assume:
– some programming experience (no specific
language required)
– familiarity with basics of calculus, linear algebra,
and probability
– will be helpful to have taken a machine learning
course, but not strictly required

3
Grading
• 3 assignments (10%)
• midterm exam (20%)
• course project (20%):
– project proposal (5%)
– final report (15%)
• class participation, including quizzes (15%)
• Final (35%)

4
Assignments
• mixture of formal exercises, implementation,
experimentation, analysis
• first assignment will be posted this week so
that you can have a look at it, due 2 weeks
from Monday

5
Project
• Replicate [part of] a published NLP paper, or
define your own project
• The project must be done in a group of two
• Each group member will receive same grade
• More details to come

6
Collaboration Policy
• You are welcome to discuss assignments with
others in the course, but solutions and code
must be written individually

7
Optional Textbooks (1/2)
• Jurafsky & Martin. Speech and Language Processing, 2nd Ed. & 3rd Ed.
• Many chapters of 3rd edition are online
• Copies of 2nd edition available in library

8
Optional Textbooks (2/2)
• Goldberg. Neural Network Methods for Natural Language Processing.
• Earlier draft (from 2015) available online

9
What is natural language processing?

10
What is natural language processing?
an experimental computer science research area
that includes problems and solutions pertaining to
the understanding of human language

11
Text Classification

12
Text Classification

• spam / not spam

• priority level
• category (primary / social / promotions / updates)

13
Sentiment Analysis

14
Sentiment Analysis

15
Machine Translation

16
Question Answering

17
Question Answering

18
Dialog Systems

19
figure credit: Phani Marupaka
Summarization

20
Summarization

The Apple Watch has drawbacks. There are other

smartwatches that oﬀer more capabilities.

21
Part-‐of-‐Speech
Tagging

Some questioned if Tim Cook ’s first product

would be a breakaway hit Apple .

for

22
Part-‐of-‐Speech
Tagging

proper proper
determiner verb (past) prep. noun noun poss. adj. noun
Some questioned if Tim Cook ’s first product
proper
modal verb det. adjective noun prep. noun punc.
would be a breakaway hit for Apple .

23
Word Prediction

he bent down and searched the large container, trying to find

anything else hidden in it other than the

24
Word Prediction

he turned to one of the cops beside him. “search the entire

coffin.” the man nodded and bustled forward towards the coffin.

he bent down and searched the large container, trying to find

anything else hidden in it other than the

25
Other language technologies
(not typically considered core NLP):
• speech processing
• information retrieval / web search
• knowledge representation / reasoning

26
Why is NLP hard?
• ambiguity and variability of linguistic expression:
– variability: many forms can mean the same thing
– ambiguity: one form can mean many things

• many diﬀerent kinds of variability and ambiguity

• each NLP task must address distinct kinds

27
Example: Hyperlinks in Wikipedia

Wikipedia Articles
bar (law)
bar (establishment)
bar bar association

… bar (unit)
medal bar
… bar (music)
…

28
Example: Hyperlinks in Wikipedia

Wikipedia Articles bar

bar (law) bars
saloon
bar(establishment)
… saloons
bar association lounge
bar
pub
… bar (unit) sports
medal bar bar
… …
bar (music)
…

29
Ambiguity Variability

Wikipedia Articles bar

bar (law) bars
saloon
bar(establishment)
… saloons
bar association lounge
bar
pub
… bar (unit) sports
medal bar bar
… …
bar (music)
…

30
Word Sense Ambiguity

credit: A. Zwicky

31
Word Sense Ambiguity

credit: A. Zwicky

32
Meaning Ambiguity

33
Words
• what is a word?
• tokenization
• morphology
• lexical semantics

34
What is a word?

35
Tokenization
• tokenization: convert a character stream into
words by adding spaces
• for certain languages, highly nontrivial
• e.g., Chinese word segmentation is a widely-‐
studied NLP task

36
Tokenization
• for other languages (English), tokenization is
easier but is still not always obvious
• the data for your homework has been
tokenized:
– punctuation has been split oﬀ from
words
– contractions have been split

37
Intricacies of Tokenization

• separating punctuation characters from words?

– , ” ? !  always separate
– .  when shouldn’t we separate it?

38
Intricacies of Tokenization

• separating punctuation characters from words?

– , ” ? !  always separate
– .  when shouldn’t we separate it?
• Dr., Mr., Prof., U.S., etc.

39
Intricacies of Tokenization

• separating punctuation characters from words?

– , ” ? !  always separate
– .  when shouldn’t we separate it?
• Dr., Mr., Prof., U.S., etc.
• English contractions:
– isn’t, aren’t, wasn’t,…  is n’t, are n’t, was n’t,…
– but how about these: can’t, won’t  ca n’t, wo n’t
– ca and wo are then diﬀerent forms from can and will

40
• Chinese and Japanese: no spaces between words:
– 莎拉波娃现在居住在美国东南部的佛罗里达。
– 莎拉波娃现居住在美国东南部的佛罗里达
在 lives in US southeastern Florida
– Sharapova now
• Further complicated in Japanese, with multiple alphabets
intermingled
– Dates/amounts in multiple formats

フォーチュン 500 社は情報不足のため時間あた $500K( 約 6,000 万

円)
Katakana Hiragana Kanji Romaji

J&M/SLP3
Removing Spaces?
• tokenization is usually about adding spaces
• but might we also want to remove spaces?
• what are some English examples?

42
Removing Spaces?
• tokenization is usually about adding spaces
• but might we also want to remove spaces?
• what are some English examples?
– names?
• New York  NewYork
– non-‐compositional compounds?
• hot dog  hotdog
– other artifacts of our spacing conventions?
• New York-‐Long
Island Railway  ?

43
Types and Tokens
• once text has been tokenized, let’s count the words
• types: entries in the vocabulary
• tokens: instances of types in a corpus
• example sentence: If they want to go , they should go .
– how many types?
– how many tokens?

44
Types and Tokens
• once text has been tokenized, let’s count the words
• types: entries in the vocabulary
• tokens: instances of types in a corpus
• example sentence: If they want to go , they should go .
– how many types? 8
– how many tokens? 10
• type/token ratio: useful statistic of a corpus (here, 0.8)

45
Types and Tokens
• once text has been tokenized, let’s count the words
• types: entries in the vocabulary
• tokens: instances of types in a corpus
• example sentence: If they want to go , they should go .
– how many types? 8
– how many tokens? 10
• type/token ratio: useful statistic of a corpus (here, 0.8)
• as we add data, what happens to the type-‐token ratio?

46
“really” on Twitter
224571 really 50 reallllllly 15 reallllyy
1189 rly 48 reeeeeally 15 reallllllllly
1119 realy 41 reeally 15 reaallly
731 rlly 38 really2 14 reeeeeeally
590 reallly 37 reaaaaally 14 reallllyyyy
234 realllly 35 reallyyyyy 13 reeeaaally
216 reallyy 31 reely 12 rreally
156 relly 30 realllyyy 12 reaaaaaally
146 reallllly 27 realllyy 11 reeeeallly
132 rily 27 reaaly 11 reeeallly
104 reallyyy 26 realllyyyy 11 realllllyyy
89 reeeally 25 realllllllly 11 reaallyy
89 realllllly 22 reaaallly 10 reallyreallyreally
84 reaaally 21 really- 10 reaaaly
82 reaally 19 reeaally 9 reeeeeeeally
72 reeeeally 18 reallllyyy 9 reallys
65 reaaaally 16 reaaaallly 9 really-really
57 reallyyyy 15 realyy 9 r)eally
53 rilly 15 reallyreally 8 reeeaally

47
“really” on Twitter
8 reallyyyyyyy 6 realllllllllly 4 realllllllyyyy
8 reallyyyyyy 6 reaaaaaallly 4 reaalllyyy
8 realky 5 rrrreally 4 reaalllly
7 relaly 5 rrly 4 reaaalllyy
7 reeeeeeeeeally 5 rellly 4 reaaalllly
7 reeeealy 5 reeeeeeeeally 4 reaaaaly
7 reeeeaaally 5 reeeeaally 3 reeeeealllly
7 reallllllyyy 5 reeeeaaallly 3 reeeealllly
7 realllllllllllly 5 reeallyyy 3 reeeeaaaaally
7 reaaaaaaally 5 reallllllllllly 3 reeeaallly
7 raelly 5 reallllllllllllly 3 reeeaaallllyyy
7 r3ally 5 reaalllyy 3 reealy
6 r-really 5 reaaaalllly 3 reeallly
6 reeeaaalllyyy 5 reaaaaallly 3 reeaaly
6 reeeaaallly 4 rllly 3 reeaalllyyy
6 reeeaaaally 4 reeeeeeeeeeally 3 reeaalllly
6 realyl 4 reeealy 3 reeaaallly
6 r-e-a-l-l-y 4 reeaaaally 3 reallyyyyyyyyy
6 realllyyyyy 4 realllllyyyy 3 reallyl

48
“really” on Twitter
3 really) 2 rlyyyy 2 reeaallyy
3 r]eally 2 rlyyy 2 reeaalllyy
3 realluy 2 reqally 2 reeaallly
3 reallllyyyyy 2 rellyy 2 reeaaally
3 reallllllyyyyyyy 2 rellys 2 reaqlly
3 reallllllyyyy 2 reeely 2 realyyy
3 reallllllyy 2 reeeeeealy 2 reallyyyyyyyyyyyy
3 realllllllllllllllly 2 reeeeeallly 2 reallyyyyyyyy
3 realiy 2 reeeeeaally 2 really*
3 reaallyyyy 2 reeeeeaaally 2 really/
3 reaallllly 2 reeeeeaaallllly 2 realllyyyyyy
3 reaaallyy 2 reeeeallyyy 2 reallllyyyyyy
3 reaaaallyy 2 reeeeallllyyy 2 realllllyyyyyy
3 reaaaallllly 2 reeeeaaallllyyyy 2 realllllyy
3 reaaaaaly 2 reeeeaaalllly 2 reallllllyyyyy
3 reaaaaaaaally 2 reeeeaaaally 2 realllllllyyyyy
3 r34lly 2 reeeeaaaalllyyy 2 realllllllyy
2 rrreally 2 reeeallyy 2 reallllllllllllllly
2 rreeaallyy 2 reeallyy 2 reallllllllllllllllly

49
1 rrrrrrrrrrrrrrrreeeeeeeeeeeaaaaaaalllllllyyyyyy
1 rrrrrrrrrreally
1 rrrrrrreeeeeeaaaalllllyyyyyyy
1 rrrrrrealy
1 rrrrrreally
…
1 re-he-he-heeeeally
1 re-he-he-he-ealy
1 reheheally
1 reelllyy
1 reellly
1 ree-hee-heally
…
1 reeeeeeeeeaally
1 reeeeeeeeeaaally
1 reeeeeeeeeaaaaaalllyyy
1 reeeeeeeeeaaaaaaallllllllyyyyyyyy
1 reeeeeeeeeaaaaaaallllllllyyyyyyyy
1 reeeeeeeeeaaaaaaaaalllllllllyyyyyyyy
1 reeeeeeeeaaaaaaaalllllyyyyyy

50
1 reallyreallyreallyreallyreallyreallyreallyreallyreallyreally
reallyreallyreallyreallyreallyreallyreally
1 reallyreallyreallyreallyreallyr33lly
1 really/really/really
1 really(really
…
1 reallllllllyyyy
1 realllllllllyyyyyy
1 realllllllllyyyyy
1 realllllllllyyyy
1 realllllllllyyy
1 reallllllllllyyyyy
1 reallllllllllllyyyyyy
1 reallllllllllllllllllly
1 reallllllllllllllllllllly
1 reallllllllllllllllllllllyyyyy
1 reallllllllllllllllllllllllllly
1 realllllllllllllllllllllllllllly
1 reallllllllllllllllllllllllllllllllly
1 reallllllllllllllllllllllllllllllllllllllllllllly
1 reallllllllllllllllllllllllllllllllllllllllllllllllllllllly
1 reallllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
lllllllly
51
How many words are there?
• how many English words exist?
• when we increase the size of our corpus, what
happens to the number of types?

52
How many words are there?
• how many English words exist?
• when we increase the size of our corpus, what
happens to the number of types?
– a bit surprising: vocabulary continues to grow in
any actual dataset
– you’ll just never see all the words
– in 1 million tweets, 15M tokens, 600k types
– in 56 million tweets, 847M tokens, ? types

53
How many words are there?
• how many English words exist?
• when we increase the size of our corpus, what
happens to the number of types?
– a bit surprising: vocabulary continues to grow in
any actual dataset
– you’ll just never see all the words
– in 1 million tweets, 15M tokens, 600k types
– in 56 million tweets, 847M tokens, 11M types

54
Classification of Language Data

what are some things we would like to classify?

and to what categories?
Document --> label (very partial list)

• language classification

• topic classification

• author classification

• sentiment classification

• interestingness

• relevance
Document --> label (very partial list)

• language classification
are each of these binary / multi-class / multi-label?
• topic classification

• author classification

• sentiment classification

• interestingness

• relevance
Document --> label (very partial list)

• language classification
are each of these binary / multi-class / multi-label?
• topic classification
classification vs. ranking
• author classification in which of these ranking may be better?

• sentiment classification

• interestingness

• relevance
[ranking: assign a score to each label or to each item]
Sentence --> label
• mostly the same as in "document --> label"
but on a more granular level.

• (but: shorter text --> harder task)

• what tasks are relevant for sentences but not for

documents?

• note: sometimes a document problem is actually a

sentence problem. when?
Word --> label
• Terminology: tokens vs. types
After Alex and Sam met , Sam met Alice and Bob .
12 tokens
9 types
Word --> label
• Type --> label

• Sad vs happy words

• Adjectives vs. nouns

• Others?

• Token --> label

• Sentence boundary detection

• Common spelling mistakes ("then" vs "than")

• Others? [you'll see one next week]

Representing text
as Features

f(
)
Representing
text as
Features
• Indicator features over events in the data.
counts words, characters, ngrams,
lemmas, stems, ...
Pre-processing:
Tokenization

After Alice met Bob, Bob met Jane!!

After Alice met Bob , Bob met Jane !!

Representing
text as
Features
"the special onion soup was not very bad."

bag of words features (word counts):

{'the':1, 'special': 1, 'onion':1, 'soup':1, 'was':1, 'not':1, 'very': 1, 'bad':1}

Representing
text as
Features
"the special onion soup was not very bad."

bag of words features (word counts):

{'the':1, 'special': 1, 'onion':1, 'soup':1, 'was':1, 'not':1, 'very': 1, 'bad':1}

(0,0,1,0,0,0,1,0,0,0,1,0,1,0,0,...,,0,1,1,0,0,1,0)
soup
special
dog

salad
a
lamp

the
good
not

bad
onio
was
n
Representing
text as
Features
"the special onion soup was not very bad."

bag of words features (word counts):

{'the':1, 'special': 1, 'onion':1, 'soup':1, 'was':1, 'not':1, 'very': 1, 'bad':1}

Representing
text as
Features
"the special onion soup was not very bad."

bag of words features (word counts):

{'the':1, 'special': 1, 'onion':1, 'soup':1, 'was':1, 'not':1, 'very': 1, 'bad':1}

re-weighting (TF/IDF, PMI)

{'the':0.1, 'special': 5, 'onion':3, 'soup':2.2,

'was':0.2, 'not':0.8, 'bad':4}
when/why is re-weighting effective?
Representing
text as
Features
"the special onion soup
was not very bad"
ngrams:

['the','special', 'onion', 'soup', 'was', 'not', 'very', 'bad', 'the

special', 'special onion', 'onion soup', 'soup was',
'was not', 'not very', 'very bad', 'the special onion', 'special onion soup',
'onion soup was', 'soup was not', 'was not very', 'not very bad']
Representing
text as
Features
"the special onion soup
was not very bad"
ngrams: unigrams bigrams trigrams

['the','special', 'onion', 'soup', 'was', 'not', 'very', 'bad', 'the

special', 'special onion', 'onion soup', 'soup was',
'was not', 'not very', 'very bad', 'the special onion', 'special onion soup',
'onion soup was', 'soup was not', 'was not very', 'not very bad']

- when are ngrams useful?

- what are potential problems with ngrams?
- if I have bigrams, do I still need unigrams?
- why not use 4-grams of 5-grams?
Dictionary-based Features

• If we have word-lists, might as well use them.

• "how many times a ``negative word`` appears?"

• "how many times a name of a city in Israel occurs"?

Pre-processing: Stemming /
Lemmatizing
Lemma: the "dictionary entry" of a word. Stem: a "base form", based on heuristics.

create, created, creating, creator, creativity create, created, creating, creator, creativity

create, create, create, creator, creativity creat, creat, creat, creat, creat

- when are stemming/lemmatization helping with?

- which is better?
- what are downsides of lemmas?
- what are downsides of stems?
Pre-processing: Stemming /
Lemmatizing
Lemma: the "dictionary entry" of a word. Stem: a "base form", based on heuristics.

create, created, creating, creator, creativity create, created, creating, creator, creativity

create, create, create, creator, creativity creat, creat, creat, creat, creat

Note: structure of words is called "morphology". Will be covered more in depth next week.

- when are stemming/lemmatization helping with?

- which is better?
- what are downsides of lemmas?
- what are downsides of stems?
Document Classification
• Bag-of-words (+ Linear Classifier) is often
surprisingly effective (especially for longer
documents).

• Re-weighting (TF/IDF, or PMI) often helps.

• Lemmatization/stemming sometimes helps.

• Dictionaries can be useful, if available.

Beyond bag-of-words
• Indicator features over events in the data.
counts words, characters, ngrams,
lemmas, stems, ...

Recall the sentence-boundary detection problem.

what kind of classification problem is this?
what are some possible features for this problem?
Classification.

ML over text.
Machine learning
• "Learn from data"
• Supervised Labeled examples

• Unsupervised No labeled examples

• Semi-supervised Labeled examples + non-labeled examples

• Few-shot / weakly-supervised. Few labeled examples

Machine learning
• "Learn from data"
• Supervised Labeled examples

• Unsupervised No labeled examples

• Semi-supervised Labeled examples + non-labeled examples

• Few-shot / weakly-supervised. Few labeled examples

Classification: the ML view
(assuming you already know ML from a different course)

• We are given data samples: x 1 , x 2 , ..., x n xi

• And corresponding labels: y 1 , y 2 , ..., y n yi 2 Y

• We train a function f: f :x 2 X! y 2
Y
• Usually, the data-point x is represented as "features".
f :
$(x) 2 R m
! y2Y
Classification: the ML view
(assuming you already know ML from a different course)

• We are given data samples: x 1 , x 2 , ..., x n xi

• And corresponding labels: y 1 , y 2 , ..., y n yi 2 Y

• We train a function f: f :x 2 X! y 2
Y
• Usually, the data-point x is represented as "features".
f :
$(x) 2 R m
! y2Y
Feature Function
how do we represent an object?

)?
Feature Function
perform measurements
and obtain features.
Feature Function
perform measurements
and obtain features.
Feature Function
perform measurements
and obtain features.

indicator features / 1-hot features / binary features

Feature Functions for Text?

• What can we measure over text? [class discussion]

Types of classification problems

• Binary y 2 { —1, 1}

• Multi-class y 2 { 1, 2, ..., k }

• Multi-label y 2 2 {1,2,...,k}

• Regression* y2 R
*not really a "classification" problem

• (Structured)
[class: provide examples of each]
Types of classifiers

• Generative vs Discriminative P(x, y)

• Probabilistic vs Non-probabilistic P(y|x)

• Linear vs Non-linear score(x, y)

f (x ) =
y
Types of classifiers

• Generative vs Discriminative P(x, y) Generative

• Probabilistic vs Non-probabilistic P(y| Discriminative

x)
• Linear vs Non-linear
score( x , y ) Discriminative

f (x ) = Discriminative

y
Types of classifiers
Prob
• Generative vs Discriminative P(x, y) Generative

Prob
• Probabilistic vs Non-probabilistic P(y| Discriminative
x)
• Linear vs Non-linear Non-prob
score( x , y ) Discriminative

Non-prob
f (x ) = Discriminative

y
Popular Classifiers
• kNN (k nearest neighbors)

• Decision trees

• decision forests

• gradient-boosted trees

• Logistic regression

• SVM

• "Neural networks"
•
Popular Classifiers
kNN (k nearest
neighbors)

• Decision trees
In Python:
• decision forests scikit-learn (sklearn)
is a popular and good package.
• gradient-boosted
trees

• Logistic regression

• SVM

• "Neural networks"
Concepts you should know
• Training set, development set, test set.
• Loss function.
• Overfitting. Regularization.
• Evaluation metric.
Evaluation Metrics
|yˆ = y|
• Accuracy |yˆ = y| + |yˆ
=6
y|
is this a good metric?
when? when not?
Evaluation Metrics
|yˆ = y|
• Accuracy |yˆ = y| + |yˆ
6= y|
• Majority-class baseline
Evaluation Metrics
|yˆ = y|
• Accuracy |yˆ = y| + |yˆ
6= y| yˆ =
• Majority-class baseline x
x
x
1
x xx x
x x
• True Positive ox
o o x o
x
True Negative o o

False Positive
o o
o
yˆ = —
False Negative
o
1
o
Evaluation Metrics
|yˆ = y|
• Accuracy |yˆ = y| + |yˆ
6= y| yˆ =
• Majority-class baseline x
x
x
1
x xx x
x x
• True Positive ox
o o x o
x
True Negative o o

False Positive
o o
o
yˆ = —
False Negative
o
1
"accuracy on positive class"?
"accuracy on negative class"?
o
"precision"?
"recall"?
This Photo by Unknown Author is licensed under CC BY-NC

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6458)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (1005)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1022)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (582)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5181)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2133)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (464)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (280)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2016)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4372)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2814)
The Hundred-Page Machine Learning Book-Andriy Burkov (2019) - Removed
No ratings yet
The Hundred-Page Machine Learning Book-Andriy Burkov (2019) - Removed
145 pages
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4135)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Zinger Line B (Oct - 22)
No ratings yet
Zinger Line B (Oct - 22)
6 pages
My Letter Request(s)
No ratings yet
My Letter Request(s)
1 page
F2022393008-Stroke Prediction
No ratings yet
F2022393008-Stroke Prediction
6 pages
Oracle APEX Developer - Usama
No ratings yet
Oracle APEX Developer - Usama
2 pages
Ufone BSMS Portal and API Password Change v2
No ratings yet
Ufone BSMS Portal and API Password Change v2
6 pages
Getjobid 176978
No ratings yet
Getjobid 176978
3 pages
Install APEX Into A Pluggable Database
No ratings yet
Install APEX Into A Pluggable Database
2 pages
Select PDB
No ratings yet
Select PDB
9 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
A Review On Cyber Security and Anomaly Detection Perspectives of Smart Grid
No ratings yet
A Review On Cyber Security and Anomaly Detection Perspectives of Smart Grid
6 pages
Supervised and Deep Learning
No ratings yet
Supervised and Deep Learning
83 pages
Selvi G. IoT and Machine Learning For Smart Applications 2025
No ratings yet
Selvi G. IoT and Machine Learning For Smart Applications 2025
211 pages
UNIT 2 - Artificial Intelligence and Machine Learning
No ratings yet
UNIT 2 - Artificial Intelligence and Machine Learning
71 pages
DLMDSDL01 JoelKazadi 9213934 SemiSupervisedLearning 20240907
No ratings yet
DLMDSDL01 JoelKazadi 9213934 SemiSupervisedLearning 20240907
10 pages
Unit-4 (NLP)
No ratings yet
Unit-4 (NLP)
47 pages
Final Lab Quiz 1 - Attempt Review
No ratings yet
Final Lab Quiz 1 - Attempt Review
5 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
SCT - QB - Anwers - p1
No ratings yet
SCT - QB - Anwers - p1
53 pages
Outlier Detection A Survey
No ratings yet
Outlier Detection A Survey
84 pages
Mini Project On Generative AI 2
No ratings yet
Mini Project On Generative AI 2
44 pages
Unit-5 - Part 2
No ratings yet
Unit-5 - Part 2
11 pages
Ai Final QB
No ratings yet
Ai Final QB
34 pages
Alam Et Al. - 2021 - A Review of Bangla Natural Language Processing Tas
No ratings yet
Alam Et Al. - 2021 - A Review of Bangla Natural Language Processing Tas
48 pages
Ai&Ml: Unit-1
No ratings yet
Ai&Ml: Unit-1
28 pages
All DL
No ratings yet
All DL
72 pages
Face Recognize in Vehicle
No ratings yet
Face Recognize in Vehicle
8 pages
ML Unit 1
No ratings yet
ML Unit 1
27 pages
Iot Analytics
No ratings yet
Iot Analytics
3 pages
Data Analytics
No ratings yet
Data Analytics
6 pages
Ai Docx-1
No ratings yet
Ai Docx-1
27 pages
Advances in Kernel Methods - Support Vector Learni
No ratings yet
Advances in Kernel Methods - Support Vector Learni
44 pages
A Comparative Analysis of Credit Card Fraud Detection Using Machine Learning Classification Algorithm.
No ratings yet
A Comparative Analysis of Credit Card Fraud Detection Using Machine Learning Classification Algorithm.
53 pages
AI Powered Microscopy Image Analysis For Parasitol
No ratings yet
AI Powered Microscopy Image Analysis For Parasitol
14 pages
Business Analytics Important Question Answers
No ratings yet
Business Analytics Important Question Answers
38 pages
Data Science Solutions IA 2
No ratings yet
Data Science Solutions IA 2
16 pages
Avion
No ratings yet
Avion
23 pages
Chidubem CV
No ratings yet
Chidubem CV
3 pages
Unsupervised Image Segmentation in Satellite Imagery Using Deep L
No ratings yet
Unsupervised Image Segmentation in Satellite Imagery Using Deep L
73 pages

Lectures 2 - MS CLASS Words and Text Classification

Uploaded by

Lectures 2 - MS CLASS Words and Text Classification

Uploaded by

Natural Language

Slides Credit: Slides taken from different Sources

Lectures 3 - Words and Text Classification

• spam / not spam

The Apple Watch has drawbacks. There are other

Some questioned if Tim Cook ’s first product

would be a breakaway hit Apple .

he bent down and searched the large container, trying to find

he turned to one of the cops beside him. “search the entire

he bent down and searched the large container, trying to find

• many diﬀerent kinds of variability and ambiguity

Wikipedia Articles bar

Wikipedia Articles bar

• separating punctuation characters from words?

• separating punctuation characters from words?

• separating punctuation characters from words?

フォーチュン 500 社は情報不足のため時間あた $500K( 約 6,000 万

what are some things we would like to classify?

• (but: shorter text --> harder task)

• what tasks are relevant for sentences but not for

• note: sometimes a document problem is actually a

• Sad vs happy words

• Adjectives vs. nouns

• Token --> label

• Sentence boundary detection

• Common spelling mistakes ("then" vs "than")

• Others? [you'll see one next week]

After Alice met Bob, Bob met Jane!!

After Alice met Bob , Bob met Jane !!

bag of words features (word counts):

{'the':1, 'special': 1, 'onion':1, 'soup':1, 'was':1, 'not':1, 'very': 1, 'bad':1}

bag of words features (word counts):

{'the':1, 'special': 1, 'onion':1, 'soup':1, 'was':1, 'not':1, 'very': 1, 'bad':1}

bag of words features (word counts):

{'the':1, 'special': 1, 'onion':1, 'soup':1, 'was':1, 'not':1, 'very': 1, 'bad':1}

bag of words features (word counts):

{'the':1, 'special': 1, 'onion':1, 'soup':1, 'was':1, 'not':1, 'very': 1, 'bad':1}

re-weighting (TF/IDF, PMI)

{'the':0.1, 'special': 5, 'onion':3, 'soup':2.2,

['the','special', 'onion', 'soup', 'was', 'not', 'very', 'bad', 'the

['the','special', 'onion', 'soup', 'was', 'not', 'very', 'bad', 'the

['the','special', 'onion', 'soup', 'was', 'not', 'very', 'bad', 'the

- when are ngrams useful?

• If we have word-lists, might as well use them.

• "how many times a ``negative word`` appears?"

• "how many times a name of a city in Israel occurs"?

- when are stemming/lemmatization helping with?

- when are stemming/lemmatization helping with?

• Re-weighting (TF/IDF, or PMI) often helps.

• Lemmatization/stemming sometimes helps.

• Dictionaries can be useful, if available.

Recall the sentence-boundary detection problem.

• Unsupervised No labeled examples

• Semi-supervised Labeled examples + non-labeled examples

• Few-shot / weakly-supervised. Few labeled examples

• Unsupervised No labeled examples

• Semi-supervised Labeled examples + non-labeled examples

• Few-shot / weakly-supervised. Few labeled examples

• We are given data samples: x 1 , x 2 , ..., x n xi

• And corresponding labels: y 1 , y 2 , ..., y n yi 2 Y

• We are given data samples: x 1 , x 2 , ..., x n xi

• And corresponding labels: y 1 , y 2 , ..., y n yi 2 Y

indicator features / 1-hot features / binary features

• What can we measure over text? [class discussion]

• Generative vs Discriminative P(x, y)

• Probabilistic vs Non-probabilistic P(y|x)

• Linear vs Non-linear score(x, y)

• Generative vs Discriminative P(x, y) Generative

• Probabilistic vs Non-probabilistic P(y| Discriminative

You might also like