0% found this document useful (0 votes)
42 views60 pages

Session 1

Here are the key steps in tokenizing text: 1. Split the text into tokens (words or other meaningful units) by applying a delimiter such as whitespace. This converts the raw string into a list of tokens. 2. Now the tokens can be manipulated and analyzed individually rather than as raw strings. For example, comparing tokens[0] and tokens[1] returns False since they are different strings, whereas comparing the raw string characters would return True. 3. Tokenization allows applying natural language processing techniques like part-of-speech tagging, named entity recognition, etc. on a per-token basis rather than trying to analyze raw strings. It is a fundamental pre-processing step that converts text into a form computers

Uploaded by

rohan uppal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views60 pages

Session 1

Here are the key steps in tokenizing text: 1. Split the text into tokens (words or other meaningful units) by applying a delimiter such as whitespace. This converts the raw string into a list of tokens. 2. Now the tokens can be manipulated and analyzed individually rather than as raw strings. For example, comparing tokens[0] and tokens[1] returns False since they are different strings, whereas comparing the raw string characters would return True. 3. Tokenization allows applying natural language processing techniques like part-of-speech tagging, named entity recognition, etc. on a per-token basis rather than trying to analyze raw strings. It is a fundamental pre-processing step that converts text into a form computers

Uploaded by

rohan uppal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Natural Language Analysis

Introduction to NLA
Noura Al Moubayed and Donald Sturgeon
Module Tutors
Donald Sturgeon
[email protected]
Research interests: digital humanities, digital libraries, and
applications of NLP to literature and history

Noura Al Moubayed
[email protected]
Research interests: machine learning, natural language
processing, and optimisation for healthcare, social signal
processing, cyber-security, and Brain-Computer Interfaces
WHY Study
Natural Language
∂ Analysis?
Natural Language Analysis


Natural Language Analysis


Natural Language Analysis

Identify the structure and meaning of words, sentences, texts and


conversations
NLP is all around us
Natural Language Analysis
Sentiment analysis
Best roast chicken in San Francisco!
The waiter ignored us for 20 minutes. Question answering (QA)
Q. How effective is ibuprofen in reducing
Spam detection Coreference resolution fever in patients with acute febrile illness?

Let’s go to Agra! ✓ Carter told Mubarak he shouldn’t run again. Paraphrase


Buy V1AGRA … ✗ Word sense disambiguation

I need new batteries for my mouse.
XYZ acquired ABC yesterday
ABC has been taken over by XYZ
Part-of-speech (POS) tagging
ADJ ADJ NOUN VERB ADV Summarization
Colorless green ideas sleep furiously. Parsing The Dow Jones is up Economy is
I can see Alcatraz from the window! The S&P500 jumped good
Housing prices rose
Named entity recognition (NER) Machine translation (MT)
PERSON ORG LOC 第13届上海国际电影节开幕… Dialog Where is Citizen Kane playing in SF?
Einstein met with UN officials in Princeton
The 13th Shanghai International Film Festival…
Castro Theatre at 7:30. Do
Information extraction (IE) Party you want a ticket?
May 27
You’re invited to our dinner add
party, Friday May 27 at 8:30

Dan Jurafsky, NLP


Natural Language Analysis
Language is complicated, complex, and ambiguous
“I made her duck.”

Humans understand language, machines understand numbers


We need to transform language into numbers…
… in a way that machines can learn from
Information Extraction
Unstructured text à Structured data


Textual Data is ambiguous


Data is rapidly growing

• Bible (King James version): ~700K words


• Penn Tree bank ~1M words annotated text
• Newswire collection: 500M+

• Wikipedia: 2.9 billion words (English)
• Web: thousands of billions of words!

We have a lot of data! How can we use it?


Language evolves
LOL Laugh out loud
G2G Got to go
BFN Bye for now

B4N Bye for now
IDK I don’t know
FWIW For what it’s worth
LUWAMH Love you with all my
CS6501– Natural Language
heart
Processing
Language evolves


Natural Language Analysis
• Rule-based systems
Hand crafted rules à Statistical Models à prediction


• Classical Machine Learning
Hand crafted features à ML models à prediction

• Deep Learning models


Auto feature extraction à DL models à prediction
Natural Language Analysis –
ML vs DL


Language modelling
Being able to model how language works requires much more than simple rules!
Sometimes grammatical
“rules” are enough to tell us
which is the correct answer
• Kick the ball ______ the opponent’s goal.
1. in 2. into 3. with 4. to

• Apples grow on ______.
1. time 2. average 3. trees 4. rocks
But not always! There’s
nothing ungrammatical about
saying that apples grow on
rocks…

We can generate almost limitless “questions” like these from existing text!
• Blank out a word, and treat the word that was there are the correct answer
Deep Learning models are powerful but are they
ethical?
Biased
input data

Learning

Biased
model

Biased
predictions

Natural Language Analysis -
Outline
• Introduction to the module and outline.
Introduction to NLP and its real-life applications.
• Text pre-processing.
• Language modelling and features extraction.
• Extracting Information from Text.

• Neural Word Embeddings.
• Text classification and processing using CNN/LSTM/RNN.
• Attention and Sequence to Sequence Models.
• Transformers
• Multi-task Learning
• Ethics and Fairness
Natural Language Analysis -
Workshops
• Setting-up the machines with the required libraries.
Data Preparation: text cleaning using: NLTK, scikit-learn, etc
• Develop Probabilistic Topic Modelling using LDA
Prepare Movie Review Data for ∂ Sentiment Analysis and develop a
Neural Bag-of-Words Model for Sentiment Analysis
• Train and Load Word Embeddings
Develop an Embedding and train CNN Model for Sentiment Analysis
• Develop a Neural Language Model for Text Generation
Text classification using RNN and LSTM
• Working on the assignment
Natural Language Analysis -
Workshops
• Labs start from next week!
• Please choose your lab group today on Ultra
• Either:
• Mondays 2-5pm, or ∂
• Thursdays 2-5pm
Natural Language Analysis-
Main Libraries
nltk
Gensim
SpaCy
scikit-learn ∂
PyTorch
TorchText
NumPy
SciPy
Questions?


Natural Language Analysis

Noura Al Moubayed and Donald Sturgeon


Text Processing and basic
language Models
• What is text?
• Text tokenisation
• Token Normalisation
• Stopwords
• Stemming ∂
• Lemmatisation
• From Words to Features
• Bag of words
• Term Frequency (TF), Inverse Document Frequency (IDF), and TF-IDF
• N-grams
What is text?

example = "This is an example"

print(example[0:3]) "Thi"

print(example[3:9]) "s is a"

• Programming languages usually treat text as sequences of characters


• Natural language processing: we care about meaning
• In natural languages, we often think of words as the smallest meaningful units
What is text?
Text is sequence of characters, words, phrases, sentences, paragraphs…

he picked up the cake,


and the rake, and the gown,
and the milk, and the strings,
and the books, and the dish,∂
and the fan, and the cup,
and the ship, and the fish.
and he put them away.
then he said, 'that is that.'
and then he was gone
with a tip of his hat.
Why tokenize?
Query

Most relevant

2nd most relevant

3rd most relevant


Why tokenize?


These are not the cats you are
looking for!
• Even for simple NLP tasks,
matching strings does not
generalize well

• Instead, match words:


“cat” = “cat” ∂
“cat” ≠ “scatter”
From strings to tokens
example = "This is an example it is" Why does this matter?

print(tokens[0] == tokens[1])
tokens = example.split()
tokens False

['This', 'is', 'an', 'example', 'it', 'is']



print(tokens[1] == tokens[5])
print(tokens[0]) True
'This'

print(tokens[1])

'is'
Text tokenisation
Issues in Tokenization

Finland’s capital ® Finland Finlands Finland’s ?


what’re, I’m, isn’t ® What are, I am, is not
Hewlett-Packard ® Hewlett
∂ Packard HP ?
state-of-the-art ® state of the art ?
Lowercase ® lower-case lowercase lower case ?
New Francisco ® one token or two?
m.p.h., PhD. ® ??

print('Finland' == 'Finland’s') print('Lowercase' == ‘ lowercase')

False False
Text tokenisation
Uppercase vs lowercase

DURHAM = Durham = durham


I broke my [Microsoft] Windows ≠ [glass] windows
USA ≈ US ≠ us ∂
Punctuation

USA = U.S.A.
Manufacturer serial no. ≠ yes or no.
Text tokenisation
Issues in language

French
– L'ensemble ® one token or two?
• L ? L’ ? Le ?

• Want l’ensemble to match with un ensemble

German noun compounds are not segmented


– Lebensversicherungsgesellschaftsangestellter
– ‘life insurance company employee’
– German information retrieval needs compound splitter
Text tokenisation
Issues in language

Chinese and Japanese have no spaces between words:


– 莎拉波娃现在居住在美国东南部的佛罗里达。
– 莎拉波娃 / 现在 / 居住 / 在 / 美国 / 东南部 / 的 / 佛罗里达
– Sharapova now lives in∂ US southeastern Florida

フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji Romaji


End-user can express the same query in different ways!
E.g. 情報不足 = じょうほうふそく
Text tokenisation
Word Tokenization in Chinese

Chinese words are composed of characters


Characters are generally 1 syllable and 1 morpheme.
Average word is 2.4 characters long.

Standard baseline algorithm: ∂


Maximum Matching
Given a wordlist of Chinese, and a string:
Start a pointer at the beginning of the string

Find the longest word in dictionary that matches


the string starting at pointer

Move the pointer over the word in string


Maximum Matching
Maximum matching
• Compare the start of the input string with a list of all possible words*
• Choose the longest matching word, and output this as a token
• Move on to the next part of the string

• E.g. input = “thisismyexample” ∂


1. tokens = [“this”] , remaining = “ismyexample”
2. tokens = [“this”, “is”] , remaining = “myexample”
3. tokens = [“this”, “is”, “my”] , remaining = “example”
4. tokens = [“this”, “is”, “my”, “example”]

* Do we have easy access to such a list?


Maximum matching
Will this simple algorithm always give us the right answer?

Thetabledownthere
The table down there

Theta bled own there ∂


Doesn’t work great in English:
– the longest word is not necessarily the most likely
Works reasonably well in Chinese
莎拉波娃现在居住在美国东南部的佛罗里达。
⇒ 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达
Maximum matching
莎拉波娃现在居住在美国东南部的佛罗里达
莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达
Sharapova now lives in US southeastern Florida

莎拉波娃 现在 居住 在美∂ 国 东南部 的 佛罗里达


Sharapova now lives in-the-US country southeastern Florida

在美 华人
In-the-US Chinese person

Larger dictionary needed => but this also gives more false positive matches!
Bag of words models

● Model a document as an
unordered collection of tokens
● Surprisingly good features for
document classification, topic ∂
modelling, etc

40
Stopwords and word clouds
Word cloud: arbitrarily arranged tokens in
font size proportional to (e.g.) their frequency
in a document

● J.K. Rowling’s Harry Potter (1997) or∂


H.G. Wells’ War of the Worlds (1897)?

● Stopwords: words intentionally excluded


because we believe they don’t tell us
much about what’s going on.

● E.g. grammatical particles: “a”, “the”, “of”,


41
Token Normalisation
Stopwords

he picked up the cake,


and the rake, and the gown,
Stopwords removal allows:
and the milk, and the strings,
Reducing the Irrelevance: restricts the
analysis to meaningful words and ∂
and the books, and the dish,
reduces the noise stopwords can and the fan, and the cup,
introduce to the meaning. and the ship, and the fish.
Reduce Feature Dimension: reduces the and he put them away.
number of the extracted tokens from then he said, 'that is that.'
documents significantly. and then he was gone
with a tip of his hat.
Word-Clouds for
Document
Classification


Left: speech of Fidel
Castro to the UN, 1960

Right: The
ecclesiastical
architecture of
Scotland, David
MacGibbon & Thomas
Ross, 1896
These are two U.S.
presidential
inauguration
addresses.

Which is ∂
Obama’s,
and which is
Trump’s?
Token Normalisation
Lemmatisation

Lemmatisation: how to find the correct dictionary headword


form (the lemma). Reduce variant forms to base form

– am, are, is ® be ∂
– run, ran, running, runs ® run
– car, cars, car's, cars' ® car
- the boy's cars are different colours ®
the boy car be different color
Token Normalisation
Stemming

Reduce terms to their “stems”, the core meaning-bearing units


• Not the same as the “lemma”!
Stemming is language dependent
Try http://9ol.es/porter_js_demo.html an online stemming tool

Stemming algorithms are typically rule-based.
One approach: remove suffix if the resulting word is in dictionary.

for example compressed


for exampl compress and
and compression are both
compress ar both accept
accepted as equivalent to
as equival to compress
compress.
Token Normalisation
Stemming – Porter Stemmer

sses ® ss caresses ® caress


ies ® i ponies ® poni
s ®ø cats ® cat
(*v*)ing ® ø walking ® walk
sing ®∂ sing
(*v*)ed ® ø plastered ® plaster
(*v*)y → i pony ® poni
ational® ate relational® relate
izer® ize digitizer ® digitize
ator® ate operator ® operate
al ® ø revival ® reviv
able ® ø adjustable ® adjust
ate ® ø activate ® activ
Documents as vectors

Goal: compare contents of (potentially) large volumes of text efficiently


• E.g. newspaper articles
• Generally vary in length
• Could have similar contents despite very different lengths
• Intuition: words in common suggest (potentially) meaningful similarities
∂ just model word counts
• “Bag of words”: forget about word order,
Simplest approach: use vocabulary V to create Term Frequency (TF) vectors
• Instead of one list per document, make one (same length) vector per doc
Term Frequency (TF) document vectors
D1 D2 D3 Index
1 1 0 V = [ the 0
1 1 0 power 1
1 1 0 of 2
1 1 1 example 3
S1 = the power of example: 1 1 0 : 4
anthropological 1 1 0 anthropological 5
explorations in 1 1 0 explorations 6
persuasion, evocation. 1 1 0 in 7
1 1 0 persuasion 8
1 2 2 , 9
S2 = the power of example :
anthropological 1 ∂ 1 0 evocation 10
explorations in 1 1 0 . 11
persuasion, evocation, 0 1 1 and 12
and imitation 0 1 0 imitation 13
0 0 2 an 14
0 0 1 artificial 15
S3 = an artificial example 0 0 1 about 16
about an ant, a rat, and 0 0 1 ant 17
a powerful person 0 0 2 a 18
0 0 1 rat 19
0 0 1 powerful 20
0 0 1 person ] 21
Documents as vectors (Obama’s speech 2,897 tokens; Trump’s speech
1,731 tokens; Fidel’s speech 21,198 tokens)

Problem 2:
Problem 1: Can’t
frequent ∂ compare
words too between
generic documents
(remove of different
stopwords) lengths

50
Normalization by length (Obama’s speech 2,897 tokens; Trump’s speech
1,731 tokens; Fidel’s speech 21,198 tokens)


Normalize (i.e.
divide) each TF
value by the length
of the document

51
From Words to Features
Term Frequency


From Words to Features
Term Frequency - Binary


From Words to Features
Term Frequency – Raw count (Term Frequency)


From Words to Features
Term Frequency – log normalisation

A document with 10 occurrences of the term (specific


word) is more relevant than a document with 1 occurrence
of the term But arguably not 10 times more relevant.


Alternative is log-frequency weight of term t in document d
From Words to Features
Term Frequency – Query Matching

Queries with >1 terms


Score for a document-query pair:
sum over terms t in both q and d:

The score is 0 if none of the query terms is present in the document


Sec. 6.2.1

From Words to Features


Document Frequency

Rare terms are more informative than frequent terms


– Recall stopwords
Consider a term in the query that is rare in the collection

A document containing rare term is very likely to be relevant to the
query of that rare term
→ We want a high weight for rare terms.
Inverse Document Frequency
● Rare words (‘cryogenic, aardvark, logarithm, chiaroscuro’) carry more
information than common ones (‘big, said, the, as, a’)
○ Recall stopwords: essentially “don’t count the very common words”
● So: give more weight to rare words:
○ N = |D| = total number of documents in collection D
○ Inverse Document Frequency (IDF)
∂ is defined as:

● “TF-IDF” - use the Inverse Document Frequency to normalize the Term


Frequency: xenomorph 1.4

Inverse Document Frequency (IDF)


1.2
anthropological
1
persuasion
0.8

0.6 power
a, the, of
0.4 example
0.2 [punctuation
marks]
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of documents a term occurs in
Cosine similarity
How similar are two documents, e.g. S1 and S2? D1 D2
• Compare their vectors:
Cosine similarity for vectors A and B:

∂ θ D3

Þ Cosine similarity is a real [0,1], and:


• 0 when A and B are orthogonal
• Intuitively (for TF vectors) S1 and S2 have no terms in common at all
• 1 when A and B are scalar multiples of one another
• Intuitively (for TF), all terms in S1 and S2 occur in identical proportions
Questions?

You might also like