0% found this document useful (0 votes)

42 views60 pages

Session 1

Here are the key steps in tokenizing text: 1. Split the text into tokens (words or other meaningful units) by applying a delimiter such as whitespace. This converts the raw string into a list of tokens. 2. Now the tokens can be manipulated and analyzed individually rather than as raw strings. For example, comparing tokens[0] and tokens[1] returns False since they are different strings, whereas comparing the raw string characters would return True. 3. Tokenization allows applying natural language processing techniques like part-of-speech tagging, named entity recognition, etc. on a per-token basis rather than trying to analyze raw strings. It is a fundamental pre-processing step that converts text into a form computers

Uploaded by

rohan uppal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views60 pages

Session 1

Uploaded by

rohan uppal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Natural Language Analysis

Introduction to NLA
Noura Al Moubayed and Donald Sturgeon
Module Tutors
Donald Sturgeon
[email protected]
Research interests: digital humanities, digital libraries, and
applications of NLP to literature and history
∂
Noura Al Moubayed
[email protected]
Research interests: machine learning, natural language
processing, and optimisation for healthcare, social signal
processing, cyber-security, and Brain-Computer Interfaces
WHY Study
Natural Language
∂ Analysis?
Natural Language Analysis

∂
Natural Language Analysis

Identify the structure and meaning of words, sentences, texts and

conversations
NLP is all around us
Natural Language Analysis
Sentiment analysis
Best roast chicken in San Francisco!
The waiter ignored us for 20 minutes. Question answering (QA)
Q. How effective is ibuprofen in reducing
Spam detection Coreference resolution fever in patients with acute febrile illness?

Let’s go to Agra! ✓ Carter told Mubarak he shouldn’t run again. Paraphrase

Buy V1AGRA … ✗ Word sense disambiguation
∂
I need new batteries for my mouse.
XYZ acquired ABC yesterday
ABC has been taken over by XYZ
Part-of-speech (POS) tagging
ADJ ADJ NOUN VERB ADV Summarization
Colorless green ideas sleep furiously. Parsing The Dow Jones is up Economy is
I can see Alcatraz from the window! The S&P500 jumped good
Housing prices rose
Named entity recognition (NER) Machine translation (MT)
PERSON ORG LOC 第13届上海国际电影节开幕… Dialog Where is Citizen Kane playing in SF?
Einstein met with UN officials in Princeton
The 13th Shanghai International Film Festival…
Castro Theatre at 7:30. Do
Information extraction (IE) Party you want a ticket?
May 27
You’re invited to our dinner add
party, Friday May 27 at 8:30

Dan Jurafsky, NLP

Natural Language Analysis
Language is complicated, complex, and ambiguous
“I made her duck.”

Humans understand language, machines understand numbers

We need to transform language into numbers…
… in a way that machines can learn from
Information Extraction
Unstructured text à Structured data

∂
Textual Data is ambiguous

∂
Data is rapidly growing

• Bible (King James version): ~700K words

• Penn Tree bank ~1M words annotated text
• Newswire collection: 500M+
∂
• Wikipedia: 2.9 billion words (English)
• Web: thousands of billions of words!

We have a lot of data! How can we use it?

Language evolves
LOL Laugh out loud
G2G Got to go
BFN Bye for now
∂
B4N Bye for now
IDK I don’t know
FWIW For what it’s worth
LUWAMH Love you with all my
CS6501– Natural Language
heart
Processing
Language evolves

∂
Natural Language Analysis
• Rule-based systems
Hand crafted rules à Statistical Models à prediction

∂
• Classical Machine Learning
Hand crafted features à ML models à prediction

• Deep Learning models

Auto feature extraction à DL models à prediction
Natural Language Analysis –
ML vs DL

∂
Language modelling
Being able to model how language works requires much more than simple rules!
Sometimes grammatical
“rules” are enough to tell us
which is the correct answer
• Kick the ball ______ the opponent’s goal.
1. in 2. into 3. with 4. to
∂
• Apples grow on ______.
1. time 2. average 3. trees 4. rocks
But not always! There’s
nothing ungrammatical about
saying that apples grow on
rocks…

We can generate almost limitless “questions” like these from existing text!
• Blank out a word, and treat the word that was there are the correct answer
Deep Learning models are powerful but are they
ethical?
Biased
input data

Learning
∂
Biased
model

Biased
predictions
∂
Natural Language Analysis -
Outline
• Introduction to the module and outline.
Introduction to NLP and its real-life applications.
• Text pre-processing.
• Language modelling and features extraction.
• Extracting Information from Text.
∂
• Neural Word Embeddings.
• Text classification and processing using CNN/LSTM/RNN.
• Attention and Sequence to Sequence Models.
• Transformers
• Multi-task Learning
• Ethics and Fairness
Natural Language Analysis -
Workshops
• Setting-up the machines with the required libraries.
Data Preparation: text cleaning using: NLTK, scikit-learn, etc
• Develop Probabilistic Topic Modelling using LDA
Prepare Movie Review Data for ∂ Sentiment Analysis and develop a
Neural Bag-of-Words Model for Sentiment Analysis
• Train and Load Word Embeddings
Develop an Embedding and train CNN Model for Sentiment Analysis
• Develop a Neural Language Model for Text Generation
Text classification using RNN and LSTM
• Working on the assignment
Natural Language Analysis -
Workshops
• Labs start from next week!
• Please choose your lab group today on Ultra
• Either:
• Mondays 2-5pm, or ∂
• Thursdays 2-5pm
Natural Language Analysis-
Main Libraries
nltk
Gensim
SpaCy
scikit-learn ∂
PyTorch
TorchText
NumPy
SciPy
Questions?

∂
Natural Language Analysis

Noura Al Moubayed and Donald Sturgeon

Text Processing and basic
language Models
• What is text?
• Text tokenisation
• Token Normalisation
• Stopwords
• Stemming ∂
• Lemmatisation
• From Words to Features
• Bag of words
• Term Frequency (TF), Inverse Document Frequency (IDF), and TF-IDF
• N-grams
What is text?

example = "This is an example"

print(example[0:3]) "Thi"
∂
print(example[3:9]) "s is a"

• Programming languages usually treat text as sequences of characters

• Natural language processing: we care about meaning
• In natural languages, we often think of words as the smallest meaningful units
What is text?
Text is sequence of characters, words, phrases, sentences, paragraphs…

he picked up the cake,

and the rake, and the gown,
and the milk, and the strings,
and the books, and the dish,∂
and the fan, and the cup,
and the ship, and the fish.
and he put them away.
then he said, 'that is that.'
and then he was gone
with a tip of his hat.
Why tokenize?
Query

Most relevant
∂

2nd most relevant

3rd most relevant

Why tokenize?

∂
These are not the cats you are
looking for!
• Even for simple NLP tasks,
matching strings does not
generalize well

• Instead, match words:

“cat” = “cat” ∂
“cat” ≠ “scatter”
From strings to tokens
example = "This is an example it is" Why does this matter?

print(tokens[0] == tokens[1])
tokens = example.split()
tokens False

['This', 'is', 'an', 'example', 'it', 'is']

∂
print(tokens[1] == tokens[5])
print(tokens[0]) True
'This'

print(tokens[1])

'is'
Text tokenisation
Issues in Tokenization

Finland’s capital ® Finland Finlands Finland’s ?

what’re, I’m, isn’t ® What are, I am, is not
Hewlett-Packard ® Hewlett
∂ Packard HP ?
state-of-the-art ® state of the art ?
Lowercase ® lower-case lowercase lower case ?
New Francisco ® one token or two?
m.p.h., PhD. ® ??

print('Finland' == 'Finland’s') print('Lowercase' == ‘ lowercase')

False False
Text tokenisation
Uppercase vs lowercase

DURHAM = Durham = durham

I broke my [Microsoft] Windows ≠ [glass] windows
USA ≈ US ≠ us ∂
Punctuation

USA = U.S.A.
Manufacturer serial no. ≠ yes or no.
Text tokenisation
Issues in language

French
– L'ensemble ® one token or two?
• L ? L’ ? Le ?
∂
• Want l’ensemble to match with un ensemble

German noun compounds are not segmented

– Lebensversicherungsgesellschaftsangestellter
– ‘life insurance company employee’
– German information retrieval needs compound splitter
Text tokenisation
Issues in language

Chinese and Japanese have no spaces between words:

– 莎拉波娃现在居住在美国东南部的佛罗里达。
– 莎拉波娃 / 现在 / 居住 / 在 / 美国 / 东南部 / 的 / 佛罗里达
– Sharapova now lives in∂ US southeastern Florida

フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji Romaji

End-user can express the same query in different ways!
E.g. 情報不足＝じょうほうふそく
Text tokenisation
Word Tokenization in Chinese

Chinese words are composed of characters

Characters are generally 1 syllable and 1 morpheme.
Average word is 2.4 characters long.

Standard baseline algorithm: ∂

Maximum Matching
Given a wordlist of Chinese, and a string:
Start a pointer at the beginning of the string

Find the longest word in dictionary that matches

the string starting at pointer

Move the pointer over the word in string

Maximum Matching
Maximum matching
• Compare the start of the input string with a list of all possible words*
• Choose the longest matching word, and output this as a token
• Move on to the next part of the string

• E.g. input = “thisismyexample” ∂

1. tokens = [“this”] , remaining = “ismyexample”
2. tokens = [“this”, “is”] , remaining = “myexample”
3. tokens = [“this”, “is”, “my”] , remaining = “example”
4. tokens = [“this”, “is”, “my”, “example”]

* Do we have easy access to such a list?

Maximum matching
Will this simple algorithm always give us the right answer?

Thetabledownthere
The table down there

Theta bled own there ∂

Doesn’t work great in English:
– the longest word is not necessarily the most likely
Works reasonably well in Chinese
莎拉波娃现在居住在美国东南部的佛罗里达。
⇒ 莎拉波娃现在居住在美国东南部的佛罗里达
Maximum matching
莎拉波娃现在居住在美国东南部的佛罗里达
莎拉波娃现在居住在美国东南部的佛罗里达
Sharapova now lives in US southeastern Florida

莎拉波娃现在居住在美∂ 国东南部的佛罗里达

Sharapova now lives in-the-US country southeastern Florida

在美华人
In-the-US Chinese person

Larger dictionary needed => but this also gives more false positive matches!
Bag of words models

● Model a document as an
unordered collection of tokens
● Surprisingly good features for
document classification, topic ∂
modelling, etc

40
Stopwords and word clouds
Word cloud: arbitrarily arranged tokens in
font size proportional to (e.g.) their frequency
in a document

● J.K. Rowling’s Harry Potter (1997) or∂

H.G. Wells’ War of the Worlds (1897)?

● Stopwords: words intentionally excluded

because we believe they don’t tell us
much about what’s going on.

● E.g. grammatical particles: “a”, “the”, “of”,

…

41
Token Normalisation
Stopwords

he picked up the cake,

and the rake, and the gown,
Stopwords removal allows:
and the milk, and the strings,
Reducing the Irrelevance: restricts the
analysis to meaningful words and ∂
and the books, and the dish,
reduces the noise stopwords can and the fan, and the cup,
introduce to the meaning. and the ship, and the fish.
Reduce Feature Dimension: reduces the and he put them away.
number of the extracted tokens from then he said, 'that is that.'
documents significantly. and then he was gone
with a tip of his hat.
Word-Clouds for
Document
Classification

∂
Left: speech of Fidel
Castro to the UN, 1960

Right: The
ecclesiastical
architecture of
Scotland, David
MacGibbon & Thomas
Ross, 1896
These are two U.S.
presidential
inauguration
addresses.

Which is ∂
Obama’s,
and which is
Trump’s?
Token Normalisation
Lemmatisation

Lemmatisation: how to find the correct dictionary headword

form (the lemma). Reduce variant forms to base form

– am, are, is ® be ∂
– run, ran, running, runs ® run
– car, cars, car's, cars' ® car
- the boy's cars are different colours ®
the boy car be different color
Token Normalisation
Stemming

Reduce terms to their “stems”, the core meaning-bearing units

• Not the same as the “lemma”!
Stemming is language dependent
Try http://9ol.es/porter_js_demo.html an online stemming tool
∂
Stemming algorithms are typically rule-based.
One approach: remove suffix if the resulting word is in dictionary.

for example compressed

for exampl compress and
and compression are both
compress ar both accept
accepted as equivalent to
as equival to compress
compress.
Token Normalisation
Stemming – Porter Stemmer

sses ® ss caresses ® caress

ies ® i ponies ® poni
s ®ø cats ® cat
(*v*)ing ® ø walking ® walk
sing ®∂ sing
(*v*)ed ® ø plastered ® plaster
(*v*)y → i pony ® poni
ational® ate relational® relate
izer® ize digitizer ® digitize
ator® ate operator ® operate
al ® ø revival ® reviv
able ® ø adjustable ® adjust
ate ® ø activate ® activ
Documents as vectors

Goal: compare contents of (potentially) large volumes of text efficiently

• E.g. newspaper articles
• Generally vary in length
• Could have similar contents despite very different lengths
• Intuition: words in common suggest (potentially) meaningful similarities
∂ just model word counts
• “Bag of words”: forget about word order,
Simplest approach: use vocabulary V to create Term Frequency (TF) vectors
• Instead of one list per document, make one (same length) vector per doc
Term Frequency (TF) document vectors
D1 D2 D3 Index
1 1 0 V = [ the 0
1 1 0 power 1
1 1 0 of 2
1 1 1 example 3
S1 = the power of example: 1 1 0 : 4
anthropological 1 1 0 anthropological 5
explorations in 1 1 0 explorations 6
persuasion, evocation. 1 1 0 in 7
1 1 0 persuasion 8
1 2 2 , 9
S2 = the power of example :
anthropological 1 ∂ 1 0 evocation 10
explorations in 1 1 0 . 11
persuasion, evocation, 0 1 1 and 12
and imitation 0 1 0 imitation 13
0 0 2 an 14
0 0 1 artificial 15
S3 = an artificial example 0 0 1 about 16
about an ant, a rat, and 0 0 1 ant 17
a powerful person 0 0 2 a 18
0 0 1 rat 19
0 0 1 powerful 20
0 0 1 person ] 21
Documents as vectors (Obama’s speech 2,897 tokens; Trump’s speech
1,731 tokens; Fidel’s speech 21,198 tokens)

Problem 2:
Problem 1: Can’t
frequent ∂ compare
words too between
generic documents
(remove of different
stopwords) lengths

50
Normalization by length (Obama’s speech 2,897 tokens; Trump’s speech
1,731 tokens; Fidel’s speech 21,198 tokens)

∂
Normalize (i.e.
divide) each TF
value by the length
of the document

51
From Words to Features
Term Frequency

∂
From Words to Features
Term Frequency - Binary

∂
From Words to Features
Term Frequency – Raw count (Term Frequency)

∂
From Words to Features
Term Frequency – log normalisation

A document with 10 occurrences of the term (specific

word) is more relevant than a document with 1 occurrence
of the term But arguably not 10 times more relevant.

∂
Alternative is log-frequency weight of term t in document d
From Words to Features
Term Frequency – Query Matching

Queries with >1 terms

Score for a document-query pair:
sum over terms t in both q and d:
∂

The score is 0 if none of the query terms is present in the document

Sec. 6.2.1

From Words to Features

Document Frequency

Rare terms are more informative than frequent terms

– Recall stopwords
Consider a term in the query that is rare in the collection
∂
A document containing rare term is very likely to be relevant to the
query of that rare term
→ We want a high weight for rare terms.
Inverse Document Frequency
● Rare words (‘cryogenic, aardvark, logarithm, chiaroscuro’) carry more
information than common ones (‘big, said, the, as, a’)
○ Recall stopwords: essentially “don’t count the very common words”
● So: give more weight to rare words:
○ N = |D| = total number of documents in collection D
○ Inverse Document Frequency (IDF)
∂ is defined as:

● “TF-IDF” - use the Inverse Document Frequency to normalize the Term

Frequency: xenomorph 1.4

Inverse Document Frequency (IDF)

1.2
anthropological
1
persuasion
0.8

0.6 power
a, the, of
0.4 example
0.2 [punctuation
marks]
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of documents a term occurs in
Cosine similarity
How similar are two documents, e.g. S1 and S2? D1 D2
• Compare their vectors:
Cosine similarity for vectors A and B:

∂ θ D3

Þ Cosine similarity is a real [0,1], and:

• 0 when A and B are orthogonal
• Intuitively (for TF vectors) S1 and S2 have no terms in common at all
• 1 when A and B are scalar multiples of one another
• Intuitively (for TF), all terms in S1 and S2 occur in identical proportions
Questions?

Natural Language Processing With Python
100% (1)
Natural Language Processing With Python
504 pages
Natural Language Processing
100% (6)
Natural Language Processing
49 pages
NLP Lect Unit I
100% (1)
NLP Lect Unit I
140 pages
Introduction To NLP: Natural Language Processing
No ratings yet
Introduction To NLP: Natural Language Processing
21 pages
Natural Language Processing
No ratings yet
Natural Language Processing
72 pages
Personal Effectiveness Scale
100% (4)
Personal Effectiveness Scale
4 pages
Formal Analysis For NLP. Zhiwei - Feng
No ratings yet
Formal Analysis For NLP. Zhiwei - Feng
802 pages
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
No ratings yet
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
41 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
1 Intro To NLP
100% (1)
1 Intro To NLP
46 pages
Michael Brownstein-The Implicit Mind - Cognitive Architecture, The Self, and Ethics-Oxford University Press, USA (2018)
100% (2)
Michael Brownstein-The Implicit Mind - Cognitive Architecture, The Self, and Ethics-Oxford University Press, USA (2018)
273 pages
Motor Imaging Strategy On Students' Vocabulary in Reading Comprehension
No ratings yet
Motor Imaging Strategy On Students' Vocabulary in Reading Comprehension
19 pages
NLP Important Question and Answers Module Wise
No ratings yet
NLP Important Question and Answers Module Wise
101 pages
NLP m1
No ratings yet
NLP m1
148 pages
Bahan Ajar Descriptive Text Miftakhul Khasan
100% (1)
Bahan Ajar Descriptive Text Miftakhul Khasan
6 pages
Introduction To Natural Language Processing FdT-Cours1-IntroNLP-FdT
No ratings yet
Introduction To Natural Language Processing FdT-Cours1-IntroNLP-FdT
183 pages
ALX Foundations Overview
No ratings yet
ALX Foundations Overview
20 pages
Verb Form Li Thuyet Bai Tap Dap An
No ratings yet
Verb Form Li Thuyet Bai Tap Dap An
8 pages
NLP PPT1
No ratings yet
NLP PPT1
29 pages
AI M3 Merged PDF
No ratings yet
AI M3 Merged PDF
98 pages
The Great Gatsby Final Essay
No ratings yet
The Great Gatsby Final Essay
2 pages
Slides Lec1-3
No ratings yet
Slides Lec1-3
225 pages
The Fundamentals of Ego-Soul Dynamics v2
No ratings yet
The Fundamentals of Ego-Soul Dynamics v2
17 pages
Natural Language Processing Notes
No ratings yet
Natural Language Processing Notes
26 pages
NLP Notes Unit-1
No ratings yet
NLP Notes Unit-1
20 pages
Natural Language Processing (CSE4022) : by N. Ilakiyaselvan
No ratings yet
Natural Language Processing (CSE4022) : by N. Ilakiyaselvan
80 pages
Jauhiainen Thesis PDF
No ratings yet
Jauhiainen Thesis PDF
296 pages
Preparing For The Job Interview PDF
No ratings yet
Preparing For The Job Interview PDF
4 pages
Jayden Case Study
No ratings yet
Jayden Case Study
16 pages
Duarte Araujo - Hubert Ripoll - Markus Raab - Perspectives On Cognition and Action in Sport-Nova Science Publishers, Incorporated (2009)
No ratings yet
Duarte Araujo - Hubert Ripoll - Markus Raab - Perspectives On Cognition and Action in Sport-Nova Science Publishers, Incorporated (2009)
249 pages
NLP PPT
No ratings yet
NLP PPT
41 pages
Unit 1
No ratings yet
Unit 1
20 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
31 pages
Shapes Lesson Plan
100% (1)
Shapes Lesson Plan
4 pages
Lisa-Madeline Smith - [PhD Dissertation] Natural Science and Philosophical Hermeneutics_ An Exploration of Understanding in the Thought of Werner Heisenberg and Hans Georg Gadamer-University of Manito.pdf
No ratings yet
Lisa-Madeline Smith - [PhD Dissertation] Natural Science and Philosophical Hermeneutics_ An Exploration of Understanding in the Thought of Werner Heisenberg and Hans Georg Gadamer-University of Manito.pdf
209 pages
Simple Past X Present Perfect
No ratings yet
Simple Past X Present Perfect
1 page
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
NLP Module1-4
No ratings yet
NLP Module1-4
100 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
Unit 104 Engineering Perspectives and Skills
No ratings yet
Unit 104 Engineering Perspectives and Skills
10 pages
Lecture 1
No ratings yet
Lecture 1
46 pages
1 Introduction
No ratings yet
1 Introduction
99 pages
Brocode OP
No ratings yet
Brocode OP
133 pages
Jurnal Syntax
No ratings yet
Jurnal Syntax
6 pages
Natural Language Processing: Some Screenshots Are Taken From NLP Course by Jufrasky - Used Only For Educational Purpose
No ratings yet
Natural Language Processing: Some Screenshots Are Taken From NLP Course by Jufrasky - Used Only For Educational Purpose
44 pages
Memory Distortion
No ratings yet
Memory Distortion
8 pages
Lect1 Intro 3jan08
No ratings yet
Lect1 Intro 3jan08
94 pages
Nelly Williams - End of Year Practicum Report
No ratings yet
Nelly Williams - End of Year Practicum Report
2 pages
NLP m2
No ratings yet
NLP m2
71 pages
Introduction To NLPAbebe Zerihun
No ratings yet
Introduction To NLPAbebe Zerihun
45 pages
NLP01 IntroNLP
No ratings yet
NLP01 IntroNLP
68 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
54 pages
Poeter Stemmer Algorithm
No ratings yet
Poeter Stemmer Algorithm
57 pages
Lecture 01
No ratings yet
Lecture 01
44 pages
IS 7118 Unit1 Introduction
No ratings yet
IS 7118 Unit1 Introduction
58 pages
Part01 Overview
No ratings yet
Part01 Overview
31 pages
C10 - Ai - Unit 3 - NLP - Half Yearly
No ratings yet
C10 - Ai - Unit 3 - NLP - Half Yearly
37 pages
NLP Notes2
No ratings yet
NLP Notes2
27 pages
NLP Qna Sem 7 2024 18 11 05 03 29 1
No ratings yet
NLP Qna Sem 7 2024 18 11 05 03 29 1
37 pages
Basic NLP To End-To-End Pipeline .PPTX - Removed
No ratings yet
Basic NLP To End-To-End Pipeline .PPTX - Removed
35 pages
NLP Introduction
No ratings yet
NLP Introduction
35 pages
Natural Language Processing: Rada Mihalcea
No ratings yet
Natural Language Processing: Rada Mihalcea
26 pages
Module 1 Lecture 1
No ratings yet
Module 1 Lecture 1
29 pages
Critical Discourse Analysis of William Blake S Poem The Sick Rose
No ratings yet
Critical Discourse Analysis of William Blake S Poem The Sick Rose
5 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
Text Analytics and Natural Language Processing - KAI073
No ratings yet
Text Analytics and Natural Language Processing - KAI073
24 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
The Other Pair Lesson Instructions
No ratings yet
The Other Pair Lesson Instructions
5 pages
What Is NLP?: Components of An FSA
No ratings yet
What Is NLP?: Components of An FSA
16 pages
Natural Language Processing: Aman Shakya
No ratings yet
Natural Language Processing: Aman Shakya
17 pages
5 Basic Text Processing
No ratings yet
5 Basic Text Processing
6 pages
NLPNotes
No ratings yet
NLPNotes
12 pages
English 10 Q1 W1D4
No ratings yet
English 10 Q1 W1D4
4 pages
Ai NLP
No ratings yet
Ai NLP
9 pages
Selective Topic Assignment
No ratings yet
Selective Topic Assignment
7 pages
Enclosure No 05 PRESENTATION PORTFOLIO ASSESSMENT SCORING SHEET
No ratings yet
Enclosure No 05 PRESENTATION PORTFOLIO ASSESSMENT SCORING SHEET
1 page
Module 1.1
No ratings yet
Module 1.1
9 pages
Advances in Natural Language Processing
No ratings yet
Advances in Natural Language Processing
7 pages
Natural Language Processing
No ratings yet
Natural Language Processing
4 pages
German Dataset Tasks
No ratings yet
German Dataset Tasks
6 pages
NLP Final
No ratings yet
NLP Final
4 pages
BBL Chapter 5
No ratings yet
BBL Chapter 5
11 pages
Pre Processing of Twitter's Data For Opinion Mining in Political Context
No ratings yet
Pre Processing of Twitter's Data For Opinion Mining in Political Context
11 pages
UWB 10802 Japanese Language (I) SEMESTER 1, SESSION 2016/17: Grammatical Notes
No ratings yet
UWB 10802 Japanese Language (I) SEMESTER 1, SESSION 2016/17: Grammatical Notes
8 pages
SIP-Assessment Form 2017-19 30 Marks
No ratings yet
SIP-Assessment Form 2017-19 30 Marks
1 page
A2 - Listening 1& Writing & Reading & Speaking
No ratings yet
A2 - Listening 1& Writing & Reading & Speaking
5 pages
English 8 # 3
No ratings yet
English 8 # 3
4 pages
In the Cloud
From Everand
In the Cloud
Win Treese
No ratings yet

Session 1

Uploaded by

Session 1

Uploaded by

Natural Language Analysis

Identify the structure and meaning of words, sentences, texts and

Let’s go to Agra! ✓ Carter told Mubarak he shouldn’t run again. Paraphrase

Dan Jurafsky, NLP

Humans understand language, machines understand numbers

• Bible (King James version): ~700K words

We have a lot of data! How can we use it?

• Deep Learning models

Noura Al Moubayed and Donald Sturgeon

example = "This is an example"

• Programming languages usually treat text as sequences of characters

he picked up the cake,

2nd most relevant

3rd most relevant

• Instead, match words:

['This', 'is', 'an', 'example', 'it', 'is']

Finland’s capital ® Finland Finlands Finland’s ?

print('Finland' == 'Finland’s') print('Lowercase' == ‘ lowercase')

DURHAM = Durham = durham

German noun compounds are not segmented

Chinese and Japanese have no spaces between words:

Katakana Hiragana Kanji Romaji

Chinese words are composed of characters

Standard baseline algorithm: ∂

Find the longest word in dictionary that matches

Move the pointer over the word in string

• E.g. input = “thisismyexample” ∂

* Do we have easy access to such a list?

Theta bled own there ∂

莎拉波娃 现在 居住 在美∂ 国 东南部 的 佛罗里达

● J.K. Rowling’s Harry Potter (1997) or∂

● Stopwords: words intentionally excluded

● E.g. grammatical particles: “a”, “the”, “of”,

he picked up the cake,

Lemmatisation: how to find the correct dictionary headword

Reduce terms to their “stems”, the core meaning-bearing units

for example compressed

sses ® ss caresses ® caress

Goal: compare contents of (potentially) large volumes of text efficiently

A document with 10 occurrences of the term (specific

Queries with >1 terms

The score is 0 if none of the query terms is present in the document

From Words to Features

Rare terms are more informative than frequent terms

● “TF-IDF” - use the Inverse Document Frequency to normalize the Term

Inverse Document Frequency (IDF)

Þ Cosine similarity is a real [0,1], and:

You might also like

莎拉波娃现在居住在美∂ 国东南部的佛罗里达