Session 1
Session 1
Introduction to NLA
Noura Al Moubayed and Donald Sturgeon
Module Tutors
Donald Sturgeon
[email protected]
Research interests: digital humanities, digital libraries, and
applications of NLP to literature and history
∂
Noura Al Moubayed
[email protected]
Research interests: machine learning, natural language
processing, and optimisation for healthcare, social signal
processing, cyber-security, and Brain-Computer Interfaces
WHY Study
Natural Language
∂ Analysis?
Natural Language Analysis
∂
Natural Language Analysis
∂
Natural Language Analysis
∂
Textual Data is ambiguous
∂
Data is rapidly growing
∂
Natural Language Analysis
• Rule-based systems
Hand crafted rules à Statistical Models à prediction
∂
• Classical Machine Learning
Hand crafted features à ML models à prediction
∂
Language modelling
Being able to model how language works requires much more than simple rules!
Sometimes grammatical
“rules” are enough to tell us
which is the correct answer
• Kick the ball ______ the opponent’s goal.
1. in 2. into 3. with 4. to
∂
• Apples grow on ______.
1. time 2. average 3. trees 4. rocks
But not always! There’s
nothing ungrammatical about
saying that apples grow on
rocks…
We can generate almost limitless “questions” like these from existing text!
• Blank out a word, and treat the word that was there are the correct answer
Deep Learning models are powerful but are they
ethical?
Biased
input data
Learning
∂
Biased
model
Biased
predictions
∂
Natural Language Analysis -
Outline
• Introduction to the module and outline.
Introduction to NLP and its real-life applications.
• Text pre-processing.
• Language modelling and features extraction.
• Extracting Information from Text.
∂
• Neural Word Embeddings.
• Text classification and processing using CNN/LSTM/RNN.
• Attention and Sequence to Sequence Models.
• Transformers
• Multi-task Learning
• Ethics and Fairness
Natural Language Analysis -
Workshops
• Setting-up the machines with the required libraries.
Data Preparation: text cleaning using: NLTK, scikit-learn, etc
• Develop Probabilistic Topic Modelling using LDA
Prepare Movie Review Data for ∂ Sentiment Analysis and develop a
Neural Bag-of-Words Model for Sentiment Analysis
• Train and Load Word Embeddings
Develop an Embedding and train CNN Model for Sentiment Analysis
• Develop a Neural Language Model for Text Generation
Text classification using RNN and LSTM
• Working on the assignment
Natural Language Analysis -
Workshops
• Labs start from next week!
• Please choose your lab group today on Ultra
• Either:
• Mondays 2-5pm, or ∂
• Thursdays 2-5pm
Natural Language Analysis-
Main Libraries
nltk
Gensim
SpaCy
scikit-learn ∂
PyTorch
TorchText
NumPy
SciPy
Questions?
∂
Natural Language Analysis
print(example[0:3]) "Thi"
∂
print(example[3:9]) "s is a"
Most relevant
∂
∂
These are not the cats you are
looking for!
• Even for simple NLP tasks,
matching strings does not
generalize well
print(tokens[0] == tokens[1])
tokens = example.split()
tokens False
print(tokens[1])
'is'
Text tokenisation
Issues in Tokenization
False False
Text tokenisation
Uppercase vs lowercase
USA = U.S.A.
Manufacturer serial no. ≠ yes or no.
Text tokenisation
Issues in language
French
– L'ensemble ® one token or two?
• L ? L’ ? Le ?
∂
• Want l’ensemble to match with un ensemble
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Thetabledownthere
The table down there
在美 华人
In-the-US Chinese person
Larger dictionary needed => but this also gives more false positive matches!
Bag of words models
● Model a document as an
unordered collection of tokens
● Surprisingly good features for
document classification, topic ∂
modelling, etc
40
Stopwords and word clouds
Word cloud: arbitrarily arranged tokens in
font size proportional to (e.g.) their frequency
in a document
41
Token Normalisation
Stopwords
∂
Left: speech of Fidel
Castro to the UN, 1960
Right: The
ecclesiastical
architecture of
Scotland, David
MacGibbon & Thomas
Ross, 1896
These are two U.S.
presidential
inauguration
addresses.
Which is ∂
Obama’s,
and which is
Trump’s?
Token Normalisation
Lemmatisation
– am, are, is ® be ∂
– run, ran, running, runs ® run
– car, cars, car's, cars' ® car
- the boy's cars are different colours ®
the boy car be different color
Token Normalisation
Stemming
Problem 2:
Problem 1: Can’t
frequent ∂ compare
words too between
generic documents
(remove of different
stopwords) lengths
50
Normalization by length (Obama’s speech 2,897 tokens; Trump’s speech
1,731 tokens; Fidel’s speech 21,198 tokens)
∂
Normalize (i.e.
divide) each TF
value by the length
of the document
51
From Words to Features
Term Frequency
∂
From Words to Features
Term Frequency - Binary
∂
From Words to Features
Term Frequency – Raw count (Term Frequency)
∂
From Words to Features
Term Frequency – log normalisation
∂
Alternative is log-frequency weight of term t in document d
From Words to Features
Term Frequency – Query Matching
0.6 power
a, the, of
0.4 example
0.2 [punctuation
marks]
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of documents a term occurs in
Cosine similarity
How similar are two documents, e.g. S1 and S2? D1 D2
• Compare their vectors:
Cosine similarity for vectors A and B:
∂ θ D3