05 Deep Learning and Neural Nets
05 Deep Learning and Neural Nets
Neural Networks
Course: Artificial Intelligence
Fundamentals
Supervised Unsupervised
Discrete Data
Classification Clustering
(predict a label) (group similar items)
Continuous Data
Dimensionality
Regression Reduction
(predict a quantity) (reduce n. of variables)
Machine Learning Tasks
Supervised Unsupervised
Discrete Data
Classification Clustering
(predict a label) (group similar items)
Continuous Data
Dimensionality
Regression Reduction
(predict a quantity) (reduce n. of variables)
Section Agenda
• a.k.a. brains
https://en.wikipedia.org/wiki/File:Brain_network.png
Artificial Neural Networks
• Speech recognition
• Machine translation
• Playing (video)games
• … more
Simple Prediction Machine
Question Answer
Think
Input Output
Compute
100 km 50 miles
miles = km * 0.5
100 km 50 miles
miles = km * 0.5
100 km 50 miles
miles = km * 0.5
Wrong answer
100 km 50 miles
miles = km * 0.5
62.137 miles
(correct)
100 km 60 miles
miles = km * 0.6
62.137 miles
(correct)
100 km 70 miles
miles = km * 0.7
62.137 miles
(correct)
100 km 61 miles
miles = km * 0.61
62.137 miles
(correct)
What happened?
Eagles
Sparrows
Weight
Example: recognise birds
Height
Separating line
(random init)
Weight
Example: recognise birds
Height
Weight
Example: recognise birds
Height
Weight
Example: recognise birds
Height
Weight
Example: recognise birds
Height
It’s a sparrow
Weight
Neural Network
Concepts
Network Basics
http://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html
Neurone Example
http://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html
Layers of the Network
http://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html
Activation Function
http://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html
ReLU Function
Properties of Activation
Function
(nice to have)
• Non-linear
• Continuously differentiable
• Monotonic
Objective Function
• Loss function (minimise the error)
http://karpathy.github.io/
NN in the Wild:
Word Embeddings
Word Embeddings
=
Word Vectors
=
Distributed
Representations
Why should you care?
Why should you care?
Data representation
is crucial
Applications
Applications
Classification
Applications
Classification
Recommender Systems
Applications
Classification
Recommender Systems
Search Engines
Applications
Classification
Recommender Systems
Search Engines
Machine Translation
One-hot Encoding
One-hot Encoding
Rome = [1, 0, 0, 0, 0, 0, …, 0]
Paris = [0, 1, 0, 0, 0, 0, …, 0]
Italy = [0, 0, 1, 0, 0, 0, …, 0]
France = [0, 0, 0, 1, 0, 0, …, 0]
One-hot Encoding
Paris
Rome word V
Rome = [1, 0, 0, 0, 0, 0, …, 0]
Paris = [0, 1, 0, 0, 0, 0, …, 0]
Italy = [0, 0, 1, 0, 0, 0, …, 0]
France = [0, 0, 0, 1, 0, 0, …, 0]
One-hot Encoding
V = vocabulary size (huge)
Rome = [1, 0, 0, 0, 0, 0, …, 0]
Paris = [0, 1, 0, 0, 0, 0, …, 0]
Italy = [0, 0, 1, 0, 0, 0, …, 0]
France = [0, 0, 0, 1, 0, 0, …, 0]
Bag-of-words
Bag-of-words
Rome
Paris Italy
France
Word Embeddings
is-capital-of
Word Embeddings
Paris
Word Embeddings
Paris + Italy
Word Embeddings
Rome
x
Intermezzo (Gradient Descent)
F(x) Objective Function (to minimise)
x
Intermezzo (Gradient Descent)
F(x)
x
Find the optimal “x”
Intermezzo (Gradient Descent)
F(x)
Random Init
x
Intermezzo (Gradient Descent)
F(x)
Derivative
x
Intermezzo (Gradient Descent)
F(x)
Update
x
Intermezzo (Gradient Descent)
F(x)
Derivative
x
Intermezzo (Gradient Descent)
F(x)
Update
x
Intermezzo (Gradient Descent)
F(x)
and again
x
Intermezzo (Gradient Descent)
F(x)
Until convergence
x
Intermezzo (Gradient Descent)
• Optimisation algorithm
Intermezzo (Gradient Descent)
• Optimisation algorithm
• Purpose: find the min (or max) for F
Intermezzo (Gradient Descent)
• Optimisation algorithm
• Purpose: find the min (or max) for F
• Batch-oriented (use all data points)
Intermezzo (Gradient Descent)
• Optimisation algorithm
• Purpose: find the min (or max) for F
• Batch-oriented (use all data points)
• Stochastic GD: update after each sample
Objective Function
Objective Function
P(i | pizza)
P(enjoyed | pizza)
…
P(restaurant | pizza)
Example
I enjoyed eating some pizza at the restaurant
Example
I enjoyed eating some pizza at the restaurant
bump P( i | pizza )
Example
I enjoyed eating some pizza at the restaurant
bump P( at | pizza )
Example
I enjoyed eating some pizza at the restaurant
bump P( i | at )
Example
I enjoyed eating some pizza at the restaurant
bump P( enjoyed | at )
Example
I enjoyed eating some pizza at the restaurant
P( eating | pizza )
Output word Input word
P( eating | pizza )
P( vec(eating) | vec(pizza) )
Output word Input word
P( eating | pizza )
P( vec(eating) | vec(pizza) )
P( vout | vin )
Output word Input word
P( eating | pizza )
P( vec(eating) | vec(pizza) )
exp(cosine(vout , vin ))
P (vout |vin ) = P
k2V exp(cosine(vk , vin ))
Vector Calculation Recap
Vector Calculation Recap
Learn vec(word)
Vector Calculation Recap
Learn vec(word)
by gradient descent
Vector Calculation Recap
Learn vec(word)
by gradient descent
on the softmax probability
Plot Twist
Paragraph Vector
a.k.a.
doc2vec
i.e.
P(vout | vin, label)
word2vec in
practice
pip install gensim
Case Study 1: Skills and CVs
Case Study 1: Skills and CVs
model.most_similar('chef')
[('cook', 0.94),
('bartender', 0.91),
('waitress', 0.89),
('restaurant', 0.76),
...]
Case Study 1: Skills and CVs
model.most_similar('chef',
negative=['food'])
[('puppet', 0.93),
('devops', 0.92),
('ansible', 0.79),
('salt', 0.77),
...]
Case Study 1: Skills and CVs
Useful for:
Data exploration
Query expansion/suggestion
Recommendations
Case Study 2: Evil AI
Case Study 2: Evil AI
from gensim.models.keyedvectors \
import KeyedVectors
fname = ‘GoogleNews-vectors.bin'
model = KeyedVectors.load_word2vec_format(
fname,
binary=True
)
Case Study 2: Evil AI
from gensim.models.keyedvectors \
import KeyedVectors
fname = ‘GoogleNews-vectors.bin'
model = KeyedVectors.load_word2vec_format(
fname,
binary=True
)
Pre-trained on Google News
Distributed by Google
Case Study 2: Evil AI
model.most_similar(
positive=['king', ‘woman'],
negative=[‘man’]
)
Case Study 2: Evil AI
model.most_similar(
positive=['king', ‘woman'],
negative=[‘man’]
)
[('queen', 0.7118),
('monarch', 0.6189),
('princess', 0.5902),
('crown_prince', 0.5499),
('prince', 0.5377),
…]
Case Study 2: Evil AI
model.most_similar(
positive=['Paris', ‘Italy'],
negative=[‘France’]
)
Case Study 2: Evil AI
model.most_similar(
positive=['Paris', ‘Italy'],
negative=[‘France’]
)
[('Milan', 0.7222),
('Rome', 0.7028),
('Palermo_Sicily', 0.5967),
('Italian', 0.5911),
('Tuscany', 0.5632),
…]
Case Study 2: Evil AI
model.most_similar(
positive=[‘professor', ‘woman'],
negative=[‘man’]
)
Case Study 2: Evil AI
model.most_similar(
positive=[‘professor', ‘woman'],
negative=[‘man’]
)
[('associate_professor', 0.7771),
('assistant_professor', 0.7558),
('professor_emeritus', 0.7066),
('lecturer', 0.6982),
('sociology_professor', 0.6539),
…]
Case Study 2: Evil AI
model.most_similar(
positive=[‘professor', ‘man'],
negative=[‘woman’]
)
Case Study 2: Evil AI
model.most_similar(
positive=[‘professor', ‘man'],
negative=[‘woman’]
)
[('professor_emeritus', 0.7433),
('emeritus_professor', 0.7109),
('associate_professor', 0.6817),
('Professor', 0.6495),
('assistant_professor', 0.6484),
…]
Case Study 2: Evil AI
model.most_similar(
positive=[‘computer_programmer’, ‘woman'],
negative=[‘man’]
)
Case Study 2: Evil AI
model.most_similar(
positive=[‘computer_programmer’, ‘woman'],
negative=[‘man’]
)
[('homemaker', 0.5627),
('housewife', 0.5105),
('graphic_designer', 0.5051),
('schoolteacher', 0.4979),
('businesswoman', 0.4934),
…]
Case Study 2: Evil AI
• Culture is biased
Case Study 2: Evil AI
• Culture is biased
• Language is biased
Case Study 2: Evil AI
• Culture is biased
• Language is biased
• Algorithms are not?
Case Study 2: Evil AI
• Culture is biased
• Language is biased
• Algorithms are not?
• “Garbage in, garbage out”
Case Study 2: Evil AI
Final Remarks on
Word Embeddings
But we’ve been
doing this for X years
But we’ve been
doing this for X years
• Simple definition:
a network with many hidden layers
• Previously:
— not enough computational power
— not enough data
— not enough understanding
• Recently:
— all the above available
— big improvements in many different tasks
Deep Learning Tools
• scikit-learn implements some NN algorithms
(see Perceptron demo)
• … Keras!
A Few Words on Keras