0% found this document useful (0 votes)
41 views184 pages

05 Deep Learning and Neural Nets

The document provides an overview of Deep Learning and Neural Networks, focusing on concepts such as artificial neural networks, their applications, and fundamental machine learning tasks. It discusses various types of neural networks, activation functions, and optimization techniques like gradient descent and backpropagation. Additionally, it covers word embeddings and their significance in data representation for tasks like classification and machine translation.

Uploaded by

hrhee1atl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views184 pages

05 Deep Learning and Neural Nets

The document provides an overview of Deep Learning and Neural Networks, focusing on concepts such as artificial neural networks, their applications, and fundamental machine learning tasks. It discusses various types of neural networks, activation functions, and optimization techniques like gradient descent and backpropagation. Additionally, it covers word embeddings and their significance in data representation for tasks like classification and machine translation.

Uploaded by

hrhee1atl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 184

Deep Learning and

Neural Networks
Course: Artificial Intelligence
Fundamentals

Instructor: Marco Bonzanini


Machine Learning Tasks

Supervised Unsupervised
Discrete Data

Classification Clustering
(predict a label) (group similar items)

Continuous Data
Dimensionality
Regression Reduction
(predict a quantity) (reduce n. of variables)
Machine Learning Tasks

Supervised Unsupervised
Discrete Data

Classification Clustering
(predict a label) (group similar items)

Continuous Data
Dimensionality
Regression Reduction
(predict a quantity) (reduce n. of variables)
Section Agenda

• Introduction to Artificial Neural Networks (ANN)

• Neural Network Concepts

• Neural Networks in the Wild


Introduction to
ANN
Biological Neural Nets

• Series of interconnected neurones


(nerve cells)

• Electrical or chemical signals are


passed through synapses

• a.k.a. brains

https://en.wikipedia.org/wiki/File:Brain_network.png
Artificial Neural Networks

• Vaguely inspired by biological neural network

• Big family of ML approaches, used to tackle a


variety of tasks

• Original idea: solving problems “like the human


brain”, nowadays focus on specific tasks
NN Applications
• Computer vision

• Speech recognition

• Machine translation

• Social network analysis

• Playing (video)games

• … more
Simple Prediction Machine

Question Answer
Think

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016


Simple Prediction Machine

Input Output
Compute

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016


Example: Learn Km to Miles

100 km 50 miles
miles = km * 0.5

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016


Example: Learn Km to Miles

100 km 50 miles
miles = km * 0.5

We’re trying a linear model

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016


Example: Learn Km to Miles

100 km 50 miles
miles = km * 0.5

Wrong answer

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016


Example: Learn Km to Miles

100 km 50 miles
miles = km * 0.5

62.137 miles
(correct)

Error: 12.137 (not great)

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016


Example: Learn Km to Miles
Update

100 km 60 miles
miles = km * 0.6

62.137 miles
(correct)

Error: 2.137 (better)

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016


Example: Learn Km to Miles
Try again

100 km 70 miles
miles = km * 0.7

62.137 miles
(correct)

Error: -7.863 (worse!)

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016


Example: Learn Km to Miles
Try again

100 km 61 miles
miles = km * 0.61

62.137 miles
(correct)

Error: 1.137 (best so far)

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016


Example: Learn Km to Miles

What happened?

• Model with adjustable parameters

• Know the answer but not the parameters?


Use error to adjust parameters

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016


Example: recognise birds
Height

Eagles

Sparrows

Weight
Example: recognise birds
Height

Separating line
(random init)

Weight
Example: recognise birds
Height

Adjust the line


(too much!)

Weight
Example: recognise birds
Height

Adjust the line


(good)

Weight
Example: recognise birds
Height

New (unseen) bird

Weight
Example: recognise birds
Height

It’s a sparrow

Weight
Neural Network
Concepts
Network Basics

• Neurone: a mathematical function


also, a node in the network

• Synapses: input-output connection between nodes


also, an edge in the network

• Nodes and edges are typically associated to


weights, to adjust the learning
Network Example

http://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html
Neurone Example

http://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html
Layers of the Network

http://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html
Activation Function

• a.k.a. Transfer function

• Defines the output of a node given the inputs

• It’s what give power to the Neural Net

• Popular examples: ReLU, Sigmoid, Softmax, …


Sigmoid Function

http://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html
ReLU Function
Properties of Activation
Function
(nice to have)

• Non-linear

• Continuously differentiable

• Fixed output range

• Monotonic
Objective Function
• Loss function (minimise the error)

• Gain function (maximise some quantity)

• Tells us “how good” the model is at making


predictions for a given set of parameters

• (the purpose of “learning”, is to find the optimal


parameters)
Gradient Descend
• Optimisation algorithm, used to minimise some loss
function

• Iteratively mode towards the steepest descend

• Descend (loss) vs Ascend (gain)

• Learning rate: size of the steps


e.g. high learning rate covers more ground, but
could overshoot; low is precise but time consuming
Back Propagation
• Purpose: to adjust each weight in the network,
in proportion of how much it contributes to the
overall error

• a.k.a. Backward propagation of errors

• Starting with the final loss value (i.e. error), it works


backward to compute the contribution that each
parameter had in the loss value
Types of Neural Network

• Feed-Forward NN: outputs only move forward

• Convolutional NN: a type of FFNN

• Recurrent NN: outputs of a layer can come back as


input of the same layer (variable size input)
Example of RNNs

http://karpathy.github.io/
NN in the Wild:
Word Embeddings
Word Embeddings
=
Word Vectors
=
Distributed
Representations
Why should you care?
Why should you care?

Data representation
is crucial
Applications
Applications
Classification
Applications
Classification

Recommender Systems
Applications
Classification

Recommender Systems

Search Engines
Applications
Classification

Recommender Systems

Search Engines

Machine Translation
One-hot Encoding
One-hot Encoding

Rome = [1, 0, 0, 0, 0, 0, …, 0]
Paris = [0, 1, 0, 0, 0, 0, …, 0]
Italy = [0, 0, 1, 0, 0, 0, …, 0]
France = [0, 0, 0, 1, 0, 0, …, 0]
One-hot Encoding
Paris
Rome word V

Rome = [1, 0, 0, 0, 0, 0, …, 0]
Paris = [0, 1, 0, 0, 0, 0, …, 0]
Italy = [0, 0, 1, 0, 0, 0, …, 0]
France = [0, 0, 0, 1, 0, 0, …, 0]
One-hot Encoding
V = vocabulary size (huge)

Rome = [1, 0, 0, 0, 0, 0, …, 0]
Paris = [0, 1, 0, 0, 0, 0, …, 0]
Italy = [0, 0, 1, 0, 0, 0, …, 0]
France = [0, 0, 0, 1, 0, 0, …, 0]
Bag-of-words
Bag-of-words

doc_1 = [32, 14, 1, 0, …, 6]


doc_2 = [ 2, 12, 0, 28, …, 12]
… …
doc_N = [13, 0, 6, 2, …, 0]
Bag-of-words
Rome Paris word V

doc_1 = [32, 14, 1, 0, …, 6]


doc_2 = [ 2, 12, 0, 28, …, 12]
… …
doc_N = [13, 0, 6, 2, …, 0]
Word Embeddings
Word Embeddings

Rome = [0.91, 0.83, 0.17, …, 0.41]

Paris = [0.92, 0.82, 0.17, …, 0.98]

Italy = [0.32, 0.77, 0.67, …, 0.42]

France = [0.33, 0.78, 0.66, …, 0.97]


Word Embeddings
n. dimensions << vocabulary size

Rome = [0.91, 0.83, 0.17, …, 0.41]

Paris = [0.92, 0.82, 0.17, …, 0.98]

Italy = [0.32, 0.77, 0.67, …, 0.42]

France = [0.33, 0.78, 0.66, …, 0.97]


Word Embeddings

Rome = [0.91, 0.83, 0.17, …, 0.41]

Paris = [0.92, 0.82, 0.17, …, 0.98]

Italy = [0.32, 0.77, 0.67, …, 0.42]

France = [0.33, 0.78, 0.66, …, 0.97]


Word Embeddings

Rome = [0.91, 0.83, 0.17, …, 0.41]

Paris = [0.92, 0.82, 0.17, …, 0.98]

Italy = [0.32, 0.77, 0.67, …, 0.42]

France = [0.33, 0.78, 0.66, …, 0.97]


Word Embeddings

Rome = [0.91, 0.83, 0.17, …, 0.41]

Paris = [0.92, 0.82, 0.17, …, 0.98]

Italy = [0.32, 0.77, 0.67, …, 0.42]

France = [0.33, 0.78, 0.66, …, 0.97]


Word Embeddings

Rome
Paris Italy

France
Word Embeddings

is-capital-of
Word Embeddings

Paris
Word Embeddings

Paris + Italy
Word Embeddings

Paris + Italy - France


Word Embeddings

Rome

Paris + Italy - France ≈ Rome


From Language
to Vectors?
Distributional Hypothesis
“You shall know a word
by the company it keeps.”
–J.R. Firth 1957
“Words that occur in similar context
tend to have similar meaning.”
–Z. Harris 1954
Context ≈ Meaning
I enjoyed eating some pizza at the restaurant
Word

I enjoyed eating some pizza at the restaurant


Word

I enjoyed eating some pizza at the restaurant

The company it keeps


I enjoyed eating some pizza at the restaurant

I enjoyed eating some Welsh cake at the restaurant


I enjoyed eating some pizza at the restaurant

I enjoyed eating some Welsh cake at the restaurant


Same Context
I enjoyed eating some pizza at the restaurant

I enjoyed eating some Welsh cake at the restaurant


Same Context
?
=
A Bit of Theory
with word2vec
word2vec Architecture

Mikolov et al. (2013) Efficient Estimation of Word Representations in Vector Space


Vector Calculation
Vector Calculation
Goal: learn vec(word)
Vector Calculation
Goal: learn vec(word)
1. Choose objective function
Vector Calculation
Goal: learn vec(word)
1. Choose objective function
2. Init: random vectors
Vector Calculation
Goal: learn vec(word)
1. Choose objective function
2. Init: random vectors
3. Run stochastic gradient descent
Intermezzo (Gradient Descent)
Intermezzo (Gradient Descent)
F(x)

x
Intermezzo (Gradient Descent)
F(x) Objective Function (to minimise)

x
Intermezzo (Gradient Descent)
F(x)

x
Find the optimal “x”
Intermezzo (Gradient Descent)
F(x)
Random Init

x
Intermezzo (Gradient Descent)
F(x)

Derivative

x
Intermezzo (Gradient Descent)
F(x)

Update

x
Intermezzo (Gradient Descent)
F(x)

Derivative

x
Intermezzo (Gradient Descent)
F(x)

Update

x
Intermezzo (Gradient Descent)
F(x)

and again

x
Intermezzo (Gradient Descent)
F(x)

Until convergence

x
Intermezzo (Gradient Descent)

• Optimisation algorithm
Intermezzo (Gradient Descent)

• Optimisation algorithm
• Purpose: find the min (or max) for F
Intermezzo (Gradient Descent)

• Optimisation algorithm
• Purpose: find the min (or max) for F
• Batch-oriented (use all data points)
Intermezzo (Gradient Descent)

• Optimisation algorithm
• Purpose: find the min (or max) for F
• Batch-oriented (use all data points)
• Stochastic GD: update after each sample
Objective Function
Objective Function

I enjoyed eating some pizza at the restaurant


Objective Function

I enjoyed eating some pizza at the restaurant


Objective Function

I enjoyed eating some pizza at the restaurant


Objective Function

I enjoyed eating some pizza at the restaurant

Maximise the likelihood


of the context given the focus word
Objective Function

I enjoyed eating some pizza at the restaurant

Maximise the likelihood


of the context given the focus word

P(i | pizza)
P(enjoyed | pizza)

P(restaurant | pizza)
Example
I enjoyed eating some pizza at the restaurant
Example
I enjoyed eating some pizza at the restaurant

Iterate over context words


Example
I enjoyed eating some pizza at the restaurant

bump P( i | pizza )
Example
I enjoyed eating some pizza at the restaurant

bump P( enjoyed | pizza )


Example
I enjoyed eating some pizza at the restaurant

bump P( eating | pizza )


Example
I enjoyed eating some pizza at the restaurant

bump P( some | pizza )


Example
I enjoyed eating some pizza at the restaurant

bump P( at | pizza )
Example
I enjoyed eating some pizza at the restaurant

bump P( the | pizza )


Example
I enjoyed eating some pizza at the restaurant

bump P( restaurant | pizza )


Example
I enjoyed eating some pizza at the restaurant

Move to next focus word and repeat


Example
I enjoyed eating some pizza at the restaurant

bump P( i | at )
Example
I enjoyed eating some pizza at the restaurant

bump P( enjoyed | at )
Example
I enjoyed eating some pizza at the restaurant

… you get the picture


P( eating | pizza )
P( eating | pizza ) ??
Output word Input word

P( eating | pizza )
Output word Input word

P( eating | pizza )

P( vec(eating) | vec(pizza) )
Output word Input word

P( eating | pizza )

P( vec(eating) | vec(pizza) )

P( vout | vin )
Output word Input word

P( eating | pizza )

P( vec(eating) | vec(pizza) )

P( vout | vin ) ???


P( vout | vin )
cosine( vout, vin )
cosine( vout, vin ) [-1, 1]
softmax(cosine( vout, vin ))
softmax(cosine( vout, vin )) [0, 1]
softmax(cosine( vout, vin ))

exp(cosine(vout , vin ))
P (vout |vin ) = P
k2V exp(cosine(vk , vin ))
Vector Calculation Recap
Vector Calculation Recap

Learn vec(word)
Vector Calculation Recap

Learn vec(word)
by gradient descent
Vector Calculation Recap

Learn vec(word)
by gradient descent
on the softmax probability
Plot Twist
Paragraph Vector
a.k.a.
doc2vec
i.e.
P(vout | vin, label)
word2vec in
practice
pip install gensim
Case Study 1: Skills and CVs
Case Study 1: Skills and CVs

from gensim.models import Word2Vec


fname = 'candidates.jsonl'
corpus = CorpusReader(fname)
model = Word2Vec(corpus)
Case Study 1: Skills and CVs

from gensim.models import Word2Vec


fname = 'candidates.jsonl'
corpus = CorpusReader(fname)
model = Word2Vec(corpus)
Custom
Case Study 1: Skills and CVs

model.most_similar('chef')

[('cook', 0.94),
('bartender', 0.91),
('waitress', 0.89),
('restaurant', 0.76),
...]
Case Study 1: Skills and CVs

model.most_similar('chef',
negative=['food'])

[('puppet', 0.93),
('devops', 0.92),
('ansible', 0.79),
('salt', 0.77),
...]
Case Study 1: Skills and CVs

Useful for:
Data exploration
Query expansion/suggestion
Recommendations
Case Study 2: Evil AI
Case Study 2: Evil AI
from gensim.models.keyedvectors \
import KeyedVectors

fname = ‘GoogleNews-vectors.bin'

model = KeyedVectors.load_word2vec_format(
fname,
binary=True
)
Case Study 2: Evil AI
from gensim.models.keyedvectors \
import KeyedVectors

fname = ‘GoogleNews-vectors.bin'

model = KeyedVectors.load_word2vec_format(
fname,
binary=True
)
Pre-trained on Google News
Distributed by Google
Case Study 2: Evil AI
model.most_similar(
positive=['king', ‘woman'],
negative=[‘man’]
)
Case Study 2: Evil AI
model.most_similar(
positive=['king', ‘woman'],
negative=[‘man’]
)
[('queen', 0.7118),
('monarch', 0.6189),
('princess', 0.5902),
('crown_prince', 0.5499),
('prince', 0.5377),
…]
Case Study 2: Evil AI
model.most_similar(
positive=['Paris', ‘Italy'],
negative=[‘France’]
)
Case Study 2: Evil AI
model.most_similar(
positive=['Paris', ‘Italy'],
negative=[‘France’]
)
[('Milan', 0.7222),
('Rome', 0.7028),
('Palermo_Sicily', 0.5967),
('Italian', 0.5911),
('Tuscany', 0.5632),
…]
Case Study 2: Evil AI
model.most_similar(
positive=[‘professor', ‘woman'],
negative=[‘man’]
)
Case Study 2: Evil AI
model.most_similar(
positive=[‘professor', ‘woman'],
negative=[‘man’]
)
[('associate_professor', 0.7771),
('assistant_professor', 0.7558),
('professor_emeritus', 0.7066),
('lecturer', 0.6982),
('sociology_professor', 0.6539),
…]
Case Study 2: Evil AI
model.most_similar(
positive=[‘professor', ‘man'],
negative=[‘woman’]
)
Case Study 2: Evil AI
model.most_similar(
positive=[‘professor', ‘man'],
negative=[‘woman’]
)
[('professor_emeritus', 0.7433),
('emeritus_professor', 0.7109),
('associate_professor', 0.6817),
('Professor', 0.6495),
('assistant_professor', 0.6484),
…]
Case Study 2: Evil AI
model.most_similar(
positive=[‘computer_programmer’, ‘woman'],
negative=[‘man’]
)
Case Study 2: Evil AI
model.most_similar(
positive=[‘computer_programmer’, ‘woman'],
negative=[‘man’]
)
[('homemaker', 0.5627),
('housewife', 0.5105),
('graphic_designer', 0.5051),
('schoolteacher', 0.4979),
('businesswoman', 0.4934),
…]
Case Study 2: Evil AI

• Culture is biased
Case Study 2: Evil AI

• Culture is biased
• Language is biased
Case Study 2: Evil AI

• Culture is biased
• Language is biased
• Algorithms are not?
Case Study 2: Evil AI

• Culture is biased
• Language is biased
• Algorithms are not?
• “Garbage in, garbage out”
Case Study 2: Evil AI
Final Remarks on
Word Embeddings
But we’ve been
doing this for X years
But we’ve been
doing this for X years

• Approaches based on co-occurrences are not new


• Think SVD / LSA / LDA
• … but they are usually outperformed by word2vec
• … and don’t scale as well as word2vec
Efficiency
Efficiency

• There is no co-occurrence matrix


(vectors are learned directly)
• Softmax has complexity O(V)
Hierarchical Softmax only O(log(V))
Garbage in, garbage out
Garbage in, garbage out
• Pre-trained vectors are useful
• … until they’re not
• The business domain is important
• The pre-processing steps are important
• > 100K words? Maybe train your own model
• > 1M words? Yep, train your own model
Word Embeddings Summary
Word Embeddings Summary

• Word Embeddings are magic!


• Big victory of unsupervised learning
• Gensim makes your life easy
word2vec Credits & Readings
Credits
• Lev Konstantinovskiy (@gensim_py)
• Chris E. Moody (@chrisemoody) see videos on lda2vec
Readings
• Deep Learning for NLP (R. Socher) http://cs224d.stanford.edu/
• “word2vec parameter learning explained” by Xin Rong
More readings
• “GloVe: global vectors for word representation” by Pennington et al.
• “Dependency based word embeddings” and “Neural word embeddings
as implicit matrix factorization” by O. Levy and Y. Goldberg
Deep Learning
Deep Networks

• Simple definition:
a network with many hidden layers

• How many? Well…


What is the buzz about?

• Previously:
— not enough computational power
— not enough data
— not enough understanding

• Recently:
— all the above available
— big improvements in many different tasks
Deep Learning Tools
• scikit-learn implements some NN algorithms
(see Perceptron demo)

• It doesn’t support much of deep learning

• More recently: specialised frameworks


e.g. TensorFlow, Theano, PyTorch

• … Keras!
A Few Words on Keras

• High-level, user friendly

• Initial focus on quick experimentation

• Now production ready, supports GPU etc.

• General purpose (design your network), extensible,


modular, …
Questions?

You might also like