0% found this document useful (0 votes)

41 views184 pages

05 Deep Learning and Neural Nets

The document provides an overview of Deep Learning and Neural Networks, focusing on concepts such as artificial neural networks, their applications, and fundamental machine learning tasks. It discusses various types of neural networks, activation functions, and optimization techniques like gradient descent and backpropagation. Additionally, it covers word embeddings and their significance in data representation for tasks like classification and machine translation.

Uploaded by

hrhee1atl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views184 pages

05 Deep Learning and Neural Nets

Uploaded by

hrhee1atl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 184

Deep Learning and

Neural Networks
Course: Artificial Intelligence
Fundamentals

Instructor: Marco Bonzanini

Machine Learning Tasks

Supervised Unsupervised
Discrete Data

Classification Clustering
(predict a label) (group similar items)

Continuous Data
Dimensionality
Regression Reduction
(predict a quantity) (reduce n. of variables)
Machine Learning Tasks

Supervised Unsupervised
Discrete Data

Classification Clustering
(predict a label) (group similar items)

Continuous Data
Dimensionality
Regression Reduction
(predict a quantity) (reduce n. of variables)
Section Agenda

• Introduction to Artificial Neural Networks (ANN)

• Neural Network Concepts

• Neural Networks in the Wild

Introduction to
ANN
Biological Neural Nets

• Series of interconnected neurones

(nerve cells)

• Electrical or chemical signals are

passed through synapses

• a.k.a. brains

https://en.wikipedia.org/wiki/File:Brain_network.png
Artificial Neural Networks

• Vaguely inspired by biological neural network

• Big family of ML approaches, used to tackle a

variety of tasks

• Original idea: solving problems “like the human

brain”, nowadays focus on specific tasks
NN Applications
• Computer vision

• Speech recognition

• Machine translation

• Social network analysis

• Playing (video)games

• … more
Simple Prediction Machine

Question Answer
Think

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016

Simple Prediction Machine

Input Output
Compute

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016

Example: Learn Km to Miles

100 km 50 miles
miles = km * 0.5

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016

Example: Learn Km to Miles

100 km 50 miles
miles = km * 0.5

We’re trying a linear model

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016

Example: Learn Km to Miles

100 km 50 miles
miles = km * 0.5

Wrong answer

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016

Example: Learn Km to Miles

100 km 50 miles
miles = km * 0.5

62.137 miles
(correct)

Error: 12.137 (not great)

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016

Example: Learn Km to Miles
Update

100 km 60 miles
miles = km * 0.6

62.137 miles
(correct)

Error: 2.137 (better)

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016

Example: Learn Km to Miles
Try again

100 km 70 miles
miles = km * 0.7

62.137 miles
(correct)

Error: -7.863 (worse!)

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016

Example: Learn Km to Miles
Try again

100 km 61 miles
miles = km * 0.61

62.137 miles
(correct)

Error: 1.137 (best so far)

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016

Example: Learn Km to Miles

What happened?

• Model with adjustable parameters

• Know the answer but not the parameters?

Use error to adjust parameters

A gentle introduction to neural networks. Tariq Rashid. PyData London 2016

Example: recognise birds
Height

Eagles

Sparrows

Weight
Example: recognise birds
Height

Separating line
(random init)

Weight
Example: recognise birds
Height

Adjust the line

(too much!)

Weight
Example: recognise birds
Height

Adjust the line

(good)

Weight
Example: recognise birds
Height

New (unseen) bird

Weight
Example: recognise birds
Height

It’s a sparrow

Weight
Neural Network
Concepts
Network Basics

• Neurone: a mathematical function

also, a node in the network

• Synapses: input-output connection between nodes

also, an edge in the network

• Nodes and edges are typically associated to

weights, to adjust the learning
Network Example

http://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html
Neurone Example

http://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html
Layers of the Network

http://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html
Activation Function

• a.k.a. Transfer function

• Defines the output of a node given the inputs

• It’s what give power to the Neural Net

• Popular examples: ReLU, Sigmoid, Softmax, …

Sigmoid Function

http://ml-cheatsheet.readthedocs.io/en/latest/nn_concepts.html
ReLU Function
Properties of Activation
Function
(nice to have)

• Non-linear

• Continuously differentiable

• Fixed output range

• Monotonic
Objective Function
• Loss function (minimise the error)

• Gain function (maximise some quantity)

• Tells us “how good” the model is at making

predictions for a given set of parameters

• (the purpose of “learning”, is to find the optimal

parameters)
Gradient Descend
• Optimisation algorithm, used to minimise some loss
function

• Iteratively mode towards the steepest descend

• Descend (loss) vs Ascend (gain)

• Learning rate: size of the steps

e.g. high learning rate covers more ground, but
could overshoot; low is precise but time consuming
Back Propagation
• Purpose: to adjust each weight in the network,
in proportion of how much it contributes to the
overall error

• a.k.a. Backward propagation of errors

• Starting with the final loss value (i.e. error), it works

backward to compute the contribution that each
parameter had in the loss value
Types of Neural Network

• Feed-Forward NN: outputs only move forward

• Convolutional NN: a type of FFNN

• Recurrent NN: outputs of a layer can come back as

input of the same layer (variable size input)
Example of RNNs

http://karpathy.github.io/
NN in the Wild:
Word Embeddings
Word Embeddings
=
Word Vectors
=
Distributed
Representations
Why should you care?
Why should you care?

Data representation
is crucial
Applications
Applications
Classification
Applications
Classification

Recommender Systems
Applications
Classification

Recommender Systems

Search Engines
Applications
Classification

Recommender Systems

Search Engines

Machine Translation
One-hot Encoding
One-hot Encoding

Rome = [1, 0, 0, 0, 0, 0, …, 0]
Paris = [0, 1, 0, 0, 0, 0, …, 0]
Italy = [0, 0, 1, 0, 0, 0, …, 0]
France = [0, 0, 0, 1, 0, 0, …, 0]
One-hot Encoding
Paris
Rome word V

Rome = [1, 0, 0, 0, 0, 0, …, 0]
Paris = [0, 1, 0, 0, 0, 0, …, 0]
Italy = [0, 0, 1, 0, 0, 0, …, 0]
France = [0, 0, 0, 1, 0, 0, …, 0]
One-hot Encoding
V = vocabulary size (huge)

Rome = [1, 0, 0, 0, 0, 0, …, 0]
Paris = [0, 1, 0, 0, 0, 0, …, 0]
Italy = [0, 0, 1, 0, 0, 0, …, 0]
France = [0, 0, 0, 1, 0, 0, …, 0]
Bag-of-words
Bag-of-words

doc_1 = [32, 14, 1, 0, …, 6]

doc_2 = [ 2, 12, 0, 28, …, 12]
… …
doc_N = [13, 0, 6, 2, …, 0]
Bag-of-words
Rome Paris word V

doc_1 = [32, 14, 1, 0, …, 6]

doc_2 = [ 2, 12, 0, 28, …, 12]
… …
doc_N = [13, 0, 6, 2, …, 0]
Word Embeddings
Word Embeddings

Rome = [0.91, 0.83, 0.17, …, 0.41]

Paris = [0.92, 0.82, 0.17, …, 0.98]

Italy = [0.32, 0.77, 0.67, …, 0.42]

France = [0.33, 0.78, 0.66, …, 0.97]

Word Embeddings
n. dimensions << vocabulary size

Rome = [0.91, 0.83, 0.17, …, 0.41]

Paris = [0.92, 0.82, 0.17, …, 0.98]

Italy = [0.32, 0.77, 0.67, …, 0.42]

France = [0.33, 0.78, 0.66, …, 0.97]

Word Embeddings

Rome = [0.91, 0.83, 0.17, …, 0.41]

Paris = [0.92, 0.82, 0.17, …, 0.98]

Italy = [0.32, 0.77, 0.67, …, 0.42]

France = [0.33, 0.78, 0.66, …, 0.97]

Word Embeddings

Rome = [0.91, 0.83, 0.17, …, 0.41]

Paris = [0.92, 0.82, 0.17, …, 0.98]

Italy = [0.32, 0.77, 0.67, …, 0.42]

France = [0.33, 0.78, 0.66, …, 0.97]

Word Embeddings

Rome = [0.91, 0.83, 0.17, …, 0.41]

Paris = [0.92, 0.82, 0.17, …, 0.98]

Italy = [0.32, 0.77, 0.67, …, 0.42]

France = [0.33, 0.78, 0.66, …, 0.97]

Word Embeddings

Rome
Paris Italy

France
Word Embeddings

is-capital-of
Word Embeddings

Paris
Word Embeddings

Paris + Italy
Word Embeddings

Paris + Italy - France

Word Embeddings

Rome

Paris + Italy - France ≈ Rome

From Language
to Vectors?
Distributional Hypothesis
“You shall know a word
by the company it keeps.”
–J.R. Firth 1957
“Words that occur in similar context
tend to have similar meaning.”
–Z. Harris 1954
Context ≈ Meaning
I enjoyed eating some pizza at the restaurant
Word

I enjoyed eating some pizza at the restaurant

Word

I enjoyed eating some pizza at the restaurant

The company it keeps

I enjoyed eating some pizza at the restaurant

I enjoyed eating some Welsh cake at the restaurant

I enjoyed eating some pizza at the restaurant

I enjoyed eating some Welsh cake at the restaurant

Same Context
I enjoyed eating some pizza at the restaurant

I enjoyed eating some Welsh cake at the restaurant

Same Context
?
=
A Bit of Theory
with word2vec
word2vec Architecture

Mikolov et al. (2013) Efficient Estimation of Word Representations in Vector Space

Vector Calculation
Vector Calculation
Goal: learn vec(word)
Vector Calculation
Goal: learn vec(word)
1. Choose objective function
Vector Calculation
Goal: learn vec(word)
1. Choose objective function
2. Init: random vectors
Vector Calculation
Goal: learn vec(word)
1. Choose objective function
2. Init: random vectors
3. Run stochastic gradient descent
Intermezzo (Gradient Descent)
Intermezzo (Gradient Descent)
F(x)

x
Intermezzo (Gradient Descent)
F(x) Objective Function (to minimise)