0% found this document useful (0 votes)

142 views179 pages

Introduction To Learning: Frederic Precioso 24/01/2019

The document provides an overview of key concepts related to artificial intelligence and machine learning, including definitions of artificial intelligence, the differences between machine learning, data mining, and data science. It also outlines common machine learning algorithms and different types of machine learning problems like supervised and unsupervised classification.

Uploaded by

Sam Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

142 views179 pages

Introduction To Learning: Frederic Precioso 24/01/2019

Uploaded by

Sam Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 179

Introduction to Learning

Winter School ROBOTICA PRINCIPIA

Frederic Precioso
24/01/2019

1
Disclaimer

If any content in this presentation is yours but is not correctly

referenced or if it should be removed, please just let me know
and I will correct it.

2
Overview
• Context & Vocabulary
– What represents Artificial Intelligence?
– Machine Learning vs Data Mining?
– Machine Learning vs Data Science?
– Machine Learning vs Statistics?
• Unsupervised classification
• Explicit supervised classification
• Implicit supervised classification
• Deep Learning
• Reinforcement Learning
3
CONTEXT & VOCABULARY

4
WHAT REPRESENTS ARTIFICIAL INTELLIGENCE?

5
What is Artificial intelligence?

• The term Artificial Intelligence, as a research field, was coined at the conference on
the campus of Dartmouth College in the summer of 1956, even though the idea was
around since antiquity.
• For instance in the first manifesto of Artificial Intelligence, “Intelligent Machinery”, in
1948 Alan Turing distinguished two different approaches to AI, which may be termed
"top-down“ or knowledge-driven AI and "bottom-up“ or data-driven AI

6
(sources: Wikipedia & http://www.alanturing.net & Stanford Encyclopedia of Philosophy)
What is Artificial intelligence?
• The two different approaches to AI can be detailed:
– "top-down“ or knowledge-driven AI
• cognition = high-level phenomenon, independent of low-level details of implementation
mechanism, first neuron (1943), first neural network machine (1950), neucognitron (1975)
• Evolutionary Algorithms (1954,1957, 1960), Reasoning (1959,1970), Expert Systems (1970),
Logic, Intelligent Agent Systems (1990)…
– "bottom-up“ or data-driven AI
• opposite approach, start from data to build incrementally and mathematically mechanisms
taking decisions
• Machine learning algorithms, Decision Trees (1983), Backpropagation (1984-1986), Random
Forest (1995), Support Vector Machine (1995), Boosting (1995), Deep Learning (1998/2006)…

7
(sources: Wikipedia & http://www.alanturing.net & Stanford Encyclopedia of Philosophy)
What is Artificial intelligence?

• AI is originally defined, by Marvin Lee Minsky, as “the construction of computer

programs doing tasks, that are, for the moment, accomplished more satisfyingly by
human beings because they require high level mental processes such as: learning.
perceptual organization of memory and critical reasoning”.

• There are so the "artificial" side with the usage of computers or sophisticated
electronic processes and the side “intelligence” associated with its goal to imitate
the (human) behavior.

(sources: Wikipedia & http://www.alanturing.net & Stanford Encyclopedia of Philosophy)

8
What is Artificial intelligence?

• The concept of strong artificial intelligence makes reference to a machine capable

not only of producing intelligent behavior, but also to experience a feeling of a real
sense of itself, “real feelings” (whatever may be put behind these words), and "an
understanding of its own arguments”.

• The notion of weak artificial intelligence is a pragmatic approach of engineers:

targeting to build more autonomous systems (to reduce the cost of their
supervision), algorithms capable of solving problems of a certain class, etc. But this
time, the machine simulates the intelligence, it seems to act as if it was smart.

(sources: Wikipedia & http://www.alanturing.net & Stanford Encyclopedia of Philosophy)

9
Why Artificial Intelligence is so difficult to
grasp?

• Frequently, when a technique reaches mainstream use, it is no longer considered as

artificial intelligence; this phenomenon is described as the AI effect: "AI is whatever
hasn't been done yet.“ (Larry Tesler's Theorem) -> e.g. Path Finding (GPS), Checkers
game, Chess electronic game, Alpha Go…

“AI” is continuously evolving and so very difficult to grasp.

10
Machine Learning

௙(x,ࢻ) ?
x ‫ݕ‬
x ‫ݕ‬

Face Detection

Betting on sports Scores, ranking…

Speech Recognition
11
Machine Learning

௙(x,ࢻ) ?
x ‫ݕ‬

x ‫ݕ‬

Support Vector Machines Random Forest Artificial Neural Networks 12

MACHINE LEARNING VS DATA MINING?

13
Data Mining Workflow

Validation

Data Mining
Model
Patterns
Transformation

Preprocessing
Selection

Data
Data warehouse
14
Data Mining Workflow

Validation

Data Mining
Model
Patterns
Transformation
Mainly manual
Pre
Preprocessing
Selection

Data
Data warehouse
15
Data Mining Workflow

Validation
• Filling missing values
• Dealing with outliers
• Sensor failures Data Mining
Model
• Data entry errors
• Duplicates
Patterns
Transfo
Transformation
• …
Preprocessing
Selection

Data
Data warehouse
16
Data Mining Workflow
• Aggregation (sum, average)
• Discretization
• Discrete attribute coding
• Text to numerical attribute Validation
• Scale uniformisation or standardisation
• New variable construction
• … Data Mining
Minin
Model
Patterns
Transformation

Preprocessing
Selection

Data
Data warehouse
17
Data Mining Workflow
• Regression
• (Supervised) Classification
• Clustering (Unsupervised Classification)
• Feature Selection
• Association analysis
• Novelty/Drift Valid
Validation
• …

Data Mining
Model
Patterns
Transformation

Preprocessing
Selection

Data
Data warehouse
18
Data Mining Workflow
• Evaluation on Validation Set
• Evaluation Measures
• Visualization
• ...
Validation

Data Mining
Model
Patterns
Transformation

Preprocessing
Selection

Data
Data warehouse
19
Data Mining Workflow • Visualization
• Reporting
• Knowledge
• ...

Validation

Data Mining
Model
Patterns
Transformation

Preprocessing
Selection

Data
Data warehouse
20
Data Mining Workflow
Problems Possible Solutions
• Regression • Machine Learning
• (Supervised) Classification • Support Vector Machine
• Density Estimation / Clustering • Artificial Neural Network
(Unsupervised Classification) • Boosting
• Feature Selection • Decision Tree
• Random Forest
• Association analysis • …
• Anomality/Novelty/Drift • Statistical Learning
• … • Gaussian Models (GMM)
• Naïve Bayes
• Gaussian processes
• …
• Other techniques
• Galois Lattice
• …
21
MACHINE LEARNING VS DATA SCIENCE?

22
Data Science Stack
Visualization / Reporting / Knowledge
• Dashboard (Kibana / Datameer)
USER
• Maps (InstantAtlas, Leaflet, CartoDB…)
• Charts (GoogleCharts, Charts.js…)
• D3.js / Tableau / Flame

Analysis / Statistics / Artificial Intelligence

• Machine Learning (Scikit Learn, Mahout, Spark)
• Search / retrieval (ElasticSearch, Solr)

Storage / Access / Exploitation

• File System (HDFS, GGFS, Cassandra …)
• Access (Hadoop / Spark / Both, Sqoop…)
• Databases / Indexing (SQL / NoSQL / Both…, MongoDB, HBase, Infinispan)
• Exploit (LogStash, Flume… )

Infrastructures
• Grid Computing / HPC
• Cloud / Virtualization
HARD
23
MACHINE LEARNING VS STATISTICS?

24
What breed is that Dogmatix (Idéfix) ?

The illustrations of the slides in this section come from the blog “Bayesian Vitalstatistix:
What Breed of Dog was Dogmatix?” 25
Does any real dog get this height and
weight?

• Let us consider x, vectors

independently generated in Rd
(here R2), following a
probability distribution fixed
but unknown P(x).

26
What should be the breed of these dogs?

• An Oracle assignes a value y to

each vector x following a
probability distribution P(y|x)
also fixed but unknown.

27
An oracle provides me with examples?

• Let S be a training set

S = {(x1, y1), (x2, y2),…, (xm, ym)},
with m training samples i.i.d. which
follow the joint probability
P(x, y) = P(x)P(y|x).

28
Statistical solution: Models, Hypotheses…

29
Statistical solution P(height, weight|breed)…

30
Statistical solution P(height, weight|breed)…

31
Statistical solution P(height, weight|breed)…

32
Statistical solution: Bayes,
P(breed|height, weight)…

33
Machine Learning

• we have a learning machine which can

provide a family of functions {f(x;ɲ)},
where ɲ is a set of parameters.

௙(x,ࢻ) ?
x ‫ݕ‬
34
The problem in Machine Learning
• The problem of learning consists in finding the model
(among the {f (x;ɲ)}) which provides the best
approximation Ǌ of the true label y given by the Oracle.
௙(x,ࢻ) ?
x ‫ݕ‬ • best is defined in terms of minimizing a specific (error)
cost related to your problem/objectives
Q((x, y), ɲ) [a; b].
• Examples of cost/loss functions Q: Hinge Loss, Quadratic
Loss, Cross-Entropy Loss, Logistic Loss…

35
Loss in Machine Learning

• How to define the loss L (or the cost Q)?

You should choose the right loss function based on your problem and your data
(here y is the true/expected answer, f(x) the answer predicted by the network).
Classification
• Cross-entropy loss: L(x) = -(y ln(f(x)) + (1-y)ln(1-f(x)))
• Hinge Loss (i.e. max-margin loss, i.e. 0-1 loss): L(x) = max(0, 1-yf(x))
• …
Regression
• Mean Square Error (or Quadratic Loss): L(x) = (f(x)-y)2
• Mean Absolute Loss: L(x) = |f(x)-y|
• …
If the loss is minimized but accuracy is low, you should check the loss function.
Maybe it is not the appropriate one for your task.
36
The problem in Machine Learning
For Clarity sake, let us note z = (x ,y).

37
Machine Learning fundamental Hypothesis

For Clarity sake, let us note z = (x ,y).

S = {zi }i=1,…,m is built through an i.i.d. sampling according to P(z).

Machine Learning Statistics

Train through Cross-Validation

Machine Learning Statistics

Training set & Test set have to be distributed according to the
same law (i.e. P(z)).
38
Vapnik learning theory (1995)

Training Error Generalization Error

39
Vapnik learning theory (1995)

40
Machine Learning vs Statistics

41
UNSUPERVISED CLASSIFICATION

42
Unsupervised classification
• The system or the operator has only samples, but no label
• The number of classes and their nature have not been
predetermined
unsupervised learning or clustering.
No expert is required.
The algorithm must discover by itself more or less
hidden/underlying data structure.

43
Clustering
Clustering: Partition a dataset into groups based on the similarity between the instances.

Clustering

44
Clustering Algorithms (Partition)

• Centroid-based (1957,1967)
(E.g. K-means, PAM, CLARA)

• Hierarchical:
• Bottom up approach.
• Top down approach.

45
Clustering Algorithms (Density)

• Density-based
(E.g. DBSCAN, DENCLUE)

• Distribution-based
(E.g. EM, extension of
K-means)

46
Clustering Algorithms (Graph)

• Graph-based
(E.g. Chameleon)

• Grid-based
(E.g. STING, CLIQUE)

47
How many Clusters?

48
Which Algorithm?

K-means PAM Gaussian model

DBSCAN Average linkage DIANA

49
Algorithms’ Parameters

Avg. linkage: K=2 Avg. linkage: K=3 Avg. linkage: K=4

Avg. linkage: K=5 Avg. linkage: K=6 Avg. linkage: K=7

50
Which metric/distance?

51
Clustering validation measures
1
• Internal validation: Separation? Compactness? 0,8
E.g. Dunn, DB, and Silhouette indexes. 0,6
0,4
• Problems:
0,2
– Different performance WRT existence of noise, variable 0
densities, and non well-separated clusters. 1 2 3
K
4 5

– Overrate the algorithm that uses the same clustering model.

• External validation: == Class labels? 0,8

0,6
E.g. Rand, Jaccard, Purity, MI, VI indexes. 0,4
• Problem: Some class labels, at least, have to exist. 0,2
0
1 2 3 4

52
Consensus clustering

Consensus
Clustering

Ensemble of 3 base clusterings Consensus solution

53
Overview

• Unsupervised classification
• Explicit supervised classification
– Decision Trees
– Random Forest
• Implicit supervised classification
• Deep Learning
• Reinforcement Learning

54
EXPLICIT SUPERVISED CLASSIFICATION

55
DECISION TREES

56
Decision tree to decide playing
tennis or not

Objective
2 classes: yes & no
Prediction if a game will
be played or not
Temperature will be
easily converted into
numerical

I.H. Witten and E. Frank, “Data Mining”, Morgan Kaufmann Pub., 2000.
57
A simple example

58
Example - explanations

• On the nodes
– Distribution of the variable to predict
• The first node is segmented with the variable outlook (sunny, overcast, rainy):
creation of 3 sub-groups
– The first group contains 5 observations, 2 yes and 3 no
• The tree can be translated in a set of decision rules without loosing any
information
– Example: if outlook = sunny and humidity = high then play = yes
59
Basic algorithm
• A= BestAttribute(Examples) // Best attribute means more
//homogenous results
• Assign A to the root
• For each value of A, create a new sub-node of the roof
• Classify all the examples in the sub-nodes
• If all examples of a sub-node are homogeneous, assign their class
to the node, if not repeat this process from this node

• Question: How to measure homogeneity?

Entropy, Gini, Information Gain…
60
Decision tree to decide playing
tennis or not
Class:YES
Class: NO Class: YES

61
Final decision tree

62
Decision tree example
Debt
Income > t1

t2 Debt > t2

Income
t3 t1
Income > t3
Note: tree boundaries are piecewise
linear and axis-parallel

63
Advantages of decision trees
• Simple and easily interpretable rules (unlike implicit decision methods)
• No need to recode heterogeneous data
• Processing with missing values
• No model and no presupposition to meet (iterative method)
• Fast processing time

64
Drawbacks
• The nodes of level n + 1 are highly dependent on those of level
n (the modification of a single variable near to the top of the
tree can entirely change the tree)
• We always choose the best local attributes, the best global
information gain is not at all guaranteed
• Learning requires a sufficient number of individuals
• Inefficient when there are many classes

• No convergence…
65
Drawbacks
• We always chose the best local attributes, the best global information
gain is not at all guaranteed
Gain (A2) = 0.25 Gain (A5) = 0.18

Gain (A4) = 0.11 Gain (A1) = 0.07 Gain (A7) = 0.30 Gain (A3) = 0.20

Solution chosen by the algorithm Solution that should be chosen 66

Decision trees do not converge?
“Plant” a forest

67
Standard Random Forests

Bagging

Random
Feature
Selection

68
Error of generalization for Random Forest

• Error of generalization of RF can be bounded by:

ߩ 1 െ ‫ݏ‬ଶ
ܴ ܴ‫ ܨ‬൑
‫ݏ‬ଶ
where
– U is the mean correlation between two decision trees
– s is the quality of prediction of the set of decision trees

69
Success story: Kinect

https://www.youtube.com/watch?v=lntbRsi8lU8

70
Success story: Kinect

71
Success story: Kinect

72
Overview
• Unsupervised classification
• Explicit supervised classification
• Implicit supervised classification
– Multi-Layer Perceptron
– Support Vector Machine
• Deep Learning
• Reinforcement Learning

73
IMPLICIT SUPERVISED CLASSIFICATION

74
Thomas Cover’s Theorem (1965)
“The Blessing of dimensionality”
Cover’s theorem states: A complex pattern-classification problem cast in
a high-dimensional space nonlinearly is more likely to be linearly
separable than in a low-dimensional space.
(repeated sequence of Bernoulli trials)

75
The curse of dimensionality [Bellman, 1956]

76
MULTI-LAYER PERCEPTRON

77
First, biological neurons
Before we study artificial neurons, let’s look at a biological neuron

78
Figure from K.Gurney, An Introduction to Neural Networks
First, biological neurons

Postsynaptic potential function with weight dependency, as a function of

time (ms) and weight value, being excitatory in case of red and blue lines, and
inhibitory in case of a green line.

79
Then, artificial neurons

Pitts & McCulloch (1943), binary inputs & activation function f is a thresholding

Rosenblatt (1956), real inputs & activation function f is a thresholding

80
(Schéma : Isaac Changhau)
‫ݕ‬
Artificial neuron vs
biology ௡

෍ ‫ݓ‬௜ ‫ݔ‬௜ ‫ܟ‬

‫ݓ‬଴ ௜ୀ଴
‫ݓ‬ଵ ‫ݓ‬௡
‫ݓ‬ଶ ‫ݓ‬
ଷ ‫ܠ‬
࢞૙ = ૚ ‫ݔ‬ଵ ‫ݔ‬ଶ ‫ݔ‬ଷ ‫ݔ‬௡

Spike-based description Rate-based description

Steady regime
s

y = s( ɇ wi xi )

Gradient descent: KO Gradient descent: OK 81

From perceptron to network

@tachyeonz: A friendly introduction to neural networks and deep learning.

82
Single Perceptron Unit
• Perceptron only learns linear function [Minsky and Papert, 1969]

• Non-linear function needs layer(s) of neurons o Neural Network

• Neural Network = input layer + hidden layer(s) + output layer

83
Multi-Layer Perceptron
• Training a neural network [Rumelhart et al. / Yann Le Cun et al. 1985]
• Unknown parameters: weights on the synapses
• Minimizing a cost function: some metric between the predicted output
and the given output

• Step function: non-continuous functions are replaced by a continuous

non-linear ones 84
Multi-Layer Perceptron
• Minimizing a cost function: some metric between the predicted output
and the given output

• Equation for a network of 3 neurons (i.e. 3 perceptrons):

‫ݓ(ݏ = ݕ‬ଵଷ ‫ݓ ݏ‬ଵଵ ‫ݔ‬ଵ + ‫ݓ‬ଶଵ ‫ݔ‬ଶ + ‫ݓ‬଴ଵ + ‫ݓ‬ଶଷ ‫ݓ ݏ‬ଵଶ ‫ݔ‬ଵ + ‫ݓ‬ଶଶ ‫ݔ‬ଶ + ‫ݓ‬଴ଶ + ‫ݓ‬଴ଷ )
85
Autonomous Land Vehicle In a Neural
Network (ALVINN)
• ALVINN is an automatic steering system for a car based on input from a
camera mounted on the vehicle.
– Successfully demonstrated in a cross-country trip.

86
ALVINN (1989)

• The ALVINN neural network is:

– 960 inputs (a 30x32 array
derived from the pixels of an
image),
– 4 hidden units and
– 30 output units (each
representing a steering
command).

87
Multi-Layer Perceptron
Theorem [Cybenko, 1989]
• A neural network with one single hidden layer is a universal
approximator: it can represent any continuous function on compact
subsets of Rn
• 2 layers is enough ... theoretically:
“…networks with one internal layer and an arbitrary continuous sigmoidal
function can approximate continuous functions with arbitrary precision
providing that no constraints are placed on the number of nodes or the size
of the weights"
• But no efficient learning rule is known and the size of the hidden layer is
exponential with the complexity of the problem (which is unkown
beforehand) to get an error H , the layer must be infinite for an error 0.
88
SUPPORT VECTOR MACHINE
Partly based on “A Gentle Introduction to Support Vector Machines in Biomedicine”, A. Statnikov, D.
Hardin, I. Guyon, C. F. Aliferis, AMIA 2010.

89
Thomas Cover’s Theorem (1965)
“The Blessing of dimensionality”
Cover’s theorem states: A complex pattern-classification problem cast in
a high-dimensional space nonlinearly is more likely to be linearly
separable than in a low-dimensional space.
(repeated sequence of Bernoulli trials)

90
The curse of dimensionality [Bellman, 1956]

91
SVM vs ANN
"SVMs have been developed in the reverse order to the development
of neural networks (NNs). SVMs evolved from the sound theory to
the implementation and experiments, while the NNs followed more
heuristic path, from applications and extensive experimentation to
the theory.“

“Support Vector Machines: Theory and Applications” by Lipo

Wang, in Studies in Fuzziness and Soft Computing, Springer, 2005.

92
The Support Vector Machine (SVM)
• Support vector machines (SVMs) is a binary classification
algorithm.
• Extensions of the basic SVM algorithm can be applied to solve
problems of regression, feature selection, novelty/outlier
detection, and clustering.
• SVMs are important because of (a) theoretical reasons:
- Robust to very large number of variables and small samples
- Can learn both simple and highly complex classification models
- Employ sophisticated mathematical principles to avoid overfitting
and (b) superior empirical results. 93
Linearly separable data, “Hard-
margin” linear SVM
G G G
x1 , x2 ,..., x N R n • Want to find a classifier (hyperplane)
Given training data:
y1 , y2 ,..., y N {1,1} to separate negative objects from
the positive ones.
• An infinite number of such
hyperplanes exist.
• SVMs finds the hyperplane that
maximizes the gap between data
points on the boundaries (so-called
“support vectors”).
• If the points on the boundaries are
Negative objects (y=-1) Positive objects (y=+1) not informative (e.g., due to noise),
SVMs may not do well. 94
Kernel Trick

https://www.youtube.com/watch?v=-Z4aojJ-pdg

95
Popular kernels
A kernel is a dot product in some feature space:
G G G G
K ( xi , x j ) ) ( xi ) ) ( x j )
Examples:
G G G G
K ( xi , x j ) xi x j Linear kernel
G G G G 2
K ( xi , x j ) exp(J xi x j ) Gaussian kernel
G G G G
K ( xi , x j ) exp(J xi x j ) Exponential kernel
G G G G q
K ( xi , x j ) ( p xi x j ) Polynomial kernel
G G G G q G G 2
K ( xi , x j ) ( p xi x j ) exp(J xi x j ) Hybrid kernel
G G G G
K ( xi , x j ) tanh(kxi x j G ) Sigmoidal
96
How to build a kernel function ?

k ( x, y ) k1 (x, y ) + k2 (x, y )
k ( x, y ) D ·k1 (x, y )
k ( x, y ) k1 (x, y )·k2 (x, y )
k ( x, y ) f (x)· f (y )
with f () a function from input space to \
k (x, y ) k3 () (x), ) (y ))
k (x, y ) xBy T
with B a matrix N u N symetric, semi-definite positive
97
Complex kernels on Video Tubes

98
Complex kernels on Video Tubes

99
Classification

100
Classification

101
SVM are ANN

102
Overview
• Unsupervised classification
• Explicit supervised classification
• Implicit supervised classification
• Deep Learning
– Convolutional Neural Networks (CNN)
– Generative Adversarial Networks (GAN)
– Stacked Denoising AutoEncoder (SDAE)
– Reccurent Neural Networks (RNN)
• Reinforcement Learning
103
DEEP LEARNING

104
Deep representation origins
• Theorem Cybenko (1989) A neural network with one single hidden layer
is a universal “approximator”, it can represent any continuous function
on compact subsets of Rn 2 layers are enough…but hidden layer size
may be exponential

…………..……..………..
………..……..………
exponential

105
Deep representation origins
• Theorem Hastad (1986), Bengio et al. (2007) Functions representable
compactly with k layers may require exponentially size with k-1 layers

106
Deep representation origins
• Theorem Hastad (1986), Bengio et al. (2007) Functions representable
compactly with k layers may require exponentially size with k-1 layers
2
exponential

107
Enabling factors
• Why do it now ? Before 2006, training deep networks was
unsuccessful because of practical aspects
– faster CPU's
– parallel CPU architectures
– advent of GPU computing

• Results…
– 2009, sound, interspeech + ~24%
– 2011, text, + ~15% without linguistic at all
– 2012, images, ImageNet + ~20%
108
Structure the network?
• Can we put any structure reducing the space of exploration and
providing useful properties (invariance, robustness…)?

‫ݓ(ݏ = ݕ‬ଵଷଷ ‫ݓ ݏ‬ଵଵଵ ‫ݔ‬ଵ + ‫ݓ‬ଶଵଵ ‫ݔ‬ଶ + ‫ݓ‬଴ଵଵ + ‫ݓ‬ଶଷଷ ‫ݓ ݏ‬ଵଶଶ ‫ݔ‬ଵ + ‫ݓ‬ଶଶଶ ‫ݔ‬ଶ + ‫ݓ‬଴ଶଶ + ‫ݓ‬଴ଷଷ )

109
CONVOLUTIONAL NEURAL NETWORKS
(AKA CNN, CONVNET)

110
Convolutional neural network

111
Deep representation by CNN

112
Convolution in nature

113
Convolution

114
Convolution in nature
1. Hubel and Wiesel have worked on visual cortex of cats (1962)
2. Convolution

3. Pooling

115
Convolution = Perceptron
‫ = ݕ‬sign(w
w.‫)ܠ‬

෍ ‫ݓ‬௜ ‫ݔ‬௜ ‫ܟ‬

‫ݓ‬଴ ௜ୀ଴
െ4
4 0
0 ‫ܠ‬
࢞૙ = ૚ 0 0 0 2

116
Convolution = Perceptron
‫ = ݕ‬sign(w
w.‫)ܠ‬

෍ ‫ݓ‬௜ ‫ݔ‬௜
௜ୀ଴ ‫ܟ‬
‫ݓ‬଴
‫ݓ‬ଵ ‫ݓ‬௡
‫ݓ‬ଶ
‫ݓ‬ଷ ‫ܠ‬
࢞૙ = ૚ ‫ݔ‬ଵ ‫ݔ‬ଶ ‫ݔ‬ଷ ‫ݔ‬௡

117
If convolution = perceptron
1. Convolution

2. Pooling

118
Deep representation by CNN

119
Deep representation by CNN

120
Deep representation by CNN

121
Deep representation by CNN

122
Transfer Learning!!

123
Endoscopic Vision Challenge 2017
Surgical Workflow Analysis in the SensorOR

10th of September
Quebec, Canada

124
Clinical context: Laparoscopic Surgery
Surgical Workflow Analysis

125
Task
Phase segmentation of laparoscopic surgeries

Video

Surgical
Devices

126
Dataset
30 colorectal laparoscopies
z Complex type of operation
z Duration: 1.6h – 4.9h (avg 3.2h)
z 3 different sub-types
z 10x Proctocolectomy
z 10x Rectal resection
z 10x Sigmoid resection

Sensor data recorded in integrated OR (Karl Storz OR1)

z Laparoscopic image stream
z Surgical devices Recorded at

127
Annotation
Annotated by surgical experts, 13 different phases

128
Method
Temporal Network

ResNet-34 Number of target classes: Spatial network accuracy:

Rectal resection: 11 Rectal resection: 62.91%
Sigmoid resection: 10 Sigmoid resection: 63.01%
Proctolectomy: 12 Proctolectomy: 63.26%
7x7 conv

3x3 conv

3x3 conv
3x3 conv

FC
feature Temporal network accuracy:
224x224x3 Rectal resection: 49.88%
224x224x20 vector
FV (512) Sigmoid resection: 48.56%
Proctolectomy: 46.96%

Spatial Network

Final Network Accuracy:

Final Network Rectal resection (8): sigmoid resection (7): Proctocolectomy (1):
80.7% 73.5% 71.3%
Rectal resection (6): sigmoid resection (1): Proctocolectomy (4):
features
Spatial

79.9% 54.7% 73.9%

(512)

Spatial CNN
LSTM (512)

Results for rectal resection video #8; GT in green, prediction in red

12
11

224x224x3 (1024 10
9
) 8
7
6
5
Temporal
features

4
(512)

Temporal CNN 3
2
1
0

224x224x20 129
And the winner is...

Average Median
Data used Accuracy
Jaccard Jaccard

1 Video 40% 38% 61% Team UCA

Video +
2
Device
38% 38% 60% Team NCT

3 Video 25% 25% 57% Team TUM

4 Device 16% 16% 36% Team TUM

5 Video 8% 7% 21% Team FFL

130
Why Deep Learning?
Before Deep Learning
Hand-crafted/Engineered features
– Image recognition
– 3L[HOĺHGJHĺtexton ĺSDWWHUQVĺSDUWĺREMHFW
– Text
– &KDUDFWHUĺZRUGĺZRUGJURXSĺFODXVHĺVHQWHQFHĺVWRU\
– Speech
– 6DPSOHĺVSHFWUDOEDQGĺVRXQGĺ«ĺSKRQHĺSKRQHPHĺZRUG
Since Deep Learning
o Hierarchy of representations with increasing level of abstraction
o Each stage is a kind of trainable feature transform…
as long as you have enough data to train the hierarchy
Trainable feature

Trainable feature

Trainable feature
transform

transform

Decision
Le Cun - Ranzato
How Deep Learning?
Start from raw data OR from a first level representation?

Gray-level Pixels

Color Pixels

Trainable feature

Trainable feature
transform

transform

Decision
Text Text Embedding

Waves

Images
Le Cun - Ranzato
AMAZING BUT…

133
Amazing but…be carreful of the bias in
the initial data

134
Amazing but…be careful of the
adversaries (as any other ML algorithms)

135
Amazing but…be careful of the
adversaries (as any other ML algorithms)

136
Amazing but…be careful of the
adversaries (as any other ML algorithms)

From Thomas Tanay

137
Amazing but…be careful of the
adversaries (as any other ML algorithms)

138
Amazing but…be careful of the
adversaries (as any other ML algorithms)

139
Amazing but…be careful of the
adversaries

https://nicholas.carlini.com/code/audio_adversarial_examples/

140
GENERATIVE ADVERSARIAL NETWORKS

141
How to solve it?
Generative Adversarial Networks

142
It finally did not solve adversarial, but…

Operations between latent representations (manifold)

143
It finally did not solve adversarial, but…

144
(DENOISING) STACKED AUTOENCODER

145
Autoencoder: unsupervised!
Learning a compact representation of the data (no classification)
First we train an AutoEncoder layer 1.

146
Autoencoder: unsupervised!
Second we train an AutoEncoder layer 2.

147
Autoencoder -> Supervised

Then we train an output layer of non-linearities based on softmax.

148
Autoencoder -> Supervised
Finally, we fine-tune the whole network in a supervised way.

149
Denoising stacked Autoencoder:
unsupervised
Result = a new latent representation

150
Denoising stacked Autoencoder: example
Stage 1.
Data-mining stage &
Feature extraction:
Driving Electronic Health
Model Deep Patient, Records to build a binary
Published in Nature, phenotype representation.
2016
Stage 2.
Unsupervised stage:
Mapping the Binary Patient
Representation to get a new space call
Deep Patient (or Latent Representation)
Using Stacked Denoising Autoencoders.

Stage 3.
Supervised stage:
Labeling Medical Target and
training the Latent Representation
by Machine Learning algorithms
for classification and prediction of
patient's disease. 151
Supervised Image Segmentation Task

Credits Matthieu Cord 152

Partly from COLAH’s Blog http://colah.github.io/posts/2015-08-Understanding-LSTMs/

RECURRENT NEURAL NETWORK

153
Recurrent Neural Networks have loops.

154
An unrolled recurrent neural network.

In the last few years, there have been incredible success applying RNNs to a
variety of problems: speech recognition, language modeling, translation,
image captioning…

155
The Problem of Long-Term Dependencies

If we are trying to predict the last word in “the clouds are in the (sky),” we don’t need any
further context – it’s pretty obvious the next word is going to be sky.

Consider trying to predict the last word in the text “I grew up in France… I speak
fluent (French).” Recent information suggests that the next word is probably the name of a
language, but if we want to narrow down which language, we need the context of France,
from further back. 156
The repeating module in a standard
RNN contains a single layer.

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information
(cf. vanishing gradients)
The problem was explored in depth by Hochreiter (1991) and Bengio, et al. (1994), who found
some pretty fundamental reasons why it might be difficult.
Thankfully, LSTMs/GRUs do not have this problem!
157
The repeating module in an LSTM/GRU
contains four interacting layers.

158
Sequence modeling with RNNs

159
One to Many - Image captioning

[Xu et al. 2015] 160

Many to Many Parallel - Char-nn

161
Many to Many - Machine translation
speech2text
• Input : Audio mp3 (natural language)
• Output : "How much would a woodchuck chuck"

[Chan et al. 2015] [Olah et Carter 2016]

162
Overview

• Context & Vocabulary

• Unsupervised classification
• Explicit supervised classification
• Implicit supervised classification
• Deep Learning
• Reinforcement Learning

163
Based on “Introduction to Deep Learning”, by Professor Qiang Yang,
from The Hong Kong University of Science and Technology

REINFORCEMENT LEARNING

164
Reinforcement Learning

• What’s Reinforcement Learning?

Environment

{Observation, Reward} {Actions}

Agent

• Agent interacts with an environment and learns by maximizing a scalar reward

• No labels or any other supervision
• Previously suffering from hand-craft states or representation

165
Policies and Value Functions
• Policy ߨ is a behavior function selecting actions given states
(it defines the probability of each possible action regarding the state s)

ܽ= argmax ࣊(s)
௔௟௟ ௣௢௦௦௜௕௟௘ ௔௖௧௜௢௡௦

• Value function ܳగ (s,a) is expected total reward ‫ ݎ‬from state s and action a
under policy ߨ

“How good is action ܽ in state ‫”?ݏ‬

166
Approaches To Reinforcement Learning
• Policy-based RL
– Search directly for the optimal policy ߨ ‫כ‬
– Policy achieving maximum future reward
• Value-based RL
– Estimate the optimal value function ܳ‫( כ‬s,a)
– Maximum value achievable for the best policy ߨ ‫כ‬
• Model-based RL
– Build a transition model of the environment
– Plan (e.g. by look-ahead) using model

167
Bellman Equation
• Value function can be unrolled recursively
ܳగ ‫ݏ‬, ܽ = ॱ ‫ݎ‬௧ାଵ + ߛ‫ݎ‬௧ାଶ + ߛ ଶ ‫ݎ‬௧ାଷ + ‫ݏ| ڮ‬, ܽ
గ ᇱ ᇱ
= ॱ௦ᇲ ‫ݎ‬௧ + ߛ max
ᇲ
ܳ ‫ ݏ‬, ܽ |‫ݏ‬, ܽ
௔

• Optimal value function Q‫( כ‬s, a) can be unrolled recursively

ܳ‫ݏ כ‬, ܽ = ॱ௦ᇲ ‫ݎ‬௧ + ߛ max ᇲ
ܳ ‫ כ‬ᇱ ᇱ
‫ ݏ‬, ܽ |‫ݏ‬, ܽ
௔
గ ᇱ ᇱ
ܸగ (‫)ݏ‬ = max
ᇲ
ܳ ‫ ݏ‬,ܽ
௔
• Value iteration algorithms solve the Bellman equation
ᇱ ᇱ
ܳ௜ାଵ ‫ݏ‬, ܽ = ॱ௦ᇲ ‫ݎ‬௧ + ߛ max
ᇲ
ܳ௜ ‫ݏ‬ , ܽ |‫ݏ‬, ܽ
௔

This last equation corresponds to how we should update Q ideally, i.e. knowing the state distribution BUT
we do not know the state distribution Prob(s’ | s,a) and the complexity is not even polynomial in the
number of states 168
Deep Reinforcement Learning
• Human

• So what’s DEEP RL?

Environment

{Raw Observation, Reward} {Actions}

169
Deep Reinforcement Learning
• Represent value function by deep Q-network with weights w
ܳ ‫ݏ‬, ܽ, ࢝ = ܳగ ‫ݏ‬, ܽ
• Define objective function by mean-squared error in Q-values
ࣦ(࢝࢏ ) = ॱ ‫ݎ‬௧ + ߛ max ᇱ ᇱ
ᇲ
ܳ ‫ݏ‬ , ܽ , ࢝࢏ି૚ െ ܳ ‫ݏ‬, ܽ, ࢝࢏
௔

target

• Leading to the following Q-learning gradient

߲ࣦ(࢝࢏ ) ߲ܳ ‫ݏ‬, ܽ, ࢝࢏
=ॱ ‫ݎ‬௧ + ߛ max ܳ ‫ ݏ‬ᇱ , ܽᇱ , ࢝ ࢏ି૚ െ ܳ ‫ݏ‬, ܽ, ࢝࢏
߲࢝ ௔ᇲ ߲࢝
target

170
DQN in Atari
• End-to-end learning of values Q(s, a) from pixels
• Input state s is stack of raw pixels from last 4 frames
• Output is Q(s, a) for 18 joystick/button positions
• Reward is the change in the score for that step

Mnih, Volodymyr, et al. 2015. 171

DQN in Atari : Human Level Control

Mnih, Volodymyr, et al. 2015.

172
AlphaGO: Monte Carlo Tree Search
• MCTS: Model look ahead to reduce searching space by predicting
opponent’s moves

ܸగ ‫ = ݏ‬max ܳగ ‫ݏ‬, ܽ
௔

Silver, David, et al. 2016.

173
AlphaGO: Learning Pipeline
• Combine SL and RL to learn the search direction in MCTS

Silver, David, et al. 2016.

• SL policy Network
– Prior search probability or potential
• Rollout:
– combine with MCTS for quick simulation on leaf node
• Value Network:
– Build the Global feeling on the leaf node situation
174
Learning to Prune: SL Policy Network
• 13-layer CNN
• Input board position ‫ݏ‬
• Output: pఙ (ܽ|‫)ݏ‬, where ܽ is the next move

175
Learning to Prune: RL Policy Network

Self play

• 1 Million samples are used to train.

• RL-Policy network VS SL-Policy network.
í RL-Policy alone wins 80% games against SL-Policy.
í Combined with MCTS, SL-Policy network is better
• Used to derive the Value Network as the ground truth
– Making enough data for training
176
Learning to Prune: Value Network
• Regression: Similar architecture

• SL Network: Sampling to generate a unique game.

• RL Network: Simulate to get the game’s final result.

• Train: 50 million mini-batches of 32 positions

(30 million unique games)

177
AlphaGO: Evaluation

The version solely using the policy network does not perform any search
Silver, David, et al. 2016. 178
QUESTIONS?

179

Deep Learning MCQ
90% (73)
Deep Learning MCQ
34 pages
3rd Principles of Soft Computing SN SIVNANDAM and DEEPA SN-compressed
No ratings yet
3rd Principles of Soft Computing SN SIVNANDAM and DEEPA SN-compressed
790 pages
Machine Learning Systems
No ratings yet
Machine Learning Systems
1,748 pages
Mastering JavaScript - Michael B. White
No ratings yet
Mastering JavaScript - Michael B. White
568 pages
Docs Srsran Com Project en Latest
No ratings yet
Docs Srsran Com Project en Latest
239 pages
Python 200 Essential Concepts For Beginners
No ratings yet
Python 200 Essential Concepts For Beginners
446 pages
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
No ratings yet
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
61 pages
A. C. C. Coolen, R. Kühn, P. Sollich - Theory of Neural Information Processing Systems-Oxford University Press, USA (2005)
No ratings yet
A. C. C. Coolen, R. Kühn, P. Sollich - Theory of Neural Information Processing Systems-Oxford University Press, USA (2005)
586 pages
React Js Cheat Sheet
No ratings yet
React Js Cheat Sheet
280 pages
Agya Ram Verma - Yatendra Kumar - Basic and Advance - Phython Programming-Independently Published (2024)
No ratings yet
Agya Ram Verma - Yatendra Kumar - Basic and Advance - Phython Programming-Independently Published (2024)
240 pages
Python Notes
No ratings yet
Python Notes
279 pages
Gen AI Roadmap 2025
No ratings yet
Gen AI Roadmap 2025
19 pages
Algorithm & Solved Example - ADALINE
No ratings yet
Algorithm & Solved Example - ADALINE
5 pages
Master Python E Book 1
No ratings yet
Master Python E Book 1
257 pages
Deep Learning Interview Questions
No ratings yet
Deep Learning Interview Questions
17 pages
Python Unit 2
No ratings yet
Python Unit 2
145 pages
Yourfirstweekwithreact 2 Ndedition
No ratings yet
Yourfirstweekwithreact 2 Ndedition
177 pages
Lecture 3 Finetuning Part 1
No ratings yet
Lecture 3 Finetuning Part 1
85 pages
MC4411 Project Work - Format
No ratings yet
MC4411 Project Work - Format
65 pages
Keycloak - CNCF Security SIG - Self Assesment
No ratings yet
Keycloak - CNCF Security SIG - Self Assesment
35 pages
NCA-GENL Nvidia Generative Ai Llms Exam Dumps
No ratings yet
NCA-GENL Nvidia Generative Ai Llms Exam Dumps
5 pages
Quantecon Python Programming
No ratings yet
Quantecon Python Programming
388 pages
771 A18 Lec4
100% (1)
771 A18 Lec4
128 pages
Machine Learning NN
100% (2)
Machine Learning NN
16 pages
Theory Lectures v2.3
No ratings yet
Theory Lectures v2.3
264 pages
Lecture8 PDF
No ratings yet
Lecture8 PDF
434 pages
Chapter 5 Final
No ratings yet
Chapter 5 Final
80 pages
OS by JJsir
No ratings yet
OS by JJsir
269 pages
Lecture Notes - Logistic Regression
100% (1)
Lecture Notes - Logistic Regression
11 pages
North American 6G Roadmap Priorities
No ratings yet
North American 6G Roadmap Priorities
23 pages
Demystifying Deep Convolutional Neural Networks - Adam Harley (2014) CNN PDF
No ratings yet
Demystifying Deep Convolutional Neural Networks - Adam Harley (2014) CNN PDF
27 pages
2018 Miccai PDF
No ratings yet
2018 Miccai PDF
239 pages
MCQ Deep Learning Engineering Syllabus 1to 5 Unit ..
No ratings yet
MCQ Deep Learning Engineering Syllabus 1to 5 Unit ..
2 pages
Intelligent Information Processing With Matlab - Xiu Zhang
No ratings yet
Intelligent Information Processing With Matlab - Xiu Zhang
347 pages
BIS Chapter 2 Class
No ratings yet
BIS Chapter 2 Class
84 pages
2023 Intro To Generative Ai
No ratings yet
2023 Intro To Generative Ai
15 pages
Adaptive and Neuro Fuzzy Spectrum
No ratings yet
Adaptive and Neuro Fuzzy Spectrum
11 pages
MCP Security
No ratings yet
MCP Security
28 pages
Soft Computing Unit 2
No ratings yet
Soft Computing Unit 2
22 pages
Pytorch Basics - For Absolute Beginners - Sel, Tam (Sel, Tam) - 2021 - Anna's Archive - Copie
No ratings yet
Pytorch Basics - For Absolute Beginners - Sel, Tam (Sel, Tam) - 2021 - Anna's Archive - Copie
62 pages
Biological Neuron Artificial Neuron
No ratings yet
Biological Neuron Artificial Neuron
18 pages
23 DeepLearning PDF
No ratings yet
23 DeepLearning PDF
74 pages
ML Unit 5
No ratings yet
ML Unit 5
33 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Federated Learning Overview, Strategies, Applications, Tools and
No ratings yet
Federated Learning Overview, Strategies, Applications, Tools and
24 pages
Elements of Satellite Technology and Communication
100% (1)
Elements of Satellite Technology and Communication
18 pages
ML Course PDF
No ratings yet
ML Course PDF
133 pages
Data Science ML Learning Demo
No ratings yet
Data Science ML Learning Demo
34 pages
ML Lab Manual Arpan
No ratings yet
ML Lab Manual Arpan
48 pages
CSM Workshop Handout - Vijay Bandaru - Version 7.6 (Jun 2022)
No ratings yet
CSM Workshop Handout - Vijay Bandaru - Version 7.6 (Jun 2022)
125 pages
How To Secure Your Spring Apps With Keycloak: Thomas Darimont
No ratings yet
How To Secure Your Spring Apps With Keycloak: Thomas Darimont
42 pages
UNIX For Users Handout
0% (1)
UNIX For Users Handout
129 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
49 pages
A Review On Large Language Models Architectures Ap
No ratings yet
A Review On Large Language Models Architectures Ap
31 pages
DL Lab Manual
No ratings yet
DL Lab Manual
65 pages
Healthcare Monitoring System For Remote Areas
No ratings yet
Healthcare Monitoring System For Remote Areas
7 pages
Agents & Environment
No ratings yet
Agents & Environment
24 pages
AI ML Notes 2
No ratings yet
AI ML Notes 2
151 pages
Nn4ir PDF
No ratings yet
Nn4ir PDF
290 pages
Unit 1&2
No ratings yet
Unit 1&2
270 pages
I Think Unix
No ratings yet
I Think Unix
299 pages
Career Track For AI/ML
No ratings yet
Career Track For AI/ML
10 pages
Question Bank MLT
No ratings yet
Question Bank MLT
8 pages
Simple Libraries in Python
No ratings yet
Simple Libraries in Python
12 pages
Face Detection Report
No ratings yet
Face Detection Report
84 pages
Ai Notes
No ratings yet
Ai Notes
2 pages
Abd Motorola
No ratings yet
Abd Motorola
441 pages
Prompt Engineering Notes
No ratings yet
Prompt Engineering Notes
2 pages
Particle Swarm Optimization
No ratings yet
Particle Swarm Optimization
18 pages
Gluon Tutorials: Deep Learning - The Straight Dope
No ratings yet
Gluon Tutorials: Deep Learning - The Straight Dope
403 pages
MIN-400B Report (Final Evaluation) - 2
No ratings yet
MIN-400B Report (Final Evaluation) - 2
54 pages
Community Session IndexingChaining
No ratings yet
Community Session IndexingChaining
19 pages
Machine Learning Methods For Data Security
No ratings yet
Machine Learning Methods For Data Security
141 pages
10.3934 Mbe.2021177
No ratings yet
10.3934 Mbe.2021177
22 pages
Seisonen 2016
No ratings yet
Seisonen 2016
41 pages
Apicella Et Al. 2019 - A Simple and Efficient Architecture For Trainable Activation Functions
No ratings yet
Apicella Et Al. 2019 - A Simple and Efficient Architecture For Trainable Activation Functions
15 pages
Unit I MACHINE LEARNING
No ratings yet
Unit I MACHINE LEARNING
87 pages
Experiment 1
No ratings yet
Experiment 1
7 pages
2023 Toward The Third Generation Artificial Intelligenc
No ratings yet
2023 Toward The Third Generation Artificial Intelligenc
19 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
Batch Normalization Separate
No ratings yet
Batch Normalization Separate
20 pages
60 Python Project With Code
No ratings yet
60 Python Project With Code
5 pages
ONDM2019 Tutorial 5G Networks Technologies Challenges and Tools
No ratings yet
ONDM2019 Tutorial 5G Networks Technologies Challenges and Tools
39 pages
Designing A Neural Network For Forecasting Financial and Economic Time Serie
No ratings yet
Designing A Neural Network For Forecasting Financial and Economic Time Serie
22 pages
Andrea Martorana Tusa: Failure Prediction For Manufacturing Industry
No ratings yet
Andrea Martorana Tusa: Failure Prediction For Manufacturing Industry
23 pages
2.neural Network
No ratings yet
2.neural Network
19 pages
01 Basics of Data Analytics and Machine Learning
No ratings yet
01 Basics of Data Analytics and Machine Learning
16 pages
Bca 5 Sem Soft Computing 93169 Dec 2022
No ratings yet
Bca 5 Sem Soft Computing 93169 Dec 2022
2 pages