0% found this document useful (0 votes)
26 views

01 - Introduction To Deep Learning

Introduction to Deep Learning

Uploaded by

nyj martin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

01 - Introduction To Deep Learning

Introduction to Deep Learning

Uploaded by

nyj martin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

WBL Deep Learning:: Week 1

Beate Sick, Oliver Dürr

Week 1: Introduction and technicalities

Zürich, 9/7/2020

1
Literature

• Probabilistic Deep Learning (Manning in production)


– Our probabilistic take
– https://www.manning.com/books/probabilistic-deep-
learning?a_aid=probabilistic_deep_learning&a_bid=78e55885

• Deep Learning Book (DL-Book) http://www.deeplearningbook.org/.


This is a quite comprehensive book which goes far beyond the
scope of this course.

• Courses
– Convolutional Neural Networks for Visual
Recognition http://cs231n.stanford.edu

– Martin Görner (very practical)


• https://cloud.google.com/blog/products/gcp/learn-tensorflow-and-deep-
learning-without-a-phd

2
Introduction to Deep Learning
--what’s the hype about?

3
AI, Machine Learning, Deep Learning

Slide credit: https://www.datasciencecentral.com/profiles/blogs/artificial-intelligence-vs-machine-learning-vs-deep-learning


4
Machine Perception
Kaggle dog vs cat competition
• Computers have been quite
bad in perceptual tasks
which are easy for humans.
– Images
– Text
– Sound

• A Kaggle contest 2012

What happened, to solve the problem?


5
Deep Learning Success Story: ImageNet 2012, 2013, 2014, 2015
1000 classes
1 Mio samples

Human: 5% misclassification

Only one non-CNN


approach in 2013

GoogLeNet 6.7%

A. Krizhevsky 2015: It gets tougher


4.95% Microsoft (Feb 6 surpassing human performance 5.1%)
first CNN in 2012 4.8% Google (Feb 11) -> further improved to 3.6 (Dec)?
Und es hat zoom gemacht 4.58% Baidu (May 11 banned due too many submissions )
3.57% Microsoft (Resnet winner 2015)
6 6
Figure: https://medium.com/global-silicon-valley/machine-learning-yesterday-today-tomorrow-3d3023c7b519
Deep Learning successes

• With DL it took approx. 3 years to solve object detection and other computer
vision task

• Further examples

Images form cs229n 7


What is new in the deep learning approach?
Traditional ML:
Extract handcrafted features & use these features
to train / fit a model (e.g. SVM, RF) and use fitted
model to perform classification/prediction.

Deep learning (end-to-end approach)


Deep neural networks start with raw data and learn during training/fitting to extract
appropriate hierarchical features and to use them for classification/prediction.

Low-level feature Mid-level feature High-level feature

NVIDEA course 8
Focus in this lectures:
Probabilistic Viewpoint

9
Probabilistic vs deterministic models

“Classification”
“Regression” Deterministic Probabilistic

Conditional probability distribution (CPD)


𝑝(𝑦|𝑥)
10
Guiding Theme of the course

• We treat DL as probabilistic models, as a continuation of GLMs (logistic


regression, ...) for CPD 𝑝(𝑦|𝑥)
• The models are fitted to training data with maximum likelihood (or Bayes)

Special networks for x Special heads for y


• Vector FCNN • Classes
• Image CNN • Regression
• Text CNN/RNN

11
Topics

• Day 1
– Introduction to DL
– Fully connected neural Networks (fcNN)
– Introduction to TensorFlow and Keras

• Day 2
– Convolutional Neural Networks (CNN) for image data
– Classification and Regression with fcNN and CNNs

• Day 3
– Probabilistic DL
– Extending the GLM with DL for scalar features and image data

• Day 4
– Extending deep GLMs by deep transformation models
– Deep interpretable ordinal regression models
12
Fully Connected Neural Networks
FCNN

13
The Single Cell: Biological Motivation

Neural networks are loosely inspired by how the brain works

14
An artificial neuron
bias weights
output y
y
z

b w1 w2
1 x1 x2 … Different non-linear transformations
(activation functions) are used to
input vector get from z to output y
sigmoid
1
y=
1 + e- z

!
The sigmoid 𝑧 = ensures number from 0 to 1, which can be interpreted as
!"# !"
probability.

Question: How is this model called in statistic?


Toy Task

• Task tell fake from real banknotes (see exercise 01_nb_ch02_01)


• Banknotes described by two features (𝑥! , 𝑥" ) extracted from image

𝑥$

𝑥!

16
Exercise: Part 1

𝑏
𝑤!
𝑤$

Model: The above network models the probability 𝑝! that a given banknote is false.

TASK
The weights (determined by a training procedure later) are given by
𝑤! = 0.3, 𝑤$ = 0.1, and 𝑏 = 1.0
The probability can be calculated from z using the function sigmoid(z)

1. What is the probability (rough estimate) that a banknote characterized by


𝑥! =1 and 𝑥$ = 2.2 is fake?
17
GPUs love Vectors

"
!!
!"

𝑤!
𝑝! = sigmoid 𝑥! 𝑥" ⋅ 𝑤 +𝑏
"

DL: better to have column vectors

18
Result*

𝑥$

𝑥!

General rule: Networks without hidden layer have linear decision boundary.

*Details of training later 19


20
Introducing hidden layer

𝑊from,to
We stack single neurons in layers. Bias 𝑏$!
We use the output of neuron as ℎ!
input to the new neuron.
ℎ$
Bias 𝑏 $!

!
𝑊!,$

𝑝!

𝑤
𝑝! = sigmoid 𝑥! 𝑥" ⋅ ! + 𝑏
𝑤"

ℎ&

input hidden output


21
Introducing hidden layer
!
𝑊!,"
ℎ" = sigmoid 𝑥! 𝑥" ⋅ ! + 𝑏"!
𝑊from,to 𝑊","

Bias 𝑏$! General


ℎ! ℎ$ = sigmoid 0 𝑥% 𝑊%,$! + 𝑏$
%

ℎ$ !
𝑊!,$
Bias 𝑏 $! ℎ$ = sigmoid 𝑥! 𝑥" ⋅ !
𝑊",$
+ 𝑏"!

!
𝑊!,$
Matrix Notation (later we drop ⃗ )
Note column vectors!
𝑝!
ℎ = sigmoid 𝑥⃗ ⋅ 𝑾𝟏 + 𝑏!

Complete Network
𝑝! = sigmoid ℎ ⋅ 𝑾𝟐 + 𝑏!"

Code:
h =sigmoid(x %*% W1 + b1)
p1=sigmoid(h %*% W2 + b2)
ℎ&

input hidden output


22
The benefit of hidden layers

ℎ"

ℎ#
Bias ! #"

#"

ℎ$

23
Increasing number of neurons in the hidden layer

A network with one hidden layer is a universal function approximator!

http://cs231n.github.io/neural-networks-1/ 24
DL use many hidden layers

• Empirical observation that having more than one layer improves


generalization (no overfitting)

• Not completely understood why. Some intuition:


– Multiple layers allow hierarchical features
– With same number of weights yield more flexible networks
– Observed in brains
25
DL vs Machine Learning Meme

https://www.reddit.com/r/ProgrammerHumor/comments/8c1i45/stack_more_layers/ 26
Experiment yourself, play at home

http://playground.tensorflow.org

Let’s you explore the effect of hidden layers


27
Structure of the network

In code:
## Solution 2 hidden layers
hidden_1=sigmoid(X %*% W1 + b1)
hidden_2=sigmoid(hidden_1 %*% W2 + b2)
res = sigmoid(hidden_2 %*% W3 + b3)

In math (f = sigmoid) and b1=b2=b3=0

𝑝 = 𝑓(𝑓 𝑓 x W! W $ 𝑊 ' )

Looks a bit like onions, matryoshka (Russian Dolls) or Lego bricks

28
2 Class

Using Networks for Classification

29
So far: Logistic Regression / Binary Classification

𝑤
𝑝! = sigmoid 𝑥! 𝑥" ⋅ ! + 𝑏
𝑤"

• Networks out the probability for one class (logistic regression /


logistic regression with hidden layers).
– In probabilistic framework: parameter of a Bernoulli 𝑌 𝑥 ~𝐵𝑒𝑟𝑛(𝑝! (𝑥))

• What to do with more than one class?

30
Classification: Softmax Activation

𝑝( , 𝑝! … 𝑝) are probabilities for the


classes 0 to 9.

Incoming to last layer 𝑧* 𝑖 = 1 ⋯ 9


Makes
outcome
𝑒 +# positive
𝑝* = )
∑,-( 𝑒 +$
Ensures that pi’s sum up
to one

This activation is called softmax

Networks out the probabilities for the classes


In probabilistic framework (parameter vector 𝑝⃗ of a multinomial 𝑌|𝑥 ∼ 𝑚𝑛(𝑝(𝑥) )

31
Training NN

32
Training

Input 𝑥 (&) True class 𝑦 (&) Suggesed class


(Show is most likely class)

Tiger Seal 👎
Neural network with many weights W

Tiger Tiger 👍

Seeh
Seehorse
orse 👍

Trainingsprinciple:
... Weights are tuned so a loss functions gets
minimized.
Typical 1 Mio. Trainingsdaten
𝑙𝑜𝑠𝑠 = 𝑙𝑜𝑠𝑠 𝑦 * , 𝑥 * , 𝑊
33
Loss for classification (‘categorical cross-entropy’)

𝑝( , 𝑝! … 𝑝) are probabilities
for the classes 0 to 9.

Definition (Negative Log-likelihood NLL / cat. crossentropy):


The loss 𝑙! of a single training example x (!) with true label 𝑦 (!) is

𝑙! = − log 𝑝$%&'( 𝑦 ! 𝑥 !

Notation: if true label is 2 then 𝑝$%&'( 𝑦 ! 𝑥 ! = 𝑝)

• Perfect, i.e. predicts class of training example 𝑦 (!) with probability 1 ⇒ 𝑙! = 0


• Worst, i.e. predicts class 𝑦 (!) with probability 0 ⇒ 𝑙! = ∞

*
For more all examples, just average loss = ∑𝑙!
+ 34
Training / Gradient Descent

35
Optimization in DL
Parameters of the network are
the weights.
• DL many parameters
– Optimization loss by simple
gradient descent

• Algorithm: Stochastic Gradient


Descent (SGD)*
– Take a random batch of training
examples 𝑦 * , 𝑥 * *-!,⋯,/
– Calculate the loss of that batch
𝑙𝑜𝑠𝑠 𝑦 * , 𝑥 * , 𝑊
– Tune the weights so that loss gets
minimized a bit (gradient descent) Modern Networks have Billions (10) )
– Repeat of weights.

Record 2020: 175 ⋅ 10) (GPT-3)

*aka minibatch gradient descent. 36


Idea of gradient descent

• Shown loss function for a single parameter a

15000

10000
loss

5000

0
−1.5−1.0−0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
a

• Imagine you are the blinded wanderer and just know the loss and
the slope at a position. How to reach the minimum?
– Take a large step if slope is steep (you are away from minimum)
• Slope of loss function is given by gradient (this is a local quantity)
• Iterative update of the parameters
– 𝑎0"! = 𝑎0 − 𝜖 𝑔𝑟𝑎𝑑_𝑎(𝑙𝑜𝑠𝑠)
37
Proper learning rate (Important parameter for DL)

See: https://developers.google.com/machine-learning/crash-course/fitter/graph

Chapter 3: Probabilistic Deep Learning Book 38


In two dimensions

Gradient is perpendicular to contour lines

¶L(w )
w2 wi ( t ) = wi ( t -1) - e ( t )
¶wi w = w ( t -1)
i i

w1

39
Summary: Simple Network no hidden layer

input Score or logit zt Softmax p=S(z) 1-hot labels y


xt
-0.9 0.01
0
0
24.0 0.1 0.03
k=28 0 1
-5.1 2.3
e zk
0.31 1 2
1.2 p̂ k = 0.10
12.2 x (1,k 2 ) × W( k 2 ,10 ) + b (1,10) = z (1,10 ) å e zj 0 3
k=28

0.9 j 0.08 0 4
0.25
89.9 2.1 0 5
-1.2 0.01 0 6
Flatten to : -0.2 0.03 0 7
vector with 0.11 0 8
k2 elements
3.2 1.3
0 9
0.9 0.08

Take step in direction of descent gradient:


(the gradient is oriented orthogonal to contour lines) +

L ( w1 , w2 ) 𝑙& 𝑦 & , 𝑥 & , 𝑊 = − . 𝑦( ⋅ log 𝑝(


w2 ()*

wi ( t ) = wi ( t -1) - e ( t )
¶L(w ) Loss of a mini-batch:
¶wi w = w ( t -1)
i i

𝐿 𝑦 & , 𝑥 & , 𝑊 = 𝑚𝑒𝑎𝑛(𝑙& )

w1 40
The miracle of gradient descent in DL

Loss surface in DL (is not convex) but SGD magically also works for non-convex
problems.

Modern deep learning: No distinction between network (model) and training (SGD)

Chapter 3: Probabilistic Deep Learning Book 41


Backpropagation

• We need to calculate the derivative of the loss function


𝑙𝑜𝑠𝑠 {𝑋; , 𝑌; }, 𝑊 w.r.t. all weights 𝑊
• Efficient way backpropagation (chain rule)
– Forward Pass propagate training example through network
• Gives output for current configuration of network
– Backward pass propagate training example through network
• Chain rule all gradients can be calculated in a single flow from the “end”

forward pass

backward pass

For more see e.g. chapter 3 in Probabilistic deep learning 42


Typical Training Curve / ReLU

Motivation:
Green:
sigmoid.
Red:
Source: ReLU faster
Alexnet
Krizhevsky et al 2012 convergence

Epochs: ”each training examples is used once”

43
Deep Learning Frameworks

44
Recap: The first network

• The input: e.g. intensity values of pixels of an image


• (Almost) no pre-processing

• Output: probability that image belongs to certain class

• Information is processed layer by layer from building blocks

• Arrows are weights (these need to be learned / training)

• Training requires gradients of loss w.r.t weights


45
Deep Learning Frameworks (common)

• Computation needs to be done on GPU or specialized hardware


(compute performance)

• Data Structure are multidimensional arrays (tensors) which are


manipulated

• Automatic calculation of the gradients


– Static: computational graph (see chapter 3 in probabilistic deep learning)
– Dynamic: reverse mode auto diff

• In this course: TensorFlow with Keras

46
Typical Tensors in Deep Learning

W24

• The input can be understood as a vector

• A mini-batch of size 64 of input vectors can be understood as tensor of


order 2
• (index in batch, xj)

• The weights going from e.g. Layer L1 to Layer L2 can be written as a matrix
(often called W)

• A mini-batch of size 64 images with 256,256 pixels and 3 color-channels


can be understood as a tensor of order 4.
47
Introduction to Keras

48
Keras Workflow

Define the network (layerwise)

Add loss and optimization method

Fit network to training data

Evaluate network on test data

Use in production

49
A first run through

50
Define the network

Number of Dimension of
neurons in input, here
first hidden vector size
layer 784

Alternative version w/o pipe

Input shape needs to be defined only at the beginning.

Alternative: input_dim=784 51
Compile the network

52
Fit the network

20% of the data


is not use for
fitting the
weights.

53
Evaluate the network

54
More layers

• Dropout
– keras.layers.Dropout
• Convolutional (see lecture on CNN)
– keras.layers.Conv2D
– keras.layers.Conv1D
• Pooling (see lecture on CNN)
– keras.layers.MaxPooling2D
• Recurrent (not in course)
– keras.layers.SimpleRNNCell
– keras.layers.GRU
– keras.layers.LSTM

55
How to use TF and Keras in the course

• Use google colab


– Free resource with preinstalled (kind of) jupyter notebooks
– Usually for python
– To start R notebook (not possible over GUI)
• https://colab.research.google.com/notebook#create=true&language=r

• Use RStudio
– Installation a bit tedious, especially for tfprobability
– You might by lucky though

• Exercises for today


Play around with code, answer questions ask questions if you have any
– Check installation / colab
https://colab.research.google.com/drive/13scWAt7B3y2KxYOdyWR1XHoSTC-H1DxN?usp=sharing

– Banknote example:
https://colab.research.google.com/drive/1_kWrocpNxlzYYySIi__55ucwtuvgAflv?usp=sharing

– MNIST with simple FCNN


https://colab.research.google.com/drive/1GTfFpUlMJoIiU08KU268ktCM6TGbfDR2?usp=sharing 56

You might also like