0% found this document useful (0 votes)

34 views

Machine Learning - Exercise 4: Companion Slides

This document provides an overview of backpropagation and training neural networks. It recaps the basics of neural networks and backpropagation, including calculating the derivative of the error with respect to the parameters. It then discusses implementing backpropagation with computational graphs and mini-batching. Examples are provided for calculating the derivatives in a linear/fully connected module both with and without batching. Tips are given for debugging gradients and expected results when training networks on MNIST data.

Uploaded by

Usman Muhammad

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

Machine Learning - Exercise 4: Companion Slides

Uploaded by

Usman Muhammad

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Machine Learning - Exercise 4

Companion Slides

Ali Athar Sabarinath Mahadevan

December 6, 2018
Exercise Goal Lecture Recap

General backpropagation with

Backpropagation for fixed network computational graphs

This exercise is about

▸ Understanding backpropagation, deriving formulas, optimizing them

▸ Implement simple neural network framework yourself
▸ Digit recognition
Recap: Neural Networks
labels

Network’s
fact Network’s floss error rate
inputs
output (E)

Linear module, Θi = (W, b) Activation function (tanh, σ, ReLu)

Parameters
▸ Training data (inputs) X = {xi}i=1..N with xi ∊ 𝕀, N the batch size

▸ Training labels T = {ti}i=1..N with xi ∊ 𝕆

▸ Network is a parametrized, (sub-)differentiable function F(X,Θ) : 𝕀 x ℙ → 𝕆

▸ e.g., 𝕆 = ℝDim (regression), 𝕆 = [0,1]Dim (prob. classification)

▸ Loss (criterion) L (T,F(X,Θ)) : 𝕆 x 𝕆 → ℝ, put on top of output to measure performance
▸ find optimal parameters: Θ* = argminΘ L (T,F(X,Θ))
Recap: Backpropagation
labels

Network’s
fact Network’s floss error rate
inputs
output (E)

▸ Optimize towards lower error rate, i.e., lower E

▸ Take derivative of E with respect to each modules parameters, follow gradient
▸ Example: Gradient Descent: Θ = Θ - λ*DΘ(E(x))
▸ DΘ(E(x)) = DΘ(E) for brevity
Derivative w.r.t. modules parameters Θ at point x
▸ How to calculate DΘ(E)
Learning rate
▸ Reverse order of modules
▸ Module gets Dout(E), calculates DΘ(E), passes Din(E) to next module
Example: Linear/Fully Connected Module

Given: Derivative with respect to output

Calculate:
x y = WT·x + b ▸ Derivatives with respect to parameters Θ

Θ = (W, b)

Element-wise:

!
Without batching
Example: Linear/Fully Connected Module

Given: Derivative with respect to output

Calculate:
x y = WT·x + b ▸ Derivative with respect to input

Θ = (W, b)

Element-wise:

!
Without batching
Example: Linear/Fully Connected Module
Putting it together:

fprop(x):
cache.x = x run training data through
x y = WT·x + b return WT*x + b (forwards)

bprop(dE):
dW = cache.x * dE
db = dE run gradients through
return dE * W (backwards)
Θ = (W, b)
update(rate):
W = W - rate*dW
update the parameters
b = b - rate*dbT
(grad. descent)

!
Without batching
Mini-Batching
▸ Batch learning
▸ All training samples processed at once, parameters updated once at the end
▸ Stable, well understood, many acceleration techniques, but slow
▸ Stochastic learning
▸ Each training sample separately, parameters updated at each step
▸ Noisy (though may lead to better results), fast
▸ Mini-batching
▸ Middle ground, batches of data processed, bundled updates
▸ Combine advantages, reduce drawbacks
▸ Example
▸ Linear Module f with input dimension Nin and output dimension Nout, batch size n

broadcast (i.e., repeat) b

mini-batch matrix
Batching Update Rule
▸ (Mini-)Batch learning
▸ Multiple samples processed at once
▸ Calculate gradient for each sample, but don’t update the parameters
▸ After processing the batch, update using a sum of all gradients
▸ Learning rate has to be adapted, e.g., divide E by batch size

▸ Example: Gradient Descent

Derivative of E w.r.t parameters Θ at point xk

▸ To make things easier, we write

Example: Linear/Fully Connected Module - Batching

Given: Derivatives with respect to outputs

Plural!
Calculate:
xk yk = WT·xk + b ▸ Derivatives with respect to parameters Θ

Θ = (W, b)
kth element in the
batch

Deriv. w.r.t outputs

assumed to be
given row-wise:
Example: Linear/Fully Connected Module - Batching

Given: Derivatives with respect to outputs

Plural!
Calculate:
xk yk = WT·xk + b ▸ Derivatives with respect to inputs

Θ = (W, b)
kth element in the
batch

Deriv. w.r.t outputs

assumed to be
given row-wise:
Example: Training a Network

▸ Numerical approach: Column-wise (here for the first column)

▸ Backprop: Row-wise (here for the first row)

▸ Advice
▸ Use (small) random x
Expected Results/Tips for MNIST
▸ [Linear(28x28, 10), Softmax]
▸ should give ± 750 errors
▸ [Linear(28x28, 200), tanh, Linear(200,10), Softmax]
▸ should give ± 250 errors
▸ Typical learning rates
▸ λ ∊ [0.1, 0.01]
▸ Typical batch sizes
▸ NB ∊ [100, 1000]
▸ Weight initialization
▸ W ∊ ℝMxN
▸ W ~ N (0, ), i.e., sampled from normal distribution around 0 with deviation
▸ b=0
▸ Pre-process the data
▸ Dividide values by 255 (= max pixel value)

3 DeltaRule PDF
No ratings yet
3 DeltaRule PDF
10 pages
Pytorch Cheatsheet EN
No ratings yet
Pytorch Cheatsheet EN
1 page
Flow Assignment
No ratings yet
Flow Assignment
5 pages
Questions for CSE 7th sem
No ratings yet
Questions for CSE 7th sem
14 pages
Learning-Demo
No ratings yet
Learning-Demo
7 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
12 Convolutional Networks
No ratings yet
12 Convolutional Networks
23 pages
ML Lab Experiments (1) - Pages-2
No ratings yet
ML Lab Experiments (1) - Pages-2
10 pages
DS303_NN
No ratings yet
DS303_NN
20 pages
Voting or Averaging of Predictions of Multiple Pre-Trained Models
No ratings yet
Voting or Averaging of Predictions of Multiple Pre-Trained Models
23 pages
Perceptron Bound Proof
No ratings yet
Perceptron Bound Proof
27 pages
DNN - M2 - Deep Feedforward NN 23dec
No ratings yet
DNN - M2 - Deep Feedforward NN 23dec
97 pages
Aiml Lab Algorithms
No ratings yet
Aiml Lab Algorithms
10 pages
Module 3.Docxaiml
No ratings yet
Module 3.Docxaiml
20 pages
Main Learning Algorithms: Find-S Algorithm
No ratings yet
Main Learning Algorithms: Find-S Algorithm
13 pages
Exp1a_DL_56
No ratings yet
Exp1a_DL_56
5 pages
DL - M2 - Deep Feedforward NN
No ratings yet
DL - M2 - Deep Feedforward NN
97 pages
Import Import Def
No ratings yet
Import Import Def
2 pages
Information Theory and Machine Learning
No ratings yet
Information Theory and Machine Learning
21 pages
Vertopal.com C1 W1 Lab04 Gradient Descent Soln
No ratings yet
Vertopal.com C1 W1 Lab04 Gradient Descent Soln
11 pages
Deep Learning Lab Manual-36-41
No ratings yet
Deep Learning Lab Manual-36-41
6 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Pytorch 101: Deep Learning PHD Course 2017/2018
No ratings yet
Pytorch 101: Deep Learning PHD Course 2017/2018
19 pages
24-0383
No ratings yet
24-0383
18 pages
Chapter_2.pdf
No ratings yet
Chapter_2.pdf
1 page
Coding Probability and Statistics With Python From Scratch
No ratings yet
Coding Probability and Statistics With Python From Scratch
37 pages
Apa Che
No ratings yet
Apa Che
4 pages
TP 8-DL
No ratings yet
TP 8-DL
2 pages
Classification Problem: Feedforwardnet Patternnet Fitnet
No ratings yet
Classification Problem: Feedforwardnet Patternnet Fitnet
16 pages
Help Reserved Variable Names: C 2018 Karim Belabas. Permissions On Back. v2.35
No ratings yet
Help Reserved Variable Names: C 2018 Karim Belabas. Permissions On Back. v2.35
4 pages
Lecture 03
No ratings yet
Lecture 03
32 pages
Pelm
No ratings yet
Pelm
1 page
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Gradient Descent and SGD
No ratings yet
Gradient Descent and SGD
8 pages
Lab2 - Perceptron and Adaline Networks
No ratings yet
Lab2 - Perceptron and Adaline Networks
7 pages
XGboost Tutorial
100% (1)
XGboost Tutorial
13 pages
Multi Perceptor
No ratings yet
Multi Perceptor
37 pages
Mathematica Commands Summary (Cheat Sheet)
No ratings yet
Mathematica Commands Summary (Cheat Sheet)
4 pages
lab-report-02.docx
No ratings yet
lab-report-02.docx
13 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
47 pages
Lab 03 - Linear Regression
No ratings yet
Lab 03 - Linear Regression
5 pages
NNML
No ratings yet
NNML
2 pages
Machine Learning - Lab Manual
No ratings yet
Machine Learning - Lab Manual
35 pages
CS 229, Autumn 2017 Problem Set #2: Supervised Learning II
No ratings yet
CS 229, Autumn 2017 Problem Set #2: Supervised Learning II
6 pages
Support Vector Machine Explained
No ratings yet
Support Vector Machine Explained
10 pages
Ass 1
No ratings yet
Ass 1
3 pages
Learning 2
No ratings yet
Learning 2
104 pages
Supervised Learning Network
No ratings yet
Supervised Learning Network
33 pages
9NeuralNetworksLearning
No ratings yet
9NeuralNetworksLearning
38 pages
Back propagation
No ratings yet
Back propagation
9 pages
Neural Networks MATH Explained
No ratings yet
Neural Networks MATH Explained
14 pages
1 ModuleEcontent - Session5
No ratings yet
1 ModuleEcontent - Session5
24 pages
Numerical Methods: Nicholas Christian BIOST 2094 Spring 2011
No ratings yet
Numerical Methods: Nicholas Christian BIOST 2094 Spring 2011
22 pages
Adaboost Matas
No ratings yet
Adaboost Matas
136 pages
Gurobi_LP
No ratings yet
Gurobi_LP
50 pages
Lecture 04
No ratings yet
Lecture 04
46 pages
Mathematica Command Summary and Cheat Sheet
No ratings yet
Mathematica Command Summary and Cheat Sheet
4 pages
Tut5 Questions
No ratings yet
Tut5 Questions
2 pages
9
No ratings yet
9
6 pages
Linear Regression
No ratings yet
Linear Regression
14 pages
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
The Waves Crash On The Shore, A Rhythm of Ebb and Flow, A Reminder To Us All, That Life Is in Constant Motion
No ratings yet
The Waves Crash On The Shore, A Rhythm of Ebb and Flow, A Reminder To Us All, That Life Is in Constant Motion
1 page
ml19 Part01 Intro
No ratings yet
ml19 Part01 Intro
61 pages
Single Phase Flow Vjan2020 - MyAbdn PDF
No ratings yet
Single Phase Flow Vjan2020 - MyAbdn PDF
53 pages
UNIT 3 DIGITAL COMMUNCATION R2017 MCQ Final
No ratings yet
UNIT 3 DIGITAL COMMUNCATION R2017 MCQ Final
9 pages
DSP 1
No ratings yet
DSP 1
4 pages
Ad3351 Set1
No ratings yet
Ad3351 Set1
3 pages
Math Project
No ratings yet
Math Project
8 pages
Lecture 6 - Decision Trees
No ratings yet
Lecture 6 - Decision Trees
43 pages
Distributed Systems CSC-503: Raymond's Tree-Based Algorithm
No ratings yet
Distributed Systems CSC-503: Raymond's Tree-Based Algorithm
11 pages
DL Endsem 2024 FlyHigh Services
No ratings yet
DL Endsem 2024 FlyHigh Services
18 pages
Recursion 3 Print
No ratings yet
Recursion 3 Print
5 pages
Numerical Methods-PRELIMS
No ratings yet
Numerical Methods-PRELIMS
4 pages
Logistic: Regression Sigmoid Function
No ratings yet
Logistic: Regression Sigmoid Function
4 pages
Rondeau GR Filtering
No ratings yet
Rondeau GR Filtering
9 pages
Design and Analysis of Algorithm
No ratings yet
Design and Analysis of Algorithm
3 pages
(REPORT) NPuzzle With A and UCS PDF
No ratings yet
(REPORT) NPuzzle With A and UCS PDF
12 pages
Wireless Communication Systems in Matlab 2nd Edition Mathuranathan Viswanathan - Read the ebook now or download it for a full experience
No ratings yet
Wireless Communication Systems in Matlab 2nd Edition Mathuranathan Viswanathan - Read the ebook now or download it for a full experience
89 pages
NLPP Complete Perfect
No ratings yet
NLPP Complete Perfect
14 pages
Pre board mathematics ss memorial paper class 12
No ratings yet
Pre board mathematics ss memorial paper class 12
19 pages
Location Allocation Modelling FLP U
No ratings yet
Location Allocation Modelling FLP U
41 pages
Algorithm 2D Irregular PDF
No ratings yet
Algorithm 2D Irregular PDF
2 pages
The Fast Fourier Transform (FFT)
No ratings yet
The Fast Fourier Transform (FFT)
11 pages
Adaptive Fault Detection Scheme Using An Optimized Self-Healing Ensemble Machine Learning Algorithm
100% (1)
Adaptive Fault Detection Scheme Using An Optimized Self-Healing Ensemble Machine Learning Algorithm
12 pages
Digital Communication Two Marks Q&a
70% (10)
Digital Communication Two Marks Q&a
29 pages
Recurrent Neural Networks: CSC2535 2013: Advanced Machine Learning
No ratings yet
Recurrent Neural Networks: CSC2535 2013: Advanced Machine Learning
57 pages
Learning Unit 8
No ratings yet
Learning Unit 8
6 pages
Offline Handwritten Gurumukhi Word Recognition Using Extreme Gradient Boosting Methodology
No ratings yet
Offline Handwritten Gurumukhi Word Recognition Using Extreme Gradient Boosting Methodology
14 pages
Recommendation Systems
No ratings yet
Recommendation Systems
27 pages
Project Report: Self-Labeled Techniques For Semi-Supervised Learning
No ratings yet
Project Report: Self-Labeled Techniques For Semi-Supervised Learning
8 pages
Lecture Note SGD
No ratings yet
Lecture Note SGD
4 pages
Past Paper Questions (Paper 2 Complete)
No ratings yet
Past Paper Questions (Paper 2 Complete)
112 pages
Transformer 2
No ratings yet
Transformer 2
6 pages
DDQN PDF
No ratings yet
DDQN PDF
13 pages

Machine Learning - Exercise 4: Companion Slides

Uploaded by

Machine Learning - Exercise 4: Companion Slides

Uploaded by

Machine Learning - Exercise 4

Ali Athar Sabarinath Mahadevan

General backpropagation with

This exercise is about

▸ Understanding backpropagation, deriving formulas, optimizing them

Linear module, Θi = (W, b) Activation function (tanh, σ, ReLu)

▸ Training labels T = {ti}i=1..N with xi ∊ 𝕆

▸ Network is a parametrized, (sub-)differentiable function F(X,Θ) : 𝕀 x ℙ → 𝕆

▸ e.g., 𝕆 = ℝDim (regression), 𝕆 = [0,1]Dim (prob. classification)

▸ Optimize towards lower error rate, i.e., lower E

Given: Derivative with respect to output

Given: Derivative with respect to output

broadcast (i.e., repeat) b

▸ Example: Gradient Descent

Derivative of E w.r.t parameters Θ at point xk

▸ To make things easier, we write

Given: Derivatives with respect to outputs

Deriv. w.r.t outputs

Given: Derivatives with respect to outputs

Deriv. w.r.t outputs

1. network = [module1,module2, …, modulen], loss = floss

▸ Numerical approach: Column-wise (here for the first column)

▸ Backprop: Row-wise (here for the first row)

You might also like