0% found this document useful (0 votes)
29 views

CNN With Tensor Flow

The document discusses stochastic gradient descent optimization. It explains that SGD stochastically samples mini-batches from a dataset and uses these to calculate noisy gradients. This can regularize models and make training faster compared to traditional gradient descent. The document also discusses techniques like shuffling data and early stopping that can improve SGD performance.

Uploaded by

Naglaa Mostafa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

CNN With Tensor Flow

The document discusses stochastic gradient descent optimization. It explains that SGD stochastically samples mini-batches from a dataset and uses these to calculate noisy gradients. This can regularize models and make training faster compared to traditional gradient descent. The document also discusses techniques like shuffling data and early stopping that can improve SGD performance.

Uploaded by

Naglaa Mostafa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Stochastic Gradient

Descent Optimization
Stochastic Gradient Descend (SGD)
Stochastically sample “mini-batches” from dataset D
The size of Bj can contain even just 1 sample

● Much faster than Gradient Descend


● Results are often better
● Also suitable for datasets that change over time
● Variance of gradients increases when batch size decreases
SGD is often better
Gradient Descent (GD) vs
Stochastic Gradient Descent (SGD)
Gradient Descend ==> Complete gradients

● Complete gradients fit optimally the (arbitrary) data we


have, not the distribution that generates them
● All training samples are the “absolute representative” of

the input distribution


● Suitable for traditional optimization problems: “find optimal

route”
● Assumption: Test data will be no different than training

data
For ML we cannot make this assumption ==> test data are
always different
Gradient Descent (GD) vs
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent
● SGD is preferred to Gradient Descend
● (A bit) Noisy gradients act as regularization

● Stochastic gradients ==> sampled training data sample

roughly representative gradients


● Model does not overfit to the particular training samples

● Training is orders faster. In real datasets Gradient

Descend is not even realistic


● How many samples per mini-batch? Hyper-parameter, trial

& error. Usually between 32-256 samples


Gradient Descent (GD) vs
Stochastic Gradient Descent (SGD)
SGD for dynamically changed datasets
Let’s assume 1 million of new images uploaded per
week on Instagram and we want to build a “cool
picture” classifier
● Should “cool pictures” from the previous year have the

same as much influence?


● No, the learning machine should track these changes

With GD these changes go undetected, as results are


averaged by the many more “past” samples Past
“over-dominates”
A properly implemented SGD can track changes much
better and give better models
Data Preprocessing
Shuffling
● Applicable only with SGD
● Choose samples with maximum information content
● Mini-batches should contain examples from different classes
● Prefer samples likely to generate larger errors
– Otherwise gradients will be small  slower learning
– Check the errors from previous rounds and prefer “hard examples”
– Don’t overdo it though , beware of outliers
● In practice, split your dataset into mini-batches
● Each mini-batch is as class-divergent and rich as possible
Data Preprocessing
Data Preprocessing
Early Stopping
Drop Out
Dropout
Dropout
Dropout
Dropout
Dropout
Dropout
Activaton Functions
Segmoid-Like Activation Fn
Rectified Linear Unit (ReLU)

● Very popular in computer vision


and speech recognition
● Much faster computations,
gradients
● No vanishing or exploding
problems, only comparison,
addition, multiplication
● People claim biological
plausibility
● No saturation
Rectified Linear Unit (ReLU)
ReLU convergence rate (w.r.t Tanh)
Rectified Linear Unit (ReLU)
Other ReLUs
Learning Rate
Learning rate
Learning rate
Learning rate schedules
Learning rate
Learning rate in practice
Better Optimization
Momentum
Adagrad (Adaptive Gradient)
RMSprop
(Root Mean Square)
Adam
(Adaptive Momentum)
Convolutional Neural Network (CNN)
With Tensor flow
Summation of Matrix + vector

[X] [W] b

1*4
3*4
1*4
7*3 Repeated 7 times
7*4
[X] * [W] [X] * [W] + b

7*4 7*4
tf.zeros
tf.zeros(shape, dtype=tf.float32, name=None)
Args:
● shape: Either a list of integers, or a 1-D Tensor of type int32.
● dtype: The type of an element in the resulting Tensor.
● name: A name for the operation (optional).
● tf.zeros([3, 4], tf.int32) ==> [ [0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0] ]
● tf.zeros(4) ==> [0, 0, 0, 0]
tf.nn.conv2d
tf.nn.conv2d (input, filter, strides, padding,
use_cudnn_on_gpu=None, data_format=None, name=None)
Computes a 2-D convolution given 4-D input and filter tensors.
Given an input tensor of shape [batch, in_height, in_width,
in_channels] and a filter / kernel tensor of shape [filter_height,
filter_width, in_channels, out_channels], this op performs the
following:
● Flattens the filter to a 2-D matrix with shape [filter_height *
filter_width * in_channels, output_channels].
● Extracts image patches from the input tensor to form a
virtual tensor of shape [batch, out_height, out_width,
filter_height * filter_width * in_channels].
tf.nn.conv2d
tf.nn.conv2d (input, filter, strides, padding, use_cudnn_on_gpu=None,
data_format=None, name=None)
Args:
● input: A Tensor. Must be one of the following types: half, float32, float64.
● filter: A Tensor. Must have the same type as input.
● strides: A list of ints. 1-D of length 4. The stride of the sliding window for each
dimension of input. Must be in the same order as the dimension specified with
format.
● padding: A string from: "SAME", "VALID". The type of padding algorithm to use.
● use_cudnn_on_gpu: An optional bool. Defaults to True.
● data_format: An optional string from: "NHWC", "NCHW". Defaults to "NHWC".
Specify the data format of the input and output data. With the default format
"NHWC", the data is stored in the order of: [batch, in_height, in_width,
in_channels]. Alternatively, the format could be "NCHW", the data storage order
of: [batch, in_channels, in_height, in_width].
tf.nn.max_pool
tf.nn.max_pool(value, ksize, strides, padding,
data_format='NHWC', name=None)
Args:
● value: A 4-D Tensor with shape [batch, height, width, channels]
and type tf.float32.
● ksize: A list of ints that has length >= 4. The size of the window for
each dimension of the input tensor.
● strides: A list of ints that has length >= 4. The stride of the sliding
window for each dimension of the input tensor.
● padding: A string, either 'VALID' or 'SAME'. The padding algorithm.
● data_format: A string. 'NHWC' and 'NCHW' are supported.
● name: Optional name for the operation.

tf.reshape
tf.reshape(tensor, shape, name=None)
Reshapes a tensor.
Given tensor, this operation returns a tensor that has the
same values as tensor with shape shape.
If one component of shape is the special value -1, the size
of that dimension is computed so that the total size remains
constant. In particular, a shape of [-1] flattens into 1-D.
At most one component of shape can be -1.
If x size is 117600, then
tf.reshape(x, [-1, 28, 28, 3]) will generate 50
images each of size 28*28*3 channels
tf.nn.dropout
tf.nn.dropout(x, keep_prob, noise_shape=None, seed=None, name=None)
Args:
● x: A tensor.
● keep_prob: A scalar Tensor with the same type as x. The probability that
each element is kept.
● noise_shape: A 1-D Tensor of type int32, representing the shape for
randomly generated keep/drop flags.
● seed: A Python integer. Used to create random seeds.
● name: A name for this operation (optional).
With probability keep_prob, outputs the input element scaled up by 1 /
keep_prob, otherwise outputs 0. The scaling is so that the expected sum is
unchanged.
By default, each element is kept or dropped independently.
Input Image
28*28*3 Input Image 28*28 pixels, 3 Channels

32 Filters
Convolute with 32 Filters,
[5*5] *3 [5*5] *3 [5*5] *3
Each is 5*5*3
32 Feature map

O/P is 32 images
28*28 28*28 Feature Map 28*28 Each is 28*28 (use padding)

ReLU (Activation Fn) ReLU Activation Fn


2*2*1 pooling
Then Pooling (2*2)
32 Feature map
O/P (after pooling)
14*14 14*14 14*14 14*14 32 images each 14*14 pixels
64 Filters
Convolute 32 images with
5*5 5*5
*32
5*5
*32
5*5
*32
64 Filters, Each is 5*5*32
*32
64 Feature map
O/P is 64 images
14*14 14*14 14*14 14*14 Each is 14*14 (use padding)
Feature Map

ReLU (Activation Fn) ReLU Activation Fn


2*2*1 pooling Then Pooling (2*2)
64 Feature map
O/P (after pooling)
7*7 7*7 7*7 7*7 64 images each 7*7 pixels
64 Feature map
O/P (after pooling)
7*7 7*7 7*7 7*7
64 images each 7*7 pixels
7x7x64 values

………………………………………….. Flaten to one Feature Vector


Size 7*7*64

Fully connected (FC)


Hidden layer 1024 Hidden Layer (1024 Neurons)

10 Fully connected (FC)


O/P layer Output Layer (10 Neurons)

Softmax Activation function,


Softmax
One-Hot output

Cross Entropy Error Loss Function

Training data Labels


(One-Hot Labels)

Training Labels
Parameters Calculations
Input values
28*28*3 Input Image 28*28*3

32 Filters
Filter parameters
[5*5] *3 [5*5] *3 [5*5] *3
(5*5*3) * 32 + 32 {bias}
32 Feature map

Feature maps size


28*28 28*28 Feature Map 28*28 (28*28) * 32

ReLU (Activation Fn) ReLU Activation Fn


2*2*1 pooling
Then Pooling (2*2)
32 Feature map
Feature maps size
14*14 14*14 14*14 14*14 (14*14) *32
64 Filters
Filter parameters
5*5 5*5
*32
5*5
*32
5*5
*32
(5*5*32) * 64 + 64 {bias}
*32
64 Feature map
Feature maps size
14*14 14*14 14*14 14*14 (14*14) * 64
Feature Map

ReLU (Activation Fn) ReLU Activation Fn


2*2*1 pooling Then Pooling (2*2)
64 Feature map
Feature maps size
7*7 7*7 7*7 7*7 (7*7) * 64
64 img
Feature maps size
7*7 7*7 7*7 7*7 (7*7) * 64
7x7x64 values
Feature Vector Size 7*7*64=3136
…………………………………………..
Hidden Layer Weights Parameters
[7*7*64] *1024 +1024 {bias}

Hidden Layer (1024 Neurons)


Hidden layer 1024
Hidden Layer Weights Parameters
1024 * 10 +10 {bias}
10
O/P layer O/P Layer (10 Neurons)

Softmax Activation function,


Softmax One-Hot output

Cross Entropy Error Loss Function

Total Number of Parameters


Conv_1 =(5*5*3) * 32 + 32
=2432
Conv_2=(5*5*32) * 64 + 64
=30784
Training Labels FC_Hidden=[7*7*64] *1024 +1024
=3212288
FC_Output= 1024 * 10 +10
=10250
Total =3,255,754
Implementation Guide
For Convolutional part, Define:
● Convolutional Network Architecture
● Number of Conv. Layers
● Number of filters in each layer
● Fliter Parameters
● Bias Parameters
def weight_variable(shape):
28*28*3 Input Image initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
32 Filters
def bias_variable(shape):
[5*5] *3 [5*5] *3 [5*5] *3 initial = tf.constant(0.1,shape=shape)
32 Feature map return tf.Variable(initial)

28*28 28*28 Feature Map 28*28 W_conv1 = weight_variable([5, 5, 3, 32])


b_conv1 = bias_variable([32])
ReLU (Activation Fn)
2*2*1 pooling
32 Feature map W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])
14*14 14*14 14*14 14*14

64 Filters

5*5 5*5 5*5 5*5


*32 *32 *32 *32
64 Feature map Define Parameters of :

14*14 14*14 14*14 14*14 Convolution Layer1 (filters+bias)


Feature Map Convolution Layer2 (filters+bias)
ReLU (Activation Fn)
2*2*1 pooling
64 Feature map
7*7 7*7 7*7 7*7
For Convolutional Part, Apply:
● 2-D Convolution (on each Layer)
● ReLU Activation function
● Max-Pool
def max_pool_2x2(x):
28*28*3 Input Image return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME')
32 Filters
def conv2d(x, W):
[5*5] *3 [5*5] *3 [5*5] *3
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1],
32 Feature map
padding='SAME')

28*28 28*28 Feature Map 28*28 h_conv1 = tf.nn.relu(


conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)
ReLU (Activation Fn)
2*2*1 pooling
32 Feature map h_conv2 = tf.nn.relu(
14*14 14*14 14*14 14*14 conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)
64 Filters

5*5 5*5 5*5 5*5


*32 *32 *32 *32
64 Feature map

14*14 14*14 14*14 14*14


Feature Map Apply
ReLU (Activation Fn)
2*2*1 pooling Convolution,
64 Feature map Activation Function (ReLU)
and max_pooling 2x2
7*7 7*7 7*7 7*7
Define Fully Connected Layers:
● Number of Neurons in Fully Connected Layer
● Initialize Weights and Biases for Hidden and
output layers
● Drop out Layer (if required)
64 feature map def weight_variable(shape):
7*7 7*7 7*7
initial
7*7
= tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
7x7x64 values

…………………………………………..definitial
bias_variable(shape):
= tf.constant(0.1,shape=shape)
return tf.Variable(initial)

Hidden layer 1024


W_FC1 = weight_variable([7 * 7 * 64, 1024])
10
b_FC1 = bias_variable([1024])
O/P layer

W_FC2 = weight_variable([ 1024 ,10])


Softmax b_FC1 = bias_variable([10])

Cross Entropy Error


Total Number of Parameters
Conv_1 =(5*5*3) * 32 + 32
=2432
Training Labels Conv_2=(5*5*32) * 64 + 64
=30784
FC_Hidden=[7*7*64] *1024 +1024
=3212288
FC_Output= 1024 * 10 +10
=10250
No Drop Out Layer Total =3,255,754
64 feature map def weight_variable(shape):
7*7 7*7 7*7
initial
7*7
= tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
7x7x64 values

…………………………………………..definitial
bias_variable(shape):
= tf.constant(0.1,shape=shape)
return tf.Variable(initial)

Hidden layer 1024 W_FC1 = weight_variable([7 * 7 * 64, 1024])


b_FC1 = bias_variable([1024])
Drop out Layer
keep_prob = tf.placeholder(tf.float32)
10 h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
O/P layer

W_FC2 = weight_variable([ 1024 ,10])


Softmax b_FC1 = bias_variable([10])

Cross Entropy Error

Training Labels
Train the Network
● Feed Forward, Calculate Estimated output
● Apply “Softmax” on Output
● Loss Calculation (Cross_Entropy)
● Optimization (Minimize Loss and Update
Weights)
64 feature map
y_conv =
7*7 7*7 7*7 7*7 tf.matmul(h_fc1_drop, W_fc2) + b_fc2

7x7x64 values
cross_entropy =
…………………………………………..
tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(y_conv, y_))

Hidden layer 1024


train_step =
tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
Drop out Layer

10
O/P layer

y_conv
Softmax

Cross Entropy Error

y_

Training Labels
64 feature map

7*7 7*7 7*7 correct_prediction


7*7 =
tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
7x7x64 values

…………………………………………..

Hidden layer 1024

Drop out Layer


Accuracy =
10 tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
O/P layer

Softmax

y_conv
Cross Entropy Error

y_

Training Labels
Model Evaluation
(Training Data / Test Data)
Start

Accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))

y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

Should be =1 .0 for Test data,


<1 .0 for training data
Model Training
(Use Training Data )
Start

train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y_conv, y_))

y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
Useful Operations
in Tensorflow
Operation Description
tf.add sum
tf.sub substraction
tf.mul multiplication
tf.div division
tf.mod
tf.abs
module
return the absolute value
Some
tf.neg return negative value Useful
tf.sign
tf.inv
return the sign
returns the inverse
Arithmetic
tf.square calculates the square Operations
tf.round returns the nearest integer
tf.sqrt calculates the square root
tf.pow calculates the power
tf.exp calculates the exponential
tf.log calculates the logarithm
tf.maximum returns the maximum
tf.minimum returns the minimum
tf.cos calculates the cosine
tf.sin calculates the sine
Some useful Matrix Operations
Operation Description

tf.diag returns a diagonal tensor with a given diagonal values

tf.transpose returns the transposes of the argument

tf.matmul returns a tensor product of multiplying two tensors listed as arguments

tf.matrix_determinant returns the determinant of the square matrix specified as an argument

tf.matrix_inverse returns the inverse of the square matrix specified as an argument

You might also like