0% found this document useful (0 votes)
7 views57 pages

15 Improving Performance - Hacks & Tricks

Uploaded by

Shahzaib Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views57 pages

15 Improving Performance - Hacks & Tricks

Uploaded by

Shahzaib Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Improving Performance – Hacks & Tricks

Data Preparation
Data pre-processing techniques generally refer to the addition, deletion, or
transformation of training set data

Data Cleaning: Identifying and correcting mistakes or errors in the data.


Data Transformation
Data Transforms: Changing the scale or distribution of variables.
Normalized to 0 to 1.
Rescaled to -1 to 1.
Standardized.
Data Transformation
Data Transforms: Changing the scale or distribution of variables.
Standardized to -1 to 1.
Big Data

Data augmentation in data analysis are techniques used to increase the amount of
data by adding slightly modified copies of already existing data or newly created
synthetic data from existing data.
Data Split

70% train, 15% val, 15% test.


80% train, 10% val, 10% test.
60% train, 20% val, 20% test.
Loss Function

70% train, 15% val, 15% test. The function we want to minimize or maximize
80% train, 10% val, 10% test. is called the objective function or criterion.
60% train, 20% val, 20% test. When we are minimizing it, we may also call it
the cost function, loss function, or error
function.
Performance Metrics

70% train, 15% val, 15% test. The function we want to minimize or maximize
80% train, 10% val, 10% test. is called the objective function or criterion.
60% train, 20% val, 20% test. When we are minimizing it, we may also call it
the cost function, loss function, or error
function.

Accuracy in classification problems is the


number of correct predictions made by the
model over all kinds predictions made.
Baby Sitting
Loss Optimization Function
Learning Rate
Learning Rate

Loss
Learning Rate
Ideal Curves
Early Stopping

Validation
How many Neurons?
How many Neurons?

X W Y

solving a system of simultaneous equations


How many Neurons?

Size of Training Data

.
.
.

There must be x independent examples for each


parameter in the model, where x could be tens (e.g. 10).
How many Neurons?

Problem Complexity

.
.
.
How many Layers?
Big Network
ImageNet Challenge
Over-fitting
Over-fitting
Under-fitting
Good-fit
Big Network with Dropout
Regularization
W1 = 0

W2 = 0

W3 = 0

.
.
.
Which Activation Function?
Recall: Back Propagation

w1 w2
x f1 y1 f2 y2 . . . J(w)

𝜕𝐽(𝑤) 𝜕𝐽(𝑤) 𝜕𝑦2 𝜕𝑦1 . . .


= * *
𝜕𝑤1 𝜕𝑦2 𝜕𝑦1 𝜕𝑤1
Which Activation Function?
Which Activation Function?

.
.
.
Which Activation Function?

Data Scaling:
-1 to +1
Which Activation Function?

Data Scaling:
0 to 1
Weight Initialization?
Weight Initialization?

In general practice biases are initialized with 0 and weights are initialized with small numbers
drawn randomly from a Gaussian or uniform distribution in the range e.g. [0, 1] , [-1, 1], [-0.3, 0.3]
Weight Initialization?

Sigmoid / Tanh

In Xavier technique weights are initialized with small numbers drawn randomly from a uniform probability
distribution between the range -(1/sqrt(n)) and 1/sqrt(n), where n is the number of inputs to the neuron.
Weight Initialization?

ReLU

In He Normal technique weights are initialized with small numbers drawn randomly from a Gaussian probability
distribution with a mean 0.0 and a standard deviation of sqrt(2/n), where n is the number of inputs to the neuron.
Weight Initialization?

Each time, a neural network is initialized with a different set of weights, resulting in a different starting point, and
potentially resulting in a different final set of weights with different performance characteristics.
Which Loss Function?

Pred = 0.8, Actual = 1


Which Loss Function?
1
𝐽(𝑤) = − 𝑦𝑖 log 𝑦 + 1 − 𝑦𝑖 log(1 − 𝑦)
𝑛
𝑖

Pred = 0.8, Actual = 1

Binary Cross Entropy will calculate a score that summarizes average difference between the actual and predicted
probability distributions for predicting class 1. The score is minimized and a perfect cross-entropy value is 0.
Which Loss Function?

.
.
.
Multi Cross Entropy will calculate a score that summarizes the average difference between the actual and predicted
probability distributions for all classes in the problem. The score is minimized and a perfect cross-entropy value is 0.
Which Loss Optimization Function?
Pre 0.8 Ac 1 J(W)
Pre 0.8 Ac 1 J(W)
Batch
Batch

Pre 0.5 Ac 1 J(W)


Use Average
.
.
.
Mini-Batch
Mini-Batch

Pre 0.6 Ac 1 J(W)


Use Average
.
.
.
Mini-Batch

Pre 0.7 Ac 1 J(W)


Use Average
.
.
.
Larger Batch Size

Pre 0.6 Ac 1 J(W)


Use Average
.
.
.
Larger Batch Size

Pre 0.6 Ac 1 J(W)


Use Average
.
.
.
Grid Search
What parameters can use grid search? Perhaps there are grids of standard
hyperparameter values that you can enumerate to find good configurations, then
repeat the process with finer and finer grids.

1. Activation Functions
2. Network Topology
3. Batches and Epochs
4. Dropout
5. Optimization and Loss
6. Early Stopping
.
.
.
Transfer Learning
Transfer learning generally refers to a process where a model trained on one
problem is used in some way on a second related problem.

You might also like