0% found this document useful (0 votes)
16 views3 pages

Deep Learning

Uploaded by

nomialsCry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views3 pages

Deep Learning

Uploaded by

nomialsCry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

GD: batch size parameter w = w - alpha*(dw / sqrt(s_dw))

Normalizing: faster convergence => move at a b = b - alpha*(db / sqrt(s_db))


similar scale
Mini batch GD: What happens if we raise beta?
- Batch size > 1
Stochastic GD: Adam:
- Bath size = 1, one training example at a time V_dw = beta*V_dw + (1-beta)*dw
- Extremely noisy S_dw = beta*S_dw + (1-beta)*dw^2
- No convergence
V_dw^corrected = V_dw / (1-beta_1^t)
Exponentially weighted moving averages S_dw^corrected = S_dw / (1-beta_2^t)
Moving average W = w - alpha*V_dw^corrected /
V_t = Beta*v_{t-1} + (1-b)*Theta_t (sqrt(S_dw^corrected) + e)
Averaging over 1/(1-Beta) - many t’s
Compute the average of the last t data points Hyperparameters choice:
So in time series let’s say you want the average of Use default values for the hyperparameters but
the quarters: alpha, the learning rate, needs to be tuned
(¼)*sum of the first year, then plot the average of ½ * Combines RMSProp and gradient with momentum
next two
Average of the last 1/(1-b) days plus some weight of Learning rate decay:
that day => Reduce learning rate - to oscillate over a tighter
Number of days proportional to b region
Recursive on sum (1-b)*(b^n)*T for n = 1 to 100 alpha = (1 / (1+decayRate*epoch))*alpha_0

What happens if we raise beta? Closer to the points Local optima:

Bias correction:
Starts off really low => add a bias term

Gradient descent with momentum:


Like moving average for derivatives, instead of time
series
V_dw = beta*V_dw + (1-beta)*dw
V_db = beta*V_db + (1-beta)*db
W = w - a*V_dw
b = b - a*V_db

What happens if we raise beta? It’s like learning rate


High β\betaβ Leads to More Horizontal Movement
=>shorter axis, smoother => depends more on the
general trend, less sensitive to noise
Low β\betaβ Leads to More Vertical Movement,
greater descent noisier => depends more on the Exponential, step decay
current level

RootMeanSprop:
S_dw = beta*S_dw + (1-beta)*dw^2
S_db = beta*S_db + (1-beta)*db^2
Computational resources SEQUENCE MODELS
- tune: learning rate, mini-batch size
- whether to try Panda or Caviar

New hyperparameters - should only be done if new


hardware or computational power is acquired =>
false

Batch normalization
Are beta and gamma learned

Deep learning programming frameworks don’t


require cloud-based machines to run

Framework allows fewer lines of code


Tasks that could be addressed by a many-to-one
RNN model architecture:
If searching among a large number of
hyperparameters, you SHOULD NOT try values What’s many-to-one RNN model architecture?
in a grid rather than random values, so that you Sequence then outputs 1 result
can carry out the search more systematically
and not rely on chance => use random search

Don’t use the most recent mini-batch’s mean


and sigma used for normalization

After training a neural network with batch norm,


at test time, to evaluate the neural network on a
new example, you should perform the needed
normalizations, use mean and sigma estimated
using an exponentially weighted average across
mini-batches seen during training
If you are training an RNN model, and find that your LSTM
weights and activations are all taking on the value of
NaN (“Not a Number”) gradients exploding

Gu has the same dim as # of hidden nodes

choose the r-th training sample first,


then the s-th word

If we want c<t> to be highly dependent on c<t-


1>, we want Gu to be very low,
Gr => about remembering previous states

Dimensionality in word embedding:

Question 10
The sparsity of connections and weight sharing are
mechanisms that allow us to use fewer parameters in a
convolutional layer making it possible to train a network
with smaller training sets. True/False?

Number of weights per filter:


Total number of weights for all filters:
Bias parameters: one per filter

LSTM

Gu => update gate


Gr => forget
Go => output gate
Gu has dimension = # hidden units in the LSTM

You might also like