(DL) Ch04-Regularization
(DL) Ch04-Regularization
Regularization
2. Constrained optimization
3. Dataset Augmentation
5. Dropout
• Main regularization approaches: limiting the capacity of the model by adding a parameter
norm penalty Ω(θ) to the objective function J
˜ X, y) = J(θ, X, y) + αΩ(θ)
J(θ, (1)
• Different choices of the parameter norm Ω can result in different solutions being referred
• In NNs, Ω is chosen to penalize only the weights of the affine transformation at each layer
(leave the bias unregularized)
• The most common parameter norm penalty: L2 (weight decay), also called ridge
regression, or Tikhonov regularization
˜ α
J(w, X, y) = J(w, X, y) + αΩ(w) = J(θ, X, y) + w> w (2)
2
with the corresponding parameter gradient
˜
∇w J(w; X, y) = αw + ∇w J(w; X, y) (3)
• Gradient step
• L1 regularization X
Ω(θ) = kwk1 = |wi | (5)
i
˜
J(w, X, y) = J(w, X, y) + αkwk1 (6)
˜
∇w J(w; X, y) = αsign(w) + ∇w J(w; X, y) (7)
• The regularization contribution to the gradient no longer scale linearly with each wi
• L1 regularization results in a solution that is more spare, comparing to L2 : some
parameters have an optimal value of zero.
• Sparsity property of L1 regularization has been used extensively as a feature selection
mechanism
Constrained optimization
• Find the maximal or minimal value of f (x) for values of x in some set S
• Feasible points: points x that lie within the set S
• Find a solution that is small in some sense
• Common approach: impose a norm constraint, such as kxk ≤ 1
• Injecting noise
• NN prove not to be very robust to noise
• Unsupervised learning: denoising autoencoder
• Noise in hidden units: dataset augmentation at multiple levels of abstraction
• Bagging
• The models are independent
• Each model is trained to convergence on its respective training set
• Dropout
• Models share parameters
• most models are not explicitly trained at all
• It is infeasible to sample all possible subnetworks within the lifetime of the universe
• The remaining sub-networks to arrive at good settings of the parameters
• Dropout can represent an exponential number of models with a tractable amount of
memory
• Dropout:
• Each sub-model defined by mask vector µ defines a probability distribution p(y|x, µ)
• The arithmetic mean over all masks is given by
X
p(µ)p(y|x, µ) (13)
µ
where p(µ) is the probability distribution that was used to sample µ at training time
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 28 / 31
Dropout