0% found this document useful (0 votes)
11 views45 pages

L10 Regularization Slides

Lecture 10 of STAT 453 focuses on regularization methods to reduce overfitting in deep learning models. Key techniques discussed include early stopping, L1/L2 regularization, and dropout, along with the importance of improving generalization performance through data collection and augmentation. The lecture also emphasizes the significance of adjusting model capacity and using norm penalties to enhance model robustness.

Uploaded by

27jeremi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views45 pages

L10 Regularization Slides

Lecture 10 of STAT 453 focuses on regularization methods to reduce overfitting in deep learning models. Key techniques discussed include early stopping, L1/L2 regularization, and dropout, along with the importance of improving generalization performance through data collection and augmentation. The lecture also emphasizes the significance of adjusting model capacity and using norm penalties to enhance model robustness.

Uploaded by

27jeremi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

STAT 453: Introduction to Deep Learning and Generative Models

Sebastian Raschka
http://stat.wisc.edu/~sraschka/teaching

Lecture 10

Regularization Methods
with Applications in Python

for Neural Networks


Sebastian Raschka STAT 453: Intro to Deep Learning 1
Goal: Reduce Over tting

usually achieved by reducing model


capacity and/or reduction of the variance of
the predictions (as explained last lecture)

Sebastian Raschka STAT 453: Intro to Deep Learning 2


fi
Regularization

In the context of deep learning, regularization can be


understood as the process of adding information / changing
the objective function to prevent over tting

Sebastian Raschka STAT 453: Intro to Deep Learning 3


fi
Regularization / Regularizing E ects
Goal: reduce over tting
usually achieved by reducing model capacity and/or
reduction of the variance of the predictions (as explained last
lecture)

Common Regularization Techniques for DNNs:

• Early stopping
• L1/L2 regularization (norm penalties)
• Dropout

Sebastian Raschka STAT 453: Intro to Deep Learning 4


fi
ff
Sebastian Raschka STAT 453: Intro to Deep Learning 5
Lecture Overview

1. Improving generalization performance

2. Avoiding over tting with (1) more data and (2)


data augmentation

3. Reducing network capacity & early stopping

4. Adding norm penalties to the loss: L1 & L2


regularization

5. Dropout

Sebastian Raschka STAT 453: Intro to Deep Learning 6


fi
An Overview of Techniques for ...

1. Improving generalization performance


2. Avoiding over tting with (1) more data and (2) data
augmentation
3. Reducing network capacity & early stopping
4. Adding norm penalties to the loss: L1 & L2 regularization
5. Dropout

Sebastian Raschka STAT 453: Intro to Deep Learning 7


fi
Collecting more data
Data augmentation
Label smoothing
Dataset
Semi-supervised
Leveraging unlabeled data
Self-supervised
Meta-learning
Leveraging related data
Transfer learning
Weight initialization strategies
Activation functions
Architecture setup
Residual layers
Knowledge distillation

Improving generalization Input standardization


BatchNorm and variants
Normalization
Weight standardization
Gradient centralization
Adaptive learning rates
Training loop Auxiliary losses
Gradient clipping
L2 (/L1) regularization
Regularization Early stopping
Dropout

Sebastian Raschka STAT 453: Intro to Deep Learning 8


First step to improve performance:
Focusing on the dataset itself

1. Improving generalization performance


2. Avoiding over tting with (1) more data and (2) data
augmentation
3. Reducing network capacity & early stopping
4. Adding norm penalties to the loss: L1 & L2 regularization
5. Dropout

Sebastian Raschka STAT 453: Intro to Deep Learning 9


fi
Hig
(Not
Often, the Best Way to Reduce Over tting is
Collecting More Data
Figure 3: Illustration of bias and variance.

Softmax on MNIST subset (test set size is kept constant)


Figure 4: Learning curves of softmax classifiers fit to MNIST.

the training set is small, the algorithm is more likely


Sebastian Raschka
picking up noise in the training set so that the
STAT 453: Intro to Deep Learning 10
fi
Data Augmentation in PyTorch via
TorchVision

Original

Randomly Augmented

https://github.com/rasbt/stat453-deep-learning-ss21/blob/master/L10/code/data-augmentation.ipynb

Sebastian Raschka STAT 453: Intro to Deep Learning 11


https://github.com/rasbt/stat453-deep-learning-ss21/blob/master/L10/code/data-augmentation.ipynb

Sebastian Raschka STAT 453: Intro to Deep Learning 12


Use (0.5, 0.5, 0.5) for RGB images

https://github.com/rasbt/stat453-deep-learning-ss21/blob/master/L10/code/data-augmentation.ipynb

Sebastian Raschka STAT 453: Intro to Deep Learning 13


Other Ways for Dealing with Over tting
if Collecting More Data is not Feasible

=> Reducing Network's Capacity by Other Means

1. Improving generalization performance


2. Avoiding over tting with (1) more data and (2) data
augmentation
3. Reducing network capacity & early stopping
4. Adding norm penalties to the loss: L1 & L2 regularization
5. Dropout

Sebastian Raschka STAT 453: Intro to Deep Learning 14


fi
fi
Early Stopping
Step 1: Split your dataset into 3 parts (always
recommended)

• use test set only once at the end (for unbiased estimate of
generalization performance)
• use validation accuracy for tuning (always recommended)

Dataset
Training Validation Test
dataset dataset dataset

Sebastian Raschka STAT 453: Intro to Deep Learning 15


Early Stopping
Step 2: Early stopping (not very common anymore)

• reduce over tting by observing the training/validation


accuracy gap during training and then stop at the "right" point

Good early stopping point Training set

Accuracy
Validation set

Epochs
Sebastian Raschka STAT 453: Intro to Deep Learning 16
fi
Other Ways for Dealing with Over tting
if Collecting More Data is not Feasible

Adding a Penalty Against Complexity

1. Improving generalization performance


2. Avoiding over tting with (1) more data and (2) data
augmentation
3. Reducing network capacity & early stopping
4. Adding norm penalties to the loss: L1 & L2 regularization
5. Dropout

Sebastian Raschka STAT 453: Intro to Deep Learning 17


fi
fi
L1/L2 Regularization
As I am sure you already know it from various statistics
classes, we will keep it short:

• L1-regularization => LASSO regression

• L2-regularization => Ridge regression (Thikonov regularization)

Basically, a "weight shrinkage" or a "penalty against


complexity"

Sebastian Raschka STAT 453: Intro to Deep Learning 18


L2 Regularization for Linear Models
(e.g., Logistic Regression)

Xn
1 [i] [i]
Costw,b = L(y , ŷ )
n i=1
<latexit sha1_base64="V59LNcdyxr0eVoJo+5mccNlvb8E=">AAACUHicbVHLatwwFL2e9JG6r2m67EZ0KKRQBjsttJtASDZddDGBzExg7BhZI8+IyLKRrpMaoU/MJrt8RzZdtLSaF7RJLwgdnXMuujrKaykMRtFN0Nl68PDR4+0n4dNnz1+87L7aGZmq0YwPWSUrfZpTw6VQfIgCJT+tNadlLvk4Pz9a6OMLro2o1Am2NU9LOlOiEIyip7LuLEH+He1RZdBlNikpzvPCXroPZINz5wjZJ0mhKbOxs8qRxDRlZsV+7M4UISsno9J+c7vtmZ2I1LeHyZyibd3q/D7r9qJ+tCxyH8Rr0IN1DbLudTKtWFNyhUxSYyZxVGNqqUbBJHdh0hheU3ZOZ3zioaIlN6ldBuLIO89MSVFpvxSSJft3h6WlMW2Ze+didnNXW5D/0yYNFl9SK1TdIFdsdVHRSIIVWaRLpkJzhrL1gDIt/KyEzalPDv0fhD6E+O6T74PRXj/+2I+OP/UODtdxbMMbeAu7EMNnOICvMIAhMLiCW/gJv4Lr4EfwuxOsrJsdXsM/1Qn/AM9ytVA=</latexit>

Xn X
1
L2-Regularized-Costw,b = L(y [i] , ŷ [i] ) + wj2
n i=1 n j
<latexit sha1_base64="UZp3ipt8/eQftFzaePzqg+XpUpo=">AAACgHicbVFdb9MwFHXCYKN8rINHXqxVSEWsXVKQhpAmTexlD3sYaN0mNWnkOE7rzXEi+4YRLP8O/hdv/Bgk3DYTbONKls4998P3nptWgmsIgl+e/2Dt4aP1jcedJ0+fPd/sbr0402WtKBvTUpTqIiWaCS7ZGDgIdlEpRopUsPP06nARP//KlOalPIWmYnFBZpLnnBJwVNL9EQH7BuZ4NPjCZrUgin9n2eCw1GATExUE5mluru0OvsGptRjv4yhXhJrQGmlxpOsiMXw/tFOJ8SqTEmGObb+ZmgmPXXknmhMwjV35b/DbtkMk3KwZ+dvnEl8nl9NR0u0Fw2Bp+D4IW9BDrZ0k3Z9RVtK6YBKoIFpPwqCC2BAFnApmO1GtWUXoFZmxiYOSFEzHZimgxa8dk+G8VO5JwEv23wpDCq2bInWZi+X03diC/F9sUkP+ITZcVjUwSVcf5bXAUOLFNXDGFaMgGgcIVdzNiumcOGHA3azjRAjvrnwfnI2G4bth8Pl97+BTK8cGeoW2UR+FaA8doCN0gsaIot9ez9vxBr7v9/1dP1yl+l5b8xLdMv/jHwobwpQ=</latexit>

X
where: wj2 = ||w||22
<latexit sha1_base64="ibct1zBUFvljjClJ/FYMGMyWyc4=">AAACCnicbVDLSsNAFJ34rPUVdelmtAiuShIF3QhFNy4r2Ac0aZhMJ+20kwczE0tJunbjr7hxoYhbv8Cdf+Ok7UJbD1w4nHMv997jxYwKaRjf2tLyyuraemGjuLm1vbOr7+3XRZRwTGo4YhFvekgQRkNSk1Qy0ow5QYHHSMMb3OR+44FwQaPwXo5i4gSoG1KfYiSV5OpHtkgCtw+Hbr9twSuYZXaAZM/z0+E4y1yrbbl6ySgbE8BFYs5ICcxQdfUvuxPhJCChxAwJ0TKNWDop4pJiRsZFOxEkRniAuqSlaIgCIpx08soYniilA/2IqwolnKi/J1IUCDEKPNWZ3ynmvVz8z2sl0r90UhrGiSQhni7yEwZlBPNcYIdygiUbKYIwp+pWiHuIIyxVekUVgjn/8iKpW2XzrGzcnZcq17M4CuAQHINTYIILUAG3oApqAINH8AxewZv2pL1o79rHtHVJm80cgD/QPn8AjqmaLA==</latexit>
j

and λ is a hyperparameter

Sebastian Raschka STAT 453: Intro to Deep Learning 19


Geometric Interpretation of L2 Regularization
1st component:
wj
<latexit sha1_base64="+0kShER0h8B2fD6O/HDwSUbA3/A=">AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyooMuiG5cV7AM6Q8mkmTY2kwlJRilDf8ONC0Xc+jPu/Bsz7Sy09UDgcM693JMTSs60cd1vp7Syura+Ud6sbG3v7O5V9w/aOkkVoS2S8ER1Q6wpZ4K2DDOcdqWiOA457YTjm9zvPFKlWSLuzUTSIMZDwSJGsLGS78fYjMIoe5r2H/rVmlt3Z0DLxCtIDQo0+9Uvf5CQNKbCEI617nmuNEGGlWGE02nFTzWVmIzxkPYsFTimOshmmafoxCoDFCXKPmHQTP29keFY60kc2sk8o170cvE/r5ea6CrImJCpoYLMD0UpRyZBeQFowBQlhk8swUQxmxWREVaYGFtTxZbgLX55mbTP6t553b27qDWuizrKcATHcAoeXEIDbqEJLSAg4Rle4c1JnRfn3fmYj5acYucQ/sD5/AF/55H6</latexit>

minimize cost function

wi
<latexit sha1_base64="mKDmnWfGezBcThHlfbPZ8Pz+m5g=">AAAB83icbVBNS8NAFHypX7V+VT16WSyCp5KooMeiF48VbC00pWy2L+3SzSbsbpQS+je8eFDEq3/Gm//GTZuDtg4sDDPv8WYnSATXxnW/ndLK6tr6RnmzsrW9s7tX3T9o6zhVDFssFrHqBFSj4BJbhhuBnUQhjQKBD8H4JvcfHlFpHst7M0mwF9Gh5CFn1FjJ9yNqRkGYPU37vF+tuXV3BrJMvILUoECzX/3yBzFLI5SGCap113MT08uoMpwJnFb8VGNC2ZgOsWuppBHqXjbLPCUnVhmQMFb2SUNm6u+NjEZaT6LATuYZ9aKXi/953dSEV72MyyQ1KNn8UJgKYmKSF0AGXCEzYmIJZYrbrISNqKLM2JoqtgRv8cvLpH1W987r7t1FrXFd1FGGIziGU/DgEhpwC01oAYMEnuEV3pzUeXHenY/5aMkpdg7hD5zPH35jkfk=</latexit>

Compromise between penalty


2nd component:
and cost
minimize penalty term

Sebastian Raschka, Vahid Mirjalili. Python Machine Learning. 3rd Edition.

Sebastian Raschka STAT 453: Intro to Deep Learning 20


E ect of Norm Penalties on the Decision
Boundary
Assume a nonlinear model

Sebastian Raschka STAT 453: Intro to Deep Learning 21


ff
L2 Regularization for Multilayer Neural Networks

Xn L
X
1 [i] [i] (l) 2
L2-Regularized-Costw,b = L(y , ŷ ) + ||w ||F
n i=1 n
<latexit sha1_base64="RCGZxKRvoEmWPZXkSusdbx4t9Lk=">AAACm3icbVFdb9MwFHXC1ygf6+AFCSFZVEidYFVSJsHL0EQlmFAfBqLbpKaNHMdprTlOZN8AwfWf4qfwxr/BaYMYG1eydHzuPfdeHyel4BqC4JfnX7t+4+atrdudO3fv3d/u7jw40UWlKJvQQhTqLCGaCS7ZBDgIdlYqRvJEsNPkfNTkT78wpXkhP0NdsllOFpJnnBJwVNz9EQH7BmY83PvEFpUgin9n6d6o0GBjE+UElklmvtoX+A9OrMX4AEeZItSE1kiLI13lseEHoZ1LjDeVlAgztv16bqZ85uSdaEnA1HZz38XP2w6RcLum5EIf0fRxWrxa/Z0/N32xa1er+N18GHd7wSBYB74Kwhb0UBvHcfdnlBa0ypkEKojW0zAoYWaIAk4Fs52o0qwk9Jws2NRBSXKmZ2btrcXPHJPirFDuSMBr9qLCkFzrOk9cZbOtvpxryP/lphVkr2eGy7ICJulmUFYJDAVuPgqnXDEKonaAUMXdrpguifMM3Hd2nAnh5SdfBSfDQfhyEH7c7x2+be3YQo/RU9RHIXqFDtEROkYTRL1H3hvvvXfkP/FH/gd/vCn1vVbzEP0T/uQ3kk3NUw==</latexit>
l=1

sum over layers

(l) 2
where ||w ||F is the Frobenius norm (squared): <latexit sha1_base64="71TeQNuRgLqGJEbkEvQmFwkRrF4=">AAACAXicbVDLSsNAFJ3UV62vqBvBzWAR6qYkVdBlURCXFewD2jRMppN26OTBzEQpSdz4K25cKOLWv3Dn3zhps9DWAxcO59zLvfc4IaNCGsa3VlhaXlldK66XNja3tnf03b2WCCKOSRMHLOAdBwnCqE+akkpGOiEnyHMYaTvjq8xv3xMuaODfyUlILA8NfepSjKSSbP0gSXoekiPHjR/SflxhJ2mS2Nf9mq2XjaoxBVwkZk7KIEfD1r96gwBHHvElZkiIrmmE0ooRlxQzkpZ6kSAhwmM0JF1FfeQRYcXTD1J4rJQBdAOuypdwqv6eiJEnxMRzVGd2rZj3MvE/rxtJ98KKqR9Gkvh4tsiNGJQBzOKAA8oJlmyiCMKcqlshHiGOsFShlVQI5vzLi6RVq5qnVfP2rFy/zOMogkNwBCrABOegDm5AAzQBBo/gGbyCN+1Je9HetY9Za0HLZ/bBH2ifP8XjlxM=</latexit>

XX (l) 2
(l) 2
||w ||F = (wi,j )
<latexit sha1_base64="wC+HVng1YzDyBZBFO6aEDDWL7aY=">AAACJHicbVBJSwMxGM3Urdat6tFLsAgtSJmpgoIIRUE8VrALdKZDJs20aTMLScZSpvNjvPhXvHhwwYMXf4vpctDWByGP976P5D0nZFRIXf/SUkvLK6tr6fXMxubW9k52d68mgohjUsUBC3jDQYIw6pOqpJKRRsgJ8hxG6k7/euzXHwgXNPDv5TAkloc6PnUpRlJJdvZiNDI9JLuOGw+SVpxnhWQ0sm9aJXgJTRF5Np1ePZgfTG07psewlxRaJTub04v6BHCRGDOSAzNU7Oy72Q5w5BFfYoaEaBp6KK0YcUkxI0nGjAQJEe6jDmkq6iOPCCuehEzgkVLa0A24Or6EE/X3Row8IYaeoybHgcS8Nxb/85qRdM+tmPphJImPpw+5EYMygOPGYJtygiUbKoIwp+qvEHcRR1iqXjOqBGM+8iKplYrGSdG4O82Vr2Z1pMEBOAR5YIAzUAa3oAKqAINH8AxewZv2pL1oH9rndDSlzXb2wR9o3z95QaQC</latexit>
i j

Sebastian Raschka STAT 453: Intro to Deep Learning 22


L2 Regularization for Neural Nets

Regular gradient descent update:

@L
wi,j := wi,j ⌘
<latexit sha1_base64="5ZjwwihY2mDS1cW07vV16R1eYEo=">AAACNHicbVDLSgMxFM34rPVVdekmWAQXWmZUUASh6EbQRQX7gE4pd9JMG5t5kGSUMsxHufFD3IjgQhG3foOZtlRtPRA4nHvPzb3HCTmTyjRfjKnpmdm5+cxCdnFpeWU1t7ZekUEkCC2TgAei5oCknPm0rJjitBYKCp7DadXpnqf16h0VkgX+jeqFtOFB22cuI6C01Mxd3jdjtotvE3xyikd8D9tUAbZdASS2QxCKAce2B6pDgMdXSfKjjlxJM5c3C2YfeJJYQ5JHQ5SauSe7FZDIo74iHKSsW2aoGnE6mHCaZO1I0hBIF9q0rqkPHpWNuH90gre10sJuIPTzFe6rvx0xeFL2PEd3pnvL8Voq/lerR8o9bsTMDyNFfTL4yI04VgFOE8QtJihRvKcJEMH0rph0QCeldM5ZHYI1fvIkqewXrIOCdX2YL54N48igTbSFdpCFjlARXaASKiOCHtAzekPvxqPxanwYn4PWKWPo2UB/YHx9Aw3LqpE=</latexit>
@wi,j

Gradient descent update with L2 regularization:


✓ ◆
@L 2
wi,j := wi,j ⌘ + wi,j
<latexit sha1_base64="2cuezq6jhNya88Eqgh7isNRI5hg=">AAACYnicbVHLSsQwFE3re3yNutTFxUFQ1KFVQREE0Y0LFwqOCtNhuM2kYzRNS5IqQ+lPunPlxg8xHYvvC4HDuffcx0mYCq6N57047sjo2PjE5FRtemZ2br6+sHitk0xR1qKJSNRtiJoJLlnLcCPYbaoYxqFgN+HDaZm/eWRK80RemUHKOjH2JY84RWOpbn3w1M35FtwXcHgEn3gbAmYQIAh5v78OQaSQ5kGKynAUEMRo7iiK/LwovthPdQGblWIHAmFX6WGRy+Kr+7DrRrfe8JreMOAv8CvQIFVcdOvPQS+hWcykoQK1bvteajp5OZ4KVtSCTLMU6QP2WdtCiTHTnXxoUQFrlulBlCj7pIEh+12RY6z1IA5tZXmd/p0ryf9y7cxEB52cyzQzTNKPQVEmwCRQ+g09rhg1YmABUsXtrkDv0Lpj7K/UrAn+75P/guudpr/b9C/3GscnlR2TZJmsknXik31yTM7IBWkRSl6dMWfOmXfe3Jq74C59lLpOpVkiP8JdeQdk5bTc</latexit>
@wi,j n

Sebastian Raschka STAT 453: Intro to Deep Learning 23


L2 Regularization for Neural Nets in
PyTorch

# regularize loss
L2 = 0.
for name, p in model.named_parameters()
if 'weight' in name
L2 = L2 + (p**2).sum(

cost = cost + 2./targets.size(0) * LAMBDA * L

optimizer.zero_grad(
cost.backward()

Sebastian Raschka STAT 453: Intro to Deep Learning 24


L2 Regularization for Logistic Regression


in PyTorch
Automatically:
#########################################################
## Apply L2 regularization
optimizer = torch.optim.SGD(model.parameters(),
lr=0.1,
weight_decay=LAMBDA
#-------------------------------------------------------

for epoch in range(num_epochs)

#### Compute outputs ####


out = model(X_train_tensor

#### Compute gradients ####


cost = F.binary_cross_entropy(out, y_train_tensor
optimizer.zero_grad(
cost.backward()

Sebastian Raschka STAT 453: Intro to Deep Learning 25


Dropout
1. Improving generalization performance
2. Avoiding over tting with (1) more data and (2) data
augmentation
3. Reducing network capacity & early stopping
4. Adding norm penalties to the loss: L1 & L2 regularization
5. Dropout
5.1 The Main Concept Behind Dropout
5.2 Dropout: Co-Adaptation Interpretation
5.3 Dropout: Ensemble Method Interpretation
5.4 Dropout in PyTorch

Sebastian Raschka STAT 453: Intro to Deep Learning 26


fi
Dropout

Original research articles:


Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012).
Improving neural networks by preventing co-adaptation of feature detectors. arXiv
preprint arXiv:1207.0580.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014).
Dropout: a simple way to prevent neural networks from overfitting. The Journal of
Machine Learning Research, 15(1), 1929-1958.

Sebastian Raschka STAT 453: Intro to Deep Learning 27


Dropout in a Nutshell: Dropping Nodes

(1)
a1 <latexit sha1_base64="51Rbp1GGPW28qr7Kl7NY0LPiq2o=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXsquCnosevFYwX5Iu5Zsmm1Dk+ySZIWy9Fd48aCIV3+ON/+N6XYP2vpg4PHeDDPzgpgzbVz32ymsrK6tbxQ3S1vbO7t75f2Dlo4SRWiTRDxSnQBrypmkTcMMp51YUSwCTtvB+Gbmt5+o0iyS92YSU1/goWQhI9hY6QH3vce06p1O++WKW3MzoGXi5aQCORr98ldvEJFEUGkIx1p3PTc2foqVYYTTaamXaBpjMsZD2rVUYkG1n2YHT9GJVQYojJQtaVCm/p5IsdB6IgLbKbAZ6UVvJv7ndRMTXvkpk3FiqCTzRWHCkYnQ7Hs0YIoSwyeWYKKYvRWREVaYGJtRyYbgLb68TFpnNe+85t5dVOrXeRxFOIJjqIIHl1CHW2hAEwgIeIZXeHOU8+K8Ox/z1oKTzxzCHzifP5zWj58=</latexit>

x1 (2)
a1
<latexit sha1_base64="5HJHR/B9CHeIlPgqihTyAybn2c4=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8eK9gPaUDbbSbt0swm7G7GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSATXxnW/ncLK6tr6RnGztLW9s7tX3j9o6jhVDBssFrFqB1Sj4BIbhhuB7UQhjQKBrWB0M/Vbj6g0j+WDGSfoR3QgecgZNVa6f+p5vXLFrbozkGXi5aQCOeq98le3H7M0QmmYoFp3PDcxfkaV4UzgpNRNNSaUjegAO5ZKGqH2s9mpE3JilT4JY2VLGjJTf09kNNJ6HAW2M6JmqBe9qfif10lNeOVnXCapQcnmi8JUEBOT6d+kzxUyI8aWUKa4vZWwIVWUGZtOyYbgLb68TJpnVe+86t5dVGrXeRxFOIJjOAUPLqEGt1CHBjAYwDO8wpsjnBfn3fmYtxacfOYQ/sD5/AEMWo2i</latexit>

<latexit sha1_base64="vfx38n+ae04OFRd5luhElMypRJ0=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuK+ix6MVjBfsh7VqyabYNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8c3Mbz9RpVkk780kpr7AQ8lCRrCx0gPue49puXo+7RdLbsWdA60SLyMlyNDoF796g4gkgkpDONa667mx8VOsDCOcTgu9RNMYkzEe0q6lEguq/XR+8BSdWWWAwkjZkgbN1d8TKRZaT0RgOwU2I73szcT/vG5iwis/ZTJODJVksShMODIRmn2PBkxRYvjEEkwUs7ciMsIKE2MzKtgQvOWXV0mrWvFqFffuolS/zuLIwwmcQhk8uIQ63EIDmkBAwDO8wpujnBfn3flYtOacbOYY/sD5/AGeXI+g</latexit>

(1)
a2
<latexit sha1_base64="UEIEXkJI4Qcu+777LfA5dwpJBR0=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuK+ix6MVjBfsh7VqyabYNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8c3Mbz9RpVkk780kpr7AQ8lCRrCx0gPuVx/Tsnc+7RdLbsWdA60SLyMlyNDoF796g4gkgkpDONa667mx8VOsDCOcTgu9RNMYkzEe0q6lEguq/XR+8BSdWWWAwkjZkgbN1d8TKRZaT0RgOwU2I73szcT/vG5iwis/ZTJODJVksShMODIRmn2PBkxRYvjEEkwUs7ciMsIKE2MzKtgQvOWXV0mrWvFqFffuolS/zuLIwwmcQhk8uIQ63EIDmkBAwDO8wpujnBfn3flYtOacbOYY/sD5/AGeYI+g</latexit>
o
<latexit sha1_base64="zmvhV5w6wvufBjgJnplzs3qmpp8=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQq6LHoxWML9gPaUDbbSbt2sxt2N0IJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZemHCmjed9O4W19Y3NreJ2aWd3b/+gfHjU0jJVFJtUcqk6IdHImcCmYYZjJ1FI4pBjOxzfzfz2EyrNpHgwkwSDmAwFixglxkoN2S9XvKo3h7tK/JxUIEe9X/7qDSRNYxSGcqJ11/cSE2REGUY5Tku9VGNC6JgMsWupIDHqIJsfOnXPrDJwI6lsCePO1d8TGYm1nsSh7YyJGellbyb+53VTE90EGRNJalDQxaIo5a6R7uxrd8AUUsMnlhCqmL3VpSOiCDU2m5INwV9+eZW0Lqr+ZdVrXFVqt3kcRTiBUzgHH66hBvdQhyZQQHiGV3hzHp0X5935WLQWnHzmGP7A+fwB2T+M9Q==</latexit>

(2)
x2
<latexit sha1_base64="gBTwEt+X3BPX1KgMo6lYVWIC09o=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGi/YA2lM120y7dbMLuRCyhP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qlX7ZXKbsWdgSwTLydlyFHvlb66/ZilEVfIJDWm47kJ+hnVKJjkk2I3NTyhbEQHvGOpohE3fjY7dUJOrdInYaxtKSQz9fdERiNjxlFgOyOKQ7PoTcX/vE6K4ZWfCZWkyBWbLwpTSTAm079JX2jOUI4toUwLeythQ6opQ5tO0YbgLb68TJrVindece8uyrXrPI4CHMMJnIEHl1CDW6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AEN3o2j</latexit>
a2
<latexit sha1_base64="Rx/RXsiT+s/v11w3kFUY/JZyKRU=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuK+ix6MVjBfsh7VqyabYNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8c3Mbz9RpVkk780kpr7AQ8lCRrCx0gPuVx/TcvV82i+W3Io7B1olXkZKkKHRL371BhFJBJWGcKx113Nj46dYGUY4nRZ6iaYxJmM8pF1LJRZU++n84Ck6s8oAhZGyJQ2aq78nUiy0nojAdgpsRnrZm4n/ed3EhFd+ymScGCrJYlGYcGQiNPseDZiixPCJJZgoZm9FZIQVJsZmVLAheMsvr5JWteLVKu7dRal+ncWRhxM4hTJ4cAl1uIUGNIGAgGd4hTdHOS/Ou/OxaM052cwx/IHz+QOf5o+h</latexit>

(1)
a3 <latexit sha1_base64="F0cJIqijoEg/scv4wVZxoymO2Dc=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXsquLeix6MVjBfsh7VqyabYNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8c3Mbz9RpVkk780kpr7AQ8lCRrCx0gPuVx/Tsnc+7RdLbsWdA60SLyMlyNDoF796g4gkgkpDONa667mx8VOsDCOcTgu9RNMYkzEe0q6lEguq/XR+8BSdWWWAwkjZkgbN1d8TKRZaT0RgOwU2I73szcT/vG5iwis/ZTJODJVksShMODIRmn2PBkxRYvjEEkwUs7ciMsIKE2MzKtgQvOWXV0nrouJVK+5drVS/zuLIwwmcQhk8uIQ63EIDmkBAwDO8wpujnBfn3flYtOacbOYY/sD5/AGf6o+h</latexit>

Originally, drop probability 0.5

(but 0.2-0.8 also common now)

Sebastian Raschka STAT 453: Intro to Deep Learning 28


Dropout in a Nutshell: Dropping Nodes
How do we drop the nodes practically/e ciently?

Bernoulli Sampling (during training):

• p := drop probability
• v := random sample from uniform distribution in range [0, 1]
• 8i 2 v : vi := 0 if vi < p else 1
<latexit sha1_base64="JuOrLqiuNAIE5LDs5HvARei8z9U=">AAACLnicbZBNSyNBEIZ7/IzxK+rRS2EQPIUZkV0RF0QRPEYwKmRC6OnUaGNPz9BdEzYM+UVe/Cu7B0FFvPoz7IkRXN0XGl6eqqKr3ihT0pLvP3gTk1PTM7OVuer8wuLScm1l9dymuRHYEqlKzWXELSqpsUWSFF5mBnkSKbyIbo7K+kUfjZWpPqNBhp2EX2kZS8HJoW7tOIxTw5UCCaHUECacrqO46A/3oN+Ve7/Ah5DwNxUgYxiWDPYh+2CoLDoadGt1v+GPBN9NMDZ1NlazW/sb9lKRJ6hJKG5tO/Az6hTckBQKh9Uwt5hxccOvsO2s5gnaTjE6dwibjvTA7e2eJhjRzxMFT6wdJJHrLM+xX2sl/F+tnVO82ymkznJCLd4/inMFlEKZHfSkQUFq4AwXRrpdQVxzwwW5hKsuhODryd/N+XYj+NHwT3fqB4fjOCpsnW2wLRawn+yAnbAmazHBbtkf9sievDvv3nv2Xt5bJ7zxzBr7R97rG1YApto=</latexit>

• a := a v (p × 100% of the activations a will be zeroed)


<latexit sha1_base64="szUeBY7jxQv01MfWqzesiW1S2RI=">AAACE3icbVDLSsNAFJ3UV42vqEs3g0UQFyVRQRGEohuXFewDmlAmk0k7dJIJM5NCCf0HN/6KGxeKuHXjzr9x0gaprQcGzpxzL/fe4yeMSmXb30ZpaXllda28bm5sbm3vWLt7TclTgUkDc8ZF20eSMBqThqKKkXYiCIp8Rlr+4Db3W0MiJOXxgxolxItQL6YhxUhpqWuduBFSfT/M0BheXcOZn8sDrsxfZTjuWhW7ak8AF4lTkAooUO9aX27AcRqRWGGGpOw4dqK8DAlFMSNj000lSRAeoB7paBqjiEgvm9w0hkdaCWDIhX6xghN1tiNDkZSjyNeV+YZy3svF/7xOqsJLL6NxkioS4+mgMGVQcZgHBAMqCFZspAnCgupdIe4jgbDSMZo6BGf+5EXSPK06Z1Xn/rxSuyniKIMDcAiOgQMuQA3cgTpoAAwewTN4BW/Gk/FivBsf09KSUfTsgz8wPn8Agrad7w==</latexit>

Sebastian Raschka STAT 453: Intro to Deep Learning 29


ffi
Dropout in a Nutshell: Dropping Nodes
How do we drop the nodes practically/e ciently?

Bernoulli Sampling (during training):

• p := drop probability
• v := random sample from uniform distribution in range [0, 1]
• 8i 2 v : vi := 0 if vi < p else 1
<latexit sha1_base64="JuOrLqiuNAIE5LDs5HvARei8z9U=">AAACLnicbZBNSyNBEIZ7/IzxK+rRS2EQPIUZkV0RF0QRPEYwKmRC6OnUaGNPz9BdEzYM+UVe/Cu7B0FFvPoz7IkRXN0XGl6eqqKr3ihT0pLvP3gTk1PTM7OVuer8wuLScm1l9dymuRHYEqlKzWXELSqpsUWSFF5mBnkSKbyIbo7K+kUfjZWpPqNBhp2EX2kZS8HJoW7tOIxTw5UCCaHUECacrqO46A/3oN+Ve7/Ah5DwNxUgYxiWDPYh+2CoLDoadGt1v+GPBN9NMDZ1NlazW/sb9lKRJ6hJKG5tO/Az6hTckBQKh9Uwt5hxccOvsO2s5gnaTjE6dwibjvTA7e2eJhjRzxMFT6wdJJHrLM+xX2sl/F+tnVO82ymkznJCLd4/inMFlEKZHfSkQUFq4AwXRrpdQVxzwwW5hKsuhODryd/N+XYj+NHwT3fqB4fjOCpsnW2wLRawn+yAnbAmazHBbtkf9sievDvv3nv2Xt5bJ7zxzBr7R97rG1YApto=</latexit>

• a := a v (p × 100% of the activations a will be zeroed)


<latexit sha1_base64="szUeBY7jxQv01MfWqzesiW1S2RI=">AAACE3icbVDLSsNAFJ3UV42vqEs3g0UQFyVRQRGEohuXFewDmlAmk0k7dJIJM5NCCf0HN/6KGxeKuHXjzr9x0gaprQcGzpxzL/fe4yeMSmXb30ZpaXllda28bm5sbm3vWLt7TclTgUkDc8ZF20eSMBqThqKKkXYiCIp8Rlr+4Db3W0MiJOXxgxolxItQL6YhxUhpqWuduBFSfT/M0BheXcOZn8sDrsxfZTjuWhW7ak8AF4lTkAooUO9aX27AcRqRWGGGpOw4dqK8DAlFMSNj000lSRAeoB7paBqjiEgvm9w0hkdaCWDIhX6xghN1tiNDkZSjyNeV+YZy3svF/7xOqsJLL6NxkioS4+mgMGVQcZgHBAMqCFZspAnCgupdIe4jgbDSMZo6BGf+5EXSPK06Z1Xn/rxSuyniKIMDcAiOgQMuQA3cgTpoAAwewTN4BW/Gk/FivBsf09KSUfTsgz8wPn8Agrad7w==</latexit>

Then, after training when making predictions (during "inference")

scale activations via a := a <latexit sha1_base64="VNKN/K5WhOwo17ucpiVRXZLFb2k=">AAACEHicbVDLSgMxFM3UV62vUZdugkWsC8uMCoogFN24rGAf0BlKJpNpQzPJkGSEMvQT3Pgrblwo4talO//GTNtFrR4InJxzL/feEySMKu0431ZhYXFpeaW4Wlpb39jcsrd3mkqkEpMGFkzIdoAUYZSThqaakXYiCYoDRlrB4Cb3Ww9EKir4vR4mxI9Rj9OIYqSN1LUPvRjpfhBlaAQvr+DMD3oiFBpWXHgMk6OuXXaqzhjwL3GnpAymqHftLy8UOI0J15ghpTquk2g/Q1JTzMio5KWKJAgPUI90DOUoJsrPxgeN4IFRQhgJaR7XcKzOdmQoVmoYB6YyX1jNe7n4n9dJdXThZ5QnqSYcTwZFKYNawDwdGFJJsGZDQxCW1OwKcR9JhLXJsGRCcOdP/kuaJ1X3tOrenZVr19M4imAP7IMKcME5qIFbUAcNgMEjeAav4M16sl6sd+tjUlqwpj274Beszx+rYpsK</latexit>


(1 p)

Q for you: Why is this required?


Sebastian Raschka STAT 453: Intro to Deep Learning 30
ffi
Dropout
1. Improving generalization performance
2. Avoiding over tting with (1) more data and (2) data
augmentation
3. Reducing network capacity & early stopping
4. Adding norm penalties to the loss: L1 & L2 regularization
5. Dropout
5.1 The Main Concept Behind Dropout
5.2 Dropout: Co-Adaptation Interpretation
5.3 Dropout: Ensemble Method Interpretation
5.4 Dropout in PyTorch

Sebastian Raschka STAT 453: Intro to Deep Learning 31


fi
Dropout: Co-Adaptation Interpretation
Why does Dropout work well?

• Network will learn not to rely on particular connections too heavily


• Thus, will consider more connections (because it cannot rely on
individual ones)
• The weight values will be more spread-out (may lead to smaller
weights like with L2 norm)
• Side note: You can certainly use di erent dropout probabilities in
di erent layers (assigning them proportional to the number of units in
a layer is not a bad idea, for example)

Sebastian Raschka STAT 453: Intro to Deep Learning 32


ff
ff
Dropout
1. Improving generalization performance
2. Avoiding over tting with (1) more data and (2) data
augmentation
3. Reducing network capacity & early stopping
4. Adding norm penalties to the loss: L1 & L2 regularization
5. Dropout
5.1 The Main Concept Behind Dropout
5.2 Dropout: Co-Adaptation Interpretation
5.3 Dropout: Ensemble Method Interpretation
5.4 Dropout in PyTorch

Sebastian Raschka STAT 453: Intro to Deep Learning 33


fi
Dropout: Ensemble Method Interpretation

• In dropout, we have a "di erent model" for each minibatch

• Via the minibatch iterations, we essentially sample over M=2h


models, where h is the number of hidden units

• Restriction is that we have weight sharing over these models,


which can be seen as a form of regularization

• During "inference" we can then average over all these models


(but this is very expensive)

Sebastian Raschka STAT 453: Intro to Deep Learning 34


ff
Dropout: Ensemble Method Interpretation
• During "inference" we can then average over all these models
(but this is very expensive)
This is basically just averaging log likelihoods
(this is for one particular class):
hY
M i1/M h M
X i
{i} {i}
pEnsemble = p = exp 1/M log(p )
<latexit sha1_base64="Wnj9ZWP25GYQpdL8qo424eQ/P5Y=">AAACKnicbVDLSgMxFM34tr6qLt0Ei+CqzqigG8EHghuhgrVCZzpk0rTGzmRCckcsYb7Hjb/ixoVS3PohZtoufB0IHM45l5t7IhlzDa47cCYmp6ZnZufmSwuLS8sr5dW1G51mirI6TeNU3UZEs5gLVgcOMbuVipEkilkj6p0VfuOBKc1TcQ19yYKEdAXvcErASmH5RIbGB/YI5lxoVozlOT7C/invNrEvVdoOzf2Rl7fMZY5ly/iG+zZR+EHLeDuXeViuuFV3CPyXeGNSQWPUwvKr305pljABNCZaNz1XQmCIAk7t+pKfaSYJ7ZEua1oqSMJ0YIan5njLKm3cSZV9AvBQ/T5hSKJ1P4lsMiFwp397hfif18ygcxgYLmQGTNDRok4WY0hx0Rtuc8UoxH1LCFXc/hXTO6IIBdtuyZbg/T75L7nZrXp7Ve9qv3J8Oq5jDm2gTbSNPHSAjtEFqqE6ougJvaA39O48O6/OwPkYRSec8cw6+gHn8wvYGad5</latexit>
j=1 <latexit sha1_base64="LdkquqSKAYEkurUC2aKnb/I4G0U=">AAACG3icbVBLSwMxGMzWV62vVY9egkWol7pbBb0USr14ESrYB3S3JZtm29hkd0myYln2f3jxr3jxoIgnwYP/xvRx0NaBwGRmPpJvvIhRqSzr28gsLa+srmXXcxubW9s75u5eQ4axwKSOQxaKlockYTQgdUUVI61IEMQ9Rpre8HLsN++JkDQMbtUoIi5H/YD6FCOlpa5ZKkOHPETQqdJ+2z65ho6MeTe5K9tpR19Y2C/AqJM4CXXS9Hiccrtm3ipaE8BFYs9IHsxQ65qfTi/EMSeBwgxJ2batSLkJEopiRtKcE0sSITxEfdLWNECcSDeZ7JbCI630oB8KfQIFJ+rviQRxKUfc00mO1EDOe2PxP68dK//CTWgQxYoEePqQHzOoQjguCvaoIFixkSYIC6r/CvEACYSVrjOnS7DnV14kjVLRPi3aN2f5SnVWRxYcgENQADY4BxVwBWqgDjB4BM/gFbwZT8aL8W58TKMZYzazD/7A+PoBUJSfwg==</latexit>
j=1

(you may know this as the "geometric mean" from other classes)

For multiple classes, we need to normalize so that the probas


sum pEnsemble, j
pEnsemble, j = Pk
to 1: j=1 pEnsemble, j <latexit sha1_base64="LSX3kl+kyySnI5Sb1zBuUWpVUSc=">AAACRnicdVBNSxwxGH5n/ahurW712EvoIvRQlpkq6GVBlIJHC10VdqZDJvuOxs1khuSd0iXMr/Piubf+hF56aClem133UL8eCDw8HyR5skpJS2H4I2gtLC4tv1hZbb9ce7W+0Xm9eWrL2ggciFKV5jzjFpXUOCBJCs8rg7zIFJ5l46Opf/YVjZWl/kyTCpOCX2iZS8HJS2knqVIXE34j91FbnNbes6umYX0W54YL97TduNjWRequ+lHzxY0b9kws7XTDXjgDe0yiOenCHCdp53s8KkVdoCahuLXDKKwocdyQFAqbdlxbrLgY8wsceqp5gTZxsxkatu2VEctL448mNlP/bzheWDspMp8sOF3ah95UfMob1pTvJ07qqibU4u6ivFaMSjbdlI2kQUFq4gkXRvq3MnHJ/Xzkl2/7EaKHX35MTj/0op1e9Gm3e3A4n2MF3sBbeAcR7MEBHMMJDEDANfyE3/AnuAl+BX+D27toK5h3tuAeWvAPy32zug==</latexit>

Sebastian Raschka STAT 453: Intro to Deep Learning 35


Dropout: Ensemble Method Interpretation
• During "inference" we can then average over all these models
(but this is very expensive)

• However, using the last model after training and scaling the
predictions by a factor 1-p approximates the geometric mean
and is much cheaper
(actually, it's exactly the geometric mean if we have a linear
model)

Sebastian Raschka STAT 453: Intro to Deep Learning 36


Dropout
1. Improving generalization performance
2. Avoiding over tting with (1) more data and (2) data
augmentation
3. Reducing network capacity & early stopping
4. Adding norm penalties to the loss: L1 & L2 regularization
5. Dropout
5.1 The Main Concept Behind Dropout
5.2 Dropout: Co-Adaptation Interpretation
5.3 Dropout: Ensemble Method Interpretation
5.4 Dropout in PyTorch

Sebastian Raschka STAT 453: Intro to Deep Learning 37


fi
Inverted Dropout

• Most frameworks implement inverted dropout


• Here, the activation values are scaled by the factor (1-p)
during training instead of scaling the activations during
"inference"
• I believe Google started this trend (because it's
computationally cheaper in the long run if you use your
model a lot after training)
• PyTorch's Dropout implementation is also inverted Dropout

Sebastian Raschka STAT 453: Intro to Deep Learning 38


Dropout in PyTorch

Sebastian Raschka STAT 453: Intro to Deep Learning 39


Dropout in PyTorch
Here, is is very important that you use model.train() and model.eval()!
for epoch in range(NUM_EPOCHS)
model.train(
for batch_idx, (features, targets) in enumerate(train_loader)

features = features.view(-1, 28*28).to(DEVICE

### FORWARD AND BACK PROP


logits = model(features

cost = F.cross_entropy(logits, targets


optimizer.zero_grad(

cost.backward(
minibatch_cost.append(cost
### UPDATE MODEL PARAMETERS
optimizer.step(

model.eval(
with torch.no_grad()
cost = compute_loss(model, train_loader
epoch_cost.append(cost
print('Epoch: %03d/%03d Train Cost: %.4f' %
epoch+1, NUM_EPOCHS, cost)
print('Time elapsed: %.2f min' % ((time.time() - start_time)/60))

Sebastian Raschka STAT 453: Intro to Deep Learning 40


Without dropout:

With 50% dropout:

https://github.com/rasbt/stat453-deep-learning-ss21/blob/master/L10/code/dropout.ipynb

Sebastian Raschka STAT 453: Intro to Deep Learning 41


Dropout: More Practical Tips

• Don't use Dropout if your model does not over t


• However, in that case above, it is then recommended to
increase the capacity to make it over t, and then use dropout
to be able to use a larger capacity model (but make it not
over t)

Sebastian Raschka STAT 453: Intro to Deep Learning 42


fi
fi
fi
DropConnect:
Randomly Dropping Weights

(1)
a1 <latexit sha1_base64="51Rbp1GGPW28qr7Kl7NY0LPiq2o=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXsquCnosevFYwX5Iu5Zsmm1Dk+ySZIWy9Fd48aCIV3+ON/+N6XYP2vpg4PHeDDPzgpgzbVz32ymsrK6tbxQ3S1vbO7t75f2Dlo4SRWiTRDxSnQBrypmkTcMMp51YUSwCTtvB+Gbmt5+o0iyS92YSU1/goWQhI9hY6QH3vce06p1O++WKW3MzoGXi5aQCORr98ldvEJFEUGkIx1p3PTc2foqVYYTTaamXaBpjMsZD2rVUYkG1n2YHT9GJVQYojJQtaVCm/p5IsdB6IgLbKbAZ6UVvJv7ndRMTXvkpk3FiqCTzRWHCkYnQ7Hs0YIoSwyeWYKKYvRWREVaYGJtRyYbgLb68TFpnNe+85t5dVOrXeRxFOIJjqIIHl1CHW2hAEwgIeIZXeHOU8+K8Ox/z1oKTzxzCHzifP5zWj58=</latexit>

x1 (2)
a1
<latexit sha1_base64="5HJHR/B9CHeIlPgqihTyAybn2c4=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8eK9gPaUDbbSbt0swm7G7GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSATXxnW/ncLK6tr6RnGztLW9s7tX3j9o6jhVDBssFrFqB1Sj4BIbhhuB7UQhjQKBrWB0M/Vbj6g0j+WDGSfoR3QgecgZNVa6f+p5vXLFrbozkGXi5aQCOeq98le3H7M0QmmYoFp3PDcxfkaV4UzgpNRNNSaUjegAO5ZKGqH2s9mpE3JilT4JY2VLGjJTf09kNNJ6HAW2M6JmqBe9qfif10lNeOVnXCapQcnmi8JUEBOT6d+kzxUyI8aWUKa4vZWwIVWUGZtOyYbgLb68TJpnVe+86t5dVGrXeRxFOIJjOAUPLqEGt1CHBjAYwDO8wpsjnBfn3fmYtxacfOYQ/sD5/AEMWo2i</latexit>

<latexit sha1_base64="vfx38n+ae04OFRd5luhElMypRJ0=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuK+ix6MVjBfsh7VqyabYNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8c3Mbz9RpVkk780kpr7AQ8lCRrCx0gPue49puXo+7RdLbsWdA60SLyMlyNDoF796g4gkgkpDONa667mx8VOsDCOcTgu9RNMYkzEe0q6lEguq/XR+8BSdWWWAwkjZkgbN1d8TKRZaT0RgOwU2I73szcT/vG5iwis/ZTJODJVksShMODIRmn2PBkxRYvjEEkwUs7ciMsIKE2MzKtgQvOWXV0mrWvFqFffuolS/zuLIwwmcQhk8uIQ63EIDmkBAwDO8wpujnBfn3flYtOacbOYY/sD5/AGeXI+g</latexit>

(1)
a2
<latexit sha1_base64="UEIEXkJI4Qcu+777LfA5dwpJBR0=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuK+ix6MVjBfsh7VqyabYNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8c3Mbz9RpVkk780kpr7AQ8lCRrCx0gPuVx/Tsnc+7RdLbsWdA60SLyMlyNDoF796g4gkgkpDONa667mx8VOsDCOcTgu9RNMYkzEe0q6lEguq/XR+8BSdWWWAwkjZkgbN1d8TKRZaT0RgOwU2I73szcT/vG5iwis/ZTJODJVksShMODIRmn2PBkxRYvjEEkwUs7ciMsIKE2MzKtgQvOWXV0mrWvFqFffuolS/zuLIwwmcQhk8uIQ63EIDmkBAwDO8wpujnBfn3flYtOacbOYY/sD5/AGeYI+g</latexit>
o
<latexit sha1_base64="zmvhV5w6wvufBjgJnplzs3qmpp8=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQq6LHoxWML9gPaUDbbSbt2sxt2N0IJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZemHCmjed9O4W19Y3NreJ2aWd3b/+gfHjU0jJVFJtUcqk6IdHImcCmYYZjJ1FI4pBjOxzfzfz2EyrNpHgwkwSDmAwFixglxkoN2S9XvKo3h7tK/JxUIEe9X/7qDSRNYxSGcqJ11/cSE2REGUY5Tku9VGNC6JgMsWupIDHqIJsfOnXPrDJwI6lsCePO1d8TGYm1nsSh7YyJGellbyb+53VTE90EGRNJalDQxaIo5a6R7uxrd8AUUsMnlhCqmL3VpSOiCDU2m5INwV9+eZW0Lqr+ZdVrXFVqt3kcRTiBUzgHH66hBvdQhyZQQHiGV3hzHp0X5935WLQWnHzmGP7A+fwB2T+M9Q==</latexit>

(2)
x2
<latexit sha1_base64="gBTwEt+X3BPX1KgMo6lYVWIC09o=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGi/YA2lM120y7dbMLuRCyhP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qlX7ZXKbsWdgSwTLydlyFHvlb66/ZilEVfIJDWm47kJ+hnVKJjkk2I3NTyhbEQHvGOpohE3fjY7dUJOrdInYaxtKSQz9fdERiNjxlFgOyOKQ7PoTcX/vE6K4ZWfCZWkyBWbLwpTSTAm079JX2jOUI4toUwLeythQ6opQ5tO0YbgLb68TJrVindece8uyrXrPI4CHMMJnIEHl1CDW6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AEN3o2j</latexit>
a2
<latexit sha1_base64="Rx/RXsiT+s/v11w3kFUY/JZyKRU=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuK+ix6MVjBfsh7VqyabYNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8c3Mbz9RpVkk780kpr7AQ8lCRrCx0gPuVx/TcvV82i+W3Io7B1olXkZKkKHRL371BhFJBJWGcKx113Nj46dYGUY4nRZ6iaYxJmM8pF1LJRZU++n84Ck6s8oAhZGyJQ2aq78nUiy0nojAdgpsRnrZm4n/ed3EhFd+ymScGCrJYlGYcGQiNPseDZiixPCJJZgoZm9FZIQVJsZmVLAheMsvr5JWteLVKu7dRal+ncWRhxM4hTJ4cAl1uIUGNIGAgGd4hTdHOS/Ou/OxaM052cwx/IHz+QOf5o+h</latexit>

(1)
a3 <latexit sha1_base64="F0cJIqijoEg/scv4wVZxoymO2Dc=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXsquLeix6MVjBfsh7VqyabYNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8c3Mbz9RpVkk780kpr7AQ8lCRrCx0gPuVx/Tsnc+7RdLbsWdA60SLyMlyNDoF796g4gkgkpDONa667mx8VOsDCOcTgu9RNMYkzEe0q6lEguq/XR+8BSdWWWAwkjZkgbN1d8TKRZaT0RgOwU2I73szcT/vG5iwis/ZTJODJVksShMODIRmn2PBkxRYvjEEkwUs7ciMsIKE2MzKtgQvOWXV0nrouJVK+5drVS/zuLIwwmcQhk8uIQ63EIDmkBAwDO8wpujnBfn3flYtOacbOYY/sD5/AGf6o+h</latexit>

Sebastian Raschka STAT 453: Intro to Deep Learning 43


DropConnect

• Generalization of Dropout
• More "possibilities"
• Less popular & doesn't work so well in practice

Original research article:


Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., & Fergus, R. (2013, February). Regularization
of neural networks using DropConnect. In International conference on machine learning
(pp. 1058-1066).

Sebastian Raschka STAT 453: Intro to Deep Learning 44


Recommended Reading Assignment

• Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014).
Dropout: a simple way to prevent neural networks from over tting. The Journal of
Machine Learning Research, 15(1), 1929-1958.
http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf

Sebastian Raschka STAT 453: Intro to Deep Learning 45


fi

You might also like