0% found this document useful (0 votes)
19 views16 pages

Training Neural

Uploaded by

Papai Rana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
19 views16 pages

Training Neural

Uploaded by

Papai Rana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 16
sea, 1224 8 “ren upon: Dep er sey Rin PT see en Senin © rotons tonsa sone @ Ravindra Parmar (Jal) vono g Sep1t,2018 . Tminvead » © Listen Training Deep Neural Networks Deep Learning Accessories Deep Neural Network Inputayor hidden layr—niddontayer2 ‘hidden ayer 3 “snr Deep Neral Neto Onep asin Acca | Ravina amar owas Da Scan impossible for humans. To achieve high level of accuracy, huge amount of data and henceforth computing power is needed to train these networks. However, despite the computational complexity involved, we can follow certain guidelines to reduce the time for training and improve model accuracy. In this article we will look through few of these techniques. Data Pre processing ‘The importance of data pre-processing can only be emphasized by the fact that your neural network is only as good as the input data used to train it. If important data inputs are missing, neural network may not be able to achieve desired level of accuracy. On the other side, if data is not processed beforehand, it could effect the accuracy as well as performance of the network down the lane. Mean subtraction (Zero centering) It’s the process of subtracting mean from each of the data point to make it zero-centered, Consider a case where inputs to neuron (unit) are all positive or all negative. In that case the gradient calculated during back propagation sera, 1221 a “rsnrg Deep Neral Neto Onep asin Azza | Ravina amar owas Daa Sane @ sien (Geisaed) zero-centered data “Mean subtraction (Zero centering the data) Data Normalization Normalization refers to normalizing the data to make it of same scale across all dimensions. Common way to do that is to divide the data across each dimension by it’s standard deviation. However, it only makes sense if you have a reason to believe that different input features have different scales but they have equal importance to the learning algorithm. sera, 1221 a “rsnrg Deep Neral Neto Onep asin Azza | Ravina amar owas Daa Sane @ sien (Geisared) original data normalized data 4 Normalization of data across both dimensions Parameter Initialization Deep neural networks are no stranger to millions or billions of parameters. The way these parameters are initialized can determine how fast our learning algorithm would converge and how accurate it might end up. The straightforward way is to initialize them all to zero. However, if we initialize weights of a layer to all zero, the gradients calculated will be same for each unit in the layer and hence update to weights would be same for all units. Consequently that layer is as good as a single logistic regression unit. = oernennnnnmear ing of 500 units and using tanh activation function. [Just a note on tanh activation before proceeding further]. layer deep neural network each cons 10 “Tanh activation function On the left is the plot for tanh activation function. There are few important points to remember about this activation as we move along :- + This activation is zero-centered. sera, 1221 a “rsnrg Deep Neral Neto Onep asin Azza | Ravina amar owas Daa Sane @ Senin (crtsaiea) ‘To start with, we initialize all weights from a standard Gaussian with zero mean and 1 e-2 standard deviation. W = 0.01 * np.random.vandn (fan_in, fan_out) Unfortunately, this works well only for small networks. And to see what issues it creates for deeper networks plots are generated with various parameters. These plots depict the mean, standard deviation and activation for each layer as we go deeper into the network. “snr Deep Neral Neto Dnep ening ezsaes yan Parnas Data Sone Signin => 7 4 (a(t 13 |: Mean standard deviation and activation aerose layers ‘input layer fad mean isdn taper had higden Layer 2 had hidden Layer 3 had hugden Layer $ had Ridden Layer 3 had ed fa hes hiaden Layer 6 Pgden Layer 7 higden Layer 8 Rigden Layer 9 hsinowadeataenecmivandap-eral-awart ist Stee a “rsnrg Deep Neral Neto Onep asin Azza | Ravina amar owas Daa Sane Signin (CGotstarted gradually as we go deeper into the network till it collapses to zero. This, as well, is obvious on account of the Fert that wo are multiplying the inputs with very small weights at each layer, " _? the gradients calculated would also be very small and hence update to weights will be negligible. Well not so good!!! Next let's try initializing weights with very large numbers. To do so, let’s sample weights from standard Gaussian with zero mean and standard deviation as 1.0 (instead of 0.01). W = 1.0 * np.random.randn (fan_in, fan_out) Below are the plots showing mean, standard deviation and activation for all layers. sera, 1221 a “snr Deep Neral Neto Dnep ening ezsaes yan Parnas Data Sone en Sianin Mean standard deviation and activation aeross layers iP ff ; i i layer 3 aver 3 hea tier oper hoa per 2 ayer 6 oa Layer 9 hoa Ayer aha => sinomadeataenecmivaringéap-eriawart est Stee one “snr Deep Neral Neto Onep asin Acca | Ravina amar owas Da Scan linearity (squashes to range +1 to -1). Consequently, the gradients calculated ‘would also be very close to zero as tanh saturates in these regimes (derivative is zero). Finally, the updates to weight would almost again be negligible. In practice, Xavier initialization is used for initializing the weights of all layers. The motivation behind Xavier initialization is to initialize the weights in such a way they do not end up in saturated regimes of tanh activation ie initialize with values not too small or too large. To achieve that we scale by the number of inputs while randomly sampling from standard Gaussian. W= 1.0 * np.xandom.xandn(fan_in, fan_out) / np.egrt (fan_in) However, this works well with the assumption that tanh is used for activation. This would surely break in case of other activation functions for e.g — ReLu. No doubt that proper initialization is still an active area of research. Batch normalization This is somewhat related idea to what all we discussed till now. Remember, we “rsnrg Deep Neral Neto Onep asin Azza | Ravina amar owas Daa Sane Itexplains why even after learning the mapping from some input to output, we need to re-train the learning algorithm to learn the mapping from that same input to output in case data distribution of input changes. However, the issue isn't resolved there only as data distribution could vary in deeper layers as well. Activation at each layer could result in different data distribution, Hence, to increase the stability of deep neural networks we need to normalize the data fed at each layer by subtracting the mean and dividing by the standard deviation. There's an article that explains this in depth. Regularization One of the most common problem in training deep neural network is over- ‘fitting. You'll realize over-fitting in play when your network performed exceptionally well on the training data but poorly on test data. This happens as our learning algorithm tries to fit every data point in the input even if they represent some randomly sampled noise as demonstrated in figure below. sera, 1221 a “rsnrg Deep Neral Neto Onep asin Azza | Ravina amar owas Daa Sane looming & reuarizaton Underfiting Overfitting Source Regularization helps avoid over-fitting by penalizing the weights of the network. To explain it further, consider a loss function defined for classification task over neural network as below : “snr Deep Neral Neto Onep asin Acca | Ravina amar owas Da Scan = Overall objective function to minimize. ~ Number of training samples ~ Actual label for ith training sample. ~ Predicted label for ith training sample. = Cross entropy loss. ~ Weights of neural network. ~ Regularization parameter. Notice how, regularization parameter (lambda) is used to control the effect of ‘weights on final objective function. So, in case lambda takes a very large value, weights of the network should be close to zero so as to minimize the objective function. But as we let weights collapse to zero, we would nullify the effect of many units in the layer and hence network is no better than a single linear classifier with few logistic regression units. And unexpectedly, this will throw us in the regime known as under-fitting which is not much better than over-fitting. Clearly, we have to choose the value of lambda very carefully so that at the end our model falls into balanced category(3rd plot in the figure). Dropout Regularization In addition to what we discussed, there's one more powerful technique to reduce over-fitting in deep neural networks known as dropout regularization. sera, 1221 a “rsnrg Deep Neral Neto Onep asin Azza | Ravina amar owas Daa Sane ° son Gomme) as to ignore those units during forward propagation or backward propagation. Ina sense this prevents the network from adapting to some specific set of features. (a) Standard Neural Net Source At each iteration we are randomly dropping some units from the network. And consequently, we are forcing each unit to not rely (not give high weights) sera, 1221 a “rsnrg Deep Neral Neto Onep asin Azza | Ravina amar owas Daa Sane ° son Gomme) ‘Muth of the content can be attributed to ‘Stanford University CS231n: Convolutional Neural Networks for Visual Recognition ‘Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping.. ceatanstanfordeds Please let me know through your comments any modification/improvements needed in the article. ‘Sign up for The Variable By Towards Data Science Every Thursday. the Variable delvrsthe very best of Towards Data Slence: rom hands-on tutorials and cutting-eege research to orignalfeatures you dont want toms, Takea ok, en - sarin (isan)

You might also like