0% found this document useful (0 votes)
18 views

unit2

The document provides an overview of Convolutional Neural Networks (CNNs), detailing their structure, operation, and advantages over traditional Artificial Neural Networks (ANNs) for image processing. It explains key concepts such as convolution, pooling, activation functions, and the significance of different layers within a CNN. Additionally, it addresses hyperparameters, padding techniques, and the challenges associated with convolution operations, along with their solutions.

Uploaded by

allinonekaipulla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
18 views

unit2

The document provides an overview of Convolutional Neural Networks (CNNs), detailing their structure, operation, and advantages over traditional Artificial Neural Networks (ANNs) for image processing. It explains key concepts such as convolution, pooling, activation functions, and the significance of different layers within a CNN. Additionally, it addresses hyperparameters, padding techniques, and the challenges associated with convolution operations, along with their solutions.

Uploaded by

allinonekaipulla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 22
UNIT CONVOLUTIONAL NEURAL NETWORKS. Convolution Operation -- Sparse Interactions -- Parameter Sharing -- Equivariance -- Pooling -- Convolution Variants: Suided -- Tiled -- Transposed and dilated convolutions; CNN Learning: Nonlinearity Functions -- Loss Functions -- Regularization -- Optimizers - Gradient Computation PART-A 1. What do you mean by Convolutional Neural Network? A Convolutional neural network (CNN, or ConyNet) is another type of neural network that can be used to enable machines to visualize things. CNN’s are used to perform analysis on images and visuals, These classes of neural networks can input a multi-channel image and work on it easily with minimal preprocessing required. 2. Why do we prefer Convolutional Neural networks (CNN) over Artificial Neural networks (ANN) for image data as input? * Feedforward neural networks can lear a single feature representation of the image but in the case of complex imagesfANN will fail to give better predictions, this is because it cannot leam pixel dependenc’ ie Present in the images. CNN can lea multiple layers of ature representations of an image by applying filters, or transformations, + In CNN, the number of parameters Canavan to leam is significantly lower than the multilayer neural networks since the numbet units in the network decreases, therefore reducing the chance of overfitting. 3. List the different layers in CNN. 7 eer 1. Input Layer: 2. Convolutional Lay 3. ReLU Layer 4, Pooling Layer 5. Fully Connected Layer: 6. Softmax / Logistic Layer: 7. OutputLayer: 4.Explain the significance of the RELU Activation function in Conyolution Neural Network. RELU Layer — After each convolution operation, the RELU operation is used. Moreover, RELU is a non-linear activation function, This operation is applied to each pixel and replaces all the negative pixel values in the feature map with zero, Therefore this layer helps in the detection of features, decreasing the non-linearity of the image, converting negative pixels to zero which also allows detecting the variations of features. 5. Why do we use a Pooling Layer in a CNN? CNN uses pooling layers to reduce the size of the input image so that it speeds up the computation of the network. Pooling or spatial pooling layers: Also called subsampling or downsampling. © Itis applied after convolution and RELU operations. + It reduces the dimensionality of each feature map by retaining the most important information. # Since the number of hidden layers required to learn the complex relations present in the image would be large. + Asaresult of pooling, even if the picture were a litle tilted, the largest number in a certain region of the feature map would have been recorded and hence, the feature would have been preserved. ‘6.What is the size of the feature map for a given input size image, Filter Size, Stride, and Padding amount? Stride tells us about the number of pixels we will jump when we are convolving filters. Ifour input image has a size of nx n and filters size fx fand p is the Padding amount and s is the Stride, then the dimension of the feature map is given by: Dimension = floor] ((n-f+2p)/s)+1] x floor| ((n-F+2p)/s) #1] 7.An input image has been converted into a matrix of size 12 X 12 along with a filter of size3X3 with 9 Stride of 1. Determine the size of the convoluted matrix. To calculate the size of the convoluted matrix, we use the generalized equation, given by: C= (a-f+2py/s)+1 where, C is the size of the convoluted matrix. nis the size of the input matrix. fis the size of the filter matrix, pis the Padding amount Here n= 3,p=0,s=1 Q Therefore the size of the convolut is 10X10. 8, Explain the terms “Valid Padding”aiid “Same Padding” in CNN. Valid Padding: This type is used when thot no requirement for Padding. The output matrix after convolution will haye the dimension of 1) X(n-f+ 1). ‘Same Padding: Here, we added the Padding elenienjgail around the output matrix. After this type of padding, we will get the dimensions of the input man same as that of the convolved matrix. After Same padding, if we apply a filer of dimension fx f to (n+2p) x (n+2p) input matrix, then we will get output matrix dimension (n+2p-f+1) x (n+2p-f+1). As we know that after applying Padding we will get the same dimension as the original input dimension (n x n), Henee we have, (#2p-f+1)x(n42p-f+ L) equivalent to nxn n+2p-fl=n 9. What are the different types of Pooling? Explain their characteristics. Spatial Pooling, can be of different types — max pooling, average pooling, and Sum pooling, ¥ Max pooling: Once we obtain the feature map of the input, we will apply a filter of determined shapes across the feature map to get the maximum value from that portion of the feature map. It is also known as subsampling because from the entire portion of the feature map covered by filter or kernel we are sampling one single maximum value Y Average pooling: Computes the average value of the feature map covered bykemel or filter, and takes the floor value of the result. ‘Sum pooling: Computes the sum of all elements in that window. Characteristics: Y Max pooling retums the maximum value of the portion covered by the kemel and suppresses the Noise, while Average pooling only retums the measure of that portion. KS ¥ The most widely used pooling technique is max pooling since it captures the Features of maximum importance with it. 10.Does the size of the feature map always reduce upon applying the filters? Explain why or why not. No, the convolution operation shrinks the matrix of pixels(input image) only if the size of the filter is greater than 1 i.e, f> 1 When we apply a filter of 11, then there is no reduction in the size of the image and hence there is no loss of information, 11. What is Stride? What is the effect of high Stride on the feature map? Stride refers to the number of pixels by which we slide over the filter matrix over the input matrix. For instance ~ If Stride =I, then move the filter one pixel at a time. If Stride=2, then move the filter two-pixel at a time. Moreover, larger Strides will produce a smaller feature map. 12. Explain the role of the flattening layer in CNN. After a series of convolution and pooling operations on the feature representation of the image, we then flatten the output of the final pooling layers into a single long continuous. linear array or a vector. The process of converting all the resultant 2d arrays into a vector is called Flattening. Flatten output is fed as input to%the fully connected neural network having varying numbers of hidden layers to leam th )snon-linear complexities present with the feature representation. 13. List down the hyperparameters of Paoing Layer. ‘The hyperparameters for a pooling layer are™ 4°) © Filter size <- Stride 4g, Max or average pooling Ifthe input of the pooling layer is nh x nw x ne, then the¢uitput will be — Dimension =| {(mh - ) /s + 1}* {(nw —f)/s + 1}* ne? | 14, What is the role of the Fully Connected (FC) Layer in CNN? The aim of the Fully connected layer is to use the high-level feature of the input image produced by convolutional and pooling, layers for classifying the input image into various. classes based on the training dataset. Fully connected means that every neuron in the previous layer is connected to each and every neuron in the next layer. The Sum of output probabilities from the Fully connected layer is 1, fully connected using a sofimax activation function in the output layer. ‘The softmax function takes a vector of arbitrary real-valued scores and transforms it into a vector of values between 0 and | that sums to 1 Working It works like an ANN, assigning random weights to each synapse, the input layer is weight- adjusted and put into an activation function, The output of this is then compared to the true values and the ‘error generated is back-propagated, ic. the weights are re-calculated and repeat all the processes. This is done until the error or cost function is minimized. 15, Briefly explain the two major steps of CNN i.c, Feature Learning and Classification, Feature Leaming deals with the algorithm by learning about the dataset. Components like Convolution, ReLU, and Pooling work for that, with numerous iterations between them, Once the features are known, then classification happens using the Flattening and Full Conneetion components Image Souree: Google Images 16.What are the problems associated with the Convolution operation and how can one resolve them? As we know, convolving an input of dimensions 6 X 6 with a filter of dimension 3 X:3 results in the output of 4 X 4 dimension. Let’s generalize the idea: ‘We can generalize it and say that if the input is n X n and the Filter Size is X f, then the output size will be (n-f+1) X (n-F#1): Input: n Xn Filter size: Xf 2 Output: (a-f+1) X (nef) Oe There are primarily two disadvantages ‘When we apply a convolutional-dperation, the size of the image shrinks every time. Pixels present in the comer of the imate, in the edges, are used only a few times during convolution as compared to the compl pixcls. Henee, we do not focus too much on the comers so it can lead to information loss To overcome these problems, we can apply the padd \¢ images with an additional border, ie, wwe add one pixel all around the edges. This means thaf‘tHe input will be of the dimension 8 X 8 instead of a 6 X 6 matrix. Applying convolution on the input of filter size 3 X 3 on it will result in a 6X 6 matrix which is the same as the original shape of the image. This is where Padding comes into the picture 17, Define Padding. Padding: In convolution, the operation reduces the size of the image ic, spatial dimension decreases thereby leading to information loss. As we keep applying convolutional layers, the size of the volume or feature map will decrease faster. Zero Paddings allow us to control the size of the feature map. PART-B 1. Explain in detail about Convolution Neural Networks. Convolutional Neural Networks (CNN or ConvNets) have had huge successes since 2012, with numerous applications in Computer Vision, Natural Language Processing, Digital Signal Processing (Aecoustics). One can say they are a specialized kind of Feed-Forward Networks that have a grid-like topology. Exhibit patterns that may be visible in ancighborhood of features — locality — for images locality over space, for sounds locality overtime. ‘What Is a CNN? In deep learning, a convolutional neural network (CNN/ConvNet) is a class of deep neural networks, most commonly applied to analyze visual imagery. Now when we think of ancural network we think about matrix multiplications but that is not the case with ConvNet. It uses a special technigue called Convolution. Now in mathematics convolution is a mathematical operation on two functions that produces a third function that expresses how the shape of one is modified by the other. a 4 t te a WOK a ee ene But we don't really need to go behind the mathematics part to understand what a CNN isor how it works. Bottomline is that therole of the ConvNet isto reduce the images into a form that is easier to process, withoutlosing features that are critical for getting a good prediction, How does it work? Before we go to the working of CNN’s let's cover the basics such as what is an image and how is it represented. An RGB image is nothing but a matrix of pixel values having three planes whereas a grayscale image is the same but it has a single plane. Take a look at thisimage to understand more. 3.Colour Channels Height: 4 Units (Pixels) wa: 4 Units Pols) For simplicity, let's stick with grayscale images as we try to understand how CNNs work. Input data The above image shows what a convolution is. We take a filter/kernel(3*3. matrix) and apply itto the input image to get the convolved feature, This convolved feature is passed on to the next layer. faq Je Convolved Image 6 Feature In the case of RGB color, channel take a look at this animation to understand its working BBE i[ele CUES afo Kernel a Kernel os ~— ‘ 308 + =498 + 164 +1= Bias=2 Convolutional neural networks are composed of multiple layers of artificial neurons. Artificial neurons, a rough imitation of their biological counterparts, are mathematical functions that calculate the weighted sum of multiple inputs and outputs an activation value. When you input an image in a ConvNet, cach layer generates several activation functions that are passed on to the next layer. The first layer usually extracts basie features such as horizontal or diagonal edges. ‘This output is passed on to the next layer which detects more complex features such as corners or combinational edges. As we move deeper into the network it can identify even more complex features such as objects, faces, etc. Convolution operation CNN use the Convolution operation in at least one of their layers in place of matrix multiplication. A few words about the definition of convolution, Convolution is a mathematical way of combining two functions (signals) to form a third function(signal) expressing how the shape of one is modified by the other. I refer the analogy of function to signal — for those who are more familiar with, as it is the single most important technique in Digital Signal Processing. It is defined as the integral of the product of the two functions f and g, denoted as f*g, after one is reversed and shifted. 9 Convolution operation in [0,°0) % Ue Sof Sale — dr and in the diserete case A> (+ ate — 2 Ab 7 In machine learning applications the first argument f to the convolution is often referred to as the input and the second argument g as the kernel (or filter or window), and the output is often referred as the feature map, Visual explanation of the convolut n operation 1, Express each each function in terms of t. 2. Flip one of the funetions g(t) —+g(-1). 3. Adda time-offset, 1, which allows g(t ) to slide along the t-axis, 4, Start t at so and slide it all the way to +20. Wherever the two functions intersect, find the integral of their product. In other words, compute a sliding, weighted-sum of function f(t) where the weighting function is g(-r) iF 3 : ane waar teen 10) Toe esp! i Pa treat a a a sat Visual explanation of convolution In other words itis a mathematical operation where each value in the output is expressed as the sum of values in the input multiplied by a set of weighing coefficients. 2. Write short Notes on (i) Sparse Inte system: ctions (ji) Parameter Sharing (i) Equivariance Convolution leverages some important ideas can help improve a machine learning 1. Sparse interactions or sparse connectivity or sparse weights 2. Parameter (weight) sharing, 3. Equivariant representations and 4. Working with inputs of variable size. (W. Sparse Interactions Convolutional neural networks are more efficient than simple neural networks — in applications where they apply, because they significantly reduce the number of parameters which reduces the required memory of the network and improves its statistical efficiency. They exploit feature locality. How? They try to find patterns in the input data, They stack them to make abstract concepts by their convolution layers. A Convolution layer defines a window or filter or kernel by which they examine a subset ofthe data, amd subsequently seans the data looking through this window, We can parameterize the window “o¢lyok for specie features (e.g. edges within an image). The output they produce focuses solely"qiijthe regions of the data which exhibited the feature it was searching for. This is what we call spar€@ connectivity or sparse interactions or sparse weights, Actually it limits the activated connections f'baeh layer. In the example below an SxS input with a 2x2 filler produces a reduced 4x4 output. Th@ first element of feature map is calculated by the convolution of the input arca with the filter ic. VY, Apply 2x2 filter to the input and get the first convolutional layer (a feature map) 1x042x14+2x1+4+1x2= First element of the feature map In practice, we don’t explicitly define the filters that our convolutional layer will use; we instead parameterize the filters and let the network leam the best filters to use during training. We do, however, define how many filters, we'll use at each layer— a hyperparameter which is called the depth of the output volume, Another hyperparameter is the stride that defines how much we slide the filter over the data. For example if stride is 1 then we move the window by 1 pixel at a time over the image, when our input is an image. When we use larger values of stride 2 or 3 we allow jumping 2 or pixels ata time, This reduces significantly the output size. The last hyperparameter is input volume with zeros around the border, he size of zero-padding, when sometimes is convenient to pad the So now we can compute the spatial size of the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border. The formula for calculating how ‘many neurons “fit” is given by WoF+2P , In our previous example for the 5x5 input(W=5) and the 2x? filter(P=2) with stride I(S=1) and pad 0(P=0) we would get a 4x4x(num oun output for each network node. i |. Parameter (weight) sharing % Parameter sharing is used in égrvotwional layers to reduce the number of parameters in the network. ° For e¢atnple’ in’ the’ first ConvolittS je layer let's ‘say we have aa output ‘of 15x15x4 where 15 is the size of the and 4 the number of filters used in this layer. For each output node in that layer we have the same filter, thus reducing dramatically the storage requirements of the model to the size of the filter. ‘The same filter (weights) (1, 0,-1) are used for that layer. Parameter sharing allows models 10 capture local connectivity while simultaneously computing the same features at different spatial locations. We will see the use of this property soon, Here we make a short detour to section 5 for discussing locally connected layers and tiled convolution. Locally connected layer/unshared convolution: The connectivity graph of convolution operation and locally connected layer is the same, The only difference is that parameter sharing is not performed, ie. each output unit performs a linear operation on its neighbourhood but the parameters are not shared across output units. This allows models to capture local connectivity while allowing different features to be computed at different spatial locations. This however requires much more parameters than the convolution operatior Tiled convolution is a Seger middle step between locally connected layer and traditional convolution, It uses Quer kemels that are cycled through, This reduces the number of parameters in the model Gs lowing for some freedom provided by unshared convolution. v, Comparison of connectivity and parameters of locally-conneeted (top), tiled (middle) and standard convolution (bottom) (source) ‘The parametercomplexity andcomputationcomplexity canbe obtained as below. Note that: m= number of input units n= number of output units k=kerel size © number of kernels in the set (for tiled convolution) Type Computations Parameters " Fullyconnected O(n) Otmn) Locally connected O¢kn) O(kn) Tiled Ofkn) Ok) Traditional O¢kn) Ok) You can see now that the quantity of ~4S1 thousand parameters corresponds to the locally connected convolution operation. If we use a set of 200 kernels, the number of parameters for tiled convolution is 1.8 thousand. For @ traditional convolution operation, this number is 9 parameters. (iii), Equivariance ‘A function fis said to be equivariant to a function g if fig(x))= g(tls)) A i.e. if input changes, the output chat the same way. Parameter sharing in a convolutional newwetiepro is that translation of the image results in cor for boundary pixels). The reason for this is vei ides equivariance to translation. What this means ing translation in the output map (except maybe Ttuitive: the same feature is being computed at all input points, Equivariant representations ve) auiaviis enna varying in ie alr anual pedi. Pabaiiane va teialatiod mesg tat a translation of input features results in an equivalent translation of outputs, It makes the CNN understand the rotation or proportion change, The equivariance allows the network to generalize edge, texture, shape, detection in different locations. representation by translated active neurons representation translated image —— Inputs of variable size A typical layer of a convolutional network consists of three stages. In the first stage, the layer performs several convolutions in parallel to produce a set of linear activations. In the second stage, each linear activation is ran through a nonlinear activation function, such as the rectified linear activation function. This stage is sometimes called the detector stage. 3. Mlustrate the concept of Pooling with suitable example, In the third stage, we use a pooling function to modify the output of the layer further. A pooling function replaces the output of the net at a certain location with asummary statistic of the nearby outputs, such as max —is most commonly used, average, weighted average or the L2 norm. Max Pool ————» Filter = (2 x 2) Stride - (2, 2) Pooling helps to make th€-fepresentation become approximately invariant to small translations of the input. It is essential for handling in Sept variable size (for examples images of different size) For classification of images to workthe input to classification layer must have a fixed size. And this is accomplished by vsing theo ag layer. The Pooling layer is responsible for redueing 1 al size of the Convolved Feature. This is to decrease the computational power req’ to process the data by reducing the dimensions. There are two types of pooling averag€ pooling and max pooling, In Max Pooling is we find the maximum value of a pixel from a portion of the image covered by the kernel. Max Pooling also performs as a Noise Suppressant. It discards the noisy activations altogether and also performs de-noising along with dimensionality reduction. On the other hand, Average Pooling returns the average of all the values from the portion of the image covered by the Kernel 4) Average Pooling simply performs dimensionality reduction as a noise suppressing mechanism, Hence, we can say that Max Pooling performsa lot better than Average Pooling. max pooling average pooling 4. Briefly explain the concept of Convolution variants. Neural Net Convolution is Different + Convolution int Opteatof neural networks does not refer exactly to the standard convolution operation in mathefgati * The functions used differ slightly. ~ Convolution Operation in Neural Networks “<> 1 trefers to an operation that consists of many wns of convolution in parallel + Thisis because convolution with a single seme? only extract one kind of feature, albeit at many locations + Usually we want to extract many kinds of features at many locations 2. Input is usually not a grid of real values ¥ Rather it is a vector of observations. E.g.,acolor image has R ,G, B values at each pixel Y Input to the next layer is the output of the first layer which has many different convolutions at each position When working with images, input and output are 3-D tensors Four indices with image software : One index for Channel . ‘Two indices for spatial coordinates of each channel . Fourth index for different samples in a batch We omit the batch axis for simplicity of discussion Multichannel Convolution + Because we are dealing with multichannel convolution, linear operations are not usually commutative, even of kernel flipping is used + These multi-channel operations are only commutative if each operation has the same number of output channels as input channels. Definition of 4-D kernel tensor + Assume we have a4-D kemel tensor K with element Ki,k,| giving the connection strength between a unit in channel i of the output and a unit in channel j of the input, with an offset of k rows and I columns between output and input w + Assume our input consists of observed data V with element Vi,jk giving the value of the input unit within channel i at row j and column k. . Assume our output consists of Z with the same format as V. Zi Riga = S> (Vig ijenrendh- nae Bitiee, . IfZ is produced by convolving K across V without flipping K, then 124.7. > Wag re he re 1 KG to Convolution with a stride? Uoranee ‘ition jp over some positions in the kemel to reduce At the cost of not ext ¢ features We can think of this ampling the output of the full convolution function a If we want to sample only every(ph can define a down oi We refer to s as the stri We may want computational in cach direction of output, then we funetion ¢ such that le to define a different stride for cach direction Here we use astride of 2 Convolution with a stride of length two implemented in a single operation convolution with aunit stridefollowedby Two-step approach is compatationally wasteful, becauseit discard many values thatare discarded Effect of Zero-padding on network size Convolutional net with akemel of width 6 at every layerNo pooling, so only convolution shrinks ‘We donotuse any implicit zero padding Causes representation to shrink by five pixels ateach layer Starting from an input of 16 pixels we are only able to have 3 convolutional layers and the last Bee layer does not ever move the kemel, so only led Convolution a ¥ Compromise between a cofpOlutional layer anda locally connected layer. of weights at every spatial location, we leam a set of rove through space. a will have different filters, like in ¥ Rather than learning a separat kernels that we rotate through as ‘This means that immediately neigh ¥ alocally connected layer, but the memo regyirements for storing the parameters will increase only by a factor of the size of this sei@Dkernels rather than the size of the entire output feature map. A locally connected layer Has no sharing at all Each connection has its own weight Tiled convolution Has a set of different kernels With t=2 © Traditional convolution Equivalent to tiled convolution with t=1 There is only one kernel and it is applied everywhere Defining Tiled Convolution Algebraically Let k be a 6-D tensor, where two of the dimensions correspond to different locations the output map. Rather than having a separate index for each location in the output map, output locations cycle through a set of t different choices of kernel stack in each direction, © Ift is ‘equal to the output width, this is the same as a locally connected layer where % is the modulo operation, with t%t = 0, (t+ 1)%t = 1, ete ‘Transposed Convolution A transposed convolutional layer is an upsampling layer that generates the output, feature map greater than the input feature map. It is similar to a deconyolutional layer. A deconvolutional layer reverses the layer to a standard convolutional layer. If the output of the standard convolution layer is deconvolved with the deconvolutional layer then the output will bbe the same as the original value, While in transposed convolutional value will not be the same, it can reverse to the same dimension,Transposed convolutional layers are used in a variety of tasks, including image generation, image super-resolution, and image segmentation. Instead of sliding the kemel over the input and performing element-wise multiplication and summation, g¢transposed convolutional layer slides the input over the kernel and performs element-wise ora sation and summation. This results in an output that is larger than the input, and the size oi ‘ome can be controlled by the stride and padding parameters of the laver. Of7 213 SOG falwlo faulw 2 a = 6 transposedconvolutional Dilated Convolution It is a technique that expands the kernel (input) by inserting holes between its consecutive elements. In simpler terms, it is the same as convolution but it involves pixel skipping, so as to cover a larger area of the input, Dilated convolution, also known as atrous convolution, is a type of convolution operation used in convolutional neural networks (CNNs) that enables the network to have a larger receptive field without increasing the number of parameters. in a dilated convolution operation, the filter is “dilated” by inserting gaps between the filter values. The dilation rate determines the size of the gaps, and it is a hyperparameter that can be adjusted. When the dilation rate is 1, the dilated convolution reduces to a regular convolution, An additional parameter / (dilation factor) tells how much the input is expanded. In other words, based on the value of this parameter, @)pixels are skipped in the kemel. Fig depicts the difference between normal vs dilated convolution, In essence, normal convolution is just a I-dilated convolution, _-Ainpat) we [Normal | Dilated Convolution a=) Normal Convolution ys Dilated Convolution Advantages of dilated convolutions are: 2 1, Increased receptive field without increasing parameters 2. Can capture features at multiple scales 3. Reduced spatial resolution loss compared to regular convolutions with larger filters Disadvantages of dilated convolutions are: 1. Reduced spatial resolution i The output feature map compared to the input feature wip (or 2. Inereased computational egy ompared to regular convolutions with the same filter size and stride 2s Vv, 5. Explain the Non Linearity functions types in neural negyotks. Te means that the neural network can successfully approximate functions that do not follow linearity or it can successfully predict the class of a function that is divided by adccision boundary which is not linear. The non-linear functions are known to be the most used activation functions. It makes it easy for ancural network model to aduptwith a variety of data and to differentiate between the outcomes. ‘These functions are mainly divided basis on their range or curves: @ Signoid Activation Functions Sigmoid takes a real value as the input and outputs another value between 0 and 1. The sigmoid activation function translates the input ranged in (-e,00) to the range in (0,1) byTanh Acivation Functions ‘Thetanh fiction isjustanother possible function thatcan be used as anon-linearactivation function between layers of a neural network. It shares a few things in common with the sigmoid activation function, Unlike a sigmoid funetion that will map input values between 0 and 1, the Tanh will map values between -1 and 1. Similar to the sigmoid function, one of the interesting properties of the tanh function is that the derivative of tanh can beexpressed in terms of the function itself. ©) ReLU Activation Functions The formula is deceptively simple: max(0,z). Despite its name, Rectified Linear Units, it’s not linear and provides the same benefits as Sigmoid but with better performance. (i) Leaky Relu Leaky Relu is a variant of ReLU, Instead of being 0 when 2<0, a leaky ReLU allows a small, non- zero, constant gradient u (normally, a=0.01). However, the consisteney of the benefit across tasks is presently unclear. Leaky RelUs attempt to fix the “dying ReLU® problem. (ii) ParametrieRelu PReLU gives the neurons the ability to choose what slope is best in the negative region. They can become ReLU or leaky ReLU with certain values of a. 6. Explain the various Loss functions in Convolutional neural networks. In most cases, error function and loss function mean the same, but with atiny difference. An error function measuresicalculates how far our model deviates fiom correct prediction. A loss function operates on the error to quantify how bad it is to get an error of a particular sizeldirection, which is affected by the negative consequences that result in an incorrect prediction, A loss function can either be discrete or continuous. Loss Functions aA 1. Mean Squared Error Loss tip 8 2, Cross-Entropy Loss Function 3. Mean Absolute Percentage Error ~ Y Mean Squared Error Loss Function ZO 2s Mean squared error (MSE) loss function is the“Gim of squared differences between the entries in the prediction vector y and the ground rts ha : entries in the prediction vector if Ys Iw LO= ZX wi-H? — » = Yi: entries in the ground truth label MSE loss function ‘You divide the sum of squared differences by N, which corresponds to the length of the vectors, If the output y of your neural network isa vector with multiple entries then N isthe number of the vector entries with y_i being one particular entry in the output vector. The mean squared error loss function is the perfect loss function if you're dealing with a regression problem. That is, if you want your neural network to predict a continuous scalar value. An example of a regression problem would be predictions of the number of products needed in a supply chain. future real estate prices under certain market conditions, astock value. ¥ Cross-Entropy Loss Function Regression is only one of two areas where feedforward networks enjoy great popularity, The other area is classification. In classification tasks, we deal with predictions of probabilities, which means the output of ancural network must be in a range between zero and one. A loss function that can measure the error between a predicted probability and the label which represents the actual class is called the cross-entropy loss fimetion. One important thing we need to discuss before continuing with the cross-entropy is what exactly the ground truth vector looks like in the case of a classification problem. One-hot-encoded vector (left) and prediction vector (right). The label vector y_hat is one hot encoded which means the values in this vector can only take discrete values of cither zero or one. The entries in this vector represent different classes. ‘The values of these entries are zero, except for a single entry which is one. This entry tells us the class into which we want to classify the input feature veetor x. ‘The prediction y, however, can take continuous values between zero and one, Given the prediction vector y and the ground truth vector y_hat you can computethe cross- entropy loss between those two vectors as follows: N a Yj. : entries in the prediction vector / £0) =-)> ilove 5 =0 Q Yi + entries in the ground truth label 1/ Cross-entropy loss function 7 First, we need to sum up the products between the ol the entrics of the predictions vector y. Then we ae 7 loss function. the label vectory_hat and the logarithms of jc the sum to geta positive value of the One interesting thing to consider is the plot of the cross-entropy loss function. In the following graph, you can see the value of the loss funetion (y-axis) vs. the predicted probability y_i, Here y_i takes values between zero and one. Cross-entropy function depending on prediction value. We can sce clearly that the cross-entropy loss function grows exponentially for lower values of the predicted probability y i. For y_i+0 the function becomes inti ic, while for y_i=1 the neural network makes an accurate probability prediction and the loss value goes to zero. ¥ Mean Absolute Percentage Error Finally, we come to the Mean Absolute Percentage Error (MAPE) loss function. This loss function doesn’t get much attention in deep leaming. Forthe most part, weuse it tomeasure the performance of a neural network during demand forecasting tasks. ¥ The mean absolute percentage error, also known as mean absolute percentage deviation (MAPD) usually expresses accuracy as a percentage. We define it with the following equation: 100% SS |ys = tal MAPE =>) ws iso 4 In this equation, y iis ine preaictea vaiue anu y nat 1s ine savel. We divide the difference between yi and y_hat by the actual value y hat again. Finally, multiplying by 100 percent gives us the percentage error. Applying this equation to the example above gives you a more meaningful understanding of the model's performance. In the first case, the deviation from the ground truth label would be only one percent, while in the second case the deviation would be 66 perce WAP Ey = WOK LOR IO) = 19% 15 MAP Es = 100% | 66% 15 7. Briefly Note on Optimizers and gradient computation in CNN, Optimizers °. Optimizer algorithms are optimi@ation method that helps improve a deep learning model's performance. These optimization hms or optimizers widely affect the accuracy and speed training of the deep learning model.“ABceptimizer is a function or an algorithm that adjusts the attributes of the neural network, such as weights hd learning rates. Thus, it helps in reducing the overall loss and improving accuracy. Important Deep Learning Terms 7 Before proceeding, there are a few terms that you should be familiar with. Epoch — The number of times the algorithm runs on the whole training dataset. Sample ~ A single row of a dataset. “| Bateh ~ It denotes the number of samples to be taken to for updating the model parameters. Learning rate — It is a parameter that provides the model a scale of how much model weights should be updated. Cost Function/Loss Function — A cost function ed tocalculate the cost, which is the difference between the predicted value and the actual value. Weights’ Bias — The leamable parameters in a model that controls the signal between two neurons. Gradient Descent Deep Learning Optimizer Gradient Descent can be considered the popular kid among the class of optimizers. This optimization algorithm uses calculus to modify the values consistently and to achieve thelocal minimum. Before moving ahead, you might have the question of what a gradient is. In simple terms, consider you are holding a bal resting at the top of a bowl, When you lose the ball, it goes along the steepest direction and eventually settles at the bottom of the bowl. A Gradient provides the ball in the steepest direction to reachthe local minimum which is the bottom of the bowl. =x—alpha*f¢ Rt ot OS (LGD. tow the gradient ix calculated, Here alpha is the step size that represents how far to move against cach gradient with each iteration: Gradient descent works as follows: L. Itstarts with some coefficients, sees their cost, and searches for cost value lesser than what it is now 2. Itmoves towards the lower weight and updates the value of the coefficients. 3. The process repeats until the local minimum is reached, A local minimum is a point beyond which it can not proceed. P lritiol woi / ight Global ririmnurn = —_—__—_—> Gradient descent works best for most purposes. However, it has some downsides too, It is expensive to calculate the gradients if the size of the data is huge. Gradient descent works well for convex functions, but it doesn't know how far to travel along the gradient for noneonvex functions. Stochastic Gradient Deseent WithMomentum Deep Learning Optimizer ss dives tit eater Ladin ave lawrna dasoadbanie gradient descent takes a much more noisy path than the(@iadient descent algorithm. Due to this reason, it requires a more significant number of iterations the optimal minimum, and hence computation time is very slow. To overcome the we use stochastic gradient descent with a momentum algorithm 7 What the momentum does is helps inAfgsier convergence of the loss function, Stochastic gradient descent oscillates between cither/distction of the gradient and updates the weights accordingly, However, adding a fraction of the previous update to the current update will make the process a bit faster. One thing that should be remembered while using this algorithm is that the learning rate should be decreased with ahigh momentum term, © In the above image, the left part shows the convergence graph of the stochastic gradient descent algorithm. At the same time, the right side shows SGD with momentum. From the image, you can compare the path chosen by both algorithms and realize that using momentum helps reach convergence in less time. You might be thinking of using a large momentum and learning rate to make the process even faster. But remember that while increasing the momentum, the possibility of passing the optimal minimum also increases, This might result in poor accuracy and even more oscillations Mini Batch Gradient Descent Deep Learning Optimizer In this variant of gradient descent, instead of taking all the training data, only a subset of the dataset is used for calculating the loss function. Since we are using a batch of data instead of taking the whole dataset, fewer iterations are needed. That is why the mini- batch gradient descent algorithm is faster than both stochastic gradient descent and batch. gradient descent algorithms. This algorithm is more efficient and robust than the earlier variants of gradient descent. As the algorithm uses batching, all the training data need not be loaded in the memory, thus making the process more efficient to implement. Moreover, the cost function in mini-batch gradient descent is noisier than the batch gradient descent algorithm but smoother than that of the stochastic gradient descent algorithm. Because of this, mini-batch gradient descent is ideal and provides a good balance between speed and accuracy. Despite all that, the mini-batch gradient descent algorithm has some downsides too, It nceds a hyperparameter that is “mini-batch-size”, which needs to be tuned to achieve the required accuracy. Although, the batch size of 32 is considered to be appropriate for almost every case. Also, in some cases, it results in poor final aceuraey. Due to this, there needs arrise to look for other alternatives too. Adagrad (Adaptive Gradient Descent) Deep Learning Optimizer The adaptive gradient descent algorithm is slightly different from other gradient descent algorithms. This is because it uses different learning rates for each iteration. The change in learning rate depends upon the difference in the parameters during training. The more the parameters get changed, the minor the learning rate changes, This modification is highly beneficial because real-world 1s contain sparse as well as dense features. So it is: unfair to have the same value of leaming fatyfor all the features. The Adagrad algorithm uses the below formula to update the weights. Here thedipha(t) denotes the different learning rates at each iteration, n is a constant, and E is a small positive to avoid division by 0. _— , OL r 7] We = Wea — Mega Ragan ‘The benefit of using Adagrad is that it abolishes the need to modify the learning rate manually. It is more reliable than gradient descent algorithms and their variants, and it reaches convergence ata + speed One downside of the AdaGrad optimizer is that it decreases the leaming rate ageressively and monotonically. There might be a point when the learning rate becomes extremely small. This is because the squared gradients in the denominator keep accumulating, and thus the denominator part keeps on increasing. Due to small learning rates, the model eventually becomes unable to acquire more knowledge, and hence theaccuracy of the model is compromised

You might also like