We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51
UNIT-2
DEEP NEURAL NETWORKS
Deep neural network • Deep neural networks, or deep learning networks, have several hidden layers with millions of artificial neurons linked together. • A number, called weight, represents the connections between one node and another. • The weight is a positive number if one node excites another, or negative if one node suppresses the other. • A deep neural network (DNN) is an artificial neural network (ANN) with multiple layers between the input and output layers. • There are different types of neural networks but they always consist of the same components: neurons, synapses, weights, biases, and functions. •While deep learning is certainly not new, it is experiencing explosive growth because of the intersection of deeply layered neural networks and the use of GPUs to accelerate their execution. Big data has also fed this growth. •Because deep learning relies on training neural networks with example data and rewarding them based on their success, the more data, the better to build these deep learning structures. •The number of architectures and algorithms that are used in deep learning is wide and varied. This section explores six of the deep learning architectures spanning the past 20 years. •Notably, long short-term memory (LSTM) and convolutional neural networks (CNNs) are two of the oldest approaches in this list but also two of the most used in various applications. •Deep learning architectures into supervised and unsupervised learning and introduces several popular deep learning architectures: 1. convolutional neural networks 2. recurrent neural networks (RNNs) 3. long short-term memory/gated recurrent unit (GRU) 4. self-organizing map (SOM) 5. autoencoders (AE) 6. restricted Boltzman machine (RBM) • It also gives an overview of deep belief networks (DBN) and deep stacking networks (DSNs) • Artificial neural network (ANN) is the underlying architecture behind deep learning. Based on ANN, several variations of the algorithms have been invented. Supervised deep learning •Supervised learning refers to the problem space wherein the target to be predicted is clearly labelled within the data that is used for training. •Deep learning uses supervised learning in situations such as image classification or object detection, as the network is used to predict a label or a number (the input and the output are both known). •As the labels of the images are known, the network is used to reduce the error rate, so it is “supervised”. •we introduce at a high-level two of the most popular supervised deep learning architectures – 1. convolutional neural networks and 2. recurrent neural networks as well as some of their variants. Convolutional neural networks • A convolutional neural network, or CNN, is a deep learning neural network designed for processing structured arrays of data such as images. • Convolutional neural networks are widely used in computer vision and have become the state of the art for many visual applications such as image classification, and have also found success in natural language processing for text classification. • Convolutional neural networks are very good at picking up on patterns in the input image, such as lines, gradients, circles, or even eyes and faces. It is this property that makes convolutional neural networks so powerful for computer vision. • Unlike earlier computer vision algorithms, convolutional neural networks can operate directly on a raw image and do not need any preprocessing. • A convolutional neural network is a feed-forward neural network, often with up to 20 or 30 layers. The power of a convolutional neural network comes from a special kind of layer called the convolutional layer. • A CNN is a multilayer neural network that was biologically inspired by the animal visual cortex. The architecture is particularly useful in image-processing applications. • The first CNN was created by Yann LeCun; at the time, the architecture focused on handwritten character recognition, such as postal code interpretation. • As a deep network, early layers recognize features (such as edges), and later layers recombine these features into higher-level attributes of the input. • The LeNet CNN architecture is made up of several layers that implement feature extraction and then classification (see the following image). • The image is divided into receptive fields that feed into a convolutional layer, which then extracts features from the input image. • The next step is pooling, which reduces the dimensionality of the extracted features (through down-sampling) while retaining the most important information (typically, through max pooling). • Another convolution and pooling step is then performed that feeds into a fully connected multilayer perceptron. • The final output layer of this network is a set of nodes that identify features of the image (in this case, a node per identified number). You train the network by using back-propagation. The use of deep layers of processing, convolutions, pooling, and a fully connected classification layer opened the door to various new applications of deep learning neural networks. In addition to image processing, the CNN has been successfully applied to video recognition and various tasks within natural language processing. • A Convolutional Neural Network (CNN) is a type of deep learning algorithm that is particularly well-suited for image recognition and processing tasks. • It is made up of multiple layers, including convolutional layers, pooling layers, and fully connected layers. • The convolutional layers are the key component of a CNN, where filters are applied to the input image to extract features such as edges, textures, and shapes. • The output of the convolutional layers is then passed through pooling layers, which are used to down-sample the feature maps, reducing the spatial dimensions while retaining the most important information. • The output of the pooling layers is then passed through one or more fully connected layers, which are used to make a prediction or classify the image. • CNNs are trained using a large dataset of labeled images, where the network learns to recognize patterns and features that are associated with specific objects or classes. • Once trained, a CNN can be used to classify new images, or extract features for use in other applications such as object detection or image segmentation. • CNNs have achieved state-of-the-art performance on a wide range of image recognition tasks, including object classification, object detection, and image segmentation. • They are widely used in computer vision, image processing, and other related fields, and have been applied to a wide range of applications, including self-driving cars, medical imaging, and security systems. Convolutional Neural Network Design • The construction of a convolutional neural network is a multi-layered feed-forward neural network, made by assembling many unseen layers on top of each other in a particular order. • It is the sequential design that give permission to CNN to learn hierarchical attributes. • In CNN, some of them followed by grouping layers and hidden layers are typically convolutional layers followed by activation layers. • The pre-processing needed in a ConvNet is kindred to that of the related pattern of neurons in the human brain and was motivated by the organization of the Visual Cortex. • Different Types of CNN Models: 1. LeNet 2. AlexNet 3. ResNet 4. GoogleNet 5. MobileNet 6. VGG Applications of CNN • Decoding Facial Recognition • Understanding Climate • Collecting Historic and Environmental Elements • Image recognition • video analysis • natural language processing Recurrent neural networks • The RNN is one of the foundational network architectures from which other deep learning architectures are built. • The primary difference between a typical multilayer network and a recurrent network is that rather than completely feed-forward connections, a recurrent network might have connections that feed back into prior layers (or into the same layer). • This feedback allows RNNs to maintain memory of past inputs and model problems in time. RNNs consist of a rich set of architectures (we'll look at one popular topology called LSTM next). • The key differentiator is feedback within the network, which could manifest itself from a hidden layer, the output layer, or some combination thereof. • RNNs can be unfolded in time and trained with standard back-propagation or by using a variant of back-propagation that is called back-propagation in time (BPTT) • Recurrent Neural Network(RNN) is a type of Neural Network where the output from the previous step is fed as input to the current step. • In traditional neural networks, all the inputs and outputs are independent of each other. • Still, in cases when it is required to predict the next word of a sentence, the previous words are required and hence there is a need to remember the previous words. How RNN differs from Feedforward Neural Network? • Artificial neural networks that do not have looping nodes are called feed forward neural networks. Because all information is only passed forward, this kind of neural network is also referred to as a multi-layer neural network. • Information moves from the input layer to the output layer – if any hidden layers are present – uni directionally in a feedforward neural network. These networks are appropriate for image classification tasks, for example, where input and output are independent. Nevertheless, their inability to retain previous inputs automatically renders them less useful for sequential data analysis. • Recurrent Neural Network(RNN) is a type of Neural Network where the output from the previous step is fed as input to the current step. • In traditional neural networks, all the inputs and outputs are independent of each other. Still, in cases when it is required to predict the next word of a sentence, the previous words are required and hence there is a need to remember the previous words. • Thus RNN came into existence, which solved this issue with the help of a Hidden Layer. The main and most important feature of RNN is its Hidden state, which remembers some information about a sequence. • The state is also referred to as Memory State since it remembers the previous input to the network. • It uses the same parameters for each input as it performs the same task on all the inputs or hidden layers to produce the output. • This reduces the complexity of parameters, unlike other neural networks. Types of RNN There are four types of RNNs based on the number of inputs and outputs in the network. 1. One to One 2. One to Many 3. Many to One 4. Many to Many One to One This type of RNN behaves the same as any simple Neural network it is also known as Vanilla Neural Network. In this Neural network, there is only one input and one output. One To Many In this type of RNN, there is one input and many outputs associated with it. One of the most used examples of this network is Image captioning where given an image we predict a sentence having Multiple words. Many to One In this type of network, Many inputs are fed to the network at several states of the network generating only one output. This type of network is used in the problems like sentimental analysis. Where we give multiple words as input and predict only the sentiment of the sentence as output. Many to Many In this type of neural network, there are multiple inputs and multiple outputs corresponding to a problem. One Example of this Problem will be language translation. In language translation, we provide multiple words from one language as input and predict multiple words from the second language as output. Advantages 1. An RNN remembers each and every piece of information through time. It is useful in time series prediction only because of the feature to remember previous inputs as well. This is called Long Short Term Memory. 2. Recurrent neural networks are even used with convolutional layers to extend the effective pixel neighborhood. Disadvantages 1. Gradient vanishing and exploding problems. 2. Training an RNN is a very difficult task. 3. It cannot process very long sequences if using tanh or relu as an activation function. Applications of Recurrent Neural Network 1. Language Modelling and Generating Text 2. Speech Recognition 3. Machine Translation 4. Image Recognition, Face detection 5. Time series Forecasting LSTM networks •The LSTM was created in 1997 by Hochreiter and Schimdhuber, but it has grown in popularity in recent years as an RNN architecture for various applications. You'll find LSTMs in products that you use every day, such as smartphones. IBM applied LSTMs in IBM Watson® for milestone-setting conversational speech recognition. •LSTM (Long Short-Term Memory) is a recurrent neural network (RNN) architecture widely used in Deep Learning. It excels at capturing long-term dependencies, making it ideal for sequence prediction tasks. •The LSTM departed from typical neuron-based neural network architectures and instead introduced the concept of a memory cell. The memory cell can retain its value. for a short or long time as a function of its inputs, which allows the cell to remember what's important and not just its last computed value. •The LSTM memory cell contains three gates that control how information flows into or out of the cell. •The input gate controls when new information can flow into the memory. •The forget gate controls when an existing piece of information is forgotten, allowing the cell to remember new data. •Finally, the output gate controls when the information that is contained in the cell is used in the output from the cell. The cell also contains weights, which control each gate. The training algorithm, commonly BPTT, optimizes these weights based on the resulting network output error. •Recent applications of CNNs and LSTMs produced image and video captioning systems in which an image or video is captioned in natural language. The CNN implements the image or video processing, and the LSTM is trained to convert the CNN output into natural language. •Example applications: Image and video captioning systems The advantages of LSTM • Long-term dependencies can be captured by LSTM networks. They have a memory cell that is capable of long-term information storage. • In traditional RNNs, there is a problem of vanishing and exploding gradients when models are trained over long sequences. By using a gating mechanism that selectively recalls or forgets information, LSTM networks deal with this problem. • LSTM enables the model to capture and remember the important context, even when there is a significant time gap between relevant events in the sequence. So where understanding context is important, LSTMS are used. eg. machine translation. The disadvantages of LSTM • Compared to simpler architectures like feed-forward neural networks LSTM networks are computationally more expensive. This can limit their scalability for large-scale datasets or constrained environments. • Training LSTM networks can be more time-consuming compared to simpler models due to their computational complexity. So training LSTMs often requires more data and longer training times to achieve high performance. • Since it is processed word by word in a sequential manner, it is hard to parallelize the work of processing the sentences. Applications of LSTM Language Modeling: LSTMs have been used for natural language processing tasks such as language modeling, machine translation, and text summarization. They can be trained to generate coherent and grammatically correct sentences by learning the dependencies between words in a sentence. Speech Recognition: LSTMs have been used for speech recognition tasks such as transcribing speech to text and recognizing spoken commands. They can be trained to recognize patterns in speech and match them to the corresponding text. Time Series Forecasting: LSTMs have been used for time series forecasting tasks such as predicting stock prices, weather, and energy consumption. They can learn patterns in time series data and use them to make predictions about future events. Anomaly Detection: LSTMs have been used for anomaly detection tasks such as detecting fraud and network intrusion. They can be trained to identify patterns in data that deviate from the norm and flag them as potential anomalies. Recommender Systems: LSTMs have been used for recommendation tasks such as recommending movies, music, and books. They can learn patterns in user behavior and use them to make personalized recommendations. Video Analysis: LSTMs have been used for video analysis tasks such as object detection, activity recognition, and action classification. They can be used in combination with other neural network architectures, such as Convolutional Neural Networks (CNNs), to analyze video data and extract useful information. GRU networks • Gated Recurrent Unit (GRU) is a type of recurrent neural network (RNN) that was introduced by Cho et al. Like LSTM, GRU can process sequential data such as text, speech, and time-series data. • In 2014, a simplification of the LSTM was introduced called the gated recurrent unit. This model has two gates, getting rid of the output gate present in the LSTM model. • These gates are an update gate and a reset gate. • The update gate indicates how much of the previous cell contents to maintain. • The reset gate defines how to incorporate the new input with the previous cell contents. A GRU can model a standard RNN simply by setting the reset gate to 1 and the update gate to 0. The reset gate determines how much of the previous hidden state should be forgotten, while the update gate determines how much of the new input should be used to update the hidden state. • The output of the GRU is calculated based on the updated hidden state. The GRU is simpler than the LSTM, can be trained more quickly, and can be more efficient in its execution. However, the LSTM can be more expressive and with more data can lead to better results. Example applications: Natural language text compression, handwriting recognition, speech recognition, gesture recognition, image captioning Unsupervised deep learning •Unsupervised learning refers to the problem space wherein there is no target label within the data that is used for training. •Unsupervised learning in artificial intelligence is a type of machine learning that learns from data without human supervision. Unlike supervised learning, unsupervised machine learning models are given unlabeled data and allowed to discover patterns and insights without any explicit guidance or instruction. • Unsupervised learning is a type of machine learning that learns from unlabeled data. This means that the data does not have any pre-existing labels or categories. The goal of unsupervised learning is to discover patterns and relationships in the data without any explicit guidance. • Unsupervised learning is the training of a machine using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. Here the task of the machine is to group unsorted information according to similarities, patterns, and differences without any prior training of data. • This section discusses three unsupervised deep learning architectures: self-organized maps, autoencoders, and restricted boltzmann machines. We also discuss how deep belief networks and deep stacking networks are built based on the underlying unsupervised architecture. Self-organized maps • Self-organized map (SOM) was invented by Dr. Teuvo Kohonen in 1982 and was popularly known as the Kohonen map. • SOM is an unsupervised neural network that creates clusters of the input data set by reducing the dimensionality of the input. SOMs vary from the traditional artificial neural network in quite a few ways. • SOM is used for clustering and mapping (or dimensionality reduction) techniques to map multidimensional data onto lower-dimensional which allows people to reduce complex problems for easy interpretation. • SOM has two layers, one is the Input layer and the other one is the Output layer. • The first significant variation is that weights serve as a characteristic of the node. After the inputs are normalized, a random input is first chosen. • Random weights close to zero are initialized to each feature of the input record. These weights now represent the input node. Several combinations of these random weights represent variations of the input node. • The euclidean distance between each of these output nodes with the input node is calculated. The node with the least distance is declared as the most accurate representation of the input and is marked as the best matching unit or BMU. • With these BMUs as center points, other units are similarly calculated and assigned to the cluster that it is the distance from. • Radius of points around BMU weights are updated based on proximity. Radius is shrunk. • Next, in an SOM, no activation function is applied, and because there are no target labels to compare against there is no concept of calculating error and back propogation. • Example applications: Dimensionality reduction, clustering high-dimensional inputs to 2- dimensional output, radiant grade result, and cluster visualization Algorithm Training: Step 1: Initialize the weights wij random value may be assumed. Initialize the learning rate α. Step 2: Calculate squared Euclidean distance. D(j) = Σ (wij – xi)^2 where i=1 to n and j=1 to m Step 3: Find index J, when D(j) is minimum that will be considered as winning index. Step 4: For each j within a specific neighborhood of j and for all i, calculate the new weight. wij(new)=wij(old) + α[xi – wij(old)] Step 5: Update the learning rule by using : α(t+1) = 0.5 * t Step 6: Test the Stopping Condition. Autoencoders •Though the history of when autoencoders were invented is hazy, the first known usage of autoencoders was found to be by LeCun in 1987. This variant of an ANN is composed of 3 layers: input, hidden, and output layers. •First, the input layer is encoded into the hidden layer using an appropriate encoding function. The number of nodes in the hidden layer is much less than the number of nodes in the input layer. •This hidden layer contains the compressed representation of the original input. The output layer aims to reconstruct the input layer by using a decoder function. •During the training phase, the difference between the input and the output layer is calculated using an error function, and the weights are adjusted to minimize the error. •Unlike traditional unsupervised learning techniques, where there is no data to compare the outputs against, autoencoders learn continuously using backward propagation. For this reason, autoencoders are classified as self supervised algorithms. • Input layer take raw input data • The hidden layers progressively reduce the dimensionality of the input, capturing important features and patterns. These layer compose the encoder. • The bottleneck layer (latent space) is the final hidden layer, where the dimensionality is significantly reduced. This layer represents the compressed encoding of the input data. •Example applications: Dimensionality reduction, data interpolation, and data compression/decompression Types of Autoencoders 1. Denoising Autoencoder Denoising autoencoder works on a partially corrupted input and trains to recover the original undistorted image. As mentioned above, this method is an effective way to constrain the network from simply copying the input and thus learn the underlying structure and important features of the data. 2. Sparse Autoencoder This type of autoencoder typically contains more hidden units than the input but only a few are allowed to be active at once. This property is called the sparsity of the network. The sparsity of the network can be controlled by either manually zeroing the required hidden units, tuning the activation functions or by adding a loss term to the cost function. 3. Variational Autoencoder Variational autoencoder makes strong assumptions about the distribution of latent variables and uses the Stochastic Gradient Variational Bayes estimator in the training process. 4. Convolutional Autoencoder Convolutional autoencoders are a type of autoencoder that use convolutional neural networks (CNNs) as their building blocks. The encoder consists of multiple layers that take a image or a grid as input and pass it through different convolution layers thus forming a compressed representation of the input. The decoder is the mirror image of the encoder it deconvolves the compressed representation and tries to reconstruct the original image. Restricted Boltzmann Machines •Though RBMs became popular much later, they were originally invented by Paul Smolensky in 1986 and was known as a Harmonium. •An RBM is a 2-layered neural network. The layers are input and hidden layers. As shown in the following figure, in RBMs every node in a hidden layer is connected to every node in a visible layer. •In a traditional Boltzmann Machine, nodes within the input and hidden layer are also connected. Due to computational complexity, nodes within a layer are not connected in a Restricted Boltzmann Machine. •During the training phase, RBMs calculate the probabilty distribution of the training set using a stochastic approach. When the training begins, each neuron gets activated at random. •Also, the model contains respective hidden and visible bias. While the hidden bias is used in the forward pass to build the activation, the visible bias helps in reconstructing the input. •Because in an RBM the reconstructed input is always different from the original input, they are also known as generative models. •Also, because of the built-in randomness, the same predictions result in different outputs. In fact, this is the most significant difference from an autoencoder, which is a deterministic model. •Example applications: Dimensionality reduction and collaborative filtering Multimodal fusion architectures Multimodal fusion is the integration of heterogeneous data from different modalities to take advantage of the complementarity of data in order to provide better prediction performance. As a matter of fact, each modality contains useful and complementary information to other modalities. multimodal fusion architectures are designed to combine information from multiple modalities at various levels of the network. Here are some common approaches to multimodal fusion in deep neural networks: 1. Early Fusion: 1. In early fusion, features from different modalities are combined at the input layer of the neural network. 2. For example, in a task involving both images and text, early fusion might concatenate image features and text embeddings before passing them through the neural network. 2. Late Fusion: 1. Late fusion involves processing each modality independently through separate neural network branches and combining their outputs at a later stage, typically at the final layers. 2. This approach allows the network to learn modality-specific representations before making a joint decision. 3. Intermediate Fusion: 1. Intermediate fusion combines features from different modalities at an intermediate layer of the network. 2. This allows the model to capture both early and late fusion characteristics, leveraging modality-specific information while also facilitating joint learning. 4.Attention Mechanisms: Attention mechanisms, such as self-attention or cross-modal attention, can be used to dynamically weigh the importance of different modalities or parts of modalities. These mechanisms enable the network to focus on relevant information for a given task. 5.Multimodal Transformers: Transformer architectures, initially developed for natural language processing, have been adapted for multimodal tasks. Multimodal Transformers can process sequences of data from different modalities, enabling effective fusion for tasks like image captioning or video understanding. 6.Graph Neural Networks (GNNs): GNNs can be applied when the relationships between modalities can be represented as a graph structure. Nodes in the graph may correspond to different modalities, and edges may represent relationships or interactions between them. 7.Memory Networks: Memory-augmented neural networks can be used to store and retrieve information from different modalities dynamically during the course of processing. This allows the network to maintain context and relevant information across modalities. 8.Hybrid Architectures: Hybrid architectures combine elements from various fusion strategies to create a custom solution tailored to the specific requirements of the task. For instance, a model might use early fusion for one set of modalities and late fusion for another. •Multi modal fusion-In a multimodal setting, it is very common to transfer models trained on the individual modalities and merge them at a single point. •It can be at the deepest layers, known as late fusion, which is relatively successful on a number of multimodal tasks. •Good ways to combine multimodal features to better exploit the information embedded at different layers in deep learning models for classification. •In vision, for example, lower layers are known to serve as edge detectors with different orientations and extent, while further layers capture more complex information such as semantic concepts, like faces, trees, animals, etc. •For example, learning to classify furry animals might require analysis of lower level visual features that can be used to build up the concept off ur, where as classes like chirping birds or growling might require analysis of more complex audio visual attributes. •Indeed, features from different layers at different modalities can give different insights from the input data. The problem of multimodal classification by directly posing the problem as a combinatorial search. •To categorize the different recent approaches of deep multimodal fusion, define two main paths: 1.architectures and 2.constraints. •The first path focuses on building best possible fusion architectures e.g.by finding at which depths the unimodal layers should be fused. •Late fusion is often defined by the combination of the final scores of each unimodal branch. Deep multiple instance learning • Multiple instance learning (MIL) is a variation of supervised learning where a single class label is assigned to a bag of instances. In this, we state the MIL problem as learning the Bernoulli distribution of the bag label where the bag label probability is fully parameterized by neural networks. • Multiple instance learning can be used to learn the properties of the sub images which characterize the target scene. From there on, these frameworks have been applied to a wide spectrum of applications, ranging from image concept learning and text categorization, to stock market prediction. ∙ It is a form of weakly supervised learning. ∙ Training instances are arranged in sets, called bags. ∙ A label is provided for entire bags rather than for the individual instances contained in them. Thus, in MIL, we aim to learn a concept given labels for bags of instances. • MIL is a variation of supervised learning that is more suitable to pathology applications. The technique involves assigning a single class label to a collection of inputs — in this context, referred to as a bag of instances. • While it is assumed that labels exist for each instance within a bag, there is no access to those labels and they remain unknown during training. A bag is typically labeled as negative if all instances in the bag are negative, or positive if there is at least one positive instance (known as the standard MIL assumption). • A simple example is shown in the figure in which we only know whether a keychain contains the key that can open a given door. This allows us to infer that the green key can open the door. There are various assumptions upon which we can base our MIL model, but here we use the standard MIL assumption: a bag may be labeled negative if all the instances in the bag are negative, or positive if there is at least one positive instance. This formulation naturally fits various problems in computer vision and document classification. For example, we might have access to medical images for which only overall patient diagnoses are available instead of costly local annotations provided by an expert. Definition of the standard MIL assumption • Training instances are arranged in sets generally called bags. • A label is given to bags but not to individual instances. • Negative bags do not contain positive instances. • Positive bags may contain negative and positive instances. • Positive bags contain at least one positive instance Relaxed MIL assumptions • In many applications, the standard MIL assumption is to restrictive. MIL can alternatively formulated as: • A bag is positive when it contains a sufficient number of positive instances. • A bag is positive when it contains a certain combination of positive instances. • Positive and negative bags differ by their instance distributions. Example of relaxed MIL assumptions •Both sand and water segments are positive instances for beach pictures. •However, picture of beach must contain both segments of sand and water. Otherwise, they can be pictures of desert or sea. Tasks that can be performed in MIL: There are two main MIL approaches: 1. Instance-based: the function f classifies each instance individually, and MIL pooling combines the instance labels to assign a bag to a class (g is the identity function). However, since individual labels are not known, it possible that the instance-level classifier might not be trained sufficiently, thereby introducing error in the final prediction. 2. Embedding-based: instead of classifying the instances individually, the function f maps instances to a low-dimensional embedding. MIL pooling is then used to obtain a bag representation that is independent of the number of instances in the bag. g then classifies these bag representations to provide ϴ(X). A downside of this method is that it lacks interpretability.