0% found this document useful (0 votes)
20 views98 pages

Technical DL U4-6

The document outlines the syllabus and key concepts related to Recurrent Neural Networks (RNNs), including their architecture, challenges with long-term dependencies, and various types such as Bidirectional RNNs and Long Short-Term Memory networks. It discusses the importance of parameter sharing and unfolding computational graphs for processing sequential data. The document also highlights practical methodologies for optimizing RNN performance and includes examples of RNN applications in language modeling and other domains.

Uploaded by

asdhdahgad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
20 views98 pages

Technical DL U4-6

The document outlines the syllabus and key concepts related to Recurrent Neural Networks (RNNs), including their architecture, challenges with long-term dependencies, and various types such as Bidirectional RNNs and Long Short-Term Memory networks. It discusses the importance of parameter sharing and unfolding computational graphs for processing sequential data. The document also highlights practical methodologies for optimizing RNN performance and includes examples of RNN applications in language modeling and other domains.

Uploaded by

asdhdahgad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 98
See eobbebde Serr rrYyYrFryyyY? Unie 1V | Recurrent Neural Networks ‘Syllabus Recurrent and Recursive Nete : Unfolding Computtional Graphs, Recurrent Neural Networks Bidirectional RNNs, EncoderDecoder Sequence-torSequence Architectures, Deep Recurrent Nenvorks, Recursive Neural Networks, The Challenge of Long-Terns Dependencies, Echo Sate Nesworks, Leaky Units and Other Strategies fr Multiple Time Scales, The Long Short-Term Memory and Other Gared RNNS, Optimization for Long-Term Dependencies, Explicit Memary. Practical Methodology + Performance Metrics, Defols Baseline Models. Determining Whether to Gather Move Data, Selecing Hyper parameters Contents 44 Basies of Recunent Neural Networks Recurrent Neural Networks 4 Encoder Decoder Architectures 4.5. Desp Recurrent Networks 48 Recursive Neural Networks 47 The Challenge of Long Term Dependencies 48 Echo State Networks 49 Leaky Unis end Other Strategies fr Muliple Time Scales 4.10 Long Short-Term Memory Networks (LSTH) 411 Other Gated RNNS 412 Optimization for Long” 413 Explicit Memory 4.14 Practical Methodology mo Dect cal her Deep Lerning 42 ecurant aura Networks EET Basics of Recurrent Neural Networks + A class of néural networks for processing sequential data is known as recurrent neural networks, or RNNs (Rumelhart etal, 1986), + Recurrent neural networks are neural networss that are specialized for processing, a series of values x, ..., x"), just like convolutional networks are neural networks that are specialized for processing a grid of values X, such as an image. ‘+ Recurrent networks can scale to far longer sequences than would be possible for networks without sequence-based specialization, much as convolutional networks ‘can easily scale to pictures with vast width and height and certain convolutional networks can handle images of varied size. + Sequences of difterent lengths may generally be processed by recurrent networks, + We need to use one of the early concepts from machine learning and statistical ‘models from the 1980s : sharing parameters across various regions of a model—to ‘ansition from multilayer networks to recurrent networks, + The model may be expanded and used to instances of other forms (different lengths inthis case) and generalized across them, thanks to parameter sharing, ‘+ Weccould not share statistical strength across different sequence lengths and across various places in time if we had distinct parameters for each value of the time index or generalize to sequence lengths not encountered during training. * When 2 given piece of information might occur numerous’ times during the sequence, such sharing is very crucial ‘+ Take the phrases "I travellee to Nepal in 2002" and "In 2009, I went to Nepal,” for instance. We want the year 2009 to be recognized as the pertinent piece of information, whether it comes in the sixth word or the second of the phrase, if we ask a machine learning madel to scan each line and extract the year in which the narrator travelled to Nepal, + Let's say we developed a feedforward network that analyses texts of a specific length. A conventional fully linked feedforward network would need to learn each language rule independently for each point in the sentence since it would have tunique parameters for each input characteristic. A recurrent neural network, in contrast, uses the same weights over a number af time steps. *+ Convolution across a 1-D temporal sequence isa similar concept, Time-delay neural networks are built using this convolutional method (Lang and Hinton, 1988; Waibel et al, 1958; Lang et al, 1990), Although shallow, the convolution technique enables TECHCAL PLBLICATIONS® an upton ioune Deep Le ane ‘ ° € a ‘nan extremely complex computational graph, * RAN are sid to operate ona sequence that has vectors) with atime step index t ‘aneing fom 119 forthe sake of clarity, Recurrent networks often function with sa sminbatth sizes ofthese sequences, each of which basa unigue sequence length, To g | make the notation simpler, the miniatch indices have been removed, Furthermore, i the ime step index need not correspond to actual time Passing inthe rel world. At : times, it just relates to the place in the Sequence. RNNs may be used in two ® dimensions spanning spatial data, such as Photographs, and even when applied to a time-related data, the network may contain con‘ections that teach back in time, si Biven thatthe complete sequence has been viewed before itis iver to the network. Unfolding Computational Graphs : * The structure ofa numberof calculations, suchas those involved in mapping inputs E and parameters to outputs and loss, can be formalized using 2 computational re graph. The concept of unfolding a recursive or recurrent computation into a th Computational network with a repeated structure, often ‘corresponding to a series of Ps ‘Sccierences, is explained in this section, The sharing of parameters across a deep * & network strictures the outcame of unfolding this graph, Lg | * Take the traditional form of ¢dymamical system, for instance : 4 sha ct Hg, (Ald) ‘t 2D wheres” is called the state ofthe system, z Because the definition of s tte t goes back to the identical definition atime t~1, es equation (4.1.1) is recuring i” | __ Tegraph can be unfolded fora limited numberof time steps x by using he dentin a4 £71 times For example, if we unfold Equation for 3 time steps, we obtain an 6 £078 (41.2) | ; ee ee cmipm ema [ t ra fe] via put the is, of ate his nal ) ep Learin i Rocurrant Neura Networks | 7 aA ee | = £6");9);0) (4.13) * By continually using the definition in this manner to unfold the equation, an, StPression that does not involve recurrence has been produced. Tn the present, a Sonventional directed acyclic computational network can represent such an “Pression. Fig. 4.1.1 shows the unfolded computational graph of equations 4.1.1 and 13, eee ©: ©; Fig. 64 * Fg-411- A computational graph that hasbeen unfurled serves as anlhstaion of the classical dynamical system gvenby equation The slate ateach ade atime tis ‘Spreseted, andthe function Fraslates he state at time t to the state a ime t+} For each time step, the same parameters {ie the same value of used to parameterize f) ate applied, * AS smother example le us consider a dynamical system driven by an extemal 4 signal x", se fi) Mh gy (414) where We see that the sale now contains information about the wale past sequence, ‘© ‘There are several methods for ‘onstructng recurrent neural networks. Any function that involves recurrence may be sen asa recurrent neural network, much like Prastilly any fron canbe regarded as fedorvard neural network. * Equation 4.15 or a celated equation is frequently used by recurrent neural networks So spesy the values ofits hidden units, We now rete Bquation 41.4 using the Natablehas he sate to show ha the state i the networks hidden unt HP -taD 6) (415) like output layers that Fig. 41.2, * Typical RNNS wil incu aditonal architectural features, 184d data from the state to make predictions as shown in » The recurrent network often leamns to utilize i asa type of lossy summary of the taskecelevant clements Of inputs up tot when itis tained 9) to a Six length vector h, a TECSMCAL PUBLICATIONS. tina oan rs eee AM. ‘his summary is inherently lossy. This summary may retain some former sequence ‘elements with greater precision than others depending on the traning criterion, For instance it might not be necessary to store all of the data in the input sequence up to time t only enough to predict the rest of the sentence if the RNN is used in statistical language modelling, which typically predicts the next word given previous words. The circumstance swhen we require h” to be rich enough to allow one to roughly recover the input sequence, like in autoenceder systems, is the most challenging. Fig 412 Fig. 41.2: An output less recurrent network. Simply by combining information fom the input x into the state h that is transmitted forward overtime, ths recurrent network processes information from the input x. Circuit schematic let), A onetime step delay is shown by the black square. (Right) The same network as a computational graph that has been unfolded, where each node is now connected to «specific time occurrence ‘There are two possible methods to draw equation 4.1.5, A diagram with one node foreach element that may be present in a real-world application ofthe model, ike a biological neural network, is one approach to represent an RNN. In this approach, the network establishes a real-time circuit made up of physical components, as shown on the left of Fig. 4.1.2, whose present condition might affect their future state 1In each circuit diagram in this chapter, a black square denotes an interaction that ‘occurs one ime step late, from the state at time tothe state at time t+ 1. The RN. ‘may also be represented, as an unfolding computational graph, where each ‘component is represented by a variety of distinct variables, one variable per time step, each indicating the component's state at that instant in time. As seen in the right of Fig, 41.2, each variable for each time step is represented as a distinct node ‘of the computational graph. The technique that conver circuit, shown on the left ‘side of the picture, into a computational graph with repeated elements, shown on a the sight ld, what we eer to as anfolding. The sce ofthe undolded graphnow os varies withthe length ofthe series est + We can represent the unfolded recurrence after t steps with a function gl”. used a a ga aD ax aay co niga 2 = Fant, x; 9) 44.1.7) + The function g() applies funeton F repeatedly to the ents pest sequence coh a 5 EO ln Se at tence PIES ’ recurrent structure, we may factorize g"” into a single function. } +The unfolding proces thus introduces two major advantages: fo Because the lsent model Is giver. n tems of transltions between safes rather than a Nstory of sats witha varlable duration it always haste same input size regards ofthe length of the sete. 6 very dine stp tn Employ the sane rarstonfretion fw the elope ae ‘+ Instead of having to train a differen! model g"” for each potential time step, these ecient two components allow us toler atngle model that works on all tps and Sai ail sequence length, A single, shared model may be Tenned, enabling Kena generalization to sequence engl net Included nthe elning stan allowing the ected to ‘model tobe estimated with a much les namber of taining samples than would otherwise be necessary ne node + There are applications Zor both the recurrent graph and the unrolled graph, like a Recurent graph is clear and conese.The unfolded graph gives a clear explanation proach ‘of the computations that need be run, By expliily displaying the path along which wn this information travels, the unfolded graph also contebtes to the istration of the notion of information flow both forward in time (compting oupats and losses) and backward in time (computing gradients). on that e RNN [1 Recurrent Neural Networks € each + With the concepts of parameter sharing and graph unroliing fromthe previous section, we may create a wide range of recurrent neural networks, Fig. 421 Fig. 42:1; The graph used to calculate a recurrent network’ training loss, which converts a sequence of input x vals into a sequence of output o values Each os sistance from the matching raining objective yis indicated by alos L. We assume that o is the unnormaized log probabilities wien utlizing softmax outputs Internally, the loss L calculates {= softmani and contasis it with the desired y The RNN has hiddentoshidden recurent connections, hidden-o‘hidden connection, and hidden-to-output connections, all of which ave parameterized by eight matrices U, W ard V, respectively. In this paradigm, forward propagation i efined by Equation 421, (Left) Recurent connections are wsed to draw the RNN and its loss, (Right) The same is shown as a computational network that hasbeen timeunfolded, where each node now connected toa specifi te occurrence The following are some instances of significant d patterns for recurrent neural networks Snip © Recurrent networks, as seen in Fig. 4.241, that feature recurrent connections between hidden units and create an output at every time step, © Recurrent networks, such as those shown in Fig. 4.22, create an output at every time step and only have recurrent connections between the output at one time step and the hidden units atthe following time step, TECHICAL PUBLICATIONS er up a own ich ra ly. 5 by NN sal ery sre the one that read a complete sequence and then 2 setions between hidden units, 0 Few recurent networks create a single output have recurrent cone =e We faquerly refer to Fg 421 a ory cepreentatve xarple toughest ig anajority ofthe tex = qn the sense that any fonction that can be computed by a Turing machine can also te calenated by a similar recutent network of limited size, the recurrent neural network of Fig. 4.21 and Equation 421 is universal. “Afier a certain number of time steps, which is asymptotically linear in both the ‘numberof time steps required ay the Turing machine and the length ofthe input the output may be read from the RNN (Siegelman and Sontag, 1991; Siegelmann, 41995; Siegelman and Sontag, 1995; Hyotyniemi, 1996). ‘These findings pertain to the actual implementation of the function, not approximations, because the functions that may be computed by a Turing computer ave ciscrete. When utilized as a Turing machine, the RNN requires discretization of its outputs {inorder to produce a binary output from an input binary sequence Using a single unique RNN of limited size (Siegelmann and Sontag (1995) employ 886 units) i is feasible to calculate all functions in this environment, ‘The Turing machine's “input fs a description of the function that has to be calculated, hence the same network that replicates this Turing machine is enough for al ssues. By expressing its activations and weights with rational values of unlimited precision, the theoretical RN utilized for the proof may imitate an unbounded Proof may stock The forward propagation equations for the RNN shown in Fig. 421 are now developed. The choice of activation function for the concealed units is not indicated illustration, Heze, the activation function of the hyperbolic tangent is assumed. Additionally, the output and loss functions’ actual shapes ate nol deserved in the image. Since the RNN is being used to predict words or characters, well assume thatthe output is diserete in this case, Regarding the output 0 as providing the unnormalized log probabilities of each potential value of the discrete variable is a logical method to deseribe discrete Variables, After that, we can se the softmax method to get a vector 9 of normalized probabilities over the output. The starting state Nis first specified in forward propagation, TECHGAL PLICATIONS an ea orange Newel Neworns, en, for each time step from t= 1 tot =<, we epply the following update equations a =bewntPsux® (421 whe Widder sequet a part . Law’ wh outpu Fig 422 + Fig. 422: An RNN in which the feedback link from the output to the hidden layer is the sole repetition. Input is xt, hidden layer activations are hi, outputs are o!”, targets are y'” and loss is L" at each time step t. (Left) circuit schematic. (Right) . computed graph that has been unfolded. Such an RNN is less capable than those in the family depicted in Fig. 42.1 (can express a narrower number of functions). The RNN in Fig. 42.1 is free to pick what data it wishes to broadcast from the past to the future as part of its hidden representation h. Only :ncirectly, by the predictions it generated, is the past related to the present. Unless o is extremely high-dimensional and rich, it typically lacks crucial historical data. As a result, the RNN in Fig. 4.2.2 is less effective, but it might be simpler to train because each time step can be trained independently of the others, allowing for more paralle) training as explained in the following section. 0, nO = tanh”? (42.2) Ne TECHRUCAL PUBUCATIONS® on wp fer nope ed he O! = ca vn" (423) 9° = sofsrax(o) + (424) where the weight matrices U, V and W for input-to-hidden, hidderto-output, and hhidden-to-hidden connections, respectively, are the parameters. This isan illustration of a recurrent network that converts an input sequence into an identically lengthened output sequence. The sum of the losses cross all the time steps would thus be the overall loss for 2 particular series of x values and a sequence of y values, + For example, if L" is the regative log-likelihood of y" given x!" ou, then L (es, xy ea 9M) (425) =EL" (42.6) = Ele Pode YR, XY, 427 where Prrodat ("71x"), ..., x") is given by reading the entry for y" from the model's output vector 1”, + Ik costs money to calculate the gradient of this loss function with respect to the parameters. In order to com pute the gradient, two passes over our representation of the unrolled graph in Fig. 421 must be made : first, a forward propagation pass from left to right, and ther. a backward propagation pass from right to left. Because the forward propagation graph is intrinsically sequential and each time step can only be computed after the preceding one, the runtime is O(s) and cannot be decreased by parallelization = ‘The memory cost is also O(t), as calculated states in the forward pass must be ‘maintained until they exe atilized in the backward round. Back-propagation through time or BPTT, is an O(t) cost back-propagation method that is used to ‘unroll a graph. As a result. athough incredibly strong, the network with recurrence between hidden units is a'sc costly to train. Is there a substitute ? Teacher Forcing and Networks with Output Recurrence Because it lacks hidden-to-hidden recurrent connections, the network (illustrated in ig. 4.2.2. with recurrent cornections simply from the output at one time step to the hidden units at the followirg time step is strictly less effective. It cannot imitate a general-purpose Turing machine, for instance. The output units must record all of the past data that the network will use to make predictions about the future since TEGHRIGAL FIBLIATIONS® on patter imide Deep Leaning a Rocurn ourel Netvertn twork lacks hidden-to-hidden recurrence, If the user does not know how to