0% found this document useful (0 votes)
19 views55 pages

DL 12

Chapter 8 discusses optimization techniques specifically for training deep learning models, highlighting the challenges and differences from traditional optimization methods. It emphasizes the importance of minimizing a cost function to improve model performance, while addressing issues such as overfitting and the use of surrogate loss functions. The chapter also covers batch and minibatch algorithms, detailing how they impact the efficiency and effectiveness of the training process.

Uploaded by

giridhari.byteiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
19 views55 pages

DL 12

Chapter 8 discusses optimization techniques specifically for training deep learning models, highlighting the challenges and differences from traditional optimization methods. It emphasizes the importance of minimizing a cost function to improve model performance, while addressing issues such as overfitting and the use of surrogate loss functions. The chapter also covers batch and minibatch algorithms, detailing how they impact the efficiency and effectiveness of the training process.

Uploaded by

giridhari.byteiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 55
Chapter 8 Optimization for Training Deep Models Deep learning algorithms involve optimization in many contexts. For example, performing inference in models such as PCA involves solving an optimization problem, We often use analytical optimization to write proofs or design algorithms. Of all the many optimization problems involved in deep learning, the most difficult is neural network training. It is quite common to invest days to months of time on hundreds of machines to solve even a single ins problem. Because this problem is so important and so expensive, a specialized sct of optimization techniques have been developed for solving it. ‘This chapter presents these optimization techniques for neural network training. wnce of the neural network training If you are unfamiliar with the basic principles of gradient-based optimization we suggest reviewing chapter 4. That chapter includes a brief overview of numerical optimization in general. This chapter focuses on one particular case of optimization: finding the param- eters @ of a neural network that significantly reduce a cost function J(), which typically includes a performance measure evaluated on the entire training set as well as additional regularization terms. We begin with a description of how optimization used as a training algorithm for a machine learning task differs from pure optimization. Next, we present several of the concrete challenges that make optimization of neural networks difficult. We then define several practical algorithms, including both optimization algorithms themselves and strategies for initializing the parameters. More advanced algorithms adapt their learning rates during training or leverage information contained in 271 CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS: the second derivatives of the cost function. Finally, we conclude with a review of several optimization strategies that are formed by combining simple optimization algorithms into higher-level procedures. 8.1 How Learning Differs from Pure Optimization Optimization algorithms used for training of decp models differ from traditional optimization algorithms in several ways. Machine learning usually acts indireetly. In most machine learning scenarios, we care about some performance me: P, that is defined with respect to the test set and may also be intractable. We therefore optimize P only indirectly. We reduce a different cost function J(8) in the hope that doing so will improve P. This is in contrast to pure optimization, where minimizing J is a goal in and of itself, Optimization algorithms for training deep models also typically include some specialization on the specific structure of machine learning objective functions. ure Typically, the cost function can be written as an average over the training set, such as (8) = Evw,y)~ pane /(S(®5 8), 9); (8.1) where L is the per-example loss function, f(«;@) is the predicted output when the input is a, and Paata is the empirical distribution. In the supervised learning case, y is the target output. Throughout this chapter, we develop the unregularized supervised case, where the arguments to L are f(#;0) and y. It is trivial to extend this development, for example, to include @ or & as arguments, or to exclude y as arguments, to develop various forms of regularization or unsupervised learning. Equation 8.1 defines an objective function with respect to the training set. We would u tive function where the expectation is taken across the data-generating distribution Paata tather than just over the finite training set: ually prefer to minimize the corresponding obj F°(8) = Eee y)~paiara US (@5 9), 9): (8.2) 8.1.1 Empirical Risk Minimization The goal of a machine learning algorithm is to reduce the expected gencralization error given by equation 8.2. This quantity is known as the risk, We emphasize here that the expectation is taken over the true underlying distribution penta. If we knew the true distribution pgata(#,y), risk minimization would be an optimization 272 CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS: task solvable by an optimization algorithm, When we do not know Paata(a, y) but only have a training set of samples, however, we have a machine learning problem The simplest way to convert a machine learning problem back into an op- timization problem is to minimize the expected loss on the training set. This means replacing the true distribution p(a, y) with the empirical distribution p(a, y) defined by the training set. We now minimize the empirical risk Beynpaara(ea) LCF (@; 4), 9) = pL bues8 9 (8.3) where m is the number of training examples. The training process based on minimizing this average training error is known as empirical risk minimization. In this setting, machine learning is still very similar to straightforward optimization. Rather than optimizing the risk direetly, we optimize the empirical risk and hope that the risk decreases significantly well. A variety of theoretical results establish conditions under which the true risk can be expected to decrease by various amounts. Nonetheless, empirical risk minimization is prone to overfitting. Models with high capacity can simply memorize the training set. In many cases, empirical risk minimization is not really feasible, The most effective modern optimization algorithms are based on gradient descent, but many useful loss functions, such as 0-1 loss, have no useful derivatives (the derivative is either zero or undefined everywhere). These two problems mean that, in the context of deep learning, we rarely use empirical risk minimization. Instead, we must use a slightly different approach, in which the quantity that we actually optimize is even more different from the quantity that we truly want to optimize. 8.1.2 Surrogate Loss Functions and Early Stopping Sometimes, the loss function we actually care about (say, classification error) is not one that can be optimized efficiently. For example, exactly minimizing expected 0-1 I typically intractable (exponential in the input dimension), even for a linear classifier (Marcotte and Savard, 1992). In such situations, one typically optimizes a surrogate loss function instead, which acts as a proxy but has advantages For example, the negative log-likelihood of the correct class is typically used as a surrogate for the 0-1 loss. The negative log-likelihood allows the model to estimate the conditional probability of the classes, given the input, and if the model can do that well, then it can pick the classes that yield the least classification error in expectation, 273, CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS In some cases, a surrogate loss function actually more. For example, the test set 0-1 loss often continues to decrease for a long. time after the training set 0-1 loss has reached zero, when training using the log-likelihood surrogate. This is because even when the expected 0-1 loss is zero, one can improve the robustness of the classifier by further pushing the classes apart from each other, obtaining a more confident and reliable classifier, thus extracting. more information from the training data than would have been possible by simply minimizing the average 0-1 loss on the training set. results in being able to learn ‘A very important difference between optimization in general and optimization as we use it for training algorithms is that training algorithms do not usually halt at a local minimum. Instead, a machine learning algorithm usually minimizes a surrogate loss function but halts when a convergence criterion based on early stopping (section 7.8) is satisfied. Typically the carly stopping criterion is based on the true underlying loss function, such as 0-1 loss measured on a validation set and is designed to cause the algorithm to halt whenever overfitting begins to occur. Training often halts while the surrogate loss function still has large derivatives, which is very different from the pure optimization sctting, where an optimization algorithm is considered to have converged when the gradient becomes very small. 8.1. Batch and Minibatch Algorithms One aspect of machine learning algorithms that separates them from general optimization algorithms is that the objective function usually decomposes as a sum over the training examples. Optimization algorithms for machine learning typically compute each update to the parameters based on an expected value of the cost function estimated using only a subset of the terms of the full cost function. For example, maximum likelihood estimation problems, when viewed in log space, decompos example: into a sum over eac Oy, = arginax Slog Pmodei(#” ,y; 8) (8.4) 8 ia Maximizing this sum is equivalent to maximizing the expectation over the empirical distribution defined by the training set: I(9) = Ex.y~Peasa 18 Pmodel(®, Y5 8) (8.5) Most of the properties of the objective function J used by most of our opti- mization algorithms are also expectations over the training set. For example, the 274 CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS: most commonly used property is the gradient VoJ (8) = Exy~peaa V0 log Pmodei(®, 5 4) (8.6) Computing this expectation exactly is very expensive because it requires evaluating the model on every example in the entire dataset. In practice, we can compute these expectations by randomly sampling a small number of examples from the dataset, then taking the average over only those examples. Recall that the standard error of the mean (equation 5.46) estimated from n samples is given by #/ yi, where o is the true standard deviation of the value of the samples. The denominator of yn shows that there are less than linear returns to using more examples to estimate the gradient. Compare two hypothetical estimates of the gradient, one based on 100 examples and another based on 10,000 examples. The latter requires 100 times more computation than the former but reduces the standard error of the mean only by a factor of 10. Most optimization algorithms converge much faster (in terms of total computation, not in terms of number of updates) if they are allowed to rapidly compute approximate estimates of the gradient rather than slowly computing the exact gradient. Another consideration motivating statistical estimation of the gradient from a y in the training , all m samples in the training set could be identical copies of each other, A sampling- based estimate of the gradient could compute the correct gradient with a single sample, using m times less computation than the naive approach. In practice, we are unlikely to encounter this worst-case situation, but we may find large numbers of examples that all make very similar contributions to the gradient. In the worst c small number of samples is redundan« Optimization algorithms that use the entire training set are called batch or deterministic gradient methods, because they process all the training examples simultancously in a large batch. This terminology can be somewhat confusing because the word “batch” is also often used to describe the minibatch used by minibatch stochastic gradient descent. Typically the term “batch gradient descent” implies the use of the full training set, while the use of the term “batch” to describe a group of examples does not. For example, it i size” to describe the size of a minibatch. ommon to use the term “batch Optimization algorithms that use only a single example at a time are sometimes called stochastic and sometimes online methods. The term “online” is usually reserved for when the examples are drawn from a stream of continually created examples rather than from a fixed-size training set over which several passes are made. Most algorithms used for deep learning fall somewhere in between, using more 275 CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS than one but fewer than all the training examples. These were traditionally called minibatch or minibatch stochastic methods, and it is now common to call them simply stochastic methods. The canonical example of a stochastic method is stochastic gradient descent, presented in detail in section 8.3.1. Minibateh sizes are generally driven by the following factors: * Larger batches provide a more accurate estimate of the gradient, but with less than linear returns. © Multicore architectures are usually underutilized by extremely small batches. This motivates using some absolute minimum batch size, below which there is no reduction in the time to proc ‘sa minibatch. * If all examples in the batch are to be processed in parallel (as is typically the case), then the amount of memory scales with the batch size. For many hardware setups this is the limiting factor in batch size * Some kinds of hardware achieve better runtime with specific sizes of arrays. Especially when using GPUs, it is common for power of 2 batch sizes to offer better runtime. ‘Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large models. Small batches can offer a regularizing effect (\Vilson and Martinez, 2008) perhap error is often best for a batch size of 1. Training with such a small batch ize might require a small learning rate to maintain stability because of the high variance in the estimate of the gradient. The total runtime can be very high as a result of the need to make more steps, both because of the reduced learning rate and because it takes more steps to observe the entire training se due to the noise they add to the learning process. Generalization Different kinds of algorithms use different kinds of information from the mini- batch in various ways. Some algorithms are more sensitive to sampling error than others, either because they use information that is difficult to estimate accurately with few sample cause they use information in ways that amplify sampling errors more. Methods that compute updates based only on the gradient g are usually relatively robust and can handle smaller batch sizes, like 100. Second-order methods, which also use the Hessian matrix H’ and compute updates such as , like 10,000. These large batch sizes are required to minimize fluctuations in the estimates of H~1g. Suppose H~'g, typically require much larger batch si 276 CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS that H is estimated perfectly but has a poor condition number. Multiplication by H or its inverse amplifies pre-existing errors, in this case, estimation errors in g. Very small changes in the estimate of g can thus cause large changes in the update H~1g, even if H is estimated perfectly. Of course, H is estimated only approximately, so the update H™¥g will contain even more error than we would predict from applying a poorly conditioned operation to the estimate of g. It is also crucial that the minibatches be selected randomly. Computing an unbiased estimate of the expected gradient from a set of samples requires that those samples b independent from each other, so two subsequent minibatches of examples should also be independent from each other. Many datasets are most naturally arranged in a way where successive examples are highly correlated. For example, we might t of medical data with a long list of blood sample tes' This list might be arranged so that first we have five blood samples taken at different times from the first patient, then we have three blood samples taken from the second patient, then the blood samples from the third patient, and so on. If we were to draw examples in order from this list, then cach of our minibatches would be extremely biased, because it would ent primarily one patient out of the many patients in the dati t holds some significance, it is necessary to shuffle the examples before selecting minibatches. For very large datasets, for example, datasets containing billions of examples in a data center, it can be impractical to sample examples truly uniformly at random every time we want to construct a minibatch. Fortunately, in practi it is usually sufficient to shuffle the order of the dataset once and then store it in shuffted fashion. This will impose a fixed set of possible minibatch examples that all models trained thereafter will use, and each individual model will be forced to reuse this ordering every time it passes through the training data This deviation from true random selection does not seem to have a significant detrimental effect. Failing to ever shuffle the examples in any way can seriously reduce the effectiveness of the algorithm. independent. We also wish for two subsequent gradient estimates to be have a datas results. epres t. In cases such as these, where the order of the dat .es of consecutive Many optimization problems in machine learning decompose over examples well enough that we can compute entire separate updates over different examples in parallel. In other words, we can compute the update that minimizes J(X) for one minibatch of examples X at the same time that we compute the update for several other minibatches. Such asynchronous parallel distributed approaches are discussed further in section 12.1.3 An interesting motivation for minibatch stochastic gradient descent is that it follows the gradient of the true generalization error (equation 8.2) as long as no 27 CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS: examples are repeated. Most implementations of minibatch stochastic gradient descent shuffle the dataset once and then pass through it multiple times. On the first pass, each minibatch is used to compute an unbiased estimate of the true generalization error. On the second pass, the estimate becomes biased because it is formed by resampling values that have already been used, rather than obtaining new fair samples from the data-generating distribution. The fact that stochastic gradient descent minimizes generalization error is easiest to see in online learning, where examples or minibatches are drawn from a stream of data. In other words, instead of receiving a fixed-size training set, the learner is similar to a living being who sees a new example at each instant, with every example (z, y) coming from the data-generating distribution pata (a, y)- In this scenario, examples are never repeated; every experience is a fair sample from Paata The equivalence is easiest to derive when both x and y are discrete. In this the generalization error (equation 8.2) can be written as a sum T8) = OY pasta(@, v)L(F(@; 8),u), 7 with the exact gradient 50), y)- (8.8) = VoI*(8) = 7 > mistal@, v) VoL(s( We have already scen the same fact demonstrat ed for the log-likelihood in equa- tion 8.5 and equation 8.6; we observe now that this holds for other functions L besides the likelihood. A similar result can be derived when « and y are continuous, under mild assumptions regarding Pata and L, Hence, we can obtain an unbiased estimator of the exact gradient of the generalization error by sampling a minibatch of examples {a ... ax} with cor responding targets y") from the data-generating distribution pyata, then computing the gradient of the loss with respect to the parameters for that minibateh: 1 go 9), y ‘ a= 7,70 Hse: 8),9) (69) Updating @ in the direction of g performs SGD on the generalization error. Of course, this interpretation applies only when examples are not reused. Nonetheless, it is usually best to make several passes through the training set, unless the training set is ext large. When multiple such epochs are used, 278 CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS: only the first epoch follows the unbiased gradient of the generalization error, but of course, the additional epochs usually provide enough benefit due to decreased training error to offset the harm they cause by increasing the gap between training error and test error. With some datasets growing rapidly in size, faster than computing power, it is becoming more common for machine learning applications to use each training example only once or even to make an incomplete pass through the training set. When using an extremely large training set, overfitting is not an issue, so underfitting and computational efficiency become the predominant concerns. See also Bottou and Bousquet (2008) for a discussion of the effect of computational bottlenecks on generalization error, as the number of training examples grows. 8.2 Challenges in Neural Network Optimization Optimization in general is an extremely difficult task. Traditionally, machine learning has avoided the difficulty of general optimization by carefully designing the objective function and constraints to ensure that the optimization problem is convex. When training neural networks, we must confront the general nonconvex convex optimization is not without its complications. In this we summarize several of the most prominent challenges involved in optimization for training deep models case. Even ion, 8.2.1 Ill-Conditioning Some challenges arise even when optimizing convex functions. Of these, the most prominent is ill-conditioning of the Hessian matrix H. This is a very general problem in most numerical optimization, convex or otherwise, and is described in more detail in section 4.3.1 The ill-conditioning problem is generally believed to be present in neural network training problems. Ill-conditioning can manifest by causing SGD to get “stuck” in the sense that even very small steps increase the cost function, Recall from equation 4.9 that a second-order Taylor series expansion of the cost function predicts that a gradient descent step of —eg will add lL, 309 Ho-«9'g (8.10) to the cost. Ill-conditioning of the gradient becomes a problem when 42g" Hg exceeds eg'g. To determine whether ill-conditioning is detrimental to a neural 279 CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS: 10 09 os or oof 4 05 4 o4 08 4 oak 4 oa 0-00 100 150 200 250 0 80 100 150 200 250 ‘Training time (epochs) ‘Training time (epochs) ‘radient norm « “lassification error rate ¢ Figure 8.1: Gradient descent often does not arrive at a critical point of any kind. In this example, the gradient norm increases throughout training of a convolutional network used for object detection. (Left)A scatterplot showing how the norms of individual gradient evaluations are distributed over time. To improve legibility, only one gradient norm is plotted per epoch. The running average of all gradient norms is plotted as a solid curve. The gradient norm clearly increases over time, rather than decreasing as we would expect if the training process converged to a critical point. (Right)Despite the increasing gradient, the training process is reasonably successful. The validation set classification error decreases to a low level. network training task, one can monitor the squared gradient norm gg and the g’ Hg term. In many cases, the gradient norm does not shrink significantly throughout learning, but the g' Hg term grows by more than an order of magnitude. The result is that learning becomes very slow despite the presence of a strong gradient because the learning rate must be shrunk to compensate for even stronger curvature, Figure 8.1 shows an example of the gradient increasing significantly during the successful training of a neural network. ‘Though ill-conditioning is present in other settings besides neural network training, some of the techniques used to combat it in other contexts are less applicable to neural networks. For example, Newton’s method is an excellent tool for minimizing convex functions with poorly conditioned Hessian matrices, but as we argue in subsequent sections, Newton's method requires significant modification before it can be applied to neural networks 250 CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS: 8.2.2 Local Minima One of the most prominent features of a convex optimization problem is that it can be reduced to the problem of finding a local minimum. Any local minimum is guaranteed to be a global minimum. Some convex functions have a flat region at the bottom rather than a single global minimum point, but any point within such a flat region is an acceptable solution. When optimizing a convex function, we know that we have reached a good s lution if we find a critical point of any kind. With nonconvex functions, such as neural nets, it is possible to have many local minima. Indeed, nearly any deep model is essentially guaranteed to have an extremely large number of local minima, As we will see, however, this is not necessarily a major problem Neural networks and any models with multiple equivalently parametrized latent variables all have multiple local minima because of the model identifiability problem. A model is said to be identifiable if a sufficiently large training set can rule out all but one setting of the model’s parameters. Models with latent variables are often not identifiable because we can obtain equivalent models by exchanging latent variables with each other. For example, we could take a neural network and modify layer 1 by swapping the incoming weight vector for unit i with the incoming weight vector for unit j, then do the same for the outgoing weight vectors. If we ys of arranging the hidden units. This kind of nonidentifiability is known as weight space symmetry. have m la with n units cach, then there are n!” we In addi ion to weight space symmetry, many kinds of neural networks have additional causes of nonidentifiability. For example, in any rectified linear or maxout network, we can sc: Je all the incoming weights and biases of a unit by « if we also scale all its outgoing weights by 1. This means that—if the cost function does not include terms such as weight. decay that depend directly on the weights rather than the models’ outputs—every local minimum of a rectified linear or maxout network lies on an (m x n)-dimensional hyperbola of equivalent local minima, These model identifiability issues mean that a neural network cost function can have an extremely large or even uncountably infinite amount of local minima. However, all these local minima arising from nonidentifiability are equivalent to each other in cost function value. As a result, these local minima are not a problematic form of nonconvexity. Local minima can be problematic if they have high cost in comparison to the global minimum. One can construct small neural networks, even without hidden units, that have local minima with higher cost than the global minimum (Sontag 281 CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS and Sussman, 1989; Brady el al., 1989; Gori and ‘Tesi, 1992). If local minima with high cost are common, this could pose a serious problem for gradient-based optimization algorithms. Whether networks of practical interest have many local minima of high cost and whether optimization algorithms encounter them remain open questions. For many years, most practitioners believed that local minima were a common problem plaguing neural network optimization. Today, that docs not appear to be the case. The problem remains an active area of research, but experts now suspect that, for sufficiently large neural networks, most loc st function value, and that it is not important to find a true global minimum rather than to find a point in parameter space that has low but not minimal cost (Saxe cf al., 2013; Dauphin et al., 2014; Goodfellow et al., 2015; Choromanska et al., 2014). | minima have a low Many practitioners attribute nearly all di tion to local minima, We encourage practitioners to carefully test for specific problems. A test that can rule out local minima as the problem is plotting the norm of the gradient over time. If the norm of the gradient does not shrink to insignificant size, the problem is neither local minima nor any other kind of critical point. In high-dimensional spaces, positively establishing that local minima are the problem can be ve have small gradients. uulty with neural network optimiza- y difficult. Many structures other than local minima also 8.2.3. Plateaus, Saddle Points and Other Flat Regions For many high-dimensional none are in fact rare compared to another kind of point with zero gradient: a saddle point. Some points around a saddle point have greater cost than the saddle point, while others have a lower cost. At a saddle point, the Hessian matrix has both positive and negative cigenvalu positive eigenvalues have greater cost than the saddle point, while points lying along negative eigenvalues have lower value. We can think of a saddle point as being a local minimum along one cross-section of the cost function and a local maximum along another cross-section. See figure 4.5 for an illustration. nvex functions, local minima (and maxima) Points lying along eigenvectors associated with Many classes of random functions exhibit the following behavior: in low- dimensional spaces, local minima are common. In higher-dimensional spaces, local minima are rare, and saddle points are more common. For a function f : R” + R of this type, the expected ratio of the number of saddle points to local minima grows exponentially with n, To understand the intuition behind this behavior, observe that the Hessian matrix at a local minimum has only po itive eigenvalues. The 282, CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS Hessian matrix at a saddle point has a mixture of positive and negative eigenvalues. Imagine that the sign of each eigenvalue is generated by flipping a coin. In a single dimension, it is easy to obtain a local minimum by tossing a coin and getting heads once, In n-dimensional space, it is exponentially unlikely that all n coin tosses will be heads. See Dauphin ef al. (2014) for a review of the relevant theoretical work An amazing property of many random functions is that the eigenvalues of the jan become more likely to be positive as ch regions of lowe: our coin tossing analogy, this means we are more likely to have our coin come up heads n time itical point with low c that local minima are much more likely to have low cost than high cost. Critical points with high cost are far more likely to be saddle points. Critical points with extremely high cost are more likely to be local maxima. He In we rea if we are at a It also meat ‘This happens for many classes of random functions. Does it happen for neural networks? Baldi and Hornik (1989) showed theoretically that shallow autoencoders (feedforward networks trained to copy their input to their output, described in chapter 14) with no nonlinearities have global minima and saddle points but no local minima with higher cost than the global minimum. They observed without proof that these results extend to deeper networks without nonlinearities. The output of such networks is a linear function of their input, but they are useful to study as a model of nonlinear neural networks because their loss function is a nonconvex function of their parameters. Such networks are essentially just multiple matrices composed together. Saxe et al. (2013) provided exact solutions to the complete learning dynamics in such networks and showed that learning in these models captures many of the qualitative features observed in the training of deep models with nonlinear activation functions. Dauphin ef al. (2014) showed experimentally that real neural networks also have loss functions that contain very many high-cost saddle points. Choromanska ef al. (2014) provided additional theoretical arguments, showing that another class of high-dimensional random functions related to neural networks does so as well. What are the implications of the proliferation of saddle points for training algorithms? For first-order optimization, algorithms that use only gradient infor mation, the small near a situation is unclear. The gradient can often become ve saddle point. On the other hand, gradient descent empirically seems able to escape saddle points in many cases, Goodfellow et al. (2015) provided visualizations of several learning trajectories of state-of-the-art neural networks, with an example given in figure 8.2. These visualizations show a flattening of the cost function near a prominent saddle point, where the weights are all zero, but they also show the gradient descent trajectory rapidly escaping this region. Goodfellow el al. (2015) 283 CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS: or Projec ection 1 of Figure 8.2: A visualization of the cost function of a neural network. These visualizations appear similar for feedforward neural networks, convolutional networks, and recurrent networks applied to real object recognition and natural language processing tasks. Sur- prisingly, these visualizations usually do not show many conspicuous obstacles. Prior to the success of stochastic gradient descent for training very large models beginning in roughly 2012, neural net cost function surfaces were generally believed to have much more nonconvex structure than is revealed by these projections. The primary obstacle revealed by this projection is a saddle point of high cost near where the parameters are initialized, but, as indicated by the blue path, the SGD training trajectory escapes this saddle point readily. Most of training time is spent traversing the relatively flat valley of the cost function, perhaps because of high noise in the gradient, poor conditioning of the Hessian matrix in this region, or simply the need to circumnavigate the tall “mountain” visible in the figure via an indirect arcing path. Image adapted with permission from Goodfellow et al. (2015) also argue that continuous-time gradient descent may be shown analytically to be repelled from, rather than attracted to, a nearby saddle point, but the situation may be different for more realistic uses of gradient descent. For Newton's method, s ddle points clearly constitute a problem. Gradient descent is designed to move “downhill” and is not explicitly designed to seck a critical point. Newton's method, however, is designed to solve for a point where the gradient is zero. Without appropriate modification, it can jump to a saddle point. The proliferation of saddle points in high-dimensional spaces presumably explains why second-order methods have not succeeded in replacing gradient descent for neural network training. Dauphin ef al. (2014) introduced a saddle-free Newton method for second-order optimization and showed that it improves significantly over the traditional version. Second-order methods remain difficult to scale to large neural networks, but this saddle-free approach holds promise if it can be scaled, 284 CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS: F(w,b) Figure 8.3: The objective function for highly nonlinear deep neural networks or for recurrent neural networks often contains sharp nonlinearities in parameter space 1 from the multiplication of several parameters. These nonlinearities give rise to very high derivatives in some places. When the parameters get close to such a cliff region, a gradient descent update can catapult the parameters very far, possibly losing most of the optimization work that has been done, Figure adapted with permission from Pascan et al, (2013). sulting There are other kinds of points with zero gradient besides minima and saddle points. Maxima are much like saddle points from the perspective of optimization— many algorithms are not attracted to them, but unmodified Newton’s method is. Maxima of many classes of random functions become exponentially rare in high-dimensional space, just as minima do, ‘There may also be wide, flat regions of constant value. In thes tions, the gradient and the Hessian are all zero. Such degenerate locations pose major problems for all numerical optimization algorithms. In a convex problem, a wide, flat region must consist entirely of global minima, but in a general optimization problem, such a region could correspond to a high value of the objective function, lov 8.2.4 Cliffs and Exploding Gradients Neural networks with many layers often have extremely steep regions resembling cliffs, as illustrated in figure 8.3. These result from the multiplication of several large weights together. On the face of an extremely steep cliff structure, the gradient update step can move the parameters extremely far, usually jumping off the cliff structure altogether ‘The cliff can be dangerous whether we approach it from above or from below, 285 CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS but fortunately its most serious consequences can be avoided using the gradient clipping heuristic described in section 10.11.1. The basic idea is to recall that the gradient specifies not the optimal step size, but only the optimal direction within an infinitesimal region. When the traditional gradient descent algorithm proposes making a very large step, the gradient clipping heuristic intervenes to reduce the step size, making it less likely to go outside the region where the gradient indicates the direction of approximately steepest descent. Cliff structures are mos in the cost functions for recurrent neural networks, because such models involve a multiplication of many factors, with one factor for cach time step. Long temporal sequences thus incur an extreme amount of multiplication. common 8.2.5 Long-Term Dependent Another difficulty that neural network optimization algorithms must overcome arises when the computational graph becomes extremely deep. Feedforward networks with many layers have such deep computational graphs. So do recurrent networks, described in chapter 10, which construct very deep computational graphs by repeatedly applying the same operation at each time step of a long temporal sequence. Repeated application of the same parameters gives rise to especially pronounced difficulties. For example, suppose that a computational graph contains a path that consists of repeatedly multiplying by a matrix W. After ¢ steps, this is equivalent to mul- tiplying by W*. Suppose that W has an eigendecomposition W = Vdiag( A)V~! imple case, it is straightforward to sce that W' = (Vdiag(A)V~1)' = Vdiag(A)'V7 (8.11) Any eigenvalues A; that are not near an absolute value of 1 will either explode if they are greater than 1 in magnitude or vanish if they are less than 1 in magnitude. The vanishing and exploding gradient problem refers to the fact that gradients through such a graph are also scaled according to diag(A)'. Vanishing gradients make it difficult to know which direction the parameters should move to improve the cost function, while exploding gradients can make learning unstable. The cliff structures described earlier that motivate gradient clipping are an example of the exploding gradient phenomenon. The repeated multiplication by W at each time step described here is very imilar to the power method algorithm used to find the largest eigenvs a matrix W and the corresponding eigenvector. From this point of view it is not surprising that ’ W* will eventually discard all components of a that are orthogonal to the principal eigenvector of W. lue of 286 CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS Recurrent networks use the same matrix W at each time step, but feedforward networks do not, so even very deep feedforward networks can largely avoid the vanishing and exploding gradient problem (Sussillo, 2014). We defer further discussion of the challenges of training recurrent networks until section 10.7, after recurrent networks have been described in more detail. 8.2.6 Inexact Gradients Most optimization algorithms are designed with the assumption that we have access to the exact gradient or Hessian matrix. In practice, we usually have only a noisy or even biased estimate of these quantities. Nearly every deep learning algorithm relies on sampling-based estimates, at least insofar as using a minibatch of training examples to compute the gradient , the ob, actually intractable. When the objective function is intractable, typically its gradient is intractable as well. In such cases we can only approximate the gradient. These issues mostly arise with the more advanced models we cover in part III. For example, contrastive divergence gives a technique for approximating the gradient of the intractable log-likelihood of a Boltzmann machine. In other cas ive function we want to minimize Various neural network optimization algorithms are designed to account for imperfections in the gradient estimate. One can also avoid the problem by choosing a surrogate loss function that is easier to approximate than the true loss. 8.2.7 Poor Correspondence between Local and Global Structure Many of the problems we have discussed so far correspond to properties of the loss function at a single point—it can be difficult to make a single step if J(0) is poorly conditioned at the current point @, or if @ lies on a cliff, or if @ is a saddle point hiding the opportunity to make progress downhill from the gradient. It is possible to overcome all these problems at a single point and still perform poorly if the direction that results in the most improvement locally does not point toward distant regions of much lower cost. Goodfellow et al. (2015) argue that much of the runtime of training is due to the length of the trajectory needed to arrive at the solution. Figure 8.2 shows that the learning trajectory spends most of its time tracing out a wide arc around a mountain-shaped structure. Much of research into the difficulties of optimization has focused on whether 287 CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS: J) @ Figure 8.4: Optimization based on local downhill moves can fail if the local surface does not point toward the global solution. Here we provide an example of how this can occur, en if there are no saddle points or local minima. This example cost fu only asymptotes toward low values, not minima. The main cause of difficulty in this case is being initialized on the wrong side of the “mountain” and not being able to traverse it, In higher-dimensional space, learning algorithms can often circumnavigate such mountains, but the trajectory associated with doing so may be long and result in excessive training time, as illustrated in figure 8.2. e ction contail training arrives at a global minimum, a local minimum, or a saddle point, but in practice, neural networks do not arrive at a critical point of any kind. Figure 8.1 shows that neural networks often do not arrive ata region of small gradient. Indeed, such critical points do not even necessarily exist. For example, the loss function —log p(y | #4) can lack a global minimum point and instead asymptotically approach some value as the model becomes more confident. For a classifier with discrete y and p(y | a) provided by a softmax, the negative log-likelihood can become arbitrarily close to zero if the model is able to correctly classify every example in the training set, but it is impo: zero. Likewise, a model of real values p(y | @) = .N(y; (0), 9-2) can have negative log-likelihood that asymptotes to negative infinity—if f(@) is able to correctly predict the value of all training set y targets, the learning algorithm will increase 8 without bound. See figure 8.4 for an example of a failure of local optimization to find a good cost function value even in the absence of any local minima or saddle points sible to actually reach the value of Future research will need to develop further understanding of the factors that influence the length of the learning trajectory and better characterize the outcome of the process. 288

You might also like