0% found this document useful (0 votes)
11 views

ML algorithms

ML algorithms
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
11 views

ML algorithms

ML algorithms
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 10
92. MACHINE LEARNING rf jatil from Equation (4.2), as ; differentiating E be obtained by gradient can is, oeek bin yi aaray uy, in De 1 Leu —o4)? W deD so DE (een an) = 42-0052 ia a a = u- 04) 55, ta — wx) to OE = Sta — 04)(—xia) (46) ui De where xq denotes the single input component x; for training example d. We now have an equation that gives 2 in terms of the linear unit inputs id, Outputs f4, and target values tz associated with the training examples. Substituting Equation (4.6) into Equation (4.5) yields the weight update rule for gradient descent Aw: =n a = 04) xia 2) Fors jova: Summarize, the gradient descent algorithm for training linear units is a follows: Pick an initial random weight vector, Apply the linear unit to all training cxamples, then compute Aw, for each weight according to Equation (4.7). Update tach weight w, by adding Aw,, then repeat this process, ‘This algorithm is given in Table 4.1. Because the error surface contains only 2 single global minimum, this reason, one common modification to the algorithm is to Sradually reduce the value of n as the number of gradient descent steps grows. 4433 STOCHASTIC APPROXIMATION TO GRADIENT DESCENT Gradient descent is an important general paradigm for learning. It is a Strategy for searching through a large or infinite hypothesis space that can be applied whenever (1) the hypothesis space Contains continuously parameterized hypotheses (e.g., the Weights in a linear unit), and (2) the enor can be differentiated with respect to these hypothesis Parameters. The key practical difficulties in applying gradient descent are (1) Converging to a local minimum can sometimes be quite slow (ice. it can require many thousands of gradient descent steps), and (2) if there are multiple local minima in the error Sumtacom beaitiiexe is no) guarantee! thar th Procedure will find the global minimum, e 98 MACHINE LBARNING BAcKPROPAGATION( raining examples, Ny Min Mout, Mhidden) Each training example is a pair of the form (i,7 ), where x is the vector of network input values, and f is the vector of target network output values. nis the learning rate (¢,8.,.05). nin is the number of network inputs, nnidden the number of units in the hidden layer, and nous the number of output units. The input from unit into unit j is denoted xji, and the weight from unit ito unit j is denoted wy «Create a feed-forward network with nin inputs, nyidden hidden units, and nou, Output units. + Initialize all network weights to small random numbers (e.g., between —.05 and .05). ‘* Until the termination condition is met, Do © For each (z,7) in training examples, Do Propagate the input forward through the network: 1, Input the instance % to the network and compute the output 0, of every unit u in the network. Propagate the errors backward through the network: 2. For each network output unit k, calculate its error term 5, 4 © o4(1 = o4)(t4 ~ 0%) 3. For each hidden unit A, calculate its error term bn 81 on 04) D> wand keouiputs 4, Update each network weight wji wy) — wy + Aw, where uy = 15) TABLE 4,2 The stochastic gradient descent version of the B, |ACKPROPAGATION al ‘containing two layers of sigmoid units, Boni for feedforward networks saree i instance x be descri ig precisely: yet an arbitrary 1! nbd yg sidean distance. MOTE feature Vector oan) tribute of instance x. Then the ke cr tes he vals the Biel to be d(i3)), where “Ny ces x and x) (ay (x). 020 * vere a, (x) del Baten tro instan' Yee =a, (x))? d(x. %)) = i be either discret . jag the target function may va In nearest eer Jeaning diserete-valued tare functions of cor real-valued. Lets frst finite set (bis «-¥e}- TE KN OREST Net form f :#" > V, whe! iscrete-valued target function is given in Table algorithm for approx! this algorithm as its estimate of f,, ; As shown there, ye ais value of f among the k training, examples neares eos ia 1, then the 1-NEAREST NeIGHBOR algorithm assigns to fu) the ae See ‘ys is the training instance nearest to xq. For larger ya}, iw of fits, algorithm assigns the most common value among the k nearest Uaiing ~ewrears! alue f (xq) returned by peration of the k-NEAREST NEIGHBOR algorithm fo, the case where the instances are points in a two~ mensional space and where the target function is boolean valued. The positive and negative training examples are Shown by “+ and “—" respectively. A query point xq is shown as well. Note te L-Nearest NEIGHBOR algorithm classifies xg as a positive example in this figure, whereas the 5-NearEst NEIGHBOR algorithm classifies it as a negative example. What is the nature of the hypothesis space H implicitly considered by the k-Nearest NEIGHBOR algorithm? Note the k-NEAREST NEIGHBOR algorithm never forms an explicit general hypothesis f regarding the target function f. It simply computes the classification of each new query instance as needed. Nevertheless, les. Figure 8.1 illustrates the o Training algorithm: + For each training example (x, f(x)), add the example to the list training examples ‘Classification algorithm: * Given a query instance xq to be classified, + Let x... : 4b enay 77 Genet the instances from training.examples that are nears 1% : k xq) < argmax 5 gm Ds , £)) where (a, 6) =1 if a= and where 3(a, b) = 0 otherwise. TABLE 8.1 The k-NeaKest Nex CARDS alpen : Bonithm for approximating a discrete-valued function f 2X" 7" ‘ » 94 MACHINE LEARNING stic gradient descent can be mag, ], stochastic g) diffe ize) sufficiently small, stoct Cerne Vey tite ec A ee teal descent arbitrarily au Ki gba ty ie griient descent and stochastic gradie! standard gra s i r all examples gradient descent, the error 18 summed oye big ES eg : iene Teh whereas in stochastic gradient descei 2 ee updating weights, Merc ne : ; pon examining each training exampre. Summing over multiple examples in standard gra ient descent requires m, mil dient de y it uses thi i ate step. On the other hand, because i a tion per weight update step. On , rie ration tidied gradient descent is often used with a larger step size i it. i hastic gradient descent er weight update than stoc! ent desce . . ii cases where there are multiple local aa ae ae ic. gradi avoid falling int tic gradient descent can sometimes avoi t _loce Seats it uses the various VE,(i) rather than VE(w) to guide its search, Both stochastic and standard gradient descent methods are commonly used in ractice, F ‘The training rule in Equation (4.10) is known as the delta rule, or sometimes the LMS (least-mean-square) rule, Adaline rule, or Widrow-Hoff rule (after its inventors). In Chapter 1 we referred to it as the LMS weight-update rule when describing its use for learning an evaluation function for game playing. Notice the delta rule in Equation (4.10) is similar to the perceptron training rule in Equation (4.4.2). In fact, the two expressions appear to be identical. However, the rules are different because in the delta rule o refers to the linear unit output (4) = i. %, whereas for the Perceptron rule refers to the thresholded output o(8) = sgn(ib -z), Although we have presented the delta rule as a method for learning weights for unthresholded linear units, it can easily be used 0 train thresholded perceptron units, as well. Suppose that 0 = ij. z is the unthresholded linear unit output as above, and o/ = sgn(ib-z) is the result of thresholding o as in the perce; coReNGE if we wish to train a perceptron to fit training examples with target valuec fl fe o’, we can use these same target values and examples to train 9 instead, i if delta ule, Clearly i the unthresholded output oes eves eee Perfectly, then the threshold output o/ will fit them as and sgn(—1) = —1). Even when the target thresholded 0’ value will correctly fit 8S eee eee eseteetly i unit output o has ¢! fe Ver the linear pt las the correct sign, Notice, however, that while this Procedure wil] pind P a i 3 ights, Th ffe x Ng Perceptr We Hs cons; gered two. ‘ar algorithms for iterative; learning p, eptron etween these algorithms is that the perceptron trai in- Bae aren 7” ining example is @ pair ofthe form ens vl en ee + fch 10 S01 seal Aor vale i nto condition mt, Do kaze each AW, (0 7er0, $ preach i!) traningexamples, Do input the instance # to the unit and c {Foc each linear unit weight w,, Do = MH ouput 0 va AU) © Bw + n(t — os; {reach linear unit weight w,, Do Hw + Aw, eee eral ' DesceNT algorithm for traning a near unit. To pa 4 To implement the stochastic approximation went descent, Equation (4.2) is deleted, and Equation (T4.1) replaced by wy oe soothe One common variation on gradient descent intended to alleviate these diffi- cae is called incremental gradient descent, or alternatively stochastic gradient ocent, Whereas the gradient descent training rule presented in Equation (4.7) computes weight updates after summing over all the training examples in D, the ida behind stochastic gradient descent is to approximate this gradient descent seach by updating weights incrementally, following the calculation of the error ‘nreach individual example. The modified training rule is like the training rule sien by Equation (4.7) except that as we iterate through each training example veupdate the weight according to Aw; =n(¢-0) x; (4.10) vieref,0, and x; are the target value, unit output, and ith input for the training ‘tanple in question, To modify the gradient descent algorithm of Table 4.1 to ‘aplement this stochastic approximation, Equation (T4.2) is simply deleted and Equation (4.1) replaced by w; — w;+n(t—o) x;. One way to view this stochastic ‘aicnt descent is to consider a distinct error function E4(w) defined for each “Ghidual training example d as follows Ey (i) = du = 04)? 4.1) Nee and og are the target value and the unit output value for training ex- 214: Stochastic gradient descent iterates over the training examples d in D, fc ulertion altering the weights according to the gradient with respect to stags THE Sequence of these weight updates, when iterated over all training Pi Provides a reasonable approximation to descending the gradient with "© our original error function (i). By making the value of » (the gradient Unsupervised Learning @ 283 sapoint and all of the cluster centres, gat nat we can reduce the computati hm that was described in Section algo we then compute the mean of fie iterate te algorithm until the clus gescription: Fhe Means Algorithm , Initialisation — choose a value for k — choose k random positions in the input space — assign the cluster centres 41; to those positions + Learning = repeat * for each datapoint x;: __—_*_- compute the distance to each cluster centre ~ assign the datapoint to the nearest cluster centre with distance i 4; = min dle, Hy) (14.1) * for each cluster centre: + move the position of the centre to the mean of the points in that cluster (N; is the number of points in cluster j): Ns Hy = DL (14.2) ‘test point: the distance to each cluster centre “datapoint to the nearest cluster centre with distance Bis iepoh iv 2: d, = min d(ei, 45): (14.3) ntation follows these steps almost exactly, and we can take advan- ) function, which returns the index of the minimum value, to find ister, The code that computes the distances, finds the nearest cluster centre, them eae as: “closest cl rerspective 78 m Machine Learning: An Algorithmic Pt axle ——— ron Algorithm + Initialisation siti dom values = initialise all weights to small (positive ran and negative) + Training = repeat: + for each input vector: Forwards phase: ~ compute the activation of each neuron j in the hidden layer(s) using L he = Doane (44) 120 ag=g(hc) = — (45) T+ exp(—Bhe) + work throngh the network until you get to the output layer neurons, which have activations (although see also Section 4.2.3): Pe 9) G7 tyre (4.6) 7 1 Yn =9G(he) = Trench) (4.7) Backwards phase: * compute the error at the output using: Gol) = (Ye — te) Ye(1 — Ye) ~ compute the error in the hidden layer(s) using: ee i 9n(6) = ae(1 —a¢) > wed (ie) & (4.9) + update the output layer weights using: Won & Wen ~ n8o(n)abidden * update the hidden layer weights using: (4.10) * (if using sequential upd ~ arenes (4) iential updating) randomise the order on i aeetin f the i that you don't trainin exactly the same order each terre PMt Vectors 30 mn. ~ until learning stops (see Section 4.3.3) + Recall 7 use the Forwards phase in the training section above

You might also like