MLP 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Multilayer Perceptron Tutorial

Leonardo Noriega School of Computing Staordshire University Beaconside Staordshire ST18 0DG email: [email protected] November 17, 2005

Introduction to Neural Networks

Articial Neural Networks are a programming paradigm that seek to emulate the microstructure of the brain, and are used extensively in articial intelligence problems from simple pattern-recognition tasks, to advanced symbolic manipulation. The Multilayer Perceptron is an example of an articial neural network that is used extensively for the solution of a number of dierent problems, including pattern recognition and interpolation. It is a development of the Perceptron neural network model, that was originally developed in the early 1960s but found to have serious limitations. For the next two tutorials, you are to implement and test a multilayer perceptron, using the programming language of your choice. The network should consist of activation units (articial neurones), and weights. The purpose of this exercise is to help in the understanding of some of the concepts discussed in the lectures. In writing this code, review the lectures, and try and relate the practice to the theory.

2
2.1

History and Theoretical Background


Biological Basis of Neural Networks

Articial Neural Networks attempt to model the functioning of the human brain. The human brain for example consists of billions of individual cells called neurones. It is believed by many (the issue is contentious) that all knowledge and experience is encoded by the connections that exist between neurones. Given that the human brain consists of such a large number of neurones (so many that it is impossible to count them with any certainty), the quantity and nature of the connections between neurones is, at present levels of understanding, almost impossible to assess. The issues as to whether information is actually encoded at neural connections (and not at the quantum level for example, as argued by some authors - see Roger Penrose The Emperors New Mind), is beyond the scope of this course. The assumption that one can encode knowedge neurally has led to some interesting and challenging algorithms for the solution of AI problems, including the Perceptron and the Multilayer Perceptron (MLP). 1

c Dr L. Noriega

2.2

Understanding the Neurone

Intelligence is arguably encoded at the connections between neurones (the synapses), but before examining what happens at these connections, we need to understand how the neurone functions. Modern computers use a single, highly complex processing unit (eg. Intel Pentium) which performs a large number of dierent functions. All of the processing on a conventional computer is handled by this single unit, which processes commands at great speed. The human brain is dierent in that it has billions of simple processing units (neurones). Each of these units is slow when compared to say a Pentium 4, but only ever performs one simple task. A neurone activates (res) or remains inactive. One may observe in this a kind of binary logic, where activation may be denoted by a 1, and inactivation by a 0. Neurones can be modelled as simple switches therefore, the only problem remains in understanding what determines whether a neurone res. Neurones can be modelled as simple input-output devices, linked together in a network. Input is received from neurones found lower down a processing chain, and the output transmitted to neurones higher up the chain. When a neurone res, it passes information up the processing chain. This innate simplicity makes neurones fairly straightforward entities to model, it is in modelling the connections that the greatest challenges occur.

2.3

Understanding the Connections(Synapses)

When real neurones re, they transmit chemicals (neurotransmitters) to the next group of neurones up the processing chain alluded to in the previous subsection. These neurotransmitters form the input to the next neurone, and constitute the messages neurones send to each other. These messages can assume one of three dierent forms. Excitation - Excitatory neurotransmitters increase the likelihood of the next neurone in the chain to re. Inhibition - Inhibitory neurotransmitters decrease the likelihood of the next neurone to re. Potentiation - Adjusting the sensitivity of the next neurones in the chain to excitation or inhibition (this is the learning mechanism). If we can model neurones as simple switches, we model connections between neurones as matrices of numbers (called weights), such that positive weights indicate excitation, negative weights indicate inhibition. How learning is modelled depends on the paradigm used.

2.4

Modelling Learning

Using articial neural networks it is impossible to model the full complexity of the brain of anything other than the most basic living creatures, and generally ANNs will consist of at most a few hundred (or few thousand) neurones, and very limited connections between them. Nonetheless quite small neural networks have been used to solve what have been quite dicult computational problems. Generally Articial Neural Networks are basic input and output devices, with the neurones organised into layers. Simple Perceptrons consist of a layer of input neurones, coupled with a layer of output neurones, and a single layer of weights between them, as shown in Figure 1 The learning process consists of nding the correct values for the weights between the input and output layer. The schematic representation given in Figure 1 is often how neural nets are depicted in 2

c Dr L. Noriega

Figure 1: Simple Perceptron Architecture the literature, although mathematically it is useful to think of the input and output layers as vectors of values(I and O respectively), and the weights as a matrix. We dene the weight matrix Wio as an i o matrix, where i is the number of input nodes, and o is the number of output nodes. The network output is calculated as follows. O = f (IWio ) (1)

Generally data is presented at the input layer, the network then processes the input by multiplying it by the weight layer. The result of this multiplication is processed by the output layer nodes, using a function that determines whether or not the output node res. The process of nding the correct values for the weights is called the learning rule, and the process involves initialising the weight matrix to a set of random numbers between 1 and +1. Then as the network learns, these values are changed until it has been decided that the network has solved the problem. Finding the correct values for the weights is eected using a learning paradigm called supervised learning. Supervised learning is sometimes referred to as training. Data is used to train the network, this constitutes input data for which the correct output is known. Starting with random weights, an input pattern is presented to the network, it makes an initial guess as to what the correct output should be. During the training phase, the dierence between the guess made by the network and the correct value for the output is assessed, and the weights are changed in order to minimise the error. The error minimisation technique is based on traditional gradient descent techniques. While this may sound frighteningly mathematical, the actual functions used in neural networks to make the corrections to the weights are chosen because of their simplicity, and the implementation of the algorithm is invariably uncomplicated.

c Dr L. Noriega

2.5

The Activation Function

The basic model of a neurone used in Perceptrons and MLPs is the McCulloch-Pitts model, which dates from the late 1940s. This modelled a neurone as a simple threshold function. f (x) = 1 x>0 0 Otherwise (2)

This activation function was used in the Perceptron neural network model, and as can be seen this is a relatively straightforward activation function to implement.

2.6

The Learning Rule

The perceptron learning rule is comparatively straightforward. Starting with a matrix of random weights, we present a training pattern to the network, and calculate the network output. We determine an error function E E(O) = (T O) (3)

Where in this case T is the target output vector for a training input. In order to determine how the weights should change, this function has to minimised. What this means is nd the point at which the function reaches its minimum value. The assumption we make about the error function is that if we were to plot all of its potential values into a graph, it would be shaped like a bowl, with sides sloping down to a minimum value at the bottom. In order to nd the minimum values of a function dierentiation is used. Dierentiation is used to give the rate at which functions change, and is often dened as the tangent on a curve at a particular point1 . If our function is perfectly bowl shaped, then there will only be one point at which the minimum value of a function has a tangent of zero (ie have a perfectly at tangent), and that is at its minimum point (see Figure 2. . In neural network programming the intention is to assess the eect of the weights on the overall error function. We can take Equation 3 and combine it with Equation 1 to obtain the following. E(O) = (T O) = T f (IWio ) (4)

We then dierentiate the error function with respect to the weight matrix. The discussion on Multilayer Perceptrons will look at the issues of function minimisation in greater detail. Function minimisation in the Simple Perceptron Algorithm is very straightforward. We consider the error each individual output node, and add that error to the weights feeding into that node. The perceptron learning algorithm works as follows. 1. initialise the weights to random values on the interval [-1,1]. 2. Present an input pattern to the network. 3. Calculate the network output. 4. For each node n in the output layer... (a) calculate the error En = Tn On
A full discussion of dierential calculus, and optimisation theory is beyond the scope of this study, but students are referred to examine suitable mathematics textbooks for deeper explanations if required
1

c Dr L. Noriega

Figure 2: Function Minimisation using Dierentiation (b) add En to all of the weights that connect to node n (add En to column n of the weight matrix. 5. Repeat the process from 2. for the next pattern in the training set. This is the essence of the perceptron algorithm. It can be shown that this technique minimises the error function. In its current form it will work, but the time taken to converge to a solution (ie the time taken to nd the minimum value) may be unpredictable because adding the error to the weight matrix is something of a blunt instrument and results in the weights gaining high values if several iterations are required to obtain a solution. This is akin to taking large steps around the bowl in order to nd the minimum value, if smaller steps are taken we are more likely to nd the bottom. In order to control the convergence rate, and reduce the size of the steps being taken, a parameter called a learning rate is used. This parameter is set to a value that is less than unity, and means that the weights are updated in smaller steps (using a fraction of the error). The weight update rule becomes the following. Wio (t + 1) = Wio (t) + En (5)

Which means that the weight value at iteration t + 1 of the algorithm, is equivalent to a fraction of the error En added to the weight value at iteration t.

2.7

Example of the Perceptron Learning Algorithm

In order to illustrate the processes involved in perceptron learning, a simple perceptron will be used to emulate a logic gate, specically an AND gate. We can model this as a simple input and output 5

c Dr L. Noriega device, having two input nodes and a single output node. Table 1 gives the inputs and outputs expected of the network. I1 0 0 1 1 I2 0 1 0 1 O 0 0 0 1

Table 1: AND gate The data in Table 1 forms the training set, and suggests the network topology outlined in Figure 3, with two inputs and a single output to determine the response.

Figure 3: Network Topology for Logic Gate This gives an input vector dened as follows, I = [I0 I1 ] and weight matrix dened thus, Wio = The network output then becomes, O= 1 0 if (W00 I0 ) + (W01 I1 ) > 0 Otherwise 6 (8) W00 W10 . (7) (6)

c Dr L. Noriega Implementing the perceptron learning algorithm then becomes a matter of substituting the input values in Table 1, into the vector I.

Exercise 1:Implementation of a Single Layer perceptron


Implement a single layer perceptron using a programming language with which you are most familiar (C,C++ or Java). Some design guidelines will be given.

C guidelines
It is useful to dene the network as a structure such as the following typedef struct{ //Architectural Constraints int input,hidden,output; //no of input,hidden,output units double **iweights; //weight matrix double *netin,*netout; //input and output vectors double lrate; //learning rate }PERCEPTRON; Write C functions to populate the structure with data as appropriate. Functions will be needed to implement the activation function, the presentation of the training data. // generates a perceptron void makePerceptron(PERCEPTRON *net,int i,int h,int o); //Initialises the weight matrix with random numbers void initialise(PERCEPTRON *net); // Logical Step Function float threshold(float input); //Presents a single pattern to the network void present(PERCEPTRON *net, double *pattern); // Shows the state of a perceptron void showState(PERCEPTRON *net); //Calculates the network error, and propagates the error backwards void BackProp(PERCEPTRON *net,double *target);

C++ and Java Guidelines


Dene a class in which to store the network. Use the structure given in the C guidelines as the basis for the class, and use the functions discussed in those guidelines as methods within the class. Can you think of anything that is missing?

Exercise 2: Logic Gates


Test your perceptron using simple logic gates. Try using simple AND and OR gates to see if your perceptron code works. Try training your network with the data, and then saving the state of the network so that it can be reloaded ready for use without requiring any training. 7

c Dr L. Noriega

Multilayer Perceptrons

The principle weakness of the perceptron was that it could only solve problems that were linearly separable. To illustrate this point we will discuss the problem of modelling simple logic gates using this architecture. Consider modelling a simple AND gate.

Figure 4: AND gate Table 1 and Figure 4 illustrate the relationship between input and output required to model a simple AND gate. Figure 4 shows the spatial disposition of the input data. It can be seen that it is possible to draw a straight line between the co-ordinates of the input values that require an output of 1, and an output of 0. This problem is thus linearly separable. The simple Perceptron, based on units with a threshold activation function, could only solve problems that were linearly separable. Many of the more challenging problems in AI are not linearly separable however, and thus the Perceptron was discovered to have a crucial weakness, and returning to the problem of modelling logic gates, the exclusive-or problem (XOR) is in fact not linearly separable. Consider Figure 5, which shows the arrangement of the patterns that a curve is required to separate the patterns. A possible solution would be to use a bilinear solution, as shown in Figure 6. To obtain a bilinear solution we could add another layer of weights to the simple perceptron model, but that brings the problem of assessing what happens in the middle layer. For a simple task such as the XOR problem, we could fairly easily work out what expected outputs for the middle layer of units should be, but nding a solution that would be completely automated would be incredibly dicult. The essence of supervised neural network training is to map input to a corresponding output, and adding an additional layer of weights makes this impossible, using the threshold function given in Equation 2.

c Dr L. Noriega

Figure 5: Nonlinearity of XOR gate

Figure 6: Nonlinearity of XOR gate

c Dr L. Noriega A better solution to the problem of learning weights is to use standard optimisation techniques. In this case we identify an error function which is expressed in terms of the neural network output. The goal of the network then becomes to nd the values for the weights such that the error function is at its minimum value. Thus gradient descent techniques can then be used to determine the impact of the weights on the value of the error function. We need to have an error function that is dierentiable, which means it should be continuous. The threshold function is not continuous, and so is unsuitable. A function that works in a similar way to the threshold function, but that is dierentiable is the Logistic Sigmoid Function, given by the following equation.
Logistic Sigmoid Function -1.5 -1.2 -0.9 -0.6 -0.3 1.0 0.9 0.8 0.7 0.6
Y Title

flsf (x) =
0.0 0.3

1 1 + ex
0.6 0.9 1.2 1.5 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

(9)

bias=1 bias=2 bias=3 bias=4 bias=5 bias=6 bias=7 bias=8 bias=9

0.5 0.4 0.3 0.2 0.1 0.0 -1.5 -1.2 -0.9 -0.6 -0.3

0.0 0.3 X Title

0.6

0.9

1.2

1.5

Figure 7: Logistic Sigmoid Function Prole Where refers to a bias term which determines the steepness of the slope of the function. This function when viewed in prole behaves in a very similar way to the threshold function, with x values above zero tending to unity, and values below zero tending to zero. This function is continuous, and it can be shown that its derivative is as follows (the proof will be provided in an appendix). flsf (x) = flsf (x)(1 flsf (x)) (10)

Because the function is dierentiable, it is possible to develop a means of adjusting the weights in a perceptron over as many layers as may be necessary.

10

c Dr L. Noriega

The MLP learning Algorithm


1. Initialise the network, with all weights set to random numbers between -1 and +1. 2. Present the rst training pattern, and obtain the output. 3. Compare the network output with the target output. 4. Propagate the error backwards. (a) Correct the output layer of weights using the following formula. who = who + (o oh ) (11)

The basic MLP learning algorithm is outlined below. This is what you should attempt to implement.

where who is the weight connecting hidden unit h with output unit o, is the learning rate, oh is the output at hidden unit h. o is given by the following. o = oo (1 oo )(to oo ) (12) where oo is the output at node o of the output layer,and t o is the target output for that node. (b) Correct the input weights using the following formula. wih = wih + (h oi ) (13)

where wih is the weight connecting node i of the input layer with node h of the hidden layer, oi is the input at node i of the input layer, is the learning rate. h is calculated as follows. (14) h = oh (1 oh ) (o who )
o

5. calculate the error, by taking the average dierence between the target and the output vector. For example the following function could be used. E=
p n=1 (to

o o )2

(15)

Where p is the number of units in the output layer. 6. Repeat from 2 for each pattern in the training set to complete one epoch. 7. Shue the training set randomly. This is important so as to prevent the network being inuenced by the order of the data. 8. repeat from step 2 for a set number of epochs, or until the error ceases to change.

Exercise 3
Take the code written for the Perceptron in Exercise 1, and adapt it to create a Multilayer perceptron. Add an additional matrix for the output layer of weights, and use the logistic sigmoid function as the activation function within the units. Adapt and add functions in order to manage the increased complexity of the dierent architecture. Test your network by attempting to solve the XOR problem. 11

c Dr L. Noriega

Exercise 4
Attempt to include a momentum term and a weight decay term into the basic MLP architecture developed as part of Exercose 3.

5
5.1

Design Hints
Activation Units
1 1 + einput

The activation units you will be using will be the logistic sigmoid function, dened as follows. f (input) = This function has the derivative f (input) = f (input)(1 f (input)) (17) (16)

Both of these functions should be easy to implement in any programming language. Advanced features of the Logistic Sigmoid function, include the use of a term in order to determine the steepness of the linear part of the sigmoid function. The term is introduced into equation 16 as follows. (18) 1+ Adjusting the value of to values of less than unity make the slope shallower with the eect that the output will be less clear (more numbers around the middle range of the graph, rather than clear indications of ring or not ring. Shallower slopes are useful in interpolation problems. f (input) = einput 1

Testing and Evaluation

Try generating some of your own datasets (using an Excel Spreadsheet say), some of which are linealy separable, and some which are not. Test any neural network you make by submitting these data sets to your program and seeing if they will work.

References
[1] R. Callan. The Essence of Neural Networks. Prentice Hall, 1999. [2] G.F. Luger. Articial Intelligence. Addison Wesley, 2005. [3] D.E. Rumelhart and J.L. McClelland. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press 1986. [4] R. Schalkov. Pattern Recognition: Statistical, Structural and Neural Approaches. Wiley, 1992.

12

You might also like