6 Genetic Algorithms: 6.1 Theory
6 Genetic Algorithms: 6.1 Theory
Genetic Algorithms (GAs) are computer simulations to evolve a population of chromosomes that contain at least some very t individuals. Fitness is specied by a tness function that rates each individual in the population. Setting up a GA simulation is fairly easy: we need to represent (or encode) the state of a system in a chromosome that is usually implemented as a set of bits. GA is basically a search operation: searching for a good solution to a problem where the solution is a very t chromosome. The programming technique of using GA is useful for AI systems that must adapt to changing conditions because re-programming can be as simple as dening a new tness function and re-running the simulation. An advantage of GA is that the search process will not often get stuck in local minimum because the genetic crossover process produces radically different chromosomes in new generations while occasional mutations (ipping a random bit in a chromosome) cause small changes. Another aspect of GA is supporting the evolutionary concept of survival of the ttest: by using the tness function we will preferentially breed chromosomes with higher tness values. It is interesting to compare how GAs are trained with how we train neural networks (Chapter 7). We need to manually supervise the training process: for GAs we need to supply a tness function and for the two neural network models used in Chapter 7 we need to supply training data with desired sample outputs for sample inputs.
6.1 Theory
GAs are typically used to search very large and possibly very high dimensional search spaces. If we want to nd a solution as a single point in an N dimensional space where a tness function has a near maximum value, then we have N parameters to encode in each chromosome. In this chapter we will be solving a simple problem that is one-dimensional so we only need to encode a single number (a oating point number for this example) in each chromosome. Using a GA toolkit, like the one developed in Section 6.2, requires two problem-specic customizations: Characterize the search space by a set of parameters that can be encoded in a chromosome (more on this later). GAs work with the coding of a parameter set, not the parameters themselves (Genetic Algorithms in Search, Optimiza-
99
6 Genetic Algorithms
Figure 6.1: The test function evaluated over the interval [0.0, 10.0]. The maximum value of 0.56 occurs at x=3.8 tion, and Machine Learning, David Goldberg, 1989). Provide a numeric tness function that allows us to rate the tness of each chromosome in a population. We will use these tness values to determine which chromosomes in the population are most likely to survive and reproduce using genetic crossover and mutation operations. The GA toolkit developed in this chapter treats genes as a single bit; while you can consider a gene to be an arbitrary data structure, the approach of using single bit genes and specifying the number of genes (or bits) in a chromosome is very exible. A population is a set of chromosomes. A generation is dened as one reproductive cycle of replacing some elements of the chromosome population with new chromosomes produced by using a genetic crossover operation followed by optionally mutating a few chromosomes in the population. We will describe a simple example problem in this section, write a general purpose library in Section 6.2, and nish the chapter in Section 6.3 by solving the problem posed in this section. For a sample problem, suppose that we want to nd the maximum value of the function F with one independent variable x in Equation 6.1 and as seen in Figure 6.1: F (x) = sin(x) sin(0.4 x) sin(3 x) (6.1)
The problem that we want to solve is nding a good value of x to nd a near to possible maximum value of F (x). To be clear: we encode a oating point number
100
Figure 6.2: Crossover operation as a chromosome made up of a specic number of bits so any chromosome with randomly set bits will represent some random number in the interval [0, 10]. The tness function is simply the function in Equation 6.1. Figure 6.2 shows an example of a crossover operation. A random chromosome bit index is chosen, and two chromosomes are cut at this this index and swap cut parts. The two original chromosomes in generationn are shown on the left of the gure and after the crossover operation they produce two new chromosomes in generationn+1 shown on the right of the gure. In addition to using crossover operations to create new chromosomes from existing chromosomes, we will also use genetic mutation: randomly ipping bits in chromosomes. A tness function that rates the tness value of each chromosome allows us to decide which chromosomes to discard and which to use for the next generation: we will use the most t chromosomes in the population for producing the next generation using crossover and mutation. We will implement a general purpose Java GA library in the next section and then solve the example problem posed in this section in Section 6.3.
101
6 Genetic Algorithms int num_chromosomes) public Genetic(int num_genes_per_chromosome, int num_chromosomes, float crossover_fraction, float mutation_fraction) The method sort is used to sort the population of chromosomes in most t rst order. The methods getGene and setGene are used to fetch and change the value of any gene (bit) in any chromosome. These methods are protected but you will probably not need to override them in derived classes. protected void sort() protected boolean getGene(int chromosome, int gene) protected void setGene(int chromosome, int gene, int value) protected void setGene(int chromosome, int gene, boolean value) The methods evolve, doCrossovers, doM utations, and doRemoveDuplicates are utilities for running GA simulations. These methods are protected but you will probably not need to override them in derived classes. protected protected protected protected void void void void evolve() doCrossovers() doMutations() doRemoveDuplicates()
When you subclass class Genetic you must implement the following abstract method calcF itness that will determine the evolution of chromosomes during the GA simulation. // Implement the following method in sub-classes: abstract public void calcFitness(); }
The class Chromosome represents a bit set with a specied number of bits and a oating point tness value.
102
6.2 Java Library for Genetic Algorithms class Chromosome { private Chromosome() public Chromosome(int num_genes) public boolean getBit(int index) public void setBit(int index, boolean value) public float getFitness() public void setFitness(float value) public boolean equals(Chromosome c) }
The class ChromosomeComparator implements a Comparator interface and is application specic: it is used to sort a population in best rst order:
class ChromosomeComparator implements Comparator<Chromosome> { public int compare(Chromosome o1, Chromosome o2) } The last class ChromosomeComparator is used when using the Java Collection class static sort method. The class Genetic is an abstract class: you must subclass it and implement the method calcF itness that uses an application specic tness function (that you must supply) to set a tness value for each chromosome. This GA library provides the following behavior: Generates an initial random population with a specied number of bits (or genes) per chromosome and a specied number of chromosomes in the population Ability to evaluate each chromosome based on a numeric tness function Ability to create new chromosomes from the most t chromosomes in the population using the genetic crossover and mutation operations There are two class constructors for Genetic set up a new GA experiment by setting the number of genes (or bits) per chromosome, and the number of chromosomes in the population. The Genetic class constructors build an array of integers rouletteW heel which is used to weight the most t chromosomes in the population for choosing the parents
103
6 Genetic Algorithms of crossover and mutation operations. When a chromosome is being chosen, a random integer is selected to be used as an index into the rouletteW heel array; the values in the array are all integer indices into the chromosome array. More t chromosomes are heavily weighted in favor of being chosen as parents for the crossover operations. The algorithm for the crossover operation is fairly simple; here is the implementation: public void doCrossovers() { int num = (int)(numChromosomes * crossoverFraction); for (int i = num - 1; i >= 0; i--) { // Dont overwrite the "best" chromosome // from current generation: int c1 = 1 + (int) ((rouletteWheelSize - 1) * Math.random() * 0.9999f); int c2 = 1 + (int) ((rouletteWheelSize - 1) * Math.random() * 0.9999f); c1 = rouletteWheel[c1]; c2 = rouletteWheel[c2]; if (c1 != c2) { int locus = 1+(int)((numGenesPerChromosome-2) * Math.random()); for (int g = 0; g<numGenesPerChromosome; g++) { if (g < locus) { setGene(i, g, getGene(c1, g)); } else { setGene(i, g, getGene(c2, g)); } } } } } The method doM utations is similar to doCrossovers: we randomly choose chromosomes from the population and for these selected chromosomes we randomly ip the value of one gene (a gene is a bit in our implementation): public void doMutations() { int num = (int)(numChromosomes * mutationFraction); for (int i = 0; i < num; i++) { // Dont overwrite the "best" chromosome // from current generation: int c = 1 + (int) ((numChromosomes - 1) * Math.random() * 0.99);
104
6.3 Finding the Maximum Value of a Function int g = (int) (numGenesPerChromosome * Math.random() * 0.99); setGene(c, g, !getGene(c, g)); } } We developed a general purpose library in this section for simulating populations of chromosomes that can evolve to a more t population given a tness function that ranks individual chromosomes in order of tness. In Section 6.3 we will develop an example GA application by dening the size of a population and the tness function dened by Equation 6.1.
105
After summing up all on bits times their base2 value, we need to normalize what is an integer in the range of [0,1023] to a oating point number in the approximate range of [0, 10]: x /= 102.4f; return x; } Note that we do not need the reverse method! We use the GA library from Section 6.2 to create a population of 10-bit chromosomes; in order to evaluate the tness of each chromosome in a population, we only have to convert the 10-bit representation to a oating-point number for evaluation using the following tness function (Equation 6.1): private float fitness(float x) { return (float)(Math.sin(x) * Math.sin(0.4f * x) * Math.sin(3.0f * x)); } Table 6.1 shows some sample random chromosomes and the oating point numbers that they encode. The rst column shows the gene indices where the bit is on, the second column shows the chromosomes as an integer number represented in binary notation, and the third column shows the oating point number that the chromosome encodes. The center column in Table 6.1 shows the bits in order where index 0 is the left-most bit, and index 9 if the right-most bit; this is the reverse of the normal order for encoding integers but the GA does not care: it works with any encoding we use. Once again, GAs work with encodings. On bits in chromosome 2, 5, 7, 8, 9 0, 1, 3, 5, 6 0, 3, 5, 6, 7, 8 As binary 0010010111 1101011000 1001011110 Number encoded 9.1015625 1.0449219 4.7753906
Table 6.1: Random chromosomes and the oating point numbers that they encode Using methods geneT oF loat and f itness we now implement the abstract method calcF itness from our GA library class Genetic so the derived class T estGenetic
106
6.3 Finding the Maximum Value of a Function is not abstract. This method has the responsibility for calculating and setting the tness value for every chromosome stored in an instance of class Genetic: public void calcFitness() { for (int i=0; i<numChromosomes; i++) { float x = geneToFloat(i); chromosomes.get(i).setFitness(fitness(x)); } } While it was useful to make this example more clear with a separate geneT oF loat method, it would have also been reasonable to simply place the formula in the method f itness in the implementation of the abstract (in the base class) method calcF itness. In any case we are done with coding this example: you can compile the two example Java les Genetic.java and TestGenetic.java, and run the T estGenetic class to verify that the example program quickly nds a near maximum value for this function. You can try setting different numbers of chromosomes in the population and try setting non-default crossover rates of 0.85 and a mutation rates of 0.3. We will look at a run with a small number of chromosomes in the population created with: genetic_experiment = new MyGenetic(10, 20, 0.85f, 0.3f); int NUM_CYCLES = 500; for (int i=0; i<NUM_CYCLES; i++) { genetic_experiment.evolve(); if ((i%(NUM_CYCLES/5))==0 || i==(NUM_CYCLES-1)) { System.out.println("Generation " + i); genetic_experiment.print(); } } In this experiment 85% of chromosomes will be sliced and diced with a crossover operation and 30% will have one of their genes changed. We specied 10 bits per chromosome and a population size of 20 chromosomes. In this example, I have run 500 evolutionary cycles. After you determine a tness function to use, you will probably need to experiment with the size of the population and the crossover and mutation rates. Since the simulation uses random numbers (and is thus nondeterministic), you can get different results by simply rerunning the simulation. Here is example program output (with much of the output removed for brevity): count of slots in roulette wheel=55
107
6 Genetic Algorithms Generation 0 Fitness for chromosome 0 is 0.505, occurs at x=7.960 Fitness for chromosome 1 is 0.461, occurs at x=3.945 Fitness for chromosome 2 is 0.374, occurs at x=7.211 Fitness for chromosome 3 is 0.304, occurs at x=3.929 Fitness for chromosome 4 is 0.231, occurs at x=5.375 ... Fitness for chromosome 18 is -0.282 occurs at x=1.265 Fitness for chromosome 19 is -0.495, occurs at x=5.281 Average fitness=0.090 and best fitness for this generation:0.505 ... Generation 499 Fitness for chromosome 0 is 0.561, occurs at x=3.812 Fitness for chromosome 1 is 0.559, occurs at x=3.703 ... This example is simple but is intended to be show you how to encode parameters for a problem where you want to search for values to maximize a tness function that you specify. Using the library developed in this chapter you should be able to set up and run a GA simulation for your own applications.
108
7 Neural Networks
Neural networks can be used to efciently solve many problems that are intractable or difcult using other AI programming techniques. I spent almost two years on a DARPA neural network tools advisory panel, wrote the rst version of the ANSim neural network product, and have used neural networks for a wide range of application problems (radar interpretation, bomb detection, and as controllers in computer games). Mastering the use of simulated neural networks will allow you to solve many types of problems that are very difcult to solve using other methods. Although most of this book is intended to provide practical advice (with some theoretical background) on using AI programming techniques, I cannot imagine being interested in practical AI programming without also wanting to think about the philosophy and mechanics of how the human mind works. I hope that my readers share this interest. In this book, we have examined techniques for focused problem solving, concentrating on performing one task at a time. However, the physical structure and dynamics of the human brain is inherently parallel and distributed [Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Rumelhart, McClelland, etc. 1986]. We are experts at doing many things at once. For example, I simultaneously can walk, talk with my wife, keep our puppy out of cactus, and enjoy the scenery behind our house in Sedona, Arizona. AI software systems struggle to perform even narrowly dened tasks well, so how is it that we are able to simultaneously perform several complex tasks? There is no clear or certain answer to this question at this time, but certainly the distributed neural architecture of our brains is a requirement for our abilities. Unfortunately, articial neural network simulations do not currently address multi-tasking (other techniques that do address this issue are multi-agent systems with some form or mediation between agents). Also interesting is the distinction between instinctual behavior and learned behavior. Our knowledge of GAs from Chapter 6 provides a clue to how the brains of especially lower order animals can be hardwired to provide efcient instinctual behavior under the pressures of evolutionary forces (i.e., likely survival of more t individuals). This works by using genetic algorithms to design specic neural wiring. I have used genetic algorithms to evolve recurrent neural networks for control applications. This work only had partial success but did convince me that biological genetic pressure is probably adequate to pre-wire some forms of behavior in natural (biological) neural networks.
109
7 Neural Networks
Figure 7.1: Physical structure of a neuron While we will study supervised learning techniques in this chapter, it is possible to evolve both structure and attributes of neural networks using other types of neural network models like Adaptive Resonance Theory (ART) to autonomously learn to classify learning examples without intervention. We will start this chapter by discussing human neuron cells and which features of real neurons that we will model. Unfortunately, we do not yet understand all of the biochemical processes that occur in neurons, but there are fairly accurate models available (web search neuron biochemical). Neurons are surrounded by thin hairlike structures called dendrites which serve to accept activation from other neurons. Neurons sum up activation from their dendrites and each neuron has a threshold value; if the activation summed over all incoming dendrites exceeds this threshold, then the neuron res, spreading its activation to other neurons. Dendrites are very localized around a neuron. Output from a neuron is carried by an axon, which is thicker than dendrites and potentially much longer than dendrites in order to affect remote neurons. Figure 7.1 shows the physical structure of a neuron; in general, the neurons axon would be much longer than is seen in Figure 7.1. The axon terminal buttons transfer activation to the dendrites of neurons that are close to the individual button. An individual neuron is connected to up to ten thousand other neurons in this way. The activation absorbed through dendrites is summed together, but the ring of a neuron only occurs when a threshold is passed.
110
7.2 Java Classes for Hopeld Neural Networks very different than back propagation networks (covered later in Section 7.4) because the training data only contains input examples unlike back propagation networks that are trained to associate desired output patterns with input patterns. Internally, the operation of Hopeld neural networks is very different than back propagation networks. We use Hopeld neural networks to introduce the subject of neural nets because they are very easy to simulate with a program, and they can also be very useful in practical applications. The inputs to Hopeld networks can be any dimensionality. Hopeld networks are often shown as having a two-dimensional input eld and are demonstrated recognizing characters, pictures of faces, etc. However, we will lose no generality by implementing a Hopeld neural network toolkit with one-dimensional inputs because a two-dimensional image can be represented by an equivalent one-dimensional array. How do Hopeld networks work? A simple analogy will help. The trained connection weights in a neural network represent a high dimensional space. This space is folded and convoluted with local minima representing areas around training input patterns. For a moment, visualize this very high dimensional space as just being the three dimensional space inside a room. The oor of this room is a convoluted and curved surface. If you pick up a basketball and bounce it around the room, it will settle at a low point in this curved and convoluted oor. Now, consider that the space of input values is a two-dimensional grid a foot above the oor. For any new input, that is equivalent to a point dened in horizontal coordinates; if we drop our basketball from a position above an input grid point, the basketball will tend to roll down hill into local gravitational minima. The shape of the curved and convoluted oor is a calculated function of a set of training input vectors. After the oor has been trained with a set of input vectors, then the operation of dropping the basketball from an input grid point is equivalent to mapping a new input into the training example that is closest to this new input using a neural network. A common technique in training and using neural networks is to add noise to training data and weights. In the basketball analogy, this is equivalent to shaking the room so that the basketball nds a good minima to settle into, and not a non-optimal local minima. We use this technique later when implementing back propagation networks. The weights of back propagation networks are also best visualized as dening a very high dimensional space with a manifold that is very convoluted near areas of local minima. These local minima are centered near the coordinates dened by each input vector.
111
7 Neural Networks algorithms for storing and recall of patterns at the same time. In a Hopeld neural network simulation, every neuron is connected to every other neuron. Consider a pair of neurons indexed by i and j. There is a weight Wi,j between these neurons that corresponds in the code to the array element weight[i, j]. We can dene energy between the associations of these two neurons as: energy[i, j] = weight[i, j] activation[i] activation[j] In the Hopeld neural network simulator, we store activations (i.e., the input values) as oating point numbers that get clamped in value to -1 (for off) or +1 (for on). In the energy equation, we consider an activation that is not clamped to a value of one to be zero. This energy is like gravitational energy potential using a basketball court analogy: think of a basketball court with an overlaid 2D grid, different grid cells on the oor are at different heights (representing energy levels) and as you throw a basketball on the court, the ball naturally bounces around and nally stops in a location near to the place you threw the ball, in a low grid cell in the oor that is, it settles in a locally low energy level. Hopeld networks function in much the same way: when shown a pattern, the network attempts to settle in a local minimum energy point as dened by a previously seen training example. When training a network with a new input, we are looking for a low energy point near the new input vector. The total energy is a sum of the above equation over all (i,j). The class constructor allocates storage for input values, temporary storage, and a two-dimensional array to store weights: public Hopfield(int numInputs) { this.numInputs = numInputs; weights = new float[numInputs][numInputs]; inputCells = new float[numInputs]; tempStorage = new float[numInputs]; } Remember that this model is general purpose: multi-dimensional inputs can be converted to an equivalent one-dimensional array. The method addT rainingData is used to store an input data array for later training. All input values get clamped to an off or on value by the utility method adjustInput. The utility method truncate truncates oating-point values to an integer value. The utility method deltaEnergy has one argument: an index into the input vector. The class variable tempStorage is set during training to be the sum of a row of trained weights. So, the method deltaEnergy returns a measure of the energy difference between the input vector in the current input cells and the training input examples: private float deltaEnergy(int index) {
112
7.2 Java Classes for Hopeld Neural Networks float temp = 0.0f; for (int j=0; j<numInputs; j++) { temp += weights[index][j] * inputCells[j]; } return 2.0f * temp - tempStorage[index];
The method train is used to set the two-dimensional weight array and the onedimensional tempStorage array in which each element is the sum of the corresponding row in the two-dimensional weight array: public void train() { for (int j=1; j<numInputs; j++) { for (int i=0; i<j; i++) { for (int n=0; n<trainingData.size(); n++) { float [] data = (float [])trainingData.elementAt(n); float temp1 = adjustInput(data[i]) * adjustInput(data[j]); float temp = truncate(temp1 + weights[j][i]); weights[i][j] = weights[j][i] = temp; } } } for (int i=0; i<numInputs; i++) { tempStorage[i] = 0.0f; for (int j=0; j<i; j++) { tempStorage[i] += weights[i][j]; } } } Once the arrays weight and tempStorage are dened, it is simple to recall an original input pattern from a similar test pattern: public float [] recall(float [] pattern, int numIterations) { for (int i=0; i<numInputs; i++) { inputCells[i] = pattern[i]; } for (int ii = 0; ii<numIterations; ii++) { for (int i=0; i<numInputs; i++) { if (deltaEnergy(i) > 0.0f) {
113
114
7.3 Testing the Hopeld Neural Network Class pattern. The version of method helper included in the ZIP le for this book is slightly different in that two bits are randomly ipped (we will later look at sample output with both one and two bits randomly ipped). private static void helper(Hopfield test, String s, float [] test_data) { float [] dd = new float[10]; for (int i=0; i<10; i++) { dd[i] = test_data[i]; } int index = (int)(9.0f * (float)Math.random()); if (dd[index] < 0.0f) dd[index] = 1.0f; else dd[index] = -1.0f; float [] rr = test.recall(dd, 5); System.out.print(s+"\nOriginal data: "); for (int i = 0; i < 10; i++) System.out.print(pp(test_data[i]) + " "); System.out.print("\nRandomized data: "); for (int i = 0; i < 10; i++) System.out.print(pp(dd[i]) + " "); System.out.print("\nRecognized pattern: "); for (int i = 0; i < 10; i++) System.out.print(pp(rr[i]) + " "); System.out.println(); } The following listing shows how to run the program, and lists the example output: java Test_Hopfield pattern 0 Original data: Randomized data: Recognized pattern: pattern 1 Original data: Randomized data: Recognized pattern: pattern 2 Original data: Randomized data: Recognized pattern:
1 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 1 1
In this listing we see that the three sample training patterns in T est Hopf ield.java are re-created after scrambling the data by changing one randomly chosen value to
115
7 Neural Networks its opposite value. When you run the test program several times you will see occasional errors when one random bit is ipped and you will see errors occur more often with two bits ipped. Here is an example with two bits ipped per test: the rst pattern is incorrectly reconstructed and the second and third patterns are reconstructed correctly: pattern 0 Original data: Randomized data: Recognized pattern: pattern 1 Original data: Randomized data: Recognized pattern: pattern 2 Original data: Randomized data: Recognized pattern:
1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1
116
Output 1 Output 2
W1,1 W1,2
W2,1 W2,2
Output 1
Output 2
Figure 7.2: Two views of the same two-layer neural network; the view on the right shows the connection weights between the input and output layers as a two-dimensional array.
am using the network in Figure 7.2 just to demonstrate layer to layer connections through a weights array. To calculate the activation of the rst output neuron O1, we evaluate the sum of the products of the input neurons times the appropriate weight values; this sum is input to a Sigmoid activation function (see Figure 7.3) and the result is the new activation value for O1. Here is the formula for the simple network in Figure 7.2: O1 = Sigmoid(I1 W [1, 1] + I2 W [2, 1]) O2 = Sigmoid(I2 W [1, 2] + I2 W [2, 2]) Figure 7.3 shows a plot of the Sigmoid function and the derivative of the sigmoid function (SigmoidP ). We will use the derivative of the Sigmoid function when training a neural network (with at least one hidden neuron layer) with classied data examples. A neural network like the one seen in Figure 7.2 is trained by using a set of training data. For back propagation networks, training data consists of matched sets of input with matching desired output values. We want to train a network to not only produce the same outputs for training data inputs as appear in the training data, but also to generalize its pattern matching ability based on the training data to be able to match test patterns that are similar to training input patterns. A key here is to balance the size of the network against how much information it must hold. A common mistake when using back-prop networks is to use too large a network: a network that contains too many neurons and connections will simply memorize the training
117
7 Neural Networks
Figure 7.3: Sigmoid and derivative of the Sigmoid (SigmoidP) functions. This plot was produced by the le src-neural-networks/Graph.java. examples, including any noise in the training data. However, if we use a smaller number of neurons with a very large number of training data examples, then we force the network to generalize, ignoring noise in the training data and learning to recognize important traits in input data while ignoring statistical noise. How do we train a back propagation neural network given that we have a good training data set? The algorithm is quite easy; we will now walk through the simple case of a two-layer network like the one in Figure 7.2, and later in Section 7.5 we will review the algorithm in more detail when we have either one or two hidden neuron layers between the input and output layers. In order to train the network in Figure 7.2, we repeat the following learning cycle several times: 1. Zero out temporary arrays for holding the error at each neuron. The error, starting at the output layer, is the difference between the output value for a specic output layer neuron and the calculated value from setting the input layer neurons activation values to the input values in the current training example, and letting activation spread through the network. 2. Update the weight Wi,j (where i is the index of an input neuron, and j is the index of an output neuron) using the formula Wi,j + = learning rate output errorj Ii (learning rate is a tunable parameter) and output errorj was calculated in step 1, and Ii is the activation of input neuron at index i. This process is continued to either a maximum number of learning cycles or until the calculated output errors get very small. We will see later that the algorithm is similar but slightly more complicated, when we have hidden neuron layers; the difference is that we will back propagate output errors to the hidden layers in order to estimate errors for hidden neurons. We will cover more on this later. This type of neural
118
Figure 7.4: Capabilities of zero, one, and two hidden neuron layer neural networks. The grayed areas depict one of two possible output values based on two input neuron activation values. Note that this is a two-dimensional case for visualization purposes; if a network had ten input neurons instead of two, then these plots would have to be ten-dimensional instead of twodimensional. network is too simple to solve very many interesting problems, and in practical applications we almost always use either one additional hidden neuron layer or two additional hidden neuron layers. Figure 7.4 shows the types of problems that can be solved by zero hidden layer, one hidden layer, and two hidden layer networks.
119
7 Neural Networks
Figure 7.5: Example backpropagation neural network with one hidden layer.
Figure 7.6: Example backpropagation neural network with two hidden layers.
120
7.5 A Java Class Library for Back Propagation propagation neural networks and Hopeld neural networks which we saw at the beginning of this chapter. The relevant les for the back propagation examples are: Neural 1H.java contains a class for simulating a neural network with one hidden neuron layer Test 1H.java a text-based test program for the class Neural 1H GUITest 1H.java a GUI-based test program for the class Neural 1H Neural 2H.java contains a class for simulating a neural network with two hidden neuron layers Neural 2H momentum.java contains a class for simulating a neural network with two hidden neuron layers and implements momentum learning (implemented in Section 7.6) Test 2H.java a text-based test program for the class Neural 2H GUITest 2H.java a GUI-based test program for the class Neural 2H GUITest 2H momentum.java a GUI-based test program for the class Neural 2H momentum that uses momentum learning (implemented in Section 7.6) Plot1DPanel a Java JFC graphics panel for the values of a one-dimensional array of oating point values Plot2DPanel a Java JFC graphics panel for the values of a two-dimensional array of oating point values The GUI les are for demonstration purposes only, and we will not discuss the code for these classes; if you are interested in the demo graphics code and do not know JFC Java programming, there are a few good JFC tutorials at the web site java.sun.com. It is common to implement back-prop libraries to handle either zero, one, or two hidden layers in the same code base. At the risk of having to repeat similar code in two different classes, I decided to make the N eural 1H and N eural 2H classes distinct. I think that this makes the code a little easier for you to understand. As a practical point, you will almost always start solving a neural network problem using only one hidden layer and only progress to trying two hidden layers if you cannot train a one hidden layer network to solve the problem at-hand with sufciently small error when tested with data that is different than the original training data. One hidden layer networks require less storage space and run faster in simulation than two hidden layer networks. In this section we will only look at the implementation of the class N eural 2H
121
7 Neural Networks (class N eural 1H is simpler and when you understand how N eural 2H works, the simpler class is easy to understand also). This class implements the Serializable interface and contains a utility method save to write a trained network to a disk le: class Neural_2H implements Serializable { There is a static factory method that reads a saved network le from disk and builds an instance of N eural 2H and there is a class constructor that builds a new untrained network in memory, given the number of neurons in each layer: public static Neural_2H Factory(String serialized_file_name) public Neural_2H(int num_in, int num_hidden1, int num_hidden2, int num_output) An instance of N eural 2H contains training data as transient data that is not saved by method save. transient protected ArrayList inputTraining = new Vector(); transient protected ArrayList outputTraining = new Vector(); I want the training examples to be native oat arrays so I used generic ArrayList containers. You will usually need to experiment with training parameters in order to solve difcult problems. The learning rate not only controls how large the weight corrections we make each learning cycle but this parameter also affects whether we can break out of local minimum. Other parameters that affect learning are the ranges of initial random weight values that are hardwired in the method randomizeW eights() and the small random values that we add to weights during the training cycles; these values are set in in slightlyRandomizeW eights(). I usually only need to adjust the learning rate when training back-prop networks: public float TRAINING_RATE = 0.5f; I often decrease the learning rate during training that is, I start with a large learning rate and gradually reduce it during training. The calculation for output neuron values given a set of inputs and the current weight values is simple. I placed the code for calculating a forward pass through the network in a separate method f orwardP ass() because it is also used later in the method training:
122
7.5 A Java Class Library for Back Propagation public float[] recall(float[] in) { for (int i = 0; i < numInputs; i++) inputs[i] = in[i]; forwardPass(); float[] ret = new float[numOutputs]; for (int i = 0; i < numOutputs; i++) ret[i] = outputs[i]; return ret; } public void forwardPass() { for (int h = 0; h < numHidden1; h++) { hidden1[h] = 0.0f; } for (int h = 0; h < numHidden2; h++) { hidden2[h] = 0.0f; } for (int i = 0; i < numInputs; i++) { for (int h = 0; h < numHidden1; h++) { hidden1[h] += inputs[i] * W1[i][h]; } } for (int i = 0; i < numHidden1; i++) { for (int h = 0; h < numHidden2; h++) { hidden2[h] += hidden1[i] * W2[i][h]; } } for (int o = 0; o < numOutputs; o++) outputs[o] = 0.0f; for (int h = 0; h < numHidden2; h++) { for (int o = 0; o < numOutputs; o++) { outputs[o] += sigmoid(hidden2[h]) * W3[h][o]; } } } While the code for recall and f orwardP ass is almost trivial, the training code in method train is more complex and we will go through it in some detail. Before we get to the code, I want to mention that there are two primary techniques for training back-prop networks. The technique that I use is to update the weight arrays after each individual training example. The other technique is to sum all output errors
123
7 Neural Networks over the entire training set (or part of the training set) and then calculate weight updates. In the following discussion, I am going to weave my comments on the code into the listing. The private member variable current example is used to cycle through the training examples: one training example is processed each time that the train method is called: private int current_example = 0; public float train(ArrayList ins, ArrayList v_outs) {
Before starting a training cycle for one example, we zero out the arrays used to hold the output layer errors and the errors that are back propagated to the hidden layers. We also need to copy the training example input values and output values:
int i, h, o; float error = 0.0f; int num_cases = ins.size(); //for (int example=0; example<num_cases; example++) { // zero out error arrays: for (h = 0; h < numHidden1; h++) hidden1_errors[h] = 0.0f; for (h = 0; h < numHidden2; h++) hidden2_errors[h] = 0.0f; for (o = 0; o < numOutputs; o++) output_errors[o] = 0.0f; // copy the input values: for (i = 0; i < numInputs; i++) { inputs[i] = ((float[]) ins.get(current_example))[i]; } // copy the output values: float[] outs = (float[]) v_outs.get(current_example);
We need to propagate the training example input values through the hidden layers to the output layers. We use the current values of the weights: forwardPass(); After propagating the input values to the output layer, we need to calculate the output error for each output neuron. This error is the difference between the desired output and the calculated output; this difference is multiplied by the value of the calculated
124
7.5 A Java Class Library for Back Propagation output neuron value that is rst modied by the Sigmoid function that we saw in Figure 7.3. The Sigmoid function is to clamp the calculated output value to a reasonable range. for (o = 0; o < numOutputs; o++) { output_errors[o] = (outs[o] outputs[o]) * sigmoidP(outputs[o]); } The errors for the neuron activation values in the second hidden layer (the hidden layer connected to the output layer) are estimated by summing for each hidden neuron its contribution to the errors of the output layer neurons. The thing to notice is that if the connection weight value between hidden neuron h and output neuron o is large, then hidden neuron h is contributing more to the error of output neuron o than other neurons with smaller connecting weight values: for (h = 0; h < numHidden2; h++) { hidden2_errors[h] = 0.0f; for (o = 0; o < numOutputs; o++) { hidden2_errors[h] += output_errors[o] * W3[h][o]; } } We estimate the errors in activation energy for the rst hidden layer neurons by using the estimated errors for the second hidden layers that we calculated in the last code snippet: for (h = 0; h < numHidden1; h++) { hidden1_errors[h] = 0.0f; for (o = 0; o < numHidden2; o++) { hidden1_errors[h] += hidden2_errors[o] * W2[h][o]; } }
After we have scaled estimates for the activation energy errors for both hidden layers we then want to scale the error estimates using the derivative of the sigmoid functions value of each hidden neurons activation energy:
125
7 Neural Networks for (h = 0; h < numHidden2; h++) { hidden2_errors[h] = hidden2_errors[h] * sigmoidP(hidden2[h]); } for (h = 0; h < numHidden1; h++) { hidden1_errors[h] = hidden1_errors[h] * sigmoidP(hidden1[h]); } Now that we have estimates for the hidden layer neuron errors, we update the weights connecting to the output layer and each hidden layer by adding the product of the current learning rate, the estimated error of each weights target neuron, and the value of the weights source neuron: // update the hidden2 to output weights: for (o = 0; o < numOutputs; o++) { for (h = 0; h < numHidden2; h++) { W3[h][o] += TRAINING_RATE * output_errors[o] * hidden2[h]; W3[h][o] = clampWeight(W3[h][o]); } } // update the hidden1 to hidden2 weights: for (o = 0; o < numHidden2; o++) { for (h = 0; h < numHidden1; h++) { W2[h][o] += TRAINING_RATE * hidden2_errors[o] * hidden1[h]; W2[h][o] = clampWeight(W2[h][o]); } } // update the input to hidden1 weights: for (h = 0; h < numHidden1; h++) { for (i = 0; i < numInputs; i++) { W1[i][h] += TRAINING_RATE * hidden1_errors[h] * inputs[i]; W1[i][h] = clampWeight(W1[i][h]); } } for (o = 0; o < numOutputs; o++) { error += Math.abs(outs[o] - outputs[o]); } The last step in this code snippet was to calculate an average error over all output neurons for this training example. This is important so that we can track the training
126
7.6 Adding Momentum to Speed Up Back-Prop Training status in real time. For very long running back-prop training experiments I like to be able to see this error graphed in real time to help decide when to stop a training run. This allows me to experiment with the learning rate initial value and see how fast it decays. The last thing that method train needs to do is to update the training example counter so that the next example is used the next time that train is called: current_example++; if (current_example >= num_cases) current_example = 0; return error; } You can look at the implementation of the Swing GUI test class GU T est 2H to see how I decrease the training rate during training. I also monitor the summed error rate over all output neurons and occasionally randomize the weights if the network is not converging to a solution to the current problem.
127
7 Neural Networks class constructor now takes another parameter alpha that determines how strong the momentum correction is when we modify weight values: // momentum scaling term that is applied // to last delta weight: private float alpha = 0f; While this alpha term is used three times in the training code, it sufces to just look at one of these uses in detail. When we allocated the three weight arrays W 1, W 2, and W 3 we also now allocate three additional arrays of corresponding same size: W 1 last delta, W 2 last delta, and W 3 last delta. These three new arrays are used to store the weight changes for use in the next training cycle. Here is the original code to update W 3 from the last section: W3[h][o] += TRAINING_RATE * output_errors[o] * hidden2[h]; The following code snippet shows the additions required to use momentum: W3[h][o] += TRAINING_RATE * output_errors[o] * hidden2[h] + // apply the momentum term: alpha * W3_last_delta[h][o]; W3_last_delta[h][o] = TRAINING_RATE * output_errors[o] * hidden2[h]; I mentioned in the last section that there are two techniques for training back-prop networks: updating the weights after processing each training example or waiting to update weights until all training examples are processed. I always use the rst method when I dont use momentum. In many cases it is best to use the second method when using momentum.
128