A Multi-Layer Adaptive Function Neural Network (Madfunn) For Letter Image Recognition

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Proceedings of International Joint Conference on Neural Networks, Orlando, Florida, USA, August 12-17, 2007

A Multi-layer ADaptive FUnction Neural Network (MADFUNN) for Letter Image Recognition
Miao Kang, Dominic Palmer-Brown
slope-related parameter of the activation function, but not if the analytic shape of the function is unsuited to the problem, in which case many hidden nodes may be required. In contrast, an adaptive function approach can learn linear inseparable problems fast, sometimes without hidden nodes [5, 6, 7, and 8]. The popular linearly inseparable Iris [10] problem was solved by a 3 x 4 ADFUNN [5] network without any hidden neuron and the generalisation ability achieved 100% with 80% of the patterns used for training, within 1000 epochs. Natural language phrase recognition on a set of phrases from the Lancaster Parsed Corpus (LPC) [11] was also solved by a 735 x 41 ADFUNN network with no hidden node. Generalisation rises to 100% with 200 training patterns (out of a total of 254) [6] within 400 epochs. Multi-layer ADFUNNs (MADFUNN) are used for this complex pattern recognition task, in which the piecewise linearly activation functions are adapted by novel gradient descent supervised learning algorithm. A general learning rule for MADFUNN is defined. This letter image recognition task was acquired from the UCI Repository [4] and its original donor was D.J. Slate. Its author has used it as an application domain for Holland style genetic classifier systems [3]. The data that defines this problem consists of 20,000 unique letter images composed of the letters A to Z from 20 different fonts. Each item was distorted both horizontally and vertically but still remained recognisable to humans. We construct a system with two parts, letter feature grouping and letter recognition. In this supervised group and letter learning system, one MADFUNN is trained for feature extraction, and the featured data goes into its corresponding group (thirteen featured groups are assumed in this experiment) in which other MADFUNNs are used to classify letter category. II. A MULTI-LAYER ADAPTIVE FUNCTION NEURAL NETWORK A. The Adaptive Function Neural Network We provide a means of solving linearly inseparable problems using a simple adaptive function neural network (ADFUNN), based on linear piecewise function neurons, as shown in figure 1. We calculate aw, and find the two neighbouring f-points that bound aw. Two proximal f-points are adapted separately, on a proximal-proportional basis. The proximal-proportional value P1 is (Xna+1 - x)/(Xna+1 Xna) and value P2 is (x - Xna)/(Xna+1 - Xna). Thus, the change to

AbstractThe letter image recognition dataset [3] from UCI repository [4] provides a complex pattern recognition problem which is to classify distorted raster images of English alphabetic characters. ADFUNN, the ANN deployed for this problem, is based on a linear piecewise neuron activation function that is modified by a novel gradient descent supervised learning algorithm. Linearly inseparable problems can be solved [5, 6] by ADFUNN, whereas the traditional Single-Layer Perceptron (SLP) is incapable of solving them without a hidden layer. Multi-layer ADFUNNs (MADFUNNs) [9] are used for the UCI distorted character recognition task. We construct a system with two parts, letter feature grouping and letter classification, to cope with the complexity of the wide diversity among the different fonts and attributes. Testing on 4,000 randomly selected test data, with all occurrences of the 16,000 training patterns removed, yields 87.6% (pure) generalisation. Allowing for naturally occurring instances of training data within the test data, yields 93.77% (natural) generalisation.

he artificial neuron is the basic unit of an artificial neural network. It derives from a joint biological-computational perspective. Summing weighted inputs is biologically plausible, and adapting a weight is a reasonable model for synaptic modification. In contrast, the common assumption of a fixed output activation function is for computational rather than biological reasons. A fixed analytical function facilitates mathematical analysis to a greater degree than an empirical one. Nonetheless, there are some computational benefits to modifiable activation functions, and they may be biologically plausible as well. Recent neuroscience suggests that neuromodulators play a role in learning by modifying the neurons activation function [1, 2]. From a computational point of view, it is surprising if real neurons are essentially fixed entities with no adaptive aspect, except at their synapses, since such a restriction leads to non-linear responses requiring many neurons. Multi-Layer Perceptrons (MLPs) with an appropriate number of nodes are very effective, but if the activation function is not optimal, neither is the number of hiddens which in turn depends on the function. Training, which is typically slow on linearly inseparable data, always requires hidden nodes. In some cases it may help to adapt a
M.Kang is in the School of Computing and Technology, University of East London, London, UK (e-mail: M.Kang@ uel.ac.uk). D. Palmer-Brown is in the School of Computing and Technology, University of East London, London, UK (phone: +44 (0) 20 8223 2170; e-mail: [email protected]).

I. MOTIVATIONS

1-4244-1380-X/07/$25.00 2007 IEEE

each point will be in proportion to its proximity to x. We obtain the output error and adapt the two proximal f-points separately, using a function modifying version of the delta rule, as outlined in B to calculate f.

III. LETTER IMAGE RECOGNITION USING MADFUNN A. Letter Image Recognition Dataset This complex task is that of letter recognition as presented by D.J. Slate [4]. The 20,000 character images consisting of on average 770 examples per letter, are based on 20 different fonts and each letter within these 20 fonts is randomly distorted to produce a file of 20,000 unique stimuli. Sixteen numerical attributes (statistical moments and edge counts) were defined to capture specific characteristics of the letter images. Each of these attributes is then scaled to fit into a range of integer values from 0 through 15. Each of the letter images is thus transformed into a list of 16 such integer values.

Fig. 1. Adapting the linear piecewise neuronal activation function in ADFUNN

B. The General Learning Rule for MADFUNN In this general learning rule, instead of normalising the weights we use a weights limiter which will limit all weights within the range of [-1, 1]. Weights and activation functions are adapted in parallel, using the following algorithm: A = input node activation, AH = hidden node activation, E = output node error, EH = hidden node error, FSLOPEY = slope of output neurons, FSLOPEH = slope of hidden neurons, PY1, PY2: the two proximal-proportional values for the two neighbouring f-points for the output that bound aw PH1, PH2: the two proximal-proportional values for the two neighbouring f-points for the hidden neuron that bound aw, WL, FL: learning rates for weights and functions. Step 1: calculate output error, E Step 2: calculate hidden error, EH
n

Fig. 2. Examples of the character images generated by warping parameters

EH = E*WHY
Y=1

Step 3: adapt weights to each output neuron WHY = WL * FSLOPEY * AH * E WHY ' = WHY + WHY Step 4: adapt function for each output neuron FY = FL * E FY1' = FY1 + FY * P Y1, FY2' = FY2 + FY * P Y2 Step 5: adapt function for each hidden neuron FH = FL * EH FH1' = FH1 + FH * P H1, FH2' = FH2 + FH * P H2 Step 6: adapt weights to each hidden neuron WIH = WL * FSLOPEH * A* EH WIH ' = WIH + WIH Step 7: FY = FY', FH = FH' WHY = WHY', WIH = WIH' Step 8: randomly select a pattern to train Step 9: repeat step 1 to step 8 until the output error tends to a steady state.

B. Letter Image Recognition using MADFUNN Networks with letter feature grouping and letter classification tasks are developed. Letter feature grouping is performed by a supervised group learning method using MADFUNN_1 (as described in figure 3). We assign (initialise) the letters to thirteen groups according to their normal sequence as in the English alphabet: Group1: A B Group2: C D Group3: E F Group4: G H Group5: I J Group6: K L Group7: M N Group8: O P Group9: Q R Group10: S T Group11: U V Group12: W X Group13: Y Z

N(1) = 2. Universal set U = group1, group2 group12, group13. The relative complement of group G in U is denoted by GC. : any NL: number of letters = 26 NG: number of groups = 13 e.g. for letter B NB (1) = 316, NB (2) = 5, NB (3) = 11, NB (4) = 8, NB (5) = 229, NB (6) = 0, NB (7) = 0, NB (8) = 12, NB (9) = 1, NB (10) = 19, NB (11) = 12, NB (12) = 15, NB (13) = 16. B(2 C) = B({1,3,4,5,6,7,8,9,10,11,12,13}) Rules (R1): R1.1: if ( NL(Y) > 1/2 NL(X) ) // if the Number of L type //letter patterns in any other letter group Y is more than half //the Number of L type letter patterns in letter group X. { R1.2: if( NL(Y) > NL(Z:Z(XY)C // if letter group Y //contains the 2nd largest Number of L type letter patterns. && R1.3: ( N(Y) 2 NL/NG //AND if there are fewer than //4 letter types already in group Y { This letter L will be regrouped to group Y. } } Taking the letter B as an example, (NB (5) = 229) > (1/2 * NB (1) = 108 ) which satisfies R1.1. And NB (5) contains the second largest number of L type letter patterns which satisfies R1.2. (N(5) =2) (2 NL/NG) = (2*26/13=4) which satisfies R1.3. Thus we regroup letter B to group5. For each epoch in this stage, we randomly select an input training pattern from the whole 20,000 patterns, e.g.: (2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8) is a letter T pattern. It should go to group10. If correctly classified in PART1, pass this pattern to Group10. The learned system generates the confusion matrix in figure 4. In this round (as shown in fig.4), for example, the system predicted that 295 H patterns were in group4 as assigned, and of the 734 H patterns 161 were in group7, 50 were in group6 and some were in other groups. According to this matrix and applying the regrouping rule defined in R1. For letter H: group7 has a more (161) than 1/2 of group4 (295) where H is currently assigned, which satisfies R1.1. Group7 is the second largest group and if we regroup H to group 7, there are still only 3 letters in there which is less than the maximum letter allowance of 4. Thus both R1.2 and R1.3 are satisfied and so we regroup letter H to group7. For letter S: group3 has a more (167) than 1/2 of group10 (323) where S is currently assigned. This satisfies R1.1. Group3 is the second largest group and it has only 2 letters, which satisfies R1.2 and R1.3. We can therefore regroup letter S from group10 to group7. For letter X: group13 has a more (143) than 1/2 of group12 (254) where X was assigned,

Fig. 3. Networks applied to letter image recognition task

MADFUNN_1 is a neural network with 16 inputs, 100 hiddens and 13 outputs. It is trained to learn the above thirteen assigned groups. The pattern is passed to MADFUNN_1 where the network is then being adapted according to its general learning rule. If this pattern is correctly classified, it will be passed to its corresponding group (e.g. letter A goes to group1, letter M goes to group7 etc). In groups 1 to 13, MADFUNNs are being adapted using the same general learning rule. MADFUNNs 2-14 each have 16 inputs, 100 hiddens and 4 outputs in stage1 and 6 outputs in stage2 (will be explained later). It is used to classify the letter category. In the beginning, the groups are assigned without human judgment. When error rate becomes steady state, the results are analysed by confusion matrix which contains information about actual and predicted classifications. We set each row of the matrix to represent the instances of a predicted group, while each column represents the instances of an actual letter. One benefit of this confusion matrix is that it is easy to see if the system is confusing letters in grouping. The regrouping method has two stages; in the first stage we have the following rule: Rules of Regrouping: NL(X): the number (N) of PATTERNS in the group (X) where this letter L is currently assigned (for the following case, e.g. NB (1) = 316). NL(Y): the number (N) of PATTERNS in any other group(Y, except X) where this letter L is preferred to be grouped (for the following case, e.g. NB (2) =5, NB (3) =11 NB (13) =16). N(Z): the number (N) of LETTERS which have been assigned to group Z. e.g. if AB were assigned to group1 then,

this satisfies R1.1. Group13 is the second largest group and it has 2 letters now, which satisfies R1.2 and R1.3. We can regroup letter X from group12 to group13.

Group13: Y Z X Following another round of learning, we get the confusion matrix:

Fig. 5. Confusion matrix generated in the third round Fig. 4. Confusion matrix generated in the first round

Thus we regroup the letters as follows: Group1: A B Group2: C D Group3: E F S Group4: G Group5: I J Group6: K L Group7: M N H Group8: O P Group9: Q R Group10: T Group11: U V Group12: W Group13: Y Z X MADFUNN_1 is trained to learn the above groups and at the same time the other 13 MADFUNNs are adapted to classify letters. After the learning of this round, we reassign the groups according to the rules as follows: Group1: A Group2: C Group3: E F S B Group4: P R Group5: I J Group6: K L Group7: M N H Group8: O D Group9: Q G Group10: T Group11: U V Group12: W

No more regrouping is required according to the regrouping rules. Essentially, this means that the groupings are stable. Thus we start regrouping stage 2. After classification of the groups and letters, we expect that letters have mostly been classified to their corresponding groups. But we observe that there are still many misclassified patterns for each letter. So to complete stage1, we perform one-shot multi-grouping. This permits letters to reside in more than one group where necessary. Multi-grouping Rule (R2): R2.1: if (( NL(Y) > 10% NL(X) ) && R1.2: ( N(Y) 3 NL/NG ) // if any group(Y) has more than 10% of the same //letter pattern as in group(X) AND if the number of letters in //group Y are less than 6 { This letter will be multi-grouped to all group y and x. } The maximum allowance of letters in each group has been increased to 6 from 4 in this stage. This is to avoid missing any featured letter when multi-grouping. Therefore the networks output number in PART2 will be increased from 4 to 6, as mentioned in the beginning of this section. According to R2, we regroup the letters as follows: Group1: A Group2: C1 Group3: B E1 F1 P1 R1 S1 Group4: F2 P2 R2 Group5: I J

Group6: C2 K L R3 Group7: H1 M N O1 R4 U1 Group8: D H2 O2 Group9: G O3 Q Group10: T Group11: U2 V Y1 Group12: W Group13: E2 F3 S2 X Y2 Z C. Learned functions for MADFUNNs in this system As described in fig. 3, there are 14 MADFUNNs in this system. One is used to do the grouping and the others are for letter classification. The only difference between MADFUNN_1 and the other MADFUNNs is the number of outputs. The former has 13 and the latter has 4 in stage 1 and 6 in stage 2. They are all adapted according to the general learning rule. For each MADFUNN, weights are initialised as small random numbers which are between [-0.1, 0.1] for output neurons and [-0.4, 0.4] for the hidden neurons. F-points are initialised to 0.5. Each F point is simply the value of the activation function for a given input sum. F points are equally spaced, and the function value between points is on the straight line joining them. A slope limiter is also applied to weight adaptation to ensure stability and function precision, in terms of the number of f-points is limited. Clearly, every dataset will have its own range, therefore in general we need two learning constants for each MADFUNN, WL and FL. WL depends on input data range and the F-point interval (0.1 in this case), whereas FL depends solely on the required output range. However, MADFUNN_1 has a more difficult (larger + more complex) problem domain than the other MADFUNNs. Therefore, a different FL is required for MADFUNN_1 than the others. The learning rate WL is equal to 0.00001, FL for MADFUNN_1 is 0.005 and for other MADFUNNs is 0.05. These training patterns aw have a known range [-10, 10] for output neurons and [-50, 50] for hidden neurons. They have the precision of 0.02 and 0.1 respectively, so 1001 points are sufficient to encode all training patterns for output and hidden neurons. The training takes about 500 epochs before the overall error rate is steady state. We choose some representative classes learned functions as an illustration.

Fig. 7. Group 7 learned function in MADFUNN_1

In fig.6 (as well as in 7, 8, and 9), the raised and decreased curves mark the learned region, within which adaptation has occurred. As showed in fig.6, the sum of weighted inputs (aw) in the range [-0.54, -0.34] for the group3 function in MADFUNN_1 activates a group3 response. Similarly in fig.7, there are two learned parts where the aw in the range [-0.5, -0.38] and [0, 0.1] for group 7 function in MADFUNN_1 activate a group 7 output as well. It is clear that a wide rage of functional shapes are formed.

Fig. 8. Letter I learned function in MADFUNN_6

Fig.9. Letter M learned function in MADFUNN_8

IV. RESULT The overall generalisation is the accuracy of correctly classified letters which come from the correctly classified groups, in other words, it equals generalisation of PART_1 (group classification) times generalisation of PART_2 (letter classification). We test the performance of our system both by the independent testing data (pure test data) and non-independent testing data (natural test data). Natural test data is where

Fig. 6. Group 3 learned function in MADFUNN_1

patterns that have already occurred in the training are left in the test dataset, whereas for the pure test data they are removed. With 4,000 pure test data, 87.60% generalisation can be achieved compared to only 79.3% [14] generalisation using a simple back-propagation MLP. And 93.77% accuracy occurs if testing with the 4,000 natural test data. V. RELATED WORK Using a Holland-style adaptive classifier and a training set of 16,000 examples, the authors of this dataset reported the classifier accuracy [4] is a little over 80% (actually, 80.8%, 81.6% and 82.7% are the three best) based on the independent testing data. Philip and Joseph [12] introduced a Difference Boosting (DB) less intensive Bayesian Classifier algorithm. Using the first 16,000 patterns from the dataset as the training set resulted in 85.8% accuracy on the independent test set of remaining 4,000 patterns. They can get 94.1% accuracy on the entire 20,000 data. However, this can only be achieved by repeated reordering of the training and test sets (this is to say, the training and test sets are selected). And, a second guess is even allowed on the test set if the first prediction fails. This is not the case with the MADFUNN tests. Partridge and Yates [13] used a selection procedure known as pick heuristic with three different measures: CFD (coincident-failing diversity measure), DFD (distinct-failure diversity measure) and OD (overall diversity which is the geometric mean of CFD times DFD) as three multi-version systems, pickOD, pickCFD and pickDFD. By exploiting distinct-failure diversity in these three ways on the 4,000 test patterns, they achieved 91.6% generalisation using pickCFD, 91.07% generalisation using pickDFD and 91.55% using pickOD. However, implementation of these three multi-version systems is complex with each containing nine networks. For instance, the pickDFD system was composed of 4 RBF networks and 5 MLPs. With AdaBoost on the C4.5 algorithm [14], 96.9% correct classification can be achieved on the 4,000 test data. However computational power is required over 100 machines to generate the tree structure [14]. And with a 4 layer fully connected MLP of 16-70-50-26 topology [15], 98% correct classification with AdaBoost can be achieved but required 20 machines to implement the system. All the MADFUNN tests have been carried out on one PC. VI. CONCLUSION AND FUTURE WORK In this paper we developed a learning method in two parts, feature grouping and classification. Letter feature grouping is achieved by a supervised group learning method using MADFUNN_1. Classification of letter category is achieved by thirteen specialised MADFUNNs. The system produced 87.60% generalisation on the 4,000 pure test data and 93.77% generalisation on the 4,000 natural test data. MADFUNN exhibited higher generalisation ability than most of the other methods which have been applied on this complex pattern recognition task and require less complexity in the system

implementation. In related work, we are exploring unsupervised and reinforcement learning regimes that combine complementary adaptation rules within a single network (modal learning): namely, the snap-drift algorithm [16]. It is proposed that snap-drift be followed by a supervised phase acting on the activation functions alone, to perform classification. REFERENCES
[1] [2] [3] [4] [5] G. Scheler, Regulation of neuromodulator efficacy: Implications for whole-neuron and synaptic plasticity, Progress in Neurobiology, Vol.72, No.6, 2004. G. Scheler, Memorisation in a neural network with adjustable transfer function and conditional gating, Quantitative Biology, Vol.1, 2004. P.W. Frey and D.J. Slate, Letter Recognition Using Holland-style Adaptive Classifiers, Machine Learning, 6, pp.161-182, 1991. D. J. Slate Letter Image Recognition Data http://www.ics.uci.edu/~mlearn/databases/letter-recognition/ D. Palmer-Brown and M. Kang, ADFUNN: An adaptive function neural network, the 7th International Conference on Adaptive and Natural Computing Algorithms (ICANNGA05), Coimbra, Portugal, 2005. M. Kang and D. Palmer-Brown, An Adaptive Function Neural Network (ADFUNN) for Phrase Recognition, the International Joint Conference on Neural Networks (IJCNN05), Montral, Canada, 2005. M. Kang and D. Palmer-Brown, An Adaptive Function Neural Network (ADFUNN) Classifier, the second International Conference on Neural Networks & Brain (ICNN&B05), Beijing, China , 2005. M. Kang and D. Palmer-Brown, An Adaptive Function Neural Network (ADFUNN) for Function Recognition, the 2005 International Conference on Computational Intelligence and Security (CIS05), Xian, China, 2005. M. Kang and D. Palmer-Brown, A Multi-layer ADaptive FUnction Neural Network (MADFUNN) for Analytical Function Recognition, the International Joint Conference on Neural Networks (IJCNN2006), Vancouver, Canada, 2006. R. A. Fisher, The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics 7, pp.178-188, 1936. R. Garside, G. Leech and T. Varadi, Manual of Information to Accompany the Lancaster Parsed Corpus: Department of English, University of Oslo, 1987. N. S. Philip, K. B. Joseph. Distorted English Alphabet Identification : An application of Difference Boosting Algorithm, CoRR, 2000. D. Partridge and W. B. Yates, Data-defined problems and multi-version neural-net systems, Journal of Intelligent Systems, 7, nos. 1-2, pp.19-32, 1997. R. E. Schapire, Y. Freund, P. Bartlett and W. S. Lee, Boosting the margin: A new explanation for effectiveness of voting methods, Proceedings of the 14th International Conference, 1997. H. Schwenk and Y. Bongo, Adaptive Boosting of Neural Networks for Character Recognition, Technical report #1072, Department d Informatique et Recherch Operationnelle, Universitede Montral, Montreal, Qc H3C-3J7, Canada, 1997. S. W. Lee, D. Palmer-Brown and C. M. Roadknight, Performance-guided Neural Network for Rapidly Self-Organising Active Network Management, Neurocomputing, Elsevier Science, Netherlands, Vol.61, pp. 5-20, 2004.

[6] [7] [8]

[9]

[10] [11] [12] [13] [14] [15]

[16]

You might also like