0% found this document useful (0 votes)
32 views16 pages

Discriminant Pattern Recognition Using Transformation Invariant Neurons

This document describes research on developing discriminant models for pattern recognition using transformation invariant neurons. The researchers propose an algorithm called TD-Neuron that uses a gradient descent approach to develop discriminant models based on one-sided tangent distance. They compare the performance of their TD-Neuron algorithm to HSS algorithms (which use SVD to generate non-discriminant models) and LVQ algorithms on a handwritten digit recognition task, finding that the discriminant TD-Neuron models achieve better classification accuracy and rejection rates.

Uploaded by

sasddasda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views16 pages

Discriminant Pattern Recognition Using Transformation Invariant Neurons

This document describes research on developing discriminant models for pattern recognition using transformation invariant neurons. The researchers propose an algorithm called TD-Neuron that uses a gradient descent approach to develop discriminant models based on one-sided tangent distance. They compare the performance of their TD-Neuron algorithm to HSS algorithms (which use SVD to generate non-discriminant models) and LVQ algorithms on a handwritten digit recognition task, finding that the discriminant TD-Neuron models achieve better classification accuracy and rejection rates.

Uploaded by

sasddasda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/12384400

Discriminant Pattern Recognition Using Transformation Invariant Neurons

Article  in  Neural Computation · July 2000


DOI: 10.1162/089976600300015402 · Source: PubMed

CITATIONS READS
12 107

3 authors, including:

Diego Sona
Fondazione Bruno Kesser & Istituto Italiano di Tecnologia
117 PUBLICATIONS   759 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Deep Learning for Analyzing Magnetic Resonance Images of Multiple Sclerosis View project

RENVISION EU FP7 View project

All content following this page was uploaded by Diego Sona on 06 June 2014.

The user has requested enhancement of the downloaded file.


Discriminant Pattern Recognition Using
Transformation Invariant Neurons
Diego Sona Alessandro Sperduti Antonina Starita

Dipartimento di Informatica, Università di Pisa


Corso Italia, 40, 56125, Pisa, Italy
e-mail: {sona,perso,starita}@di.unipi.it

Abstract
To overcome the problem of invariant pattern recognition Simard et
al. proposed a successful nearest-neighbor approach based on tangent
distance, attaining state-of-the-art accuracy. Since this approach needs
great computational and memory effort, Hastie et al. proposed an al-
gorithm (HSS) based on Singular Value Decomposition (SVD), for the
generation of non-discriminant tangent models.
In this paper we propose a different approach, based on a gradient
descent constructive algorithm called TD-Neuron, that develops discrim-
inant models. We present as well comparative results of our constructive
algorithm versus HSS and LVQ algorithms. Specifically, we tested the
HSS algorithm using both the original version based on the two-sided
tangent distance, and a new version based on the one-sided tangent
distance. Empirical results over the NIST-3 database show that the
TD-Neuron is superior to both SVD and LVQ based algorithms, since it
reaches a better trade-off between error and rejection.

1 Introduction
In several pattern recognition systems the principal and most desired feature is the
robustness against transformations of patterns. Simard et al. (Simard, LeCun, and
Denker, 1993) partially solved this problem by proposing the tangent distance as
a classification function invariant to small transformations. They used the con-
cept in a nearest neighbor algorithm, achieving state-of-the-art accuracy on isolated
handwritted character recognition. However, this approach has a quite high com-
putational complexity, due to the large number of Euclidean and tangent distances
that need to be calculated.
Different researchers have shown how such complexity can be reduced at the cost
of increased space complexity. Simard (Simard, 1994) proposed a filtering method
based on multi-resolution and on a hierarchy of distances, while Sperduti and Stork
(Sperduti and Stork, 1995) devised a graph based method for rapid and accurate
search through prototypes.
Different approaches to the problem, aiming at the reduction of the classifica-
tion time and space requirements, while trying to preserve the same accuracy, were

1
studied by some authors. Specifically, Hastie et al. (Hastie, Simard, and Säckinger,
1995) developed rich models for representing large subsets of the prototypes through
a Singular Value Decomposition (SVD) based algorithm, while Schwenk & Milgram
(Schwenk and Milgram, 1995b) proposed a modular classification system (Diabolo)
based on several auto-associative multi-layer perceptrons, which use tangent dis-
tance as the error reconstruction measure. A different, but related approach, has
been pursued by Hinton et. al (Hinton, Dayan, and Revow, 1997), which propose
two different methods for modeling the manifolds of data. Both methods are based
on locally linear low-dimensional approximations to the underlying data manifolds.
All the above models are non-discriminant 1 . Although non-discriminant models
have some advantages over discriminant models, as discussed in (Hinton, Dayan, and
Revow, 1997), the amount of computation during recognition is usually higher for
non-discriminant models, especially if a good trade-off between error and rejection
is required. On the other side, discriminant models take more time to be trained. In
several applications, however, it is more convenient to spend extra time for training,
which is usually performed only once or a few times, so to have a faster recognition
process, which is repeated millions of times. In this cases, discriminant models
should be preferred.
In this paper, we discuss a constructive algorithm for the generation of discrimi-
nant2 tangent models. The proposed algorithm, which is an improved version of the
algorithm previously presented in (Sona, Sperduti, and Starita, 1997), is based on
the definition of the TD-Neuron (TD stands for Tangent Distance) where the net
input is computed by using the one-sided tangent distance instead of the standard
dot product. Using this definition we have devised a constructive algorithm which
we compare here with HSS and LVQ algorithms. In particular, we report results ob-
tained for the HSS algorithm using both the original version based on the two-sided
tangent distance, and a new version based on the one-sided tangent distance. For
the sake of comparison, we present also the results of the LVQ2.1 algorithm, which
resulted to be the best among the LVQ algorithms.
The one-sided version of the HSS algorithm was derived in order to have a fair
comparison against the TD-Neuron, which exploits the one-sided tangent distance.
Empirical results over the NIST-3 database of handwritten digits show that the
TD-Neuron is superior to both HSS algorithms and LVQ algorithms since it reaches
a better trade-off between error and rejection. More surprisingly, our results show
that the one-sided version of the HSS algorithm is superior to the two-sided version,
which performs poorly when introducing a rejection class. An additional advantage
of the proposed algorithm is the constructive approach.
The paper is organized as follows. In Section 2 and 3 we give an overview of
tangent distance and tangent distance models, respectively. In the same section we
define a novel version of the HSS algorithm, based on one-sided tangent distance. A
new formulation for discriminant tangent distance models is proposed in Section 4,
while the proposed TD-Neuron model is presented in Section 5, which includes also
details on the training algorithm. Comparative empirical results on a handwritten
digit recognition task between our algorithm, HSS algorithms, LVQ and nearest
neighbor algorithms with Euclidean distance are presented in Section 6. Finally, a
discussion of the results and conclusions are reported in Section 7.
1 Schwenk & Milgram proposed a discriminant version of Diablo (Schwenk and Milgram, 1995a)
as well.
2 In the sense that the model for each class is generated taking into account also negative

examples, i.e., examples of patterns belonging to the other classes.

2
2 Tangent Distance Overview
Let consider a pattern recognition problem where invariance for a set of n different
transformations is required. Given an image X i , the function X i (θ) is a manifold
of at most n dimensions, representing the set of patterns that can be obtained by
transforming the original image through the chosen transformations, where θ is
the amount of transformations and X i = X i (0). The ideal would be to use the
transformation-invariant distance

DI (X i , X j ) = min kX i (α) − X j (θ)k.


α,θ

However, the formalization of the manifold equation and, in particular, the com-
putation of the distance between the two manifolds, is very hard. For this reason,
Simard et al. (Simard, LeCun, and Denker, 1993) proposed an approach based on
the local linear approximation of the manifold by
n
X
X̃ i (θ) = X i + T jXi θj ,
j=1

where T jXi are n different tangent vectors at the point X i (0), which can easily
be computed by finite difference. The distance between the two manifolds is then
approximated by the so called tangent distance (Simard, LeCun, and Denker, 1993):

DT (X i , X j ) = min kX̃ i (α) − X̃ j (θ)k. (1)


α,θ

Of course, the approximation is accurate only for local transformations, however, in


character recognition problems, global invariance may not be desired, since it can
cause confusion between patterns such as “n” and “u”.
The tangent distance defined by equation (1) is called two-sided tangent distance,
since it is computed between two subspaces. There exists also a less computational
expensive version called one-sided tangent distance (Schwenk and Milgram, 1995a),
where the distance is computed between a subspace and a pattern in the following
way
DT1-sided (X i , X j ) = min kX̃ i (α) − X j k. (2)
α

3 Tangent Distance Models


The main drawback of tangent distance is its high computational requirement, if
compared with Euclidean distance. For this reason several authors tried to devise
compact models, based on tangent distance, able to summarize relevant information
conveyed by a set of patterns.
Specifically, to address this problem, Hastie et al. (Hastie, Simard, and Säckinger,
1995) proposed an algorithm for the generation of rich models representing large
subsets of patterns.
Given a set of patterns {X 1 , . . . , X NC } of class C, Hastie et al. (Hastie, Simard,
and Säckinger, 1995) proposed the tangent subspace model,
n
X
M (θ) = W + T i θi ,
i=1

3
where W is the centroid and the set {T i } constitutes the associated invariant sub-
space of dimension n.
According to this definition, for each class C, the model M C can be computed
as
NC
X
M C = arg min min kM (θ p ) − X p (αp )k2 , (3)
M p=1 θ p αp
minimizing the error function over W and T i .
The above definition constitutes a difficult optimization problem, which however
can be solved for a fixed value of n (i.e., the subspace dimension) by an iterative
algorithm based on Singular Value Decomposition, proposed by Hastie et al. (Hastie,
Simard, and Säckinger, 1995).
Note that, if the problem is formulated using the one-sided tangent distance,
then equation (3) becomes
NC
X
M C = arg min min kM (θ p ) − X p k2 , (4)
M p=1 θ p

which can be easily solved by principal component analysis theory, also called
Karhunen-Loéve Expansion. In fact, equation (4) can be minimized by choosing
W as the average over all available samples X p , and T i as the most representative
eigenvectors (principal components) of the covariance matrix Σ, where
NC
1 X
Σ= (X p − W )(X p − W )T .
NC p=1

In the following, we will refer to the two versions of the algorithms as HSS, and
when necessary we will specify which one is used (one-sided or two-sided).
It must be observed that, by construction, the HSS algorithms return non-
discriminant models. In fact, they use only the evidence provided by positive ex-
amples of the target class.
Moreover, the two-sided HSS algorithm can be used only if a priori knowledge
on invariant transformations is present. If this knowledge is not present, the intro-
duction of invariance with respect to an arbitrary transformation can be risky, since
this can remove information relevant for the classification task. In this situation
it is preferable to use the one-sided version which does not commit to any specific
transformation.

4 A General Formulation
As discussed in the introduction, there are good reasons for using discriminant
models. Although Schwenk & Milgram suggested how to modify the learning rule
of Diablo to obtain discriminant models, they never proposed a formalization of
discriminant models using tangent distance. In this section we present a general
formulation which allows the user to develop discriminant or non-discriminant tan-
gent models.
To be able to devise discriminant models, equation (3) must be modified in such
a way to take into account that all available data must be used during the generation
process. The basic idea is to define a model for class C which minimizes the tangent
distances from patterns belonging to C, and maximizes the tangent distances from

4
patterns not in C (i.e., in C). Mathematically this can be expressed, for each class
C, by  
NC NC
X X
C C
M C = arg min  DT (M , X p ) − λ DT (M , X p ) , (5)
M p=1 p=1

where M is the generic model {W , T i , . . .,T n }, NC is the number of patterns X C


p
belonging to the class C, and N C is the number of patterns X Cp not belonging to
C.
Note that, the second sum is multiplied by a constant λ, which identifies how
much discriminant the model should be. If λ = 0 equation (5) becomes equal to
equation (3) (or (4) when considering the 1-sided tangent distance). On the other
hand, if λ is large, the resulting model may not be a good descriptive model for
class C. In any case, no bounded solution to equation (5) may exist if the term
associated with λ is not bounded.

5 TD-Neuron
The TD-Neuron (abbreviation for Tangent Distance Neuron) is so called since it
can be considered as a neural computational unit which computes (as net input)
the square of the one-sided tangent distance of the input vector X k from a prototype
model defined by a set of internal parameters (weights).
Specifically, it is characterized by a set of n + 1 vectors, of the same dimension
as the input vectors. One vector (W ) is used as reference vector (centroid), while
the remaining vectors {T 1 , . . . , T n } are used as tangent vectors. Moreover, the set
of tangent vectors constitutes an orthonormal basis.
This set of parameters is organized in such a way to form a tangent model.
Formally, the net input of a TD-Neuron for a pattern k is

netk = min kM (θ) − X k k2 + β, (6)


θ
where β is the offset. A good model should return small net input for patterns
belonging to the learned class.
Since the tangent vectors constitute an orthonormal basis, equation (6) can
exactly and easily be computed by using the projections of the input vector over
the model subspace (see Figure 1):

Xn
netk = k X k − W k2 − [(X k − W )t Ti ]2 + β
| {z }
i=1
dk
(7)
n
X
= dtk dk − [ dtk T i ]2 + β,
| {z }
i=1
γik

where, for the sake of notation, d k denotes the difference between the input pattern
X k and the centroid W , and the projection of dk over the i-th tangent vector is
denoted by γik .
Note that, the right side of equation (7) mainly involves dot products, just as in
a standard neuron.

5
X
1-sided
d DT
θ2 T2
W

θ1

T1

Figure 1: Geometric interpretation of equation (7). Note that W and T i span the

invariance manifold, d is the Euclidean distance between the pattern X and the

centroid W , and net = (DT1-sided )2 is the one-sided tangent distance.

The output of the TD-Neuron is then computed by transforming the net through
a nonlinear monotone function f . In our experiments, we have used the symmetric
sigmoidal function
2
ok = − 1. (8)
1 + enetk
We have used a monotonic decreasing function, so that the output corresponding to
patterns belonging to the target class will be close to 1.

5.1 Training the TD-Neuron


A discriminant model based on the TD-Neuron can be obtained by adapting equa-
tion (5).
Given a training set {(X 1 , t1 ), . . . , (X N , tN )}, where

1 if X i ∈ C
ti =
−1 if X i ∈ C

is the i-th desired output for the TD-Neuron, and N = N C +NC is the total number
of patterns in the training set, an error function can be defined as
N
1X
E= (tk − ok )2 , (9)
2
k=1

where ok is the output of the TD-Neuron for the k-th input pattern.
Equation (9) can be written into a form similar to equation (5) by making explicit
the target values ti , by splitting the sum in such a way to group together patterns

6
belonging to C and patterns belonging to C, and by weighting the negative examples
with a constant λ ≥ 0:
 
NC NC
1 X X
E= (1 − ok )2 + λ (1 + oj )2  (10)
2 j=1
k=1

NC
In our experiments we have chosen λ = NC , thus balancing the strength of
patterns belonging to C and those belonging to C.
Using equations (7-8), it is trivial to compute the changes for the centroid, the
tangent vectors, and the offset, by using a gradient descent approach over equa-
tion (10):

  N
" n
!#
∂E X X
∆W = −η = −2 η (tk − ok ) fk0 dk − γik T i (11)
∂W i=1
k=1
  N
∂E X
∆T i = −η = −2 η [(tk − ok ) fk0 γik dk ] (12)
∂T i
k=1
  N
∂E X
∆β = −ηβ = ηβ [(tk − ok ) fk0 ] (13)
∂β
k=1

∂ok
where η and ηβ are learning parameters, and f k0 = ∂net k
.
Before training the TD-Neuron by gradient descent, however, the tangent sub-
space dimension must be decided. To solve this problem we developed a constructive
algorithm which adds tangent vectors one by one, according to the computational
needs. This idea is also justified by the observation that using equations (11-13)
leads to the sequential convergence of the tangent vectors according to their relative
importance.
This means that, in first approximation, all the tangent vectors remain random
vectors while the centroid converges first. Then one of the tangent vectors converges
to the most relevant transformation (while the remaining tangent vectors are still
immature), and so on till all the tangent vectors converge, one by one, to less and
less relevant transformations.
This behavior suggests starting the training using only the centroid (i.e., without
tangent vectors) and then adding tangent vectors as needed. Under this learning
scheme, since there are no tangents when the centroid is computed, equation (11)
becomes
  N
∂E X
∆W = −η = −2η [(tk − ok ) fk0 dk ]. (14)
∂W
k=1

The constructive algorithm is composed by two phases (see Table 1). First there
is the centroid computation, based on the iterative use of equations (13) and (14).
Then the centroid is frozen, and one by one all tangent vectors T i are trained
using equations (12) and (13). At each iteration in the learning phase the tangent
vector T i must be orthonormalized with respect to the already computed tangent
vectors3 . If after a fixed number of iterations (we have used 300 iterations) the
total error variation is less then a fixed threshold (0.01%), and no change in the
3 The computational complexity of the orthonormalization is linear in the number of already

computed tangent vectors.

7
CONSTRUCTIVE ALGORITHM

Initialize the centroid W

Update β and W by equations (13) and (14) till they converge

Freeze W

REPEAT

Initialize a new tangent vector T i

Update T i and β with equations (12) and (13), and ortho-

normalize T i with respect to {T 1 , . . . , T i−1 } till

it converges

Freeze T i

UNTIL new T i gives little accuracy changes

Table 1: The constructive algorithm for the TD-Neuron.

classification performance over training set occurs, a new tangent vector is added.
The tangent vectors are iteratively added till changes in the classification accuracy
become irrelevant.
The initialization of internal vectors of the TD-Neuron can be done in many
different ways. On the basis of empirical evidence we have concluded that the
learning phase of the centroid can considerably be reduced by initializing the centroid
with the mean value of the patterns belonging to the positive class. We have also
devised a “good” initialization algorithm for tangent vectors (see Table 2) which
tries to minimize the drop in the net input for all the patterns due to the increase
in the tangent subspace dimension (see Figure 2). This is obtained by introducing a
new tangent vector which mainly spans the residual subspace between the patterns
in the positive class and the current model. In this way, patterns which are in the
negative class will only be mildly affected by the new introduced tangent vector.
We have also devised a “better” initialization algorithm based on principal com-
ponents of difference vectors for all classes. However, the observed training speed-up
is not justified by the additional computational overhead due to SVD computation
needed at each tangent vector insertion.
In our experiments, we have also used a simple form of regularization over the
parameters (weight decay with penalty equal to 0.997, i.e., all parameters are mul-
tiplied by the penalty before adding the gradient), obtaining a better convergence.

6 Results
In order to obtain comparable results, we had to use the same amount of parameters
for all algorithms. In particular, for tangent based algorithms the number of vectors
is given by the number of tangent vectors incremented by 1 (the centroid). For this

8
Insertion of 1st tangent vector

Insertion of 2nd tangent vector

Total Error
100

10
0 10000 20000 30000 40000
Iterations

Figure 2: Total error variation during learning phase for pattern ‘0’. At each new

tangent insertion there is a reduction of the distance between patterns and model

(also for patterns belonging to class C). This affects the output of the neuron for

all patterns, increasing the total output squared error.

TANGENT VECTOR INITIALIZATION

• for each class c ∈ (C ∪ C) compute the mean value of differences

1
PNc
between patterns and model: dc = Nc p=1 (X p −M (θ));

• orthonormalize the vector d c of the class C with respect to the

mean values of differences of all other classes belonging to C, and

return it as the new initial tangent vector.

Table 2: Initialization procedure for the tangent vectors.

reason, the LVQ algorithms are compared versus tangent distance based algorithms
using a number of reference vectors equal to the number of tangent vectors plus 1.
Furthermore, with the two-sided HSS algorithm we have to consider also the tangent
vectors corresponding to the input patterns, specifically, we have used 6 transfor-
mations (tangent vectors) for each input pattern: clockwise and counterclockwise
rotations, and translation in the four cardinal directions 4 .
We have tested our constructive algorithm versus the two versions of HSS al-
gorithm and LVQ algorithms in the LVQ PAK package (Kohonen, Hynninen, Kan-
4 We preferred to approximate the exact tangents by finite differences, since in this way the local

shape of the manifold is interpolated in a better way.

9
97

95 HSS 1-sided
TDN
HSS 2-sided
93 LVQ-2.1
1-NN

performance
91 97

96.5
89
96
87
95.5
85
95
5 6 7 8 9 10 11 12 13 14 15
83
0 2 4 6 8 10 12 14
number of tangents

Figure 3: The results obtained on the test set by models generated by both versions

of HSS, LVQ2.1, TD-Neuron and 1-NN with Euclidean distance.

gas, Laaksonen, and Torkkola, 1996) (optimized-learning-rate LVQ1, original LVQ1,


LVQ2.1, LVQ3), using 10704 binary digits taken from the NIST-3 dataset. The bi-
nary 128x128 digits were transformed into 64-grey level 16x16 images by a simple
local counting procedure 5 . The only preprocessing transformation performed was
the elimination of empty borders.
The training set consisted of 5000 randomly chosen digits, while the remaining
digits were used in the test set. For each tangent distance algorithm, a single
tangent model for each class of digit was computed. With the LVQ algorithms, a
set of reference vectors was used for each class; the number of reference vectors was
chosen so to have as many parameters as in the tangent distance algorithms. In
particular, we have tested all algorithms based on tangent distance using different
number of vectors for each experiment, starting from 1 vector per class (centroid
without tangent vectors) up to 16 vectors per class (centroid plus 15 tangent vectors).
The number of reference vectors for LVQ algorithms has been chosen accordingly.
Concerning LVQ algorithms, here we just report the results obtained by using
LVQ2.1 with 1-NN based on Euclidean distance as classification rule, since this algo-
rithm reached the best performance over an extended set of experiments, involving
LVQ algorithms, with different settings for the learning parameters.
The classification of the test digits was performed using the label of the clos-
est model for HSS, the 1-NN rule for LVQ algorithms and the highest output for
the TD-Neuron algorithm. For the sake of comparison, we have also performed a
classification using the Nearest Neighbor rule (1-NN) with the Euclidean distance
as classification metric. In this case we have classified each pattern looking at the
label of the nearest vector in the learning set.
In Figure 3 we have reported the results obtained on the test set for different
numbers of tangent vectors for all models. In particular, the best classification result
5 The original image is partitioned into 16x16 windows and the number of pixel with value equal

to 1 is used as the grey value for the corresponding pixel in the new image.

10
(96.84%) was given by 1-NN with Euclidean distance, followed by the two-sided HSS
with 9 tangent vectors6 (96.6%). With the same amount of parameters (15 tangent
vectors) the TD-Neuron and the one-sided HSS gave a performance rate of 96.51%
and 96.42% respectively. Finally, LVQ2.1 obtained a performance rate of 96.48%
using 15 vectors per class.
From these results it can be noted that the two-sided HSS algorithm does over-
fit the data after the 9th tangent vector, while this is not true for the remaining
algorithms. Nevertheless, all the models reach a similar performance with the same
amount of parameters, which is slightly below the performance attained by the 1-
NN classifier using Euclidean distance. However, both the tangent models and LVQ
algorithms have the advantage of being less demanding both in space and response
time than 1-NN.
It is interesting that the recognition performance of TD-Neuron system is ba-
sically monotone in the number of parameters (tangent vectors), while all other
methods present over-fitting and high variance in the results.
Although from Figure 3 it seems that the tangent models and LVQ2.1 are equiv-
alent, when introducing a rejection criterion the model generated by the TD-Neuron
outperforms the other algorithms (see Figure 4). We used the same rejection crite-
rion for all algorithms: a pattern is rejected when the difference between the first
and the second best outputs belonging to different classes is less than a fixed thresh-
old7 . Furthermore, introducing the rejection criterion leads also to the surprising
result that the one-sided version of HSS performs better than the two-sided version.
In order to assess whether the better performance exhibited by the TD-Neuron
was due to the specific rejection criterion or to the discriminant capability of the
model, we performed some experiments using a different rejection criterion: reject
when the best output value is smaller (TD-Neuron) or greater (HSS, LVQ and
1-NN) than a threshold. In Figure 5 we have reported the curves obtained for
the best models, i.e., TD-Neuron and 1-sided HSS, reporting, for comparison, the
corresponding curves of Figure 4.
These results demonstrate that the improvement shown by the TD-Neuron is not
tight to a specific rejection criterion. Moreover, removing the sigmoidal function
from the TD-Neuron during the test phase, the rejection curves does not change
significantly for both rejection criteria. Thus we can conclude that the improvement
of the TD-Neuron is mainly due to the discriminant training procedure.
In Figure 6, four examples of TD-Neuron models are reported: in the leftmost
column the centroids of patterns ‘0’, ‘1’, ‘2’ and ‘3’ are shown. The remaining
columns contain the first (and most important) four tangent vectors for each model.

7 Discussion and Conclusion


We introduced the tangent distance neuron (TD-Neuron), which implements the
one-sided version of the tangent distance, and gave a constructive learning algorithm
for building a tangent subspace with discriminant capabilities.
As stated in the introduction, there are many advantages in using the proposed
computational model versus the HSS model and LVQ algorithms. Specifically, we
believe that the proposed approach is particularly useful in those applications where
it is very important to have a classification system which is both discriminant and
fast in recognition.
6 There are also 6 tangent vectors corresponding to the input pattern.
7 Obviously, the threshold is different for each algorithm.

11
90 60
1-sided HSS with 15 tangents
2-sided HSS with 9 tangents
80 50 2-sided HSS with 15 tangents
TD_Neuron with 9 tangents
TD_Neuron with 15 tangents
70 1-NN with Euclidean distance
40
LVQ-2.1 with 15 vectors
60
30

% rejection
50
20
40

30 10

20 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
10

0
0 0.5 1 1.5 2 2.5 3 3.5 4
% error

Figure 4: Error-rejection curves for all algorithms: 1-sided HSS with 15 tangent

vectors, 2-sided HSS with 9 and 15 tangent vectors, TD-Neuron with 9 and 15 tan-

gent vectors, LVQ2.1 with 15 reference vectors and 1-NN with Euclidean distance.

The boxed diagram shows a detail of the curves demonstrating that the TD-Neuron

with 15 tangent vectors has the best trade-off between error and rejection.

In this paper we also compared the TD-Neuron constructive algorithm versus


two different versions of the HSS algorithm, the LVQ2.1 algorithm and the 1-NN
classification criterion. The obtained results over the NIST-3 database of handwrit-
ten digits show that the TD-Neuron is superior to the HSS algorithms based on
Singular Value Decomposition and the LVQ algorithms, since it reaches a better
trade-off between error and rejection. Moreover, we have assessed that the better
trade-off is mainly due to the discriminant capabilities of our model.
Concerning the proposed neural formulation, we believe that the nonlinear sig-
moidal transformation is useful since by removing it and using the inverted target
(o−1
k (tk )) for training, it would drastically reduce the size of the space of solutions
in the weight space. In fact, very high or very low values for the net input would not
minimize the error function with the inverted target, while being fully acceptable in
the proposed model. Moreover, the nonlinearity increases the stability of learning
because of the saturated output.
It must be pointed out that, during the model generation, for a fixed number
of tangent vectors, the HSS algorithm is faster than our, because it needs only
a fraction of the training examples (only one class). However, our algorithm is
remarkably more efficient than HSS algorithms when a family of tangent models,
with an increasing number of tangent vectors, must be generated. Also the LVQ
algorithms are faster than TD-Neuron, but they have as drawback a poor rejection
performance.
An additional advantage of the TD-Neuron model is that, because the training

12
60
TD_Neuron with 15 tangents - (diff)
TD_Neuron with 15 tangents - (abs)
TD_Neuron with 15 tangents - (linear) (diff)
50 TD_Neuron with 15 tangents - (linear) (abs)
1-sided HSS with 15 tangents - (diff)
1-sided HSS with 15 tangents - (abs)
40

% rejection 30

20

10

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
% error

Figure 5: Detail of error-rejection curves of 1-sided HSS and TD-Neuron for two

different rejection criteria: threshold on the output absolute value (abs) and thresh-

old on the difference between the best and the second best output models (diff ). The

curves for the TD-Neuron model with removed non-linear output is individuated by

the additional label (linear).

algorithm is based on a gradient descent technique, several TD-Neurons can be


arranged to form an hidden layer in a feed-forward network with standard output
neurons, which can be trained by a trivial extension of back-propagation. This may
lead to a remarkable increase in the transformation invariant features of the system.
Furthermore, it should be possible to easily extract information from the network
regarding the most important features used during classification (see Figure 6).

References
Hastie, T., Simard, P. Y., and Säckinger, E., 1995. Learning Prototype Models for
Tangent Distance. In Advances in Neural Information Processing Systems, eds.
G. Teasauro, D. S. Touretzky, and T. K. Leen, vol. 7, pp. 999–1006. Cambridge
MA: MIT Press.
Hinton, G. E., Dayan, P., and Revow, M., 1997. Modeling the Manifold of Images
of Handwritten Digits. IEEE Transactions on Neural Networks 8, no. 1:65–74.
Kohonen, T., Hynninen, J., Kangas, J., Laaksonen, J., and Torkkola, K.,
1996. LVQ PAK: The Learning Vector Quantization Program Package. Tech-
nical Report A30, Helsinki University of Technology, Laboratory of Com-
puter and Information Science, Rakentajanaukio 2 C, SF-02150 Espoo, Finland.
Http://www.cis.hut.fi/nnrc/nnrc-programs.html.

13
Figure 6: Tangent models obtained by the TD-Neuron for digits ‘0’, ‘1’, ‘2’ and

‘3’. The centroids are shown in the leftmost column, while the remaining columns

show the first four tangent vectors.

Schwenk, H. and Milgram, M., 1995a. Learning Discriminant Tangent Models for
Handwritten Character Recognition. In International Conference on Artificial
Neural Networks, pp. 985–988. Springer-Verlag.
Schwenk, H. and Milgram, M., 1995b. Transformation Invariant Autoassociation
with Application to Handwritten Character Recognition. In Advances in Neural
Information Processing Systems, eds. G. Teasauro, D. S. Touretzky, and T. K.
Leen, vol. 7, pp. 991–998. Cambridge MA: MIT Press.
Simard, P. Y., 1994. Efficient Computation of Complex Distance Metrics Using
Hierarchical Filtering. In Advances in Neural Information Processing Systems,
eds. J. D. Cowan, G. Teasauro, and J. Alspector, vol. 6, pp. 168–175. San Francisco
CA: Morgan Kaufmann.
Simard, P. Y., LeCun, Y., and Denker, J., 1993. Efficient Pattern Recognition Using
a New Transformation Distance. In Advances in Neural Information Processing
Systems, eds. S. J. Hanson, J. D. Cowan, and C. L. Giles, vol. 5, pp. 50–58. San
Mateo CA: Morgan Kaufmann.
Sona, D., Sperduti, A., and Starita, A., 1997. A Constructive Learning Algorithm
for Discriminant Tangent Models. In Advances in Neural Information Processing

14
Systems, eds. M. C. Mozer, M. I. Jordan, and T. Petsche, vol. 9, pp. 786–792.
Cambridge MA: MIT Press.
Sperduti, A. and Stork, D. G., 1995. A rapid graph-based method for arbitrary
transformation-invariant pattern classification. In Advances in Neural Informa-
tion Processing Systems, eds. G. Teasauro, D. S. Touretzky, and T. K. Leen,
vol. 7, pp. 665–672. Cambridge MA: MIT Press.

15

View publication stats

You might also like