A Conic Section Classifier and Its Application To Image Datasets
A Conic Section Classifier and Its Application To Image Datasets
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
tures that is sufficient for the purpose of classification.
In this paper, we present a novel concept class that ex-
pands the power of the first approach noted above. The con-
(a) (b) (c) (d)
cept class, presented in Section 2, is rich and subsumes lin-
Figure 1. Discriminant boundaries in R2 . (See Sec-2)
ear discriminants, and yet is specified with merely twice the
number of parameters as a linear discriminant. Each mem- equidistant to the representative conic sections, in eccen-
ber class in the dataset is represented by a prototype conic tricity. This discriminant corresponds to a rich, non-linear
section in the feature space, and new data points are classi-
surface in RM .
fied based on a distance measure to each such representative
The concept class just described has several notable fea-
conic section. In Section 3, we present a tractable algorithm
tures. As shown in Fig.1, different configurations of two
for learning the appropriate conic sections (i.e., their direc-
conic sections (shown in R2 ) generate different discrimi-
trices, foci, and eccentricities) for the classes given a labeled
nant boundaries, ranging from simple to complex. Fig.1(a)
dataset. In Section 4, we demonstrate the efficacy of the
corresponds to a configuration where the directrices for the
technique by comparing it to several well known classifiers
two classes are identical, the foci for the two classes lie
on multiple artificial as well as public domain datasets.
symmetrically on the two sides of the directrix, and the class
eccentricities are equal. If the foci are moved such that the
2. The Concept Class using Conic Sections line joining them is bisected by the directrix, the boundary
A conic section in R2 is defined as the locus of points remains linear (Fig.1(b)). When the angle between the nor-
whose distance from a given point (the focus) and that from mals to the directrices, D1 , D2 , is non-zero, the boundary
a given line (the directrix), form a constant ratio (the eccen- becomes non-linear (Fig.1(c)). Further changes in the de-
tricity). Different kinds of conic sections, ellipse, parabola scriptors produce rich non-linear boundaries (Fig.1(d)).
and hyperbola, are obtained by fixing the value of the eccen- Regardless of the dimensionality of the feature space, the
tricity to < 1, = 1, and > 1, respectively. The concept can discriminant is linear when the directrices of the two classes
be generalized to RM by making the directrix a hyperplane are parallel, the foci are equidistant from the directrices, and
of codimension 1. Together, the focus and the directrix hy- the class eccentricities are equal and lie in a particular range.
perplane generate an eccentricity function that attributes to The concept class therefore subsumes linear discriminants.
each point X ∈ RM a scalar valued eccentricity defined as: Finally, the number of parameters necessary to specify the
conic sections for each class is 2 ∗ (M + 1), which is far
(F − X)T (F − X)
ε(X) = (1) less than the M 2 parameters necessary to specify a generic
b + DT X quadratic surface. We point out in passing that there is no
where F ∈ RM is the focus, and (b + DT X) (assuming known kernel for the support vector machine which matches
DT D = 1) is the orthogonal distance of X from the direc- this concept class, and therefore, the concept class is novel.
trix represented as {b, D}, where b ∈ R is the offset of the
directrix from the origin and D ∈ RM , DT D = 1, is the
3. Learning Algorithm - The Two-Class case
unit normal vector to the directrix. Setting ε(X) = ê yields
an axially symmetric conic sections in RM . In this section, we present a novel incremental algorithm
We are now in a position to formally define the concept (Algorithm-1) for learning the conic section descriptors,
class. To each class, k, we assign a distinct conic section Ck = {Fk , {bk , Dk }, êk } for k = 1, 2, that minimize the
parameterized by the descriptor set: focus, directrix and ec- empirical error (Eqn.4). We assume a set of N labeled sam-
centricity, as Ck = {Fk , {bk , Dk }, êk }. For any given point ples P = {X1 , y1 , . . . , XN , yN }, where Xi ∈ RM and
X, each class attributes an eccentricity εk (X), as defined in the label yi ∈ {1, 2}, and that the data is sparse in a very
Eqn.1, in terms of the descriptor set Ck . The conic sections high dimensional input space, i.e., N M .
for a set of K classes induce a mapping ε∗ : RM → RK ,
from the feature space to the eccentricity space (ecc-Space)
Data: Labeled Samples P
as, ε∗ (X) = ε1 (X), . . . , εK (X). The point X is assigned
Result: Conic Section Descriptors C1 , C2
to the class whose eccentricity descriptor êk is closest in
1: Initialize {F1 , b1 , D1 }, {F2 , b2 , D2 } [Sec.3.6]
magnitude to the attributed eccentricity, i.e.,
2: Compute ε1 (Xi ), ε2 (Xi ) ∀Xi ∈ P
class(X) = argmink (|εk (X) − êk |) (2) 3: Find class-eccentricities ê1 , ê2 [Sec.3.1]
|ε1 (X) − ê1 | = |ε2 (X) − ê2 |, f or K = 2 (3) 4: Compute the desired ε1i , ε2i [Sec.3.2]
5: Update foci & directrices alternately. [Sec.3.3, 3.5]
With an eye towards simplicity, we restrict the rest of the 6: Goto (2) until convergence of descriptors.
presentation to the binary classification case. The discrim-
Algorithm 1: Learning the descriptors C1 , C2
inant boundary (Eqn.3) for this case is the locus of points
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
Following initialization of the descriptors (Section 3.6), order to keep the learning process simple, we update only
the learning process is comprised of two stages. In the first one descriptor of a particular class at each iteration. Hence,
stage, C1 and C2 are held fixed, and each Xi is mapped we move the misclassified points in ecc-Space by changing
into ecc-Space by computing its attributed eccentricities, ε1i or ε2i for the class of the descriptor being updated.
ε1 (Xi ), ε2 (Xi ). The pair of class eccentricities ê1 , ê2 The learning task now reduces to alternately updating the
that minimizes the empirical risk Lerr is then computed. foci and directrices of C1 and C2 , so that the misclassified
points are mapped into the desired quadrants in ecc-Space,
1
Lerr = I(yi = class(Xi )) (4) while the correctly classified points remain fixed. Note that
N i with such an update, our learning rate is non-decreasing.
We also introduce a margin along the discriminant bound-
where I is the indicator function. For each misclassified ary and require the misclassified points to be shifted beyond
sample, one can find a desired pair of attributed eccentrici- this margin into the correct quadrant. In most of our experi-
ties ε1i , ε2i that would correctly classify that sample. ments the margin was set to 5% of the range of eccentricity
In the second stage, the foci {F1 , F2 } and the directri- values in ecc-Space.
ces {{b1 , D1 }, {b2 , D2 }} are updated alternately so as to
achieve the desired attributed eccentricities for those mis- 3.3. Updating The Focus
classified samples, without affecting the attributed eccen-
tricities for those samples that are already correctly classi- Our objective here is to achieve the desired attributed ec-
fied. The process is repeated until the descriptors converge centricities εki for all the samples by changing the focus
or there can be no further improvement in classification. Fk . For each correctly classified sample, the desired eccen-
tricity εki is simply its previous value εki . From Eqn.1 we
3.1. Finding Class-Eccentricities ê1 , ê2 can conclude that the εki ’s for k = 1 depend only on the
class descriptor C1 , and likewise for k = 2. Since we update
Note that the dimensionality of ecc-Space is the num- only one focus at a time, we shall hereafter deal with the
ber of classes (2 in our case). For any given choice of case k = 1. The update problem may be posed formally
class eccentricities, the discriminant boundary (Eqn.3) in as follows. Find a focus F1 that satisfies the following N
ecc-Space is a pair of orthogonal lines with slopes +1, −1, quadratic constraints. Let . 2 be the Euclidean L2 norm.
respectively, as illustrated in Fig.2(a). The lines intersect at
ê1 , ê2 (referred to hereafter as the cross-hair). The lines = r1i , ∀Xi ∈ Pc ,
divide ecc-space into four quadrants with opposite pairs be- F1 − Xi 2 (5)
≤ or ≥ r1i , ∀Xi ∈ Pmc ,
longing to the same class. It should be noted that this dis-
criminant corresponds to a non-linear decision boundary in w here, r1i = ε1i (b1 + D1T Xi ) (6)
the feature space RM . In effect, each point Xi desires F1 to be at a distance r1i
We now present an O(N 2 ) algorithm to find the opti- from itself, derived from Eqn.6. Pc and Pmc are the set
mal cross-hair. The method begins by rotating ecc-Space of classified and misclassified points respectively. The in-
around the origin by 45◦ so that any choice of the discrim- equalities above imply that the desired location ε1i can lie
inants will now be parallel to the new axes. Each axis is in an interval along an axis in ecc-Space (See Fig.2). In
divided into (N + 1) intervals by projecting the points in order to closely control the learning process, we learn one
ecc-Space onto that axis. Consequently, ecc-Space is parti- misclassified point at a time, while holding all the others
tioned into (N + 1)2 2D intervals. We now make a crucial fixed. This leaves us with only one inequality constraint.
observation: within the confines of a given 2D interval, any
We refer to the set of all feasible solutions to the above
choice of a cross-hair classifies the set of samples identi-
quadratic constraints as the Null Space of F1 . Further, we
cally. We can therefore enumerate just the (N + 1)2 inter-
have to pick an optimal F1 in this Null Space that maxi-
vals and choose the one that gives the smallest classification
mizes the generalization capacity of the classifier. Although
error. The cross-hair is set at the center of this 2D interval.
the general Quadratic Programming Problem is known to
In cases where there are multiple 2D intervals that give the
be NP-hard, the above constraints have a nice geometric
smallest classification error, the larger one is chosen.
structure that can be exploited to construct the Null Space
in O(N 2 M ) time. Note that by assumption, the number
3.2. Learning Misclassified Points
of constraints, N M . The Null Space of F1 with re-
Given the attributed eccentricities ε1i , ε2i of a misclas- spect to each equality constraint in Eqn.5 is a hyper-sphere
sified point, we can compute its desired location ε1i , ε2i in RM . Hence, the Null Space for all the constraints com-
in ecc-Space (see Fig.2(a)) by moving it into the nearest bined is simply the intersection of all the corresponding
quadrant associated with its class label. This movement can hyper-spheres in RM with centers {X1 , . . . , XN } and radii
be achieved by updating a focus or directrix in Eqn.1. In {r11 , . . . , r1N }. Let XN be the single point being updated
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
spheres problem is converted into the intersection of (N −1)
hyper-spheres and a hyper-plane H{1,2} problem.
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
translating the origin to the first classified point, X1 , so that eccentricities. This implies that the Null Space for the clas-
b1 = v11 . In addition, just as in Section 3.3, we learn a sified points is non-empty. Second, the search for updates
single misclassified point, say XN , in each iteration. With a in the Null Space always guarantees the feasible solution for
known b1 , we translate and scale all remaining points such the constraints related to the correctly classified points.
that the linear constraints become: A key contribution of our technique is the tracking of the
set of all feasible solutions as a compact geometric object.
= v̂1i 2≤i<N
D1T X̂i (11) From this Null Space we pick a solution biased towards a
≤ or ≥ v̂1i i = N linear discriminant so as to improve upon generalization.
w here, X̂i = (Xi − X1 )/ (Xi − X1 ) 2 (12) The size of margin in ecc-space also gives a modicum of
v̂1i = (v1i − v11 )/ (Xi − X1 ) (13) control over generalization. The order of samples processed
2
does not affect the final Null Space. The convergence of our
Now the null space of D1 , for each constraint in Eqn.11 learning algorithm depends on data and initialization. How-
considered separately, is a hyper-plane Hi ∈ RM repre- ever, we found that it converged to a local minima typically
sented as {−v̂1i , X̂i }. The null space corresponding to the within 50 iterations of the focus and directrix updates.
quadratic constraint on D1 is a unit hyper-sphere, S1 ∈
RM , centered at the new origin. Hence, the final Null Space 4. Experiments
for D1 is the intersection of all the Hi ’s and S1 .
We evaluated the classifier on two synthetic datasets
We now make two critical observations. The inter-
and four real datasets. The results were compared against
section of a hyper-plane with a hyper-sphere is a lower-
several state-of-the-art linear and non-linear classifiers.
dimensional hyper-sphere. Same is the case with the in-
The classification accuracies based on leave-one-out cross-
tersection of two hyper-spheres. We can therefore con-
validation are presented in Table-1.
vert this hyperplane-hypersphere intersection problem into
a hypersphere-hypersphere intersection problem. In effect, Support Vector Machines (SVM) [2] and Kernel Fisher
we can replace each hyper-plane Hi with a suitable hyper- Discriminants (KFD) [7] broadly represented the non-linear
sphere Si such that Hi ∩ S1 = Si ∩ S1 . Owing to the category. Both employ the kernel trick of replacing inner
geometry of the problem, we can compute Si from Hi and products with Mercer kernels. Among the linear classifiers,
S1 . The Null Space for all the constraints combined is now we chose the Linear Fisher Discriminant (LFD) [4] and lin-
the intersection of all the hyper-spheres S1 , S2 , ..., SN . The ear SVM. We used the OSU SVM toolbox for MATLAB
problem, now reduced to a hyperspheres-intersection prob- based on libSVM [13]. We considered Polynomial (PLY)
lem, is solved as in Section 3.4. and Radial Basis (RBF) Kernels.
The best parameters were empirically explored. Polyno-
mial kernels gave best results with either degree = 1 or 2
3.6. Initialization
and the scale was approximately the sample variance. The
Given a set of labeled samples, we found that there are RBF kernel performed best when the radius was the sample
several ways of initializing the conic section descriptors that variance or the mean distance between all sample pairs.
led to a solution. Random initializations converged to dif-
ferent conic descriptors each time leading to inconsistent 4.1. Results
performance. We observed that owing to Eqn.1, the Null
Synthetic dataset-1 was randomly generated from two
Spaces are small or vanishing if the foci or directrices are
well separated Gaussian clusters in R40 . The results in
very close to the samples. We found the following initial-
Table-1 validate our classifier’s effectiveness on simple, lin-
ization to be consistently effective in our experiments. The
early separable data. Synthetic dataset-2 was generated by
foci were first placed at the sample class means and then
sampling from two intersecting paraboloids (related to the
pushed apart until they were outside the sample clouds. The
two classes) in R3 and placing them in R64 . This instance
normals to the directrices were initialized as the line joining
shows that our classifier favors data lying on paraboloids. It
the foci. The directrix planes were then positioned at the
clearly out-performed the other classifiers.
center of this line or on either sides of the data.
Epilepsy data [10] consists of displacement vector fields
between the left and right hippocampi for 31 epilepsy pa-
3.7. Discussion
tients. The displacement vectors are computed at 762 dis-
One of the core characteristics of our algorithm is that crete mesh points on each of the hippocampal surfaces,
after each update any point that is correctly classified by the in 3D. This vector field representing the non-rigid regis-
earlier descriptors is not subsequently misclassified. This tration, captures the asymmetry between the left and right
is due to two reasons. First, we begin with an initialization hippocampi. Hence, it can be used to categorize different
that gives a valid set of assignments for the class attributed classes of epilepsy based on the localization of the focus
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
Samples Data Size (NxM) CSC LFD KFD PLY KFD RBF SVM PLY SVM RBF
Synthetic Data1 20 x 40 100 100 100 100 100 100
Synthetic Data2 32 x 64 93.75 87.5 75 75 81.25 87.5
Epilepsy 31 x 2286 77.42 67.74 67.74 61.29 67.74 74.19
Colon Tumor 62 x 2000 87.1 85.48 75.81 82.26 82.26 85.48
UMIST FaceDB 575 x 10304 97.74 98.72 99.93 99.91 99.3 99.06
Texture Pair1 95 x 601 100 100 100 100 100 100
Texture Pair2 95 x 601 92.63 98.94 100 100 90.52 82.10
Table 1. Classification accuracies for the Conic Section Classifier, (Linear & Kernel) Fisher Discriminants and SVM. ( See Sec-4 )
of epilepsy to either the left (LATL) or right temporal lobe 5. Summary and Conclusions
(RATL). The LATL vs. RATL classification is a hard prob-
lem. As seen in Table-1, our classifier out-performed all In this paper, we have introduced a novel concept class
based on conic section descriptors, provided a tractable su-
the others and with a significant margin, except over SVM-
pervised learning algorithm, and have tested the resultant
RBF. In fact, our result is better than that reported in [10].
The best RBF kernel parameters for SVM and KFD meth- classifier against several state-of-the-art classifiers on many
public domain datasets. Our classifier was able to classify
ods were 600 and 1000, respectively. The best degree for
the polynomial kernel was 1 for both of them. tougher datasets better than others in most cases as vali-
dated in Table-1. The classifier in its present form uses ax-
The Colon Tumor data [1] comprises of 2000 gene- ially symmetric conic sections. In future work, we intend
expression levels for 22 normal and 40 tumor colon tissues. to extend this technique for multi-class classification and to
The normals to directrix descriptors were initialized with conic sections that are not necessarily axially symmetric.
the LFD direction in this case. Our classifier yielded 87%
accuracy outperforming the other classifiers. Interestingly, References
most of the other classifiers could not out-perform LFD, im-
plying that they were learning the noise as well. Terrence, [1] U. Alan, et al. Broad patterns of gene expression revealed by
et al. [5] were able to correctly classify two more samples clustering analysis of tumor and normal colon tissues probed
by oligonucleotide arrays. PNAS, 96:6745–6750, 1999.
with a linear SVM, only after adding a diagonal factor of
[2] C. Cortes and V. Vapnik. Support-vector networks. Machine
two to the kernel matrix. Learning, 20(3):273–297, 1995.
The Sheffield (formerly UMIST) Face Database [6] has [3] K. J. Dana, et al. Reflectance and texture of real-world sur-
564 pre-cropped face images of 20 individuals with varying faces. ACM Transactions on Graphics, 18(1):1–34, 1999.
[4] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classifica-
pose. Each image has 92 x 112 pixels with 256 gray-levels.
tion. Wiley-Interscience, 2001.
Since we only have a binary classifier now, the average clas- [5] T. Furey, et al. Support vector machine classification and
sification performance over all possible pairs of subjects is validation of cancer tissue samples using microarray expres-
reported. This turned out to be an easier problem. Conic sion data. Bioinformatics, 16(10):906–914, 2000.
classifier achieved a comparable accuracy of about 98%, [6] D. Graham and N. Allinson. Characterizing virtual eigen
while the others were near 100%. signatures for general purpose face recognition. NATO ASI
Series F, Comp. & Sys. Sci., 163:446–456, 1998.
CURET database [3] is a collection of 61 texture classes [7] S. Mika, et al. Fisher discriminant analysis with kernels.
imaged under 205 illumination and viewing conditions. Neural Networks for Signal Processing, IX:41–48, 1999.
Varma et al. [9] have built a dictionary of 601 textons and [8] E. Spellman, B. C. Vemuri, and M. Rao. Using the KL-
computed texton frequencies in a given sample image. The center for efficient and accurate retrieval of distributions
texton frequency histograms obtained from [8], can be used arising from texture images. CVPR, pages 111–116, 2005.
as the sample feature vectors for classification. About 47 [9] M. Varma and A. Zisserman. Texture classification: Are
images were chosen from each class, with out a preferential filter banks necessary? Vol 2, pages 691–698, June 2003.
[10] N. Vohra, et al. Kernel fisher for shape based classification
order so as to demonstrate the efficacy of our classifier for
in epilepsy. MICCAI, pages 436–443, 2002.
high-dimensional sparse data. We report the results for an [11] W. Zhao, et al. Face recognition: A literature survey. ACM
easy pair and a relatively tougher pair of textures for clas- Comput. Surv., 35(4):399–458, 2003.
sification. The two cases are Sand paper vs. Rough paper [12] V. Vapnik. Statistical Learning Theory, John Wiley and
(Pair1) and Sand paper vs. Polyester (Pair2), respectively. Sons, New York, 1999.
As seen in Table-1, Pair1 turned out to be easier case in [13] C. Chang and C. Lin. LIBSVM: a Library for Support Vec-
deed. KFD out-performed the others for the second pair tor Machines (Version 2.31).
and our classifier fared comparably.
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE