2012 Liviu P. Dinu, Radu-Tudor Ionescu, 2012. A Rank-Based Approach of Cosine Similarity With Applications in

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

A Rank-based Approach of Cosine Similarity with Applications in Automatic

Classification

Liviu P. Dinu and Radu-Tudor Ionescu


Department of Computer Science
University of Bucharest
14 Academiei, Bucharest, Romania
E-mails: [email protected], [email protected]

Abstract—This paper introduces a new rank-based approach quantitative properties) will be used rather than the actual
of the cosine similarity that can be applied in the competitive values of these quantitative properties. Usage of the ranking
area of automatic classification. We describe the method and in the computation of the distance/similarity measure instead
then we present some of its mathematical and computational
properties. of the actual values of the frequencies may seem as a loss
Tests are performed on a dataset consisting of handwritten of information. From another point of view, the process of
digits extracted from a collection of Dutch utility maps. ranking makes the distance/similarity measure more robust
The experimental results are compared with other reported acting as a filter and eliminating the noise contained in
results based on different combining methods which used the the values of the quantitative properties. For example, the
same dataset. The obtained results show that the rank-based
approach of the cosine similarity can successfully be used for fact that a particular feature has the rank 2 (is the second
automatic classification or similar tasks. most frequent feature) in one object and the rank 4 (is the
fourth most important feature) in another object can be more
Keywords-rank distance; cosine similarity; median string;
classification; digits classification; handwritten digits classi- relevant than the fact that the same feature appears 34%
fication; voting; aggregation scheme; combination scheme; times in the first object and only 29% times in the second
combination rule. one.
Researchers periodically study and develop new meth-
I. I NTRODUCTION ods for automatic object classification. There are many
Decision taking processes are a common and frequent task algorithms that are able to solve classification tasks with
for most of us in our daily life. Many of the decision taking great accuracy. Starting with a similarity measure based on
problems can be solved with the help of computers. A very rankings, the idea of combining state-of-art classifiers using
common task of such kind is object classification. Usually, the similarity measure comes naturally. The problem of
the solutions for object classification or similar decision combining classifiers has been intensively studied in the last
taking problems are approximate and researchers continu- period [6] and various classifier schemes [7, 8, 9] have been
ously investigate new methods that improve the accuracy and devised and experimented in different domains: document
performance. Most of the proposed solutions are based on classification, document image analysis, biometric recog-
statistics or machine learning. For many state-of-art machine nition (personal identification based on various physical
learning techniques (such as SVM [1], kernel methods [2]) attributes such as iris, face, fingerprint) or speech recognition
the decision is taken based on some kind of similarity or are few of them. A typical combination schema consists
dissimilarity between objects. Currently, a very high number of a set of individual classifiers and a combiner which
of distance measures exist in literature and there are efforts aggregates the results of the individual classifiers to make the
to group them in specific areas of interest [3]. final decision. In many situations, the results of individual
Despite of the great number of distances in literature, new classifiers are rankings (an ordered list of objects). Each
distances are periodically explored. Mostly, these tools are ranking can be considered as being produced by applying
based on some quantitative properties or features of implied an ordering criterion to a given set of objects.
objects. The last decade reveals in some practical cases the The concept of ordering several objects, and consequently
advantage of ordinal distances in similarity and classification obtaining a ranking is encountered in many situations: an
methods [4, 5]. electoral process (where the ordering criterion between the
In some cases, regarding features as ordinal variables participants is given by the number of votes they gained);
performs better than regarding them as frequency or other the results of a football tournament (where the criterion is
measures. Looking at function features as ordinal variables the number of points obtained by each team at the end of
means that in the calculation of a distance/similarity func- the tournament), etc. However, it is not the general case to
tion, the ranks of features (made up according to their have very simple methods to decide which is the ordering
criterion, and, as a consequence, it becomes difficult to If the ranking τ contains all the elements of U , then
define and build the ranking. To support this statement, it is called a full list (ranking). It is obvious that all full
we mention situations like selecting documents based on lists represent all total orderings of U (the same as the
multiple criteria, building search engines for the WEB [10] permutations of U ). However, there are situations (see [10]
or finding the author of a given text. Examples of multi- for example) when some objects cannot be ranked by a given
criteria selection arise when trying to select a product from criterion: the list τ contains only a subset of elements from
a database of products, such as travel plans or restaurants the universe U . Then, τ is called partial list (ranking). We
(users might rank restaurants based on several different crite- denote the set of elements in the list τ with the same symbol
ria like cuisine, driving distance, ambiance, star-rating, etc). as the list.
Other situations when we combine rankings are those when
B. Metrics on Rankings
we take decisions based on subjective or sensorial criteria
(for example, perceptions). Especially when working with Usually, the distance measures between rankings are
perceptions, but not only, we face the situation of operating defined for the case of full lists. Some of the most used
with rankings of objects where the essential information is measures are (see [11]): the Spearman footrule distance and
not given by the numerical value of some parameter of each the Kendall tau distance.
object. Instead, this information is given by the position A problem arises when one tries to apply the distances
occupied by the object in the ranking, like movies or music above to partial lists: in the most cases the newly defined
tops (according to a natural hierarchical order, in which on functions do not preserve the property of being a metric
the first place we find the most important element, on the function (as it is shown in [10]). In [12] a distance is
second place the next one and on the last position the least introduced which preserves this property, namely the rank
important element). In order to make a decision in all these distance. It is the distance that we will use in developing
situations, we have to combine two or more rankings which our method. In the following, we shortly present the rank
have been ordered by using different criteria. We deal with distance.
the so-called rank aggregation problem. A few preliminary notations are explained first. Let
Let us describe the paper organization. Section II gives σ = (x1 > x2 > . . . > xn ) be a partial ranking over
a brief discussion about rankings and associated metrics. A U ; we say that n is the length of σ. For an element
rank based approach of the well known cosine similarity is x ∈ U ∩ σ, one defines the order of the object x in the
proposed in section III. This section also introduces a com- ranking σ: ord(σ, x) = |n + 1 − σ(x)|. In other words one
bining ranking schema based on the cosine rank similarity. assigns different weights to each element of the ranking in
The performance of our method on the dataset consisting of decreasing order from top to bottom. More precisely, one
handwritten digits is tested in section IV. Here we compare attributes the highest rank n to the first element of the
our experimental results with other reported results on the ranking, then the rank n − 1 to the second element, and
same dataset that were obtained with different combining so on until the final element is assigned with the lowest
methods. Finally, in section V we draw our conclusions and rank 1. If x ∈ U \ σ, we have ord(σ, x) = 0.
give some ad-hoc extensions of the proposed similarity. Definition 1: Given two partial rankings σ and τ over the
same universe U, we define the rank distance between them
II. O RDINAL M EASURES as:
A. Rankings X
A ranking is an ordered list of objects. Every ranking can ∆(σ, τ ) = |ord(σ, x) − ord(τ, x)|.
be considered as being produced by applying an ordering x∈σ∪τ

criterion to a given set of objects. Example 1: Let σ = (1 > 2 > 3 > 4) and τ = (5 >
Let U be a finite set of objects, called the universe 1 > 2) be rankings over the universe U = {1, 2, 3, 4, 5}.
of objects. We assume, without loss of generality, that According to Definition 1, we have:
U = {1, 2, . . . , |U |} (where by |U | we denote the cardinality X
of U ). A ranking over U is an ordered list: τ = (x1 > ∆(σ, τ ) = |ord(σ, x) − ord(τ, x)|
x2 > . . . > xd ), where {x1 , . . . , xd } ⊆ U , and > is a strict x∈{1,...,5}
ordering relation on {x1 , . . . , xd }, what we have called in
= |4 − 2| + |3 − 1| + |2 − 0| + |1 − 0| + |0 − 3| = 10
section I an ordering criterion. It is important to point the
fact that xi 6= xj if i 6= j. For a given object i ∈ U present In [12] Dinu proves that ∆ is a distance function. The
in τ , τ (i) represents the position (or rank) of i in τ . rank distance is an extension of the Spearman footrule
A ranking defines a partial function on U where for each distance [13].
object i ∈ U, τ (i) represents the position of the object i in Note that Rank Distance can be extended to compute the
the ranking τ . Observe that the objects with high rank in τ distance of one ranking to a multiset of rankings in the
have the lowest positions. intuitive way by computing the distance between the given
ranking and each of the ranking from the multiset and then The resulting similarity ranges from −1 meaning ex-
by adding these values: actly opposite, to 1 meaning exactly the same. Obviously,
Definition 2: Let T = L1 , L2 , . . . , Ln be a multiset of n 0 indicates independence, and in-between values indicate
rankings and let L be a ranking. Then, the rank distance intermediate similarity or dissimilarity between vectors A
from L to T is defined by: and B.
X In the case of information retrieval, the cosine similarity
∆(L, T ) = ∆(L, Li ). of two documents will range from 0 to 1, since the term
x={1,...,n} frequencies (TF-IDF weights) cannot be negative. The angle
between two term frequency vectors cannot be greater than
The motivation for the usage of objects’ order in a
900 . When the angle is 900 , it means that the two term
ranking, instead of the rank itself, comes from at least two
frequency vectors are completely opposite.
directions. First, one considers that the distance between two
rankings should be greater if they are more different at top
B. Rank-based Cosine Similarity
(on the high ranked objects), since in many applications the
low ranked objects are neglected. Consequently, the objects Definition 3: Given two full ranking f = (f1 , f2 , . . . , fn )
with high ranks should have a greater weight. Second, and g = (g1 , g2 , . . . gn ) over the same universe U of
the length of the rankings is also important: if a ranking cardinality |U | = n, we define CosRank(f, g) similarity
is longer, we consider that the criterion that produced it as follow:
performed a more profound analysis of the objects, hence, it
is more reliable than another criterion that produced a shorter < f, g > < f, g >
ranking. Although, for example, two rankings of different CosRank(f, g) = = 2
k f kk g k 1 + 2 2 + . . . + n2
length may have the same object on the first position,
this object has different orders (in the sense of the upper P
ord(x | f ) × ord(x | g)
x∈U
definition) in the two rankings, and, this difference should =
1 2 + 2 2 + . . . + n2
be reflected in the total distance.
From the point of view of the time complexity needed to In other words, we are interested by the positions of x in f
compute the rank distance between two rankings, the time and g, respectively, and we use a cosine-like measure. Note
needed to compute the distance between two partial rankings that CosRank(f, g) = 1 if and only if f and g are identical
over an universe U is linear in the cardinality of U . Thus, rankings. As f and g are more different, CosRank(f, g)
rank distance is easy to implement and has an extremely tends to get closer to 0.
good computational behavior. Another advantage of rank In what follows, we will use the 1−CosRank measure. It
distance is that it imposes minimal hardware demands: it is not hard to show that 1−CosRank is a distance function.
runs in optimal conditions on modest computers, reducing Let’s consider the following standard definition of a distance
the costs and increasing the number of possible users. For function
example, the time needed to compare a DNA string of Definition 4: A metric on a set X is a function (called
45, 000 nucleotides length with other 150 DNA strings (with the distance function or simply distance) d : X × X → R.
similar length), by using an laptop with 224 MB RAM For all f, g, h ∈ X, this function is required to satisfy the
and 1.4 GHz processor is no more than six seconds. Two following conditions:
algorithms that compute rank distance in linear time are 1. d(f, g) > 0 (non-negativity, or separation axiom);
given in [14]. 2. d(f, g) = 0 if and only if f = g (coincidence axiom);
3. d(f, g) = d(g, f ) (symmetry);
III. C OSINE R ANK S IMILARITY 4. d(f, g) 6 d(f, h) + d(h, g) (triangle inequality).
A. Cosine Similarity Note that conditions 1 and 2 produce positive definiteness.
Cosine similarity is a measure of similarity between two Also, observe that the first condition is implied by the others.
vectors of n dimensions by finding the cosine of the angle Since CosRank(f, g) = 1 if and only if f and g are
between them. Given two vectors of attributes, A and B, the identical rankings, it is obvious that 1−CosRank(f, g) = 0.
cosine similarity, θ is represented using a cross product as: Thus, we have the coincidence axiom. From Definition 3
observe that CosRank is symmetric (i.e., CosRank(f, g) =
A×B CosRank(g, f )) since:
similarity = cos(θ) = .
k A kk B k X X
ord(x | f ) × ord(x | g) = ord(x | g) × ord(x | f )
For text matching, the attribute vectors A and B are
x∈U x∈U
usually the term frequency vectors of two text documents.
The cosine similarity can be seen as a method of normalizing The triangle inequality is ensured by the fact that f, g, h
document length during comparison. are all full rankings.
C. CosRank Combining Scheme • Morph: 6 morphological features.
In [15] we introduce a rank based combining scheme The 12 classifiers (c1 , . . . , c12 ) used in the experiment
starting from the aggregation problem. We apply with good and the 6 combining rules are listed in [15]. The results
results this scheme in digit recognition [15] and in text of the CosRank combining scheme on the dataset consisting
categorization [4]. Starting from ideas developed in [15], we of handwritten digits are summarized bellow.
introduce a similar combining scheme based on a aggrega- In Table I we present results for 12 individual classifiers
tion problem, only that here we use the CosRank similarity (c1, c2, ..., c12). The 12 classifiers used in the experiment
instead of rank distance. are listed in [15]. Note that the combining rules are applied
In a selection process, rankings are issued for a common on the six feature sets for a single classification rule. The
decision problem, therefore a ranking that “combines” all results obtained with CosRank are better than those obtained
the original (base) rankings is required. One commonsense with the RDC scheme for the top classifiers. This is a very
solution is finding a ranking that is as close as possible to good result if we consider that RDC scheme was the one of
all the particular rankings. Apart from many paradoxes of the best schemes tested in [15].
different aggregation methods, this problem is NP-hard for
most non-trivial distances [16]. On the other hand, a solution Table I
CosRank vs. RDC: Results for 12 individual classifiers. The success
for this problem can be found in polynomial time for rank rate of each classifier (listed in lines) is given for each combining
distance [17]. scheme (listed on columns).
Formally, given a multiset T = {τ1 , τ2 , ..., τk }, we ag-
Classifier CosRank RDC
gregate the rankings by using 1 − CosRank similarity. This c1 97.3% 96.5%
means that we are looking for those rankings σ that have a c2 97.9% 97.1%
minimal 1 − CosRank distance to all the rankings in the c3 94.8% 93.9%
multiset. In other words, we have to minimize the sum: c4 97.8% 97.3%
c5 97.8% 97.2%
c6 98.4% 97.5%
c7 96.6% 97.2%
X
1 − CosRank(σ, T ) = 1 − CosRank(σ, τ ). c8 65.8% 85.0%
τ ∈T c9 51.8% 74.5%
c10 49.0% 81.9%
The CosRank classification scheme has two important
c11 77.1% 95.0%
steps. The first one is to obtain all the rankings which c12 94.3% 95.8%
minimize the upper equation in a similar fashion to the
solution of rank aggregation [17]. The second step is to apply
Table II shows the results of the CosRank and RDC
voting on all the obtained rankings.
combining rules on each of the six feature sets. It seems
IV. A PPLICATION that CosRank has a slightly lower accuracy than RDC, but
In this section we make a comparative study regarding the it is still above most of the combining rules tested in [15].
behavior of six combining schema on the same input data Table II
set. The input dataset consists of handwritten digits extracted CosRank vs. RDC: Results for six feature sets. The success rate for
from a collection of Dutch utility maps. We compare the each feature set (listed in lines) is given for each combining scheme
(listed on columns).
results of the CosRank combining scheme with the rank
distance combining (RDC) scheme from [15]. Feature set CosRank RDC
In [8] an experiment regarding the error rate (in per- f1 82.6% 83.6%
centage) of different classifiers and classifier combination f2 95.5% 96.6%
f3 96.0% 96.7%
schemes on a digit classification problem is reported. We
f4 94.5% 96.6%
use the same data to test the performance of our combining f5 80.9% 83.4%
scheme. f6 68.0% 70.7%
A brief description of the experiment is available in [15].
For a more comprehensive description see [8] and [18]. The Finally, we applied all 12 classifiers to all six features
experiments are done on a data set which consists of six and obtained for each document a multiset of 72 rankings.
different feature sets for the same set of objects. The six In [15] these rankings are combined by using RDC scheme.
feature sets are: The reported success rate of RDC is 98.2%. Using CosRank
• Fourier: 76 Fourier coefficients of the character shapes. combining scheme instead of RDC we obtain a succes rate
• Profiles: 216 profile correlations. of 97.9%. However, if we aggregate the best classifiers
• KL-coef: 64 Karhunen-Love coefficients. with feature sets f 1, f 2, ..., f 6, the success rate of CosRank
• Pixel: 240 pixel averages in 2 × 3 windows. classification scheme is 98.7%. This is the best success rate
• Zernike: 47 Zernike moments. over all methods.
V. C ONCLUSIONS on WWW, pp. 613–622, 2001. [Online]. Available:
http://doi.acm.org/10.1145/371920.372165
In this paper we introduced CosRank similarity, a rank
[11] P. Diaconis, “Group representation in probability and
based approach of cosine similarity. It is a ordinal measure
statistics,” IMS Lecture Series 11, 1988.
that can be computed in linear time. We tested the similarity
[12] L. P. Dinu, “On the classification and aggregation
on a dataset consisting of handwritten digits extracted from
of hierarchies with different constitutive elements,”
a collection of Dutch utility maps. Our results are compared
Fundamenta Informaticae, vol. 55, no. 1, pp. 39–50,
with other reported results based on a different combining
2003.
method, namely RDC [15]. We conclude that the rank-based
[13] P. Diaconis and R. L. Graham, “Spearman footrule as
approach of the cosine similarity is the best over all methods
a measure of disarray,” Journal of Royal Statistical
that report results on the same dataset. In future work
Society. Series B (Methodological), vol. 39, no. 2, pp.
we intend to propose a weighted variant of the CosRank
262–268, 1977.
combining method.
[14] L. P. Dinu and A. Sgarro, “A Low-complexity Distance
ACKNOWLEDGMENT for DNA Strings,” Fundamenta Informaticae, vol. 73,
no. 3, pp. 361–372, 2006.
The contribution of the authors to this paper is equal. Radu [15] L. P. Dinu and M. Popescu, “A multi-criteria decision
Ionescu thanks his Ph.D. supervisor Denis Enachescu from method based on rank distance,” Fundamenta Infor-
the University of Bucharest. The research of Liviu P. Dinu maticae, vol. 86, no. 1–2.
was supported by the CNCS-PCE Idei grant 311/2011. The [16] C. de la Higuera and F. Casacuberta, “Topology of
authors also thank to the anonymous reviewers for helpful strings: Median string is np-complete,” Theoretical
comments. Computer Science, vol. 230, pp. 39–48, 2000.
[17] L. P. Dinu and F. Manea, “An efficient approach for
R EFERENCES
the rank aggregation problem,” Theoretical Computer
[1] C. Cortes and V. Vapnik, “Support-Vector Networks,” Science, vol. 359, no. 1-3, pp. 455–461, 2006.
Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. [18] R. P. W. Duin and D. M. J. Tax, “Experiments with
[2] J. Shawe-Taylor and N. Cristianini, Kernel Methods for classifier combining rules,” In Proceedings of MCS ’00,
Pattern Analysis. Cambridge University Press, 2004. pp. 16–29, 2000.
[3] E. Deza and M.-M. Deza, Dictionary of Distances.
The Netherlands: Elsevier, 1998.
[4] L. P. Dinu and A. Rusu, “Rank distance aggregation
as a fixed classifier combining rule for text categoriza-
tion,” In Proceedings of CICLing 2010, pp. 638–647,
2010.
[5] L. P. Dinu and M. Popescu, “Ordinal measures in
authorship identification,” In Stein, Stamatos, Koppel,
Agire (eds.) PAN ’09, pp. 62–66, 2009.
[6] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On
combining classifiers,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 20, no. 3, pp.
226–239, 1998.
[7] T. K. Ho, J. Hull, and S. Srihari, “Decision combination
in multiple classifier systems,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 16,
no. 1, pp. 66–75, 1994.
[8] A. K. Jain, R. P. W. Duin, and J. Mao, “Statistical
pattern recognition: A review,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 22,
no. 1, pp. 4–37, 2000.
[9] C.-L. Liu, “Classifier combination basedon confidence
transformation,” Pattern Recognition, vol. 38, pp. 11–
28, 2005.
[10] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar,
“Rank aggregation methods for the web,” In
Proceedings of the 10th International Conference

You might also like