Numerical Coding of Nominal Data: January 2015
Numerical Coding of Nominal Data: January 2015
net/publication/278403191
CITATIONS READS
2 8,932
2 authors, including:
Zenon Gniazdowski
Warsaw School of Computer Science
34 PUBLICATIONS 81 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Zenon Gniazdowski on 28 August 2016.
Abstract
In this paper, a novel approach for coding nominal data is proposed. For the given
nominal data, a rank in a form of complex number is assigned. The proposed
method does not lose any information about the attribute and brings other proper-
ties previously unknown. The approach based on these knew properties can been
used for classification. The analyzed example shows that classification with the
use of coded nominal data or both numerical as well as coded nominal data is
more effective than the classification, which uses only numerical data.
1 Introduction
Different types of data are used in data analysis. Generally, they can be numerical data or
nominal data. Numerical data are linearly ordered, which leads to the conclusion that two
elements are equal or one element precedes the second one. Nominal data cannot be naturally
ordered. In the set of nominal data, the identity equivalence relation can be defined, at most. It
means that two elements may be equal or different.
For both types of data specific methods of analysis are developed. Particular difficulties
arise when continuous and nominal data are analyzed simultaneously. Usually, by discretization
continuous data are treated as nominal data. In this way, there avoids the opportunity of setting
the data in order. On the other hand, the procedure can be reversed. In this case, nominal data
are coded with the use of numbers [1]. Unfortunately, numerically coded nominal data cannot
be naturally ordered.
In this paper, a novel approach for coding nominal data with the use of complex numbers
will be presented [2]. For the given nominal data, it will be assigned a rank in a form of number.
Proposed approach can be employed for classification and clustering.
∗
E-mail: [email protected]
54
Numerical Coding of Nominal Data
55
Zenon Gniazdowski, Michał Grabowski
the j − th subset (j = 0, 1, . . . , k − 1) can be coded with the use of k successive roots of unity:
√
Rj = R · k −1 = R · eiφ = R · (cos φ + i sin φ) (2)
√
In the above expression i = −1, φ = 2πj/k (j = 0, 1, . . . , k − 1) and R is the rank
calculated by the formula (1). Value of φ is the phase assigned to the successive (j − th)
nominal value. In the presented concept R is a module of complex rank, depending on the car-
dinality of the subset that contains given nominal value. This approach gives the same modules
R for equinumerous subsets contained identical nominal elements, and distinguishes ranks of
different nominal values via different phases.
Table 4 shows an example of the ranking for the case when the cardinality of elements a, b
and c are equal to three, and the cardinality of the element d is equal to six. For nominal values
of a, b and c assigned phases are respectively equal to 0, 2π/3 and 4π/3. Hence, the rank
assigned to the value of a is real, and ranks assigned to nominal values of b and c are complex.
Real rank is assigned to the nominal value of d.
56
Numerical Coding of Nominal Data
class will contain identical elements. The cardinality of each class is the only attribute informa-
tion that is important from our analysis point of view. Coding with the use of complex numbers
is unambiguous, i. e. after coding different elements are still distinguishable. In addition, it is
also possible to define the corresponding equivalence relation, which divides the set into classes
of equivalences, with cardinality of each class as before coding.
Coding does not lose any information about the attribute. The coded data receives addi-
tional properties that enrich them. Before coding, the cardinality of the given value was as the
external feature. Now, through the module, the cardinality is an inherent property of the coded
value of the attribute. The module presents information about the statistical strength of a given
subset of elements. The phase contains the information about the number of equinumerous
classes. Additionally, coding with the use of complex numbers brings other properties previ-
ously unknown. Above all, on complex numbers all arithmetic operations can be performed.
Objects in data space can be viewed as vectors in a complex space. In this space, a scalar prod-
uct, norm, as well as metric can be defined [4]. Scalar product of two complex vectors x and y
is defined as follows:
Xn
(x, y) = xi y i (3)
i=1
This way the norm, which is generated by the above scalar product, can also be defined:
p
||x|| = (x, x) (4)
Proposed methodology of coding can be employed for analysis of nominal data. In par-
ticular, in a natural way it may be used for clustering and classification, because of the metric
defined above.
57
Zenon Gniazdowski, Michał Grabowski
58
Numerical Coding of Nominal Data
usefulness of the proposed complex coding, four classifications were made based on different
conditional attributes:
K–means algorithm was used for classification. The data were standardized, for this pur-
pose. Euclidean norm was used to measure distances. For this purpose, the adequate number of
starting points for the k-means algorithm was chosen randomly. All these tests were repeated
twenty times. In none of these twenty cases, the sequence of randomly selected points was not
repeated.
After completion of the experiments, the results obtained for different conditional attributes
were compared. Table 7 shows the comparison of classification results for different types of
used data. It can be seen that the classification using coded nominal data and both numerical as
well as coded nominal data is more effective than the classification, which uses ad hoc coding
or only numerical data. Based on obtained results it must be concluded that the information that
is contained in the coded nominal data is important for classification.
6 Conclusions
In this paper, a novel approach for coding nominal data was proposed. For the given nominal
data, it can be assigned rank in a form of complex number. The module of this rank presents
information about the statistical strength of a given subset of elements. The phase contains the
information about the number of equinumerous values of attribute.
Proposed methodology is unambiguous. After coding, different values of attribute are still
distinguishable. The method does not lose any information about the attribute. Additionally,
coded data receives properties previously unknown that enrich them. Above all, on complex
numbers all arithmetic operations can be performed. In complex space, a scalar product, norm,
as well as metric can be defined. It means that coded data may be used for clustering and
classification.
59
Zenon Gniazdowski, Michał Grabowski
References
[1] M. Grabowski and M. Korpusik. Metrics and similarities in modeling dependencies be-
tween continuous and nominal data. Zeszyty Naukowe WWSI, 7(10):25–37, 2013.
[2] Z. Gniazdowski. Numerical coding of nominal data. Seminar, Warsaw School of Computer
Science, May 15, 2014.
[3] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83,
1945.
[4] S. G. Krejn. Analiza funkcjonalna. PWN, Warszawa, 1967.
[5] L. Rutkowski. Metody i techniki sztucznej inteligencji. Wydawnictwo Naukowe PWN,
Warszawa, 2012.
[6] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer,
New York, 2001.
[7] R. Bellman. Adaptive Control Processes. A Guided Tour. Princeton University Press,
Princeton, 1961.
60
Numerical Coding of Nominal Data
[8] J. Koronacki and J. Ćwik. Statystyczne systemy ucza̧ce siȩ. Akademicka Oficyna
Wydawnicza EXIT, 2008.
61