0% found this document useful (0 votes)
63 views5 pages

Column Vectorizing Algorithms For Support Vector Machines: Chen Zhiyuan, Dino Isa and Peter Blanchfield

This document discusses algorithms for vectorizing columns in support vector machines. It presents two algorithms: 1) a discrete vectorization algorithm that derives a vector value from the original column value and builds a vector table based on the value using functions, and 2) a continuous vectorization algorithm. The vectorization process converts textual data from tables into numerical vector form to allow classification using a support vector machine in a hybrid data mining and case-based reasoning system.

Uploaded by

vol2no2
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views5 pages

Column Vectorizing Algorithms For Support Vector Machines: Chen Zhiyuan, Dino Isa and Peter Blanchfield

This document discusses algorithms for vectorizing columns in support vector machines. It presents two algorithms: 1) a discrete vectorization algorithm that derives a vector value from the original column value and builds a vector table based on the value using functions, and 2) a continuous vectorization algorithm. The vectorization process converts textual data from tables into numerical vector form to allow classification using a support vector machine in a hybrid data mining and case-based reasoning system.

Uploaded by

vol2no2
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

52 (IJCNS) International Journal of Computer and Network Security,

Vol. 2, No. 2, February 2010

Column Vectorizing Algorithms for Support Vector


Machines
Chen ZhiYuan1, Dino Isa2 and Peter Blanchfield3
1
University of Nottingham, School of Computer Science,
Jalan Broga 43500 Semenyih Selangor Malaysia
[email protected]
2
School of Electronic Engineering, University of Nottingham,
Jalan Broga 43500 Semenyih Selangor Malaysia
[email protected]
3
School of Computer Science, University of Nottingham,
Nottingham, NG8 1BB, UK
[email protected]

construction of the support vector machine.


Abstract: In this paper we present the vectorization method for
support vector machines in a hybrid Data Mining and Case- The rest of this paper is organized as follows: Section 2
Based Reasoning system which incorporates a vector model to presents objectives and related techniques. Section 3
help transfer textual information to numerical vector in order to describes in detail the architecture of the hybrid system.
make the real world information more adapted to the data Section 4 provides the procedure of vectorization. Section 5
mining engine. The main issue of implementing this approach is explains the conducted experiments. The conclusion is
two algorithms; the discrete vectorization algorithm and
discussed in section 6.
continuous vectorization algorithm. The basic idea of the
vectorization algorithm is to derive X value from the original
column value and where the vector value is unavailable; the 2. Objectives and Foundation
algorithm builds a vector table based on the X value by using
Our research group works on the designing of flexible and
appropriate functions. Subsequently, the vector model is
classified using a support vector machine and retrieved from the adaptable user oriented hybrid systems which aims to
case based reasoning cycle using a self organizing map. combine database technology and artificial intelligence
techniques. The preprocessing procedure related to data
Keywords: Vectorization, Support Vector Machine, Data
Mining, Artificial Intelligence, Case-Based Reasoning. vectorization step of a classification process, going from low
level data mining processes [2] to high level artificial
1. Introduction intelligence techniques. Many domain specific system such
as user modeling systems [3] or artificial intelligence hybrid
The problem faced by traditional database technology systems have been described in literature [4] [5] [6]. Even
developer today is lack of intelligence support, while when the applied strategies are designed as generic as
artificial intelligence techniques [1] were limited in their possible, the illustration given for the system are limited to
capacity to supply and maintain large amount of factual the text document and do not develop any vectorizing
data. This paper provides a method to solve this problem. algorithm to quantitate the input raw textual data set into
From a database point of view, there was an urgent need to numeric data set.
address the problems caused by the limited intelligent Actually, to the best of our knowledge, no such complete
capabilities of database systems, in particular relational and generic vectorization process exists because of the
database systems. Such limitations implied the impossibility necessity to have an excellent know-how in the
of developing, in a pure database context, certain facilities implementation of a hybrid intelligent system. Many existed
for reasoning, problem solving, and question answering. systems have been developed on the basis of using artificial
From an artificial intelligence point of view, it was intelligence techniques to provide semantic support to a
necessary to transcend the era of the operating on numerical database system, or database techniques to aid an artificial
signals to achieve the real information management system intelligence system to deal with large amounts of
able to deal with large amounts of textual data. Our information. The key factors they concerned reside in the
approach was explicitly designed to support efficient exploitation of the equivalence between database field and
vectorization techniques by providing multiple number the knowledge representation system of artificial
resources with minimum inter-dependencies and irregular intelligence.
constraints, yet under strict artificial intelligence In our hybrid system, vector is the unique representation of
considerations. It features a table in a relational database data considering the system consistency. On the other hand,
through two types of vectorizing functions, supporting to the for both data mining process and case-based reasoning cycle
(IJCNS) International Journal of Computer and Network Security, 53
Vol. 2, No. 2, February 2010

[7], vectorization and consistency are crucial. The role of 4. Vectorization


vectorization is to convert text table which stored in SQL
As can be seen from the hybrid system architecture, in order
server, into numerical vector form. Traditional vectorization
to classify individual models and domain information into
method concentrates on image object into a raster vector or
user model the support vector machine are applied.
raw line fragments. While we focus on these table column
Individual models are user information which took table
features and describe how they can be vectorized by applied format and stored in the SQL server. Domain information in
automatically approach using two kinds of vectorization the database is also sorts of tables which stored the
functions. preselected user-preferred knowledge. The support vector
In order to describe the foundation of the vectorization, the machine [10] [11] is one of AI techniques which serve as
framework of our hybrid system is simply described in the classifier in the system. The main idea of a support vector
following section. machine is to construct a hyper plane as the decision
surfaces in such a way that the margin of separation between
3. Hybrid System Architecture Overview positive and negative features is maximized. The
vectorization step is the data preprocessing for the support
The concepts of this project are as follows: vector machine which provides the numeric feature vector.
• To develop a hybrid data mining and case-based
reasoning user modeling system 3.2 Feature Type
• To combine data mining technology and artificial
For vectorization task to be as accurate as possible we
intelligence pattern classifiers as a means to construct a
predefined two type table columns or we called feature type;
Knowledge Base and to link this to the case-based
discrete columns (feature) and continuous columns (feature).
reasoning cycle in order to provide domain specific user
Discrete feature contains discrete values, in that the data
relevant information to the user in a timely manner.
represents a finite, counted number of categories. The values
• To use the self organizing map [8] in the CBR cycle in in a discrete attribute column do not imply ordered data,
order to retrieve the most relevant information for the even if the values are numeric; the distinct character is
user from the knowledge base. values are clearly separated. Telephone area code is a good
Based on these concepts the architecture has been designed example of discrete data that is numeric.
which is illustrated in Figure 1. The hybrid system contains Continuous feature contains values that represent a
five main components: continuous set of numeric and measurement data, and it is
• Individual models, comparable to the blackboard possible for the data to contain an infinite number of
containing the user information from the real world. fractional values. An income column is an example of a
• Domain database integrated the preselected domain continuous column.
information [9]. The numeric value is not the vital factor to determine the
• A data mining engine which classified both user class feature type, but if the value is a word then it must be a
and domain information vectors. discrete feature.
• A knowledge base, containing the representation of
classified user information and combined with interested 3.3 Vectorization algorithm
domain knowledge. From the technology point of view, vectorization is an
• A problem-solving life-cycle called case-based reasoning approach modeling relationships between the data set and
cycle, assisting in retrieve reuse revise and retain the the vectorizing variable. We provide a more flexible
knowledge base. approach by allowing some of the features (columns) to be
independent and some of the features to be interdependent.
Constructing two parallel algorithms to avoid time
User ID SOM
consuming and save a large amount of effort.
User interface Retrieved
Query The schema of the algorithm is specified in Figure 2 which
RETRIEVE
Case
derives the numeric vector by implementing different
RETAIN
R
functions. The schema is not exhaustive and can evolve with
Data mining engine
SVM
User Model Knowledge E
U new data, according to user need.
Base S
Human E Furthermore, once the type of the column has been
expert
determined, adding a new record is quite straightforward.
Vectorization These functions are also well suited to dealing with
incomplete data. Instances with missing attributes can be
REVISE
Confirmed Proposed handled by summing or integrating the values of other
Solution Solution
Domain Individual attribute.
Database Model We represent each column as a data point in a dimensional
space, where Z is the total number of attributes (columns).
Data Mining User Model CBR The algorithm computes the vectorizing value (or
representation value) between each feature which was
Figure 1. The architecture of the system denoted by abscissa axis and the vector denoted by y-axis,
and all the feature values determine its own vectorizing
values. Once the vectorizing value list is obtained, the vector
model will be classified based on the implementation of
54 (IJCNS) International Journal of Computer and Network Security,
Vol. 2, No. 2, February 2010

support vector machine so that the core of the hybrid system


the knowledge base will be constructed completely. The key computation of these two algorithms is the
vectorization value formula given in step 5 of the both table.
Formula 1:
Vdx = nd
V = 1 × nd
dy n
Formula 2:

Vcx = (Vcx − AvgVcx )
MaxVcx
−x
Vcy′ = e − e
x
, Vcy′ ∈ [ −1, +1].
e x + e −x
In Formula 1, n is the weight parameter associated with the
discrete columns which is the sum of value type. Vdy is a
combination of the unit value ( 1 / n ) multiply the sequence
of the current value type ( nd ). This is a regression-like
expression [12]. Regression is used to make predictions for
numerical targets. By far the most widely used approach for
Figure 2. The schema of the vectorization algorithm
numerical prediction is regression, a statistical methodology
that was developed by Sir Frances Galeton [13]. Generally
The detailed vectorization algorithms are described in the
speaking Regression analysis methods include Linear
Table 1 and Table 2 according to discrete columns and
Regression, Nonlinear Regression. Linear Regression is
continuous columns.
widely used, owing largely to its simplicity. By applying
transformations to the variables, we can convert the
Table 1. The discrete column vectorization algorithm
nonlinear model (text table column information) into a
linear one according to the requirement of the support vector
machine.
1: Let V be the representation of Vectors, D be the In order to get the negative X value and at the same time
whole set of the vector model and d be the set of discrete keep the same distance among original X value, in Formula
columns. 2 we minus average value to all x value and then get the
2: FOR each data point Z DO proportion compare with the maximum original X value,
3: Select Zd , the discrete features of all data point, after that get the new X value and by means of Hyperbolic
Tangent function [14] to map these new value into (-1, +1)
4: Compute Vd = ( Vdx , Vdy ), the corresponding value
scale.
between Z and every vector, ( Vdx , Vdy ) D. In order to explain these algorithms clearly, we show the
experiment procedure in the following section.
5: Vdx = nd , Vdy = 1 × nd ; Vdy ∈ [0,1] .
n
5. Experiments
6: END FOR
The vectorization algorithm was tested on the census-
income data set extracted from the 1994 and 1995 current
Table 2. The continuous column vectorization algorithm population surveys conducted by the U.S. Census Bureau.
The data contains 41 demographic and employment related
variables. In order to explain how to apply our approach
clearly, we choose 8 discrete columns and 8 continuous
1: Let V be the representation of Vectors, D be the set of
columns which can be found in table 3 to explain the
vector model and c be the set of continuous columns.
implementation in details. In Table 4 we list the n value of
2: FOR each data point Z DO
the discrete columns. For example the worker class n value,
3: Select Zc , the continuous features of all data point, because there are 9 kinds of worker class, so n is equal to 9.
′ ′ Parts of the experiment results implemented the proposed
4: Compute Vc = ( Vcx , Vcy ), the corresponding value algorithms which contain 27 records are shown in figure 3.
′ ′ The input for the algorithm was given 8 discrete features
between Z and every vector, ( Vcx , Vcy ) D. and 8 features and asked to give the vectorized value as
′ output. The discrete attributes were decomposed into n
5: Vcx =( Vcx − AvgVcx) MaxVcx , equidistances, which yielded corresponded vector value
−x
Vcy′ = e − e
x scaling to the range of (0, 1). For the continuous attribute,
, Vcy′ ∈ [ −1, +1].
e x + e −x firstly the raw attribute value was transferred into the whole
6: END FOR x-axis, so that the new x value contain the negative value
(IJCNS) International Journal of Computer and Network Security, 55
Vol. 2, No. 2, February 2010

and by using Hyperbolic Tangent function the vector value


was calculated. The Hyperbolic Tangent function make sure
All the experiment results was created on PC computer,
the vector value to be projected into the (-1, 1) scale which
CPU Intel(R) Core(TM) Duo CPU T2250 @ 1.73GHz 4.6
is required by support vector machine.
2.3, 2GB RAM DDR2 667 MHz, with WinXP. Program was
compiled with NetBeans 6.0.

6. Conclusions
The proposed hybrid Data Mining and Case-Based
Reasoning User modeling system is a multi purpose
platform and is characterized by three major processes. The
vectorization processing unit communicate through the raw
data set the SQL table and the output is the numeric vector,
such an approach avoid the data inconsistency usually met
in classifying documents chain when implement artificial
intelligence tools.
In this paper we built vectorization model by applying two
algorithms: The discrete vectorization algorithm and
continuous vectorization algorithm. The advantage of using
discrete algorithm is that each record in the whole table was
assigned a vector value in an easily expression calculation.
While for the continuous column we choose a relatively
Figure 3. Part of the experiment results T complicated formula that is the Hyperbolic Tangent function
to achieve the vector value.
The recommend vector value range is (0, 1) or (-1, 1) for In designing the algorithm, the key consideration is to bring
support vector machine [15]. One reason for this is to avoid up easy scientific numerical transformation. Therefore, the
vector value in great numeric ranges dominates those in formulas in the algorithm are quite basic but the impressive
smaller numeric ranges. Another reason is to avoid the part is it also provides a reasonable balance between a
numerical difficulties during the calculation. Because kernel satisfactory result and reasonable processing time. Secondly
values usually depends on the inner products of feature due to the modular structure of the algorithm it can be
vectors. For example the linear kernel and the polynomial adapted easily for application. The results of the algorithm
kernel, large vector values may cause numerical problems in the experiments labeled clean and the vector points
[16]. generated by our algorithm have a standard coverage (0, 1)
Another reason why we proposed two kinds of algorithm to and (-1, 1) which is useful in fulfilling the classification task
vectorize discrete columns and continuous columns is to by means of support vector machine for the hybrid system.
preserve the character of the column for the sake of the later
analysis.
Table 3. Parameters for experiments References
[1] S. J. Russell, P. Norvig, Artificial Intelligence A
Discrete columns Continuous columns
Modern Approach, Prentice-Hall International Inc,
class of worker* age*
1995.
education* wage per hour
[2] U. Fayyad, G. Paitetsky-Shapiro, P. Smith,
marital stat capital gains
“knowledge discovery and data mining: Towards a
sex capital losses
unifying framework”, proceedings of the International
reason for unemployment dividends from stocks
Conference on Knowledge Discovery and Data
family members under 18 person for employer
Mining, 1996, pp. 82-22.
live in this house 1 year ago weeks worked in year
[3] J. Vassileva, "A practical architecture for user
veterans benefits instance weight*
modeling in a hypermedia-based information system",
Proceedings of Fourth International Conference on
Table 4. The discrete column n value User Modeling, Hyannis, MA, August 1994, pp 15-
19.
Discrete columns n Value [4] I.V. Chepegin, L. Aroyo, P. D. Bra, “Ontology-driven
class of worker 9 User Modeling for Modular User Adaptive Systems”,
education 17 LWA, 2004, pp.17-19.
marital stat 7 [5] I. Watson, Applying Case-Based Reasoning:
sex 2 Techniques for Enterprise Systems, Morgan
reason for unemployment 6 Kaufmann Publishers, Inc., San Francisco, CA, 1997.
family members under 18 5 [6] K. Sycara, “CADET: A cased-based synthesis tool for
live in this house 1 year ago 3 engineering design”, International Journal for Expert
Veterans benefits 3 System, 4(2), 1992, pp.157-188.
56 (IJCNS) International Journal of Computer and Network Security,
Vol. 2, No. 2, February 2010

[7] A. Aamodt, E. Plaza, “Case-based reasoning:


foundational issues, Methodological variations, and Peter Blanchfield is a senior tutor in the
system approaches”, AI communications, 7(1), 1994, School of Computer Science, University of
pp. 39-59. Nottingham. From September 2005 to July
2009 he was Director of the IT Institute in
[8] Kohonen, “self-organizing map using contiguity-
the School, before which he was the
constrained clustering”, Pattern Recognition Letters, Director of Computer Science and IT
1995, pp. 399–408. Division at the Malaysia Campus of the
[9] B. Hjorland, H. Albrechtsen, “Toward A New Horizon University of Nottingham. In that role he was involved in setting
in Information Science: Domain Analysis”, Journal of up the activities of the School there along with the activities of
the American Society for Information Science, 1995, what has become the Engineering Faculty on that campus.
46(6), 400-425.
[10] E. Osuna, “Support Vector Machines: Training and
Applications”, Ph.D thesis, Operations Research
Center, MIT, 1998.
[11] V.N. Vapink, Statistical Learning Theory, New
York:Wiley.
[12] D.V. Lindley, "Regression and correlation analysis,"
New Palgrave: A Dictionary of Economics, v. 4,
1987, pp. 120-23.
[13] F. Galeton,“Typical laws of heredity", Nature
15,1877, pp. 492-495, 512-514, 532-533.
[14] M.A. Abdou, A.A. Soliman, “Modified extended
tanh-function method and its application on nonlinear
physical equations”, Physics Letters A, Volume 353,
Issue 6, 15 May 2006, pp. 487-492
[15] E. Osuna, R. Freund, F. Girosi, “Improved training
algorithm for support vector machine”, IEEE Neural
Networks in Signal Processing 97,1997.
[16] C. Cortes, V. Vapnik, “Support-vector network”,
Machine Learning , 1995, pp. 273–297.

Authors Profile

Chen ZhiYuan received the B.A. in


Economics from University of HeiLongJiang
in 2001 (China). During 2006-2010, she
stayed in University of Nottingham,
Malaysia Campus to do PhD research in
imitate human experts (especially in
manufacturing and medical field) to
perceive the environment and to make
decisions which maximize the chance of success. From 2007 to
2009, she stayed in Supercapacitor Research Laboratory (SRL),
which is supported by Ministry of Science Technology and
Inovation of Malaysia to study knowledge management system for
manufacturing enviroment.

Dino Isa is a Professor in the Department of


Electrical Electronics Engineering,
University of Nottingham Malaysian
Campus. He obtained a BSEE (Hons) from
the University of Tennessee, USA in 1986
and a PhD from the University of
Nottingham, University Park Nottingham,
UK in 1991.nnnThe main aim of his research is to formulate
strategies which lead to the successful implementations of
“Intelligent Systems” in various domains.

You might also like