Column Vectorizing Algorithms For Support Vector Machines: Chen Zhiyuan, Dino Isa and Peter Blanchfield
Column Vectorizing Algorithms For Support Vector Machines: Chen Zhiyuan, Dino Isa and Peter Blanchfield
6. Conclusions
The proposed hybrid Data Mining and Case-Based
Reasoning User modeling system is a multi purpose
platform and is characterized by three major processes. The
vectorization processing unit communicate through the raw
data set the SQL table and the output is the numeric vector,
such an approach avoid the data inconsistency usually met
in classifying documents chain when implement artificial
intelligence tools.
In this paper we built vectorization model by applying two
algorithms: The discrete vectorization algorithm and
continuous vectorization algorithm. The advantage of using
discrete algorithm is that each record in the whole table was
assigned a vector value in an easily expression calculation.
While for the continuous column we choose a relatively
Figure 3. Part of the experiment results T complicated formula that is the Hyperbolic Tangent function
to achieve the vector value.
The recommend vector value range is (0, 1) or (-1, 1) for In designing the algorithm, the key consideration is to bring
support vector machine [15]. One reason for this is to avoid up easy scientific numerical transformation. Therefore, the
vector value in great numeric ranges dominates those in formulas in the algorithm are quite basic but the impressive
smaller numeric ranges. Another reason is to avoid the part is it also provides a reasonable balance between a
numerical difficulties during the calculation. Because kernel satisfactory result and reasonable processing time. Secondly
values usually depends on the inner products of feature due to the modular structure of the algorithm it can be
vectors. For example the linear kernel and the polynomial adapted easily for application. The results of the algorithm
kernel, large vector values may cause numerical problems in the experiments labeled clean and the vector points
[16]. generated by our algorithm have a standard coverage (0, 1)
Another reason why we proposed two kinds of algorithm to and (-1, 1) which is useful in fulfilling the classification task
vectorize discrete columns and continuous columns is to by means of support vector machine for the hybrid system.
preserve the character of the column for the sake of the later
analysis.
Table 3. Parameters for experiments References
[1] S. J. Russell, P. Norvig, Artificial Intelligence A
Discrete columns Continuous columns
Modern Approach, Prentice-Hall International Inc,
class of worker* age*
1995.
education* wage per hour
[2] U. Fayyad, G. Paitetsky-Shapiro, P. Smith,
marital stat capital gains
“knowledge discovery and data mining: Towards a
sex capital losses
unifying framework”, proceedings of the International
reason for unemployment dividends from stocks
Conference on Knowledge Discovery and Data
family members under 18 person for employer
Mining, 1996, pp. 82-22.
live in this house 1 year ago weeks worked in year
[3] J. Vassileva, "A practical architecture for user
veterans benefits instance weight*
modeling in a hypermedia-based information system",
Proceedings of Fourth International Conference on
Table 4. The discrete column n value User Modeling, Hyannis, MA, August 1994, pp 15-
19.
Discrete columns n Value [4] I.V. Chepegin, L. Aroyo, P. D. Bra, “Ontology-driven
class of worker 9 User Modeling for Modular User Adaptive Systems”,
education 17 LWA, 2004, pp.17-19.
marital stat 7 [5] I. Watson, Applying Case-Based Reasoning:
sex 2 Techniques for Enterprise Systems, Morgan
reason for unemployment 6 Kaufmann Publishers, Inc., San Francisco, CA, 1997.
family members under 18 5 [6] K. Sycara, “CADET: A cased-based synthesis tool for
live in this house 1 year ago 3 engineering design”, International Journal for Expert
Veterans benefits 3 System, 4(2), 1992, pp.157-188.
56 (IJCNS) International Journal of Computer and Network Security,
Vol. 2, No. 2, February 2010
Authors Profile