SPAARCPREPRINT
SPAARCPREPRINT
SPAARCPREPRINT
net/publication/331122326
CITATION READS
1 453
3 authors, including:
Some of the authors of this publication are also working on these related projects:
A Hybrid Clustering Technique Combining a Novel Genetic Algorithm with K-Means View project
All content following this page was uploaded by Darren Yates on 19 March 2019.
1 Introduction
However, if an attribute Ai is categorical, the domain of Ai = {a1, a2, … ax} where |Ai|
= x, such that the attribute Ai has x possible values. Each record RL ∈ D will draw
values for each attribute from the domain of the corresponding attribute.
Data-mining has long been performed on computers, servers and ‘cloud compu-
ting’, but the growing capabilities of smartphones and ‘Internet of Things’ (IoT) mi-
crocontroller units (MCUs) has seen increasing attempts to implement data-mining
algorithms on these constrained devices [6-9]. Improving algorithm efficiency would
further their implementation on these devices, potentially offering new applications.
At the same time, moves by major cloud computing service providers such as Am-
azon and Google in late-2017 saw changes to service fees from a ‘per hour’ to a ‘per
second’ basis [10]. This will likely renew corporate focus on algorithm efficiency. For
example, if an algorithm can build a model on cloud computing with almost identical
accuracy but in less time, the time saved may translate into direct cost savings.
To this end, this paper introduces a method for accelerating classification model
build speed called SPAARC, consisting of two components – Node Attribute Sam-
pling (NAS) and Split-Point Sampling (SPS). Experiments show SPAARC cuts model
build times by as much as 70% with minimal loss in classification accuracy. This
improvement could deliver cost savings for cloud-based data-mining or potentially
boost implementation of locally-executed machine learning on performance-
constrained devices.
While the concepts of NAS and SPS are not new, our novel contribution is the
combination of these components into a single effective algorithm. Moreover, our
NAS component incorporates a novel feature that aims to balance the disparate needs
of classification accuracy and processing speed. This paper continues in Section 2
with a summary of previous research into algorithm speed optimisation. Section 3
details our proposed method, while Section 4 reports on its implementation and test-
ing within the CART classification algorithm. Section 5 provides further analysis of
the results and discussion before Section 6 concludes this paper.
2 Related Work
Decision tree induction speed optimisation can be traced back to Fayyad and Irani
[11]. While the purpose of their research was primarily to provide supporting evi-
dence for using entropy as a heuristic in decision tree induction, a noted ‘side benefit’
of their work was the improvement in algorithm speed.
A decision tree is a classification method using a tree-like structure to split a da-
taset’s instances into subsets based on their attribute values. As shown in Fig. 1, a
decision tree consists of nodes (denoted by ovals), each representing an attribute be-
ing tested; branches (edges), the possible outcomes for values of that attribute; and
leaf nodes or ‘leaves’ (rectangles) holding class values that classify each instance [3].
At each node, all attributes are tested in turn to determine which one provides the
most distinct partitioning of dataset records or ‘information gain’. The test is referred
as the ‘attribute selection measure’. There are many attribute selection measures, in-
cluding Gain Ratio, as used by C4.5 [5], and Gini Index, implemented in CART [4].
3
Fig. 1. A simple decision tree showing nodes (ovals), branches (edges) and leaves (rectangles).
Attribute selection measures typically handle attributes with categorical values differ-
ently to those with numerical values. Each distinct value of a categorical attribute
becomes a possible outcome or ‘split point’, however, a numerical attribute will more
commonly be split by two edges only. In this case, a numerical value t of attribute Ai
is selected that best splits the instances of dataset D into distinct subsets such that one
edge splits instances with values of Ai ≤ t and the other edge splits on values of Ai > t.
To select the value t, the j distinct values of numeric attribute Ai are sorted in in-
creasing order. The mid-point between each pair of adjacent values j and j+1 is tested
for maximum information gain. The mid-point of the adjacent pair showing the max-
imum information gain is then held for that attribute as value t. All m attributes are
tested for maximum information gain and attribute Ak is selected where
, G(Ak) is the information gain of Ak and 1 ≤ i ≤ m. If Ak is numerical,
its recorded value t is used as the ‘split point’, as noted above.
But as Fayyad and Irani indicate in [11], testing every adjacent pair of j distinct
values of a numeric attribute implies j-1 possible split-points. However, they also
found that if the sorted sequence of j values resulted in their corresponding class val-
ues grouping together neatly into k groups, only the split-points between those k
groups need to be tested. This would require just k-1 tests and since k << j, the com-
putational workload would be significantly reduced. Nevertheless, if the sequence of
class values is all jumbled, each adjacent pair of distinct numeric attribute values
would need to be tested and the computational workload is back to j-1 calculations.
However, in practice, Fayyad and Irani deemed this expanded testing would not be
necessary and that only k-1 testing points would be required, with those tests occur-
ring at steps equal to |Ci|, where 1 ≤ i ≤ k and k is the number of possible class values.
Yet, while the authors empirically tested this concept, processing speed results were
considered only as far as the split-point discretisation process – the effect on overall
algorithm performance was a secondary consideration [11].
This method of reducing the number of possible split-points was furthered consid-
ered in [12] to improve support for very large datasets, resulting in the development
of the CLOUDS algorithm. The technique implemented is described as ‘Split-Point
Sampling with Estimation’ (SSE). It samples the j-1 possible split-points by dividing
them into intervals using an undefined ‘similar frequency’ method. It further estimat-
ed the optimum split-point within each interval via a hill-climbing algorithm; after all
intervals were tested, the leading split-point was chosen for that attribute. While im-
4
proving processing speed was not the main purpose of CLOUDS, further tests were
carried out comparing SSE with a more basic split-point sampling technique. Each
test consisted of 100 and 200 split-points, respectively. While the magnitude of im-
provement varied, results indicate testing fewer split point reduced the overall time
required for the algorithm to build its model of the test dataset.
This idea of sampling split-points has also been applied to the attributes them-
selves. Rather than test every attribute at each node, ‘feature subset selection’ reduces
the number of attributes. Decision tree induction assumes that each non-class attribute
has a relationship with the class attribute. Thus, each non-class attribute is tested at
each node point. However, using irrelevant attributes can reduce a tree’s classification
ability due to the unnecessary information or ‘noise’ these attributes can contain [13].
Removing the irrelevant attributes not only can improve overall classification accura-
cy, but also reduce the computational workload, since fewer attributes are tested.
Attribute subset selection has evolved with numerous techniques grouped into
three broad categories – wrapper, filter and embedded. Wrapper methods use the clas-
sifier output itself to determine the most suitable subset of attributes. However, while
accurate, wrapper methods can be computationally costly for datasets with high di-
mensionality and are considered a ‘brute force’ method of subset selection [14]. Filter
methods, by contrast, select the attribute subset prior to tree induction through a pro-
cess of ‘feature ranking’. Yet, filtering can remove attributes that alone may not pro-
vide information, but in combination with other attributes hold knowledge that would
otherwise be lost. Alternatively, embedded methods incorporate attribute selection
within the decision tree algorithm itself, making them more efficient than wrapper
methods. The CART algorithm has been identified as having an embedded mecha-
nism for attribute selection [14], however, it still requires testing of all attributes at
every node. The question we aim to answer in this paper is how to reduce the number
of attributes required for testing in a way that provides meaningful improvement to
processing speed whilst minimising any negative effect on classification accuracy.
3 Proposed Method
Our proposed method for accelerating tree induction involves combining components
of split-point sampling and attribute subset selection into a single novel implementa-
tion we have named Split-Point And Attribute Reduced Classifier or SPAARC.
Moreover, this method can be applied to any classification algorithm that implements
its numeric attribute split-point analysis and node attribute selection recursively.
The two specific components of SPAARC will now be detailed individually, starting
with Node Attribute Sampling (NAS) covered in Section 3.1 and Split-Point Sam-
pling (SPS) in Section 3.2. This will be followed by empirical evaluation in Section 4.
Fig. 2. Node attribute sampling samples the full attribute space on each treeDepth modulus
level (in this case, treeDepth modulus = 2)
Induction begins at the root node, N1, which sits on Tree Depth Level (TDL) 1. Here,
the full non-class attribute space A = {A1…A6} is always used to find the most ap-
propriate attribute for the root node. As TDL 1 is also designated as a modulus level,
the information gain scores from each non-class attribute tested for N1 are recorded
and the average information gain is calculated. The indices of these non-class attrib-
utes are stored separately in decreasing order of their information gain scores. The
number of attributes with above-average information gain is counted and those attrib-
utes are selected as the new attribute subset, which for N1 is {A2, A4}. Moving to
node N2, the ‘treeDepth’ value is now set to 2. Thus, node N2 is not on a modulus
TDL and tested only with the previously-stored attribute subset, i.e. {A2, A4}, so that
the selected node attribute can only come from this subset. Tree induction continues
recursively following the left-hand branch, reaching node N3. As N3 sits on a modu-
lus TDL, the attribute selected for this node now comes from the full attribute space.
This process continues down until leaf nodes N6 and N7 are set. At this point, the
current subset of {A5, A6}, which was created at node N5, is retained as the tree in-
duction returns recursively up the tree to node N8. Again, as N8 is also on a modulus
TDL, the full attribute space is used and the subset {A1, A6} is created. Following
this recursive path, the next node for processing is node N9. As N9 is not on a modu-
lus TDL, its attribute is selected from the current stored subset {A1, A6}. Induction
continues to N10, which being on a modulus TDL, selects from the full attribute
space and creates a new subset {A2, A3}. This subset services nodes N11 and N12.
However, since the next node in the sequence, N13, also sits on a non-modulus TDL,
it, too, selects its attribute from the current {A2, A3} subset. The final two nodes,
N14 and N15, both sit on a modulus TDL. Thus, both choose an attribute from the full
attribute space.
The pseudo-code for NAS appears in the NodeAttributeSample function in Fig. 3.
It takes as parameters, the dataset D, attributes A along with the tree depth modulus
factor M and returns the split-attribute As. To start, if the tree depth level modulus M
is one (1), the function looks for the best split-point from every attribute, retaining the
information gain (infoGain) factor from each. Following this, the attributes are sorted
by information gain in decreasing order and stored in the array ‘sortedAtts’. The aver-
age information gain is calculated and stored in ‘avgGain’. Next, each attribute Ai in
the sortedAtts array is tested and if its information gain is greater than the average
gain, that attribute is added to the attribute subset list called ‘attSubset’.
If the tree depth level modulus M is not equal to one, then each attribute of the sub-
set (rather than every attribute), is tested to find its split-point. In any case, regardless
of tree level, the attribute with the maximum information gain is returned as the at-
tribute to split As, along with the information gain values as the array ‘infoGains’.
This NodeAttributeSample function also calls the ‘SplitPointSample’ function, fea-
turing the split-point sampling (SPS) component we shall now discuss in Section 3.2.
7
Gini(R) = 1 ∑
(1)
where pi is the probability that the record R in dataset D belongs to the class value Ci.
However, to obtain the split-point with maximum information gain, each adjacent pair
of distinct attribute values must be tested, such that for j distinct values, there are
generally j-1 adjacent pairs tested. Our SPS component does not interfere with infor-
mation gain measure itself, however, it dynamically reduces the number of possible
split-points tested. This is done by dividing the range of distinct attribute values into
equal-width intervals and using the adjacent pair of values at the edges of each inter-
val as potential split points. Thus, if there are k intervals, only k-1 test points are re-
quired. Through experimentation, a value of k = 10 has been shown to provide good
results. While the sampling of ten interval points may not result in the selection of the
split-point of maximum information gain, it can be shown that the point selected
should never be more than a value distance of half the interval step.
Fig. 4. Maximum distance the actual split-point i can be from a tested split-point is m/2k.
Proof: Let p and q be two consecutive interval points, such that q – p = m/k. Let i be
the actual split point somewhere in the range sequence between points p and q.
From this, there are three possibilities – i) that (i – p) < m/2k, ii) that (i - p) = m/2k,
and iii) that (i – p) > m/2k. For possibilities i) and ii), the theorem is already proved,
since neither is greater than m/2k.
Now for possibility iii), let (i – p) > m/2k. Since from Fig. 3, (i – p) + (q – i) = m/k,
it must be that (i – p) = m/k – (q – i). Substituting for (i – p), it must also follow that if
(i – p) > m/2k, then m/k – (q – i) > m/2k. It then follows that m/k – m/2k > (q – i) and
(q – i) < m/2k.
Choosing the optimum level of k is dependent upon the number of distinct values j,
for as k approaches j, the difference in the number of calculations required, and thus
the speed gain as a result, will be minimal. However, if k is too low, the distance be-
8
tween the optimal and selected split points will likely be greater, potentially negative-
ly affecting the choice of attribute at each node and hence, overall accuracy.
The SplitPointSample pseudo-code in Fig. 3 details this function. It takes as pa-
rameters the subset of records for the current node, Dsorted, and the current attribute, Ai.
It returns an array of information gains ‘infoGain’, plus the candidate split-point
called ‘splitPoint’. The function first considers attribute Ai – if it is categorical, it is
passed onto the tree algorithm’s current function for finding categorical split-points.
As SPS component handles numerical attributes only, we skip this task.
For numerical attributes, four values are initially calculated from the sorted range
of record values for attribute Ai – hopStart is the Ai value of the first record, valu-
eRange is the range of numeric values m, hopStep is the range distance divided by the
number of intervals, k, set to 10 and hopPoint is the first interval, set to hopStart plus
hotStep. At this point, each record Rj in the current data subset Dsorted is considered for
the value of its attribute Ai. Unless this value is greater than the current interval value
hopPoint, it is skipped. If this Ai value is greater, we calculate the information gain.
We compare this information gain value, mj, against the current maximum infor-
mation gain value maxInfoGain – if value mj is greater, it becomes the new maximum
and the current best ‘splitPoint’ is set between the Ai values of current record Rj and
the previous record Rj-1. We then increment the interval step hopStep to the next inter-
val and continue until the last record is reached. On completion, the maximum infor-
mation gain value for Ai is stored and its corresponding split-point is returned.
4 Experiments
To test the validity of SPAARC, experiments were carried out using 14 open-source
freely-available datasets from the University of California, Irvine (UCI) machine-
learning repository [15]. Dataset details are shown in Table 1. These datasets were
tested on the Weka version 3.8.2 data-mining core using the SimpleCART algorithm
from the Weka Package Manager. Tests were conducted comparing the SimpleCART
algorithm in its original form against the same algorithm augmented with our
SPAARC method. Further tests involving the SPS and NAS components individually
were carried out to identify the role each plays in the overall SPAARC result.
Classification accuracy was evaluated using ten-fold cross-validation, while tree in-
duction or ‘model build’ times were programmatically recorded to millisecond preci-
sion. All tests were carried out on a 3.1GHz Intel Core i5 2300 PC with Windows 8.1
operating system, 14GB of RAM and 240GB Intel 535 Series solid-state drive (SSD).
Classification accuracy and model build times of our SPAARC algorithm compared
with SimpleCART appear in Table 2 (leading results in all tests are shown in bold).
Table 2. Classification accuracy and build times of SimpleCART and SPAARC algorithms.
Significantly, SPAARC produced faster model build times than SimpleCART in all
14 datasets tested. The most successful of the results achieved by SPAARC occurred
with the FOU dataset, with build time dropping by 70%. At the same time, classifica-
tion accuracy for this dataset rose by one percentage point (76.5% vs 75.5%).
Importantly, SPAARC achieved these speeds whilst still managing to record higher
classification accuracy scores in seven of the 14 datasets compared with Simple-
CART. Moreover, individual accuracy gains outweigh losses to the extent that
SPAARC produced a marginally-higher overall classification accuracy average than
SimpleCART (87.52 vs 87.39). The least successful SPAARC result occurred with
the WIN dataset, where classification accuracy fell by 1.18 percentage points, while
build speed still improved by a modest 16%. However, across all 14 datasets,
SPAARC reduced the total build time by more than 48% (58.499secs vs 112.55). To
understand in more detail how each of the algorithm’s two main components contrib-
ute to the overall result, SPAARC was modified to implement each component sepa-
rately and compared in each case with the SimpleCART algorithm. The results in
Table 3 show that SPS improved model build times in ten of the 14 datasets tested.
Worthy of note is the fact that the overall build time achieved by SPS is greater than
that recorded by SPAARC, indicating the SPAARC results are not due solely to SPS.
10
Table 3. SimpleCART classification accuracy and build times compared with SPS alone.
Table 4. SimpleCART classification accuracy and build times compared with NAS alone.
Similarly, Table 4 shows the results of implementing the NAS component alone in
comparison with SimpleCART. Here, model build times were improved by NAS in
12 of 14 datasets tested. While NAS achieved speed gains in more datasets than SPS,
the gain in build time resulting from NAS was less than SPS (87.65seconds vs SPS’
72.77). This indicates NAS with its treeDepth modulus setting of 5 contributes less to
overall speed than SPS, but is more likely to show a speed gain on a dataset than SPS
alone. The results also saw NAS improve classification accuracy in seven of 14 da-
tasets, with no change in overall average accuracy. More broadly, the results of Ta-
bles 3 and 4 show the performance of SPAARC is due to both components combined.
Overall, SPAARC shows the potential of combining SPS and NAS techniques with
significant falls in build time, while classification accuracy is largely maintained.
Although every decision tree algorithm has its own structure, tree induction in general
is a two-step process. First, the tree is grown until all branches are finished as leaves
via a stopping criteria. Then, second, depending on the algorithm, the tree is reduced
or ‘pruned’ to reduce the effects of noise in the training dataset and improve its gener-
alisation ability on new unseen instances [16]. The SPAARC algorithm’s two compo-
nents only operate on the growth phase, leaving potential for further time savings
through accelerating the pruning phase. This will be an area for future research.
Although research shows attempts to implement classification algorithms within
low-power MCUs have been made, as shown in Section 1, the lack of RAM in many
of these devices is likely as much of an impediment to their implementation as their
limited processing capability. This would have the effect of limiting the size of da-
tasets that are ‘mine-able’, whether in terms of instances or attributes. Thus, reducing
mining algorithm RAM requirements is another area of further research relevant to
local IoT data-mining. Nevertheless, the speed gains achieved by SPAARC with min-
imal loss of classification accuracy could contribute toward greater implementation of
data-mining in IoT applications. This is another area we are keen to research.
6 Conclusion
7 Acknowledgements
References
[1] Islam, M.Z., M. Furner, and M.J. Siers, WaterDM: A Knowledge Discovery and Decision
Support Tool for Efficient Dam Management. 2016.
[2] Dangare, C.S. and S.S. Apte, Improved study of heart disease prediction system using data
mining classification techniques. International Journal of Computer Applications,
2012. 47(10): p. 44-48.
[3] Han, J., J. Pei, and M. Kamber, Data mining: concepts and techniques. 2011: Elsevier.
[4] Breiman, L., J. Friedman, C.J. Stone, and R.A. Olshen, Classification and regression trees.
1984: CRC press.
[5] Quinlan, J.R., C4. 5: programs for machine learning. 2014: Elsevier.
[6] Nath, S. ACE: exploiting correlation for energy-efficient and continuous context sensing. in
Proceedings of the 10th international conference on Mobile systems, applications,
and services. 2012. ACM.
[7] Srinivasan, V., S. Moghaddam, A. Mukherji, K.K. Rachuri, C. Xu, and E.M. Tapia.
Mobileminer: Mining your frequent patterns on your phone. in Proceedings of the
2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing.
2014. ACM.
[8] Hinwood, A., P. Preston, G. Suaning, and N. Lovell, Bank note recognition for the vision
impaired. Australasian Physics & Engineering Sciences in Medicine, 2006. 29(2): p.
229.
[9] Maurer, U., A. Smailagic, D.P. Siewiorek, and M. Deisher. Activity recognition and
monitoring using multiple sensors on different body positions. in Wearable and
Implantable Body Sensor Networks, 2006. BSN 2006. International Workshop on.
2006. IEEE.
[10] Darrow, B. Amazon Just Made a Huge Change to its Cloud Pricing.
http://fortune.com/2017/09/18/amazon-cloud-pricing-second/. 2017.
[11] Fayyad, U.M. and K.B. Irani, On the handling of continuous-valued attributes in decision
tree generation. Machine learning, 1992. 8(1): p. 87-102.
[12] Ranka, S. and V. Singh. CLOUDS: A decision tree classifier for large datasets. in
Proceedings of the 4th Knowledge Discovery and Data Mining Conference. 1998.
[13] Chandrashekar, G. and F. Sahin, A survey on feature selection methods. Computers &
Electrical Engineering, 2014. 40(1): p. 16-28.
[14] Guyon, I. and A. Elisseeff, An introduction to variable and feature selection. Journal of
machine learning research, 2003. 3(Mar): p. 1157-1182.
[15] Dua, D.a.K.T., E. . UCI Machine Learning Repository. http://archive.ics.uci.edu/ml.
2018.
[16] Buntine, W. and T. Niblett, A further comparison of splitting rules for decision-tree
induction. Machine Learning, 1992. 8(1): p. 75-85.