SPAARCPREPRINT

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/331122326

SPAARC: A Fast Decision Tree Algorithm: 16th Australasian Conference,


AusDM 2018, Bahrurst, NSW, Australia, November 28–30, 2018, Revised
Selected Papers

Chapter · January 2019


DOI: 10.1007/978-981-13-6661-1_4

CITATION READS

1 453

3 authors, including:

Darren Yates Md Zahidul Islam


Charles Sturt University Charles Sturt University
10 PUBLICATIONS   22 CITATIONS    146 PUBLICATIONS   2,358 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

A Hybrid Clustering Technique Combining a Novel Genetic Algorithm with K-Means View project

Brand switching in mobile phones and telecommunications View project

All content following this page was uploaded by Darren Yates on 19 March 2019.

The user has requested enhancement of the downloaded file.


SPAARC: A Fast Decision Tree Algorithm

Darren Yates1, Md Zahidul Islam1 and Junbin Gao2


1
Charles Sturt University, Panorama Ave, Bathurst NSW 2795, Australia
{dyates, zislam}@csu.edu.au
2
University of Sydney Business School, The University of Sydney, NSW, 2006, Australia
[email protected]

Abstract. Decision trees are a popular method of data-mining and knowledge


discovery, capable of extracting hidden information from datasets consisting of
both nominal and numerical attributes. However, their need to test the suitabil-
ity of every attribute at every tree node, in addition to testing every possible
split-point for every numerical attribute can be expensive computationally, par-
ticularly for datasets with high dimensionality. This paper proposes a method
for speeding up the decision tree induction process called SPAARC, consisting
of two components to address these issues – sampling of the numeric attribute
tree-node split-points and dynamically adjusting the node attribute selection
space. Further, these methods can be applied to almost any decision tree algo-
rithm. To confirm its validity, SPAARC has been tested and compared against
an implementation of the CART algorithm using 14 freely-available datasets
from the UCI data repository. Results from this testing indicate the two compo-
nents of SPAARC combined have minimal effect on decision tree classification
accuracy yet reduce model build times by as much as 70%.

Keywords: Decision Tree, Processing Speed, Classification Accuracy, Node


Attribute Sampling.

1 Introduction

Classification is a long-studied method for data-mining and knowledge discovery,


featuring in applications as varied as water dam management [1] and heart disease
prediction [2]. It extracts information from a dataset of records as a set of rules or
‘model’ summarising the relationships between the feature values. Features are also
known as ‘attributes’. Moreover, the model learned can then be applied to a new pre-
viously-unseen record to predict the category or ‘class’ it belongs to. Decision trees
are a popular classification technique due to their flowchart-like visualisation that is
easy to follow and understand [3]. Popular examples include CART [4] and C4.5 [5].
A decision tree aims to discover relationships within a dataset D containing n rec-
ords or ‘instances’ such that D = {R1, R2, …, Rn} between an m-dimensional vector of
non-class attributes A = {A1, A2, …, Am} and a class attribute, C, consisting of p class
values, with C = {c1, c2, …, cp}. If an attribute Ai is numerical, its value can range
between Ai = [Li, Ui], where Li is the lower limit and Ui the upper limit of attribute Ai.
The final authenticated publication is available online at
https://doi.org/10.1007/978-981-13-6661-1_4
2

However, if an attribute Ai is categorical, the domain of Ai = {a1, a2, … ax} where |Ai|
= x, such that the attribute Ai has x possible values. Each record RL ∈ D will draw
values for each attribute from the domain of the corresponding attribute.
Data-mining has long been performed on computers, servers and ‘cloud compu-
ting’, but the growing capabilities of smartphones and ‘Internet of Things’ (IoT) mi-
crocontroller units (MCUs) has seen increasing attempts to implement data-mining
algorithms on these constrained devices [6-9]. Improving algorithm efficiency would
further their implementation on these devices, potentially offering new applications.
At the same time, moves by major cloud computing service providers such as Am-
azon and Google in late-2017 saw changes to service fees from a ‘per hour’ to a ‘per
second’ basis [10]. This will likely renew corporate focus on algorithm efficiency. For
example, if an algorithm can build a model on cloud computing with almost identical
accuracy but in less time, the time saved may translate into direct cost savings.
To this end, this paper introduces a method for accelerating classification model
build speed called SPAARC, consisting of two components – Node Attribute Sam-
pling (NAS) and Split-Point Sampling (SPS). Experiments show SPAARC cuts model
build times by as much as 70% with minimal loss in classification accuracy. This
improvement could deliver cost savings for cloud-based data-mining or potentially
boost implementation of locally-executed machine learning on performance-
constrained devices.
While the concepts of NAS and SPS are not new, our novel contribution is the
combination of these components into a single effective algorithm. Moreover, our
NAS component incorporates a novel feature that aims to balance the disparate needs
of classification accuracy and processing speed. This paper continues in Section 2
with a summary of previous research into algorithm speed optimisation. Section 3
details our proposed method, while Section 4 reports on its implementation and test-
ing within the CART classification algorithm. Section 5 provides further analysis of
the results and discussion before Section 6 concludes this paper.

2 Related Work

Decision tree induction speed optimisation can be traced back to Fayyad and Irani
[11]. While the purpose of their research was primarily to provide supporting evi-
dence for using entropy as a heuristic in decision tree induction, a noted ‘side benefit’
of their work was the improvement in algorithm speed.
A decision tree is a classification method using a tree-like structure to split a da-
taset’s instances into subsets based on their attribute values. As shown in Fig. 1, a
decision tree consists of nodes (denoted by ovals), each representing an attribute be-
ing tested; branches (edges), the possible outcomes for values of that attribute; and
leaf nodes or ‘leaves’ (rectangles) holding class values that classify each instance [3].
At each node, all attributes are tested in turn to determine which one provides the
most distinct partitioning of dataset records or ‘information gain’. The test is referred
as the ‘attribute selection measure’. There are many attribute selection measures, in-
cluding Gain Ratio, as used by C4.5 [5], and Gini Index, implemented in CART [4].
3

Fig. 1. A simple decision tree showing nodes (ovals), branches (edges) and leaves (rectangles).

Attribute selection measures typically handle attributes with categorical values differ-
ently to those with numerical values. Each distinct value of a categorical attribute
becomes a possible outcome or ‘split point’, however, a numerical attribute will more
commonly be split by two edges only. In this case, a numerical value t of attribute Ai
is selected that best splits the instances of dataset D into distinct subsets such that one
edge splits instances with values of Ai ≤ t and the other edge splits on values of Ai > t.
To select the value t, the j distinct values of numeric attribute Ai are sorted in in-
creasing order. The mid-point between each pair of adjacent values j and j+1 is tested
for maximum information gain. The mid-point of the adjacent pair showing the max-
imum information gain is then held for that attribute as value t. All m attributes are
tested for maximum information gain and attribute Ak is selected where   


  , G(Ak) is the information gain of Ak and 1 ≤ i ≤ m. If Ak is numerical,
its recorded value t is used as the ‘split point’, as noted above.
But as Fayyad and Irani indicate in [11], testing every adjacent pair of j distinct
values of a numeric attribute implies j-1 possible split-points. However, they also
found that if the sorted sequence of j values resulted in their corresponding class val-
ues grouping together neatly into k groups, only the split-points between those k
groups need to be tested. This would require just k-1 tests and since k << j, the com-
putational workload would be significantly reduced. Nevertheless, if the sequence of
class values is all jumbled, each adjacent pair of distinct numeric attribute values
would need to be tested and the computational workload is back to j-1 calculations.
However, in practice, Fayyad and Irani deemed this expanded testing would not be
necessary and that only k-1 testing points would be required, with those tests occur-
ring at steps equal to |Ci|, where 1 ≤ i ≤ k and k is the number of possible class values.
Yet, while the authors empirically tested this concept, processing speed results were
considered only as far as the split-point discretisation process – the effect on overall
algorithm performance was a secondary consideration [11].
This method of reducing the number of possible split-points was furthered consid-
ered in [12] to improve support for very large datasets, resulting in the development
of the CLOUDS algorithm. The technique implemented is described as ‘Split-Point
Sampling with Estimation’ (SSE). It samples the j-1 possible split-points by dividing
them into intervals using an undefined ‘similar frequency’ method. It further estimat-
ed the optimum split-point within each interval via a hill-climbing algorithm; after all
intervals were tested, the leading split-point was chosen for that attribute. While im-
4

proving processing speed was not the main purpose of CLOUDS, further tests were
carried out comparing SSE with a more basic split-point sampling technique. Each
test consisted of 100 and 200 split-points, respectively. While the magnitude of im-
provement varied, results indicate testing fewer split point reduced the overall time
required for the algorithm to build its model of the test dataset.
This idea of sampling split-points has also been applied to the attributes them-
selves. Rather than test every attribute at each node, ‘feature subset selection’ reduces
the number of attributes. Decision tree induction assumes that each non-class attribute
has a relationship with the class attribute. Thus, each non-class attribute is tested at
each node point. However, using irrelevant attributes can reduce a tree’s classification
ability due to the unnecessary information or ‘noise’ these attributes can contain [13].
Removing the irrelevant attributes not only can improve overall classification accura-
cy, but also reduce the computational workload, since fewer attributes are tested.
Attribute subset selection has evolved with numerous techniques grouped into
three broad categories – wrapper, filter and embedded. Wrapper methods use the clas-
sifier output itself to determine the most suitable subset of attributes. However, while
accurate, wrapper methods can be computationally costly for datasets with high di-
mensionality and are considered a ‘brute force’ method of subset selection [14]. Filter
methods, by contrast, select the attribute subset prior to tree induction through a pro-
cess of ‘feature ranking’. Yet, filtering can remove attributes that alone may not pro-
vide information, but in combination with other attributes hold knowledge that would
otherwise be lost. Alternatively, embedded methods incorporate attribute selection
within the decision tree algorithm itself, making them more efficient than wrapper
methods. The CART algorithm has been identified as having an embedded mecha-
nism for attribute selection [14], however, it still requires testing of all attributes at
every node. The question we aim to answer in this paper is how to reduce the number
of attributes required for testing in a way that provides meaningful improvement to
processing speed whilst minimising any negative effect on classification accuracy.

3 Proposed Method

Our proposed method for accelerating tree induction involves combining components
of split-point sampling and attribute subset selection into a single novel implementa-
tion we have named Split-Point And Attribute Reduced Classifier or SPAARC.
Moreover, this method can be applied to any classification algorithm that implements
its numeric attribute split-point analysis and node attribute selection recursively.
The two specific components of SPAARC will now be detailed individually, starting
with Node Attribute Sampling (NAS) covered in Section 3.1 and Split-Point Sam-
pling (SPS) in Section 3.2. This will be followed by empirical evaluation in Section 4.

3.1 Node Attribute Sampling (NAS)


The NAS component in our proposed method avoids testing every non-class attribute
at every tree node regardless of the strength of that attribute’s relationship with the
5

class attribute. It further avoids the limitation of preselecting a subset of attributes


before tree induction begins. Instead, our method dynamically selects the attribute
space by switching between the full and subset attribute lists based on the tree depth
level of the current node being tested. A simple example of this is shown in Fig.2.

Fig. 2. Node attribute sampling samples the full attribute space on each treeDepth modulus
level (in this case, treeDepth modulus = 2)

Fig. 3. Algorithms for NAS (NodeAttributeSample) and SPS (SplitPointSample)


6

Induction begins at the root node, N1, which sits on Tree Depth Level (TDL) 1. Here,
the full non-class attribute space A = {A1…A6} is always used to find the most ap-
propriate attribute for the root node. As TDL 1 is also designated as a modulus level,
the information gain scores from each non-class attribute tested for N1 are recorded
and the average information gain is calculated. The indices of these non-class attrib-
utes are stored separately in decreasing order of their information gain scores. The
number of attributes with above-average information gain is counted and those attrib-
utes are selected as the new attribute subset, which for N1 is {A2, A4}. Moving to
node N2, the ‘treeDepth’ value is now set to 2. Thus, node N2 is not on a modulus
TDL and tested only with the previously-stored attribute subset, i.e. {A2, A4}, so that
the selected node attribute can only come from this subset. Tree induction continues
recursively following the left-hand branch, reaching node N3. As N3 sits on a modu-
lus TDL, the attribute selected for this node now comes from the full attribute space.
This process continues down until leaf nodes N6 and N7 are set. At this point, the
current subset of {A5, A6}, which was created at node N5, is retained as the tree in-
duction returns recursively up the tree to node N8. Again, as N8 is also on a modulus
TDL, the full attribute space is used and the subset {A1, A6} is created. Following
this recursive path, the next node for processing is node N9. As N9 is not on a modu-
lus TDL, its attribute is selected from the current stored subset {A1, A6}. Induction
continues to N10, which being on a modulus TDL, selects from the full attribute
space and creates a new subset {A2, A3}. This subset services nodes N11 and N12.
However, since the next node in the sequence, N13, also sits on a non-modulus TDL,
it, too, selects its attribute from the current {A2, A3} subset. The final two nodes,
N14 and N15, both sit on a modulus TDL. Thus, both choose an attribute from the full
attribute space.
The pseudo-code for NAS appears in the NodeAttributeSample function in Fig. 3.
It takes as parameters, the dataset D, attributes A along with the tree depth modulus
factor M and returns the split-attribute As. To start, if the tree depth level modulus M
is one (1), the function looks for the best split-point from every attribute, retaining the
information gain (infoGain) factor from each. Following this, the attributes are sorted
by information gain in decreasing order and stored in the array ‘sortedAtts’. The aver-
age information gain is calculated and stored in ‘avgGain’. Next, each attribute Ai in
the sortedAtts array is tested and if its information gain is greater than the average
gain, that attribute is added to the attribute subset list called ‘attSubset’.
If the tree depth level modulus M is not equal to one, then each attribute of the sub-
set (rather than every attribute), is tested to find its split-point. In any case, regardless
of tree level, the attribute with the maximum information gain is returned as the at-
tribute to split As, along with the information gain values as the array ‘infoGains’.
This NodeAttributeSample function also calls the ‘SplitPointSample’ function, fea-
turing the split-point sampling (SPS) component we shall now discuss in Section 3.2.
7

3.2 Split-Point Sampling (SPS)


As we have seen previously in Section 2, decision tree algorithms test for the opti-
mum split-point of a numerical attribute at a node using some measure of information
gain. One example measure is the Gini index:

Gini(R) = 1  ∑ 
  (1)

where pi is the probability that the record R in dataset D belongs to the class value Ci.
However, to obtain the split-point with maximum information gain, each adjacent pair
of distinct attribute values must be tested, such that for j distinct values, there are
generally j-1 adjacent pairs tested. Our SPS component does not interfere with infor-
mation gain measure itself, however, it dynamically reduces the number of possible
split-points tested. This is done by dividing the range of distinct attribute values into
equal-width intervals and using the adjacent pair of values at the edges of each inter-
val as potential split points. Thus, if there are k intervals, only k-1 test points are re-
quired. Through experimentation, a value of k = 10 has been shown to provide good
results. While the sampling of ten interval points may not result in the selection of the
split-point of maximum information gain, it can be shown that the point selected
should never be more than a value distance of half the interval step.

Fig. 4. Maximum distance the actual split-point i can be from a tested split-point is m/2k.

Theorem 1: If m = range of numeric attribute values and k = the number of inter-


vals, the actual split-point of maximum information gain cannot be more than m/2k
away from an interval test point.

Proof: Let p and q be two consecutive interval points, such that q – p = m/k. Let i be
the actual split point somewhere in the range sequence between points p and q.
From this, there are three possibilities – i) that (i – p) < m/2k, ii) that (i - p) = m/2k,
and iii) that (i – p) > m/2k. For possibilities i) and ii), the theorem is already proved,
since neither is greater than m/2k.
Now for possibility iii), let (i – p) > m/2k. Since from Fig. 3, (i – p) + (q – i) = m/k,
it must be that (i – p) = m/k – (q – i). Substituting for (i – p), it must also follow that if
(i – p) > m/2k, then m/k – (q – i) > m/2k. It then follows that m/k – m/2k > (q – i) and
(q – i) < m/2k.

Choosing the optimum level of k is dependent upon the number of distinct values j,
for as k approaches j, the difference in the number of calculations required, and thus
the speed gain as a result, will be minimal. However, if k is too low, the distance be-
8

tween the optimal and selected split points will likely be greater, potentially negative-
ly affecting the choice of attribute at each node and hence, overall accuracy.
The SplitPointSample pseudo-code in Fig. 3 details this function. It takes as pa-
rameters the subset of records for the current node, Dsorted, and the current attribute, Ai.
It returns an array of information gains ‘infoGain’, plus the candidate split-point
called ‘splitPoint’. The function first considers attribute Ai – if it is categorical, it is
passed onto the tree algorithm’s current function for finding categorical split-points.
As SPS component handles numerical attributes only, we skip this task.
For numerical attributes, four values are initially calculated from the sorted range
of record values for attribute Ai – hopStart is the Ai value of the first record, valu-
eRange is the range of numeric values m, hopStep is the range distance divided by the
number of intervals, k, set to 10 and hopPoint is the first interval, set to hopStart plus
hotStep. At this point, each record Rj in the current data subset Dsorted is considered for
the value of its attribute Ai. Unless this value is greater than the current interval value
hopPoint, it is skipped. If this Ai value is greater, we calculate the information gain.
We compare this information gain value, mj, against the current maximum infor-
mation gain value maxInfoGain – if value mj is greater, it becomes the new maximum
and the current best ‘splitPoint’ is set between the Ai values of current record Rj and
the previous record Rj-1. We then increment the interval step hopStep to the next inter-
val and continue until the last record is reached. On completion, the maximum infor-
mation gain value for Ai is stored and its corresponding split-point is returned.

4 Experiments

To test the validity of SPAARC, experiments were carried out using 14 open-source
freely-available datasets from the University of California, Irvine (UCI) machine-
learning repository [15]. Dataset details are shown in Table 1. These datasets were
tested on the Weka version 3.8.2 data-mining core using the SimpleCART algorithm
from the Weka Package Manager. Tests were conducted comparing the SimpleCART
algorithm in its original form against the same algorithm augmented with our
SPAARC method. Further tests involving the SPS and NAS components individually
were carried out to identify the role each plays in the overall SPAARC result.

Table 1. Details of the 14 numeric datasets used in experiments

Dataset Instances Attributes Dataset Instances Attributes


mfeat-fourier (FOU) 2000 77 Spambase (SPA) 4601 58
mfeat-zernike (ZER) 2000 48 EEG eye state (EEG) 14980 15
Page-blocks (PAG) 5473 11 Crowd map (CRO) 10545 29
Pen-digits (PEN) 10992 17 Wine quality (WIN) 4898 12
Segment (SEG) 2310 20 Shuttle (SHU) 43500 9
Waveform (WAV) 5000 41 Sensorless drive (SEN) 58509 49
Optical digits (OPT) 5620 65 Skin segment (SKS) 245057 4
9

Classification accuracy was evaluated using ten-fold cross-validation, while tree in-
duction or ‘model build’ times were programmatically recorded to millisecond preci-
sion. All tests were carried out on a 3.1GHz Intel Core i5 2300 PC with Windows 8.1
operating system, 14GB of RAM and 240GB Intel 535 Series solid-state drive (SSD).
Classification accuracy and model build times of our SPAARC algorithm compared
with SimpleCART appear in Table 2 (leading results in all tests are shown in bold).

Table 2. Classification accuracy and build times of SimpleCART and SPAARC algorithms.

Class’n Accuracy (%) Model Build Time (secs)


Dataset SimpleCART SPAARC SimpleCART SPAARC
mfeat-fourier (FOU) 75.50 76.50 2.629 0.788
mfeat-zernike (ZER) 67.25 70.15 1.808 0.570
Page-blocks (PAG) 96.78 96.69 0.530 0.521
Pen-digits (PEN) 96.32 95.93 1.889 1.551
Segment (SEG) 96.15 95.89 0.310 0.201
Waveform (WAV) 76.68 77.36 1.839 0.998
Optical digits (OPT) 90.53 89.88 2.193 1.814
Spambase (SPA) 92.44 91.81 1.910 1.318
EEG eye state (EEG) 84.11 84.30 2.652 2.436
Crowdsource map (CRO) 90.09 90.19 4.266 1.712
Wine quality (WIN) 59.31 58.13 0.99 0.833
Shuttle (SHU) 99.94 99.95 4.278 3.249
Sensorless drive (SEN) 98.46 98.64 68.054 26.870
Skin segment (SKS) 99.92 99.91 19.203 15.639
Average (Total) 87.39 87.52 (112.55) (58.499)

Significantly, SPAARC produced faster model build times than SimpleCART in all
14 datasets tested. The most successful of the results achieved by SPAARC occurred
with the FOU dataset, with build time dropping by 70%. At the same time, classifica-
tion accuracy for this dataset rose by one percentage point (76.5% vs 75.5%).
Importantly, SPAARC achieved these speeds whilst still managing to record higher
classification accuracy scores in seven of the 14 datasets compared with Simple-
CART. Moreover, individual accuracy gains outweigh losses to the extent that
SPAARC produced a marginally-higher overall classification accuracy average than
SimpleCART (87.52 vs 87.39). The least successful SPAARC result occurred with
the WIN dataset, where classification accuracy fell by 1.18 percentage points, while
build speed still improved by a modest 16%. However, across all 14 datasets,
SPAARC reduced the total build time by more than 48% (58.499secs vs 112.55). To
understand in more detail how each of the algorithm’s two main components contrib-
ute to the overall result, SPAARC was modified to implement each component sepa-
rately and compared in each case with the SimpleCART algorithm. The results in
Table 3 show that SPS improved model build times in ten of the 14 datasets tested.
Worthy of note is the fact that the overall build time achieved by SPS is greater than
that recorded by SPAARC, indicating the SPAARC results are not due solely to SPS.
10

Table 3. SimpleCART classification accuracy and build times compared with SPS alone.

Class’n Accuracy (%) Model Build Time (secs)


Dataset SimpleCART SPS SimpleCART SPS
mfeat-fourier (FOU) 75.50 75.95 2.629 0.973
mfeat-zernike (ZER) 67.25 69.10 1.808 0.707
Page-blocks (PAG) 96.78 96.62 0.530 0.585
Pen-digits (PEN) 96.32 96.23 1.889 1.818
Segment (SEG) 96.15 95.89 0.310 0.203
Waveform (WAV) 76.68 77.60 1.839 1.315
Optical digits (OPT) 90.53 89.88 2.193 2.206
Spambase (SPA) 92.44 91.70 1.910 2.380
EEG eye state (EEG) 84.11 84.55 2.652 2.734
Crowdsource map (CRO) 90.09 89.73 4.266 2.214
Wine quality (WIN) 59.31 58.64 0.990 0.902
Shuttle (SHU) 99.94 99.94 4.278 4.230
Sensorless drive (SEN) 98.46 98.41 68.054 36.171
Skin segment (SKS) 99.92 99.93 19.203 16.336
Average (Total) 87.39 87.44 (112.55) (72.77)

Table 4. SimpleCART classification accuracy and build times compared with NAS alone.

Class’n Accuracy (%) Model Build Time (secs)


Dataset SimpleCART NAS SimpleCART NAS
mfeat-fourier (FOU) 75.50 75.00 2.629 2.018
mfeat-zernike (ZER) 67.25 69.15 1.808 1.332
Page-blocks (PAG) 96.78 96.62 0.530 0.490
Pen-digits (PEN) 96.32 95.92 1.889 1.697
Segment (SEG) 96.15 96.15 0.310 0.308
Waveform (WAV) 76.68 76.86 1.839 1.415
Optical digits (OPT) 90.53 90.53 2.193 2.305
Spambase (SPA) 92.44 92.50 1.910 1.706
EEG eye state (EEG) 84.11 83.86 2.652 2.380
Crowdsource map (CRO) 90.09 90.48 4.266 3.173
Wine quality (WIN) 59.31 57.92 0.990 0.915
Shuttle (SHU) 99.94 99.92 4.278 4.637
Sensorless drive (SEN) 98.46 98.59 68.054 47.707
Skin segment (SKS) 99.92 99.90 19.203 17.568
Average (Total) 87.39 87.39 (112.55) (87.649)

SPS improved upon the SimpleCART classification accuracy results in six of 14


datasets. However, the overall average classification accuracy for SPS was marginally
better than SimpleCART largely due to one dataset gain (ZER). Nevertheless, SPS
improved the cumulative model build time by just under 40 seconds or 35.3%. Thus,
SPS can accelerate tree induction yet have minimal effect on classification accuracy.
11

Similarly, Table 4 shows the results of implementing the NAS component alone in
comparison with SimpleCART. Here, model build times were improved by NAS in
12 of 14 datasets tested. While NAS achieved speed gains in more datasets than SPS,
the gain in build time resulting from NAS was less than SPS (87.65seconds vs SPS’
72.77). This indicates NAS with its treeDepth modulus setting of 5 contributes less to
overall speed than SPS, but is more likely to show a speed gain on a dataset than SPS
alone. The results also saw NAS improve classification accuracy in seven of 14 da-
tasets, with no change in overall average accuracy. More broadly, the results of Ta-
bles 3 and 4 show the performance of SPAARC is due to both components combined.
Overall, SPAARC shows the potential of combining SPS and NAS techniques with
significant falls in build time, while classification accuracy is largely maintained.

5 Discussion and further research

Although every decision tree algorithm has its own structure, tree induction in general
is a two-step process. First, the tree is grown until all branches are finished as leaves
via a stopping criteria. Then, second, depending on the algorithm, the tree is reduced
or ‘pruned’ to reduce the effects of noise in the training dataset and improve its gener-
alisation ability on new unseen instances [16]. The SPAARC algorithm’s two compo-
nents only operate on the growth phase, leaving potential for further time savings
through accelerating the pruning phase. This will be an area for future research.
Although research shows attempts to implement classification algorithms within
low-power MCUs have been made, as shown in Section 1, the lack of RAM in many
of these devices is likely as much of an impediment to their implementation as their
limited processing capability. This would have the effect of limiting the size of da-
tasets that are ‘mine-able’, whether in terms of instances or attributes. Thus, reducing
mining algorithm RAM requirements is another area of further research relevant to
local IoT data-mining. Nevertheless, the speed gains achieved by SPAARC with min-
imal loss of classification accuracy could contribute toward greater implementation of
data-mining in IoT applications. This is another area we are keen to research.

6 Conclusion

In this paper, we have proposed a novel implementation of sampling methods we


have called SPAARC to reduce the computational workload of decision tree induc-
tion. The first of these methods involves dynamically selecting attributes with above
average information gains, then using the current tree depth level to switch between
all attributes and the selected attribute subspace for node testing. The second method
samples possible attribute value split-points by hopping across the distinct value space
at equal widths just prior to the tree algorithm’s information gain calculations. The
combination of these two components improved upon SimpleCART’s classification
accuracy in half of the datasets tested, while reducing the model build time in all da-
taset tests by as much as 70%. These methods only apply to the tree growth phase and
leave the reduction or pruning phase open for further research that we plan to pursue.
12

7 Acknowledgements

This research is supported by an Australian Government Research Training Program


(RTP) scholarship.

References

[1] Islam, M.Z., M. Furner, and M.J. Siers, WaterDM: A Knowledge Discovery and Decision
Support Tool for Efficient Dam Management. 2016.
[2] Dangare, C.S. and S.S. Apte, Improved study of heart disease prediction system using data
mining classification techniques. International Journal of Computer Applications,
2012. 47(10): p. 44-48.
[3] Han, J., J. Pei, and M. Kamber, Data mining: concepts and techniques. 2011: Elsevier.
[4] Breiman, L., J. Friedman, C.J. Stone, and R.A. Olshen, Classification and regression trees.
1984: CRC press.
[5] Quinlan, J.R., C4. 5: programs for machine learning. 2014: Elsevier.
[6] Nath, S. ACE: exploiting correlation for energy-efficient and continuous context sensing. in
Proceedings of the 10th international conference on Mobile systems, applications,
and services. 2012. ACM.
[7] Srinivasan, V., S. Moghaddam, A. Mukherji, K.K. Rachuri, C. Xu, and E.M. Tapia.
Mobileminer: Mining your frequent patterns on your phone. in Proceedings of the
2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing.
2014. ACM.
[8] Hinwood, A., P. Preston, G. Suaning, and N. Lovell, Bank note recognition for the vision
impaired. Australasian Physics & Engineering Sciences in Medicine, 2006. 29(2): p.
229.
[9] Maurer, U., A. Smailagic, D.P. Siewiorek, and M. Deisher. Activity recognition and
monitoring using multiple sensors on different body positions. in Wearable and
Implantable Body Sensor Networks, 2006. BSN 2006. International Workshop on.
2006. IEEE.
[10] Darrow, B. Amazon Just Made a Huge Change to its Cloud Pricing.
http://fortune.com/2017/09/18/amazon-cloud-pricing-second/. 2017.
[11] Fayyad, U.M. and K.B. Irani, On the handling of continuous-valued attributes in decision
tree generation. Machine learning, 1992. 8(1): p. 87-102.
[12] Ranka, S. and V. Singh. CLOUDS: A decision tree classifier for large datasets. in
Proceedings of the 4th Knowledge Discovery and Data Mining Conference. 1998.
[13] Chandrashekar, G. and F. Sahin, A survey on feature selection methods. Computers &
Electrical Engineering, 2014. 40(1): p. 16-28.
[14] Guyon, I. and A. Elisseeff, An introduction to variable and feature selection. Journal of
machine learning research, 2003. 3(Mar): p. 1157-1182.
[15] Dua, D.a.K.T., E. . UCI Machine Learning Repository. http://archive.ics.uci.edu/ml.
2018.
[16] Buntine, W. and T. Niblett, A further comparison of splitting rules for decision-tree
induction. Machine Learning, 1992. 8(1): p. 75-85.

View publication stats

You might also like