Decision Tree Classification of Spatial Data Streams Using Peano Count Trees
Decision Tree Classification of Spatial Data Streams Using Peano Count Trees
For each band (assuming 8-bit data values), we get 8 basic P- We point out the classifier is precisely the classifier built by the
trees, one for each bit positions. For band, Bi, we will label the ID3 decision tree induction algorithm [4]. The point of the work
basic P-trees, Pi,1, Pi,2, …, Pi,8, thus, Pi,j is a lossless representation is to reduce the time it takes to build and rebuild the classifier as
of the jth bits of the values from the ith band. However, Pij new data continue to arrive. This is very important for
provides more information and are structured to facilitate data performing classification on data streams.
mining processes. Some of the useful features of P-trees can be
found later in this paper or our earlier work [11].
3.1 Data Smoothing and Attribute Relevance • Otherwise, an entropy-based measure, "information gain", is
In the overall classification effort, as in most data mining used as a heuristic for selecting the attribute which best
approaches, there is a data preparation stage in which the data are separates the samples into individual classes (the “decision"
prepared for classification. Data preparation can involve data attribute).
cleaning (noise reduction by applying smoothing techniques and • A branch is created for each value of the test attribute and
missing value management techniques). The P-tree data structure samples are partitioned accordingly.
facilitates a proximity-based data smoothing method, which can
reduce the data classification time considerably. The smoothing • The algorithm advances recursively to form the decision tree
method is called bottom-up purity shifting. By replacing 3 counts for the sub-sample set at each partition. Once an attribute
with 4 and 1 counts with 0 at level-1 (and making resultant has been used, it is not considered in descendent nodes.
changes on up the tree), the data is smoothed and the P-tree is
compressed. A more drastic smoothing can be effected. The user • The algorithm stops when all samples for a given node
can determine which set of counts to replace with pure-1 and belong to the same class or when there are no remaining
which set of counts to replace with pure-0. The most important attributes (or some other stopping condition).
thing to note is that this smoothing can be done almost The attribute selected at each decision tree level is the one with
instantaneously once P-trees are constructed. With this method it the highest information gain. The information gain of an attribute
is feasible to actually smooth data from the data stream before is computed by using the following algorithm.
mining.
Assume B[0] is the class attribute; the others are non-class
Another important pre-classification step is relevance analysis attributes. We store the decision path for each node. For
(selecting only a subset of the feature attributes, so as to improve example, in the decision tree below (Figure 2), the decision path
algorithm efficiency). This step can involve removal of irrelevant for node N09 is “Band2, value 0011, Band3, value 1000”. We
attributes or redundant attributes. We can build a cube, called use RC to denote the root count of a P-tree, given node N’s
Peano Cube (P-cube) in which each dimension is a band and each decision path B[1], V[1], B[2], V[2], … , B[t], V[t], let P-tree
band has several values depending on the bit precision. For P=PB[1],v[1]^PB[2],v[2]^…^PB[t],v[t]
example, for an image with three bands using 1-bit precision, the
cell (0,0,1) gives the count of P1’ AND P2’ AND P3. We can
determine relevance by rolling-up the P-cube to the class label N01
attribute and each other potential decision attribute in turn. If any B2
of these roll-ups produce counts that are uniformly distributed, 0010 0011 0111 1010 1011
then that attribute is not going to be effective in classifying the N02 N03 N04 N05 N06
class label attribute. The roll-up can be computed from the basic
B1 B3 B1 B1 B1
P-trees without necessitating the actual creation of the P-cube. 0111 0100 1000 0011 1111 0010
This can be done by ANDing the P-trees of class label attribute
N07 N08 N09 N10 N11 N12
with the P-trees of the potential decision attribute. Only an
estimate of uniformity in the root counts is all that is needed. B1 B1
0111 0011
Better estimates can be discovered by ANDing down to a fixed
depth of the P-trees. For instance, ANDing to depth=1 counts N13 N14
provides the rough set of distribution information, ANDing at
depth=2 provides better distribution information and so forth.
Again, the point is that P-trees facilitate simple real-time Figure 2. A Decision Tree Example
relevance analysis, which makes it feasible for data streams.
We can calculate node N’s information I(P) through
n
3.2 Classification by Decision Tree Induction I ( P ) = − ∑ p i ∗ log 2 p i
i =1
Using P-trees where
A Decision Tree is a flowchart-like structure in which each node pi = RC(P^PB[0], V0[i])/RC(P).
denotes a test on an attribute. Each branch represents an outcome
of the test and the leaf nodes represent classes or class Here V0[1], ... , V0[n] are possible B[0] values if classified by
distributions. Unknown samples can be classified by testing B[0] at node N. If N is the root node, then P is the full P-tree
attributes against the tree. The path traced from root to leaf holds (root count is the total number of transactions).
the class prediction for that sample. The basic algorithm for
Now if we want to evaluate the information gain of attribute A at
inducing a decision tree from the learning or training sample set is
node N, we can use the formula:
as follows [2, 7]:
Gain(A)=I(P)-E(A), where entropy
• Initially the decision tree is a single node representing the n
entire training set. E(A) = ∑ I (P ∧ PA,VA[i ] ) ∗ RC(P ∧ PA,VA[i ] ) / RC(P)
i =1
• If all samples are in the same class, this node becomes a leaf Here VA[1], ... ,VA[n] are possible A values if classified by
and is labeled with that class label. attribute A at node N.
P1,0001 P1,0101 P1,1001 P1,1101
3.3 Example 0 0 0 0
In this example the data is a remotely sensed image (e.g., satellite
P1,0011 P1,0111 P1,1011 P1,1111
image or aerial photo) of an agricultural field and the soil 4 4 0 3
moisture levels for the field, measured at the same time. We use 4000 0400 0003
0111
the whole data set for mining so as to get as better accuracy as we
can. This data are divided into learning and test data sets. The Then we generate basic P-trees and value P-trees similarly to B2,
goal is to classify the data using soil moisture as the class label B3 and B4.
attribute and then to use the resulting classifier to predict the soil
moisture levels for future time (e.g., to determine capacity to Start with A = B2. Because the node currently dealing is the root
buffer flooding or to schedule crop planting). node, P is the full P-tree. So pi can be 3/16, 1/4, 1/4, 1/8, 3/16,
thus we can calculate
Branches are created for each value of the selected attribute and
subsets are partitioned accordingly. The following training set I(P) = 3/16*log2(3/16) + 4/16*log2(4/16) + 4/16*log2(4/16) +
contains 4 bands of 4-bit data values (expressed in decimal and 2/16*log2(2/16) + 3/16*log2(3/16) ) = 2.281
binary). B1 stands for soil-moisture. B2, B3, and B4 stand for To calculate E(B2), first P^PA,VA[i] should be all the value P-trees
the channel 3, 4, and 5 of AVHRR, respectively. of B2. Then I(P^PA,VA[i]) can be calculated by ANDing all the B2
value P-trees and B1 value P-trees. Finally we get E(B2)=0.656
and Gain(B2)=1.625.
FIELD CLASS REMOTELY SENSED
COORDS LABEL REFLECTANCES Likewise, the Gains of B3 and B4 are computed: Gain(B3) =
X Y B1 B2 B3 B4 1.084 , Gain(B4) = 0.568. Thus, B2 is selected as the first level
decision attribute.
0,0 0011 0111 1000 1011
0,1 0011 0011 1000 1111
0,2 0111 0011 0100 1011 Branches are created for each value of B2 and samples are
0,3 0111 0010 0101 1011 partitioned accordingly.
1,0
1,1
0011
0011
0111
0011
1000
1000
1011
1011 B2=0010 ! Sample_Set_1
1,2 0111 0011 0100 1011 B2=0011 ! Sample_Set_2
1,3 0111 0010 0101 1011 B2=0111 ! Sample_Set_3
B2=1010 ! Sample_Set_4
2,0 0010 1011 1000 1111
2,1 0010 1011 1000 1111 B2=1011 ! Sample_Set_5
2,2 1010 1010 0100 1011
2,3 1111 1010 0100 1011 Advancing the algorithm recursively to each sub-sample set, it is
3,0 0010 1011 1000 1111 unnecessary to rescan the learning set to form these sub-sample
3,1 1010 1011 1000 1111 sets, since the P-trees for those samples have been computed.
3,2 1111 1010 0100 1011
3,3 1111 1010 0100 1011 The algorithm will terminate with the decision tree:
B2=0010 ! B1=0111
Figure 3. Learning Dataset B2=0011 ! B3=0100 ! B1=0111
B3=1000 ! B1=0011
B2=0111 ! B1=0011
This learning dataset (Figure 3) is converted to bSQ format. We B2=1010 ! B1=1111
display the bSQ bit-bands values in their spatial positions, rather B2=1011 ! B1=0010
than displaying them in 1-column files. The Band-1 bit-bands
are: 4. PERFORMANCE ANALYSIS
B11 B12 B13 B14 Prediction accuracy is usually used as a basis of comparison for
0000 0011 1111 1111 different classification methods. However, for data mining on
0000 0011 1111 1111
0011 0001 1111 0001 streams, speed is a significant issue. In this paper, we use the ID3
0111 0011 1111 0011 algorithm with the P-tree data structure to improve the speed. The
Thus, the Band-1 basic P-trees are as follows (tree pointers are important performance issue in this paper is computation speed
omitted). relative to ID3.
P1,1 P1,2 P1,3 P1,4 In our method, we only build and store basic P-trees. All the AND
5 7 16 11
0014 0403 4403 operations are performed on the fly and only the corresponding
0001 0111 0111 root counts are needed.
We can use AND and COMPLEMENT operation to calculate all Our experimental results show that larger data size leads to more
the value P-trees of Band-1 as below. (e.g., P1,0011 = P1,1’ AND significant speed improvement (in Figure 4) by using P-trees.
P1,2’ AND P1,3 AND P1,4 ) There are several reasons. First, let’s look at the cost to calculate
P1,0000 P1,0100 P1,1000 P1,1100 information gain each time. In ID3, to test if all the samples are in
0 0 0 0 the same class, one scan on the entire sample set is needed. While
P1,0010 P1,0110 P1,1010 P1,1110 using P-trees, we only need to calculate the root counts of the
3 0 2 0 AND of relevant P-trees. These AND operations can be
0030 0011 performed very fast. Figure 5 gives the experimental results by
1110 0001 1000
comparing the cost of scanning the entire dataset (for different development time significantly. This makes classification of
sizes) and all the P-tree ANDings. open-ended streaming datasets feasible in near real time.
Second, Using P-trees, the creation of sub-sample sets is not
necessary. If A is a candidate for the current decision attribute 6. ACKNOWLEDGMENTS
with kA basic P-trees, we only need to AND the P-trees of the We would like to express our thanks to Amalendu Roy of
class label defining the sub-sample set with each of the kA basic Motorola, William Jockheck of Northern State University and
P-trees. If the P-tree of the current sample set is P2, 0100 ^ P3, 0001, Stephen Krebsbach of Dakota State University for their help and
and the current attribute is B1 (with, say, 2 bit values), then P2, suggestions.
0100 ^ P3, 0001 ^ P1, 00, P2, 0100 ^ P3, 0001 ^ P1, 01, P2, 0100 ^ P3, 0001 ^
P1, 10 and P2, 0100 ^ P3, 0001 ^ P1,11 identifies the partition of the 7. REFERENCES
current sample set. In our algorithm, only P-tree ANDings are [1] J. R. Quinlan and R. L. Riverst, “Inferring decision trees
required. using the minimum description length principle”,
Information and Computation, 80, 227-248, 1989.
Classification Time [2] Quinlan, J. R., “C4.5: Programs for Machine Learning”,
Morgan Kaufmann, 1993.
700
600 [3] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A.
Total cost
500
400 ID3 Swami. “An interval classifier for database mining
300
200
P-tree applications”, VLDB 1992.
100
0
[4] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J.
0 10 20 30 40 50 60 70 Stone, “Classfication and Regression Trees”, Wadsworth,
Size of data (M) Belmont, 1984.
[5] J. Shafer, R. Agrawal, and M. Mehta, “SPRINT: A scalable
parallel classifier for data mining”, VLDB 96.
Figure 4. Classification cost with respect to the dataset size
[6] S. M. Weiss and C. A. Kulikowski, “Computer Systems
that Learn: Classification and Prediction Methods from
Cost (base-2 log) with respect to dataset Satatistics, Neural Nets, Machine Learning, and Expert
Systems”, Morgan Kaufman, 1991.
size
[7] Jiawei Han, Micheline Kamber, “Data Mining: Concepts and
cost (ms) (base-
10
Scan cost in ID3 [8] D. Michie, D. J. Spiegelhalter, and C. C. Taylor,
5 “Machine Learning, Neural and Statistical Classification”,
0 ANDing cost Ellis Horwood, 1994.
0 20000 40000 60000 80000 using P-trees
data set size (KB) [9] Domingos, P. and Hulten, G., “Mining high-speed data
streams”, Proceedings of ACM SIGKDD 2000.
[10] Domingos, P., & Hulten, G., “Catching Up with the Data:
Figure 5. Cost Comparison between scan and ANDing Research Issues in Mining Data Streams”, DMKD 2001.
[11] William Perrizo, Qin Ding, Qiang Ding, Amlendu Roy,
“Deriving High Confidence Rules from Spatial Data using
5. CONCLUSION Peano Count Trees”, Springer-Verlag, LNCS 2118, July
In this paper, we propose a new approach to decision tree 2001.
induction that is especially useful for the classification of spatial [12] William Perrizo, "Peano Count Tree Technology", Technical
data streams. We use the Peano Count tree (P-tree) structure to Report NDSU-CSOR-TR-01-1, 2001.
represent the information needed for classification in an efficient
and ready-to-use form. The rich and efficient P-tree storage
structure and fast P-tree algebra facilitate the development of a
fast decision tree induction classifier. The P-tree based decision
tree induction classifier is shown to improve classifier