0% found this document useful (0 votes)
67 views5 pages

Decision Tree Classification of Spatial Data Streams Using Peano Count Trees

This document proposes a new method for decision tree classification of spatial data streams using Peano Count Trees (P-trees). P-trees provide a compressed representation of spatial data and enable efficient calculation of measurements used for decision tree induction. The proposed P-tree based method is compared to a classical decision tree induction method and is shown to build classifiers significantly faster, making it suitable for mining large, streaming spatial datasets in near real-time. Experimental results demonstrate the P-tree method can complete classification work faster than data is arriving in the stream.

Uploaded by

nobeen666
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views5 pages

Decision Tree Classification of Spatial Data Streams Using Peano Count Trees

This document proposes a new method for decision tree classification of spatial data streams using Peano Count Trees (P-trees). P-trees provide a compressed representation of spatial data and enable efficient calculation of measurements used for decision tree induction. The proposed P-tree based method is compared to a classical decision tree induction method and is shown to build classifiers significantly faster, making it suitable for mining large, streaming spatial datasets in near real-time. Experimental results demonstrate the P-tree method can complete classification work faster than data is arriving in the stream.

Uploaded by

nobeen666
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Decision Tree Classification of Spatial Data Streams

Using Peano Count Trees 1, 2


Qiang Ding, Qin Ding, William Perrizo
Computer Science Department, North Dakota State University
Fargo, ND58105, USA
{qiang.ding, qin.ding, william.perrizo}@ndsu.nodak.edu

ABSTRACT traditional techniques often cannot complete the work as quickly


Many organizations have large quantities of spatial data collected as the data is arriving in the stream [9, 10]. Spatial data collected
in various application areas, including remote sensing, from sensor platforms in space, from airplanes or other platforms
geographical information systems (GIS), astronomy, computer are typically updated periodically. For example, AVHRR
cartography, environmental assessment and planning, etc. These (Advanced Very High Resolution Radiometer) data is updated
data collections are growing rapidly and can therefore be every hour or so (8 times each day during daylight hours). Such
considered as spatial data streams. For data stream classification, data sets can be very large (multiple gigabytes) and are often
time is a major issue. However, these spatial data sets are too archived in deep storage before valuable information can be
large to be classified effectively in a reasonable amount of time obtained from them. An objective of spatial data stream mining is
using existing methods. In this paper, we developed a new to mine such data in near real time prior to deep storage archiving.
method for decision tree classification on spatial data streams Classification is one of the important areas of data mining [6,7,8].
using a data structure called Peano Count Tree (P-tree). The In classification task, a training set (or called learning set) is
Peano Count Tree is a spatial data organization that provides a identified for the construction of a classifier. Each record in the
lossless compressed representation of a spatial data set and learning set has several attributes, one of which, the goal or class
facilitates efficient classification and other data mining label attribute, indicates the class to which each record belongs.
techniques. Using P-tree structure, fast calculation of The classifier, once built and tested, is used to predict the class
measurements, such as information gain, can be achieved. We label of new records that do not yet have a class label attribute
compare P-tree based decision tree induction classification and a value.
classical decision tree induction method with respect to the speed
at which the classifier can be built (and rebuilt when substantial A test set is used to test the accuracy of the classifier. The
amounts of new data arrive). Experimental results show that the classifier, once certified, is used to predict the class label of future
P-tree method is significantly faster than existing classification unclassified data. Different models have been proposed for
methods, making it the preferred method for mining on spatial classification, such as decision trees, neural networks, Bayesian
data streams. belief networks, fuzzy sets, and generic models. Among these
models, decision trees are widely used for classification. We
focus on decision tree induction in this paper. ID3 (and its
Keywords variants such as C4.5) [1, 2] and CART [4] are among the best
Data mining, Classification, Decision Tree Induction, Spatial known classifiers that use decision trees. Other decision tree
Data, Data Streams classifiers include Interval Classifier [3] and SPRINT [3, 5] which
concentrate on making it possible to mine databases that do not fit
1. INTRODUCTION in main memory by only requiring sequential scans of the data.
In many areas, large quantities of data are generated and collected Classification has been applied in many fields, such as retail target
everyday, such as supermarket transactions, phone call records. marketing, customer retention, fraud detection and medical
These data arrive too fast to be analyzed or mined in time. Such diagnosis [8]. Spatial data is a promising area for classification.
kinds of data are called “data streams” [9, 10]. Classifying open- In this paper, we propose a decision tree based model to perform
ended data streams brings challenges and opportunities since classification on spatial data streams. We use the Peano Count
_____________________________________________ Tree (P-tree) structure [11] to build the classifier.
1
Patents are pending on the bSQ and P-tree technology. P-trees [11] represent spatial data bit-by-bit in a recursive
2
This work is partially supported by GSA Grant ACT#: K96130308, quadrant-by-quadrant arrangement. With the information in P-
NSF Grant OSR-9553368 and DARPA Grant DAAH04-96-1-0329. trees, we can rapidly build the decision tree. Each new
component in a spatial data stream is converted to P-trees and
Permission to make digital or hard copies of all or part of this work for then added to the training set as soon as possible. Typically, a
personal or classroom use is granted without fee provided that copies are window of data components from the stream is used to build (or
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
rebuild) the classifier. There are many ways to define the
otherwise, or republish, to post on servers or to redistribute to lists, window, depending on the data and application. In this paper, we
requires prior specific permission and/or a fee. focus on a fast classifier-building algorithm.
SAC 2002, Madrid, Spain
The rest of the paper is organized as follows. In section 2, we
 2002 ACM 1-58113-445-2/02/03…$5.00.
briefly review the spatial data formats and the P-tree structure. In
Section 3, we detail our decision tree induction classifier using P- The basic P-trees defined above can be combined using simple
trees. We also walk through an example to illustrate our logical operations (AND, OR and COMPLEMENT) to produce P-
approach. Performance analysis is given in Section 4. Finally, trees for the original values (at any level of precision, 1-bit
there is a conclusion in Section 5. precision, 2-bit precision, etc.). We let Pb,v denote the Peano
Count Tree for band, b, and value, v, where v can be expressed in
1-bit, 2-bit,.., or 8-bit precision. For example, Pb,110 can be
2. PEANO COUNT TREE STRUCTURE constructed from the basic P-trees as:
A spatial image can be viewed as a 2-dimensional array of pixels.
Associated with each pixel are various descriptive attributes, Pb,110 = Pb,1 AND Pb,2 AND Pb,3’
called “bands”. For example, visible reflectance bands (Blue,
Green and Red), infrared reflectance bands (e.g., NIR, MIR1, where ’ indicates the bit-complement (which is simply the count
MIR2 and TIR) and possibly some bands of data gathered from complement in each quadrant). This is called the value P-tree.
ground sensors (e.g., yield quantity, yield quality, and soil The AND operation is simply the pixel wise AND of the bits.
attributes such as moisture and nitrate levels, etc.). All the values The data in the relational format can also be represented as P-
have been scaled to values between 0 and 255 for simplicity. The trees. For any combination of values, (v1,v2,…,vn), where vi is
pixel coordinates in raster order constitute the key attribute. One from band-i, the quadrant-wise count of occurrences of this tuple
can view such data as table in relational form where each pixel is of values is given by:
a tuple and each band is an attribute.
P(v1,v2,…,vn) = P1,V1 AND P2,V2 AND … AND Pn,Vn
There are several formats used for spatial data, such as Band
Sequential (BSQ), Band Interleaved by Line (BIL) and Band This is called a tuple P-tree.
Interleaved by Pixel (BIP). In our previous works [11], we Finally, we note that the basic P-trees can be generated quickly
proposed a new format called bit Sequential Organization (bSQ). and it is only a one-time cost. The logical operations are also very
Since each intensity value ranges from 0 to 255, which can be fast [12]. So this structure can be viewed as a “data mining
represented as a byte, we try to split each bit in one band into a ready” and lossless format for storing spatial data.
separate file, called a bSQ file. Each bSQ file can be reorganized
into a quadrant-based tree (P-tree). The example in Figure 1
shows a bSQ file and its P-tree. 3. THE CLASSIFIER
Classification is a data mining technique that typically involves
11 11 11 00 55 level=3 three phases, a learning phase, a testing phase and an application
11 11 10 00
phase. A learning model or classifier is built during the learning
phase. It may be in the form of classification rules, a decision
11 11 11 00
tree, or a mathematical formula. Since the class label of each
11 11 11 10 16 8 15 16 level=2
training sample is provided, this approach is known as supervised
11 11 11 11 learning. In unsupervised learning (clustering), the class labels
11 11 11 11 3 0 4 1 4 4 3 4 level=1 are not known in advance.
11 11 11 11 In the testing phase test data are used to assess the accuracy of
1 1 1 0 0 0 1 0 1 1 0 1 level=0
01 11 11 11 classifier. If the classifier passes the test phase, it is used for the
classification of new, unclassified data tuples. This is the
Figure 1. 8 by 8 image and its p-tree
application phase. The classifier predicts the class label for these
new data samples.
In this example, 55 is the count of 1’s in the entire image (called
In this paper, we consider the classification of spatial data in
root count), the numbers at the next level, 16, 8, 15 and 16, are
which the resulting classifier is a decision tree (decision tree
the 1-bit counts for the four major quadrants. Since the first and
induction). Our contributions include
last quadrant is made up of entirely 1-bits (called pure-1
quadrants), we do not need sub-trees for them. Similarly, • A set of classification-ready data structures called Peano
quadrants made up of entirely 0-bits are called pure-0 quadrant. Count trees, which are compact, rich in information and
This pattern is continued recursively. Recursive raster ordering is facilitate classification;
called Peano or Z-ordering in the literature. The process • A data structure for organizing the inputs to decision tree
terminates at the leaf level (level-0) where each quadrant is a 1- induction, the Peano count cube;
row-1-column quadrant. If we were to expand all sub-trees,
including those pure quadrants, then the leaf sequence is just the • A fast decision tree induction algorithm, which employs
Peano space-filling curve for the original raster image. these structures.

For each band (assuming 8-bit data values), we get 8 basic P- We point out the classifier is precisely the classifier built by the
trees, one for each bit positions. For band, Bi, we will label the ID3 decision tree induction algorithm [4]. The point of the work
basic P-trees, Pi,1, Pi,2, …, Pi,8, thus, Pi,j is a lossless representation is to reduce the time it takes to build and rebuild the classifier as
of the jth bits of the values from the ith band. However, Pij new data continue to arrive. This is very important for
provides more information and are structured to facilitate data performing classification on data streams.
mining processes. Some of the useful features of P-trees can be
found later in this paper or our earlier work [11].
3.1 Data Smoothing and Attribute Relevance • Otherwise, an entropy-based measure, "information gain", is
In the overall classification effort, as in most data mining used as a heuristic for selecting the attribute which best
approaches, there is a data preparation stage in which the data are separates the samples into individual classes (the “decision"
prepared for classification. Data preparation can involve data attribute).
cleaning (noise reduction by applying smoothing techniques and • A branch is created for each value of the test attribute and
missing value management techniques). The P-tree data structure samples are partitioned accordingly.
facilitates a proximity-based data smoothing method, which can
reduce the data classification time considerably. The smoothing • The algorithm advances recursively to form the decision tree
method is called bottom-up purity shifting. By replacing 3 counts for the sub-sample set at each partition. Once an attribute
with 4 and 1 counts with 0 at level-1 (and making resultant has been used, it is not considered in descendent nodes.
changes on up the tree), the data is smoothed and the P-tree is
compressed. A more drastic smoothing can be effected. The user • The algorithm stops when all samples for a given node
can determine which set of counts to replace with pure-1 and belong to the same class or when there are no remaining
which set of counts to replace with pure-0. The most important attributes (or some other stopping condition).
thing to note is that this smoothing can be done almost The attribute selected at each decision tree level is the one with
instantaneously once P-trees are constructed. With this method it the highest information gain. The information gain of an attribute
is feasible to actually smooth data from the data stream before is computed by using the following algorithm.
mining.
Assume B[0] is the class attribute; the others are non-class
Another important pre-classification step is relevance analysis attributes. We store the decision path for each node. For
(selecting only a subset of the feature attributes, so as to improve example, in the decision tree below (Figure 2), the decision path
algorithm efficiency). This step can involve removal of irrelevant for node N09 is “Band2, value 0011, Band3, value 1000”. We
attributes or redundant attributes. We can build a cube, called use RC to denote the root count of a P-tree, given node N’s
Peano Cube (P-cube) in which each dimension is a band and each decision path B[1], V[1], B[2], V[2], … , B[t], V[t], let P-tree
band has several values depending on the bit precision. For P=PB[1],v[1]^PB[2],v[2]^…^PB[t],v[t]
example, for an image with three bands using 1-bit precision, the
cell (0,0,1) gives the count of P1’ AND P2’ AND P3. We can
determine relevance by rolling-up the P-cube to the class label N01
attribute and each other potential decision attribute in turn. If any B2
of these roll-ups produce counts that are uniformly distributed, 0010 0011 0111 1010 1011
then that attribute is not going to be effective in classifying the N02 N03 N04 N05 N06
class label attribute. The roll-up can be computed from the basic
B1 B3 B1 B1 B1
P-trees without necessitating the actual creation of the P-cube. 0111 0100 1000 0011 1111 0010
This can be done by ANDing the P-trees of class label attribute
N07 N08 N09 N10 N11 N12
with the P-trees of the potential decision attribute. Only an
estimate of uniformity in the root counts is all that is needed. B1 B1
0111 0011
Better estimates can be discovered by ANDing down to a fixed
depth of the P-trees. For instance, ANDing to depth=1 counts N13 N14
provides the rough set of distribution information, ANDing at
depth=2 provides better distribution information and so forth.
Again, the point is that P-trees facilitate simple real-time Figure 2. A Decision Tree Example
relevance analysis, which makes it feasible for data streams.
We can calculate node N’s information I(P) through
n
3.2 Classification by Decision Tree Induction I ( P ) = − ∑ p i ∗ log 2 p i
i =1
Using P-trees where
A Decision Tree is a flowchart-like structure in which each node pi = RC(P^PB[0], V0[i])/RC(P).
denotes a test on an attribute. Each branch represents an outcome
of the test and the leaf nodes represent classes or class Here V0[1], ... , V0[n] are possible B[0] values if classified by
distributions. Unknown samples can be classified by testing B[0] at node N. If N is the root node, then P is the full P-tree
attributes against the tree. The path traced from root to leaf holds (root count is the total number of transactions).
the class prediction for that sample. The basic algorithm for
Now if we want to evaluate the information gain of attribute A at
inducing a decision tree from the learning or training sample set is
node N, we can use the formula:
as follows [2, 7]:
Gain(A)=I(P)-E(A), where entropy
• Initially the decision tree is a single node representing the n
entire training set. E(A) = ∑ I (P ∧ PA,VA[i ] ) ∗ RC(P ∧ PA,VA[i ] ) / RC(P)
i =1

• If all samples are in the same class, this node becomes a leaf Here VA[1], ... ,VA[n] are possible A values if classified by
and is labeled with that class label. attribute A at node N.
P1,0001 P1,0101 P1,1001 P1,1101
3.3 Example 0 0 0 0
In this example the data is a remotely sensed image (e.g., satellite
P1,0011 P1,0111 P1,1011 P1,1111
image or aerial photo) of an agricultural field and the soil 4 4 0 3
moisture levels for the field, measured at the same time. We use 4000 0400 0003
0111
the whole data set for mining so as to get as better accuracy as we
can. This data are divided into learning and test data sets. The Then we generate basic P-trees and value P-trees similarly to B2,
goal is to classify the data using soil moisture as the class label B3 and B4.
attribute and then to use the resulting classifier to predict the soil
moisture levels for future time (e.g., to determine capacity to Start with A = B2. Because the node currently dealing is the root
buffer flooding or to schedule crop planting). node, P is the full P-tree. So pi can be 3/16, 1/4, 1/4, 1/8, 3/16,
thus we can calculate
Branches are created for each value of the selected attribute and
subsets are partitioned accordingly. The following training set I(P) = 3/16*log2(3/16) + 4/16*log2(4/16) + 4/16*log2(4/16) +
contains 4 bands of 4-bit data values (expressed in decimal and 2/16*log2(2/16) + 3/16*log2(3/16) ) = 2.281
binary). B1 stands for soil-moisture. B2, B3, and B4 stand for To calculate E(B2), first P^PA,VA[i] should be all the value P-trees
the channel 3, 4, and 5 of AVHRR, respectively. of B2. Then I(P^PA,VA[i]) can be calculated by ANDing all the B2
value P-trees and B1 value P-trees. Finally we get E(B2)=0.656
and Gain(B2)=1.625.
FIELD CLASS REMOTELY SENSED
COORDS LABEL REFLECTANCES Likewise, the Gains of B3 and B4 are computed: Gain(B3) =
X Y B1 B2 B3 B4 1.084 , Gain(B4) = 0.568. Thus, B2 is selected as the first level
decision attribute.
0,0 0011 0111 1000 1011
0,1 0011 0011 1000 1111
0,2 0111 0011 0100 1011 Branches are created for each value of B2 and samples are
0,3 0111 0010 0101 1011 partitioned accordingly.
1,0
1,1
0011
0011
0111
0011
1000
1000
1011
1011 B2=0010 ! Sample_Set_1
1,2 0111 0011 0100 1011 B2=0011 ! Sample_Set_2
1,3 0111 0010 0101 1011 B2=0111 ! Sample_Set_3
B2=1010 ! Sample_Set_4
2,0 0010 1011 1000 1111
2,1 0010 1011 1000 1111 B2=1011 ! Sample_Set_5
2,2 1010 1010 0100 1011
2,3 1111 1010 0100 1011 Advancing the algorithm recursively to each sub-sample set, it is
3,0 0010 1011 1000 1111 unnecessary to rescan the learning set to form these sub-sample
3,1 1010 1011 1000 1111 sets, since the P-trees for those samples have been computed.
3,2 1111 1010 0100 1011
3,3 1111 1010 0100 1011 The algorithm will terminate with the decision tree:
B2=0010 ! B1=0111
Figure 3. Learning Dataset B2=0011 ! B3=0100 ! B1=0111
B3=1000 ! B1=0011
B2=0111 ! B1=0011
This learning dataset (Figure 3) is converted to bSQ format. We B2=1010 ! B1=1111
display the bSQ bit-bands values in their spatial positions, rather B2=1011 ! B1=0010
than displaying them in 1-column files. The Band-1 bit-bands
are: 4. PERFORMANCE ANALYSIS
B11 B12 B13 B14 Prediction accuracy is usually used as a basis of comparison for
0000 0011 1111 1111 different classification methods. However, for data mining on
0000 0011 1111 1111
0011 0001 1111 0001 streams, speed is a significant issue. In this paper, we use the ID3
0111 0011 1111 0011 algorithm with the P-tree data structure to improve the speed. The
Thus, the Band-1 basic P-trees are as follows (tree pointers are important performance issue in this paper is computation speed
omitted). relative to ID3.
P1,1 P1,2 P1,3 P1,4 In our method, we only build and store basic P-trees. All the AND
5 7 16 11
0014 0403 4403 operations are performed on the fly and only the corresponding
0001 0111 0111 root counts are needed.
We can use AND and COMPLEMENT operation to calculate all Our experimental results show that larger data size leads to more
the value P-trees of Band-1 as below. (e.g., P1,0011 = P1,1’ AND significant speed improvement (in Figure 4) by using P-trees.
P1,2’ AND P1,3 AND P1,4 ) There are several reasons. First, let’s look at the cost to calculate
P1,0000 P1,0100 P1,1000 P1,1100 information gain each time. In ID3, to test if all the samples are in
0 0 0 0 the same class, one scan on the entire sample set is needed. While
P1,0010 P1,0110 P1,1010 P1,1110 using P-trees, we only need to calculate the root counts of the
3 0 2 0 AND of relevant P-trees. These AND operations can be
0030 0011 performed very fast. Figure 5 gives the experimental results by
1110 0001 1000
comparing the cost of scanning the entire dataset (for different development time significantly. This makes classification of
sizes) and all the P-tree ANDings. open-ended streaming datasets feasible in near real time.
Second, Using P-trees, the creation of sub-sample sets is not
necessary. If A is a candidate for the current decision attribute 6. ACKNOWLEDGMENTS
with kA basic P-trees, we only need to AND the P-trees of the We would like to express our thanks to Amalendu Roy of
class label defining the sub-sample set with each of the kA basic Motorola, William Jockheck of Northern State University and
P-trees. If the P-tree of the current sample set is P2, 0100 ^ P3, 0001, Stephen Krebsbach of Dakota State University for their help and
and the current attribute is B1 (with, say, 2 bit values), then P2, suggestions.
0100 ^ P3, 0001 ^ P1, 00, P2, 0100 ^ P3, 0001 ^ P1, 01, P2, 0100 ^ P3, 0001 ^
P1, 10 and P2, 0100 ^ P3, 0001 ^ P1,11 identifies the partition of the 7. REFERENCES
current sample set. In our algorithm, only P-tree ANDings are [1] J. R. Quinlan and R. L. Riverst, “Inferring decision trees
required. using the minimum description length principle”,
Information and Computation, 80, 227-248, 1989.
Classification Time [2] Quinlan, J. R., “C4.5: Programs for Machine Learning”,
Morgan Kaufmann, 1993.
700
600 [3] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A.
Total cost

500
400 ID3 Swami. “An interval classifier for database mining
300
200
P-tree applications”, VLDB 1992.
100
0
[4] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J.
0 10 20 30 40 50 60 70 Stone, “Classfication and Regression Trees”, Wadsworth,
Size of data (M) Belmont, 1984.
[5] J. Shafer, R. Agrawal, and M. Mehta, “SPRINT: A scalable
parallel classifier for data mining”, VLDB 96.
Figure 4. Classification cost with respect to the dataset size
[6] S. M. Weiss and C. A. Kulikowski, “Computer Systems
that Learn: Classification and Prediction Methods from
Cost (base-2 log) with respect to dataset Satatistics, Neural Nets, Machine Learning, and Expert
Systems”, Morgan Kaufman, 1991.
size
[7] Jiawei Han, Micheline Kamber, “Data Mining: Concepts and
cost (ms) (base-

20 Techniques”, Morgan Kaufmann, 2001.


15
2 log)

10
Scan cost in ID3 [8] D. Michie, D. J. Spiegelhalter, and C. C. Taylor,
5 “Machine Learning, Neural and Statistical Classification”,
0 ANDing cost Ellis Horwood, 1994.
0 20000 40000 60000 80000 using P-trees
data set size (KB) [9] Domingos, P. and Hulten, G., “Mining high-speed data
streams”, Proceedings of ACM SIGKDD 2000.
[10] Domingos, P., & Hulten, G., “Catching Up with the Data:
Figure 5. Cost Comparison between scan and ANDing Research Issues in Mining Data Streams”, DMKD 2001.
[11] William Perrizo, Qin Ding, Qiang Ding, Amlendu Roy,
“Deriving High Confidence Rules from Spatial Data using
5. CONCLUSION Peano Count Trees”, Springer-Verlag, LNCS 2118, July
In this paper, we propose a new approach to decision tree 2001.
induction that is especially useful for the classification of spatial [12] William Perrizo, "Peano Count Tree Technology", Technical
data streams. We use the Peano Count tree (P-tree) structure to Report NDSU-CSOR-TR-01-1, 2001.
represent the information needed for classification in an efficient
and ready-to-use form. The rich and efficient P-tree storage
structure and fast P-tree algebra facilitate the development of a
fast decision tree induction classifier. The P-tree based decision
tree induction classifier is shown to improve classifier

You might also like