0% found this document useful (0 votes)

129 views15 pages

SLIQ

The document summarizes a new decision tree classifier called SLIQ that is designed to be scalable for large datasets. SLIQ uses novel techniques like pre-sorting of data and an integrated tree growth strategy to enable classification of disk-resident datasets. It also uses an inexpensive tree pruning algorithm that results in compact and accurate trees. The combination of these techniques allows SLIQ to scale to large datasets and classify data irrespective of number of classes, attributes or examples, making it suitable for data mining applications on big data.

Uploaded by

Stan Vlad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

129 views15 pages

SLIQ

Uploaded by

Stan Vlad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

SLIQ:

A Fast Scalable
Manish Mehta,
IBM

Classifier

for Data

Mining

Rakesh Agrawal

and Jorma Rissanen

650 Harry

Almaden Research Center Road, %n Jose, CA 95120

Abstract. Classification is an important problem in the emerging field of data mining. Although classification has been studied extensively in the past, most of the classification algorithms are designed only for memory-resident data, thus limiting their suitability for data mining large data sets. This paper discusses issues in building a scalable classifier and presents the design of SLIQ , a new classifier. SLIQ is a decision tree classifier that can handle both numeric and categorical attributes. It uses a novel pre-sorting technique in the tree-growth phase. This sorting procedure is integrated with a breadth-fist tree growing strategy to enable classification of disk-resident datasets. SLIQ also uses a new tree-pruning algorithm that is inexpensive, and results in compact aad accurate trees. The combination of these techniques enables SLIQ to scale for lerge data sets and classify data sets irrespective of the number of classes, attributes, and examples (records), thus making it an attractive tool for data mining.

Introduction
data management has resulted in the accumulation There is a growing perception this passive data into useful
of Data Mining, or Knowledge

The success of computerized

of huge amounts of data in several organizations. that analyses of these large data bases can turn actionable information. The recent emergence

Discovery
development

in Databases,
of tools that

is a testimony
can extract

CZassification is an important data as follows. The input data, also called

amples (records), each having multiple

example is tagged with a special class analyze the input data and to develop class using the features present in the classify future test data for which the class labels are unknown. They can also be used to develop a better understanding of each class in the data. Applications of classification include credit approval, target marketing, medical diagnosis, treatment effectiveness, store location, etc.
Classification has been studied extensively (see [13] for an excellent overview

to this trend. Data mining involves the patterns from large data bases. mining problem [l] and can be described the training set, consists of multiple exattributes or features. Additionally, each label. The objective of classification is to an accurate description or model for each data. The class descriptions are used to

of various

techniques).

However,

the existing h
Center.

classification

algorithms

have the
Mining

SLIQ stands for Supervised project at the IBM Ahnaden

Learning
Research

Quest, where Quest is the Data

problem that they do not scale. Most of the current algorithms have the restriction that the training data should fit in memory. This is perhaps a result of the type of applications to which classification has been hitherto applied. In many applications, there were simply not many training examples available. As a matter of fact, the largest dataset in the Irvine Machine Learning repositary is only 700KB with 20000 examples. Even in [5], a classifier built with database considerations, the size of the training set was overlooked. Instead, the focus was on building a classifier that can use database indices to improve the retrieval efficiency while classifying test data. In data mining applications, very large training sets with several million examples are common. Our primary motivation in this work is to design a classifier that scales well and can handle training data of this magnitude. The ability to classify larger training data can also improve the classification accuracy [2][3]. Given our goal of classifying large data sets, we focus mainly on decision tree classifiers [4][10]. D ecision tree classifiers are relatively fast compared to other classification methods. Methods like neural networks can have extremely long training times even for small datasets. A decision tree can be converted into simple and easy to understand classification rules [lo]. They can also be converted into SQL queries for accessing databases 151. Finally, tree classifiers obtain similar and sometimes better accuracy when compared with other classification methods 171.Figure 1 gives an example of a decision tree classifier for a toy dataset of six examples.

(Age <- <= 35) (salmy50) c= (S~hry 40) xx B

C B G

Fig. 1. Example of a decision tree The idea of modifying tree classifiers to enable them to classify large datasets has been explored previously. Previous proposal include sampling of data at each decision tree node [2], and discretixation of numeric attributes [2]. These methods decrease classification time significantly but also reduce the classification accuracy. Ghan and Stolfo [3] have studied the method of partitioning the input data and then building a classifier for each partition. The outputs of the multiple classifiers are then combined to get the final classification. Their results show that classification using multiple classifiers never achieves the accuracy of a single classifier that can classify all of the data. The decision-tree classifier we present, called SLIQ, uses novel techniques that improve learning time for the classifier without loss in accuracy. At the same time, these techniques allows classification to be performed on large disk-resident training data. Consequently, given training data that can be handled by other

decision tree classifiers, SLIQ exhibits the same accuracy characteristics, but executes faster and produces small trees. However, SLIQ imposes no restrictions on the amount of training data or the number of attributes in the examples. Therefore, SLIQ can potentially obtain higher accuracies by classifying Iarger training datasets which cannot be handled by other classifiers. The rest of the paper is organized as follows. Section 2 describes a generic decision tree classifier and Section 3 discusses scalability issues. Sections 4 and 5 present the design and a detailed performance analysis of SLIQ, respectively. FinalIy, Section 6 contains our conclusions.

Decision-Tree

Classification
classification in

Most decision-tree classifiers (e.g. CART [4], C4.5 [lo]) perform two phases: Tree Building and Dee Pruning.

Tree Building An initial decision tree is grown in this phase by repeatedly partitioning the training data. The training set is split into two or more partitions using an attribute. This process is repeated recursively until all the examples in each partition belong to one class. Figure 2 gives an overview of the process.

MakeTree(IYaining
Partition(T);

Data 2)

Partition(Data S) if (all points in S are in the same class)) then

Evaluate splits for each attribute Use best split found to partition Partition($); Partition(Sz ); A

return;

S into S1 and 52;

Fig. 2. Tree-Building

Algorithm

Tree Pruning The tree built in the first phase compietely cIassifies the training data set. This implies that branches are created in the tree even for spurious noise data and statistical fluctuations. These branches can lead to errors when classifying test data. Tree pruning is aimed at removing these branches from the decision tree by selecting the subtree with the least estimated error rate.

3
3.1

Scalability
Tree Building

Issues
main operations during tree building: i) evaluation of splits for and the selection of the best split and ii) creation of partitions split. Having determined the overall best split, partitions can a simple application of the splitting criterion to the data. The

There are two each attribute using the best be created by

Multivariate

splits based on values of multiple

attributes

have also been proposed[4].

complexity lies in determining the best split for each attribute. The choice of the splitting criterion depends on the domain of the attribute being numeric or categorical (attributes with a finite discrete set of possible values). But let us first specify how alternative splits for an attribute are compared. 3.1.1 Splitting Index A splitting index is used to evaluate the goodness of the alternative splits for an attribute. Several splitting indices have been proposed in [4]. If a proposed in the past[13]. W e use the gini index, originally data set T contains examples from n classes, gini is defined as gini where pj is the relative frequency = 1 - &; of class j in T.

3.1.2 Splits for Numeric Attributes A binary split of the form A _< v, where v is a real number, is used for numeric attributes. The first step in evaluating splits for numeric attributes is to sort the training examples based on the values of the attribute being considered for splitting. Let wi, 212,. . . , v, be the sorted values of a numeric attribute A. Since any value between zli and vi+1 will divide the set into the same two subsets, we need to examine only n - 1 possible splits. Typically, the midpoint of each interval v; - w;+i is chosen as the split point. The cost of evaluating splits for a numeric attribute is dominated by the cost of sorting the values. Therefore, an important scalability issue is the reduction of sorting costs for numeric attributes. 3.1.3 Splits for Categorical Attributes If S(A) is the set of possible values of a categorical attribute A, then the split test is of the form A E S, where S c S. Since the number of possible subsets for an attribute with n possible values is 2) the search for the best subset can be expensive. Therefore, a fast algorithm for subset selection for a categorical attribute is essential. 3.2 Tree Pruning

The tree pruning phase examines the initial tree grown using the training data and chooses the subtree with the least estimated error rate. There are two main approaches to estimating the error rate: one using the original training dataset and the other using an independent dataset for error estimation. Cross-validation [4] belongs to the first category. Multiple samples are taken from the training data and a tree is grown for each sample. These multiple trees are then used to estimate the error rates of the subtrees of the original tree. Although this approach selects compact trees with high accuracies, it is inapplicable for large data sets, where building even one decision tree is expensive. Alternative approaches [lo] th a t use only a single decision tree often lead to large decision trees. The second class of methods divide the training data into two parts where one part is used to build the tree and the other for pruning the tree. The data used

for pruning should be selected such that it captures the true data distribution, which brings up a potential problem with this method. How large should the test sample be and how should it be selected? Moreover, using portions of the data only for pruning, reduces the number of training examples available for the tree growing phase, which can lead to reduced accuracy. The challenge for a scalable classifier in the pruning phase is to use an algorithm that is fast, and leads to compact and accurate decision trees. 4

SLIQ

Classifier

We first give a brief overview of SLIQ and then give details about the techniques used in SLIQ to address the scalability issues identified in the previous section. 4.1 Overview

SLIQ is a decision tree classifier that can handle both numeric and categorical attributes. SLIQ uses a pre-sorting technique in the tree-growth phase to reduce the cost of evaluating numeric attributes. This sorting procedure is integrated with a breadth-first tree growing strategy to enable SLIQ to classify disk-resident datasets. In addition, SLIQ uses a fast subsetting algorithm for determining splits for categorical attributes. SLIQ also uses a new tree-pruning algorithm based on the Minimum Description Length principle [ll]. This algorithm is inexpensive, and results in compact and accurate trees. The combination of these techniques enables SLIQ to scale for large data sets and classify data sets with a large number of classes, attributes, and examples. 4.2 Pre-Sorting and Breadth-First Growth

For numeric attributes, sorting time is the dominant factor when finding the best split at a decision tree node [Z]. Therefore, the first technique used in SLIQ is to implement a scheme that eliminates the need to sort the data at each node of the decision tree. Instead, the training data are sorted just once for each numeric attribute at the beginning of the tree growth phase. To achieve this pre-sorting, we use the following data structures. We create a separate list for each attribute of the training data. Additionally, a separate list, called class list, is created for the class labels attached to the examples. An entry in an attribute list has two fields: one contains an attribute value, the other an index into the class list. An entry of the class list also has two fields: one contains a class label, the other a reference to a leaf node of the decision tree. The ith entry of the class list corresponds to the ith example in the training data. Each leaf node of the decision tree represents a partition of the training data, the partition being defined by the conjunction of the predicates on the path from the node to the root. Thus, the class list can at any time identify the partition .to which an example belongs. We assume that there is enough memory to keep the class list memory-resident. Attribute lists are written to disk if necessary. Initially, the leaf reference fields of all the entries of the class list are set to point to the root of the decision tree. Then a pass is made over the training

23
TRAINING DATA AFTER PRE-SORTING
ClSSS

Salary

LM Index

Class

Leaf

Age List

salary

List

class List

Fig. 3. Example of Pre-Sorting EvaluateSplits() for each attribute A do traverse attribute list of A for each value TV the attribute in list do fhd the corresponding entry in the class list, end hence the corresponding class and the leaf node (say I) update the class histogram in the leaf I if A is a numeric attribute then compute splitting index for test (A 5 v) for leaf 1 if A is a categorical attribute then for each leaf of the tree do find subset of A with best split
Fig. 4. Evaluating data, distributing values of the attributes Splits for each example

Each attribute
attribute illustrates

value is also tagged with the corresponding

across all the lists. class list index. The

Figure 3

lists for the numeric features are then sorted independently. the state of the data structures before and after pre-sorting.

4.2.1

Processing

Node

Splits

Rather than using a depth-first

strategy

used

in the earlier decision-tree classifiers, we grow trees breadth-first. Consequently, splits for all the leaves of the current tree are simultaneously evaluated in one pass over the data. Figure 4 gives a schematic of the evaluation process. To compute the gini splitting-index (see Section 3.1) for an attribute at a node, we need the frequency distribution of class values in the data partition corresponding to the node. The distribution is accumulated in a class histogram attached with each leaf node. For a numeric attribute, the histogram is a list of pairs of the form <class, frequency>. For a categorical attribute, this histogram is a list of triples of the form <attribute value, class, frequency>. Attribute lists are processed one at a time (recall that the attribute lists can

be on disk). For each value IJ in the attribute

list for the current

attribute

A, we

I I

IN3

Update Evd~~te

class lirst

bkto~rams split for

and node N2 (salary <15)

Update Evaluate

chrr first

bbtogrnmr split lor

sad nodt N3 (salary <40)

Fig. 5. Evaluating

Splits: Example

find the the corresponding entry in the class list, which yields the corresponding class and the leaf node. We now update the histogram attached with this leaf node. If A is a numeric attribute, we compute at the same time the splittingindex for the test A 2 v for this leaf. If A is a categorical attribute, we wait till the attribute list has been completely scanned and then find the subset of A with the best split. Thus, in one traversal of an attribute list, the best split using this attribute is known for all the leaf nodes. Similarly, with one traversal of all of the attribute lists, the best overall split for all of the leaf nodes is known. The best split test is saved with each of the leaf nodes. Figure 5 illustrates the evaluation of splits on the salary attribute for the second level of the decision tree. The example assumes that the data has been initially split on the age attribute using the split age 2 35. The class histograms reflect the distribution of the points at each leaf node as a result of the split. The L values represent the distributions for examples that satisfy the test and R values represent examples that do not satisfy the test. We show how the class histograms are updated as each split is evaluated. The first value in the salary list belongs to node N2. So the first split evaluated is (salary <_ 15) for N2. After this split, the corresponding example (salary 15, class index 2) which satisfies the predicate belongs to the left branch and the rest belong to the right branch. The class histogram of node N2 is updated to reflect this fact. Next, the split (salary 5 40) is evaluated for node N3. After the split, the corresponding example (salary 40, class index 4) belongs to the left branch and the class histogram of node N3 is updated to reflect this fact.

4.2.2 Updating the Class List The next step is to create child nodes for each of the leaf nodes and update the class list. Figure 6 gives the update process. UpdateLabels for each attribute A used in a split do traverse attribute list of A for each value u in the attribute list do find the corresponding entry in the class list (say e) find the new class c to which v belongs by applying the splitting test at node referenced from e upd&e the class label for e to c update node referenced in e to the child corresponding Fig. 6. Updating Class List

to the class

As an illustration, Figure 7 shows the class list being updated after the nodes N2 and N3 have been split on the salary attribute. The salary attribute list is being traversed and the class list entry (entry 4) corresponding to the salary value of 40 is being updated. First, the leaf reference in the entry 4 of class list is used to find the node to which the example used to belong (N3 in this case). Then, the split selected at N3 is applied to find the new child to which the example belongs (N6 in this case). The leaf reference field of entry 4 in the class list is updated to reflect the new value. aau
List Class List Class Leaf (Salary -z- 4.0) A

N.5

Age

List

Salsly

Lint

Class

List

Fig. 7. Class List Update: Example

42.3 An Optimization While growing the tree, the above two steps of splitting nodes and updating labels are repeated until each leaf node becomes a pure node (i.e. it contains examples belonging to only one class) and no further splits are required. Note that some nodes may become pure earlier than others and it may be better to condense the attribute lists to discard entries corresponding to examples belonging to these pure nodes. This optimization can easily be implemented by rewriting condensed lists when the the savings from reading smaller lists outweigh the extra cost of writing condensed lists. The information required to make this decision is available from the previous pass over the data. The important thing to note about pre-sorting and breadth-first growth is that these strategies allow SLIQ to scale for large data sets with no loss in

accuracy. This is because the set of splits evaluated with and without pre-sorting is identical. Pre-sorting simply eliminates the task of resorting data at each node and removes the restriction that the training set be memory-resident.
4.3 Subsetting for Categorical Attributes

The splits for a categorical attribute A are of the form A E S, where S C S and S is the set of possible values of attribute A. The evaluation of all the subsets of S can be prohibitively expensive, especially if the cardinality of S is large. SLIQ uses a hybrid approach to overcome this issue. If the cardinality of S is less than a threshold, MAXSETSIZE, then all of the subsets of S are evaluated 3. Otherwise, a greedy algorithm (initially proposed for IND [8]) is used to obtain the desired subset. The greedy algorithm starts with an empty subset S and adds that one element of S to S which gives the best split. The process is repeated until there is no improvement in the splits. This hybrid approach finds the optimal subset if S is small and also performs well for larger subsets.
4.4 Tree Pruning

The pruning strategy used in SLIQ is based on the principle of Minimum Description Length (MDL) [ll]. W e fi rs t review briefly the MDL principle and then show its application in decision-tree pruning. The MDL principle states that the best model for encoding data is the one that minimizes the sum of the cost of describing the data in terms of the model and the cost of describing the model. If M is a model that encodes the data D, the total cost of the encoding, cost(nll, D), is defined as: cosd(M,D) = coLd(D 1 M) + cost(M)

where, cost(D ] M) is the cost, in number of bits, of encoding the data given a model M and cost(M) is the cost of encoding the model M. In the context of the decision tree classifiers, the models are the set of trees obtained by pruning the initial decision tree T, and the data is the training set S. The objective of MDL pruning is to find the subtree of T that best describes the training set S. Earlier applications of the MDL principle to tree pruning [9][12] showed that the resultant trees were over-pruned, causing a decrease in the classification accuracy. In [6], an alternative application of MDL was presented that yielded small trees without sacrificing accuracy. However, the pruning algorithm in [6] was limited; it either pruned all or none of the children of a node in the decision tree. We present a new algorithm that is able to prune a subset of the children at each node and thus subsumes the previous algorithm. There are two components of the pruning algorithm: the encoding scheme that determines the cost of encoding the data and the model, and the algorithm used to compare various subtrees of T.
We use quickly. a default MAXSETSIZE of 10, since 21 subsets can be evaluated fairly

4.4.1 Data Encoding The cost of encoding a training set S by a decision tree T is defined as the sum of all classification errors. A classification of an example is an error if the classification produced by T is not the same as the original class label of the example. This count of misclassification errors is collected during the tree building phase. So, the data encoding step is inexpensive. 4.4.2 Model Encoding The encoding scheme for the model has to provide for the cost of describing the tree and the costs of describing the tests used in the tree at each internal node. - Encoding the Tree: Given a decision tree, a node in the decision tree can be an internal node with one or two children, or a leaf node. The number of bits required to encode the tree depends on the permissible tree structures. We explore three possible ways of encoding the tree: 1. CodeI: A node is allowed either 0 or two children. Since there are only two possibilities, it takes only one bit to encode each node. 2. Codez: Each node can have no children, a left child, a right child, or both children. Therefore, 2 bits are needed to encode the four possible values of each node. 3. Codes: Only internal nodes are examined. So each node can have a left child, a right child, or both children. This requires log(3) bits - Encoding the Splits: The cost of encoding the splits depends on the type of attribute tested for the split: 1. Numeric Attributes: If the split is of the form A 5 II where A is a numeric attribute and v is a real-valued number, the cost of encoding this test is simply the overhead of encoding II, say P. Although the value of P should optimally be determined independently for each such test in the decision tree, we assume a constant value of 1 throughout the tree. The value of 1 was empirically determined. 2. Categorical Attributes: For tests of the form A & S, where A is a categorical attribute and S is a subset of the possible values of A, the cost is calculated in a two-step process. First, we count the number of such tests used in the tree, ?%A;, for each categorical attribute A;. Then the cost of the test is calculated as In nA; . From now on, Lteat denotes the cost of encoding any test at an internal node. 4.4.3 Pruning Algorithms The MDL pruning evaluates the code length at each decision tree node to determine whether to convert the node into a leaf, prune the left or the right child, or leave the node intact. For each of the above options, the code length C(n) for a node n is calculated as follows:

Geaf(t) =
&th(t) Ge&) Gig&) =

L(t) + Errorst ,

if if if if

L(t)

+ Ltest + C(tl)

+ C(tz), + C(h), +
C@,),

= L(t) + L test + C(h) = L(i) + L test + C(h)

t is a leaf t has both t has only t has only

(1) children (2) tl as a child (3) t2 as a child (4)

Except for C(ti), all the other quantities are self-explanatory. In the case of partial pruning when either ti or t2 is pruned, the examples that fall into the pruned branch are encoded using the statistics at the parent node. C (ti) represents the cost of encoding the childrens examples using the parents statistics. We consider three pruning strategies: 1. Full: This strategy, first presented in [6], considers only options (1) and (2). If Ckaf (t) is smaller than Cboth(t) f or a node t then both the children are pruned and the node is converted into a leaf. This approach codes the decision tree using only one bit (method Coder). 2. Partial: The partial pruning strategy chooses amongst all four options. Each node is converted into the option with the shortest code length. This approach uses the second method for coding trees, Codez, which requires 2 bits for each node. 3. Hybrid: The hybrid method prunes the tree in two phases. It first uses the Full method to get a smaller tree and then considers only options (2), (3) and (4) to further prune the tree.

Performance

Results

This section presents a detailed performance evaluation of SLIQ. We first discuss the metrics used in the evaluation and then describe the experimental methodology. This is followed by a comparison of SLIQ with other tree classification methods and the result of experiments showing SLIQs scalability. 5.1 Metrics

The primary metric for evaluating classifier performance is classification QCCUracy - the percentage of test samples that are correctly classified. We also present the classification time, and the size of the decision tree as secondary metrics. The ideal goal for a classifier is to produce compact, accurate trees in a short time. 5.2 Experimental Setup

The performance evaluation of SLIQ was divided into two parts. The first part compares SLIQ with the classifiers provided with the IND classifier package [S]. The IND package implements two of the most popular decision tree classifiers: CART [4] and C4 (a predecessor of C4.5 [lo]). Th ese implementations are henceforth referred to as IND-Cart and IND-C4. Since the IND classifiers handle only datasets that 6t in memory, the comparison used datasets from the STATLOG classification benchmark [7]. Table 1 summarizes the important parameters of this benchmark. The second part of the performance evaluation examines SLIQs performance on disk-resident data. In the absence of a benchmark with large classification datasets, we used the evaluation methodology and synthetic databases proposed in [l]. Each tuple in these databases has nine attributes. Ten classification functions were used in [l] to produce data distributions of varying complexities. In

Dataset
Australian Diabetes DNA Letter

Domain
Credit Analysis Disease diagnosis DNA Sequencing Handwriting Recognition

#Attributes
14 8 180 16 36 19 9 18

#Classes
2 2 3

#Example:
690 768 3186

26
6 7 7

20000
6435 2310

Satimage
Segment Shuttle Vehicle

Landusage Images
Image Segmentation Space Shuttle Radiation Vehicle Identification

Table this paper, we to the highest performed on and executing 5.3 MDL

1. STATLOG

4 Benchmark Datasets

57000 846

use the functions which were the hardest to characterize and led were classification errors - functions 5 and 10. All experiments an IBM RS/SOOO 250 workstation with a buffer pool of 64 MB the AIX 3.2.5 OS.

Pruning

Section 4.4 presented the partial and hybrid MDL-based pruning algorithms that can remove a subset of the children at any decision tree node. The first experiment compares the performance of these algorithms to the full pruning algorithm. Table 2 shows the classification accuracy of the different algorithms while Table 3 shows the sizes of the final decision tree. The execution times of the three algorithms nearly the same and have therefore not been shown. (Dataset
Australian Diabetes

(full(partiallhybrid(
84.9 ( 75.8 92.1 85.11 84.91

[Dataset 75.4
92.1
Australian! _---__---_ Diabetes DNA Letter Satimage

1 fulllpartialihybrid]
14.61 9.61

74.9
91.9

10.61

DNA
Letter Satimage Segment Shuttle Vehicle

84.6 86.3 94.6 99.9 70.3

81.7 85.3 94.1 99.9 68.7

84.6 86.3 94.6

99.9

35.2 55.0
1141.0

45.0
729.0 91.0

21.2 45.0 879.0

133.0

159.0
18.6

Segment
Shuttle Vehicle

70.3

29 68.3

15.2 27 42.6

16.2 27 49.4

Table 3. Decision Tree Size The tables show that compared to full pruning, the partial pruning leads to much smaller trees but at the cost of lower classification accuracy. This implies that the partial MDL pruning is over-aggressive. Hybrid pruning, on the other hand, achieves the same accuracy as full pruning, and leads to decision trees that are, on the average, 22% smaller. Hybrid pruning is therefore the preferred approach, and is used for the rest of the experiments in this paper. 5.4 Small Datasets

Table

2. Classification

Accuracy

The next experiment compares the performance of SLIQ with IND-Cart and IND-C4. Table 4 shows the classification accuracy of each of the algorithm on the STATLOG benchmark. The results show that all the three classifiers achieve

30 similar accuracy. The largest difference is only 5.3% (Diabetes). However I Table 5 shows that there is a significant difference in the sires of the decision trees produced by the classifiers. IND-C4 produces the largest decision trees for all the datasets. The trees produced by IND-Cart are 2 (Segment) to 16.4 (Australian) times smaller. SLIQ also produces trees that are comparable in size to IND-Cart and 2.1 (Shuttle) to 8.5 (Diabetes) times smaller than IND-C4. Dataset
Australian Diabetes DNA Letter Satimage Segment Shuttle Vehicle

IND-Cart
85.3

IND-C4
84.4

SLIQ
84.9

IDataset

74.6
92.2

70.1
92.5

75.4
92.1

Australian Diabetes DNA

84.7 85.3
94.9 99.9

86.8 85.2
95.9 99.9

84.6 86.3
94.6 99.9

[IND-Cart(IND-C41SLIQ] 5.21 851 10.61 11.5 179.7 21.2 171.0 45.0 35.0 Letter 3241.3 879.0 1199.5 Satimage 563.0 133.0 90.0 Segment 52.0 102.0 16.2 Shuttle 57 27 27
Vehicle 50.1) 249.0) Size 49.4

68.8
4. Classification

71.1
Accuracy

70.3

Table

5. Pruned-Tree

Dataset
Australian Diabetes

IIND-Cart
2.1

DNA
Letter

Satimage
Segment Shuttle Vehicle

2.5 33.4 251.3 224.7 30.2 460 7.62

IND-C4 SLIQ 1.5 7.1 1.4 1.81 9.21 19.3 53.08 39.0 37.06 16.5 9.7 5.2 80 33 2.7 1.8
Times

Table

6. Execution

The final criterion for comparing the pruning algorithms is the execution times of the algorithms. Table 6 shows that IND-Cart, which uses cross-validation for pruning, has the largest execution time. The other two algorithms grow a single decision tree, and therefore are nearly an order of magnitude faster in comparison. SLIQ is faster than IND-C4, except for the Australian, Diabetes, and DNA data. The Australian and Diabetes data are very small and, therefore, the full potential of pre-sorting and breadth-first growth cannot be fully exploited by SLIQ. The DNA data consists only of categorical attributes, and hence there are no sorting costs that SLIQ can reduce. In summary, this set of experiments has shown that IND-Cart achieves good accuracy and small trees. However, the algorithm is nearly an order of magnitude
slower times, than the other algorithms. IND-C4 is also accurate and has fast execution but leads to large decision trees. SLIQ, on the other hand, does not

suffer from any of these drawbacks. It produces accurate decision trees that are significantly smaller than the trees produced using IND-C4. At the same time, SLIQ executes nearly an order of magnitude faster than IND-Cart.

31 1600. 1400 ? 0 a 9 i200c lOooFMO,i 600400I , 2000

+--Function5 --~--Function10 /8 ,/ I , / I _,/ . A / , -0 s*I I . I

700 600 Q e! .g 500 c i 400 ,$ 300 9 7 200 d 100

---

de F

-,+-

jg
0

2 4 6 8 Number of Examples (In millions)

loo 200 300 Number of Atfributes

,I.-.I.....i

Fig. 8. SLIQ S&ability: 5.5 Scalability

#Examples

Fig.

SLIQ Scalability:

#Attributes

The last set of experiments showed that SLIQ achieves good memory-resident data. This section examines the scalability of dimensions: the number of training examples, and the number the data. Synthetic databases (Section 5.2) were used for these

performance on SLIQ along two of attributes in experiments.

Scalability on number of training examples Figure 8 shows the performance of SLIQ as the number of training examples is increased from 100,000 to 10 million. This corresponds to an increase in total database size from 4MB to 400MB. The results show that SLIQ achieves near-linear execution times on diskresident data. This is because the total classification time is dominated by I/O costs. Recall that SLIQ makes at most two complete passes over the data for each level of the decision tree. Since I/O costs are directly proportional to the size of the data, the total classification time also becomes a near-linear function of data size. The two functions show different slopes because the size of the tree and hence the number of passes made over the data is function-dependent. Total linearity is not achieved because of two reasons. First, the pre-sorting time is non-linear in the size of the data. Second, classifying larger data sets sometimes leads to larger decision trees which requires extra passes over the data. Scalability on number of attributes The next experiment studies the performance of SLIQ as the number of attributes increases. Since the original synthetic databases have only 9 attributes, extra attributes were created by adding randomly generated values to each example. Note that the addition of these attributes does not substantially change the final decision tree produced because the extra attributes are not used by SLIQ anyway. The additional attributes simply increase the classification time because of the need to examine additional attributes at each level of the decision tree. The number of training examples was fixed at 100,000 for this experiment. The number of attributes was increased

from 9 to 400, which represents an increase in the database size from 4MB to 160MB. Figure 9 shows the performance for functions 5 and 10. There is a discontinuity at 100 attributes, when the database size is just over 40 MB and the attribute lists (80 MB) do not fit in memory. Recall that the buffer pool size was fixed at 64 MB of memory for all the experiments*. For disk-resident data, the classification time increases linearly, again due to the domination of I/O costs.

Conclusions

Classification is an important problem in data mining. Although classification has been studied extensively in the past, the various techniques proposed for classification do not scale well for large data sets. We presented a new decision-tree classifier, called SLIQ, which is designed specifically for scalability. SLIQ uses the novel techniques of pre-sorting, breadth first growth, and MDL-based pruning. An empirical performance evaluation shows that compared to other classifiers, SLIQ achieves comparable or better classification accuracy but produces small decision trees and and has small execution times. We also demonstrated that SLIQ achieves good scalability and performs well for datasets with a large number of examples and attributes.

References
1. R. Agrawal,
2. 3. 4. 5. 6. perT. Imielinski, and A. Swami. Database mining: A performance spective. IEEE fians. on Knowledge and Data Engineering, 5(6), Dec. 1993. J. Catlett. Megainduction: Machine Learning on Very Large Databases. PhD thesis, University of Sydney, 1991. P. K. Chan and S. J. StoIfo. Meta-learning for multistrategy and parallel learning. In Proc. Second Intl. Workshop on Mu&strategy Learning, pages 150-165, 1993. L. Breiman et. al. Classification and Regression Trees. Wadsworth, Belmont, 1984. R. Agrawal et. al. An interval classifier for database mining applications. In Proc. of the VLDB Conf., Vancouver, British Columbia, Canada, August 1992. M. Mehta, J. Rissanen, and R. Agrawal. MDL-based decision tree pruning. In Int? Conf. on Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, Aug. 1995. D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classification. E&s Horwood, 1994. NASA Ames Res. Ctr. Intro. to IND Version 2.1, GA23-2475-02 edition, 1992. J. R. Quinlan and R. L. Rivest. Inferring decision trees using minimum description length principle. Information and Computation, 1989. J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993. Stochastic Complexity in Statistical Inquiry. World Scientific Publ. J. Rissanen. Co., 1989. C. Wallace and J. Patrick. Coding decision trees. Machine Learning, 11:7-22, 1993. S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classijication
and pert Prediction Systems. Methods from Statistics, Neural Nets, Machine Learning, and Ex-

7.
8.

9.
10.

11. 12. 13.

Morgan

Kaufman,

1991. experiment at 800K tuples.

A similar

discontinuity

nlso occurs in the previous

Crash Course JEE Advanced Sample Ebook
100% (1)
Crash Course JEE Advanced Sample Ebook
31 pages
Fundamentals of Data Science Unit 4
100% (1)
Fundamentals of Data Science Unit 4
31 pages
Reservoir Engineering
100% (1)
Reservoir Engineering
178 pages
10 Seconds Part 2
100% (1)
10 Seconds Part 2
167 pages
Quick Revision of Bio Phy Che 9 Hours
100% (2)
Quick Revision of Bio Phy Che 9 Hours
489 pages
7 Classification
100% (3)
7 Classification
63 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
95 pages
HPC Calculation SHEET 1 ROW
No ratings yet
HPC Calculation SHEET 1 ROW
7 pages
Unit V - Classification and Prediction 2020-21
100% (1)
Unit V - Classification and Prediction 2020-21
68 pages
Classification, Prediction
100% (1)
Classification, Prediction
67 pages
SigmaXL Version 8 Workbook
No ratings yet
SigmaXL Version 8 Workbook
541 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
35 pages
Science Stem Lesson
No ratings yet
Science Stem Lesson
25 pages
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
CN U2
No ratings yet
CN U2
162 pages
9th - PF - 21-22 - Maths - 2 - 05-Coordinate Geometry
No ratings yet
9th - PF - 21-22 - Maths - 2 - 05-Coordinate Geometry
18 pages
Mel709 22
No ratings yet
Mel709 22
18 pages
7 - Worksheet 7 - Trigonometry & Right-Angled Triangles
No ratings yet
7 - Worksheet 7 - Trigonometry & Right-Angled Triangles
8 pages
Unit 4 Data Mining Algorithms: Dr. Anjan Krishnamurthy Associate Professor Bmsit&M
No ratings yet
Unit 4 Data Mining Algorithms: Dr. Anjan Krishnamurthy Associate Professor Bmsit&M
95 pages
Classification and Regression Tree Construction
No ratings yet
Classification and Regression Tree Construction
18 pages
Classification and Prediction
No ratings yet
Classification and Prediction
40 pages
IntroClassificationDA 2024
No ratings yet
IntroClassificationDA 2024
129 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
80 pages
A New Approach To Current Differential Protection For Transmission Lines
No ratings yet
A New Approach To Current Differential Protection For Transmission Lines
25 pages
New Classification11
No ratings yet
New Classification11
98 pages
Student Study Guide
No ratings yet
Student Study Guide
3 pages
3-Classification, Clustering and Prediction
No ratings yet
3-Classification, Clustering and Prediction
142 pages
Chapter 02 - DM Tasks - Part I - Classification
No ratings yet
Chapter 02 - DM Tasks - Part I - Classification
58 pages
DWDM 4
No ratings yet
DWDM 4
58 pages
Classification Algorithm
No ratings yet
Classification Algorithm
78 pages
DM Lect8
No ratings yet
DM Lect8
56 pages
Data Mining UNIT-III R20 Syllabus
No ratings yet
Data Mining UNIT-III R20 Syllabus
50 pages
DWDM - Unit - V
No ratings yet
DWDM - Unit - V
93 pages
Classifiction
No ratings yet
Classifiction
42 pages
Recent Advances in Mathematics For Engineering (Mathematical Engineering, Manufacturing, and Management Sciences) 1st Edition Mangey Ram (Editor)
100% (3)
Recent Advances in Mathematics For Engineering (Mathematical Engineering, Manufacturing, and Management Sciences) 1st Edition Mangey Ram (Editor)
54 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
70 pages
Midterm Assignment: Saint Louis University
No ratings yet
Midterm Assignment: Saint Louis University
7 pages
Week 6 - 7 - Classification
No ratings yet
Week 6 - 7 - Classification
67 pages
Enotes
No ratings yet
Enotes
30 pages
3 - Sınıflandırma 2
No ratings yet
3 - Sınıflandırma 2
62 pages
05classification Rule Mining
No ratings yet
05classification Rule Mining
56 pages
Unit-2 Material
No ratings yet
Unit-2 Material
52 pages
Class Basic
No ratings yet
Class Basic
75 pages
Pythagorean Triples: Determine Whether Each Set of Numbers Form A Pythagorean Triple. 12, 20, 16 8, 15, 17 1, 7, 5
No ratings yet
Pythagorean Triples: Determine Whether Each Set of Numbers Form A Pythagorean Triple. 12, 20, 16 8, 15, 17 1, 7, 5
2 pages
08 Class Basic
No ratings yet
08 Class Basic
141 pages
Decision Tree
No ratings yet
Decision Tree
30 pages
Lec05 Classification DecisionTree
No ratings yet
Lec05 Classification DecisionTree
67 pages
Problem 1: Cse352 AI Homework 3 Solutions
No ratings yet
Problem 1: Cse352 AI Homework 3 Solutions
31 pages
05 Classification Part1
No ratings yet
05 Classification Part1
35 pages
Week 4 Part 1 Classification
No ratings yet
Week 4 Part 1 Classification
71 pages
Supervised Learning Algorithm
No ratings yet
Supervised Learning Algorithm
59 pages
Down 4
No ratings yet
Down 4
83 pages
Module 04
No ratings yet
Module 04
75 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
50 pages
DM Unit-3
No ratings yet
DM Unit-3
23 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
EET302 M2-Ktunotes - in
No ratings yet
EET302 M2-Ktunotes - in
33 pages
DM Unit-3
No ratings yet
DM Unit-3
46 pages
Unit - Iii
No ratings yet
Unit - Iii
52 pages
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
No ratings yet
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages
20210913115613D3708 - Session 05-08 Decision Tree Classification
No ratings yet
20210913115613D3708 - Session 05-08 Decision Tree Classification
37 pages
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
Data Mining Unit-Iii
No ratings yet
Data Mining Unit-Iii
36 pages
L05 - Advance Analytical Theory and Methods - Classification
No ratings yet
L05 - Advance Analytical Theory and Methods - Classification
34 pages
Chapter 4
No ratings yet
Chapter 4
31 pages
DM - 06 Mar 2025
No ratings yet
DM - 06 Mar 2025
13 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Updated DM Unit 3
No ratings yet
Updated DM Unit 3
28 pages
DWM - Module 3
No ratings yet
DWM - Module 3
22 pages
Classification Unit-4
No ratings yet
Classification Unit-4
19 pages
Effect of Friction Coefficient On Finite Element Modeling of The Deep - Cold Rolling Process
No ratings yet
Effect of Friction Coefficient On Finite Element Modeling of The Deep - Cold Rolling Process
5 pages
Delivery Feet Data Using K Mean Clustering With Applied SPSS
No ratings yet
Delivery Feet Data Using K Mean Clustering With Applied SPSS
2 pages
Survey of Classification Techniques in Data Mining: Open Access
No ratings yet
Survey of Classification Techniques in Data Mining: Open Access
10 pages
Unit-4 DM
No ratings yet
Unit-4 DM
19 pages
Test-1 Land Surveyor
No ratings yet
Test-1 Land Surveyor
23 pages
CS402 Mod 3
No ratings yet
CS402 Mod 3
2 pages
Unit 3
No ratings yet
Unit 3
16 pages
PSD Analysis Steps
No ratings yet
PSD Analysis Steps
15 pages
Math Set 1
No ratings yet
Math Set 1
9 pages
Jan 2006 Paper 2
No ratings yet
Jan 2006 Paper 2
16 pages
IP MODEL 1 QST Set 2
No ratings yet
IP MODEL 1 QST Set 2
4 pages
Phase Plane Analysis - 3
No ratings yet
Phase Plane Analysis - 3
21 pages
Mission Planning Issues of Imaging Satellites Summ
No ratings yet
Mission Planning Issues of Imaging Satellites Summ
20 pages
A Numerical Case Study On Pier Shape Coefficient o
No ratings yet
A Numerical Case Study On Pier Shape Coefficient o
7 pages
第七單元細線化與骨架抽取
No ratings yet
第七單元細線化與骨架抽取
13 pages
41 j48 Naive Bayes Weka
No ratings yet
41 j48 Naive Bayes Weka
5 pages
Carrom
No ratings yet
Carrom
3 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet

SLIQ

Uploaded by

SLIQ

Uploaded by

SLIQ:

and Jorma Rissanen

Almaden Research Center Road, %n Jose, CA 95120

The success of computerized

CZassification is an important data as follows. The input data, also called

SLIQ stands for Supervised project at the IBM Ahnaden

Quest, where Quest is the Data

(Age <- <= 35) (salmy50) c= (S~hry 40) xx B

Partition(Data S) if (all points in S are in the same class)) then

S into S1 and 52;

There are two each attribute using the best be created by

splits based on values of multiple

have also been proposed[4].

value is also tagged with the corresponding

across all the lists. class list index. The

Rather than using a depth-first

be on disk). For each value IJ in the attribute

list for the current

bkto~rams split for

and node N2 (salary <15)

bbtogrnmr split lor

sad nodt N3 (salary <40)

Fig. 7. Class List Update: Example

= L(t) + L test + C(h) = L(i) + L test + C(h)

t is a leaf t has both t has only t has only

(1) children (2) tl as a child (3) t2 as a child (4)

84.6 86.3 94.6 99.9 70.3

81.7 85.3 94.1 99.9 68.7

84.6 86.3 94.6

21.2 45.0 879.0

Australian Diabetes DNA

2.5 33.4 251.3 224.7 30.2 460 7.62

31 1600. 1400 ? 0 a 9 i200c lOooFMO,i 600400I , 2000

+--Function5 --~--Function10 /8 ,/ I , / I _,/ . *A / ,* -0 s*I I . I

700 600 Q e! .g 500 c i 400 ,$ 300 9 7 200 d 100

2 4 6 8 Number of Examples (In millions)

loo 200 300 Number of Atfributes

Fig. 8. SLIQ S&ability: 5.5 Scalability

performance on SLIQ along two of attributes in experiments.

11. 12. 13.

1991. experiment at 800K tuples.

nlso occurs in the previous

You might also like

+--Function5 --~--Function10 /8 ,/ I , / I _,/ . A / , -0 s*I I . I