Data Preprocessing
Data Preprocessing
com
Contents
3 Data Preprocessing
3.1 Why preprocess the data? . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Data cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Noisy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Inconsistent data . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Data integration and transformation . . . . . . . . . . . . . . . . . . . . 3.3.1 Data integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Data transformation . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Data reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Data cube aggregation . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Data compression . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Numerosity reduction . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Discretization and concept hierarchy generation . . . . . . . . . . . . . . 3.5.1 Discretization and concept hierarchy generation for numeric data 3.5.2 Concept hierarchy generation for categorical data . . . . . . . . . 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 5 5 6 7 8 8 8 10 10 11 13 14 19 19 23 25
CONTENTS
September 7, 1999
Chapter 3
Data Preprocessing
Today's real-world databases are highly susceptible to noise, missing, and inconsistent data due to their typically huge size, often several gigabytes or more. How can the data be preprocessed in order to help improve the quality of the data, and consequently, of the mining results? How can the data be preprocessed so as to improve the e ciency and ease of the mining process? There are a number of data preprocessing techniques. Data cleaning can be applied to remove noise and correct inconsistencies in the data. Data integration merges data from multiple sources into a coherent data store, such as a data warehouse or a data cube. Data transformations, such as normalization, may be applied. For example, normalization may improve the accuracy and e ciency of mining algorithms involving distance measurements. Data reduction can reduce the data size by aggregating, eliminating redundant features, or clustering, for instance. These data processing techniques, when applied prior to mining, can substantially improve the overall data mining results. In this chapter, you will learn methods for data preprocessing. These methods are organized into the following categories: data cleaning, data integration and transformation, and data reduction. The use of concept hierarchies for data discretization, an alternative form of data reduction, is also discussed. Concept hierarchies can be further used to promote mining at multiple levels of abstraction. You will study how concept hierarchies can be generated automatically from the given data.
may be faulty. There may have been human or computer errors occurring at data entry. Errors in data transmission can also occur. There may be technology limitations, such as limited bu er size for coordinating synchronized data transfer and consumption. Incorrect data may also result from inconsistencies in naming conventions or data codes used. Duplicate tuples also require data cleaning. Data cleaning routines work to clean" the data by lling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. Dirty data can cause confusion for the mining procedure. Although most mining routines have some procedures for dealing with incomplete or noisy data, they are not always robust. Instead, they may concentrate on avoiding over tting the data to the function being modeled. Therefore, a useful preprocessing step is to run your data through some data cleaning routines. Section 3.2 discusses methods for cleaning" up your data. Getting back to your task at AllElectronics, suppose that you would like to include data from multiple sources in your analysis. This would involve integrating multiple databases, data cubes, or les, i.e., data integration. Yet some attributes representing a given concept may have di erent names in di erent databases, causing inconsistencies and redundancies. For example, the attribute for customer identi cation may be referred to as customer id is one data store, and cust id in another. Naming inconsistencies may also occur for attribute values. For example, the same rst name could be registered as Bill" in one database, but William" in another, and B." in the third. Furthermore, you suspect that some attributes may be derived" or inferred from others e.g., annual revenue. Having a large amount of redundant data may slow down or confuse the knowledge discovery process. Clearly, in addition to data cleaning, steps must be taken to help avoid redundancies during data integration. Typically, data cleaning and data integration are performed as a preprocessing step when preparing the data for a data warehouse. Additional data cleaning may be performed to detect and remove redundancies that may have resulted from data integration. Getting back to your data, you have decided, say, that you would like to use a distance-based mining algorithm for your analysis, such as neural networks, nearest neighbor classi ers, or clustering. Such methods provide better results if the data to be analyzed have been normalized, that is, scaled to a speci c range such as 0, 1.0 . Your customer data, for example, contains the attributes age, and annual salary. The annual salary attribute can take many more values than age. Therefore, if the attributes are left un-normalized, then distance measurements taken on annual salary will generally outweigh distance measurements taken on age. Furthermore, it would be useful for your analysis to obtain aggregate information as to the sales per customer region | something which is not part of any precomputed data cube in your data warehouse. You soon realize that data transformation operations, such as normalization and aggregation, are additional data preprocessing procedures that would contribute towards the success of the mining process. Data integration and data transformation are discussed in Section 3.3. Hmmm", you wonder, as you consider your data even further. The data set I have selected for analysis is
huge | it is sure to slow or wear down the mining process. Is there any way I can `reduce' the size of my data set, without jeopardizing the data mining results?" Data reduction obtains a reduced representation of the data set
that is much smaller in volume, yet produces the same or almost the same analytical results. There are a number of strategies for data reduction. These include data aggregation e.g., building a data cube, dimension reduction e.g., removing irrelevant attributes through correlation analysis, data compression e.g., using encoding schemes such as minimum length encoding or wavelets, and numerosity reduction e.g., replacing" the data by alternative, smaller representations such as clusters, or parametric models. Data can also be reduced" by generalization, where low level concepts such as city for customer location, are replaced with higher level concepts, such as region or province or state. A concept hierarchy is used to organize the concepts into varying levels of abstraction. Data reduction is the topic of Section 3.4. Since concept hierarchies are so useful in mining at multiple levels of abstraction, we devote a separate section to the automatic generation of this important data structure. Section 3.5 discusses concept hierarchy generation, a form of data reduction by data discretization. Figure 3.1 summarizes the data preprocessing steps described here. Note that the above categorization is not mutually exclusive. For example, the removal of redundant data may be seen as a form of data cleaning, as well as data reduction. In summary, real world data tend to be dirty, incomplete, and inconsistent. Data preprocessing techniques can improve the quality of the data, thereby helping to improve the accuracy and e ciency of the subsequent mining process. Data preprocessing is therefore an important step in the knowledge discovery process, since quality decisions must be based on quality data. Detecting data anomalies, rectifying them early, and reducing the data to be analyzed can lead to huge pay-o s for decision making.
Data Integration
Data Transformation
A1
A2
A3
A1
A3
...
A115
Sorted data for price in dollars: 4, 8, 15, 21, 21, 24, 25, 28, 34 Partition into equi-width bins:
Bin 1: 4, 8, 15 Bin 2: 21, 21, 24 Bin 3: 25, 28, 34 Bin 1: 9, 9, 9, Bin 2: 22, 22, 22 Bin 3: 29, 29, 29 Bin 1: 4, 4, 15 Bin 2: 21, 21, 24 Bin 3: 25, 25, 34 Figure 3.2: Binning methods for data smoothing. customers in the same credit risk category as that of the given tuple. 6. Use the most probable value to ll in the missing value: This may be determined with inference-based tools using a Bayesian formalism or decision tree induction. For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income. Decision trees are described in detail in Chapter 7. Methods 3 to 6 bias the data. The lled-in value may not be correct. Method 6, however, is a popular strategy. In comparison to the other methods, it uses the most information from the present data to predict missing values.
What is noise?" Noise is a random error or variance in a measured variable. Given a numeric attribute such as, say, price, how can we smooth" out the data to remove the noise? Let's look at the following data smoothing techniques.
1. Binning methods: Binning methods smooth a sorted data value by consulting the neighborhood", or values around it. The sorted values are distributed into a number of `buckets', or bins. Because binning methods consult the neighborhood of values, they perform local smoothing. Figure 3.2 illustrates some binning techniques. In this example, the data for price are rst sorted and partitioned into equi-depth bins of depth 3. In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by the value 9. Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a given bin are identi ed as the bin boundaries. Each bin value is then replaced by the closest boundary value. In general, the larger the width, the greater the e ect of the smoothing. Alternatively, bins may be equi-width, where the interval range of values in each bin is constant. Binning is also used as a discretization technique and is further discussed in Section 3.5, and in Chapter 6 on association rule mining. 2. Clustering: Outliers may be detected by clustering, where similar values are organized into groups or clusters". Intuitively, values which fall outside of the set of clusters may be considered outliers Figure 3.3. Chapter 9 is dedicated to the topic of clustering.
+ +
Figure 3.3: Outliers may be detected by clustering analysis. 3. Combined computer and human inspection: Outliers may be identi ed through a combination of computer and human inspection. In one application, for example, an information-theoretic measure was used to help identify outlier patterns in a handwritten character database for classi cation. The measure's value reected the surprise" content of the predicted character label with respect to the known label. Outlier patterns may be informative e.g., identifying useful data exceptions, such as di erent versions of the characters 0" or 7", or garbage" e.g., mislabeled characters. Patterns whose surprise content is above a threshold are output to a list. A human can then sort through the patterns in the list to identify the actual garbage ones. This is much faster than having to manually search through the entire database. The garbage patterns can then be removed from the training database. 4. Regression: Data can be smoothed by tting the data to a function, such as with regression. Linear regression involves nding the best" line to t two variables, so that one variable can be used to predict the other. Multiple linear regression is an extension of linear regression, where more than two variables are involved and the data are t to a multidimensional surface. Using regression to nd a mathematical equation to t the data helps smooth out the noise. Regression is further described in Section 3.4.4, as well as in Chapter 7. Many methods for data smoothing are also methods of data reduction involving discretization. For example, the binning techniques described above reduce the number of distinct values per attribute. This acts as a form of data reduction for logic-based data mining methods, such as decision tree induction, which repeatedly make value comparisons on sorted data. Concept hierarchies are a form of data discretization which can also be used for data smoothing. A concept hierarchy for price, for example, may map price real values into inexpensive", moderately priced", and expensive", thereby reducing the number of data values to be handled by the mining process. Data discretization is discussed in Section 3.5. Some methods of classi cation, such as neural networks, have built-in data smoothing mechanisms. Classi cation is the topic of Chapter 7.
If the resulting value of Equation 3.1 is greater than 1, then A and B are positively correlated. The higher the value, the more each attribute implies the other. Hence, a high value may indicate that A or B may be removed as a redundancy. If the resulting value is equal to 1, then A and B are independent and there is no correlation between them. If the resulting value is less than 1, then A and B are negatively correlated. This means that each attribute discourages the other. Equation 3.1 may detect a correlation between the customer id and cust number attributes described above. Correlation analysis is further described in Chapter 6 Section 6.5.2 on mining correlation rules. In addition to detecting redundancies between attributes, duplication" should also be detected at the tuple level e.g., where there are two or more identical tuples for a given unique data entry case. A third important issue in data integration is the detection and resolution of data value con icts. For example, for the same real world entity, attribute values from di erent sources may di er. This may be due to di erences in representation, scaling, or encoding. For instance, a weight attribute may be stored in metric units in one system, and British imperial units in another. The price of di erent hotels may involve not only di erent currencies but also di erent services such as free breakfast and taxes. Such semantic heterogeneity of data poses great challenges in data integration. Careful integration of the data from multiple sources can help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help improve the accuracy and speed of the subsequent mining process.
4. Generalization of the data, where low level or `primitive' raw data are replaced by higher level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be generalized to higher level concepts, like city or county. Similarly, values for numeric attributes, like age, may be mapped to higher level concepts, like young, middle-aged, and senior. In this section, we discuss normalization. Smoothing is a form of data cleaning, and was discussed in Section 3.2.2. Aggregation and generalization also serve as forms of data reduction, and are discussed in Sections 3.4 and 3.5, respectively. An attribute is normalized by scaling its values so that they fall within a small speci ed range, such as 0 to 1.0. Normalization is particularly useful for classi cation algorithms involving neural networks, or distance measurements such as nearest-neighbor classi cation and clustering. If using the neural network backpropagation algorithm for classi cation mining Chapter 7, normalizing the input values for each attribute measured in the training samples will help speed up the learning phase. For distance-based methods, normalization helps prevent attributes with initially large ranges e.g., income from outweighing attributes with initially smaller ranges e.g., binary attributes. There are many methods for data normalization. We study three: min-max normalization, z-score normalization, and normalization by decimal scaling. Min-max normalization performs a linear transformation on the original data. Suppose that minA and maxA are the minimum and maximum values of an attribute A. Min-max normalization maps a value v of A to v0 in the range new minA ; new maxA by computing v v0 = max, minA new maxA , new minA + new minA : , min
A A
3.2
Min-max normalization preserves the relationships among the original data values. It will encounter an out of bounds" error if a future input case for normalization falls outside of the original data range for A.
the mean and standard deviation of A. A value v of A is normalized to v0 by computing v , mean v0 = stand devA
A
Example 3.1 Suppose that the maximum and minimum values for the attribute income are $98,000 and $12,000, respectively. We would like to map income to the range 0; 1 . By min-max normalization, a value of $73,600 for 12 000 2 income is transformed to 73;;600,12;;0001 , 0 + 0 = 0:716. 98 000, In z-score normalization or zero-mean normalization, the values for an attribute A are normalized based on
3.3
where meanA and stand devA are the mean and standard deviation, respectively, of attribute A. This method of normalization is useful when the actual minimum and maximum of attribute A are unknown, or when there are outliers which dominate the min-max normalization.
Example 3.2 Suppose that the mean and standard deviation of the values for the attribute income are;600,54;000 $54,000 and 73
$16,000, respectively. With z-score normalization, a value of $73,600 for income is transformed to 1:225.
16;000
= 2
Normalization by decimal scaling normalizes by moving the decimal point of values of attribute A. The number of decimal points moved depends on the maximum absolute value of A. A value v of A is normalized to v0 by computing v v0 = 10j ; 3.4 where j is the smallest integer such that Maxjv0 j 1. Example 3.3 Suppose that the recorded values of A range from ,986 to 917. The maximum absolute value of A is 986. To normalize by decimal scaling, we therefore divide each value by 1,000 i.e., j = 3 so that ,986 normalizes to ,0:986. 2
10
Note that normalization can change the original data quite a bit, especially the latter two of the methods shown above. It is also necessary to save the normalization parameters such as the mean and standard deviation if using z-score normalization so that future data can be normalized in a uniform manner.
11
Figure 3.4: Sales data for a given branch of AllElectronics for the years 1997 to 1999. In the data on the left, the sales are shown per quarter. In the data on the right, the data are aggregated to provide the annual sales.
Branch A Item type home entertainment computer phone security 1997 1998 1999 Year B C D
Figure 3.5: A data cube for sales at AllElectronics. The base cuboid should correspond to an individual entity of interest, such as sales or customer. In other words, the lowest level should be usable", or useful for the analysis. Since data cubes provide fast accessing to precomputed, summarized data, they should be used when possible to reply to queries regarding aggregated information. When replying to such OLAP queries or data mining requests, the smallest available cuboid relevant to the given task should be used. This issue is also addressed in Chapter 2.
12
Forward Selection Initial attribute set: {A1, A2, A3, A4, A5, A6} Initial reduced set: {} -> {A1} --> {A1, A4} ---> Reduced attribute set: {A1, A4, A6} Backward Elimination Initial attribute set: {A1, A2, A3, A4, A5, A6} -> {A1, A3, A4, A5, A6} --> {A1, A4, A5, A6} ---> Reduced attribute set: {A1, A4, A6}
1. Step-wise forward selection: The procedure starts with an empty set of attributes. The best of the original attributes is determined and added to the set. At each subsequent iteration or step, the best of the remaining original attributes is added to the set. 2. Step-wise backward elimination: The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set. 3. Combination forward selection and backward elimination: The step-wise forward selection and backward elimination methods can be combined, where at each step one selects the best attribute and removes the worst from among the remaining attributes. The stopping criteria for methods 1 to 3 may vary. The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process. 4. Decision tree induction: Decision tree algorithms, such as ID3 and C4.5, were originally intended for classi cation. Decision tree induction constructs a ow-chart-like structure where each internal non-leaf node denotes a test on an attribute, each branch corresponds to an outcome of the test, and each external leaf
1
13
node denotes a class prediction. At each node, the algorithm chooses the best" attribute to partition the data into individual classes. When decision tree induction is used for attribute subset selection, a tree is constructed from the given data. All attributes that do not appear in the tree are assumed to be irrelevant. The set of attributes appearing in the tree form the reduced subset of attributes. This method of attribute selection is visited again in greater detail in Chapter 5 on concept description.
Wavelet transforms The discrete wavelet transform DWT is a linear signal processing technique that, when applied to a data vector D, transforms it to a numerically di erent vector, D0 , of wavelet coe cients. The two vectors are of the
same length.
Hmmm", you wonder. How can this technique be useful for data reduction if the wavelet transformed data are of the same length as the original data?" The usefulness lies in the fact that the wavelet transformed data can be
truncated. A compressed approximation of the data can be retained by storing only a small fraction of the strongest of the wavelet coe cients. For example, all wavelet coe cients larger than some user-speci ed threshold can be retained. The remaining coe cients are set to 0. The resulting data representation is therefore very sparse, so that operations that can take advantage of data sparsity are computationally very fast if performed in wavelet space. The DWT is closely related to the discrete Fourier transform DFT, a signal processing technique involving sines and cosines. In general, however, the DWT achieves better lossy compression. That is, if the same number of coe cients are retained for a DWT and a DFT of a given data vector, the DWT version will provide a more accurate approximation of the original data. Unlike DFT, wavelets are quite localized in space, contributing to the conservation of local detail. There is only one DFT, yet there are several DWTs. The general algorithm for a discrete wavelet transform is as follows. 1. The length, L, of the input data vector must be an integer power of two. This condition can be met by padding the data vector with zeros, as necessary. 2. Each transform involves applying two functions. The rst applies some data smoothing, such as a sum or weighted average. The second performs a weighted di erence. 3. The two functions are applied to pairs of the input data, resulting in two sets of data of length L=2. In general, these respectively represent a smoothed version of the input data, and the high-frequency content of it. 4. The two functions are recursively applied to the sets of data obtained in the previous loop, until the resulting data sets obtained are of desired length. 5. A selection of values from the data sets obtained in the above iterations are designated the wavelet coe cients of the transformed data. Equivalently, a matrix multiplication can be applied to the input data in order to obtain the wavelet coe cients. For example, given an input vector of length 4 represented as the column vector x0; x1; x2; x3 , the 4-point Haar
14
2 1=2 1=2 1=2 6 1=2 1=2 ,1=2 6 p 4 1= 2 ,1=p2 0 p
1=2 x0 ,1=2 7 6 x1 7 7 6 7 3.5 0p 5 4 x2 5 x3 0 0 1= 2 ,1= 2 The matrix on the left is orthonormal, meaning that the columns are unit vectors multiplied by a constant and are mutually orthogonal, so that the matrix inverse is just its transpose. Although we do not have room to discuss it here, this property allows the reconstruction of the data from the smooth and smooth-di erence data sets. Other popular wavelet transforms include the Daubechies-4 and the Daubechies-6 transforms. Wavelet transforms can be applied to multidimensional data, such as a data cube. This is done by rst applying the transform to the rst dimension, then to the second, and so on. The computational complexity involved is linear with respect to the number of cells in the cube. Wavelet transforms give good results on sparse or skewed data, and data with ordered attributes.
stored, instead of the actual data. Outliers may also be stored. Log-linear models, which estimate discrete multidimensional probability distributions, are an example. Non-parametric methods for storing reduced representations of the data include histograms, clustering, and sampling.
15
price
Figure 3.7: A histogram for price using singleton buckets - each bucket represents one price-value frequency pair. Let's have a look at each of the numerosity reduction techniques mentioned above.
Histograms
Histograms use binning to approximate data distributions and are a popular form of data reduction. A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or buckets. The buckets are displayed on a horizontal axis, while the height and area of a bucket typically re ects the average frequency of the values represented by the bucket. If each bucket represents only a single attribute-value frequency pair, the buckets are called singleton buckets. Often, buckets instead represent continuous ranges for the given attribute.
Example 3.4 The following data are a list of prices of commonly sold items at AllElectronics rounded to the
nearest dollar. The numbers have been sorted. 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
16
count 6 5 4 3 2 1 1-10 11-20 21-30
price
Figure 3.8: A histogram for price where values are aggregated so that each bucket has a uniform width of $10. Figure 3.7 shows a histogram for the data using singleton buckets. To further reduce the data, it is common to have each bucket denote a continuous range of values for the given attribute. In Figure 3.8, each bucket represents a di erent $10 range for price. 2 How are the buckets determined and the attribute values partitioned? There are several partitioning rules, including the following. 1. Equi-width: In an equi-width histogram, the width of each bucket range is constant such as the width of $10 for the buckets in Figure 3.8. 2. Equi-depth or equi-height: In an equi-depth histogram, the buckets are created so that, roughly, the frequency of each bucket is constant that is, each bucket contains roughly the same number of contiguous data samples. 3. V-Optimal: If we consider all of the possible histograms for a given number of buckets, the V-optimal histogram is the one with the least variance. Histogram variance is a weighted sum of the original values that each bucket represents, where bucket weight is equal to the number of values in the bucket. 4. MaxDi : In a MaxDi histogram, we consider the di erence between each pair of adjacent values. A bucket boundary is established between each pair for pairs having the , 1 largest di erences, where is user-speci ed. V-Optimal and MaxDi histograms tend to be the most accurate and practical. Histograms are highly e ective at approximating both sparse and dense data, as well as highly skewed, and uniform data. The histograms described above for single attributes can be extended for multiple attributes. Multidimensional histograms can capture dependencies between attributes. Such histograms have been found e ective in approximating data with up to ve attributes. More studies are needed regarding the e ectiveness of multidimensional histograms for very high dimensions. Singleton buckets are useful for storing outliers with high frequency. Histograms are further described in Chapter 5 Section 5.6 on mining descriptive statistical measures in large databases.
Clustering Clustering techniques consider data tuples as objects. They partition the objects into groups or clusters, so that
objects within a cluster are similar" to one another and dissimilar" to objects in other clusters. Similarity is commonly de ned in terms of how close" the objects are in space, based on a distance function. The quality" of a cluster may be represented by its diameter, the maximum distance between any two objects in the cluster. Centroid distance is an alternative measure of cluster quality, and is de ned as the average distance of each cluster object from the cluster centroid denoting the average object", or average point in space for the cluster. Figure 3.9 shows
17
+ +
Figure 3.9: A 2-D plot of customer data with respect to customer locations in a city, showing three data clusters. Each cluster centroid is marked with a +".
986
3396
5411
8392
9544
a 2-D plot of customer data with respect to customer locations in a city, where the centroid of each cluster is shown with a +". Three data clusters are visible. In data reduction, the cluster representations of the data are used to replace the actual data. The e ectiveness of this technique depends on the nature of the data. It is much more e ective for data that can be organized into distinct clusters, than for smeared data. In database systems, multidimensional index trees are primarily used for providing fast data access. They can also be used for hierarchical data reduction, providing a multiresolution clustering of the data. This can be used to provide approximate answers to queries. An index tree recursively partitions the multidimensional space for a given set of data objects, with the root node representing the entire space. Such trees are typically balanced, consisting of internal and leaf nodes. Each parent node contains keys and pointers to child nodes that, collectively, represent the space represented by the parent node. Each leaf node contains pointers to the data tuples they represent or to the actual tuples. An index tree can therefore store aggregate and detail data at varying levels of resolution or abstraction. It provides a hierarchy of clusterings of the data set, where each cluster has a label that holds for the data contained in the cluster. If we consider each child of a parent node as a bucket, then an index tree can be considered as a hierarchical histogram. For example, consider the root of a B+-tree as shown in Figure 3.10, with pointers to the data keys 986, 3396, 5411, 8392, and 9544. Suppose that the tree contains 10,000 tuples with keys ranging from 1 to 9,999. The data in the tree can be approximated by an equi-depth histogram of 6 buckets for the key ranges 1 to 985, 986 to 3395, 3396 to 5410, 5411 to 8392, 8392 to 9543, and 9544 to 9999. Each bucket contains roughly 10,000 6 items. Similarly, each bucket is subdivided into smaller buckets, allowing for aggregate data at a ner-detailed level. The use of multidimensional index trees as a form of data resolution relies on an ordering of the attribute values in each dimension. Multidimensional index trees include R-trees, quad-trees, and their variations. They are well-suited for handling both sparse and skewed data. There are many measures for de ning clusters and cluster quality. Clustering methods are further described in Chapter 8.
18
Sampling Sampling can be used as a data reduction technique since it allows a large data set to be represented by a much
smaller random sample or subset of the data. Suppose that a large data set, D, contains N tuples. Let's have a look at some possible samples for D.
T1 T2 T3 T4 T5 T6 T7 T8 SRSWOR (n=4) T5 T1 T8 T6
SRSWR (n=4)
T4 T7 T4 T1
Cluster Sample T1 T2 T3 T4 ... T100 T201 T202 T203 T204 ... T300 T301 T302 T303 T304 ... T400 T368 T391 T307 T326 T298 T216 T228 249 T5 T32 T53 T75
Stratified Sample (according to age) T38 T256 T307 T391 T96 T117 T138 T263 T290 T308 T326 T387 T69 T284 young young young young middle-aged middle-aged middle-aged middle-aged middle-aged middle-aged middle-aged middle-aged senior senior T38 T391 T117 T138 T290 T326 T69 young young middle-aged middle-aged middle-aged middle-aged senior
Figure 3.11: Sampling can be used for data reduction. 1. Simple random sample without replacement SRSWOR of size n: This is created by drawing n of the N tuples from D n N, where the probably of drawing any tuple in D is 1=N, i.e., all tuples are equally likely. 2. Simple random sample with replacement SRSWR of size n: This is similar to SRSWOR, except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a tuple is drawn, it is placed back in D so that it may be drawn again.
19
3. Cluster sample: If the tuples in D are grouped into M mutually disjoint clusters", then a SRS of m clusters can be obtained, where m M. For example, tuples in a database are usually retrieved a page at a time, so that each page can be considered a cluster. A reduced data representation can be obtained by applying, say, SRSWOR to the pages, resulting in a cluster sample of the tuples. 4. Strati ed sample: If D is divided into mutually disjoint parts called strata", a strati ed sample of D is generated by obtaining a SRS at each stratum. This helps to ensure a representative sample, especially when the data are skewed. For example, a strati ed sample may be obtained from customer data, where stratum is created for each customer age group. In this way, the age group having the smallest number of customers will be sure to be represented. These samples are illustrated in Figure 3.11. They represent the most commonly used forms of sampling for data reduction. An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample, n, as opposed to N, the data set size. Hence, sampling complexity is potentially sub-linear to the size of the data. Other data reduction techniques can require at least one complete pass through D. For a xed sample size, sampling complexity increases only linearly as the number of data dimensions, d, increases, while techniques using histograms, for example, increase exponentially in d. When applied to data reduction, sampling is most commonly used to estimate the answer to an aggregate query. It is possible using the central limit theorem to determine a su cient sample size for estimating a given function within a speci ed degree of error. This sample size, n, may be extremely small in comparison to N. Sampling is a natural choice for the progressive re nement of a reduced data set. Such a set can be further re ned by simply increasing the sample size.
20
($0 - $1000]
($0 - $200]
($200 - $400]
($400 - $600]
($600 - $800]
($800 - $1,000]
($0 - $100] ($100 - $200] ($200 - $300] ($300 - $400] ($400 - $500] ($500 - $600] ($600 - $700] ($700 - $800]
500
$100
$200
$300
$400
$500
$600
$700
$800
$900
$1,000
price
Figure 3.13: Histogram showing the distribution of values for the attribute price.
clustering analysis, entropy-based discretization, and data segmentation by natural partitioning".
1. Binning. Section 3.2.2 discussed binning methods for data smoothing. These methods are also forms of discretization. For example, attribute values can be discretized by replacing each bin value by the bin mean or median, as in smoothing by bin means or smoothing by bin medians, respectively. These techniques can be applied recursively to the resulting partitions in order to generate concept hierarchies. 2. Histogram analysis. Histograms, as discussed in Section 3.4.4, can also be used for discretization. Figure 3.13 presents a histogram showing the data distribution of the attribute price for a given data set. For example, the most frequent price range is roughly $300-$325. Partitioning rules can be used to de ne the ranges of values. For instance, in an equi-width histogram, the values are partitioned into equal sized partions or ranges e.g., $0-$100 , $100-$200 , . .., $900-$1,000 . With an equi-depth histogram, the values are partitioned so that, ideally, each partition contains the same number of data samples. The histogram analysis algorithm can be applied recursively to each partition in order to automatically generate a multilevel concept hierarchy, with the procedure terminating once a pre-speci ed number of concept levels has been reached. A minimum interval size can also be used per level to control the recursive procedure. This speci es the minimum width of a partition, or the minimum number of values for each partition at each level. A concept hierarchy for price, generated from the data of Figure 3.13 is shown in Figure 3.12. 3. Clustering analysis. A clustering algorithm can be applied to partition data into clusters or groups. Each cluster forms a node of a concept hierarchy, where all nodes are at the same conceptual level. Each cluster may be further decomposed
21
into several subclusters, forming a lower level of the hierarchy. Clusters may also be grouped together in order to form a higher conceptual level of the hierarchy. Clustering methods for data mining are studied in Chapter 8. 4. Entropy-based discretization. An information-based measure called entropy" can be used to recursively partition the values of a numeric attribute A, resulting in a hierarchical discretization. Such a discretization forms a numerical concept hierarchy for the attribute. Given a set of data tuples, S, the basic method for entropy-based discretization of A is as follows. Each value of A can be considered a potential interval boundary or threshold T. For example, a value v of A can partition the samples in S into two subsets satisfying the conditions A v and A v, respectively, thereby creating a binary discretization. Given S, the threshold value selected is the one that maximizes the information gain resulting from the subsequent partitioning. The information gain is: IS; T = jjS1jj EntS1 + jjS2jj EntS2 ; S S 3.7
where S1 and S2 correspond to the samples in S satisfying the conditions A T and A T , respectively. The entropy function Ent for a given set is calculated based on the class distribution of the samples in the set. For example, given m classes, the entropy of S1 is: EntS1 = ,
m X i=1
pi log2 pi ;
3.8
where pi is the probability of class i in S1 , determined by dividing the number of samples of class i in S1 by the total number of samples in S1 . The value of EntS2 can be computed similarly. The process of determining a threshold value is recursively applied to each partition obtained, until some stopping criterion is met, such as EntS , IS; T 3.9 Experiments show that entropy-based discretization can reduce data size and may improve classi cation accuracy. The information gain and entropy measures described here are also used for decision tree induction. These measures are revisited in greater detail in Chapter 5 Section 5.4 on analytical characterization and Chapter 7 Section 7.3 on decision tree induction. 5. Segmentation by natural partitioning. Although binning, histogram analysis, clustering and entropy-based discretization are useful in the generation of numerical hierarchies, many users would like to see numerical ranges partitioned into relatively uniform, easy-to-read intervals that appear intuitive or natural". For example, annual salaries broken into ranges like $50,000, $60,000 are often more desirable than ranges like $51263.98, $60872.34, obtained by some sophisticated clustering analysis. The 3-4-5 rule can be used to segment numeric data into relatively uniform, natural" intervals. In general, the rule partitions a given range of data into either 3, 4, or 5 relatively equi-length intervals, recursively and level by level, based on the value range at the most signi cant digit. The rule is as follows. a If an interval covers 3, 6, 7 or 9 distinct values at the most signi cant digit, then partition the range into 3 intervals 3 equi-width intervals for 3, 6, 9, and three intervals in the grouping of 2-3-2 for 7; b if it covers 2, 4, or 8 distinct values at the most signi cant digit, then partition the range into 4 equi-width intervals; and
22
count
Step 1:
profit
$4,700,896.50 MAX
Step 2:
msd = 1,000,000
LOW = -$1,000,000
HIGH = $2,000,000
Step 3:
(-$1,000,000 - $2,000,000]
(-$1,000,000 - 0]
($0 - $1,000,000]
($1,000,000 - $2,000,000]
Step 4:
(-$400,000 - $5,000,000]
(-$400,000 - 0] Step 5:
(0 - $1,000,000]
($1,000,000 - $2,000,000]
($2,000,000 - $5,000,000]
($0 $200,000] ($200,000 $400,000] (400,000 $600,000] ($600,000 $800,000] ($800,000 $1,000,000]
($1,000,000 $1,200,000] ($1,200,000 $1,400,000] ($1,400,000 $1,600,000] ($1,600,000 $1,800,000] ($1,800,000 $2,000,000]
Figure 3.14: Automatic generation of a concept hierarchy for pro t based on the 3-4-5 rule. c if it covers 1, 5, or 10 distinct values at the most signi cant digit, then partition the range into 5 equi-width intervals. The rule can be recursively applied to each interval, creating a concept hierarchy for the given numeric attribute. Since there could be some dramatically large positive or negative values in a data set, the top level segmentation, based merely on the minimum and maximum values, may derive distorted results. For example, the assets of a few people could be several orders of magnitude higher than those of others in a data set. Segmentation based on the maximal asset values may lead to a highly biased hierarchy. Thus the top level segmentation can be performed based on the range of data values representing the majority e.g., 5-tile to 95-tile of the given data. The extremely high or low values beyond the top level segmentation will form distinct intervals which can be handled separately, but in a similar manner. The following example illustrates the use of the 3-4-5 rule for the automatic construction of a numeric hierarchy.
Example 3.5 Suppose that pro ts at di erent branches of AllElectronics for the year 1997 cover a wide range, from ,$351,976.00 to $4,700,896.50. A user wishes to have a concept hierarchy for pro t automatically
23
generated. For improved readability, we use the notation l | r to represent the interval l; r . For example, ,$1,000,000 | $0 denotes the range from ,$1,000,000 exclusive to $0 inclusive. Suppose that the data within the 5-tile and 95-tile are between ,$159,876 and $1,838,761. The results of applying the 3-4-5 rule are shown in Figure 3.14.
Step 1: Based on the above information, the minimum and maximum values are: MIN = ,$351; 976:00,
and MAX = $4; 700; 896:50. The low 5-tile and high 95-tile values to be considered for the top or rst level of segmentation are: LOW = ,$159; 876, and HIGH = $1; 838; 761. Step 2: Given LOW and HIGH, the most signi cant digit is at the million dollar digit position i.e., msd = 1,000,000. Rounding LOW down to the million dollar digit, we get LOW 0 = ,$1; 000; 000; and rounding HIGH up to the million dollar digit, we get HIGH 0 = +$2; 000; 000. Step 3: Since this interval ranges over 3 distinct values at the most signi cant digit, i.e., 2; 000; 000 , ,1; 000; 000=1; 000;000 = 3, the segment is partitioned into 3 equi-width subsegments according to the 3-4-5 rule: ,$1,000,000 | $0 , $0 | $1,000,000 , and $1,000,000 | $2,000,000 . This represents the top tier of the hierarchy. Step 4: We now examine the MIN and MAX values to see how they t" into the rst level partitions. Since the rst interval, ,$1; 000; 000 | $0 covers the MIN value, i.e., LOW 0 MIN, we can adjust the left boundary of this interval to make the interval smaller. The most signi cant digit of MIN is the hundred thousand digit position. Rounding MIN down to this position, we get MIN 0 = ,$400; 000. Therefore, the rst interval is rede ned as ,$400; 000 | 0 . Since the last interval, $1,000,000 | $2,000,000 does not cover the MAX value, i.e., MAX HIGH 0 , we need to create a new interval to cover it. Rounding up MAX at its most signi cant digit position, the new interval is $2,000,000 | $5,000,000 . Hence, the top most level of the hierarchy contains four partitions, ,$400,000 | $0 , $0 | $1,000,000 , $1,000,000 | $2,000,000 , and $2,000,000 | $5,000,000 . Step 5: Recursively, each interval can be further partitioned according to the 3-4-5 rule to form the next lower level of the hierarchy: The rst interval ,$400,000 | $0 is partitioned into 4 sub-intervals: ,$400,000 | ,$300,000 , ,$300,000 | ,$200,000 , ,$200,000 | ,$100,000 , and ,$100,000 | $0 . The second interval, $0 | $1,000,000 , is partitioned into 5 sub-intervals: $0 | $200,000 , $200,000 | $400,000 , $400,000 | $600,000 , $600,000 | $800,000 , and $800,000 | $1,000,000 . The third interval, $1,000,000 | $2,000,000 , is partitioned into 5 sub-intervals: $1,000,000 | $1,200,000 , $1,200,000 | $1,400,000 , $1,400,000 | $1,600,000 , $1,600,000 | $1,800,000 , and $1,800,000 | $2,000,000 . The last interval, $2,000,000 | $5,000,000 , is partitioned into 3 sub-intervals: $2,000,000 | $3,000,000 , $3,000,000 | $4,000,000 , and $4,000,000 | $5,000,000 . Similarly, the 3-4-5 rule can be carried on iteratively at deeper levels, as necessary. 2
24
country
province_or_state
65 distinct values
city
street
Figure 3.15: Automatic generation of a schema concept hierarchy based on the number of distinct attribute values. 2. Speci cation of a portion of a hierarchy by explicit data grouping. This is essentially the manual de nition of a portion of concept hierarchy. In a large database, it is unrealistic to de ne an entire concept hierarchy by explicit value enumeration. However, it is realistic to specify explicit groupings for a small portion of intermediate level data. For example, after specifying that province and country form a hierarchy at the schema level, one may like to add some intermediate levels manually, such as de ning explicitly, fAlberta, Saskatchewan, Manitobag prairies Canada", and fBritish Columbia, prairies Canadag Western Canada". 3. Speci cation of a set of attributes , but not of their partial ordering. A user may simply group a set of attributes as a preferred dimension or hierarchy, but may omit stating their partial order explicitly. This may require the system to automatically generate the attribute ordering so as to construct a meaningful concept hierarchy. Without knowledge of data semantics, it is di cult to provide an ideal hierarchical ordering for an arbitrary set of attributes. However, an important observation is that since higher level concepts generally cover several subordinate lower level concepts, an attribute de ning a high concept level will usually contain a smaller number of distinct values than an attribute de ning a lower concept level. Based on this observation, a concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy. The lesser the number of distinct values an attribute has, the higher it is in the generated concept hierarchy. This heuristic rule works ne in many cases. Some local level swapping or adjustments may be performed by users or experts, when necessary, after examination of the generated hierarchy. Let's examine an example of this method.
Example 3.6 Suppose a user selects a set of attributes, street, country, province or state, and city, for a dimension location from the database AllElectronics, but does not specify the hierarchical ordering among the attributes. The concept hierarchy for location can be generated automatically as follows. First, sort the attributes in ascending order based on the number of distinct values in each attribute. This results in the following where the number of distinct values per attribute is shown in parentheses: country 15, province or state 65, city 3567, and street 674,339. Second, generate the hierarchy from top down according to the sorted order, with the rst attribute at the top-level and the last attribute at the bottom-level. The resulting hierarchy is shown in Figure 3.15. Finally, the user examines the generated hierarchy, and when necessary, modi es it to re ect desired semantic relationship among the attributes. In this example, it is obvious that there is no need to modify the generated hierarchy. 2
Note that this heristic rule cannot be pushed to the extreme since there are obvious cases which do not follow such a heuristic. For example, a time dimension in a database may contain 20 distinct years, 12 distinct months and 7 distinct days of the week. However, this does not suggest that the time hierarchy should be year month days of the week", with days of the week at the top of the hierarchy. 4. Speci cation of only a partial set of attributes.
3.6. SUMMARY
25
Sometimes a user can be sloppy when de ning a hierarchy, or may have only a vague idea about what should be included in a hierarchy. Consequently, the user may have included only a small subset of the relevant attributes in a hierarchy speci cation. For example, instead of including all the hierarchically relevant attributes for location, one may specify only street and city. To handle such partially speci ed hierarchies, it is important to embed data semantics in the database schema so that attributes with tight semantic connections can be pinned together. In this way, the speci cation of one attribute may trigger a whole group of semantically tightly linked attributes to be dragged-in" to form a complete hierarchy. Users, however, should have the option to over-ride this feature, as necessary.
province or state, and country, because they are closely linked semantically, regarding the notion of location. If a user were to specify only the attribute city for a hierarchy de ning location, the system may automatically drag all of the above ve semantically-related attributes to form a hierarchy. The user may choose to drop any of these attributes, such as number and street, from the hierarchy, keeping city as the lowest conceptual level in the hierarchy. 2
Example 3.7 Suppose that a database system has pinned together the ve attributes, number, street, city,
3.6 Summary
Data preparation is an important issue for both data warehousing and data mining, as real-world data tends
to be incomplete, noisy, and inconsistent. Data preparation includes data cleaning, data integration, data transformation, and data reduction. Data cleaning routines can be used to ll in missing values, smooth noisy data, identify outliers, and correct data inconsistencies. Data integration combines data from multiples sources to form a coherent data store. Metadata, correlation analysis, data con ict detection, and the resolution of semantic heterogeneity contribute towards smooth data integration. Data transformation routines conform the data into appropriate forms for mining. For example, attribute data may be normalized so as to fall between a small range, such as 0 to 1.0. Data reduction techniques such as data cube aggregation, dimension reduction, data compression, numerosity reduction, and discretization can be used to obtain a reduced representation of the data, while minimizing the loss of information content. Concept hierarchies organize the values of attributes or dimensions into gradual levels of abstraction. They are a form a discretization that is particularly useful in multilevel mining. Automatic generation of concept hierarchies for categoric data may be based on the number of distinct values of the attributes de ning the hierarchy. For numeric data, techniques such as data segmentation by partition rules, histogram analysis, and clustering analysis can be used. Although several methods of data preparation have been developed, data preparation remains an active area of research.
Exercises
1. Data quality can be assessed in terms of accuracy, completeness, and consistency. Propose two other dimensions of data quality. 2. In real-world data, tuples with missing values for some attributes are a common occurrence. Describe various methods for handling this problem. 3. Suppose that the data for analysis includes the attribute age. The age values for the data tuples are in increasing order: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
26
Bibliographic Notes
Data preprocessing is discussed in a number of textbooks, including Pyle 28 , Kennedy et al. 21 , and Weiss and Indurkhya 37 . More speci c references to individual preprocessing techniques are given below. For discussion regarding data quality, see Ballou and Tayi 3 , Redman 31 , Wand and Wang 35 , and Wang, Storey and Firth 36 . The handling of missing attribute values is discussed in Quinlan 29 , Breiman et al. 5 , and Friedman 11 . A method for the detection of outlier or garbage" patterns in a handwritten character database is given in Guyon, Matic, and Vapnik 14 . Binning and data normalization are treated in several texts, including 28, 21, 37 . A good survey of data reduction techniques can be found in Barbar
et al. 4 . For algorithms on data cubes a and their precomputation, see 33, 16, 1, 38, 32 . Greedy methods for attribute subset selection or feature subset selection are described in several texts, such as Neter et al. 24 , and John 18 . A combination forward selection and backward elimination method was proposed in Siedlecki and Sklansky 34 . For a description of wavelets for data compression, see Press et al. 27 . Daubechies transforms are described in Daubechies 6 . The book by Press et al. 27 also contains an introduction to singular value decomposition for principal components analysis. An introduction to regression and log-linear models can be found in several textbooks, such as 17, 9, 20, 8, 24 . For log-linear models known as multiplicative models in the computer science literature, see Pearl 25 . For a
3.6. SUMMARY
27
general introduction to histograms, see 7, 4 . For extensions of single attribute histograms to multiple attributes, see Muralikrishna and DeWitt 23 , and Poosala and Ioannidis 26 . Several references to clustering algorithms are given in Chapter 7 of this book, which is devoted to this topic. A survey of multidimensional indexing structures is in Gaede and Gunther 12 . The use of multidimensional index trees for data aggregation is discussed in Aoki 2 . Index trees include R-trees Guttman 13 , quad-trees Finkel and Bentley 10 , and their variations. For discussion on sampling and data mining, see John and Langley 19 , and Kivinen and Mannila 22 . Entropy and information gain are described in Quinlan 30 . Concept hierarchies, and their automatic generation from categorical data are described in Han and Fu 15 .
28
Bibliography
1 S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. In Proc. 1996 Int. Conf. Very Large Data Bases, pages 506 521, Bombay, India, Sept. 1996. 2 P. M. Aoki. Generalizing search" in generalized search trees. In Proc. 1998 Int. Conf. Data Engineering ICDE'98, April 1998. 3 D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications of ACM, 42:73 78, 1999. 4 D. Barbar
et al. The new jersey data reduction report. Bulletin of the Technical Committee on Data Engia neering, 20:3 45, December 1997. 5 L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classi cation and Regression Trees. Wadsworth International Group, 1984. 6 I. Daubechies. Ten Lectures on Wavelets. Capital City Press, Montpelier, Vermont, 1992. 7 J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. New York: Duxbury Press, 1997. 8 J. L. Devore. Probability and Statistics for Engineering and the Science, 4th ed. Duxbury Press, 1995. 9 A. J. Dobson. An Introduction to Generalized Linear Models. Chapman and Hall, 1990. 10 R. A. Finkel and J. L. Bentley. Quad-trees: A data structure for retrieval on composite keys. ACTA Informatica, 4:1 9, 1974. 11 J. H. Friedman. A recursive partitioning decision rule for nonparametric classi ers. IEEE Trans. on Comp., 26:404 408, 1977. 12 V. Gaede and O. Gunther. Multdimensional access methods. ACM Comput. Surv., 30:170 231, 1998. 13 A. Guttman. R-tree: A dynamic index structure for spatial searching. In Proc. 1984 ACM-SIGMOD Int. Conf. Management of Data, June 1984. 14 I. Guyon, N. Matic, and V. Vapnik. Discoverying informative patterns and data cleaning. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 181 203. AAAI MIT Press, 1996. 15 J. Han and Y. Fu. Dynamic generation and re nement of concept hierarchies for knowledge discovery in databases. In Proc. AAAI'94 Workshop Knowledge Discovery in Databases KDD'94, pages 157 168, Seattle, WA, July 1994. 16 V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes e ciently. In Proc. 1996 ACMSIGMOD Int. Conf. Management of Data, pages 205 216, Montreal, Canada, June 1996. 17 M. James. Classi cation Algorithms. John Wiley, 1985. 18 G. H. John. Enhancements to the Data Mining Process. Ph.D. Thesis, Computer Science Dept., Stanford Univeristy, 1997. 29
30
BIBLIOGRAPHY
19 G. H. John and P. Langley. Static versus dynamic sampling for data mining. In Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining KDD'96, pages 367 370, Portland, OR, Aug. 1996. 20 R. A. Johnson and D. W. Wickern. Applied Multivariate Statistical Analysis, 3rd ed. Prentice Hall, 1992. 21 R. L Kennedy, Y. Lee, B. Van Roy, C. D. Reed, and R. P. Lippman. Solving Data Mining Problems Through Pattern Recognition. Upper Saddle River, NJ: Prentice Hall, 1998. 22 J. Kivinen and H. Mannila. The power of sampling in knowledge discovery. In Proc. 13th ACM Symp. Principles of Database Systems, pages 77 85, Minneapolis, MN, May 1994. 23 M. Muralikrishna and D. J. DeWitt. Equi-depth histograms for extimating selectivity factors for multidimensional queries. In Proc. 1988 ACM-SIGMOD Int. Conf. Management of Data, pages 28 36, Chicago, IL, June 1988. 24 J. Neter, M. H. Kutner, C. J. Nachtsheim, and L. Wasserman. Applied Linear Statistical Models, 4th ed. Irwin: Chicago, 1996. 25 J. Pearl. Probabilistic Reasoning in Intelligent Systems. Palo Alto, CA: Morgan Kau man, 1988. 26 V. Poosala and Y. Ioannidis. Selectivity estimationwithout the attribute value independence assumption. In Proc. 23rd Int. Conf. on Very Large Data Bases, pages 486 495, Athens, Greece, Aug. 1997. 27 W. H. Press, S. A. Teukolosky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scienti c Computing. Cambridge University Press, Cambridge, MA, 1996. 28 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999. 29 J. R. Quinlan. Unknown attribute values in induction. In Proc. 6th Int. Workshop on Machine Learning, pages 164 168, Ithaca, NY, June 1989. 30 J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. 31 T. Redman. Data Quality: Management and Technology. Bantam Books, New York, 1992. 32 K. Ross and D. Srivastava. Fast computation of sparse datacubes. In Proc. 1997 Int. Conf. Very Large Data Bases, pages 116 125, Athens, Greece, Aug. 1997. 33 S. Sarawagi and M. Stonebraker. E cient organization of large multidimensional arrays. In Proc. 1994 Int. Conf. Data Engineering, pages 328 336, Feb. 1994. 34 W. Siedlecki and J. Sklansky. On automatic feature selection. Int. J. of Pattern Recognition and Arti cial Intelligence, 2:197 220, 1988. 35 Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM, 39:86 95, 1996. 36 R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, 7:623 640, 1995. 37 S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1998. 38 Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data, pages 159 170, Tucson, Arizona, May 1997.
Bzupages.com