DWDM Module II
DWDM Module II
DATA MINING
33
OUTLINE
Introduction
• What is Data Mining
• Definition, Knowledge Discovery in Data ( KDD)
• Kinds of data bases
• Data mining functionalities
• Classification of data mining systems
• Data mining task primitives
OUTLINE
Data Preprocessing:
• Data cleaning
• Data integration and transformation
• Data mining functionalities
• Data reduction
• Data discretization and Concept hierarchy.
INTRODUCTION
Introduction
The major reason that data mining has attracted a great deal of attention in information
industry in recent years is due to the wide availability of huge amounts of data and the
imminent need for turning such data into useful information and knowledge.
The information and knowledge gained can be used for applications ranging from
business management, production control, and market analysis, to engineering design and
science exploration.
INTRODUCTION
Data mining refers to extracting or mining" knowledge from large amounts of data. There
are many other terms related to data mining, such as knowledge mining, knowledge
extraction, data/pattern analysis, data archaeology, and data dredging.
Many people treat data mining as a synonym for another popularly used term, Knowledge
Discovery in Databases", or KDD
Data mining should be applicable to any kind of data repository, as well as to transient
data, such as data streams.
The scope of our examination of data repositories will include relational databases, data
warehouses, transactional databases, advanced database systems, flat files, data streams,
and the World Wide Web.
Relational Databases
A relational database is a collection of tables, each of which is assigned a unique name.
Each table consists of a set of attributes (columns or fields) and usually stores a large set
of tuples (records or rows). Each tuple in a relational table represents an object identified
by a unique key and described by a set of attribute values.
DataWarehouses
If AllElectronics had a data warehouse, this task would be easy. A data warehouse is a
repository of information collected from multiple sources, stored under a unified schema,
and that usually resides at a single site. Data warehouses are constructed via a process of
data cleaning, data integration, data transformation, data loading, and periodic data
refreshing.
KINDS OF DATA BASES
KINDS OF DATA BASES
A data cube for AllElectronics. A data cube for summarized sales data of AllElectronics is
presented in Figure.The cube has three dimensions: address (with city values Chicago,
New York, Toronto, Vancouver), time (with quarter values Q1, Q2, Q3, Q4), and
item(with itemtype values home entertainment, computer, phone, security).
KINDS OF DATA BASES
KINDS OF DATA BASES
Transactional Databases
A transactional database consists of a file where each record represents a transaction. A
transaction typically includes a unique transaction identity number (trans ID) and a list of
the items making up the transaction.
As an analyst of the AllElectronics database, you may ask, “Show me all the items
purchased by Sandy Smith” or “How many transactions include item number I3?”
Answering such queries may require a scan of the entire transactional database.
KINDS OF DATA BASES
Object-Relational Databases
Object-relational databases are constructed based on an object-relational data model.
This model extends the relational model by providing a rich data type for handling
complex objects and object orientation.
Each object has associated with it the following:
• A set of variables that describe the objects. These correspond to attributes in the entity-
relationship and relational models.
• A set of messages that the object can use to communicate with other objects, or with
the rest of the database system.
• A set of methods, where each method holds the code to implement a message. Upon
receiving a message, the method returns a value in response.
KINDS OF DATA BASES
A temporal database typically stores relational data that include time-related attributes.
These attributes may involve several timestamps, each having different semantics.
A spatial database that stores spatial objects that change with time is called a
spatiotemporal database,
Text Databases and Multimedia Databases
Text databases are databases that contain word descriptions for objects. These word
descriptions are usually not simple keywords but rather long sentences or paragraphs,
such as product specifications, error or bug reports, warning messages, summary reports,
notes, or other documents.
Multimedia databases store image, audio, and video data.
KINDS OF DATA BASES
Data Streams
Many applications involve the generation and analysis of a newkind of data, called stream
data, where data flow in and out of an observation platform (or window) dynamically.
Such data streams have the following unique features: huge or possibly infinite volume,
dynamically changing, flowing in and out in a fixed order, allowing only one or a small
number of scans, and demanding fast (often real-time) response time.
Typical examples of data streams include various kinds of scientific and engineering data,
time-series data
The World Wide Web
The World Wide Web and its associated distributed information services, such as Yahoo!,
Google, America Online, and AltaVista, provide rich, worldwide, on-line information
services, where data objects are linked together to facilitate interactive access.
Data Mining Functionalities
Data mining functionalities are used to specify the kind of patterns to be found in data
mining tasks. In general, data mining tasks can be classified into two categories:
1.Descriptive
2.Predictive.
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make
predictions.
Data Mining Functionalities
For example,
in the AllElectronics store,classes of items for sale include computers and printers, and
concepts of customers include bigSpenders and budget Spenders.
It can be useful to describe individual classes and concepts in summarized, concise, and
yet precise terms. Such descriptions of a class or a concept are called class/concept
descriptions.
These descriptions can be derived via
1. Data characterization
2. Data discrimination
Data Mining Functionalities
Data characterization
For example, to study the characteristics of software products whose sales increased by
10% in the last year, the data related to such products can be collected by executing an
SQL query
The output of data characterization can be presented in various forms. Examples include
pie charts, bar charts, curves, multidimensional data cubes, and multidimensional tables,
including crosstabs.
Data Mining Functionalities
Data discrimination
is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes.
The target and contrasting classes can be specified by the user, and the corresponding data
objects retrieved through database queries.
For example, the user may like to compare the general features of software products
whose sales increased by 10% in the last year with those whose sales decreased by at least
30% during the same period. The methods used for data discrimination are similar to
those used for data characterization.
Here also we use the same pie charts, bar charts, curves, multidimensional data cubes, and
multidimensional tables, including crosstabs for output.
Data Mining Functionalities
A frequent itemset typically refers to a set of items that frequently appear together
in a transactional data set, such as milk and bread.
A frequently occurring subsequence, such as the pattern that customers tend to purchase
first a PC, followed by a digital camera, and then a memory card, is a (frequent)
sequential pattern.
A substructure can refer to different structural forms, such as graphs, trees, or lattices,
which may be combined with itemsets or subsequences. If a substructure occurs
frequently, it is called a (frequent) structured pattern.
Data Mining Functionalities
Eg:
Association analysis. Suppose, as a marketing manager of AllElectronics, you would like
to determine which items are frequently purchased together within the same transactions.
association rules are discarded as uninteresting if they do not satisfy both a minimum
support threshold and a minimum confidence threshold.
Data Mining Functionalities
Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts, for the purpose of being able to use the model to
predict the class of objects whose class label is unknown. The derived model is based on
the analysis of a set of training data (i.e., data objects whose class label is known).
“How is the derived model presented?” The derived model may be represented in various
forms, such as classification (IF-THEN) rules, decision trees, mathematical formulae,
or neural networks.
Data Mining Functionalities
A decision tree is a flow-chart-like tree structure, where each node denotes a test on an
attribute value, each branch represents an outcome of the test, and tree leaves represent
classes or class distributions. Decision trees can easily be converted to classification
rules.
There are many other methods for constructing classification models, such as naïve
Bayesian classification, support vector machines, and k-nearest neighbor classification.
Data Mining Functionalities
Data Mining Functionalities
Eg:
Classification and prediction. Suppose, as sales manager of AllElectronics, you would
like to classify a large set of items in the store, based on three kinds of responses to a
sales campaign: good response, mild response, and no response. You would like to derive
a model for each of these three classes based on the descriptive features of the items, such
as price, brand, place made, type, and category.
Data Mining Functionalities
Cluster Analysis
“What is cluster analysis?” Unlike classification and prediction, which analyze class-
labeled data objects, clustering analyzes data objects without consulting a known class
label.
Data Mining Functionalities
The objects are clustered or grouped based on the principle of maximizing the intraclass
similarity and minimizing the interclass similarity. That is, clusters of objects are formed
so that objects within a cluster have high similarity in comparison to one another, but are
very dissimilar to objects in other clusters.
Eg:
Cluster analysis can be performed on AllElectronics customer data in order to identify
homogeneous subpopulations of customers. These clusters may represent individual
target groups for marketing.
Data Mining Functionalities
Outlier Analysis
A database may contain data objects that do not comply with the general behavior or
model of the data. These data objects are outliers. Most data mining methods discard
outliers as noise or exceptions. However, in some applications such as fraud detection, the
rare events can be more interesting than the more regularly occurring ones. The analysis
of outlier data is referred to as outlier mining.
Outliers may be detected using statistical tests that assume a distribution or probability
model for the data, or using distance measures where objects that are a substantial
distance from any other cluster are considered outliers.
Outlier analysis. Outlier analysis may uncover fraudulent usage of credit cards by
detecting purchases of extremely large amounts for a given account number in
comparison to regular charges incurred by the same account. Outlier values may also be
detected with respect to the location and type of purchase, or the purchase frequency.
Data Mining Functionalities
Eg:
Outlier analysis. Outlier analysis may uncover fraudulent usage of credit cards by
detecting purchases of extremely large amounts for a given account number in
comparison to regular charges incurred by the same account. Outlier values may also be
detected with respect to the location and type of purchase, or the purchase frequency.
Data Mining Functionalities
Evolution Analysis
Data evolution analysis describes and models regularities or trends for objects whose
behavior changes over time. Although this may include characterization, discrimination,
association and correlation analysis, classification, prediction, or clustering of time
related
data, distinct features of such an analysis include time-series data analysis, sequence or
periodicity pattern matching, and similarity-based data analysis.
Eg:
Evolution analysis. Suppose that you have the major stock market (time-series) data of
the last several years available from the New York Stock Exchange and you would like to
invest in shares of high-tech industrial companies. A data mining study of stock exchange
data may identify stock evolution regularities for overall stocks and for the stocks of
particular companies. Such regularities may help predict future trends in stock market
prices, contributing to your decision making regarding stock investments.
Data Mining Functionalities
Depending on the kinds of data to be mined or on the given data mining application, the
data mining system may also integrate techniques from spatial data analysis, information
retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web
technology, economics, business, bioinformatics, or psychology.
Data Mining Functionalities
A data mining query is defined in terms of data mining task primitives. These primitives
allow the user to interactively communicate with the data mining system during discovery
in order to direct the mining process, or examine the findings from different angles or
depths.
Data Mining Task Primitives
The set of task-relevant data to be mined: This specifies the portions of the database
or the set of data in which the user is interested. This includes the database attributes
or data warehouse dimensions of interest (referred to as the relevant attributes or
dimensions).
The kind of knowledge to be mined: This specifies the data mining functions to be
performed, such as characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution analysis.
Pattern interestingness measure: This primitive allows users to specify functions that
are used to separate uninteresting patterns from knowledge and may be used to guide the
mining process, as well as to evaluate the discovered patterns. This allows the user to
confine the number of uninteresting patterns returned by the process, as a data mining
process may generate a large number of patterns. Interestingness measures can be
specified for such pattern characteristics as simplicity, certainty, utility and novelty.
The expected representation for visualizing the discovered patterns: This refers to the
form in which discovered patterns are to be displayed, which may include rules, tables,
charts, graphs, decision trees, and cubes.
Data Mining Task Primitives
DATA PREPROCESSING
DATA PREPROCESSING
Data preprocessing describes processing performed on raw data to prepare it for another
processing procedure. Commonly used as a preliminary data mining practice, data
preprocessing transforms the data into a format that will be more easily and effectively
processed for the purpose of the user.
DATA PREPROCESSING
Noisy data (incorrect values) may come from Faulty data collection by instruments
Human or computer error at data entry Errors in data transmission
Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out
noise while identifying outliers, and correct inconsistencies in the data.
Missing Values:
Imagine that you need to analyze AllElectronics sales and customer data. You note that
many tuples have no recorded value for several attributes, such as customer income. How
can you go about filling in the missing values for this attribute? Let’s look at the
following
methods:
1. Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification). This method is not very effective, unless the tuple
contains several attributes with missing values. It is especially poor when the percentage
of missing values per attribute varies considerably.
DATA PREPROCESSING- Data cleaning
2. Fill in the missing value manually: In general, this approach is time-consuming and
may not be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute values
by the same constant, such as a label like “Unknown” or - ¥. If missing values are
replaced by, say, “Unknown,” then the mining program may mistakenly think that they
form an interesting concept, since they all have a value in common—that of “Unknown.”
Hence, although this method is simple, it is not fool proof.
4. Use the attribute mean to fill in the missing value: For example, suppose that the
average income of AllElectronics customers is $56,000. Use this value to replace the
missing value for income.
DATA PREPROCESSING- Data cleaning
5. Use the attribute mean for all samples belonging to the same class as the given tuple:
For example, if classifying customers according to credit risk, replace the missing value
with the average income value for customers in the same credit risk category as that of the
given tuple.
6. Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism, or decision tree induction.
For example, using the other customer attributes in your data set, you may construct a
decision tree to predict the missing values for income.
DATA PREPROCESSING- Data cleaning
Noisy Data:
“What is noise?” Noise is a random error or variance in a measured variable. Given a
numerical attribute such as, say, price, how can we “smooth” out the data to remove the
noise?
Let’s look at the following data smoothing techniques:
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
2. Regression: Data can be smoothed by fitting the data to a function, such as with
regression. Linear regression involves finding the “best” line to fit two attributes (or
variables), so that one attribute can be used to predict the other. Multiple linear regression
is an extension of linear regression, where more than two attributes are involved and the
data are fit to a multidimensional surface.
DATA PREPROCESSING- Data cleaning
3. Clustering: Outliers may be detected by clustering, where similar values are organized
into groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may be
considered outliers
DATA PREPROCESSING- Data cleaning
The data should also be examined regarding unique rules, consecutive rules, and null
rules.
A unique rule says that each value of the given attribute must be different from all other
values for that attribute.
A consecutive rule says that there can be no missing values between the lowest and
highest values for the attribute, and that all values must also be unique (e.g., as in check
numbers).
A null rule specifies the use of blanks, question marks, special characters, or other strings
that may indicate the null condition (e.g., where a value for a given attribute is not
available), and how such values should be handled
DATA PREPROCESSING- Data Integration and Transformation
Data Integration
It is likely that your data analysis task will involve data integration, which combines data
from multiple sources into a coherent data store, as in data warehousing. These sources
may include multiple databases, data cubes, or flat files.
DATA PREPROCESSING- Data Integration and Transformation
For example, how can the data analyst or the computer be sure that customer id in one
database and cust number in another refer to the same attribute?
metadata for each attribute include the name, meaning, data type, and range of values
permitted for the attribute, and null rules for handling blank, zero, or null values. Such
metadata can be used to help avoid errors in schema integration.
The metadata may also be used to help transform the data (e.g., where data codes for pay
type in one database may be “H” and “S”, and 1 and 2 in another).
DATA PREPROCESSING- Data Integration and Transformation
Some redundancies can be detected by correlation analysis. Given two attributes, such
analysis can measure how strongly one attribute implies the other, based on the available
data. For numerical attributes, we can evaluate the correlation between two attributes, A
and B, by computing the correlation coefficient
DATA PREPROCESSING- Data Integration and Transformation
This is
The result of the equation is > 0, then A and B are positively correlated, which means the
value of A increases as the values of B increases. The higher value may indicate
redundancy that may be removed.
The result of the equation is = 0, then A and B are independent and there is no correlation
between them.
If the resulting value is < 0, then A and B are negatively correlated where the values of
one attribute increase as the value of one attribute decrease which means each attribute
may discourages each other.
-also called Pearson‘s product moment coefficient
DATA PREPROCESSING- Data Integration and Transformation
For categorical (discrete) data, a correlation relationship between two attributes, A
and B, can be discovered by a X2 (chi-square) test.
Suppose A has c distinct values, namely a1;a2; : : :ac. B has r distinct values, namely
b1;b2; : : :br. The data tuples described by A and B can be shown as a contingency
table, with the c values of A making up the columns and the r values of B making up the
rows.
Let (Ai;Bj) denote the event that attribute A takes on value ai and attribute B takes on
value bj, that is, where (A = ai;B = bj). Each and every possible (Ai;Bj) joint event has
its own cell (or slot) in the table. The c2 value (also known as the Pearson c2 statistic) is
computed as:
DATA PREPROCESSING- Data Integration and Transformation
where oi j is the observed frequency (i.e., actual count) of the joint event (Ai;Bj) and ei
j
is the expected frequency of (Ai;Bj), which can be computed as
where N is the number of data tuples, count(A=ai) is the number of tuples having value
ai for A, and count(B = bj) is the number of tuples having value bj for B.
DATA PREPROCESSING- Data Integration and Transformation
DATA TRANSFORAMTION
In data transformation, the data are transformed or consolidated into forms appropriate
for mining.
Data transformation can involve the following:
• Smoothing, which works to remove noise from the data. Such techniques include
binning, regression, and clustering.
• Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and
annual total amounts. This step is typically used in constructing a data cube for
analysis of the data at multiple granularities.
• Generalization of the data, where low-level or “primitive” (raw) data are replaced by
higher-level concepts through the use of concept hierarchies. For example,
categorical attributes, like street, can be generalized to higher-level concepts,
DATA PREPROCESSING- Data Integration and Transformation
like city or country. Similarly, values for numerical attributes, like age, may be
mapped to higher-level concepts, like youth, middle-aged, and senior.
• Normalization, where the attribute data are scaled so as to fall within a small
specified range, such as -1.0 to 1.0, or 0.0 to 1.0.
Suppose that the minimum and maximum values for the attribute income are
$12,000 and $98,000, respectively. We would like to map income to the range
[0.0,1.0].
By min-max normalization, a value of $73,600 for income is transformed
to
73,600-12,000/
98,000-12,000 (1.0-0)+0 = 0.716.
DATA PREPROCESSING- Data Integration and Transformation
In z-score normalization (or zero-mean normalization), the values for an attribute,
A, are normalized based on the mean and standard deviation of A. A value, v, of A is
normalized to v1 by computing
z-score normalization Suppose that the mean and standard deviation of the values for
the attribute income are $54,000 and $16,000, respectively. With z-score
normalization,
a value of $73,600 for income is transformed to 73,600-54,000/16,000 = 1.225.
DATA PREPROCESSING- Data Integration and Transformation
Normalization by decimal scaling normalizes by moving the decimal point of values
of attribute A. The number of decimal points moved depends on the maximum
absolute value of A. A value, v, of A is normalized to v1 by computing
Decimal scaling. Suppose that the recorded values of A range from -986 to 917. The
maximum absolute value of A is 986. To normalize by decimal scaling, we therefore
divide each value by 1,000 (i.e., j = 3) so that -986 normalizes to -0.986 and 917
normalizes to 0.917.
DATA PREPROCESSING- Data Reduction
Data Reduction
Imagine that you have selected data from the AllElectronics data warehouse for
analysis. The data set will likely be huge! Complex data analysis and mining on huge
amounts of data can take a long time, making such analysis impractical or infeasible.
Mining on a reduced set of attributes has an additional benefit. It reduces the number
of attributes appearing in the discovered patterns, helping to make the patterns easier
to understand.
“How can we find a ‘good’ subset of the original attributes?” For n attributes, there
are
2 power n possible subsets.
DATA PREPROCESSING- Data Reduction
Basic heuristic methods of attribute subset selection include the following techniques
1. Stepwise forward selection: The procedure starts with an empty set of attributes as
the reduced set. The best of the original attributes is determined and added to the
reduced set. At each subsequent iteration or step, the best of the remaining original
attributes is added to the set.
2. Stepwise backward elimination: The procedure starts with the full set of attributes.
At each step, it removes the worst attribute remaining in the set.
DATA PREPROCESSING- Data Reduction
3. Combination of forward selection and backward elimination: The stepwise forward
selection and backward elimination methods can be combined so that, at each step,
the procedure selects the best attribute and removes the worst from among the
remaining attributes.
4. Decision tree induction constructs a flowchart like structure where each internal
(non leaf) node denotes a test on an attribute, each branch corresponds to an outcome
of the test, and each external (leaf) node denotes a class prediction. At each node, the
algorithm chooses the “best” attribute to partition the data into individual classes.
When decision tree induction is used for attribute subset selection, a tree is
constructed from the given data. All attributes that do not appear in the tree are
assumed to be irrelevant. The set of attributes appearing in the tree form the reduced
subset of attributes.
DATA PREPROCESSING- Data Reduction
DATA PREPROCESSING- Data Reduction
Dimensionality Reduction
In dimensionality reduction, data encoding or transformations are applied so as to
obtain a reduced or “compressed” representation of the original data.
loss less:
If the original data can be reconstructed from the compressed data without any loss of
information, the data reduction is called lossless.
lossy:
If, instead, we can reconstruct only an approximation of the original data, then the
data reduction is called lossy.
“How can this technique be useful for data reduction if the wavelet transformed data
are of the same length as the original data?” The usefulness lies in the fact that the
wavelet transformed data can be truncated.
DATA PREPROCESSING- Data Reduction
A compressed approximation of the data can be retained by storing only a small
fraction of the strongest of the wavelet coefficients.
For example, all wavelet coefficients larger than some user-specified threshold can be
retained. All other coefficients are set to 0.
For parametric methods, a model is used to estimate the data, so that typically only
the data parameters need to be stored, instead of the actual data.
Log-linear models, which estimate discrete multidimensional probability
distributions, are an example.
Figure shows a histogram for the data using singleton buckets. To further reduce
the data, it is common to have each bucket denote a continuous range of values for the
given attribute.
DATA PREPROCESSING- Data Reduction
“How are the buckets determined and the attribute values partitioned?” There are
several partitioning rules, including the following:
Equal-width: In an equal-width histogram, the width of each bucket range is uniform
Equal-frequency (or equidepth): In an equal-frequency histogram, the buckets are
created so that, roughly, the frequency of each bucket is constant (that is, each bucket
contains roughly the same number of contiguous data samples).
V-Optimal: If we consider all of the possible histograms for a given number of
buckets, the V-Optimal histogram is the one with the least variance. Histogram
variance is a weighted sum of the original values that each bucket represents, where
bucket weight is equal to the number of values in the bucket.
MaxDiff: In a MaxDiff histogram, we consider the difference between each pair of
adjacent values. A bucket boundary is established between each pair for pairs having
the b-1 largest differences, where b is the user-specified number of buckets.
DATA PREPROCESSING- Data Reduction
Clustering
Clustering techniques consider data tuples as objects. They partition the objects into
groups or clusters, so that objects within a cluster are “similar” to one another and
“dissimilar” to objects in other clusters. Similarity is commonly defined in terms of
how “close” the objects are in space, based on a distance function.
The “quality” of a cluster may be represented by its diameter, the maximum distance
between any two objects in the cluster. Centroid distance is an alternative measure of
cluster quality and is defined as the average distance of each cluster object from the
cluster centroid.
In data reduction, the cluster representations of the data are used to replace the actual
data.
DATA PREPROCESSING- Data Reduction
Sampling
Sampling can be used as a data reduction technique because it allows a large data set
to be represented by a much smaller random sample (or subset) of the data. Suppose
that a large data set, D, contains N tuples. Let’s look at the most common ways that
we could sample D for data reduction,
DATA PREPROCESSING- Data Reduction
DATA PREPROCESSING- Data Reduction
Simple random sample without replacement (SRSWOR) of size s: This is created
by drawing s of the N tuples from D (s < N), where the probability of drawing any
tuple in D is 1/N, that is, all tuples are equally likely to be sampled.
Histogram Analysis
Histograms partition the values for an attribute, A, into disjoint ranges called buckets.
DATA PREPROCESSING- Data Reduction
Entropy-Based Discretization
Entropy is one of the most commonly used discretization measures. Entropy-based
discretization is a supervised, top-down splitting technique. It explores class
distribution information in its calculation and determination of split-points.
To discretize a numerical attribute, A, the method selects the value of A that has the
minimum entropy as a split-point, and recursively partitions the resulting intervals to
arrive at a hierarchical discretization. Such discretization forms a concept hierarchy
for A.
Let D consist of data tuples defined by a set of attributes and a class-label attribute.
The class-label attribute provides the class information per tuple. The basic method
for
entropy-based discretization of an attribute A within the set is as follows:
DATA PREPROCESSING- Data Reduction
1. Each value of A can be considered as a potential interval boundary or split-point
(denoted split point) to partition the range of A. That is, a split-point for A can
partition the tuples in D into two subsets satisfying the conditions A <= split point and
A > split point, respectively, there by creating a binary discretization.