0% found this document useful (0 votes)
12 views24 pages

Unit-1 Notes

Data mining involves extracting knowledge from large datasets through a process known as Knowledge Discovery in Databases (KDD), which includes steps like data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Various types of data can be mined, including database data, data warehouses, and transactional data, and mining functionalities can be categorized into descriptive and predictive tasks. Major issues in data mining include methodology, performance, and the need for effective data preprocessing techniques to handle incomplete, noisy, and inconsistent data.

Uploaded by

enuguprasanna23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views24 pages

Unit-1 Notes

Data mining involves extracting knowledge from large datasets through a process known as Knowledge Discovery in Databases (KDD), which includes steps like data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Various types of data can be mined, including database data, data warehouses, and transactional data, and mining functionalities can be categorized into descriptive and predictive tasks. Major issues in data mining include methodology, performance, and the need for effective data preprocessing techniques to handle incomplete, noisy, and inconsistent data.

Uploaded by

enuguprasanna23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Unit-1

Data mining refers to extracting or “mining” knowledge from large amounts of data.
KDD PROCESS(Knowledge Discovery process):

KDD Process steps:


1) Data cleaning: Data cleaning process used to remove noise and inconsistent data
2) Data integration : In this process multiple data sources may be combined
3) Data selection : In this process data relevant to the analysis task are retrieved from
the database
4) Data transformation : In this process data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
5) Data mining : it is an essential process where intelligent methods are applied in order
to extract data patterns
6) Pattern evaluation : In this process to identify the truly interesting patterns
representing knowledge based on some interestingness measures
7)Knowledge presentation: In this process where visualization and knowledge
representation techniques are used to present the mined knowledge to the user
Data-Types of Data can be mined:
The most basic forms of data for mining applications are database data , data warehouse
data and transactional data .
1)Database Data :
• A database system also called a database management system (DBMS), consists of a
collection of interrelated data known as a database and a set of software programs to
manage and access the data.
• A relational database is a collection of tables, each of which is assigned a unique name.
Each table consists of a set of attributes (columns or fields) and usually stores a large
set of tuples (records or rows).
• EX:customer (cust ID, name, address, age, occupation, annual income, credit
information, category, . . .)
2)Data Warehouses:
• A data warehouse is a repository of information collected from multiple sources, stored
under a unified schema.
• Data warehouses are constructed via a process of data cleaning, data integration, data
transformation, data loading, and periodic data refreshing.


• A data warehouse is usually modeled by a multidimensional data structure, called a data
cube, in which each dimension corresponds to an attribute or a set of attributes in the
schema, and each cell stores the value of some aggregate measure such as count or
sum(sales amount).

3)Transactional Data.
• In general, each record in a transactional database captures a transaction, such as a
customer’s purchase, a flight booking, or a user’s clicks on a web page.
• A transaction typically includes a unique transaction identity number (trans ID) and a
list of the items making up the transaction, such as the items purchased in the
transaction.


4) Other Kinds of Data:
• Time-related or sequence data: e.g., historical records, stock exchange data, and time-
series and biological sequence data
• Data streams :e.g., video surveillance and sensor data, which are continuously
transmitted
• Spatial data :e.g., maps
• Engineering design data :e.g., the design of buildings, system components, or
integrated circuits
• Hypertext and multimedia data :including text, image, video, and audio data
• Graph and Networked data :e.g., social and information networks
• Web :a huge, widely distributed information repository made available by the Internet

Data Mining Functionalities:


• Data mining functionalities are used to specify the kinds of patterns to be found in data
mining tasks.
• In general, such tasks can be classified into two categories: descriptive and predictive.
• Descriptive mining tasks characterize properties of the data in a target data set.
• Predictive mining tasks perform induction on the current data in order to make
predictions.
• There are a number of data mining functionalities. They are
1) Data Characterization and Data Discrimination:
• Data characterization is a summarization of the general characteristics or features of
a target class of data.
• Example: A customer relationship manager at AllElectronics may order the following
data mining task:
• Summarize the characteristics of customers who spend more than $5000 a year at
AllElectronics.
• The result is a general profile of these customers, such as that they are 40 to 50 years
old, employed, and have excellent credit ratings.
• Data Discrimination is comparison of the target class with one or a set of comparative
classes (often called the contrasting classes)
Example:
• A customer relationship manager at AllElectronics may want to compare two groups of
customers
• those who shop for computer products regularly (e.g., more than twice a month) and
those who rarely shop for such products (e.g., less than three times a year).
• The resulting description provides a general comparative profile of these customers,
such as that 80% of the customers who frequently purchase computer products are
between 20 and 40 years old and have a university education,
• whereas 60% of the customers who infrequently buy such products are either seniors
or youths, and have no university degree.

2) Mining Frequent Patterns, Associations, and Correlations:


• Frequent patterns, as the name suggests, are patterns that occur frequently in data.
• There are many kinds of frequent patterns, including frequent itemsets, frequent
subsequences (also known as sequential patterns), and frequent substructures.
• A frequent itemset typically refers to a set of items that often appear together in a
transactional data set—for example, milk and bread, which are frequently bought
together in grocery stores by many customers.
• A frequent subsequence, such as the pattern that customers, tend to purchase first a
laptop, followed by a digital camera, and then a memory card, is a (frequent) sequential
pattern.
• A substructure can refer to different structural forms (e.g., graphs, trees, or lattices)
that may be combined with itemsets or subsequences.
• Mining frequent patterns leads to the discovery of interesting associations and
correlations within data.
• An example of such a rule, mined from the AllElectronics transactional database, is
• buys(X, “computer”) ⇒ buys(X, “software”) [support = 40%,confidence = 50%],

3) Classification and Regression:


• Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts.
• The model are derived based on the analysis of a set of training data (i.e., data objects
for which the class labels are known).
• The model is used to predict the class label of objects for which the class label is
unknown.
• The derived model may be represented in various forms, such as classification rules
(i.e., IF-THEN rules), decision trees, mathematical formulae, or neural networks.
• Decision trees can easily be converted to classification rules.


• Whereas classification predicts categorical (discrete, unordered) labels, regression
models continuous-valued functions.
4) Cluster Analysis :
• Clustering can be used to generate class labels for a group of data.
• The objects are clustered or grouped based on the principle of maximizing the
intraclass similarity and minimizing the interclass similarity.
• That is, clusters of objects are formed so that objects within a cluster have high
similarity in comparison to one another, but are rather dissimilar to objects in other
clusters. Each cluster so formed can be viewed as a class of objects, from which rules
can be derived.

5) Outlier Analysis
• A data set may contain objects that do not comply with the general behavior or model
of the data.
• These data objects are outliers. Many data mining methods discard outliers as noise or
exceptions.
• The analysis of outlier data is referred to as outlier analysis or anomaly mining.

Interestingness Patterns:

• A data mining system has the potential to generate thousands or even millions of
patterns, or rules.
• A pattern is interesting if it is
1)easily understood by humans,
2) valid on new or test data with some degree of certainty
3) potentially useful, and (4) novel.
• A pattern is also interesting if it validates a hypothesis that the user sought to confirm.
An interesting pattern represents knowledge.
• Ex: for Association rule
buys(X, “computer”) ⇒ buys(X, “software”) [support = 40%,confidence = 50%]

Classification of Data Mining Systems:

Data mining is an interdisciplinary field ,the confluence of a set of disciplines, including


database systems, statistics, machine learning, visualization, and information science.
Data mining systems can be categorized according to various criteria, as follows:
i) Classification according to the kinds of databases mined:
• Database systems can be classified according to data models, we may have a relational,
transactional, object-relational, or data warehouse mining system.
• Each of which may require its own data mining technique.
ii) Classification according to the kinds of knowledge mined:
• Data mining systems can be categorized according to the kinds of knowledge they mine,
that is, based on data mining functionalities, such as characterization, discrimination,
association and correlation analysis, classification, prediction, clustering, outlier
analysis, and evolution analysis.
iii) Classification according to the kinds of techniques utilized:
• Data mining systems can be categorized according to the underlying data mining
techniques employed.
• These techniques can be described according to the degree of user interaction involved
(e.g., autonomous systems, interactive exploratory systems, query-driven systems)
IV) Classification according to the applications adapted:
• Data mining systems can also be categorized according to the applications they adapt.
• For example, data mining systems may be tailored specifically for finance,
telecommunications, DNA, stock markets, e-mail, and so on.

Data Mining Task Primitives:


• A data mining task can be specified in the form of a data mining query, which is input
to the data mining system.
• A data mining query is defined in terms of data mining task primitives.
• The data mining primitives specify the following:
i) The set of task-relevant data to be mined:
• This specifies the portions of the database or the set of data in which the user is
interested.
• This includes the database attributes or data warehouse dimensions of interest

ii)The kind of knowledge to be mined:


• This specifies the data mining functions to be performed, such as characterization,
discrimination, association or correlation analysis, classification, prediction, clustering,
outlier analysis, or evolution analysis.
iii) The background knowledge to be used in the discovery process:
• This knowledge about the domain to be mined is useful for guiding the knowledge
discovery process and for evaluating the patterns found.
• Concept hierarchies are a popular form of background knowledge
iv) The interestingness measures and thresholds for pattern evaluation:
• They may be used to guide the mining process or, after discovery to evaluate the
discovered patterns.
• Different kinds of knowledge may have different interestingness measures
v) The expected representation for visualizing the discovered patterns:
• This refers to the form in which discovered patterns are to be displayed, which may
include rules, tables, charts, graphs, decision trees, and cubes

Integration of Data mining system with a Data warehouse:


DB and DW systems, possible integration schemes include no coupling, loose coupling,
semitight coupling, and tight coupling.
1)No coupling:
• No coupling means that a DM system will not utilize any function of a DB or DW
system.
• It may fetch data from a particular source (such as a file system), process data using
some data mining algorithms, and then store the mining results in another file.
2) Loose coupling:
• Loose coupling means that a DM system will use some facilities of a DB or DW system.
• Fetching data from a data repository managed by these systems, performing data
mining, and then storing the mining results either in a file or in a designated place in a
database or data warehouse.
3)Semitight coupling:
• Semitight coupling means that besides linking a DM system to a DB/DW system,
efficient implementations of a few essential data mining primitives can be provided in
the DB/DW system.
• These primitives can include sorting, indexing, aggregation, histogram analysis,
multiway join, and precomputation of some essential statistical measures, such as sum,
count, max, min, standard deviation, and so on.
4)Tight coupling:
• Tight coupling means that a DM system is smoothly integrated into the DB/DW system.
• The data mining subsystem is treated as one functional component of an information
system.
• Data mining queries and functions are optimized based on mining query analysis, data
structures, indexing schemes, and query processing methods of a DB or DW system.

Major Issues in Data Mining:


1.Mining methodology and user interaction issues:
i) Mining different kinds of knowledge in databases:
• Because different users can be interested in different kinds of knowledge.
• Data mining should cover a wide spectrum of data analysis and knowledge discovery
tasks, including data characterization, discrimination, association and correlation
analysis, classification, prediction, clustering, outlier analysis, and evolution analysis.
ii) Interactive mining of knowledge at multiple levels of abstraction:

• Because it is difficult to know exactly what can be discovered within a database, the
data mining process should be interactive.
iii) Incorporation of background knowledge:
• Background knowledge, or information regarding the domain under study, may be used
to guide the discovery process and allow discovered patterns to be expressed in concise
terms and at different levels of abstraction.
iv) Data mining query languages and ad hoc data mining:
• Relational query languages (such as SQL) allow users to pose ad hoc queries for data
retrieval. But Data mining Query Language cannot allow.
v) Presentation and visualization of data mining results:
• Discovered knowledge should be expressed in high-level languages, visual
representations, or other expressive forms so that the knowledge can be easily
understood and directly usable by humans.
vi) Handling noisy or incomplete data:
• The data stored in a database may reflect noise, exceptional cases, or incomplete data
objects.
vii) Pattern evaluation:
• The interestingness problem: A data mining system can uncover thousands of patterns.
• Many of the patterns discovered may be uninteresting to the given user

2. Performance issues:
i) Efficiency and scalability of data mining algorithms:
• To effectively extract information from a huge amount of data in databases, data mining
algorithms must be efficient and scalable.
ii) Parallel, distributed, and incremental mining algorithms:
• The huge size of many databases, the wide distribution of data, and the computational
complexity of some data mining methods are factors motivating the development of
parallel and distributed data mining algorithms.
3.Issues relating to the diversity of database types:
i) Handling of relational and complex types of data:
• Because relational databases and data warehouses are widely used, the development of
efficient and effective data mining systems for such data is important.
ii) Mining information from heterogeneous databases and global information
systems:
• Local- and wide-area computer networks (such as the Internet) connect many sources
of data, forming huge, distributed, and heterogeneous databases.

Data Preprocessing:
• Data preprocessing is a data mining technique which is used to transform the raw data
in a useful and efficient format.

Need for Data Preprocessing:


Data in the real world is
-incomplete: lacking attribute values or certain attributes of interest, or containing only
aggregate data.
- noisy: containing errors or outlier values that deviate from the expected
-inconsistent: lack of compatibility or similarity between two or more facts
Data Preprocessing Techniques:
 Data Cleaning:
-Data Cleaning is a process to fill in missing values , smooth noisy data while
identifying outlier, and correct inconsistencies in the data.
• Data Integration:
-The merging of data from multiple data stores.
• Data Transformation:
-The data are transformed or consolidated into forms appropriate for mining.
• Data Reduction:
-Data Reduction techniques can be applied to obtain a reduced representation in
volume but produces the same or similar analytical results.

Data Cleaning:
a)Missing values :
• Ignore the tuple
• Filling in the missing values manually
• use a global constant to fill in the missing value
• use the attribute mean to fill in the missing value
• use the attribute mean for all samples belonging to the same class as the given tuple
• Use the most probable value to fill in the missing value
b)Noisy data :
• Noise is a random error or variance in a measured variable. In order to remove the noise
data smoothing techniques are used
• Data Smoothing Techniques:
i) Binning: Binning methods smooth a sorted data value by consulting its
neighborhood i.e. the values around it.

ii)Regression: Data smoothing can also be done by regression, a technique that


conforms data values to a function. Linear regression involves finding the “best” line
to fit two attributes (or variables) so that one attribute can be used to predict the other.
iii)Outlier analysis: Outliers may be detected by clustering, for example, where
similar values are organized into groups, or “clusters.” Intuitively, values that fall
outside of the set of clusters may be considered outliers

Data Integration:
• Data integration issues:
i) Entity Identification Problem:
 How can equivalent real-world entities from multiple data sources be matched up? This
is referred to as the entity identification problem.
 For example, how can the data analyst or the computer be sure that customer id in
one database and cust number in another refer to the same attribute?
 Examples of metadata for each attribute include the name, meaning, data type, and
range of values permitted for the attribute, and null rules for handling blank, zero,or
null values .
 Such metadata can be used to help avoid errors in schema integration.
ii) Redundancy and Correlation Analysis:
 An attribute may be redundant if it can be “derived” from another attribute or
set of attributes.
 Inconsistencies in attribute or dimension naming can also cause redundancies in the
resulting data set
 Some redundancies can be detected by correlation analysis. Given two attributes, such
analysis can measure how strongly one attribute implies the other, based on the available
data.
 For nominal data, we use the χ 2 (chi-square) test. For numeric attributes, we can use the
correlation coefficient and covariance
iii)Tuple Duplication:
 In addition to detecting redundancies between attributes, duplication should also be
detected at the tuple level .
 The use of denormalized tables is another source of data redundancy.
 Inconsistencies often arise between various duplicates, due to inaccurate data entry or
updating some but not all data occurrences.

iv) Data Value Conflict Detection and Resolution:


 Data integration also involves the detection and resolution of data value
conflicts.
 For example, for the same real-world entity, attribute values from different
sources may differ.
 For a hotel chain, the price of rooms in different cities may involve not only
different currencies but also different services (e.g., free breakfast) and taxes.

Data Transformation:
In this preprocessing step, the data are transformed or consolidated so that the resulting
mining process may be more efficient, and the patterns found may be easier to
understand.
• Data Transformation Strategies:
i) Smoothing: which works to remove noise from the data. Techniques include
binning, regression, and clustering.
ii) Attribute construction : where new attributes are constructed and added from
the given set of attributes to help the mining process.
iii) Aggregation: where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual
total amounts.
iv) Normalization : where the attribute data are scaled so as to fall within a smaller
range, such as −1.0 to 1.0, or 0.0 to 1.0.
There are many methods for data normalization . Some are
a) Min-max normalization:

Suppose that minA and maxA are the minimum and maximum values of an
attribute, A.
Min-max normalization maps a value, vi , of A to v!i in the range [new_minA, new
maxA]
b) z-score normalization (or zero-mean normalization):

c) Normalization by decimal scaling : normalizes by moving the decimal point of


values of attribute A. The number of decimal points moved depends on the
maximum absolute value of A. A value, vi , of A is normalized to v!i by computing.

Data Reduction:
Data Reduction techniques can be applied to obtain a reduced representation in volume but
produces the same or similar analytical results as original data.
Data reduction techniques :
1. Data cube aggregation : where aggregation operations are applied to the data in the
construction of a data cube.
Attribute subset selection: where irrelevant, weakly relevant, or redundant attributes or
dimensions may be detected and removed.

Attribute subset selection include the following techniques:

1.Stepwise forward selection: The procedure starts with an empty set of attributes as
the reduced set. The best of the original attributes is determined and added to the
reduced set. At each subsequent iteration or step, the best of the remaining original
attributes is added to the set.

2. Stepwise backward elimination: The procedure starts with the full set of attributes.
At each step, it removes the worst attribute remaining in the set.
3.Combination of forward selection and backward elimination: The stepwise
forward selection and backward elimination methods can be combined so that, at each
step, the procedure selects the best attribute and removes the worst from among the
remaining attributes.
4.Decision tree induction: Decision tree algorithms, such as ID3, C4.5, and CART,
were originally intended for classification. Decision tree induction constructs a flow
chart like structure where each internal (nonleaf) node denotes a test on an attribute,
each branch corresponds to an outcome of the test, and each external (leaf) node denotes
a class prediction. At each node, the algorithm chooses the “best” attribute to partition
the data into individual classes
Dimensionality Reduction
• In dimensionality reduction, data encoding or transformations are applied so as to
obtain a reduced or “compressed” representation of the original data.
• If the original data can be reconstructed from the compressed data without any loss of
information, the data reduction is called lossless.
There are two popular and effective methods of lossy dimensionality reduction: wavelet
transforms and principal components analysis

i)Wavelet Transforms
• The discrete wavelet transform (DWT) is a linear signal processing technique that,
when applied to a data vector X, transforms it to a numerically different vector, Xi , of
wavelet coefficients.
• The two vectors are of the same length. When applying this technique to data reduction,
we consider each tuple as an n-dimensional data vector, that is, X = (x1,x2,...,xn),
depicting n measurements made on the tuple from n database attributes.
• A compressed approximation of the data can be retained by storing only a small fraction
of the strongest of the wavelet coefficients.
• The technique also works to remove noise without smoothing out the main features of
the data, making it effective for data cleaning as well
• There are several families of DWTs. Popular wavelet transforms include the Haar-2,
Daubechies-4, and Daubechies-6 transforms.
• The method is as follows:
1.The length, L, of the input data vector must be an integer power of 2. This condition
can be met by padding the data vector with zeros as necessary (L ≥ n).
2.Each transform involves applying two functions. The first applies some data
smoothing, such as a sum or weighted average. The second performs a weighted
difference, which acts to bring out the detailed features of the data.
3. The two functions are applied to pairs of data points in X, that is, to all pairs of
measurements (x2i ,x2i+1). This results in two sets of data of length L/2. In general,
these represent a smoothed or low-frequency version of the input data and the high
frequency content of it, respectively.
4. The two functions are recursively applied to the sets of data obtained in the previous
loop, until the resulting data sets obtained are of length 2.

5. Selected values from the data sets obtained in the above iterations are designated the
wavelet coefficients of the transformed data

ii) Principal Components Analysis


• Principal components analysis, or PCA , searches for k n-dimensional orthogonal
vectors that can best be used to represent the data, where k ≤ n.
• The original data are thus projected onto a much smaller space, resulting in
dimensionality reduction.
The basic procedure for PCA is as follows:
• The input data are normalized, so that each attribute falls within the same range. This
step helps ensure that attributes with large domains will not dominate attributes with
smaller domains
• PCA computes k orthonormal vectors that provide a basis for the normalized input data.
These are unit vectors that each point in a direction perpendicular to the others. These
vectors are referred to as the principal components. The input data are a linear
combination of the principal components.
• The principal components are sorted in order of decreasing “significance” or strength.
The principal components essentially serve as a new set of axes for the data, providing
important information about variance. That is, the sorted axes are such that the first axis
shows the most variance among the data, the second axis shows the next highest
variance, and so on.
• Because the components are sorted according to decreasing order of “significance,” the
size of the data can be reduced by eliminating the weaker components, that is, those
with low variance. Using the strongest principal components, it should be possible to
reconstruct a good approximation of the original data.
Numerosity Reduction
The numerosity reduction techniques are
i)Regression and Log-Linear Models
• Regression and log-linear models can be used to approximate the given data.
• linear regression, the data are modeled to fit a straight line,with the equation y = wx+b
• In the context of data mining, x and y are numerical database attributes. The
coefficients, wand b (called regression coefficients), specify the slope of the line and
the Y-intercept, respectively.
• Log-linear models approximate discrete multidimensional probability distributions.
• Given a set of tuples in n dimensions (e.g., described by n attributes), we can consider
each tuple as a point in an n-dimensional space.
ii)Histograms
• A histogram for an attribute, A, partitions the data distribution of A into disjoint subsets,
or buckets.
• There are several partitioning rules, including the following:
• Equal-width: In an equal-width histogram, the width of each bucket range is uniform
(such as the width of $10 for the bucket)
• Equal-frequency (or equidepth): In an equal-frequency histogram, the buckets are
created so that, roughly, the frequency of each bucket is constant.
• V-Optimal: If we consider all of the possible histograms for a given number of buckets,
the V-Optimal histogram is the one with the least variance. Histogram variance is a
weighted sum of the original values that each bucket represents, where bucket weight
is equal to the number of values in the bucket.
• MaxDiff: In a MaxDiff histogram, we consider the difference between each pair of
adjacent values. A bucket boundary is established between each pair for pairs having
the β−1 largest differences, where β is the user-specified number of buckets.
iii)Clustering
• Clustering techniques consider data tuples as objects.
• They partition the objects into groups or clusters, so that objects within a cluster are
“similar” to one another and “dissimilar” to objects in other clusters.
• Similarity is commonly defined in terms of how “close” the objects are in space, based
on a distance function.
• The “quality” of a cluster may be represented by its diameter, the maximum distance
between any two objects in the cluster.
• Centroid distance is an alternative measure of cluster quality and is defined as the
average distance of each cluster object from the cluster centroid

iv)Sampling
• Sampling can be used as a data reduction technique because it allows a large data set to
be represented by a much smaller random sample of the data.
• Suppose that a large data set, D, contains N tuples. Let’s look at the most common ways
that we could sample D for data reduction
• Simple random sample without replacement (SRSWOR) of size s: This is created
by drawing s of the N tuples from D (s < N), where the probability of drawing any tuple
in D is 1/N, that is, all tuples are equally likely to be sampled.
• Simple random sample with replacement (SRSWR) of size s: This is similar to
SRSWOR, except that each time a tuple is drawn from D, it is recorded and then
replaced. That is, after a tuple is drawn, it is placed back in D so that it may be drawn
again

• Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters,” then
an SRS of s clusters can be obtained, where s < M. For example, tuples in a database
are usually retrieved a page at a time, so that each page can be considered a cluster. A
reduced data representation can be obtained by applying, say, SRSWOR to the pages,
resulting in a cluster sample of the tuples

• Stratified sample: If D is divided into mutually disjoint parts called strata, a stratified
sample of D is generated by obtaining an SRS at each stratum.
Data Discretization and Concept Hierarchy Generation
• Data discretization: Data discretization techniques can be used to reduce the number
of values for a given continuous attribute by dividing the range of the attribute into
intervals.
• Interval labels can then be used to replace actual data values.
• If the discretization process uses class information, then we say it is supervised
discretization.
• Otherwise, it is unsupervised. If the process starts by first finding one or a few points
(called split points or cut points) to split the entire attribute range, and then repeats this
recursively on the resulting intervals
• Concept hierarchy :A concept hierarchy for a given numerical attribute defines a
discretization of the attribute.
• Concept hierarchies can be used to reduce the data by collecting and replacing low-
level concepts (such as numerical values for the attribute age) with higher-level
concepts (such as youth, middle-aged, or senior).
Discretization and Concept Hierarchy Generation for Numerical Data
The following are methods for Data Discretization and Concept Hierarchy Generation:
• Binning:Binning is a top-down splitting technique based on a specified number of bins.
• ii)Histogram Analysis: histogram analysis is an unsupervised discretization technique
because it does not use class information.
• iii)Cluster Analysis:
• Cluster analysis is a popular data discretization method.
• iv) Discretization by Intuitive Partitioning:
• Numerical ranges partitioned into relatively uniform, easy-to-read intervals that appear
intuitive or “natural.”
• For example, annual salaries broken into ranges like ($50,000, $60,000] are often more
desirable than ranges like ($51,263.98, $60,872.34]
• The 3-4-5 rule can be used to segment numerical data into relatively uniform,
naturalseeming intervals.
• In general, the rule partitions a given range of data into 3, 4, or 5 relatively equal-width
intervals, recursively and level by level, based on the value range at the most significant
digit.
• The rule is as follows:
• If an interval covers 3, 6, 7, or 9 distinct values at the most significant digit, then
partition the range into 3 intervals
• If it covers 2, 4, or 8 distinct values at the most significant digit, then partition the range
into 4 equal-width intervals.
• If it covers 1, 5, or 10 distinct values at the most significant digit, then partition the
range into 5 equal-width intervals.
• Concept Hierarchy Generation for Nominal Data(Categorical Data ):
• Categorical data are discrete data.
• Categorical attributes have a finite (but possibly large) number of distinct values, with
no ordering among the values.
• Examples include geographic location, job category, and item type. There are several
methods for the generation of concept hierarchies for categorical data.
i) Specification of a partial ordering of attributes explicitly at the schema level by
users or experts:
• Concept hierarchies for nominal attributes or dimensions typically involve a group of
attributes.
• A user or expert can easily define a concept hierarchy by specifying a partial or total
ordering of the attributes at the schema level.
• For example, suppose that a relational database contains the following group of
attributes: street, city, province or state, and country
• A hierarchy can be defined by specifying the total ordering among these attributes at
the schema level such as street < city < province or state < country.
ii) Specification of a portion of a hierarchy by explicit data grouping:
• This is essentially the manual definition of a portion of a concept hierarchy.
• In a large database, it is unrealistic to define an entire concept hierarchy by explicit
value enumeration.
• On the contrary, we can easily specify explicit groupings for a small portion of
intermediate-level data.
• For example, after specifying that province and country , form a hierarchy at the schema
level, a user could define some intermediate levels manually, such as “{Alberta,
Saskatchewan, Manitoba} ⊂ prairies Canada” and “{British Columbia, prairies
Canada} ⊂ Western Canada.”
iii) Specification of a set of attributes, but not of their partial ordering:
• A user may specify a set of attributes forming a concept hierarchy, but omit to explicitly
state their partial ordering.
• The system can then try to automatically generate the attribute ordering so as to
construct a meaningful concept hierarchy.
iv)Specification of only a partial set of attributes:
• The user may have included only a small subset of the relevant attributes in the
hierarchy specification.
• For example, instead of including all of the hierarchically relevant attributes for
location, the user may have specified only street and city

You might also like