0% found this document useful (0 votes)

6 views64 pages

DM Concepts

Chapter 5 of 'Data Mining: Concepts and Techniques' focuses on concept description, including characterization and comparison of data. It distinguishes between descriptive and predictive data mining, discusses data generalization, and introduces analytical characterization and attribute relevance analysis. The chapter also covers methods for mining class comparisons and presents examples of data aggregation and visualization techniques.

Uploaded by

Ch Samson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views64 pages

DM Concepts

Uploaded by

Ch Samson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 64

Data Mining:

Concepts and
Techniques
— Slides for Textbook —
— Chapter 5 —

©Jiawei Han and Micheline Kamber

Intelligent Database Systems Research Lab
School of Computing Science
Simon Fraser University, Canada
http://www.cs.sfu.ca
April 7, 2025 Data Mining: Concepts and Techniq 1
Chapter 5: Concept
Description: Characterization
and Comparison
 What is concept description?
 Data generalization and summarization-based
characterization
 Analytical characterization: Analysis of attribute
relevance
 Mining class comparisons: Discriminating between
different classes
 Mining descriptive statistical measures in large
databases
 Discussion
 Summary
April 7, 2025 Data Mining: Concepts and Techniq 2
What is Concept Description?

 Descriptive vs. predictive data mining

 Descriptive mining: describes concepts or task-

relevant data sets in concise, summarative,

informative, discriminative forms
 Predictive mining: Based on data and analysis,

constructs models for the database, and predicts

the trend and properties of unknown data
 Concept description:
 Characterization: provides a concise and succinct

summarization of the given collection of data

 Comparison: provides descriptions comparing

two or more collections of data

April 7, 2025 Data Mining: Concepts and Techniq
Concept Description vs.
OLAP

 Concept description:
 can handle complex data types of the

attributes and their aggregations

 a more automated process

 OLAP:
 restricted to a small number of

dimension and measure types

 user-controlled process

April 7, 2025 Data Mining: Concepts and Techniq 4

Chapter 5: Concept
Description: Characterization
and Comparison
 What is concept description?
 Data generalization and summarization-based
characterization
 Analytical characterization: Analysis of attribute
relevance
 Mining class comparisons: Discriminating between
different classes
 Mining descriptive statistical measures in large
databases
 Discussion
 Summary
April 7, 2025 Data Mining: Concepts and Techniq 5
Data Generalization and
Summarization-based
Characterization

 Data generalization
 A process which abstracts a large set of task-

relevant data in a database from a low

conceptual levels to higher
1
ones.
2
3
4
Conceptual levels
5
 Approaches:

Data cube approach(OLAP approach)

Attribute-oriented induction approach

April 7, 2025 Data Mining: Concepts and Techniq 6

Characterization: Data Cube
Approach (without using AO-
Induction)

 Perform computations and store results in data cubes

 Strength
 An efficient implementation of data generalization
 Computation of various kinds of measures

e.g., count( ), sum( ), average( ), max( )
 Generalization and specialization can be performed on a
data cube by roll-up and drill-down
 Limitations
 handle only dimensions of simple nonnumeric data and
measures of simple aggregated numeric values.
 Lack of intelligent analysis, can’t tell which dimensions
should be used and what levels should the generalization
reach

April 7, 2025 Data Mining: Concepts and Techniq 7

Attribute-Oriented
Induction

 Proposed in 1989 (KDD ‘89 workshop)

 Not confined to categorical data nor particular
measures.
 How it is done?
 Collect the task-relevant data( initial relation) using

a relational database query

 Perform generalization by attribute removal or

attribute generalization.
 Apply aggregation by merging identical, generalized

tuples and accumulating their respective counts.

 Interactive presentation with users.

April 7, 2025 Data Mining: Concepts and Techniq 8

Basic Principles of Attribute-
Oriented Induction
 Data focusing: task-relevant data, including dimensions,
and the result is the initial relation.
 Attribute-removal: remove attribute A if there is a large
set of distinct values for A but (1) there is no
generalization operator on A, or (2) A’s higher level
concepts are expressed in terms of other attributes.
 Attribute-generalization: If there is a large set of distinct
values for A, and there exists a set of generalization
operators on A, then select an operator and generalize A.
 Attribute-threshold control: typical 2-8, specified/default.
 Generalized relation threshold control: control the final
relation/rule size. see example

April 7, 2025 Data Mining: Concepts and Techniq

Basic Algorithm for Attribute-
Oriented Induction
 InitialRel: Query processing of task-relevant data,
deriving the initial relation.
 PreGen: Based on the analysis of the number of distinct
values in each attribute, determine generalization plan
for each attribute: removal? or how high to generalize?
 PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
 Presentation: User interaction: (1) adjust levels by
drilling, (2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.
See Implementation See example See complexity

April 7, 2025 Data Mining: Concepts and Techniq

Example
 DMQL: Describe general characteristics of
graduate students in the Big-University database

use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major,
birth_place, birth_date, residence, phone#, gpa
from student
where status in “graduate”
 Corresponding SQL statement:

Select name, gender, major, birth_place,

birth_date, residence, phone#, gpa
from student
where status inData
April 7, 2025
{“Msc”, “MBA”, “PhD” }
Mining: Concepts and Techniq 11
Class Characterization: An Example
Name Gender Major Birth-Place Birth_date Residence Phone # GPA

Initial Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67

Woodman Canada Richmond
Relation Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70
Lachance Canada Richmond
Laura Lee F Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83
… … … … … Burnaby … …
…
Removed Retained Sci,Eng, Country Age range City Removed Excl,
Bus VG,..
Gender Major Birth_region Age_range Residence GPA Count
Prime M Science Canada 20-25 Richmond Very-good 16
Generalized F Science Foreign 25-30 Burnaby Excellent 22
Relation … … … … … … …

Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62

See Principles See Algorithm See Implementation See Analytical Characterization

April 7, 2025 Data Mining: Concepts and Techniq
Presentation of Generalized
Results
 Generalized relation:
 Relations where some or all attributes are generalized, with
counts or other aggregation values accumulated.
 Cross tabulation:
 Mapping results into cross tabulation form (similar to
contingency tables).
 Visualization techniques:
 Pie charts, bar charts, curves, cubes, and other visual
forms.
 Quantitative characteristic rules:
 Mapping generalized result into characteristic rules with
quantitative
grad information associated with it, e.g.,
( x)  male( x) 
birth _ region( x) "Canada"[t :53%]  birth _ region( x) " foreign"[t : 47%].
April 7, 2025 Data Mining: Concepts and Techniq
Presentation—Generalized
Relation

April 7, 2025 Data Mining: Concepts and Techniq 14

Presentation—Crosstab

April 7, 2025 Data Mining: Concepts and Techniq 15

Implementation by Cube
Technology

 Construct a data cube on-the-fly for the given

data mining query
 Facilitate efficient drill-down analysis

 May increase the response time

 A balanced solution: precomputation of

“subprime” relation
 Use a predefined & precomputed data cube
 Construct a data cube beforehand

 Facilitate not only the attribute-oriented

induction, but also attribute relevance

analysis, dicing, slicing, roll-up and drill-down
 Cost of cube computation and the nontrivial

storage overhead
April 7, 2025 Data Mining: Concepts and Techniq 16
Chapter 5: Concept
Description: Characterization
and Comparison
 What is concept description?
 Data generalization and summarization-based
characterization
 Analytical characterization: Analysis of attribute
relevance
 Mining class comparisons: Discriminating between
different classes
 Mining descriptive statistical measures in large
databases
 Discussion
 Summary
April 7, 2025 Data Mining: Concepts and Techniq 17
Characterization vs. OLAP

 Similarity:
 Presentation of data summarization at multiple
levels of abstraction.
 Interactive drilling, pivoting, slicing and dicing.
 Differences:
 Automated desired level allocation.
 Dimension relevance analysis and ranking when
there are many relevant dimensions.
 Sophisticated typing on dimensions and measures.
 Analytical characterization: data dispersion analysis.

April 7, 2025 Data Mining: Concepts and Techniq 18

Attribute Relevance
Analysis

 Why?

Which dimensions should be included?
 How high level of generalization?

 Automatic vs. interactive

 Reduce # attributes; easy to understand patterns

 What?
 statistical method for preprocessing data


filter out irrelevant or weakly relevant attributes

retain or rank the relevant attributes

relevance related to dimensions and levels
 analytical characterization, analytical comparison

April 7, 2025 Data Mining: Concepts and Techniq 19

Attribute relevance analysis
(cont’d)

 How?
 Data Collection


Analytical Generalization

Use information gain analysis (e.g., entropy or other
measures) to identify highly relevant dimensions and
levels.
 Relevance Analysis

Sort and select the most relevant dimensions and levels.
 Attribute-oriented Induction for class description

On selected dimension/level
 OLAP operations (e.g. drilling, slicing) on
relevance rules

April 7, 2025 Data Mining: Concepts and Techniq 20

Relevance Measures

 Quantitative relevance measure

determines the classifying power of an
attribute within a set of data.
 Methods
 information gain (ID3)

 gain ratio (C4.5)

 gini index

 2 contingency table statistics

 uncertainty coefficient

April 7, 2025 Data Mining: Concepts and Techniq 21

Information-Theoretic Approach

 Decision tree
 each internal node tests an attribute

 each branch corresponds to attribute value

 each leaf node assigns a classification

 ID3 algorithm
 build decision tree based on training objects

with known class labels to classify testing

objects
 rank attributes with information gain measure

 minimal height


the least number of tests to classify an object
See example
April 7, 2025 Data Mining: Concepts and Techniq 22
Top-Down Induction of Decision
Tree

Attributes = {Outlook, Temperature, Humidity, Wind}

PlayTennis = {yes, no}

Outlook

sunny overcast rain

Humidity Wind
yes
high normal strong weak

no yes no yes

April 7, 2025 Data Mining: Concepts and Techniq 23

Entropy and Information
Gain

 S contains si tuples of class Ci for i = {1, …,

m}
 Information measures info required to classify
any arbitrary I(tuple m
si
s1,s2,...,sm )   log 2
si
i 1 s s

 Entropy of attribute A with values {a1,a2,…,av}

v
s1 j  ...  smj
E(A)  I ( s1 j ,..., smj )
j 1 s

 Information gained by branching on attribute A

Gain(A) I(s 1, s 2 ,..., sm)  E(A)

April 7, 2025 Data Mining: Concepts and Techniq 24

Example: Analytical
Characterization

 Task
 Mine general characteristics describing

graduate students using analytical

characterization

 Given
 attributes name, gender, major, birth_place,

birth_date, phone#, and gpa


Gen(ai) = concept hierarchies on a i

Ui = attribute analytical thresholds for a i

Ti = attribute generalization thresholds for a i
 R = attribute relevance threshold

April 7, 2025 Data Mining: Concepts and Techniq 25

Example: Analytical
Characterization (cont’d)
 1. Data collection
 target class: graduate student

 contrasting class: undergraduate student

 2. Analytical generalization using Ui

 attribute removal


remove name and phone#
 attribute generalization

generalize major, birth_place, birth_date and gpa

accumulate counts
 candidate relation: gender, major,
birth_country, age_range and gpa

April 7, 2025 Data Mining: Concepts and Techniq 26

Example: Analytical
characterization (2)
gender major birth_country age_range gpa count
M Science Canada 20-25 Very_good 16
F Science Foreign 25-30 Excellent 22
M Engineering Foreign 25-30 Excellent 18
F Science Foreign 25-30 Excellent 25
M Science Canada 20-25 Excellent 21
F Engineering Canada 20-25 Excellent 18

Candidate relation for Target class: Graduate students ( =120)

gender major birth_country age_range gpa count

M Science Foreign <20 Very_good 18
F Business Canada <20 Fair 20
M Business Canada <20 Fair 22
F Science Canada 20-25 Fair 24
M Engineering Foreign 20-25 Very_good 22
F Engineering Canada <20 Excellent 24

Candidate relation for Contrasting class: Undergraduate students ( =130)

April 7, 2025 Data Mining: Concepts and Techniq 27
Example: Analytical
characterization (3)
 3. Relevance analysis
 Calculate expected info required to classify an

arbitrary tuple
120 120 130 130
I(s 1, s 2 ) I( 120,130 )  log 2  log 2 0.9988
250 250 250 250

 Calculate entropy of each attribute: e.g. major

For major=”Science”: S11=84 S21=42 I(s11,s21)=0.9183
For major=”Engineering”: S12=36 S22=46 I(s12,s22)=0.9892
For major=”Business”: S13=0 S23=42 I(s13,s23)=0

Number of grad
students in “Science” Number of undergrad
students in “Science”

April 7, 2025 Data Mining: Concepts and Techniq 28

Example: Analytical Characterization
(4)

 Calculate expected info required to classify a

given sample if S is partitioned according to the
attribute 126 82 42
E(major)  I ( s11, s 21 )  I ( s12 , s 22 )  I ( s13 , s 23 ) 0.7873
250 250 250

 Calculate information
Gain(major ) I(s 1, s 2 )gain foreach
 E(major) 0.2115 attribute

 Information gain for all

Gain(gender) attributes
= 0.0003
Gain(birth_country) = 0.0407
Gain(major) = 0.2115
Gain(gpa) = 0.4490
Gain(age_range) = 0.5971

April 7, 2025 Data Mining: Concepts and Techniq 29

Example: Analytical
characterization (5)

 4. Initial working relation (W0) derivation

 R = 0.1
 remove irrelevant/weakly relevant attributes from
candidate relation => drop gender, birth_country
 remove contrasting class candidate relation
major age_range gpa count
Science 20-25 Very_good 16
Science 25-30 Excellent 47
Science 20-25 Excellent 21
Engineering 20-25 Excellent 18
Engineering 25-30 Excellent 18

Initial target class working relation W0: Graduate students

 5. Perform attribute-oriented induction on W0 using Ti

April 7, 2025 Data Mining: Concepts and Techniq 30

 Comparison: Comparing two or more classes.

 Method:

Partition the set of relevant data into the target class
and the contrasting class(es)

Generalize both classes to the same high level
concepts

Compare tuples with the same high level descriptions

Present for every tuple its description and two
measures:

support - distribution within single class

comparison - distribution between classes

Highlight the tuples with strong discriminant features
 Relevance Analysis:

Find attributes (features) which best distinguish
different classes.
April 7, 2025 Data Mining: Concepts and Techniq
Example: Analytical comparison
 Task
 Compare graduate and undergraduate

students using discriminant rule.

 DMQL query

use Big_University_DB
mine comparison as “grad_vs_undergrad_students”
in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa
for “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student

April 7, 2025 Data Mining: Concepts and Techniq 33

Example: Analytical comparison
(2)

 Given
 attributes name, gender, major, birth_place,

birth_date, residence, phone# and gpa

 Gen(a ) = concept hierarchies on attributes
i
ai
 Ui = attribute analytical thresholds for
attributes ai
 Ti = attribute generalization thresholds for
attributes ai
 R = attribute relevance threshold
April 7, 2025 Data Mining: Concepts and Techniq 34
Example: Analytical comparison
(3)

 1. Data collection
 target and contrasting classes

 2. Attribute relevance analysis


remove attributes name, gender, major, phone#

 3. Synchronous generalization

controlled by user-specified dimension thresholds
 prime target and contrasting class(es)

relations/cuboids

April 7, 2025 Data Mining: Concepts and Techniq 35

Example: Analytical comparison
(4)
Birth_country Age_range Gpa Count%
Canada 20-25 Good 5.53%
Canada 25-30 Good 2.32%
Canada Over_30 Very_good 5.86%
… … … …
Other Over_30 Excellent 4.68%
Prime generalized relation for the target class: Graduate students

Birth_country Age_range Gpa Count%

Canada 15-20 Fair 5.53%
Canada 15-20 Good 4.53%
… … … …
Canada 25-30 Good 5.02%
… … … …
Other Over_30 Excellent 0.68%

Prime generalized relation for the contrasting class: Undergraduate students

April 7, 2025 Data Mining: Concepts and Techniq 36

Example: Analytical comparison
(5)

 4. Drill down, roll up and other OLAP

operations on target and contrasting classes
to adjust levels of abstractions of resulting
description

 5. Presentation
 as generalized relations, crosstabs, bar

charts, pie charts, or rules

 contrasting measures to reflect comparison

between target and contrasting classes


e.g. count%

April 7, 2025 Data Mining: Concepts and Techniq 37

Quantitative Discriminant Rules

 Cj = target class
 qa = a generalized tuple covers some tuples of
class
 but can also cover some tuples of contrasting

class
count(qa  Cj )
 d-weight d  weight  m
 range: [0, 1]  count(qa  Ci )
i 1

 X, target_cla ss(X)  condition(X) [d : d_weight]

 quantitative discriminant rule form
April 7, 2025 Data Mining: Concepts and Techniq 38
Example: Quantitative
Discriminant Rule
Status Birth_country Age_range Gpa Count
Graduate Canada 25-30 Good 90
Undergraduate Canada 25-30 Good 210

Count distribution between graduate and undergraduate students for a generalized tuple

 Quantitative discriminant rule

X , graduate _ student ( X ) 
birth _ country( X ) " Canada" age _ range( X ) "25  30" gpa( X ) " good" [d : 30%]
 where 90/(90+120) = 30%

April 7, 2025 Data Mining: Concepts and Techniq 39

Class Description
 Quantitative characteristic rule
 X, target_class(X)  condition(X) [t : t_weight]
necessary


 Quantitative discriminant rule

 X, target_cla ss(X)  condition(X) [d : d_weight]
sufficient


 Quantitative description rule

 X, target_class(X) 
condition 1(X) [t : w1, d : w 1]  ...  conditionn(X) [t : wn, d : w n]
 necessary and sufficient

April 7, 2025 Data Mining: Concepts and Techniq 40

Example: Quantitative
Description Rule
Location/item TV Computer Both_items

Count t-wt d-wt Count t-wt d-wt Count t-wt d-wt

Europe 80 25% 40% 240 75% 30% 320 100% 32%
N_Am 120 17.65% 60% 560 82.35% 70% 680 100% 68%

Both_ 200 20% 100% 800 80% 100% 1000 100% 100%
regions

Crosstab showing associated t-weight, d-weight values and total number (in thousands) of TVs and
computers sold at AllElectronics in 1998

 Quantitative description rule for target class

Europe
 X, Europe(X) 
(item(X) " TV" ) [t : 25%, d : 40%]  (item(X) " computer" ) [t : 75%, d : 30%]

April 7, 2025 Data Mining: Concepts and Techniq 41

 Motivation
 To better understand the data: central tendency,
variation and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities
of precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
April 7, 2025 Data Mining: Concepts and Techniq 43
Measuring the Central
Tendency
1 n
 Mean x   xi n

n i 1 w x i i
 Weighted arithmetic mean x  i 1
n

 Median: A holistic measure w

i 1
i

 Middle value if odd number of values, or average of

the middle two values otherwise
n / 2  ( f )l
 estimated by interpolation median L1  ( )c
f median
 Mode
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula: mean  mode 3 (mean  median)

April 7, 2025 Data Mining: Concepts and Techniq 44

Measuring the Dispersion of Data

 Quartiles, outliers and boxplots

 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, M, Q3, max
 Boxplot: ends of the box are the quartiles, median is
marked, whiskers, and plot outlier individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Variance and standard deviation
 Variance s2:1(algebraic,
n scalable1computation)
n
1 n

 (x [  xi (  xi ) 2 ]
2 2 2
s  i  x)  
n 1i1 n 1 i1 n i1

 Standard deviation s is the square root of variance s2

April 7, 2025 Data Mining: Concepts and Techniq 45
Boxplot Analysis

 Five-number summary of a distribution:

Minimum, Q1, M, Q3, Maximum
 Boxplot
 Data is represented with a box

 The ends of the box are at the first and

third quartiles, i.e., the height of the box

is IRQ
 The median is marked by a line within

the box
 Whiskers: two lines outside the box

extend to Minimum and Maximum

April 7, 2025 Data Mining: Concepts and Techniq 46
A Boxplot
A boxplot

April 7, 2025 Data Mining: Concepts and Techniq 47

Visualization of Data
Dispersion: Boxplot Analysis

April 7, 2025 Data Mining: Concepts and Techniq 48

Mining Descriptive Statistical Measures
in Large Databases

 Variance
1 n 1  1 2
2
s   i ( x  x ) 2
   i
x 2
  x 
 i 
n  1 i 1 n  1 n 

 Standard deviation: the square root of the

variance
 Measures spread about the mean

 It is zero if and only if all the values are equal

 Both the deviation and the variance are

algebraic

April 7, 2025 Data Mining: Concepts and Techniq 49

Histogram Analysis

 Graph displays of basic statistical class

descriptions
 Frequency histograms


A univariate graphical method

Consists of a set of rectangles that reflect the counts
or frequencies of the classes present in the given data

April 7, 2025 Data Mining: Concepts and Techniq 50

Quantile Plot
 Displays all of the data (allowing the user to
assess both the overall behavior and unusual
occurrences)
 Plots quantile information

For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the
data are below or equal to the value xi

April 7, 2025 Data Mining: Concepts and Techniq 51

Quantile-Quantile (Q-Q) Plot

 Graphs the quantiles of one univariate

distribution against the corresponding quantiles
of another
 Allows the user to view whether there is a shift
in going from one distribution to another

April 7, 2025 Data Mining: Concepts and Techniq 52

Scatter plot

 Provides a first look at bivariate data to see

clusters of points, outliers, etc
 Each pair of values is treated as a pair of
coordinates and plotted as points in the plane

April 7, 2025 Data Mining: Concepts and Techniq 53

Loess Curve
 Adds a smooth curve to a scatter plot in order to
provide better perception of the pattern of
dependence
 Loess curve is fitted by setting two parameters: a
smoothing parameter, and the degree of the
polynomials that are fitted by the regression

April 7, 2025 Data Mining: Concepts and Techniq 54

Graphic Displays of Basic
Statistical Descriptions
 Histogram: (shown before)
 Boxplot: (covered before)
 Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are  xi
 Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
 Scatter plot: each pair of values is a pair of
coordinates and plotted as points in the plane
 Loess (local regression) curve: add a smooth curve to
a scatter plot to provide better perception of the
pattern of dependence

April 7, 2025 Data Mining: Concepts and Techniq 55

 Difference in philosophies and basic assumptions

 Positive and negative samples in learning-from-

example: positive used for generalization,

negative - for specialization
 Positive samples only in data mining: hence

generalization-based, to drill-down backtrack

the generalization to a previous state
 Difference in methods of generalizations
 Machine learning generalizes on a tuple by

tuple basis
 Data mining generalizes on an attribute by

attribute basis
April 7, 2025 Data Mining: Concepts and Techniq 57
Comparison of Entire vs.
Factored Version Space

April 7, 2025 Data Mining: Concepts and Techniq 58

Incremental and Parallel Mining of
Concept Description

 Incremental mining: revision based on newly

added data DB
 Generalize DB to the same level of
abstraction in the generalized relation R to
derive R
 Union R U R, i.e., merge counts and other
statistical information to produce a new
relation R’
 Similar philosophy can be applied to data
sampling, parallel and/or distributed mining, etc.
April 7, 2025 Data Mining: Concepts and Techniq 59
Chapter 5: Concept
Description: Characterization
and Comparison
 What is concept description?
 Data generalization and summarization-based
characterization
 Analytical characterization: Analysis of attribute
relevance
 Mining class comparisons: Discriminating between
different classes
 Mining descriptive statistical measures in large
databases
 Discussion
 Summary
April 7, 2025 Data Mining: Concepts and Techniq 60
Summary

 Concept description: characterization and

discrimination
 OLAP-based vs. attribute-oriented induction
 Efficient implementation of AOI
 Analytical characterization and comparison
 Mining descriptive statistical measures in large
databases
 Discussion

Incremental and parallel mining of description

Descriptive mining of complex types of data
April 7, 2025 Data Mining: Concepts and Techniq 61
References
 Y. Cai, N. Cercone, and J. Han. Attribute-oriented induction in relational
databases. In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge
Discovery in Databases, pages 213-228. AAAI/MIT Press, 1991.
 S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP
technology. ACM SIGMOD Record, 26:65-74, 1997
 C. Carter and H. Hamilton. Efficient attribute-oriented generalization for
knowledge discovery from large databases. IEEE Trans. Knowledge and Data
Engineering, 10:193-208, 1998.
 W. Cleveland. Visualizing Data. Hobart Press, Summit NJ, 1993.
 J. L. Devore. Probability and Statistics for Engineering and the Science, 4th ed.
Duxbury Press, 1995.
 T. G. Dietterich and R. S. Michalski. A comparative review of selected
methods for learning from examples. In Michalski et al., editor, Machine
Learning: An Artificial Intelligence Approach, Vol. 1, pages 41-82. Morgan
Kaufmann, 1983.
 J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F.
Pellow, and H. Pirahesh. Data cube: A relational aggregation operator
generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge
Discovery, 1:29-54, 1997.
 J. Han, Y. Cai, and N. Cercone. Data-driven discovery of quantitative rules in
relational databases. IEEE Trans. Knowledge and Data Engineering, 5:29-40,
1993.
April 7, 2025 Data Mining: Concepts and Techniq 62
References (cont.)
 J. Han and Y. Fu. Exploration of the power of attribute-oriented induction in
data mining. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.
Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining,
pages 399-421. AAAI/MIT Press, 1996.
 R. A. Johnson and D. A. Wichern. Applied Multivariate Statistical Analysis, 3rd
ed. Prentice Hall, 1992.
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB'98, New York, NY, Aug. 1998.
 H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data
Mining. Kluwer Academic Publishers, 1998.
 R. S. Michalski. A theory and methodology of inductive learning. In Michalski
et al., editor, Machine Learning: An Artificial Intelligence Approach, Vol. 1,
Morgan Kaufmann, 1983.
 T. M. Mitchell. Version spaces: A candidate elimination approach to rule
learning. IJCAI'97, Cambridge, MA.
 T. M. Mitchell. Generalization as search. Artificial Intelligence, 18:203-226,
1982.
 T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
 J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
 D. Subramanian and J. Feigenbaum. Factorization in experiment generation.
AAAI'86, Philadelphia, PA, Aug. 1986.

April 7, 2025 Data Mining: Concepts and Techniq 63

http://www.cs.sfu.ca/~han/
dmbook

Thank you !!!

April 7, 2025 Data Mining: Concepts and Techniq 64

19, 9852 1825 01 Service Manual ST14 DD
100% (3)
19, 9852 1825 01 Service Manual ST14 DD
144 pages
Service Manual: History Information For The Following Manual
No ratings yet
Service Manual: History Information For The Following Manual
71 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 5
64 pages
5 Desc
No ratings yet
5 Desc
60 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 5
73 pages
Chapter 5 Concept Description Characterization and Comparison 395
No ratings yet
Chapter 5 Concept Description Characterization and Comparison 395
64 pages
Chapter 5: Concept Description: Characterization and Comparison
No ratings yet
Chapter 5: Concept Description: Characterization and Comparison
58 pages
Data Mining: Concepts and Techniques: April 30, 2012
No ratings yet
Data Mining: Concepts and Techniques: April 30, 2012
64 pages
Lecture 2.1.1 2.1.2
No ratings yet
Lecture 2.1.1 2.1.2
23 pages
Data Mining: Concepts and Techniques: November 21, 2013
No ratings yet
Data Mining: Concepts and Techniques: November 21, 2013
64 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
22 pages
Data Mining: Concepts and Techniques: January 14, 2014
No ratings yet
Data Mining: Concepts and Techniques: January 14, 2014
64 pages
Data Mining: Concepts and Techniques: - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 5
63 pages
6.concept Description Characterization and Comparison
No ratings yet
6.concept Description Characterization and Comparison
69 pages
Data Mining Unit3
No ratings yet
Data Mining Unit3
19 pages
Unit III: Concept Description: Characterization and Comparison
No ratings yet
Unit III: Concept Description: Characterization and Comparison
53 pages
CH 4
No ratings yet
CH 4
58 pages
Lecture 2.1.3 2.1.4
No ratings yet
Lecture 2.1.3 2.1.4
34 pages
Unit-Iii Data Mining Material
No ratings yet
Unit-Iii Data Mining Material
23 pages
05 DM BI Concept Description
No ratings yet
05 DM BI Concept Description
21 pages
Lecture 2.1.3 2.1.4
No ratings yet
Lecture 2.1.3 2.1.4
30 pages
Concept Description: Characterization and Comparision: Chapter-10
No ratings yet
Concept Description: Characterization and Comparision: Chapter-10
5 pages
Unit 3
No ratings yet
Unit 3
38 pages
Concept Description:: Characterization & Comparison
No ratings yet
Concept Description:: Characterization & Comparison
51 pages
Attribute Oriented Induction
100% (1)
Attribute Oriented Induction
6 pages
Kmeans Ex
No ratings yet
Kmeans Ex
98 pages
8 CLST
No ratings yet
8 CLST
98 pages
Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison
No ratings yet
Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison
59 pages
Data Mining Unit2
No ratings yet
Data Mining Unit2
9 pages
Data Mining Unit-III
No ratings yet
Data Mining Unit-III
5 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Solutions To DM I MID (A)
100% (1)
Solutions To DM I MID (A)
19 pages
Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison
No ratings yet
Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison
59 pages
Unit 1
No ratings yet
Unit 1
28 pages
Data Mining Primitives, Languages and System Architecture
No ratings yet
Data Mining Primitives, Languages and System Architecture
64 pages
Data Mining-2-1
No ratings yet
Data Mining-2-1
12 pages
Assgg
No ratings yet
Assgg
12 pages
DM 1 PDF
No ratings yet
DM 1 PDF
67 pages
Data Mining Concept Description: Characterization and Comparison
No ratings yet
Data Mining Concept Description: Characterization and Comparison
14 pages
Unit 4 Data Warehousing and Data Mining
No ratings yet
Unit 4 Data Warehousing and Data Mining
15 pages
Data Mining and Data Warehousing Notes ct1
No ratings yet
Data Mining and Data Warehousing Notes ct1
12 pages
8 CLST
No ratings yet
8 CLST
98 pages
Data Mining-Unit-1
No ratings yet
Data Mining-Unit-1
21 pages
Week1 2
No ratings yet
Week1 2
24 pages
Lecture 3.2.1 3.2.2
No ratings yet
Lecture 3.2.1 3.2.2
28 pages
8 Clustering
No ratings yet
8 Clustering
89 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
123 pages
Data Mining Mid 2
No ratings yet
Data Mining Mid 2
20 pages
DM Unit-1
No ratings yet
DM Unit-1
14 pages
Data Mining Primitives, Languages and System Architecture
No ratings yet
Data Mining Primitives, Languages and System Architecture
64 pages
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
No ratings yet
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
55 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
127 pages
Down 2
No ratings yet
Down 2
61 pages
DW&M Unit - 1-Imp Vii Sem
No ratings yet
DW&M Unit - 1-Imp Vii Sem
9 pages
Cluster Analisys
No ratings yet
Cluster Analisys
100 pages
Module 4
No ratings yet
Module 4
54 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
21 pages
An 15 DM Caracterizacion
No ratings yet
An 15 DM Caracterizacion
38 pages
IGNOU MCA Data Science and Big Data Previous Years Unsolved Papers MCS 226
From Everand
IGNOU MCA Data Science and Big Data Previous Years Unsolved Papers MCS 226
Manish Soni
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Applied Data Mining with Weka: Definitive Reference for Developers and Engineers
From Everand
Applied Data Mining with Weka: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Science Unveiled: A Practical Guide to Key Techniques
From Everand
Data Science Unveiled: A Practical Guide to Key Techniques
Ed A Norex
No ratings yet
Transmitting Loop Antenna For The 40M Band
No ratings yet
Transmitting Loop Antenna For The 40M Band
12 pages
Smart Agriculture System
100% (1)
Smart Agriculture System
9 pages
Documents - Pub - The Elastix Call Center Protocol Revealed
No ratings yet
Documents - Pub - The Elastix Call Center Protocol Revealed
68 pages
Mark-VIe Power Supply Specificationsforprojectcl
No ratings yet
Mark-VIe Power Supply Specificationsforprojectcl
6 pages
Java For Selenium
No ratings yet
Java For Selenium
45 pages
Bcis 1305 Business Computer Applications Homework 2 True/False
No ratings yet
Bcis 1305 Business Computer Applications Homework 2 True/False
6 pages
In The Future All Cars
No ratings yet
In The Future All Cars
51 pages
Box Sensor 2
No ratings yet
Box Sensor 2
1 page
Student Guide Anthropogenic Climate Change
No ratings yet
Student Guide Anthropogenic Climate Change
9 pages
Https Support - Honeywellaidc.com S Article Best-Practices-On-t
No ratings yet
Https Support - Honeywellaidc.com S Article Best-Practices-On-t
3 pages
Comparing Open-Source Speech Recognition Toolkits
No ratings yet
Comparing Open-Source Speech Recognition Toolkits
12 pages
Code:: Bahria University, Islamabad Campus Short Assignment (Quiz 01) (Fall 2020 Semester)
No ratings yet
Code:: Bahria University, Islamabad Campus Short Assignment (Quiz 01) (Fall 2020 Semester)
4 pages
Vaccine Portal
No ratings yet
Vaccine Portal
3 pages
Smart India Hackathon 2024
No ratings yet
Smart India Hackathon 2024
6 pages
HP Color LaserJet CP5220 Ersatzteile PDF
No ratings yet
HP Color LaserJet CP5220 Ersatzteile PDF
51 pages
C Handbook
No ratings yet
C Handbook
22 pages
UPDPSWin 3000MU
No ratings yet
UPDPSWin 3000MU
5 pages
Literature Review Mobile Application Development
100% (1)
Literature Review Mobile Application Development
5 pages
B-Jac Us
No ratings yet
B-Jac Us
8 pages
Os Installation
No ratings yet
Os Installation
16 pages
Fall 2023 - CS607 - 1
No ratings yet
Fall 2023 - CS607 - 1
3 pages
A New Implementation: A Multiport Automatic Network Analyzer
No ratings yet
A New Implementation: A Multiport Automatic Network Analyzer
8 pages
Kementerian Keuangan Republik Indonesia: Direktorat Jenderal Bea Dan Cukai Sekretariat Direktorat Jenderal Bea Dan Cukai
No ratings yet
Kementerian Keuangan Republik Indonesia: Direktorat Jenderal Bea Dan Cukai Sekretariat Direktorat Jenderal Bea Dan Cukai
3 pages
Scribbed 223751127-Chapter-12-Enhanced-Entity-Relationship-Modeling PDF
No ratings yet
Scribbed 223751127-Chapter-12-Enhanced-Entity-Relationship-Modeling PDF
16 pages
CE 212 Digital Systems Ch4
No ratings yet
CE 212 Digital Systems Ch4
37 pages
Different Types of Sewing Machines
100% (1)
Different Types of Sewing Machines
11 pages
Huawei WISP Solution v2.0
No ratings yet
Huawei WISP Solution v2.0
27 pages
JD-Ungana 1326 - Annexe 1-2
No ratings yet
JD-Ungana 1326 - Annexe 1-2
53 pages