0% found this document useful (0 votes)
6 views64 pages

DM Concepts

Chapter 5 of 'Data Mining: Concepts and Techniques' focuses on concept description, including characterization and comparison of data. It distinguishes between descriptive and predictive data mining, discusses data generalization, and introduces analytical characterization and attribute relevance analysis. The chapter also covers methods for mining class comparisons and presents examples of data aggregation and visualization techniques.

Uploaded by

Ch Samson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views64 pages

DM Concepts

Chapter 5 of 'Data Mining: Concepts and Techniques' focuses on concept description, including characterization and comparison of data. It distinguishes between descriptive and predictive data mining, discusses data generalization, and introduces analytical characterization and attribute relevance analysis. The chapter also covers methods for mining class comparisons and presents examples of data aggregation and visualization techniques.

Uploaded by

Ch Samson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 64

Data Mining:

Concepts and
Techniques
— Slides for Textbook —
— Chapter 5 —

©Jiawei Han and Micheline Kamber


Intelligent Database Systems Research Lab
School of Computing Science
Simon Fraser University, Canada
http://www.cs.sfu.ca
April 7, 2025 Data Mining: Concepts and Techniq 1
Chapter 5: Concept
Description: Characterization
and Comparison
 What is concept description?
 Data generalization and summarization-based
characterization
 Analytical characterization: Analysis of attribute
relevance
 Mining class comparisons: Discriminating between
different classes
 Mining descriptive statistical measures in large
databases
 Discussion
 Summary
April 7, 2025 Data Mining: Concepts and Techniq 2
What is Concept Description?

 Descriptive vs. predictive data mining


 Descriptive mining: describes concepts or task-

relevant data sets in concise, summarative,


informative, discriminative forms
 Predictive mining: Based on data and analysis,

constructs models for the database, and predicts


the trend and properties of unknown data
 Concept description:
 Characterization: provides a concise and succinct

summarization of the given collection of data


 Comparison: provides descriptions comparing

two or more collections of data


April 7, 2025 Data Mining: Concepts and Techniq
Concept Description vs.
OLAP

 Concept description:
 can handle complex data types of the

attributes and their aggregations


 a more automated process

 OLAP:
 restricted to a small number of

dimension and measure types


 user-controlled process

April 7, 2025 Data Mining: Concepts and Techniq 4


Chapter 5: Concept
Description: Characterization
and Comparison
 What is concept description?
 Data generalization and summarization-based
characterization
 Analytical characterization: Analysis of attribute
relevance
 Mining class comparisons: Discriminating between
different classes
 Mining descriptive statistical measures in large
databases
 Discussion
 Summary
April 7, 2025 Data Mining: Concepts and Techniq 5
Data Generalization and
Summarization-based
Characterization

 Data generalization
 A process which abstracts a large set of task-

relevant data in a database from a low


conceptual levels to higher
1
ones.
2
3
4
Conceptual levels
5
 Approaches:

Data cube approach(OLAP approach)

Attribute-oriented induction approach

April 7, 2025 Data Mining: Concepts and Techniq 6


Characterization: Data Cube
Approach (without using AO-
Induction)

 Perform computations and store results in data cubes


 Strength
 An efficient implementation of data generalization
 Computation of various kinds of measures

e.g., count( ), sum( ), average( ), max( )
 Generalization and specialization can be performed on a
data cube by roll-up and drill-down
 Limitations
 handle only dimensions of simple nonnumeric data and
measures of simple aggregated numeric values.
 Lack of intelligent analysis, can’t tell which dimensions
should be used and what levels should the generalization
reach

April 7, 2025 Data Mining: Concepts and Techniq 7


Attribute-Oriented
Induction

 Proposed in 1989 (KDD ‘89 workshop)


 Not confined to categorical data nor particular
measures.
 How it is done?
 Collect the task-relevant data( initial relation) using

a relational database query


 Perform generalization by attribute removal or

attribute generalization.
 Apply aggregation by merging identical, generalized

tuples and accumulating their respective counts.


 Interactive presentation with users.

April 7, 2025 Data Mining: Concepts and Techniq 8


Basic Principles of Attribute-
Oriented Induction
 Data focusing: task-relevant data, including dimensions,
and the result is the initial relation.
 Attribute-removal: remove attribute A if there is a large
set of distinct values for A but (1) there is no
generalization operator on A, or (2) A’s higher level
concepts are expressed in terms of other attributes.
 Attribute-generalization: If there is a large set of distinct
values for A, and there exists a set of generalization
operators on A, then select an operator and generalize A.
 Attribute-threshold control: typical 2-8, specified/default.
 Generalized relation threshold control: control the final
relation/rule size. see example

April 7, 2025 Data Mining: Concepts and Techniq


Basic Algorithm for Attribute-
Oriented Induction
 InitialRel: Query processing of task-relevant data,
deriving the initial relation.
 PreGen: Based on the analysis of the number of distinct
values in each attribute, determine generalization plan
for each attribute: removal? or how high to generalize?
 PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
 Presentation: User interaction: (1) adjust levels by
drilling, (2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.
See Implementation See example See complexity

April 7, 2025 Data Mining: Concepts and Techniq


Example
 DMQL: Describe general characteristics of
graduate students in the Big-University database

use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major,
birth_place, birth_date, residence, phone#, gpa
from student
where status in “graduate”
 Corresponding SQL statement:

Select name, gender, major, birth_place,


birth_date, residence, phone#, gpa
from student
where status inData
April 7, 2025
{“Msc”, “MBA”, “PhD” }
Mining: Concepts and Techniq 11
Class Characterization: An Example
Name Gender Major Birth-Place Birth_date Residence Phone # GPA

Initial Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67


Woodman Canada Richmond
Relation Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70
Lachance Canada Richmond
Laura Lee F Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83
… … … … … Burnaby … …

Removed Retained Sci,Eng, Country Age range City Removed Excl,
Bus VG,..
Gender Major Birth_region Age_range Residence GPA Count
Prime M Science Canada 20-25 Richmond Very-good 16
Generalized F Science Foreign 25-30 Burnaby Excellent 22
Relation … … … … … … …

Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62

See Principles See Algorithm See Implementation See Analytical Characterization


April 7, 2025 Data Mining: Concepts and Techniq
Presentation of Generalized
Results
 Generalized relation:
 Relations where some or all attributes are generalized, with
counts or other aggregation values accumulated.
 Cross tabulation:
 Mapping results into cross tabulation form (similar to
contingency tables).
 Visualization techniques:
 Pie charts, bar charts, curves, cubes, and other visual
forms.
 Quantitative characteristic rules:
 Mapping generalized result into characteristic rules with
quantitative
grad information associated with it, e.g.,
( x)  male( x) 
birth _ region( x) "Canada"[t :53%]  birth _ region( x) " foreign"[t : 47%].
April 7, 2025 Data Mining: Concepts and Techniq
Presentation—Generalized
Relation

April 7, 2025 Data Mining: Concepts and Techniq 14


Presentation—Crosstab

April 7, 2025 Data Mining: Concepts and Techniq 15


Implementation by Cube
Technology

 Construct a data cube on-the-fly for the given


data mining query
 Facilitate efficient drill-down analysis

 May increase the response time

 A balanced solution: precomputation of

“subprime” relation
 Use a predefined & precomputed data cube
 Construct a data cube beforehand

 Facilitate not only the attribute-oriented

induction, but also attribute relevance


analysis, dicing, slicing, roll-up and drill-down
 Cost of cube computation and the nontrivial

storage overhead
April 7, 2025 Data Mining: Concepts and Techniq 16
Chapter 5: Concept
Description: Characterization
and Comparison
 What is concept description?
 Data generalization and summarization-based
characterization
 Analytical characterization: Analysis of attribute
relevance
 Mining class comparisons: Discriminating between
different classes
 Mining descriptive statistical measures in large
databases
 Discussion
 Summary
April 7, 2025 Data Mining: Concepts and Techniq 17
Characterization vs. OLAP

 Similarity:
 Presentation of data summarization at multiple
levels of abstraction.
 Interactive drilling, pivoting, slicing and dicing.
 Differences:
 Automated desired level allocation.
 Dimension relevance analysis and ranking when
there are many relevant dimensions.
 Sophisticated typing on dimensions and measures.
 Analytical characterization: data dispersion analysis.

April 7, 2025 Data Mining: Concepts and Techniq 18


Attribute Relevance
Analysis

 Why?

Which dimensions should be included?
 How high level of generalization?

 Automatic vs. interactive

 Reduce # attributes; easy to understand patterns

 What?
 statistical method for preprocessing data


filter out irrelevant or weakly relevant attributes

retain or rank the relevant attributes

relevance related to dimensions and levels
 analytical characterization, analytical comparison

April 7, 2025 Data Mining: Concepts and Techniq 19


Attribute relevance analysis
(cont’d)

 How?
 Data Collection


Analytical Generalization

Use information gain analysis (e.g., entropy or other
measures) to identify highly relevant dimensions and
levels.
 Relevance Analysis

Sort and select the most relevant dimensions and levels.
 Attribute-oriented Induction for class description

On selected dimension/level
 OLAP operations (e.g. drilling, slicing) on
relevance rules

April 7, 2025 Data Mining: Concepts and Techniq 20


Relevance Measures

 Quantitative relevance measure


determines the classifying power of an
attribute within a set of data.
 Methods
 information gain (ID3)

 gain ratio (C4.5)

 gini index

 2 contingency table statistics

 uncertainty coefficient

April 7, 2025 Data Mining: Concepts and Techniq 21


Information-Theoretic Approach

 Decision tree
 each internal node tests an attribute

 each branch corresponds to attribute value

 each leaf node assigns a classification

 ID3 algorithm
 build decision tree based on training objects

with known class labels to classify testing


objects
 rank attributes with information gain measure

 minimal height


the least number of tests to classify an object
See example
April 7, 2025 Data Mining: Concepts and Techniq 22
Top-Down Induction of Decision
Tree

Attributes = {Outlook, Temperature, Humidity, Wind}


PlayTennis = {yes, no}

Outlook

sunny overcast rain

Humidity Wind
yes
high normal strong weak

no yes no yes

April 7, 2025 Data Mining: Concepts and Techniq 23


Entropy and Information
Gain

 S contains si tuples of class Ci for i = {1, …,


m}
 Information measures info required to classify
any arbitrary I(tuple m
si
s1,s2,...,sm )   log 2
si
i 1 s s

 Entropy of attribute A with values {a1,a2,…,av}


v
s1 j  ...  smj
E(A)  I ( s1 j ,..., smj )
j 1 s

 Information gained by branching on attribute A


Gain(A) I(s 1, s 2 ,..., sm)  E(A)

April 7, 2025 Data Mining: Concepts and Techniq 24


Example: Analytical
Characterization

 Task
 Mine general characteristics describing

graduate students using analytical


characterization

 Given
 attributes name, gender, major, birth_place,

birth_date, phone#, and gpa



Gen(ai) = concept hierarchies on a i

Ui = attribute analytical thresholds for a i

Ti = attribute generalization thresholds for a i
 R = attribute relevance threshold

April 7, 2025 Data Mining: Concepts and Techniq 25


Example: Analytical
Characterization (cont’d)
 1. Data collection
 target class: graduate student

 contrasting class: undergraduate student

 2. Analytical generalization using Ui


 attribute removal


remove name and phone#
 attribute generalization

generalize major, birth_place, birth_date and gpa

accumulate counts
 candidate relation: gender, major,
birth_country, age_range and gpa

April 7, 2025 Data Mining: Concepts and Techniq 26


Example: Analytical
characterization (2)
gender major birth_country age_range gpa count
M Science Canada 20-25 Very_good 16
F Science Foreign 25-30 Excellent 22
M Engineering Foreign 25-30 Excellent 18
F Science Foreign 25-30 Excellent 25
M Science Canada 20-25 Excellent 21
F Engineering Canada 20-25 Excellent 18

Candidate relation for Target class: Graduate students ( =120)

gender major birth_country age_range gpa count


M Science Foreign <20 Very_good 18
F Business Canada <20 Fair 20
M Business Canada <20 Fair 22
F Science Canada 20-25 Fair 24
M Engineering Foreign 20-25 Very_good 22
F Engineering Canada <20 Excellent 24

Candidate relation for Contrasting class: Undergraduate students ( =130)


April 7, 2025 Data Mining: Concepts and Techniq 27
Example: Analytical
characterization (3)
 3. Relevance analysis
 Calculate expected info required to classify an

arbitrary tuple
120 120 130 130
I(s 1, s 2 ) I( 120,130 )  log 2  log 2 0.9988
250 250 250 250

 Calculate entropy of each attribute: e.g. major


For major=”Science”: S11=84 S21=42 I(s11,s21)=0.9183
For major=”Engineering”: S12=36 S22=46 I(s12,s22)=0.9892
For major=”Business”: S13=0 S23=42 I(s13,s23)=0

Number of grad
students in “Science” Number of undergrad
students in “Science”

April 7, 2025 Data Mining: Concepts and Techniq 28


Example: Analytical Characterization
(4)

 Calculate expected info required to classify a


given sample if S is partitioned according to the
attribute 126 82 42
E(major)  I ( s11, s 21 )  I ( s12 , s 22 )  I ( s13 , s 23 ) 0.7873
250 250 250

 Calculate information
Gain(major ) I(s 1, s 2 )gain foreach
 E(major) 0.2115 attribute

 Information gain for all


Gain(gender) attributes
= 0.0003
Gain(birth_country) = 0.0407
Gain(major) = 0.2115
Gain(gpa) = 0.4490
Gain(age_range) = 0.5971

April 7, 2025 Data Mining: Concepts and Techniq 29


Example: Analytical
characterization (5)

 4. Initial working relation (W0) derivation


 R = 0.1
 remove irrelevant/weakly relevant attributes from
candidate relation => drop gender, birth_country
 remove contrasting class candidate relation
major age_range gpa count
Science 20-25 Very_good 16
Science 25-30 Excellent 47
Science 20-25 Excellent 21
Engineering 20-25 Excellent 18
Engineering 25-30 Excellent 18

Initial target class working relation W0: Graduate students


 5. Perform attribute-oriented induction on W0 using Ti

April 7, 2025 Data Mining: Concepts and Techniq 30


Chapter 5: Concept
Description: Characterization
and Comparison
 What is concept description?
 Data generalization and summarization-based
characterization
 Analytical characterization: Analysis of attribute
relevance
 Mining class comparisons: Discriminating between
different classes
 Mining descriptive statistical measures in large
databases
 Discussion
 Summary
April 7, 2025 Data Mining: Concepts and Techniq 31
Mining Class Comparisons

 Comparison: Comparing two or more classes.


 Method:

Partition the set of relevant data into the target class
and the contrasting class(es)

Generalize both classes to the same high level
concepts

Compare tuples with the same high level descriptions

Present for every tuple its description and two
measures:

support - distribution within single class

comparison - distribution between classes

Highlight the tuples with strong discriminant features
 Relevance Analysis:

Find attributes (features) which best distinguish
different classes.
April 7, 2025 Data Mining: Concepts and Techniq
Example: Analytical comparison
 Task
 Compare graduate and undergraduate

students using discriminant rule.


 DMQL query

use Big_University_DB
mine comparison as “grad_vs_undergrad_students”
in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa
for “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student

April 7, 2025 Data Mining: Concepts and Techniq 33


Example: Analytical comparison
(2)

 Given
 attributes name, gender, major, birth_place,

birth_date, residence, phone# and gpa


 Gen(a ) = concept hierarchies on attributes
i
ai
 Ui = attribute analytical thresholds for
attributes ai
 Ti = attribute generalization thresholds for
attributes ai
 R = attribute relevance threshold
April 7, 2025 Data Mining: Concepts and Techniq 34
Example: Analytical comparison
(3)

 1. Data collection
 target and contrasting classes

 2. Attribute relevance analysis



remove attributes name, gender, major, phone#

 3. Synchronous generalization

controlled by user-specified dimension thresholds
 prime target and contrasting class(es)

relations/cuboids

April 7, 2025 Data Mining: Concepts and Techniq 35


Example: Analytical comparison
(4)
Birth_country Age_range Gpa Count%
Canada 20-25 Good 5.53%
Canada 25-30 Good 2.32%
Canada Over_30 Very_good 5.86%
… … … …
Other Over_30 Excellent 4.68%
Prime generalized relation for the target class: Graduate students

Birth_country Age_range Gpa Count%


Canada 15-20 Fair 5.53%
Canada 15-20 Good 4.53%
… … … …
Canada 25-30 Good 5.02%
… … … …
Other Over_30 Excellent 0.68%

Prime generalized relation for the contrasting class: Undergraduate students

April 7, 2025 Data Mining: Concepts and Techniq 36


Example: Analytical comparison
(5)

 4. Drill down, roll up and other OLAP


operations on target and contrasting classes
to adjust levels of abstractions of resulting
description

 5. Presentation
 as generalized relations, crosstabs, bar

charts, pie charts, or rules


 contrasting measures to reflect comparison

between target and contrasting classes



e.g. count%

April 7, 2025 Data Mining: Concepts and Techniq 37


Quantitative Discriminant Rules

 Cj = target class
 qa = a generalized tuple covers some tuples of
class
 but can also cover some tuples of contrasting

class
count(qa  Cj )
 d-weight d  weight  m
 range: [0, 1]  count(qa  Ci )
i 1

 X, target_cla ss(X)  condition(X) [d : d_weight]


 quantitative discriminant rule form
April 7, 2025 Data Mining: Concepts and Techniq 38
Example: Quantitative
Discriminant Rule
Status Birth_country Age_range Gpa Count
Graduate Canada 25-30 Good 90
Undergraduate Canada 25-30 Good 210

Count distribution between graduate and undergraduate students for a generalized tuple

 Quantitative discriminant rule

X , graduate _ student ( X ) 
birth _ country( X ) " Canada" age _ range( X ) "25  30" gpa( X ) " good" [d : 30%]
 where 90/(90+120) = 30%

April 7, 2025 Data Mining: Concepts and Techniq 39


Class Description
 Quantitative characteristic rule
 X, target_class(X)  condition(X) [t : t_weight]
necessary

 Quantitative discriminant rule


 X, target_cla ss(X)  condition(X) [d : d_weight]
sufficient

 Quantitative description rule


 X, target_class(X) 
condition 1(X) [t : w1, d : w 1]  ...  conditionn(X) [t : wn, d : w n]
 necessary and sufficient

April 7, 2025 Data Mining: Concepts and Techniq 40


Example: Quantitative
Description Rule
Location/item TV Computer Both_items

Count t-wt d-wt Count t-wt d-wt Count t-wt d-wt


Europe 80 25% 40% 240 75% 30% 320 100% 32%
N_Am 120 17.65% 60% 560 82.35% 70% 680 100% 68%

Both_ 200 20% 100% 800 80% 100% 1000 100% 100%
regions

Crosstab showing associated t-weight, d-weight values and total number (in thousands) of TVs and
computers sold at AllElectronics in 1998

 Quantitative description rule for target class


Europe
 X, Europe(X) 
(item(X) " TV" ) [t : 25%, d : 40%]  (item(X) " computer" ) [t : 75%, d : 30%]

April 7, 2025 Data Mining: Concepts and Techniq 41


Chapter 5: Concept
Description: Characterization
and Comparison
 What is concept description?
 Data generalization and summarization-based
characterization
 Analytical characterization: Analysis of attribute
relevance
 Mining class comparisons: Discriminating between
different classes
 Mining descriptive statistical measures in large
databases
 Discussion
 Summary
April 7, 2025 Data Mining: Concepts and Techniq 42
Mining Data Dispersion
Characteristics

 Motivation
 To better understand the data: central tendency,
variation and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities
of precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
April 7, 2025 Data Mining: Concepts and Techniq 43
Measuring the Central
Tendency
1 n
 Mean x   xi n

n i 1 w x i i
 Weighted arithmetic mean x  i 1
n

 Median: A holistic measure w


i 1
i

 Middle value if odd number of values, or average of


the middle two values otherwise
n / 2  ( f )l
 estimated by interpolation median L1  ( )c
f median
 Mode
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula: mean  mode 3 (mean  median)

April 7, 2025 Data Mining: Concepts and Techniq 44


Measuring the Dispersion of Data

 Quartiles, outliers and boxplots


 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, M, Q3, max
 Boxplot: ends of the box are the quartiles, median is
marked, whiskers, and plot outlier individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Variance and standard deviation
 Variance s2:1(algebraic,
n scalable1computation)
n
1 n

 (x [  xi (  xi ) 2 ]
2 2 2
s  i  x)  
n 1i1 n 1 i1 n i1

 Standard deviation s is the square root of variance s2


April 7, 2025 Data Mining: Concepts and Techniq 45
Boxplot Analysis

 Five-number summary of a distribution:


Minimum, Q1, M, Q3, Maximum
 Boxplot
 Data is represented with a box

 The ends of the box are at the first and

third quartiles, i.e., the height of the box


is IRQ
 The median is marked by a line within

the box
 Whiskers: two lines outside the box

extend to Minimum and Maximum


April 7, 2025 Data Mining: Concepts and Techniq 46
A Boxplot
A boxplot

April 7, 2025 Data Mining: Concepts and Techniq 47


Visualization of Data
Dispersion: Boxplot Analysis

April 7, 2025 Data Mining: Concepts and Techniq 48


Mining Descriptive Statistical Measures
in Large Databases

 Variance
1 n 1  1 2
2
s   i ( x  x ) 2
   i
x 2
  x 
 i 
n  1 i 1 n  1 n 

 Standard deviation: the square root of the


variance
 Measures spread about the mean

 It is zero if and only if all the values are equal

 Both the deviation and the variance are

algebraic

April 7, 2025 Data Mining: Concepts and Techniq 49


Histogram Analysis

 Graph displays of basic statistical class


descriptions
 Frequency histograms


A univariate graphical method

Consists of a set of rectangles that reflect the counts
or frequencies of the classes present in the given data

April 7, 2025 Data Mining: Concepts and Techniq 50


Quantile Plot
 Displays all of the data (allowing the user to
assess both the overall behavior and unusual
occurrences)
 Plots quantile information

For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the
data are below or equal to the value xi

April 7, 2025 Data Mining: Concepts and Techniq 51


Quantile-Quantile (Q-Q) Plot

 Graphs the quantiles of one univariate


distribution against the corresponding quantiles
of another
 Allows the user to view whether there is a shift
in going from one distribution to another

April 7, 2025 Data Mining: Concepts and Techniq 52


Scatter plot

 Provides a first look at bivariate data to see


clusters of points, outliers, etc
 Each pair of values is treated as a pair of
coordinates and plotted as points in the plane

April 7, 2025 Data Mining: Concepts and Techniq 53


Loess Curve
 Adds a smooth curve to a scatter plot in order to
provide better perception of the pattern of
dependence
 Loess curve is fitted by setting two parameters: a
smoothing parameter, and the degree of the
polynomials that are fitted by the regression

April 7, 2025 Data Mining: Concepts and Techniq 54


Graphic Displays of Basic
Statistical Descriptions
 Histogram: (shown before)
 Boxplot: (covered before)
 Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are  xi
 Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
 Scatter plot: each pair of values is a pair of
coordinates and plotted as points in the plane
 Loess (local regression) curve: add a smooth curve to
a scatter plot to provide better perception of the
pattern of dependence

April 7, 2025 Data Mining: Concepts and Techniq 55


Chapter 5: Concept
Description: Characterization
and Comparison
 What is concept description?
 Data generalization and summarization-based
characterization
 Analytical characterization: Analysis of attribute
relevance
 Mining class comparisons: Discriminating between
different classes
 Mining descriptive statistical measures in large
databases
 Discussion
 Summary
April 7, 2025 Data Mining: Concepts and Techniq 56
AO Induction vs. Learning-from-
example Paradigm

 Difference in philosophies and basic assumptions


 Positive and negative samples in learning-from-

example: positive used for generalization,


negative - for specialization
 Positive samples only in data mining: hence

generalization-based, to drill-down backtrack


the generalization to a previous state
 Difference in methods of generalizations
 Machine learning generalizes on a tuple by

tuple basis
 Data mining generalizes on an attribute by

attribute basis
April 7, 2025 Data Mining: Concepts and Techniq 57
Comparison of Entire vs.
Factored Version Space

April 7, 2025 Data Mining: Concepts and Techniq 58


Incremental and Parallel Mining of
Concept Description

 Incremental mining: revision based on newly


added data DB
 Generalize DB to the same level of
abstraction in the generalized relation R to
derive R
 Union R U R, i.e., merge counts and other
statistical information to produce a new
relation R’
 Similar philosophy can be applied to data
sampling, parallel and/or distributed mining, etc.
April 7, 2025 Data Mining: Concepts and Techniq 59
Chapter 5: Concept
Description: Characterization
and Comparison
 What is concept description?
 Data generalization and summarization-based
characterization
 Analytical characterization: Analysis of attribute
relevance
 Mining class comparisons: Discriminating between
different classes
 Mining descriptive statistical measures in large
databases
 Discussion
 Summary
April 7, 2025 Data Mining: Concepts and Techniq 60
Summary

 Concept description: characterization and


discrimination
 OLAP-based vs. attribute-oriented induction
 Efficient implementation of AOI
 Analytical characterization and comparison
 Mining descriptive statistical measures in large
databases
 Discussion

Incremental and parallel mining of description

Descriptive mining of complex types of data
April 7, 2025 Data Mining: Concepts and Techniq 61
References
 Y. Cai, N. Cercone, and J. Han. Attribute-oriented induction in relational
databases. In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge
Discovery in Databases, pages 213-228. AAAI/MIT Press, 1991.
 S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP
technology. ACM SIGMOD Record, 26:65-74, 1997
 C. Carter and H. Hamilton. Efficient attribute-oriented generalization for
knowledge discovery from large databases. IEEE Trans. Knowledge and Data
Engineering, 10:193-208, 1998.
 W. Cleveland. Visualizing Data. Hobart Press, Summit NJ, 1993.
 J. L. Devore. Probability and Statistics for Engineering and the Science, 4th ed.
Duxbury Press, 1995.
 T. G. Dietterich and R. S. Michalski. A comparative review of selected
methods for learning from examples. In Michalski et al., editor, Machine
Learning: An Artificial Intelligence Approach, Vol. 1, pages 41-82. Morgan
Kaufmann, 1983.
 J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F.
Pellow, and H. Pirahesh. Data cube: A relational aggregation operator
generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge
Discovery, 1:29-54, 1997.
 J. Han, Y. Cai, and N. Cercone. Data-driven discovery of quantitative rules in
relational databases. IEEE Trans. Knowledge and Data Engineering, 5:29-40,
1993.
April 7, 2025 Data Mining: Concepts and Techniq 62
References (cont.)
 J. Han and Y. Fu. Exploration of the power of attribute-oriented induction in
data mining. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.
Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining,
pages 399-421. AAAI/MIT Press, 1996.
 R. A. Johnson and D. A. Wichern. Applied Multivariate Statistical Analysis, 3rd
ed. Prentice Hall, 1992.
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB'98, New York, NY, Aug. 1998.
 H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data
Mining. Kluwer Academic Publishers, 1998.
 R. S. Michalski. A theory and methodology of inductive learning. In Michalski
et al., editor, Machine Learning: An Artificial Intelligence Approach, Vol. 1,
Morgan Kaufmann, 1983.
 T. M. Mitchell. Version spaces: A candidate elimination approach to rule
learning. IJCAI'97, Cambridge, MA.
 T. M. Mitchell. Generalization as search. Artificial Intelligence, 18:203-226,
1982.
 T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
 J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
 D. Subramanian and J. Feigenbaum. Factorization in experiment generation.
AAAI'86, Philadelphia, PA, Aug. 1986.

April 7, 2025 Data Mining: Concepts and Techniq 63


http://www.cs.sfu.ca/~han/
dmbook

Thank you !!!


April 7, 2025 Data Mining: Concepts and Techniq 64

You might also like