DM Concepts
DM Concepts
Concepts and
Techniques
— Slides for Textbook —
— Chapter 5 —
Concept description:
can handle complex data types of the
OLAP:
restricted to a small number of
Data generalization
A process which abstracts a large set of task-
attribute generalization.
Apply aggregation by merging identical, generalized
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major,
birth_place, birth_date, residence, phone#, gpa
from student
where status in “graduate”
Corresponding SQL statement:
Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62
“subprime” relation
Use a predefined & precomputed data cube
Construct a data cube beforehand
storage overhead
April 7, 2025 Data Mining: Concepts and Techniq 16
Chapter 5: Concept
Description: Characterization
and Comparison
What is concept description?
Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attribute
relevance
Mining class comparisons: Discriminating between
different classes
Mining descriptive statistical measures in large
databases
Discussion
Summary
April 7, 2025 Data Mining: Concepts and Techniq 17
Characterization vs. OLAP
Similarity:
Presentation of data summarization at multiple
levels of abstraction.
Interactive drilling, pivoting, slicing and dicing.
Differences:
Automated desired level allocation.
Dimension relevance analysis and ranking when
there are many relevant dimensions.
Sophisticated typing on dimensions and measures.
Analytical characterization: data dispersion analysis.
Why?
Which dimensions should be included?
How high level of generalization?
What?
statistical method for preprocessing data
filter out irrelevant or weakly relevant attributes
retain or rank the relevant attributes
relevance related to dimensions and levels
analytical characterization, analytical comparison
How?
Data Collection
Analytical Generalization
Use information gain analysis (e.g., entropy or other
measures) to identify highly relevant dimensions and
levels.
Relevance Analysis
Sort and select the most relevant dimensions and levels.
Attribute-oriented Induction for class description
On selected dimension/level
OLAP operations (e.g. drilling, slicing) on
relevance rules
gini index
uncertainty coefficient
Decision tree
each internal node tests an attribute
ID3 algorithm
build decision tree based on training objects
minimal height
the least number of tests to classify an object
See example
April 7, 2025 Data Mining: Concepts and Techniq 22
Top-Down Induction of Decision
Tree
Outlook
Humidity Wind
yes
high normal strong weak
no yes no yes
Task
Mine general characteristics describing
Given
attributes name, gender, major, birth_place,
remove name and phone#
attribute generalization
generalize major, birth_place, birth_date and gpa
accumulate counts
candidate relation: gender, major,
birth_country, age_range and gpa
arbitrary tuple
120 120 130 130
I(s 1, s 2 ) I( 120,130 ) log 2 log 2 0.9988
250 250 250 250
Number of grad
students in “Science” Number of undergrad
students in “Science”
Calculate information
Gain(major ) I(s 1, s 2 )gain foreach
E(major) 0.2115 attribute
use Big_University_DB
mine comparison as “grad_vs_undergrad_students”
in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa
for “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student
Given
attributes name, gender, major, birth_place,
1. Data collection
target and contrasting classes
3. Synchronous generalization
controlled by user-specified dimension thresholds
prime target and contrasting class(es)
relations/cuboids
5. Presentation
as generalized relations, crosstabs, bar
Cj = target class
qa = a generalized tuple covers some tuples of
class
but can also cover some tuples of contrasting
class
count(qa Cj )
d-weight d weight m
range: [0, 1] count(qa Ci )
i 1
Count distribution between graduate and undergraduate students for a generalized tuple
X , graduate _ student ( X )
birth _ country( X ) " Canada" age _ range( X ) "25 30" gpa( X ) " good" [d : 30%]
where 90/(90+120) = 30%
Both_ 200 20% 100% 800 80% 100% 1000 100% 100%
regions
Crosstab showing associated t-weight, d-weight values and total number (in thousands) of TVs and
computers sold at AllElectronics in 1998
Motivation
To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities
of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
April 7, 2025 Data Mining: Concepts and Techniq 43
Measuring the Central
Tendency
1 n
Mean x xi n
n i 1 w x i i
Weighted arithmetic mean x i 1
n
(x [ xi ( xi ) 2 ]
2 2 2
s i x)
n 1i1 n 1 i1 n i1
the box
Whiskers: two lines outside the box
Variance
1 n 1 1 2
2
s i ( x x ) 2
i
x 2
x
i
n 1 i 1 n 1 n
algebraic
A univariate graphical method
Consists of a set of rectangles that reflect the counts
or frequencies of the classes present in the given data
tuple basis
Data mining generalizes on an attribute by
attribute basis
April 7, 2025 Data Mining: Concepts and Techniq 57
Comparison of Entire vs.
Factored Version Space