100% found this document useful (1 vote)
3K views4 pages

Mining Class Comparisons

The document discusses comparing two or more classes by partitioning data into target and contrasting classes, generalizing the classes, and comparing tuples to highlight discriminant features between classes. An example is provided that analyzes graduate and undergraduate students using attributes like birthplace, age, GPA to find distinguishing attributes between the classes. The process involves data collection, attribute analysis, generalization of relations, and presentation of results as charts or rules to show comparisons between target and contrasting classes.

Uploaded by

murali_20c
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
3K views4 pages

Mining Class Comparisons

The document discusses comparing two or more classes by partitioning data into target and contrasting classes, generalizing the classes, and comparing tuples to highlight discriminant features between classes. An example is provided that analyzes graduate and undergraduate students using attributes like birthplace, age, GPA to find distinguishing attributes between the classes. The process involves data collection, attribute analysis, generalization of relations, and presentation of results as charts or rules to show comparisons between target and contrasting classes.

Uploaded by

murali_20c
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 4

Mining Class Comparisons

v Comparison: Comparing two or more classes.


v Method:
– Partition the set of relevant data into the target class and the contrasting
class(es)
– Generalize both classes to the same high level concepts
– Compare tuples with the same high level descriptions
– Present for every tuple its description and two measures:
u support - distribution within single class
u comparison - distribution between classes
– Highlight the tuples with strong discriminant features
v Relevance Analysis:
– Find attributes (features) which best distinguish different classes.
Example: Analytical comparison
v Task
– Compare graduate and undergraduate students using discriminant rule.
– DMQL query

use Big_University_DB
mine comparison as “grad_vs_undergrad_students”
in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa
for “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student

v 1. Data collection
– target and contrasting classes

v 2. Attribute relevance analysis


– remove attributes name, gender, major, phone#

v 3. Synchronous generalization
– controlled by user-specified dimension thresholds
– prime target and contrasting class(es) relations/cuboids
v 4. Drill down, roll up and other OLAP operations on target and contrasting classes
to adjust levels of abstractions of resulting description

v 5. Presentation
– as generalized relations, crosstabs, bar charts, pie charts, or rules
– contrasting measures to reflect comparison between target and contrasting
classes
u e.g. count%
Example: Analytical
comparison (4)
Birth_country Age_range
Canada 20-25
Canada 25-30
Prime generalized relation for the target class: Graduate students

Birth_country
Canada Age_range
Over_30
…Canada 15-20

Canada
Other 15-20
Over_30
Prime generalized relation for the contrasting class: Undergraduate students

discriminant

Status …Birth_country Age_range … Count


Gpa
Graduate
Canada 25-30
Canada
Undergraduate Canada
25-30
25-30
Good 90
Good 210

… …
Other Over_30
Measuring the Central
Tendency
1 n
x= ∑ xi
n i =1 n

∑w x i i
x= i =1
n

∑w
Mean
i
i =1

Weighted arithmetic mean


Median: A holistic measure
Middle value if odd number of values, or average of
n / 2 − (∑ f )l
the middle two values otherwise
median = L + (
f
1 )c
median

estimated by interpolation
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula:

mean − mode = 3 × (mean − median)


Measuring the Dispersion of
Data

Quartiles, outliers and boxplots


Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 – Q1
Five number summary: min, Q1, M, Q3, max
Boxplot: ends of the box are the quartiles, median is
marked, whiskers, and plot outlier individually
Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation
Variance s2: (algebraic, scalable computation)
1 n 1 n 2 1 n
s2 = ∑ ( xi − x ) 2 = [∑ xi − (∑ xi ) 2 ]
Standard deviation s is the square root of variance s2
n − 1 i =1 n − 1 i =1 n i =1

You might also like