Data Mining - Functionalities, Classification and Task Primitives
Data Mining - Functionalities, Classification and Task Primitives
PRESENT BY K.Aravind (10mx03) M.Boobalan (10mx05) V.Boopathiraj (10mx06) S.Kadhiresan (10mx18) L.RoshanAli (10mx41) A.Selvaraj (10mx46)
Discrimination
y Data entries can be associated with classes or concepts y describe individual classes and concepts in summarized, concise,
and precise terms. Such descriptions of a class or a concept are called class/concept descriptions.
characteristics or features of a target class of data. y Data discrimination is a comparison of the general features of the target class data objects against the general features of objects from one or multiple contrasting classes.
association rules
Association rules are discarded as uninteresting if they do not satisfy both a minimum support threshold and a minimum confidence threshold.
y y y y
that describes and distinguishes data classes or concepts for future prediction Class label is known E.g., classify countries based on climate, or classify cars based on gas mileage Presentation: decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values
Cluster Analysis
y Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns y Clustering based on the principle: Maximizing the Intraclass similarity and Minimizing the Interclass similarity
Outlier Analysis
y Outlier: a data object that does not comply with the general
behavior of the data y It can be considered as noise or exception but it is quite useful in fraud detection, rare events analysis y The analysis of outlier data is referred to as outlier analysis or anomaly mining.
Cont
y A data mining system/query may generate thousands of patterns, not all of them are
interesting.
y Suggested approach: Human-centered, query-based, focused mining y Interestingness measures: A pattern is interesting if it is easily understood by
humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm
y Objective vs. subjective interestingness measures: y Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
y Subjective: based on users belief in the data, e.g., unexpectedness, novelty,
actionability, etc.
clustering, trend, deviation and outlier analysis, etc. y Multiple/integrated functions and mining at multiple levels
Cont
y Techniques utilized
y Database-oriented, data warehouse (OLAP), machine learning,
performed on data
y specification of data to be mined y set of data in which the user is interested y kinds of knowledge to be mined y background knowledge useful in guiding the discovery process y specification of how knowledge should be visualized
y Background knowledge
y concept hierarchies
y Interestingness Measures
y separate patterns from knowledge
of results:
y P(X: customer, W) AND Q(X,Y) ->buys(X,Z) y age(X,30..30) AND income(X, 40K49K) -> buys(X,
VCR) [2.2%, 60%] y Might specify to classify input file of customers as likely to buy , not likely to buy y indicates 60% confidence is to be used and such cases should represent 2.2% of all transactions.
y Types of hierarchies
y schema hierarchy y set-grouping hierarchy y operation derived hierarchy y rule-based hierarchy
Concept Hierarchies
y Schema
y total or partial order among an attribute, usually aware house
or range values
y Operation defined
y automatically derived, clustering, extraction, etc.
y Rule-based
y hierarchy may be well defined by set of rules
Interestingness Measures
y Simplicity
e.g., (association) rule length, (decision) tree size y Certainty e.g., confidence, P(A|B) = n(A and B)/ n (B), classification reliability or accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc. y Utility potential usefulness, e.g., support (association), noise threshold (description) y Novelty not previously known, surprising (used to remove redundant rules, e.g., Canada vs. Vancouver rule implication support ratio
representation
y E.g., rules, tables, crosstabs, pie/bar chart etc. y Concept hierarchy is also important y Discovered knowledge might be more understandable when
Thank you!!!