Unit #2 - Data Warehouse and Data Mining
Unit #2 - Data Warehouse and Data Mining
Unit #2 - Data Warehouse and Data Mining
and
Data Mining
Prof. Dr. M. S. Memon
[email protected]
03337037187
May 20, 2023 1
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines
• General functionality
• Mining methodology
– Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
– Performance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
– Parallel, distributed and incremental mining methods
– Integration of the discovered knowledge with existing one: knowledge fusion
• User interaction
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of abstraction
• Applications and social impacts
– Domain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacy
• Data mining may generate thousands of patterns: Not all of them are
interesting
– Suggested approach: Human-centered, query-based, focused mining
• Interestingness measures
– A pattern is interesting if it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful, novel, or validates some
hypothesis that a user seeks to confirm
• Objective vs. subjective interestingness measures
– Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
– Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty,
actionability, etc.
M. S. Memon CSE Dept.
May 20, 2023 QUEST Nawabshah 37
Find All and Only Interesting Patterns?
• Simplicity
e.g., (association) rule length, (decision) tree size
• Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification reliability or
accuracy, certainty factor, rule strength, rule quality, discriminating
weight, etc.
• Utility
potential usefulness, e.g., support (association), noise threshold
(description)
• Novelty
not previously known, surprising (used to remove redundant rules, e.g.,
Illinois vs. Champaign rule implication support ratio)
M. S. Memon CSE Dept.
May 20, 2023 QUEST Nawabshah 43
Primitive 5: Presentation of Discovered Patterns
• Motivation
– A DMQL can provide the ability to support ad-hoc and interactive
data mining
– By providing a standardized language like SQL
• Hope to achieve a similar effect like that SQL has on relational
database
• Foundation for system development and evolution
• Facilitate information exchange, technology transfer,
commercialization and wide acceptance
• Design
– DMQL is designed with the primitives described earlier
Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data
Warehouse Server