Introduction To Data Mining 1604
Introduction To Data Mining 1604
&
DATA MINING
Prepared by:
Anita Parmar
1
2
3
DATA MINING:
CONCEPTS AND TECHNIQUES
— CHAPTER 2 —
4
CHAPTER 2. INTRODUCTION
10
11
KNOWLEDGE DISCOVERY FROM DATA (KDD)
PROCESS
Data mining—core of knowledge
discovery process Pattern Evaluation
Data Mining
Task-relevant Data
Data Cleaning
Data Integration
12
Databases
KDD PROCESS: SEVERAL KEY STEPS
1.Data cleaning : to remove noise and inconsistent data (may take 60% of
effort!)
2. Data integration : Where multiple data sources may be combined.
3. Data selection :
Where data relevant to the analysis task are retrieved from the database.
4. Data Transformation
Where data are transformed or consolidated into forms appropriate for mining by
performing summary or aggregation
13
CONTINUE…
14
DATA MINING AND BUSINESS INTELLIGENCE
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
16
CONTINUE…
Database, Data warehouse, WWW or other information repository:
A set of Database, data warehouse, spreadsheets, or other kind of
information repositories.
Data cleaning and data integration techniques may be performed on
the data.
Knowledge base:
Domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns. For ex.,
Concept hierarchies, used to organize attributes or attribute values
into different levels of abstraction,
User beliefs, which can be used to assess a pattern’s interestingness
17
based on its unexpectedness, may also be included.
Additional interestingness constraints or thresholds and
metadata.
CONTINUE…
Data mining engine:
Essential to the data mining system
Consists of a set of functional modules.
User interface:
Communicate between users and the data mining system.
Allowing the user to interact with the system by specifying a data mining query or
task, providing information to help focus the search,
performing exploratory data mining based on the intermediate data mining
results.
Allows the user to browse database and data warehouse schemas or data 18
structures,
evaluate mined patterns,
and visualize the patterns in different forms.
DATA MINING: ON WHAT KINDS OF DATA?
19
DATA MINING: ON WHAT KINDS OF DATA?
20
DATA MINING FUNCTIONALITIES : WHAT KINDS OF PATTERNS CAN BE MINED
22
ARE ALL THE “DISCOVERED” PATTERNS INTERESTING?
Data mining may generate thousands of patterns: Not all of them are
interesting
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful, novel, or validates some
hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns, e.g., support, confidence,
etc.
Subjective: based on user’s belief in the data
23
FIND ALL AND ONLY INTERESTING PATTERNS?
24
CLASSIFICATION OF DATA MINING SYSTEM
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines
25
CLASSIFICATION OF DATA MINING SYSTEM
Kinds of Databases to be mined
Relational, data warehouse, transactional, stream, object-oriented/relational,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
Kinds of Knowledge to be mined
Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Kinds of Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market
analysis, text mining, Web mining, etc. 26
PRIMITIVES THAT DEFINE A DATA MINING TASK
Task-relevant data
Database or data warehouse name
Database tables or data warehouse cubes
Condition for data selection
Relevant attributes or dimensions
Data grouping criteria
Type of knowledge to be mined
Characterization, discrimination, association, classification, prediction,
clustering, outlier analysis, other data mining tasks
Background knowledge
Pattern interestingness measurements 27
Visualization/presentation of discovered patterns
PRIMITIVE 3: BACKGROUND KNOWLEDGE
29
COUPLING DATA MINING WITH DB/DW SYSTEMS
Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing one: knowledge fusion
User interaction
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction
31
SUMMARY