Unit#6 - Data Mining For Data Sciences
Unit#6 - Data Mining For Data Sciences
st
1 Term Final Year
Data Sciences
and Analytics
(DSA)
Prof. Dr. M. S. Memon
Course In charge
[email protected]
Unit# 6: Data Mining for Data Sciences
Decision support progress to Data Mining
M. S. Memon
6
CSE Dept. QUEST Nawabshah
Why Data Mining?
• Alternative names
• Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
• Watch out: Is everything “data mining”?
• Simple search and query processing
• (Deductive) expert systems
M. S. Memon
8
CSE Dept. QUEST Nawabshah
Knowledge Discovery (KDD) Process
Task-relevant Data
Data Cleaning
Data Integration
Databases M. S. Memon
9
CSE Dept. QUEST Nawabshah
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines
M. S. Memon
11
CSE Dept. QUEST Nawabshah
Data Mining: A KDD Process
M. S. Memon
12
CSE Dept. QUEST Nawabshah
Steps of KDD Process
• Learning the application domain:
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation:
– Find useful features, dimensionality/variable reduction, invariant
representation
• Choosing functions of data mining
– summarization, classification, regression, association, clustering
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc
• Use of discovered knowledge
M. S. Memon
13
CSE Dept. QUEST Nawabshah
Data Mining Applications
• Database analysis and decision support
– Market analysis and management
• target marketing, customer relation management,
market basket analysis, cross selling, market
segmentation
– Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
– Fraud detection and management
• Other Applications
– Text mining (news group, email, documents)
– Stream data mining
– Web mining
M. S. Memon
– DNA data analysis CSE Dept. QUEST Nawabshah 14
Data Mining Techniques
• Data mining covers a broad range of
techniques including:
– Classification
– Clustering
– Sequential Pattern mining
– Association rule mining
– Many more …
• These techniques consist of the specific
algorithms
M. S. Memon
15
CSE Dept. QUEST Nawabshah
Why Not Traditional Data Analysis?
• Knowledge to be mined
• Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
• Multiple/integrated functions and mining at multiple levels
• Techniques utilized
• Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
• Applications adapted
• Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
M. S. Memon
17
CSE Dept. QUEST Nawabshah
Data Mining: Classification Schemes
• General functionality
• Descriptive data mining
• Predictive data mining
• Different views lead to different classifications
• Data view: Kinds of data to be mined
• Knowledge view: Kinds of knowledge to be discovered
• Method view: Kinds of techniques utilized
• Application view: Kinds of applications adapted
M. S. Memon
18
CSE Dept. QUEST Nawabshah
Data Mining: On What Kinds of Data?
M. S. Memon
20
CSE Dept. QUEST Nawabshah
Data Mining Functionalities (2)
• Cluster analysis
• Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
• Maximizing intra-class similarity & minimizing interclass similarity
• Outlier analysis
• Outlier: Data object that does not comply with the general behavior of the
data
• Noise or exception? Useful in fraud detection, rare events analysis
• Trend and evolution analysis
• Trend and deviation: e.g., regression analysis
• Sequential pattern mining: e.g., digital camera large SD memory
• Periodicity analysis
• Similarity-based analysis
• Other pattern-directed or statistical analyses
M. S. Memon
21
CSE Dept. QUEST Nawabshah
Top-10 Most Popular DM Algorithms
1. Classification
2. Statistical Learning
3. Link Mining
4. Clustering
5. Association and Aggregation
6. Bagging and Boosting
7. Sequential Patterns
8. Integrated Mining
9. Rough Sets
10. Graph Mining
M. S. Memon
22
CSE Dept. QUEST Nawabshah
Major Issues in Data Mining
• Mining methodology
• Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
• Performance: efficiency, effectiveness, and scalability
• Pattern evaluation: the interestingness problem
• Incorporation of background knowledge
• Handling noise and incomplete data
• Parallel, distributed and incremental mining methods
• Integration of the discovered knowledge with existing one: knowledge fusion
• User interaction
• Data mining query languages and ad-hoc mining
• Expression and visualization of data mining results
• Interactive mining of knowledge at multiple levels of abstraction
• Applications and social impacts
• Domain-specific data mining & invisible data mining
• Protection of data security, integrity, and privacy
M. S. Memon
23
CSE Dept. QUEST Nawabshah
Why Data Mining?—Potential Applications
M. S. Memon
26
CSE Dept. QUEST Nawabshah
Ex. 3: Fraud Detection & Mining Unusual Patterns
M. S. Memon
27
CSE Dept. QUEST Nawabshah
KDD Process: Several Key Steps
• Learning the application domain
• relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation
• Find useful features, dimensionality/variable reduction, invariant representation
• Choosing functions of data mining
• summarization, classification, regression, association, clustering
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
• visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
M. S. Memon
28
CSE Dept. QUEST Nawabshah
5 Primitives that Define a Data Mining Task
M. S. Memon
29
CSE Dept. QUEST Nawabshah
Primitive 2: Type of knowledge
M. S. Memon
30
CSE Dept. QUEST Nawabshah
Primitive 3: Background Knowledge
M. S. Memon
31
CSE Dept. QUEST Nawabshah
Primitive 4: Pattern Interestingness Measure
• Simplicity
e.g., (association) rule length, (decision) tree size
• Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification reliability or
accuracy, certainty factor, rule strength, rule quality, discriminating
weight, etc.
• Utility
potential usefulness, e.g., support (association), noise threshold
(description)
• Novelty
not previously known, surprising (used to remove redundant rules, e.g.,
Illinois vs. Champaign rule implication support ratio)
M. S. Memon
32
CSE Dept. QUEST Nawabshah
Primitive 5: Presentation of Discovered Patterns
M. S. Memon
33
CSE Dept. QUEST Nawabshah
Coupling Data Mining with DB
Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data
Warehouse Server
M. S. Memon
35
CSE Dept. QUEST Nawabshah