Data Mining 1
Data Mining 1
1
Chapter 1. Introduction
► Need for Data Mining
► Types of Applications
► Summary 2
Need for Data Mining
► Explosive Growth of Data: from terabytes to petabytes
5
Chapter 1. Introduction
► Need for Data Mining
► Types of Applications
► Summary 6
What Is Data Mining?
7
Architecture of Data Mining System
Knowledge Discovery (KDD) Process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Cleaning
Data Integration
9
Databases
KDD Process
► Step 1 : Data Cleaning : Removing noise and inconsistent
data
► Step 2: Data Integration : Combining multiple data sources
► Step 3: Data Selection : Retrieving task relevant data from
the database.
► Step 4: Data Transformation: Consolidation and
transformation of data into appropriate form for mining.
(Summarization and Aggregation).
► Step 5: Data Mining: Applying intelligent methods for
extracting data patterns
► Step 6: Pattern Evaluation : Identifying interesting patterns
using measures
► Step 7: Knowledge
Data Mining: Concepts and Techniques
Presentation: Presenting
*
knowledge
10
to
the users
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
12
Chapter 1. Introduction
► Need for Data Mining
► Types of Applications
► Summary 13
Data Mining: Sources of Data
► Database Data
► Relational database - Trends and data patterns
► Data warehouse
► Transactional database (TID)
14
Data Mining: Sources of Data
► Advanced data sets and advanced applications
► Data streams and sensor data
► Time-series data, temporal data, sequence data
(incl. bio-sequences)
► Structure data, graphs, social networks and
multi-linked data
► Object-relational databases
► Heterogeneous databases and legacy databases
► Spatial data and spatiotemporal data
► Multimedia database
► Text databases
Data Mining: Concepts and Techniques * 15
► Web databases
Chapter 1. Introduction
► Need for Data Mining
► Types of Applications
► Summary 16
Data Mining Function: (1) Class/Concept Description
► Class Description
► Items can be classes – Description of items like computers,
printers.
► Concept Description
► Virtual concepts like Big spenders, budget spenders,
Low-income group, middle class, etc
18
Data Mining Function: (3) Classification
19
Data Mining Function: (4) Cluster Analysis
20
Data Mining Function: (5) Outlier Analysis
► Outlier analysis
► Outlier: A data object that does not comply with the general
behavior of the data
► Anomaly mining
► Methods: Probability models, by product of clustering or
regression analysis, …
► Applications: Fraud detection, rare events analysis
21
Time and Ordering: Sequential Pattern, Trend and
Evolution Analysis
► Periodicity analysis
► Motifs and biological sequence analysis
► Approximate and consecutive motifs
► Similarity-based analysis
22
Structure and Network Analysis
► Graph mining
► Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
► Information network analysis
► Social networks: actors (objects, nodes) and relationships (edges)
► e.g., author networks in CS, terrorist networks
► Multiple heterogeneous networks
► A person could be multiple information networks: friends, family,
classmates, …
► Links carry a lot of semantic information: Link mining
► Web mining
► Web is a big information network: from PageRank to Google
► Analysis of Web information networks
► Web community discovery, opinion mining, usage mining, …
23
Evaluation of Knowledge
► Are all mined knowledge interesting?
► One can mine tremendous amount of “patterns” and knowledge
► Some may fit only certain dimension space (time, location, …)
► Some may not be representative, may be transient, …
► Evaluation of mined knowledge → directly mine only interesting
knowledge?
► Descriptive vs. predictive
► Coverage
► Typicality vs. novelty
► Accuracy
► Timeliness
► …
24
Chapter 1. Introduction
► Need for Data Mining
► Types of Applications
► Summary 25
Data Mining: Confluence of Multiple Disciplines
Pattern
Machine Statistics
Recogniti
Learning
on
26
Why Confluence of Multiple Disciplines?
► Tremendous amount of data
► Algorithms must be highly scalable to handle such as tera-bytes of
data
► High-dimensionality of data
► Micro-array may have tens of thousands of dimensions
► High complexity of data
► Data streams and sensor data
► Time-series data, temporal data, sequence data
► Structure data, graphs, social networks and multi-linked data
► Heterogeneous databases and legacy databases
► Spatial, spatiotemporal, multimedia, text and Web data
► Software programs, scientific simulations
► New and sophisticated applications 27
Chapter 1. Introduction
► Need for Data Mining
► Types of Applications
► Summary 28
Applications of Data Mining
► Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
► Collaborative analysis & recommender systems
► Basket data analysis to targeted marketing
► Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
► Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
► From major dedicated data mining systems/tools (e.g., SAS, MS
SQL-Server Analysis Manager, Oracle Data Mining Tools) to invisible
data mining 29
Chapter 1. Introduction
► Need for Data Mining
► Types of Applications
► Summary 30
Major Issues in Data Mining (1)
► Mining Methodology
► Mining various and new kinds of knowledge
► Mining knowledge in multi-dimensional space
► Data mining: An interdisciplinary effort
► Boosting the power of discovery in a networked environment
► Handling noise, uncertainty, and incompleteness of data
► Pattern evaluation and pattern- or constraint-guided mining
► User Interaction
► Interactive mining
► Incorporation of background knowledge
► Presentation and visualization of data mining results
31
Major Issues in Data Mining (2)
32
Chapter 1. Introduction
► Need for Data Mining
► Types of Applications
► Summary
33
A Brief History of Data Mining Society
► Types of Applications
► Summary 37
Summary
► T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
► U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining.
AAAI/MIT Press, 1996
► U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan
Kaufmann, 2001
► J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed., 2011
► D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
► T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd
ed., Springer-Verlag, 2009
► P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
► I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations,
Morgan Kaufmann, 2nd ed. 2005 39