Data Mining Chapter 1
Data Mining Chapter 1
Concepts and
Techniques
(3rd ed.)
— Chapter 1 —
2
Chapter 1. Introduction
■ Why Data Mining?
■ What Is Data Mining?
■ A Multi-Dimensional View of Data Mining
■ What Kind of Data Can Be Mined?
■ What Kinds of Patterns Can Be Mined?
■ What Technology Are Used?
■ What Kind of Applications Are Targeted?
■ Major Issues in Data Mining
■ A Brief History of Data Mining and Data Mining Society
■ Summary
3
Why Data Mining?
Data
Cleaning
Data
Integration
Databases
9
Example: A Web Mining
Framework
10
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Business
Analyst
Presentation
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
12
KDD Process: A Typical View from ML
and Statistics
14
Chapter 1. Introduction
■ Why Data Mining?
■ What Is Data Mining?
■ A Multi-Dimensional View of Data Mining
■ What Kind of Data Can Be Mined?
■ What Kinds of Patterns Can Be Mined?
■ What Technology Are Used?
■ What Kind of Applications Are Targeted?
■ Major Issues in Data Mining
■ A Brief History of Data Mining and Data Mining Society
■ Summary
15
Multi-Dimensional View of Data
Mining
■ Data to be mined
■ Database data (extended-relational, object-oriented,
heterogeneous, legacy), data warehouse, transactional
data, stream, spatiotemporal, time-series, sequence, text
and web, multi-media, graphs & social and information
networks
■ Knowledge to be mined (or: Data mining functions)
■ Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
■ Descriptive vs. predictive data mining
■ Multiple/integrated functions and mining at multiple levels
■ Techniques utilized
■ Data-intensive, data warehouse (OLAP), machine learning,
statistics, pattern recognition, visualization, high-
performance, etc.
■ Applications adapted
■ Retail, telecommunication, banking, fraud analysis, bio-data 16
Chapter 1. Introduction
■ Why Data Mining?
■ What Is Data Mining?
■ A Multi-Dimensional View of Data Mining
■ What Kind of Data Can Be Mined?
■ What Kinds of Patterns Can Be Mined?
■ What Technology Are Used?
■ What Kind of Applications Are Targeted?
■ Major Issues in Data Mining
■ A Brief History of Data Mining and Data Mining Society
■ Summary
17
Data Mining: On What Kinds of
Data?
■ Database-oriented data sets and applications
■ Relational database, data warehouse, transactional database
■ Advanced data sets and advanced applications
■ Data streams and sensor data
■ Time-series data, temporal data, sequence data (incl. bio-
sequences)
■ Structure data, graphs, social networks and multi-linked data
■ Object-relational databases
■ Heterogeneous databases and legacy databases
■ Spatial data and spatiotemporal data
■ Multimedia database
■ Text databases
■ The World-Wide Web
18
Chapter 1. Introduction
■ Why Data Mining?
■ What Is Data Mining?
■ A Multi-Dimensional View of Data Mining
■ What Kind of Data Can Be Mined?
■ What Kinds of Patterns Can Be Mined?
■ What Technology Are Used?
■ What Kind of Applications Are Targeted?
■ Major Issues in Data Mining
■ A Brief History of Data Mining and Data Mining Society
■ Summary
19
Data Mining Function: (1)
Generalization
■ Information integration and data warehouse
construction
■Data cleaning, transformation, integration, and
multidimensional data model
■ Data cube technology
■Scalable methods for computing (i.e.,
materializing) multidimensional aggregates
■OLAP (online analytical processing)
■ Multidimensional concept description:
Characterization and discrimination
■Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet region
20
Data Mining Function: (2)
Association and Correlation Analysis
■ Frequent patterns (or frequent itemsets)
■What items are frequently purchased together
in your Walmart?
■ Association, correlation vs. causality
■A typical association rule
■Diaper 🡪 Beer [0.5%, 75%] (support,
confidence)
■Are strongly associated items also strongly
correlated?
■ How to mine such patterns and rules efficiently in
large datasets?
■ How to use such patterns for classification,
21
Data Mining Function: (3)
Classification
■ Classification and label prediction
■ Construct models (functions) based on some training
examples
■ Describe and distinguish classes or concepts for future
prediction
■E.g., classify countries based on (climate), or classify
cars based on (gas mileage)
■ Predict some unknown class labels
■ Typical methods
■ Decision trees, naïve Bayesian classification, support
vector machines, neural networks, rule-based
classification, pattern-based classification, logistic
regression, …
■ Typical applications:
■ Credit card fraud detection, direct marketing, classifying 22
Data Mining Function: (4) Cluster
Analysis
■ Unsupervised learning (i.e., Class label is unknown)
■ Group data to form new categories (i.e., clusters),
e.g., cluster houses to find distribution patterns
■ Principle: Maximizing intra-class similarity &
minimizing interclass similarity
■ Many methods and applications
23
Data Mining Function: (5) Outlier
Analysis
■ Outlier analysis
■ Outlier: A data object that does not comply with the
general behavior of the data
■ Noise or exception? ― One person’s garbage could be
another person’s treasure
■ Methods: by product of clustering or regression analysis, …
■ Useful in fraud detection, rare events analysis
24
Time and Ordering: Sequential
Pattern, Trend and Evolution Analysis
■ Sequence, trend and evolution analysis
■Trend, time-series, and deviation analysis: e.g.,
regression and value prediction
■Sequential pattern mining
■e.g., first buy digital camera, then buy large
SD memory cards
■Periodicity analysis
■Motifs and biological sequence analysis
■Approximate and consecutive motifs
■Similarity-based analysis
■ Mining data streams
■Ordered, time-varying, potentially infinite, data
streams 25
Structure and Network Analysis
■ Graph mining
■ Finding frequent subgraphs (e.g., chemical compounds),
trees (XML), substructures (web fragments)
■ Information network analysis
■ Social networks: actors (objects, nodes) and relationships
(edges)
■e.g., author networks in CS, terrorist networks
■ Multiple heterogeneous networks
■A person could be multiple information networks:
friends, family, classmates, …
■ Links carry a lot of semantic information: Link mining
■ Web mining
■ Web is a big information network: from PageRank to
Google
■ Analysis of Web information networks
■Web community discovery, opinion mining, usage 26
Evaluation of Knowledge
■ Are all mined knowledge interesting?
■ One can mine tremendous amount of “patterns” and
knowledge
■ Some may fit only certain dimension space (time, location,
…)
■ Some may not be representative, may be transient, …
■ Evaluation of mined knowledge → directly mine only
interesting knowledge?
■ Descriptive vs. predictive
■ Coverage
■ Typicality vs. novelty
■ Accuracy
■ Timeliness 27
Exercise
Databas
Algorith High-
e Performan
m Technol ce
ogy Computing
30
Why Confluence of Multiple
Disciplines?
■ Tremendous amount of data
■ Algorithms must be highly scalable to handle such as tera-
bytes of data
■ High-dimensionality of data
■ Micro-array may have tens of thousands of dimensions
■ High complexity of data
■ Data streams and sensor data
■ Time-series data, temporal data, sequence data
■ Structure data, graphs, social networks and multi-linked
data
■ Heterogeneous databases and legacy databases
■ Spatial, spatiotemporal, multimedia, text and Web data
■ Software programs, scientific simulations
■ New and sophisticated applications
31
Chapter 1. Introduction
■ Why Data Mining?
■ What Is Data Mining?
■ A Multi-Dimensional View of Data Mining
■ What Kind of Data Can Be Mined?
■ What Kinds of Patterns Can Be Mined?
■ What Technology Are Used?
■ What Kind of Applications Are Targeted?
■ Major Issues in Data Mining
■ A Brief History of Data Mining and Data Mining Society
■ Summary
32
Applications of Data Mining
■ Web page analysis: from web page classification, clustering
to PageRank & HITS algorithms
■ Collaborative analysis & recommender systems
■ Basket data analysis to targeted marketing
■ Biological and medical data analysis: classification, cluster
analysis (microarray data analysis), biological sequence
analysis, biological network analysis
■ Data mining and software engineering (e.g., IEEE Computer,
Aug. 2009 issue)
■ From major dedicated data mining systems/tools (e.g., SAS,
MS SQL-Server Analysis Manager, Oracle Data Mining Tools)
to invisible data mining
33
Chapter 1. Introduction
■ Why Data Mining?
■ What Is Data Mining?
■ A Multi-Dimensional View of Data Mining
■ What Kind of Data Can Be Mined?
■ What Kinds of Patterns Can Be Mined?
■ What Technology Are Used?
■ What Kind of Applications Are Targeted?
■ Major Issues in Data Mining
■ A Brief History of Data Mining and Data Mining Society
■ Summary
34
Major Issues in Data Mining
(1)
■ Mining Methodology
■ Mining various and new kinds of knowledge
■ Mining knowledge in multi-dimensional space
■ Data mining: An interdisciplinary effort
■ Boosting the power of discovery in a networked
environment
■ Handling noise, uncertainty, and incompleteness of data
■ Pattern evaluation and pattern- or constraint-guided
mining
■ User Interaction
■ Interactive mining
■ Incorporation of background knowledge
■ Presentation and visualization of data mining results 35
Major Issues in Data Mining
(2)
36
Chapter 1. Introduction
■ Why Data Mining?
■ What Is Data Mining?
■ A Multi-Dimensional View of Data Mining
■ What Kind of Data Can Be Mined?
■ What Kinds of Patterns Can Be Mined?
■ What Technology Are Used?
■ What Kind of Applications Are Targeted?
■ Major Issues in Data Mining
■ A Brief History of Data Mining and Data Mining Society
■ Summary
37
A Brief History of Data Mining
Society
■ 1989 IJCAI Workshop on Knowledge Discovery in Databases
■ Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W.
Frawley, 1991)
■ 1991-1994 Workshops on Knowledge Discovery in Databases
■ Advances in Knowledge Discovery and Data Mining (U. Fayyad,
G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
■ 1995-1998 International Conferences on Knowledge Discovery in
Databases and Data Mining (KDD’95-98)
■ Journal of Data Mining and Knowledge Discovery (1997)
■ ACM SIGKDD conferences since 1998 and SIGKDD Explorations
■ More conferences on data mining
■ PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE)
ICDM (2001), etc.
■ ACM Transactions on KDD starting in 2007
38
Conferences and Journals on Data Mining