DM 1
DM 1
Data
Chapter 1. Introduction
• Why Data Mining?
Task-relevant Data
Data Cleaning
Data Integration
Databases
Knowledge Discovery Process
Integration
Interpretation Knowledge
Da & Evaluation
ta
Mi
nin Knowledge
Tr g
Raw an
sfo
Data rm __ __ __
Patterns
Understanding
S ati __ __ __
& elec on __ __ __ and
Cl
ea tion Rules
nin
g Transformed
Target Data
DATA
Ware Data
house
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
Data Collection "What was my total Computers, tapes, IBM, CDC Retrospective,
(1960s) revenue in the last disks static data delivery
five years?"
Data Warehousing "What were unit On - line analytic SPSS, Comshare, Retrospective,
& Decis ion sales in New processing Arbor, Cognos, dynamic data
Support England last (OLAP), Microstrategy,NCR d elivery at multiple
(1990s) March? Drill down multidimensional levels
to Boston." databases, data
warehouses
•OLAP - On-line
Analytical Processing
– Provides you with
a very good view
of what is
happening, but can
not predict what
will happen in the
future or why it is
happening
Data Mining vs. Database
• DB’s user knows what is looking for.
• DM’s user might/might not know what is looking for.
• DB’s answer to query is 100% accurate, if data correct.
• DM’s effort is to get the answer as accurate as possible.
• DB’s data are retrieved as stored.
• DM’s data need to be cleaned (some what) before
producing results.
• DB’s results are subset of data.
• DM’s results are the analysis of the data.
• The meaningfulness of the results is not the concern of
Database as it is the main issue in Data Mining.
Data Mining vs. KDD
• Knowledge Discovery in Databases (KDD) is the process
of finding useful information and patterns in the data.
• Data Mining is the use of algorithms to find the useful
information in the KDD process.
• KDD process is:
» Data cleaning & integration (Data Pre-processing)
» Creating a common data repository for all sources, such
as data warehouse.
Data mining
» Visualization for the generated results
Data mining is not
• Brute-force crunching of bulk
data
• “Blind” application of
algorithms
• Going to find relationships
where none exist
• Presenting data in different ways
• A database intensive task
• A difficult to understand
technology requiring an
advanced degree in computer
science
Chapter 1. Introduction
• Why Data Mining?
Description Prediction
SQL Query
Tools
Regressio
Classification
ns
Visualization
Decision
Clustering Trees
Neural
Association Networks
Sequential
Analysis
Data Mining Function: (1) Generalization
Initial
Relation
Generalized
Relation
Data Mining Function: (2) Association and
Correlation Analysis
• Frequent patterns (or frequent itemsets)
– What items are frequently purchased together in
your Walmart?
• Association, correlation vs. causality
– A typical association rule
– Are strongly associated items also strongly
correlated?
• How to mine such patterns and rules efficiently in
large datasets?
Association rule
• Association (correlation and causality)
– age(X, “20..29”) ^ income(X, “20..29K”) buys(X,
“PC”) [support = 2%, confidence = 60%]
• Association rule mining
– Finding frequent patterns, associations, correlations
among sets of items or objects in transaction databases,
relational databases, and other information repositories
– Frequent pattern: pattern (set of items, sequence, etc.)
that occurs frequently in a database
• Motivation: finding regularities in data
– What products were often purchased together?
Example: Association rule
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classification (2): Prediction Using the Model
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
Tenured?
Classification Techniques
• Decision Tree Induction
• Bayesian Classification
• Neural Networks
• Genetic Algorithms
• Fuzzy Set and Logic
Data Mining Function: (4) Cluster Analysis
• Outlier analysis
– Outlier: A data object that does not comply
with the general behavior of the data
– Noise or exception? ― One person’s garbage
could be another person’s treasure
– Methods: by product of clustering or
regression analysis, …
– Useful in fraud detection, rare events analysis
Time and Ordering: Sequential Pattern, Trend and
Evolution Analysis
• Sequence, trend and evolution analysis
– Trend, time-series, and deviation analysis: e.g.,
regression and value prediction
– Sequential pattern mining
• e.g., first buy digital camera, then buy large SD memory
cards
– Periodicity analysis
– Biological sequence analysis
• Mining data streams
– Ordered, time-varying, potentially infinite, data
streams
Regression
• Regression is similar to classification
– First, construct a model
– Second, use model to predict unknown
value
• Methods
– Linear and multiple regression
– Non-linear regression
• Regression is different from
classification
– Classification refers to predict categorical
class label
– Regression models continuous-valued
functions
Chapter 1. Introduction
• Why Data Mining?
Pattern
Machine Statistics
Recogniti
Learning
on
Database High-Perform
Algorithm Technolo ance
gy Computing
Why Confluence of Multiple Disciplines?
• Tremendous amount of data
– Algorithms must be highly scalable to handle such
as tera-bytes of data
• High-dimensionality of data
– Micro-array may have tens of thousands of
dimensions
Why Confluence of Multiple Disciplines?
• High complexity of data
– Data streams and sensor data
– Time-series data, temporal data, sequence data
– Structure data, graphs, social networks and
multi-linked data
– Heterogeneous databases and legacy databases
– Spatial, spatiotemporal, multimedia, text and Web
data
– Software programs, scientific simulations
• New and sophisticated applications
Chapter 1. Introduction
• Why Data Mining?
• Mining Methodology
– Mining various and new kinds of knowledge
– Mining knowledge in multi-dimensional
space
– Data mining: An interdisciplinary effort
– Boosting the power of discovery in a
networked environment
Major Issues in Data Mining (1)
– Handling noise, uncertainty, and
incompleteness of data
– Pattern evaluation and pattern- or
constraint-guided mining
• User Interaction
– Interactive mining
– Incorporation of background knowledge
– Presentation and visualization of data
mining results
Major Issues in Data Mining (2)