Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
[email protected] Jalali.mshdiau.ac.ir
Data Mining
Data Mining
Introduction
Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Are all the patterns interesting? Classification of data mining systems Integration of data mining system with a DB and DW System Major issues in data mining
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web databases
2000s
Stream data management and mining
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, business intelligence, etc.
Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation
Risk analysis and management
Target marketing
Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc.,
Cross-market analysisFind associations/co-relations between product sales, & predict based on such association Customer profilingWhat types of customers buy what products (clustering or classification) Customer requirement analysis
Identify the best products for different customers
Predict what factors will attract new customers
Professional patients, ring of doctors, and ring of references Unnecessary or correlated screening tests
Telecommunications: phone-call fraud
Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm
Retail industry
Pattern Evaluation
Data Cleaning
Data Integration Databases
Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant representation
Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Decision Making
Data Presentation Visualization Techniques Data Mining Information Discovery
End User
Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
DBA
Database Technology
Statistics
Machine Learning
Pattern Recognition
Data Mining
Visualization
Algorithm
Other Disciplines
High-dimensionality of data
Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data
Describes data in a concise and summarative manner and presents interesting general properties of the data
Predictive data mining
Analyzes data in order to construct one or a set of models and attempts to predict the behavior of new data sets.
Object-relational databases
Heterogeneous databases and legacy databases Spatial data and spatiotemporal data Multimedia database
Text databases
The World-Wide Web
Construct models (functions) that describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars based on (gas mileage)
Predict some unknown or missing numerical values Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity Outlier analysis Outlier: Data object that does not comply with the general behavior of the data Noise or exception? Useful in fraud detection, rare events analysis Other pattern-directed or statistical analyses
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm
First general all the patterns and then filter out the uninteresting ones Generate only the interesting patternsmining query optimization
Database
User interaction
VLDB
(IEEE) ICDE WWW, SIGIR ICML, CVPR, NIPS
Journals
Data Mining and Knowledge Discovery (DAMI or DMKD) Springer IEEE Trans. On Knowledge and Data Eng. (TKDE) KDD Explorations (ACM) ACM Trans. on KDD
3 2 Patent
Data Mining
27