Chapter 2
Chapter 2
1
Why Data Mining?—Potential Applications
2
Model vs. Pattern
Y= ax+c
model structure:
𝑦 = 2𝑥 + 3.5
model:
3
Supervised vs. Unsupervised
H1 H2
H1 light 1 1 healthy
H2 dark 1 1 healthy
H3 H4
H3 light 1 2 healthy
H4 light 2 1 healthy
C1 C2
C1 dark 1 2 cancerous
C2 dark 2 1 cancerous
C3 C4
C3 light 2 2 cancerous
C4 dark 2 2 cancerous
5
From business problems to data mining tasks
Classification
Classificationattempts to predict, for each individual in a
population, which of a (small) set of classes that
individual belongs to
Regression
8
From business problems to data mining tasks
Similarity matching
Clustering
Clustering attempts to group individuals in a
population together by their similarity, but without
regard to any specific purpose
10
From business problems to data mining tasks
Co-occurence grouping
11
Transactional databases
Classification
and regression are generally solved
with supervised techniques
13
Data mining and its use
14
Ex. 1: Market Analysis and
Management
■ Where does the data come from?—Credit card transactions, loyalty cards,
discount coupons, customer complaint calls, plus (public) lifestyle studies
■ Target marketing
■ Find clusters of “model” customers who share the same characteristics: interest,
income level, spending habits, etc.
■ Determine customer purchasing patterns over time
■ Cross-market analysis—Find associations/co-relations between product
sales, & predict based on such association
■ Customer profiling—What types of customers buy what products (clustering or
classification)
■ Customer requirement analysis
■ Identify the best products for different groups of customers
■ Predict what factors will attract new customers
■ Provision of summary information
■ Multidimensional summary reports
■ Statistical summary information (data central tendency and variation)
17
Ex. 2: Corporate Analysis & Risk Management
18
Ex. 3: Fraud Detection & Mining Unusual
Patterns
■ Approaches: Clustering & model construction for frauds, outlier analysis
■ Applications: Health care, retail, credit card service, telecomm.
■ Auto insurance: ring of collisions
■
Money laundering: suspicious monetary transactions
■
Medical insurance
■ Professional patients, ring of doctors, and ring of references
■ Unnecessary or correlated screening tests
■ Telecommunications: phone-call fraud
■ Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm
■ Retail industry
■ Analysts estimate that 38% of retail shrink is due to dishonest
employees
■ Anti-terrorism
19
KDD Process: Several Key
Steps
■ Learning the application domain
■ relevant prior knowledge and goals of application
■ Creating a target data set: data selection
■ Data cleaning and preprocessing: (may take 60% of effort!)
■ Data reduction and transformation
■ Find useful features, dimensionality/variable reduction, invariant
representation
■ Choosing functions of data mining
■ summarization, classification, regression, association, clustering
■ Choosing the mining algorithm(s)
■ Data mining: search for patterns of interest
■ Pattern evaluation and knowledge presentation
■ visualization, transformation, removing redundant patterns, etc.
■ Use of discovered knowledge
20
Bài tập cá nhân số 2 – 27/09/2024
22
Find All and Only Interesting Patterns?
24
Why Data Mining Query Language?
25
Primitives that Define a Data Mining Task
■ Task-relevant data
■ Database or data warehouse name
■ Database tables or data warehouse cubes
■ Condition for data selection
■ Relevant attributes or dimensions
■ Data grouping criteria
■ Type of knowledge to be mined
■ Characterization, discrimination, association, classification,
prediction, clustering, outlier analysis, other data mining tasks
■ Background knowledge
■ Pattern interestingness measurements
■ Visualization/presentation of discovered patterns
26
Primitive 3: Background Knowledge
■ Simplicity
e.g., (association) rule length, (decision) tree size
■ Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification
reliability or accuracy, certainty factor, rule strength, rule quality,
discriminating weight, etc.
■ Utility
potential usefulness, e.g., support (association), noise
threshold (description)
■ Novelty
not previously known, surprising (used to remove redundant
rules, e.g., Illinois vs. Champaign rule implication support ratio)
28
Primitive 5: Presentation of Discovered Patterns
29
DMQL—A Data Mining Query
Language
■ Motivation
■ A DMQL can provide the ability to support ad-hoc and
interactive data mining
■ By providing a standardized language like SQL
■
Hope to achieve a similar effect like that SQL has on
relational database
■
Foundation for system development and evolution
■
Facilitate information exchange, technology transfer,
commercialization and wide acceptance
■ Design
■ DMQL is designed with the primitives described earlier
30
An Example Query in DMQL
31
Other Data Mining Languages &
Standardization Efforts
■ Association rule language specifications
■ MSQL (Imielinski & Virmani’99)
■ MineRule (Meo Psaila and Ceri’96)
■ Query flocks based on Datalog
syntax (Tsur et al’98)
■ OLEDB for DM (Microsoft’2000) and recently DMX (Microsoft
SQLServer 2005)
■ Based on OLE, OLE DB, OLE DB for OLAP, C#
■ Integrating DBMS, data warehouse and data mining
■ DMML (Data Mining Mark-up Language) by DMG (www.dmg.org)
■ Providing a platform and process structure for effective data mining
■ Emphasizing on deploying data mining technology to solve business
problems
32
Integration of Data Mining and Data
Warehousing
■ Data mining systems, DBMS, Data warehouse systems
coupling
■ No coupling, loose-coupling, semi-tight-coupling, tight-coupling
■ On- line analytical mining data
■ integration of mining and OLAP technologies
■ Interactive mining multi- level knowledge
■ Necessity of mining knowledge and patterns at different levels of
abstraction by drilling/rolling, pivoting, slicing/dicing, etc.
■ Integration of multiple mining functions
■ Characterized classification, first clustering and then association
33
Coupling Data Mining with DB/DW
Systems
■ No coupling—flat file processing, not recommended
■ Loose coupling
■ Fetching data from DB/DW
■ Semi-tight coupling—enhanced DM performance
■ Provide efficient implement a few data mining primitives in a
DB/DW system, e.g., sorting, indexing, aggregation, histogram
analysis, multiway join, precomputation of some stat functions
■ Tight coupling—A uniform information processing
environment
■ DM is smoothly integrated into a DB/DW system, mining query
is optimized based on mining query, indexing, query processing
methods, etc.
34
Architecture: Typical Data Mining
System
Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data
Warehouse Server
35