Data Miningppt378
Data Miningppt378
Data Miningppt378
Chapter 26
Chapter 1. Introduction
Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories
We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining
1960s:
1970s:
1980s:
1990s2000s:
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Data mining: a misnomer? Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. (Deductive) query processing. Expert systems or small ML/statistical programs
5
Alternative names:
target marketing, customer relation management, market basket analysis, cross selling, market segmentation Forecasting, customer retention, improved underwriting, quality control, competitive analysis
Fraud detection and management Text mining (news group, email, documents) Stream data mining Web mining. DNA data analysis
6
Other Applications
Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc. Conversion of single to a joint bank account: marriage, etc. Associations/co-relations between product sales Prediction based on the association information
7
Target marketing
Cross-market analysis
Customer profiling
data mining can tell you what types of customers buy what products (clustering or classification)
identifying the best products for different customers use prediction to find what factors will attract new customers
various multidimensional summary reports statistical summary information (data central tendency and variation)
8
cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) summarize and compare the resources and spending
Resource planning:
Competition:
monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market
9
Applications
widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.
Approach
use historical data to build models of fraudulent behavior and use data mining to help identify similar instances
auto insurance: detect a group of people who stage accidents to collect on insurance money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references
10
Examples
Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Analysts estimate that 38% of retail shrink is due to dishonest employees.
11
Retail
Other Applications
Sports
IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat JPL and the Palomar Observatory discovered 22 quasars with the help of data mining
Astronomy
IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.
12
Data Cleaning
Data Integration Databases
13
Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation. summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation
Relational databases Data warehouses Transactional databases Advanced DB and information repositories
Object-oriented and object-relational databases Spatial and temporal data Time-series data and stream data Text databases and multimedia databases Heterogeneous and legacy databases WWW
15
16
Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database What products were often purchased together? Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents?
17
Find all the rules XY with min confidence and support support, s, probability that a transaction contains XY confidence, c, conditional probability that a transaction having X also contains Y.
For rule A C:
19
Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Method: generate length (k+1) candidate itemsets from length k frequent itemsets, and test the candidates against DB The performance studies show its efficiency and scalability
if {beer, diaper, nuts} is frequent, so is {beer, diaper} every transaction having {beer, diaper, nuts} also contains {beer, diaper}
20
Database TDB
Tid
10 20 30 40
{A}
Itemset
sup 2
Items
A, C, D B, C, E A, B, C, E B, E Itemset {A, C} {B, C} sup 2 2
C1 1st scan
L1
{A}
{B}
{C} {E}
3
3 3
C2
L2
{B, E}
{C, E}
3
2
sup 1 2 1 2 3 2
C2 2nd scan
{B, C}
{B, E} {C, E}
C3
Itemset {B, C, E}
3rd scan
L3
Itemset {B, C, E}
sup 2
21
L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;
22
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace
Pruning:
23
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
24
Finding models (functions) that describe and distinguish classes or concepts for future prediction
E.g., classify countries based on climate, or classify cars based on gas mileage
Presentation: decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values
25
RANK YEARS TENURED A ssistant P rof 3 no A ssistant P rof 7 yes P rofessor 2 yes A ssociate P rof 7 yes A ssistant P rof 6 no A ssociate P rof 3 no
Classifier (Model)
Unseen Data
(Jeff, Professor, 4)
NAME RANK T om M erlisa G eorge Joseph A ssistant P rof A ssociate P rof P rofessor A ssistant P rof YEARS TENURED 2 7 5 7 no no yes yes
Tenured?
27
Decision Trees
age <=30 <=30 3140 >40 >40 >40 3140 <=30 <=30 >40 <=30 3140 3140 >40 income high high high medium low low low medium low medium medium medium high medium student no no no no yes yes yes no yes yes yes no yes no credit_rating fair excellent fair fair fair excellent excellent fair fair fair excellent excellent fair excellent
28
Training set
no
yes
no
yes
29
Cluster analysis
Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
Outlier analysis
Outlier: a data object that does not comply with the general behavior of
the data
It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis
30
31