Data Miningppt378

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 31

Data Mining

Chapter 26

Chapter 1. Introduction

Motivation: Why data mining?

What is data mining?

Data Mining: On what kind of data?

Data mining functionality Are all the patterns interesting? Major issues in data mining

Motivation: Necessity is the Mother of Invention

Data explosion problem

Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories

We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing

Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases

Evolution of Database Technology


Data collection, database creation, IMS and network DBMS


Relational data model, relational DBMS implementation

RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) Data mining and data warehousing, multimedia databases, and Web databases



What Is Data Mining?

Data mining (knowledge discovery in databases):

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Data mining: a misnomer? Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. (Deductive) query processing. Expert systems or small ML/statistical programs

Alternative names:

What is not data mining?

Why Data Mining? Potential Applications

Database analysis and decision support

Market analysis and management

target marketing, customer relation management, market basket analysis, cross selling, market segmentation Forecasting, customer retention, improved underwriting, quality control, competitive analysis

Risk analysis and management

Fraud detection and management Text mining (news group, email, documents) Stream data mining Web mining. DNA data analysis

Other Applications

Market Analysis and Management (1)

Where are the data sources for analysis?

Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc. Conversion of single to a joint bank account: marriage, etc. Associations/co-relations between product sales Prediction based on the association information

Target marketing

Determine customer purchasing patterns over time

Cross-market analysis

Market Analysis and Management (2)

Customer profiling

data mining can tell you what types of customers buy what products (clustering or classification)

Identifying customer requirements

identifying the best products for different customers use prediction to find what factors will attract new customers

Provides summary information

various multidimensional summary reports statistical summary information (data central tendency and variation)

Corporate Analysis and Risk Management

Finance planning and asset evaluation

cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) summarize and compare the resources and spending

Resource planning:


monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market

Fraud Detection and Management (1)


widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.


use historical data to build models of fraudulent behavior and use data mining to help identify similar instances
auto insurance: detect a group of people who stage accidents to collect on insurance money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references


Fraud Detection and Management (2)

Detecting inappropriate medical treatment

Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Analysts estimate that 38% of retail shrink is due to dishonest employees.

Detecting telephone fraud


Other Applications


IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat JPL and the Palomar Observatory discovered 22 quasars with the help of data mining


Internet Web Surf-Aid

IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

Data Mining: A KDD Process

Pattern Evaluation

Data mining: the core of knowledge discovery Data Mining process.

Task-relevant Data Data Warehouse Selection

Data Cleaning
Data Integration Databases

Steps of a KDD Process

Learning the application domain:

relevant prior knowledge and goals of application

Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:

Find useful features, dimensionality/variable reduction, invariant representation. summarization, classification, regression, association, clustering.

Choosing functions of data mining

Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc.


Use of discovered knowledge

Data Mining: On What Kind of Data?

Relational databases Data warehouses Transactional databases Advanced DB and information repositories

Object-oriented and object-relational databases Spatial and temporal data Time-series data and stream data Text databases and multimedia databases Heterogeneous and legacy databases WWW

Data Mining Functionalities


Association Rule Mining

Association rule mining:

Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database What products were often purchased together? Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents?

Motivation: finding regularities in data

Association Rule Mining (cont.)

Transaction-id 10 20 30 40 Items bought A, B, C A, C A, D B, E, F

Itemset X={x1, , xk}

Customer buys both

Customer buys diapers

Find all the rules XY with min confidence and support support, s, probability that a transaction contains XY confidence, c, conditional probability that a transaction having X also contains Y.

Customer buys beer

Let min_support = 50%, min_conf = 50%: A C (50%, 66.7%) C A (50%, 100%)


Mining Association Rulesan Example

Transaction-id 10 20 30 40 Items bought A, B, C A, C A, D B, E, F

Min. support 50% Min. confidence 50%

Frequent pattern {A} {B} {C} {A, C} Support 75% 50% 50% 50%

For rule A C:

support = support({A}{C}) = 50% confidence = support({A}{C})/support({A}) = 66.6%


Apriori: A Candidate Generation-and-test Approach

Any subset of a frequent itemset must be frequent

Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Method: generate length (k+1) candidate itemsets from length k frequent itemsets, and test the candidates against DB The performance studies show its efficiency and scalability

if {beer, diaper, nuts} is frequent, so is {beer, diaper} every transaction having {beer, diaper, nuts} also contains {beer, diaper}


The Apriori Algorithm An Example

Itemset sup 2 3 3 1 3

Database TDB
10 20 30 40



sup 2

A, C, D B, C, E A, B, C, E B, E Itemset {A, C} {B, C} sup 2 2

C1 1st scan

{B} {C} {D} {E}



{C} {E}

3 3



{B, E}
{C, E}


Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E}

sup 1 2 1 2 3 2

C2 2nd scan

Itemset {A, B} {A, C} {A, E}

{B, C}
{B, E} {C, E}


Itemset {B, C, E}

3rd scan


Itemset {B, C, E}

sup 2

The Apriori Algorithm

Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k

L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;

Important Details of Apriori

How to generate candidates?

Step 1: self-joining Lk Step 2: pruning

Example of Candidate-generation

L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace

acde is removed because ade is not in L3 C4={abcd}


How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order

Step 1: self-joining Lk-1

insert into Ck select p.item1, p.item2, , p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q

where p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk1

Step 2: pruning
forall itemsets c in Ck do

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck


Classification and Prediction

Finding models (functions) that describe and distinguish classes or concepts for future prediction

E.g., classify countries based on climate, or classify cars based on gas mileage
Presentation: decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values

Classification Process: Model Construction

Training Data Classification Algorithms

NAME M ike M ary B ill Jim D ave A nne

RANK YEARS TENURED A ssistant P rof 3 no A ssistant P rof 7 yes P rofessor 2 yes A ssociate P rof 7 yes A ssistant P rof 6 no A ssociate P rof 3 no

Classifier (Model)

IF rank = professor OR years > 6 THEN tenured = yes


Classification Process: Use the Model in Prediction

Classifier Testing Data

Unseen Data

(Jeff, Professor, 4)
NAME RANK T om M erlisa G eorge Joseph A ssistant P rof A ssociate P rof P rofessor A ssistant P rof YEARS TENURED 2 7 5 7 no no yes yes



Decision Trees
age <=30 <=30 3140 >40 >40 >40 3140 <=30 <=30 >40 <=30 3140 3140 >40 income high high high medium low low low medium low medium medium medium high medium student no no no no yes yes yes no yes yes yes no yes no credit_rating fair excellent fair fair fair excellent excellent fair fair fair excellent excellent fair excellent

Training set

Output: A Decision Tree for buys_computer

age? <=30 student? no yes overcast 30..40 yes >40 credit rating? excellent fair





Cluster and outlier analysis

Cluster analysis

Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity

Outlier analysis

Outlier: a data object that does not comply with the general behavior of

the data

It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis


Clusters and Outliers


You might also like