Lesson Data Mining
Lesson Data Mining
Outline
Query Query
– Well defined – Poorly defined
– SQL – No precise query language
– Data Data
– Operational data – Not operational data
Output Output
– Precise – Fuzzy
– Subset of database – Not a subset of database
Query Examples
Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more than $10,000 in the
last month.
– Find all customers who have purchased milk
Data Mining
– Find all credit applicants who are poor credit risks. (classification)
– Identify customers with similar buying habits. (Clustering)
– Find all items which are frequently purchased with milk. (association
rules)
Data Mining: Classification Schemes
Databases to be mined
– Relational, transactional, object-oriented, object-relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy, WWW,
etc.
Knowledge to be mined
– Characterization, discrimination, association, classification, clustering,
trend, deviation and outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
Applications adapted
– Retail, telecommunication, banking, fraud analysis, DNA mining, stock market
analysis, Web mining, Weblog analysis, etc.
Data Mining Tasks
Prediction Tasks
– Use some variables to predict unknown or future values of other
variables
Description Tasks
– Find human-interpretable patterns that describe the data.
An Example
(from Pattern Classification by Duda & Hart & Stork –
Second Edition, 2001)
A fish-packing plant wants to automate the
process of sorting incoming fish according to
species
Domain knowledge:
◦ A sea bass is generally longer than a salmon
Related feature: (or attribute)
◦ Length
Training the classifier:
◦ Some examples are provided to the classifier in this
form: <fish_length, fish_name>
◦ These examples are called training examples
◦ The classifier learns itself from the training examples,
how to distinguish Salmon from Bass based on the
fish_length
An Example (continued) Classification
Test/Unlabeled
So the overall Training Data
Data
classification process Preprocessing Preprocessing
goes like this , and feature , and feature
extraction extraction
Testing against
Training model/
Classification
Model Prediction/
Evaluation
An Example (continued) Classification
An Example (continued)
Why error?
Insufficient training data
Too few features
Too many/irrelevant features
Overfitting / specialization
Classification
An Example (continued)
Classification
An Example (continued)
New Feature:
– Average lightness of the fish scales
Classification
An Example (continued)
Classification
An Example (continued)
Terms
Accuracy:
% of test data correctly classified
In our first example, accuracy was 3 out 4 = 75%
In our second example, accuracy was 4 out 4 =
100%
False positive:
Negative class incorrectly classified as positive
Usually, the larger class is the negative class
Suppose
salmon is negative class
sea bass is positive class
Classification
Terms
false positive
false negative
Classification
Terms
Cross validation (3 fold)
Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No Learn
Training
10 No Single 90K Yes Model
10
Set Classifier
Classification: Application 1
Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier model.
Classification: Application 2
Fraud Detection
– Goal: Predict fraudulent cases in credit card transactions.
– Approach:
• Use credit card transactions and the information on its account-
holder as attributes.
– When does a customer buy, what does he buy, how often he pays on
time, etc
• Label past transactions as fraud or fair transactions. This forms the
class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.
Classification: Application 3
Customer Attrition/Churn:
– Goal: To predict whether a customer is likely to be lost to a
competitor.
– Approach:
• Use detailed record of transactions with each of the past and
present customers, to find attributes.
– How often the customer calls, where he calls, what time-of-the day
he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.
Classification: Application 4
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
CLUSTERING
Clustering Definition
Market Segmentation:
– Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached with a distinct marketing mix.
– Approach:
• Collect different attributes of customers based on their
geographical and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.
Clustering: Application 2
Document Clustering:
– Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.
– Approach: To identify frequently occurring terms in each
document. Form a similarity measure based on the
frequencies of different terms. Use it to cluster.
– Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.
ASSOCIATION RULE MINING
Association Rule Discovery: Definition
TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Association Rule Discovery: Application 1
O – observed value
E – Expected value based on hypothesis.
Ex:
– O={50,93,67,78,87}
– E=75
– c2=15.55 and therefore significant
Regression
Advantages:
– Easy to understand.
– Easy to generate rules
Disadvantages:
– May suffer from overfitting.
– Classifies by rectangular partitioning.
– Does not easily handle nonnumeric data.
– Can be quite large – pruning is necessary.
Neural Networks
Learning
Can continue learning even after training set has
been applied.
Easy parallelization
Solves many problems
NN Disadvantages
Difficult to understand
May suffer from overfitting
Structure of graph must be determined a priori.
Input values must be numeric.
Verification difficult.