Lecture 3 Data Mining
Lecture 3 Data Mining
- Prediction Methods
- Description Methods
- Classification [Predictive]
- Clustering [Descriptive]
- Regression [Predictive]
- Find a model for class attribute as a function of the values of other attributes.
Test
Set
Learn
Training Model
Classifier
Set
Classification : Application 1
- Direct Marketing
- Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone
product.
- Approach:
- Fraud Detection
- Approach:
- Use credit card transactions and the information on its account-holder as attributes.
- When does a customer buy, what does he buy, how often he pays on time, etc
- Label past transactions as fraud or fair transactions. This forms the class attribute.
- Learn a model for the class of the transactions.
- Use this model to detect fraud by observing credit card transactions on an account.
Classification : Application 3
- Customer Attrition/Churn:
- Approach:
- Use detailed record of transactions with each of the past and present customers, to find
attributes.
- How often the customer calls, where he calls, what time-of-the day he calls most,
his financial status, marital status, etc.
- Label the customers as loyal or disloyal.
- Find a model for loyalty.
Clustering : Definition
- Given a set of data points, each having a set of attributes, and a similarity
measure among them, find clusters such that
- Similarity Measures:
- Example:
Clustering : Application 1
- Market Segmentation:
- Goal: subdivide a market into distinct subsets of customers where any subset may conceivably
be selected as a market target to be reached with a distinct marketing mix.
- Approach:
- Collect different attributes of customers based on their geographical and lifestyle related
information.
- Find clusters of similar customers.
- Measure the clustering quality by observing buying patterns of customers in same cluster
vs. those from different clusters.
Clustering : Application 2
- Document Clustering:
- Goal: To find groups of documents that are similar to each other based on the important terms
appearing in them.
- Approach: To identify frequently occurring terms in each document. Form a similarity measure
based on the frequencies of different terms. Use it to cluster.
- Gain: Information Retrieval can utilize the clusters to relate a new document or search term to
clustered documents.
Illustrating Document Clustering
- Similarity Measure: How many words are common in these documents (after some word
filtering).
Classification vs Clustering
Classification Clustering
- Task: Based on this training set, the - Task: Using statistical concepts, we split
algorithms finds the category that the the datasets into sub-datasets such that
new data points belong to the Sub-datasets have “Similar” data
- Correct results/labels during the training - Correct results/labels are NOT given in
are given. input data
- Resultant models are generalized ones, - Usually computationally expensive
usually fast and accurate - Grouping of input data w.r.t. its
statistical properties
Association Rule Discovery : Definition
- Given a set of records each of which contain some number of items from a given collection;
– Produce dependency rules which will predict occurrence of an item based on occurrences of
other items.
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Association Rule Discovery : Application 1
- Inventory Management:
- Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its
consumer products and keep the service vehicles equipped with right parts to reduce on
number of visits to consumer households.
- Approach: Process the data on tools and parts required in previous repairs at different
consumer locations and discover the co-occurrence patterns.
Sequential Pattern Discovery : Definition
● Given is a set of objects, with each object associated with its own timeline of events, find rules that
predict strong sequential dependencies among different events.
(A B) (C) (D E)
● Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by
timing constraints.
Sequential Pattern Discovery : Example
- Computer Bookstore:
- (Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies,Tcl_Tk)
- Athletic Apparel Store:
- (Shoes) (Racket, Racketball) --> (Sports_Jacket)
Regression
- Predict a value of a given continuous valued variable based on the values of other variables,
assuming a linear or nonlinear model of dependency.
- Examples:
- Predicting sales amounts of new product based on advertising expenditure.
- Time series prediction of stock market indices.
- Income prediction on basis of qualifications and other characteristics of individuals
Deviation / Anomaly Detection
- Applications:
- Network Intrusion
Detection
Typical network traffic at University level may reach over 100 million connections per day
Challenges of Data Mining
- Scalability
- Dimensionality
- Data Quality
- Privacy Preservation
- Streaming Data
Open Source Data Mining Tools
- Python - Rapidminer
- R - Matlab
- Weka - Tableau
- Knime
Contribution of Data Mining
- Less expenditures
– Automated systems instead of manual ones
– Selection of customers to mail new promotions of the company
- Increased sales
– Shelf management to increase the sale of certain items
– What types of products can be sold together?
– How does one retain profitable customers?
Data Mining Real World Success Stories
- Bank of America identified savings of $4.8 million in 2 years by using a credit risk
management system, i.e., examination of only borderline applicants.
- BBC’s data mining based program scheduler determines the timing to show
programs as good as the best planner but at much less cost.
Data Mining Real World Success Stories
- Bell Atlantic developed telephone technician dispatch system. They must decide
what type of technician to dispatch to resolve the reported complain.
- Bell Atlantic save more than 10 million dollars per year by using data mining rule
based system because they make fewer erroneous decisions.
Data Mining Real World Success Stories
- Safeway (UK)’s data mining system found that the top - spending 25% customers
often purchase a particular cheese product ranked below 200 in sales.
- Normally, without the data - mining results, the product would have been
discontinued and would disappoint the best customers.