Datamining ch1
Datamining ch1
1
What is Data Mining?
Many Definitions
Non-trivial extraction of implicit, previously
unknown and potentially useful information from
data
Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
2
Why Mine Data? Commercial Viewpoint
Lots of data is being collected
and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Computers have become cheaper and more
powerful
Competitive Pressure is Strong
Provide better, customized services for a
competitive advantage (e.g. in Customer
3 Relationship Management)
Why Mine Data? Scientific Viewpoint
Data collected and stored at enormous speeds
(GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene
expression data
scientific simulations generating terabytes of data
Traditional techniques infeasible
for raw data
Data mining may help scientists
in classifying and segmenting data
in Hypothesis Formation
4
Mining Large Data Sets - Motivation
There is often information "hidden" in the data
that is
not readily evident
Human analysts may take weeks to discover
useful information
4,000,000
Total new disk (TB) since 1995
3,000,000
2,500,000
The Data Gap
2,000,000
1,500,000
1,000,000
500,000
0
1995 1996 1997 1998 1999
Database
systems
6
Data Mining Tasks
Prediction Methods
Use some variables to predict unknown or future
values of other variables.
Description Methods
Find human-interpretable patterns that describe
the data.
7
Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
8
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of
the attributes is the class.
Find a model for class attribute as a function of
the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.
9
Classification: Example
Class
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
Training
>40 low yes excellent no set
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent
Test set
31…40 high yes fair
>40 medium no excellent
10
Classification: Example…
Model
age?
<=30 overcast
30..40 >40
no yes no yes
Classification Application: Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a particular product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which
decided otherwise. This {buy, don’t buy} decision
forms the class attribute.
Collect various demographic, lifestyle, and company-
interaction related information about all such
customers.
(e.g., type of business, where they stay, how much
they earn, etc.)
Use this information as input attributes to learn a
classifier model.
12
Classification Application: Fraud Detection
Goal: Predict fraudulent cases in certain (e.g.,
credit card) transactions.
Approach:
Use credit card transactions and the
information on its account-holder as
attributes.
(e.g., when does a customer buy, what does
he buy, how often he pays on time, etc.)
Label past transactions as fraud or fair
transactions. This forms the class attribute.
Learn a model for the class of the
transactions.
Use this model to detect fraud by observing
credit card transactions on an account.
13
Classification Application: Customer Attrition
Goal: To predict whether a customer is likely to
be lost to a competitor.
Approach:
Use detailed record of transactions with each
of the past and present customers, to find
attributes.
(e.g., how often the customer calls, where he
calls, what time-of-the day he calls most, his
financial status, marital status, etc.)
Label the customers as loyal or disloyal.
Find a model for loyalty.
14
Association Rule Discovery: Definition
Given a set of records each of which contain some
number of items from a given collection;
Produce dependency rules which will predict
occurrence of an item based on occurrences of
other
TID
items.
Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
{Coke}
{Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
15
Association Rule Discovery: Application 1
Marketing and Sales Promotion:
Let the rule discovered be
{Bread, … } --> {Potato Chips}
Potato Chips as consequent => Can be used to
determine what should be done to boost its
sales.
Bread in the antecedent => Can be used to see
which products would be affected if the store
discontinues selling Bread.
Bread in antecedent and Potato chips in
consequent => Can be used to see what
products should be sold with Bread to promote
sale of Potato chips!
16
Association Rule Discovery: Application 2
Supermarket shelf management.
Goal: To identify items that are bought together
by sufficiently many customers.
Approach: Process the point-of-sale data
collected with barcode scanners to find
dependencies among items.
A classic rule --
If a customer buys diaper and milk, then he
is very likely to buy beer.
17
Clustering: Definition
Given a set of data points, each having a set
of attributes, and a similarity measure among
them, find clusters such that
Data points in one cluster are more similar to one
another.
Data points in separate clusters are less similar
to one another.
Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.
18
Clustering Application: Market Segmentation
Goal: subdivide a market into distinct subsets
of customers where any subset may
conceivably be selected as a market target to
be reached with a distinct marketing mix.
Approach:
Collect different attributes of customers
based on their geographical and lifestyle
related information.
Find clusters of similar customers.
Measure the clustering quality by observing
buying patterns of customers in same cluster
vs. those from different clusters.
19
Clustering Application: Document Clustering
Goal: To find groups of documents that are
similar to each other based on the important
terms appearing in them.
Approach:
To identify frequently occurring terms in
each document.
Form a similarity measure based on the
frequencies of different terms. Use it to
cluster.
Gain: Information Retrieval can utilize the
clusters to relate a new document or search
term to clustered documents.
20
Regression
Predict a value of a given continuous valued
variable based on the values of other variables,
assuming a linear or nonlinear model of
dependency.
Greatly studied in statistics, neural network fields.
Examples:
Predicting sales amounts of new product based
on advertising expenditure.
Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
Time series prediction of stock market indices.
21
Deviation/Anomaly Detection
Detect significant deviations from normal behavior
Applications:
Credit Card Fraud Detection
Typical network traffic at University level may reach over 100 million connections per day
22
What is (not) Data Mining?
What is not Data
Mining? What is Data Mining?
Look up phone number Predicting the future
in phone directory stock price of a
Dividing the customers company using
of a company historical records
according to their Group together similar
gender documents returned by
Computing the total search engine
sales of a company according to their
Query a Web search context
engine for particular Monitoring the heart
information rate of a patient for
abnormalities
23
Challenges of Data Mining
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data
24