0% found this document useful (0 votes)
10 views24 pages

Datamining ch1

Uploaded by

tofikmohammed471
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views24 pages

Datamining ch1

Uploaded by

tofikmohammed471
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Chapter 1: Introduction

Credits to: Tan et al

1
What is Data Mining?
 Many Definitions
 Non-trivial extraction of implicit, previously
unknown and potentially useful information from
data
 Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns

2
Why Mine Data? Commercial Viewpoint
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions
 Computers have become cheaper and more
powerful
 Competitive Pressure is Strong
 Provide better, customized services for a
competitive advantage (e.g. in Customer
3 Relationship Management)
Why Mine Data? Scientific Viewpoint
 Data collected and stored at enormous speeds
(GB/hour)
 remote sensors on a satellite
 telescopes scanning the skies
 microarrays generating gene
expression data
 scientific simulations generating terabytes of data
 Traditional techniques infeasible
for raw data
 Data mining may help scientists
 in classifying and segmenting data
 in Hypothesis Formation

4
Mining Large Data Sets - Motivation
 There is often information "hidden" in the data
that is
not readily evident
 Human analysts may take weeks to discover
useful information
4,000,000
Total new disk (TB) since 1995

 Much of the data is never analyzed at all


3,500,000

3,000,000

2,500,000
The Data Gap
2,000,000

1,500,000

1,000,000

500,000

0
1995 1996 1997 1998 1999

5 Number of analysts since 1995


Origins of Data Mining
 Draws ideas from machine learning (pattern
recognition), statistics, and database systems
 Traditional Techniques
may be unsuitable due to
Statistics Machine Learning/
 Enormity of data
Pattern
 High dimensionality of data Recognition
 Heterogeneous,
distributed nature of data Data Mining

Database
systems

6
Data Mining Tasks
 Prediction Methods
 Use some variables to predict unknown or future
values of other variables.

 Description Methods
 Find human-interpretable patterns that describe
the data.

7
Data Mining Tasks...
 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Regression [Predictive]
 Deviation Detection [Predictive]

8
Classification: Definition
 Given a collection of records (training set )
 Each record contains a set of attributes, one of
the attributes is the class.
 Find a model for class attribute as a function of
the values of other attributes.
 Goal: previously unseen records should be
assigned a class as accurately as possible.
 A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.

9
Classification: Example
Class
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
Training
>40 low yes excellent no set
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent
Test set
31…40 high yes fair
>40 medium no excellent

10
Classification: Example…
Model
age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes
Classification Application: Direct Marketing
 Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a particular product.
 Approach:
 Use the data for a similar product introduced before.
 We know which customers decided to buy and which
decided otherwise. This {buy, don’t buy} decision
forms the class attribute.
 Collect various demographic, lifestyle, and company-
interaction related information about all such
customers.
(e.g., type of business, where they stay, how much
they earn, etc.)
 Use this information as input attributes to learn a
classifier model.

12
Classification Application: Fraud Detection
 Goal: Predict fraudulent cases in certain (e.g.,
credit card) transactions.
 Approach:
 Use credit card transactions and the
information on its account-holder as
attributes.
(e.g., when does a customer buy, what does
he buy, how often he pays on time, etc.)
 Label past transactions as fraud or fair
transactions. This forms the class attribute.
 Learn a model for the class of the
transactions.
 Use this model to detect fraud by observing
credit card transactions on an account.
13
Classification Application: Customer Attrition
 Goal: To predict whether a customer is likely to
be lost to a competitor.
 Approach:
 Use detailed record of transactions with each
of the past and present customers, to find
attributes.
(e.g., how often the customer calls, where he
calls, what time-of-the day he calls most, his
financial status, marital status, etc.)
 Label the customers as loyal or disloyal.
 Find a model for loyalty.

14
Association Rule Discovery: Definition
 Given a set of records each of which contain some
number of items from a given collection;
 Produce dependency rules which will predict
occurrence of an item based on occurrences of
other
TID
items.
Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
{Coke}
{Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
15
Association Rule Discovery: Application 1
 Marketing and Sales Promotion:
 Let the rule discovered be
{Bread, … } --> {Potato Chips}
 Potato Chips as consequent => Can be used to
determine what should be done to boost its
sales.
 Bread in the antecedent => Can be used to see
which products would be affected if the store
discontinues selling Bread.
 Bread in antecedent and Potato chips in
consequent => Can be used to see what
products should be sold with Bread to promote
sale of Potato chips!

16
Association Rule Discovery: Application 2
 Supermarket shelf management.
 Goal: To identify items that are bought together
by sufficiently many customers.
 Approach: Process the point-of-sale data
collected with barcode scanners to find
dependencies among items.
 A classic rule --
 If a customer buys diaper and milk, then he
is very likely to buy beer.

17
Clustering: Definition
 Given a set of data points, each having a set
of attributes, and a similarity measure among
them, find clusters such that
 Data points in one cluster are more similar to one
another.
 Data points in separate clusters are less similar
to one another.
 Similarity Measures:
 Euclidean Distance if attributes are continuous.
 Other Problem-specific Measures.

18
Clustering Application: Market Segmentation
 Goal: subdivide a market into distinct subsets
of customers where any subset may
conceivably be selected as a market target to
be reached with a distinct marketing mix.
 Approach:
 Collect different attributes of customers
based on their geographical and lifestyle
related information.
 Find clusters of similar customers.
 Measure the clustering quality by observing
buying patterns of customers in same cluster
vs. those from different clusters.

19
Clustering Application: Document Clustering
 Goal: To find groups of documents that are
similar to each other based on the important
terms appearing in them.
 Approach:
 To identify frequently occurring terms in
each document.
 Form a similarity measure based on the
frequencies of different terms. Use it to
cluster.
 Gain: Information Retrieval can utilize the
clusters to relate a new document or search
term to clustered documents.
20
Regression
 Predict a value of a given continuous valued
variable based on the values of other variables,
assuming a linear or nonlinear model of
dependency.
 Greatly studied in statistics, neural network fields.
 Examples:
 Predicting sales amounts of new product based
on advertising expenditure.
 Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
 Time series prediction of stock market indices.

21
Deviation/Anomaly Detection
 Detect significant deviations from normal behavior
 Applications:
 Credit Card Fraud Detection

 Network Intrusion Detection

Typical network traffic at University level may reach over 100 million connections per day

22
What is (not) Data Mining?
What is not Data
Mining? What is Data Mining?
 Look up phone number  Predicting the future
in phone directory stock price of a
 Dividing the customers company using
of a company historical records
according to their  Group together similar
gender documents returned by
 Computing the total search engine
sales of a company according to their
 Query a Web search context
engine for particular  Monitoring the heart
information rate of a patient for
abnormalities
23
Challenges of Data Mining
 Scalability
 Dimensionality
 Complex and Heterogeneous Data
 Data Quality
 Data Ownership and Distribution
 Privacy Preservation
 Streaming Data

24

You might also like