Introduction To Data Mining
Introduction To Data Mining
Afzaal Hussain
Email: [email protected]
Course Content
Introduction to data mining
Classification
Database Systems
Algorithms and data structures
Programming
What is Data Mining?
Information Harvesting
Knowledge Mining
Data Mining
Knowledge Discovery
in Databases Data Dredging
Suggested approach:
Human-centered, query-based, focused mining
How to measure ?
Interestingness
Interestingne
ss
Objective:
based on statistics and structures of patterns, e.g.
support, confidence, etc.
Subjective:
based on user’s beliefs in the data, e.g. unexpectedness,
novelty, etc.
Data Mining
Find all credit applicants who are poor credit risks. (classification)
Identify customers with similar buying habits. (Clustering)
Find all items which are frequently purchased with milk.
(association rules)
Database Processing vs. Data Mining
Processing
Query Query
– Well – Poorly defined
defined – No precise query
– SQL language
Output Output
– Precise – Fuzzy
– Subset of database – Not a subset of database
What is Data Mining?
What is not DM? Certain names are more
prevalent in certain US locations
Look up phone number
(O’Brien, O’Rurke, O’Reilly… in
in phone directory
Boston area)
Data Mining:
Use of algorithms to extract the information and
patterns
derived by the K D D process.
Knowledge Discovery in Databases:
Process
Data mining: the Interpretation/
core of knowledge Evaluation
discovery process.
Data Mining Knowledge
Preprocessing
Patterns
Selection
Preprocessed
Data
Data
Targe
t
Data
Goal:
unseen records should be assigned a class as accurately
as
possible.
Classification Example
t
7 Yes Divorced 220K No
Set
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10 No Single 90K Yes Model
10
Set Classifier
Classification
Typical methods
Decision trees,
naïve Bayesian classification,
support vector machines,
neural networks,
rule-based or pattern-based
classification,
logistic regression, …
Typical applications:
Credit card fraud detection,
direct marketing,
classifying stars, diseases, web-pages, …
Classification Application
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone
product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided
otherwise.This {buy, don’t buy} decision forms the class
attribute.
Collect various demographic, lifestyle, and company-
interaction
related information about all such customers
Use this information as input attributes to learn a classifier
model.
Clustering
Clustering groups similar data together into clusters based
on attribute values. (unsupervised classification)
Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster Intercluster
distances are distances are
minimized maximized
Clustering Application: Market Segmentation:
Goal: subdivide a market into distinct subsets of
customers
where any subset may be selected as a market target to be
reached with a distinct marketing mix.
Approach:
Collect different attributes of customers
based on their
geographical and lifestyle related
information.
Find clusters of similar customers.
Measure the clustering quality by
observing buying patterns of
customers in same cluster vs. those from
different clusters.
Clustering: Application 2
Document Clustering:
Goal:
To find groups of documents that are similar to each other based
on the important terms appearing in them.
Approach:
To identify frequently occurring terms in each document. Form a
similarity measure based on the frequencies of different terms.
Use it to cluster.
Gain:
Information Retrieval can utilize the clusters to relate a
new document or search term to clustered documents.
Association Rule Discovery
Frequent patterns (or frequent itemsets)
What items are frequently purchased together in
your Walmart?
Produce dependency rules which will predict
occurrence of an
item based on occurrences of other items in data.
TID Items
1 Bread, Coke, Milk
2 Cereal, Bread
Rules Discovered:
3 Cereal, Coke, Diaper, Milk
{Milk} --> {Coke}
4 Cereal, Bread, Diaper, Milk
{Diaper, Milk} --> {Cereal}
5 Coke, Diaper, Milk
Association Rule Discovery: Application
Marketing and Sales Promotion:
Let the rule discovered be
{Bagels, … } --> {Potato Chips}
Potato Chips as consequent =>
Can be used to determine what should be done to boost its
sales.
Bagels in the antecedent =>
Can be used to see which products would be affected if the
Network Intrusion
Detection
Challenges of Data Mining
Scalability
Dimensionality
Complex and Heterogeneous
Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data