Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
CSE-443
Course Outline
Recommended Book
3
Ayesha Aziz Prova,
Lecturer, CSE, CWU
PRESENTATION
4
Ayesha Aziz Prova, Lecturer, CSE, CWU
BOOK
Data Mining: Concepts and Techniques
J. Han and M. Kamber
Introduction to Data Mining
Tan, Steinbach, Kumar
After years of data mining there is still no unique answer to this question.
A tentative definition:
Data mining is the use of efficient techniques for
the analysis of very large collections of data and the
extraction of useful and possibly unexpected
patterns in data.
6
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DATA MINING
7
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DATA RICH BUT INFORMATION POOR
Terrorbytes
8
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHAT IS DATA MINING?
9
Ayesha Aziz Prova,
Lecturer, CSE, CWU
KNOWLEDGE DISCOVERY
10
Ayesha Aziz Prova,
Lecturer, CSE, CWU
EXAMPLE OF DISCOVERED PATTERNS
Association rules:
“80% of customers who buy cheese and milk also buy bread, and 5% of
customers buy all of them together”
Cheese, Milk Bread [sup =5%, confid=80%]
11
Ayesha Aziz Prova,
Lecturer, CSE, CWU
ORIGINS OF DATA MINING
Data Mining
Database 12
systems Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHY DO WE NEED DATA MINING?
13
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHY DO WE NEED DATA MINING?
Data is power!
Today, the collected data is one of the biggest assets of an online
company
Query logs of Google
The friendship and updates of Facebook
Tweets and follows of Twitter
Amazon transactions
We need a way to harness the collective intelligence
14
Ayesha Aziz Prova,
Lecturer, CSE, CWU
THE DATA IS ALSO VERY COMPLEX
15
Ayesha Aziz Prova,
Lecturer, CSE, CWU
EXAMPLE: TRANSACTION DATA
The point cards allow companies to collect information about specific users
16
Ayesha Aziz Prova,
Lecturer, CSE, CWU
EXAMPLE: DOCUMENT DATA
17
Ayesha Aziz Prova,
Lecturer, CSE, CWU
EXAMPLE: NETWORK DATA
18
Ayesha Aziz Prova,
Lecturer, CSE, CWU
EXAMPLE: GENOMIC SEQUENCES
http://www.1000genomes.org/page.php
Full sequence of 1000 individuals
3*109 nucleotides per person 3*1012 nucleotides
Lots more data in fact: medical history of the persons, gene expression data
19
Ayesha Aziz Prova,
Lecturer, CSE, CWU
EXAMPLE: ENVIRONMENTAL DATA
Climate data (just an example)
http://www.ncdc.gov/oa/climate/ghcn-monthly/index.php
20
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Attributes
SO, WHAT IS DATA?
Tid Refund Marital Taxable
Status Income Cheat
Collection of data objects and
their attributes 1 Yes Single 125K No
2 No Married 100K No
An attribute is a property or 3 No Single 70K No
22
Ayesha Aziz Prova,
Lecturer, CSE, CWU
NUMERIC RECORD DATA
If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute
23
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CATEGORICAL DATA
Tid Refund Marital Taxable
Status Income Cheat
24
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DOCUMENT DATA
Each document becomes a
`term' vector,
timeout
season
coach
game
score
team
ball
lost
pla
wi
each term is a component
n
y
(attribute) of the vector,
the value of each component
is the number of times the Document 1 3 0 5 0 2 6 0 2 0 2
corresponding term occurs in
the document. Document 2 0 7 0 2 1 0 0 3 0 0
Bag-of-words representation –
Document 3 0 1 0 0 1 2 2 0 3 0
no ordering
25
Ayesha Aziz Prova,
Lecturer, CSE, CWU
TRANSACTION DATA
27
Ayesha Aziz Prova,
Lecturer, CSE, CWU
GRAPH DATA
2
5 1
2
5
28
Ayesha Aziz Prova,
Lecturer, CSE, CWU
TYPES OF DATA
Numeric data: Each object is a point in a multidimensional space
Categorical data: Each object is a vector of categorical values
Set data: Each object is a set of values (with or without counts)
Sets can also be represented as binary vectors, or vectors of counts
Ordered sequences: Each object is an ordered sequence of values.
Graph data
29
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHAT CAN YOU DO WITH THE DATA?
Suppose that you are the owner of a supermarket and you have collected billions of market
basket data. What information would you extract from it and how would you use it?
30
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHAT CAN YOU DO WITH THE DATA?
Suppose you are biologist who has microarray expression data: thousands of genes, and their expression
values over thousands of different settings (e.g. tissues). What information would you like to get out of
your data?
34
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DATA MINING TASKS...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
35
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLASSIFICATION: DEFINITION
36
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLASSIFICATION EXAMPLE
l l s
ir ca ir ca uou
go g o tin
t e t e n ss
ca ca co cla
Refund Marital Taxable
Tid Refund Marital Taxable
Status Income Cheat
Status Income Cheat
Set Classifier
37
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLASSIFICATION: APPLICATION 1
38
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLASSIFICATION: APPLICATION 2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
Approach:
Use credit card transactions and the information on its account-holder as
attributes.
When does a customer buy, what does he buy, how often he pays on
time, etc
Label past transactions as fraud or fair transactions. This forms the class
attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card transactions on an
account.
39
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLASSIFYING GALAXIES
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB 40
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLUSTERING DEFINITION
41
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLUSTERING DEFINITION
Intracluster
Intraclusterdistances
distances Intercluster
Interclusterdistances
distances
are
areminimized
minimized are
aremaximized
maximized
42
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLUSTERING: APPLICATION 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers where any subset
may conceivably be selected as a market target to be reached with a distinct
marketing mix.
Approach:
Collect different attributes of customers based on their geographical and
lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns of customers
in same cluster vs. those from different clusters.
43
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLUSTERING: APPLICATION 2
Bioinformatics applications:
Goal: Group genes and tissues together such that genes are co-expressed on the same tissues
44
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLUSTERING: APPLICATION 2
Document Clustering:
Goal: To find groups of documents that are similar to each other based on
the important terms appearing in them.
Approach:
To identify frequently occurring terms in each document.
Form a similarity measure based on the frequencies of different terms.
Use it to cluster.
Gain:
Information Retrieval can utilize the clusters to relate a new document or search
term to clustered documents.
45
Ayesha Aziz Prova,
Lecturer, CSE, CWU
ILLUSTRATING DOCUMENT CLUSTERING
National 273 36
Itemsets
ItemsetsDiscovered:
Discovered:
TID Items {Milk,Coke}
{Milk,Coke}
1 Bread, Coke, Milk {Diaper,
{Diaper,Milk}
Milk}
2 Beer, Bread
3 Beer, Coke, Diaper, Milk Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
4
5
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk {Diaper,
{Coke}
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
47
Ayesha Aziz Prova,
Lecturer, CSE, CWU
FREQUENT ITEMSETS: APPLICATIONS
48
Ayesha Aziz Prova,
Lecturer, CSE, CWU
ASSOCIATION RULE DISCOVERY:
APPLICATION
49
Ayesha Aziz Prova,
Lecturer, CSE, CWU
SEQUENTIAL PATTERN MINING
50
Ayesha Aziz Prova,
Lecturer, CSE, CWU
REGRESSION
51
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DEVIATION/ANOMALY DETECTION
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data
53
Ayesha Aziz Prova,
Lecturer, CSE, CWU
THANKS
54
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Any Question???