0% found this document useful (1 vote)
76 views55 pages

Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU

Lecturer, CSE, CWU This document provides an overview of the DATA MINING course taught by Ayesha Aziz Prova at CWU. It outlines the course contents, assessment breakdown, recommended books, and what is data mining. Data mining involves extracting useful patterns from large amounts of data and can help organizations address the "data rich but information poor" problem.

Uploaded by

Dipty Sarker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
76 views55 pages

Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU

Lecturer, CSE, CWU This document provides an overview of the DATA MINING course taught by Ayesha Aziz Prova at CWU. It outlines the course contents, assessment breakdown, recommended books, and what is data mining. Data mining involves extracting useful patterns from large amounts of data and can help organizations address the "data rich but information poor" problem.

Uploaded by

Dipty Sarker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 55

DATA MINING

CSE-443

Ayesha Aziz Prova


Lecturer,
Dept. of CSE
CWU
CONTENTS

 Course Outline
 Recommended Book

Ayesha Aziz Prova,


Lecturer, CSE, CWU
2
THEORY EXAM

Number of class test :2


Number of Presentation : 1(Equal to a class test)
Class test : 10
Assignment : 10
Midterm : 30
Final Exam : 40
Attendance : 05
Class Performance : 05

3
Ayesha Aziz Prova,
Lecturer, CSE, CWU
PRESENTATION

 Members in a presentation : 2 (maximum)


 Number of presentation :1
(* Presentation will consider as a mandatory class test)

4
Ayesha Aziz Prova, Lecturer, CSE, CWU
BOOK
 Data Mining: Concepts and Techniques
 J. Han and M. Kamber
 Introduction to Data Mining
 Tan, Steinbach, Kumar

Ayesha Aziz Prova,


Lecturer, CSE, CWU 5
WHAT IS DATA MINING?

 After years of data mining there is still no unique answer to this question.

 A tentative definition:
Data mining is the use of efficient techniques for
the analysis of very large collections of data and the
extraction of useful and possibly unexpected
patterns in data.
6
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DATA MINING

 Data explosion problem:


 Automated data collection tools and mature database technology.
 Leading to tremendous amounts of data stored in databases, data
warehouses and other information repositories.
 We are drowning in data, but starving for knowledge!

7
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DATA RICH BUT INFORMATION POOR

Databases are too big

Data Mining can help


discover knowledge

Terrorbytes
8
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHAT IS DATA MINING?

 Data mining is also called knowledge discovery and data mining


(KDD)
 Data mining is
 extraction of useful patterns from data sources, e.g., databases, texts,
web, image.
 Patterns must be:
 valid, novel, potentially useful, understandable

9
Ayesha Aziz Prova,
Lecturer, CSE, CWU
KNOWLEDGE DISCOVERY

10
Ayesha Aziz Prova,
Lecturer, CSE, CWU
EXAMPLE OF DISCOVERED PATTERNS

 Association rules:
“80% of customers who buy cheese and milk also buy bread, and 5% of
customers buy all of them together”
Cheese, Milk Bread [sup =5%, confid=80%]

11
Ayesha Aziz Prova,
Lecturer, CSE, CWU
ORIGINS OF DATA MINING

 Draws ideas from machine learning/AI, pattern recognition, statistics,


and database systems
 Traditional Techniques may be unsuitable due to
 Enormity of data
 High dimensionality of data
 Heterogeneous, distributed nature
Statistics/ Machine Learning/
of data
AI Pattern
Recognition

Data Mining

Database 12
systems Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHY DO WE NEED DATA MINING?

 Really, really huge amounts of raw data!!


 In the digital age, TB of data is generated by the second
 Mobile devices, digital photographs, web documents.
 Facebook updates, Tweets, Blogs, User-generated content
 Transactions, sensor data, surveillance data
 Queries, clicks, browsing
 Cheap storage has made possible to maintain this data
 Need to analyze the raw data to extract knowledge

13
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHY DO WE NEED DATA MINING?
 Data is power!
 Today, the collected data is one of the biggest assets of an online
company
 Query logs of Google
 The friendship and updates of Facebook
 Tweets and follows of Twitter
 Amazon transactions
 We need a way to harness the collective intelligence

14
Ayesha Aziz Prova,
Lecturer, CSE, CWU
THE DATA IS ALSO VERY COMPLEX

 Multiple types of data: tables, images, graphs, etc


 Interconnected data of different types:
 From the mobile phone we can collect, location of the user, friendship
information, check-ins to venues, opinions through twitter, images though
cameras, queries to search engines

15
Ayesha Aziz Prova,
Lecturer, CSE, CWU
EXAMPLE: TRANSACTION DATA

 Billions of real-life customers:


 Credit card companies: billions of transactions per day.

 The point cards allow companies to collect information about specific users

16
Ayesha Aziz Prova,
Lecturer, CSE, CWU
EXAMPLE: DOCUMENT DATA

 Web as a document repository: estimated 50 billions of web pages


 Wikipedia: 4 million articles (and counting)
 Online news portals: steady stream of 100’s of new articles every
day

17
Ayesha Aziz Prova,
Lecturer, CSE, CWU
EXAMPLE: NETWORK DATA

 Web: 50 billion pages linked via hyperlinks


 Facebook: 500 million users
 Twitter: 300 million users
 Instant messenger: ~1billion users
 Blogs: 250 million blogs worldwide, presidential candidates run blogs

18
Ayesha Aziz Prova,
Lecturer, CSE, CWU
EXAMPLE: GENOMIC SEQUENCES

 http://www.1000genomes.org/page.php
 Full sequence of 1000 individuals
 3*109 nucleotides per person  3*1012 nucleotides
 Lots more data in fact: medical history of the persons, gene expression data

19
Ayesha Aziz Prova,
Lecturer, CSE, CWU
EXAMPLE: ENVIRONMENTAL DATA
 Climate data (just an example)
http://www.ncdc.gov/oa/climate/ghcn-monthly/index.php

 “a database of temperature, precipitation and pressure records managed by the


National Climatic Data Center, Arizona State University and the Carbon
Dioxide Information Analysis Center”

 “6000 temperature stations, 7500 precipitation stations, 2000 pressure


stations”
 Spatiotemporal data

20
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Attributes
SO, WHAT IS DATA?
Tid Refund Marital Taxable
Status Income Cheat
 Collection of data objects and
their attributes 1 Yes Single 125K No
2 No Married 100K No
 An attribute is a property or 3 No Single 70K No

characteristic of an object 4 Yes Married 120K No


 Examples: eye color of a Objects
5 No Divorced 95K Yes
person, temperature, etc. 6 No Married 60K No
 Attribute is also known as 7 Yes Divorced 220K No
variable, field, characteristic, or 8 No Single 85K Yes
feature 9 No Married 75K No
 A collection of attributes 10 No Single 90K Yes
describe an object
10

 Object is also known as record, Size: Number of objects


point, case, sample, entity, or Dimensionality: Number of attributes
instance Sparsity: Number of populated
21
object-attribute pairs
Ayesha Aziz Prova, Lecturer, CSE, CWU
TYPES OF ATTRIBUTES
 There are different types of attributes
 Categorical
 Examples: eye color, zip codes, words, rankings (e.g, good, fair, bad), height
in {tall, medium, short}
 Numeric
 Examples: dates, temperature, time, length, value, count.
 Discrete (counts) vs Continuous (temperature)
 Special case: Binary attributes (yes/no, exists/not exists)

22
Ayesha Aziz Prova,
Lecturer, CSE, CWU
NUMERIC RECORD DATA

 If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute

 Such data set can be represented by an n-by-d data matrix, where


there are n rows, one for each object, and d columns, one for each
attribute

23
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CATEGORICAL DATA
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single High No


 Data that consists of a collection of 2 No Married Medium No

records, each of which consists of a 3 No Single Low No

fixed set of categorical attributes 4 Yes Married High No


5 No Divorced Medium Yes
6 No Married Low No
7 Yes Divorced High No
8 No Single Medium Yes
9 No Married Medium No
10 No Single Medium Yes
10

24
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DOCUMENT DATA
 Each document becomes a
`term' vector,

timeout

season
coach

game
score
team

ball

lost
pla

wi
each term is a component

n
y

(attribute) of the vector,
 the value of each component
is the number of times the Document 1 3 0 5 0 2 6 0 2 0 2
corresponding term occurs in
the document. Document 2 0 7 0 2 1 0 0 3 0 0

 Bag-of-words representation –
Document 3 0 1 0 0 1 2 2 0 3 0
no ordering

25
Ayesha Aziz Prova,
Lecturer, CSE, CWU
TRANSACTION DATA

 Each record (transaction) is a set of items.


TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
 A set of items can also be represented as a binary vector, where
each attribute is an item.
 A document can also be represented as a set of words (no counts)

Sparsity: average number of products bought by a customer


26
Ayesha Aziz Prova,
Lecturer, CSE, CWU
ORDERED DATA

 Genomic sequence data

 Data is a long ordered string

27
Ayesha Aziz Prova,
Lecturer, CSE, CWU
GRAPH DATA

 Examples: Web graph and HTML Links

2
5 1
2
5

28
Ayesha Aziz Prova,
Lecturer, CSE, CWU
TYPES OF DATA
 Numeric data: Each object is a point in a multidimensional space
 Categorical data: Each object is a vector of categorical values
 Set data: Each object is a set of values (with or without counts)
 Sets can also be represented as binary vectors, or vectors of counts
 Ordered sequences: Each object is an ordered sequence of values.
 Graph data

29
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHAT CAN YOU DO WITH THE DATA?

 Suppose that you are the owner of a supermarket and you have collected billions of market
basket data. What information would you extract from it and how would you use it?

TID Items Product placement


1 Bread, Coke, Milk
2 Beer, Bread Catalog creation
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk Recommendations
5 Coke, Diaper, Milk

 What if this was an online store?

30
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHAT CAN YOU DO WITH THE DATA?
 Suppose you are biologist who has microarray expression data: thousands of genes, and their expression
values over thousands of different settings (e.g. tissues). What information would you like to get out of
your data?

Groups of genes and tissues


31
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHY MINE DATA? COMMERCIAL VIEWPOINT
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions

 Computers have become cheaper and


more powerful
 Competitive Pressure is Strong
 Provide better, customized
services for an edge (e.g. in
Customer Relationship
Management) 32
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHY MINE DATA? SCIENTIFIC VIEWPOINT

 Data collected and stored at


enormous speeds (GB/hour)

 remote sensors on a satellite


 telescopes scanning the skies
 microarrays generating gene
expression data
 scientific simulations
generating terabytes of data
 Traditional techniques infeasible for raw data
 Data mining may help scientists
 in classifying and segmenting data
in Hypothesis Formation
33

Ayesha Aziz Prova,


Lecturer, CSE, CWU
WHAT IS DATA MINING AGAIN?
 “Data mining is the analysis of (often large) observational data sets to find unsuspected
relationships and to summarize the data in novel ways that are both understandable and
useful to the data analyst” (Hand, Mannila, Smyth)

 “Data mining is the discovery of models for data” (Rajaraman, Ullman)


 We can have the following types of models
 Models that explain the data (e.g., a single function)
 Models that predict the future data instances.
 Models that summarize the data
 Models the extract the most prominent features of the data.

34
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DATA MINING TASKS...
 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Sequential Pattern Discovery [Descriptive]
 Regression [Predictive]
 Deviation Detection [Predictive]

35
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLASSIFICATION: DEFINITION

 Given a collection of records (training set )


 Each record contains a set of attributes, one of the attributes is the class.
 Find a model for class attribute as a function of the values of other
attributes.
 Goal: previously unseen records should be assigned a class as
accurately as possible.
 A test set is used to determine the accuracy of the model. Usually, the given
data set is divided into training and test sets, with training set used to build
the model and test set used to validate it.

36
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLASSIFICATION EXAMPLE
l l s
ir ca ir ca uou
go g o tin
t e t e n ss
ca ca co cla
Refund Marital Taxable
Tid Refund Marital Taxable
Status Income Cheat
Status Income Cheat

1 Yes Single 125K No No Single 75K ?

2 No Married 100K No Yes Married 50K ?

3 No Single 70K No No Married 150K ?

4 Yes Married 120K No Yes Divorced 90K ?


No Single 40K ?
5 No Divorced 95K Yes Test
6 No Married 60K No No Married 80K ? Set
10

7 Yes Divorced 220K No


8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10 No Single 90K Yes Model
10

Set Classifier
37
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLASSIFICATION: APPLICATION 1

38
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLASSIFICATION: APPLICATION 2

 Fraud Detection
 Goal: Predict fraudulent cases in credit card transactions.
 Approach:
 Use credit card transactions and the information on its account-holder as
attributes.
 When does a customer buy, what does he buy, how often he pays on
time, etc
 Label past transactions as fraud or fair transactions. This forms the class
attribute.
 Learn a model for the class of the transactions.
 Use this model to detect fraud by observing credit card transactions on an
account.

39
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLASSIFYING GALAXIES

Early Class: Attributes:


• Stages of Formation • Image features,
• Characteristics of light
waves received, etc.
Intermediate

Late

Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB 40
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLUSTERING DEFINITION

 Given a set of data points, each having a set of attributes, and a


similarity measure among them, find clusters such that
 Data points in one cluster are more similar to one another.
 Data points in separate clusters are less similar to one another.
 Similarity Measures:
 Euclidean Distance if attributes are continuous.
 Other Problem-specific Measures.

41
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLUSTERING DEFINITION

Intracluster
Intraclusterdistances
distances Intercluster
Interclusterdistances
distances
are
areminimized
minimized are
aremaximized
maximized

42
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLUSTERING: APPLICATION 1

 Market Segmentation:
 Goal: subdivide a market into distinct subsets of customers where any subset
may conceivably be selected as a market target to be reached with a distinct
marketing mix.
 Approach:
 Collect different attributes of customers based on their geographical and
lifestyle related information.
 Find clusters of similar customers.
 Measure the clustering quality by observing buying patterns of customers
in same cluster vs. those from different clusters.

43
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLUSTERING: APPLICATION 2

 Bioinformatics applications:
 Goal: Group genes and tissues together such that genes are co-expressed on the same tissues

44
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLUSTERING: APPLICATION 2

 Document Clustering:
 Goal: To find groups of documents that are similar to each other based on
the important terms appearing in them.
 Approach:
 To identify frequently occurring terms in each document.
 Form a similarity measure based on the frequencies of different terms.
 Use it to cluster.
 Gain:
 Information Retrieval can utilize the clusters to relate a new document or search
term to clustered documents.

45
Ayesha Aziz Prova,
Lecturer, CSE, CWU
ILLUSTRATING DOCUMENT CLUSTERING

 Clustering Points: 3204 Articles of Los Angeles Times.


 Similarity Measure: How many words are common in these documents (after
some word filtering).

Category Total Correctly


Articles Placed
Financial 555 364

Foreign 341 260

National 273 36

Metro 943 746

Sports 738 573

Entertainment 354 278 46


Ayesha Aziz Prova,
Lecturer, CSE, CWU
FREQUENT ITEMSETS AND ASSOCIATION
RULES
 Given a set of records each of which contain some number of items from a
given collection;
 Identify sets of items (itemsets) occurring frequently
together
 Produce dependency rules which will predict occurrence of
an item based on occurrences of other items.

Itemsets
ItemsetsDiscovered:
Discovered:
TID Items {Milk,Coke}
{Milk,Coke}
1 Bread, Coke, Milk {Diaper,
{Diaper,Milk}
Milk}
2 Beer, Bread
3 Beer, Coke, Diaper, Milk Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
4
5
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk {Diaper,
{Coke}
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
47
Ayesha Aziz Prova,
Lecturer, CSE, CWU
FREQUENT ITEMSETS: APPLICATIONS

 Text mining: finding associated phrases in text


 There are lots of documents that contain the phrases “association rules”,
“data mining” and “efficient algorithm”
 Recommendations:
 Users who buy this item often buy this item as well
 Users who watched James Bond movies, also watched Jason Bourne
movies.
 Recommendations make use of item and user similarity

48
Ayesha Aziz Prova,
Lecturer, CSE, CWU
ASSOCIATION RULE DISCOVERY:
APPLICATION

 Supermarket shelf management.


 Goal: To identify items that are bought together by sufficiently many
customers.
 Approach: Process the point-of-sale data collected with barcode
scanners to find dependencies among items.
 A classic rule --
 If a customer buys diaper and milk, then he is very likely to buy beer.
 So, don’t be surprised if you find six-packs stacked next to diapers!

49
Ayesha Aziz Prova,
Lecturer, CSE, CWU
SEQUENTIAL PATTERN MINING

 Sequential pattern mining:


A sequential rule: A B, says that event A will be immediately
followed by event B with a certain confidence.

50
Ayesha Aziz Prova,
Lecturer, CSE, CWU
REGRESSION

 Predict a value of a given continuous valued variable based on the values of


other variables, assuming a linear or nonlinear model of dependency.
 Greatly studied in statistics, neural network fields.
 Examples:
 Predicting sales amounts of new product based on advertising
expenditure.
 Predicting wind velocities as a function of temperature, humidity, air
pressure, etc.
 Time series prediction of stock market indices.

51
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DEVIATION/ANOMALY DETECTION

 Detect significant deviations from


normal behavior
 Discovering the most significant
changes in data
 Applications:
 Credit Card Fraud Detection
 Network Intrusion Detection

Typical network traffic at University level


may reach over 100 million connections
per day
52
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CHALLENGES OF DATA MINING

 Scalability
 Dimensionality
 Complex and Heterogeneous Data
 Data Quality
 Data Ownership and Distribution
 Privacy Preservation
 Streaming Data

53
Ayesha Aziz Prova,
Lecturer, CSE, CWU
THANKS

54
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Any Question???

You might also like