LECTURE 1:
INTRODUCTION TO DATA
MINING
Dr. Dhaval Patel
CSE, IIT-
Roorkee
What is data mining?
Data mining is also called knowledge
discovery and data mining (KDD)
Data mining is
🞑 extraction of useful patterns from
data sources, e.g., databases, texts, web,
image.
Patterns must be:
🞑 valid, novel, potentially useful, understandable
Knowledge Discovery in Data:
Process
Data Interpretation/
Mining Evaluation
Knowledge
Patterns
Data
Knowledge Discovery in Data:
Process
Knowledge Discovery in Data:
Challenges
Volum
e
- Big Data
- Small
Data
Dat
a
Variety
Velocity - Transaction
- Data - Temporal
Stream - Spatial
- Static …
Outline (Part
1)
Introduction to Data
🞑 Transactional Data
🞑 Temporal Data
🞑 Spatial & Spatial-Temporal
Data
Data Preprocessing
🞑 Missing Values
🞑 Summarization
INTRODUCTION TO
DATA
Data Come from Everywhere
Grocery Markets E-Commerce Stock
But, they have different
Exchange
form
Hospita Weather Social
l Station Media
What is Data?
Attribut
es
Collection of records and Tid Refund Marital Taxable
Status Income Cheat
their attributes
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
An attribute is a 4 Yes Married 120K No
characteristic of an object 5 No Divorced 95K Yes
O 6 No Married 60K No
b 7 Yes Divorced 220K No
j 8 No Single 85K Yes
e 9 No Married 75K No
c
10 No Single 90K Yes
t 10
s
Types of
Data
Record Data Graph Data
🞑 Transactional Data 🞑 Transactional Data
Temporal Data UnStructured Data
🞑 Time Series Data
🞑 Twitter Status
🞑 Sequence Data Message
🞑 Review, news article
Spatial & Spatial-
Temporal Data Semi-Structured
🞑 Spatial Data
Data
🞑 Spatial-Temporal Data
🞑 Paper Publications
Data
🞑 XML format
Record
Data
• Transaction
Data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Market-Basket
Dataset
Data
Matrix
If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of
as points in a multi- dimensional space, where each
dimension represents a distinct attribute
Such data set can be represented by an m by n
matrix, where there are m rows, one for each
object, and n columns, one for each attribute
Data Matrix Example for
Documents
Each document becomes a `term' vector,
🞑 each term is a component (attribute) of the
vector,
🞑 the value of each component is the number of
times the corresponding term occurs in the
document.
t
timeou
n
seaso
h
coac
e
gam
e
scor
m
tea
l
bal
t
los
pl
w
n
i
y
a
Distance
Matrix
3
point x y
2 p
1 p1 0 2
p p
3 4 p2 2 0
1
p p3 3 1
2 p4 5 1
0 0 2 3 5
1 4 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance
Matrix
Temporal
Data
Sequences
Data
(Patient Data obtained from Zhang’s KDD 06
Paper)
Temporal
Data
Time Series
Data
Yahoo Finance
Website
Biological Sequence
Data
Interval
Data
EL= { (A, 1, 5),( C, 3, 12), ( B, 4, 9), ( D, 9,
15) }
B
C
A
( ( (A overlaps C ) contains B ) overlaps D ) time
1 3 4 5 9 12
15
(Interval Patient Data obtained from Amit’s M.Tech.
Thesis Work)
Spatial & Spatial-Temporal
Data
• Spatial
Data
(Spatial Distribution of Objects of Various Types : Prof. Shashi
Shekhar)
Spatial & Spatial-Temporal
Data
◾ Spatial
Data
Average Monthly Temperature of land and
ocean
Spatial & Spatial-Temporal
Data
Spatial
Data
Dengue Disease Dataset
(Singapore)
Spatial & Spatial-Temporal
Data
Trajectory Data: Set of
Harricans
http://csc.noaa.gov/hurric
anes
Spatial & Spatial-Temporal
Data
Trajectory Data: (of 87 users obtained
using RFID)
Vast 2008 Challenge – RFID
Dataset
User Movement
Data
Trajectory
🞑 Movement trail of a user
🞑 Sampling Points: <latitude, longitude, time>
Stadium
Movie Complex
Swimming Pool
P1 on weekends
Home
Thanks to Shreyash and Sahoishnu (M.Tech.
Students)
Graph
Data
Semi-structured
Data
Unstructured
Data
Data can help us solve specific
problems.
How should these pictures be
placed into 3 groups?
How should these pictures be placed into
groups? How many groups should there
be?
Which genes are associated with a disease?
How can expression values be used to predict
survival?
What items should Amazon
display for me?
Is it likely that this stock was
traded based on illegal insider
information?
Where are the faces in this
picture?
Is this spam?
Will I like
300?
What techniques people apply
on data?
They apply data mining algorithms and
discover useful knowledge
So, what are the some of the well-known Data
mining Tasks?
🞑 Clustering,
🞑 Classification,
🞑 Frequent Patterns,
🞑 Association Rules,
🞑 ….
What people do with the time
series data?
Clustering
Classification
Motif Discovery Rule Query by
10
Discover Content
y
s=
0.5
c=
0.3
Visualization Novelty Detection Motif Association
What people do with the
trajectory data?
Clustering Frequent Travel
Patterns
Motif Discovery Prediction
Visualization Classification
In, Summary
Types of Data
Data Mining
Transactional Data
Methods
Sequence Data Frequent
Interval Data Pattern
Time Series Data Clusterin
Algorithm Discovery
Spatial Data Outlier
g Detection
s Classification
Spatio-Temporal Data Statistical
Data Set with Analysis
Multiple Kinds of …
Data
….
Activity 1
Find top 3 recent research activities around
the world that are analyzing data. You
need to write short summary for each
research activities. First three line must
follow following format:
🞑 Line 1: Problem they are trying to sole along with
dataset they are using
🞑 Line 2: How they are solving the problem
🞑 Line 3: Justify yourself why you rate this work as a
top 5 activities
🞑 Remaining lines… you can think yourself ….
BigN’Smart Research group at IIT-Roorkee is analyzing
“YelpReview” Dataset for learning Location-to-activity Tagging.
They are applying
… . I feel this is an interesting research because …
Activity 2: Why Data
Mining
Google
???
Facebook
Netflix Read
eHarmony About
FICO Their
FlightCaste
r
Story
IBM’s
Watson
Related
Field
Machine Visualization
Learnin
g
Data Mining and
Knowledge
Discovery
Statistic Database
s s
4
3
Related
Field
Statistics:
🞑 more theory-based
🞑 more focused on testing hypotheses
Machine learning
🞑 more heuristic
🞑 focused on improving performance of a learning agent
🞑 also looks at real-time learning and robotics – areas not
part of data mining
Data Mining and Knowledge Discovery
🞑 integrates theory and heuristics
🞑 focus on the entire process of knowledge discovery, including
data cleaning, learning, and integration and visualization of
results
Distinctions are fuzzy
Classification
Learn a method for predicting the instance class from pre-
labeled (classified) instances
Many approaches:
Statistics, Decision Trees,
Neural Networks,
...
4
5
Clustering
Find “natural” grouping of instances given
un- labeled data
4
6
Association Rules & Frequent
Itemsets
Transactio
ns Frequent Itemsets:
TID Produce
1 MILK, BREAD, EGGS
Milk, Bread (4)
2 BREAD, SUGAR
Bread, Cereal (3)
3 BREAD, CEREAL Milk, Bread, Cereal
4 MILK, BREAD, SUGAR (2)
5 MILK, CEREAL …
6 BREAD, CEREAL
7 MILK, CEREAL
8 MILK, BREAD, CEREAL, EGGS
9 MILK, BREAD, CEREAL
Rules:
Milk => Bread
(66%)
4
7
Visualization & Data Mining
Visualizing the
data to facilitate
human discovery
Presenting the
discovered results
in a visually "nice"
way
4
8
Summarization
Describe features of the
selected group
Use natural language
and graphics
Usually in Combination
with Deviation detection or
other methods
Average length of stay in this study area rose 45.7
percent, from 4.3 days to 6.2 days, because ...
49
Data Mining Models and Tasks
Obtained from Prof. Srini’s Lecture
notes