0% found this document useful (0 votes)
10 views50 pages

Lecture 1 Introduction To Data Mining

Data mining, also known as knowledge discovery, involves extracting useful patterns from various data sources. The lecture outlines the process of knowledge discovery, types of data, and challenges faced in data mining, including big data and data variety. It also discusses various data mining tasks such as clustering, classification, and association rules.

Uploaded by

Shumaila
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views50 pages

Lecture 1 Introduction To Data Mining

Data mining, also known as knowledge discovery, involves extracting useful patterns from various data sources. The lecture outlines the process of knowledge discovery, types of data, and challenges faced in data mining, including big data and data variety. It also discusses various data mining tasks such as clustering, classification, and association rules.

Uploaded by

Shumaila
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

LECTURE 1:

INTRODUCTION TO DATA
MINING
Dr. Dhaval Patel
CSE, IIT-
Roorkee
What is data mining?
 Data mining is also called knowledge
discovery and data mining (KDD)

 Data mining is
🞑 extraction of useful patterns from
data sources, e.g., databases, texts, web,
image.

 Patterns must be:


🞑 valid, novel, potentially useful, understandable
Knowledge Discovery in Data:
Process

Data Interpretation/
Mining Evaluation

Knowledge
Patterns
Data
Knowledge Discovery in Data:
Process
Knowledge Discovery in Data:
Challenges
Volum
e
- Big Data
- Small
Data

Dat
a
Variety
Velocity - Transaction
- Data - Temporal
Stream - Spatial
- Static …
Outline (Part
1)
 Introduction to Data
🞑 Transactional Data
🞑 Temporal Data

🞑 Spatial & Spatial-Temporal


Data

 Data Preprocessing
🞑 Missing Values
🞑 Summarization
INTRODUCTION TO
DATA
Data Come from Everywhere

Grocery Markets E-Commerce Stock


But, they have different
Exchange

form

Hospita Weather Social


l Station Media
What is Data?
Attribut
es
 Collection of records and Tid Refund Marital Taxable
Status Income Cheat
their attributes
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
 An attribute is a 4 Yes Married 120K No
characteristic of an object 5 No Divorced 95K Yes
O 6 No Married 60K No
b 7 Yes Divorced 220K No
j 8 No Single 85K Yes
e 9 No Married 75K No
c
10 No Single 90K Yes
t 10

s
Types of
Data
 Record Data  Graph Data
🞑 Transactional Data 🞑 Transactional Data

 Temporal Data  UnStructured Data


🞑 Time Series Data
🞑 Twitter Status
🞑 Sequence Data Message
🞑 Review, news article
 Spatial & Spatial-
Temporal Data  Semi-Structured
🞑 Spatial Data
Data
🞑 Spatial-Temporal Data
🞑 Paper Publications
Data
🞑 XML format
Record
Data
• Transaction
Data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Market-Basket
Dataset
Data
Matrix
 If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of
as points in a multi- dimensional space, where each
dimension represents a distinct attribute

 Such data set can be represented by an m by n


matrix, where there are m rows, one for each
object, and n columns, one for each attribute
Data Matrix Example for
Documents
 Each document becomes a `term' vector,
🞑 each term is a component (attribute) of the
vector,
🞑 the value of each component is the number of
times the corresponding term occurs in the
document.

t
timeou
n
seaso
h
coac

e
gam
e
scor
m
tea

l
bal

t
los
pl

w
n
i
y
a
Distance
Matrix
3
point x y
2 p
1 p1 0 2
p p
3 4 p2 2 0
1
p p3 3 1
2 p4 5 1
0 0 2 3 5
1 4 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance
Matrix
Temporal
Data
 Sequences
Data

(Patient Data obtained from Zhang’s KDD 06


Paper)
Temporal
Data
 Time Series
Data

Yahoo Finance
Website
Biological Sequence
Data
Interval
Data
EL= { (A, 1, 5),( C, 3, 12), ( B, 4, 9), ( D, 9,
15) }

B
C
A

( ( (A overlaps C ) contains B ) overlaps D ) time

1 3 4 5 9 12
15
(Interval Patient Data obtained from Amit’s M.Tech.
Thesis Work)
Spatial & Spatial-Temporal
Data
• Spatial
Data

(Spatial Distribution of Objects of Various Types : Prof. Shashi


Shekhar)
Spatial & Spatial-Temporal
Data
◾ Spatial
Data

Average Monthly Temperature of land and


ocean
Spatial & Spatial-Temporal
Data
 Spatial
Data

Dengue Disease Dataset


(Singapore)
Spatial & Spatial-Temporal
Data
 Trajectory Data: Set of
Harricans

http://csc.noaa.gov/hurric
anes
Spatial & Spatial-Temporal
Data
 Trajectory Data: (of 87 users obtained
using RFID)

Vast 2008 Challenge – RFID


Dataset
User Movement
Data
Trajectory
🞑 Movement trail of a user
🞑 Sampling Points: <latitude, longitude, time>

Stadium

Movie Complex

Swimming Pool

P1 on weekends

Home

Thanks to Shreyash and Sahoishnu (M.Tech.


Students)
Graph
Data
Semi-structured
Data
Unstructured
Data
Data can help us solve specific
problems.
How should these pictures be
placed into 3 groups?
How should these pictures be placed into
groups? How many groups should there
be?
Which genes are associated with a disease?
How can expression values be used to predict
survival?
What items should Amazon
display for me?
Is it likely that this stock was
traded based on illegal insider
information?
Where are the faces in this
picture?
Is this spam?
Will I like
300?
What techniques people apply
on data?
 They apply data mining algorithms and
discover useful knowledge

 So, what are the some of the well-known Data


mining Tasks?
🞑 Clustering,
🞑 Classification,
🞑 Frequent Patterns,
🞑 Association Rules,
🞑 ….
What people do with the time
series data?
Clustering
Classification

Motif Discovery Rule Query by


10
Discover Content
y 
s=
0.5
c=
0.3
Visualization Novelty Detection Motif Association
What people do with the
trajectory data?
Clustering Frequent Travel
Patterns

Motif Discovery Prediction

Visualization Classification
In, Summary

Types of Data
Data Mining
 Transactional Data

Methods
Sequence Data  Frequent
 Interval Data Pattern
 Time Series Data Clusterin
Algorithm Discovery
 Spatial Data  Outlier
g Detection
s Classification
 Spatio-Temporal Data  Statistical
 Data Set with Analysis
Multiple Kinds of  …
Data
 ….
Activity 1
 Find top 3 recent research activities around
the world that are analyzing data. You
need to write short summary for each
research activities. First three line must
follow following format:
🞑 Line 1: Problem they are trying to sole along with
dataset they are using
🞑 Line 2: How they are solving the problem
🞑 Line 3: Justify yourself why you rate this work as a
top 5 activities
🞑 Remaining lines… you can think yourself ….
BigN’Smart Research group at IIT-Roorkee is analyzing
“YelpReview” Dataset for learning Location-to-activity Tagging.
They are applying
… . I feel this is an interesting research because …
Activity 2: Why Data
Mining
Google
???
 Facebook
 Netflix Read
 eHarmony About
 FICO Their
 FlightCaste
r
Story
 IBM’s
Watson
Related
Field
Machine Visualization
Learnin
g
Data Mining and
Knowledge
Discovery
Statistic Database
s s

4
3
Related
Field
 Statistics:
🞑 more theory-based
🞑 more focused on testing hypotheses

 Machine learning
🞑 more heuristic
🞑 focused on improving performance of a learning agent
🞑 also looks at real-time learning and robotics – areas not
part of data mining

 Data Mining and Knowledge Discovery


🞑 integrates theory and heuristics
🞑 focus on the entire process of knowledge discovery, including
data cleaning, learning, and integration and visualization of
results

 Distinctions are fuzzy


Classification
Learn a method for predicting the instance class from pre-
labeled (classified) instances

Many approaches:
Statistics, Decision Trees,
Neural Networks,
...

4
5
Clustering

Find “natural” grouping of instances given


un- labeled data

4
6
Association Rules & Frequent
Itemsets
Transactio
ns Frequent Itemsets:
TID Produce
1 MILK, BREAD, EGGS
Milk, Bread (4)
2 BREAD, SUGAR
Bread, Cereal (3)
3 BREAD, CEREAL Milk, Bread, Cereal
4 MILK, BREAD, SUGAR (2)
5 MILK, CEREAL …
6 BREAD, CEREAL
7 MILK, CEREAL
8 MILK, BREAD, CEREAL, EGGS
9 MILK, BREAD, CEREAL

Rules:
Milk => Bread
(66%)

4
7
Visualization & Data Mining
 Visualizing the
data to facilitate
human discovery

 Presenting the
discovered results
in a visually "nice"
way
4
8
Summarization

 Describe features of the


selected group
 Use natural language
and graphics
 Usually in Combination
with Deviation detection or
other methods

Average length of stay in this study area rose 45.7


percent, from 4.3 days to 6.2 days, because ...

49
Data Mining Models and Tasks

Obtained from Prof. Srini’s Lecture


notes

You might also like