0% found this document useful (0 votes)
126 views

Lecture 1 Introduction To Data Mining

1. Detecting cancer subtypes using gene expression data. Researchers analyzed gene expression data to identify subtypes of breast cancer and predict patient survival. 2. Predicting traffic congestion using smart card data. Researchers used smart card data from public transportation to predict traffic congestion in major cities and recommend alternative routes. 3. Analyzing social media posts during disasters. Researchers looked at tweets and posts during hurricanes and wildfires to understand emergency needs, locate stranded people, and coordinate response efforts.

Uploaded by

sureshkm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views

Lecture 1 Introduction To Data Mining

1. Detecting cancer subtypes using gene expression data. Researchers analyzed gene expression data to identify subtypes of breast cancer and predict patient survival. 2. Predicting traffic congestion using smart card data. Researchers used smart card data from public transportation to predict traffic congestion in major cities and recommend alternative routes. 3. Analyzing social media posts during disasters. Researchers looked at tweets and posts during hurricanes and wildfires to understand emergency needs, locate stranded people, and coordinate response efforts.

Uploaded by

sureshkm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

LECTURE 1: INTRODUCTION

TO DATA MINING
Dr. Dhaval Patel
CSE, IIT-Roorkee
What is data mining?
 Data mining is also called knowledge discovery and
data mining (KDD)

 Data mining is
 extractionof useful patterns from data sources, e.g.,
databases, texts, web, image.

 Patterns must be:


 valid, novel, potentially useful, understandable
Knowledge Discovery in Data: Process

Data Mining Interpretation/


Evaluation

Knowledge
Patterns
Data
Knowledge Discovery in Data: Process
Knowledge Discovery in Data: Challenges

Volume
- Big Data
- Small Data

Data
Variety
Velocity - Transaction
- Data Stream - Temporal
- Static - Spatial

5
Outline (Part 1)
 Introduction to Data
 TransactionalData
 Temporal Data

 Spatial & Spatial-Temporal Data

 Data Preprocessing
 Missing
Values
 Summarization
INTRODUCTION TO DATA
Data Come from Everywhere

Grocery Markets E-Commerce Stock Exchange


But, they have different form

Hospital Weather Station 8


Social Media
What is Data?
Attributes

 Collection of records and their Tid Refund Marital Taxable


Status Income Cheat
attributes
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
 An attribute is a characteristic of 4 Yes Married 120K No
an object 5 No Divorced 95K Yes
Objects 6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
 A collection of attributes describe
9 No Married 75K No
an object
10 No Single 90K Yes
10
Types of Data

 Record Data  Graph Data


 Transactional Data  Transactional Data

 Temporal Data  UnStructured Data


 Time Series Data
 Twitter Status Message
 Sequence Data
 Review, news article

 Spatial & Spatial-Temporal  Semi-Structured Data


Data
 Paper Publications Data
 Spatial Data
 XML format
 Spatial-Temporal Data
Record Data

• Transaction Data

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Market-Basket Dataset
Data Matrix

 If data objects have the same fixed set of numeric attributes,


then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a distinct
attribute

 Such data set can be represented by an m by n matrix, where


there are m rows, one for each object, and n columns, one for
each attribute
Data Matrix Example for Documents

 Each document becomes a `term' vector,


 each term is a component (attribute) of the vector,
 the value of each component is the number of times the
corresponding term occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
Distance Matrix

3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix
Temporal Data
 Sequences Data

(Patient Data obtained from Zhang’s KDD 06 Paper)


Temporal Data
 Time Series Data

Yahoo Finance Website


Biological Sequence Data
Interval Data

EL= { (A, 1, 5),( C, 3, 12), ( B, 4, 9), ( D, 9, 15) }

B
C
A

( ( (A overlaps C ) contains B ) overlaps D )


time
1 3 4 5 9 12 15

(Interval Patient Data obtained from Amit’s M.Tech. Thesis Work)


Spatial & Spatial-Temporal Data

• Spatial Data

(Spatial Distribution of Objects of Various Types : Prof. Shashi Shekhar)


Spatial & Spatial-Temporal Data

 Spatial Data

Average Monthly Temperature of land and ocean


Spatial & Spatial-Temporal Data
 Spatial Data

Dengue Disease Dataset (Singapore)


Spatial & Spatial-Temporal Data
 Trajectory Data: Set of Harricans

http://csc.noaa.gov/hurricanes
Spatial & Spatial-Temporal Data

 Trajectory Data: (of 87 users obtained using


RFID)

Vast 2008 Challenge – RFID Dataset


User Movement Data
 Trajectory
 Movement trail of a user
 Sampling Points: <latitude, longitude, time>

Stadium

Movie Complex

Swimming Pool

P1 on weekends

Home

Thanks to Shreyash and Sahoishnu (M.Tech. Students)


Graph Data
Semi-structured Data
Unstructured Data
Data can help us solve specific problems.
How should these pictures be placed
into 3 groups?
How should these pictures be placed into groups?
How many groups should there be?
Which genes are associated with a disease? How can
expression values be used to predict survival?
What items should Amazon display for
me?
Is it likely that this stock was traded
based on illegal insider information?
Where are the faces in this picture?
Is this spam?
Will I like 300?
What techniques people apply on
data?
 They apply data mining algorithms and discover useful
knowledge

 So, what are the some of the well-known Data mining


Tasks?
 Clustering,
 Classification,
 Frequent Patterns,
 Association Rules,
 ….
What people do with the time series
data?
Clustering Classification

Motif Discovery Rule Query by


10 Content
Discovery

s = 0.5
c = 0.3

Visualization Novelty Detection Motif Association


What people do with the trajectory
data?
Clustering Frequent Travel Patterns

Motif Discovery Prediction

Visualization Classification
In, Summary

Types of Data Data Mining


Methods
 Transactional Data  Frequent Pattern
 Sequence Data Discovery
 Interval Data  Classification
 Time Series Data Algorithms Clustering
 Spatial Data  Outlier Detection
 Spatio-Temporal Data  Statistical Analysis
 Data Set with Multiple  …
Kinds of Data
 ….
Activity 1
 Find top 3 recent research activities around the world
that are analyzing data. You need to write short
summary for each research activities. First three line
must follow following format:
 Line 1: Problem they are trying to sole along with dataset
they are using
 Line 2: How they are solving the problem
 Line 3: Justify yourself why you rate this work as a top 5
activities
 Remaining lines… you can think yourself ….

BigN’Smart Research group at IIT-Roorkee is analyzing “YelpReview”


Dataset for learning Location-to-activity Tagging. They are applying
… . I feel this is an interesting research because …
Activity 2: Why Data Mining ???
 Google
 Facebook
 Netflix Read
 eHarmony About
 FICO Their
 FlightCaster
Story
 IBM’s Watson
Related Field

Machine Visualization
Learning

Data Mining and


Knowledge Discovery

Statistics Databases

43
Related Field
 Statistics:
 more theory-based
 more focused on testing hypotheses

 Machine learning
 more heuristic
 focused on improving performance of a learning agent
 also looks at real-time learning and robotics – areas not part of data
mining

 Data Mining and Knowledge Discovery


 integrates theory and heuristics
 focus on the entire process of knowledge discovery, including data cleaning,
learning, and integration and visualization of results

 Distinctions are fuzzy


Classification
Learn a method for predicting the instance class from pre-labeled
(classified) instances

Many approaches: Statistics,


Decision Trees, Neural
Networks,
...

45
Clustering

Find “natural” grouping of instances given un-


labeled data

46
Association Rules & Frequent Itemsets
Transactions
Frequent Itemsets:
TID Produce
1 MILK, BREAD, EGGS Milk, Bread (4)
2 BREAD, SUGAR Bread, Cereal (3)
3 BREAD, CEREAL Milk, Bread, Cereal (2)
4 MILK, BREAD, SUGAR …
5 MILK, CEREAL
6 BREAD, CEREAL
7 MILK, CEREAL
8 MILK, BREAD, CEREAL, EGGS
9 MILK, BREAD, CEREAL
Rules:
Milk => Bread (66%)

47
Visualization & Data Mining
 Visualizing the data to
facilitate human
discovery

 Presenting the
discovered results in a
visually "nice" way

48
Summarization

 Describe features of the selected


group
 Use natural language and
graphics
 Usually in Combination with
Deviation detection or other
methods

Average length of stay in this study area rose 45.7 percent,


from 4.3 days to 6.2 days, because ...

49
Data Mining Models and Tasks

Obtained from Prof. Srini’s Lecture notes

You might also like