0% found this document useful (0 votes)

154 views50 pages

Lecture 1 Introduction To Data Mining

1. Detecting cancer subtypes using gene expression data. Researchers analyzed gene expression data to identify subtypes of breast cancer and predict patient survival. 2. Predicting traffic congestion using smart card data. Researchers used smart card data from public transportation to predict traffic congestion in major cities and recommend alternative routes. 3. Analyzing social media posts during disasters. Researchers looked at tweets and posts during hurricanes and wildfires to understand emergency needs, locate stranded people, and coordinate response efforts.

Uploaded by

sureshkm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

154 views50 pages

Lecture 1 Introduction To Data Mining

Uploaded by

sureshkm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

LECTURE 1: INTRODUCTION

TO DATA MINING
Dr. Dhaval Patel
CSE, IIT-Roorkee
What is data mining?
 Data mining is also called knowledge discovery and
data mining (KDD)

 Data mining is
 extractionof useful patterns from data sources, e.g.,
databases, texts, web, image.

 Patterns must be:

 valid, novel, potentially useful, understandable
Knowledge Discovery in Data: Process

Data Mining Interpretation/

Evaluation

Knowledge
Patterns
Data
Knowledge Discovery in Data: Process
Knowledge Discovery in Data: Challenges

Volume
- Big Data
- Small Data

Data
Variety
Velocity - Transaction
- Data Stream - Temporal
- Static - Spatial
…
5
Outline (Part 1)
 Introduction to Data
 TransactionalData
 Temporal Data

 Spatial & Spatial-Temporal Data

 Data Preprocessing
 Missing
Values
 Summarization
INTRODUCTION TO DATA
Data Come from Everywhere

Grocery Markets E-Commerce Stock Exchange

But, they have different form

Hospital Weather Station 8

Social Media
What is Data?
Attributes

 Collection of records and their Tid Refund Marital Taxable

Status Income Cheat
attributes
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
 An attribute is a characteristic of 4 Yes Married 120K No
an object 5 No Divorced 95K Yes
Objects 6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
 A collection of attributes describe
9 No Married 75K No
an object
10 No Single 90K Yes
10
Types of Data

 Record Data  Graph Data

 Transactional Data  Transactional Data

 Temporal Data  UnStructured Data

 Time Series Data
 Twitter Status Message
 Sequence Data
 Review, news article

 Spatial & Spatial-Temporal  Semi-Structured Data

Data
 Paper Publications Data
 Spatial Data
 XML format
 Spatial-Temporal Data
Record Data

• Transaction Data

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Market-Basket Dataset
Data Matrix

 If data objects have the same fixed set of numeric attributes,

then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a distinct
attribute

 Such data set can be represented by an m by n matrix, where

there are m rows, one for each object, and n columns, one for
each attribute
Data Matrix Example for Documents

 Each document becomes a `term' vector,

 each term is a component (attribute) of the vector,
 the value of each component is the number of times the
corresponding term occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
Distance Matrix

3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix
Temporal Data
 Sequences Data

(Patient Data obtained from Zhang’s KDD 06 Paper)

Temporal Data
 Time Series Data

Yahoo Finance Website

Biological Sequence Data
Interval Data

EL= { (A, 1, 5),( C, 3, 12), ( B, 4, 9), ( D, 9, 15) }

B
C
A

( ( (A overlaps C ) contains B ) overlaps D )

time
1 3 4 5 9 12 15

(Interval Patient Data obtained from Amit’s M.Tech. Thesis Work)

Spatial & Spatial-Temporal Data

• Spatial Data

(Spatial Distribution of Objects of Various Types : Prof. Shashi Shekhar)

Spatial & Spatial-Temporal Data

 Spatial Data

Average Monthly Temperature of land and ocean

Spatial & Spatial-Temporal Data
 Spatial Data

Dengue Disease Dataset (Singapore)

Spatial & Spatial-Temporal Data
 Trajectory Data: Set of Harricans

http://csc.noaa.gov/hurricanes
Spatial & Spatial-Temporal Data

 Trajectory Data: (of 87 users obtained using

RFID)

Vast 2008 Challenge – RFID Dataset

User Movement Data
 Trajectory
 Movement trail of a user
 Sampling Points: <latitude, longitude, time>

Stadium

Movie Complex

Swimming Pool

P1 on weekends

Home

Thanks to Shreyash and Sahoishnu (M.Tech. Students)

Graph Data
Semi-structured Data
Unstructured Data
Data can help us solve specific problems.
How should these pictures be placed
into 3 groups?
How should these pictures be placed into groups?
How many groups should there be?
Which genes are associated with a disease? How can
expression values be used to predict survival?
What items should Amazon display for
me?
Is it likely that this stock was traded
based on illegal insider information?
Where are the faces in this picture?
Is this spam?
Will I like 300?
What techniques people apply on
data?
 They apply data mining algorithms and discover useful
knowledge

 So, what are the some of the well-known Data mining

Tasks?
 Clustering,
 Classification,
 Frequent Patterns,
 Association Rules,
 ….
What people do with the time series
data?
Clustering Classification

Motif Discovery Rule Query by

10 Content
Discovery

s = 0.5
c = 0.3

Visualization Novelty Detection Motif Association

What people do with the trajectory
data?
Clustering Frequent Travel Patterns

Motif Discovery Prediction

Visualization Classification
In, Summary

Types of Data Data Mining

Methods
 Transactional Data  Frequent Pattern
 Sequence Data Discovery
 Interval Data  Classification
 Time Series Data Algorithms Clustering
 Spatial Data  Outlier Detection
 Spatio-Temporal Data  Statistical Analysis
 Data Set with Multiple  …
Kinds of Data
 ….
Activity 1
 Find top 3 recent research activities around the world
that are analyzing data. You need to write short
summary for each research activities. First three line
must follow following format:
 Line 1: Problem they are trying to sole along with dataset
they are using
 Line 2: How they are solving the problem
 Line 3: Justify yourself why you rate this work as a top 5
activities
 Remaining lines… you can think yourself ….

BigN’Smart Research group at IIT-Roorkee is analyzing “YelpReview”

Dataset for learning Location-to-activity Tagging. They are applying
… . I feel this is an interesting research because …
Activity 2: Why Data Mining ???
 Google
 Facebook
 Netflix Read
 eHarmony About
 FICO Their
 FlightCaster
Story
 IBM’s Watson
Related Field

Machine Visualization
Learning

Data Mining and

Knowledge Discovery

Statistics Databases

43
Related Field
 Statistics:
 more theory-based
 more focused on testing hypotheses

 Machine learning
 more heuristic
 focused on improving performance of a learning agent
 also looks at real-time learning and robotics – areas not part of data
mining

 Data Mining and Knowledge Discovery

 integrates theory and heuristics
 focus on the entire process of knowledge discovery, including data cleaning,
learning, and integration and visualization of results

 Distinctions are fuzzy

Classification
Learn a method for predicting the instance class from pre-labeled
(classified) instances

Many approaches: Statistics,

Decision Trees, Neural
Networks,
...

45
Clustering

Find “natural” grouping of instances given un-

labeled data

46
Association Rules & Frequent Itemsets
Transactions
Frequent Itemsets:
TID Produce
1 MILK, BREAD, EGGS Milk, Bread (4)
2 BREAD, SUGAR Bread, Cereal (3)
3 BREAD, CEREAL Milk, Bread, Cereal (2)
4 MILK, BREAD, SUGAR …
5 MILK, CEREAL
6 BREAD, CEREAL
7 MILK, CEREAL
8 MILK, BREAD, CEREAL, EGGS
9 MILK, BREAD, CEREAL
Rules:
Milk => Bread (66%)

47
Visualization & Data Mining
 Visualizing the data to
facilitate human
discovery

 Presenting the
discovered results in a
visually "nice" way

48
Summarization

 Describe features of the selected

group
 Use natural language and
graphics
 Usually in Combination with
Deviation detection or other
methods

Average length of stay in this study area rose 45.7 percent,

from 4.3 days to 6.2 days, because ...

49
Data Mining Models and Tasks

Obtained from Prof. Srini’s Lecture notes

Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
ZBNF
100% (3)
ZBNF
30 pages
Redthread Learning Technology Ecosystems - 092019
100% (1)
Redthread Learning Technology Ecosystems - 092019
38 pages
Alternative To READ - TEXT Function Module (No Mo PDF
No ratings yet
Alternative To READ - TEXT Function Module (No Mo PDF
16 pages
Lecture-1-Introduction-to-Data-Mining
No ratings yet
Lecture-1-Introduction-to-Data-Mining
50 pages
CS822-DataMining-Week1 (1)
No ratings yet
CS822-DataMining-Week1 (1)
97 pages
Lecture - 2 - Data Mining Concepts
No ratings yet
Lecture - 2 - Data Mining Concepts
30 pages
DM Lec1
No ratings yet
DM Lec1
40 pages
Lec 1
No ratings yet
Lec 1
33 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
44 pages
Data-Mining FINAL
No ratings yet
Data-Mining FINAL
45 pages
Introduction-to-Data-Mining
No ratings yet
Introduction-to-Data-Mining
32 pages
1 Lect - 1.2 - 12 - August 2022 PDF
No ratings yet
1 Lect - 1.2 - 12 - August 2022 PDF
59 pages
DMiningKuliah 1 Introduction
No ratings yet
DMiningKuliah 1 Introduction
41 pages
Dm1 Introduction ML Data Mining
100% (1)
Dm1 Introduction ML Data Mining
39 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
datamining-1class
No ratings yet
datamining-1class
76 pages
UNIT 1 Introduction of Data Mining
No ratings yet
UNIT 1 Introduction of Data Mining
40 pages
Dm1 Introduction Ml Data Mining
No ratings yet
Dm1 Introduction Ml Data Mining
39 pages
Data Mining Merged Pdf CS1 CS8
No ratings yet
Data Mining Merged Pdf CS1 CS8
272 pages
1. Introduction
No ratings yet
1. Introduction
26 pages
Tum Dersler Veri Madenciligi
No ratings yet
Tum Dersler Veri Madenciligi
123 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
1 DMiningKuliah 1 Introduction
No ratings yet
1 DMiningKuliah 1 Introduction
51 pages
Unit 3
No ratings yet
Unit 3
23 pages
IME 672-Chapter 1 PDF
No ratings yet
IME 672-Chapter 1 PDF
41 pages
DM 01 Introduction ML Data Mining
No ratings yet
DM 01 Introduction ML Data Mining
39 pages
01 Intro
No ratings yet
01 Intro
23 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Data Mining: July 18, 2019 1
No ratings yet
Data Mining: July 18, 2019 1
41 pages
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
32 pages
datamining&warehousing
No ratings yet
datamining&warehousing
65 pages
DM Day2 DataUnderstanding MS S25
No ratings yet
DM Day2 DataUnderstanding MS S25
165 pages
To Data Mining: Motivation: "Necessity Is The Mother of Invention"
No ratings yet
To Data Mining: Motivation: "Necessity Is The Mother of Invention"
14 pages
KDD - Knowledge Discovery in Databases
No ratings yet
KDD - Knowledge Discovery in Databases
546 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
CT075!3!2-DTM-Topic 8 - Introduction To Data Mining
No ratings yet
CT075!3!2-DTM-Topic 8 - Introduction To Data Mining
32 pages
DM Consolidated
100% (1)
DM Consolidated
676 pages
DM Overview
No ratings yet
DM Overview
52 pages
unit_1
No ratings yet
unit_1
102 pages
Lect 1 2 Data Mining 3
No ratings yet
Lect 1 2 Data Mining 3
19 pages
Data Mining Nostos
100% (1)
Data Mining Nostos
39 pages
class 1a-DataCollection
No ratings yet
class 1a-DataCollection
14 pages
1-Data Mining and Applications
No ratings yet
1-Data Mining and Applications
70 pages
Data Mining_Lecture1
No ratings yet
Data Mining_Lecture1
28 pages
Data Mining 1
No ratings yet
Data Mining 1
56 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Dwdm Unit-II Notes
No ratings yet
Dwdm Unit-II Notes
29 pages
DM_C1_Overview
No ratings yet
DM_C1_Overview
55 pages
Week 1-2
No ratings yet
Week 1-2
3 pages
Data Mining and Warehousing: - Module 1 - Introduction
No ratings yet
Data Mining and Warehousing: - Module 1 - Introduction
29 pages
p144 Data Mining
100% (3)
p144 Data Mining
11 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
Cap481 - Business Communication Unit 4
No ratings yet
Cap481 - Business Communication Unit 4
90 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
48 pages
CIS 467 - Topic 1 - Introduction - 2020
No ratings yet
CIS 467 - Topic 1 - Introduction - 2020
79 pages
combinepdf-1
No ratings yet
combinepdf-1
74 pages
1712060004 (1)
No ratings yet
1712060004 (1)
25 pages
1 - 1 Intro To Data Mining - ch1
No ratings yet
1 - 1 Intro To Data Mining - ch1
18 pages
01 Intro
No ratings yet
01 Intro
61 pages
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
From Everand
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
Taryn Voska
No ratings yet
Python Data Science Cookbook
From Everand
Python Data Science Cookbook
Taryn Voska
No ratings yet
Project Life Cycle and Phases
No ratings yet
Project Life Cycle and Phases
47 pages
Kunapajala A Liquid Organic Manure Preparation and
0% (1)
Kunapajala A Liquid Organic Manure Preparation and
13 pages
The Real Value of Pollination
No ratings yet
The Real Value of Pollination
2 pages
Srigandha Nursery
No ratings yet
Srigandha Nursery
63 pages
AGRI Surapala
No ratings yet
AGRI Surapala
4 pages
ADXL335 Interfacing With AVR
No ratings yet
ADXL335 Interfacing With AVR
10 pages
Maintenance Engineering
No ratings yet
Maintenance Engineering
2 pages
Mo BTR2 R4 en
No ratings yet
Mo BTR2 R4 en
33 pages
Rukhsar Khan FlowCV Resume Game Testing 1
No ratings yet
Rukhsar Khan FlowCV Resume Game Testing 1
2 pages
Music Organizer Report
50% (2)
Music Organizer Report
21 pages
PowerApps Training in Pune - TutorsBot
No ratings yet
PowerApps Training in Pune - TutorsBot
1 page
Government: of India
No ratings yet
Government: of India
1 page
Data Sheet: SMCANT-DI105/ - DI135/ - DI145
No ratings yet
Data Sheet: SMCANT-DI105/ - DI135/ - DI145
2 pages
Unit 5 - Data Science & Big Data - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Science & Big Data - WWW - Rgpvnotes.in
17 pages
Pendulum Simulink
No ratings yet
Pendulum Simulink
10 pages
Red Hat Installation Steps-Full
No ratings yet
Red Hat Installation Steps-Full
2 pages
Bluetooth GPS: User's Guide
No ratings yet
Bluetooth GPS: User's Guide
19 pages
Introduction To SQL: Practice Exercises
No ratings yet
Introduction To SQL: Practice Exercises
4 pages
0417 s22 QP 11answers
No ratings yet
0417 s22 QP 11answers
16 pages
HTMLnotes
No ratings yet
HTMLnotes
12 pages
Pancake Flip Game
No ratings yet
Pancake Flip Game
6 pages
Principles of Software Testing: Satzinger, Jackson, and Burd
No ratings yet
Principles of Software Testing: Satzinger, Jackson, and Burd
86 pages
Rethinking Bim Syllabus - Fa 2024
No ratings yet
Rethinking Bim Syllabus - Fa 2024
6 pages
Sathish Raman CV
No ratings yet
Sathish Raman CV
11 pages
Generative AI Brochure
No ratings yet
Generative AI Brochure
37 pages
The Distributed Computing Model Based On The Capabilities of The Internet
No ratings yet
The Distributed Computing Model Based On The Capabilities of The Internet
6 pages
Happy Feeder plusIII GB
No ratings yet
Happy Feeder plusIII GB
70 pages
j2534 Tutor
No ratings yet
j2534 Tutor
24 pages
Group 1-Eng and AP-ldm Final Output
No ratings yet
Group 1-Eng and AP-ldm Final Output
511 pages
Dragonlance: The Complete Saga: Locklearx@keemail - Me
100% (1)
Dragonlance: The Complete Saga: Locklearx@keemail - Me
2 pages
Check-In Workpath Guide - EN V2 - Explore La Guía de Registro de Workpath
No ratings yet
Check-In Workpath Guide - EN V2 - Explore La Guía de Registro de Workpath
8 pages
Clock Design Spec
No ratings yet
Clock Design Spec
1 page
BSNL User Manual After Login
No ratings yet
BSNL User Manual After Login
146 pages

Lecture 1 Introduction To Data Mining

Uploaded by

Lecture 1 Introduction To Data Mining

Uploaded by

LECTURE 1: INTRODUCTION

 Patterns must be:

Data Mining Interpretation/

 Spatial & Spatial-Temporal Data

Grocery Markets E-Commerce Stock Exchange

Hospital Weather Station 8

 Collection of records and their Tid Refund Marital Taxable

 Record Data  Graph Data

 Temporal Data  UnStructured Data

 Spatial & Spatial-Temporal  Semi-Structured Data

 If data objects have the same fixed set of numeric attributes,

 Such data set can be represented by an m by n matrix, where

 Each document becomes a `term' vector,

(Patient Data obtained from Zhang’s KDD 06 Paper)

Yahoo Finance Website

EL= { (A, 1, 5),( C, 3, 12), ( B, 4, 9), ( D, 9, 15) }

( ( (A overlaps C ) contains B ) overlaps D )

(Interval Patient Data obtained from Amit’s M.Tech. Thesis Work)

(Spatial Distribution of Objects of Various Types : Prof. Shashi Shekhar)

Average Monthly Temperature of land and ocean

Dengue Disease Dataset (Singapore)

 Trajectory Data: (of 87 users obtained using

Vast 2008 Challenge – RFID Dataset

Thanks to Shreyash and Sahoishnu (M.Tech. Students)

 So, what are the some of the well-known Data mining

Motif Discovery Rule Query by

Visualization Novelty Detection Motif Association

Motif Discovery Prediction

Types of Data Data Mining

BigN’Smart Research group at IIT-Roorkee is analyzing “YelpReview”

Data Mining and

 Data Mining and Knowledge Discovery

 Distinctions are fuzzy

Many approaches: Statistics,

Find “natural” grouping of instances given un-

 Describe features of the selected

Average length of stay in this study area rose 45.7 percent,

Obtained from Prof. Srini’s Lecture notes

You might also like