0% found this document useful (0 votes)

13 views35 pages

Chapter 1

Data mining is the process of analyzing large datasets to extract useful patterns and knowledge. It encompasses various tasks such as prediction, classification, clustering, and association rule discovery, which can be applied in commercial and scientific contexts. The document outlines the KDD process, the importance of data mining in addressing societal challenges, and examples of its applications in different fields.

Uploaded by

ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views35 pages

Chapter 1

Uploaded by

ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Data Mining: Introduction

Lecture Notes for Chapter 1

Introduction to Data Mining, 2nd Edition

by
Tan, Steinbach, Karpatne, Kumar

01/17/2018 Introduction to Data Mining, 2nd Edition 1

What Is Data Mining?

Data mining (knowledge discovery from data)

Data mining is the use of efficient techniques for the

analysis of very large collections of data and the
extraction of useful and possibly unexpected patterns in
data (hidden knowledge).
3
The KDD Process

Pattern Evaluation

Data Mining
Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
Large-scale Data is Everywhere!
§ There has been enormous data
growth in both commercial and
scientific databases due to
advances in data generation
and collection technologies E-Commerce
Cyber Security
§ New mantra
§ Gather whatever data you can
whenever and wherever
possible.
§ Expectations
§ Gathered data will have value Social Networking: Twitter
Traffic Patterns
either for the purpose
collected or for a purpose not
envisioned.

Sensor Networks Computational Simulations

01/17/2018 Introduction to Data Mining, 2nd Edition 4

Why Data Mining? Commercial Viewpoint

● Lots of data is being collected

and warehoused
– Web data
uYahoo has Peta Bytes of web data
uFacebook has billions of active users

– purchases at department/
grocery stores, e-commerce
u Amazon handles millions of visits/day
– Bank/Credit Card transactions
● Computers have become cheaper and more powerful
● Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)

01/17/2018 Introduction to Data Mining, 2nd Edition 5

Why Data Mining? Scientific Viewpoint

● Data collected and stored at

enormous speeds
– remote sensors on a satellite
u NASA EOSDIS archives over
petabytes of earth science data / year fMRI Data from Brain Sky Survey Data

– telescopes scanning the skies

u Sky survey data

– High-throughput biological data

– scientific simulations
u terabytes of data generated in a few hours Gene Expression Data

● Data mining helps scientists

– in automated analysis of massive datasets
– In hypothesis formation
Surface Temperature of Earth
01/17/2018 Introduction to Data Mining, 2nd Edition 6
Great opportunities to improve productivity in all walks of life

01/17/2018 Introduction to Data Mining, 2nd Edition 7

Great Opportunities to Solve Society’s Major Problems

Improving health care and reducing costs Predicting the impact of climate change

Reducing hunger and poverty by

Finding alternative/ green energy sources
increasing agriculture production
01/17/2018 Introduction to Data Mining, 2nd Edition 8
Data Mining Tasks

● Prediction Methods
– Use some variables to predict unknown or
future values of other variables.

● Description Methods
– Find human-interpretable patterns that
describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

01/17/2018 Introduction to Data Mining, 2nd Edition 9

Data Mining Tasks …

Data
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No
15 No Single 90K Yes
10

Milk

01/17/2018 Introduction to Data Mining, 2nd Edition 10

Predictive Modeling: Classification
● Find a model for class attribute as a function of
the values of other attributes Model for predicting credit
worthiness

Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10

Number of Number of
years years

> 3 yr < 3 yr > 7 yrs < 7 yrs

Yes No Yes No

01/17/2018 Introduction to Data Mining, 2nd Edition 11

Classification Example

# years at
Level of Credit
Tid Employed present
Education Worthy
address
1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
… … … … …
1 Yes Graduate 5 Yes 10

2 Yes High School 2 No

3 No Undergrad 1 No
4 Yes High School 10 Yes
… … … … …
10 Test
Set

Learn
Training
Model
Set Classifier

01/17/2018 Introduction to Data Mining, 2nd Edition 12

Examples of Classification Task

● Classifying credit card transactions

as legitimate or fraudulent

● Classifying land covers (water bodies, urban areas,

forests, etc.) using satellite data

● Categorizing news stories as finance,

weather, entertainment, sports, etc

● Identifying intruders in the cyberspace

● Predicting tumor cells as benign or malignant

● Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random coil

01/17/2018 Introduction to Data Mining, 2nd Edition 13

Classification: Application 1

● Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
u Use credit card transactions and the information
on its account-holder as attributes.
– When does a customer buy, what does he buy, how
often he pays on time, etc
u Label past transactions as fraud or fair
transactions. This forms the class attribute.
u Learn a model for the class of the transactions.
u Use this model to detect fraud by observing credit
card transactions on an account.
01/17/2018 Introduction to Data Mining, 2nd Edition 14
Classification: Application 2

● Churn prediction for telephone customers

– Goal: To predict whether a customer is likely
to be lost to a competitor.
– Approach:
u Use detailed record of transactions with each of the
past and present customers, to find attributes.
– How often the customer calls, where he calls, what time-
of-the day he calls most, his financial status, marital
status, etc.
u Label the customers as loyal or disloyal.
u Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997

01/17/2018 Introduction to Data Mining, 2nd Edition 15

Clustering

● Finding groups of objects such that the objects in a

group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

01/17/2018 Introduction to Data Mining, 2nd Edition 16

Applications of Cluster Analysis
● Understanding
– Custom profiling for targeted
marketing
– Group related documents for
browsing
– Group genes and proteins that
have similar functionality
– Group stocks with similar price
fluctuations
● Summarization
– Reduce the size of large data
sets Courtesy: Michael Eisen

Clusters for Raw SST and Raw NPP

Use of K-means to
partition Sea Surface
60

Land Cluster 2

30 Temperature (SST) and

Land Cluster 1 Net Primary Production
latitude

0
(NPP) into clusters that
Ice or No NPP

-30
reflect the Northern
Sea Cluster 2 and Southern
-60
Hemispheres.
Sea Cluster 1

-90
-180 -150 01/17/2018
-120 -90 -60 -30 0 30 60 90 120 150 180
Cluster
Introduction to Data Mining, 2nd Edition 17
longitude
Clustering: Application 1

● Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
u Collect different attributes of customers based on
their geographical and lifestyle related information.
u Find clusters of similar customers.
u Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those
from different clusters.

01/17/2018 Introduction to Data Mining, 2nd Edition 18

A Behavior Based Segmentation

19
Clustering: Application 2

● Document Clustering:

– Goal: To find groups of documents that are similar to

each other based on the important terms appearing in
them.

– Approach: To identify frequently occurring terms in

each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.

Enron email dataset

01/17/2018 Introduction to Data Mining, 2nd Edition 20

Association Rule Discovery: Definition

● Given a set of records each of which contain

some number of items from a given collection
– Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.

TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

01/17/2018 Introduction to Data Mining, 2nd Edition 21

Association Analysis: Applications

● Market-basket analysis
– Rules are used for sales promotion, shelf
management, and inventory management

● Telecommunication alarm diagnosis

– Rules are used to find combination of alarms that
occur together frequently in the same time period

● Medical Informatics
– Rules are used to find combination of patient
symptoms and test results associated with certain
diseases
01/17/2018 Introduction to Data Mining, 2nd Edition 22
23
The KDD Process

Pattern Evaluation

Data Mining
Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
DATA

01/17/2018 Introduction to Data Mining, 2nd Edition 24

What is Data?

● Collection of data objects Attributes

and their attributes
● An attribute is a property or Tid Refund Marital Taxable
characteristic of an object Status Income Cheat

– Examples: eye color of a 1 Yes Single 125K No

person, temperature, etc. 2 No Married 100K No
– Attribute is also known as 3 No Single 70K No

Objects
variable, field, characteristic,
dimension, or feature 4 Yes Married 120K No

● A collection of attributes 5 No Divorced 95K Yes

describe an object 6 No Married 60K No

– Object is also known as 7 Yes Divorced 220K No

record, point, case, sample, 8 No Single 85K Yes
entity, or instance
9 No Married 75K No
10 No Single 90K Yes
10
Types of data sets
● Record
– Data Matrix
– Document Data
– Transaction Data
● Graph
– World Wide Web
– Molecular Structures
● Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Data Matrix

● If data objects have the same fixed set of numeric

attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute

● Such data set can be represented by an m by n matrix,

where there are m rows, one for each object, and n
columns, one for each attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1
Document Data

● Each document becomes a ‘term’ vector

– Each term is a component (attribute) of the
vector
– The value of each component is the number of
times the corresponding term occurs in the
document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data

● A special type of record data, where

– Each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the
items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data

● Examples: Generic graph, a molecule, and webpages

2
5 1
2
5

Benzene Molecule: C6H6

Ordered Data

● Sequences of transactions
Items/Events

An element of
the sequence
Ordered Data

● Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data

● Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean
Data Quality

● Poor data quality negatively affects many data processing

efforts
“The most important point is that poor data quality is an unfolding
disaster.
– Poor data quality costs the typical company at least
ten percent (10%) of revenue; twenty percent
(20%) is probably a better estimate.”
Thomas C. Redman, DM Review, August 2004

● Data mining example: a classification model for detecting

people who are loan risks is built using poor data
– Some credit-worthy candidates are denied loans
– More loans are given to individuals that default
Data Quality …

● What kinds of data quality problems?

● How can we detect problems with the data?
● What can we do about these problems?

● Examples of data quality problems:

– Noise and outliers
– Missing values
– Duplicate data
– Wrong data

(Ebook PDF) Introduction To Data Mining 2nd Edition by Pang-Ning Tanpdf Download
100% (8)
(Ebook PDF) Introduction To Data Mining 2nd Edition by Pang-Ning Tanpdf Download
51 pages
Chapter 1
No ratings yet
Chapter 1
313 pages
Lec Slides Combined Mid Quiz With Old Quizzes
No ratings yet
Lec Slides Combined Mid Quiz With Old Quizzes
378 pages
DM Consolidated
100% (1)
DM Consolidated
676 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
117 pages
Data Mining
No ratings yet
Data Mining
254 pages
KDD - Knowledge Discovery in Databases
No ratings yet
KDD - Knowledge Discovery in Databases
546 pages
Intelligent Systems 1
No ratings yet
Intelligent Systems 1
38 pages
Basic Concepts Data Mining (Lecture 02) - 1
No ratings yet
Basic Concepts Data Mining (Lecture 02) - 1
40 pages
INS2061 Introductions
No ratings yet
INS2061 Introductions
75 pages
3 DM
No ratings yet
3 DM
36 pages
Lec 1
No ratings yet
Lec 1
33 pages
Chap1 Intro
No ratings yet
Chap1 Intro
28 pages
Data Mining Slide
No ratings yet
Data Mining Slide
35 pages
Lec 1
No ratings yet
Lec 1
27 pages
Introduction To Data Mining, 2 Edition: by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Introduction To Data Mining, 2 Edition: by Tan, Steinbach, Karpatne, Kumar
95 pages
Data Management
No ratings yet
Data Management
36 pages
DM Chapter 1
No ratings yet
DM Chapter 1
37 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
35 pages
DM Lec1
No ratings yet
DM Lec1
40 pages
L1 Intro
No ratings yet
L1 Intro
32 pages
Unit 1
No ratings yet
Unit 1
102 pages
Introduction Lecture1gghhhhh
No ratings yet
Introduction Lecture1gghhhhh
23 pages
What Is Data Mining?: Many Definitions
No ratings yet
What Is Data Mining?: Many Definitions
15 pages
Chap1 Intro
No ratings yet
Chap1 Intro
32 pages
GRC Config Steps, ARA, EAM
100% (1)
GRC Config Steps, ARA, EAM
61 pages
Unit 1
No ratings yet
Unit 1
59 pages
Lecture Notes For Chapter 7 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 7 Introduction To Data Mining, 2 Edition
108 pages
Tum Dersler Veri Madenciligi
No ratings yet
Tum Dersler Veri Madenciligi
123 pages
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
32 pages
WINSEM2024-25 MCSE615L TH VL2024250502897 2024-12-19 Reference-Material-I
No ratings yet
WINSEM2024-25 MCSE615L TH VL2024250502897 2024-12-19 Reference-Material-I
58 pages
Lecture Notes For Chapter 1: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1: by Tan, Steinbach, Kumar
37 pages
DMiningKuliah 1 Introduction
No ratings yet
DMiningKuliah 1 Introduction
41 pages
02 - Data Mining
No ratings yet
02 - Data Mining
27 pages
Data Mining
No ratings yet
Data Mining
26 pages
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
34 pages
Topic10 - Data Mining
No ratings yet
Topic10 - Data Mining
29 pages
Data Mining and Its Branches
No ratings yet
Data Mining and Its Branches
37 pages
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
31 pages
UNIT 1 Introduction of Data Mining
No ratings yet
UNIT 1 Introduction of Data Mining
40 pages
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining: by Tan, Steinbach, Kumar
32 pages
Lecture - 1 02032023 095637am 1 29022024 124126pm
No ratings yet
Lecture - 1 02032023 095637am 1 29022024 124126pm
33 pages
1 Lect - 1.2 - 12 - August 2022 PDF
No ratings yet
1 Lect - 1.2 - 12 - August 2022 PDF
59 pages
Lecture Notes For Chapter 1: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 1: by Tan, Steinbach, Kumar
34 pages
DMlecture 1
No ratings yet
DMlecture 1
39 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
44 pages
Data Mining: July 18, 2019 1
No ratings yet
Data Mining: July 18, 2019 1
41 pages
Lecture Notes For Chapter 1 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 1 Introduction To Data Mining
16 pages
Lecture Notes For Chapter 1: by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Lecture Notes For Chapter 1: by Tan, Steinbach, Karpatne, Kumar
28 pages
100+ Informatica Interview Questions and Answers (Basic, Advanced, Scenario-Based)
No ratings yet
100+ Informatica Interview Questions and Answers (Basic, Advanced, Scenario-Based)
44 pages
2 WhyWhatDM
No ratings yet
2 WhyWhatDM
9 pages
CPS 196.03: Information Management and Mining: Shivnath Babu
No ratings yet
CPS 196.03: Information Management and Mining: Shivnath Babu
30 pages
Introduction To Data Mining
100% (1)
Introduction To Data Mining
643 pages
01-Introduction To Data Mining
No ratings yet
01-Introduction To Data Mining
43 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
1 Intro
No ratings yet
1 Intro
33 pages
Digital Signal Processing: M.Sivakumar
100% (1)
Digital Signal Processing: M.Sivakumar
44 pages
BOD310 EN Col06 FV Show
No ratings yet
BOD310 EN Col06 FV Show
142 pages
Data Warehousing and Data Mining Dr.P.rizwan Ahmed
0% (1)
Data Warehousing and Data Mining Dr.P.rizwan Ahmed
20 pages
Class X Practicals in Information Technology 2024-25
No ratings yet
Class X Practicals in Information Technology 2024-25
3 pages
101521-Report On The Physical Count of Property, Plant - Equipment-RPCPPE
No ratings yet
101521-Report On The Physical Count of Property, Plant - Equipment-RPCPPE
4 pages
Electronics: Quarter III - Module 3: Lesson 1
No ratings yet
Electronics: Quarter III - Module 3: Lesson 1
16 pages
Intro-Data Center
No ratings yet
Intro-Data Center
22 pages
Install Shield User Guide
No ratings yet
Install Shield User Guide
2,504 pages
More About Spreadsheet Errors and Fixes
100% (1)
More About Spreadsheet Errors and Fixes
3 pages
B - Tech - 2nd - Year - AIML, AIDS, Computer - SC - and - Design - 2022 - 23 - Revised
No ratings yet
B - Tech - 2nd - Year - AIML, AIDS, Computer - SC - and - Design - 2022 - 23 - Revised
14 pages
Chapter 2 Notes For Math in Grade 11
No ratings yet
Chapter 2 Notes For Math in Grade 11
17 pages
Python Lesson 5 - Selection
No ratings yet
Python Lesson 5 - Selection
19 pages
CS 234: Assignment #2: 1 Deep - Networks (DQN) (8 Pts Writeup)
No ratings yet
CS 234: Assignment #2: 1 Deep - Networks (DQN) (8 Pts Writeup)
9 pages
DOCA0091EN-05 Modbus NSX
No ratings yet
DOCA0091EN-05 Modbus NSX
228 pages
Iot Based Monitoring System For White Button Mushroom Farming
No ratings yet
Iot Based Monitoring System For White Button Mushroom Farming
1 page
Etech Q1 Lesson 2
No ratings yet
Etech Q1 Lesson 2
43 pages
Planning For Software Quality Assurance
No ratings yet
Planning For Software Quality Assurance
17 pages
Ensayo Sobre La Apariencia Física
100% (1)
Ensayo Sobre La Apariencia Física
6 pages
What Is Computer
No ratings yet
What Is Computer
19 pages
ICT-GRADE 7-WORKBOOK-ANSWERS-TERM-3 - (2024-2025) - R
No ratings yet
ICT-GRADE 7-WORKBOOK-ANSWERS-TERM-3 - (2024-2025) - R
5 pages
‏لقطة شاشة 2024-01-14 في 3.44.33 ص
No ratings yet
‏لقطة شاشة 2024-01-14 في 3.44.33 ص
36 pages
Aditya's Resume
No ratings yet
Aditya's Resume
1 page
Perception-Desktop-CX-GAM&D-TM-003 - Software Release Note - MD - March 2024
No ratings yet
Perception-Desktop-CX-GAM&D-TM-003 - Software Release Note - MD - March 2024
35 pages
G-12 Practice Questions 2-1
No ratings yet
G-12 Practice Questions 2-1
12 pages
Replace A Failed Drive in A Software RAID On Linux
No ratings yet
Replace A Failed Drive in A Software RAID On Linux
8 pages
Speech Enhancement Using Kalman Filter
No ratings yet
Speech Enhancement Using Kalman Filter
14 pages
AdBlock Is Now Installed!
No ratings yet
AdBlock Is Now Installed!
5 pages
Subnetting Assignment #01: Instructions
No ratings yet
Subnetting Assignment #01: Instructions
4 pages
A Comparative Study Between Android Ios
No ratings yet
A Comparative Study Between Android Ios
7 pages

Chapter 1

Uploaded by

Chapter 1

Uploaded by

Data Mining: Introduction

Lecture Notes for Chapter 1

Introduction to Data Mining, 2nd Edition

01/17/2018 Introduction to Data Mining, 2nd Edition 1

Data mining (knowledge discovery from data)

Data mining is the use of efficient techniques for the

Data Warehouse Selection

Sensor Networks Computational Simulations

01/17/2018 Introduction to Data Mining, 2nd Edition 4

● Lots of data is being collected

01/17/2018 Introduction to Data Mining, 2nd Edition 5

● Data collected and stored at

– telescopes scanning the skies

– High-throughput biological data

● Data mining helps scientists

01/17/2018 Introduction to Data Mining, 2nd Edition 7

Reducing hunger and poverty by

01/17/2018 Introduction to Data Mining, 2nd Edition 9

1 Yes Single 125K No

01/17/2018 Introduction to Data Mining, 2nd Edition 10

> 3 yr < 3 yr > 7 yrs < 7 yrs

01/17/2018 Introduction to Data Mining, 2nd Edition 11

2 Yes High School 2 No

01/17/2018 Introduction to Data Mining, 2nd Edition 12

● Classifying credit card transactions

● Classifying land covers (water bodies, urban areas,

● Categorizing news stories as finance,

● Identifying intruders in the cyberspace

● Predicting tumor cells as benign or malignant

● Classifying secondary structures of protein

01/17/2018 Introduction to Data Mining, 2nd Edition 13

● Churn prediction for telephone customers

From [Berry & Linoff] Data Mining Techniques, 1997

01/17/2018 Introduction to Data Mining, 2nd Edition 15

● Finding groups of objects such that the objects in a

01/17/2018 Introduction to Data Mining, 2nd Edition 16

Clusters for Raw SST and Raw NPP

30 Temperature (SST) and

01/17/2018 Introduction to Data Mining, 2nd Edition 18

– Goal: To find groups of documents that are similar to

– Approach: To identify frequently occurring terms in

Enron email dataset

01/17/2018 Introduction to Data Mining, 2nd Edition 20

● Given a set of records each of which contain

01/17/2018 Introduction to Data Mining, 2nd Edition 21

● Telecommunication alarm diagnosis

Data Warehouse Selection

01/17/2018 Introduction to Data Mining, 2nd Edition 24

● Collection of data objects Attributes

– Examples: eye color of a 1 Yes Single 125K No

● A collection of attributes 5 No Divorced 95K Yes

describe an object 6 No Married 60K No

– Object is also known as 7 Yes Divorced 220K No

● If data objects have the same fixed set of numeric

● Such data set can be represented by an m by n matrix,

10.23 5.27 15.22 2.7 1.2

● Each document becomes a ‘term’ vector

● A special type of record data, where

● Examples: Generic graph, a molecule, and webpages

Benzene Molecule: C6H6

● Genomic sequence data

● Poor data quality negatively affects many data processing

● Data mining example: a classification model for detecting

● What kinds of data quality problems?

● Examples of data quality problems:

You might also like