0% found this document useful (0 votes)
10 views35 pages

Chapter 1

Data mining is the process of analyzing large datasets to extract useful patterns and knowledge. It encompasses various tasks such as prediction, classification, clustering, and association rule discovery, which can be applied in commercial and scientific contexts. The document outlines the KDD process, the importance of data mining in addressing societal challenges, and examples of its applications in different fields.

Uploaded by

ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views35 pages

Chapter 1

Data mining is the process of analyzing large datasets to extract useful patterns and knowledge. It encompasses various tasks such as prediction, classification, clustering, and association rule discovery, which can be applied in commercial and scientific contexts. The document outlines the KDD process, the importance of data mining in addressing societal challenges, and examples of its applications in different fields.

Uploaded by

ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Data Mining: Introduction

Lecture Notes for Chapter 1

Introduction to Data Mining, 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar

01/17/2018 Introduction to Data Mining, 2nd Edition 1


What Is Data Mining?

Data mining (knowledge discovery from data)

Data mining is the use of efficient techniques for the


analysis of very large collections of data and the
extraction of useful and possibly unexpected patterns in
data (hidden knowledge).
3
The KDD Process

Pattern Evaluation

Data Mining
Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
Large-scale Data is Everywhere!
§ There has been enormous data
growth in both commercial and
scientific databases due to
advances in data generation
and collection technologies E-Commerce
Cyber Security
§ New mantra
§ Gather whatever data you can
whenever and wherever
possible.
§ Expectations
§ Gathered data will have value Social Networking: Twitter
Traffic Patterns
either for the purpose
collected or for a purpose not
envisioned.

Sensor Networks Computational Simulations

01/17/2018 Introduction to Data Mining, 2nd Edition 4


Why Data Mining? Commercial Viewpoint

● Lots of data is being collected


and warehoused
– Web data
uYahoo has Peta Bytes of web data
uFacebook has billions of active users

– purchases at department/
grocery stores, e-commerce
u Amazon handles millions of visits/day
– Bank/Credit Card transactions
● Computers have become cheaper and more powerful
● Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)

01/17/2018 Introduction to Data Mining, 2nd Edition 5


Why Data Mining? Scientific Viewpoint

● Data collected and stored at


enormous speeds
– remote sensors on a satellite
u NASA EOSDIS archives over
petabytes of earth science data / year fMRI Data from Brain Sky Survey Data

– telescopes scanning the skies


u Sky survey data

– High-throughput biological data


– scientific simulations
u terabytes of data generated in a few hours Gene Expression Data

● Data mining helps scientists


– in automated analysis of massive datasets
– In hypothesis formation
Surface Temperature of Earth
01/17/2018 Introduction to Data Mining, 2nd Edition 6
Great opportunities to improve productivity in all walks of life

01/17/2018 Introduction to Data Mining, 2nd Edition 7


Great Opportunities to Solve Society’s Major Problems

Improving health care and reducing costs Predicting the impact of climate change

Reducing hunger and poverty by


Finding alternative/ green energy sources
increasing agriculture production
01/17/2018 Introduction to Data Mining, 2nd Edition 8
Data Mining Tasks

● Prediction Methods
– Use some variables to predict unknown or
future values of other variables.

● Description Methods
– Find human-interpretable patterns that
describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

01/17/2018 Introduction to Data Mining, 2nd Edition 9


Data Mining Tasks …

Data
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No
15 No Single 90K Yes
10

Milk

01/17/2018 Introduction to Data Mining, 2nd Edition 10


Predictive Modeling: Classification
● Find a model for class attribute as a function of
the values of other attributes Model for predicting credit
worthiness

Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10

Number of Number of
years years

> 3 yr < 3 yr > 7 yrs < 7 yrs

Yes No Yes No

01/17/2018 Introduction to Data Mining, 2nd Edition 11


Classification Example

# years at
Level of Credit
Tid Employed present
Education Worthy
address
1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
… … … … …
1 Yes Graduate 5 Yes 10

2 Yes High School 2 No


3 No Undergrad 1 No
4 Yes High School 10 Yes
… … … … …
10 Test
Set

Learn
Training
Model
Set Classifier

01/17/2018 Introduction to Data Mining, 2nd Edition 12


Examples of Classification Task

● Classifying credit card transactions


as legitimate or fraudulent

● Classifying land covers (water bodies, urban areas,


forests, etc.) using satellite data

● Categorizing news stories as finance,


weather, entertainment, sports, etc

● Identifying intruders in the cyberspace

● Predicting tumor cells as benign or malignant

● Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random coil

01/17/2018 Introduction to Data Mining, 2nd Edition 13


Classification: Application 1

● Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
u Use credit card transactions and the information
on its account-holder as attributes.
– When does a customer buy, what does he buy, how
often he pays on time, etc
u Label past transactions as fraud or fair
transactions. This forms the class attribute.
u Learn a model for the class of the transactions.
u Use this model to detect fraud by observing credit
card transactions on an account.
01/17/2018 Introduction to Data Mining, 2nd Edition 14
Classification: Application 2

● Churn prediction for telephone customers


– Goal: To predict whether a customer is likely
to be lost to a competitor.
– Approach:
u Use detailed record of transactions with each of the
past and present customers, to find attributes.
– How often the customer calls, where he calls, what time-
of-the day he calls most, his financial status, marital
status, etc.
u Label the customers as loyal or disloyal.
u Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997

01/17/2018 Introduction to Data Mining, 2nd Edition 15


Clustering

● Finding groups of objects such that the objects in a


group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

01/17/2018 Introduction to Data Mining, 2nd Edition 16


Applications of Cluster Analysis
● Understanding
– Custom profiling for targeted
marketing
– Group related documents for
browsing
– Group genes and proteins that
have similar functionality
– Group stocks with similar price
fluctuations
● Summarization
– Reduce the size of large data
sets Courtesy: Michael Eisen

Clusters for Raw SST and Raw NPP


90

Use of K-means to
partition Sea Surface
60

Land Cluster 2

30 Temperature (SST) and


Land Cluster 1 Net Primary Production
latitude

0
(NPP) into clusters that
Ice or No NPP

-30
reflect the Northern
Sea Cluster 2 and Southern
-60
Hemispheres.
Sea Cluster 1

-90
-180 -150 01/17/2018
-120 -90 -60 -30 0 30 60 90 120 150 180
Cluster
Introduction to Data Mining, 2nd Edition 17
longitude
Clustering: Application 1

● Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
u Collect different attributes of customers based on
their geographical and lifestyle related information.
u Find clusters of similar customers.
u Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those
from different clusters.

01/17/2018 Introduction to Data Mining, 2nd Edition 18


A Behavior Based Segmentation

19
Clustering: Application 2

● Document Clustering:

– Goal: To find groups of documents that are similar to


each other based on the important terms appearing in
them.

– Approach: To identify frequently occurring terms in


each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.

Enron email dataset

01/17/2018 Introduction to Data Mining, 2nd Edition 20


Association Rule Discovery: Definition

● Given a set of records each of which contain


some number of items from a given collection
– Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.

TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

01/17/2018 Introduction to Data Mining, 2nd Edition 21


Association Analysis: Applications

● Market-basket analysis
– Rules are used for sales promotion, shelf
management, and inventory management

● Telecommunication alarm diagnosis


– Rules are used to find combination of alarms that
occur together frequently in the same time period

● Medical Informatics
– Rules are used to find combination of patient
symptoms and test results associated with certain
diseases
01/17/2018 Introduction to Data Mining, 2nd Edition 22
23
The KDD Process

Pattern Evaluation

Data Mining
Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
DATA

01/17/2018 Introduction to Data Mining, 2nd Edition 24


What is Data?

● Collection of data objects Attributes


and their attributes
● An attribute is a property or Tid Refund Marital Taxable
characteristic of an object Status Income Cheat

– Examples: eye color of a 1 Yes Single 125K No


person, temperature, etc. 2 No Married 100K No
– Attribute is also known as 3 No Single 70K No

Objects
variable, field, characteristic,
dimension, or feature 4 Yes Married 120K No

● A collection of attributes 5 No Divorced 95K Yes

describe an object 6 No Married 60K No

– Object is also known as 7 Yes Divorced 220K No


record, point, case, sample, 8 No Single 85K Yes
entity, or instance
9 No Married 75K No
10 No Single 90K Yes
10
Types of data sets
● Record
– Data Matrix
– Document Data
– Transaction Data
● Graph
– World Wide Web
– Molecular Structures
● Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Data Matrix

● If data objects have the same fixed set of numeric


attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute

● Such data set can be represented by an m by n matrix,


where there are m rows, one for each object, and n
columns, one for each attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Document Data

● Each document becomes a ‘term’ vector


– Each term is a component (attribute) of the
vector
– The value of each component is the number of
times the corresponding term occurs in the
document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data

● A special type of record data, where


– Each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the
items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data

● Examples: Generic graph, a molecule, and webpages

2
5 1
2
5

Benzene Molecule: C6H6


Ordered Data

● Sequences of transactions
Items/Events

An element of
the sequence
Ordered Data

● Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data

● Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean
Data Quality

● Poor data quality negatively affects many data processing


efforts
“The most important point is that poor data quality is an unfolding
disaster.
– Poor data quality costs the typical company at least
ten percent (10%) of revenue; twenty percent
(20%) is probably a better estimate.”
Thomas C. Redman, DM Review, August 2004

● Data mining example: a classification model for detecting


people who are loan risks is built using poor data
– Some credit-worthy candidates are denied loans
– More loans are given to individuals that default
Data Quality …

● What kinds of data quality problems?


● How can we detect problems with the data?
● What can we do about these problems?

● Examples of data quality problems:


– Noise and outliers
– Missing values
– Duplicate data
– Wrong data

You might also like