Data Attributes Needed
for Data Mining
Types of Data
Objective
Objective
Recognize
attributes of data
needed for data
mining
Types of Data
Categorical Categorical Continuous Class
Data that consists of Tid Refund Marital Taxable Cheat
a collection of Status Income
records, each of 1 Yes Single 125K No
2 No Married 100K No
which consists of a
3 No Single 70K No
fixed set of attributes 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
Data Matrix
|If data objects have the |Data set can be
same fixed set of represented by an m by
numeric attributes, the n matrix, where there
data objects can be are m rows, one for
thought of as points in each object, and n
a multi-dimensional columns, one for each
space attribute
- each dimension represents
a distinct attribute
Data Matrix
Projection Projection Distance Load Thickness
of x Load of y Load
10.23 5.27 15.22 2.7 1.2
12.65 6.25 16.22 2.2 1.1
Document Data
T C P B S G W L T S
e o l a c a i o i e
a a a l o m n s m a
m c y l r e t e s
h e e o o
r u n
t
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
|A special type of record TID Items
data, where 1 Bread, Coke, Milk
2 Beer, Bread
- each record (transaction)
involves a set of items. 3 Beer, Coke, Diapers, Milk
4 Beer, Bread, Diapers, Milk
5 Coke, Diapers, Milk
Graph Data
|Generic graph and
HTML Links 5
2
1
2
5
Chemical Data
|Benzene Molecule:
C6H6
Ordered Data
|Sequences of
transactions
Ordered Data
|Spatio-Temporal Data January
Average Monthly Temperature of land and ocean
Data Quality
|What kinds of data |Examples of data
quality problems? quality problems:
- Noise and outliers
|How can we detect - missing values
problems with the - duplicate data
data?
|What can we do about
these problems?
Noise
|Noise refers
to
modification
of original
values
Outliers
|Outliers are data
objects with
characteristics that are
considerably different
than most of the other
data objects in the data
set
Missing Values
|Reasons for missing |Handling missing
values values
- Information is not collected - Eliminate Data Objects
- Attributes may not be - Estimate Missing Values
applicable to all cases - Ignore the Missing Value
During Analysis
Duplicate Data
|Data set may |Examples: |Data cleaning
include data - Same person with - Process of
objects that are multiple email dealing with
duplicates, or addresses duplicate data
issues
almost duplicates
of one another
- Major issue when
merging data from
heterogenous
sources