Data Mining Lecture2-2
Data Mining Lecture2-2
Lecture 2
Outline
● Types of Data
● Data Quality
What is Data?
Objects
variable, field, characteristic,
dimension, or feature
● A collection of attributes
describe an object
– Object is also known as
record, point, case, sample,
entity, or instance
Attribute Values
● Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
● Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
Asymmetric Attributes
● Only presence (a non-zero attribute value) is regarded as
important
Words present in documents
Items present in customer transactions
● Incomplete
– Asymmetric binary
– Cyclical
– Multivariate
– Partially ordered
– Partial membership
– Relationships between the data
– Sparsity
Only presence counts
– Resolution
Patterns depend on the scale
– Size
Type of analysis may depend on size of data
Types of data sets
● Record
– Data Matrix
– Document Data
– Transaction Data
● Graph
– World Wide Web
– Molecular Structures
● Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Record Data
● Sequences of transactions
Items/Events
An element of the
sequence
Ordered Data
● Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Data Quality
● Causes?
Missing Values
● Examples:
– Same person with multiple email addresses
● Data cleaning
– Process of dealing with duplicate data issues