0% found this document useful (0 votes)
3 views

Lecture2_IntroData

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture2_IntroData

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Lecture 2

Data
Summary – last week
• Last week:
– Course Motivation
– Data Mining basics

• This week:
– Data

Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 2


Agenda
– Attributes and Objects
– Types of Data
– Data Quality
– Similarity and Distance
– Data Preprocessing

Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 3


What is Data?
Attributes
• Collection of data objects
and their attributes
• An attribute is a property or Tid Refund Marital Taxable
Status Income Cheat
characteristic of an object
• Examples: eye color of a 1 Yes Single 125K No
person, temperature, etc. 2 No Married 100K No
• Attribute is also known as 3 No Single 70K No

Objects
variable, field, characteristic,
4 Yes Married 120K No
dimension, or feature
5 No Divorced 95K Yes
• A collection of attributes
6 No Married 60K No
describe an object
• Object is also known as 7 Yes Divorced 220K No
record, point, case, sample, 8 No Single 85K Yes
entity, or instance 9 No Married 75K No
10 No Single 90K Yes
10

Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 4


Attribute Values
• Attribute values are numbers or symbols assigned to
an attribute for a particular object

• Distinction between attributes and attribute values


– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of values


• Example: Attribute values for ID and age are integers
– But properties of attribute can be different than
the properties of the values used to represent the
attribute

Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 5


Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.

Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 8


Important Characteristics of Data
– Dimensionality (number of attributes)
• High dimensional data brings a number of
challenges
– Sparsity
• Only presence counts
– Resolution
• Patterns depend on the scale
– Size
• Type of analysis may depend on size of data

Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 10


Types of data sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data

Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 11


Record Data
• Data that consists of a collection of records, each of
which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 12


Data Matrix
• If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute
• Such a data set can be represented by an m by n
matrix, where there are m rows, one for each object,
and n columns, one for each attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1

Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 13


Document Data
• Each document becomes a ‘term’ vector
– Each term is a component (attribute) of the vector
– The value of each component is the number of times the
corresponding term occurs in the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 14


Transaction Data
• A special type of data, where
– Each transaction involves a set of items.
– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.
– Can represent transaction data as record data

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 15
Graph Data
• Examples: Generic graph, a molecule, and webpages

2
5 1
2
5

Benzene Molecule: C6H6

Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 16


Ordered Data
• Sequences of transactions

Items/Events

An element of
the sequence
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 17
Ordered Data
• Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 18


Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?

• Examples of data quality problems:


– Noise and outliers
– Wrong data
– Fake data
– Missing values
– Duplicate data

Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 20

You might also like