CS F415 Data Mining Data Preprocessing
CS F415 Data Mining Data Preprocessing
Objects
feature
6 No Married 60K No
• A collection of attributes 7 Yes Divorced 220K No
describe an object 8 No Single 85K Yes
entity, or instance
• Example: Document-term
matrix
• A vocabulary is formed
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
composed of all unique terms
in the documents Document 1 3 0 5 0 2 6 0 2 0 2
• Extension of record
data, where each record
has a time associated
with it
– Also called temporal data
– Find patterns such as “candy sales
peak before Halloween.”
• Correction
– Possible sometimes
– Requires additional or redundant information
73,600-54,000/16,000 = 1.225
Dissimilarity matrix
test-1 1 2 3 4
1 0 1 1 0
2 1 0 1 1
3 1 1 0 1
4 0 1 1 0
test-2 1 2 3 4
1 0 1 0.5 0
2 1 0 0.5 1
3 0.5 0.5 0 0.5
4 0 1 0.5 0
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
• Solution
• d1 = 1 0 1 0 1
• d2 = 1 1 1 0 1
• d1.d2 = 3
• ||d1||2 = 1 + 1 + 1 = 3
• ||d2||2 = 1 + 1 + 1 + 1 = 4
• T(d1, d2) = 3 / 3 + 4 – 3
• T(d1, d2) = ¾ = 0.75
• Solution
• d1 = 1 1 1 1 1
• d2 = 1 1 1 1 1
• d1.d2 = 5
• ||d1||2 = 1 + 1 + 1 + 1 + 1 = 5
• ||d2||2 = 1 + 1 + 1 + 1 + 1 = 5
• T(d1, d2) = 5 / 5 + 5 – 5
• T(d1, d2) = 1