Prof.
Heitor S Lopes
Prof. Thiago H Silva
Data Mining &
Knowledge Discovery
1c - Data - Important Aspects
Data -> Knowledge
Appropriate languages help
In this course examples in:
Dataframe
Support variables of various types and simplify manipulation
Can have
different types
in each column
Dataset
Collection of data objects and their attributes
An attribute is a property of an object
Examples: eye color, temperature, etc.
Attribute is also known as variable, characteristic, feature
Attribute types
Categorical They are just different names
Nominal
Sufficient information to order
Ex: ID number, eye color, zip code
Ordinal
Ex: grades, height {high, medium, low}
Numeric
Data has a natural zero point
Interval, Ratio (significant).
Allows comparisons of the
Dates, temperature in Celsius type (x is twice as much as y)
(intervals between each value
are equally divided.) Monetary amounts, weight.
Attribute types
User ID in an e-mail system
– Nominal, Ordinal, or Interval?
Attribute types
Discrete attribute
● Has only a finite or countably infinite set of values
● Ex: zip codes or the set of words in a collection
● Typically represented as integer variables
Continuous attributes
● Has real numbers as attribute values
● Ex: temperature, height or weight.
● Typically represented as floating point variables
Is age continuous or discrete?
Typical and complex datasets
Matrix data
Structured Text: DNA/Protein Sequences
Complex datasets
Transactions
A special type of record, where:
● Each record (transaction) involves a set of items
● Ex: supermarket
Complex datasets
Unstructured text: Spatio-temporal data:
Complex datasets
Time series Graph:
Proximity notion
Measure of similarity
● Numerical measure of how similar two data objects are
● It is larger when they are more similar
● Usually in the range [0,1]
Measure of dissimilarity
● Numerical measure of how different two objects are
● Minimal dissimilarity is usually 0
For convenience, proximity refers to similarity or dissimilarity
Euclidean distance
where n is the number of dimensions (attributes) and xk and yk are the kth
attributes of data objects x and y.
Standardization is necessary if the scale differs
Euclidean distance
Distance matrix
Comparison
Distance matrix
Similarity between binary vectors
Common situation: objects p and q have only binary attributes
Compute the similarity like this:
f01 = # of attributes where p was 0 and q is 1
f10 = # of attributes where p was 1 and q is 0
f00 = # of attributes where p was 0 and q is 0
f11 = # of attributes where p was 1 and q is 1
Simple Matching (SMC) and Jaccard Coefficient (J)
SMC = number of matches “11” and “00” / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of matches “11” / number of non-zero attributes
= (f11) / (f01 + f10 + f11)
SMC vs Jaccard
x= 1000000000
y= 0000001001
f01 = 2
f10 = 1
f00 = 7
f11 = 0
SMC = (f11 + f00) / (f01 + f10 + f11 + f00)
= (0+7) / (2+1+0+7) = 0.7
J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0
Cosine similarity Does not take into account 0-0
matches, as in Jaccard, and works
If d1 and d2 are numeric vectors, then for non-binary vectors
cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates the dot product of the vectors, d1 and d2, and || d || is the
magnitude of the vector d.
Eg.:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
Linear correlation measure
corr(x,y)=1 It means a perfect positive correlation between the two variables.
corr(x,y)=-1 It means a perfect negative correlation between the two variables - That is, if
one increases, the other always decreases.
corr(x,y)=0 It means that the two variables do not depend linearly on each other. However,
there may be a non-linear dependence.
What about categorical data?
A = (-3, -2, -1, 0, 1, 2, 3)
B = (a, a, b, a, a, b, b)
What about categorical data?
A = (-3, -2, -1, 0, 1, 2, 3) Codes
B = (a, a, b, a, a, b, b) a=0
b=1
B = (0, 0, 1, 0, 0, 1, 1)
Binarization
Maps a continuous or categorical attribute to one or
more binary variables
Normalization (z-score)
Also known as standardization
where μ is the mean (average) and σ is the standard deviation of the mean
Standardizes features so that they are centered around 0 with a
standard deviation of 1.
This can be a general requirement for many machine learning algorithms.
Normalization (MIN-MAX)
Typically
In this approach, data is scaled to a fixed range - usually from 0 to 1.
Normalization - Example
[[ 3.9 5. 3000. ]
[ 5. 5.5 3500. ]
[ 10. 6. 3500. ]]
Distances between non-normalized objects
[[0. 500.00 500.038]
[500.0014 0. 5.0249]
[500.03820 5.024 0. ]]
Distances between normalized objects
[[0. 1.13248317 1.7320]
[1.13248317 0. 0.96013]
[1.73205081 0.96013 0. ]]
References
Tan, P. N., Steinbach, M., & Kumar, V. (2016). Introduction to data mining.
Pearson Education India.
Thanks to Professors Josh Starmer, Yi Zhang, Vincent Spruyt for some
images that were used
Official documentation of the Scikit Learn library scikit-learn.org