Data Mining Notes C2
Data Mining Notes C2
Chapter2 Data
Wu Ziqing
21th October 2020
1 Data Exploration
1.1 Types of Attribute
An attribute is a property or characteristic of an object that may vary, either
from one object to another or from one time to another.
There are mainly two types of data, Categorical/Qualitative Data and Nu-
meric/Quantitative Data. They can be further divided into 4 sub categories.
Table 1 shows the description, allowable operations, permissible transformations
which will not change the attributes of the data, and examples of each type of
data.
There are other properties about an attribute:
1. Number of values can take:
• Discrete attributes has a finite or count ably infinite set of values.
Such attributes can be categorical, such as zip codes or ID numbers,
or numeric, such as counts.
• Continuous attributes has real number values, such as tempera-
ture, height and weight. Continuous attributes are usually Interval
or Ratio values.
2. Asymmetric Attributes: For asymmetric attributes, only presence–a
non-zero attribute value–is regarded as important. This type of attribute
is particularly important for tasks like association analysis where the not-
purchased items are not important.
It is always important to understand each attribute and its types, as it could
help us to know the legitimate operations and transformations on the data.
1
Attribute Type Description Operations Transforma- Examples
tion
Nominal The values mode, en- any one-to- zip codes,
of nominal tropy, con- one mapping employee
attribute are tingency ID, eye
just different correlation, color,
names to dis- χ2 test gender
Categorical
tinguish one
object from
another.
Ordinal The values median, an order- hardness
of an ordinal percentiles, preserving of miner-
attribute pro- rank corre- change of als good,
vide enough lation, run values better,
information to tests, sign best,
order objects. tests grades,
street
numbers
Interval For interval mean, stan- new value = calendar
attributes, dard de- a·old value+ dates,
the differ- viation, b temper-
ences between Pearson’s ature
values are correlation, t in Cel-
meaningful, and F tests sius and
i.e., a unit of Fahren-
measurement heit
exists.
Numeric
2
2. Sparsity: The sparsity refers to the proportion of zero values in a data set.
Sparsity might be an advantage or a problem depends on the technique
used.
3. Resolution: The resolution refers to the unit used to measure the values
in a data set. For example, when recording the surface of the earth,
the resolution can be of a few meters, or thousands of kilometers. The
properties and patterns in the data might depend on the level of resolution.
2 Data Cleaning
Raw data often have quality issues. The detection and correction of missing
data, corrupted data and outliers are called Data Cleaning.
3
5. Inconsistent error: Sometime the value breach a certain rule. For
example, a zip code is not valid for the object’s city. It is sometimes
possible to correct the data with additional or redundant information.
6. Duplicate Data: Duplicated data may contain the same information
or refer to same object. Before removing or combining the duplicates, we
need to make sure the two objects are exactly the same, not just similar
or containing additional information.
3 Feature Engineering
3.1 Aggregation
Data of the same entity (user/company), time period (day/month), or geo-
graphic location (city, country) can be aggregated together.
Aggregation could reveal new pattern and make the statistical property more
stable.
Quantitative attributes, such as price, are typically aggregated by taking a
sum or an average, A qualitative attribute, such as item, can either be omitted
or summarized as the set of all the items that were sold at that location.
3.2 Sampling
Sampling is a commonly used approach for selecting a subset of the data objects
to be analyzed. The sampled data has smaller dimension and thus is more
efficient.
Sampling should be able to produce representative samples with respect to
the original data set. Some common sampling strategies are:
4
Principal Components Analysis (PCA) is a linear algebra technique for
continuous attributes that projects the high-dimensional data to low-dimensions.
TODO: add notes for PCA
1. The Search Strategy controls the generation from of feature subset. For
example, reduce feature from the full feature set or add feature from an
empty set.
5
2. The Evaluation process decides whether the current subset is a good one.
For Filter approaches, the evaluation is done based on another method
while for Wrapper approaches, the evaluation uses the targeting algorithm.
3. After the subset is selected, a Validation can be done against the result
from original full feature, or from other subsetting algorithms.
6
income group could be age 0-16 where everyone is of ’low income’.
In other word, we need to make each segregation as pure as possible,
i.e., minimize the entropy.
Definition of entropy
Entropy measures the purity of the interval. If the interval has only
data from one class, the entropy will be 0. If the proportion of each
class is the same, entropy will be 1, as we have minimal knowledge
about which class this interval is likely to represent.
Assume a data set has k classes and an attribute is discretized into i
intervals, the entropy ei for ith interval is calculated by:
k
X
ei = pij log2 pij (1)
j=1
where pij = mij /mi (mi is the amount of data in interval i and mij is
the amount of class j data in interval i), which means the probability
of class j in interval i.
The total entropy of the discretization is:
n
X
wi · ei (2)
i=1
• We could find unique values and encode each value with a unique
n-bit binary number. The total number of bits needed for m unique
values is n = dlog2 (m)e.
• We could use each bit to represent one unique value. The represen-
tation only has one ’1’ bit representing the bit, the rest will all be
0. This One-hot Encoding is suitable for algorithms which requires
asymmetric binary values and can avoid correlation between bits. It
will however require m bits for m unique values.
7
3.7 Variable Transformation
Sometimes we need to transform all values of an attribute if it does not change
the nature of the data. The transformations include:
1. Simple functions
√
Such functions include operations like xk , |x|, log(x), x, ex , sin(x), 1/x.
Logarithm, square root and inverse are usually used to make data has
Gaussian distribution. Logarithm can also emphasise small values over
large values (log(0) and log(103 ) would be higher than difference between
log(108 ) and log(109 )).
2. Normalization and Standardization
The goal of standardization or normalization is to make an entire set of
values have a particular property, for example, having Gaussian distribu-
tion or ranging between [0,1]. This could avoid the attributes with larger
value has overwhelming influence on the algorithm.
To normalize data, we could transform data with the equation:
x0 = (x − x̄)/sx (3)
where x̄ is the mean and sx is the standard deviation. If there are influ-
ential outliers, we could also use medianPinstead of mean, and replace sx
m
with absolute standard deviation, σA = i=1 |xi − µ|, where m is number
of data entry and µ can be either median or mean.
4 Measurement of Proximity
Proximity refers to the similarity and dissimilarity between data. It measures
how the data is alike or different. The following chapters discusses some of the
proximity measures.
8
It should be noted that the dissimilarity score can be transformed into sim-
ilarity score. For a dissimilarity (0, 1, 10, 100), it could be transformed by:
1. 1 − d or maxd − d: (100, 99, 90, 0)
2. −d: (0, -1, -10, -100)
Xk
d(x, y) = ( |xi − yi |r )1/r (5)
i=1
9
2. Mahalanobis distance: It is a generalization of Euclidean distance
where it is able to handle correlated attributes, attributes with different
ranges and Gaussian distribution.
where Σ−1 is the inverse of the covariance matrix of the data, where Σ[i][j]
is the covariance between ith and j th attribute.
3. Bregman Divergence: Bregman Divergence is a measure to calculate
the distortion/loss/differnece between actual value y and approximated
value x. It is commonly used in algorithms like K-means clustering.
Given a strictly convex function φ, the Bregman Divergence can be calcu-
lated by:
D(x, y) = φ(x) − φ(y)− < ∇φ(y), (x − y) > (7)
where ∇φ(y) is the gradient of φ evaluated at y. < ∇φ(y), (x − y) > is the
inner product of < ∇φ(y) and (x − y). If x and y are both 1-dimensional
vectors, D(x, y) = (x − y)2 .
10
where
√ x · y is the inner product of x and y, ||x|| is the length of the vector,
x · x.
This distance is the same case calculating the cosine of two vectors. If
cos(x, y) = 0, the two vectors are perpendicular. If cos(x, y) = 1, the
two vectors overlaps. However, the magnitude is not taken into account.
(Euclidean distance is a better choice if magnitude difference is important.)
4. Pearson’s correlation: This score measures linear relationship between
two attributes.
covariance(x, y) sxy
corr(x, y) = = (11)
s.d.(x) · s.d.(y) sx sy
where
n
1 X
sxy = (xk − x̄)(yk − ȳ)
n−1
k=1
v
u n
u 1 X
sx = t (xk − x̄)2 (12)
n−1
k=1
v
u n
u 1 X
sy = t (yk − ȳ)2
n−1
k=1
11
Algorithm 1: Similarities for objects with heterogeneous attributes
1 For k th attribute, define indicator variable δk as:
(
0, if the attribute is asymmetric and both values are 0
δk =
1, otherwise
th
2 For k attribute, compute similarity s(x, y), in range [0, 1].
3 Compute the overall similarity of x and y:
Pn
δ s (x, y)
similarity(x, y) = Pnk k
k=1
(13)
k=1 δk
12