0% found this document useful (0 votes)
98 views12 pages

Data Mining Notes C2

This document provides an overview of key concepts in data exploration and cleaning discussed in Chapter 2 of the Introduction to Data Mining textbook. It defines different types of attributes (categorical, numeric), characteristics of datasets (dimensionality, sparsity, resolution), sources of errors in raw data (collection errors, measurement errors, outliers, missing values, inconsistent errors, duplicates), and techniques for feature engineering like aggregation, sampling, and dimensionality reduction. The document provides examples and descriptions to illustrate these fundamental data mining concepts.

Uploaded by

wuziqi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views12 pages

Data Mining Notes C2

This document provides an overview of key concepts in data exploration and cleaning discussed in Chapter 2 of the Introduction to Data Mining textbook. It defines different types of attributes (categorical, numeric), characteristics of datasets (dimensionality, sparsity, resolution), sources of errors in raw data (collection errors, measurement errors, outliers, missing values, inconsistent errors, duplicates), and techniques for feature engineering like aggregation, sampling, and dimensionality reduction. The document provides examples and descriptions to illustrate these fundamental data mining concepts.

Uploaded by

wuziqi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Notes on Introduction to Data Mining:

Chapter2 Data
Wu Ziqing
21th October 2020

1 Data Exploration
1.1 Types of Attribute
An attribute is a property or characteristic of an object that may vary, either
from one object to another or from one time to another.
There are mainly two types of data, Categorical/Qualitative Data and Nu-
meric/Quantitative Data. They can be further divided into 4 sub categories.
Table 1 shows the description, allowable operations, permissible transformations
which will not change the attributes of the data, and examples of each type of
data.
There are other properties about an attribute:
1. Number of values can take:
• Discrete attributes has a finite or count ably infinite set of values.
Such attributes can be categorical, such as zip codes or ID numbers,
or numeric, such as counts.
• Continuous attributes has real number values, such as tempera-
ture, height and weight. Continuous attributes are usually Interval
or Ratio values.
2. Asymmetric Attributes: For asymmetric attributes, only presence–a
non-zero attribute value–is regarded as important. This type of attribute
is particularly important for tasks like association analysis where the not-
purchased items are not important.
It is always important to understand each attribute and its types, as it could
help us to know the legitimate operations and transformations on the data.

1.2 Characteristics of Data set


A data set can often be viewed as a collection of data objects. There some
general characteristics of an data set which might be important to the data
mining techniques:

1
Attribute Type Description Operations Transforma- Examples
tion
Nominal The values mode, en- any one-to- zip codes,
of nominal tropy, con- one mapping employee
attribute are tingency ID, eye
just different correlation, color,
names to dis- χ2 test gender
Categorical

tinguish one
object from
another.
Ordinal The values median, an order- hardness
of an ordinal percentiles, preserving of miner-
attribute pro- rank corre- change of als good,
vide enough lation, run values better,
information to tests, sign best,
order objects. tests grades,
street
numbers
Interval For interval mean, stan- new value = calendar
attributes, dard de- a·old value+ dates,
the differ- viation, b temper-
ences between Pearson’s ature
values are correlation, t in Cel-
meaningful, and F tests sius and
i.e., a unit of Fahren-
measurement heit
exists.
Numeric

Ratio For ratio vari- geometric new value = tempera-


ables, both mean, har- a · old value ture in
differences monic mean, Kelvin,
and ratios are percent monetary
meaningful. variation quan-
A 40-year-old tities,
man is 2 times counts,
older than a age,
20-year-old mass,
man. length,
electrical
current

Table 1: Summary of data types

1. Dimensionality: The dimensionality of a data set is the number of at-


tributes that the objects in the data set possess. High dimensional data
might lead to higher difficulties known as the Curse of Dimensionality.

2
2. Sparsity: The sparsity refers to the proportion of zero values in a data set.
Sparsity might be an advantage or a problem depends on the technique
used.
3. Resolution: The resolution refers to the unit used to measure the values
in a data set. For example, when recording the surface of the earth,
the resolution can be of a few meters, or thousands of kilometers. The
properties and patterns in the data might depend on the level of resolution.

2 Data Cleaning
Raw data often have quality issues. The detection and correction of missing
data, corrupted data and outliers are called Data Cleaning.

1. Collection error: It refers to errors such as omitting data objects or


attribute values, or inappropriately including a data object.
2. Measurement error: It refers to any problem resulting from the mea-
surement process.
Measurement error will lead to Noise, which is the random component
of a measurement error and Artifacts, the deterministic distortions of the
data.
Measurement error can be gauged by Accuracy, which means the close-
ness of measurements to t.he true value of the quantity being measured.
The Accuracy is composed of Precision, the closeness of repeated mea-
surements (of the same quantity) to one another, and Bias, a systematic
variation of measurements from the quantity being measured.
3. Outliers: Outliers refer to data objects/value of attributes that are un-
usual with respect to the typical objects/values.
Unlike noise, outliers may sometimes be of interest. Sometimes, the
anomalies may need to be analysed instead of been removed.
4. Missing values: Several strategies can be used to handle missing data:
• Eliminate: Eliminate data objects/attributes containing missing
values.
• Estimate: Commonly for continuous attributes, estimate using av-
erage/median of the whole data set or close neighbours. For Discrete
attributes, estimate using most commonly occurring values.
• Ignore: For some techniques, we could simply ignore the missing
data. For example, when performing clustering, the distance between
two objects can be calculated by using only commonly existing at-
tributes.

3
5. Inconsistent error: Sometime the value breach a certain rule. For
example, a zip code is not valid for the object’s city. It is sometimes
possible to correct the data with additional or redundant information.
6. Duplicate Data: Duplicated data may contain the same information
or refer to same object. Before removing or combining the duplicates, we
need to make sure the two objects are exactly the same, not just similar
or containing additional information.

3 Feature Engineering
3.1 Aggregation
Data of the same entity (user/company), time period (day/month), or geo-
graphic location (city, country) can be aggregated together.
Aggregation could reveal new pattern and make the statistical property more
stable.
Quantitative attributes, such as price, are typically aggregated by taking a
sum or an average, A qualitative attribute, such as item, can either be omitted
or summarized as the set of all the items that were sold at that location.

3.2 Sampling
Sampling is a commonly used approach for selecting a subset of the data objects
to be analyzed. The sampled data has smaller dimension and thus is more
efficient.
Sampling should be able to produce representative samples with respect to
the original data set. Some common sampling strategies are:

1. simple random sampling: There are equal probability of selecting any


particular item. Sampling can be done with replacement or without re-
placement.
2. Stratified sampling: For data with different sets of objects, stratified
sampling can be used to adequately represent rare classes. Equal num-
bers of objects are drawn from each group even though the groups are of
different sizes.

A proper sampling size should be determined to prevent too much informa-


tion loss. Progressive sampling gradually increases the sampling size until the
predictive model’s performance levels off. It could help determine the smallest
best sampling size.

3.3 Dimensionality Reduction


Dimensionality reduction can eliminate irrelevant features and reduce noise and
partly because of the curse of dimensionality.

4
Principal Components Analysis (PCA) is a linear algebra technique for
continuous attributes that projects the high-dimensional data to low-dimensions.
TODO: add notes for PCA

3.4 Feature Subset Selection


Selecting features from the full list of features can also reduce dimensionality.
The goal is to remove Redundant features, which duplicate much or all of the
information contained in one or more other attributes. and Irrelevant features,
which contain almost no useful information for the data mining task at hand.
The ideal way is to try all possible subsets and select the best one. However,
it is impractical in terms of efficiency. Several strategies can be used:
1. Embedded approaches: Algorithms like decision tree classifiers auto-
matically select feature subsets.
2. Filter approaches: Selecting subsets before data mining algorithm is
run. For example, select sets of attributes whose pairwise correlation is as
low as possible.
3. Wrapper approaches: Use algorithms like a black box to find the best
subset, but typically without enumerating all possible subsets.
For Filter and Wrapper approach, the selection process can be generalized
as Figure 1:

Figure 1: Knowledge Discovery in Databases process

Each step in the process is explained as below:

1. The Search Strategy controls the generation from of feature subset. For
example, reduce feature from the full feature set or add feature from an
empty set.

5
2. The Evaluation process decides whether the current subset is a good one.
For Filter approaches, the evaluation is done based on another method
while for Wrapper approaches, the evaluation uses the targeting algorithm.
3. After the subset is selected, a Validation can be done against the result
from original full feature, or from other subsetting algorithms.

Sometimes the selection is not done by adding or removing, but assigning a


Feature weight on a typical feature to indicate its importance.

3.5 Feature Creation


Sometimes we need to create new features based on existing features. Some
general methods are shown below:

1. Feature Extraction: Feature of higher level can be extracted to reveal


hidden information. For example, some basic geometric shapes can be
extracted from photographs to help item identification. Feature extraction
is highly dependent on the domain knowledge.
2. Mapping data to a New Space: Projecting data to a new space
can also reveal important information sometimes. For example, applying
Fourier Transformation on time series data can detect periodic changes,
trends and noise.
3. Feature Construction: Constructing information based on original fea-
tures can make the information more suitable to the algorithm. For ex-
ample, the original data may have ’volume’ and ’mass’, but constructing
a ’density’ column may help classify the type of materials better.

3.6 Discretization and Binarization


1. Discretization: Discretization refers to transforming a numeric attribute
to a categorical variable. It will be needed for some algorithms whose
attributes must be categorical. There are two types of discretization:

• Unsupervised discretization: When the data contains no class


information about the data, unsupervised discretization can be used.
Numeric data can be categorized by 1) Equal Width (for example,
age 0-20, 20-40, etc.), or 2) Equal Frequency, for example, we put
first 20% of data into one category.
• Supervised discretization: When the data contains the class in-
formation, we can perform supervised discretization by considering
the class label.
In order to achieve a better performance for the algorithm, we want
to separate the feature in a way that each segregation consists of
only one category’s data. For example, a good interval for classifying

6
income group could be age 0-16 where everyone is of ’low income’.
In other word, we need to make each segregation as pure as possible,
i.e., minimize the entropy.
Definition of entropy
Entropy measures the purity of the interval. If the interval has only
data from one class, the entropy will be 0. If the proportion of each
class is the same, entropy will be 1, as we have minimal knowledge
about which class this interval is likely to represent.
Assume a data set has k classes and an attribute is discretized into i
intervals, the entropy ei for ith interval is calculated by:
k
X
ei = pij log2 pij (1)
j=1

where pij = mij /mi (mi is the amount of data in interval i and mij is
the amount of class j data in interval i), which means the probability
of class j in interval i.
The total entropy of the discretization is:
n
X
wi · ei (2)
i=1

where wi = mi /m, which is the the proportion of data in interval i.


How to discretize an attribute
A simple way to discretize an attribute is to divide data into two
intervals each time. We first find a data point as the splitting point
where the point will minimize entropy of the two sides. Then we do
the division in the interval with the highest entropy again, till we
meet the required number of intervals.

2. Binarization: Binarization refers to continuous and discrete attributes


into one or more binary attributes. The binary code is required sometimes
for certain algorithms. Two ways are usually used:

• We could find unique values and encode each value with a unique
n-bit binary number. The total number of bits needed for m unique
values is n = dlog2 (m)e.
• We could use each bit to represent one unique value. The represen-
tation only has one ’1’ bit representing the bit, the rest will all be
0. This One-hot Encoding is suitable for algorithms which requires
asymmetric binary values and can avoid correlation between bits. It
will however require m bits for m unique values.

7
3.7 Variable Transformation
Sometimes we need to transform all values of an attribute if it does not change
the nature of the data. The transformations include:

1. Simple functions

Such functions include operations like xk , |x|, log(x), x, ex , sin(x), 1/x.
Logarithm, square root and inverse are usually used to make data has
Gaussian distribution. Logarithm can also emphasise small values over
large values (log(0) and log(103 ) would be higher than difference between
log(108 ) and log(109 )).
2. Normalization and Standardization
The goal of standardization or normalization is to make an entire set of
values have a particular property, for example, having Gaussian distribu-
tion or ranging between [0,1]. This could avoid the attributes with larger
value has overwhelming influence on the algorithm.
To normalize data, we could transform data with the equation:

x0 = (x − x̄)/sx (3)

where x̄ is the mean and sx is the standard deviation. If there are influ-
ential outliers, we could also use medianPinstead of mean, and replace sx
m
with absolute standard deviation, σA = i=1 |xi − µ|, where m is number
of data entry and µ can be either median or mean.

4 Measurement of Proximity
Proximity refers to the similarity and dissimilarity between data. It measures
how the data is alike or different. The following chapters discusses some of the
proximity measures.

4.1 Similarity and Dissimilarity between Simple At trib-


utes
Table 2 shows some common proximity measures for one attribute.

Attribute Type Dissimilarity


( Similarity
(
0, if x = y 1, if x = y
Nominal d= s=
1, if x 6= y 0, if x 6= y
Ordinal d = |x − y|/(n − 1) s=1−d
d−mind
Interval or Ratio d = |x − y| 1
s = −d, s = 1+d , s = e−d , s = maxd −mind

Table 2: Similarity and Dissimilarity measures for simple attributes

8
It should be noted that the dissimilarity score can be transformed into sim-
ilarity score. For a dissimilarity (0, 1, 10, 100), it could be transformed by:
1. 1 − d or maxd − d: (100, 99, 90, 0)
2. −d: (0, -1, -10, -100)

3. e−d : (1.00, 0.37, 0.00, 0.00)


4. 1/(1 + d): (1, 0.5, 0.09, 0.01)
5. d − mind /(maxd − mind ): (1.00, 0.99, 0.00, 0.00)
Method 3-5 will also produce a probability-form score ranging [0, 1].

4.2 Dissimilarity for data objects


For dissimilarity, such as Euclidean distance, it has some important properties:

• Positivity: d(x, y) ≥ 0 for all x, y and d(x, x) = 0


• Symmetry: d(x, y) = d(y, x) for all x, y.
• Triangle Inequality: d(x, z) ≤ d(x, y) + d(y, z).
Measures that satisfied all three properties are known as Metrics.
Some measures for dissimilarity is shown below:
1. Euclidean distance: This value represents the distance between two
vectors.
v
u k
uX
d(x, y) = t (xi − yi )2 (4)
i=1

It can be further generalized into Minkovski distance:

Xk
d(x, y) = ( |xi − yi |r )1/r (5)
i=1

where r is the power parameter. It can take any integer value.


• r = 1 : City block / Manhattan / taxicab distance (L1 norm)
• r = 2 : Euclidean distance (L2 norm)
• r = ∞ : Supremum distance (Lmax or L∞ norm)

9
2. Mahalanobis distance: It is a generalization of Euclidean distance
where it is able to handle correlated attributes, attributes with different
ranges and Gaussian distribution.

mahalanobis(x, y) = (x − y)Σ−1 (x − y)> (6)

where Σ−1 is the inverse of the covariance matrix of the data, where Σ[i][j]
is the covariance between ith and j th attribute.
3. Bregman Divergence: Bregman Divergence is a measure to calculate
the distortion/loss/differnece between actual value y and approximated
value x. It is commonly used in algorithms like K-means clustering.
Given a strictly convex function φ, the Bregman Divergence can be calcu-
lated by:
D(x, y) = φ(x) − φ(y)− < ∇φ(y), (x − y) > (7)
where ∇φ(y) is the gradient of φ evaluated at y. < ∇φ(y), (x − y) > is the
inner product of < ∇φ(y) and (x − y). If x and y are both 1-dimensional
vectors, D(x, y) = (x − y)2 .

4.3 Similarity for data objects


Similarity measures usually have two properties:
• Positivity: s(x, y) ≥ 0 for all x, y and s(x, x) = 1
• Symmetry: s(x, y) = s(y, x) for all x, y.
There are many methods to measure different types of similarities:
1. Simple Matching Coefficient: This measure can be used for calculating
binary similarity:
number of matching attribute values Count(x = y)
SM C = = (8)
number of attributes len(x)

2. Jaccard Coefficient: This measure is also for binary similarity. It is


useful for sparse data where attributes are asymmetric, i.e., 0 is does not
provide meaningful information. It can prevent the case where majority
of attributes are both 0 and it makes two data object very similar.
number of matching presence Count(x = y = 1)
J= =
number of attributes which both are not 0 Count(x + y 6= 0)
(9)
3. Cosine Similarity: This measure works similar to Jaccard Coefficient as
it also ignores the both zero cases. However, it is able to handle non-binary
values as well.
x·y x y
cos(x, y) = = · (10)
||x|| · ||y|| ||x|| ||y||

10
where
√ x · y is the inner product of x and y, ||x|| is the length of the vector,
x · x.
This distance is the same case calculating the cosine of two vectors. If
cos(x, y) = 0, the two vectors are perpendicular. If cos(x, y) = 1, the
two vectors overlaps. However, the magnitude is not taken into account.
(Euclidean distance is a better choice if magnitude difference is important.)
4. Pearson’s correlation: This score measures linear relationship between
two attributes.

covariance(x, y) sxy
corr(x, y) = = (11)
s.d.(x) · s.d.(y) sx sy
where
n
1 X
sxy = (xk − x̄)(yk − ȳ)
n−1
k=1
v
u n
u 1 X
sx = t (xk − x̄)2 (12)
n−1
k=1
v
u n
u 1 X
sy = t (yk − ȳ)2
n−1
k=1

Pearson’s correlation has a range of [−1, 1]. A correlation of 1 (-1) means


that x and y have a perfect positive (negative) linear relationship. If
the correlation is 0, x and y have no linear relationship. However, other
non-linear relationship may still exist.

4.4 Proximity for data with Heterogeneous Attributes


If attributes of two data objects are of different types, we need to combine the
similarity for each attributes.
However, if the attribute is asymmetric, the similarity should not be taken
into account if values of both objects are 0. It is because similar to Simple
Matching Coefficient, too much 0 asymmetric values will make all objects simi-
lar, while they are not.
A general algorithm for handling objects with heterogeneous attributes is
shown in Algorithm 1.

11
Algorithm 1: Similarities for objects with heterogeneous attributes
1 For k th attribute, define indicator variable δk as:
(
0, if the attribute is asymmetric and both values are 0
δk =
1, otherwise
th
2 For k attribute, compute similarity s(x, y), in range [0, 1].
3 Compute the overall similarity of x and y:
Pn
δ s (x, y)
similarity(x, y) = Pnk k
k=1
(13)
k=1 δk

12

You might also like