CS822 DataMining Week3
CS822 DataMining Week3
CS822
Data
Mining
Instructor: Dr. Muhammad Tahir
2
Data Preprocessing
3
Data Quality: Why Preprocess the
Data?
• Preprocessing improves data quality including
• Accuracy: correctness of data i.e., having incorrect attribute values
could be due to instruments used may be faulty or human error (or
intentional).
• Completeness: full information is available i.e. Incomplete data
can occur for values that are not always available, such as customer
information for sales transaction data.
• Consistency: data from all sources are the same i.e. Two different
users may have very different assessments or live in different time
zones.
• Timeliness: time difference and delay i.e. some store branches has
delay in syncing sales
• Believability: reflects how much the data are trusted by users
• Interpretability: reflects how easy the data are understood. 4
Major Tasks in Data Preprocessing
• Data cleaning routines work to “clean” the data by
filling in missing values, smoothing noisy data,
identifying or removing outliers, and resolving
inconsistencies.
• Data integration is the process integrating data from
multiple sources (databases, data cubes, or files).
• Data reduction obtains a reduced representation of
the data set that is much smaller in volume but
produces the same (or almost the same) mining results
• Data transformation convert the data into
appropriate forms for better mining results.
5
Data Preprocessing
Overview
6
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday? 7
Incomplete Data (Missing Values)
• Data is not always available
• E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of
entry
• not register history or changes of the data
• Missing data may need to be inferred 8
How to Handle Missing Values?
• Ignore the sample: not effective when the % of ignored
sample is too high.
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean for all data
• the attribute mean for all samples belonging to the same
class
• the most probable value: based on some statistical
models.
9
Noisy Data
• Noise: random error in data
• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data 10
How to Handle Noisy Data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with
possible outliers) 11
Data Cleaning as a Process
• Data discrepancy detection
• Use metadata (e.g., domain, range, dependency, distribution)
• Check field overloading i.e. storing multiple types of information
within a single data field (column) in a dataset
• Check
• uniqueness rule i.e. storing multiple types of information within a
single data field (column) in a dataset
• consecutive rule i.e. there should not be a sequence of repeated
values in a particular field or column
• null rule i.e. a particular field is allowed to have null (empty)
values or not
12
Data Cleaning as a Process
• Use Commercial Tools for
• Data scrubbing (also called data cleansing) is the process of:
• Identifying and correcting inaccurate, incomplete, or irrelevant data.
• Removing duplicate records.
• Fixing structural errors (like typos, incorrect formatting, or wrongly
placed data).
• Handling missing values by filling them with appropriate substitutes or
removing incomplete records.
• Ensuring data consistency across datasets.
13
Data Cleaning as a Process
• Data migration and integration
• Data migration tools are software solutions used to move data from one
system, storage, or format to another. These tools play a crucial role in data
management, particularly when organizations upgrade systems, consolidate
databases, or move to cloud platforms.
• ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
• Integration of the two processes
• refers to the combination of data preprocessing (data cleaning, transformation,
integration, etc.) with the actual data mining process (pattern discovery,
modeling, analysis).
14
Data Integration
15
Data Integration
• Combining data from multiple sources into a single,
unified view (a coherent data store).
• In data mining, data often comes from different
databases, files, or systems.
• To analyze it effectively, you need all data combined in
a single format/location.
• Example:
Combining customer data from a CRM system with sales
data from an ERP system into a data warehouse.
16
Schema Integration
• The process of merging schemas (data structures) from
different sources into a consistent schema.
• Challenge:
• Different systems may label or structure the same data differently.
• Example:
• One system has A.cust-id, while another uses B.cust-#.
• Both refer to the same concept — customer ID — so they must be
mapped together.
• Key Step:
• Integrating metadata (information about data) to match up
fields/attributes correctly across sources.
17
Entity Identification Problem
• The challenge of identifying that records from
different sources actually refer to the same real-
world entity (person, company, product, etc.).
• Example:
• One database lists "Bill Clinton".
• Another lists "William Clinton".
• Both refer to the same person, but they need to be
recognized as such.
• Why it’s important:
• If not resolved, duplicate records can skew analysis or cause
incorrect results.
18
Detecting and Resolving Data Value
Conflicts
• Even when the same entity is correctly identified, different sources
may report different attribute values for the same data point.
• Example:
• A person’s weight in one system is recorded as 70 kg, while another system
says 154 lbs.
• These are equivalent, but they are in different scales (metric vs imperial).
• Other conflicts can happen with formats (DD/MM/YYYY vs MM/DD/YYYY) or
naming styles ("John Smith" vs "Smith, John").
• Key Task:
• Detect these differences and resolve them into a single, accurate value.
19
Handling Redundancy in Data
Integration
20
Handling Redundancy in Data
Integration
• Redundant Data in Data Integration
• When combining data from multiple databases or sources,
the same information may appear in multiple places, often
in different formats or levels of detail.
• This creates redundancies, which can lead to data
duplication, inconsistencies, and wasted storage space.
• Problems
• Object Identification
• Derivable Data
21
Handling Redundancy in Data
Integration
• Object Identification
• The same object (entity) or attribute (field) may have different
names across different databases.
• Example: In Database A, a customer’s ID is stored as cust_id.
• In Database B, the same field is called customer_number.
• These need to be identified as the same attribute to correctly
integrate the data.
• Without matching equivalent fields across sources, data
integration will fail, and redundant data will accumulate.
22
Handling Redundancy in Data
Integration
• Derivable Data
• Sometimes, an attribute in one table can be derived
from attributes in another table.
• Example:
• A table might store quarterly revenue for a company.
• Another table stores annual revenue, which is simply the
sum of the quarterly revenues.
• Key Point:
• These derived attributes can introduce redundancy
because they can be computed instead of being stored
directly.
23
Handling Redundancy in Data
Integration
• Detecting Redundant Attributes
• Redundant attributes (like annual revenue and sum of
quarterly revenue) can sometimes be detected using:
• Correlation Analysis:
• Measures the statistical relationship between two
attributes.
• High correlation suggests the attributes may be redundant
or related in some way.
• Covariance Analysis:
• Measures how two attributes vary together.
• Helps to detect attributes that provide similar
information. 24
Handling Redundancy in Data
Integration
• Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
25
Correlation Analysis (Nominal Data)
• Correlation analysis is used on attributes or features to determine their
dependencies on among each other.
𝜒 =∑ ¿ ¿ ¿
2
• The larger the value, the more likely the variables are related
• The cells that contribute the most to the value are those whose actual
count is very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population 26
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250 (90) 200 (360) 450
Not like science 50 (210) 1000 (840) 1050
fiction
Sum(col.) 300 1200 1500
• (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data
distribution in the
(250 two
90) 2 categories)
(50 210) 2 (200 360) 2 (1000 840) 2
2
507.93
90 210 360 840
28
Covariance (Numeric Data)
• Covariance is similar to correlation
• Measures how two variables change together.
• Correlation coefficient:
• Standardized version of covariance.
• Scales the relationship to always fall between -1 and +1
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
• Some pairs of random variables may have a covariance of 0 but are
not independent. Only under some additional assumptions (e.g., the
data follow multivariate normal distributions) does a covariance of 0
imply independence
29
Covariance versus Correlation
• Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends,
will their prices rise or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
• Thus, A and B rise together since Cov(A, B) > 0. 31
Data Reduction
32
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical results
• Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete dataset.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
• Data compression (lossy/lossless) 33
Data Reduction 1: Dimensionality
Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering and
outlier analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection) 34
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information contained in one or
more other attributes
• E.g., purchase price of a product and the amount of sales
tax paid
• Irrelevant attributes
• Contain no information that is useful for the data mining
task at hand
• E.g., students' ID is often irrelevant to the task of predicting
students' GPA
35
Heuristic Search in Attribute
Selection
• There are 2^d possible attribute combinations of d attributes
• Typical heuristic attribute selection methods:
• Best single attribute under the attribute independence assumption:
choose by significance tests
• Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
• Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
• Best combined attribute selection and elimination
• Optimal branch and bound:
• Use attribute elimination and backtracking 36
Attribute Creation (Feature
Generation)
• Create new attributes (features) that can capture the important
information in a data set more effectively than the original ones
• Three general methodologies
• Attribute extraction
• Domain-specific
• Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet transformation,
manifold approaches (not covered)
• Attribute construction
• Combining features (see: discriminative frequent patterns in
Chapter 7)
• Data discretization
37
Data Reduction 2: Numerosity
Reduction
• Reduce data volume by choosing alternative, smaller
forms of data representation
• Parametric methods
• Non-parametric methods
38
Parametric methods
39
Parametric methods
• Parametric methods for data reduction involve
summarizing a large dataset using a fixed number of
parameters.
• These methods assume that the data follows a known
statistical distribution and reduce the dataset while
retaining important patterns.
40
Key Parametric Methods for Data
Reduction
1. Regression Models (Linear Regression)
• Concept: Instead of storing an entire dataset, a mathematical
function (model) is used to approximate the relationship between
variables.
• Example:
• Suppose we have 1000 data points showing how house prices depend
on their size.
• A linear regression model can approximate this relationship using the
formula: Price=50000+200×Size
• Instead of storing all 1000 points, we store only the equation and
predict prices based on size.
41
Key Parametric Methods for Data
Reduction
2. Principal Component Analysis (PCA)
• Concept: PCA reduces the number of dimensions while
retaining the most important information.
• Example:
• A dataset with 10 attributes (e.g., height, weight, age,
income, education, etc.) can be reduced to 2 or 3 principal
components that explain most of the variance.
• This reduces storage and speeds up computation while
preserving essential trends.
42
Key Parametric Methods for Data
Reduction
3. Logarithmic Data Reduction
• Concept: Transforming data using logarithmic functions
to compress large values.
• Example:
• Instead of storing a dataset of 1 million records with large
numbers (e.g., 1000000, 500000, etc.), we can store their
logarithmic values: log(1000000)=6, log(500000)=5.7
• This helps in compressing data while keeping relative
differences intact.
43
Key Parametric Methods for Data
Reduction
4. Data Approximation (Curve Fitting)
• Concept: Instead of storing all data points, an equation
approximates the trend.
• Example:
• A dataset of temperature variations over a year (365
days) can be approximated using a sine wave function
rather than storing every single value.
44
Non-Parametric methods
45
Non-Parametric methods
• Non-parametric methods for data reduction do not
assume any predefined statistical distribution or
mathematical model.
• Instead, they reduce the dataset while preserving key
patterns and relationships.
• These methods are more flexible than parametric
methods because they adapt to the structure of the
data.
46
Key Non-Parametric Methods for
Data Reduction
1. Sampling
• Concept: Instead of analyzing the entire dataset, a
smaller, representative subset is used.
• Example:
• Suppose we have 1 million customer records. Instead of
using all data points, we randomly select 10,000 customers
that reflect the overall trends.
• This reduces computational cost while maintaining accuracy.
47
Types of Sampling
• Simple random sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling:
• Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the data)
• Used in conjunction with skewed data
48
Sampling: With or without
Replacement
W O R
SRS le random
i m p ho ut
( s e wi t
p l
sa m m e nt )
p l a ce
re
SRSW
R
Raw Data
49
Sampling: Cluster or Stratified
Sampling
Raw Data Cluster/Stratified Sample
50
Key Non-Parametric Methods for
Data Reduction
2. Clustering-Based Data Reduction
• Concept: Groups similar data points together and
stores only the cluster representatives.
• Example:
• Suppose we have 100,000 customer records with different
shopping behaviours.
• Using k-Means clustering, we can group them into 5
customer types and store only these representative
profiles, rather than all individual records.
51
Key Non-Parametric Methods for
Data Reduction
3. Dimensionality Reduction using Feature
Selection
• Concept: Removes irrelevant or redundant attributes
while keeping important ones.
• Example:
• A dataset with 100 attributes (e.g., height, weight, income,
education, ZIP code, etc.) might contain irrelevant
attributes.
• If ZIP code does not impact customer spending, we remove it
to make analysis more efficient.
52
Key Non-Parametric Methods for
Data Reduction 40
• Example: 5
0
• A dataset has 1 million income values 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000
0 0 0 0 0 0 0 0 0 00
ranging from $10,000 to $200,000.
• Instead of storing all values, we group them
into bins:
• Low Income: $10,000 – $40,000
• Middle Income: $40,001 – $100,000
• High Income: $100,001 – $200,000
• This keeps the distribution intact while reducing
the number of stored values. 53
Key Non-Parametric Methods for
Data Reduction
5. Discretization
• Concept: Converts continuous numerical values into
categorical values.
• Example:
• Instead of storing exact student scores (e.g., 89.4, 92.1,
76.8), we group them into letter grades:
• A (80–100)
• B (60–79)
• C (40–59)
• This makes analysis easier and reduces storage requirements.
54
Data Transformation
55
Data Transformation
• A function that maps the entire set of values of a given attribute
to a new set of replacement values such that each old value can
be identified with one of the new values
• Methods
• Smoothing: Remove noise from data
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: Summarization, data cube construction
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization: Concept hierarchy climbing 56
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v − 𝑚𝑖𝑛 𝐴
𝑣 ′= (new ¿ − new ¿ )+𝑛𝑒𝑤 ¿
𝑚 𝑎𝑥 𝐴 − 𝑚𝑖𝑛 𝐴
57
Normalization by Decimal
Scaling
• Definition: Normalization by decimal scaling is a method
used in data preprocessing to rescale numerical values so
that they fall within a certain range, typically between -1 and
1. This is done by dividing each data value by a power of 10,
determined by the maximum absolute valuev in the dataset.
v' j
10
where:
• X′ is the normalized value,
• X is the original value,
• j is the smallest integer such that the maximum absolute value of X′
is less than 1.
58
Normalization by Decimal
Scaling
Example: Consider a dataset with values: [150, 300, 1200,
5000]
1.Find the maximum absolute value: 5000.
2.Determine j: The smallest j such that 5000 / 10^j is less than
1.
j=4 (since 5000/=0.5)
3.Normalize each value:
1.150/ = 0.015,
2.300/ = 0.03,
3.1200/ = 0.12,
4.5000/ = 0.5
• The transformed dataset becomes [0.015, 0.03, 0.12, 0.5],
59
ensuring values remain within the range [-1,1].
Discretization
60
Discretization
• Discretization is a data preprocessing technique used in
data mining and machine learning that transforms
continuous numerical data into discrete categories or
bins.
• This process is particularly useful when working with
algorithms that require categorical input or when
reducing the complexity of large datasets.
61
Discretization
Why is Discretization Important?
1.Improves Interpretability: Converting continuous values into
discrete intervals makes the data easier to understand and
analyze.
2.Enhances Model Performance: Some machine learning
models, such as decision trees and rule-based classifiers, perform
better with categorical data.
3.Reduces Noise: Grouping values into bins helps smooth out
minor variations and reduces the impact of small fluctuations in
data.
4.Facilitates Pattern Recognition: Many patterns become more
evident when similar values are grouped together.
62
Types of Discretization Techniques
There are two main categories of discretization:
1.Supervised Discretization: Uses class labels to
determine optimal binning.
2.Unsupervised Discretization: Does not use class
labels and instead follows predefined rules.
63
Types of Discretization Techniques
1. Equal-Width Binning (Unsupervised)
• The range of values is divided into intervals of equal
size.
• Example: Suppose we have student scores ranging from
0 to 100. If we use equal-width binning with 4 bins, the
intervals might be:
• 0–25 (Low)
• 26–50 (Medium)
• 51–75 (High)
• 76–100 (Very High)
• Limitation: Uneven data distribution can lead to
64
unbalanced bins.
Types of Discretization Techniques
2. Equal-Frequency Binning (Unsupervised)
• Each bin contains approximately the same number of
data points.
• Example: If we have 100 data points and want 5 bins,
each bin will contain around 20 values.
• Advantage: Ensures each category has an equal
number of instances.
• Limitation: Bin ranges may be uneven, making it
harder to interpret.
65
Types of Discretization Techniques
3. Entropy-Based Binning (Supervised)
• Uses information gain to determine optimal binning.
• Commonly used in decision trees.
• Example: If we are categorizing patients based on their
cholesterol levels and their likelihood of heart disease,
entropy-based binning will create intervals that best
separate the risk groups.
66
Types of Discretization Techniques
4. Clustering-Based Binning (Unsupervised or
Supervised)
• Uses clustering algorithms like k-Means to group similar
values.
• Example: If we have customer purchase amounts, k-
Means can cluster them into “low spenders,” “moderate
spenders,” and “high spenders.”
67
Simple Discretization: Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate
presentation
• Skewed data is not handled well
• Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing
approximately same number of samples
• Good data scaling
• Managing categorical attributes can be tricky 68
Smoothing by Bin Means –
Numerical Example
• Smoothing by bin means is a data preprocessing
technique in which the values in a bin are replaced with
the mean (average) value of that bin. This technique
reduces noise and smooths the data while maintaining
the overall trend.
Steps in Smoothing by Bin Means:
1.Sort the data in ascending order.
2.Divide the data into bins (equal-width or equal-
frequency).
3.Compute the mean of each bin.
4.Replace each value in the bin with the mean of that 69
Smoothing by Bin Means –
Numerical Example
Example
Given Data:
5, 18, 20, 25, 27, 30, 35, 40, 50, 60
Step 1: Sort the Data
Already sorted: 5, 18, 20, 25, 27, 30, 35, 40, 50, 60
Step 2: Divide into Bins
Let’s divide the data into 3 equal-width bins:
• Bin 1: (5, 18, 20)
• Bin 2: (25, 27, 30, 35)
• Bin 3: (40, 50, 60)
Step 3: Compute the Mean for Each Bin
• Bin 1 Mean: = ≈14.33
• Bin 2 Mean: = = 29.25
• Bin 3 Mean: = = 50
• Step 4: Replace Each Value with the Bin Mean 70
Smoothing by Bin Means –
Numerical Example
76
Measuring Data Similarity and
Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
77
Data Matrix and Dissimilarity Matrix
x11 ... x1f ... x1p
• Data matrix
... ... ... ... ...
• n data points with p dimensions x ... x if ... x ip
i1
• Two modes ... ... ... ... ...
x ... x nf ... x np
n1
0
• Dissimilarity matrix d(2,1)
0
• n data points, but registers only the d(3,1) d ( 3,2) 0
distance
• A triangular matrix : : :
d ( n,1) d ( n,2) ... ... 0
• Single mode
78
Proximity Measure for Nominal
Attributes
• Can take 2 or more states, e.g., red, yellow, blue, green
(generalization of a binary attribute)
• Method 1: Simple matching
• m: # of matches, p: total # of variables
d (i, j) p p m
79
Proximity Measure for Binary
Attributes Object j
“coherence”:
80
Dissimilarity between Binary
Variables
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
• Example
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
0+1
𝑑( 𝑗𝑎𝑐𝑘,𝑚𝑎𝑟𝑦)= =0.33
2+0+1 81
Standardizing Numeric Data
x
z
• Z-score:
• X: raw score to be standardized, μ: mean of the population, σ:
standard deviation
• the distance between the raw score and the population mean in
units of the standard deviation
• negative when the raw score is below the mean, “+” when above
• An alternative way: Calculate the mean absolute deviation
s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
xif m f
m f 1n (x1 f x2 f ... xnf ). zif s
where
f
• standardized measure (z-score):
• Using mean absolute deviation is more robust than using standard
deviation 82
Example: Data Matrix and
Dissimilarity Matrix
Data Matrix
point attribute1 attribute2
x2 x4 x1 1 2
x2 3 5
4 x3 2 0
x4 4 5
Dissimilarity Matrix
2 (with Euclidean Distance)
x1
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 x3 5.1 5.1 0
4 x4 4.24 1 5.39 0
0 2 83
Distance on Numeric Data:
Minkowski Distance
• Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and h is the order (the distance so defined is also called L-h norm)
• Properties
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
• d(i, j) = d(j, i) (Symmetry)
• d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric
84
Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
• E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
x1 0
x2 3.61 0 4
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
2 x1
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0 x3 86
x4 3 1 5 0 0 4
Ordinal Variables
• An ordinal variable can be discrete or continuous
• Order is important, e.g., rank
• Can be treated like interval-scaled
• replace xif by their rank rif {1,...,M f }
• map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable r
by 1
zif if
Mf 1
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 =
6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 =
4.12
90
cos(d , d ) = 0.94
You are welcome
91