0% found this document useful (0 votes)
73 views103 pages

CS F415 Data Mining Data Preprocessing

The document discusses different types of data and attributes in datasets used for data mining. It describes key characteristics like dimensionality and sparsity that impact analysis. Different types of attributes are defined including nominal, ordinal, interval, ratio, discrete, continuous and asymmetric attributes. Properties of attribute values and important characteristics of datasets like resolution are also covered. Finally, common types of datasets are defined including record data, ordered data, graph data and their structure.

Uploaded by

Shubhra Bhatotia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views103 pages

CS F415 Data Mining Data Preprocessing

The document discusses different types of data and attributes in datasets used for data mining. It describes key characteristics like dimensionality and sparsity that impact analysis. Different types of attributes are defined including nominal, ordinal, interval, ratio, discrete, continuous and asymmetric attributes. Properties of attribute values and important characteristics of datasets like resolution are also covered. Finally, common types of datasets are defined including record data, ordered data, graph data and their structure.

Uploaded by

Shubhra Bhatotia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 103

Data Mining

Topic: Data, Datasets, Quality,


Pre-processing, Similarity &
BITS Pilani Dissimilarity Dr. J Angel Arul Jothi
Dubai Campus Department of Computer Science
BITS Pilani
Dubai Campus

Text Book Chapter 2


Data
• Dataset: Collection of data Attributes
objects and their attributes
• An attribute is a property or Tid Refund Marital Taxable
characteristic of an object that Status Income Cheat

vary from person to person or 1 Yes Single 125K No

from time to time 2 No Married 100K No

– Attribute is also known as 3 No Single 70K No


4 Yes Married 120K No
variable, field, dimension, or
5 No Divorced 95K Yes

Objects
feature
6 No Married 60K No
• A collection of attributes 7 Yes Divorced 220K No
describe an object 8 No Single 85K Yes

– Object is also known as 9 No Married 75K No

record, point, case, sample, 10


10 No Single 90K Yes

entity, or instance

10/29/2022 Data Mining 3


BITS Pilani, Dubai Campus
Type of an Attribute
• Properties of an attribute need not be the same as the
properties of the values used to measure it
– Example 1: Employee ID and Employee age
represented as integers
• ID has no limit but age has a maximum and minimum value
• Average Employee age
– Example 2: Mapping length of line segments to
numbers in two different ways
• An attribute can be measured in a way that does not capture
all the properties of the attribute

10/29/2022 Data Mining 4


BITS Pilani, Dubai Campus
Type of an Attribute
• The way you measure an attribute sometimes may not capture all the
properties of an attribute
5 A 1
• Measurements
B
on the right-hand
7 2 side of the
figure, capture
C
both the ordering
8 3
and additivity
properties of the
D
length attribute.
10 4
• Whereas
measurements
E
on the left
captures only the
15 5 order property

10/29/2022 Data Mining 5


BITS Pilani, Dubai Campus
Types of Attributes
• There are four different types of attributes
– Nominal
• Names/Symbols/labels
• Order is not emphasized
• Examples: ID numbers, eye color, zip codes, gender
– Ordinal – exhibits order
• Ordering of values is logical
• Usually used to get attitudes and perception
• Difference between values – not logical
• Examples: ratings (e.g., taste of potato chips on a scale from
1-5)

10/29/2022 Data Mining 6


BITS Pilani, Dubai Campus
Types of Attributes
– Interval
• Numeric where order and difference between values is logical
• Examples: calendar dates
– Ratio
• Values that have order, difference and ratio
• Examples: age, mass, length, counts (years of experience)

– Nominal and ordinal attributes are referred as categorical/qualitative


attributes
– Interval and ratio attributes are referred as quantitative or numeric
attributes

10/29/2022 Data Mining 7


BITS Pilani, Dubai Campus
Properties of Attribute Values
• Properties of numbers used to describe attributes
– Distinctness: = 
– Order: < >
– Addition: + -
– Multiplication: * /
• Nominal attribute: distinctness
• Ordinal attribute: distinctness & order
• Interval attribute: distinctness, order & addition
• Ratio attribute: distinctness, order, addition and
multiplication

10/29/2022 Data Mining 8


BITS Pilani, Dubai Campus
Types of Attributes

10/29/2022 Data Mining 9


BITS Pilani, Dubai Campus
Discrete and Continuous
Attributes (number of values)
• Discrete Attribute
– Has only a finite set of values
– Examples: zip codes, counts, ID numbers
– Often represented as integer variables
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight
– Continuous attributes are typically represented as floating-point
variables

10/29/2022 Data Mining 10


BITS Pilani, Dubai Campus
Asymmetric Attributes

• For asymmetric attributes, only presence—a non-zero


attribute value—is regarded as important
• Binary attributes where only non-zero values are
important are called asymmetric binary attributes

10/29/2022 Data Mining 11


BITS Pilani, Dubai Campus
Important Characteristics of
Data Sets
• Dimensionality
– Number of attributes in a dataset
– Curse of Dimensionality: As dimension increases time and
space requirements also increases. Hence, analysis is difficult
– Dimensionality reduction techniques can be used
• Sparsity
– Many attributes have zero values
– Analyze only attributes with non-zero values (Asymmetric
attributes)
– Saves computation time and storage

10/29/2022 Data Mining 12


BITS Pilani, Dubai Campus
Important Characteristics of
Data Sets
• Resolution
– Data can be obtained at different levels of resolution
– Properties of data is different at different resolutions
– Patterns depend on the level of resolution

10/29/2022 Data Mining 13


BITS Pilani, Dubai Campus
Types of data sets
• Record • Ordered
– Data Matrix – Sequential Data
– Sparse Data Matrix – Sequence Data
– Transaction/Market Data – Time series Data
• Graph – Spatial Data
– Data with relationships
among objects
• World Wide Web
– Dataset with objects as
graphs
• Molecular Structures

10/29/2022 Data Mining 14


BITS Pilani, Dubai Campus
Record Data
• Data that consists of a collection of records, each of
which consists of a fixed set of attributes without any
relationship between the objects
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

10/29/2022 Data Mining 15


BITS Pilani, Dubai Campus
Types of record data: Data Matrix

• If data objects have the same fixed set of numeric


attributes
• The data objects can be thought of as points in a multi-
dimensional space
– where each dimension represents a distinct attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1

• Such data set can be represented by an m by n matrix,


where there are m rows, one for each object, and n
columns, one for each attribute
10/29/2022 Data Mining 16
BITS Pilani, Dubai Campus
Types of record data: Sparse Data Matrix

• Example: Document-term
matrix
• A vocabulary is formed

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
composed of all unique terms
in the documents Document 1 3 0 5 0 2 6 0 2 0 2

• Each document becomes a Document 2 0 7 0 2 1 0 0 3 0 0

`term' vector Document 3 0 1 0 0 1 2 2 0 3 0

• Each term is a component


(attribute) of the vector
• The value of each component
in the vector is the number of
times the corresponding term
occurs in the document 17
10/29/2022 Data Mining
BITS Pilani, Dubai Campus
Transaction/Market Data
• A special type of record data
– Each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

10/29/2022 Data Mining 18


BITS Pilani, Dubai Campus
Graph Data Data with relationships among objects
data objects are mapped to nodes
• Examples: Generic
relationships among objects are
graph and HTML Links
captured by the links and link properties
Web pages on the World Wide Web as
2
a directed graph
5 1
 Node corresponds to a web
2 page
5  Edge corresponds to a
hyperlink
Hyperlink information can be used by
search engines to fetch relevant pages

10/29/2022 Data Mining 19


BITS Pilani, Dubai Campus
Data with objects that are graphs
Chemical Data  objects contain subobjects that
have relationships
• Benzene Molecule: C6H6 Example: structure of chemical
compounds
 nodes are atoms
 links between nodes are
Carbon
Hydrogen
chemical bonds
which substructures occur
frequently in a set of compounds
ascertain whether the presence
of any of these substructures is
associated with the presence or
absence of certain chemical
10/29/2022 properties
Data Mining 20
BITS Pilani, Dubai Campus
Ordered Data
• The attributes have relationships that involve order in
time or space

10/29/2022 Data Mining 21


BITS Pilani, Dubai Campus
Ordered Data: Sequential data

• Extension of record
data, where each record
has a time associated
with it
– Also called temporal data
– Find patterns such as “candy sales
peak before Halloween.”

• Time can also be


associated with each
attribute
– Find patterns such as “people who
buy DVD players tend to buy DVDs in
the period immediately following the
purchase”

10/29/2022 Data Mining 22


BITS Pilani, Dubai Campus
Ordered Data: Sequence data
• Data set that is a sequence
GGTTCCGCCTTCAGCCCCGCGCC
of individual entities CGCAGGGCCCGCCCCGCGCCGTC
• Genomic sequence data GAGAAGGGCCCGCCTGGCGGGCG
– predicting similarities in the GGGGGAGGCGGGGCCGCCCGAGC
structure and function of CCAACCGAGTCCGACCAGGTGCC
genes from similarities in CCCTCTGCTCGGCCTAGACCTGA
nucleotide sequences GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

10/29/2022 Data Mining 23


BITS Pilani, Dubai Campus
Ordered Data: Time series data

• Each record is a time series, i.e., a series of


measurements taken over time
• Example: data set containing objects that are time series
of the daily prices of various stocks

10/29/2022 Data Mining 24


BITS Pilani, Dubai Campus
Ordered Data: Spatial data
• Some objects have spatial attributes, such as positions
or areas, as well as other types of attributes
• Example of spatial data is weather data (precipitation,
temperature, pressure) that is collected for a variety of
geographical locations

10/29/2022 Data Mining 25


BITS Pilani, Dubai Campus
Data Quality
• Data may not be of good quality
– Human error, limitations of measuring devices and
flaws in the data collection process
– Factors affecting data quality
• Noise and artifacts
• Outliers
• Missing values
• Inconsistent values
• Duplicate data

10/29/2022 Data Mining 26


BITS Pilani, Dubai Campus
Noise
• Noise is a random component of measurement error
• Refers to unwanted signal that modifies (distorts) the
original values/signals
– Example: distortion of a person’s voice when talking on a poor phone

Two Sine Waves Two Sine Waves + Noise

10/29/2022 Data Mining 27


BITS Pilani, Dubai Campus
Artifacts
• Deterministic distortion of data
• Example: A streak in the same place in a set of
photographs

10/29/2022 Data Mining 28


BITS Pilani, Dubai Campus
Outliers
• Outliers are
– Data objects with
characteristics that are
considerably different than
most of the other data
objects in the data set
– Values of an attribute that
are different from the usual
values of the attribute
– Outliers can be legitimate
data objects or values
– Outliers are of interest

10/29/2022 Data Mining 29


BITS Pilani, Dubai Campus
Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all objects
(e.g., annual income is not applicable to children)
• Handling missing values
– Eliminate Data Objects or attributes
– Estimate Missing Values – interpolation (time series data),
average (continuous attribute), most common value (categorical
attribute)
– Ignore the Missing Value During Analysis

10/29/2022 Data Mining 30


BITS Pilani, Dubai Campus
Handling missing values
• Eliminate Data Objects with missing values
– If many objects have missing values – reliable
analysis is not possible
– Good choice if the dataset has few objects with
missing values
• Eliminate attributes with missing values
– Should be done with caution – the attribute might be
critical for analysis

10/29/2022 Data Mining 31


BITS Pilani, Dubai Campus
Handling missing values
• Estimate Missing Values
– Time series data where values vary in a smooth
manner use interpolation
– Dataset with many similar data points
• Missing attribute value is continuous – replaced
with the average of the nearest neighbors
• Missing attribute value is categorical – replaced
with the most frequently occurring value

10/29/2022 Data Mining 32


BITS Pilani, Dubai Campus
Handling missing values
• Ignore Missing Values
– Example: Clustering a set of objects
• Similarity is computed with attributes that do not
have missing values
• The similarity will be approximate

10/29/2022 Data Mining 33


BITS Pilani, Dubai Campus
Inconsistent values
• Example: Address details
– Zip code and city are inconsistent
– Person with age 50 weighs 5kg
– Negative height values
– Unavailable product code

• Correction
– Possible sometimes
– Requires additional or redundant information

10/29/2022 Data Mining 34


BITS Pilani, Dubai Campus
Duplicate Data
• Data set may include data objects that are duplicates, or
almost duplicates of one another
– Major issue when merging data from heterogeneous sources
• Examples:
– Same person with multiple email addresses
• Data deduplication or cleaning
– Process of dealing with duplicate data
• Correction requires additional or redundant information

10/29/2022 Data Mining 35


BITS Pilani, Dubai Campus
Data Preprocessing
• Data preprocessing steps are to be applied to the data to
make it more suitable for data mining
• Improves data mining analysis with respect to time, cost
and quality

10/29/2022 Data Mining 36


BITS Pilani, Dubai Campus
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation

10/29/2022 Data Mining 37


BITS Pilani, Dubai Campus
Aggregation
• Combining two or objects into a single object
• Purpose
– Data reduction
• Advantages
– Reduce the number of attributes or objects
• Small datasets – less memory/processing time
– Change of scale
• High level view of data instead of low level
– More “stable” data
• Aggregated data tends to have less variability
• Disadvantage
– Loss of details

10/29/2022 Data Mining 38


BITS Pilani, Dubai Campus
Sampling
• Sampling is the main technique employed for selecting a
subset of data
• Sampling is used in data mining because processing the
entire data is too expensive or time consuming
• The key principle for effective sampling is the following:
– Using a sample will work almost as well as using the entire data
set, if the sample is representative

– A sample is representative if it has approximately the same


property (of interest) as the original set of data

10/29/2022 Data Mining 39


BITS Pilani, Dubai Campus
Types of Sampling
• Simple Random Sampling
– There is an equal probability of selecting any particular item
• Sampling without replacement
– As each item is selected, it is removed from the population
• Sampling with replacement
– Objects are not removed from the population as they are
selected for the sample
– In sampling with replacement, the same object can be picked up
more than once
– Simpler to analyse since the probability of selecting any object
remains constant during the sampling process

10/29/2022 Data Mining 40


BITS Pilani, Dubai Campus
Types of Sampling
• Stratified sampling
– Used when the dataset has different types of objects with
different number of objects in each type
– Random sampling can fail to represent objects that are less
frequent
– Example: Classifier model might not be able to learn the less
frequent objects

– Split the data into several partitions/groups based on the type of


the objects
– Then draw equal number of random samples from each partition

10/29/2022 Data Mining 41


BITS Pilani, Dubai Campus
Types of Sampling
• Progressive/Adaptive Sampling
– Used when proper sample size can be difficult to determine
– Start with a small sample, and then increase the sample size
until a sample of sufficient size has been obtained
– Evaluate the sample to judge if it is large enough

10/29/2022 Data Mining 42


BITS Pilani, Dubai Campus
Dimensionality Reduction
• Reducing the number of features in a dataset
• Purpose
• Avoids curse of dimensionality
• Reduce amount of time and memory required by data mining
algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise

10/29/2022 Data Mining 43


BITS Pilani, Dubai Campus
Curse of Dimensionality
• When dimension (attributes) increases, analysis become
significantly harder
• Data becomes increasingly sparse in the space that it
occupies
• Classification
– Model not able to learn
• Clustering and outlier detection
– Definitions of distance between points, and density which is
critical become less meaningful

10/29/2022 Data Mining 44


BITS Pilani, Dubai Campus
Dimensionality Reduction
• Types:
– Creating new attributes that are a combination of the old
attributes
• Dimensionality reduction
– Selecting attributes that are a subset of the original attributes
• Feature subset selection or feature selection
• Dimensionality Reduction Techniques
– Principle Component Analysis (PCA)
– Singular Value Decomposition (SVD)

10/29/2022 Data Mining 45


BITS Pilani, Dubai Campus
Dimensionality Reduction:
PCA
• Use techniques from linear algebra
• Maps the data from a high-dimensional space into a
lower-dimensional space
• Continuous data
• Finds new attributes (principal components) that
– are linear combinations of the original attributes
– are orthogonal (perpendicular) to each other
– capture the maximum amount of variation in the data

10/29/2022 Data Mining 46


BITS Pilani, Dubai Campus
Feature Subset Selection
 Use/Select only a subset of features
 Eliminates redundant and irrelevant features
 Redundant features
– Duplicate
– Much or all of the information is contained in one or more other
attributes
– Example: purchase price of a product and the amount of sales tax paid
 Irrelevant features
– Contain no information that is useful for the data mining task at hand
– Example: Students' ID is often irrelevant to the task of predicting
students' GPA

10/29/2022 Data Mining 48


BITS Pilani, Dubai Campus
Feature Subset Selection
• Techniques:
– Brute-force approach
• Try all possible feature subsets as input to data mining
algorithm
• n attributes is 2n
• Evaluation function
– Embedded approaches
• Feature selection occurs naturally as part of the data mining
algorithm
– Decision tree classifiers

10/29/2022 Data Mining 49


BITS Pilani, Dubai Campus
Feature Subset Selection
– Filter approaches
• Features are selected before data mining algorithm is run
• Filter approaches are independent of the data mining
algorithms
• Example: Select attributes whose pairwise correlation is as
low as possible
– Wrapper approaches
• Use the data mining algorithm as a black box to find best
subset of attributes

10/29/2022 Data Mining 50


BITS Pilani, Dubai Campus
An Architecture for Feature
Subset Selection
• A common architecture filter and wrapper approaches
• The feature selection process
– a search strategy that controls the generation of a new
subset of features
– a measure for evaluating a subset
– a stopping criterion
– a validation procedure
• Difference between filter methods and wrapper (evaluate a
subset of features)
– For a wrapper method, subset evaluation uses the target
data mining algorithm, while for a filter approach, the
evaluation technique is distinct from the target data
mining algorithm
10/29/2022 Data Mining 51
BITS Pilani, Dubai Campus
Feature subset selection
process

10/29/2022 Data Mining 52


BITS Pilani, Dubai Campus
Feature Weighting
 Used to keep or eliminate features
 Important features are assigned a higher weight
 Less important features are given a lower weight
 Weights are sometimes assigned based on domain
knowledge about the relative importance of features

10/29/2022 Data Mining 53


BITS Pilani, Dubai Campus
Feature Creation
• Create new attributes that can capture the important
information in a data set much more efficiently than the
original attributes
• The number of new attributes can be smaller than the
number of original attributes
• Three general methodologies
– Feature Extraction
– Mapping Data to New Space
• Time series data to frequency domain
– Feature Construction
• Constructing new features from already existing ones
• Density = mass/volume

10/29/2022 Data Mining 54


BITS Pilani, Dubai Campus
Feature Extraction
Histopathology image

Image data is processed to provide


higher level features
– Colour
– Shape – area, diameter,
– Texture

10/29/2022 Data Mining 55


BITS Pilani, Dubai Campus
Mapping Data to New Space

10/29/2022 Data Mining 57


BITS Pilani, Dubai Campus
Discretization and Binarization

• Some data mining algorithms require that the data be in


the form of
– Categorical attributes (Classification algorithms)
– Binary attributes (Association mining)
• Discretization
– Continuous attribute into a categorical attribute
• Binarization
– Continuous and discrete attributes into one or more binary attributes

10/29/2022 Data Mining 58


BITS Pilani, Dubai Campus
Binarization
• Binarizing categorical attribute (Method 1)
– For each of the m categorical values, uniquely assign each
original value to an integer in the interval [0, m − 1]
– Convert each of these m integers to a binary number
• n = ceil(log2(m)) binary digits/attributes are required to
represent these integers
– Represent the binary numbers as n binary attributes

10/29/2022 Data Mining 59


BITS Pilani, Dubai Campus
Binarization
• Example

10/29/2022 Data Mining 60


BITS Pilani, Dubai Campus
Binarization
• Binarizing categorical attribute (Method 2)
– Avoid unnecessary relationships by assigning one binary
attribute for each categorical value

10/29/2022 Data Mining 61


BITS Pilani, Dubai Campus
Discretization
• Transformation of a continuous attribute to a categorical
attribute
• Two sub tasks
– Step1: Deciding how many categories to have
– Step2: Determining how to map the values of the continuous attribute to
these categories
• Step1
– The values of the continuous attribute are sorted
– Specify n−1 split points and divide the data into n intervals
• Step2
– All the values in one interval are mapped to the same categorical value

10/29/2022 Data Mining 62


BITS Pilani, Dubai Campus
Discretization by equal width
binning
• Divides the range of the attribute into a user-specified
number of intervals (bins) each having the same width
• Bin width w = (max – min) / (no of bins)
• Sort data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
• No. of bins = 3
• Bin width w = 34-4 = 30/3 = 10
• Bin1 (4-14)- 4,8
• Bin2(15-24)-15
• Bin3(25-34)- 21, 21, 24, 25, 28, 34
• Note: Data should be in ascending order before performing
equal width binning

10/29/2022 Data Mining 63


BITS Pilani, Dubai Campus
Discretization by equal
frequency binning
• Put the same number of objects (frequency) into each
interval
• Sort data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28,
34
• Partition into (equal-frequency) bins of frequency 3
• Bin 1: 4, 8, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 28, 34

• Note: Data should be in ascending order before


performing equal frequency binning

10/29/2022 Data Mining 64


BITS Pilani, Dubai Campus
Attribute Transformation
• Transformation applied to all values of an
attribute/variable
– Simple functions
• Apply simple mathematical function to each value
• If x is a variable then xk, ex, |x|, 1/x, √x, log(x)
– Standardization or Normalization
• Attribute values are scaled to fall within a smaller range,
such as -1.0 to 1.0, or 0.0 to 1.0.
• Avoid a variable with large values dominate the results of the
calculation
– Neural networks, nearest-neighbor classification

10/29/2022 Data Mining 65


BITS Pilani, Dubai Campus
Min-max normalization
• Let minA and maxA are the minimum and maximum
values of an attribute A
• Min-max normalization maps a value, vi , of A to vi’ in the
range [new_minA, new_maxA] by computing

10/29/2022 Data Mining 66


BITS Pilani, Dubai Campus
Min-max normalization
• Example:
– Suppose that the minimum and maximum values for
the attribute income are $12,000 and $98,000,
respectively. Given income = $73,600, map it to the
range [0.0,1.0] by min-max normalization
– Solution:

((73,600-12,000/(98,000-12,000)) *(1.0-0)) + 0 = 0.716

10/29/2022 Data Mining 67


BITS Pilani, Dubai Campus
z-score normalization
•  

10/29/2022 Data Mining 68


BITS Pilani, Dubai Campus
z-score normalization
• Example:
– Suppose that the mean and standard deviation of the
values for the attribute income are $54,000 and
$16,000, respectively. Find the z-score normalized
value for a original value of $73,600 for income
– Solution:

73,600-54,000/16,000 = 1.225

10/29/2022 Data Mining 69


BITS Pilani, Dubai Campus
Similarity and Dissimilarity
• Similarity & Dissimilarity
– Used by a number of data mining techniques
• clustering, nearest neighbor classification, and anomaly
detection
– In many cases, the initial data set is not needed once
these similarities or dissimilarities have been
computed
• Proximity refers to similarity or dissimilarity
• Dense data (time series or n-dimensional points)
– Correlation and Euclidean distance
• Sparse data (document)
– Jaccard and Cosine similarity measures
10/29/2022 Data Mining 70
BITS Pilani, Dubai Campus
Similarity and Dissimilarity
• Similarity
– Numerical measure of how alike two data objects are
– Is higher when objects are more alike
• Dissimilarity
– Numerical measure of how different two data objects are
– Lower when objects are more alike

10/29/2022 Data Mining 71


BITS Pilani, Dubai Campus
Transformations
• Convert a similarity to a dissimilarity or vice versa
• Transform a proximity measure to fall within a particular
range, such as [0,1]
– Proximity - Fixed range to [0, 1]
s’ = (s-s_min)/(s_max-s_min)
– Proximity – (0 – infinity) to [0,1]
d’ = d/(1+d)
• Transformation may change the meaning of the proximity
measure
– Example correlation (similarity measure) takes values in
the range [-1. 1]. Mapping these values to the interval
[0,1] by taking the absolute value loses information about
the sign
10/29/2022 Data Mining 72
BITS Pilani, Dubai Campus
Similarity/Dissimilarity for
Simple Attributes
• p and q are the attribute values for two data objects

10/29/2022 Data Mining 73


BITS Pilani, Dubai Campus
Dissimilarity matrix
• Stores a collection of • d(i, j) is the dissimilarity or
disimilarities for all pairs of n “difference” between objects i
objects and j
• Represented by an n-by-n • d(i, j) is a non-negative
table/matrix where n is the number
number of objects • d(i, j) = 0 means objects are
highly similar
• Note that d(i, i)=0; that is, the
distance between an object
and itself is 0
• The matrix is symmetric, (i.e.,)
d(i, j) = d (j, i)

10/29/2022 Data Mining 74


BITS Pilani, Dubai Campus
Example
• Suppose that we have the sample data given below.
Compute the dissimilarity matrix using the attribute, test-1

Dissimilarity matrix

test-1 1 2 3 4
1 0 1 1 0
2 1 0 1 1
3 1 1 0 1
4 0 1 1 0

10/29/2022 Data Mining 75


BITS Pilani, Dubai Campus
Example
• Suppose that we have the sample data given below.
Compute the dissimilarity matrix using the attribute test-2
Dissimilarity matrix

test-2 1 2 3 4
1 0 1 0.5 0
2 1 0 0.5 1
3 0.5 0.5 0 0.5
4 0 1 0.5 0

Excellent (2), good (1), fair(0)

10/29/2022 Data Mining 76


BITS Pilani, Dubai Campus
Dissimilarities between
Data Objects
• The Euclidean distance, d, between two points, x and y, in one-,
two-, three-, or higher dimensional space, is given by the following
formula

where n is the number of dimensions (attributes) and xk and yk


are, respectively, the kth attributes (components) of data objects x
and y
• Standardization is necessary, if scales differ
• The Euclidean distance is always greater than or equal to zero

10/29/2022 Data Mining 77


BITS Pilani, Dubai Campus
Dissimilarities between
Data Objects
Squared Euclidean Distance
• The squared Euclidean distance, d, between two points, a and
b, in one-, two-, three-, or higher dimensional space, is given
by the following formula

where k is the number of dimensions (attributes) and aj and


bj are, respectively, the kth attributes (components) of data
objects a and b

10/29/2022 Data Mining 78


BITS Pilani, Dubai Campus
Minkowski Distance
• Minkowski distance is a generalization of Euclidean
distance

• Where r is a parameter, n is the number of dimensions


(attributes) and xk and yk are, respectively, the kth
attributes (components) of data objects x and y

10/29/2022 Data Mining 79


BITS Pilani, Dubai Campus
Minkowski Distance:
Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance
– d(x,y)= |x1 – y1| + |x2- y2| + … +|xn-yn|
The City block distance is always greater than or equal to zero
• r = 2. Euclidean distance or L2 norm
– d(x,y)= (|x1 – y1|2 + |x2- y2|2 + … +|xn-yn|2)1/2

• r  . “supremum” (Lmax norm, L norm) distance


– This is the maximum difference between any
component of the vectors
• d(x,y) = max {|x1 – y1|, |x2- y2|, … ,|xn-yn|}

10/29/2022 Data Mining 80


BITS Pilani, Dubai Campus
Minkowski Distance

10/29/2022 Data Mining 81


BITS Pilani, Dubai Campus
Common Properties of a
Distance
• Distance is often used to refer to dissimilarities
1. Non negativity/Positivity:
• d(p, q)  0 for all p and q and d(p, q) = 0 only if p = q
2. Symmetry:
• d(p, q) = d(q, p) for all p and q
3. Triangle inequality:
d(p, r)  d(p, q) + d(q, r) for all points p, q, and r
• A distance that satisfies these properties is a metric

10/29/2022 Data Mining 82


BITS Pilani, Dubai Campus
Similarities between Data
Objects
• Similarities, also have some well known properties.

1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data


objects), p and q

10/29/2022 Data Mining 83


BITS Pilani, Dubai Campus
Similarity Between Binary
Vectors
• Similarity between binary objects is called as similarity
co-efficient
• Let objects, p and q, have n binary attributes
• Define the following quantities
M01 = the number of attributes where p is 0 and q is 1
M10 = the number of attributes where p is 1 and q is 0
M00 = the number of attributes where p is 0 and q is 0
M11 = the number of attributes where p is 1 and q is 1
• Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of not-both-zero
attributes values
= (M11) / (M01 + M10 + M11)
10/29/2022 Data Mining 84
BITS Pilani, Dubai Campus
SMC and Jaccard: Example

• Compute SMC and Jaccard co-efficient for two data


objects p and q
p= 1000000000
q= 0000001001

10/29/2022 Data Mining 85


BITS Pilani, Dubai Campus
SMC and Jaccard: Solution
p= 1000000000
q= 0000001001

M01 = 2 (the number of attributes where p is 0 and q is 1)


M10 = 1 (the number of attributes where p is 1 and q is 0)
M00 = 7 (the number of attributes where p is 0 and q is 0)
M11 = 0 (the number of attributes where p is 1 and q is 1)
SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) /
(2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0


10/29/2022 Data Mining 86
BITS Pilani, Dubai Campus
Cosine Similarity
• If x and y are two document vectors, then

•The cosine similarity ranges from +1 to -1


– +1 is the highest similarity
– complete opposite points have similarity -1

10/29/2022 Data Mining 87


BITS Pilani, Dubai Campus
Cosine Similarity
• If the cosine similarity is 1
– The angle between x and y is 0◦
– x and y are the same except for magnitude
• If the cosine similarity is 0
– The angle between x and y is 90◦
• If the cosine similarity is -1
– The angle between x and y is 180◦

10/29/2022 Data Mining 88


BITS Pilani, Dubai Campus
Cosine Similarity: Example

•Compute the cosine similarity between two document


vectors d1 and d2
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

10/29/2022 Data Mining 89


BITS Pilani, Dubai Campus
Cosine Similarity: Solution

d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 +


0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5
= (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5
= (6) 0.5 = 2.245
cos( d1, d2 ) = .3150

10/29/2022 Data Mining 90


BITS Pilani, Dubai Campus
Extended Jaccard Coefficient
(Tanimoto)
• Used for finding similarity between documents (data)
– Reduces to Jaccard co-efficient for binary attributes

– Compute the Extended Jaccard Similarity between d1


and d2
• d1 = 1 0 1 0 1 and d2 = 1 1 1 0 1
• d1 = 1 1 1 1 1 and d2 = 1 1 1 1 1

10/29/2022 Data Mining 91


BITS Pilani, Dubai Campus
Extended Jaccard Coefficient (Tanimoto)

• Solution
• d1 = 1 0 1 0 1
• d2 = 1 1 1 0 1
• d1.d2 = 3
• ||d1||2 = 1 + 1 + 1 = 3
• ||d2||2 = 1 + 1 + 1 + 1 = 4
• T(d1, d2) = 3 / 3 + 4 – 3
• T(d1, d2) = ¾ = 0.75

10/29/2022 Data Mining 92


BITS Pilani, Dubai Campus
Extended Jaccard Coefficient (Tanimoto)

• Solution
• d1 = 1 1 1 1 1
• d2 = 1 1 1 1 1
• d1.d2 = 5
• ||d1||2 = 1 + 1 + 1 + 1 + 1 = 5
• ||d2||2 = 1 + 1 + 1 + 1 + 1 = 5
• T(d1, d2) = 5 / 5 + 5 – 5
• T(d1, d2) = 1

10/29/2022 Data Mining 93


BITS Pilani, Dubai Campus
Correlation
• Correlation coefficient also known as Pearson’s
coefficient (r)
– Numeric attributes
– Measures linear relationship between objects
– Value is between -1 and +1
• If Correlation coefficient (A,B) > 0
– Attributes A and B are positively correlated, meaning that
the values of A increase as the values of B increase
– The higher the value, the stronger the correlation hence, a
higher value may indicate that A (or B) may be removed as
a redundancy

10/29/2022 Data Mining 94


BITS Pilani, Dubai Campus
Correlation
• If Correlation coefficient (A,B) = 0
– Attributes A and B are independent and there is no correlation
between them
• If Correlation coefficient (A,B) < 0
– Attributes A and B are negatively correlated, where the values of
one attribute increase as the values of the other attribute
decrease
– Each attribute discourages the other

10/29/2022 Data Mining 95


BITS Pilani, Dubai Campus
Correlation

10/29/2022 Data Mining 96


BITS Pilani, Dubai Campus
Correlation: Example
• The time x in years that an employee spent at a company and the
employee's hourly pay, y, for 5 employees are listed in the table.
Calculate and interpret the correlation coefficient r.

10/29/2022 Data Mining 97


BITS Pilani, Dubai Campus
Correlation: Solution
x y x-x* y-y* (x-x*) (y-y*) (x-x*)2 (y-y*)2
5 25
3 20
4 21
10 35
15 38
Sum = Sum
37 =139

Average (x) = 7.4


Average (y) = 27.8

10/29/2022 Data Mining 98


BITS Pilani, Dubai Campus
Correlation: Solution
x y x-x* y-y* (x-x*) (y-y*) (x-x*)2 (y-y*)2
5 25 -2.4
3 20 -4.4
4 21 -3.4
10 35 2.6
15 38 7.6
Sum = Sum
37 =139

10/29/2022 Data Mining 99


BITS Pilani, Dubai Campus
Correlation: Solution
x y x-x* y-y* (x-x*) (y-y*) (x-x*)2 (y-y*)2
5 25 -2.4 -2.8
3 20 -4.4 -7.8
4 21 -3.4 - 6.8
10 35 2.6 7.2
15 38 7.6 10.2
Sum = Sum
37 =139

10/29/2022 Data Mining 100


BITS Pilani, Dubai Campus
Correlation: Solution
x y x-x* y-y* (x-x*) (y-y*) (x-x*)2 (y-y*)2
5 25 -2.4 -2.8 6.72
3 20 -4.4 -7.8 34.32
4 21 -3.4 - 6.8 23.12
10 35 2.6 7.2 18.72
15 38 7.6 10.2 77.52
Sum = Sum Sum =
37 =139 160.4

10/29/2022 Data Mining 101


BITS Pilani, Dubai Campus
Correlation: Solution
x y x-x* y-y* (x-x*) (y-y*) (x-x*)2 (y-y*)2
5 25 -2.4 -2.8 6.72 5.76
3 20 -4.4 -7.8 34.32 19.36
4 21 -3.4 - 6.8 23.12 11.26
10 35 2.6 7.2 18.72 6.76
15 38 7.6 10.2 77.52 57.76
Sum = Sum Sum = Sum=101
37 =139 160.4 .2

10/29/2022 Data Mining 102


BITS Pilani, Dubai Campus
Correlation: Solution
x y x-x* y-y* (x-x*) (y-y*) (x-x*)2 (y-y*)2
5 25 -2.4 -2.8 6.72 5.76 7.84
3 20 -4.4 -7.8 34.32 19.36 60.84
4 21 -3.4 - 6.8 23.12 11.26 46.24
10 35 2.6 7.2 18.72 6.76 51.84
15 38 7.6 10.2 77.52 57.76 104.04
Sum = Sum Sum = Sum=101 Sum
37 =139 160.4 .2 =270.8

Average (x) = 7.4 Sy= (270.8/4)1/2 = 8.23


Average (y) = 27.8 Covariance(x,y) = 160.4/4 = 40.1
Sx= (101/4)1/2 = 5.02 Correlation(x,y) = 40.1/ (5.02) (8.23) = 0.97

10/29/2022 Data Mining 103


BITS Pilani, Dubai Campus
Correlation: Solution
• There is a strong positive correlation between the
number of years the employee has worked and the
employee's salary, since r is very close to 1.

10/29/2022 Data Mining 104


BITS Pilani, Dubai Campus
Thank you

10/29/2022 Data Mining 105


BITS Pilani, Dubai Campus

You might also like