0% found this document useful (0 votes)

73 views103 pages

CS F415 Data Mining Data Preprocessing

The document discusses different types of data and attributes in datasets used for data mining. It describes key characteristics like dimensionality and sparsity that impact analysis. Different types of attributes are defined including nominal, ordinal, interval, ratio, discrete, continuous and asymmetric attributes. Properties of attribute values and important characteristics of datasets like resolution are also covered. Finally, common types of datasets are defined including record data, ordered data, graph data and their structure.

Uploaded by

Shubhra Bhatotia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views103 pages

CS F415 Data Mining Data Preprocessing

Uploaded by

Shubhra Bhatotia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 103

Data Mining

Topic: Data, Datasets, Quality,

Pre-processing, Similarity &
BITS Pilani Dissimilarity Dr. J Angel Arul Jothi
Dubai Campus Department of Computer Science
BITS Pilani
Dubai Campus

Text Book Chapter 2

Data
• Dataset: Collection of data Attributes
objects and their attributes
• An attribute is a property or Tid Refund Marital Taxable
characteristic of an object that Status Income Cheat

vary from person to person or 1 Yes Single 125K No

from time to time 2 No Married 100K No

– Attribute is also known as 3 No Single 70K No

4 Yes Married 120K No
variable, field, dimension, or
5 No Divorced 95K Yes

Objects
feature
6 No Married 60K No
• A collection of attributes 7 Yes Divorced 220K No
describe an object 8 No Single 85K Yes

– Object is also known as 9 No Married 75K No

record, point, case, sample, 10

10 No Single 90K Yes

entity, or instance

10/29/2022 Data Mining 3

BITS Pilani, Dubai Campus
Type of an Attribute
• Properties of an attribute need not be the same as the
properties of the values used to measure it
– Example 1: Employee ID and Employee age
represented as integers
• ID has no limit but age has a maximum and minimum value
• Average Employee age
– Example 2: Mapping length of line segments to
numbers in two different ways
• An attribute can be measured in a way that does not capture
all the properties of the attribute

10/29/2022 Data Mining 4

BITS Pilani, Dubai Campus
Type of an Attribute
• The way you measure an attribute sometimes may not capture all the
properties of an attribute
5 A 1
• Measurements
B
on the right-hand
7 2 side of the
figure, capture
C
both the ordering
8 3
and additivity
properties of the
D
length attribute.
10 4
• Whereas
measurements
E
on the left
captures only the
15 5 order property

10/29/2022 Data Mining 5

BITS Pilani, Dubai Campus
Types of Attributes
• There are four different types of attributes
– Nominal
• Names/Symbols/labels
• Order is not emphasized
• Examples: ID numbers, eye color, zip codes, gender
– Ordinal – exhibits order
• Ordering of values is logical
• Usually used to get attitudes and perception
• Difference between values – not logical
• Examples: ratings (e.g., taste of potato chips on a scale from
1-5)

10/29/2022 Data Mining 6

BITS Pilani, Dubai Campus
Types of Attributes
– Interval
• Numeric where order and difference between values is logical
• Examples: calendar dates
– Ratio
• Values that have order, difference and ratio
• Examples: age, mass, length, counts (years of experience)

– Nominal and ordinal attributes are referred as categorical/qualitative

attributes
– Interval and ratio attributes are referred as quantitative or numeric
attributes

10/29/2022 Data Mining 7

BITS Pilani, Dubai Campus
Properties of Attribute Values
• Properties of numbers used to describe attributes
– Distinctness: = 
– Order: < >
– Addition: + -
– Multiplication: * /
• Nominal attribute: distinctness
• Ordinal attribute: distinctness & order
• Interval attribute: distinctness, order & addition
• Ratio attribute: distinctness, order, addition and
multiplication

10/29/2022 Data Mining 8

BITS Pilani, Dubai Campus
Types of Attributes

10/29/2022 Data Mining 9

BITS Pilani, Dubai Campus
Discrete and Continuous
Attributes (number of values)
• Discrete Attribute
– Has only a finite set of values
– Examples: zip codes, counts, ID numbers
– Often represented as integer variables
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight
– Continuous attributes are typically represented as floating-point
variables

10/29/2022 Data Mining 10

BITS Pilani, Dubai Campus
Asymmetric Attributes

• For asymmetric attributes, only presence—a non-zero

attribute value—is regarded as important
• Binary attributes where only non-zero values are
important are called asymmetric binary attributes

10/29/2022 Data Mining 11

BITS Pilani, Dubai Campus
Important Characteristics of
Data Sets
• Dimensionality
– Number of attributes in a dataset
– Curse of Dimensionality: As dimension increases time and
space requirements also increases. Hence, analysis is difficult
– Dimensionality reduction techniques can be used
• Sparsity
– Many attributes have zero values
– Analyze only attributes with non-zero values (Asymmetric
attributes)
– Saves computation time and storage

10/29/2022 Data Mining 12

BITS Pilani, Dubai Campus
Important Characteristics of
Data Sets
• Resolution
– Data can be obtained at different levels of resolution
– Properties of data is different at different resolutions
– Patterns depend on the level of resolution

10/29/2022 Data Mining 13

BITS Pilani, Dubai Campus
Types of data sets
• Record • Ordered
– Data Matrix – Sequential Data
– Sparse Data Matrix – Sequence Data
– Transaction/Market Data – Time series Data
• Graph – Spatial Data
– Data with relationships
among objects
• World Wide Web
– Dataset with objects as
graphs
• Molecular Structures

10/29/2022 Data Mining 14

BITS Pilani, Dubai Campus
Record Data
• Data that consists of a collection of records, each of
which consists of a fixed set of attributes without any
relationship between the objects
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

10/29/2022 Data Mining 15

BITS Pilani, Dubai Campus
Types of record data: Data Matrix

• If data objects have the same fixed set of numeric

attributes
• The data objects can be thought of as points in a multi-
dimensional space
– where each dimension represents a distinct attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1

• Such data set can be represented by an m by n matrix,

where there are m rows, one for each object, and n
columns, one for each attribute
10/29/2022 Data Mining 16
BITS Pilani, Dubai Campus
Types of record data: Sparse Data Matrix

• Example: Document-term
matrix
• A vocabulary is formed

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
composed of all unique terms
in the documents Document 1 3 0 5 0 2 6 0 2 0 2

• Each document becomes a Document 2 0 7 0 2 1 0 0 3 0 0

`term' vector Document 3 0 1 0 0 1 2 2 0 3 0

• Each term is a component

(attribute) of the vector
• The value of each component
in the vector is the number of
times the corresponding term
occurs in the document 17
10/29/2022 Data Mining
BITS Pilani, Dubai Campus
Transaction/Market Data
• A special type of record data
– Each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

10/29/2022 Data Mining 18

BITS Pilani, Dubai Campus
Graph Data Data with relationships among objects
data objects are mapped to nodes
• Examples: Generic
relationships among objects are
graph and HTML Links
captured by the links and link properties
Web pages on the World Wide Web as
2
a directed graph
5 1
 Node corresponds to a web
2 page
5  Edge corresponds to a
hyperlink
Hyperlink information can be used by
search engines to fetch relevant pages

10/29/2022 Data Mining 19

BITS Pilani, Dubai Campus
Data with objects that are graphs
Chemical Data  objects contain subobjects that
have relationships
• Benzene Molecule: C6H6 Example: structure of chemical
compounds
 nodes are atoms
 links between nodes are
Carbon
Hydrogen
chemical bonds
which substructures occur
frequently in a set of compounds
ascertain whether the presence
of any of these substructures is
associated with the presence or
absence of certain chemical
10/29/2022 properties
Data Mining 20
BITS Pilani, Dubai Campus
Ordered Data
• The attributes have relationships that involve order in
time or space

10/29/2022 Data Mining 21

BITS Pilani, Dubai Campus
Ordered Data: Sequential data

• Extension of record
data, where each record
has a time associated
with it
– Also called temporal data
– Find patterns such as “candy sales
peak before Halloween.”

• Time can also be

associated with each
attribute
– Find patterns such as “people who
buy DVD players tend to buy DVDs in
the period immediately following the
purchase”

10/29/2022 Data Mining 22

BITS Pilani, Dubai Campus
Ordered Data: Sequence data
• Data set that is a sequence
GGTTCCGCCTTCAGCCCCGCGCC
of individual entities CGCAGGGCCCGCCCCGCGCCGTC
• Genomic sequence data GAGAAGGGCCCGCCTGGCGGGCG
– predicting similarities in the GGGGGAGGCGGGGCCGCCCGAGC
structure and function of CCAACCGAGTCCGACCAGGTGCC
genes from similarities in CCCTCTGCTCGGCCTAGACCTGA
nucleotide sequences GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

10/29/2022 Data Mining 23

BITS Pilani, Dubai Campus
Ordered Data: Time series data

• Each record is a time series, i.e., a series of

measurements taken over time
• Example: data set containing objects that are time series
of the daily prices of various stocks

10/29/2022 Data Mining 24

BITS Pilani, Dubai Campus
Ordered Data: Spatial data
• Some objects have spatial attributes, such as positions
or areas, as well as other types of attributes
• Example of spatial data is weather data (precipitation,
temperature, pressure) that is collected for a variety of
geographical locations

10/29/2022 Data Mining 25

BITS Pilani, Dubai Campus
Data Quality
• Data may not be of good quality
– Human error, limitations of measuring devices and
flaws in the data collection process
– Factors affecting data quality
• Noise and artifacts
• Outliers
• Missing values
• Inconsistent values
• Duplicate data

10/29/2022 Data Mining 26

BITS Pilani, Dubai Campus
Noise
• Noise is a random component of measurement error
• Refers to unwanted signal that modifies (distorts) the
original values/signals
– Example: distortion of a person’s voice when talking on a poor phone

Two Sine Waves Two Sine Waves + Noise

10/29/2022 Data Mining 27

BITS Pilani, Dubai Campus
Artifacts
• Deterministic distortion of data
• Example: A streak in the same place in a set of
photographs

10/29/2022 Data Mining 28

BITS Pilani, Dubai Campus
Outliers
• Outliers are
– Data objects with
characteristics that are
considerably different than
most of the other data
objects in the data set
– Values of an attribute that
are different from the usual
values of the attribute
– Outliers can be legitimate
data objects or values
– Outliers are of interest

10/29/2022 Data Mining 29

BITS Pilani, Dubai Campus
Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all objects
(e.g., annual income is not applicable to children)
• Handling missing values
– Eliminate Data Objects or attributes
– Estimate Missing Values – interpolation (time series data),
average (continuous attribute), most common value (categorical
attribute)
– Ignore the Missing Value During Analysis

10/29/2022 Data Mining 30

BITS Pilani, Dubai Campus
Handling missing values
• Eliminate Data Objects with missing values
– If many objects have missing values – reliable
analysis is not possible
– Good choice if the dataset has few objects with
missing values
• Eliminate attributes with missing values
– Should be done with caution – the attribute might be
critical for analysis

10/29/2022 Data Mining 31

BITS Pilani, Dubai Campus
Handling missing values
• Estimate Missing Values
– Time series data where values vary in a smooth
manner use interpolation
– Dataset with many similar data points
• Missing attribute value is continuous – replaced
with the average of the nearest neighbors
• Missing attribute value is categorical – replaced
with the most frequently occurring value

10/29/2022 Data Mining 32

BITS Pilani, Dubai Campus
Handling missing values
• Ignore Missing Values
– Example: Clustering a set of objects
• Similarity is computed with attributes that do not
have missing values
• The similarity will be approximate

10/29/2022 Data Mining 33

BITS Pilani, Dubai Campus
Inconsistent values
• Example: Address details
– Zip code and city are inconsistent
– Person with age 50 weighs 5kg
– Negative height values
– Unavailable product code

• Correction
– Possible sometimes
– Requires additional or redundant information

10/29/2022 Data Mining 34

BITS Pilani, Dubai Campus
Duplicate Data
• Data set may include data objects that are duplicates, or
almost duplicates of one another
– Major issue when merging data from heterogeneous sources
• Examples:
– Same person with multiple email addresses
• Data deduplication or cleaning
– Process of dealing with duplicate data
• Correction requires additional or redundant information

10/29/2022 Data Mining 35

BITS Pilani, Dubai Campus
Data Preprocessing
• Data preprocessing steps are to be applied to the data to
make it more suitable for data mining
• Improves data mining analysis with respect to time, cost
and quality

10/29/2022 Data Mining 36

BITS Pilani, Dubai Campus
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation

10/29/2022 Data Mining 37

BITS Pilani, Dubai Campus
Aggregation
• Combining two or objects into a single object
• Purpose
– Data reduction
• Advantages
– Reduce the number of attributes or objects
• Small datasets – less memory/processing time
– Change of scale
• High level view of data instead of low level
– More “stable” data
• Aggregated data tends to have less variability
• Disadvantage
– Loss of details

10/29/2022 Data Mining 38

BITS Pilani, Dubai Campus
Sampling
• Sampling is the main technique employed for selecting a
subset of data
• Sampling is used in data mining because processing the
entire data is too expensive or time consuming
• The key principle for effective sampling is the following:
– Using a sample will work almost as well as using the entire data
set, if the sample is representative

– A sample is representative if it has approximately the same

property (of interest) as the original set of data

10/29/2022 Data Mining 39

BITS Pilani, Dubai Campus
Types of Sampling
• Simple Random Sampling
– There is an equal probability of selecting any particular item
• Sampling without replacement
– As each item is selected, it is removed from the population
• Sampling with replacement
– Objects are not removed from the population as they are
selected for the sample
– In sampling with replacement, the same object can be picked up
more than once
– Simpler to analyse since the probability of selecting any object
remains constant during the sampling process

10/29/2022 Data Mining 40

BITS Pilani, Dubai Campus
Types of Sampling
• Stratified sampling
– Used when the dataset has different types of objects with
different number of objects in each type
– Random sampling can fail to represent objects that are less
frequent
– Example: Classifier model might not be able to learn the less
frequent objects

– Split the data into several partitions/groups based on the type of

the objects
– Then draw equal number of random samples from each partition

10/29/2022 Data Mining 41

BITS Pilani, Dubai Campus
Types of Sampling
• Progressive/Adaptive Sampling
– Used when proper sample size can be difficult to determine
– Start with a small sample, and then increase the sample size
until a sample of sufficient size has been obtained
– Evaluate the sample to judge if it is large enough

10/29/2022 Data Mining 42

BITS Pilani, Dubai Campus
Dimensionality Reduction
• Reducing the number of features in a dataset
• Purpose
• Avoids curse of dimensionality
• Reduce amount of time and memory required by data mining
algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise

10/29/2022 Data Mining 43

BITS Pilani, Dubai Campus
Curse of Dimensionality
• When dimension (attributes) increases, analysis become
significantly harder
• Data becomes increasingly sparse in the space that it
occupies
• Classification
– Model not able to learn
• Clustering and outlier detection
– Definitions of distance between points, and density which is
critical become less meaningful

10/29/2022 Data Mining 44

BITS Pilani, Dubai Campus
Dimensionality Reduction
• Types:
– Creating new attributes that are a combination of the old
attributes
• Dimensionality reduction
– Selecting attributes that are a subset of the original attributes
• Feature subset selection or feature selection
• Dimensionality Reduction Techniques
– Principle Component Analysis (PCA)
– Singular Value Decomposition (SVD)

10/29/2022 Data Mining 45

BITS Pilani, Dubai Campus
Dimensionality Reduction:
PCA
• Use techniques from linear algebra
• Maps the data from a high-dimensional space into a
lower-dimensional space
• Continuous data
• Finds new attributes (principal components) that
– are linear combinations of the original attributes
– are orthogonal (perpendicular) to each other
– capture the maximum amount of variation in the data

10/29/2022 Data Mining 46

BITS Pilani, Dubai Campus
Feature Subset Selection
 Use/Select only a subset of features
 Eliminates redundant and irrelevant features
 Redundant features
– Duplicate
– Much or all of the information is contained in one or more other
attributes
– Example: purchase price of a product and the amount of sales tax paid
 Irrelevant features
– Contain no information that is useful for the data mining task at hand
– Example: Students' ID is often irrelevant to the task of predicting
students' GPA

10/29/2022 Data Mining 48

BITS Pilani, Dubai Campus
Feature Subset Selection
• Techniques:
– Brute-force approach
• Try all possible feature subsets as input to data mining
algorithm
• n attributes is 2n
• Evaluation function
– Embedded approaches
• Feature selection occurs naturally as part of the data mining
algorithm
– Decision tree classifiers

10/29/2022 Data Mining 49

BITS Pilani, Dubai Campus
Feature Subset Selection
– Filter approaches
• Features are selected before data mining algorithm is run
• Filter approaches are independent of the data mining
algorithms
• Example: Select attributes whose pairwise correlation is as
low as possible
– Wrapper approaches
• Use the data mining algorithm as a black box to find best
subset of attributes

10/29/2022 Data Mining 50

BITS Pilani, Dubai Campus
An Architecture for Feature
Subset Selection
• A common architecture filter and wrapper approaches
• The feature selection process
– a search strategy that controls the generation of a new
subset of features
– a measure for evaluating a subset
– a stopping criterion
– a validation procedure
• Difference between filter methods and wrapper (evaluate a
subset of features)
– For a wrapper method, subset evaluation uses the target
data mining algorithm, while for a filter approach, the
evaluation technique is distinct from the target data
mining algorithm
10/29/2022 Data Mining 51
BITS Pilani, Dubai Campus
Feature subset selection
process

10/29/2022 Data Mining 52

BITS Pilani, Dubai Campus
Feature Weighting
 Used to keep or eliminate features
 Important features are assigned a higher weight
 Less important features are given a lower weight
 Weights are sometimes assigned based on domain
knowledge about the relative importance of features

10/29/2022 Data Mining 53

BITS Pilani, Dubai Campus
Feature Creation
• Create new attributes that can capture the important
information in a data set much more efficiently than the
original attributes
• The number of new attributes can be smaller than the
number of original attributes
• Three general methodologies
– Feature Extraction
– Mapping Data to New Space
• Time series data to frequency domain
– Feature Construction
• Constructing new features from already existing ones
• Density = mass/volume

10/29/2022 Data Mining 54

BITS Pilani, Dubai Campus
Feature Extraction
Histopathology image

Image data is processed to provide

higher level features
– Colour
– Shape – area, diameter,
– Texture

10/29/2022 Data Mining 55

BITS Pilani, Dubai Campus
Mapping Data to New Space

10/29/2022 Data Mining 57

BITS Pilani, Dubai Campus
Discretization and Binarization

• Some data mining algorithms require that the data be in

the form of
– Categorical attributes (Classification algorithms)
– Binary attributes (Association mining)
• Discretization
– Continuous attribute into a categorical attribute
• Binarization
– Continuous and discrete attributes into one or more binary attributes

10/29/2022 Data Mining 58

BITS Pilani, Dubai Campus
Binarization
• Binarizing categorical attribute (Method 1)
– For each of the m categorical values, uniquely assign each
original value to an integer in the interval [0, m − 1]
– Convert each of these m integers to a binary number
• n = ceil(log2(m)) binary digits/attributes are required to
represent these integers
– Represent the binary numbers as n binary attributes

10/29/2022 Data Mining 59

BITS Pilani, Dubai Campus
Binarization
• Example

10/29/2022 Data Mining 60

BITS Pilani, Dubai Campus
Binarization
• Binarizing categorical attribute (Method 2)
– Avoid unnecessary relationships by assigning one binary
attribute for each categorical value

10/29/2022 Data Mining 61

BITS Pilani, Dubai Campus
Discretization
• Transformation of a continuous attribute to a categorical
attribute
• Two sub tasks
– Step1: Deciding how many categories to have
– Step2: Determining how to map the values of the continuous attribute to
these categories
• Step1
– The values of the continuous attribute are sorted
– Specify n−1 split points and divide the data into n intervals
• Step2
– All the values in one interval are mapped to the same categorical value

10/29/2022 Data Mining 62

BITS Pilani, Dubai Campus
Discretization by equal width
binning
• Divides the range of the attribute into a user-specified
number of intervals (bins) each having the same width
• Bin width w = (max – min) / (no of bins)
• Sort data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
• No. of bins = 3
• Bin width w = 34-4 = 30/3 = 10
• Bin1 (4-14)- 4,8
• Bin2(15-24)-15
• Bin3(25-34)- 21, 21, 24, 25, 28, 34
• Note: Data should be in ascending order before performing
equal width binning

10/29/2022 Data Mining 63

BITS Pilani, Dubai Campus
Discretization by equal
frequency binning
• Put the same number of objects (frequency) into each
interval
• Sort data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28,
34
• Partition into (equal-frequency) bins of frequency 3
• Bin 1: 4, 8, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 28, 34

• Note: Data should be in ascending order before

performing equal frequency binning

10/29/2022 Data Mining 64

BITS Pilani, Dubai Campus
Attribute Transformation
• Transformation applied to all values of an
attribute/variable
– Simple functions
• Apply simple mathematical function to each value
• If x is a variable then xk, ex, |x|, 1/x, √x, log(x)
– Standardization or Normalization
• Attribute values are scaled to fall within a smaller range,
such as -1.0 to 1.0, or 0.0 to 1.0.
• Avoid a variable with large values dominate the results of the
calculation
– Neural networks, nearest-neighbor classification

10/29/2022 Data Mining 65

BITS Pilani, Dubai Campus
Min-max normalization
• Let minA and maxA are the minimum and maximum
values of an attribute A
• Min-max normalization maps a value, vi , of A to vi’ in the
range [new_minA, new_maxA] by computing

10/29/2022 Data Mining 66

BITS Pilani, Dubai Campus
Min-max normalization
• Example:
– Suppose that the minimum and maximum values for
the attribute income are $12,000 and $98,000,
respectively. Given income = $73,600, map it to the
range [0.0,1.0] by min-max normalization
– Solution:

((73,600-12,000/(98,000-12,000)) *(1.0-0)) + 0 = 0.716

10/29/2022 Data Mining 67

BITS Pilani, Dubai Campus
z-score normalization
•

10/29/2022 Data Mining 68

BITS Pilani, Dubai Campus
z-score normalization
• Example:
– Suppose that the mean and standard deviation of the
values for the attribute income are $54,000 and
$16,000, respectively. Find the z-score normalized
value for a original value of $73,600 for income
– Solution:

73,600-54,000/16,000 = 1.225

10/29/2022 Data Mining 69

BITS Pilani, Dubai Campus
Similarity and Dissimilarity
• Similarity & Dissimilarity
– Used by a number of data mining techniques
• clustering, nearest neighbor classification, and anomaly
detection
– In many cases, the initial data set is not needed once
these similarities or dissimilarities have been
computed
• Proximity refers to similarity or dissimilarity
• Dense data (time series or n-dimensional points)
– Correlation and Euclidean distance
• Sparse data (document)
– Jaccard and Cosine similarity measures
10/29/2022 Data Mining 70
BITS Pilani, Dubai Campus
Similarity and Dissimilarity
• Similarity
– Numerical measure of how alike two data objects are
– Is higher when objects are more alike
• Dissimilarity
– Numerical measure of how different two data objects are
– Lower when objects are more alike

10/29/2022 Data Mining 71

BITS Pilani, Dubai Campus
Transformations
• Convert a similarity to a dissimilarity or vice versa
• Transform a proximity measure to fall within a particular
range, such as [0,1]
– Proximity - Fixed range to [0, 1]
s’ = (s-s_min)/(s_max-s_min)
– Proximity – (0 – infinity) to [0,1]
d’ = d/(1+d)
• Transformation may change the meaning of the proximity
measure
– Example correlation (similarity measure) takes values in
the range [-1. 1]. Mapping these values to the interval
[0,1] by taking the absolute value loses information about
the sign
10/29/2022 Data Mining 72
BITS Pilani, Dubai Campus
Similarity/Dissimilarity for
Simple Attributes
• p and q are the attribute values for two data objects

10/29/2022 Data Mining 73

BITS Pilani, Dubai Campus
Dissimilarity matrix
• Stores a collection of • d(i, j) is the dissimilarity or
disimilarities for all pairs of n “difference” between objects i
objects and j
• Represented by an n-by-n • d(i, j) is a non-negative
table/matrix where n is the number
number of objects • d(i, j) = 0 means objects are
highly similar
• Note that d(i, i)=0; that is, the
distance between an object
and itself is 0
• The matrix is symmetric, (i.e.,)
d(i, j) = d (j, i)

10/29/2022 Data Mining 74

BITS Pilani, Dubai Campus
Example
• Suppose that we have the sample data given below.
Compute the dissimilarity matrix using the attribute, test-1

Dissimilarity matrix

test-1 1 2 3 4
1 0 1 1 0
2 1 0 1 1
3 1 1 0 1
4 0 1 1 0

10/29/2022 Data Mining 75

BITS Pilani, Dubai Campus
Example
• Suppose that we have the sample data given below.
Compute the dissimilarity matrix using the attribute test-2
Dissimilarity matrix

test-2 1 2 3 4
1 0 1 0.5 0
2 1 0 0.5 1
3 0.5 0.5 0 0.5
4 0 1 0.5 0

Excellent (2), good (1), fair(0)

10/29/2022 Data Mining 76

BITS Pilani, Dubai Campus
Dissimilarities between
Data Objects
• The Euclidean distance, d, between two points, x and y, in one-,
two-, three-, or higher dimensional space, is given by the following
formula

where n is the number of dimensions (attributes) and xk and yk

are, respectively, the kth attributes (components) of data objects x
and y
• Standardization is necessary, if scales differ
• The Euclidean distance is always greater than or equal to zero

10/29/2022 Data Mining 77

BITS Pilani, Dubai Campus
Dissimilarities between
Data Objects
Squared Euclidean Distance
• The squared Euclidean distance, d, between two points, a and
b, in one-, two-, three-, or higher dimensional space, is given
by the following formula

where k is the number of dimensions (attributes) and aj and

bj are, respectively, the kth attributes (components) of data
objects a and b

10/29/2022 Data Mining 78

BITS Pilani, Dubai Campus
Minkowski Distance
• Minkowski distance is a generalization of Euclidean
distance

• Where r is a parameter, n is the number of dimensions

(attributes) and xk and yk are, respectively, the kth
attributes (components) of data objects x and y

10/29/2022 Data Mining 79

BITS Pilani, Dubai Campus
Minkowski Distance:
Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance
– d(x,y)= |x1 – y1| + |x2- y2| + … +|xn-yn|
The City block distance is always greater than or equal to zero
• r = 2. Euclidean distance or L2 norm
– d(x,y)= (|x1 – y1|2 + |x2- y2|2 + … +|xn-yn|2)1/2

• r  . “supremum” (Lmax norm, L norm) distance

– This is the maximum difference between any
component of the vectors
• d(x,y) = max {|x1 – y1|, |x2- y2|, … ,|xn-yn|}

10/29/2022 Data Mining 80

BITS Pilani, Dubai Campus
Minkowski Distance

10/29/2022 Data Mining 81

BITS Pilani, Dubai Campus
Common Properties of a
Distance
• Distance is often used to refer to dissimilarities
1. Non negativity/Positivity:
• d(p, q)  0 for all p and q and d(p, q) = 0 only if p = q
2. Symmetry:
• d(p, q) = d(q, p) for all p and q
3. Triangle inequality:
d(p, r)  d(p, q) + d(q, r) for all points p, q, and r
• A distance that satisfies these properties is a metric

10/29/2022 Data Mining 82

BITS Pilani, Dubai Campus
Similarities between Data
Objects
• Similarities, also have some well known properties.

1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data

objects), p and q

10/29/2022 Data Mining 83

BITS Pilani, Dubai Campus
Similarity Between Binary
Vectors
• Similarity between binary objects is called as similarity
co-efficient
• Let objects, p and q, have n binary attributes
• Define the following quantities
M01 = the number of attributes where p is 0 and q is 1
M10 = the number of attributes where p is 1 and q is 0
M00 = the number of attributes where p is 0 and q is 0
M11 = the number of attributes where p is 1 and q is 1
• Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of not-both-zero
attributes values
= (M11) / (M01 + M10 + M11)
10/29/2022 Data Mining 84
BITS Pilani, Dubai Campus
SMC and Jaccard: Example

• Compute SMC and Jaccard co-efficient for two data

objects p and q
p= 1000000000
q= 0000001001

10/29/2022 Data Mining 85

BITS Pilani, Dubai Campus
SMC and Jaccard: Solution
p= 1000000000
q= 0000001001

M01 = 2 (the number of attributes where p is 0 and q is 1)

M10 = 1 (the number of attributes where p is 1 and q is 0)
M00 = 7 (the number of attributes where p is 0 and q is 0)
M11 = 0 (the number of attributes where p is 1 and q is 1)
SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) /
(2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

10/29/2022 Data Mining 86
BITS Pilani, Dubai Campus
Cosine Similarity
• If x and y are two document vectors, then

•The cosine similarity ranges from +1 to -1

– +1 is the highest similarity
– complete opposite points have similarity -1

10/29/2022 Data Mining 87

BITS Pilani, Dubai Campus
Cosine Similarity
• If the cosine similarity is 1
– The angle between x and y is 0◦
– x and y are the same except for magnitude
• If the cosine similarity is 0
– The angle between x and y is 90◦
• If the cosine similarity is -1
– The angle between x and y is 180◦

10/29/2022 Data Mining 88

BITS Pilani, Dubai Campus
Cosine Similarity: Example

•Compute the cosine similarity between two document

vectors d1 and d2
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

10/29/2022 Data Mining 89

BITS Pilani, Dubai Campus
Cosine Similarity: Solution

d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1  d2= 31 + 20 + 00 + 50 + 00 + 00 + 00 + 21 +

0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5
= (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5
= (6) 0.5 = 2.245
cos( d1, d2 ) = .3150

10/29/2022 Data Mining 90

BITS Pilani, Dubai Campus
Extended Jaccard Coefficient
(Tanimoto)
• Used for finding similarity between documents (data)
– Reduces to Jaccard co-efficient for binary attributes

– Compute the Extended Jaccard Similarity between d1

and d2
• d1 = 1 0 1 0 1 and d2 = 1 1 1 0 1
• d1 = 1 1 1 1 1 and d2 = 1 1 1 1 1

10/29/2022 Data Mining 91

BITS Pilani, Dubai Campus
Extended Jaccard Coefficient (Tanimoto)

• Solution
• d1 = 1 0 1 0 1
• d2 = 1 1 1 0 1
• d1.d2 = 3
• ||d1||2 = 1 + 1 + 1 = 3
• ||d2||2 = 1 + 1 + 1 + 1 = 4
• T(d1, d2) = 3 / 3 + 4 – 3
• T(d1, d2) = ¾ = 0.75

10/29/2022 Data Mining 92

BITS Pilani, Dubai Campus
Extended Jaccard Coefficient (Tanimoto)

• Solution
• d1 = 1 1 1 1 1
• d2 = 1 1 1 1 1
• d1.d2 = 5
• ||d1||2 = 1 + 1 + 1 + 1 + 1 = 5
• ||d2||2 = 1 + 1 + 1 + 1 + 1 = 5
• T(d1, d2) = 5 / 5 + 5 – 5
• T(d1, d2) = 1

10/29/2022 Data Mining 93

BITS Pilani, Dubai Campus
Correlation
• Correlation coefficient also known as Pearson’s
coefficient (r)
– Numeric attributes
– Measures linear relationship between objects
– Value is between -1 and +1
• If Correlation coefficient (A,B) > 0
– Attributes A and B are positively correlated, meaning that
the values of A increase as the values of B increase
– The higher the value, the stronger the correlation hence, a
higher value may indicate that A (or B) may be removed as
a redundancy

10/29/2022 Data Mining 94

BITS Pilani, Dubai Campus
Correlation
• If Correlation coefficient (A,B) = 0
– Attributes A and B are independent and there is no correlation
between them
• If Correlation coefficient (A,B) < 0
– Attributes A and B are negatively correlated, where the values of
one attribute increase as the values of the other attribute
decrease
– Each attribute discourages the other

10/29/2022 Data Mining 95

BITS Pilani, Dubai Campus
Correlation

10/29/2022 Data Mining 96

BITS Pilani, Dubai Campus
Correlation: Example
• The time x in years that an employee spent at a company and the
employee's hourly pay, y, for 5 employees are listed in the table.
Calculate and interpret the correlation coefficient r.

10/29/2022 Data Mining 97

BITS Pilani, Dubai Campus
Correlation: Solution
x y x-x* y-y* (x-x*) (y-y*) (x-x*)2 (y-y*)2
5 25
3 20
4 21
10 35
15 38
Sum = Sum
37 =139

Average (x) = 7.4

Average (y) = 27.8

10/29/2022 Data Mining 98

BITS Pilani, Dubai Campus
Correlation: Solution
x y x-x* y-y* (x-x*) (y-y*) (x-x*)2 (y-y*)2
5 25 -2.4
3 20 -4.4
4 21 -3.4
10 35 2.6
15 38 7.6
Sum = Sum
37 =139

10/29/2022 Data Mining 99

BITS Pilani, Dubai Campus
Correlation: Solution
x y x-x* y-y* (x-x*) (y-y*) (x-x*)2 (y-y*)2
5 25 -2.4 -2.8
3 20 -4.4 -7.8
4 21 -3.4 - 6.8
10 35 2.6 7.2
15 38 7.6 10.2
Sum = Sum
37 =139

10/29/2022 Data Mining 100

BITS Pilani, Dubai Campus
Correlation: Solution
x y x-x* y-y* (x-x*) (y-y*) (x-x*)2 (y-y*)2
5 25 -2.4 -2.8 6.72
3 20 -4.4 -7.8 34.32
4 21 -3.4 - 6.8 23.12
10 35 2.6 7.2 18.72
15 38 7.6 10.2 77.52
Sum = Sum Sum =
37 =139 160.4

10/29/2022 Data Mining 101

BITS Pilani, Dubai Campus
Correlation: Solution
x y x-x* y-y* (x-x*) (y-y*) (x-x*)2 (y-y*)2
5 25 -2.4 -2.8 6.72 5.76
3 20 -4.4 -7.8 34.32 19.36
4 21 -3.4 - 6.8 23.12 11.26
10 35 2.6 7.2 18.72 6.76
15 38 7.6 10.2 77.52 57.76
Sum = Sum Sum = Sum=101
37 =139 160.4 .2

10/29/2022 Data Mining 102

BITS Pilani, Dubai Campus
Correlation: Solution
x y x-x* y-y* (x-x*) (y-y*) (x-x*)2 (y-y*)2
5 25 -2.4 -2.8 6.72 5.76 7.84
3 20 -4.4 -7.8 34.32 19.36 60.84
4 21 -3.4 - 6.8 23.12 11.26 46.24
10 35 2.6 7.2 18.72 6.76 51.84
15 38 7.6 10.2 77.52 57.76 104.04
Sum = Sum Sum = Sum=101 Sum
37 =139 160.4 .2 =270.8

Average (x) = 7.4 Sy= (270.8/4)1/2 = 8.23

Average (y) = 27.8 Covariance(x,y) = 160.4/4 = 40.1
Sx= (101/4)1/2 = 5.02 Correlation(x,y) = 40.1/ (5.02) (8.23) = 0.97

10/29/2022 Data Mining 103

BITS Pilani, Dubai Campus
Correlation: Solution
• There is a strong positive correlation between the
number of years the employee has worked and the
employee's salary, since r is very close to 1.

10/29/2022 Data Mining 104

BITS Pilani, Dubai Campus
Thank you

10/29/2022 Data Mining 105

BITS Pilani, Dubai Campus

1Z0-1047-24 - Oracle Absence Cloud - Final
100% (1)
1Z0-1047-24 - Oracle Absence Cloud - Final
22 pages
Manual Desconectador Yaskawa
60% (5)
Manual Desconectador Yaskawa
32 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
(Nigel R Hewson) Prestressed Concrete Bridges de
100% (2)
(Nigel R Hewson) Prestressed Concrete Bridges de
390 pages
Statement of Purpose
71% (7)
Statement of Purpose
2 pages
Gitam Aptitude Test
100% (5)
Gitam Aptitude Test
8 pages
StockSceneryPart 7-8
100% (1)
StockSceneryPart 7-8
59 pages
DS1 Test Patterns
No ratings yet
DS1 Test Patterns
5 pages
Taco HS2 Quick Start Guide
No ratings yet
Taco HS2 Quick Start Guide
30 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
IDS Unit 2
No ratings yet
IDS Unit 2
49 pages
Full
No ratings yet
Full
367 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
L2-4 - Data
No ratings yet
L2-4 - Data
83 pages
FoDS - L8
No ratings yet
FoDS - L8
53 pages
Basic Data Mining Techniques: Attributes
No ratings yet
Basic Data Mining Techniques: Attributes
12 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Lect 2
No ratings yet
Lect 2
77 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
No ratings yet
ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc
22 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
IDS2
No ratings yet
IDS2
61 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
No ratings yet
Attribute Type Description Examples Operations: Attribute Level Transformation Comments
33 pages
RL3.1 Data Descriptions 1
No ratings yet
RL3.1 Data Descriptions 1
18 pages
Class 2 Introduction To Data
No ratings yet
Class 2 Introduction To Data
40 pages
Getting To Know Your Data: - Chapter 2
No ratings yet
Getting To Know Your Data: - Chapter 2
63 pages
Attributes
No ratings yet
Attributes
66 pages
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 2: by Tan, Steinbach, Kumar
25 pages
Modified Module 2-DM
No ratings yet
Modified Module 2-DM
107 pages
Clustering Vivek Saxena
No ratings yet
Clustering Vivek Saxena
169 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Data
No ratings yet
Data
84 pages
Data Mining: Data: Lecture Notes For Chapter 2
No ratings yet
Data Mining: Data: Lecture Notes For Chapter 2
34 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
DMDW 2
No ratings yet
DMDW 2
68 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Lec2 Data
No ratings yet
Lec2 Data
51 pages
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
No ratings yet
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
42 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
AIML ML Session 2 - Student Common Reference (With More Additional Reading Materials) v1
No ratings yet
AIML ML Session 2 - Student Common Reference (With More Additional Reading Materials) v1
71 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
PREPROCESSING
No ratings yet
PREPROCESSING
122 pages
Datamining 1class
No ratings yet
Datamining 1class
76 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
Unit 2 Data Preprocessing For Students
No ratings yet
Unit 2 Data Preprocessing For Students
169 pages
Data Mining Lecture2-2
No ratings yet
Data Mining Lecture2-2
29 pages
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
No ratings yet
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
2 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
Data Mining Chapter 2 Notes
No ratings yet
Data Mining Chapter 2 Notes
87 pages
Unit-II Notes
No ratings yet
Unit-II Notes
9 pages
Week 02.0 Chapt02
No ratings yet
Week 02.0 Chapt02
9 pages
02 Data
No ratings yet
02 Data
47 pages
Module 1 - Aug 2024
No ratings yet
Module 1 - Aug 2024
93 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Why Processor Performance Is More Than Frequency and Core Counts v10 13 23
No ratings yet
Why Processor Performance Is More Than Frequency and Core Counts v10 13 23
7 pages
TOPIC - Integrated Circuit: Class-Bca 1 Semester
No ratings yet
TOPIC - Integrated Circuit: Class-Bca 1 Semester
33 pages
Integra RDA7 Pwramp
No ratings yet
Integra RDA7 Pwramp
28 pages
2.0 Thermochemistry Dec 21
No ratings yet
2.0 Thermochemistry Dec 21
77 pages
Statistical Mechanics Part 1
No ratings yet
Statistical Mechanics Part 1
17 pages
NUMERALS (Bilangan) : Cardinal Ordinal Fraction
No ratings yet
NUMERALS (Bilangan) : Cardinal Ordinal Fraction
3 pages
Eca LNG: Coordinate System: Refrigerationmaster
No ratings yet
Eca LNG: Coordinate System: Refrigerationmaster
1 page
Weighted Moving Average Formula
No ratings yet
Weighted Moving Average Formula
25 pages
Howe Et Al (2022)
No ratings yet
Howe Et Al (2022)
21 pages
MELSEC iQ-R PROFINET IO Controller Module Function Block Reference
No ratings yet
MELSEC iQ-R PROFINET IO Controller Module Function Block Reference
36 pages
Intro Arduino Matlab v5
No ratings yet
Intro Arduino Matlab v5
17 pages
Assignment On Introduction To Algorithm: Submitted by
No ratings yet
Assignment On Introduction To Algorithm: Submitted by
12 pages
Updated Final THRM Module Engr. CM Gualberto
No ratings yet
Updated Final THRM Module Engr. CM Gualberto
116 pages
Cyclone Design
100% (2)
Cyclone Design
11 pages
ApplicationGuide Sensation16
100% (1)
ApplicationGuide Sensation16
580 pages
Mpfi - Audi A6 C4
No ratings yet
Mpfi - Audi A6 C4
13 pages
Simple Motion - Block - Reference
No ratings yet
Simple Motion - Block - Reference
102 pages
MEPC 60-13 - The Generation of Biocide Leaching Rate Estimates For Anti-Fouling Coatings and Their Use... (IPPIC)
No ratings yet
MEPC 60-13 - The Generation of Biocide Leaching Rate Estimates For Anti-Fouling Coatings and Their Use... (IPPIC)
18 pages
51 Ls at 400 Kpa
No ratings yet
51 Ls at 400 Kpa
1 page
STM Unit 1 Taxonomy of Bugs
No ratings yet
STM Unit 1 Taxonomy of Bugs
56 pages
Art Appreciation
No ratings yet
Art Appreciation
12 pages
Classical Mechanics and Special Theory of Relativity - 01
No ratings yet
Classical Mechanics and Special Theory of Relativity - 01
4 pages