0% found this document useful (0 votes)
9 views83 pages

Know Your Data

The document provides an overview of data attributes, including their types (nominal, binary, ordinal, numeric) and characteristics, as well as the importance of understanding data for preprocessing in data mining. It discusses the properties of attribute values, the distinction between discrete and continuous attributes, and basic statistical descriptions of data, including measures of central tendency and dispersion. Additionally, it highlights methods for visualizing data and identifying outliers.

Uploaded by

yijac51850
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views83 pages

Know Your Data

The document provides an overview of data attributes, including their types (nominal, binary, ordinal, numeric) and characteristics, as well as the importance of understanding data for preprocessing in data mining. It discusses the properties of attribute values, the distinction between discrete and continuous attributes, and basic statistical descriptions of data, including measures of central tendency and dispersion. Additionally, it highlights methods for visualizing data and identifying outliers.

Uploaded by

yijac51850
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Know Your Data

(Data, and about data)


Getting to Know Your Data
• Real-world data are typically noisy, enormous in volume (often several
gigabytes or more), and may originate from a hodgepodge of
heterogenous sources.
• Knowledge about your data is useful for data preprocessing, which is
the first major task of the data mining process.
▪ What are the types of attributes or fields that make up your data?
▪ What kind of values does each attribute have?
▪ Which attributes are discrete, and which are continuous-valued?
▪ What does the data look like?
▪ How are the values distributed?
▪ Can we measure the similarity of some data objects with respect to
others?
What Is an Attribute?
• An attribute is a data field, representing a characteristic or feature of a
data object.
• The nouns attribute, dimension, feature, and variable are often used
interchangeably in the literature.
• The term dimension is commonly used in data warehousing.
• Machine learning literature tends to use the term feature
• The statisticians prefer the term variable.
• Data mining and database professionals commonly use the term
attribute.
• Database rows -> data objects; columns -> attributes.
What is Data?
• Collection of data objects and their Attributes
attributes
• An attribute is a property or characteristic
Tid Refund Marital Taxable
of an object Status Income Cheat

▪ Examples: eye color of a person, 1 Yes Single 125K No


temperature, etc. 2 No Married 100K No

▪ Attribute is also known as variable, 3 No Single 70K No

Objects
field, characteristic, dimension, or 4 Yes Married 120K No

feature 5 No Divorced 95K Yes


6 No Married 60K No
• A collection of attributes describe an
7 Yes Divorced 220K No
object 8 No Single 85K Yes
▪ Object is also known as record, point, 9 No Married 75K No
case, sample, entity, or instance 10 No Single 90K Yes
10
Attribute Types
The type of an attribute is determined by the set of possible values—
nominal, binary, ordinal, or numeric—the attribute can have.
1) Nominal Attributes
2) Binary Attributes
3) Ordinal Attributes
4) Numeric Attributes
i. Interval-Scaled Attributes
ii. Ratio-Scaled Attributes
Nominal Attributes
• Nominal means relating to names.
• The values of a nominal attribute are symbols or names of things.
• Each value represents some kind of category, code, or state, and so
nominal attributes are also referred to as categorical.
• The values do not have any meaningful order.
• In computer science, the values are also known as enumerations.
• For example: that hair_color, marital_status and occupation are
attributes.
• The values for hair_color can be black, brown, blond, red, gray, and
white.
• The attribute marital_status can take on the values single, married,
divorced, and widowed.
• For occupation, the values can be teacher, dentist, programmer, farmer,
etc.
Binary Attributes
• A binary attribute is a nominal attribute with only two categories or
states: 0 or 1, where 0 means that the attribute is absent, and 1 means
that it is present.
• Binary attributes are referred to as Boolean if the two states correspond
to true and false.
• The attribute medical test is binary, where a value of 1 means the result
of the test for the patient is positive, while 0 means the result is
negative.
• A binary attribute is symmetric if both of its states are equally valuable
and carry the same weight. e.g., gender.
• A binary attribute is asymmetric if the outcomes of the states are not
equally important, such as the positive and negative outcomes of a
medical test for HIV.
Ordinal Attributes
• An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between
successive values is not known.
• For example: the size of drinks available at a fast-food restaurant. This
nominal attribute has three possible values: small, medium, and large.
• For example: of ordinal attributes include grade (e.g., AC, A, A−, BC.
• Other example: Professional ranks can be enumerated in a sequential
order: for example, assistant, associate, and full for professors.
• Customer satisfaction had the following ordinal categories:
• 0: very dissatisfied, 1: somewhat dissatisfied,
• 2: neutral, 3: satisfied, and 4: very satisfied.
Numeric Attributes
• A numeric attribute is quantitative; that is, it is a measurable quantity,
represented in integer or real values. Numeric attributes can be
interval-scaled or ratio-scaled.
• Interval-Scaled Attributes:
▪Interval-scaled attributes are measured on a scale of equal-size units.
The values of interval-scaled attributes have order and can be
positive, 0, or negative. For example, temperatures measured in Celsius can be -10°C, 0°C, or 10°C
▪In addition to providing a ranking of values, such attributes allow us
to compare and quantify the difference between values.
▪we can compute their mean value, in addition to the median and
mode measures of central tendency.
Numeric Attributes
• A numeric attribute is quantitative; that is, it is a measurable quantity,
represented in integer or real values. Numeric attributes can be
interval-scaled or ratio-scaled.
• Ratio-Scaled Attributes: absence

▪A ratio-scaled attribute is a numeric attribute with an inherent zero-


point. That is, if a measurement is ratio-scaled, we can speak of a
value as being a multiple (or ratio) of another value.
▪In addition, the values are ordered, and we can also compute the
difference between values, as well as the mean, median, and mode.
▪Unlike temperatures in Celsius and Fahrenheit, the Kelvin (K)
temperature scale has what is considered a true zero-point.
▪attributes to measure weight, height, latitude and longitude
coordinates and monetary quantities (e.g., you are 100 times richer
with $100 than with $1).
Properties of Attribute Values
The type of an attribute depends on which of the following
properties/operations it possesses:

✓Nominal attribute: distinctness


✓Ordinal attribute: distinctness & order
✓Interval attribute: distinctness, order & meaningful differences
✓Ratio attribute: all 4 properties/operations
Discrete versus Continuous Attributes
• A discrete attribute has a finite or countably infinite set of values,
which may or may not be represented as integers.
• The attributes hair color, smoker, medical test, and drink size each have
a finite number of values, and so are discrete.
• Discrete attributes may have numeric values, such as 0 and 1 for binary
attributes or, the values 0 to 110 for the attribute age.
• Note: Binary attributes are a special case of discrete attributes
Discrete versus Continuous Attributes
• If an attribute is not discrete, it is continuous. In practice, real values are
represented using a finite number of digits.
• Continuous attributes are typically represented as floating-point
variables.
• In literature, the terms numeric attribute and continuous attribute are
often used interchangeably.
Example- temperature like 25.8°C and speed like 60.5 km/h
Basic Statistical Descriptions of Data
1. Measuring the central tendency: which measure the location of the
middle or center of a data distribution.
• mean, median, mode, and midrange.
2. Measuring the dispersion of data: how are the data spread out?
• The most common data dispersion measures are the range, quartiles, and
interquartile range;
• The five-number summary and boxplots; and the variance and standard
deviation of the data.
3. Graphic displays of basic statistical descriptions: visually inspect the
data
• Data presentation software includes bar charts, pie charts, and line
graphs.
• Other popular displays of data summaries and distributions include
quantile plots, quantile–quantile plots, histograms, and scatter plots.
1. Measuring the Central Tendency
• Various ways to measure the central tendency of data.
• Mean, Median, and Mode
• The most common and effective numeric measure of the “center” of a
set of data is the (arithmetic) mean.
• Let x1,x2,…,xN be a set of N values or observations, such as for some
numeric attribute X, like salary.
• The mean of this set of values is

• we have the following values for salary (in thousands of dollars)


30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.

• mean salary is $58,000.


1. Measuring the Central Tendency
Weighted arithmetic mean or the weighted average:
• Sometimes, each value xi in a set may be associated with a weight wi for
i = 1,…,N.
• The weights reflect the significance, importance, or occurrence
frequency attached to their respective values.
1. Measuring the Central Tendency
Limitations of mean:
• A major problem with the mean is its sensitivity to extreme (e.g.,
outlier) values.
• Even a small (or a large) number of extreme values can corrupt the
mean.
• For example, the mean salary at a company may be substantially
pushed up by that of a few highly paid managers.
1. Measuring the Central Tendency
Trimmed mean:
• To offset the effect caused by a small (or a large) number of extreme
values, we can instead use the trimmed mean,
• Trimmed mean obtained after chopping off values at the high and low
extremes.
• For example, we can sort the values observed for salary and remove the
top and bottom 2% before computing the mean.
• We should avoid trimming too large a portion (such as 20%) at both
ends, as this can result in the loss of valuable information.
1. Measuring the Central Tendency
Median:
• The data must be sorted in an order.
• Given an odd number of values, the median is the middlemost value.
• If, there is an even number of observations; therefore, the median is
not unique.
• It can be any value within the two middlemost values. By convention,
we assign the average of the two middlemost values.
1. Measuring the Central Tendency
Median:
• The median is expensive to compute when we have a large number of
observations.
• Assume that data are grouped in intervals according to their xi data
values and that the frequency of each interval is known.
• For example, employees may be grouped according to their annual
salary in intervals such as $10–20,000, $20–30,000, and so on.
• The interval that contains the median frequency be the median interval.
1. Measuring the Central Tendency
Median:
• The interval that contains the median frequency be the median interval.
• We can approximate the median of the entire data set by interpolation
using the formula

• where L1 is the lower boundary of the median interval


• N is the number of values in the entire data set
• is the sum of the frequencies of all of the intervals that are
lower than the median interval
• freqmedian is the frequency of the median interval
• width is the width of the median interval
1. Measuring the Central Tendency

• where L1 is the lower boundary of the median interval


• N is the number of values in the entire data set
• is the sum of the frequencies of all of the intervals that are
lower than the median interval
• freqmedian is the frequency of the median interval
• width is the width of the median interval L1 = 59.5
N= 50, N/2= 25
= 14
freqmedian = 12
width= 69.5-59.5=10
1. Measuring the Central Tendency
Mode:
• The value that occurs most frequently in the set.
• Mode can be determined for qualitative and quantitative attributes.
• More than one mode: It is possible for the greatest frequency to
correspond to several different values.
▪ Unimodal: one mode
▪ Bimodal: two modes
▪ Trimodal: three modes
▪ Multimodal: two or more modes
NOTE: if each data value occurs only once in a set, then there is no
mode.
1. Measuring the Central Tendency
Midrange:
• Midrange can also be used to assess the central tendency of a numeric
data set.
• It is the average of the largest and smallest values in the set.
• In a unimodal frequency curve with perfect symmetric data distribution,
the mean, median, and mode are all at the same center value
1. Measuring the Central Tendency
• Data in most real applications are not
symmetric.
• Positively skewed: where the mode occurs at
a value that is smaller than the median

• negatively skewed, where the mode


occurs at a value greater than the
• median
2. Measuring the Dispersion of Data
• Dispersion or spread of numeric data
• The measures include Range, Quantiles, Quartiles, Percentiles,
Interquartile Range and Variance, and Standard Deviation.
• The five-number summary, which can be displayed as a boxplot, is
useful in identifying outliers.
2. Measuring the Dispersion of Data
• Dispersion or spread of numeric data
• Let x1,x2,…,xN be a set of observations for some numeric attribute, X.
• Range: the difference between the largest (max()) and smallest (min())
values.
• Quantiles: the points taken at regular intervals of a data distribution,
dividing it into essentially equal-size consecutive sets.
• Split the data distribution into equal-size consecutive sets.
• Quartiles: The 4-quantiles are the three data points that split the data
distribution into four equal parts;
• Each part represents one-fourth of the data distribution.
• Percentiles: The 100-quantiles are referred to as percentiles
• They divide the data distribution into 100 equal-sized consecutive sets.
NOTE: Median, Quartiles, and Percentiles are the most widely used forms
of quantiles.
2. Measuring the Dispersion of Data
• Quartile: three values that split the sorted data set into four equal
parts.
• The first quartile, denoted by Q1, is the 25th percentile —it cuts off the
lowest 25% of the data.
• The third quartile, denoted by Q3, is the 75th percentile—it cuts off the
lowest 75% (or highest 25%) of the data.
• The Q2 is the 50th percentile. As the median, it gives the center of the
data distribution.
2. Measuring the Dispersion of Data
• Interquartile range: measure of spread (or range) covered by the
middle half of the data.
IQR = Q3 − Q1
• For example- sorted data: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
• Quartiles for this data are the third, sixth, and ninth values.
• Q1 = 48.5 and Q3 is 66.5.
• So, the interquartile range is IQR = 66.5− 48.5=18
2. Measuring the Dispersion of Data
• Five-Number Summary, Boxplots, and Outliers:
• No single numeric measure of spread (e.g., IQR) is very useful for
describing skewed distributions.
• In the symmetric distribution, the median (and other measures of
central tendency) splits the data into equal-size halves. This does not
occur for skewed distributions.
• Outlier detection: A common rule of thumb for identifying suspected
outliers is to single out values falling at least 1.5 × IQR above the third
quartile or below the first quartile
• Five-number summary: Q1, the median, and Q3 together contain no
information about the endpoints (e.g., tails) of the data, a fuller
summary of the shape of a distribution can be obtained by providing
the lowest and highest data values as well.
2. Measuring the Dispersion of Data
• Five-Number Summary, Boxplots, and Outliers:
• Five-number summary: Q1, the median, and Q3 together contain no
information about the endpoints (e.g., tails) of the data, a fuller
summary of the shape of a distribution can be obtained by providing
the lowest and highest data values as well.
• It consists of the median (Q2), the quartiles Q1 and Q3, and the
smallest and largest individual observations, written in the order of
Minimum, Q1, Median, Q3, Maximum.
2. Measuring the Dispersion of Data
• Five-Number Summary, Boxplots, and Outliers:
• Boxplots are a popular way of visualizing a distribution.
• A boxplot incorporates the five-number summary as follows:
• Ends of the box are at the quartiles (Q1 & Q3) so that the box length is
IQR.
• The median is marked by a line within the box.
• Two lines (called whiskers) outside the box extend to the smallest
(Minimum) and largest (Maximum) observations.
2. Measuring the Dispersion of Data
• Five-Number Summary, Boxplots, and Outliers:
• The whiskers terminate at the most extreme
observations occurring within 1.5 × IQR of the
quartiles.
• Figure shows boxplots for unit price data for
items sold at four branches.
• For branch 1, the Median price of items sold is
$80, Q1 is $60, and Q3 is $100.
• Two outlying observations for this branch were
plotted individually, as their values of 175 and
202 are more than 1.5 times the IQR (40).
2. Measuring the Dispersion of Data
• Variance and Standard Deviation:
• Variance and standard indicate how spread out a data distribution is.
• A low standard deviation means that the data observations tend to be
very close to the mean.
• A high standard deviation indicates that the data are spread out over a
large range of values.
• The variance of N observations, x1,x2,…,xN, for a numeric attribute X is

• 𝑥ҧ is the mean value of the observations


• The standard deviation (σ) is the square root of the variance, σ2.
2. Measuring the Dispersion of Data
• Variance and Standard Deviation:
• The basic properties of the standard deviation are as follows:
✓ σ measures spread about the mean and should be considered only
when the mean is chosen as the measure of center.
✓ σ = 0 only when there is no spread, that is, when all observations
have the same value. Otherwise, σ > 0.
Important Characteristics of Data
• Dimensionality (number of attributes)
✓ High dimensional data brings a number of challenges
• Sparsity
➢ Only presence counts
• Resolution
o Patterns depend on the scale
• Size
▪ Type of analysis may depend on size of data
Types of data sets
• Record
➢Data Matrix
➢Document Data
➢Transaction Data
• Graph
✓World Wide Web
✓Molecular Structures
• Ordered
❑Spatial Data
❑Temporal Data
❑Sequential Data
❑Genetic Sequence Data
Record Data
• Data that consists of a collection of records, each of which
consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix
• If data objects have the same fixed set of numeric attributes, then
the data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute.
• Such a data set can be represented by an m by n matrix, where there
are m rows, one for each object, and n columns, one for each
attribute

- One Dimension/Attribute-
Projection
Student representedDistance
Projection as point on Load Thickness
number line of Math Scores.
of x Load of y load
- Two Dimension/Attributes- Student represented as point
on 2d plot of (Math,Science) Scores
. 10.23 5.27 15.22 2.7 1.2
- Multi Dimension/Attributes- Student represented as point
on12.65
multidimension(n) space
6.25of n attributes. 16.22 2.2 1.1
Document Data
Each document becomes a ‘term’ vector
• Each term is a component (attribute) of the vector
• The value of each component is the number of times the
corresponding term occurs in the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
A special type of data, where
• Each transaction involves a set of items.
• For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.
• Can represent transaction data as record data.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
• Examples: Generic graph, a molecule, and webpages

2
5 1
2
5

Benzene Molecule: C6H6


Ordered Data
• Sequences of transactions

Items/Events

An element of
the sequence
Ordered Data
• Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Data Quality
• Poor data quality negatively affects many data processing efforts
• What kinds of data quality problems? Examples of data quality
problems:
✓Noise and outliers
✓Wrong data
✓Fake data
✓Missing values
✓Duplicate data
Noise
• For objects, noise is an extraneous object
• For attributes, noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor
phone and “snow” on television screen
• The figures below show two sine waves of the same magnitude
and different frequencies, the waves combined, and the two
sine waves with random noise
• The magnitude and shape of the original signal is distorted
Outliers
• Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
• Case 1: Outliers are
noise that interferes
with data analysis

• Case 2: Outliers are


the goal of our analysis
• Credit card fraud
• Intrusion detection

• Causes?
Missing Values
• Reasons for missing values
• Information is not collected
(e.g., people decline to give their age and weight)
• Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values


• Eliminate data objects or variables
• Estimate missing values
• Example: time series of temperature
• Example: census results
• Ignore the missing value during analysis
Duplicate Data
• Data set may include data objects that are duplicates, or almost
duplicates of one another
• Major issue when merging data from heterogeneous sources

• Examples:
• Same person with multiple email addresses

• Data cleaning
• Process of dealing with duplicate data issues

• When should duplicate data not be removed?


Measuring Data Similarity and Dissimilarity
In data mining applications, such as clustering, outlier analysis, and
nearest-neighbor classification, we need ways to assess how alike or
unalike objects are in comparison to one another.
• Similarity
▪ Numerical measure of how alike two data objects are
▪ Value is higher when objects are more alike
▪ Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
▪ Numerical measure of how different two data objects are
▪ Lower when objects are more alike
▪ Minimum dissimilarity is often 0
▪ Upper limit varies
• Proximity refers to a similarity or dissimilarity
Measuring Data Similarity and Dissimilarity
In data mining applications, such as clustering, outlier analysis, and
nearest-neighbor classification, we need ways to assess how alike or
unalike objects are in comparison to one another.
• A cluster is a collection of data objects such that the objects within a
cluster are similar to one another and dissimilar to the objects in other
clusters.
• Outlier analysis also employs clustering-based techniques to identify
potential outliers as objects that are highly dissimilar to others.
• Knowledge of object similarities can also be used in nearest-neighbor
classification schemes where a given object (e.g., a patient) is assigned
a class label (relating to, say, a diagnosis) based on its similarity toward
other objects in the model.
Measuring Data Similarity and Dissimilarity
• Two data structures that are commonly used in clustering, outlier
analysis, and nearest-neighbor classification applications:
• data matrix (used to store the data objects) and
• dissimilarity matrix (used to store dissimilarity values for pairs of
objects).
• How object dissimilarity can be computed for objects:
▪ Proximity Measures for nominal attributes
▪ Proximity Measures for binary attributes
▪ Proximity Measures for numeric attributes
▪ Proximity Measures for ordinal attributes
Measuring Data Similarity and Dissimilarity
Data Matrix versus Dissimilarity Matrix:
• Objects described by multiple attributes.
• Suppose that we have n objects (e.g., persons, items, or courses)
described by p attributes (also called measurements or features, such
as age, height, weight, or gender).
• The objects are x1=(x11,x12,…,x1p), x2 =(x21,x22,…,x2p), and so on, where xij
is the value for object xi of the jth attribute.
• Main memory-based clustering and nearest-neighbor classification
algorithms typically operate on Data matrix or Dissimilarity matrix.
Measuring Data Similarity and Dissimilarity
Data Matrix (or object-by-attribute structure):
• This structure stores the n data objects in the form of a relational table,
or n-by-p matrix (n objects × p attributes).

• Each row corresponds to an object. As part of our notation, we may use


f to index through the p attributes.
Measuring Data Similarity and Dissimilarity
• Dissimilarity matrix (or object-by-object structure): This structure stores
a collection of proximities that are available for all pairs of n objects. It
is often represented by an n-by-n table:
• where d(i, j) is the measured dissimilarity or “difference” between
objects i and j.
• In general, d(i, j) is a non-negative number that is close to 0 when
objects i and j are highly similar or “near” each other, and becomes
larger the more they differ.
• Note that d(i, i) = 0 is the difference between an object and itself is 0.
• Furthermore, d(i, j) = d(j, i).
• Measures of similarity can often be expressed as a function of
measures of dissimilarity. For example, for nominal data,
sim(i, j) = 1 − d(i, j)
• where sim(i, j) is the similarity between objects i and j.
Proximity Measures for Nominal Attributes
• A nominal attribute can take on two or more states.
• For example, map color is a nominal attribute that may have, five states:
red, yellow, green, pink, and blue.
• Let the number of states of a nominal attribute be M. The states can be
denoted by letters, symbols, or a set of integers, such as 1, 2,…, M.
Notice that such integers are used just for data handling and do not
represent any specific ordering.
• The dissimilarity between two objects i and j can be computed based
on the ratio of mismatches:

• where m is the number of matches (i.e., the number of attributes for


which i and j are in the same state).
• p is the total number of attributes describing the objects.
Proximity Measures for Nominal Attributes
• The dissimilarity between two objects i and j can be computed based
on the ratio of mismatches: Total Matches

Total attributes

• where m is the number of matches (i.e., the number of attributes for


which i and j are in the same state).
• p is the total number of attributes describing the objects.

• we set p = 1;
Proximity Measures for Nominal Attributes
• The dissimilarity between two objects i and j can be computed based
on the ratio of mismatches:

• where m is the number of matches (i.e., the number of attributes for


which i and j are in the same state).
• p is the total number of attributes describing the objects.
Proximity Measures for Nominal Attributes
• The similarity can be computed as:

• where m is the number of matches (i.e., the number of attributes for


which i and j are in the same state).
• p is the total number of attributes describing the objects.
Proximity Measures for Binary Attributes
• For symmetric binary attributes, each state is equally valuable.
• Symmetric binary attributes are thought of as having the same weight,
we have the 2 × 2 contingency table:

• Dissimilarity that is based on symmetric binary attributes is called


symmetric binary dissimilarity.
Proximity Measures for Binary Attributes
• If objects i and j are described by symmetric binary attributes, then the
dissimilarity between i and j is:

• The dissimilarity based on asymmetric binary attributes is called


asymmetric binary dissimilarity, where the number of negative
matches, t, is considered unimportant and is thus ignored in the
following computation:
Proximity Measures for Binary Attributes
• Complementarily, we can measure the difference between two binary
attributes based on the notion of similarity instead of dissimilarity.
• For example, the asymmetric binary similarity between the objects i
and j can be computed as:

• The coefficient sim(i, j) is called the Jaccard coefficient.


Proximity Measures for Binary Attributes
• The dissimilarity based on asymmetric binary attributes is called
asymmetric binary dissimilarity, where the number of negative
matches, t, is considered unimportant and is thus ignored:

Let the values Y (yes) and P (positive) be set to 1, and the value N (no or negative) be set to 0.
Proximity Measures for Binary Attributes
• The dissimilarity based on asymmetric binary attributes is called
asymmetric binary dissimilarity, where the number of negative
matches, t, is considered unimportant and is thus ignored:
Proximity Measures for Binary Attributes
• If objects i and j are described by symmetric binary attributes, then the
dissimilarity between i and j is:
Dissimilarity of Numeric Data: Minkowski Distance
• In some cases, the data are normalized before applying distance
calculations. This involves transforming the data to fall within a smaller
or common range, such as [−1, 1] or [0.0, 1.0].
• Consider a height attribute, for example, which could be measured in
either meters or inches.
• Normalizing the data attempts to give all attributes an equal weight.
• The most popular distance measure is Euclidean distance (i.e., straight
line or “as the crow flies”). Let i = (xi1, xi2,…, xip) and j = (xj1, xj2,…, xjp) be
two objects described by p numeric attributes. The Euclidean distance
between objects i and j is defined as:
Dissimilarity of Numeric Data: Minkowski Distance
• Another well-known measure is the Manhattan (or city block) distance,
named so because it is the distance in blocks between any two points in
a city (such as 2 blocks down and 3 blocks over for a total of 5 blocks). It
is defined as:

• Both the Euclidean and the Manhattan distance satisfy the following
mathematical properties:
✓ Non-negativity: d(i, j) ≥ 0: Distance is a non-negative number.
✓ Identity of indiscernibles: d(i, i) = 0: The distance of an object to
itself is 0.
✓ Symmetry: d(i, j) = d(j, i): Distance is a symmetric function.
✓ Triangle inequality: d(i, j) ≤ d(i, k) + d(k, j): Going directly from
object i to object j in space is no more than making a detour over
any other object k.
Dissimilarity of Numeric Data: Minkowski Distance
• A measure that satisfies these conditions is known as metric. Please
note that the non-negativity property is implied by the other three
properties.
• Minkowski distance is a generalization of the Euclidean and Manhattan
distances. It is defined as:

• where h is a real number such that h ≥ 1. (Such a distance is also called


Lp norm in some literature.
• Minkowski distance represents the Manhattan distance when h = 1
(i.e., L1 norm) and Euclidean distance when h = 2 (i.e., L2 norm).
Dissimilarity of Numeric Data: Minkowski Distance
• If each attribute is assigned a weight according to its perceived
importance, the weighted Euclidean distance can be computed as:

• Weighting can also be applied to other distance measures as well.


• Euclidean, Manhattan and Minkowski distance measures are
commonly used for computing the dissimilarity of objects described by
numeric attributes.
Dissimilarity of Numeric Data: Minkowski Distance
• The supremum distance (also referred to as Lmax, L∞norm and as the
Chebyshev distance) is a generalization of the Minkowski distance for
h → ∞.
• To compute it, we find the attribute f that gives the maximum
difference in values between the two objects.
• This difference is the supremum distance, defined more formally as:

• The L∞ norm is also known as the uniform norm.


Dissimilarity of Numeric Data: Minkowski Distance
• Euclidean, and Manhattan distance:
Proximity Measures for Ordinal Attributes
• The treatment of ordinal attributes is quite similar to that of numeric
attributes when computing dissimilarity between objects. Suppose that
f is an attribute from a set of ordinal attributes describing n objects. The
dissimilarity computation with respect to f involves the following steps:
1) The value of f for the ith object is xif , and f has Mf ordered states,
representing the ranking 1,…,Mf. Replace each xif by its
corresponding rank, rif ∈ {1,…, Mf}.
2) Since each ordinal attribute can have a different number of states, it
is often necessary to map the range of each attribute onto [0.0, 1.0]
so that each attribute has equal weight. We perform such data
normalization by replacing the rank rif of the ith object in the fth
attribute by
Proximity Measures for Ordinal Attributes
3) Dissimilarity can then be computed using any of the distance
measures described in previous slides for numeric attributes, using zif
to represent the f value for the ith object.

• There are three states for test-2: fair, good, and excellent, that is, Mf = 3.
For step 1, if we replace each value for test-2 by its rank, the four objects
are assigned the ranks 3, 1, 2, and 3, respectively.
• Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5,
and rank 3 to 1.0.
Proximity Measures for Ordinal Attributes
3) Dissimilarity can then be computed using any of the distance
measures described in previous slides for numeric attributes, using zif
to represent the f value for the ith object.

• There are three states for test-2: fair, good, and excellent, that is, Mf = 3.
For step 1, if we replace each value for test-2 by its rank, the four objects
are assigned the ranks 3, 1, 2, and 3, respectively.
• Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5,
and rank 3 to 1.0.
Proximity Measures for Ordinal Attributes
• Similarity values for ordinal attributes can be interpreted from
dissimilarity as sim(i,j) = 1 − d(i,j).
Cosine Similarity
• A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as a keyword) or
phrase in the document.
• Thus, each document is an object represented by what is called a term-
frequency vector.
• Term-frequency vectors are typically very long and sparse (i.e., they
have many 0 values).
• Applications using such structures include information retrieval, text
document clustering, biological taxonomy, and gene feature mapping.
• The traditional distance measures that we have studied, do not work
well for such sparse numeric data.
Cosine Similarity
• For example, two term-frequency vectors may have many 0 values in
common, meaning that the corresponding documents do not share
many words, but this does not make them similar.
• We need a measure that will focus on the words that the two
documents do have in common, and the occurrence frequency of such
words.
• In other words, we need a measure for numeric data that ignores zero-
matches.
• Cosine similarity is a measure of similarity that can be used to compare
documents or, say, give a ranking of documents with respect to a given
vector of query words.
Cosine Similarity
• Let x and y be two vectors for comparison. Using the cosine measure as
a similarity function, we have

• where ||x|| is the Euclidean norm of vector , defined


as .
• Similarly, ||y|| is the Euclidean norm of vector y. The measure
computes the cosine of the angle between vectors x and y.
• A cosine value of 0 means that the two vectors are at 90 degrees to
each other (orthogonal) and have no match.
• The closer the cosine value to 1, the smaller the angle and the greater
the match between vectors.
• The cosine similarity measure does not obey all of the properties of
metric measures, it is referred to as a nonmetric measure.
Cosine Similarity
• Suppose that x and y are the first two term-frequency vectors. How
similar are x and y?

• x = (5,0,3,0,2,0,0,2,0,0) and y = (3,0,2,0,1,1,0,1,0,1)


Summary
• Data sets are made up of data objects. A data object represents an
entity. Data objects are described by attributes. Attributes can be
nominal, binary, ordinal, or numeric.
• The values of a nominal (or categorical) attribute are symbols or names
of things, where each value represents some kind of category, code, or
state.
• Binary attributes are nominal attributes with only two possible states
(such as 1 and 0 or true and false). If the two states are equally
important, the attribute is symmetric; otherwise it is asymmetric.
• An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between
successive values is not known.
Summary
• A numeric attribute is quantitative (i.e., it is a measurable quantity)
represented in integer or real values. Numeric attribute types can be
interval-scaled or ratio-scaled. The values of an interval-scaled attribute
are measured in fixed and equal units. Ratio-scaled attributes are
numeric attributes with an inherent zero-point. Measurements are
ratio-scaled in that we can speak of values as being an order of
magnitude larger than the unit of measurement.
• Measures of object similarity and dissimilarity are used in data mining
applications such as clustering, outlier analysis, and nearest-neighbor
classification. Such measures of proximity can be computed for each
attribute type, or for combinations of such attributes. Examples include
the Jaccard coefficient for asymmetric binary attributes and Euclidean,
Manhattan, and Minkowski. For applications involving sparse numeric
data vectors, such as term-frequency vectors, the cosine measure.
References
• Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and
Techniques, Morgan Kaufmann, 3rd Edition.

You might also like