Know Your Data
Know Your Data
Objects
field, characteristic, dimension, or 4 Yes Married 120K No
- One Dimension/Attribute-
Projection
Student representedDistance
Projection as point on Load Thickness
number line of Math Scores.
of x Load of y load
- Two Dimension/Attributes- Student represented as point
on 2d plot of (Math,Science) Scores
. 10.23 5.27 15.22 2.7 1.2
- Multi Dimension/Attributes- Student represented as point
on12.65
multidimension(n) space
6.25of n attributes. 16.22 2.2 1.1
Document Data
Each document becomes a ‘term’ vector
• Each term is a component (attribute) of the vector
• The value of each component is the number of times the
corresponding term occurs in the document.
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
A special type of data, where
• Each transaction involves a set of items.
• For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.
• Can represent transaction data as record data.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
• Examples: Generic graph, a molecule, and webpages
2
5 1
2
5
Items/Events
An element of
the sequence
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Data Quality
• Poor data quality negatively affects many data processing efforts
• What kinds of data quality problems? Examples of data quality
problems:
✓Noise and outliers
✓Wrong data
✓Fake data
✓Missing values
✓Duplicate data
Noise
• For objects, noise is an extraneous object
• For attributes, noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor
phone and “snow” on television screen
• The figures below show two sine waves of the same magnitude
and different frequencies, the waves combined, and the two
sine waves with random noise
• The magnitude and shape of the original signal is distorted
Outliers
• Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
• Case 1: Outliers are
noise that interferes
with data analysis
• Causes?
Missing Values
• Reasons for missing values
• Information is not collected
(e.g., people decline to give their age and weight)
• Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
• Examples:
• Same person with multiple email addresses
• Data cleaning
• Process of dealing with duplicate data issues
Total attributes
• we set p = 1;
Proximity Measures for Nominal Attributes
• The dissimilarity between two objects i and j can be computed based
on the ratio of mismatches:
Let the values Y (yes) and P (positive) be set to 1, and the value N (no or negative) be set to 0.
Proximity Measures for Binary Attributes
• The dissimilarity based on asymmetric binary attributes is called
asymmetric binary dissimilarity, where the number of negative
matches, t, is considered unimportant and is thus ignored:
Proximity Measures for Binary Attributes
• If objects i and j are described by symmetric binary attributes, then the
dissimilarity between i and j is:
Dissimilarity of Numeric Data: Minkowski Distance
• In some cases, the data are normalized before applying distance
calculations. This involves transforming the data to fall within a smaller
or common range, such as [−1, 1] or [0.0, 1.0].
• Consider a height attribute, for example, which could be measured in
either meters or inches.
• Normalizing the data attempts to give all attributes an equal weight.
• The most popular distance measure is Euclidean distance (i.e., straight
line or “as the crow flies”). Let i = (xi1, xi2,…, xip) and j = (xj1, xj2,…, xjp) be
two objects described by p numeric attributes. The Euclidean distance
between objects i and j is defined as:
Dissimilarity of Numeric Data: Minkowski Distance
• Another well-known measure is the Manhattan (or city block) distance,
named so because it is the distance in blocks between any two points in
a city (such as 2 blocks down and 3 blocks over for a total of 5 blocks). It
is defined as:
• Both the Euclidean and the Manhattan distance satisfy the following
mathematical properties:
✓ Non-negativity: d(i, j) ≥ 0: Distance is a non-negative number.
✓ Identity of indiscernibles: d(i, i) = 0: The distance of an object to
itself is 0.
✓ Symmetry: d(i, j) = d(j, i): Distance is a symmetric function.
✓ Triangle inequality: d(i, j) ≤ d(i, k) + d(k, j): Going directly from
object i to object j in space is no more than making a detour over
any other object k.
Dissimilarity of Numeric Data: Minkowski Distance
• A measure that satisfies these conditions is known as metric. Please
note that the non-negativity property is implied by the other three
properties.
• Minkowski distance is a generalization of the Euclidean and Manhattan
distances. It is defined as:
• There are three states for test-2: fair, good, and excellent, that is, Mf = 3.
For step 1, if we replace each value for test-2 by its rank, the four objects
are assigned the ranks 3, 1, 2, and 3, respectively.
• Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5,
and rank 3 to 1.0.
Proximity Measures for Ordinal Attributes
3) Dissimilarity can then be computed using any of the distance
measures described in previous slides for numeric attributes, using zif
to represent the f value for the ith object.
• There are three states for test-2: fair, good, and excellent, that is, Mf = 3.
For step 1, if we replace each value for test-2 by its rank, the four objects
are assigned the ranks 3, 1, 2, and 3, respectively.
• Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5,
and rank 3 to 1.0.
Proximity Measures for Ordinal Attributes
• Similarity values for ordinal attributes can be interpreted from
dissimilarity as sim(i,j) = 1 − d(i,j).
Cosine Similarity
• A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as a keyword) or
phrase in the document.
• Thus, each document is an object represented by what is called a term-
frequency vector.
• Term-frequency vectors are typically very long and sparse (i.e., they
have many 0 values).
• Applications using such structures include information retrieval, text
document clustering, biological taxonomy, and gene feature mapping.
• The traditional distance measures that we have studied, do not work
well for such sparse numeric data.
Cosine Similarity
• For example, two term-frequency vectors may have many 0 values in
common, meaning that the corresponding documents do not share
many words, but this does not make them similar.
• We need a measure that will focus on the words that the two
documents do have in common, and the occurrence frequency of such
words.
• In other words, we need a measure for numeric data that ignores zero-
matches.
• Cosine similarity is a measure of similarity that can be used to compare
documents or, say, give a ranking of documents with respect to a given
vector of query words.
Cosine Similarity
• Let x and y be two vectors for comparison. Using the cosine measure as
a similarity function, we have