0% found this document useful (0 votes)
4 views

3_Introduction to Data (3)

Uploaded by

Garv tech
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

3_Introduction to Data (3)

Uploaded by

Garv tech
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Outline

Attributes and Objects

Types of Data

Data Quality

Similarity and Distance

01/27/2021 Introduction to Data Mining, 2nd Edition 1


Tan, Steinbach, Karpatne, Kumar
What is Data?

Collection of data objects Attributes


and their attributes
An attribute is a property Tid Refund Marit Taxabl
Cheat
or characteristic of an al e
Status Income
object 1 Yes Single 125K No
– Examples: eye color of a 2 No Married 100K No
person, temperature, etc.

Objects
3 No Single 70K No
– Attribute is also known as
variable, field, characteristic, 4 Yes Married 120K No
dimension, or feature 5 No Divorced 95K Yes
A collection of attributes 6 No Married 60K No
describe an object 7 Yes Divorced 220K No
– Object is also known as 8 No Single 85K Yes
record, point, case, sample,
entity, or instance 9 No Married 75K No

10
10 No Single 90K Yes
Attribute Values

Attribute values are numbers or symbols assigned to


an attribute for a particular object

Distinction between attributes and attribute values


– Same attribute can be mapped to different attribute values
 Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of values


 Example: Attribute values for ID and age are integers

– But properties of attribute can be different than the


properties of the values used to represent the
attribute

01/27/2021 Introduction to Data Mining, 2nd Edition 3


Tan, Steinbach, Karpatne, Kumar
Attribute Types

Nominal: categories, states, or “names of things”


– Hair_color = {auburn, black, blond, brown, grey, red, white}
– marital status, occupation, ID numbers, zip codes
Binary
– Nominal attribute with only 2 states (0 and 1)
– Symmetric binary: both outcomes equally important
 e.g., gender
– Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)

 Convention: assign 1 to most important outcome (e.g., HIV


positive)
Ordinal
– Values have a meaningful order (ranking) but magnitude between
successive values is not known.
– Size = {small, medium, large}, grades, army rankings
01/27/2021 Introduction to Data Mining, 2nd Edition 4
Tan, Steinbach, Karpatne, Kumar
Numeric Attribute Types

Interval
 Measured on a scale of equal-sized units
 Values have order
– E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
Ratio
 Inherent zero-point
 We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as high as
5 K˚).
– e.g. length, counts, monetary quantities

01/27/2021 Introduction to Data Mining, 2nd Edition 5


Tan, Steinbach, Karpatne, Kumar 5
https://www.graphpad.com/support/faq/what-is-the-difference-between-ordinal-interval-and-ratio-
variables-why-should-i-care/
01/27/2021 Introduction to Data Mining, 2nd Edition 5
Tan, Steinbach, Karpatne, Kumar 6
01/27/2021 Introduction to Data Mining, 2nd Edition 7
Tan, Steinbach, Karpatne, Kumar
Discrete and Continuous Attributes

Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
01/27/2021 Introduction to Data Mining, 2nd Edition 8
Tan, Steinbach, Karpatne, Kumar
Basic Statistical Descriptions of Data

Motivation
– To better understand the data: central tendency,
variation and spread
Data dispersion characteristics
– median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
– Data dispersion: analyzed with multiple granularities
of precision
– Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
– Folding measures into numerical dimensions
– Boxplot orInqtroudaucntitoinletoaDantaaMlyinsinigs,2ondnEdthiteontransformed cube
01/27/2021 9 9
Tan, Steinbach, Karpatne, Kumar
Measuring the Central Tendency

 
n
1

Mean (algebraic measure) (sample vs. population): x
x  xi
Note: n is sample size and N is population size. n i 1
N

w
n
– Weighted arithmetic mean:
– Trimmed mean: chopping extreme values x  i1

Median:
– Middle value if odd number of values, or average of
the middle two values otherwise

– Estimated by interpolation (for grouped data):


n / 2  (  freq)l
median  L1  ( )width
freq median
Mode
– Value that occurs most frequently in the data

01/27/2021 Introduction to Data Mining, 2nd Edition 10


Tan, Steinbach, Karpatne, Kumar 10
Symmetric vs. Skewed
Data
Median, mean and mode of symmetric
symmetric, positively and
negatively skewed data

positively skewed negatively skewed

01/27/2021 IntroductionDatato ng:MinCoinncg,e2n


ptsdaEd
nd ition 11
DMataini catnhniquese, Kumar 11
Measuring the Dispersion of Data

Quartiles, outliers and boxplots


– Quartiles: Q1 (25th percentile of data below this point), Q3 (75th percentile)
– Inter-quartile range: IQR = Q3 – Q1
– Five number summary: min, Q1, median, Q3, max
– Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
– Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: σ)
– Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 2
 [ xi  ( xi ) ]
n n

(x x
1 1
s 
2
(xi  x) 
2

2
 ) 
2 2
2
n 1 i1 n 1 i1 n i1 N i1
i
N i1
i

– Standard deviation s (or σ) is the square root of variance s2 (or σ2)

01/27/2021 Introduction to Data Mining, 2nd Edition 12


Tan, Steinbach, Karpatne, Kumar 12
Boxplot Analysis

Five-number summary of a distributio n


– Minimum, Q1, Median, Q3, Maximum
Boxplot
– Data is represented with a box
– The ends of the box are at the first and
third quartiles, i.e., the height of the box is
IQR
– The median is marked by a line within the
box
– Outliers: points beyond a specified outlier
threshold, plotted individually.

01/27/2021 Introduction to Data Mining, 2nd Edition 12


Tan, Steinbach, Karpatne, Kumar 13
Example

01/27/2021 Introduction to Data Mining, 2nd Edition 14


Tan, Steinbach, Karpatne, Kumar
Example

01/27/2021 Introduction to Data Mining, 2nd Edition 15


Tan, Steinbach, Karpatne, Kumar
Graphic Displays of Basic Statistical Descriptions

Boxplot: graphic display of five-number summary


Histogram: x-axis are values, y-axis repres. frequencies
Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are  xi
Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding quantiles
of another

Scatter plot: each pair of values is a pair of coordinates


and plotted as points in the plane

01/27/2021 Introduction to Data Mining, 2nd Edition 16


Tan, Steinbach, Karpatne, Kumar 16
01/27/2021 Introduction to Data Mining, 2nd Edition 17
Tan, Steinbach, Karpatne, Kumar
Histogram Analysis
Histogram: Graph display of
tabulated frequencies, shown as
40
bars
It shows what proportion of cases fall35
into each of several categories 30
Differs from a bar chart in that it is25
the area of the bar that denotes the20
value, not the height as in bar charts,
15
a crucial distinction when the
categories are not of uniform width 10
The categories are usually specified 5
as non-overlapping intervals of some 0
variable. The categories (bars) must
be adjacent

01/27/2021 Introduction to Data Mining, 2nd Edition 18


Tan, Steinbach, Karpatne, Kumar 18
Histogram vs. Bar Graph

01/27/2021 Introduction to Data Mining, 2nd Edition 19


Tan, Steinbach, Karpatne, Kumar
Histograms Often Tell More than Boxplots

 The two histograms shown in the left


may have the same boxplot
representation
 The same values for: min, Q1,
median, Q3, max
 But they have rather different data
distributions

01/27/2021 Introduction to Data Mining, 2nd Edition 20


Tan, Steinbach, Karpatne, Kumar 20
Quantile Plot

Displays all of the data (allowing the user to assess both


the overall behavior and unusual occurrences)
Plots quantile information
– For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data are
below or equal to the value xi

01/27/2021 IntroductionDtaotaDMaintainMg:inCionngc,e2pntsdaEnddition 21
Tan, Steinbach, KaTrepcahtnnieq,uKesumar 21
Quantile-Quantile (Q-Q) Plot

Graphs the quantiles of one univariate distribution against the


corresponding quantiles of another
View: Is there is a shift in going from one distribution to another?
Example shows unit price of items sold at Branch 1 vs. Branch 2 for each
quantile. Unit prices of items sold at Branch 1 tend to be lower than those
at Branch 2.

01/27/2021 Introduction to Data Mining, 2nd Edition 22


Tan, Steinbach, Karpatne, Kumar 22
Scatter plot

Provides a first look at bivariate data to see clusters of


points, outliers, etc
Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

01/27/2021 Introduction to Data Mining, 2nd Edition 22


Tan, Steinbach, Karpatne, Kumar 23
Positively and Negatively Correlated Data

The left half fragment is positively


correlated

The right half is negative correlated

01/27/2021 Introduction to Data Mining, 2nd Edition 24


Tan, Steinbach, Karpatne, Kumar 24
Uncorrelated Data

01/27/2021 Introduction to Data Minin g, 2nd Edition 25


Tan, Steinbach, Karpatne, Kumar 25
Important Characteristics of Data

– Dimensionality (number of attributes)


 High dimensional data brings a number of challenges

– Sparsity
 Only presence counts

– Resolution
 Patterns depend on the scale

– Size
 Type of analysis may depend on size of data

01/27/2021 Introduction to Data Mining, 2nd Edition 26


Tan, Steinbach, Karpatne, Kumar
Types of data sets
Record
– Data Matrix
– Document Data
– Transaction Data
Graph
– World Wide Web
– Molecular Structures
Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data

01/27/2021 Introduction to Data Mining, 2nd Edition 27


Tan, Steinbach, Karpatne, Kumar
Record Data

Data that consists of a collection of records, each


of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

01/27/2021 Introduction to Data Mining, 2nd Edition 28


Tan, Steinbach, Karpatne, Kumar
Data Matrix

If data objects have the same fixed set of numeric


attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute

Such a data set can be represented by an m by n matrix,


where there are m rows, one for each object, and n
columns, one for each attribute
Projection Projection Distance Load Thickness
of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1

01/27/2021 Introduction to Data Mining, 2nd Edition 29


Tan, Steinbach, Karpatne, Kumar
Document Data

Each document becomes a ‘term’ vector


– Each term is a component (attribute) of the vector
– The value of each component is the number of times
the corresponding term occurs in the document.
coach

game
score

timeout

season
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0

01/27/2021 Introduction to Data Mining, 2nd Edition 30


Tan, Steinbach, Karpatne, Kumar
Transaction Data

A special type of data, where


– Each transaction involves a set of items.
– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.
– Can represent transaction data as record data

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
01/27/2021 Introduction to Data Mining, 2nd Edition 31
Tan, Steinbach, Karpatne, Kumar
Graph Data

Examples: Generic graph, a molecule, and webpages

2
5 1
2
5

Benzene Molecule: C6H6


01/27/2021 Introduction to Data Mining, 2nd Edition 32
Tan, Steinbach, Karpatne, Kumar
Ordered Data

Sequences of transactions
Items/Events

An element of
the sequence
01/27/2021 Introduction to Data Mining, 2nd Edition 33
Tan, Steinbach, Karpatne, Kumar
Ordered Data

Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

01/27/2021 Introduction to Data Mining, 2nd Edition 34


Tan, Steinbach, Karpatne, Kumar
Ordered Data

Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean

01/27/2021 Introduction to Data Mining, 2nd Edition 35


Tan, Steinbach, Karpatne, Kumar
Similarity and Dissimilarity Measures

Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
Dissimilarity measure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Proximity refers to a similarity or dissimilarity
01/27/2021 Introduction to Data Mining, 2nd Edition 36
Tan, Steinbach, Karpatne, Kumar
Similarity/Dissimilarity for Simple Attributes

The following table shows the similarity and dissimilarity


between two objects, x and y, with respect to a single, simple
attribute.

01/27/2021 Introduction to Data Mining, 2nd Edition 37


Tan, Steinbach, Karpatne, Kumar
Euclidean Distance

Euclidean Distance

where n is the number of dimensions (attributes) and


xk and yk are, respectively, the kth attributes
(components) or data objects x and y.

Standardization is necessary, if scales differ.

01/27/2021 Introduction to Data Mining, 2nd Edition 38


Tan, Steinbach, Karpatne, Kumar
Euclidean Distance

3
point x y
2 p1
p1 0 2
p3
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
01/27/2021 Introduction to Data Mining, 2nd Edition 39
Tan, Steinbach, Karpatne, Kumar
Minkowski Distance

Minkowski Distance is a generalization of Euclidean


Distance

Where r is a parameter, n is the number of dimensions


(attributes) and xk and yk are, respectively, the kth
attributes (components) or data objects x and y.

01/27/2021 Introduction to Data Mining, 2nd Edition 40


Tan, Steinbach, Karpatne, Kumar
Minkowski Distance: Examples

r = 1. City block (Manhattan, taxicab, L1 norm) distance.


– A common example of this for binary vectors is the
Hamming distance, which is just the number of bits that are
different between two binary vectors

r = 2. Euclidean distance

r  . “supremum” (Lmax norm, L norm) distance.


– This is the maximum difference between any component of
the vectors

Do not confuse r with n, i.e., all these distances are


defined for all numbers of dimensions.

01/27/2021 Introduction to Data Mining, 2nd Edition 41


Tan, Steinbach, Karpatne, Kumar
Minkowski Distance

L1 p1 p2 p3 p
p1 0 4 4
p2 4 0
p3 4
p4
point x y
p1 0 2 L2 p1 p2 p3 p
p2 2 0 p1 0 2.828 3
p3 3 1 p2 2.828
p4 5 1 p3 3
p4

L p1 p2 p3 p
p1 0 2
p2 2
p3
p

Distance Matrix
01/27/2021 Introduction to Data Mining, 2nd Edition 42
Tan, Steinbach, Karpatne, Kumar
Common Properties of a Distance

Distances, such as the Euclidean distance,


have some well known properties.

1. d(x, y)  0 for all x and y and d(x, y) = 0 if and only


if x = y.
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)

where d(x, y) is the distance (dissimilarity) between


points (data objects), x and y.

A distance that satisfies these properties is a


metric
01/27/2021 Introduction to Data Mining, 2nd Edition 43
Tan, Steinbach, Karpatne, Kumar
Common Properties of a Similarity

Similarities, also have some well known


properties.

1. s(x, y) = 1 (or maximum similarity) only if x = y.


(does not always hold, e.g., cosine)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data


objects), x and y.

01/27/2021 Introduction to Data Mining, 2nd Edition 44


Tan, Steinbach, Karpatne, Kumar
Similarity Between Binary Vectors
Common situation is that objects, x and y, have only
binary attributes
Compute similarities using the following quantities
f01 = the number of attributes where x was 0 and y was 1
f10 = the number of attributes where x was 1 and y was 0
f00 = the number of attributes where x was 0 and y was 0
f11 = the number of attributes where x was 1 and y was 1

Simple Matching and Jaccard Coefficients


counts both presences and absences equally and it is normally
used for symmetric binary attributes

SMC = number of matches / number of attributes


= (f11 + f00) / (f01 + f10 + f11 + f00)

01/27/2021 Introduction to Data Mining, 2nd Edition 45


Tan, Steinbach, Karpatne, Kumar
Similarity Between Binary Vectors
Common situation is that objects, x and y, have only
binary attributes
Compute similarities using the following quantities
f01 = the number of attributes where x was 0 and y was 1
f10 = the number of attributes where x was 1 and y was 0
f00 = the number of attributes where x was 0 and y was 0
f11 = the number of attributes where x was 1 and y was 1

Jaccard Coefficients
counts only presences and it is frequently for asymmetric binary
attributes.
J = number of 11 matches / number of non-zero attributes
= (f11) / (f01 + f10 + f11)

01/27/2021 Introduction to Data Mining, 2nd Edition 46


Tan, Steinbach, Karpatne, Kumar
SMC versus Jaccard: Example

x= 1000000000
y= 0000001001

f01 = 2 (the number of attributes where x was 0 and y was 1)


f10 = 1 (the number of attributes where x was 1 and y was 0)
f00 = 7 (the number of attributes where x was 0 and y was 0)
f11 = 0 (the number of attributes where x was 1 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)


= (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

01/27/2021 Introduction to Data Mining, 2nd Edition 47


Tan, Steinbach, Karpatne, Kumar
Cosine Similarity

01/27/2021 Introduction to Data Mining, 2nd Edition 48


Tan, Steinbach, Karpatne, Kumar
Cosine Similarity

If d1 and d2 are two document vectors, then


cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot
product of vectors, d1 and d2, and || d || is the length of
vector d.

Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150

01/27/2021 Introduction to Data Mining, 2nd Edition 49


Tan, Steinbach, Karpatne, Kumar
Correlation measures the linear relationship
between objects

01/27/2021 Introduction to Data Mining, 2nd Edition 50


Tan, Steinbach, Karpatne, Kumar
Correlation

01/27/2021 Introduction to Data Mining, 2nd Edition 53


Tan, Steinbach, Karpatne, Kumar
Drawback of Correlation (Non-linear Data)

x = (-3, -2, -1, 0, 1, 2, 3)


y = (9, 4, 1, 0, 1, 4, 9)

yi = x 2
i

mean(x) = 0, mean(y) = 4
std(x) = 2.16, std(y) = 3.74

corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 )


=0

01/27/2021 Introduction to Data Mining, 2nd Edition 54


Tan, Steinbach, Karpatne, Kumar
Correlation vs cosine vs Euclidean distance

Choice of the right proximity measure depends on the domain


What is the correct choice of proximity measure for the
following situations?
– Comparing documents using the frequencies of words
 Documents are considered similar if the word frequencies are similar

– Comparing the temperature in Celsius of two locations


 Two locations are considered similar if the temperatures are similar in
magnitude

– Comparing two time series of temperature measured in Celsius


 Two time series are considered similar if their “shape” is similar, i.e., they
vary in the same way over time, achieving minimums and maximums at
similar times, etc.

01/27/2021 Introduction to Data Mining, 2nd Edition 55


Tan, Steinbach, Karpatne, Kumar

You might also like