0% found this document useful (0 votes)

39 views67 pages

X Chapter 02 Data

Uploaded by

Adharsh Rajeev Dfc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views67 pages

X Chapter 02 Data

Uploaded by

Adharsh Rajeev Dfc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 67

Getting to Know Your Data

1
Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

2
MODULE 5

Types of data in cluster analysis

3
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,

timeout

season
coach

game
score
team

ball

lost
pla
crosstabs

wi
n
y
 Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
 Transaction data
 Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
 World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
 Social or information networks
 Molecular Structures
 Ordered TID Items
 Video data: sequence of images
1 Bread, Coke, Milk
 Temporal data: time-series
 Sequential Data: transaction sequences 2 Beer, Bread
 Genetic sequence data 3 Beer, Coke, Diaper, Milk
 Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
 Spatial data: maps 5 Coke, Diaper, Milk
 Image data:
 Video data:

4
Data Objects

 Data sets are made up of data objects.

 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.

5
Attributes
 Attribute (or dimensions (data warehousing), features
(machine learning), variables(statisticians)): a data field,

representing a characteristic or feature of a data

object.
 E.g., customer _ID, name, address
 Types:
 Nominal

 Binary

 Numeric: quantitative

 Interval-scaled

 Ratio-scaled

6
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 No meaning full order and are not quantitative
 Mathematical operation on values of nominal attributes are not meaning ful
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV
positive)

7
Attribute Types
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades, army rankings
 Central tendency can be represented by – mode and median
 But ‘mean’ cannot be defined

8
Numeric Attribute Types
 Numeric
 Quantitative - it is measurable quantity represented in

integer or real-values
 Can be interval –scaled or ratio scaled

 Interval
 Continous measurements of a rougly linear scale
 Measured on a scale of equal-sized units
 Allow us to compare and quantify the difference
between values
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Neither 0 degree Celcius/Fahrenheit indicates no
temperature
 Ratios are not valid 9
Numeric Attribute Types
 Ratio
 Inherent zero-point
 Values are ordered, being multiple (ratio), Can compute
difference, mean, mode, median,
 We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚, since 0 K˚ = -273.15˚C ).
 e.g., temperature in Kelvin, length, counts,
monetary quantities

10
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete

attributes
 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and

represented using a finite number of digits

 Continuous attributes are typically represented as

floating-point variables
11
Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

12
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike
 Often falls in the range [0,1]
 Dissimilarity (e.g., distance)
 Numerical measure of how different two data objects
are
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Proximity refers to a similarity or dissimilarity

13
Data Matrix and Dissimilarity Matrix
 Data structures
1. Data matrix  x11 ... x1f ... x1p 
 
 Stores the n data  ... ... ... ... ... 
objects with p x ... xif ... xip 
 i1 
attributes in the form  ... ... ... ... ... 
of a relational table x ... xnf ... xnp 
 n1 
 It is n by p matrix

14
Data Matrix and Dissimilarity Matrix
 Data structures  0 
2. Dissimilarity matrix  d(2,1) 0 
 
 This structure shows  d(3,1) d ( 3,2) 0 
collection of proximities  
 : : : 
that are available for all d ( n,1) d ( n,2) ... ... 0
pair of n objects.
 d(i, j) – difference

between object i and j

 D(i,j) = 0, objects are

same

15
Proximity Measure for Nominal Attributes

 Let the number of states of a nominal attribute be

M.
 Simple matching
 m: # of matches (number of attributes for
which i and j are in same state),
 p: total # of attributes
d (i, j)  p 
p
m

16
Dissimilarity- Nominal data
Obj. ID test-1 test-2
101 A X
102 B X
103 A Y
104 C Z

Calculate
 0 
 d(2,1)   0 
 0  1 / 2 0 
 d(3,1) d ( 3,2) 0   
  1 / 2 1 0 
 : : :   
d ( n,1) d ( n,2) ... ... 0  1 1 1 
 

17
Proximity Measure for Binary Attributes
Object j
 A contingency table for binary data
Object i
where q – # of attributes == 1 for both objects
t - # attributes == 0 for both objects
s - # attributes == 0 object i and == 1 object j
r - # attributes == 1 object i and == 0 object j

 Symmetric binary dissimilarity

18
Proximity Measure for Binary Attributes

 Asymmetric binary dissimilarity

 Asymmetric binary similarity

q
between object I and j qr  s
Sim(i, j) = q
qr  s

 Jaccard coefficient (similarity

measure for asymmetric binary
variables):

19
Proximity Measure for Nominal Attributes

 Let the number of states of a nominal attribute be

M.
 Simple matching
 m: # of matches (number of attributes for
which i and j are in same state),
 p: total # of attributes
d (i, j)  p 
p
m

20
Dissimilarity- Nominal data
Obj. ID test-1 test-2
101 A X
102 B X
103 A Y
104 C Z

21
Proximity Measure for Binary Attributes
Object j
 A contingency table for binary data
Object i
where q – # of attributes == 1 for both objects
t - # attributes == 0 for both objects
s - # attributes == 0 object i and == 1 object j
r - # attributes == 1 object i and == 0 object j

 Symmetric binary dissimilarity

22
Proximity Measure for Binary Attributes

 Asymmetric binary dissimilarity

 Asymmetric binary similarity

q
between object I and j qr  s
Sim(i, j) = q
qr  s

 Jaccard coefficient (similarity

measure for asymmetric binary
variables):

23
Dissimilarity between Binary Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
 Gender is a symmetric attribute
 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N 0
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
24
Dissimilarity on Numeric Data: Minkowski
Distance
 Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
25
Special Cases of Minkowski Distance
 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

 h = 2: (L2 norm) Euclidean distance

d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp

 h  . “supremum” (Lmax norm, L norm) distance.

 This is the maximum difference between any component (attribute)
of the vectors

26
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
27
Ordinal Variables

 An ordinal variable can be discrete or continuous

 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace x by their rank
if
rif {1,..., M f }
 map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
rif 1
zif 
M f 1
 compute the dissimilarity using methods for interval-
scaled variables (xif-mean)/sd

28
Attributes of Mixed Type
 A database may contain all attribute types
 Nominal, symmetric binary, asymmetric binary, numeric,

ordinal
 One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
 δij =0, if either 1) xif or xjf is missing 2) xif =xjf and attribute is asymmetric
binary, otherwise δij =1
 f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is numeric: use the normalized distance
 f is ordinal
z 
rif  1

Compute ranks rif and if M f 1

Treat zif as interval-scaled
29
Cosine Similarity
 A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.

 Other vector objects: gene features in micro-arrays, …

 Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d

30
Example: Cosine Similarity
 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

31
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency,
variation and spread
 Central tendency
 Location of middle or center of dispersion
 Dispersion
 How data are spread about
 Range, quartiles, interquartile range, five-number summary.

32
Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population): 1 n
x   xi   x
Note: n is sample size and N is population size. n i 1 N
n
Weighted arithmetic mean:
w x

i i
 Trimmed mean: chopping extreme values x i 1
n
 Median: w
i 1
i
 Middle value if odd number of values, or average of the
middle two values otherwise
 Estimated by interpolation (for grouped data):
n / 2  ( freq ) l Median
median  L1  ( ) width interval

freq median
 L1 –lower boundary of median interval
 n – the number of values in entire data set
 ∑freq – sum of freq of all intervals lower than median interval
 Freqmedian – frequency of median interval
 Width- width of median interval
33
Measuring the Central Tendency
 Mode
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula:
mean  mode  3  (mean  median)

34
Symmetric vs. Skewed
Data
 Median, mean and mode of symmetric

symmetric, positively and negatively

skewed data

positively skewed negatively skewed

June 18, 2024 Data Mining: Concepts and Techniques 35

Measuring the Dispersion of Data
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)

n n
1 1
    x  2
2
2
( xi  2
)  i
N i 1 N i 1

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

36
Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quartiles: sorted, splits into equal size, consecutive set
 Q1 (25th percentile), Q3 (75th percentile)

 Inter-quartile range: IQR = Q3 – Q1

 Five number summary: fuller summary of shape of distribution
 min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked; add
whiskers (lines outside the box), and plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR

37
Boxplot Analysis

 Five-number summary of a distribution

 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the box
 Whiskers: two lines outside the box extended
to Minimum and Maximum
 Outliers: points beyond a specified outlier
threshold, plotted individually

38
Visualization of Data Dispersion: 3-D Boxplots

June 18, 2024 Data Mining: Concepts and Techniques 39

Properties of Normal Distribution Curve

 The normal (distribution) curve

 From μ–σ to μ+σ: contains about 68% of the measurements

(μ: mean, σ: standard deviation)

 From μ–2σ to μ+2σ: contains about 95% of it

 From μ–3σ to μ+3σ: contains about 99.7% of it

40
Graphic Displays of Basic Statistical Descriptions

 Boxplot: graphic display of five-number summary

 Histogram: x-axis are values, y-axis repres. frequencies
 Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are  xi
 Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
 Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane
41
Histogram Analysis
 Histogram: Graph display of tabulated
40
frequencies, shown as bars
35
 It shows what proportion of cases fall
into each of several categories 30
 Differs from a bar chart in that it is the25
area of the bar that denotes the 20
value, not the height as in bar charts, 15
a crucial distinction when the 10
categories are not of uniform width
5
 The categories are usually specified as
0
non-overlapping intervals of some 10000 30000 50000 70000 90000

variable. The categories (bars) must

be adjacent
42
Histograms Often Tell More than Boxplots

 The two histograms

shown in the left may
have the same boxplot
representation
 The same values for:
min, Q1, median, Q3,
max
 But they have rather
different data
distributions

43
Quantile Plot
 Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
 Plots quantile information
 For a data x data sorted in increasing order, f indicates
i i
that approximately 100 fi% of the data are below or
equal to the value xi

Data Mining: Concepts and Techniques 44

Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
 View: Is there is a shift in going from one distribution to
another?
 Example shows unit price of items sold at Branch 1 vs. Branch
2 for each quantile. Unit prices of items sold at Branch 1 tend
to be lower than those at Branch 2.

45
Scatter plot
 Provides a first look at bivariate data to see clusters of
points, outliers, etc
 Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

46
Positively and Negatively Correlated Data

 The left half fragment is positively

correlated
 The right half is negative correlated

47
Uncorrelated Data

48
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike
 Often falls in the range [0,1]
 Dissimilarity (e.g., distance)
 Numerical measure of how different two data objects
are
 Lower when objects are more alike
 Minimum dissimilarity is often 0
 Proximity refers to a similarity or dissimilarity

49
Data Matrix and Dissimilarity Matrix
 Data structures
1. Data matrix  x11 ... x1f ... x1p 
 
 Stores the n data  ... ... ... ... ... 
objects with p x ... xif ... xip 
 i1 
attributes in the form  ... ... ... ... ... 
of a relational table x ... xnf ... xnp 
 n1 
 It is n by p matrix

50
Data Matrix and Dissimilarity Matrix
 Data structures  0 
2. Dissimilarity matrix  d(2,1) 0 
 
 This structure shows  d(3,1) d ( 3,2) 0 
collection of proximities  
 : : : 
that are available for all d ( n,1) d ( n,2) ... ... 0
pair of n objects.
 d(i, j) – difference

between object i and j

 D(i,j) = 0, objects are

same

51
Proximity Measure for Nominal Attributes

 Let the number of states of a nominal attribute be

M.
 Simple matching
 m: # of matches (number of attributes for
which i and j are in same state),
 p: total # of attributes
d (i, j)  p 
p
m

52
Dissimilarity- Nominal data
Obj. ID test-1 test-2
101 A X
102 B X
103 A Y
104 C Z

53
Proximity Measure for Binary Attributes
Object j
 A contingency table for binary data
Object i
where q – # of attributes == 1 for both objects
t - # attributes == 0 for both objects
s - # attributes == 0 object i and == 1 object j
r - # attributes == 1 object i and == 0 object j

 Symmetric binary dissimilarity

54
Proximity Measure for Binary Attributes

 Asymmetric binary dissimilarity

 Asymmetric binary similarity

q
between object I and j qr  s
Sim(i, j) = q
qr  s

 Jaccard coefficient (similarity

measure for asymmetric binary
variables):

55
Proximity Measure for Nominal Attributes

 Let the number of states of a nominal attribute be

M.
 Simple matching
 m: # of matches (number of attributes for
which i and j are in same state),
 p: total # of attributes
d (i, j)  p 
p
m

56
Dissimilarity- Nominal data
Obj. ID test-1 test-2
101 A X
102 B X
103 A Y
104 C Z

57
Proximity Measure for Binary Attributes
Object j
 A contingency table for binary data
Object i
where q – # of attributes == 1 for both objects
t - # attributes == 0 for both objects
s - # attributes == 0 object i and == 1 object j
r - # attributes == 1 object i and == 0 object j

 Symmetric binary dissimilarity

58
Proximity Measure for Binary Attributes

 Asymmetric binary dissimilarity

 Asymmetric binary similarity

q
between object I and j qr  s
Sim(i, j) = q
qr  s

 Jaccard coefficient (similarity

measure for asymmetric binary
variables):

59
Dissimilarity between Binary Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
 Gender is a symmetric attribute
 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N 0
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
60
Dissimilarity on Numeric Data: Minkowski
Distance
 Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
61
Special Cases of Minkowski Distance
 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

 h = 2: (L2 norm) Euclidean distance

d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp

 h  . “supremum” (Lmax norm, L norm) distance.

 This is the maximum difference between any component (attribute)
of the vectors

62
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
63
Ordinal Variables

 An ordinal variable can be discrete or continuous

64
Attributes of Mixed Type
 A database may contain all attribute types
 Nominal, symmetric binary, asymmetric binary, numeric,

ordinal
 One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
 δij =0, if either 1) xif or xjf is missing 2) xif =xjf and attribute is asymmetric
binary, otherwise δij =1
 f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is numeric: use the normalized distance
 f is ordinal
z 
rif  1

Compute ranks rif and if M f 1

Treat zif as interval-scaled
65
Cosine Similarity
 A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.

 Other vector objects: gene features in micro-arrays, …

66
Example: Cosine Similarity
 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

Potential of Cucumber (Cucumis Sativus) Peel Extract As Alternative Moisturizing Soap
No ratings yet
Potential of Cucumber (Cucumis Sativus) Peel Extract As Alternative Moisturizing Soap
8 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
IDS Unit 2
No ratings yet
IDS Unit 2
49 pages
Week 2
No ratings yet
Week 2
73 pages
MT8127 Android Scatter
100% (1)
MT8127 Android Scatter
7 pages
Lect 2
No ratings yet
Lect 2
77 pages
Know Your Data
No ratings yet
Know Your Data
83 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Cyber Threats and NATO 2030
100% (2)
Cyber Threats and NATO 2030
267 pages
2015 HK
No ratings yet
2015 HK
20 pages
Light Activated Switch Circuit Diagram
100% (1)
Light Activated Switch Circuit Diagram
2 pages
Dmi Unit 2 - 186 - N3
No ratings yet
Dmi Unit 2 - 186 - N3
21 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
2 2 Data
No ratings yet
2 2 Data
27 pages
ACC 222 Costing
No ratings yet
ACC 222 Costing
17 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
3.4 Diaphragm Wall
No ratings yet
3.4 Diaphragm Wall
16 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
DMDW 2
No ratings yet
DMDW 2
68 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
Sess02 Data
No ratings yet
Sess02 Data
96 pages
Full
No ratings yet
Full
367 pages
Problem Solving in Organizations A Methodological Handbook For Business Students 1st Edition Van Aken 2024 Scribd Download
100% (11)
Problem Solving in Organizations A Methodological Handbook For Business Students 1st Edition Van Aken 2024 Scribd Download
84 pages
2nd Slides
No ratings yet
2nd Slides
54 pages
Data
No ratings yet
Data
84 pages
Analysis of A Rock Slide Stabilized With A Toe-Berm: A Case Study in British Columbia, Canada
No ratings yet
Analysis of A Rock Slide Stabilized With A Toe-Berm: A Case Study in British Columbia, Canada
13 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Introduction To Data
No ratings yet
Introduction To Data
26 pages
For Your Salvation
No ratings yet
For Your Salvation
455 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
11 pages
CS822 DataMining Week4
No ratings yet
CS822 DataMining Week4
45 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Drop Box
No ratings yet
Drop Box
179 pages
Design School
No ratings yet
Design School
22 pages
Dmi Unit 2
No ratings yet
Dmi Unit 2
19 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
IRD Project 1
No ratings yet
IRD Project 1
16 pages
02 Data
No ratings yet
02 Data
35 pages
Basic Calculus q4
No ratings yet
Basic Calculus q4
74 pages
Sikyon Sample
No ratings yet
Sikyon Sample
66 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
DSV-S6 Measures of Similarity and Dissimilarity
No ratings yet
DSV-S6 Measures of Similarity and Dissimilarity
43 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
No ratings yet
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
42 pages
03 - Data Mining
No ratings yet
03 - Data Mining
37 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
Diff. Lit. Elements
No ratings yet
Diff. Lit. Elements
11 pages
Larteh Final
No ratings yet
Larteh Final
11 pages
Robinson Crusoe
No ratings yet
Robinson Crusoe
34 pages
Industry 4.0
No ratings yet
Industry 4.0
4 pages
Files in Folders and Subfolders
No ratings yet
Files in Folders and Subfolders
6 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
Analogy Hard
No ratings yet
Analogy Hard
2 pages
02data Part4
No ratings yet
02data Part4
28 pages
RPH
No ratings yet
RPH
7 pages
Kim Gretel Matapias - Hist101 Midterm
No ratings yet
Kim Gretel Matapias - Hist101 Midterm
6 pages
Dacia Spring 2022 0120
No ratings yet
Dacia Spring 2022 0120
5 pages
Lec 5
No ratings yet
Lec 5
24 pages
Unit 2 Final Ids
No ratings yet
Unit 2 Final Ids
38 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
10 Things High Performing Leaders Never Do
No ratings yet
10 Things High Performing Leaders Never Do
12 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
A Set of Measures of Centrality Based On Betweenness
No ratings yet
A Set of Measures of Centrality Based On Betweenness
8 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Yellow & Brown Hand-Drawn Process Writing Proofreading Essay Worksheet - 20241124 - 080226 - 0000
No ratings yet
Yellow & Brown Hand-Drawn Process Writing Proofreading Essay Worksheet - 20241124 - 080226 - 0000
5 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Similarity
No ratings yet
Similarity
19 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
10
No ratings yet
10
6 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
Developing Your Network Marketing Game Plan
No ratings yet
Developing Your Network Marketing Game Plan
4 pages
Cds 1 Phase Submersible Dewatering Pumpset Compressed
No ratings yet
Cds 1 Phase Submersible Dewatering Pumpset Compressed
2 pages

X Chapter 02 Data

Uploaded by

X Chapter 02 Data

Uploaded by

Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

Types of data in cluster analysis

 Data sets are made up of data objects.

representing a characteristic or feature of a data

 E.g., zip codes, profession, or the set of words in a

 Note: Binary attributes are a special case of discrete

 E.g., temperature, height, or weight

 Practically, real values can only be measured and

represented using a finite number of digits

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

between object i and j

 Let the number of states of a nominal attribute be

 Symmetric binary dissimilarity

 Asymmetric binary dissimilarity

 Asymmetric binary similarity

 Jaccard coefficient (similarity

 Let the number of states of a nominal attribute be

 Symmetric binary dissimilarity

 Asymmetric binary dissimilarity

 Asymmetric binary similarity

 Jaccard coefficient (similarity

 h = 2: (L2 norm) Euclidean distance

 h  . “supremum” (Lmax norm, L norm) distance.

 An ordinal variable can be discrete or continuous

 Other vector objects: gene features in micro-arrays, …

 Ex: Find the similarity between documents 1 and 2.

symmetric, positively and negatively

positively skewed negatively skewed

June 18, 2024 Data Mining: Concepts and Techniques 35

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

 Inter-quartile range: IQR = Q3 – Q1

 Five-number summary of a distribution

June 18, 2024 Data Mining: Concepts and Techniques 39

 The normal (distribution) curve

(μ: mean, σ: standard deviation)

 From μ–3σ to μ+3σ: contains about 99.7% of it

 Boxplot: graphic display of five-number summary

variable. The categories (bars) must

 The two histograms

Data Mining: Concepts and Techniques 44

 The left half fragment is positively

between object i and j

 Let the number of states of a nominal attribute be

 Symmetric binary dissimilarity

 Asymmetric binary dissimilarity

 Asymmetric binary similarity

 Jaccard coefficient (similarity

 Let the number of states of a nominal attribute be

 Symmetric binary dissimilarity

 Asymmetric binary dissimilarity

 Asymmetric binary similarity

 Jaccard coefficient (similarity

 h = 2: (L2 norm) Euclidean distance

 h  . “supremum” (Lmax norm, L norm) distance.

 An ordinal variable can be discrete or continuous

 Other vector objects: gene features in micro-arrays, …

 Ex: Find the similarity between documents 1 and 2.

You might also like