0% found this document useful (0 votes)

2 views31 pages

Mod 4 Types of Data in Cluster Analysis

The document provides an overview of data types, including structured and unstructured data, and their attributes. It discusses statistical descriptions of data, methods for measuring similarity and dissimilarity, and various distance measures such as Minkowski distance. Additionally, it covers the importance of data visualization and the characteristics of data objects.

Uploaded by

pobocow192

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views31 pages

Mod 4 Types of Data in Cluster Analysis

Uploaded by

pobocow192

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Getting to Know Your D a t a

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Data Visualization

◼ Measuring Data Similarity and

Dissimilarity

◼ Summary
1
Types of D a t a Sets
◼ Record
◼ Relational records
◼ Data matrix, e.g., numerical

timeout

season
coach

game
score
team

pla y
matrix, crosstabs

wi n
ball

lost
◼ Document data: text documents:
term- frequency vector
◼ Transaction data Document 1 3 0 5 0 2 6 0 2 0 2
◼ Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
◼ World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
◼ Social or information networks
◼ Molecular Structures
◼ Ordered TID Items
◼ Video data: sequence of images 1 Bread, Coke, Milk
◼ Temporal data: time-series
2 Beer, Bread
◼ Sequential Data: transaction
sequences 3 Beer, Coke, Diaper, Milk
◼ Genetic sequence data 4 Beer, Bread, Diaper, Milk
◼ Spatial, image and multimedia: 5 Coke, Diaper, Milk
◼ Spatial data: maps
◼ Image data:
◼ Video data: 2
Important Characteristics of Structured D a t a

◼ Dimensionality
◼ Curse of dimensionality
◼ Sparsity
◼ Only presence counts
◼ Resolution
◼ Patterns depend on the
scale
◼ Distribution
◼ Centrality and dispersion
3
D a t a Objects

◼ Data sets are made up of data objects.

◼ A data object represents an entity.
◼ Examples:
◼ sales database: customers, store items,
sales
◼ medical database: patients, treatments
◼ university database: students, professors,
courses
◼ Also called samples , examples, instances, data
points, objects, tuples.
◼ Data objects are described by attributes.
4
Attributes
◼ Attribute (or dimensions, features,
variables): a data field, representing a
characteristic or feature of a data object.
◼ E.g., customer _ID, name, address
◼ Types:
◼ Nominal

◼ Binary

◼ Numeric: quantitative

◼ Interval-scaled

◼ Ratio-scaled

5
Attribute Types
◼ Nominal: categories, states, or “names of things”
◼ Hair_color = {auburn, black, blond, brown, grey, red, white}
◼ marital status, occupation, ID numbers, zip codes
◼ Binary
◼ Nominal attribute with only 2 states (0 and 1)
◼ Symmetric binary: both outcomes equally important
◼ e.g., gender
◼ Asymmetric binary: outcomes not equally important.
◼ e.g., medical test (positive vs. negative)
◼ Convention: assign 1 to most important outcome
(e.g., HIV positive)
◼ Ordinal
◼ Values have a meaningful order (ranking) but
magnitude between
successive values is not known.
◼ Size = {small, medium, large}, grades, army rankings
6
Numeric Attribute Types
◼ Quantity (integer or real-valued)
◼ Interval
◼ Measured on a scale of equal-sized
units
◼ Values have order
◼ E.g., temperature in C˚or F˚, calendar
dates
◼ No true zero-point
◼ Ratio
◼ Inherent zero-point
◼ We can speak of values as being an order
of magnitude larger than the unit of
measurement (10 K˚ is twice as high as 5
K˚).
7
Discrete vs. Continuous Attributes
◼ Discrete Attribute
◼ Has only a finite or countably infinite set of

values
◼ E.g., zip codes, profession, or the set of

words in a
collection of documents
◼ Sometimes, represented as integer variables

◼ Note: Binary attributes are a special case of

discrete attributes
◼ Continuous Attribute
◼ Has real numbers as attribute values

◼ E.g., temperature, height, or weight

◼ Practically, real values can only be

measured and represented using a finite

number of digits 8
Getting to Know Your D a t a

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Data Visualization

◼ Measuring Data Similarity and

Dissimilarity

◼ Summary
9
Basic Statistical Descriptions of D a t a
◼ Motivation
◼ To better understand the data: central
tendency, variation and spread
◼ Data dispersion characteristics
◼ median, max, min, quantiles, outliers,
variance, etc.
◼ Numerical dimensions correspond to sorted
intervals
◼ Data dispersion: analyzed with multiple
granularities
of precision
◼ Boxplot or quantile analysis on sorted
intervals
10
Measuring the Central Tendency
n
◼ Mean (algebraic measure) (sample vs. 1
x =
population): Note: n is sample size and N n
i
μ=∑
N
is population size.
∑ n x i=
1 x
∑ wixi
◼ Weightedmean:
Trimmed arithmetic mean:
chopping extreme i=1
values x = n

∑
◼ Median:
◼ Middle value if odd number of values, or wi
average of the middle two values otherwise i=1

◼ Estimated by interpolation (for grouped data):

◼ Mode

◼ Value that occurs most frequently in

the data
◼ Unimodal, bimodal, trimodal
◼ Empirical mean − mode = 3×(mean −
formula:
median) 11
Symmetric vs. Skewed D a t a
ta
◼ Median, mean and symmetr
ic
mode of symmetric,
positively and
negatively skewed data

positively negatively
skewed skewed

March 7, 2023 Data Mining: ncepts and Techniques

Measuring the Dispersion of D a t a
◼ Quartiles, outliers and boxplots
◼ Quartiles: Q1 (25th percentile), Q3 (75th percentile)
◼ Inter-quartile range: IQR = Q3 – Q1
◼ Five number summary: min, Q1, median, Q3, max
◼ Boxplot: ends of the box are the quartiles; median is
marked; add whiskers, and plot outliers individually
◼ Outlier: usually, a value higher/lower than 1.5 x IQR
◼ Variance and standard deviation (sample: s, population: σ)
◼ Variance: (algebraic, scalable computation)

◼ Standard deviation s (or σ) is the square root of variance

s2 (or σ2)

13
Chapter 2: Getting to Know Your D a t a

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Measuring Data Similarity and

Dissimilarity

◼ Summary

14
Similarity and Dissimilarity
◼ Similarity
◼ Numerical measure of how alike two data

objects are
◼ Value is higher when objects are more alike

◼ Often falls in the range [0,1]

◼ Dissimilarity (e.g., distance)

◼ Numerical measure of how different two data

objects are
◼ Lower when objects are more alike

◼ Minimum dissimilarity is often 0

◼ Upper limit varies

◼ Proximity refers to a similarity or dissimilarity

15
D a t a Matrix and Dissimilarity Matrix
◼ Data matrix
◼ n data points
⎡ x 11 ... x 1f ... x 1p ⎤
with p ⎢
⎢ ... ... ⎥⎥
... ... ...
dimensions ⎢⎢ xi1 ⎥
... xif ... ip ⎥
◼ Two modes
⎢ ... x ... ⎥
⎢⎢ x ... ...
x nf ... x ⎥⎥
n1 np
... ...
◼ Dissimilarity ⎣ ⎦
matrix ⎡ 0 ⎤
⎢ d(2,1) 0 ⎥
◼ n data points, ⎢ ⎥
⎢ d(3,1) d (3,2) 0 ⎥
but registers ⎥
only the ⎢⎢ : : : ⎥
distance ⎢⎣ d ... 0 ⎥
d (n,2) ⎦
◼ A triangular (n,1)
... 16
Proximity Measure for Nominal Attributes

◼ Can take 2 or more states, e.g., red, yellow,

blue, green (generalization of a binary
attribute)
◼ Method 1: Simple matching
◼ m: # of matches, p:ptotal
− # of variables
d(i, j) = p
◼ Method 2: Use
m a large number of binary
attributes
◼ creating a new binary attribute for each
of the
M nominal states 17
Proximity Measure for Binary Attributes
Object
j
◼ A contingency table for binary data
Object i

◼ Distance measure for symmetric

binary variables:
◼ Distance measure for asymmetric
binary variables:
◼ Jaccard coefficient
(similarity measure for
asymmetric binary
variables):
◼ Note: Jaccard coefficient is
the same as “coherence”:

18
Dissimilarity between Binary Variables
◼ Exampl
e Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
◼ Gender is a symmetric attribute
◼ The remaining attributes are asymmetric
binary
◼ Let the values Y and P be 1, and
0 + 1 the
value Nd0( j a c k , m a r y ) = 2 + 0 + 1 = 0 . 3 3
1 +1
d ( jack, jim) = = 0.67
1 +1 +1
1 +2
d ( jim, mary) = = 0.75
1 +1 +2
19
Standardizing Numeric D a t a
◼ Z- z = xσ−
score:
◼ X: raw score to be standardized, μ: mean of the
μ
population, σ:
standard deviation
◼ the distance between the raw score and the population
mean in units of the standard deviation
◼ negative when the raw score is below the mean, “+”
when above
◼ s f =nCalculate
An alternative way: 1 (|1 xf − m|+| 2xf − absolute
fthe mean m|+...+|
f x− m
nf f
deviation
wher
e m f =n1|)
(x1 f + 2xf +...+ nf
. )
x −
if f
x zif
ms
◼ standardized measure (z-
score = f
): absolute deviation is more robust than using
◼ Using mean
standard deviation

20
Example:
D a t a Matrix and Dissimilarity Matrix
Data Matrix
point attribute1
attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5

Dissimilarity Matrix
(with Euclidean
Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0

21
Distance on Numeric Data: Minkowski Distance
◼ Minkowski distance: A popular distance
measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)

are two p-dimensional data objects, and h is the
order (the distance so defined is also called L-h
norm)
◼ Properties
◼ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive
definiteness)
◼ d(i, j) = d(j, i) (Symmetry)
◼ d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality) 22
Special Cases of Minkowski Distance
◼ h = 1: Manhattan (city block, L1 norm) distance
◼ E.g., the Hamming distance: the number of bits that are
different between two binary vectors

d(i, j) =| xi1 − x j1 |+| xi2 − x j2 |+...+| xip − x jp |

◼ h = 2: (L2 norm) Euclidean distance

d(i, j) = (| x − x |2 +| x − x |2
+...+| x − x |2 )
i1 j1
i2 j2
ip jp
◼ h → ∞. “supremum” (Lmax norm, L∞ norm) distance.
◼ This is the maximum difference between any component
(attribute) of the vectors

23
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3
x2 3 5 x4
x3 2 0 x1 0
x4 4 5 x2 5 0
x3 3 6 0
x4 6 1 7 0
Euclidean
(L2)L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39
0
Supremu
m L∞ x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
24
Ordinal Variables

◼ An ordinal variable can be discrete or

continuous
◼ Order is important, e.g., rank
◼ Can be treated
◼ replace like interval-scaled
by their r i f ∈{1, . . . , M
◼ x
map
if rank of each variable
the range f
} onto [0, 1] by
replacing
i-th object in the f-th variable
− 1 by
z i f = r if
M f − 1
◼ compute the dissimilarity using methods for
interval-
scaled variables
25
Attributes of Mixed Type
◼ A database may contain all attribute types
◼ Nominal, symmetric binary, asymmetric binary,

numeric, ordinal
◼ One may use a weighted formula to combine their
effects
Σ p δ ( f )d
d(i, j) = ( f )f = 1 ij ij
Σ fp = 1δij
◼ f is binary or nominal: (f)
dij (f) = 0 if xif = xjf , or d (f) = 1
◼ f otherwise
is
ij numeric: use the normalized

distance
◼ f ◼isCompute
ordinal ranks rif and −
zif r 1
◼ Treat z as interval-
if = Mif
f

−1
scaled 26
Cosine Similarity
◼ A document can be represented by thousands of
attributes, each recording the frequency of a particular
word (such as keywords) or phrase in the document.

◼ Other vector objects: gene features in micro-arrays, …

◼ Applications: information retrieval, biologic taxonomy,
gene feature mapping, ...
◼ Cosine measure: If d1 and d2 are two vectors (e.g., term-
frequency
vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the length of
vector d 27
Example: Cosine Similarity
◼ cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d|: the length of
vector d

◼ Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||=
(5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||=
(3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
2
Chapter 2: Getting to Know Your D a t a

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Data Visualization

◼ Measuring Data Similarity and

Dissimilarity

◼ Summary
29
Summary
◼ Data attribute types: nominal, binary, ordinal, interval-
scaled, ratio- scaled
◼ Many types of data sets, e.g., numerical, text, graph, Web,
image.
◼ Gain insight into the data by:
◼ Basic statistical data description: central tendency,
dispersion, graphical displays
◼ Data visualization: map data onto graphical primitives
◼ Measure data similarity
◼ Above steps are the beginning of data preprocessing.
◼ Many methods have been developed but still an active
area of research.

30
References
◼ W. Cleveland, Visualizing Data, Hobart Press, 1993
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
◼ U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
◼ L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
◼ H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
◼ D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
◼ D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
◼ S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
◼ E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics
Press,
◼
2001
C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009 31

Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
RDSO2008CG 04rev 06amend 1
No ratings yet
RDSO2008CG 04rev 06amend 1
35 pages
EGO 7 Segment Manual v.1.3 en
No ratings yet
EGO 7 Segment Manual v.1.3 en
14 pages
Unit 2 Ooad 2020 Pptcomplete
No ratings yet
Unit 2 Ooad 2020 Pptcomplete
96 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
WickedWhims V176i Exception
No ratings yet
WickedWhims V176i Exception
905 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
WP-M-002 Quality Manual NEW
No ratings yet
WP-M-002 Quality Manual NEW
22 pages
Omniaccess Stellar Wlan Enterprise Advanced - Issue 09 DT00CTE26 - Nodrm
No ratings yet
Omniaccess Stellar Wlan Enterprise Advanced - Issue 09 DT00CTE26 - Nodrm
313 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Training Plan CSS NC II
No ratings yet
Training Plan CSS NC II
5 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
ICX200 REV0919 LG L PDF
100% (1)
ICX200 REV0919 LG L PDF
120 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
2025-03-11 Biz Main
No ratings yet
2025-03-11 Biz Main
73 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
Lec2 Data
No ratings yet
Lec2 Data
51 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
CPSC 4830 2025summer Lecture 2
No ratings yet
CPSC 4830 2025summer Lecture 2
42 pages
Data Similarity
0% (1)
Data Similarity
18 pages
DM-Knowing Your Data
No ratings yet
DM-Knowing Your Data
56 pages
01 Data
No ratings yet
01 Data
100 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
RPG APIs
No ratings yet
RPG APIs
78 pages
DM 2 Final
No ratings yet
DM 2 Final
30 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Building AI Projects: Starting An AI Project
No ratings yet
Building AI Projects: Starting An AI Project
33 pages
2 2 Data
No ratings yet
2 2 Data
27 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Lect 3
No ratings yet
Lect 3
51 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
CH 2
No ratings yet
CH 2
68 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
CH 2
No ratings yet
CH 2
35 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Inaxes Dash Cam Catalogue 2024
No ratings yet
Inaxes Dash Cam Catalogue 2024
4 pages
3 Data
No ratings yet
3 Data
64 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
02 Data
No ratings yet
02 Data
24 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Tecnologia Electrica A Castejon, G Santamaria PDF
No ratings yet
Tecnologia Electrica A Castejon, G Santamaria PDF
292 pages
Ac 500
No ratings yet
Ac 500
138 pages
9-1 Data Analysis and Pre-Processing Part 1 PDF
No ratings yet
9-1 Data Analysis and Pre-Processing Part 1 PDF
19 pages
03 - Data Mining
No ratings yet
03 - Data Mining
37 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
Train Sim World BR Heavy Freight Pack Driver S Manual - EN
No ratings yet
Train Sim World BR Heavy Freight Pack Driver S Manual - EN
28 pages
Lecture 2
No ratings yet
Lecture 2
62 pages
Module No 2 - Part 2 - Compressed - Compressed
No ratings yet
Module No 2 - Part 2 - Compressed - Compressed
46 pages
mkt412 Group Project Final One
No ratings yet
mkt412 Group Project Final One
22 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
02 Data
No ratings yet
02 Data
41 pages
02 Data
No ratings yet
02 Data
64 pages
Comptia A+ 220-601 Practice Test Questions
No ratings yet
Comptia A+ 220-601 Practice Test Questions
32 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
02data Part4
No ratings yet
02data Part4
28 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
Slides of Lecture 2 of CS3319 SJTU
No ratings yet
Slides of Lecture 2 of CS3319 SJTU
35 pages
02 Data
No ratings yet
02 Data
35 pages
AE6170 Project Report
No ratings yet
AE6170 Project Report
5 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
Data ch2
No ratings yet
Data ch2
16 pages
Basic Statistical Descriptions of Data
No ratings yet
Basic Statistical Descriptions of Data
26 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
Similarity
No ratings yet
Similarity
19 pages
Lec 5
No ratings yet
Lec 5
24 pages
React JS Question Bank
No ratings yet
React JS Question Bank
9 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Tactical Threat Intelligence
No ratings yet
Tactical Threat Intelligence
8 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
Mapping Logical Data Model To Relational Schema (Physical Data Model)
No ratings yet
Mapping Logical Data Model To Relational Schema (Physical Data Model)
31 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Purposive 3rd Distribution
No ratings yet
Purposive 3rd Distribution
17 pages
Data Structure Linked List: Unit 1 - Part 4 Bca, NGMC
No ratings yet
Data Structure Linked List: Unit 1 - Part 4 Bca, NGMC
13 pages
qm75c Spec
No ratings yet
qm75c Spec
2 pages
CS402 Data Mining and Warehousing
No ratings yet
CS402 Data Mining and Warehousing
3 pages
The Ultimate Guide To Bitcoin For Beginners: Your Cryptocurrency Questions Answered
No ratings yet
The Ultimate Guide To Bitcoin For Beginners: Your Cryptocurrency Questions Answered
19 pages
TY Project Management Question Bank
No ratings yet
TY Project Management Question Bank
6 pages
How To Set Up AirDroid
No ratings yet
How To Set Up AirDroid
7 pages
Toyota Quiz #2 - Overview
No ratings yet
Toyota Quiz #2 - Overview
8 pages
Digital Twin: Manufacturing Excellence Through Virtual Factory Replication
No ratings yet
Digital Twin: Manufacturing Excellence Through Virtual Factory Replication
9 pages
Python Data Science: A Comprehensive Guide to Self-Directed Python Programming Learning
From Everand
Python Data Science: A Comprehensive Guide to Self-Directed Python Programming Learning
Vere salazar
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet

Mod 4 Types of Data in Cluster Analysis

Uploaded by

Mod 4 Types of Data in Cluster Analysis

Uploaded by

Getting to Know Your D a t a

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Measuring Data Similarity and

◼ Data sets are made up of data objects.

◼ Note: Binary attributes are a special case of

◼ E.g., temperature, height, or weight

◼ Practically, real values can only be

measured and represented using a finite

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Measuring Data Similarity and

◼ Estimated by interpolation (for grouped data):

◼ Value that occurs most frequently in

March 7, 2023 Data Mining: ncepts and Techniques

◼ Standard deviation s (or σ) is the square root of variance

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Measuring Data Similarity and

◼ Often falls in the range [0,1]

◼ Dissimilarity (e.g., distance)

◼ Minimum dissimilarity is often 0

◼ Upper limit varies

◼ Proximity refers to a similarity or dissimilarity

◼ Can take 2 or more states, e.g., red, yellow,

◼ Distance measure for symmetric

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)

d(i, j) =| xi1 − x j1 |+| xi2 − x j2 |+...+| xip − x jp |

◼ h = 2: (L2 norm) Euclidean distance

◼ An ordinal variable can be discrete or

◼ Other vector objects: gene features in micro-arrays, …

◼ Ex: Find the similarity between documents 1 and 2.

◼ Data Objects and Attribute Types

◼ Basic Statistical Descriptions of Data

◼ Measuring Data Similarity and

You might also like