0% found this document useful (0 votes)
101 views21 pages

Chapter 6: High-Dimensional Data

The document discusses concepts related to high-dimensional data, including: 1) High-dimensional space behaves differently than lower dimensions and does not follow familiar geometry. Data can be represented as a hyperrectangle or hypercube in high-dimensional space. 2) Hyperspheres and hyperballs are used to represent data centered around a point in high-dimensions. The volume of hyperspheres increases exponentially with dimensionality. 3) As dimensionality increases, the volume of the unit hypersphere approaches zero, indicating most of the space is empty, even though the total volume continues to grow exponentially with dimensionality.

Uploaded by

s8nd11d UNI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views21 pages

Chapter 6: High-Dimensional Data

The document discusses concepts related to high-dimensional data, including: 1) High-dimensional space behaves differently than lower dimensions and does not follow familiar geometry. Data can be represented as a hyperrectangle or hypercube in high-dimensional space. 2) Hyperspheres and hyperballs are used to represent data centered around a point in high-dimensions. The volume of hyperspheres increases exponentially with dimensionality. 3) As dimensionality increases, the volume of the unit hypersphere approaches zero, indicating most of the space is empty, even though the total volume continues to grow exponentially with dimensionality.

Uploaded by

s8nd11d UNI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms


dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 6: High-dimensional Data

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 1 / 21
High-dimensional Space
Let D be a n × d data matrix. In data mining typically the data is very high dimensional.
Understanding the nature of high-dimensional space, or hyperspace, is very important,
especially because it does not behave like the more familiar geometry in two or three
dimensions.
Hyper-rectangle: The data space is a d-dimensional hyper-rectangle
d h
Y i
Rd = min(Xj ), max(Xj )
j =1

where min(Xj ) and max(Xj ) specify the range of Xj .


Hypercube: Assume the data is centered, and let m denote the maximum attribute value
d n
n o
m = max max |xij |
j =1 i =1

The data hyperspace can be represented as a hypercube, centered at 0, with all sides of
length l = 2m, given as
n o
Hd (l) = x = (x1 , x2 , . . . , xd )T ∀i, xi ∈ [−l/2, l/2]

The unit hypercube has all sides of length l = 1, and is denoted as Hd (1).
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 2 / 21
Hypersphere
Assume that the data has been centered, so that µ = 0. Let r denote the largest
magnitude among all points:
n o
r = max kx i k
i

The data hyperspace can be represented as a d-dimensional hyperball centered at


0 with radius r , defined as
( d
)
X
xj2 ≤ r 2

Bd (r ) = x | kxk ≤ r or Bd (r ) = x = (x1 , x2 , . . . , xd )
j =1

The surface of the hyperball is called a hypersphere, and it consists of all the
points exactly at distance r from the center of the hyperball

Sd (r ) = x | kxk = r
( d
)
X
or Sd (r ) = x = (x1 , x2 , . . . , xd ) (xj )2 = r 2
j =1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 3 / 21
Iris Data Hyperspace: Hypercube and Hypersphere
l = 4.12 and r = 2.19

bC
bC
bC
bC
1 bC
bC bC bC bC
bC bC bC
bC bC bC
X2 : sepal width

bC bC Cb bC r
bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC Cb bC Cb Cb
bC bC bC b bC bC Cb
bC bC bC bC bC bC bC bC bC bC Cb Cb bC bC bC Cb Cb bC bC
0 bC Cb bC bC bC Cb Cb Cb Cb Cb
Cb bC Cb bC bC Cb Cb bC bC Cb bC
bC Cb Cb bC Cb Cb
Cb Cb bC bC bC
bC bC bC bC bC bC Cb
bC bC
bC bC bC bC
bC bC
bC
−1

−2
−2 −1 0 1 2
X1 : sepal length
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 4 / 21
High-dimensional Volumes
Hypercube: The volume of a hypercube with edge length l is given as

vol(Hd (l)) = l d

HypersphereThe volume of a hyperball and its corresponding hypersphere is identical


The volume of a hypersphere is given as
4
In 1D: vol(S1 (r )) = 2r In 2D: vol(S2 (r )) = πr 2 In 3D: vol(S3 (r )) = πr 3
3

d
!
π2
In d-dimensions: vol(Sd (r )) = Kd r d = rd
Γ d2 + 1


where
 ( d
! if d is even

d 2
Γ + 1 = √  d !! 
2 π 2(d+1)/2 if d is odd

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 5 / 21
Volume of Unit Hypersphere

With increasing dimensionality the hypersphere volume first increases up to a point, and
then starts to decrease, and ultimately vanishes. In particular, for the unit hypersphere
with r = 1,

d
π2
lim vol(Sd (1)) = lim →0
d →∞ d →∞ Γ( d + 1)
2

bC
bC
5 bC
bC

bC
bC
4
vol(Sd (1))

bC
bC
3
bC

bC
2 bC

bC

1 bC
bC
bC
bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
0
0 5 10 15 20 25 30 35 40 45 50
d
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 6 / 21
Hypersphere Inscribed within Hypercube

Consider the space enclosed within the largest hypersphere that can be
accommodated within a hypercube (which represents the dataspace).
The ratio of the volume of the hypersphere of radius r to the hypercube with side
length l = 2r is given as

vol(S2 (r )) πr 2 π
In 2 dimensions: = 2 = = 78.5%
vol(H2 (2r )) 4r 4
4
vol(S3 (r )) πr 3 π
In 3 dimensions: = 3 3 = = 52.4%
vol(H3 (2r )) 8r 6

vol(Sd (r )) π d /2
In d dimensions: lim = lim d d →0
d →∞ vol(Hd (2r )) d →∞ 2 Γ( + 1)
2

As the dimensionality increases, most of the volume of the hypercube is in the


“corners,” whereas the center is essentially empty.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 7 / 21
Hypersphere Inscribed inside a Hypercube

−r
−r 0 r
0
r

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 8 / 21
Conceptual View of High-dimensional Space
Two, three, four, and higher dimensions

All the volume of the hyperspace is in the corners, with the center being
essentially empty.

High-dimensional space looks like a rolled-up porcupine!

(a) 2D (b) 3D (c) 4D (d) dD

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 9 / 21
Volume of a Thin Shell

The volume of a thin hypershell of width


ǫ is given as

vol(Sd (r , ǫ)) = vol(Sd (r )) − vol(Sd (r − ǫ))


= Kd r d − Kd (r − ǫ)d .

The ratio of volume of the thin shell to


the volume of the outer sphere:
r
d d
vol(Sd (r , ǫ)) Kd r − Kd (r − ǫ)  ǫ d
= = 1− 1−
vol(Sd (r )) Kd r d r

r−
ǫ
As d increases, we have

ǫ
vol(Sd (r , ǫ))  ǫ d
lim = lim 1 − 1 − →1
d →∞ vol(Sd (r )) d →∞ r

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 10 / 21
Diagonals in Hyperspace

Consider a d-dimensional hypercube, with origin 0d = (01 , 02 , . . . , 0d ), and


bounded in each dimension in the range [−1, 1]. Each “corner” of the hyperspace
is a d-dimensional vector of the form (±11 , ±12 , . . . , ±1d )T .
Let e i = (01 , . . . , 1i , . . . , 0d )T denote the d-dimensional canonical unit vector in
dimension i, and let 1 denote the d-dimensional diagonal vector (11 , 12 , . . . , 1d )T .
Consider the angle θd between the diagonal vector 1 and the first axis e 1 , in d
dimensions:
e T1 1 e T1 1 1 1
cos θd = =p √ =√ √ =√
ke 1 k k1k e T1 e 1 1T 1 1 d d

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 11 / 21
Diagonals in Hyperspace

As d increases, we have
1
lim cos θd = lim √ → 0
d →∞ d →∞ d
which implies that

π
lim θd → = 90◦
d →∞ 2

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 12 / 21
Angle between Diagonal Vector 1 and e 1

1
1

1 1

θ
0 e1
0
θ

e1

−1 −1
−1
−1 0 1
1
0
0
1 −1

(a) In 2D (b) In 3D
In high dimensions all of the diagonal vectors are perpendicular (or orthogonal) to all the
coordinates axes! Each of the 2d −1 new axes connecting pairs of 2d corners are
essentially orthogonal to all of the d principal coordinate axes! Thus, in effect,
high-dimensional space has an exponential number of orthogonal “axes.”

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 13 / 21
Density of the Multivariate Normal
Consider the standard multivariate normal distribution with µ = 0, and Σ = I
xT x
 
1
f (x) = √ exp −
( 2π)d 2

The peak of the density is at the mean. Consider the set of points x with density at least
α fraction of the density at the mean
f (x)
≥α
f (0)
xT x
 
exp − ≥α
2
x T x ≤ −2 ln(α)
d
X
(xi )2 ≤ −2 ln(α)
i =1

The sum of squared IID random variables follows a chi-squared distribution χ2d . Thus,
 
f (x)
P ≥ α = Fχ2 (−2 ln(α))
f (0) d

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 14 / 21
Density Contour for α Fraction of the Density at the Mean:
One Dimension

Let α = 0.5, then −2 ln(0.5) = 1.386 and Fχ2 (1.386) = 0.76. Thus, 24% of the
1
density is in the tail regions.

0.4

0.3

α = 0.5
0.2

0.1

| |
−4 −3 −2 −1 0 1 2 3 4

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 15 / 21
Density Contour for α Fraction of the Density at the Mean:
Two Dimensions
Let α = 0.5, then −2 ln(0.5) = 1.386 and Fχ2 (1.386) = 0.50. Thus, 50% of the
2
density is in the tail regions.

f (x)

0.15

0.10
α = 0.5
0.05 −4
−3
b −2
0
−1
0 X2
−4 1
−3
−2 2
−1
0
1 3
X1 2
3
4 4

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 16 / 21
Chi-Squared Distribution: P(f (x)/f (0) ≥ α)

This probability decreases rapidly with dimensionality. For 2D, it is 0.5. For 3D it
is 0.29, ie., 71% of the density is in the tails. By d = 10, it decreases to 0.075%,
that is, 99.925% of the points lie in the extreme or tail regions.
f (x) f (x)

0.5 F = 0.29
F = 0.5 0.25

0.4
0.20

0.3
0.15

0.2
0.10

0.1 0.05

0 x 0 x
0 5 10 15 0 5 10 15

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 17 / 21
Hypersphere Volume: Polar Coordinates in 2D
X2

The Jacobian matrix for this transformation


(x1 , x2 ) is given as
bC
∂ x1 ∂ x1
!  
∂r ∂θ1 c1 −rs1
r J(θ1 ) = ∂ x
2 ∂ x2
=
s1 rc1
θ1 ∂r ∂θ1
X1
Hypersphere volume is obtained by
integration over r and θ1 (with r > 0, and
0 ≤ θ1 ≤ 2π):
Z Z
vol(S2 (r )) = det(J(θ1 )) dr dθ1

r θ1
Z rZ 2π Z r Z 2π
The point x = (x1 , x2 ) in polar coordinates = r dr dθ1 = r dr dθ1
0 0 0 0
x1 = r cos θ1 = rc1 2 r

r

= · θ1 = πr 2

x2 = r sin θ1 = rs1 2 0
0

where r = kxk, and cos θ1 = c1 and


sin θ1 = s1 .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 18 / 21
Hypersphere Volume: Polar Coordinates in 3D
x = (x1 , x2 , x2 ) in polar coordinates

x1 = r cos θ1 cos θ2 = rc1 c2


X3
x2 = r cos θ1 sin θ2 = rc1 s2
x3 = r sin θ1 = rs1

(x1 , x2 , x3 )
The Jacobian matrix is given as
bC
 
r c1 c2 −rs1 c2 −rc1 s2
X2 J(θ1 , θ2 ) =  c1 s2 −rs1 s2 rc1 c2 
θ1
s1 rc1 0
θ2
The volume of the hypersphere for d = 3 is
obtained via a triple integral with r > 0,
−π/2 ≤ θ1 ≤ π/2, and 0 ≤ θ2 ≤ 2π
Z Z Z
X1 vol(S3 (r )) = det(J(θ1 , θ2 )) dr dθ1 dθ2

r θ1 θ2
4
= πr 3
3

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 19 / 21
Hypersphere Volume in d Dimensions
The determinant of the d-dimensional Jacobian matrix is

det(J(θ1 , θ2 , . . . , θd −1 )) = (−1)d r d −1 c1d −2 c2d −3 . . . cd −2

The volume of the hypersphere is given by the d-dimensional integral with r > 0,
−π/2 ≤ θi ≤ π/2 for all i = 1, . . . , d − 2, and 0 ≤ θd −1 ≤ 2π:
Z Z Z Z
vol(Sd (r )) = ··· det(J(θ1 , θ2 , . . . , θd −1 )) dr dθ1 dθ2 . . . dθd −1

r θ1 θ2 θd−1
Z r Z π/2 Z π/2 Z 2π
= r d −1 dr c1d −2 dθ1 · · · cd −2 dθd −2 dθd −1
0 −π/2 −π/2 0
d −1 d −2
 1
   
rd Γ 2
Γ 2
Γ 2
Γ 21 Γ (1) Γ 12
= ...  2π
d d −1
Γ 23
 
d Γ 2
Γ 2
1 d /2−1 d

πΓ 2
r
= d d

2
Γ 2
!
π d /2
= rd
Γ d2 + 1


Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 20 / 21
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 6: High-dimensional Data

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 21 / 21

You might also like