Chapter 6: High-Dimensional Data
Chapter 6: High-Dimensional Data
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 1 / 21
High-dimensional Space
Let D be a n × d data matrix. In data mining typically the data is very high dimensional.
Understanding the nature of high-dimensional space, or hyperspace, is very important,
especially because it does not behave like the more familiar geometry in two or three
dimensions.
Hyper-rectangle: The data space is a d-dimensional hyper-rectangle
d h
Y i
Rd = min(Xj ), max(Xj )
j =1
The data hyperspace can be represented as a hypercube, centered at 0, with all sides of
length l = 2m, given as
n o
Hd (l) = x = (x1 , x2 , . . . , xd )T ∀i, xi ∈ [−l/2, l/2]
The unit hypercube has all sides of length l = 1, and is denoted as Hd (1).
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 2 / 21
Hypersphere
Assume that the data has been centered, so that µ = 0. Let r denote the largest
magnitude among all points:
n o
r = max kx i k
i
The surface of the hyperball is called a hypersphere, and it consists of all the
points exactly at distance r from the center of the hyperball
Sd (r ) = x | kxk = r
( d
)
X
or Sd (r ) = x = (x1 , x2 , . . . , xd ) (xj )2 = r 2
j =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 3 / 21
Iris Data Hyperspace: Hypercube and Hypersphere
l = 4.12 and r = 2.19
bC
bC
bC
bC
1 bC
bC bC bC bC
bC bC bC
bC bC bC
X2 : sepal width
bC bC Cb bC r
bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC Cb bC Cb Cb
bC bC bC b bC bC Cb
bC bC bC bC bC bC bC bC bC bC Cb Cb bC bC bC Cb Cb bC bC
0 bC Cb bC bC bC Cb Cb Cb Cb Cb
Cb bC Cb bC bC Cb Cb bC bC Cb bC
bC Cb Cb bC Cb Cb
Cb Cb bC bC bC
bC bC bC bC bC bC Cb
bC bC
bC bC bC bC
bC bC
bC
−1
−2
−2 −1 0 1 2
X1 : sepal length
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 4 / 21
High-dimensional Volumes
Hypercube: The volume of a hypercube with edge length l is given as
vol(Hd (l)) = l d
d
!
π2
In d-dimensions: vol(Sd (r )) = Kd r d = rd
Γ d2 + 1
where
( d
! if d is even
d 2
Γ + 1 = √ d !!
2 π 2(d+1)/2 if d is odd
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 5 / 21
Volume of Unit Hypersphere
With increasing dimensionality the hypersphere volume first increases up to a point, and
then starts to decrease, and ultimately vanishes. In particular, for the unit hypersphere
with r = 1,
d
π2
lim vol(Sd (1)) = lim →0
d →∞ d →∞ Γ( d + 1)
2
bC
bC
5 bC
bC
bC
bC
4
vol(Sd (1))
bC
bC
3
bC
bC
2 bC
bC
1 bC
bC
bC
bC
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
0
0 5 10 15 20 25 30 35 40 45 50
d
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 6 / 21
Hypersphere Inscribed within Hypercube
Consider the space enclosed within the largest hypersphere that can be
accommodated within a hypercube (which represents the dataspace).
The ratio of the volume of the hypersphere of radius r to the hypercube with side
length l = 2r is given as
vol(S2 (r )) πr 2 π
In 2 dimensions: = 2 = = 78.5%
vol(H2 (2r )) 4r 4
4
vol(S3 (r )) πr 3 π
In 3 dimensions: = 3 3 = = 52.4%
vol(H3 (2r )) 8r 6
vol(Sd (r )) π d /2
In d dimensions: lim = lim d d →0
d →∞ vol(Hd (2r )) d →∞ 2 Γ( + 1)
2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 7 / 21
Hypersphere Inscribed inside a Hypercube
−r
−r 0 r
0
r
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 8 / 21
Conceptual View of High-dimensional Space
Two, three, four, and higher dimensions
All the volume of the hyperspace is in the corners, with the center being
essentially empty.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 9 / 21
Volume of a Thin Shell
r−
ǫ
As d increases, we have
ǫ
vol(Sd (r , ǫ)) ǫ d
lim = lim 1 − 1 − →1
d →∞ vol(Sd (r )) d →∞ r
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 10 / 21
Diagonals in Hyperspace
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 11 / 21
Diagonals in Hyperspace
As d increases, we have
1
lim cos θd = lim √ → 0
d →∞ d →∞ d
which implies that
π
lim θd → = 90◦
d →∞ 2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 12 / 21
Angle between Diagonal Vector 1 and e 1
1
1
1 1
θ
0 e1
0
θ
e1
−1 −1
−1
−1 0 1
1
0
0
1 −1
(a) In 2D (b) In 3D
In high dimensions all of the diagonal vectors are perpendicular (or orthogonal) to all the
coordinates axes! Each of the 2d −1 new axes connecting pairs of 2d corners are
essentially orthogonal to all of the d principal coordinate axes! Thus, in effect,
high-dimensional space has an exponential number of orthogonal “axes.”
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 13 / 21
Density of the Multivariate Normal
Consider the standard multivariate normal distribution with µ = 0, and Σ = I
xT x
1
f (x) = √ exp −
( 2π)d 2
The peak of the density is at the mean. Consider the set of points x with density at least
α fraction of the density at the mean
f (x)
≥α
f (0)
xT x
exp − ≥α
2
x T x ≤ −2 ln(α)
d
X
(xi )2 ≤ −2 ln(α)
i =1
The sum of squared IID random variables follows a chi-squared distribution χ2d . Thus,
f (x)
P ≥ α = Fχ2 (−2 ln(α))
f (0) d
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 14 / 21
Density Contour for α Fraction of the Density at the Mean:
One Dimension
Let α = 0.5, then −2 ln(0.5) = 1.386 and Fχ2 (1.386) = 0.76. Thus, 24% of the
1
density is in the tail regions.
0.4
0.3
α = 0.5
0.2
0.1
| |
−4 −3 −2 −1 0 1 2 3 4
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 15 / 21
Density Contour for α Fraction of the Density at the Mean:
Two Dimensions
Let α = 0.5, then −2 ln(0.5) = 1.386 and Fχ2 (1.386) = 0.50. Thus, 50% of the
2
density is in the tail regions.
f (x)
0.15
0.10
α = 0.5
0.05 −4
−3
b −2
0
−1
0 X2
−4 1
−3
−2 2
−1
0
1 3
X1 2
3
4 4
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 16 / 21
Chi-Squared Distribution: P(f (x)/f (0) ≥ α)
This probability decreases rapidly with dimensionality. For 2D, it is 0.5. For 3D it
is 0.29, ie., 71% of the density is in the tails. By d = 10, it decreases to 0.075%,
that is, 99.925% of the points lie in the extreme or tail regions.
f (x) f (x)
0.5 F = 0.29
F = 0.5 0.25
0.4
0.20
0.3
0.15
0.2
0.10
0.1 0.05
0 x 0 x
0 5 10 15 0 5 10 15
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 17 / 21
Hypersphere Volume: Polar Coordinates in 2D
X2
(x1 , x2 , x3 )
The Jacobian matrix is given as
bC
r c1 c2 −rs1 c2 −rc1 s2
X2 J(θ1 , θ2 ) = c1 s2 −rs1 s2 rc1 c2
θ1
s1 rc1 0
θ2
The volume of the hypersphere for d = 3 is
obtained via a triple integral with r > 0,
−π/2 ≤ θ1 ≤ π/2, and 0 ≤ θ2 ≤ 2π
Z Z Z
X1 vol(S3 (r )) = det(J(θ1 , θ2 )) dr dθ1 dθ2
r θ1 θ2
4
= πr 3
3
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 19 / 21
Hypersphere Volume in d Dimensions
The determinant of the d-dimensional Jacobian matrix is
The volume of the hypersphere is given by the d-dimensional integral with r > 0,
−π/2 ≤ θi ≤ π/2 for all i = 1, . . . , d − 2, and 0 ≤ θd −1 ≤ 2π:
Z Z Z Z
vol(Sd (r )) = ··· det(J(θ1 , θ2 , . . . , θd −1 )) dr dθ1 dθ2 . . . dθd −1
r θ1 θ2 θd−1
Z r Z π/2 Z π/2 Z 2π
= r d −1 dr c1d −2 dθ1 · · · cd −2 dθd −2 dθd −1
0 −π/2 −π/2 0
d −1 d −2
1
rd Γ 2
Γ 2
Γ 2
Γ 21 Γ (1) Γ 12
= ... 2π
d d −1
Γ 23
d Γ 2
Γ 2
1 d /2−1 d
πΓ 2
r
= d d
2
Γ 2
!
π d /2
= rd
Γ d2 + 1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 20 / 21
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 6: High-dimensional Data 21 / 21