0% found this document useful (0 votes)
9 views

Data Mining and Analysis

Uploaded by

li
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Data Mining and Analysis

Uploaded by

li
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

CHAPTER 1 Data Mining and Analysis

Data mining is the process of discovering insightful, interesting, and novel patterns, as
well as descriptive, understandable, and predictive models from large-scale data. We
begin this chapter by looking at basic properties of data modeled as a data matrix. We
emphasize the geometric and algebraic views, as well as the probabilistic interpreta-
tion of data. We then discuss the main data mining tasks, which span exploratory data
analysis, frequent pattern mining, clustering, and classification, laying out the roadmap
for the book.

1.1 DATA MATRIX

Data can often be represented or abstracted as an n × d data matrix, with n rows and
d columns, where rows correspond to entities in the dataset, and columns represent
attributes or properties of interest. Each row in the data matrix records the observed
attribute values for a given entity. The n × d data matrix is given as
 
X1 X2 · · · Xd
x x 11 x 12 · · · x 1d 
 1 
 
D =.
x2 x 21 x 22 · · · x 2d 
. .. .. .. .. 

. . . . . 
xn x n1 x n2 · · · x nd

where xi denotes the i th row, which is a d-tuple given as

xi = (x i1 , x i2 , . . . , x id )

and X j denotes the j th column, which is an n-tuple given as

X j = (x 1 j , x 2 j , . . . , x nj )

Depending on the application domain, rows may also be referred to as entities,


instances, examples, records, transactions, objects, points, feature-vectors, tuples, and so
on. Likewise, columns may also be called attributes, properties, features, dimensions,
variables, fields, and so on. The number of instances n is referred to as the size of the
1
2 Data Mining and Analysis

Table 1.1. Extract from the Iris dataset


 
Sepal Sepal Petal Petal
 Class 
 length width length width 
 X1 X2 X3 X4 X5 
 
 
 x1 5.9 3.0 4.2 1.5 Iris-versicolor
 
 x2 6.9 3.1 4.9 1.5 Iris-versicolor
 
 x3 6.6 2.9 4.6 1.3 Iris-versicolor
 
x 4.6 3.2 1.4 0.2 Iris-setosa 
 4 
 
 x5 6.0 2.2 4.0 1.0 Iris-versicolor
 
 x6 4.7 3.2 1.3 0.2 Iris-setosa 
 
 x7 6.5 3.0 5.8 2.2 Iris-virginica 
 
x 5.8 2.7 5.1 1.9 Iris-virginica 
 8 
 . .. .. .. .. .. 
 . 
 . . . . . . 
 
x149 7.7 3.8 6.7 2.2 Iris-virginica 
x150 5.1 3.4 1.5 0.2 Iris-setosa

data, whereas the number of attributes d is called the dimensionality of the data. The
analysis of a single attribute is referred to as univariate analysis, whereas the simultane-
ous analysis of two attributes is called bivariate analysis and the simultaneous analysis
of more than two attributes is called multivariate analysis.

Example 1.1. Table 1.1 shows an extract of the Iris dataset; the complete data forms
a 150 × 5 data matrix. Each entity is an Iris flower, and the attributes include sepal
length, sepal width, petal length, and petal width in centimeters, and the type
or class of the Iris flower. The first row is given as the 5-tuple

x1 = (5.9, 3.0, 4.2, 1.5, Iris-versicolor)

Not all datasets are in the form of a data matrix. For instance, more complex
datasets can be in the form of sequences (e.g., DNA and protein sequences), text,
time-series, images, audio, video, and so on, which may need special techniques for
analysis. However, in many cases even if the raw data is not a data matrix it can
usually be transformed into that form via feature extraction. For example, given a
database of images, we can create a data matrix in which rows represent images
and columns correspond to image features such as color, texture, and so on. Some-
times, certain attributes may have special semantics associated with them requiring
special treatment. For instance, temporal or spatial attributes are often treated dif-
ferently. It is also worth noting that traditional data analysis assumes that each entity
or instance is independent. However, given the interconnected nature of the world
we live in, this assumption may not always hold. Instances may be connected to
other instances via various kinds of relationships, giving rise to a data graph, where
a node represents an entity and an edge represents the relationship between two
entities.
1.2 Attributes 3

1.2 ATTRIBUTES

Attributes may be classified into two main types depending on their domain, that is,
depending on the types of values they take on.

Numeric Attributes
A numeric attribute is one that has a real-valued or integer-valued domain. For
example, Age with domai n(Age) = N, where N denotes the set of natural num-
bers (non-negative integers), is numeric, and so is petal length in Table 1.1, with
domai n(petal length) = R+ (the set of all positive real numbers). Numeric attributes
that take on a finite or countably infinite set of values are called discrete, whereas those
that can take on any real value are called continuous. As a special case of discrete,
if an attribute has as its domain the set {0, 1}, it is called a binary attribute. Numeric
attributes can be classified further into two types:

• Interval-scaled: For these kinds of attributes only differences (addition or subtraction)


make sense. For example, attribute temperature measured in ◦ C or ◦ F is interval-
scaled. If it is 20 ◦ C on one day and 10 ◦ C on the following day, it is meaningful to
talk about a temperature drop of 10 ◦ C, but it is not meaningful to say that it is twice
as cold as the previous day.
• Ratio-scaled: Here one can compute both differences as well as ratios between values.
For example, for attribute Age, we can say that someone who is 20 years old is twice as
old as someone who is 10 years old.

Categorical Attributes
A categorical attribute is one that has a set-valued domain composed of a set of
symbols. For example, Sex and Education could be categorical attributes with their
domains given as

domai n(Sex) = {M, F}


domai n(Education) = {HighSchool, BS, MS, PhD}

Categorical attributes may be of two types:

• Nominal: The attribute values in the domain are unordered, and thus only equality
comparisons are meaningful. That is, we can check only whether the value of the
attribute for two given instances is the same or not. For example, Sex is a nomi-
nal attribute. Also class in Table 1.1 is a nominal attribute with domai n(class) =
{iris-setosa, iris-versicolor , iris-virginica }.
• Ordinal: The attribute values are ordered, and thus both equality comparisons (is one
value equal to another?) and inequality comparisons (is one value less than or greater
than another?) are allowed, though it may not be possible to quantify the difference
between values. For example, Education is an ordinal attribute because its domain
values are ordered by increasing educational qualification.
4 Data Mining and Analysis

1.3 DATA: ALGEBRAIC AND GEOMETRIC VIEW

If the d attributes or dimensions in the data matrix D are all numeric, then each row
can be considered as a d-dimensional point:

xi = (x i1 , x i2 , . . . , x id ) ∈ Rd

or equivalently, each row may be considered as a d-dimensional column vector (all


vectors are assumed to be column vectors by default):
 
x i1
 x i2  T
 
xi =  .  = x i1 x i2 ··· x id ∈ Rd
 .. 
x id

where T is the matrix transpose operator.


The d-dimensional Cartesian coordinate space is specified via the d unit vectors,
called the standard basis vectors, along each of the axes. The j th standard basis vec-
tor e j is the d-dimensional unit vector whose j th component is 1 and the rest of the
components are 0

e j = (0, . . . , 1 j , . . . , 0)T

Any other vector in Rd can be written as linear combination of the standard basis
vectors. For example, each of the points xi can be written as the linear combination

d
X
xi = x i1 e1 + x i2 e2 + · · · + x id ed = xi j e j
j=1

where the scalar value x i j is the coordinate value along the j th axis or attribute.

Example 1.2. Consider the Iris data in Table 1.1. If we project the entire data
onto the first two attributes, then each row can be considered as a point or
a vector in 2-dimensional space. For example, the projection of the 5-tuple
x1 = (5.9, 3.0, 4.2, 1.5, Iris-versicolor) on the first two attributes is shown in
Figure 1.1a. Figure 1.2 shows the scatterplot of all the n = 150 points in the 2-
dimensional space spanned by the first two attributes. Likewise, Figure 1.1b shows
x1 as a point and vector in 3-dimensional space, by projecting the data onto the first
three attributes. The point (5.9, 3.0, 4.2) can be seen as specifying the coefficients in
the linear combination of the standard basis vectors in R3 :
       
1 0 0 5.9
x1 = 5.9e1 + 3.0e2 + 4.2e3 = 5.9 0 + 3.0 1 + 4.2 0 = 3.0
0 0 1 4.2
1.3 Data: Algebraic and Geometric View 5

X3

4
X2
3
4 x1 = (5.9, 3.0, 4.2)
x1 = (5.9, 3.0) bC
3 bc 2

2
1
1

0 X1 1 2 3
1 X2
0 1 2 3 4 5 6 2
3
4
5
6
X1
(a) (b)

Figure 1.1. Row x1 as a point and vector in (a) R2 and (b) R3 .

4.5 bC

bC
bC
bC
4.0
X 2 : sepal width

bC
bC bC bC bC
bC bC bC
bC bC bC
bC bC bC bC
3.5 bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
b bC bC bC bC bC bC bC bC bC bC bC
3.0 bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC bC
2.5 bC bC
bC bC bC bC
bC bC

bC
2
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
X 1 : sepal length
Figure 1.2. Scatterplot: sepal length versus sepal width. The solid circle shows the mean point.
6 Data Mining and Analysis

Each numeric column or attribute can also be treated as a vector in an


n-dimensional space Rn :
 
x1 j
x 2 j 
 
Xj = . 
 .. 
x nj
If all attributes are numeric, then the data matrix D is in fact an n × d matrix, also
written as D ∈ Rn×d , given as

  — xT —
x 11 x 12 ··· x 1d 1  
 x 21   | | |
 x 22 ··· x 2d 
  — x T
2 —
D = . .. ..  
=  =  X X2 ··· Xd 
 .. .. ..  1
. . .   .  | | |
x n1 x n2 ··· x nd — xnT —

As we can see, we can consider the entire dataset as an n × d matrix, or equivalently as


a set of n row vectors xiT ∈ Rd or as a set of d column vectors X j ∈ Rn .

1.3.1 Distance and Angle

Treating data instances and attributes as vectors, and the entire dataset as a matrix,
enables one to apply both geometric and algebraic methods to aid in the data mining
and analysis tasks.
Let a, b ∈ Rm be two m-dimensional vectors given as

  
a1 b1
 a2   b2 
   
a= .  b= . 
 ..   .. 
am bm

Dot Product
The dot product between a and b is defined as the scalar value
 
b1
  
 b2 
a T b = a1 a2 ··· am ×  . 
 .. 
bm
= a1 b1 + a2 b2 + · · · + am bm
X
m
= ai bi
i=1
1.3 Data: Algebraic and Geometric View 7

Length
The Euclidean norm or length of a vector a ∈ Rm is defined as
v
q u m
√ uX
kak = a a = a1 + a2 + · · · + am = t
T 2 2 2 ai2
i=1

The unit vector in the direction of a is given as


 
a 1
u= = a
kak kak

By definition u has length kuk = 1, and it is also called a normalized vector, which can
be used in lieu of a in some analysis tasks.
The Euclidean norm is a special case of a general class of norms, known as
L p -norm, defined as

  1 X
m  1p
p p p p p
kak p = |a1 | + |a2 | + · · · + |am | = |ai |
i=1

for any p 6= 0. Thus, the Euclidean norm corresponds to the case when p = 2.

Distance
From the Euclidean norm we can define the Euclidean distance between a and b, as
follows
v
u m
p uX
δ(a, b) = ka − bk = (a − b) (a − b) = t (ai − bi )2
T (1.1)
i=1

Thus, the length of a vector is simply its distance from the zero vector 0, all of whose
elements are 0, that is, kak = ka − 0k = δ(a, 0).
From the general L p -norm we can define the corresponding L p -distance function,
given as follows

δ p (a, b) = ka − bk p (1.2)

Angle
The cosine of the smallest angle between vectors a and b, also called the cosine
similarity, is given as
 T  
aT b a b
cos θ = = (1.3)
kak kbk kak kbk

Thus, the cosine of the angle between a and b is given as the dot product of the unit
a b
vectors kak and kbk .
The Cauchy–Schwartz inequality states that for any vectors a and b in Rm

|aT b| ≤ kak · kbk


8 Data Mining and Analysis

X2

(1, 4)
4 bc a−b
(5, 3)
3 bc

2
b
a

1
θ

0 X1
0 1 2 3 4 5
Figure 1.3. Distance and angle. Unit vectors are shown in gray.

It follows immediately from the Cauchy–Schwartz inequality that

−1 ≤ cos θ ≤ 1

Because the smallest angle θ ∈ [0◦ , 180◦ ] and because cos θ ∈ [−1, 1], the cosine similar-
ity value ranges from +1, corresponding to an angle of 0◦ , to −1, corresponding to an
angle of 180◦ (or π radians).

Orthogonality
Two vectors a and b are said to be orthogonal if and only if aT b = 0, which in turn
implies that cos θ = 0, that is, the angle between them is 90◦ or π2 radians. In this case,
we say that they have no similarity.

Example 1.3 (Distance and Angle). Figure 1.3 shows the two vectors
   
5 1
a= and b =
3 4

Using Eq. (1.1), the Euclidean distance between them is given as


p √ √
δ(a, b) = (5 − 1)2 + (3 − 4)2 = 16 + 1 = 17 = 4.12

The distance can also be computed as the magnitude of the vector:


     
5 1 4
a−b= − =
3 4 −1
p √
because ka − bk = 42 + (−1)2 = 17 = 4.12.
The unit vector in the direction of a is given as
     
a 1 5 1 5 0.86
ua = =√ =√ =
kak 52 + 32 3 34 3 0.51
1.3 Data: Algebraic and Geometric View 9

The unit vector in the direction of b can be computed similarly:


 
0.24
ub =
0.97

These unit vectors are also shown in gray in Figure 1.3.


By Eq. (1.3) the cosine of the angle between a and b is given as
 T  
5 1
3 4 17 1
cos θ = √ √ =√ =√
52 + 32 12 + 42 34 × 17 2
We can get the angle by computing the inverse of the cosine:
√ 
θ = cos−1 1/ 2 = 45◦

Let us consider the L p -norm for a with p = 3; we get


1/3
kak3 = 53 + 33 = (153)1/3 = 5.34

The distance between a and b using Eq. (1.2) for the L p -norm with p = 3 is given as
1/3
ka − bk3 = (4, −1)T 3
= 43 + (−1)3 = (63)1/3 = 3.98

1.3.2 Mean and Total Variance

Mean
The mean of the data matrix D is the vector obtained as the average of all the row-
vectors:
n
1X
mean(D) = µ = xi
n i=1

Total Variance
The total variance of the data matrix D is the average squared distance of each point
from the mean:
n n
1X 1X
var (D) = δ(xi , µ)2 = kxi − µk2 (1.4)
n i=1 n i=1

Simplifying Eq. (1.4) we obtain

1X
n

var (D) = kxi k2 − 2xiT µ + kµk2
n i=1
 X  !
1 X
n n
1
= kxi k2 − 2nµT xi + n kµk2
n i=1 n i=1
10 Data Mining and Analysis
!
1 X
n
= kxi k2 − 2nµT µ + n kµk2
n i=1
!
1 X
n
= kxi k − kµk2
2
n i=1

The total variance is thus the difference between the average of the squared magnitude
of the data points and the squared magnitude of the mean (average of the points).

Centered Data Matrix


Often we need to center the data matrix by making the mean coincide with the origin
of the data space. The centered data matrix is obtained by subtracting the mean from
all the points:
 T  T  T  T
x1 µ x1 − µT z1
 T  T  T T  T
x2  µ  x2 − µ  z2 
Z = D − 1 · µT =     
 . − .  =  ..
= 
 . (1.5)
 ..   ..   .   .. 
xnT µT xnT − µT znT

where zi = xi − µ represents the centered point corresponding to xi , and 1 ∈ Rn is the


n-dimensional vector all of whose elements have value 1. The mean of the centered
data matrix Z is 0 ∈ Rd , because we have subtracted the mean µ from all the points xi .

1.3.3 Orthogonal Projection

Often in data mining we need to project a point or vector onto another vector, for
example, to obtain a new point after a change of the basis vectors. Let a, b ∈ Rm be two
m-dimensional vectors. An orthogonal decomposition of the vector b in the direction

X2

b
4

b⊥ a
3 r=

1 bk
p=

0 X1
0 1 2 3 4 5
Figure 1.4. Orthogonal projection.
1.3 Data: Algebraic and Geometric View 11

of another vector a, illustrated in Figure 1.4, is given as

b = bk + b⊥ = p + r (1.6)

where p = bk is parallel to a, and r = b⊥ is perpendicular or orthogonal to a. The vector


p is called the orthogonal projection or simply projection of b on the vector a. Note
that the point p ∈ Rm is the point closest to b on the line passing through a. Thus, the
magnitude of the vector r = b − p gives the perpendicular distance between b and a,
which is often interpreted as the residual or error vector between the points b and p.
We can derive an expression for p by noting that p = ca for some scalar c, as p is
parallel to a. Thus, r = b − p = b − ca. Because p and r are orthogonal, we have

pT r = (ca)T (b − ca) = caT b − c2 aT a = 0

which implies that


aT b
c=
aT a
Therefore, the projection of b on a is given as
 T 
a b
p = bk = ca = a (1.7)
aT a

Example 1.4. Restricting the Iris dataset to the first two dimensions, sepal length
and sepal width, the mean point is given as
 
5.843
mean(D) =
3.054

X2

1.5
rS

rS
rS
1.0 rs rs
rs rs
rS
rs rS
rs rs
rS rs rs rs rS uT uT
rS Sr rs rs rs rs rs rS
rs rs
rS rS rs rs uT
rs rs
rs rs rs
0.5 rS rS rS rs rS rs
rS rS rS rS rS rS rs bC uT uT
rS rS bCuT uT
rsbc
rS rS rS rS bc tu bC bCuT uT uT uT bC uT
cb bc
rS rS rS bc ut bc bc
bc ut
uT bCuT bCuT
0.0 rS rS rS rS rS bC bC bC bc ut CuTb
bcut bc bc
bc ut bc
uT bCuT uT bC bCuT uT uT uT uT uT X1
rS bC bC ut bcut Cb bC bC uT bC bC uT
bc ut bcut
bc bcut
uT bC uT bcut bcut Cb
ut bc bc uT uT uT bC bC uT uT
ut ut
bC bC bCuT bC ut bc ut bcut uT uT
bc ut ut
bC bC bC uT ut bcut bc uT
bc bcut ut
−0.5 uT bC bC bC uT ut CuTb
bcut uT
bcut bc
bC bC ut
bcut bc
rS bC bC bC ut
ut
bCuT bC ut ut
ut
−1.0 bC ut
ut

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0


Figure 1.5. Projecting the centered data onto the line ℓ.
12 Data Mining and Analysis

which is shown as the black circle in Figure 1.2. The corresponding centered data
is shown in Figure 1.5, and the total variance is var (D) = 0.868 (centering does not
change this value).
Figure 1.5 shows the projection of each point onto the line ℓ, which is the line that
maximizes the separation between the class iris-setosa (squares) from the other
T
two class (circles and triangles).
  The  line ℓis given as the set of all the points (x 1 , x 2 )
x1 −2.15
satisfying the constraint =c for all scalars c ∈ R.
x2 2.75

1.3.4 Linear Independence and Dimensionality

Given the data matrix


T 
D = x1 x2 ··· xn = X1 X2 ··· Xd

we are often interested in the linear combinations of the rows (points) or the
columns (attributes). For instance, different linear combinations of the original d
attributes yield new derived attributes, which play a key role in feature extraction and
dimensionality reduction.
Given any set of vectors v1 , v2 , . . . , vk in an m-dimensional vector space Rm , their
linear combination is given as

c1 v 1 + c2 v 2 + · · · + ck v k

where ci ∈ R are scalar values. The set of all possible linear combinations of the k
vectors is called the span, denoted as span(v1 , . . . , vk ), which is itself a vector space
being a subspace of Rm . If span(v1, . . . , vk ) = Rm , then we say that v1 , . . . , vk is a spanning
set for Rm .

Row and Column Space


There are several interesting vector spaces associated with the data matrix D, two of
which are the column space and row space of D. The column space of D, denoted
col(D), is the set of all linear combinations of the d column vectors or attributes X j ∈
Rn , that is,
col(D) = span(X 1 , X 2 , . . . , X d )
By definition col(D) is a subspace of Rn . The row space of D, denoted r ow(D), is the
set of all linear combinations of the n row vectors or points xi ∈ Rd , that is,

r ow(D) = span(x1 , x2 , . . . , xn )

By definition r ow(D) is a subspace of Rd . Note also that the row space of D is the
column space of DT :
r ow(D) = col(DT )

Linear Independence
We say that the vectors v1 , . . . , vk are linearly dependent if at least one vector can be
written as a linear combination of the others. Alternatively, the k vectors are linearly
1.3 Data: Algebraic and Geometric View 13

dependent if there are scalars c1 , c2 , . . . , ck , at least one of which is not zero, such that

c1 v 1 + c2 v 2 + · · · + ck v k = 0

On the other hand, v1 , · · · , vk are linearly independent if and only if

c1 v1 + c2 v2 + · · · + ck vk = 0 implies c1 = c2 = · · · = ck = 0

Simply put, a set of vectors is linearly independent if none of them can be written as a
linear combination of the other vectors in the set.

Dimension and Rank


Let S be a subspace of Rm . A basis for S is a set of vectors in S, say v1 , . . . , vk , that are
linearly independent and they span S, that is, span(v1 , . . . , vk ) = S. In fact, a basis is a
minimal spanning set. If the vectors in the basis are pairwise orthogonal, they are said
to form an orthogonal basis for S. If, in addition, they are also normalized to be unit
vectors, then they make up an orthonormal basis for S. For instance, the standard basis
for Rm is an orthonormal basis consisting of the vectors
     
1 0 0
0 1 0
     
e1 =  .  e2 =  .  ··· em =  . 
 ..   ..   .. 
0 0 1

Any two bases for S must have the same number of vectors, and the number of vectors
in a basis for S is called the dimension of S, denoted as di m(S). Because S is a subspace
of Rm , we must have di m(S) ≤ m.
It is a remarkable fact that, for any matrix, the dimension of its row and column
space is the same, and this dimension is also called the rank of the matrix. For the data
matrix D ∈ Rn×d , we have r ank(D) ≤ mi n(n, d), which follows from the fact that the
column space can have dimension at most d, and the row space can have dimension at
most n. Thus, even though the data points are ostensibly in a d dimensional attribute
space (the extrinsic dimensionality), if r ank(D) < d, then the data points reside in a
lower dimensional subspace of Rd , and in this case r ank(D) gives an indication about
the intrinsic dimensionality of the data. In fact, with dimensionality reduction methods
it is often possible to approximate D ∈ Rn×d with a derived data matrix D′ ∈ Rn×k ,
which has much lower dimensionality, that is, k ≪ d. In this case k may reflect the
“true” intrinsic dimensionality of the data.

 T 
Example 1.5. The line ℓ in Figure 1.5 is given as ℓ = span −2.15 2.75 , with
di m(ℓ) = 1. After normalization, we obtain the orthonormal basis for ℓ as the unit
vector    
1 −2.15 −0.615
√ =
12.19 2.75 0.788
14 Data Mining and Analysis

Table 1.2. Iris dataset: sepal length (in centimeters).

5.9 6.9 6.6 4.6 6.0 4.7 6.5 5.8 6.7 6.7 5.1 5.1 5.7 6.1 4.9
5.0 5.0 5.7 5.0 7.2 5.9 6.5 5.7 5.5 4.9 5.0 5.5 4.6 7.2 6.8
5.4 5.0 5.7 5.8 5.1 5.6 5.8 5.1 6.3 6.3 5.6 6.1 6.8 7.3 5.6
4.8 7.1 5.7 5.3 5.7 5.7 5.6 4.4 6.3 5.4 6.3 6.9 7.7 6.1 5.6
6.1 6.4 5.0 5.1 5.6 5.4 5.8 4.9 4.6 5.2 7.9 7.7 6.1 5.5 4.6
4.7 4.4 6.2 4.8 6.0 6.2 5.0 6.4 6.3 6.7 5.0 5.9 6.7 5.4 6.3
4.8 4.4 6.4 6.2 6.0 7.4 4.9 7.0 5.5 6.3 6.8 6.1 6.5 6.7 6.7
4.8 4.9 6.9 4.5 4.3 5.2 5.0 6.4 5.2 5.8 5.5 7.6 6.3 6.4 6.3
5.8 5.0 6.7 6.0 5.1 4.8 5.7 5.1 6.6 6.4 5.2 6.4 7.7 5.8 4.9
5.4 5.1 6.0 6.5 5.5 7.2 6.9 6.2 6.5 6.0 5.4 5.5 6.7 7.7 5.1

1.4 DATA: PROBABILISTIC VIEW

The probabilistic view of the data assumes that each numeric attribute X is a random
variable, defined as a function that assigns a real number to each outcome of an exper-
iment (i.e., some process of observation or measurement). Formally, X is a function
X : O → R, where O, the domain of X, is the set of all possible outcomes of the experi-
ment, also called the sample space, and R, the range of X, is the set of real numbers. If
the outcomes are numeric, and represent the observed values of the random variable,
then X : O → O is simply the identity function: X (v) = v for all v ∈ O. The distinc-
tion between the outcomes and the value of the random variable is important, as we
may want to treat the observed values differently depending on the context, as seen in
Example 1.6.
A random variable X is called a discrete random variable if it takes on only a finite
or countably infinite number of values in its range, whereas X is called a continuous
random variable if it can take on any value in its range.

Example 1.6. Consider the sepal length attribute (X 1 ) for the Iris dataset in
Table 1.1. All n = 150 values of this attribute are shown in Table 1.2, which lie in
the range [4.3, 7.9], with centimeters as the unit of measurement. Let us assume that
these constitute the set of all possible outcomes O.
By default, we can consider the attribute X 1 to be a continuous random variable,
given as the identity function X 1 (v) = v, because the outcomes (sepal length values)
are all numeric.
On the other hand, if we want to distinguish between Iris flowers with short and
long sepal lengths, with long being, say, a length of 7 cm or more, we can define a
discrete random variable A as follows:
(
0 If v < 7
A(v) =
1 If v ≥ 7

In this case the domain of A is [4.3, 7.9], and its range is {0, 1}. Thus, A assumes
nonzero probability only at the discrete values 0 and 1.
1.4 Data: Probabilistic View 15

Probability Mass Function


If X is discrete, the probability mass function of X is defined as
f (x) = P(X = x) for all x ∈ R
In other words, the function f gives the probability P(X = x) that the random variable
X has the exact value x. The name “probability mass function” intuitively conveys the
fact that the probability is concentrated or massed at only discrete values in the range
of X, and is zero for all other values. f must also obey the basic rules of probability.
That is, f must be non-negative:
f (x) ≥ 0
and the sum of all probabilities should add to 1:
X
f (x) = 1
x

Example 1.7 (Bernoulli and Binomial Distribution). In Example 1.6, A was defined
as discrete random variable representing long sepal length. From the sepal length
data in Table 1.2 we find that only 13 Irises have sepal length of at least 7 cm. We can
thus estimate the probability mass function of A as follows:
13
f (1) = P(A = 1) = = 0.087 = p
150
and
137
f (0) = P(A = 0) = = 0.913 = 1 − p
150
In this case we say that A has a Bernoulli distribution with parameter p ∈ [0, 1], which
denotes the probability of a success, that is, the probability of picking an Iris with a
long sepal length at random from the set of all points. On the other hand, 1 − p is the
probability of a failure, that is, of not picking an Iris with long sepal length.
Let us consider another discrete random variable B, denoting the number of
Irises with long sepal length in m independent Bernoulli trials with probability of
success p. In this case, B takes on the discrete values [0, m], and its probability mass
function is given by the Binomial distribution
 
m k
f (k) = P(B = k) = p (1 − p)m−k
k

The formula can be understood as follows. There are mk ways of picking k long sepal
length Irises out of the m trials. For each selection of k long sepal length Irises, the
total probability of the k successes is pk , and the total probability of m − k failures is
(1 − p)m−k . For example, because p = 0.087 from above, the probability of observing
exactly k = 2 Irises with long sepal length in m = 10 trials is given as
 
10
f (2) = P(B = 2) = (0.087)2(0.913)8 = 0.164
2
Figure 1.6 shows the full probability mass function for different values of k for m = 10.
Because p is quite small, the probability of k successes in so few a trials falls off
rapidly as k increases, becoming practically zero for values of k ≥ 6.
16 Data Mining and Analysis

P(B=k)

0.4

0.3

0.2

0.1

k
0 1 2 3 4 5 6 7 8 9 10
Figure 1.6. Binomial distribution: probability mass function (m = 10, p = 0.087).

Probability Density Function


If X is continuous, its range is the entire set of real numbers R. The probability of any
specific value x is only one out of the infinitely many possible values in the range of X,
which means that P(X = x) = 0 for all x ∈ R. However, this does not mean that the value
x is impossible, because in that case we would conclude that all values are impossible!
What it means is that the probability mass is spread so thinly over the range of values
that it can be measured only over intervals [a, b] ⊂ R, rather than at specific points.
Thus, instead of the probability mass function, we define the probability density func-
tion, which specifies the probability that the variable X takes on values in any interval
[a, b] ⊂ R:

Zb

P X ∈ [a, b] = f (x) d x
a

As before, the density function f must satisfy the basic laws of probability:

f (x) ≥ 0, for all x ∈ R

and
Z∞
f (x) d x = 1
−∞

We can get an intuitive understanding of the density function f by considering


the probability density over a small interval of width 2ǫ > 0, centered at x, namely
1.4 Data: Probabilistic View 17

[x − ǫ, x + ǫ]:
Zx+ǫ

P X ∈ [x − ǫ, x + ǫ] = f (x) d x ≃ 2ǫ · f (x)
x−ǫ

P X ∈ [x − ǫ, x + ǫ]
f (x) ≃ (1.8)

f (x) thus gives the probability density at x, given as the ratio of the probability mass
to the width of the interval, that is, the probability mass per unit distance. Thus, it is
important to note that P(X = x) 6= f (x).
Even though the probability density function f (x) does not specify the probability
P(X = x), it can be used to obtain the relative probability of one value x 1 over another
x 2 because for a given ǫ > 0, by (1.8), we have
P(X ∈ [x 1 − ǫ, x 1 + ǫ]) 2ǫ · f (x 1 ) f (x 1 )
≃ = (1.9)
P(X ∈ [x 2 − ǫ, x 2 + ǫ]) 2ǫ · f (x 2 ) f (x 2 )
Thus, if f (x 1 ) is larger than f (x 2 ), then values of X close to x 1 are more probable than
values close to x 2 , and vice versa.

Example 1.8 (Normal Distribution). Consider again the sepal length values from
the Iris dataset, as shown in Table 1.2. Let us assume that these values follow a
Gaussian or normal density function, given as
 
1 −(x − µ)2
f (x) = √ exp
2πσ 2 2σ 2
There are two parameters of the normal density distribution, namely, µ, which rep-
resents the mean value, and σ 2 , which represents the variance of the values (these
parameters are discussed in Chapter 2). Figure 1.7 shows the characteristic “bell”
shape plot of the normal distribution. The parameters, µ = 5.84 and σ 2 = 0.681, were
estimated directly from the data for sepal length in Table 1.2.
1
Whereas f (x = µ) = f (5.84) = √ exp{0} = 0.483, we emphasize that the
2π · 0.681
probability of observing X = µ is zero, that is, P(X = µ) = 0. Thus, P(X = x) is not
given by f (x), rather, P(X = x) is given as the area under the curve for an infinitesi-
mally small interval [x − ǫ, x + ǫ] centered at x, with ǫ > 0. Figure 1.7 illustrates this
with the shaded region centered at µ = 5.84. From Eq. (1.8), we have

P(X = µ) ≃ 2ǫ · f (µ) = 2ǫ · 0.483 = 0.967ǫ

As ǫ → 0, we get P(X = µ) → 0. However, based on Eq. (1.9) we can claim that the
probability of observing values close to the mean value µ = 5.84 is 2.67 times the
probability of observing values close to x = 7, as
f (5.84) 0.483
= = 2.69
f (7) 0.18
18 Data Mining and Analysis

f (x)

µ±ǫ
0.5

0.4

0.3

0.2

0.1

0 x
2 3 4 5 6 7 8 9
Figure 1.7. Normal distribution: probability density function (µ = 5.84, σ 2 = 0.681).

Cumulative Distribution Function


For any random variable X, whether discrete or continuous, we can define the cumula-
tive distribution function (CDF) F : R → [0, 1], which gives the probability of observing
a value at most some given value x:

F(x) = P(X ≤ x) for all − ∞ < x < ∞

When X is discrete, F is given as


X
F(x) = P(X ≤ x) = f (u)
u≤x

and when X is continuous, F is given as


Zx
F(x) = P(X ≤ x) = f (u) du
−∞

Example 1.9 (Cumulative Distribution Function). Figure 1.8 shows the cumulative
distribution function for the binomial distribution in Figure 1.6. It has the character-
istic step shape (right continuous, non-decreasing), as expected for a discrete random
variable. F(x) has the same value F(k) for all x ∈ [k, k + 1) with 0 ≤ k < m, where m
is the number of trials and k is the number of successes. The closed (filled) and open
circles demarcate the corresponding closed and open interval [k, k + 1). For instance,
F(x) = 0.404 = F(0) for all x ∈ [0, 1).
Figure 1.9 shows the cumulative distribution function for the normal density
function shown in Figure 1.6. As expected, for a continuous random variable, the
CDF is also continuous, and non-decreasing. Because the normal distribution is
symmetric about the mean, we have F(µ) = P(X ≤ µ) = 0.5.
1.4 Data: Probabilistic View 19

F(x)

1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 x
−1 0 1 2 3 4 5 6 7 8 9 10 11
Figure 1.8. Cumulative distribution function for the binomial distribution.

F(x)

1.0
0.9
0.8
0.7
0.6
0.5 (µ, F(µ)) = (5.84, 0.5)
0.4
0.3
0.2
0.1
0 x
0 1 2 3 4 5 6 7 8 9 10
Figure 1.9. Cumulative distribution function for the normal distribution.

1.4.1 Bivariate Random Variables

Instead of considering each attribute as a random variable, we can also perform pair-
wise analysis by considering a pair of attributes, X 1 and X 2 , as a bivariate random
variable:
 
X1
X=
X2
X : O → R2 is a function that assigns to each outcome
  in the sample space, a pair of
x1
real numbers, that is, a 2-dimensional vector ∈ R2 . As in the univariate case,
x2
20 Data Mining and Analysis

if the outcomes are numeric, then the default is to assume X to be the identity
function.

Joint Probability Mass Function


If X 1 and X 2 are both discrete random variables then X has a joint probability mass
function given as follows:
f (x) = f (x 1 , x 2 ) = P(X 1 = x 1 , X 2 = x 2 ) = P(X = x)
f must satisfy the following two conditions:
f (x) = f (x 1 , x 2 ) ≥ 0 for all − ∞ < x 1 , x 2 < ∞
X XX
f (x) = f (x 1 , x 2 ) = 1
x x1 x2

Joint Probability Density Function


If X 1 and X 2 are both continuous random variables then X has a joint probability
density function f given as follows:
Z Z Z Z
P(X ∈ W ) = f (x) dx = f (x 1 , x 2 ) d x 1 d x 2
x∈W (x1 ,x2 )T ∈W

where W ⊂ R2 is some subset of the 2-dimensional space of reals. f must also satisfy
the following two conditions:
f (x) = f (x 1 , x 2 ) ≥ 0 for all − ∞ < x 1 , x 2 < ∞
Z Z∞ Z∞
f (x) dx = f (x 1 , x 2 ) d x 1 d x 2 = 1
R2 −∞ −∞

As in the univariate case, the probability mass P(x) = P (x 1 , x 2 )T = 0 for any
particular point x. However, we can use f to compute the probability density at x.

Consider the square region W = [x 1 − ǫ, x 1 + ǫ], [x 2 − ǫ, x 2 + ǫ] , that is, a window of
width 2ǫ centered at x = (x 1 , x 2 )T . The probability density at x can be approximated as
 
P(X ∈ W ) = P X ∈ [x 1 − ǫ, x 1 + ǫ], [x 2 − ǫ, x 2 + ǫ]
xZ1 +ǫ xZ2 +ǫ

= f (x 1 , x 2 ) d x 1 d x 2
x1 −ǫ x2 −ǫ

≃ 2ǫ · 2ǫ · f (x 1 , x 2 )
which implies that
P(X ∈ W )
f (x 1 , x 2 ) =
(2ǫ)2
The relative probability of one value (a1 , a2 ) versus another (b1 , b2 ) can therefore be
computed via the probability density function:

P(X ∈ [a1 − ǫ, a1 + ǫ], [a2 − ǫ, a2 + ǫ] ) (2ǫ)2 · f (a1 , a2 ) f (a1 , a2 )
 ≃ 2 · f (b , b )
=
P(X ∈ [b1 − ǫ, b1 + ǫ], [b2 − ǫ, b2 + ǫ] ) (2ǫ) 1 2 f (b1 , b2 )
1.4 Data: Probabilistic View 21

Example 1.10 (Bivariate Distributions). Consider the sepal length and sepal
width attributes in the Iris dataset, plotted in Figure 1.2. Let A denote the Bernoulli
random variable corresponding to long sepal length (at least 7 cm), as defined in
Example 1.7.
Define another Bernoulli  random
 variable B corresponding to long sepal width,
A
say, at least 3.5 cm. Let X = be the discrete bivariate random variable; then the
B
joint probability mass function of X can be estimated from the data as follows:
116
f (0, 0) = P(A = 0, B = 0) = = 0.773
150
21
f (0, 1) = P(A = 0, B = 1) = = 0.140
150
10
f (1, 0) = P(A = 1, B = 0) = = 0.067
150
3
f (1, 1) = P(A = 1, B = 1) = = 0.020
150
Figure 1.10 shows a plot of this probability mass function.
Treating attributes X 1 and X 2 in the Iris dataset (see Table 1.1) as continuous
 
X1
random variables, we can define a continuous bivariate random variable X = .
X2
Assuming that X follows a bivariate normal distribution, its joint probability density
function is given as
 
1 (x − µ)T 6 −1 (x − µ)
f (x|µ, 6) = √ exp −
2π |6| 2
Here µ and 6 are the parameters of the bivariate normal distribution, representing
the 2-dimensional mean vector and covariance matrix, which are discussed in detail

f (x)
b
0.773

0.14
b

0.067
b

0
1 0.02 1
b

X2
X1

Figure 1.10. Joint probability mass function: X1 (long sepal length), X2 (long sepal width).
22 Data Mining and Analysis

f (x)
0.4

0.2

0 X2
1
b 2
3
4
X1 0 5
1 6
2 7
3
4 8
5 9

Figure 1.11. Bivariate normal density: µ = (5.843, 3.054)T (solid circle).

in Chapter 2. Further, |6| denotes the determinant of 6. The plot of the bivariate
normal density is given in Figure 1.11, with mean

µ = (5.843, 3.054)T

and covariance matrix  


0.681 −0.039
6=
−0.039 0.187
It is important to emphasize that the function f (x) specifies only the probability
density at x, and f (x) 6= P(X = x). As before, we have P(X = x) = 0.

Joint Cumulative Distribution Function


The joint cumulative distribution function for two random variables X 1 and X 2 is
defined as the function F, such that for all values x 1 , x 2 ∈ (−∞, ∞),
F(x) = F(x 1 , x 2 ) = P(X 1 ≤ x 1 and X 2 ≤ x 2 ) = P(X ≤ x)

Statistical Independence
Two random variables X 1 and X 2 are said to be (statistically) independent if, for every
W1 ⊂ R and W2 ⊂ R, we have
P(X 1 ∈ W1 and X 2 ∈ W2 ) = P(X 1 ∈ W1 ) · P(X 2 ∈ W2 )
Furthermore, if X 1 and X 2 are independent, then the following two conditions are also
satisfied:
F(x) = F(x 1 , x 2 ) = F1 (x 1 ) · F2 (x 2 )
f (x) = f (x 1 , x 2 ) = f 1 (x 1 ) · f 2 (x 2 )
1.4 Data: Probabilistic View 23

where Fi is the cumulative distribution function, and f i is the probability mass or


density function for random variable X i .

1.4.2 Multivariate Random Variable

A d-dimensional multivariate random variable X = (X 1 , X 2 , . . . , X d )T , also called a


vector random variable, is defined as a function that assigns a vector of real numbers to
each outcome in the sample space, that is, X : O → Rd . The range of X can be denoted
as a vector x = (x 1 , x 2 , . . . , x d )T . In case all X j are numeric, then X is by default assumed
to be the identity function. In other words, if all attributes are numeric, we can treat
each outcome in the sample space (i.e., each point in the data matrix) as a vector ran-
dom variable. On the other hand, if the attributes are not all numeric, then X maps the
outcomes to numeric vectors in its range.
If all X j are discrete, then X is jointly discrete and its joint probability mass
function f is given as

f (x) = P(X = x)
f (x 1 , x 2 , . . . , x d ) = P(X 1 = x 1 , X 2 = x 2 , . . . , X d = x d )

If all X j are continuous, then X is jointly continuous and its joint probability density
function is given as
Z Z
P(X ∈ W ) = · · · f (x) dx
x∈W
Z Z
T

P (X 1 , X 2 , . . . , X d ) ∈ W = ··· f (x 1 , x 2 , . . . , x d ) d x 1 d x 2 . . . d x d
(x1 ,x2 ,...,xd )T ∈W

for any d-dimensional region W ⊆ Rd .


The laws of probability must be obeyed as usual, that is, f (x) ≥ 0 and sum of f
over all x in the range of X must be 1. The joint cumulative distribution function of
X = (X 1 , . . . , X d )T is given as

F(x) = P(X ≤ x)
F(x 1 , x 2 , . . . , x d ) = P(X 1 ≤ x 1 , X 2 ≤ x 2 , . . . , X d ≤ x d )

for every point x ∈ Rd .


We say that X 1 , X 2 , . . . , X d are independent random variables if and only if, for
every region Wi ⊂ R, we have

P(X 1 ∈ W1 and X 2 ∈ W2 · · · and X d ∈ Wd )


= P(X 1 ∈ W1 ) · P(X 2 ∈ W2 ) · · · · · P(X d ∈ Wd ) (1.10)

If X 1 , X 2 , . . . , X d are independent then the following conditions are also satisfied

F(x) = F(x 1 , . . . , x d ) = F1 (x 1 ) · F2 (x 2 ) · . . . · Fd (x d )
f (x) = f (x 1 , . . . , x d ) = f 1 (x 1 ) · f 2 (x 2 ) · . . . · f d (x d ) (1.11)
24 Data Mining and Analysis

where Fi is the cumulative distribution function, and f i is the probability mass or


density function for random variable X i .

1.4.3 Random Sample and Statistics

The probability mass or density function of a random variable X may follow some
known form, or as is often the case in data analysis, it may be unknown. When the
probability function is not known, it may still be convenient to assume that the values
follow some known distribution, based on the characteristics of the data. However,
even in this case, the parameters of the distribution may still be unknown. Thus, in
general, either the parameters, or the entire distribution, may have to be estimated
from the data.
In statistics, the word population is used to refer to the set or universe of all entities
under study. Usually we are interested in certain characteristics or parameters of the
entire population (e.g., the mean age of all computer science students in the United
States). However, looking at the entire population may not be feasible or may be
too expensive. Instead, we try to make inferences about the population parameters by
drawing a random sample from the population, and by computing appropriate statis-
tics from the sample that give estimates of the corresponding population parameters of
interest.

Univariate Sample
Given a random variable X, a random sample of size n from X is defined as a set of n
independent and identically distributed (IID) random variables S1 , S2 , . . . , Sn , that is, all
of the Si ’s are statistically independent of each other, and follow the same probability
mass or density function as X.
If we treat attribute X as a random variable, then each of the observed values of
X, namely, x i (1 ≤ i ≤ n), are themselves treated as identity random variables, and the
observed data is assumed to be a random sample drawn from X. That is, all x i are
considered to be mutually independent and identically distributed as X. By (1.11) their
joint probability function is given as
n
Y
f (x 1 , . . . , x n ) = f X (x i )
i=1

where f X is the probability mass or density function for X.

Multivariate Sample
For multivariate parameter estimation, the n data points xi (with 1 ≤ i ≤ n) constitute
a d-dimensional multivariate random sample drawn from the vector random vari-
able X = (X 1 , X 2 , . . . , X d ). That is, xi are assumed to be independent and identically
distributed, and thus their joint distribution is given as
n
Y
f (x1 , x2 , . . . , xn ) = f X (xi ) (1.12)
i=1

where f X is the probability mass or density function for X.


1.5 Data Mining 25

Estimating the parameters of a multivariate joint probability distribution is usu-


ally difficult and computationally intensive. One common simplifying assumption that
is typically made is that the d attributes X 1 , X 2 , . . . , X d are statistically independent.
However, we do not assume that they are identically distributed, because that is
almost never justified. Under the attribute independence assumption (1.12) can be
rewritten as
n
Y n Y
Y d
f (x1 , x2 , . . . , xn ) = f (xi ) = f (x i j )
i=1 i=1 j=1

Statistic
We can estimate a parameter of the population by defining an appropriate sample
statistic, which is defined as a function of the sample. More precisely, let {Si }mi=1 denote
the random sample of size m drawn from a (multivariate) random variable X. A statis-
tic θ̂ is a function θ̂ : (S1 , S2 , . . . , Sn ) → R. The statistic is an estimate of the corresponding
population parameter θ . As such, the statistic θ̂ is itself a random variable. If we use
the value of a statistic to estimate a population parameter, this value is called a point
estimate of the parameter, and the statistic is called an estimator of the parameter. In
Chapter 2 we will study different estimators for population parameters that reflect the
location (or centrality) and dispersion of values.

Example 1.11 (Sample Mean). Consider attribute sepal length (X 1 ) in the Iris
dataset, whose values are shown in Table 1.2. Assume that the mean value of X 1
is not known. Let us assume that the observed values {x i }ni=1 constitute a random
sample drawn from X 1 .
The sample mean is a statistic, defined as the average

1X
n
µ̂ = xi
n i=1

Plugging in values from Table 1.2, we obtain


1 876.5
µ̂ = (5.9 + 6.9 + · · · + 7.7 + 5.1) = = 5.84
150 150
The value µ̂ = 5.84 is a point estimate for the unknown population parameter µ, the
(true) mean value of variable X 1 .

1.5 DATA MINING

Data mining comprises the core algorithms that enable one to gain fundamental
insights and knowledge from massive data. It is an interdisciplinary field merging
concepts from allied areas such as database systems, statistics, machine learning, and
pattern recognition. In fact, data mining is part of a larger knowledge discovery pro-
cess, which includes pre-processing tasks such as data extraction, data cleaning, data
fusion, data reduction and feature construction, as well as post-processing steps such

You might also like