0% found this document useful (0 votes)
44 views26 pages

Chapter 3: Categorical Attributes

The document discusses univariate and multivariate analysis of categorical attributes. It describes modeling a single categorical attribute with multiple symbolic values as a Bernoulli variable or multivariate Bernoulli variable. The mean, variance, and probability mass function are defined for both cases. As an example, a categorical attribute of sepal length is modeled as a multivariate Bernoulli variable.

Uploaded by

s8nd11d UNI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views26 pages

Chapter 3: Categorical Attributes

The document discusses univariate and multivariate analysis of categorical attributes. It describes modeling a single categorical attribute with multiple symbolic values as a Bernoulli variable or multivariate Bernoulli variable. The mean, variance, and probability mass function are defined for both cases. As an example, a categorical attribute of sepal length is modeled as a multivariate Bernoulli variable.

Uploaded by

s8nd11d UNI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms


dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 3: Categorical Attributes

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 1 / 26
Univariate Analysis: Bernoulli Variable
Consider a single categorical attribute, X , with domain dom(X ) = {a1 , a2 , . . . , am }
comprising m symbolic values. The data D is an n × 1 symbolic data matrix given
as
 
X
x1 
 
D = x2 
 
 .. 
.
xn

where each point xi ∈ dom(X ).

Bernoulli Variable: Special case when m = 2


(
1 if v = a1
X (v ) =
0 if v = a2

i.e., dom(X ) = {0, 1}.


Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 2 / 26
Bernoulli Variable: Mean and Variance

Assume that each symbolic point has


The probability mass function (PMF) of been mapped to its binary value. The set
X is given as {x1 , x2 , . . . , xn } is a random sample
drawn from X .
P(X = x) = f (x) = p x (1 − p)1−x The sample mean is given as
n
The expected value of X is given as 1X n1
µ̂ = xi = = p̂
n i =1 n
µ = E [X ] = 1 · p + 0 · (1 − p) = p
where ni is the number of points with
xj = i in the random sample (equal to
and the variance of X is given as the number of occurrences of symbol ai ).
The sample variance is given as
σ 2 = var (X ) = p(1 − p)
σ̂ 2 = p̂(1 − p̂)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 3 / 26
Binomial Distribution: Number of Occurrences
Given the Bernoulli variable X , let {x1 , x2 , . . . , xn } be a random sample of size n.
Let N be the random variable denoting the number of occurrences of the symbol
a1 (value X = 1). N has a binomial distribution, given as
 
n
f (N = n1 | n, p) = p n1 (1 − p)n−n1
n1

N is theP
sum of the n independent Bernoulli random variables xi IID with X , that
n
is, N = i =1 xi . The mean or expected number of occurrences of a1 is
" n # n n
X X X
µN = E [N] = E xi = E [xi ] = p = np
i =1 i =1 i =1

The variance of N is
n
X n
X
σN2 = var (N) = var (xi ) = p(1 − p) = np(1 − p)
i =1 i =1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 4 / 26
Multivariate Bernoulli Variable
For the general case when dom(X ) = {a1 , a2 , . . . , am }, we model X as an
m-dimensional or multivariate Bernoulli random variable X = (A1 , A2 , . . . , Am )T ,
where each Ai is a Bernoulli variable with parameter pi denoting the probability of
observing symbol ai .
However, X can assume only one of the symbolic values at any one time. Thus,

X (v ) = e i if v = ai

where e i is the i-th standard basis vector in m dimensions. The range of X


consists of m distinct vector values {e 1 , e 2 , . . . , e m }.
The PMF of X is

m
Y e
P(X = e i ) = f (e i ) = pi = pj ij
j =1

Pm
with i =1 pi = 1.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 5 / 26
Multivariate Bernoulli: Mean

The mean or expected value of X can be obtained as


     
1 0 p1
X m X m 0 0  p2 
µ = E [X ] = e i f (e i ) = e i pi =  .  p1 + · · · +  .  pm =  .  = p
     
 ..   ..   .. 
i =1 i =1
0 1 pm

The sample mean is


   
n1 /n p̂1
n m n /n
1 ni 2   p̂2 
X X    
µ̂ = xi = e i =  .  =  .  = p̂

n i =1 i =1
n  ..   .. 
nm /n p̂m

where ni is the number of occurrences of the vector value eP


i in the sample, i.e.,
m
the number of occurrences of the symbol ai . Furthermore, i =1 ni = n.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 6 / 26
Multivariate Bernoulli Variable: sepal length

Bins Domain Counts Probability Mass Function


[4.3, 5.2] Very Short (a1 ) n1 = 45 The total sample size is n = 150; the
(5.2, 6.1] Short (a2 ) n2 = 50 estimates p̂i are:
(6.1, 7.0] Long (a3 ) n3 = 43
(7.0, 7.9] Very Long (a4 ) n4 = 12 p̂1 = 45/150 = 0.3
p̂2 = 50/150 = 0.333
We model sepal length as a multivariate
p̂3 = 43/150 = 0.287
Bernoulli variable X
 p̂4 = 12/150 = 0.08

 e 1 = (1, 0, 0, 0) if v = a1 f (x)

e = (0, 1, 0, 0)
2 if v = a2 0.333
b
X (v ) = 0.3
0.3
b 0.287
b
e 3 = (0, 0, 1, 0)
 if v = a3


e 4 = (0, 0, 0, 1) if v = a4 0.2

For example, the symbolic point 0.1 0.08


b

x1 = Short = a2 is represented as the vector


0 x
(0, 1, 0, 0)T = e 2 . e1
Very Short
e2
Short
e3
Long
e4
Very Long

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 7 / 26
Multivariate Bernoulli Variable: Covariance Matrix
We have X = (A1 , A2 , . . . , Am )T , where Ai is the Bernoulli variable corresponding to
symbol ai . The variance for each Bernoulli variable Ai is

σi2 = var (Ai ) = pi (1 − pi )

The covariance between Ai and Aj is

σij = E [Ai Aj ] − E [Ai ] · E [Aj ] = 0 − pi pj = −pi pj

Negative relationship since Ai and Aj cannot both be 1 at the same time.


The covariance matrix for X is given as
 2   
σ1 σ12 . . . σ1m p1 (1 − p1 ) −p1 p2 ··· −p1 pm
 σ12 2  
 σ2 . . . σ2m   −p1 p2 p2 (1 − p2 ) · · · −p2 pm  
Σ= . . .. .  =  . . .. .. 
 .. .. . ..   .. .. . . 
2
σ1m σ 2m . . . σm −p1 pm −p2 pm · · · pm (1 − pm )

More compactly Σ = diag (p) − p · p T where µ = p = (p1 , · · · , pm )T .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 8 / 26
Categorical, Mapped Binary and Centered Dataset
Modeling as multivariate Bernoulli variable is equivalent to treating X (xi ) as a new
n × m binary data matrix
X A1 A2 Z1 Z2
x1 Short x1 0 1 z1 −0.4 0.4
x2 Short x2 0 1 z2 −0.4 0.4
x3 Long x3 1 0 z3 0.6 −0.6
x4 Short x4 0 1 z4 −0.4 0.4
x5 Long x5 1 0 z5 0.6 −0.6

X is the multivariate Bernoulli variable



e 1 = (1, 0)T if v = Long(a1 )
X (v ) = e = (0, 1)T if v = Short(a2 )
 2

The sample mean and covariance matrix are


 
0.24 −0.24
µ̂ = p̂ = (2/5, 3/5)T = (0.4, 0.6)T b = diag (p̂) − p̂p̂ T =
Σ
−0.24 0.24
From the centered data, we have Z = (Z1 , Z2 )T and
 
b = 1ZTZ =
Σ
0.24 −0.24
5 −0.24 0.24
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 9 / 26
Multinomial Distribution: Number of Occurrences
Let {x 1 , x 2 , . . . , x n } be a random sample from X . Let Ni be the random variable denoting
number of occurrences of symbol ai in the sample, and let N = (N1 , N2 , . . . , Nm )T .
N has a multinomial distribution, given as
! m
 n Y n
f N = (n1 , n2 , . . . , nm ) | p = pi i
n1 n2 . . . nm
i =1

The mean and covariance matrix of N are:



np1
 . 
µN = E [N] = nE [X ] = n · µ = n · p =  .. 
npm
 
np1 (1 − p1 ) −np1 p2 ··· −np1 pm
 −np1 p2 np2 (1 − p2 ) ··· −np2 pm 
 
ΣN = n · (diag (p) − pp T ) =  .. .. .. .. 
 . . . . 
−np1 pm −np2 pm ··· npm (1 − pm )
The sample mean and covariance matrix for N are

µ̂N = np̂ b N = n diag (p̂) − p̂p̂ T
Σ
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 10 / 26
Bivariate Analysis
Assume the data comprises two categorical attributes, X1 and X2 ,

dom(X1 ) = {a11 , a12 , . . . , a1m1 }


dom(X2 ) = {a21 , a22 , . . . , a2m2 }

We model X1 and X2 as multivariate Bernoulli variables X 1 and X 2 with dimensions m1


and m2 , respectively. The joint distribution
  of X 1 and X 2 is modeled as the m1 + m2
X1
dimensional vector variable X =
X2
  X (v ) e 
1i
X (v1 , v2 )T =
1 1
=
X 2 (v2 ) e 2j

provided that v1 = a1i and v2 = a2j .


The joint PMF for X is given as the m1 × m2 matrix
 
p11 p12 . . . p 1m 2
 p21 p22 . . . p 2m 2 
 
P 12 =  . . .. .. 
 .. .. . . 
p m1 1 p m1 2 . . . p m1 m2

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 11 / 26
Bivariate Empirical PMF: sepal length and sepal width

X1 :sepal length X2 :sepal width


Bins Domain Counts
Bins Domain Counts
[4.3, 5.2] Very Short (a1 ) n1 = 45
[2.0, 2.8] Short (a1 ) 47
(5.2, 6.1] Short (a2 ) n2 = 50
(2.8, 3.6] Medium (a2 ) 88
(6.1, 7.0] Long (a3 ) n3 = 43
(3.6, 4.4] Long (a3 ) 15
(7.0, 7.9] Very Long (a4 ) n4 = 12

Observed Counts (nij )


X2
Short (e 21 ) Medium (e 22 ) Long (e 23 )
Very Short (e 11 ) 7 33 5
Short (e 22 ) 24 18 8
X1
Long (e 13 ) 13 30 0
Very Long (e 14 ) 3 7 2

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 12 / 26
Bivariate Empirical PMF: sepal length and sepal width

f (x)

Joint probabilities: p̂ij = nij /n


0.2

0.22
b

0.16 0.1
b
0.2
b

0.12 0.047
b
0.087
b
e 11 b
e 12 e 21
e 13
e 22
e 14
X1 0.02 0.053 0.033 e 23
b b
0.047 b
b

X2
0b
0.013
b

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 13 / 26
Attribute Dependence: Contingency Analysis

The contingency table for X 1 and X 2 is the m1 × m2 matrix of observed counts nij
 
n11 n12 · · · n 1m 2
 n21 n 22 · · · n 2m 2 
b 12 = 
N 12 = n · P  . .. .. .. 

 . . . . . 
n m1 1 n m1 2 · · · n m1 m2

b 12 is the empirical joint PMF for X 1 and X 2 . The contingency table is


where P
augmented with row and column marginal counts, as follows:
 1  2
n1 n1
 ..   .. 
N 1 = n · p̂ 1 =  .  N 2 = n · p̂ 2 =  . 
1 2
nm 1
nm 2

1
N 1 and N 2 have a multinomial distribution with parameters p 1 = (p11 , . . . , pm 1
) and
2 2
p 2 = (p1 , . . . , pm2 ), respv.
N 12 also has a multinomial distribution with parameters P 12 = {pij }, for 1 ≤ i ≤ m1 and
1 ≤ j ≤ m2 .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 14 / 26
Contingency Table: sepal length vs. sepal width

Sepal width (X2 )


Sepal length (X1 )

Short Medium Long


a21 a22 a23 Row Counts
Very Short (a11 ) 7 33 5 n11 = 45
Short (a12 ) 24 18 8 n21 = 50
Long (a13 ) 13 30 0 n31 = 43
Very Long (a14 ) 3 7 2 n41 = 12
Column Counts n12 = 47 n22 = 88 n32 = 15 n = 150

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 15 / 26
Chi-Squared Test for Independence
Assume X 1 and X 2 are independent. Then, their joint PMF is
p̂ij = p̂i1 · p̂j2
The expected frequency for each pair of values is
2
ni1 nj ni1 nj2
eij = n · p̂ij = n · p̂i1 · p̂j2 = n · · =
n n n
The χ2 statistic quantifies the difference between observed and expected counts
m1 X
X m2
(nij − eij )2
χ2 =
eij
i =1 j =1

The sampling distribution for the χ2 statistic follows the chi-squared density function:
q
1 −1 − x
f (x|q) = x2 e 2
2q/2 Γ(q/2)
where q is the degrees of freedom
q = |dom(X1 )| × |dom(X2 )| − (|dom(X1 )| + |dom(X2 )|) + 1
= m 1 m2 − m 1 − m 2 + 1
= (m1 − 1)(m2 − 1)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 16 / 26
Chi-Squared Test: sepal length and sepal width
Expected Counts X2
Short (a21 ) Medium (a22 ) Short (a23 )
Very Short (a11 ) 14.1 26.4 4.5
Short (a12 ) 15.67 29.33 5.0
X1
Long (a13 ) 13.47 25.23 4.3
Very Long (a14 ) 3.76 7.04 1.2
Observed Counts X2
Short (a21 ) Medium (a22 ) Long (a23 )
Very Short (a11 ) 7 33 5
Short (a12 ) 24 18 8
Long (a13 ) 13 30 0
Very Long (a14 ) 3 7 2
The chi-squared statistic value is χ2 = 21.8.
The number of degrees of freedom are

q = (m1 − 1) · (m2 − 1) = 3 · 2 = 6

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 17 / 26
Chi-Squared Distribution (q = 6).

The p-value of a statistic θ is defined as the probability of obtaining a value at least as


extreme as the observed value.
The null hypothesis, that X1 and X2 are independent, is rejected if p-value(z) ≤ α, say
α = 0.01. We have p-value(21.8) = 0.0013. Thus, we reject the null hypothesis, and
conclude that X1 and X2 are dependent.

f (x|6)

0.15

0.12

0.09

0.06
α = 0.01
0.03
H0 Rejection Region

b bC x
0
16.8 21.8
0 5 10 15 20 25
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 18 / 26
Multiway Contingency Analysis
Given X = (X1 , X2 , · · · , Xd )T . The chi-squared statistic is given as
m1 m2 md
X (ni − ei )2 X X X (ni 1 ,i2 ,...,id − ei1 ,i2 ,...,id )2
χ2 = = ···
ei i1 =1 i2 =1 id =1
ei1 ,i2 ,...,id
i

Under the null hypothesis, that attributes are independent, the expected number
of occurrences of the symbol tuple (a1i1 , a2i2 , . . . , adid ) is given as
d
Y ni11 ni22 . . . nidd
ei = n · p̂i = n · p̂ijj =
j =1
nd −1

The total number of degrees of freedom for the chi-squared distribution is given as
d
Y d
X
q= |dom(Xi )| − |dom(Xi )| + (d − 1)
i =1 i =1
d
Y d
 X 
= mi − mi + d − 1
i =1 i =1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 19 / 26
3-Way Contingency Table
X1 : sepal length, X2 : sepal width and X3 : Iris type

1
0
7 0
3
8 0
19
3 0
1 7

3
33 2

X
X1 0 5
a14 3
45 0 8
X1 a13 0
50 0 0
a12 5 0
43 0 0
a11 17 0
12 12

3
a3
5 0

50
11

a3 3
0 0

X
2
50
0
0

1
a3
50
47

2
X
88
15
a2 1
a2
2
3
a2
2
X

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 20 / 26
3-Way Contingency Analysis

X3 (a31 /a32 /a33 )


X2
a21 a22 a23
a11 1.25 2.35 0.40
a12 4.49 8.41 1.43
X1
a13 5.22 9.78 1.67
a14 4.70 8.80 1.50

The value of the χ2 statistic is χ2 = 231.06, and the number of degrees of freedom is
q = 4 · 3 · 3 − (4 + 3 + 3) + 2 = 36 − 10 + 2 = 28.
For a significance level of α = 0.01, the critical value of the chi-square distribution is
z = 48.28.
The observed value of χ2 = 231.06 is much greater than z, and it is thus extremely
unlikely to happen under the null hypothesis. We conclude that the three attributes are
not 3-way independent, but rather there is some dependence between them.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 21 / 26
Distance and Angle
With the modeling of categorical attributes as multivariate Bernoulli variables, it is
possible to compute the distance or the angle between any two points x i and x j :
   
e 1i1 e 1j1
x i =  ...  x j =  ... 
   

e d id e d jd

The different measures of distance and similarity rely on the number of matching
and mismatching values (or symbols) across the d attributes X k .
The number of matching values s is given as:
d
X
s = x Ti x j = (e kik )T e kjk
k =1

The number of mismatches is simply d − s. Also useful is the norm of each point:
2
kx i k = x Ti x i = d

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 22 / 26
Distance and Angle

The Euclidean distance between x i and x j is given as


q p
δ(x i , x j ) = kx i − x j k = x Ti x i − 2x i x j + x Tj x j = 2(d − s)

The Hamming distance is given as

δH (x i , x j ) = d − s

Cosine Similarity: The cosine of the angle is given as

x Ti x j s
cos θ = =
kx i k · kx j k d

The Jaccard Coeff icient is given as


s s
J(x i , x j ) = =
2(d − s) + s 2d − s

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 23 / 26
Discretization
Discretization, also called binning, converts numeric attributes into categorical ones.
Equal-Width Intervals: Partition the range of X into k equal-width intervals. The
interval width is simply the range of X divided by k:
xmax − xmin
w=
k
Thus, the ith interval boundary is given as

vi = xmin + iw , for i = 1, . . . , k − 1

Equal-Frequency Intervals: We divide the range of X into intervals that contain


(approximately) equal number of points. The intervals are computed from the empirical
quantile or inverse cumulative distribution function

F̂ −1 (q) = min{x | P(X ≤ x) ≥ q}

We require that each interval contain 1/k of the probability mass; therefore, the interval
boundaries are given as follows:

vi = F̂ −1 (i/k) for i = 1, . . . , k − 1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 24 / 26
Equal-Frequency Discretization: sepal length (4 bins)

Quartile values:

F̂ − 1(0.25) = 5.1
F̂ − 1(0.5) = 5.8
Empirical Inverse CDF
8.0 F̂ − 1(0.75) = 6.4
7.5
7.0
6.5
Range: [4.3, 7.9]
F̂ −1 (q)

6.0
5.5
5.0
4.5
Bin Width Count
4
0 0.25 0.50 0.75 1.00 [4.3, 5.1] 0.8 n1 = 41
q
(5.1, 5.8] 0.7 n2 = 39
(5.8, 6.4] 0.6 n3 = 35
(6.4, 7.9] 1.5 n4 = 35

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 25 / 26
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 3: Categorical Attributes

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 26 / 26

You might also like