Chapter 3: Categorical Attributes
Chapter 3: Categorical Attributes
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 1 / 26
Univariate Analysis: Bernoulli Variable
Consider a single categorical attribute, X , with domain dom(X ) = {a1 , a2 , . . . , am }
comprising m symbolic values. The data D is an n × 1 symbolic data matrix given
as
X
x1
D = x2
..
.
xn
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 3 / 26
Binomial Distribution: Number of Occurrences
Given the Bernoulli variable X , let {x1 , x2 , . . . , xn } be a random sample of size n.
Let N be the random variable denoting the number of occurrences of the symbol
a1 (value X = 1). N has a binomial distribution, given as
n
f (N = n1 | n, p) = p n1 (1 − p)n−n1
n1
N is theP
sum of the n independent Bernoulli random variables xi IID with X , that
n
is, N = i =1 xi . The mean or expected number of occurrences of a1 is
" n # n n
X X X
µN = E [N] = E xi = E [xi ] = p = np
i =1 i =1 i =1
The variance of N is
n
X n
X
σN2 = var (N) = var (xi ) = p(1 − p) = np(1 − p)
i =1 i =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 4 / 26
Multivariate Bernoulli Variable
For the general case when dom(X ) = {a1 , a2 , . . . , am }, we model X as an
m-dimensional or multivariate Bernoulli random variable X = (A1 , A2 , . . . , Am )T ,
where each Ai is a Bernoulli variable with parameter pi denoting the probability of
observing symbol ai .
However, X can assume only one of the symbolic values at any one time. Thus,
X (v ) = e i if v = ai
m
Y e
P(X = e i ) = f (e i ) = pi = pj ij
j =1
Pm
with i =1 pi = 1.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 5 / 26
Multivariate Bernoulli: Mean
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 6 / 26
Multivariate Bernoulli Variable: sepal length
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 7 / 26
Multivariate Bernoulli Variable: Covariance Matrix
We have X = (A1 , A2 , . . . , Am )T , where Ai is the Bernoulli variable corresponding to
symbol ai . The variance for each Bernoulli variable Ai is
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 8 / 26
Categorical, Mapped Binary and Centered Dataset
Modeling as multivariate Bernoulli variable is equivalent to treating X (xi ) as a new
n × m binary data matrix
X A1 A2 Z1 Z2
x1 Short x1 0 1 z1 −0.4 0.4
x2 Short x2 0 1 z2 −0.4 0.4
x3 Long x3 1 0 z3 0.6 −0.6
x4 Short x4 0 1 z4 −0.4 0.4
x5 Long x5 1 0 z5 0.6 −0.6
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 11 / 26
Bivariate Empirical PMF: sepal length and sepal width
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 12 / 26
Bivariate Empirical PMF: sepal length and sepal width
f (x)
0.22
b
0.16 0.1
b
0.2
b
0.12 0.047
b
0.087
b
e 11 b
e 12 e 21
e 13
e 22
e 14
X1 0.02 0.053 0.033 e 23
b b
0.047 b
b
X2
0b
0.013
b
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 13 / 26
Attribute Dependence: Contingency Analysis
The contingency table for X 1 and X 2 is the m1 × m2 matrix of observed counts nij
n11 n12 · · · n 1m 2
n21 n 22 · · · n 2m 2
b 12 =
N 12 = n · P . .. .. ..
. . . . .
n m1 1 n m1 2 · · · n m1 m2
1
N 1 and N 2 have a multinomial distribution with parameters p 1 = (p11 , . . . , pm 1
) and
2 2
p 2 = (p1 , . . . , pm2 ), respv.
N 12 also has a multinomial distribution with parameters P 12 = {pij }, for 1 ≤ i ≤ m1 and
1 ≤ j ≤ m2 .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 14 / 26
Contingency Table: sepal length vs. sepal width
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 15 / 26
Chi-Squared Test for Independence
Assume X 1 and X 2 are independent. Then, their joint PMF is
p̂ij = p̂i1 · p̂j2
The expected frequency for each pair of values is
2
ni1 nj ni1 nj2
eij = n · p̂ij = n · p̂i1 · p̂j2 = n · · =
n n n
The χ2 statistic quantifies the difference between observed and expected counts
m1 X
X m2
(nij − eij )2
χ2 =
eij
i =1 j =1
The sampling distribution for the χ2 statistic follows the chi-squared density function:
q
1 −1 − x
f (x|q) = x2 e 2
2q/2 Γ(q/2)
where q is the degrees of freedom
q = |dom(X1 )| × |dom(X2 )| − (|dom(X1 )| + |dom(X2 )|) + 1
= m 1 m2 − m 1 − m 2 + 1
= (m1 − 1)(m2 − 1)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 16 / 26
Chi-Squared Test: sepal length and sepal width
Expected Counts X2
Short (a21 ) Medium (a22 ) Short (a23 )
Very Short (a11 ) 14.1 26.4 4.5
Short (a12 ) 15.67 29.33 5.0
X1
Long (a13 ) 13.47 25.23 4.3
Very Long (a14 ) 3.76 7.04 1.2
Observed Counts X2
Short (a21 ) Medium (a22 ) Long (a23 )
Very Short (a11 ) 7 33 5
Short (a12 ) 24 18 8
Long (a13 ) 13 30 0
Very Long (a14 ) 3 7 2
The chi-squared statistic value is χ2 = 21.8.
The number of degrees of freedom are
q = (m1 − 1) · (m2 − 1) = 3 · 2 = 6
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 17 / 26
Chi-Squared Distribution (q = 6).
f (x|6)
0.15
0.12
0.09
0.06
α = 0.01
0.03
H0 Rejection Region
b bC x
0
16.8 21.8
0 5 10 15 20 25
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 18 / 26
Multiway Contingency Analysis
Given X = (X1 , X2 , · · · , Xd )T . The chi-squared statistic is given as
m1 m2 md
X (ni − ei )2 X X X (ni 1 ,i2 ,...,id − ei1 ,i2 ,...,id )2
χ2 = = ···
ei i1 =1 i2 =1 id =1
ei1 ,i2 ,...,id
i
Under the null hypothesis, that attributes are independent, the expected number
of occurrences of the symbol tuple (a1i1 , a2i2 , . . . , adid ) is given as
d
Y ni11 ni22 . . . nidd
ei = n · p̂i = n · p̂ijj =
j =1
nd −1
The total number of degrees of freedom for the chi-squared distribution is given as
d
Y d
X
q= |dom(Xi )| − |dom(Xi )| + (d − 1)
i =1 i =1
d
Y d
X
= mi − mi + d − 1
i =1 i =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 19 / 26
3-Way Contingency Table
X1 : sepal length, X2 : sepal width and X3 : Iris type
1
0
7 0
3
8 0
19
3 0
1 7
3
33 2
X
X1 0 5
a14 3
45 0 8
X1 a13 0
50 0 0
a12 5 0
43 0 0
a11 17 0
12 12
3
a3
5 0
50
11
a3 3
0 0
X
2
50
0
0
1
a3
50
47
2
X
88
15
a2 1
a2
2
3
a2
2
X
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 20 / 26
3-Way Contingency Analysis
The value of the χ2 statistic is χ2 = 231.06, and the number of degrees of freedom is
q = 4 · 3 · 3 − (4 + 3 + 3) + 2 = 36 − 10 + 2 = 28.
For a significance level of α = 0.01, the critical value of the chi-square distribution is
z = 48.28.
The observed value of χ2 = 231.06 is much greater than z, and it is thus extremely
unlikely to happen under the null hypothesis. We conclude that the three attributes are
not 3-way independent, but rather there is some dependence between them.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 21 / 26
Distance and Angle
With the modeling of categorical attributes as multivariate Bernoulli variables, it is
possible to compute the distance or the angle between any two points x i and x j :
e 1i1 e 1j1
x i = ... x j = ...
e d id e d jd
The different measures of distance and similarity rely on the number of matching
and mismatching values (or symbols) across the d attributes X k .
The number of matching values s is given as:
d
X
s = x Ti x j = (e kik )T e kjk
k =1
The number of mismatches is simply d − s. Also useful is the norm of each point:
2
kx i k = x Ti x i = d
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 22 / 26
Distance and Angle
δH (x i , x j ) = d − s
x Ti x j s
cos θ = =
kx i k · kx j k d
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 23 / 26
Discretization
Discretization, also called binning, converts numeric attributes into categorical ones.
Equal-Width Intervals: Partition the range of X into k equal-width intervals. The
interval width is simply the range of X divided by k:
xmax − xmin
w=
k
Thus, the ith interval boundary is given as
vi = xmin + iw , for i = 1, . . . , k − 1
We require that each interval contain 1/k of the probability mass; therefore, the interval
boundaries are given as follows:
vi = F̂ −1 (i/k) for i = 1, . . . , k − 1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 24 / 26
Equal-Frequency Discretization: sepal length (4 bins)
Quartile values:
F̂ − 1(0.25) = 5.1
F̂ − 1(0.5) = 5.8
Empirical Inverse CDF
8.0 F̂ − 1(0.75) = 6.4
7.5
7.0
6.5
Range: [4.3, 7.9]
F̂ −1 (q)
6.0
5.5
5.0
4.5
Bin Width Count
4
0 0.25 0.50 0.75 1.00 [4.3, 5.1] 0.8 n1 = 41
q
(5.1, 5.8] 0.7 n2 = 39
(5.8, 6.4] 0.6 n3 = 35
(6.4, 7.9] 1.5 n4 = 35
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 25 / 26
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 3: Categorical Attributes 26 / 26