0% found this document useful (0 votes)
28 views16 pages

Sta 5

The document discusses feature selection techniques for machine learning models. It covers the problems caused by high-dimensional data and many irrelevant features. It also describes different feature selection methods like filter, wrapper and embedded approaches and analyzes their pros and cons.

Uploaded by

wayacel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views16 pages

Sta 5

The document discusses feature selection techniques for machine learning models. It covers the problems caused by high-dimensional data and many irrelevant features. It also describes different feature selection methods like filter, wrapper and embedded approaches and analyzes their pros and cons.

Uploaded by

wayacel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

The learning procedure

PHENOMENON

INFO-F-422: Statistical foundations of machine


learning PROBLEM
FORMULATION

Feature selection EXPERIMENTAL


DESIGN

Gianluca Bontempi RAW DATA

Machine Learning Group


Computer Science Department PREPROCESSING
MODEL PARAMETRIC
mlg.ulb.ac.be GENERATION IDENTIFICATION

DATA
MODEL
VALIDATION

MODEL
SELECTION

MODEL
1/61 2/61

Feature selection problem Benefits and drawbacks of feature selection

◮ ML algorithms degrade in accuracy when faced with many There are many potential benefits of feature selection:
inputs (aka features) that are not necessary. ◮ facilitating data visualization and data understanding,
◮ Tasks with thousands of features: bioinformatics classification ◮ reducing the measurement and storage requirements,
where n (number of genes for which expression is measured) ◮ reducing training and utilization times of the final model,
may range from 6000 to 60000.
◮ defying the curse of dimensionality to improve prediction
◮ Original learning techniques not designed to cope with lot of
performance.
irrelevant features.
Drawbacks are
◮ Feature selection: selecting some subset of features to use as
inputs. ◮ additional layer of complexity: the search in the hypothesis
space is augmented by another dimension, finding the optimal
◮ Using all features may negatively affect generalization, because
subset of relevant features.
of irrelevant/ redundant features.
◮ additional time for learning.
◮ Feature selection as a model selection problem.

3/61 4/61
Neighbourhood and dimension n n=1 d=1/2 V=1/2

n= 2 d=1/2 V=1/4
◮ n dimensional space
◮ query point xq ∈ Rn
◮ unit volume around xq .
◮ hypercube of edge length d < 1 which contains a portion V of
the unit volume
d n = V ⇒ d = V 1/n
n=3 d=1/2 V=1/8

V = {x ∈ Rn : |xj | < d/2, j = 1, . . . , n}
where d stands for a measure of locality.
◮ For instance if V = 0.5 we have d = 0.7, 0.87, 0.98 for
n = 2, 5, 50.
As n increase, the volume of the cubic neighbourhood decreases if
the edge length d is fixed.
5/61 6/61

Neighborhood volume vs. edge size Locality and dimension n

◮ Let N points uniformly distributed in a unit volume.


◮ The number of neighbours K in a volume V < 1 amounts to
K = NV and is proportional to V .
◮ By increasing n, V and K decrease and observations become
sparser.
◮ As n increases the amount of local data goes to zero: all data
sets are sparse.

◮ given neighbourhood volume V : edge length increases for


increasing n.
◮ given edge d: neighbourhood volume decreases for increasing
n.
7/61 8/61
K vs. dimension n (fixed N and d ) Number of observations vs. dimension n (fixed d and K )

If N and d are fixed, the number K of neighbours in V decreases


by increasing n. If we want to preserve the same degree of locality (fixed edge
As n increases the amount of local data goes to zero: all data sets d = 0.1) for increasing n, we need much larger number N of
are sparse. observations.
9/61 10/61

Curse of dimensionality: other issues Methods of feature selection

1. Filter methods: preprocessing methods that assess features


◮ The error of the best model (e.g. conditional variance in ignoring the role of learning algorithm. Examples: ranking,
regression or the Bayes error in classification ) decreases with compression techniques (like PCA or clustering).
n but the mean integrated squared error of models increases 2. Wrapper methods: assess subsets of variables according to
faster than linearly in n: in other terms the best model is more usefulness to a given predictor. Search using the learning
and more accurate but it is more and more difficult to find it. algorithm as part of the evaluation function. Examples:
◮ In high dimensions, all datasets show multicollinearity. stepwise methods in linear regression.
◮ In high dimensions, the number of possible models to consider 3. Embedded methods: selection as part of learning procedure.
increases superexponentially in n. Examples: classification trees, random forests, methods based
on regularization (e.g. lasso) and representation learning (e.g.
autoencoder).

11/61 12/61
Pros-cons analysis Pros-cons analysis

◮ Filter methods: ◮ Wrapper methods:


◮ Pros: easily scale to very high-dimensional datasets, ◮ Pros: take into account learning algo and feature
computationally simple and fast, and independent of the dependencies.
classification algorithm. FS performed only once and different ◮ Cons: higher risk of overfitting than filters and very
classifiers are evaluated. computationally intensive.
◮ Cons: ignore interaction with the classifier, often univariate or ◮ Embedded methods:
low-variate. Each feature is considered separately, ignoring ◮ Pros: less computationally intensive than wrapper methods.
feature dependencies. ◮ Cons: specific to a learning machine.

13/61 14/61

Principal component analysis (PCA) Principal component analysis

◮ PC1 is the axis along which there is the greatest variation of


the projected observations.
◮ One of the most popular methods for linear dimensionality ◮ PC1 is a = [a1 , . . . , an ] ∈ Rn such that
reduction.
◮ It projects the data from the original space into a a lower z = a1 x·1 + · · · + an x·n = aT x
dimensional space in an unsupervised manner.
has the largest variance.
◮ New axes (called principal components (PC)) are created as
linear combinations of original ones.
◮ Optimal a is the eigenvector of Var [x] corresponding to the
largest eigenvalue λ1 .
◮ PC2 is the axis orthogonal to PC1 that has the greatest
variation of the projected observations; and so forth.

15/61 16/61
PCA example (n = 2) PCA: the algorithm

Input: matrix X[N,n] of size [N, n]:


1. X is normalized into X̃ such that each column X̃ [, i ],
i = 1, . . . , n, has mean 0 and variance 1.
2. Singular Value Decomposition (SVD) of X̃

X̃ = UDV T

where U[N,N] is orthogonal, D[N,n] is rectangular diagonal with


diagonal elements d1 ≥ d2 ≥ · · · ≥ dn and V[n,n] is orthogonal.
3. X̃ transformed into Z = X̃ V = UD where each column of
Z[N,n] is a linear combination of original features and its
importance is diminishing.
4. The first h < n columns of Z (aka eigen-features) are chosen
to represent the dataset.

17/61 18/61

How many PCs? R code

1. Fix a threshold on the proportion of variance to be explained ## LEAVE-ONE-OUT loop


for (i in 1:N){
by the PCs, e.g. choose h PCs such that ## normalization of the input training
Xtr=scale(X[-i,])
Ytr=Y[-i]
λ1 + · · · + λh ## normalization of the input test
Pn ≥α Xts=(X[i,]-attr(Xtr,"scaled:center"))/attr(Xtr,"scaled:scale")

j=1 λj S=svd(Xtr)
## PC loop
for (h in 1:n){
d2
j V=(S$v)
where λj = N−1 is the jth largest eigenvalue and is equal to Vh=array(V[,1:h],c(n,h))
Zh=Xtr%*%Vh
the variance of the jth component. Zi=Xts%*%Vh
YhatPCAi=pred("knn",Zh,Ytr,Zi,class=FALSE)
2. Scree plot: decreasing plot of λj as a function of j. Choose EPCA[i,h]=(Y[i]-YhatPCAi)^2
}
the value of h corresponding to a knee in the curve }
hbest=which.min(apply(EPCA,2,mean)))
3. Selection by cross validation.
See also script FeatureSel/fs_pca.R

19/61 20/61
Clustering (unsupervised learning) Hierarchical clustering

◮ Returns groups of features or observations with similar ◮ It begins by considering all the observations as separate
patterns (e.g. patterns of gene expressions in microarray data). clusters and starts by putting together the two samples that
◮ Require a distance function between variables and between are nearest to each other.
clusters. ◮ In subsequent stages also clusters can be merged.
◮ Nearest neighbor clustering: number of clusters fixed in ◮ A measure of dissimilarity between sets of observations is
advance, then each variable is assigned to each cluster. required.
Examples: Self Organizing Maps (SOM) and K-means. ◮ It is based on an appropriate metric (a measure of distance
◮ Agglomerative clustering: bottom-up methods where between pairs of observations), and a linkage criterion which
clusters start as empty and variables are successively added. specifies the dissimilarity of sets as a function of the pairwise
An example is hierarchical clustering distances of observations in the sets.

21/61 22/61

Dendrogram Ranking
The output of hierarchical clustering is a dendrogram, a tree
diagram used to illustrate the arrangement of the clusters. ◮ Assesses the importance (or relevance) of each variable with
respect to the output by using a univariate measure.
Supervised techniques of complexity O(n).
◮ Measures of relevance:
◮ Pearson correlation (the greater the more relevant) which
assumes linearity;
◮ in case of binary classification significance p-value of an
hypothesis test (the lower the more relevant) which aims at
detecting the features that split well the dataset. Parametric
(t-test) and nonparametric (Wilcoxon) tests have been
proposed in litterature;
◮ mutual information (the greater the more relevant).
◮ After the univariate assessment the method ranks the variables
in a decreasing order of relevance.

23/61 24/61
Ranking methods (II) Wrapping search

◮ Search in space W = {0, 1}n where w ∈ W is such that


(
0 if the input j does NOT belong to the set of features
w [j] =
◮ (+): fast (complexity O(n)), intuitive output and easy to 1 if the input j belongs to the set of features
understand.
◮ We look for optimal w ∗ ∈ {0, 1}n such that
◮ (-) they disregard redundancies and higher order interactions
between variables (e.g. genes). w ∗ = arg min MISEw
w ∈W
◮ (-) best k individual features are not necessarily constitute the
best k variate vector. where MISEw is the generalization error of the model based on
w.
◮ the number of vectors in W is equal to 2n .
◮ for moderately large n, the exhaustive search is no more
possible.

25/61 26/61

Wrapping search strategies Wrapper limitation


P n(n+1)
Greedy methods evaluate n−1 i =0 (n − i ) = 2 (O(n2 )) subsets
by either adding or deleting one variable at a time.
1. Forward selection: starts with no variables and incorporates ◮ Wrapper techniques search for the best subset of features by
features. First input is the one with the lowest generalization performing a large number of comparisons and selections.
error: the second is the one that, together with the first, has
◮ Despite the use of validation procedures, it is highly possible
the lowest error, and so on. R script FeatureSel/fs_wrap.R.
that high correlations or low prediction errors are found only
2. Backward selection: opposite direction of forward approach due to chance.
by progressively removing feature from the full set. At each
◮ A bad practice consists in using the same set of observations
step, input to be removed is the one whose absence causes the
to select the feature set and to assess the generalization
lowest increase (or highest decrease) of generalisation error.
accuracy of the classifier ( see external cross-validation).
3. Stepwise selection: it combines the previous two techniques,
by testing for each set of variables, first the removal of
features belonging to the set, then the addition of variables
not in the set.

27/61 28/61
External cross-validation Assessment by permutation

Internal cross-validation: assessment process where the entire


available data is used to select a feature set. ◮ If no additional data are available, an alternative consists in
This is a wrong procedure since it can induce sampling bias (i.e. estimating how often random data would generate correlation
too optimistic estimation of the generalization error due to testing or classification accuracy of the magnitude obtained in the
on samples already considered in the feature selection process). original analysis.
The external cross-validation consists into ◮ permutation testing: repeating the procedure several times
◮ Leave out a single test fold of the data set with data where the dependency between the input and the
output is artificially removed (for exemple by permuting the
◮ Use the remaining training folds for selecting both variables
inputs ordering). This would provide us with a distribution of
and model
the accuracy in case of random data where no information is
◮ Evaluate the generalization error (of both model and feature present in the data.
set) on the test fold.

29/61 30/61

Combining instead of selecting Shrinkage methods

◮ Improve an estimator by adding constraints to reduce variance.


◮ Instead of choosing one particular FS method, different FS
◮ Ridge regression is a shrinkage method applied to least
methods can be combined using ensemble approaches.
squares regression
◮ Since there is not an optimal feature selection technique and
due to the possible existence of more than one subset of
features that discriminates the data equally well, model N
X p
X
T 2
combination approaches have been adapted to improve the β̂r = arg min{ (yi − xi b) + λ bj2 } =
b
robustness and stability of final, discriminative methods. i =1 j=1
 
◮ ensemble techniques include averaging over multiple feature = arg min (Y − Xb)T (Y − Xb) + λb T b
b
subsets.
◮ use of ensemble approaches requires additional computational where λ > 0 is a complexity parameter that controls the
resources. amount of shrinkage: the larger the value of λ the stronger the
constraint.

31/61 32/61
Ridge regression R code

◮ An equivalent way to write the ridge problem is


## set of lambda values
N LAM=seq(1,100,by=5)
X E<-array(NA,c(N,length(LAM)))
β̂r = arg min (yi − xiT b)2 , ## LEAVE-ONE-OUT loop
for (i in 1:N){
b
i =1 Xtr=cbind(numeric(N-1)+1,X[-i,])
Ytr=Y[-i]
Xp Xts=c(1,X[i,])
subject to bj2 ≤ L cnt=1
for (lam in LAM){
betahat=solve(t(Xtr)%*%Xtr+lam*diag(n+1))%*%t(Xtr)%*%Ytr
j=1 Yhati=Xts%*%betahat
E[i,cnt]=(Y[i]-Yhati)^2
cnt=cnt+1
where there is a one-to-one correspondence between the }
}
parameter λ and L lambest=LAM[which.min(apply(E,2,mean))]
XX=cbind(numeric(N)+1,X)
◮ It can be shown that the ridge regression solution is betahat=solve(t(XX)%*%XX+lambest*diag(n+1))%*%t(XX)%*%Y
print(sort(abs(betahat),decr=TRUE,index=TRUE)$ix[1:4]-1)

β̂r = (X T X + λIp )−1 X T Y


See also script FeatureSel/fs_ridgeR
where Ip is a [p, p] identity matrix.

33/61 34/61

Lasso Ridge regression vs. lasso


◮ Estimate of the linear parameters is returned by
N
X
β̂r = arg min (yi − xiT b)2 ,
b
i =1
Xp
subject to |bj | ≤ L
j=1

◮ The 1-norm penalty of the lasso approach makes the solution


nonlinear and requires a quadratic programming algorithm.
P
◮ Note that if L > pj=1 |β̂j | the lasso returns the common
least-squares solution.
◮ Lasso returns sparser solutions (more coefficients forced to
zero)
◮ How to choose the penalty factor L? In practice we have
recours to cross-validation strategies. From www.quora.com
35/61 36/61
Dual linear formulation Dual ridge regression formulation
Since
Linear regression problem with n >> N: least-squares solution is
(X T X + λIp )−1 X T = X T (XX T + λIN )−1
the [p, 1] parameter vector
where IN is the identity matrix of size N, the dual formulation is
T −1 T
β̂ = (X X ) X y N
X
T T −1 T
β̂ = X (XX + λIN ) y = X α̂ = α̂i xi
and the prediction for a new [p, 1] vector x is returned by
i =1
T
ŷ = β̂ x = hβ̂, xi where
α̂ = (K + λIN )−1 Y
The dual formulation is
is the [N, 1] vector of dual parameters and K = XX T is the Kernel
N
T −1 T T T −2 T T
X or Gram [N, N] matrix.
β̂ = (X X ) X y = X X (X X ) X y = X α̂ = α̂i xi For a query point xq the prediction is
| {z }
i =1
α̂
ŷ = xqT X T α̂ = xqT X T (K + λIN )−1 Y
where α̂ is a [N, 1] vector and xi is the [p, 1] vector which
represents the i th observation. where xqT X T is the [1, N] vector containing all dot products
between the query point and the training samples.
37/61 38/61

Dot product Kernel methods

◮ Many learning algorithms, such the perceptron, support vector


machine (SVM) and PCA, process data only in linear manner
◮ The dot product (or scalar product) is an operation which through inner products.
takes two real vectors and returns a real-valued scalar quantity. ◮ They address the problem of dimensionality n >> N by solving
◮ The dot product of two [n, 1] vectors xi = [xi 1 , xi 2 , . . . , xin ]T a dual problem in a space of dimension N.
and xk = [xk1 , xk2 , . . . , xkn ]T is defined as: ◮ Kernel methods generalize the notion of inner product by
n
X adopting a user-specified kernel function, i.e., a similarity
hxi , xk i = xij xkj = xiT xk function over pairs of data points in raw representation.
j=1 ◮ Kernel functions enable them to operate in a high-dimensional,
implicit feature space
◮ This can be considered as a similarity score between vectors. 1. without computing the coordinates of the data in the large
dimensional space
2. by directly computing the inner products between all
transformed pairs of observations.

39/61 40/61
Dual nonlinear regression Kernel trick

Suppose to apply the transformation Φ : x ∈ Rn → Φ(x) ∈ RM to


the inputs projecting the original n dimensional space into a M
dimensional space.
The prediction for an input x ∈ Rn would be given by

ŷ (x) = k(K + λIN )−1 Y

where
Ki ,j = hΦ(xi ), Φ(xj )i, ki = hΦ(xi ), Φ(x)i
These inner products can be computed efficiently without explicitly
computing the mapping Φ(·).

41/61 42/61

Kernel function Kernel function


◮ A kernel function is a function κ that for all x, z ∈ X satisfies
κ(x, z) = hΦ(x), Φ(z)i
◮ κ(x, z) = (1 + x T z)2 : if n = 2 it corresponds to a
where Φ is a mapping from X to a feature space F of transformation to M =√6 dimensional
√ √ space
dimension M 2 2
Φ(x1 , x2 ) = (1, x1 , x2 , 2x1 , 2x2 , 2x1 x2 )
◮ For instance if n = 2, x = [x1 , x2 ] , y = [y1 , y2 ] 2
◮ A Gaussian kernel κ(x, z) = exp−γkx−zk corresponds to a
κ(x, z) = hx, zi2 = (x1 y1 + x2 y2 )2 = hΦ(x), Φ(z)i transformation to an infinite-dimensional space.
where
◮ Theoretically, a Gram matrix must be positive semi-definite
√ (PSD). Empirically, for machine learning heuristics, choices of
Φ : x = [x1 , x2 ] → Φ(x) = [x12 , x22 , 2x1 x2 ] ∈ F a function κ that do not satisfy PSD condition may still
perform reasonably if κ at least approximates the intuitive idea
◮ Kernels de-couple the specification of the algorithm from the
of similarity.
specification of the feature space.
◮ Kernels give a way to compute dot products in some feature
space without even knowing what this space is and what is the
function Φ.
43/61 44/61
Kernel methods Notions of entropy (continuous case)

Consider a continuous r.v. y. The (differential) entropy of y is


◮ They aim to take advantage of high dimensional defined by
representations without actually having to work in the high Z
dimensional space H(y) = − log(p(y ))p(y )dy = Ey [− log(p(y ))]
◮ Sometimes it is possible to compute dot products in high
dimensional feature spaces without having to explicitly carrying knowing that 0 log 0 = 0.
out the mapping into these spaces. ◮ Entropy if a functional of the distribution of y.
◮ Kernel trick: given any algorithm that can be expressed solely ◮ Entropy is a measure of the predictability of a r.v. y. The
in terms of dot products, the kernel trick allows us to higher the entropy, the less reliable are our predictions about y.
construct different nonlinear versions of it.
◮ for a scalar normal r.v. y ∼ N (0, σ 2 )
◮ However the accuracy of a kernel method very strongly
depends on the choice of kernel. 1  1 
H(y) = 1 + ln 2πσ 2 = ln 2πeσ 2
2 2

45/61 46/61

Conditional entropy Some properties

Consider two continuous r.v.s x and y and their joint density ◮ for continuous r.v.s the differential entropy may be negative
p(x, y ). (logarithmic scale)
The conditional entropy is defined as ◮ in general H(y|x) 6= H(x|y)
Z Z ◮ conditioning reduces entropy
H(y|x) = − log(p(y |x))p(x, y )dxdy = Ex,y [− log(p(y |x))] =
  H(y|x) ≤ H(y)
1
= Ex,y log = Ex [H(y|x)]
p(y |x) with equality if x and y are independent, i.e. x ⊥
⊥ y,

This quantity quantifies the remaining uncertainty of y once x is
known. H(y) − H(y|x) = H(x) − H(x|y)

47/61 48/61
Mutual information of two vars (II) Mutual information in the normal case

Let (x, y) a normally distributed random vector with correlation


◮ Given two random variables x and y, their mutual information coefficient ρ.
is defined in terms of their probabilistic marginal density The mutual information between x and y is given by
functions px (x), py (y ) and the joint p(x,y) (x, y ):
1
Z Z I (x; y) = − log(1 − ρ2 )
p(x, y ) 2
I (x; y) = log p(x, y )dxdy =
p(x)p(y ) Equivalently the normal correlation coefficient can be written as
= H(y) − H(y|x) = H(x) − H(x|y) p
ρ = 1 − exp(−2I (x; y))
with the convention that 0 log 00 = 0.
Note that ρ = 0 when I (x; y)) = 0 or equivalently x ⊥
⊥ y.

49/61 50/61

Mutual information of two vars Conditional mutual information


◮ Consider three r.v.s x, y and z. The conditional mutual
◮ Note that if x and y are independent, then H[y|x] = H[y].
information is defined by
◮ Mutual information
I (y; x|z) = H(y|z) − H(y|x, z)
I (x; y) = H(y) − H(y|x) = H(x) − H(x|y)
◮ The conditional mutual information is null iff x and y are
is one of the widely used measures to define dependency of conditionally independent given z.
variables. ◮ For a larger number n of variables X = {x1 , . . . , xn } a chain
◮ It is a measure of the amount of information that one random rule holds
variable contains about another random variable.
◮ It can also be considered as the distance from independence I (X; y) = I (X−i ; y|xi ) + I (xi ; y) = I (xi ; y|X−i ) + I (X−i ; y),
between the two variables. Indeed if x and y are independent i = 1, . . . , n
I (x; y) = 0.
◮ This quantity is always non negative and zero if and only if the This means that for n = 2
two variables are stochastically independent.
I ({x1 , x2 }; y) = I (x2 ; y|x1 ) + I (x1 ; y) = I (x1 ; y|x2 ) + I (x2 ; y)

51/61 52/61
Strong and weak relevance Strongly and weakly relevant variables

Let us consider a set X of n input variables and a target y.


◮ Strong relevance indicates that the feature is always necessary
◮ A variable xi is strongly relevant to the target y if
for an optimal subset.
I (X−i ; y) < I (X; y) ◮ Weak relevance suggests that the feature is not always
necessary but may become necessary at certain conditions.
i.e. if it carries information about y that no other variable ◮ Irrelevance indicates that the feature is not necessary at all.
carrries.
◮ A variable is weakly relevant to the target y if it is not strongly
Example: Consider a learning problem where n = 4, x3 = −x2
relevant and
(
1 + w, x1 + x2 > 0
∃S ⊆ X−i : I (S; y) < I ({xi , S}; y) y=
0, else
i.e. it exists a certain context S in which it carries information
about the target. Which variables are strongly, weakly relevant and irrelevant?

53/61 54/61

Markov blanket

Definition (Relevance)
The variable relevance of x2 to y given x1 , is the conditional
Let us consider a set X of n r.v.s., a target variable y and a subset mutual information
My ⊂ X. The subset M is said to be a Markov blanket of y, y ∈ /M
iff I ({x1 , x2 }; y) − I (x1 ; y) = I (x2 ; y|x1 )

I (y; X−(My ) )|My ) = 0 We can define x1 as the context and consider the relevance of a
variable x2 as context-dependent.
Theorem (Total conditioning)

x ∈ My ⇔ I (x; y|X−(x,y ) ) > 0 For an empty context, the relevance boils down to the mutual
information I (x2 ; y) of x2 to y.

55/61 56/61
Interaction Negative interaction (XOR)
Note that since
x2
I ({x1 , x2 }; y) = I (x2 ; y|x1 ) + I (x1 ; y) = I (x1 ; y|x2 ) + I (x2 ; y)

we have

I (x2 ; y|x1 ) = I (x2 ; y) − I (x1 ; y) + I (x1 ; y|x2 )

It follows that

I ({x1 , x2 }; y) = I (x1 ; y) + I (x2 ; y) − [I (x1 ; y) − I (x1 ; y|x2 )]


| {z }
interaction
Negative interaction: complementary variables, i.e. the variable
x1
together have more information than the sum of the univariate
informations (XOR).
Positive interaction: redundant variables. I (x1 ; y) = 0, I (x2 ; y) = 0 but I (x1 ; y|x2 ) > 0
57/61 58/61

Positive interaction Feature selection and mutual information


◮ Given an output target y and a set of input variables
X = {x1 , . . . , xn } selecting the optimal subset of d variables
boils down to

X ∗ = arg max I (XS ; y)


XS ⊂X ,|XS |=d

◮ Maximization can be done in a forward manner.


◮ Let X = {xi }, i = 1, . . . , n the whole set of variables and XS
the current set of selected variables. The task of adding a
variable x ∗ ∈ S − X can be addressed by solving

x ∗ = arg max I ({XS , xk }; y)


xk ∈X −XS

This requires a multivariate estimation of the mutual


information.
I (x1 ; y) > 0, I (x2 ; y) > 0 but I (x1 ; y|x2 ) = 0
◮ Filter approaches rely on some low variate approximation.
59/61 60/61
The mRMR approach

The mRMR (mimimum-Redundancy Maximum-Relevancy) feature


selection strategy approximates

arg max I ({XS , xk }; y)


xk ∈X −XS

with
 
∗ 1 X
xMRMR = arg max I (xk ; y ) − I (xi ; xk )
xk ∈X −XS m
xi ∈XS

where m is the size of XS .

61/61

You might also like