0% found this document useful (0 votes)

28 views16 pages

Sta 5

The document discusses feature selection techniques for machine learning models. It covers the problems caused by high-dimensional data and many irrelevant features. It also describes different feature selection methods like filter, wrapper and embedded approaches and analyzes their pros and cons.

Uploaded by

wayacel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views16 pages

Sta 5

Uploaded by

wayacel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

The learning procedure

PHENOMENON

INFO-F-422: Statistical foundations of machine

learning PROBLEM
FORMULATION

Feature selection EXPERIMENTAL

DESIGN

Gianluca Bontempi RAW DATA

Machine Learning Group

Computer Science Department PREPROCESSING
MODEL PARAMETRIC
mlg.ulb.ac.be GENERATION IDENTIFICATION

DATA
MODEL
VALIDATION

MODEL
SELECTION

MODEL
1/61 2/61

Feature selection problem Benefits and drawbacks of feature selection

◮ ML algorithms degrade in accuracy when faced with many There are many potential benefits of feature selection:
inputs (aka features) that are not necessary. ◮ facilitating data visualization and data understanding,
◮ Tasks with thousands of features: bioinformatics classification ◮ reducing the measurement and storage requirements,
where n (number of genes for which expression is measured) ◮ reducing training and utilization times of the final model,
may range from 6000 to 60000.
◮ defying the curse of dimensionality to improve prediction
◮ Original learning techniques not designed to cope with lot of
performance.
irrelevant features.
Drawbacks are
◮ Feature selection: selecting some subset of features to use as
inputs. ◮ additional layer of complexity: the search in the hypothesis
space is augmented by another dimension, finding the optimal
◮ Using all features may negatively affect generalization, because
subset of relevant features.
of irrelevant/ redundant features.
◮ additional time for learning.
◮ Feature selection as a model selection problem.

3/61 4/61
Neighbourhood and dimension n n=1 d=1/2 V=1/2

n= 2 d=1/2 V=1/4
◮ n dimensional space
◮ query point xq ∈ Rn
◮ unit volume around xq .
◮ hypercube of edge length d < 1 which contains a portion V of
the unit volume
d n = V ⇒ d = V 1/n
n=3 d=1/2 V=1/8
◮
V = {x ∈ Rn : |xj | < d/2, j = 1, . . . , n}
where d stands for a measure of locality.
◮ For instance if V = 0.5 we have d = 0.7, 0.87, 0.98 for
n = 2, 5, 50.
As n increase, the volume of the cubic neighbourhood decreases if
the edge length d is fixed.
5/61 6/61

Neighborhood volume vs. edge size Locality and dimension n

◮ Let N points uniformly distributed in a unit volume.

◮ The number of neighbours K in a volume V < 1 amounts to
K = NV and is proportional to V .
◮ By increasing n, V and K decrease and observations become
sparser.
◮ As n increases the amount of local data goes to zero: all data
sets are sparse.

◮ given neighbourhood volume V : edge length increases for

increasing n.
◮ given edge d: neighbourhood volume decreases for increasing
n.
7/61 8/61
K vs. dimension n (fixed N and d ) Number of observations vs. dimension n (fixed d and K )

If N and d are fixed, the number K of neighbours in V decreases

by increasing n. If we want to preserve the same degree of locality (fixed edge
As n increases the amount of local data goes to zero: all data sets d = 0.1) for increasing n, we need much larger number N of
are sparse. observations.
9/61 10/61

Curse of dimensionality: other issues Methods of feature selection

1. Filter methods: preprocessing methods that assess features

◮ The error of the best model (e.g. conditional variance in ignoring the role of learning algorithm. Examples: ranking,
regression or the Bayes error in classification ) decreases with compression techniques (like PCA or clustering).
n but the mean integrated squared error of models increases 2. Wrapper methods: assess subsets of variables according to
faster than linearly in n: in other terms the best model is more usefulness to a given predictor. Search using the learning
and more accurate but it is more and more difficult to find it. algorithm as part of the evaluation function. Examples:
◮ In high dimensions, all datasets show multicollinearity. stepwise methods in linear regression.
◮ In high dimensions, the number of possible models to consider 3. Embedded methods: selection as part of learning procedure.
increases superexponentially in n. Examples: classification trees, random forests, methods based
on regularization (e.g. lasso) and representation learning (e.g.
autoencoder).

11/61 12/61
Pros-cons analysis Pros-cons analysis

◮ Filter methods: ◮ Wrapper methods:

◮ Pros: easily scale to very high-dimensional datasets, ◮ Pros: take into account learning algo and feature
computationally simple and fast, and independent of the dependencies.
classification algorithm. FS performed only once and different ◮ Cons: higher risk of overfitting than filters and very
classifiers are evaluated. computationally intensive.
◮ Cons: ignore interaction with the classifier, often univariate or ◮ Embedded methods:
low-variate. Each feature is considered separately, ignoring ◮ Pros: less computationally intensive than wrapper methods.
feature dependencies. ◮ Cons: specific to a learning machine.

13/61 14/61

Principal component analysis (PCA) Principal component analysis

◮ PC1 is the axis along which there is the greatest variation of

the projected observations.
◮ One of the most popular methods for linear dimensionality ◮ PC1 is a = [a1 , . . . , an ] ∈ Rn such that
reduction.
◮ It projects the data from the original space into a a lower z = a1 x·1 + · · · + an x·n = aT x
dimensional space in an unsupervised manner.
has the largest variance.
◮ New axes (called principal components (PC)) are created as
linear combinations of original ones.
◮ Optimal a is the eigenvector of Var [x] corresponding to the
largest eigenvalue λ1 .
◮ PC2 is the axis orthogonal to PC1 that has the greatest
variation of the projected observations; and so forth.

15/61 16/61
PCA example (n = 2) PCA: the algorithm

Input: matrix X[N,n] of size [N, n]:

1. X is normalized into X̃ such that each column X̃ [, i ],
i = 1, . . . , n, has mean 0 and variance 1.
2. Singular Value Decomposition (SVD) of X̃

X̃ = UDV T

where U[N,N] is orthogonal, D[N,n] is rectangular diagonal with

diagonal elements d1 ≥ d2 ≥ · · · ≥ dn and V[n,n] is orthogonal.
3. X̃ transformed into Z = X̃ V = UD where each column of
Z[N,n] is a linear combination of original features and its
importance is diminishing.
4. The first h < n columns of Z (aka eigen-features) are chosen
to represent the dataset.

17/61 18/61

How many PCs? R code

1. Fix a threshold on the proportion of variance to be explained ## LEAVE-ONE-OUT loop

for (i in 1:N){
by the PCs, e.g. choose h PCs such that ## normalization of the input training
Xtr=scale(X[-i,])
Ytr=Y[-i]
λ1 + · · · + λh ## normalization of the input test
Pn ≥α Xts=(X[i,]-attr(Xtr,"scaled:center"))/attr(Xtr,"scaled:scale")

j=1 λj S=svd(Xtr)
## PC loop
for (h in 1:n){
d2
j V=(S$v)
where λj = N−1 is the jth largest eigenvalue and is equal to Vh=array(V[,1:h],c(n,h))
Zh=Xtr%*%Vh
the variance of the jth component. Zi=Xts%*%Vh
YhatPCAi=pred("knn",Zh,Ytr,Zi,class=FALSE)
2. Scree plot: decreasing plot of λj as a function of j. Choose EPCA[i,h]=(Y[i]-YhatPCAi)^2
}
the value of h corresponding to a knee in the curve }
hbest=which.min(apply(EPCA,2,mean)))
3. Selection by cross validation.
See also script FeatureSel/fs_pca.R

19/61 20/61
Clustering (unsupervised learning) Hierarchical clustering

◮ Returns groups of features or observations with similar ◮ It begins by considering all the observations as separate
patterns (e.g. patterns of gene expressions in microarray data). clusters and starts by putting together the two samples that
◮ Require a distance function between variables and between are nearest to each other.
clusters. ◮ In subsequent stages also clusters can be merged.
◮ Nearest neighbor clustering: number of clusters fixed in ◮ A measure of dissimilarity between sets of observations is
advance, then each variable is assigned to each cluster. required.
Examples: Self Organizing Maps (SOM) and K-means. ◮ It is based on an appropriate metric (a measure of distance
◮ Agglomerative clustering: bottom-up methods where between pairs of observations), and a linkage criterion which
clusters start as empty and variables are successively added. specifies the dissimilarity of sets as a function of the pairwise
An example is hierarchical clustering distances of observations in the sets.

21/61 22/61

Dendrogram Ranking
The output of hierarchical clustering is a dendrogram, a tree
diagram used to illustrate the arrangement of the clusters. ◮ Assesses the importance (or relevance) of each variable with
respect to the output by using a univariate measure.
Supervised techniques of complexity O(n).
◮ Measures of relevance:
◮ Pearson correlation (the greater the more relevant) which
assumes linearity;
◮ in case of binary classification significance p-value of an
hypothesis test (the lower the more relevant) which aims at
detecting the features that split well the dataset. Parametric
(t-test) and nonparametric (Wilcoxon) tests have been
proposed in litterature;
◮ mutual information (the greater the more relevant).
◮ After the univariate assessment the method ranks the variables
in a decreasing order of relevance.

23/61 24/61
Ranking methods (II) Wrapping search

◮ Search in space W = {0, 1}n where w ∈ W is such that

(
0 if the input j does NOT belong to the set of features
w [j] =
◮ (+): fast (complexity O(n)), intuitive output and easy to 1 if the input j belongs to the set of features
understand.
◮ We look for optimal w ∗ ∈ {0, 1}n such that
◮ (-) they disregard redundancies and higher order interactions
between variables (e.g. genes). w ∗ = arg min MISEw
w ∈W
◮ (-) best k individual features are not necessarily constitute the
best k variate vector. where MISEw is the generalization error of the model based on
w.
◮ the number of vectors in W is equal to 2n .
◮ for moderately large n, the exhaustive search is no more
possible.

25/61 26/61

Wrapping search strategies Wrapper limitation

P n(n+1)
Greedy methods evaluate n−1 i =0 (n − i ) = 2 (O(n2 )) subsets
by either adding or deleting one variable at a time.
1. Forward selection: starts with no variables and incorporates ◮ Wrapper techniques search for the best subset of features by
features. First input is the one with the lowest generalization performing a large number of comparisons and selections.
error: the second is the one that, together with the first, has
◮ Despite the use of validation procedures, it is highly possible
the lowest error, and so on. R script FeatureSel/fs_wrap.R.
that high correlations or low prediction errors are found only
2. Backward selection: opposite direction of forward approach due to chance.
by progressively removing feature from the full set. At each
◮ A bad practice consists in using the same set of observations
step, input to be removed is the one whose absence causes the
to select the feature set and to assess the generalization
lowest increase (or highest decrease) of generalisation error.
accuracy of the classifier ( see external cross-validation).
3. Stepwise selection: it combines the previous two techniques,
by testing for each set of variables, first the removal of
features belonging to the set, then the addition of variables
not in the set.

27/61 28/61
External cross-validation Assessment by permutation

Internal cross-validation: assessment process where the entire

available data is used to select a feature set. ◮ If no additional data are available, an alternative consists in
This is a wrong procedure since it can induce sampling bias (i.e. estimating how often random data would generate correlation
too optimistic estimation of the generalization error due to testing or classification accuracy of the magnitude obtained in the
on samples already considered in the feature selection process). original analysis.
The external cross-validation consists into ◮ permutation testing: repeating the procedure several times
◮ Leave out a single test fold of the data set with data where the dependency between the input and the
output is artificially removed (for exemple by permuting the
◮ Use the remaining training folds for selecting both variables
inputs ordering). This would provide us with a distribution of
and model
the accuracy in case of random data where no information is
◮ Evaluate the generalization error (of both model and feature present in the data.
set) on the test fold.

29/61 30/61

Combining instead of selecting Shrinkage methods

◮ Improve an estimator by adding constraints to reduce variance.

◮ Instead of choosing one particular FS method, different FS
◮ Ridge regression is a shrinkage method applied to least
methods can be combined using ensemble approaches.
squares regression
◮ Since there is not an optimal feature selection technique and
due to the possible existence of more than one subset of
features that discriminates the data equally well, model N
X p
X
T 2
combination approaches have been adapted to improve the β̂r = arg min{ (yi − xi b) + λ bj2 } =
b
robustness and stability of final, discriminative methods. i =1 j=1

◮ ensemble techniques include averaging over multiple feature = arg min (Y − Xb)T (Y − Xb) + λb T b
b
subsets.
◮ use of ensemble approaches requires additional computational where λ > 0 is a complexity parameter that controls the
resources. amount of shrinkage: the larger the value of λ the stronger the
constraint.

31/61 32/61
Ridge regression R code

◮ An equivalent way to write the ridge problem is

## set of lambda values
N LAM=seq(1,100,by=5)
X E<-array(NA,c(N,length(LAM)))
β̂r = arg min (yi − xiT b)2 , ## LEAVE-ONE-OUT loop
for (i in 1:N){
b
i =1 Xtr=cbind(numeric(N-1)+1,X[-i,])
Ytr=Y[-i]
Xp Xts=c(1,X[i,])
subject to bj2 ≤ L cnt=1
for (lam in LAM){
betahat=solve(t(Xtr)%*%Xtr+lam*diag(n+1))%*%t(Xtr)%*%Ytr
j=1 Yhati=Xts%*%betahat
E[i,cnt]=(Y[i]-Yhati)^2
cnt=cnt+1
where there is a one-to-one correspondence between the }
}
parameter λ and L lambest=LAM[which.min(apply(E,2,mean))]
XX=cbind(numeric(N)+1,X)
◮ It can be shown that the ridge regression solution is betahat=solve(t(XX)%*%XX+lambest*diag(n+1))%*%t(XX)%*%Y
print(sort(abs(betahat),decr=TRUE,index=TRUE)$ix[1:4]-1)

β̂r = (X T X + λIp )−1 X T Y

See also script FeatureSel/fs_ridgeR
where Ip is a [p, p] identity matrix.

33/61 34/61

Lasso Ridge regression vs. lasso

◮ Estimate of the linear parameters is returned by
N
X
β̂r = arg min (yi − xiT b)2 ,
b
i =1
Xp
subject to |bj | ≤ L
j=1

◮ The 1-norm penalty of the lasso approach makes the solution

nonlinear and requires a quadratic programming algorithm.
P
◮ Note that if L > pj=1 |β̂j | the lasso returns the common
least-squares solution.
◮ Lasso returns sparser solutions (more coefficients forced to
zero)
◮ How to choose the penalty factor L? In practice we have
recours to cross-validation strategies. From www.quora.com
35/61 36/61
Dual linear formulation Dual ridge regression formulation
Since
Linear regression problem with n >> N: least-squares solution is
(X T X + λIp )−1 X T = X T (XX T + λIN )−1
the [p, 1] parameter vector
where IN is the identity matrix of size N, the dual formulation is
T −1 T
β̂ = (X X ) X y N
X
T T −1 T
β̂ = X (XX + λIN ) y = X α̂ = α̂i xi
and the prediction for a new [p, 1] vector x is returned by
i =1
T
ŷ = β̂ x = hβ̂, xi where
α̂ = (K + λIN )−1 Y
The dual formulation is
is the [N, 1] vector of dual parameters and K = XX T is the Kernel
N
T −1 T T T −2 T T
X or Gram [N, N] matrix.
β̂ = (X X ) X y = X X (X X ) X y = X α̂ = α̂i xi For a query point xq the prediction is
| {z }
i =1
α̂
ŷ = xqT X T α̂ = xqT X T (K + λIN )−1 Y
where α̂ is a [N, 1] vector and xi is the [p, 1] vector which
represents the i th observation. where xqT X T is the [1, N] vector containing all dot products
between the query point and the training samples.
37/61 38/61

Dot product Kernel methods

◮ Many learning algorithms, such the perceptron, support vector

machine (SVM) and PCA, process data only in linear manner
◮ The dot product (or scalar product) is an operation which through inner products.
takes two real vectors and returns a real-valued scalar quantity. ◮ They address the problem of dimensionality n >> N by solving
◮ The dot product of two [n, 1] vectors xi = [xi 1 , xi 2 , . . . , xin ]T a dual problem in a space of dimension N.
and xk = [xk1 , xk2 , . . . , xkn ]T is defined as: ◮ Kernel methods generalize the notion of inner product by
n
X adopting a user-specified kernel function, i.e., a similarity
hxi , xk i = xij xkj = xiT xk function over pairs of data points in raw representation.
j=1 ◮ Kernel functions enable them to operate in a high-dimensional,
implicit feature space
◮ This can be considered as a similarity score between vectors. 1. without computing the coordinates of the data in the large
dimensional space
2. by directly computing the inner products between all
transformed pairs of observations.

39/61 40/61
Dual nonlinear regression Kernel trick

Suppose to apply the transformation Φ : x ∈ Rn → Φ(x) ∈ RM to

the inputs projecting the original n dimensional space into a M
dimensional space.
The prediction for an input x ∈ Rn would be given by

ŷ (x) = k(K + λIN )−1 Y

where
Ki ,j = hΦ(xi ), Φ(xj )i, ki = hΦ(xi ), Φ(x)i
These inner products can be computed efficiently without explicitly
computing the mapping Φ(·).

41/61 42/61

Kernel function Kernel function

◮ A kernel function is a function κ that for all x, z ∈ X satisfies
κ(x, z) = hΦ(x), Φ(z)i
◮ κ(x, z) = (1 + x T z)2 : if n = 2 it corresponds to a
where Φ is a mapping from X to a feature space F of transformation to M =√6 dimensional
√ √ space
dimension M 2 2
Φ(x1 , x2 ) = (1, x1 , x2 , 2x1 , 2x2 , 2x1 x2 )
◮ For instance if n = 2, x = [x1 , x2 ] , y = [y1 , y2 ] 2
◮ A Gaussian kernel κ(x, z) = exp−γkx−zk corresponds to a
κ(x, z) = hx, zi2 = (x1 y1 + x2 y2 )2 = hΦ(x), Φ(z)i transformation to an infinite-dimensional space.
where
◮ Theoretically, a Gram matrix must be positive semi-definite
√ (PSD). Empirically, for machine learning heuristics, choices of
Φ : x = [x1 , x2 ] → Φ(x) = [x12 , x22 , 2x1 x2 ] ∈ F a function κ that do not satisfy PSD condition may still
perform reasonably if κ at least approximates the intuitive idea
◮ Kernels de-couple the specification of the algorithm from the
of similarity.
specification of the feature space.
◮ Kernels give a way to compute dot products in some feature
space without even knowing what this space is and what is the
function Φ.
43/61 44/61
Kernel methods Notions of entropy (continuous case)

Consider a continuous r.v. y. The (differential) entropy of y is

◮ They aim to take advantage of high dimensional defined by
representations without actually having to work in the high Z
dimensional space H(y) = − log(p(y ))p(y )dy = Ey [− log(p(y ))]
◮ Sometimes it is possible to compute dot products in high
dimensional feature spaces without having to explicitly carrying knowing that 0 log 0 = 0.
out the mapping into these spaces. ◮ Entropy if a functional of the distribution of y.
◮ Kernel trick: given any algorithm that can be expressed solely ◮ Entropy is a measure of the predictability of a r.v. y. The
in terms of dot products, the kernel trick allows us to higher the entropy, the less reliable are our predictions about y.
construct different nonlinear versions of it.
◮ for a scalar normal r.v. y ∼ N (0, σ 2 )
◮ However the accuracy of a kernel method very strongly
depends on the choice of kernel. 1 1
H(y) = 1 + ln 2πσ 2 = ln 2πeσ 2
2 2

45/61 46/61

Conditional entropy Some properties

Consider two continuous r.v.s x and y and their joint density ◮ for continuous r.v.s the differential entropy may be negative
p(x, y ). (logarithmic scale)
The conditional entropy is defined as ◮ in general H(y|x) 6= H(x|y)
Z Z ◮ conditioning reduces entropy
H(y|x) = − log(p(y |x))p(x, y )dxdy = Ex,y [− log(p(y |x))] =
H(y|x) ≤ H(y)
1
= Ex,y log = Ex [H(y|x)]
p(y |x) with equality if x and y are independent, i.e. x ⊥
⊥ y,
◮
This quantity quantifies the remaining uncertainty of y once x is
known. H(y) − H(y|x) = H(x) − H(x|y)

47/61 48/61
Mutual information of two vars (II) Mutual information in the normal case

Let (x, y) a normally distributed random vector with correlation

◮ Given two random variables x and y, their mutual information coefficient ρ.
is defined in terms of their probabilistic marginal density The mutual information between x and y is given by
functions px (x), py (y ) and the joint p(x,y) (x, y ):
1
Z Z I (x; y) = − log(1 − ρ2 )
p(x, y ) 2
I (x; y) = log p(x, y )dxdy =
p(x)p(y ) Equivalently the normal correlation coefficient can be written as
= H(y) − H(y|x) = H(x) − H(x|y) p
ρ = 1 − exp(−2I (x; y))
with the convention that 0 log 00 = 0.
Note that ρ = 0 when I (x; y)) = 0 or equivalently x ⊥
⊥ y.

49/61 50/61

Mutual information of two vars Conditional mutual information

◮ Consider three r.v.s x, y and z. The conditional mutual
◮ Note that if x and y are independent, then H[y|x] = H[y].
information is defined by
◮ Mutual information
I (y; x|z) = H(y|z) − H(y|x, z)
I (x; y) = H(y) − H(y|x) = H(x) − H(x|y)
◮ The conditional mutual information is null iff x and y are
is one of the widely used measures to define dependency of conditionally independent given z.
variables. ◮ For a larger number n of variables X = {x1 , . . . , xn } a chain
◮ It is a measure of the amount of information that one random rule holds
variable contains about another random variable.
◮ It can also be considered as the distance from independence I (X; y) = I (X−i ; y|xi ) + I (xi ; y) = I (xi ; y|X−i ) + I (X−i ; y),
between the two variables. Indeed if x and y are independent i = 1, . . . , n
I (x; y) = 0.
◮ This quantity is always non negative and zero if and only if the This means that for n = 2
two variables are stochastically independent.
I ({x1 , x2 }; y) = I (x2 ; y|x1 ) + I (x1 ; y) = I (x1 ; y|x2 ) + I (x2 ; y)

51/61 52/61
Strong and weak relevance Strongly and weakly relevant variables

Let us consider a set X of n input variables and a target y.

◮ Strong relevance indicates that the feature is always necessary
◮ A variable xi is strongly relevant to the target y if
for an optimal subset.
I (X−i ; y) < I (X; y) ◮ Weak relevance suggests that the feature is not always
necessary but may become necessary at certain conditions.
i.e. if it carries information about y that no other variable ◮ Irrelevance indicates that the feature is not necessary at all.
carrries.
◮ A variable is weakly relevant to the target y if it is not strongly
Example: Consider a learning problem where n = 4, x3 = −x2
relevant and
(
1 + w, x1 + x2 > 0
∃S ⊆ X−i : I (S; y) < I ({xi , S}; y) y=
0, else
i.e. it exists a certain context S in which it carries information
about the target. Which variables are strongly, weakly relevant and irrelevant?

53/61 54/61

Markov blanket

Definition (Relevance)
The variable relevance of x2 to y given x1 , is the conditional
Let us consider a set X of n r.v.s., a target variable y and a subset mutual information
My ⊂ X. The subset M is said to be a Markov blanket of y, y ∈ /M
iff I ({x1 , x2 }; y) − I (x1 ; y) = I (x2 ; y|x1 )

I (y; X−(My ) )|My ) = 0 We can define x1 as the context and consider the relevance of a
variable x2 as context-dependent.
Theorem (Total conditioning)

x ∈ My ⇔ I (x; y|X−(x,y ) ) > 0 For an empty context, the relevance boils down to the mutual
information I (x2 ; y) of x2 to y.

55/61 56/61
Interaction Negative interaction (XOR)
Note that since
x2
I ({x1 , x2 }; y) = I (x2 ; y|x1 ) + I (x1 ; y) = I (x1 ; y|x2 ) + I (x2 ; y)

we have

I (x2 ; y|x1 ) = I (x2 ; y) − I (x1 ; y) + I (x1 ; y|x2 )

It follows that

I ({x1 , x2 }; y) = I (x1 ; y) + I (x2 ; y) − [I (x1 ; y) − I (x1 ; y|x2 )]

| {z }
interaction
Negative interaction: complementary variables, i.e. the variable
x1
together have more information than the sum of the univariate
informations (XOR).
Positive interaction: redundant variables. I (x1 ; y) = 0, I (x2 ; y) = 0 but I (x1 ; y|x2 ) > 0
57/61 58/61

Positive interaction Feature selection and mutual information

◮ Given an output target y and a set of input variables
X = {x1 , . . . , xn } selecting the optimal subset of d variables
boils down to

X ∗ = arg max I (XS ; y)

XS ⊂X ,|XS |=d

◮ Maximization can be done in a forward manner.

◮ Let X = {xi }, i = 1, . . . , n the whole set of variables and XS
the current set of selected variables. The task of adding a
variable x ∗ ∈ S − X can be addressed by solving

x ∗ = arg max I ({XS , xk }; y)

xk ∈X −XS

This requires a multivariate estimation of the mutual

information.
I (x1 ; y) > 0, I (x2 ; y) > 0 but I (x1 ; y|x2 ) = 0
◮ Filter approaches rely on some low variate approximation.
59/61 60/61
The mRMR approach

The mRMR (mimimum-Redundancy Maximum-Relevancy) feature

selection strategy approximates

arg max I ({XS , xk }; y)

xk ∈X −XS

with
 
∗ 1 X
xMRMR = arg max I (xk ; y ) − I (xi ; xk )
xk ∈X −XS m
xi ∈XS

where m is the size of XS .

61/61

Curious Freaks Coding Sheet
100% (6)
Curious Freaks Coding Sheet
6 pages
Department of Computing
No ratings yet
Department of Computing
5 pages
Python Deep Learning Tutorial
0% (1)
Python Deep Learning Tutorial
17 pages
Numerical Methods For Differential Equations and Applications
No ratings yet
Numerical Methods For Differential Equations and Applications
14 pages
Lecture 1 - Introduction To Optimization PDF
No ratings yet
Lecture 1 - Introduction To Optimization PDF
31 pages
Vu Solved
No ratings yet
Vu Solved
60 pages
Section 8 3 Filter Design by The Insertion Loss Method Package
No ratings yet
Section 8 3 Filter Design by The Insertion Loss Method Package
41 pages
Huffman Coding: Version of September 17, 2016
No ratings yet
Huffman Coding: Version of September 17, 2016
27 pages
Dimenn Red PDF
No ratings yet
Dimenn Red PDF
135 pages
Ejercicios Interpolacion Lagrange
No ratings yet
Ejercicios Interpolacion Lagrange
1 page
Tmi 2018 2833635
No ratings yet
Tmi 2018 2833635
14 pages
Introduction To Hill Climbing
No ratings yet
Introduction To Hill Climbing
5 pages
Image Processing: Adaptive Filters: Wiener and Lucy Richardson Filters
No ratings yet
Image Processing: Adaptive Filters: Wiener and Lucy Richardson Filters
17 pages
Z-Transform: Automatic Control 1
No ratings yet
Z-Transform: Automatic Control 1
21 pages
Feature Selection 1692278667
No ratings yet
Feature Selection 1692278667
100 pages
3b Features PDF
No ratings yet
3b Features PDF
40 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in PDF
14 pages
Foundations of Machine Learning: Sudeshna Sarkar IIT Kharagpur
No ratings yet
Foundations of Machine Learning: Sudeshna Sarkar IIT Kharagpur
40 pages
08 HighDimensional PDF
No ratings yet
08 HighDimensional PDF
88 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
69 pages
08 HighDimensional PDF
No ratings yet
08 HighDimensional PDF
88 pages
Linear Program - Original Data: TORA Optimization System, Windows®-Version 1.00 Viernes, Diciembre 04, 2020 13:04
No ratings yet
Linear Program - Original Data: TORA Optimization System, Windows®-Version 1.00 Viernes, Diciembre 04, 2020 13:04
3 pages
Foundations of Machine Learning: Module 3: Instance Based Learning and Feature Reduction
No ratings yet
Foundations of Machine Learning: Module 3: Instance Based Learning and Feature Reduction
40 pages
PSK and QPSK
No ratings yet
PSK and QPSK
5 pages
The Uncertainty Analysis in Linear and Nonlinear Regression Revisited Application To Concrete Strength Estimation
No ratings yet
The Uncertainty Analysis in Linear and Nonlinear Regression Revisited Application To Concrete Strength Estimation
26 pages
A Review On Empirical Mode Decomposition in Fault Diagnosis
No ratings yet
A Review On Empirical Mode Decomposition in Fault Diagnosis
19 pages
CS434a/541a: Pattern Recognition Prof. Olga Veksler
No ratings yet
CS434a/541a: Pattern Recognition Prof. Olga Veksler
42 pages
Unit 3 - Part 1 Assignment Problem
No ratings yet
Unit 3 - Part 1 Assignment Problem
54 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Ball Tree
No ratings yet
Ball Tree
4 pages
Embedded Methods: Isabelle Guyon André Elisseeff
No ratings yet
Embedded Methods: Isabelle Guyon André Elisseeff
12 pages
CHP 4
No ratings yet
CHP 4
72 pages
DimensionalityReduction 13022024
No ratings yet
DimensionalityReduction 13022024
32 pages
Pattern Classification 06. Feature Selection & Extraction: Abdelmoniem Bayoumi, PHD
No ratings yet
Pattern Classification 06. Feature Selection & Extraction: Abdelmoniem Bayoumi, PHD
29 pages
ML Unit2 Classppt
No ratings yet
ML Unit2 Classppt
44 pages
6 - Data Pre-Processing-III
No ratings yet
6 - Data Pre-Processing-III
30 pages
DSH - L5 - Data-Driven Approaches - Concepts
No ratings yet
DSH - L5 - Data-Driven Approaches - Concepts
38 pages
16 dm2 Dimred 2022 23
No ratings yet
16 dm2 Dimred 2022 23
49 pages
Gaussian Quadrature
No ratings yet
Gaussian Quadrature
4 pages
Unit 3,4 and 5
No ratings yet
Unit 3,4 and 5
5 pages
University Institute of Engineering Department of Computer Science & Engineering
No ratings yet
University Institute of Engineering Department of Computer Science & Engineering
23 pages
Classification Techniques
No ratings yet
Classification Techniques
99 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
47 pages
Ruiz Modified I2ml3e Chap6
No ratings yet
Ruiz Modified I2ml3e Chap6
38 pages
Quadratic Regression Calculator - Formula
No ratings yet
Quadratic Regression Calculator - Formula
7 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
60 pages
06 KNN
No ratings yet
06 KNN
41 pages
An Introduction To Feature Selection
No ratings yet
An Introduction To Feature Selection
45 pages
ML Unit 2 Part 2
No ratings yet
ML Unit 2 Part 2
23 pages
11 Regression and Classification Using Neural Networks
No ratings yet
11 Regression and Classification Using Neural Networks
2 pages
Feature Extraction
No ratings yet
Feature Extraction
16 pages
I2ml3e Chap6
No ratings yet
I2ml3e Chap6
37 pages
Dimension Reduction
No ratings yet
Dimension Reduction
15 pages
Pca Kmeans GMM
No ratings yet
Pca Kmeans GMM
96 pages
Technology - Mca Master of Computer Applications - Semester 2 - 2023 - May - Elective 1 - Computer Vision Rev 2019 C' Scheme
No ratings yet
Technology - Mca Master of Computer Applications - Semester 2 - 2023 - May - Elective 1 - Computer Vision Rev 2019 C' Scheme
1 page
Feature Extraction: 4.1. Principal Component Analysis (PCA)
No ratings yet
Feature Extraction: 4.1. Principal Component Analysis (PCA)
10 pages
Feature Selection - New
No ratings yet
Feature Selection - New
41 pages
Feature Engineering
No ratings yet
Feature Engineering
51 pages
Lecture 15 - 23.09.2024 - Feature Selection
No ratings yet
Lecture 15 - 23.09.2024 - Feature Selection
47 pages
Lab 3 Fourier Transform 2022
No ratings yet
Lab 3 Fourier Transform 2022
8 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
5 pages
PCA - Feb 8
No ratings yet
PCA - Feb 8
28 pages
Wrapper Method
No ratings yet
Wrapper Method
58 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
57 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
Presentation 1
No ratings yet
Presentation 1
15 pages
UNIT04
No ratings yet
UNIT04
35 pages
QSRI Lecture4
No ratings yet
QSRI Lecture4
56 pages
Feature Selection and Extraction
No ratings yet
Feature Selection and Extraction
26 pages
Unit 3
No ratings yet
Unit 3
50 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
36 pages
L06 Feature Selection and Extraction
No ratings yet
L06 Feature Selection and Extraction
29 pages
5 Data Pre Processing III
No ratings yet
5 Data Pre Processing III
30 pages
m3 Notes
No ratings yet
m3 Notes
24 pages
ML Notes
No ratings yet
ML Notes
15 pages
Unit 5 (Dimensionality Reduction)
No ratings yet
Unit 5 (Dimensionality Reduction)
96 pages
7 Selectia Trasaturilor
No ratings yet
7 Selectia Trasaturilor
54 pages
Day School 03
No ratings yet
Day School 03
32 pages
ML Mod 4 & 6 Pyq
No ratings yet
ML Mod 4 & 6 Pyq
11 pages
Unit V
No ratings yet
Unit V
82 pages
3ML.03.Feature Reduction
No ratings yet
3ML.03.Feature Reduction
44 pages
Lesson 7-Feature Selection and Principal Component Analysis
No ratings yet
Lesson 7-Feature Selection and Principal Component Analysis
24 pages
AML Unit - 1 Material
No ratings yet
AML Unit - 1 Material
36 pages
ML Mod 4 Part 2
No ratings yet
ML Mod 4 Part 2
32 pages
Numerical Methods & Scientific Computing Math Cor-08
No ratings yet
Numerical Methods & Scientific Computing Math Cor-08
9 pages
Data Structures and Algorithms
No ratings yet
Data Structures and Algorithms
5 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet

Sta 5

Uploaded by

Sta 5

Uploaded by

The learning procedure

INFO-F-422: Statistical foundations of machine

Feature selection EXPERIMENTAL

Gianluca Bontempi RAW DATA

Machine Learning Group

Feature selection problem Benefits and drawbacks of feature selection

Neighborhood volume vs. edge size Locality and dimension n

◮ Let N points uniformly distributed in a unit volume.

◮ given neighbourhood volume V : edge length increases for

If N and d are fixed, the number K of neighbours in V decreases

Curse of dimensionality: other issues Methods of feature selection

1. Filter methods: preprocessing methods that assess features

◮ Filter methods: ◮ Wrapper methods:

Principal component analysis (PCA) Principal component analysis

◮ PC1 is the axis along which there is the greatest variation of

Input: matrix X[N,n] of size [N, n]:

where U[N,N] is orthogonal, D[N,n] is rectangular diagonal with

How many PCs? R code

1. Fix a threshold on the proportion of variance to be explained ## LEAVE-ONE-OUT loop

◮ Search in space W = {0, 1}n where w ∈ W is such that

Wrapping search strategies Wrapper limitation

Internal cross-validation: assessment process where the entire

Combining instead of selecting Shrinkage methods

◮ Improve an estimator by adding constraints to reduce variance.

◮ An equivalent way to write the ridge problem is

β̂r = (X T X + λIp )−1 X T Y

Lasso Ridge regression vs. lasso

◮ The 1-norm penalty of the lasso approach makes the solution

Dot product Kernel methods

◮ Many learning algorithms, such the perceptron, support vector

Suppose to apply the transformation Φ : x ∈ Rn → Φ(x) ∈ RM to

ŷ (x) = k(K + λIN )−1 Y

Kernel function Kernel function

Consider a continuous r.v. y. The (differential) entropy of y is

Conditional entropy Some properties

Let (x, y) a normally distributed random vector with correlation

Mutual information of two vars Conditional mutual information

Let us consider a set X of n input variables and a target y.

I (x2 ; y|x1 ) = I (x2 ; y) − I (x1 ; y) + I (x1 ; y|x2 )

I ({x1 , x2 }; y) = I (x1 ; y) + I (x2 ; y) − [I (x1 ; y) − I (x1 ; y|x2 )]

Positive interaction Feature selection and mutual information

X ∗ = arg max I (XS ; y)

◮ Maximization can be done in a forward manner.

x ∗ = arg max I ({XS , xk }; y)

This requires a multivariate estimation of the mutual

The mRMR (mimimum-Redundancy Maximum-Relevancy) feature

arg max I ({XS , xk }; y)

where m is the size of XS .

You might also like