Sta 5
Sta 5
PHENOMENON
DATA
MODEL
VALIDATION
MODEL
SELECTION
MODEL
1/61 2/61
◮ ML algorithms degrade in accuracy when faced with many There are many potential benefits of feature selection:
inputs (aka features) that are not necessary. ◮ facilitating data visualization and data understanding,
◮ Tasks with thousands of features: bioinformatics classification ◮ reducing the measurement and storage requirements,
where n (number of genes for which expression is measured) ◮ reducing training and utilization times of the final model,
may range from 6000 to 60000.
◮ defying the curse of dimensionality to improve prediction
◮ Original learning techniques not designed to cope with lot of
performance.
irrelevant features.
Drawbacks are
◮ Feature selection: selecting some subset of features to use as
inputs. ◮ additional layer of complexity: the search in the hypothesis
space is augmented by another dimension, finding the optimal
◮ Using all features may negatively affect generalization, because
subset of relevant features.
of irrelevant/ redundant features.
◮ additional time for learning.
◮ Feature selection as a model selection problem.
3/61 4/61
Neighbourhood and dimension n n=1 d=1/2 V=1/2
n= 2 d=1/2 V=1/4
◮ n dimensional space
◮ query point xq ∈ Rn
◮ unit volume around xq .
◮ hypercube of edge length d < 1 which contains a portion V of
the unit volume
d n = V ⇒ d = V 1/n
n=3 d=1/2 V=1/8
◮
V = {x ∈ Rn : |xj | < d/2, j = 1, . . . , n}
where d stands for a measure of locality.
◮ For instance if V = 0.5 we have d = 0.7, 0.87, 0.98 for
n = 2, 5, 50.
As n increase, the volume of the cubic neighbourhood decreases if
the edge length d is fixed.
5/61 6/61
11/61 12/61
Pros-cons analysis Pros-cons analysis
13/61 14/61
15/61 16/61
PCA example (n = 2) PCA: the algorithm
X̃ = UDV T
17/61 18/61
j=1 λj S=svd(Xtr)
## PC loop
for (h in 1:n){
d2
j V=(S$v)
where λj = N−1 is the jth largest eigenvalue and is equal to Vh=array(V[,1:h],c(n,h))
Zh=Xtr%*%Vh
the variance of the jth component. Zi=Xts%*%Vh
YhatPCAi=pred("knn",Zh,Ytr,Zi,class=FALSE)
2. Scree plot: decreasing plot of λj as a function of j. Choose EPCA[i,h]=(Y[i]-YhatPCAi)^2
}
the value of h corresponding to a knee in the curve }
hbest=which.min(apply(EPCA,2,mean)))
3. Selection by cross validation.
See also script FeatureSel/fs_pca.R
19/61 20/61
Clustering (unsupervised learning) Hierarchical clustering
◮ Returns groups of features or observations with similar ◮ It begins by considering all the observations as separate
patterns (e.g. patterns of gene expressions in microarray data). clusters and starts by putting together the two samples that
◮ Require a distance function between variables and between are nearest to each other.
clusters. ◮ In subsequent stages also clusters can be merged.
◮ Nearest neighbor clustering: number of clusters fixed in ◮ A measure of dissimilarity between sets of observations is
advance, then each variable is assigned to each cluster. required.
Examples: Self Organizing Maps (SOM) and K-means. ◮ It is based on an appropriate metric (a measure of distance
◮ Agglomerative clustering: bottom-up methods where between pairs of observations), and a linkage criterion which
clusters start as empty and variables are successively added. specifies the dissimilarity of sets as a function of the pairwise
An example is hierarchical clustering distances of observations in the sets.
21/61 22/61
Dendrogram Ranking
The output of hierarchical clustering is a dendrogram, a tree
diagram used to illustrate the arrangement of the clusters. ◮ Assesses the importance (or relevance) of each variable with
respect to the output by using a univariate measure.
Supervised techniques of complexity O(n).
◮ Measures of relevance:
◮ Pearson correlation (the greater the more relevant) which
assumes linearity;
◮ in case of binary classification significance p-value of an
hypothesis test (the lower the more relevant) which aims at
detecting the features that split well the dataset. Parametric
(t-test) and nonparametric (Wilcoxon) tests have been
proposed in litterature;
◮ mutual information (the greater the more relevant).
◮ After the univariate assessment the method ranks the variables
in a decreasing order of relevance.
23/61 24/61
Ranking methods (II) Wrapping search
25/61 26/61
27/61 28/61
External cross-validation Assessment by permutation
29/61 30/61
31/61 32/61
Ridge regression R code
33/61 34/61
39/61 40/61
Dual nonlinear regression Kernel trick
where
Ki ,j = hΦ(xi ), Φ(xj )i, ki = hΦ(xi ), Φ(x)i
These inner products can be computed efficiently without explicitly
computing the mapping Φ(·).
41/61 42/61
45/61 46/61
Consider two continuous r.v.s x and y and their joint density ◮ for continuous r.v.s the differential entropy may be negative
p(x, y ). (logarithmic scale)
The conditional entropy is defined as ◮ in general H(y|x) 6= H(x|y)
Z Z ◮ conditioning reduces entropy
H(y|x) = − log(p(y |x))p(x, y )dxdy = Ex,y [− log(p(y |x))] =
H(y|x) ≤ H(y)
1
= Ex,y log = Ex [H(y|x)]
p(y |x) with equality if x and y are independent, i.e. x ⊥
⊥ y,
◮
This quantity quantifies the remaining uncertainty of y once x is
known. H(y) − H(y|x) = H(x) − H(x|y)
47/61 48/61
Mutual information of two vars (II) Mutual information in the normal case
49/61 50/61
51/61 52/61
Strong and weak relevance Strongly and weakly relevant variables
53/61 54/61
Markov blanket
Definition (Relevance)
The variable relevance of x2 to y given x1 , is the conditional
Let us consider a set X of n r.v.s., a target variable y and a subset mutual information
My ⊂ X. The subset M is said to be a Markov blanket of y, y ∈ /M
iff I ({x1 , x2 }; y) − I (x1 ; y) = I (x2 ; y|x1 )
I (y; X−(My ) )|My ) = 0 We can define x1 as the context and consider the relevance of a
variable x2 as context-dependent.
Theorem (Total conditioning)
x ∈ My ⇔ I (x; y|X−(x,y ) ) > 0 For an empty context, the relevance boils down to the mutual
information I (x2 ; y) of x2 to y.
55/61 56/61
Interaction Negative interaction (XOR)
Note that since
x2
I ({x1 , x2 }; y) = I (x2 ; y|x1 ) + I (x1 ; y) = I (x1 ; y|x2 ) + I (x2 ; y)
we have
It follows that
with
∗ 1 X
xMRMR = arg max I (xk ; y ) − I (xi ; xk )
xk ∈X −XS m
xi ∈XS
61/61