Biological Data Science Lecture6
Biological Data Science Lecture6
X y
Subjects feature1 feature2 ... feature M result
P1 3.1 1.3 0.9 1
P2 3.7 1.0 1.3 2
N P3 2.9 2.6 0.6 1
… …
PN 1.7 2.0 0.7 3
© A. Tsanas, 2020
▪ Many features 𝑀 Curse of dimensionality
▪ Obstruct interpretability and detrimental to learning process
© A. Tsanas, 2020
▪ Reduce the initial feature space 𝑀
into 𝑚 (where 𝑚 < 𝑀)
▪ Feature selection
▪ Feature transformation
© A. Tsanas, 2020
▪ Principle of parsimony
▪ Information content
▪ Statistical associations
▪ Computational constraints
© A. Tsanas, 2020
Subjects feature1 feature2 ... feature M
P1 3.1 1.3 0.9
P2 3.7 1.0 1.3 X
N P3 2.9 2.6 0.6
…
PN 1.7 2.0 0.7
M (features or characteristics)
Subjects PCA feat1 PCA feat2 ... PCA feat M
P1 3.1*P1+
1.3*P2+…
P2 X’
N P3
…
PN 1.7 2.0 0.7
m (features or characteristics) © A. Tsanas, 2020
▪ Unobserved latent variables = factors
▪ Similar in principle to PCA, but has subtle differences
▪ FA takes into account random errors in
measurements
▪ Different flavours of FA: Exploratory FA (EFA),
Confirmatory FA (CFA)
▪ Many statisticians remain skeptical about FA because
it has no unique solutions (space rotation)
© A. Tsanas, 2020
Day 6 part 2
Discard non-contributing features towards
predicting the outcome
© A. Tsanas, 2020
▪ Interpretable
▪ Retain domain expertise
F1 F2
F3 y
▪ Which features would you choose? In which order?
© A. Tsanas, 2020
▪ Minimum redundancy amongst features in the subset
F1 F2
F3 F4
© A. Tsanas, 2020
Bring it together using Lagrangian formulation:
© A. Tsanas, 2020
▪ Selecting the ‘true’ feature subset (i.e. discarding
features which are known to be noise)
o Possible only for artificial datasets
Misclassification (RF)
GSO GSO
RELIEF RELIEF
0.7 0.7
LLBFS LLBFS
0.6 RRCT RRCT
0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
1 5 10 15 20 25 30 1 5 10 15 20 25 30 35 40 45 50
Number of features Number of features