0% found this document useful (0 votes)
19 views29 pages

Biological Data Science Lecture6

Biological Data Science
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views29 pages

Biological Data Science Lecture6

Biological Data Science
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Dr Athanasios Tsanas (‘Thanasis’)

Associate Prof. in Data Science


Usher Institute, Medical School
University of Edinburgh
Day 1 • Introduction and overview; reminder of basic concepts
Day 2 • Data collection and sampling

Day 3 • Data mining: signal/image processing and information extraction

Day 4 • Data visualization: density estimation, statistical descriptors

Day 5 • Exploratory analysis: hypothesis testing and quantifying relationships

Day 6 • Feature selection and feature transformation

Day 7 • Statistical machine learning and model validation

Day 8 • Statistical machine learning and model validation

Day 9 • Practical examples: bringing things together

Day 10 • Revision and exam preparation


© A. Tsanas, 2020
Feature generation Feature selection Statistical
from raw data or transformation mapping

X y
Subjects feature1 feature2 ... feature M result
P1 3.1 1.3 0.9 1
P2 3.7 1.0 1.3 2
N P3 2.9 2.6 0.6 1
… …
PN 1.7 2.0 0.7 3

M (features or characteristics) outcome


 Depending on the problem, “features” can be demographics, genes, …

 y = f (X), f : mechanism X: feature set y: outcome © A. Tsanas, 2020


Exploratory
Data
analysis: Feature Statistical
visualization
hypothesis selection or mapping
(density
testing and transformation (regression/clas
estimation,
statistical (e.g. PCA) sification)
scatter plots)
associations

© A. Tsanas, 2020
▪ Many features 𝑀  Curse of dimensionality
▪ Obstruct interpretability and detrimental to learning process

© A. Tsanas, 2020
▪ Reduce the initial feature space 𝑀
into 𝑚 (where 𝑚 < 𝑀)

▪ Feature selection
▪ Feature transformation
© A. Tsanas, 2020
▪ Principle of parsimony
▪ Information content
▪ Statistical associations
▪ Computational constraints

▪ We want to determine the most parsimonious


feature subset with maximum joint information
content © A. Tsanas, 2020
© A. Tsanas, 2020
▪ Construct lower dimensional space where the
new data points retain the distance of the data
points in the original feature space
▪ Different algorithms depending on how we
define the distance
© A. Tsanas, 2020
▪ Results are not easily
interpretable
▪ Does not save up on resources
on data collection or data
processing
▪ Reliable transformation in
high dimensional spaces is
problematic
© A. Tsanas, 2020
▪ Project data from X into a different feature
space X’
▪ Linearly uncorrelated principal components
▪ Maximize capturing the remaining variance at
each step

▪ Best linear approximation of the data


© A. Tsanas, 2020
▪ Principal components associated with the
original features via the loadings (weights), e.g.:
PCA1 = 0.1*x1 + 0.3*x2 + 0.8*x3 + …

▪ See how well each principal component explains


the remaining variance in the data

▪ Potentially interpretable results in many


applications
© A. Tsanas, 2020
▪ Questionnaire with 6 items
▪ Think it as 6 “features”
▪ Each feature: 7 possible values
(Likert-scores)
▪ 1000+ samples
▪ Is there some structure in the
questionnaire? (i.e. latent
variables) © A. Tsanas, 2020
P1 P2 P3 P4 P5 P6
Anxious 0.55 0.08 -0.47 -0.27 0.60 0.18
Elated -0.11 0.76 -0.11 -0.53 -0.33 0.01
Sad 0.52 0.04 -0.43 0.39 -0.57 -0.25
Angry 0.42 0.11 0.46 0.11 -0.21 0.74
Irritable 0.47 0.12 0.60 -0.15 0.14 -0.60
Energetic -0.13 0.62 0.020 0.67 0.38 -0.03
% Total
variance 55 77 85 91 97 100
explained
Tentative “Negative “Positive “Irritability”
interpretation feelings” feelings”

© A. Tsanas, 2020
Subjects feature1 feature2 ... feature M
P1 3.1 1.3 0.9
P2 3.7 1.0 1.3 X
N P3 2.9 2.6 0.6

PN 1.7 2.0 0.7
M (features or characteristics)
Subjects PCA feat1 PCA feat2 ... PCA feat M
P1 3.1*P1+
1.3*P2+…
P2 X’
N P3

PN 1.7 2.0 0.7
m (features or characteristics) © A. Tsanas, 2020
▪ Unobserved latent variables = factors
▪ Similar in principle to PCA, but has subtle differences
▪ FA takes into account random errors in
measurements
▪ Different flavours of FA: Exploratory FA (EFA),
Confirmatory FA (CFA)
▪ Many statisticians remain skeptical about FA because
it has no unique solutions (space rotation)
© A. Tsanas, 2020
Day 6 part 2
Discard non-contributing features towards
predicting the outcome
© A. Tsanas, 2020
▪ Interpretable
▪ Retain domain expertise

▪ Often is the only useful approach in practice


(e.g. in micro-array data)
▪ Saves on resources on data collection or data
processing
© A. Tsanas, 2020
▪ Maximum relevance: features (F) and response (y)

F1 F2
F3 y
▪ Which features would you choose? In which order?
© A. Tsanas, 2020
▪ Minimum redundancy amongst features in the subset

F1 F2
F3 F4

▪ Which features would you choose? In which order?


© A. Tsanas, 2020
▪ Conditional relevance (feature interaction)

▪ Features are jointly highly predictive of outcome


© A. Tsanas, 2020
▪ Compromise: relevance and redundancy
▪ Does not account for interactions and non-
pairwise redundancy
▪ Generally works very well
1
▪ mRMR ≝ max 𝐼 𝐟𝑗 , 𝐲 − σ𝑠 ∈ 𝑆 𝐼 𝐟𝑗 , 𝐟𝑠
𝑗 ∈ 𝑄−𝑆 𝑆

❑ 𝑆 is the cardinality of the selected subset


❑ 𝑄 contains the indices of all possible features © A. Tsanas, 2020
▪ Start with classical ordinary least squares
regression
▪ L1 penalty: sparsity promoting,
▪ some coefficients => exactly zero!

© A. Tsanas, 2020
 Bring it together using Lagrangian formulation:

© A. Tsanas, 2020
▪ Selecting the ‘true’ feature subset (i.e. discarding
features which are known to be noise)
o Possible only for artificial datasets

▪ Maximize the out of sample prediction


performance
o proxy for assessing feature selection algorithms
o adds an additional ‘layer’: the learner
o beware of feature exportability (different learners may give
different results)
o BUT… in practice this is really what is of most interest!
© A. Tsanas, 2020
Ovarian cancer SRBCT
1 1
LASSO LASSO
0.9 mRMR 0.9 mRMR
mRMR Spearman mRMR Spearman
0.8 0.8
Misclassification (RF)

Misclassification (RF)
GSO GSO
RELIEF RELIEF
0.7 0.7
LLBFS LLBFS
0.6 RRCT RRCT
0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
1 5 10 15 20 25 30 1 5 10 15 20 25 30 35 40 45 50
Number of features Number of features

▪ Out of sample classification error using the selected feature


subsets (lower is better)
▪ Which FS algorithm leads to best results?
© A. Tsanas, 2020
▪ No free lunch theorem (no universally best algorithm)
▪ Trade-offs
o algorithmic: relevance, redundancy, complementarity
o computational: wrappers are costly but often give better results
o comprehensive search of the feature space, e.g. genetic algorithms
(very costly)

▪ Reducing the number of features may improve


prediction performance and always improves
interpretability
© A. Tsanas, 2020
 I. Guyon and A. Elisseeff: An introduction to variable
and feature selection, JMLR, 2003
http://www.jmlr.org/papers/volume3/guyon03a/guyo
n03a.pdf

 OPTIONAL: L. van der Maaten: Dimensionality


reduction
https://lvdmaaten.github.io/publications/papers/TR_
Dimensionality_Reduction_Review_2009.pdf
© A. Tsanas, 2020

You might also like