0% found this document useful (0 votes)
24 views21 pages

Biological Data Science Lecture4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views21 pages

Biological Data Science Lecture4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Dr Athanasios Tsanas (‘Thanasis’)

Associate Prof. in Data Science


Usher Institute, Medical School
University of Edinburgh
Day 1 • Introduction and overview; reminder of basic concepts
Day 2 • Data collection and sampling

Day 3 • Data mining: signal/image processing and information extraction

Day 4 • Data visualization: density estimation, statistical descriptors

Day 5 • Exploratory analysis: hypothesis testing and quantifying relationships

Day 6 • Feature selection and feature transformation

Day 7 • Statistical machine learning and model validation

Day 8 • Statistical machine learning and model validation

Day 9 • Practical examples: bringing things together

Day 10 • Revision and exam preparation


© A. Tsanas, 2020
ECG, EEG Activity Location

Subjects feature1 feature2 ... feature M


P1 3.1 1.3 0.9
P2 3.7 1.0 1.3
X
N P3 2.9 2.6 0.6

PN 1.7 2.0 0.7

M (features or characteristics) © A. Tsanas, 2020


Feature generation Feature selection Statistical
from raw data or transformation mapping

X y
Subjects feature1 feature2 ... feature M result
P1 3.1 1.3 0.9 1
P2 3.7 1.0 1.3 2
N P3 2.9 2.6 0.6 1
… …
PN 1.7 2.0 0.7 3

M (features or characteristics) outcome


 Depending on the problem, “features” can be demographics, genes, …

 y = f (X), f : mechanism X: feature set y: outcome © A. Tsanas, 2020


Exploratory
Data
analysis: Feature Statistical
visualization
hypothesis selection or mapping
(density
testing and transformation (regression/clas
estimation,
statistical (e.g. PCA) sification)
scatter plots)
associations

© A. Tsanas, 2020
 We will focus primarily on studying properties
of a single variable

 You can think of this as focusing on a single


feature, i.e. one column in X

 We will subsequently also study the visual


exploration in 2D plots with two variables

© A. Tsanas, 2020
Discrete variable Finite set of possible values
• Use histograms

Continuous variable Typically all possible values


• Use probability density functions
• (e.g. kernel density estimation)

© A. Tsanas, 2020
 20 throws of a dice:
3,4,4,4,1,3,4,5,1,6,6,4,5,5,3,6,5,4,4,1
Histogram of scores for 20 dice throws
7
6
Frequency

5
4
3
2
1
0
1 2 3 4 5 6
© A. Tsanas, 2020
 Discretize possible values, use “bins”
Histogram of 1000 stock returns
160
140
120
Frequency

100
80
60
40
20
0
-3

-2

-1

3
0.5

1.5

2.5
-2.5

-1.5

-0.5

© A. Tsanas, 2020
 Probability Density Function (PDF)
-3
x 10
4

mean = 500
3.5 X ~ N(500,10 2)
variance = 100

Compute PDF
probability density p(x)

3 standard deviation = 10

2.5 using kernel


2 density
1.5 estimation
1

0.5

0
0 100 200 300 400 500 600 700 800 900 1000
possible values x © A. Tsanas, 2020
1 𝑁
 Mean (average): 𝜇 = σ𝑖=1 𝑥𝑖
𝑁

 Median: rank values, and find middle value

1 𝑁 2
 Standard deviation: 𝜎 = σ𝑖=1 𝑥𝑖 − 𝜇
𝑁

 Variance: var 𝑋 = 𝜎 2

 Interquartile range (iqr): 75% percentile – 25%


percentile
© A. Tsanas, 2020

 𝐸 𝑋 = 𝜇𝑋 = ‫׬‬−∞ 𝑥 ∙ 𝑝 𝑥 𝑑𝑥

 𝑉𝑎𝑟 𝑋 = 𝜎𝛸2 = 𝐸 𝑋 − E 𝑋 2

(𝑚) 𝑚
 𝑀𝑜𝑚𝑒𝑛𝑡𝛸 = 𝐸 𝑋−E 𝑋

 The expectation operator 𝐸 ∙ is computed


from the possible values in 𝑋 multiplied by
their probabilities
© A. Tsanas, 2020
 Same information like PDF, presented differently!
Cumulative Probability Distribution for Stock Returns
1.0

P(return<X)
0.8
Probability

0.6

0.4

0.2

0.0
X
-1.5 -1 -0.5 0 0.5 1 1.5 2
Return
© A. Tsanas, 2020
 Add noise to each observation (impose a kernel,
typically Gaussian kernel)

1 𝛮 𝑥𝑖 −𝑥0 2
 𝑝Ƹ 𝑥0 = σ𝑖=1 exp −
𝑁 2𝜋𝜎 2 2𝜎 2

 𝜎 is the kernel bandwidth

 𝑁 is the number of samples

 𝑥0 refers the point where we estimate the density


© A. Tsanas, 2020
Computing histogram and applying kernel density estimation

Image source: Wikipedia © A. Tsanas, 2020


• Many different
approaches to
computing the
bandwidth
(beyond this
course)

• Increasing the
kernel
bandwidth 𝜎
leads to
smoother
distribution
© A. Tsanas, 2020
 You will notice that I have placed a lot of
emphasis on densities

 These are important in their own right for


visualization, but more importantly…

 Subsequent machine learning tools often


depend heavily on the density estimates

© A. Tsanas, 2020
 Boxplot

 “Box and Median


whiskers”

 Easy to
IQR
understand
outlier
 Portrays
outliers
© A. Tsanas, 2020
 Two dimensional plot to visualize how one
variable is related to another

 Often complemented with the ‘best linear fit’


to assess whether there is a positive or
negative relationship

© A. Tsanas, 2020
© A. Tsanas, 2020
 H.J. Seltman: Experimental design and analysis
(chapter 3: pp. 19-46)

http://www.stat.cmu.edu/~hseltman/309/Book/Book.
pdf

© A. Tsanas, 2020

You might also like