Biological Data Science Lecture4
Biological Data Science Lecture4
X y
Subjects feature1 feature2 ... feature M result
P1 3.1 1.3 0.9 1
P2 3.7 1.0 1.3 2
N P3 2.9 2.6 0.6 1
… …
PN 1.7 2.0 0.7 3
© A. Tsanas, 2020
We will focus primarily on studying properties
of a single variable
© A. Tsanas, 2020
Discrete variable Finite set of possible values
• Use histograms
© A. Tsanas, 2020
20 throws of a dice:
3,4,4,4,1,3,4,5,1,6,6,4,5,5,3,6,5,4,4,1
Histogram of scores for 20 dice throws
7
6
Frequency
5
4
3
2
1
0
1 2 3 4 5 6
© A. Tsanas, 2020
Discretize possible values, use “bins”
Histogram of 1000 stock returns
160
140
120
Frequency
100
80
60
40
20
0
-3
-2
-1
3
0.5
1.5
2.5
-2.5
-1.5
-0.5
© A. Tsanas, 2020
Probability Density Function (PDF)
-3
x 10
4
mean = 500
3.5 X ~ N(500,10 2)
variance = 100
Compute PDF
probability density p(x)
3 standard deviation = 10
0.5
0
0 100 200 300 400 500 600 700 800 900 1000
possible values x © A. Tsanas, 2020
1 𝑁
Mean (average): 𝜇 = σ𝑖=1 𝑥𝑖
𝑁
1 𝑁 2
Standard deviation: 𝜎 = σ𝑖=1 𝑥𝑖 − 𝜇
𝑁
Variance: var 𝑋 = 𝜎 2
𝑉𝑎𝑟 𝑋 = 𝜎𝛸2 = 𝐸 𝑋 − E 𝑋 2
(𝑚) 𝑚
𝑀𝑜𝑚𝑒𝑛𝑡𝛸 = 𝐸 𝑋−E 𝑋
P(return<X)
0.8
Probability
0.6
0.4
0.2
0.0
X
-1.5 -1 -0.5 0 0.5 1 1.5 2
Return
© A. Tsanas, 2020
Add noise to each observation (impose a kernel,
typically Gaussian kernel)
1 𝛮 𝑥𝑖 −𝑥0 2
𝑝Ƹ 𝑥0 = σ𝑖=1 exp −
𝑁 2𝜋𝜎 2 2𝜎 2
• Increasing the
kernel
bandwidth 𝜎
leads to
smoother
distribution
© A. Tsanas, 2020
You will notice that I have placed a lot of
emphasis on densities
© A. Tsanas, 2020
Boxplot
Easy to
IQR
understand
outlier
Portrays
outliers
© A. Tsanas, 2020
Two dimensional plot to visualize how one
variable is related to another
© A. Tsanas, 2020
© A. Tsanas, 2020
H.J. Seltman: Experimental design and analysis
(chapter 3: pp. 19-46)
http://www.stat.cmu.edu/~hseltman/309/Book/Book.
pdf
© A. Tsanas, 2020