Day1 Descriptive and Summary
Day1 Descriptive and Summary
Summary Statistics
BIO5312 FALL2017
STEPHANIE J. SPIELMAN, PHD
Logistics
All course materials will be hosted here: http://sjspielman.org/bio5312_fall2017
Submit assignments via Canvas: https://templeu.instructure.com
Please bring your laptop to class!!!
We use statistics to make inferences about phenomena using samples and quantify
uncertainty of data
Inaccurate
Bias
Pop quiz: Is it random?
A researcher selects the first 58 student volunteers that sign up for a study
A computer program numbers all residents in a community, and then uses a random-number
generator to select 26 residents
A researcher vigorously shakes a box containing equally sized balls and takes the first 3 that fall
out of the box.
A researcher selects all study participants whose first name starts with an A, B, K, M, or O.
Pop quiz: Is it random?
A researcher selects the first 58 student volunteers that sign up for a study
A computer program numbers all residents in a community, and then uses a random-number
generator to select 26 residents
A researcher vigorously shakes a box containing equally sized balls and takes the first 3 that
fall out of the box.
A researcher selects all study participants whose first name starts with an A, B, K, M, or O.
Descriptive and Summary Statistics
Tools to concisely describe data, numerically and visually
Discrete
◦ Values are in indivisible units, i.e. whole or counting numbers
◦ Includes count data (number of cups of coffee per day, number of amino acids in a protein…)
Categorical data
Nominal
◦ Hair color, eye color, sex genotypes (XX, XY, XXY, XYY, XO).
Binary
◦ Yes/No
◦ True/False
Bonus: names of sex genotypes?
Measures of Location
Continuous Discrete
Mean Mode
◦ The most frequent appearing observation in
$
𝑌" = ∑%()$ 𝑌( the distribution (commonly used for discrete
% data)
◦ 1, 2, 2, 2, 3, 4, 4, 5, 6 à 2
Median
%*$
◦ For odd n, the th observation
+
%
◦ For even n, the average of the th and
+
%
+ 1 th observation
+
Measures of location in distributions
http://i.imgur.com/YSEYhha.jpg
Measures of spread
Range
Standard deviation and variance
Interquartile range
Range
Difference between largest and smallest value in a distribution
◦ 1, 2, 3, 7, 9 à 8
◦ 1, 2, 3, 7, 9, 500 à 499
Range is very sensitive to extreme observations and becomes very unwieldy very quickly.
Standard deviation and variance
Generally discussed in the context of mean
Deviance describes how each nth data point deviates from mean 𝑌":
◦ 𝑌$ − 𝑌", 𝑌+ − 𝑌", 𝑌0 − 𝑌", …, 𝑌% − 𝑌"
Variance
◦ 𝑠+
Interquartile range
Generally discussed in the context of median
Quartiles divide the data into four equal parts (“quar”!)
Interquartile range (IQR) is the difference between the third and first quartile
◦ How much of the data does the IQR encompass?
Interquartile range
1.25 1.64 1.91 2.31 2.37 2.38 2.84 2.87 2.93 2.94 2.98 3.00 3.09 3.22 3.41 3.55
mean
mean
𝒔
◦ 𝑪𝑶𝑽 = ;
×𝟏𝟎𝟎%
𝒀
Mean $ $
𝑌" = ∑%()$ 𝑌( 𝜇= ∑%()$ 𝑥(
% %
Standard $
$
∑%()$(𝑌( −𝑌")+ σ= ∑%()$(𝜇( −𝜇̅ )+
deviation 𝑠= %
%2$
Variance 𝑠+ σ+
Visualizing data
Different types of plots are used to represent different types of data
Continuous data
Histogram
Density plot
Boxplot
Violin plot
Discrete data
Bar plot
30
Count 20
10
12 14 16 18
Value
Using histograms to describe
distributions
0.3 40
0.3
30
Density
density
count
0.2 0.2
20
0.1 0.1
10
0.0 0.0
0
12 14 16 18 12
12 14
14 16
16 18
18
Value xx
Boxplot
Graphical representation of a five-
number summary “whiskers”
2
Q3
“Whiskers” calculated as data within +/-
1.5 IQR
Median
IQR
Value
0
Q1
−2
outliers
−4
Boxplots: The plot thickens*
Bimodal Unimodal
600
10
400
Value
Count
200
0
0
0 10 0 10
Distributions Value
*Pun intended.
What can we say about this distribution
based on its boxplot?
0.6
Symmetry? Asymmetric
Skewness? Right-skewed
Modality? Unclear
0.4
Value 0.2
0.0
Violin plot: Density meets boxplot
N(5, 4) N(2, 1) N(4, 0.09)
12
Violin plot
8
value
4
x
0.20
Density plot
0.15 0.3 1.0
density
0.10 0.2
0.5
0.05 0.1
8
Boxplot
value
x
Barplot
60
Flower color
40
Count orange
pink
red
white
20
0
orange pink red white
Flowers in garden
Cautionary tale in barplots
http://journals.plos.org/plosbiology/article?id
=10.1371/journal.pbio.1002128
Scatterplot
4
response/dependent variable
10
3
Variable 2
Variable 2
2 0
1
−10
0
−2 −1 0 1 2 3 −2 −1 0 1 2
Variable 1 Variable 1
explanatory/independent variable
Time series data
Year
2003
2002
2001
2000
1999
150
1998
140 1997
130 1996
Value
120
1995
1994
110
1993
100
1992
1992 1996 2000 1991
Year 1990