0% found this document useful (0 votes)
11 views36 pages

Day1 Descriptive and Summary

Uploaded by

abery.au
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views36 pages

Day1 Descriptive and Summary

Uploaded by

abery.au
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Descriptive and

Summary Statistics
BIO5312 FALL2017
STEPHANIE J. SPIELMAN, PHD
Logistics
All course materials will be hosted here: http://sjspielman.org/bio5312_fall2017
Submit assignments via Canvas: https://templeu.instructure.com
Please bring your laptop to class!!!

Office SERC 643


◦ Weekly office hours Friday 1-3 ground floor of SERC ß vote?
Course goals
The primary goal is to analyze, interpret, and visualize data in the biological sciences
Achieved via statistical analysis and data science techniques in R

This is not a course in statistical theory.


Course topics
Descriptive and Summary Statistics
Data visualization
Fundamentals in probability, distributions
Statistical inference: hypothesis testing and confidence intervals
Linear modeling
Multiple testing
Binary classification
Clustering methods
Special topics in current biological data analysis
Course topics
Descriptive and Summary Statistics
Data visualization
Fundamentals in probability, distributions
Statistical inference: hypothesis testing and confidence intervals
Linear modeling
Multiple testing
Binary classification
Clustering methods
Special topics in current biological data analysis
But first, what are we doing here?
Statistics is the study of the collection, analysis, interpretation, presentation, and organization of
data.

We use statistics to make inferences about phenomena using samples and quantify
uncertainty of data

Biostatistics is (surprisingly!) a branch of applied statistics geared towards to medical and


biological problems
Populations and samples
Populations are the entire collection of individuals/units/etc. a researcher is interested in
◦ Generally we can never know the true composition of a population
◦ Populations are described with parameters

Samples are subsets of individuals/units from populations


◦ We use hypothesis testing to (try to) draw population-level conclusions from samples
◦ Samples are described with estimates

Parameters and estimates use different notations, as we will see


What makes a good sample?
In an ideal world, a sample is unbiased and features low sampling
error
Sampling error
◦ Bias is a systematic discrepancy between estimate and parameter
Precise Imprecise

Low bias and low sampling error

Samples should be randomly chosen Accurate


◦ Each population unit should have an equal and independent chance of
being chosen for a given sample

Inaccurate

Bias
Pop quiz: Is it random?
A researcher selects the first 58 student volunteers that sign up for a study

A computer program numbers all residents in a community, and then uses a random-number
generator to select 26 residents

A researcher vigorously shakes a box containing equally sized balls and takes the first 3 that fall
out of the box.

A researcher selects all study participants whose first name starts with an A, B, K, M, or O.
Pop quiz: Is it random?
A researcher selects the first 58 student volunteers that sign up for a study

A computer program numbers all residents in a community, and then uses a random-number
generator to select 26 residents

A researcher vigorously shakes a box containing equally sized balls and takes the first 3 that
fall out of the box.

A researcher selects all study participants whose first name starts with an A, B, K, M, or O.
Descriptive and Summary Statistics
Tools to concisely describe data, numerically and visually

Generally the first step in data exploration and statistical analysis


o Identify missing values, outliers, etc.
o Check assumptions required to fit models or perform statistical tests
o Identify trends that merit further study
Types of data
How you analyze and visualize data depends on the type of data you have

Quantitative data Categorical data


◦ Continuous ◦ Nominal
◦ Discrete (includes count data) ◦ Ordinal
◦ Binary*
Quantitative data
Continuous
◦ Any real-number value within some range

Discrete
◦ Values are in indivisible units, i.e. whole or counting numbers
◦ Includes count data (number of cups of coffee per day, number of amino acids in a protein…)
Categorical data
Nominal
◦ Hair color, eye color, sex genotypes (XX, XY, XXY, XYY, XO).

Ordinal – categories with a natural ordering


◦ Bad, fair, good, excellent
◦ A, B, C, D

Binary
◦ Yes/No
◦ True/False
Bonus: names of sex genotypes?
Measures of Location
Continuous Discrete

Mean Mode
◦ The most frequent appearing observation in
$
𝑌" = ∑%()$ 𝑌( the distribution (commonly used for discrete
% data)
◦ 1, 2, 2, 2, 3, 4, 4, 5, 6 à 2
Median
%*$
◦ For odd n, the th observation
+
%
◦ For even n, the average of the th and
+
%
+ 1 th observation
+
Measures of location in distributions

http://i.imgur.com/YSEYhha.jpg
Measures of spread
Range
Standard deviation and variance
Interquartile range
Range
Difference between largest and smallest value in a distribution
◦ 1, 2, 3, 7, 9 à 8
◦ 1, 2, 3, 7, 9, 500 à 499

Range is very sensitive to extreme observations and becomes very unwieldy very quickly.
Standard deviation and variance
Generally discussed in the context of mean

Deviance describes how each nth data point deviates from mean 𝑌":
◦ 𝑌$ − 𝑌", 𝑌+ − 𝑌", 𝑌0 − 𝑌", …, 𝑌% − 𝑌"

Standard deviation of a sample


$
◦ 𝑠= ∑%()$(𝑌( −𝑌")+
%2$

Variance
◦ 𝑠+
Interquartile range
Generally discussed in the context of median
Quartiles divide the data into four equal parts (“quar”!)
Interquartile range (IQR) is the difference between the third and first quartile
◦ How much of the data does the IQR encompass?

Interquartile range

First quartile Median Third quartile

1.25 1.64 1.91 2.31 2.37 2.38 2.84 2.87 2.93 2.94 2.98 3.00 3.09 3.22 3.41 3.55

Five number summary: min, Q1, median, Q3, max


Mean or median?
The median is much more robust to outliers compared to the mean.

mean

mean

Which would you choose for a symmetric distribution and why?


Measures of variability
Coefficient of variation is the standard deviation of a sample expressed as a percentage of the
sample mean (aka normalized)

𝒔
◦ 𝑪𝑶𝑽 = ;
×𝟏𝟎𝟎%
𝒀

◦ Useful measure for comparing variability between two differently-scaled datasets


Sample vs population notation
Measurement Sample estimate Population parameter

Mean $ $
𝑌" = ∑%()$ 𝑌( 𝜇= ∑%()$ 𝑥(
% %

Standard $
$
∑%()$(𝑌( −𝑌")+ σ= ∑%()$(𝜇( −𝜇̅ )+
deviation 𝑠= %
%2$

Variance 𝑠+ σ+
Visualizing data
Different types of plots are used to represent different types of data

Continuous data
Histogram
Density plot
Boxplot
Violin plot

Discrete data
Bar plot

Comparing two continuous variables


Scatterplot

Trend over time


Line plot
Histogram
40

30

Count 20

10

12 14 16 18
Value
Using histograms to describe
distributions

Uniform Bell–shaped Asymmetric (skewed) Bimodal


Density plots smoothen histograms
50

0.3 40
0.3

30
Density

density
count
0.2 0.2

20

0.1 0.1

10

0.0 0.0
0

12 14 16 18 12
12 14
14 16
16 18
18
Value xx
Boxplot
Graphical representation of a five-
number summary “whiskers”

2
Q3
“Whiskers” calculated as data within +/-
1.5 IQR
Median
IQR

Value
0

Q1
−2

outliers
−4
Boxplots: The plot thickens*
Bimodal Unimodal
600

10

400
Value

Count
200
0

0
0 10 0 10
Distributions Value
*Pun intended.
What can we say about this distribution
based on its boxplot?
0.6
Symmetry? Asymmetric
Skewness? Right-skewed
Modality? Unclear
0.4

Value 0.2

0.0
Violin plot: Density meets boxplot
N(5, 4) N(2, 1) N(4, 0.09)
12

Violin plot
8

value
4

x
0.20

Density plot
0.15 0.3 1.0
density
0.10 0.2
0.5
0.05 0.1

0.00 0.0 0.0


0 3 6 9 12 0 2 4 3.0 3.5 4.0 4.5 5.0
value
12

8
Boxplot
value

x
Barplot
60

Flower color
40
Count orange
pink
red
white
20

0
orange pink red white
Flowers in garden
Cautionary tale in barplots

http://journals.plos.org/plosbiology/article?id
=10.1371/journal.pbio.1002128
Scatterplot
4

response/dependent variable
10
3
Variable 2

Variable 2
2 0

1
−10

0
−2 −1 0 1 2 3 −2 −1 0 1 2
Variable 1 Variable 1

explanatory/independent variable
Time series data

Year
2003
2002
2001
2000
1999
150
1998
140 1997
130 1996
Value

120
1995
1994
110
1993
100
1992
1992 1996 2000 1991
Year 1990

75 100 125 150 175


Value
BREAK

You might also like