0% found this document useful (0 votes)

11 views36 pages

Day1 Descriptive and Summary

Uploaded by

abery.au

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views36 pages

Day1 Descriptive and Summary

Uploaded by

abery.au

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Descriptive and

Summary Statistics
BIO5312 FALL2017
STEPHANIE J. SPIELMAN, PHD
Logistics
All course materials will be hosted here: http://sjspielman.org/bio5312_fall2017
Submit assignments via Canvas: https://templeu.instructure.com
Please bring your laptop to class!!!

Office SERC 643

◦ Weekly office hours Friday 1-3 ground floor of SERC ß vote?
Course goals
The primary goal is to analyze, interpret, and visualize data in the biological sciences
Achieved via statistical analysis and data science techniques in R

This is not a course in statistical theory.

Course topics
Descriptive and Summary Statistics
Data visualization
Fundamentals in probability, distributions
Statistical inference: hypothesis testing and confidence intervals
Linear modeling
Multiple testing
Binary classification
Clustering methods
Special topics in current biological data analysis
Course topics
Descriptive and Summary Statistics
Data visualization
Fundamentals in probability, distributions
Statistical inference: hypothesis testing and confidence intervals
Linear modeling
Multiple testing
Binary classification
Clustering methods
Special topics in current biological data analysis
But first, what are we doing here?
Statistics is the study of the collection, analysis, interpretation, presentation, and organization of
data.

We use statistics to make inferences about phenomena using samples and quantify
uncertainty of data

Biostatistics is (surprisingly!) a branch of applied statistics geared towards to medical and

biological problems
Populations and samples
Populations are the entire collection of individuals/units/etc. a researcher is interested in
◦ Generally we can never know the true composition of a population
◦ Populations are described with parameters

Samples are subsets of individuals/units from populations

◦ We use hypothesis testing to (try to) draw population-level conclusions from samples
◦ Samples are described with estimates

Parameters and estimates use different notations, as we will see

What makes a good sample?
In an ideal world, a sample is unbiased and features low sampling
error
Sampling error
◦ Bias is a systematic discrepancy between estimate and parameter
Precise Imprecise

Low bias and low sampling error

Samples should be randomly chosen Accurate

◦ Each population unit should have an equal and independent chance of
being chosen for a given sample

Inaccurate

Bias
Pop quiz: Is it random?
A researcher selects the first 58 student volunteers that sign up for a study

A computer program numbers all residents in a community, and then uses a random-number
generator to select 26 residents

A researcher vigorously shakes a box containing equally sized balls and takes the first 3 that fall
out of the box.

A researcher selects all study participants whose first name starts with an A, B, K, M, or O.
Pop quiz: Is it random?
A researcher selects the first 58 student volunteers that sign up for a study

A computer program numbers all residents in a community, and then uses a random-number
generator to select 26 residents

A researcher vigorously shakes a box containing equally sized balls and takes the first 3 that
fall out of the box.

A researcher selects all study participants whose first name starts with an A, B, K, M, or O.
Descriptive and Summary Statistics
Tools to concisely describe data, numerically and visually

Generally the first step in data exploration and statistical analysis

o Identify missing values, outliers, etc.
o Check assumptions required to fit models or perform statistical tests
o Identify trends that merit further study
Types of data
How you analyze and visualize data depends on the type of data you have

Quantitative data Categorical data

◦ Continuous ◦ Nominal
◦ Discrete (includes count data) ◦ Ordinal
◦ Binary*
Quantitative data
Continuous
◦ Any real-number value within some range

Discrete
◦ Values are in indivisible units, i.e. whole or counting numbers
◦ Includes count data (number of cups of coffee per day, number of amino acids in a protein…)
Categorical data
Nominal
◦ Hair color, eye color, sex genotypes (XX, XY, XXY, XYY, XO).

Ordinal – categories with a natural ordering

◦ Bad, fair, good, excellent
◦ A, B, C, D

Binary
◦ Yes/No
◦ True/False
Bonus: names of sex genotypes?
Measures of Location
Continuous Discrete

Mean Mode
◦ The most frequent appearing observation in
$
𝑌" = ∑%()$ 𝑌( the distribution (commonly used for discrete
% data)
◦ 1, 2, 2, 2, 3, 4, 4, 5, 6 à 2
Median
%*$
◦ For odd n, the th observation
+
%
◦ For even n, the average of the th and
+
%
+ 1 th observation
+
Measures of location in distributions

http://i.imgur.com/YSEYhha.jpg
Measures of spread
Range
Standard deviation and variance
Interquartile range
Range
Difference between largest and smallest value in a distribution
◦ 1, 2, 3, 7, 9 à 8
◦ 1, 2, 3, 7, 9, 500 à 499

Range is very sensitive to extreme observations and becomes very unwieldy very quickly.
Standard deviation and variance
Generally discussed in the context of mean

Deviance describes how each nth data point deviates from mean 𝑌":
◦ 𝑌$ − 𝑌", 𝑌+ − 𝑌", 𝑌0 − 𝑌", …, 𝑌% − 𝑌"

Standard deviation of a sample

$
◦ 𝑠= ∑%()$(𝑌( −𝑌")+
%2$

Variance
◦ 𝑠+
Interquartile range
Generally discussed in the context of median
Quartiles divide the data into four equal parts (“quar”!)
Interquartile range (IQR) is the difference between the third and first quartile
◦ How much of the data does the IQR encompass?

Interquartile range

First quartile Median Third quartile

1.25 1.64 1.91 2.31 2.37 2.38 2.84 2.87 2.93 2.94 2.98 3.00 3.09 3.22 3.41 3.55

Five number summary: min, Q1, median, Q3, max

Mean or median?
The median is much more robust to outliers compared to the mean.

mean

Which would you choose for a symmetric distribution and why?

Measures of variability
Coefficient of variation is the standard deviation of a sample expressed as a percentage of the
sample mean (aka normalized)

𝒔
◦ 𝑪𝑶𝑽 = ;
×𝟏𝟎𝟎%
𝒀

◦ Useful measure for comparing variability between two differently-scaled datasets

Sample vs population notation
Measurement Sample estimate Population parameter

Mean $ $
𝑌" = ∑%()$ 𝑌( 𝜇= ∑%()$ 𝑥(
% %

Standard $
$
∑%()$(𝑌( −𝑌")+ σ= ∑%()$(𝜇( −𝜇̅ )+
deviation 𝑠= %
%2$

Variance 𝑠+ σ+
Visualizing data
Different types of plots are used to represent different types of data

Continuous data
Histogram
Density plot
Boxplot
Violin plot

Discrete data
Bar plot

Comparing two continuous variables

Scatterplot

Trend over time

Line plot
Histogram
40

Count 20

12 14 16 18
Value
Using histograms to describe
distributions

Uniform Bell–shaped Asymmetric (skewed) Bimodal

Density plots smoothen histograms
50

0.3 40
0.3

30
Density

density
count
0.2 0.2

0.1 0.1

0.0 0.0
0

12 14 16 18 12
12 14
14 16
16 18
18
Value xx
Boxplot
Graphical representation of a five-
number summary “whiskers”

2
Q3
“Whiskers” calculated as data within +/-
1.5 IQR
Median
IQR

Value
0

Q1
−2

outliers
−4
Boxplots: The plot thickens*
Bimodal Unimodal
600

400
Value

Count
200
0

0
0 10 0 10
Distributions Value
*Pun intended.
What can we say about this distribution
based on its boxplot?
0.6
Symmetry? Asymmetric
Skewness? Right-skewed
Modality? Unclear
0.4

Value 0.2

0.0
Violin plot: Density meets boxplot
N(5, 4) N(2, 1) N(4, 0.09)
12

Violin plot
8

value
4

x
0.20

Density plot
0.15 0.3 1.0
density
0.10 0.2
0.5
0.05 0.1

0.00 0.0 0.0

0 3 6 9 12 0 2 4 3.0 3.5 4.0 4.5 5.0
value
12

8
Boxplot
value

x
Barplot
60

Flower color
40
Count orange
pink
red
white
20

0
orange pink red white
Flowers in garden
Cautionary tale in barplots

http://journals.plos.org/plosbiology/article?id
=10.1371/journal.pbio.1002128
Scatterplot
4

response/dependent variable
10
3
Variable 2

Variable 2
2 0

1
−10

0
−2 −1 0 1 2 3 −2 −1 0 1 2
Variable 1 Variable 1

explanatory/independent variable
Time series data

Year
2003
2002
2001
2000
1999
150
1998
140 1997
130 1996
Value

120
1995
1994
110
1993
100
1992
1992 1996 2000 1991
Year 1990

75 100 125 150 175

Value
BREAK

Safari
No ratings yet
Safari
385 pages
Math 553
No ratings yet
Math 553
271 pages
DSML
No ratings yet
DSML
510 pages
Lecture 1 - Online - INTRODUCTION TO BIOSTATISTICS (Compatibility Mode)
100% (1)
Lecture 1 - Online - INTRODUCTION TO BIOSTATISTICS (Compatibility Mode)
28 pages
Full Slides Beginselen2019
No ratings yet
Full Slides Beginselen2019
364 pages
Basic Statistics (3685) PPT - Lecture On 20-01-2019
100% (1)
Basic Statistics (3685) PPT - Lecture On 20-01-2019
64 pages
Basic Statistics
100% (9)
Basic Statistics
73 pages
Chapter 2 Descriptive Statistics
No ratings yet
Chapter 2 Descriptive Statistics
12 pages
1 Biostatistics LECTURE 1
100% (1)
1 Biostatistics LECTURE 1
64 pages
BIOSTATS Block Review - Ed
No ratings yet
BIOSTATS Block Review - Ed
168 pages
Statistics
No ratings yet
Statistics
45 pages
1-Introduction To Statistics
100% (1)
1-Introduction To Statistics
19 pages
Class 1
No ratings yet
Class 1
52 pages
Business Statistics NOtes
No ratings yet
Business Statistics NOtes
46 pages
Chapter 1
No ratings yet
Chapter 1
51 pages
Unit 1 - Examining Distributions
No ratings yet
Unit 1 - Examining Distributions
80 pages
AA SL - Unit 1a - Representing Data (Statistics)
No ratings yet
AA SL - Unit 1a - Representing Data (Statistics)
74 pages
Business Statistics 18 19 Nov 2017
No ratings yet
Business Statistics 18 19 Nov 2017
23 pages
Notes 3 Descriptive Statistics RJMurden 2021
No ratings yet
Notes 3 Descriptive Statistics RJMurden 2021
47 pages
Introduction To Biostatistics
No ratings yet
Introduction To Biostatistics
19 pages
Biostat Aguila Mission Solis
No ratings yet
Biostat Aguila Mission Solis
44 pages
CH1 and CH2 Definitions and Descriptive Statistics
No ratings yet
CH1 and CH2 Definitions and Descriptive Statistics
29 pages
AEB02 - Basic Biostatistics (FE)
No ratings yet
AEB02 - Basic Biostatistics (FE)
36 pages
Basics of Statistics
No ratings yet
Basics of Statistics
40 pages
Introduction, Sampling, and Measurement
No ratings yet
Introduction, Sampling, and Measurement
19 pages
Topic 2 - Descriptive - Statistics
No ratings yet
Topic 2 - Descriptive - Statistics
36 pages
02 Exploratory Data Analytics
No ratings yet
02 Exploratory Data Analytics
41 pages
RM EBBA Class 8 CH0 11 Quatitative Analysis
No ratings yet
RM EBBA Class 8 CH0 11 Quatitative Analysis
37 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
22 pages
Chapter 1 Descriptivestatistics
No ratings yet
Chapter 1 Descriptivestatistics
21 pages
Your Definitive Guide To Network Detection and Response
No ratings yet
Your Definitive Guide To Network Detection and Response
16 pages
Introduction To Biostatistics
No ratings yet
Introduction To Biostatistics
53 pages
Statistics Introduction
No ratings yet
Statistics Introduction
37 pages
PDF Notes
No ratings yet
PDF Notes
28 pages
1.biostatistics Introduction
No ratings yet
1.biostatistics Introduction
72 pages
Intro To Biostatistics Lecture BSMLS 3-A&B
No ratings yet
Intro To Biostatistics Lecture BSMLS 3-A&B
74 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
101 pages
Data Analyst
No ratings yet
Data Analyst
21 pages
Stats 1 Module Updated
No ratings yet
Stats 1 Module Updated
53 pages
Lecture 1 Intro
No ratings yet
Lecture 1 Intro
61 pages
Sta 2300 - Theory of Estimation
100% (1)
Sta 2300 - Theory of Estimation
2 pages
Week1 Introduction
No ratings yet
Week1 Introduction
36 pages
C1S1 Statistics Packet
No ratings yet
C1S1 Statistics Packet
24 pages
Making Sense of Data Mooc Notes PDF
No ratings yet
Making Sense of Data Mooc Notes PDF
32 pages
Classification of Data: Objectives: Understand How Data Are Classified. Recognize The Different Types of Data
No ratings yet
Classification of Data: Objectives: Understand How Data Are Classified. Recognize The Different Types of Data
39 pages
43hyrs Principles of Statistics 3
No ratings yet
43hyrs Principles of Statistics 3
56 pages
Math Notes Module 4A
No ratings yet
Math Notes Module 4A
4 pages
Day 01-Basic Statistics
No ratings yet
Day 01-Basic Statistics
36 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
2statsnotes 1
No ratings yet
2statsnotes 1
24 pages
Chapter1 Statistics
No ratings yet
Chapter1 Statistics
17 pages
NITKclass 1
No ratings yet
NITKclass 1
50 pages
Lesson #05: Data Management: Feasible)
No ratings yet
Lesson #05: Data Management: Feasible)
11 pages
Midterms Gec Math Adooooor
No ratings yet
Midterms Gec Math Adooooor
6 pages
A00009581enw Hyperconverged Infrastructure For Data Protection PDF
No ratings yet
A00009581enw Hyperconverged Infrastructure For Data Protection PDF
30 pages
1 - 2 Biostatistics
No ratings yet
1 - 2 Biostatistics
24 pages
WK 1b Biostat
No ratings yet
WK 1b Biostat
38 pages
Unit II: Basic Data Analytic Methods
No ratings yet
Unit II: Basic Data Analytic Methods
38 pages
6643690f56a51719abfa0901 - Gartner Market Guide For NDR
No ratings yet
6643690f56a51719abfa0901 - Gartner Market Guide For NDR
18 pages
04-Layer 2-LAN Switching Configuration Guide-Book
No ratings yet
04-Layer 2-LAN Switching Configuration Guide-Book
221 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
4 pages
Models For Predicting Anthropometric Dimensions of Students Needed For Ergonomic School Furniture Design
No ratings yet
Models For Predicting Anthropometric Dimensions of Students Needed For Ergonomic School Furniture Design
25 pages
Reaserch Methodology.
No ratings yet
Reaserch Methodology.
16 pages
06-Layer 2-WAN Access Configuration Guide-Book
No ratings yet
06-Layer 2-WAN Access Configuration Guide-Book
103 pages
Hyland OnBase Gartner Reprint - 2019
No ratings yet
Hyland OnBase Gartner Reprint - 2019
27 pages
6.0 FAQs
No ratings yet
6.0 FAQs
27 pages
Saidiev
No ratings yet
Saidiev
25 pages
The Gorilla Guide To HCI - Technical Overview-A00078608enw
No ratings yet
The Gorilla Guide To HCI - Technical Overview-A00078608enw
81 pages
B2B E-Commerce RFP Template by Liferay
No ratings yet
B2B E-Commerce RFP Template by Liferay
30 pages
Smtb1402-Probability & Statistics: Correlation
No ratings yet
Smtb1402-Probability & Statistics: Correlation
19 pages
Probit Logit Interpretation
No ratings yet
Probit Logit Interpretation
26 pages
Gorilla Guide To Hyperconverged Infrastructure For Cloud-A00009575enw
No ratings yet
Gorilla Guide To Hyperconverged Infrastructure For Cloud-A00009575enw
26 pages
H3C MSR3600 Series Router Datasheet
No ratings yet
H3C MSR3600 Series Router Datasheet
15 pages
Gorilla Guide To Hyperconverged Infrastructure For Tier 1dedicated Apps-A00015105enw
No ratings yet
Gorilla Guide To Hyperconverged Infrastructure For Tier 1dedicated Apps-A00015105enw
19 pages
How To Write An Effective RFP For B2B E-Commerce
No ratings yet
How To Write An Effective RFP For B2B E-Commerce
15 pages
The Forrester Wave™ Agile Content Management Systems (CMSes), Q1 2021-1
No ratings yet
The Forrester Wave™ Agile Content Management Systems (CMSes), Q1 2021-1
18 pages
Midterm Examination
No ratings yet
Midterm Examination
5 pages
01 Training Outline For The H3CSE-RS-SW Advanced Routing - Switching Technology 1
No ratings yet
01 Training Outline For The H3CSE-RS-SW Advanced Routing - Switching Technology 1
4 pages
En - Maevex 6100 Datasheet
No ratings yet
En - Maevex 6100 Datasheet
4 pages
Ant Media Server Enterprise and Community
No ratings yet
Ant Media Server Enterprise and Community
5 pages
C7000 BladeSystem EOSL
No ratings yet
C7000 BladeSystem EOSL
13 pages
3.0 Services Information
No ratings yet
3.0 Services Information
2 pages
Concepts Practical Applications and Computer Implementation 5263016
No ratings yet
Concepts Practical Applications and Computer Implementation 5263016
60 pages
6.1 FAQ21 CCC Checklist For Load Test
No ratings yet
6.1 FAQ21 CCC Checklist For Load Test
2 pages
6.2 FAQ22 CCC Checklist For Security Risk Assessment and Audit
No ratings yet
6.2 FAQ22 CCC Checklist For Security Risk Assessment and Audit
2 pages
Chapter 8 Test Review
No ratings yet
Chapter 8 Test Review
7 pages
Tolly223136-H3C MSR3620-X1 Performance Features
No ratings yet
Tolly223136-H3C MSR3620-X1 Performance Features
12 pages
06 GB0-392 Exam Syllabus For The H3CSE-RS-NSO
No ratings yet
06 GB0-392 Exam Syllabus For The H3CSE-RS-NSO
4 pages
HCI Goes Main Stream
No ratings yet
HCI Goes Main Stream
6 pages
En Maevex 6152 Encoder Datasheet
No ratings yet
En Maevex 6152 Encoder Datasheet
4 pages
4.0 - Getting Started
No ratings yet
4.0 - Getting Started
3 pages
3.1 - Infrastructure-as-a-Service (IaaS)
No ratings yet
3.1 - Infrastructure-as-a-Service (IaaS)
2 pages
3.3 - Database-as-a-Service (DBaaS)
No ratings yet
3.3 - Database-as-a-Service (DBaaS)
2 pages
5.0 - Fund Contribution
No ratings yet
5.0 - Fund Contribution
2 pages
CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S
No ratings yet
CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S
158 pages
Flag Jogo MERDA
No ratings yet
Flag Jogo MERDA
6 pages
7.0 - Contact Us
No ratings yet
7.0 - Contact Us
2 pages
MATH+270 Chapter+6
No ratings yet
MATH+270 Chapter+6
6 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
34 pages
Influential Observation
No ratings yet
Influential Observation
4 pages
Class 5 - LinearRegression
No ratings yet
Class 5 - LinearRegression
20 pages
Chapter 3 - Census and Sample Method
No ratings yet
Chapter 3 - Census and Sample Method
17 pages
Project - Ipynb - Colaboratory
No ratings yet
Project - Ipynb - Colaboratory
21 pages
Matrix Plot of Law. SCH Gpa Vs Under Grad G, Lmat Perctl, Qlty Rating & Gre
No ratings yet
Matrix Plot of Law. SCH Gpa Vs Under Grad G, Lmat Perctl, Qlty Rating & Gre
4 pages
Q4 Formulating Hypothesis 2
No ratings yet
Q4 Formulating Hypothesis 2
18 pages
This Worksheet Plots Two-Way Linear Interaction Effects Estimated Via Regression Analysis
No ratings yet
This Worksheet Plots Two-Way Linear Interaction Effects Estimated Via Regression Analysis
5 pages
Topic 10. ANOVA Models For Random and Mixed Effects References: ST&DT: Topic 7.5 p.152-153, Topic 9.9 P. 225-227, Topic 15.5 379-384
No ratings yet
Topic 10. ANOVA Models For Random and Mixed Effects References: ST&DT: Topic 7.5 p.152-153, Topic 9.9 P. 225-227, Topic 15.5 379-384
16 pages
R. A. Fisher and The Making of Maximum Likelihood 1912 - 1922
No ratings yet
R. A. Fisher and The Making of Maximum Likelihood 1912 - 1922
15 pages
Screenshot 2024-01-31 at 6.54.16 PM
No ratings yet
Screenshot 2024-01-31 at 6.54.16 PM
8 pages
PSYC206 Mid-Semester Exam
No ratings yet
PSYC206 Mid-Semester Exam
7 pages
CHAPTERS 7-9 Worksheet - Ques & Ans
No ratings yet
CHAPTERS 7-9 Worksheet - Ques & Ans
7 pages
Visualizing Interaction Effects: A Proposal For Presentation and Interpretation
No ratings yet
Visualizing Interaction Effects: A Proposal For Presentation and Interpretation
8 pages
‎⁨مد احصاء حيوي 1446⁩
No ratings yet
‎⁨مد احصاء حيوي 1446⁩
2 pages
Covariance - Correlation - Variance of A Sum - Correlation Coefficient
No ratings yet
Covariance - Correlation - Variance of A Sum - Correlation Coefficient
4 pages
Massachusetts Institute of Technology: 6.867 Machine Learning, Fall 2006 Problem Set 3: Solutions
No ratings yet
Massachusetts Institute of Technology: 6.867 Machine Learning, Fall 2006 Problem Set 3: Solutions
3 pages
QC Analysis L1Start
No ratings yet
QC Analysis L1Start
2 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
From Everand
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
S. Deviant
4.5/5 (6)
Statistics I Essentials
From Everand
Statistics I Essentials
Emil G. Milewski
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet

Day1 Descriptive and Summary

Uploaded by

Day1 Descriptive and Summary

Uploaded by

Descriptive and

Office SERC 643

This is not a course in statistical theory.

Biostatistics is (surprisingly!) a branch of applied statistics geared towards to medical and

Samples are subsets of individuals/units from populations

Parameters and estimates use different notations, as we will see

Low bias and low sampling error

Samples should be randomly chosen Accurate

Generally the first step in data exploration and statistical analysis

Quantitative data Categorical data

Ordinal – categories with a natural ordering

Standard deviation of a sample

First quartile Median Third quartile

Five number summary: min, Q1, median, Q3, max

Which would you choose for a symmetric distribution and why?

◦ Useful measure for comparing variability between two differently-scaled datasets

Comparing two continuous variables

Trend over time

Uniform Bell–shaped Asymmetric (skewed) Bimodal

0.00 0.0 0.0

75 100 125 150 175

You might also like