0% found this document useful (0 votes)
12 views

Lecture 3

The document provides an overview of best practices for displaying data through graphical and numerical summaries. It discusses different types of variables and appropriate ways to summarize categorical versus quantitative variables, such as using pie charts, bar charts, measures of central tendency, and measures of spread. The document also covers concepts like outliers, quartiles, histograms, scatter plots, and common mistakes to avoid when presenting data visually such as hiding information or using misleading scales.

Uploaded by

Amine Hadji
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Lecture 3

The document provides an overview of best practices for displaying data through graphical and numerical summaries. It discusses different types of variables and appropriate ways to summarize categorical versus quantitative variables, such as using pie charts, bar charts, measures of central tendency, and measures of spread. The document also covers concepts like outliers, quartiles, histograms, scatter plots, and common mistakes to avoid when presenting data visually such as hiding information or using misleading scales.

Uploaded by

Amine Hadji
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Data display

Amine Hadji

Leiden University

February 22, 2022


Outline

• Data

• Graphical summary

• Numerical summary

• How to display data badly


Quick reminder

• Observation: the value of a single measurement.

• Variable: is a characteristic that differs from one individual to another.

• Sample size: total number of observations.

• Sample data: measurements taken from a subset of a population.

• Parameter: summary measure for an entire population.

• Statistic: summary measure computed from sample data.


Types of variables
Types:
• Categorical variable: variable that not cannot be tied to a meaningful number
• nominal variable when the categories cannot be ordered
• ordinal variable when the categories can be ordered
• Quantitative variable: variable that can be tied to a meaningful number
• discrete variable when the number is counted
• continuous variable when the number is measured
Summarizing categorical variables
One variable:
Summarizing categorical variables

Two variables:
Pie chart vs Bar chart
Pie charts are the worst
Visualizing categorical variables
Only time pie charts work
Summary - Quantitative variable

The most interesting summary characteristics:


Summary - Quantitative variable

The most interesting summary characteristics:


• Location: mean - median - mode
Summary - Quantitative variable

The most interesting summary characteristics:


• Location: mean - median - mode

• Spread: range - inter-quantile range - standard deviation


Summary - Quantitative variable

The most interesting summary characteristics:


• Location: mean - median - mode

• Spread: range - inter-quantile range - standard deviation

• Shape: symmetry - skewedness - modes


Summary - Quantitative variable

The most interesting summary characteristics:


• Location: mean - median - mode

• Spread: range - inter-quantile range - standard deviation

• Shape: symmetry - skewedness - modes

• +1: Outliers
Five summary statistics
Boxplot
Boxplot in real life
Histogram - Stem-and-leaf - Dotplots
Bin size in histograms
When to use what?
If you are interested into
• Location: all of them (esp. boxplot)

• Shape: histogram - stem-and-leaf

• Individual values: stem-and-leaf - dotplot


When to use what?
If you are interested into
• Location: all of them (esp. boxplot)

• Shape: histogram - stem-and-leaf

• Individual values: stem-and-leaf - dotplot

If you want to
• compare groups: boxplot - histogram (two copies or side by side)

• identify outliers: all of them (esp. boxplot)

• study small datasets: stem-and-leaf - dotplot - boxplot

• study large datasets: histogram - boxplot


Mean & Median

• Mean - numerical average of the data:


n
1X
x̄ = xi
n
i=1

• Median - middle value splitting a data set evenly according to the order
Mean & Median

• Mean - numerical average of the data:


n
1X
x̄ = xi
n
i=1

• Median - middle value splitting a data set evenly according to the order

Influence on extreme value: the mean is more prone to get influenced:

1,2,3,4,5 vs 1,2,3,4,10
Shape on (mean - median)
Mean - Median - Mode
Quartiles - Quantiles

• Lower quartile median of the lower half of the ordered data values Q1 .

• Upper quartile median of the upper half of the ordered data values Q3 .
Quartiles - Quantiles

• Lower quartile median of the lower half of the ordered data values Q1 .

• Upper quartile median of the upper half of the ordered data values Q3 .

• k-th percentile k% of the data values at or below it and (100 − k)% percent at
or above it.
• Interquartile range (IQR): Q3 − Q1 .
Outliers

Possible sources:
• legitimate value; represents natural (high) variability

• mistake made during measurement or typing

• individual in a different group than the rest of the sample


Standard deviation
Formula for the sample variance and standard deviation:
Pn
2 (xi − x̄)2
s = i=1
n−1
sP
n 2
i=1 (xi − x̄)
s=
n−1
Scatter plot
Scatter plots help visualizing the relationship between two quantitative variables.
Scatter plots - Nonlinearity
Scatter plots - Outliers
Scatter plots - Two categories
Hiding/Showing small data

Showing very small amount of data Hiding the data:


Change of Scale
Incorrect labeling
Hiding the main effect
Jiggle Baseline

You might also like