1.fundamentals of 1D Visualization
1.fundamentals of 1D Visualization
Data Visualization
SUMANTA PATTANAIK
Sumanta Pattanaik: Visiting Professor
Email: [email protected]
pilani.ac.in
Instructors
Tathagat Ray: Professor, Hyderabad campus
Vinayak Naik: Professor, Goa Campus
Sundaresan Raman: Professor, Pilani Campus
8/21/2023 2
What will be covered ?
Principles and techniques for interactive data
visualization that are useful for presenting and
analyzing and presenting information associated
Course
with the data.
Algorithmic aspects of developing interactive
Particulars visualization.
The students will receive practical experience of
creating visualization using
Python based tools for data analysis and Plotly's
Python graphing library and Matplotlib for data
visualization.
Course Text book: Fundamentals of Data Visualization
by Claus O. Wilke
Particulars Online link: https://clauswilke.com/dataviz/
Programming Knowledge: An object based
language expertise is essential.
Computer Programming and Data Structure, OO
Prerequisite Programming.
NOTE: This class has a strong programming
component. All the assignment and project will
be done using Python, Plotly and Matplotlib.
BITS Policies
Students are expected to familiarize themselves
with and follow the BITS Rules of Conduct.
General Talk to your local faculty instructors
Policies
Mid Term [20%]
Comprehensive [30%]
Final Project [10%]
Assignment Policies
Assignments must be turned in by deadline, mostly
set to 11:55 pm of the date.
Programming Assignments must be carried out using
Jupyter notebook.
General Assignment Submission is through Google
Classroom. (submit assignment_xx.ipynb file).
Policies Programming in Python, Visualization using Plotly,and
Matplotlib
Compute Platform:
Any computer/OS running python and Jupyter
Students will work on the assignment independently.
Quizzes: 20 points. (open book)
will be conducted weekly, online, during class
hours.
8/21/2023 29
Quizzes: 20 points. (open book)
will be conducted weekly, online, during class
hours.
Questions will be from the material covered the
previous week.
Midterm: 20 points. (closed book)
General Check your campus calendar for the date/time
8/21/2023 30
Topics
Lectures
Overview
Getting started with Python on Jupyter
Getting started with Plotly, Matplotlib/seaborn
Assignments will be mostly on getting the class familiarized with
Python, NumPy and Pandas
Plotly and Matplotlib for visualization
Get Started
Visualization?
picture.
Any technique that helps in creating the
mental picture will be called data visualization.
Table for Visualization
10 primary component.
Source: “Tables”, Chapter 11, Better Data Visualization.
Table for Visualization
Brain and
Visual
[https://blog.hubspot.com/agency/science-brains-crave-infographics]
Verified Evidence:
Value of
Data
Visualization
Sampled data sets from “Graphs in statistical
analysis”. by F. J. Anscombe, in American
Statistician, 27, 17–21 (1973)
Anscombe’s Quartet
Visualization
It is much easier to discover and confirm the
presence (or even absence) of patterns,
relationships, and physical characteristics (such
as outliers) through visualization.
Addresses a variety of needs:
to evaluate data.
Data
to communicate to peers.
to convince the board/reviewers.
Visualization to present to clients.
to report to regulatory committee.
Dates back to 2nd Century: Egyptians used
History of maps for Earthly and heavenly positions
line chart
Visualization
bar chart
pie chart
Record information
Blueprints, photographs, seismographs, …
Analyze data to support reasoning
Value of Develop and assess hypotheses
Discover errors in data
Data
Expand memory
Value of
Blueprints, photographs, seismographs, …
Analyze data to support reasoning
Data Develop and assess hypotheses
Discover errors in data
Visualization:
Expand memory
Minard's interest lay with the painful efforts and sacrifices of the soldiers.
Napolean’s Disastrous Russian campaign of
The graphic is notable 1812.
for its representation in
two dimensions of six
types of data:
1- the size of
Napoleon's troops;
2- distance;
3- temperature;
4- the latitude and
longitude,
5-direction of travel;
and
6- location relative to
specific dates.
See:
https://www.britannica.com/event/French-
invasion-of-Russia
Napolean’s Disastrous Russian campaign of
The graphic is notable 1812.
for its representation in
two dimensions of six
types of data:
1- the size of
Napoleon's troops;
2- distance;
3- temperature;
4- the latitude and
longitude,
5-direction of travel;
and
6- location relative to
specific dates.
Visualization:
Develop and assess hypotheses
Discover errors in data
Analyze data Expand memory
support
prove his hypothesis that contaminated water, not air,
was the source of cholera.
Src:
• https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak
• https://www.arcgis.com/home/item.html?id=d7deb67f810d46dfacb80ff80ac224e9
John Snow’s famous Study of 1854 Cholera
outbreak in Broad Street London.
Analyze
data to
support
reasoning
https://www.arcgis.com/apps/PublicInformation/ind
ex.html?appid=d7deb67f810d46dfacb80ff80ac224e9
John Snow’s famous Study of 1854 Cholera
outbreak in Broad Street London.
Analyze
data to
support
reasoning
John Snow’s famous Study of 1854 Cholera
outbreak in Broad Street London.
Analyze
data to
support
reasoning
Snow used his map to convince local authorities to remove the handle of the Broad Street pump that prevented many deaths.
The removal of the Broad Street pump handle has become the stuff of legend. At the Centers for Disease Control (CDC) in
Atlanta, when scientists look for simple answers to questions about epidemics, they sometimes ask each other,
“Where is the handle to this pump?”
Value of Record information
Blueprints, photographs, seismographs, …
Data Analyze data to support reasoning
Visualization:
Develop and assess hypotheses
Discover errors in data
Communicate Expand memory
Crimean was fought between Russia and an ultimately victorious alliance of the
Ottoman Empire, France, the United Kingdom and Piedmont-Sardinia.
Communicate
information to
others
The circle is divided into twelve equal "slices" representing each
month of the year. Months with more deaths are shown with longer
wedges, so that the area of each wedge represents the number of
deaths in that month from disease (Blue) wounds (Red), or other
causes (Black) . In the second year of the war (shown in the left
image), deaths from disease were greatly reduced, showing the
effect of the improved hygiene in the camps and hospitals starting in
March 1855.
Communicate
information to
others
Once you see Nightingale's graph, the terrible
picture is clear. The Russians were a minor enemy.
The real enemies were cholera, typhus, and
dysentery. The chart indeed resulted in the
modernization of British army hospital system.
Interactive visualization allows better
exploration of the data.
For a visualization to be considered interactive
it must satisfy two criteria:
Visualization
information being represented, and
Response time: changes made through input
must be incorporated into the visualization in a
timely manner. In general, interactive
visualization is considered a soft real-time task.
Interactive
Visualization:
An Example
Disappearing Shorelines
Source:
https://archive.nytimes.com/www.nytimes.com/i
nteractive/2012/11/24/opinion/sunday/what-
could-disappear.html
CS F441: Data
Learn how to examine data and relationship
Course among variables though visualization and
statistical tools with a goal towards
Goal: Building insight into the data & process that
generated the data.
Exploratory Finding out what may be interesting.
Old Faithful
eruptions: 3.6 1.8 3.33 2.28 4.53 ...
waiting : 79 54 74 62 85 55 88 85 51 85 ...
Data
operations: equal, not equal
for example
a record of students' course choices
constitutes nominal data.
Male (M), Female (F)
Hair color: Brown, Black, Blonde, Gray, Other
Data that can be quantified.
Data that are answers questions such as “How
Quantitative
many?”, “How often?”, “How much?”.
In general, 2 categories
data Continuous: can take of any numeric value
Discrete:
ex: Count
Categorical, statistical data type where the variables have natural,
ordered categories
There is an order in the values: (operations: equal, not equal;
less/more).
first place, second place, third place; size S, M, L).
Note: the distances between the categories is not known. (e.g. a
scale ranging from happy to indifferent to unhappy).
The ordinal scale is distinguished
Ordinal
from the nominal scale by having ordered categories.
from continuous scales by not having category widths that
Data
represent equal increments of the underlying attribute.[Wiki]
Examples:
Likert Scale
Answer to survey question "Is your general health poor, reasonable,
good, or excellent?“. The answers may be coded as 1, 2, 3, and 4.
Individuals income might be grouped into the income categories
$0-$19,999, $20,000-$39,999, $40,000-$59,999,…
socioeconomic status, military ranks.
letter grades for coursework.
A Typical
DataSet
Source: https://datacatalog.worldbank.org/dataset/world-development-indicators
A Table of records (rows)
elements in the same row are related to each
other in the sense that they are all measures
from the same observation---or measures of the
same item.
Each Record has a number of observations
(columns)
A Typical Called Items, Dimensions, Variables
Sample
https://data.oecd.org/india.htm
https://www.indiastat.com/
Data https://ourworldindata.org/country/india
Sources https://www.imf.org/external/datamapper/prof
ile/IND
https://www.kaggle.com/datasets?tags=3023-
India
…
Recap: Nightingale’s Rose Chart
of
Data Science & Python
General W3Schools Python
Python Tutorial
x, y = 10, 5
numbers = [0, 1, 2, 3, 4, 5, 6]
numbers[0:3] == [0, 1, 2]
numbers[:3] == [0, 1, 2]
numbers[5:] == [5, 6]
numbers[5:7] == [5, 6]
numbers[:] == [0, 1, 2, 3, 4, 5, 6]
Collection: Tuple
ref: https://www.w3schools.com/python/python_sets.asp
Additional Collections
Ref: https://www.w3schools.com/python/pandas/pandas_intro.asp
DataFrame
Pandas use the loc attribute to return one or more specified dtypes: object(2)
row(s) memory usage: 160.0+ bytes
>>> print(df.loc[0]) None
courses F441
grades A
Name: 0, dtype: object
pandas.read_csv()
>>> df = pandas.read_csv("data.csv")
>>> print(df.head(10))
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.0
6 60 110 136 374.0
7 45 104 134 253.3
8 30 109 133 195.1
9 60 98 124 269.0
DataFrame (contd…)
if-then-else
for loop
while loop
Indents
Syntax: Example:
if condition1: >>> number = 0
# code block 1 >>> if number > 0:
elif condition2: ... print("Positive number")
# code block 2 ... elif number == 0:
else: ... print('Zero')
# code block 3 ... else:
... print('Negative number’)
Zero
for loop
The range() function returns a sequence of numbers, starting from 0 by default, and
increments by 1 (by default), and stops before a specified number.
syntax:
range(start, stop, step)
for loop: unpacking tuples
1 10
2 20
3 30
while loop
and Examination:
Metadata
Familarizing
Completeness
Metadata Pixel size: SEE FILE HEADER,, Specifies the pixel size
(width, height, separation) in millimeters.
Image format: GE 16 BITS, Compressed Unix compressed,
use "uncompress [filename]" to restore.
Header size: 7900. The header block size in bytes.
Coordinate offset: NONE,NONE
If images files are cropped to remove empty pixels,
these offsets are provided, in pixels, relative to a fixed
coordinate plane.
…
High-quality data needs to pass a set of quality
criteria.
Validity
Data Quality
Accuracy
Completeness
Consistency
Uniformity
See:
https://en.wikipedia.org/wiki/Data_cleansing#Data_quality
Validity: The degree to which the data conform to
defined domain rules or constraints.
Data-Type Constraints: values in a particular column must
be of a particular datatype, e.g., boolean, numeric, date,
etc.
Range Constraints: typically, numbers or dates should fall
within a certain range.
Mandatory Constraints: certain columns cannot be
Data Quality
empty.
Unique Constraints: a field, or a combination of fields,
must be unique across a dataset.
Set-Membership constraints: values of a column come
from a set of discrete values, e.g. enum values. For
example, a person’s gender may be male or female.
Regular expression patterns: text fields that have to be in
a certain pattern. For example, phone numbers may be
required to have the pattern (999) 999–9999.
Cross-field validation: certain conditions that span across
multiple fields must hold. For example, a patient’s date of
discharge from the hospital cannot be earlier than the
date of admission.
Accuracy: The degree of conformity of a
measure to a standard or a true value
ex: 6 digit Pin Code in an address
Data Quality
Source:
https://towardsdatascience.com/the-ultimate-guide-to-
data-cleaning-3969843991d4
Inspection: Detect unexpected, incorrect,
and inconsistent data.
Data profiling: Generate A summary
statistics about the data
Is the data column recorded as a string or
Data
number?.
How many values are missing?
Cleaning How many unique values in a column, and their
distribution?
Statistical Analysis and Visualization of Distribution
mean, standard deviation, mean, range, or
quantiles.
Source:
https://towardsdatascience.com/the-ultimate-guide-to-
data-cleaning-3969843991d4
Cleaning: Fix or remove the anomalies
discovered.
Missing Values:
Drop Row: missing values in a column rarely
Data
happen and occur at random
Cleaning
Drop Column: most of the column’s values are
missing, and occur at random
Assign Value (impute): Mean/Median value or
prediction using Linear regression.
Do nothing but Flag
Source:
https://towardsdatascience.com/the-ultimate-guide-to-
data-cleaning-3969843991d4
Cleaning: Fix or remove the anomalies
discovered. (Continued)
Remove Irrelevant/Duplicate data
Source:
https://towardsdatascience.com/the-ultimate-guide-to-
data-cleaning-3969843991d4
Tools:
Data
Drag and Drop Tools
Cleaning Script Based
Trifacta Wrangler: Messy Data Accepted
https://www.trifacta.com/products/wrangler/
of Tools
Drag-and-Drop Tools:
Rely on a Graphical User Interface
Make assumptions about what you may like to
Two do
Ex: You may draw SALES to Y-axis and DATE to the
Categories X-axis. The tools assumes that you are interested in
graphing total sales per month.
Cons:
Development time can be long
Expensive
Drag-and-Drop Tools:
Two Based on Scripting Languages
Categories
Ex: matplotlib and Plotly in Python; D3,
Observable Plot, Plotly in Javascript, ggplot,
Plotly in R, …
of Tools Better control on the result, but you have to be
explicit about what you want.
This class will
Python: Scripting language
use Scripting
Languages
Matplotlib is a popular Python library for creating visualizations.
matplotlib.pyplot: This is the primary module used for creating
visualizations. It provides a simple interface for creating plots and
charts
Pros:
Versatile and Customizable
Wide Adoption: Being one of the oldest and most popular plotting
libraries in Python, Matplotlib is widely adopted and often
considered the starting point for many data visualization tasks.
Cons:
Matplotlib
Steep Learning Curve.
Limited interactivity
Its default styles might not always produce the most
aesthetically pleasing plots compared to some other
libraries.
As in any scripting-based visualization tool, a lot of code to
write.
We will mostly use Pandas.plot and Seaborn: Libraries
developed on the top of MatplotLib. Reduce coding
load.
https://matplotlib.org/
Open Source graphics library for creating interactive,
publication-quality graphs. It has a concise and (hopefully)
memorable functions to foster fluency
Pros:
Interface is available to Python, R, Javascript, Matlab, Julia
Plotly
Great support for interaction
Beautiful visualizations
Cons:
As in any scripting-based visualization tool, a lot of
code to write.
We will mostly use Plotly.express: A library developed
on the top of Plotly. Reduces coding load.
https://plotly.com/graphing-libraries/
Course Learn how to examine data and relationship
among variables though visualization and
Goal: statistical tools with a goal towards
Exploratory
Building insight into the data & process that
generated the data.
Data
Finding out what may be interesting.
Determining which variables have the most
Grammar of
geometric shapes
coordinate system
Graphics aesthetic mapping:
Mapping of data dimensions to visual dimensions
Scales
statistical transformations
position adjustments
faceting
Example
import pandas as pd
alphabets = pd.read_csv( "english-letter.csv")
Example
Columns/Channels/Data Dimensions
import pandas as pd
alphabets = pd.read_csv( "english-letter.csv")
Data Array/
Data Frame/
Data Table
Plotly Example
import pandas as pd
import plotly.express as px
alphabets = pd.read_csv( "english-letter.csv")
px.bar(alphabets,
x="letter",
Data Channels
y="frequency“
)
Pandas Matplotlib Example
import pandas as pd
alphabets = pd.read_csv( "english-letter.csv")
alphabets.plot.bar(x="letter",y="frequency",
figsize = (10,5), rot=0, width=0.8)
Geometric
Data Array shapes
alphabets.plot.bar(
x="letter",
Data Channels
y="frequency“
)
Plotly Example
import plotly.express as px
iris = px.data.iris()
Datasets: https://plotly.com/python-api-reference/generated/plotly.data.html
Plotly Example
import plotly.express as px
iris = px.data.iris()
fig = px.scatter(
iris,
x="sepal_width",
y="sepal_length")
fig.show()
Plotly Example
import plotly.express as px
iris = px.data.iris()
Geometric
shapes
fig = px.scatter(
iris,
x="sepal_width",
y="sepal_length")
fig.show()
Plotly Example
import plotly.express as px
iris = px.data.iris()
fig = px.scatter(iris,
x="sepal_width",
y="sepal_length“,
color=“species”
)
fig.show()
Plotly Example
import plotly.express as px
iris = px.data.iris()
Geometric
shapes
fig = px.scatter(iris,
x="sepal_width",
y="sepal_length“,
color=“species”
) Data
Dimensions
fig.show()
import plotly.express as px
iris = px.data.iris()
fig, ax = plt.subplots(
figsize=(10, 5)
)
sns.scatterplot(data=iris,
x="sepal_width",
y="sepal_length",
hue = "species“,
ax=ax)
Plotly Example
import plotly.express as px
iris = px.data.iris()
fig = px.scatter(iris,
x="sepal_width",
y="sepal_length“,
color=“species”,
color_discrete_sequence=
px.colors.qualitative.Light24
)
fig.show() Mapping function or
scaling function
Data Graphics types
Both Plotly and Matplotlib support a large number of data
graphics types.
Commonly used ones are
Basic charts
line chart
bar chart
scatter plot, bubble chart
pie chart
Statistical charts
histogram
box plot
violin plot
error plot
distribution plots
geo plots
choropleths
geobubble plots
…
Data Component
Geometry/Graphic Component
Cartesian Coordinates:
2D Cartesian coordinates is the
widely used in data visualization
Axes are orthogonal
Represent both positive and
negative real numbers.
Example
MTCar Dataset:
Motor Trend magazine 1974
Effectiveness of Various Visual
properties for Data
Log scale
https://clauswilke.com/dataviz/coordinate-systems-axes.html
Continuous Input Continuous
Output
Polar (Curvilinear) Scale
Linear Transformation of data values into
angles and radial distances from origin.
use a linear function ( = m * x + b) to
interpolate across the domain and range
Generally, one od the axis is assigned to the
discrete input data
Domain and Range:
Domain: [0,maxCoordinate]
Range: ([0, 2])
https://clauswilke.com/dataviz/coordinate-systems-axes.html
Continuous data to continuous
Color
https://clauswilke.com/dataviz/coordinate-systems-axes.html
Scale Functions
https://clauswilke.com/dataviz/coordinate-systems-axes.html
Summary of Data and types
Components of Data
Visualization
Data Component
Geometry/Graphic Component
Main ex: lines, bars, symbols.
import pandas as pd
alphabets = pd.read_csv( "english-letter.csv")
Data Array/
Data Frame/
Data Table
Pandas Matplotlib Example
import pandas as pd
alphabets = pd.read_csv( "english-letter.csv")
alphabets.plot.bar(x="letter",y="frequency",
figsize = (10,5), rot=0, width=0.8)
Geometric
Data Array shapes
alphabets.plot.bar(
x="letter",
Data Channels
y="frequency“
)
Plotly Example
import pandas as pd
import plotly.express as px
alphabets = pd.read_csv( "english-letter.csv")
px.bar(alphabets,
x="letter",
Data Channels
y="frequency“
)
Plotly Example
import plotly.express as px
iris = px.data.iris()
Datasets: https://plotly.com/python-api-reference/generated/plotly.data.html
Plotly Example
import plotly.express as px
iris = px.data.iris()
fig = px.scatter(
iris,
x="sepal_width",
y="sepal_length")
fig.show()
Plotly Example
import plotly.express as px
iris = px.data.iris()
Geometric
shapes
fig = px.scatter(
iris,
x="sepal_width",
y="sepal_length")
fig.show()
Plotly Example
import plotly.express as px
iris = px.data.iris()
fig = px.scatter(iris,
x="sepal_width",
y="sepal_length“,
color=“species”
)
fig.show()
Plotly Example
import plotly.express as px
iris = px.data.iris()
Geometric
shapes
fig = px.scatter(iris,
x="sepal_width",
y="sepal_length“,
color=“species”
Data
) Dimensions
import plotly.express as px
iris = px.data.iris()
fig, ax = plt.subplots(
figsize=(10, 5)
)
sns.scatterplot(data=iris,
x="sepal_width",
y="sepal_length",
hue = "species ")
Plotly Example
import plotly.express as px
iris = px.data.iris()
fig = px.scatter(iris,
x="sepal_width",
y="sepal_length“,
color=“species”,
color_discrete_sequence=
px.colors.qualitative.Light24
)
fig.show() Mapping function or
scaling function
Effectiveness of Various Visual
properties for Data
Cartesian Coordinates:
2D Cartesian coordinates is the
widely used in data visualization
Axes are orthogonal
Represent both positive and
negative real numbers.
Example
MTCar Dataset:
Motor Trend magazine 1974
Scale Functions
Log scale
https://clauswilke.com/dataviz/coordinate-systems-axes.html
Gapminder Data Source:
https://plotly.com/python/plotly-express/
Continuous Input Continuous Output:
Polar (Curvilinear) Scale
Linear Transformation of data values into
angles and radial distances from origin.
use a linear function ( = m * x + b) to
interpolate across the domain and range
Generally, one od the axis is assigned to
the discrete input data
Domain and Range:
Domain: [0,maxCoordinate]
Range: ([0, 2])
Use: For cyclic data,
https://clauswilke.com/dataviz/coordinate-systems-axes.html
Nightingale’s
Rose plot
Radial Bar Star plot
Plot (Spider plot)
https://clauswilke.com/dataviz/coordinate-systems-axes.html
Continuous data to
continuous Color
Continuous data to
continuous Color
Source: The scale and drivers of carbon footprints in households, cities and regions across India
January 2021 Global Environmental Change 66(11):102205
Scale Functions
https://clauswilke.com/dataviz/coordinate-systems-axes.html
Gapminder plot
Source:
https://www.gapminder.org
/tag/chart/
Summary of Data and types
Common, Effective
Data Visualization
Techniques
1D DATA
1D Data
iris = px.data.iris()
Visualizing 1D Data
Tabulation in order
Nominal Data: Alphabetic order
Ordinal and Quantitative data:
Could be increasing or decreasing
Visualizing 1D Data
nominal data
Tabulate
Sorted Tabulate
Total Count of unique category
# Count duplicates
tips = px.data.tips()
tips["day"].value_count()
1D Data Visualize
nominal data
Tabulate
Sorted Tabulate
Total Count of unique category
Bar plot (Histogram plot)
1D Quantitative Data
iris dataset
iris = px.data.iris()
1D Quantitative Data
Old Faithful
Data Set
Distribution Examining sets of quantitative values:
How are the values distributed from
Analysis lowest to highest?
iris dataset
Distribution analysis
3 Key Characteristics of Distribution
Spread: The difference between the
maximum value and the minimum value (or
the range of the data).
Wide or short
Interquartile range:
Measure of difference
between upper (75%)
and lower quartile (25%)
Where the majority
values lie.
Source: https://www.onlinemathlearning.com/quartile.html
Box plot: Distribution visualization
Histogram: Distribution Visualization
Histogram:
A better way to visualize the
distribution of 1D data. Often used
in statistical analysis.
shows the number of data points
(frequency) that lie within intervals,
called bins
visualized as a collection of
rectangles. The frequency and the
width of the bin interval represent
the height and width of a
rectangle.
Histogram: Distribution Visualization
Statistical parameters
shown on the histogram
Histogram
Best Practices
Keep interval constant
Select best interval
Source: https://statistics.laerd.com/statistical-guides/understanding-
histograms.php
Histogram: Distribution Visualization
Density Plot: Smoothening Histogram. Density plot over Histogram of Eruption Durations
from Old Faithful dataset.
Resources:
Wiki page on Kernel Density Estimation:
https://en.wikipedia.org/wiki/Kernel_densit
y_estimation
https://mathisonian.github.io/kde/
Density plot of a Normal distribution
Spread
Range of Data
Difference between the largest and
smallest point
Interquartile range:
Measure of difference between
upper (75%) and lower quartile (25%)
Where the majority values lie.
Source: https://www.onlinemathlearning.com/quartile.html
Data
Distribution
Shape
Symmetric
Uniform
Skewed
Source: https://towardsdatascience.com/intro-to-descriptive-statistics-
252e9c464ac9
Data
Distribution
Shape
Symmetric
Uniform
Skewed
Multi-modal
Source: https://towardsdatascience.com/intro-to-descriptive-statistics-
252e9c464ac9
Distribution
Display
Single distribution display
Histogram
Frequency polygon:
A line graph that
exclusively draws
attention to the
distribution’s shape with
minimal distraction
Strip plot
Stem-and-leaf plot
Frequency Polygon
Stem-and-leaf plot
Visualizing 1D Quantitative Data
Strip Plot:
the viewer gets an idea
about the range(s) of
values that are more
frequent and those that are
less frequent
Box plot For example, with Box Plots, you can't see if the
distribution is bimodal or multimodal.
Violin Plots display more information, they can be
noisier than a Box Plot.
https://datavizcatalogue.com/methods/violin_plot.html
Multiple Distribution Display
Source: https://datavizcatalogue.com/methods/box_plot.html
Distribution
Display
Multiple distribution display
Box Plots
Violin Plots
Multiple Strip plots
Frequency Polygons
Source: https://blogs.sas.com/content/graphicallyspeaking/2013/03/24/custom-box-plot
Distribution
Display
Multiple distribution display
Box Plots
Violin Plots
Multiple Strip plots
Frequency Polygons
https://datavizcatalogue.com/methods/violin_plot.html
Distribution Display
Frequency Polygons
Distribution
Display
Multiple distribution display
Box Plots
Violin Plots
Multiple Strip plots
Frequency Polygons
A Population Pyramid
is a pair of back-to-
back Histograms (for
each sex) that
displays the
distribution of a
population in all age
groups and in both
sexes.
Population Pyramid
https://datavizcatalogue.com/methods/population_pyramid.html