0% found this document useful (0 votes)
9 views98 pages

Visualization Summarization S25 Lec6,7

The document discusses data visualization and summarization techniques, emphasizing the importance of visualizing data to understand distributions, detect errors, and present results effectively. It outlines best practices for creating good visualizations, including simplicity, color consistency, and proper labeling, while also highlighting common pitfalls and the significance of choosing appropriate plots based on data types. Additionally, it covers measures of central tendency and dispersion, and the implications of normal and skewed distributions in statistical analysis.

Uploaded by

u80817578
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views98 pages

Visualization Summarization S25 Lec6,7

The document discusses data visualization and summarization techniques, emphasizing the importance of visualizing data to understand distributions, detect errors, and present results effectively. It outlines best practices for creating good visualizations, including simplicity, color consistency, and proper labeling, while also highlighting common pitfalls and the significance of choosing appropriate plots based on data types. Additionally, it covers measures of central tendency and dispersion, and the implications of normal and skewed distributions in statistical analysis.

Uploaded by

u80817578
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

BRSM

Data Visualization & Summarization


Vinoo Alluri & Bapi Raju
ORGANIZATION
Data Organization
• identify variables (IV, DV)
and respective types
• identify different levels of
measurement

• missing data?
• replace with mean
• remove

Continuous Categorical
What information does it give ???
Outline
• Visualization
• why we visualise
• how to pick a plot
• initial data vs final results visualization (some
examples)
• bad designs and misleading graphs
• Summarization
• measures of central tendency & dispersion
• which measure to pick
Data Visualisation
Mean Mid-Sem Test Score = 65.5

How can i summarise this data?


Anscombe’s Quartet

• same mean, std, correlation, regression line


Anscombe’s Quartet

• same mean, std, correlation, regression line


Why do we visualise?
• allows for initial guesses of data distribution
• direction of effect
• error detection (eg: missing, NaNs)
• outlier detection
• present results
what makes them “good” or “bad”?

comment on these visualizations


Tim Cook used the particular chart to showcase the rising sale of iPads
between the years 2008-2013.
What makes a good visualisation?

• reduce cognitive Load


• simplicity
• relevancy
• less is more
• storytelling
• ability to support
the reader during
their journey “Perfection is achieved not when there is
nothing more to add, but when there is
• convince the reader nothing left to take away”
– Antoine de Saint-Exupery
What makes a good visualisation
• Color Consistency
• use same colors across multiple charts for consistency
• avoid using colors with negligible contrast
• avoid using too many colors
• avoid using conventional colors to convey opposite
meanings
• pay heed to the needs of people who might be
colorblind (check also in grayscale)
• Accurate Scaling

https://www.cardinalpath.com/blog/makes-good-visualization
What makes a good visualisation

• identify & explain/infer from missing data

https://www.cardinalpath.com/blog/makes-good-visualization
What makes a good visualisation
• labelling
• label the axis correctly and consistently across all your
charts.
• avoid using acronyms that are not widely understood.
• make the chart title as concise and descriptive as
possible.
• whenever possible, label the lines in your line chart
directly rather than using a legend.
• be consistent in formatting; if you are working with
currency symbols, percentage signs and the decimal
values, retain them across all your charts.

https://www.cardinalpath.com/blog/makes-good-visualization
Market Share of Film Studios
So which visualisation was best?
Tufte’s Graphical Theory
• minimize data-to-ink ratio
• minimise lie factor (or increase graphical integrity)
• minimise chart junk
• use proper scales and labelling
The good, the bad, & the ugly
Outline
• Visualization
• why we visualise
• how to pick a plot
• initial data vs final results visualization (some
examples)
• bad designs and misleading graphs
• Summarization
• measures of central tendency & dispersion
• which measure to pick
How to choose the right plot?
How to choose the right plot?

• distributions & compositions


• proportions
• data distributions
• comparisons
• group differences
• associations
• relationships between variables
• geographical data
• variable types
Initial Data vs Final Result
Visualization
HISTOGRAMS
BOX-PLOT FUNNEL PLOTS
SCATTER PLOT SPIDER PLOT / RADAR CHART
PIE CHARTS & BAR CHARTS RADIAL HEAT MAP
MOSAIC PLOT CIRCOS PLOT
STREAMGRAPH
VIOLIN PLOT
RAIN-DROP

Not an exhaustive list!


Some plots used for both!
Data Visualization: What Info?

• allows for initial guesses of data distribution (normality)


and direction of effect
• ex: histograms (bin-width dependency), box-plots,
scatter plots, pie charts, bar plots (already seen), etc
Data Visualisation
HISTOGRAM, BOXPLOT, SCATTER PLOT
PIE CHART
Not comprehensible!

BAR CHART
Pie vs Bar Charts
• use pie charts when
• smaller no. of categories
• readers can differentiate slices (unless you are making a point)
• you don’t need to rely on many colors or labels to explain the
proportions
• total adds up to 100%
• use bar charts when
• have many categories (not too many)
• need to compare numbers side-by-side (caution: more than
two bars are hard for readers
AREA PLOTS: TREE MAP
AREA PLOTS: WAFFLE CHART
Data Visualisation
• mosaic plots
• allows you to observe the relationship among two or
more categorical variables
AREA PLOTS: MOSAIC PLOT
So which visualisation was best?
VIOLIN PLOT

Higher Probability

Lower Probability
RAINDROP PLOT (Combo)
Visualizing Results

Funnel Plots
(Regression Results)
Describing Data

Funnel Plots
SPIDER PLOT / RADAR CHART
Describing Data

Radial Heat Map


How to choose the right plot?

• temporal changes
• proportions
• data distributions
• group differences
• relationships between variables
• geographical data
Temporal

• showing change over time


Temporal
• showing change over time
Proportions
• showing a part-to-whole composition
Proportions

Area plots
Data Distribution

indicative of potential groups or group differences


Group Differences

• main effects and interaction plots


Data Distribution & Group Differences
Group differences
Describing Data + Group Differences
Association between variables
• scatter/bubble plots
• allows you to observe the relationship between
variables
Association between variables
• bubble plots
Association between variables
Association between variables
• heat maps depicting correlations
Association between variables
• heat maps depicting correlations
Association between variables
• heat maps depicting correlations
Geographical maps
Creative Combinations
What makes a good visualisation

• Storytelling
• Reduce Cognitive Load
• Less is more
• Missing data
• Color Consistency
• Labelling

https://www.cardinalpath.com/blog/makes-good-visualization
To do or not to do
• Provide necessary Context around Visuals
• Ensure Simplicity and Clarity of Information
• Ensure Brevity and Avoid Unnecessary Information
• Use Simple and Easy to Understand Color Palettes
• Pay attention to Graphics in order to make sure that
they are Visually Appealing
• Where possible, bring in Originality by relating,
seemingly Unrelated data and subjects
To do or not to do
• Avoid using Too Many Variables within a single image which might result in
distracting the viewers
• Be extremely careful of not visualizing data through an Unsuitable or Incorrect
visualization format
• While using Scales in Data Visualization in order to depict differences between
data points, it is important to ensure that the scale is consistent
• Poor Choice of Colors is another significant issue which should be avoided at all
costs. Thus, it is important to:
• avoid using colors with negligible contrast
• avoid using too many colors
• avoid using conventional colors to convey opposite meanings
• pay heed to the needs of people who might be colorblind (check also in
grayscale)
Outline
• Visualization
• why we visualise
• how to pick a plot
• initial data vs final results visualization (some
examples)
• bad designs and misleading graphs
• Summarization
• measures of central tendency & dispersion
• which measure to pick
Bad Designs & Improvements

https://nandeshwar.info/data-visualization/pie-chart-vs-bar-chart/
What if we want to compare genders
within the job categories and
ethnicities/races?
Outline
• Visualization
• why we visualise
• how to pick a plot
• initial data vs final results visualization (some
examples)
• bad designs and misleading graphs
• Summarization
• measures of central tendency & dispersion
• which measure to pick
Descriptive Statistics
• Common descriptive statistics are:
• Measure of central tendency
- the most typical value of a given
group of values
• Measure of dispersion
- how much all the other values in the
group vary around the typical value
Measures of central tendency
Central Tendency for
Variable Types
Measures of central tendency

Advantages Disadvantages

A sensitive and exact measure of the A single extreme value in one direction
centre point of a group of values can seriously distort the mean

Not as susceptible to extreme values Can be unrepresentative if there are


as the mean only a small number of values

Indicates the most important value


Not useful for small data sets where
Unaffected by extreme scores
several values occur equally frequently
More informative than mean
Measures of dispersion/spread
Measures of dispersion/spread

Advantages Disadvantages

distorted by extreme values


no indication of grouping
——
around the mean

- Fundamental to significance
testing, and forms basis of
Analysis of Variance
——
(ANOVA)
- Enables population
parameters to be estimated
from a sample of people
?

When do these measures fail to be


representative ????
Normal Distribution

• A bell‐shaped mathematical curve describing how values


are distributed
• Data taken from a sample is assumed to be ʻnormally
distributedʼ, and must approximate this shape in order
to use parametric tests of significance
• Inferential statistics (eg: t-tests, F-tests, regression
analyses) require in some sense that the numeric
variables are approximately normally distributed
• Note: it does not fit all populations
Normal Distribution
• symmetrical about the horizontal axis
midpoint
• mean, median, and mode all fall on the
midpoint
• No matter what μ and σ are, the area
between
• μ-σ and μ+σ is about 68%;
• μ-2σ and μ+2σ is about 95%;
• μ-3σ and μ+3σ is about 99.7%
• Almost all values fall within 3 standard
deviations
Skewed Distribution
• Resembles an exponential
distribution
• Lots of extreme values far
from mean or mode
• Not straightforward to do
useful statistical tests with
this type of distribution
Skewed Distribution
• Negative skew
- Result from relatively easy tasks, due to a ceiling effect
• Positive skew
- Results from tasks which are hard to improve upon,
due to a floor effect (such as RT —reaction time)
• Bimodal
- Two distinct peaks
- probable indicator of groups
- ex: completion time of marathon runners

You might also like